Originally posted: 2023-03-09. Last updated: 2024-09-02. View source code for this page here.
This post presents the argument for open sourcing analytical work using the real-world example of Splink, a Python record linkage library. The hope is that it helps others who want to make the case for open sourcing their work.
The benefits to Splink of working in the open have been enormous. Academics, government, private sector consultancies, big tech companies and expert hobbyists have all contributed code and ideas to the project, for which we are very thankful.
As such, Splink represents an international collaboration of some of the world's leading experts in record linkage.
The following headline statistics give an indication of the extent of external engagement and collaboration and the size of the community:
Splink has been downloaded 8,973,210 times and is in the top 0.89% of all Python packages by monthly downloads.
1254 people have starred (favourited) Splink on Github
305 people from outside of the Splink team have posted a total of 1256 comments on the main Splink website, including bug reports, feature requests and discussions
58 people have contributed to the codebase
Note that this post is a more concrete companion to a previous, more abstract blog post I've written: Why you should open source your analytical work.
Some of the greatest benefits have come from the informal, virtual community centered around Splink, and the resultant network effects. Each interaction within this community is a small quality and productivity multiplier, but these multipliers compound to make dramatic overall gains. We also seem to have established a virtuous cycle whereby these improvements in quality draw more users into the network.
The following chart shows the how the number of different users we've interacted with on Github has grown through time:
With Splink, I believe working in the open was critical to establishing these network effects because:
This community has been critical in helping us to understand best practice record linkage techniques and how we can incorporate them into Splink.
One of the greatest benefits of this community has been the access its given us to a wide range of users who've contacted us both via our public discussion forums and privately. This has effectively enabled us to conduct user research with some of world's leading experts and practitioners. The result has been Splink is faster, easier to user and more accurate than if it was closed source.
For example, we have been contacted by and have spoken to government analysts in at least six different countries, who have been at various stages of testing Splink on their data. Similarly we've spoken to academics in leading international universities, and Splink has been used to power published academic research e.g this paper.
Overall, users have spontanously given us hundreds of pieces of feedback, including reporting bugs, suggesting ideas and new features, and asking questions, and shown in the following chart:
The main result has been continuous and iterative improvement of Splink, with a total of 136 releases, each of which represents incremental improvements. Some of these releases also represent a step change in capabilities: In July 2022, in response to two years of user feedback, we rewrote Splink from the ground up, addressing the most serious usability and performance issues which had been identified by the community. At the same time, we released a new documentation website to help users get the most from Splink.
We'e seen first hand how these user experience improvements and online documentation have made Splink easier to use as new starters have joined the team. Corporate knowledge retention is much easier when all the materials are easily available online, including:
We've received contributions to Splink code from private sector consultancies, big tech firms, academics and hobbyists, all for free. Overall a total of 58 different people have contributed to our codebase.
The following chart shows the number of contributions (pull requests) from external users through time:
A final strong argument for open sourcing Splink has been the benefits external users themselves get from Splink. As a government analyst, my work is taxpayer funded so it makes sense that taxpayers should get as much value as possible from it.
One of the clearest ways we've been able to chart the resultant growth in interst in Splink is through Github 'stars' - which is a way of favouriting Splink within the Github platform:
This includes a range of people from around the world, see the bubble chart here.
The following two charts show the monthly downloads of Splink, and the cumulative total downloads, respectively:
It's a bit unclear how this translates into users. But it's worth noting that Splink is the 5,043 most downloaded Python package out of a total of 566,369 packages posted on PyPi, putting it in the top 0.89% of all Python packages.
Since Splink is open source, we only ever hear from a fraction of the people using it. This is particularly true because record linkage is usually conducted on sensitive data, so code that uses Splink will generally not be open source.
If you are a Splink user, please do get in touch - we'd love to hear from you about how you've used it - email [email protected] or put in a PR like this one to be included in the use cases section of the docs.