Splink now includes Python and AWS Athena backends, in addition to Spark. It's
easier to use, faster and more flexible, and can be used for close to real
time linkage.
Two years ago, we introduced Splink, a Python library for data deduplication and linkage (entity resolution) at scale.
Since then, Splink has been used in government, the private sector, and academia to link and deduplicate huge datasets, some in excess of 100 million records, and it’s been downloaded over 3 million times.
We believe that it is now the fastest and most accurate free library for record linkage at scale.
We’ve learned a lot from our users, and we’ve just released version 3, which includes frequently requested features and tries to address some the biggest pain points.
Small linkages can run 100x faster through use the new DuckDB backend, relative to Spark. It’s possible to link a dataset of a million records in under two minutes on a modern laptop.
Linkage models are now simpler to estimate.Less code is needed, and the API is clearer. Splink now automatically combines parameter estimates from different model runs, significantly reducing the number of lines of code to estimate a full model.
Linkages in Spark run significantly faster, and are less likely to result in out of memory errors or other scaling issues. Big data linkage in Athena runs significantly faster than Spark in some instances.