Originally posted: 2020-04-16. Last updated: 2022-08-04. View source code for this page here.
Splink is a Python library for data deduplication (probabilistic record linkage, entity resolution). It supports running record linkage workloads using the Apache Spark
, AWS Athena
, or DuckDB
backends.
Its key features are:
It is extremely fast. It is capable of linking a million records on a modern laptop in under two minutes using the DuckDB
backend.
It is highly accurate, with support for term frequency adjustments, and sophisticated fuzzy matching logic.
It is capable of linking very large datasets (100 million records +) using the Spark
or AWS Athena
backends.
It has a simple but highly customisable interface, so users can tackle the majority of record linking and deduplication problems
Training data is not required because models can be trained using an unsupervised approach.
It supports the full workflow of data linkage, from exploratory analysts, through to model prediction, diagnostics and quality assurance.
Is robust, with an automated suite of unit and integration tests.
A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together:
row_id | first_name | surname | dob | city |
---|---|---|---|---|
1 | lucas | smith | 1984-01-02 | London |
2 | lucas | smyth | 1984-07-02 | Manchester |
3 | lucas | smyth | 1984-07-02 | |
4 | david | jones | Leeds | |
5 | david | jones | 1990-03-21 | Leeds |
For instance, customer data may have been entered multiple times in multiple different computer systems, with different spellings of names, different addresses, and other typos. The lack of a unique customer identifier presents challenges at all stages of data analysis — from basic questions such counting the number of unique customers, to feature engineering of customers’ details for machine learning purposes.
There is a large body of theoretical and empirical work into this problem. The solution usually involves computing a new unique identifier column which allows entities to be linked and grouped, using a process of statistical estimation, machine learning and/or rules-based logic.
However, there is a lack of free software that can tackle this problem at the scale of millions of records — the size typically seen in large organisations. Solving the problem usually involves generating very large numbers of record comparisons and so is ill-suited to in-memory solutions in R or Python. Distributed computing frameworks like Apache Spark, or backends that are parellelised and not memory-bound like DuckDB, are a much better fit.
We have released a free and open source library called splink
, that implements the Fellegi-Sunter/Expectation Maximisation approach, one of the key statistical models from the data linking literature. This is an unsupervised learning algorithm which yields a match score for each pair of record comparisons.
In a nutshell, Splink takes messy input data in the table above and produces pairwise predictions of which records match:
row_id_l | row_id_r | match_probability |
---|---|---|
1 | 2 | 0.9 |
1 | 3 | 0.85 |
2 | 3 | 0.92 |
4 | 5 | 0.7 |
It then clusters the predictions to produce an estimated unique id:
cluster_id | row_id |
---|---|
a | 1 |
a | 2 |
a | 3 |
b | 4 |
b | 5 |
It makes no assumption about the entity type (person, company, case, etc.), so it can be used to link or most types of data.
You can try out the library in a Jupyter notebook using our Binder link here. These demos illustrate how to use the library, but note they are running in local mode on free servers, so don’t expect great performance.
Splink is an implementation of the Fellegi-Sunter model. The software generates pairwise record comparisons using an approach called blocking, and computes a match score for each pair which quantifies the similarity between the two records.
The match score is determined by parameters are known as partial match weights. These quantify the importance of different aspects of the comparison.
For example, a match on date of birth lends more evidence in favour of two records being a match than a match on gender. A mismatch on postcode may provide weak evidence against a match because people move house, whereas a mismatch on date of birth may be stronger evidence against the record being a match.
This simple idea has a lot of power to build highly nuanced models. Partial match weights can be computed for an arbitrary number of user-defined scenarios, not just a match or non match. For example, a partial match weight can be estimated for a scenario where postcodes do not match, but are within 10 miles of each other.
These partial match weights are combined into an overall match score, which represents the weight of evidence that the two records are a match.
The library uses unsupervised learning (the Expectation Maximisation algorithm) to estimate these match weights. You can learn more about the theory in my interactive training materials.
A more detailed video description of how this all works can be found here.
We have tried to design an interface which is simple but nevertheless can be adapted to most record linkage and deduplication problems.
In the following code we:
from splink.duckdb.duckdb_linker import DuckDBLinkerfrom splink.duckdb.duckdb_comparison_library import (exact_match,levenshtein_at_thresholds,)import pandas as pddf = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")# Specify a data linkage modelsettings = {"link_type": "dedupe_only","blocking_rules_to_generate_predictions": ["l.first_name = r.first_name","l.surname = r.surname",],"comparisons": [levenshtein_at_thresholds("first_name", 2),exact_match("surname"),exact_match("dob"),exact_match("city", term_frequency_adjustments=True),exact_match("email"),],}linker = DuckDBLinker(df, settings)# Estimate its parameterslinker.estimate_u_using_random_sampling(target_rows=1e6)blocking_rule_for_training = "l.first_name = r.first_name and l.surname = r.surname"linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)blocking_rule_for_training = "l.dob = r.dob"linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)# Use the model to compute pairwise match scorespairwise_predictions = linker.predict()# Cluster the match scores into groups to produce an estimated unique person idclusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)clusters.as_pandas_dataframe(limit=5)
We are grateful to all our users who have contributed and provided feedback so far. Please continue to do so by:
Or I’m @robinlinacre on Twitter.