Originally posted: 2023-10-24. Last updated: 2024-01-07. Live edit this notebook here.
Data duplication (record linkage) algorithms are models that predict whether records such as the following represent the same entity.
first_name | surname | dob | city |
---|---|---|---|
joss | hammond | 1984-01-02 | london |
joss | hamond | 1984-07-02 | manchester |
What algorithms work best? To get the best accuracy in record linkage, we need a model that uses as much information from the input data as possible.
This article describes the three types of information that are most important in making an accurate prediction, and how all three are leveraged by the Fellegi-Sunter model as used in Splink, a free software package for record linkage at scale.
It also describes how some alternative record linkage approaches throw away some of this information, leaving accuracy on the table.
Broadly, there are three categories of information that are relevant when trying to predict whether this pair of records represent the same entity:
Let's look at each in turn.
The most obvious way to predict whether two records represent the same entity is to measure whether the columns contain the same or similar information.
The similarity of each column can be measured quantitatively using fuzzy matching functions like Levenshtein or Jaro-Winker for text, or numeric differences such as absolute or percentage difference.
For example, hammond
vs hamond
has a Jaro-Winkler similarity of 0.97; it's probably a typo.
1984-01-02
vs 1984-07-02
has a Levenshtein distance of 1.
But how do we combine these different measures of similarity? They could be assigned weights, and summed together to compute a total similarity score.
The approach is sometimes known as fuzzy matching, and it is an important part of an accurate linkage model.
However using this approach alone has major drawback - the weights are arbitrary:
gender
is less important than a match on dob
. How about the negative weights that should be assigned to non-matches?We can improve on fuzzy matching by accounting for the frequency of values in the overall dataset (sometimes known as 'term frequencies').
For example, john
vs john
, and joss
vs joss
are both exact matches so have the same Jaro-Winkler or Levenshtein similarity score, but the latter is stronger evidence of a match than the former, because joss
is an unusual name.
The term frequencies of john
v joss
provide a data-driven estimate of the relative importance of these different names, which can be used to inform the weights.
In essence, the term frequencies capture the likelihood that two non-matching records would have the same value on a given column i.e. the likelihood of a 'collision' (e.g. two different people with the same name).
In probabilistic linkage, these values are called u
probabilities which are defined more precisely in the article on m and u probabilities.
This concept can be extended to encompass similar records that are not an exact match. How much of a coincidence is it that dob
differs by a single character (1984-01-02
vs 1984-07-02
)? This can be measured from the data by looking at how often this occurs.
We've seen that fuzzy matching and term frequency based approaches can allow us to score the similarity between records, and even weight the importance of matches on different columns.
However, none of these techniques help quantify the relative importance of non-matches (e.g. london
vs manchester
) to the predicted match probability.
Probabilistic methods explicitly estimate the relative importance of non matches by estimating the data quality of each column. In probabilistic linkage, this information is captured in the m
probabilities which are discussed in more detail in the article on m and u probabilities.
For example, if records have been observed over a number of years, a non-match on city
wouldn't be strong evidence against the two records being a match.
Conversely, if the data quality in the dob
variable is extremely high, then a non-match on gender would be strong evidence against the two records being a true match.
Much of the power of probabilistic models comes from combining all three sources of information in a way which is not possible in other models.
Not only are all three sources of information used to make the prediction, but the Fellegi Sunter model used in Splink allows us estimate the relative importance of each source of information from the data itself.
These data-driven partial match weights have intuitive interpretations, leading to transparent and easy to understand models.
Conversely, fuzzy matching techniques often use arbitrary weights, and cannot fully incorporate information from all three sources. Term frequency approaches lack the ability to use information about data quality, or a mechanism to appropriately weight fuzzy matches.