Originally posted: 2023-10-24. Last updated: 2024-01-07. Live edit this notebook here.

How effectively do different approaches to record linkage exploit the information in the data to make predictions?

Data duplication (record linkage) algorithms are models that predict whether records such as the following represent the same entity.

first_namesurnamedobcity
josshammond1984-01-02london
josshamond1984-07-02manchester

What algorithms work best? To get the best accuracy in record linkage, we need a model that uses as much information from the input data as possible.

This article describes the three types of information that are most important in making an accurate prediction, and how all three are leveraged by the Fellegi-Sunter model as used in Splink, a free software package for record linkage at scale.

It also describes how some alternative record linkage approaches throw away some of this information, leaving accuracy on the table.

Broadly, there are three categories of information that are relevant when trying to predict whether this pair of records represent the same entity:

  1. The similarity of the pair of records, without reference to the overall dataset (sometimes known as fuzzy matching)
  2. The frequency of values in the overall dataset, or more generally how common it is to observe different types of fuzzy matches
  3. The data quality of the overall dataset

Let's look at each in turn.

The most obvious way to predict whether two records represent the same entity is to measure whether the columns contain the same or similar information.

The similarity of each column can be measured quantitatively using fuzzy matching functions like Levenshtein or Jaro-Winker for text, or numeric differences such as absolute or percentage difference.

For example, hammond vs hamond has a Jaro-Winkler similarity of 0.97; it's probably a typo.

1984-01-02 vs 1984-07-02 has a Levenshtein distance of 1.

But how do we combine these different measures of similarity? They could be assigned weights, and summed together to compute a total similarity score.

The approach is sometimes known as fuzzy matching, and it is an important part of an accurate linkage model.

However using this approach alone has major drawback - the weights are arbitrary:

  • The importance of different fields has to be guessed at by the user. Clearly a match on gender is less important than a match on dob. How about the negative weights that should be assigned to non-matches?
  • For each field, a decision must be made on how changes in the fuzzy matching metric influence the score. For example, how much should our prediction change for a Jaro-Winkler score of 0.8 vs. 0.9? Is this the same or different depending on the field?

We can improve on fuzzy matching by accounting for the frequency of values in the overall dataset (sometimes known as 'term frequencies').

For example, john vs john, and joss vs joss are both exact matches so have the same Jaro-Winkler or Levenshtein similarity score, but the latter is stronger evidence of a match than the former, because joss is an unusual name.

The term frequencies of john v joss provide a data-driven estimate of the relative importance of these different names, which can be used to inform the weights.

In essence, the term frequencies capture the likelihood that two non-matching records would have the same value on a given column i.e. the likelihood of a 'collision' (e.g. two different people with the same name).

In probabilistic linkage, these values are called u probabilities which are defined more precisely in the article on m and u probabilities.

This concept can be extended to encompass similar records that are not an exact match. How much of a coincidence is it that dob differs by a single character (1984-01-02 vs 1984-07-02)? This can be measured from the data by looking at how often this occurs.

We've seen that fuzzy matching and term frequency based approaches can allow us to score the similarity between records, and even weight the importance of matches on different columns.

However, none of these techniques help quantify the relative importance of non-matches (e.g. london vs manchester) to the predicted match probability.

Probabilistic methods explicitly estimate the relative importance of non matches by estimating the data quality of each column. In probabilistic linkage, this information is captured in the m probabilities which are discussed in more detail in the article on m and u probabilities.

For example, if records have been observed over a number of years, a non-match on city wouldn't be strong evidence against the two records being a match.

Conversely, if the data quality in the dob variable is extremely high, then a non-match on gender would be strong evidence against the two records being a true match.

Much of the power of probabilistic models comes from combining all three sources of information in a way which is not possible in other models.

Not only are all three sources of information used to make the prediction, but the Fellegi Sunter model used in Splink allows us estimate the relative importance of each source of information from the data itself.

These data-driven partial match weights have intuitive interpretations, leading to transparent and easy to understand models.

Conversely, fuzzy matching techniques often use arbitrary weights, and cannot fully incorporate information from all three sources. Term frequency approaches lack the ability to use information about data quality, or a mechanism to appropriately weight fuzzy matches.