Originally posted: 2025-02-18. View source code for this page here.
So far, we have thought about the calculation of the Fellegi-Sunter model from the perspective of pairwise record comparisons.
In this view, we begin by imagining a random draw from the set of all possible pairwise comparisons of records. The probability that these records match is the 'prior' - our assessment of the match probability before we begin to take into account the information we observe in the records.
To compute the match probability for a specific pairwise comparison, we then update our beliefs based on the information in the records using Bayesian conditioning.
For example, if we observe an exact match on first name, then we are no longer in the space of all comparisons, and instead need to restrict our attention to the subset of all comparisons where first name matches. In this subset, the probability of a match is higher, and so we update the match probability to reflect this.
In this post, we look at an alternative and equivalent way to think about the computation of the match probability that may be more intuitive. This explanation starts from the perspective of a single record, rather than the space of all pairwise comparisons.
Given this record, let's define two variables1:
Using these variables, we can compute the probability that two records like this one are a match:
To clarify, let's look at a simple example:
Suppose we have a dataset of 100,000 records. Each record has, on average, one duplicate, so there are 50,000 distinct individuals in this dataset.
Take an example record from this data. Let's say we observe the full name 'John Smith', with all other fields null. Further assume that:
In this example because we assumed that each record has, on average, one duplicate.
But what is ? How many records are there for in the dataset for different John Smiths?
There are 99,998 records for different individuals, so we can compute the expected number of John Smiths as:
99.998 other John Smiths.
So if we find another "John Smith" in the dataset, the probability that this is a duplicate is:
0.990119%
This approach is mathematically equivalent to the pairwise comparison approach.
With the pairwise comparison approach we have:
Total number of possible comparisons = = 4,999,950,000
= probability two random records match = 50,000 / 4,999,950,000
Recall from our article on the maths of the Fellegi Sunter model that:
where is the Bayes Factor for the th comparison vector.
The relevant Bayes Factors are 20 and 50, so we have:
=0.0100002000
Which we can convert into a probability using:
= 0.990119%
These are defined loosely here for the sake of exposition. In reality, they should be defined in relation to a specific specific comparison vector. ↩