Originally posted: 2025-02-18. View source code for this page here.

This is part 9 of the tutorial

So far, we have thought about the calculation of the Fellegi-Sunter model from the perspective of pairwise record comparisons.

In this view, we begin by imagining a random draw from the set of all possible pairwise comparisons of records. The probability that these records match is the 'prior' - our assessment of the match probability before we begin to take into account the information we observe in the records.

To compute the match probability for a specific pairwise comparison, we then update our beliefs based on the information in the records using Bayesian conditioning.

For example, if we observe an exact match on first name, then we are no longer in the space of all n(n1)2\frac{n(n-1)}{2} comparisons, and instead need to restrict our attention to the subset of all comparisons where first name matches. In this subset, the probability of a match is higher, and so we update the match probability to reflect this.

In this post, we look at an alternative and equivalent way to think about the computation of the match probability that may be more intuitive. This explanation starts from the perspective of a single record, rather than the space of all pairwise comparisons.

Given this record, let's define two variables1:

  • ss: how many other records for the same individual are there in the dataset?
  • dd how many records are there in the dataset for different individuals who, by coincidence, share the same information? (We can think of these as 'red herrings')

Using these variables, we can compute the probability that two records like this one are a match:

P(match)=true matchestrue matches+’red herring’ matches=ss+dP(match) = \frac{\text{true matches}}{\text{true matches} + \text{'red herring' matches}} = \frac{s}{s+d}

To clarify, let's look at a simple example:

Suppose we have a dataset of 100,000 records. Each record has, on average, one duplicate, so there are 50,000 distinct individuals in this dataset.

Take an example record from this data. Let's say we observe the full name 'John Smith', with all other fields null. Further assume that:

  • 1 in 20 people are called John in the population
  • 1 in 50 people are called Smith in the population

In this example s=1s = 1 because we assumed that each record has, on average, one duplicate.

But what is dd? How many records are there for in the dataset for different John Smiths?

There are 99,998 records for different individuals, so we can compute the expected number of John Smiths as:

d=99,998×120×150=d = 99,998 \times \frac{1}{20} \times \frac{1}{50} = 99.998 other John Smiths.

So if we find another "John Smith" in the dataset, the probability that this is a duplicate is:

P(match)=ss+d=11+99.998=P(match) = \frac{s}{s+d} = \frac{1}{1+99.998} = 0.990119%

This approach is mathematically equivalent to the pairwise comparison approach.

With the pairwise comparison approach we have:

Total number of possible comparisons = n(n1)2\frac{n(n-1)}{2} = 4,999,950,000

λ\lambda = probability two random records match = 50,000 / 4,999,950,000

Recall from our article on the maths of the Fellegi Sunter model that:

posterior odds of match=λ1λK1K2Kn\text{posterior odds of match} = \frac{\lambda }{1- \lambda}K_1 K_2 \dots K_n

where KiK_i is the Bayes Factor for the iith comparison vector.

The relevant Bayes Factors are 20 and 50, so we have:

posterior odds of match=(50,0004,999,950,000)×50×20 \text{posterior odds of match} = \left( \frac{50,000}{4,999,950,000} \right) \times 50 \times 20

=0.0100002000

Which we can convert into a probability using:

P(match)=odds1+oddsP(match) = \frac{\text{odds}}{1 + \text{odds}}

= 0.990119%

  1. These are defined loosely here for the sake of exposition. In reality, they should be defined in relation to a specific specific comparison vector.