#An alternative way to think about predicted probabilities in the Fellegi Sunter model

So far, we have thought about the calculation of the Fellegi-Sunter model from the perspective of pairwise record comparisons.

In this view, we begin by imagining a random draw from the set of all possible pairwise comparisons of records. The probability that these records match is the 'prior' - our assessment of the match probability before we begin to take into account the information we observe in the records.

To compute the match probability for a specific pairwise comparison, we then update our beliefs based on the information in the records using Bayesian conditioning.

For example, if we observe an exact match on first name, then we are no longer in the space of all $\frac{n(n-1)}{2}$ comparisons, and instead need to restrict our attention to the subset of all comparisons where first name matches. In this subset, the probability of a match is higher, and so we update the match probability to reflect this.

In this post, we look at an alternative and equivalent way to think about the computation of the match probability that may be more intuitive. This explanation starts from the perspective of a single record, rather than the space of all pairwise comparisons.

Given this record, let's define two variables¹:

$s$ : how many other records for the same individual are there in the dataset?
$d$ how many records are there in the dataset for different individuals who, by coincidence, share the same information? (We can think of these as 'red herrings')

Using these variables, we can compute the probability that two records like this one are a match:

P(match) = \frac{\text{true matches}}{\text{true matches} + \text{'red herring' matches}} = \frac{s}{s+d}

#Simple example

To clarify, let's look at a simple example:

Suppose we have a dataset of 100,000 records. Each record has, on average, one duplicate, so there are 50,000 distinct individuals in this dataset.

Take an example record from this data. Let's say we observe the full name 'John Smith', with all other fields null. Further assume that:

1 in 20 people are called John in the population
1 in 50 people are called Smith in the population

In this example $s = 1$ because we assumed that each record has, on average, one duplicate.

But what is $d$ ? How many records are there for in the dataset for different John Smiths?

There are 99,998 records for different individuals, so we can compute the expected number of John Smiths as:

$d = 99,998 \times \frac{1}{20} \times \frac{1}{50} =$ 99.998 other John Smiths.

So if we find another "John Smith" in the dataset, the probability that this is a duplicate is:

$P(match) = \frac{s}{s+d} = \frac{1}{1+99.998} =$ 0.990119%

#What is the equivalent maths using our previous approach?

This approach is mathematically equivalent to the pairwise comparison approach.

With the pairwise comparison approach we have:

Total number of possible comparisons = $\frac{n(n-1)}{2}$ = 4,999,950,000

$\lambda$ = probability two random records match = 50,000 / 4,999,950,000

Recall from our article on the maths of the Fellegi Sunter model that:

$\text{posterior odds of match} = \frac{\lambda }{1- \lambda}K_1 K_2 \dots K_n$

where $K_i$ is the Bayes Factor for the $i$ th comparison vector.

The relevant Bayes Factors are 20 and 50, so we have:

$\text{posterior odds of match} = \left( \frac{50,000}{4,999,950,000} \right) \times 50 \times 20$

=0.0100002000

Which we can convert into a probability using:

$P(match) = \frac{\text{odds}}{1 + \text{odds}}$

= 0.990119%

#Footnotes

These are defined loosely here for the sake of exposition. In reality, they should be defined in relation to a specific specific comparison vector. ↩

Probabilistic Linkage Tutorial Navigation:

An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework
Partial match weights
m and u values in the Fellegi-Sunter model
The mathematics of the Fellegi Sunter model
Computing the Fellegi Sunter model
Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication
The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
An alternative way to think about predicted probabilities in the Fellegi Sunter model