Originally posted: 2023-09-22. Live edit this notebook here.
The previous article showed how partial match weights are used to compute a prediction of whether two records match.
However, partial match weights are not estimated directly. They are made up of two parameters known as the m
and the u
probabilities.
These probabilities are key to enabling estimation of partial match weights.
The m
and u
probabilities also have intuitive interpretations that allow us to understand linkage models and diagnose problems.
Imagine we have two records. We're not sure whether they represent the same person.
Now we're given some new information: we're told that month of birth matches.
Is this scenario more likely among matches or non-matches?
Since it's common to observe this scenario among matching records, but rare to observe it among non-matching records, this is evidence in favour of a match.
But how much evidence?
The strength of the evidence is quantified using the m
and u
probabilities. For each scenario in the model:
The m
probability measures how often the scenario occurs among matching records:
The u
probability measures how often the scenario occurs among non-matching records:
What matters is the relative size of these values. This is calculated as a ratio known as the Bayes Factor1, denoted by .
Bayes Factors provide the easiest way to interpret the parameters of the Fellegi Sunter model because they act as a relative multiplier that increases or decreases the overall prediction of whether the records match. For example:
For example, suppose we observe that month of birth matches.
.
This means we observe this scenario around 11.9 times more often amongst matching records than non-matching records.
Hence, given this observation, the records are 11.9 times more likely to be a match.
More generally, we can see from the formula that strong positive match weights only possible with low u
probabilities, implying high cardinality.
Suppose we observe that gender does not match.
.
We observe this scenario around 25 times more often among non-matching records than matching records.
Hence, given this observation the records are 25 times less likely to be a match.
More generally, we can see from the formula that strong negative match weights only possible with low m
probabilities, which in turn implies high data quality.
In addition to these quantitative interpretations, the m
and u
probabilities also have intuitive qualitative interpretations:
The m
probability can be thought of as a measure of data quality, or the propensity for data to change through time.
For example, consider the scenario of an exact match on first name.
An m probability of 0.9 means that, amongst matching records, the first name matches just 90% of the time, which is an indication of poor data quality.
The m probability for an exact match on postcode may be even lower - but this may be driven primarily by people moving house, as opposed to data error.
The u
probability is primarily a measure of the likelihood of coincidences, which is driven by the cardinality of the data.
Consider the scenario of an exact match on first name.
A u
probability of 0.005 means that, amongst non-matching records, first name matches 0.5% of the time.
The u
probability therefore measures how often two different people have the same first name - so in this sense it's a measure of how often coincidences occur.
A column such as first name with a large number of distinct values (high cardinality) will have much smaller u
probabilities than a column such as gender which has low cardinality.
What does it mean for a match to be times more or less likely? More likely than what?
It's only meaningful to say that something is more or less likely relative to a starting probability - known as the 'prior' (our 'prior belief').
In the context of record linkage, the prior is our existing belief that the two records match before we saw the new information contained in a scenario (e.g. that first names match).
Our updated belief given this new information is called the 'posterior'.
Mathematically this can be written:
and odds can be turned into probabilities with the following formula:
See the mathematical annex for further detail on these derivations.
For example, suppose we believe the odds of a record comparison being a match are 1 to 120. But now we observe the new information that month of birth matches, with a Bayes Factor of 12.
So
So after observing that the month of birth matches, the odds of the records being a match would be 1 in 10, or a probability of approximately 0.0909.
Here's a calculator which shows how a prior probability is updated with a Bayes Factor/partial match weight:
An alternative way of visualising these concepts can be found here.
How do m
and u
probabilities and Bayes Factors relate to the partial match weights we explored in the previous article?
Partial match weights relate to Bayes Factors through a simple formula:
There are two main reasons that the additional concept of partial match weights is useful in addition to Bayes Factors:
We can summarise this relationship with this chart.
Hover over the chart to view different values
A larger, standalone version is available here.
Now that we have a firm grasp of these ingredients, we're in a position to present the full mathematical specification of the Fellegi Sunter model.
In the main text we asserted that:
We can derive this formula from the m
and u
probabilities and Bayes Theorem.
Recall that Bayes Theorm is:
or in words:
In the context of record linkage, we can describe these parts as:
Prior: The overall proportion of comparisons which are matches
Evidence: We have observed that e.g. first name matches,
Likelihood: The probability that first name matches amongst matches, given by
So Bayes' formuls is:
Which can also be written:
Using some of the terminology from the article this is the same as:
The formula for odds is:
So we can write: