Originally posted: 2023-09-22. Live edit this notebook here.

The previous article showed how partial match weights are used to compute a prediction of whether two records match.

However, partial match weights are not estimated directly. They are made up of two parameters known as the m and the u probabilities.

These probabilities are key to enabling estimation of partial match weights.

The m and u probabilities also have intuitive interpretations that allow us to understand linkage models and diagnose problems.

Imagine we have two records. We're not sure whether they represent the same person.

Now we're given some new information: we're told that month of birth matches.

Is this scenario more likely among matches or non-matches?

  • Amongst matching records, month of birth will usually match
  • Amongst non-matching records month of birth will rarely match

Since it's common to observe this scenario among matching records, but rare to observe it among non-matching records, this is evidence in favour of a match.

But how much evidence?

The strength of the evidence is quantified using the m and u probabilities. For each scenario in the model:

  • The m probability measures how often the scenario occurs among matching records: m=Pr(scenariorecords match)m = \text{Pr}(\text{scenario}|\text{records match})

  • The u probability measures how often the scenario occurs among non-matching records: u=Pr(scenariorecords do not match)u = \text{Pr}(\text{scenario}|\text{records do not match})

What matters is the relative size of these values. This is calculated as a ratio known as the Bayes Factor1, denoted by KK.

Bayes Factor=K=mu=Pr(scenariorecords match)Pr(scenariorecords do not match)\text{Bayes Factor} = K = \frac{m}{u} = \frac{\text{Pr}(\text{scenario}|\text{records match})}{\text{Pr}(\text{scenario}|\text{records do not match})}

Bayes Factors provide the easiest way to interpret the parameters of the Fellegi Sunter model because they act as a relative multiplier that increases or decreases the overall prediction of whether the records match. For example:

  • A Bayes Factor of 5 can be interpreted as '5 times more likely to match'
  • A Bayes Factor of 0.2 can be interpreted as '5 times less likely to match'

For example, suppose we observe that month of birth matches.

  • Amongst matching records, month of birth will usually match. Supposing the occasional typo, we may have m=0.99m = 0.99
  • Amongst non matching records, month of birth matches around a twelth of the time, so u=1/12u = 1/12.

Bayes Factor=K=mu=0.990.0833=11.9\text{Bayes Factor} = K = \frac{m}{u} = \frac{0.99}{0.0833} = 11.9.

This means we observe this scenario around 11.9 times more often amongst matching records than non-matching records.

Hence, given this observation, the records are 11.9 times more likely to be a match.

More generally, we can see from the formula that strong positive match weights only possible with low u probabilities, implying high cardinality.

Suppose we observe that gender does not match.

  • Amongst matching records, it will be rare to observe a non-match on gender. If there are occasional data entry errors, we may have m=0.02m = 0.02
  • Amongst non matching records, gender will match around half the time. So u=0.5u = 0.5.

Bayes Factor=K=mu=0.020.5=0.04=125\text{Bayes Factor} = K = \frac{m}{u} = \frac{0.02}{0.5} = 0.04 = \frac{1}{25}.

We observe this scenario around 25 times more often among non-matching records than matching records.

Hence, given this observation the records are 25 times less likely to be a match.

More generally, we can see from the formula that strong negative match weights only possible with low m probabilities, which in turn implies high data quality.

In addition to these quantitative interpretations, the m and u probabilities also have intuitive qualitative interpretations:

The m probability can be thought of as a measure of data quality, or the propensity for data to change through time.

For example, consider the scenario of an exact match on first name.

An m probability of 0.9 means that, amongst matching records, the first name matches just 90% of the time, which is an indication of poor data quality.

The m probability for an exact match on postcode may be even lower - but this may be driven primarily by people moving house, as opposed to data error.

The u probability is primarily a measure of the likelihood of coincidences, which is driven by the cardinality of the data.

Consider the scenario of an exact match on first name.

A u probability of 0.005 means that, amongst non-matching records, first name matches 0.5% of the time.

The u probability therefore measures how often two different people have the same first name - so in this sense it's a measure of how often coincidences occur.

A column such as first name with a large number of distinct values (high cardinality) will have much smaller u probabilities than a column such as gender which has low cardinality.

What does it mean for a match to be nn times more or less likely? More likely than what?

It's only meaningful to say that something is more or less likely relative to a starting probability - known as the 'prior' (our 'prior belief').

In the context of record linkage, the prior is our existing belief that the two records match before we saw the new information contained in a scenario (e.g. that first names match).

Our updated belief given this new information is called the 'posterior'.

Mathematically this can be written:

posterior odds=prior odds×Bayes Factor\text{posterior odds} = \text{prior odds} \times \text{Bayes Factor}

and odds can be turned into probabilities with the following formula:

probability=odds1+odds\text{probability} = \frac{\text{odds}}{1 + \text{odds}}

See the mathematical annex for further detail on these derivations.

For example, suppose we believe the odds of a record comparison being a match are 1 to 120. But now we observe the new information that month of birth matches, with a Bayes Factor of 12.

So posterior odds=1120×12=110\text{posterior odds} = \frac{1}{120} \times 12 = \frac{1}{10}

posterior probability=1101+110=1110.0909\text{posterior probability} = \frac{\frac{1}{10}}{1 + \frac{1}{10}} = \frac{1}{11} \approx 0.0909

So after observing that the month of birth matches, the odds of the records being a match would be 1 in 10, or a probability of approximately 0.0909.

Here's a calculator which shows how a prior probability is updated with a Bayes Factor/partial match weight:

Prior

New evidence

An alternative way of visualising these concepts can be found here.

How do m and u probabilities and Bayes Factors relate to the partial match weights we explored in the previous article?

Partial match weights relate to Bayes Factors through a simple formula:

partial match weight=ω=log2(Bayes Factor)=log2(mu)\text{partial match weight} = \omega = \log_2 \text{(Bayes Factor)} = \log_2 (\frac{m}{u})

There are two main reasons that the additional concept of partial match weights is useful in addition to Bayes Factors:

  • Partial match weights are easier to represent on charts. They tend to range from -30 to 30, whereas Bayes Factors can be tiny (one in a million) or massive (millions).
  • Since Bayes Factors are multiplicative, the log transform turns them into something additive, which simplifies the maths a little.

We can summarise this relationship with this chart.

A larger, standalone version is available here.

Now that we have a firm grasp of these ingredients, we're in a position to present the full mathematical specification of the Fellegi Sunter model.

In the main text we asserted that:

posterior odds=prior odds×Bayes Factor\text{posterior odds} = \text{prior odds} \times \text{Bayes Factor}

We can derive this formula from the m and u probabilities and Bayes Theorem.

Recall that Bayes Theorm is:

Pr(ab)=Pr(ba)Pr(a)Pr(b)\operatorname{Pr}(a|b) = {\frac{\operatorname{Pr}(b|a)\operatorname{Pr}(a)}{\operatorname{Pr}{(b)}}}

or in words:

posterior probability=likelihood×prior probabilityevidence\text{posterior probability} = \frac{\text{likelihood} \times \text{prior probability}}{\text{evidence}}

In the context of record linkage, we can describe these parts as:

Prior: The overall proportion of comparisons which are matches Pr(match)\operatorname{Pr}(\text{match})

Evidence: We have observed that e.g. first name matches, Pr(first name matches)\operatorname{Pr}(\text{first name matches})

Likelihood: The probability that first name matches amongst matches, given by Pr(first name matchesrecords match)\operatorname{Pr}(\text{first name matches}|\text{records match})

So Bayes' formuls is:

Pr(matchfirst name matches)=Pr(first name matchesmatch)Pr(match)Pr(first name matches)\operatorname{Pr}(\text{match}|\text{first name matches}) = \frac{\operatorname{Pr}(\text{first name matches}|\text{match})\operatorname{Pr}(\text{match})}{\operatorname{Pr}{(\text{first name matches})}}

Which can also be written:

Pr(first name matchesmatch)Pr(match)Pr(first name matchesmatch)Pr(match)+Pr(first name matchesnon match)Pr(non match) \frac{\operatorname{Pr}(\text{first name matches}|\text{match})\operatorname{Pr}(\text{match})}{\operatorname{Pr}(\text{first name matches}|\text{match})\operatorname{Pr}(\text{match}) + \operatorname{Pr}(\text{first name matches}|\text{non match})\operatorname{Pr}(\text{non match})}

Using some of the terminology from the article this is the same as:

posterior probability=m×prior probabilitym×prior probability+u×(1prior probability)\text{posterior probability} = \frac{m \times \text{prior probability}}{m \times \text{prior probability} + u \times (1 - \text{prior probability})}

The formula for odds is:

odds=p1p\text{odds} = \frac{p}{1-p}

So we can write:

posterior odds=prior1priormu\text{posterior odds} = \frac{\text{prior}}{1 - \text{prior}} \frac{m}{u}

posterior odds=prior odds×Bayes Factor\text{posterior odds} = \text{prior odds} \times \text{Bayes Factor}

  1. You can read more about Bayes Factors here. The concept is quite similar to a likelihood ratio.