Originally posted: 2021-05-21. Last updated: 2023-10-02. Live edit this notebook here.
The previous article showed how m
and u
probabilities can be converted into Bayes Factors, which in turn could be combined with the prior probability to compute an updated prediction (the 'posterior' probability).
We now have all the groundwork in place to present a full mathematical definition of a model in the Fellegi-Sunter framework.
A more technical treatment is given in AP Enamorado, T., Fifield, B., & Imai, K. (2019), which contains a slightly more generalised form of the model presented here.
In the previous article we saw what Bayes Factors act as a relative multiplier that increases or decreases the overall prediction of whether the records match.
We saw that, for a Bayes Factor for a specific scenario such as a match on month of birth we could write:
It turns out that this formula can be extended to account for the information in multiple scenarios such as: a match on first name and a match on surname and a fuzzy match on date of birth.
Let us denote by the Bayes Factors for each activated scenario1, then the general formula is:
This formula is quite intuitive. For example:
If both match, it becomes 24x more likely the records match.
Multiplying Bayes Factors to compute a predicted probability in this way sometimes known as a Naive Bayes classifier.2
Since Bayes Factors are not estimated directly, and instead we estimte the m
and u
probabilities of the model, we can use to write the full specification of our model:
This final result can be converted into a probability using
In the first article in this tutorial we claimed that the Fellegi-Sunter model makes its predictions by summing of the partial match weights.
This follows from equation 4.1. Specifically, by taking the logarithm, we can write:
Applying to both sides, you get:
where are the partial match weights.
The remainder of this article delves a bit deeper into equation (4.1) and can be safely skipped if you're not interested in the maths. In particular, it shows how the match probability can be expressed in terms of the m
and u
probabilities using a formulation equivalent to equation 4 in the Fastlink paper.
The next article will look at the mechanics of computing the final prediction.
In the previous article we saw that by applying Bayes Theorem to a specific scenario we could write:
We can make this easier to read by recalling that
and
and denoting as the prior probability
this can be written:
By repeated application of Bayes Theorem (see annex here), we can generalise this formula to:
We can write a similar formula for a non-match:
Now, recall that:
This means we can divide (4.2) by (4.3 to obtain the odds). Since they share a denominator this becomes:
Since this is equivalent to Equation 4.1 in this article, and also is equivalent to Equation 4 in the FastLink paper.
The 'activated scenario' is the similarity category for the record comparison under evaluation. For example, the 'activated scenario', three scenarios (similarity categories) may be defined for the first name column - 'exact match', 'fuzzy match' and 'all other'. Each will have a Bayes Factor associated with it. The 'activated scenario' is which of these scenarios describes the record comparison under evaluation. ↩
The result is only valid if we assume that the columns are independent conditional on match status. This is 'naive' in the sense that it's rarely true. However, it turns out for record linkage purposes that Fellegi Sunter models are quite robust to the violation of this assumption. ↩