Originally posted: 2023-10-02. Live edit this notebook here.
The previous article showed how we could derive a mathematical representation of the Fellegi-Sunter model.
This article contains a step-by-step demonstration of how we use this mathematical model to compute predictions from data.
The aim of the model is to compute a prediction of which records match. This prediction is a probability that quantifies the likelihood the two records represent the same entity.
We begin with some data. In this example we will attempt to link but not deduplicate these records.
The following two input tables are interactive! Edit the values and all the calculations in the rest of this article will update.
In the first step of the calculation we compare each record in the first table with all records in the second table.1
We wish to predict which of these pairwise record comparisons represent the same entity (i.e. which records match).
Table scrolls right →
At first glance, it may seem challenging to turn this table of text into a numeric prediction. But this can be achieved using the concepts of scenarios and partial_match_weights introduced earlier in this tutorial.
Specifically, we define scenarios for each column and then use a lookup table to find the partial match weights.
For example, we may choose to define our scenarios as follows:
Note the 'comparison vector value' column in this table. This is an integer value that identifies the scenario within the column. We will use it to help us look up the correct partial match weights.
We then proceed by computing which scenario is activated for each pairwise record comparison, and representing this using the comparison vector value. We use the gamma symbol (γ) to refer to the comparison vector value.
Table scrolls right →
Now that we've used the text values to compute the comparison vector value, we no longer need it. What remains are known as the comparison vectors for each pairwise comparison:
Each row in this table is known as the 'comparison vector' for the record comparison
We now look up the partial match weights from the comparison vector. I've also added the prior, expressed as a partial match weight, and the sum of the match weights:
Table scrolls right →
Finally, we can then sum the values in the partial match weights to produce a final match weight. This can then be converted into a probability using:
We can now see how the original comparison pairs were scored by this simple model:
Table scrolls right →