Originally posted: 2021-11-05. View source code for this page here.
In part 1 of this post, I showed that more complex probabilistic linkage models can be dramatically more accurate than simple models. I used supervised learning to compute the model parameters directly from ground truth data to ensure a fair comparison. This meant that differences in performance could be attributed solely to the different model formulations, rather than the process of estimation.
I also used the labelled data to ensure that match scores were computed for all true matches, in addition to being computed for non-matching comparisons that met a list of blocking rules. This meant that the results were not strongly dependent on the blocking approach.
In real applications of record linkage, we typically do not have ground truth data. This post explores whether the findings change if the same models are trained using unsupervised learning. Specifically, model parameters will be estimated using Splink.
I find that the gains in accuracy from more complex models are similar, and seem little affected by the differences between estimated parameters and their true values.
I also find that convergence of the Expectation Maximisation algorithm is significantly faster for more complex models. This is not surprising: a more complex model more accurately describes the data, leading to more precise predictions on each iteration of the algorithm.
I also find evidence that the more accurate models seem to have greatest advantage when predicting the match status of some of the most hard-to-find matches. This may imply that, in models where blocking rules are very tight, the more complex models have less of a comparative advantage. That would make intuitive sense, since tight blocking rules pick out easier-to-match record comparisons, many of which are likely to be easy to match even with a very simple model.
First I present ROC curves on the same basis as in Part 1, but using estimated parameters rather than their true values. For this chart, the ground truth data has been used to ensure that all true matches are included in the ROC curve. The blocking rules for this ROC curve are here.
Second, I present a ROC curve where record comparisons are generated from a list of blocking rules. This is a better approximation of a real-life record linkage scenario, in which we have to use blocking rules to try and recover all positive matches.
This second ROC curve achieves lower recall (true positive rate) because the blocking rules fail to detect a fairly large number of true matches. These records are therefore scored as certain non-matches.
Model training was performed using Splink. Specifically, I estimated u values directly, and then used the expectation maximisation algorithm to estimate m values (e.g. here) and the proportion of matches parameter.
Exactly the same model specifications were used as in Part 1 of this post.
Reproducibility materials can be found here.
There were two other interesting findings.
First, the more complex models trained in fewer iterations. That is, convergence of the Expectation Maximisation algorithm is significantly faster for the more complex models.
Second, the m values estimated by the models were different from their true values by a fairly large margin, and this pattern was consistent across all models.
What may be the cause?
The m values are computed from running the expectation maximisation algorithm on a subset of blocked record comparisons. If the blocking rules select a biased sample of records, then the estimated m values will also be biased.
There is good reason to believe this may be the case. For example, for estimation of the parameters on first name and surname, I block on date of birth and postcode. But records where date of birth and postcode match are likely to be higher quality records overall, and so there is less likely to be an error in forename or surname than for an average matching record. This would be expected to cause the m value for a match to be estimated too high.
The following charts show a comparison of estimated parameter estimates vs their true values for all five estimated models.