This visualisation aims to clarify the intuition behind the following formula from the Fellegi Sunter model, which was presented in the article on the maths of Fellegi Sunter in the main tutorial.
posterior match probability=λm+(1−λ)uλm
where
λ=prior match probability
and the m
and u
probabilities are defined in relation to a scenario:
m=Pr(scenario∣records match)
u=Pr(scenario∣records do not match)
This article also explains why it's just a re-statement of Bayes Theorem.
Recall that the prior is the probability that two random records match, which is one of the parameters to be estimated.
We can visualise this parameter by showing how it divides the set of all pairwise record comparisons:
Using the definition of the m
and u
probabilities above, we can further subdivide this space as follows:
Given our observation that the scenario holds, we can discard the areas in white as no longer applicable given this new information.
Turning this back into formulas we can write:
Substituting in numbers we have:
In a more general sense, these visualisations explains the intuition behind Bayes Theorem.
Recall that Bayes Theorm is:
posterior probability=evidencelikelihood×prior probability
or:
Pr(a∣b)=Pr(b)Pr(b∣a)Pr(a)
In the context of record linkage, we can describe these parts as:
Prior:
The overall proportion of comparisons which are matches Pr(match)
Evidence: We have observed that a scenario holds, Pr(scenario)
Likelihood: The probability that the scenario holds amongst matches, given by Pr(scenario∣records match)
So Bayes' Theorem is:
Pr(match∣scenario)=Pr(scenario)Pr(scenario∣match)Pr(match)
but:
Pr(scenario)=λm+(1−λ)u
So Bayes Theorem is just our original formula:
posterior match probability=λm+(1−λ)uλm
See also this great video!