population size estimation and linkage errors: the ...€¦ · linkage errors •extension to the...
TRANSCRIPT
POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE MULTIPLE LISTS CASE
Loredana Di Consiglio (Eurostat)
Tiziana Tuoto (Istat)
Capture recapture model
L1
IN OUT
L2 L2
L3 IN OUT IN OUT
IN n111 n110 n011 n010
OUT n101 n100 n001 n000
The method consists in comparing two (or more) independent counts (“captures”) of the same unit in multiple occasions.
Strong assumptions: • independence in two lists case (no
highest interaction with multiple lists); • Homogeneity of capture probabilities; • Error-free linkage at record level; • Population closure.
Capture recapture model
323121321))(log(LL
jk
LL
ik
LL
ij
L
k
L
j
L
iijknE
000
~~nnN
101011101
010100001111000
~
nnn
nnnnn
Saturated model for the three-way case
Probabilistic record linkage (Fellegi Sunter)
The linkage process between L1 and L2 can be viewed as a classification problem in M (matched pairs) and U (unmatched pairs) on the basis of:
)(
)(
),(
),(
u
m
UbaP
MbaPr
m and u can be estimated by assuming the true link status is a latent variable, using, for instance, the EM algorithm.
Probabilistic record linkage (Fellegi Sunter)
The thresholds are chosen to minimize false link probability
β and false non link 1-α
M
uMPu
)()|()(
U
mUPm
)()|()(1
)(/)(: umTuU
)(/)(: umTmM
Capture recapture in presence of linkage errors
L1
IN OUT
L2 L2
L3 IN OUT IN OUT
IN n*111 n*110 n*011 n*010
OUT n*101 n*100 n*001 n*000
Fienberg and Ding propose an adjustment of the estimators for correcting for missing true matches error.
i. No erroneous matches: only missed true links are taken into account.
ii. A transition can go only downwards by at most one level, i.e. from true status (111): (111); (110) and (001); (101) and (010); (011) and (100); BUT NOT (100) and (010) and (001).
Capture recapture in presence of linkage errors
iii. The probability of staying at the original state (no missing
error) equal to α and the probability to transit to any of a
possible state is equal to (1-α)/(m-1), where m is the
number of all possible states to which transitions are possible and allowed.
Then the observed count n*=Mn and standard formulas can be applied taking into account of "known" M (function of missing matches error).
Capture-recapture in presence of linkage errors
• Extension to the Fienberg and Ding adjustment
• The transition from n to n* is a function of the probability of missing a true link and the probability of false links.
• Assuming the process of linkage is a two steps process with first a linkage of list 1 and 2 and then a linkage with list 3, we allow for different linkage errors in the two linkage steps. The transition takes into account of this process.
Capture-recapture in presence of linkage errors
111 110 101 100 011 010 001
111*
𝛼1𝛼2 𝛼1𝛽2 𝛼2𝛽1 𝛽1𝛽2 𝛼2𝛽1 - -
110*
α1(1 − α2) (𝛼1) 1 − 𝛽2 𝛽1 1 − 𝛼2 𝛽1 1 − 𝛽2 𝛽1 1 − 𝛼2 - -
101*
1−α1 α2
2 1 − 𝛼1 𝛽2/2 (1 − 𝛽1)(𝛼2) 1 − 𝛽1 𝛽2 - - -
100*
1 − α1 2 − α22
1 − 𝛼1 (1
−𝛽22)
(1 − 𝛽1)(1 − 𝛼2)
1 − 𝛽1 (1− 𝛽2)
−𝛽1 - -
011*
1−α1 α2
2 1 − 𝛼1 𝛽2/2 - (1 − 𝛽1)𝛼2 - -
010*
1 − α1 2 − α22
(1 − 𝛼1)(1
−𝛽22)
−𝛽1 −𝛽1 (1− 𝛽1)(1 − 𝛼2)
1 -
001*
1 − α2 −𝛽2 1 − 𝛼2 −𝛽2 1 − 𝛼2 - 1
Simulation: the data
Fictitious data from the ESSnet DI - 26000 records (McLeod,
et al. 2011).
• List1 Patient Register Data, coverage rate t1=0.65;
• List2 CIS, data from the tax and benefit systems,
coverage rate t3=0.53;
• List3 Population census, coverage rate t3=0.57.
The three lists mimic the registers' undercoverages and the
presence of errors in the common identifiers.
100 replicated settings of 1000 units, independently
randomly selected from the lists generated according to the
coverage probabilities.
Simulation: the linkage
Linkage step 1: List1 and List2
Linkage step 2: Results of first linkage and List3
Linkage variables in both steps: Name, Surname, Day, Month and Year of Birth.
Linkage Errors %
Min Median Mean Max
1-1 0.00 2.50 2.48 4.90
1 0.20 4.52 4.10 7.63
Linkage Errors %
Min Median Mean Max
1-2 0.92 2.92 2.94 5.40
2 0.22 4.28 4.07 7.89
Simulations: Results
Concluding remarks
• The adjustment allows to reduce bias of the Naïve estimator without relevant effect on variance, however bias is not cancel out completely due to nonlinear nature of the Petersen estimator.
• Evaluation of linkage errors: The evaluation of linkage errors
is still an open issue. One solution is the model-derived evaluation of the FS. Other "external of the model" proposals are based on a training set to assess linkage quality. Anyway, automatic probabilistic methods are necessary, particularly for detecting missing links errors.
• Generalization to multiple lists requires not straightforward evaluation of M and knowledge of the multiple steps linkage mechanism.