population size estimation and linkage errors: the ...€¦ · linkage errors •extension to the...

13
POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE MULTIPLE LISTS CASE Loredana Di Consiglio (Eurostat) Tiziana Tuoto (Istat)

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE MULTIPLE LISTS CASE

Loredana Di Consiglio (Eurostat)

Tiziana Tuoto (Istat)

Page 2: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture recapture model

L1

IN OUT

L2 L2

L3 IN OUT IN OUT

IN n111 n110 n011 n010

OUT n101 n100 n001 n000

The method consists in comparing two (or more) independent counts (“captures”) of the same unit in multiple occasions.

Strong assumptions: • independence in two lists case (no

highest interaction with multiple lists); • Homogeneity of capture probabilities; • Error-free linkage at record level; • Population closure.

Page 3: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture recapture model

323121321))(log(LL

jk

LL

ik

LL

ij

L

k

L

j

L

iijknE

000

~~nnN

101011101

010100001111000

~

nnn

nnnnn

Saturated model for the three-way case

Page 4: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Probabilistic record linkage (Fellegi Sunter)

The linkage process between L1 and L2 can be viewed as a classification problem in M (matched pairs) and U (unmatched pairs) on the basis of:

)(

)(

),(

),(

u

m

UbaP

MbaPr

m and u can be estimated by assuming the true link status is a latent variable, using, for instance, the EM algorithm.

Page 5: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Probabilistic record linkage (Fellegi Sunter)

The thresholds are chosen to minimize false link probability

β and false non link 1-α

M

uMPu

)()|()(

U

mUPm

)()|()(1

)(/)(: umTuU

)(/)(: umTmM

Page 6: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture recapture in presence of linkage errors

L1

IN OUT

L2 L2

L3 IN OUT IN OUT

IN n*111 n*110 n*011 n*010

OUT n*101 n*100 n*001 n*000

Fienberg and Ding propose an adjustment of the estimators for correcting for missing true matches error.

i. No erroneous matches: only missed true links are taken into account.

ii. A transition can go only downwards by at most one level, i.e. from true status (111): (111); (110) and (001); (101) and (010); (011) and (100); BUT NOT (100) and (010) and (001).

Page 7: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture recapture in presence of linkage errors

iii. The probability of staying at the original state (no missing

error) equal to α and the probability to transit to any of a

possible state is equal to (1-α)/(m-1), where m is the

number of all possible states to which transitions are possible and allowed.

Then the observed count n*=Mn and standard formulas can be applied taking into account of "known" M (function of missing matches error).

Page 8: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture-recapture in presence of linkage errors

• Extension to the Fienberg and Ding adjustment

• The transition from n to n* is a function of the probability of missing a true link and the probability of false links.

• Assuming the process of linkage is a two steps process with first a linkage of list 1 and 2 and then a linkage with list 3, we allow for different linkage errors in the two linkage steps. The transition takes into account of this process.

Page 9: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Capture-recapture in presence of linkage errors

111 110 101 100 011 010 001

111*

𝛼1𝛼2 𝛼1𝛽2 𝛼2𝛽1 𝛽1𝛽2 𝛼2𝛽1 - -

110*

α1(1 − α2) (𝛼1) 1 − 𝛽2 𝛽1 1 − 𝛼2 𝛽1 1 − 𝛽2 𝛽1 1 − 𝛼2 - -

101*

1−α1 α2

2 1 − 𝛼1 𝛽2/2 (1 − 𝛽1)(𝛼2) 1 − 𝛽1 𝛽2 - - -

100*

1 − α1 2 − α22

1 − 𝛼1 (1

−𝛽22)

(1 − 𝛽1)(1 − 𝛼2)

1 − 𝛽1 (1− 𝛽2)

−𝛽1 - -

011*

1−α1 α2

2 1 − 𝛼1 𝛽2/2 - (1 − 𝛽1)𝛼2 - -

010*

1 − α1 2 − α22

(1 − 𝛼1)(1

−𝛽22)

−𝛽1 −𝛽1 (1− 𝛽1)(1 − 𝛼2)

1 -

001*

1 − α2 −𝛽2 1 − 𝛼2 −𝛽2 1 − 𝛼2 - 1

Page 10: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Simulation: the data

Fictitious data from the ESSnet DI - 26000 records (McLeod,

et al. 2011).

• List1 Patient Register Data, coverage rate t1=0.65;

• List2 CIS, data from the tax and benefit systems,

coverage rate t3=0.53;

• List3 Population census, coverage rate t3=0.57.

The three lists mimic the registers' undercoverages and the

presence of errors in the common identifiers.

100 replicated settings of 1000 units, independently

randomly selected from the lists generated according to the

coverage probabilities.

Page 11: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Simulation: the linkage

Linkage step 1: List1 and List2

Linkage step 2: Results of first linkage and List3

Linkage variables in both steps: Name, Surname, Day, Month and Year of Birth.

Linkage Errors %

Min Median Mean Max

1-1 0.00 2.50 2.48 4.90

1 0.20 4.52 4.10 7.63

Linkage Errors %

Min Median Mean Max

1-2 0.92 2.92 2.94 5.40

2 0.22 4.28 4.07 7.89

Page 12: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Simulations: Results

Page 13: POPULATION SIZE ESTIMATION AND LINKAGE ERRORS: THE ...€¦ · linkage errors •Extension to the Fienberg and Ding adjustment •The transition from n to n* is a function of the

Concluding remarks

• The adjustment allows to reduce bias of the Naïve estimator without relevant effect on variance, however bias is not cancel out completely due to nonlinear nature of the Petersen estimator.

• Evaluation of linkage errors: The evaluation of linkage errors

is still an open issue. One solution is the model-derived evaluation of the FS. Other "external of the model" proposals are based on a training set to assess linkage quality. Anyway, automatic probabilistic methods are necessary, particularly for detecting missing links errors.

• Generalization to multiple lists requires not straightforward evaluation of M and knowledge of the multiple steps linkage mechanism.