eurostat statistical matching using auxiliary information training course «statistical matching»...

28
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it

Upload: silvester-mclaughlin

Post on 14-Jan-2016

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Statistical Matching using auxiliary information

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Marco Di ZioDept. Integration, Quality, Research and ProductionNetworks Development Department, Istatdizio [at] istat.it

Page 2: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Outline

The problem

Auxiliary information

Auxiliary information in parametric models

Auxiliary information in nonparametric models

References

Page 3: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

The problem

Let A U B be a sample of nA + nB observations i.i.d. from f(x, y, z),

with Z missing on records of A, and Y missing on records of B. Two alternative models are identifiable for A U B :

the CIA and the PIA. The reason is that those models involve only the distribution of X,

Y|X and Z|X. When the CIA (or PIA) is not adapt to our problem it is necessary to

use auxiliary information (if we want a point estimate).

Page 4: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Example. The normal case

(X,Y,Z) ~ N(

The inestimable parameter is yz (or equivalently yz) Under the CIA this is yz = xy xz/

yz

In general it holds yz = xy xz/2yz + yz|x

We need information to fill the gap yz|x=? (or yz|x)

Page 5: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Regression

where

Page 6: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information

In general two different kinds of auxiliary information:

1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z|X) or

(Y,Z)

Page 7: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Sources

Sources may not be perfect:

an outdated statistical investigation;

administrative register;

a supplemental (even small) ad hoc survey;

proxy variables (Y°,Z°)

Page 8: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information on parameters

Previous surveys, assumptions made by the researcher, proxy variables maysuggest a value * for the non estimable parameters.

Two kinds of information:

information about yz|x

Information about yz

Page 9: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information on parameters

Consequences of information on parameters.

It restricts the parameter space to a subspace *

* involves all the param. in compatible with the auxiliary information

Page 10: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information and likelihood

Combining estimates and auxiliary information is easier when info is about yz|sx

In general, the pdf f(x, y, z; θ) may be written as:

f (x, y, z; θ) = fX (x; θX) fYZ|X(y, z|x; θYZ|X)

where x X, y Y, z Z and the paramet. space {}

Reparametrised in two sets X = {θX}, YZ|X = {θYZ|X}.

Page 11: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz|x

This information is precious but rarely available.

An interesting case is when (X, Y,Z)~ N(μ,). In this case the only information required is on ρY Z|X.

Algorithm for the MLE estimate θX on A U B estimate θY |X on A and θZ|X on B with the previous estimates and ρYZ|X = ρ*YZ|X we can compute

and

Page 12: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz

This information is more problematic

This info does not guarantee a unique MLE (see e.g., multinomial distribution).

it is not an easy task to combine this info with estimates obtained from A U B.

It requires maximum constrained approaches

Page 13: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz

This info does not guarantee a unique MLE We cannot estimate a log-linear model like

However we can estimate

Page 14: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz

Normal distribution

This info guarantees a unique MLE

The only parameter involving Y,Z is yz.

Info on it is sufficient to fill the lack of knowledge

Page 15: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz

Let us estimate yx and zx with

and let yz= *yz.

There are two possibilities

1) Auxiliary info is compatible with estimates

Page 16: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information about yz

2) Auxiliary info is NOT compatible with estimates

Page 17: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Example: Auxiliary info on yz= yz

Let us suppose that

Value *yz = 0.7 is compatible, det() =0.096.

while

*yz = 0.9 is not compatible, det( ) =-0.008

Page 18: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Micro approach

As in the micro approach under the CIA

Conditional mean

Random draw

Page 19: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Conditional mean – Normal distribution

Imputation of Z in A

Page 20: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Random draw

Imputation of Z in A

Page 21: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Non-parametric methods

Auxiliary information may be an additional file CMicroHot-deck (A recipient and B donor)

any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is

(xa, ya, ˜za(1) = zc*)

Z in a is imputed with a live value ˜za(2) = zb* from B through hot-deck.

If a distance is used, b* B minimizes d((xa, zc*), (xb, zb))

the final data set is composed of (xa, ya, ˜za(2) )

Page 22: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Auxiliary information

Auxiliary information can be

1. information on the inestimable parameters (e.g. ρY Z), (as already introduced)

2. on some parameters not directly identifying the model; for instance,(X, Y,Z) are continuous but it is known the contingency table of a

categorization of them

This kind of auxiliary info can be dealt with by using mixed methods and non-parametric methods as well

Page 23: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Mixed methods

They use parametric and non-parametric approach, mainly in two steps.

1. Estimate the parametric model

2. use a hot deck procedure for the imputation of the missing data

Page 24: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Mixed methods: Auxiliary file C Regression step 1

Regression step 2

Matching step For each obs. a is imputed zb* corresponding to the nearest neighbor b* in

B,

Page 25: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Mixed methods: Auxiliary file C

Regression step

Matching stepFor each obs. a is imputed zb* corresponding to the nearest neighbor b* in B,

Page 26: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Mixed methods: Auxiliary file CCategorical variables

1. Estimation stepEstimate ijk through the maximum likelihood applied to file C

2. Matching step For each obs. a it is found zb* through an hot-deck procedure. This value is

used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded

Page 27: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Mixed methods: ‘Coarse’ informationWe do not know the parameters of (X, Y, Z), but we know the contingency

table for a categorization (X°, Y°, Z°) of (X, Y, Z)

1. Hot-deck stepFor each obs. a in A determine a ‘live’ value zc* in c* in C with respect to

a distance d((xa,ya),(xc,yc)). It is imputed only if the frequency of

(X°, Y°, Z°) in A is not exceeded. Otherwise continue.

2. Matching step For each obs. a in A impute the live value zb* corresponding to the nearest

neighbor b* in B with respect to the minimum distance d((xa, ~za), (xb,zb)).

Page 28: Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

EurostatEurostat

Selected referencesRässler S. (2002) Statistical Matching: a frequentist theory, practical applications and

alternative Bayesian approaches, SpringerMoriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the

Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File

Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73

Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association

D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22