eurostat statistical matching using auxiliary information training course «statistical matching»...

EurostatEurostat

Statistical Matching using auxiliary information

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Marco Di ZioDept. Integration, Quality, Research and ProductionNetworks Development Department, Istatdizio [at] istat.it

EurostatEurostat

Outline

The problem

Auxiliary information

Auxiliary information in parametric models

Auxiliary information in nonparametric models

References

EurostatEurostat

The problem

Let A U B be a sample of nA + nB observations i.i.d. from f(x, y, z),

with Z missing on records of A, and Y missing on records of B. Two alternative models are identifiable for A U B :

the CIA and the PIA. The reason is that those models involve only the distribution of X,

Y|X and Z|X. When the CIA (or PIA) is not adapt to our problem it is necessary to

use auxiliary information (if we want a point estimate).

EurostatEurostat

Example. The normal case

(X,Y,Z) ~ N(

The inestimable parameter is yz (or equivalently yz) Under the CIA this is yz = xy xz/

yz

In general it holds yz = xy xz/2yz + yz|x

We need information to fill the gap yz|x=? (or yz|x)

EurostatEurostat

Regression

where

EurostatEurostat


In general two different kinds of auxiliary information:

1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z|X) or

(Y,Z)

EurostatEurostat

Sources

Sources may not be perfect:

an outdated statistical investigation;

administrative register;

a supplemental (even small) ad hoc survey;

proxy variables (Y°,Z°)

EurostatEurostat

Auxiliary information on parameters

Previous surveys, assumptions made by the researcher, proxy variables maysuggest a value * for the non estimable parameters.

Two kinds of information:

information about yz|x

Information about yz

EurostatEurostat

Auxiliary information on parameters

Consequences of information on parameters.

It restricts the parameter space to a subspace *

* involves all the param. in compatible with the auxiliary information

EurostatEurostat

Auxiliary information and likelihood

Combining estimates and auxiliary information is easier when info is about yz|sx

In general, the pdf f(x, y, z; θ) may be written as:

f (x, y, z; θ) = fX (x; θX) fYZ|X(y, z|x; θYZ|X)

where x X, y Y, z Z and the paramet. space {}

Reparametrised in two sets X = {θX}, YZ|X = {θYZ|X}.

EurostatEurostat

Auxiliary information about yz|x

This information is precious but rarely available.

An interesting case is when (X, Y,Z)~ N(μ,). In this case the only information required is on ρY Z|X.

Algorithm for the MLE estimate θX on A U B estimate θY |X on A and θZ|X on B with the previous estimates and ρYZ|X = ρ*YZ|X we can compute

and

EurostatEurostat

Auxiliary information about yz

This information is more problematic

This info does not guarantee a unique MLE (see e.g., multinomial distribution).

it is not an easy task to combine this info with estimates obtained from A U B.

It requires maximum constrained approaches

EurostatEurostat


This info does not guarantee a unique MLE We cannot estimate a log-linear model like

However we can estimate

EurostatEurostat


Normal distribution

This info guarantees a unique MLE

The only parameter involving Y,Z is yz.

Info on it is sufficient to fill the lack of knowledge

EurostatEurostat


Let us estimate yx and zx with

and let yz= *yz.

There are two possibilities

1) Auxiliary info is compatible with estimates

EurostatEurostat


2) Auxiliary info is NOT compatible with estimates

EurostatEurostat

Example: Auxiliary info on yz= yz

Let us suppose that

Value *yz = 0.7 is compatible, det() =0.096.

while

*yz = 0.9 is not compatible, det( ) =-0.008

EurostatEurostat

Micro approach

As in the micro approach under the CIA

Conditional mean

Random draw

EurostatEurostat

Conditional mean – Normal distribution

Imputation of Z in A

EurostatEurostat

Random draw

Imputation of Z in A

EurostatEurostat

Non-parametric methods

Auxiliary information may be an additional file CMicroHot-deck (A recipient and B donor)

any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is

(xa, ya, ˜za(1) = zc*)

Z in a is imputed with a live value ˜za(2) = zb* from B through hot-deck.

If a distance is used, b* B minimizes d((xa, zc*), (xb, zb))

the final data set is composed of (xa, ya, ˜za(2) )

EurostatEurostat


Auxiliary information can be

1. information on the inestimable parameters (e.g. ρY Z), (as already introduced)

2. on some parameters not directly identifying the model; for instance,(X, Y,Z) are continuous but it is known the contingency table of a

categorization of them

This kind of auxiliary info can be dealt with by using mixed methods and non-parametric methods as well

EurostatEurostat

Mixed methods

They use parametric and non-parametric approach, mainly in two steps.

1. Estimate the parametric model

2. use a hot deck procedure for the imputation of the missing data

EurostatEurostat

Mixed methods: Auxiliary file C Regression step 1

Regression step 2

Matching step For each obs. a is imputed zb* corresponding to the nearest neighbor b* in

B,

EurostatEurostat

Mixed methods: Auxiliary file C

Regression step

Matching stepFor each obs. a is imputed zb* corresponding to the nearest neighbor b* in B,

EurostatEurostat

Mixed methods: Auxiliary file CCategorical variables

1. Estimation stepEstimate ijk through the maximum likelihood applied to file C

2. Matching step For each obs. a it is found zb* through an hot-deck procedure. This value is

used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded

EurostatEurostat

Mixed methods: ‘Coarse’ informationWe do not know the parameters of (X, Y, Z), but we know the contingency

table for a categorization (X°, Y°, Z°) of (X, Y, Z)

1. Hot-deck stepFor each obs. a in A determine a ‘live’ value zc* in c* in C with respect to

a distance d((xa,ya),(xc,yc)). It is imputed only if the frequency of

(X°, Y°, Z°) in A is not exceeded. Otherwise continue.

2. Matching step For each obs. a in A impute the live value zb* corresponding to the nearest

neighbor b* in B with respect to the minimum distance d((xa, ~za), (xb,zb)).

EurostatEurostat

Selected referencesRässler S. (2002) Statistical Matching: a frequentist theory, practical applications and

alternative Bayesian approaches, SpringerMoriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the

Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File

Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73

Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association

D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22

eurostat statistical matching using auxiliary information training course «statistical matching»...

Documents