eurostat statistical matching using auxiliary information training course «statistical matching»...
TRANSCRIPT
EurostatEurostat
Statistical Matching using auxiliary information
Training Course «Statistical Matching»
Rome, 6-8 November 2013
Marco Di ZioDept. Integration, Quality, Research and ProductionNetworks Development Department, Istatdizio [at] istat.it
EurostatEurostat
Outline
The problem
Auxiliary information
Auxiliary information in parametric models
Auxiliary information in nonparametric models
References
EurostatEurostat
The problem
Let A U B be a sample of nA + nB observations i.i.d. from f(x, y, z),
with Z missing on records of A, and Y missing on records of B. Two alternative models are identifiable for A U B :
the CIA and the PIA. The reason is that those models involve only the distribution of X,
Y|X and Z|X. When the CIA (or PIA) is not adapt to our problem it is necessary to
use auxiliary information (if we want a point estimate).
EurostatEurostat
Example. The normal case
(X,Y,Z) ~ N(
The inestimable parameter is yz (or equivalently yz) Under the CIA this is yz = xy xz/
yz
In general it holds yz = xy xz/2yz + yz|x
We need information to fill the gap yz|x=? (or yz|x)
EurostatEurostat
Regression
where
EurostatEurostat
Auxiliary information
In general two different kinds of auxiliary information:
1) a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2) a plausible value of the inestimable parameters of either (Y,Z|X) or
(Y,Z)
EurostatEurostat
Sources
Sources may not be perfect:
an outdated statistical investigation;
administrative register;
a supplemental (even small) ad hoc survey;
proxy variables (Y°,Z°)
EurostatEurostat
Auxiliary information on parameters
Previous surveys, assumptions made by the researcher, proxy variables maysuggest a value * for the non estimable parameters.
Two kinds of information:
information about yz|x
Information about yz
EurostatEurostat
Auxiliary information on parameters
Consequences of information on parameters.
It restricts the parameter space to a subspace *
* involves all the param. in compatible with the auxiliary information
EurostatEurostat
Auxiliary information and likelihood
Combining estimates and auxiliary information is easier when info is about yz|sx
In general, the pdf f(x, y, z; θ) may be written as:
f (x, y, z; θ) = fX (x; θX) fYZ|X(y, z|x; θYZ|X)
where x X, y Y, z Z and the paramet. space {}
Reparametrised in two sets X = {θX}, YZ|X = {θYZ|X}.
EurostatEurostat
Auxiliary information about yz|x
This information is precious but rarely available.
An interesting case is when (X, Y,Z)~ N(μ,). In this case the only information required is on ρY Z|X.
Algorithm for the MLE estimate θX on A U B estimate θY |X on A and θZ|X on B with the previous estimates and ρYZ|X = ρ*YZ|X we can compute
and
EurostatEurostat
Auxiliary information about yz
This information is more problematic
This info does not guarantee a unique MLE (see e.g., multinomial distribution).
it is not an easy task to combine this info with estimates obtained from A U B.
It requires maximum constrained approaches
EurostatEurostat
Auxiliary information about yz
This info does not guarantee a unique MLE We cannot estimate a log-linear model like
However we can estimate
EurostatEurostat
Auxiliary information about yz
Normal distribution
This info guarantees a unique MLE
The only parameter involving Y,Z is yz.
Info on it is sufficient to fill the lack of knowledge
EurostatEurostat
Auxiliary information about yz
Let us estimate yx and zx with
and let yz= *yz.
There are two possibilities
1) Auxiliary info is compatible with estimates
EurostatEurostat
Auxiliary information about yz
2) Auxiliary info is NOT compatible with estimates
EurostatEurostat
Example: Auxiliary info on yz= yz
Let us suppose that
Value *yz = 0.7 is compatible, det() =0.096.
while
*yz = 0.9 is not compatible, det( ) =-0.008
EurostatEurostat
Micro approach
As in the micro approach under the CIA
Conditional mean
Random draw
EurostatEurostat
Conditional mean – Normal distribution
Imputation of Z in A
EurostatEurostat
Random draw
Imputation of Z in A
EurostatEurostat
Non-parametric methods
Auxiliary information may be an additional file CMicroHot-deck (A recipient and B donor)
any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is
(xa, ya, ˜za(1) = zc*)
Z in a is imputed with a live value ˜za(2) = zb* from B through hot-deck.
If a distance is used, b* B minimizes d((xa, zc*), (xb, zb))
the final data set is composed of (xa, ya, ˜za(2) )
EurostatEurostat
Auxiliary information
Auxiliary information can be
1. information on the inestimable parameters (e.g. ρY Z), (as already introduced)
2. on some parameters not directly identifying the model; for instance,(X, Y,Z) are continuous but it is known the contingency table of a
categorization of them
This kind of auxiliary info can be dealt with by using mixed methods and non-parametric methods as well
EurostatEurostat
Mixed methods
They use parametric and non-parametric approach, mainly in two steps.
1. Estimate the parametric model
2. use a hot deck procedure for the imputation of the missing data
EurostatEurostat
Mixed methods: Auxiliary file C Regression step 1
Regression step 2
Matching step For each obs. a is imputed zb* corresponding to the nearest neighbor b* in
B,
EurostatEurostat
Mixed methods: Auxiliary file C
Regression step
Matching stepFor each obs. a is imputed zb* corresponding to the nearest neighbor b* in B,
EurostatEurostat
Mixed methods: Auxiliary file CCategorical variables
1. Estimation stepEstimate ijk through the maximum likelihood applied to file C
2. Matching step For each obs. a it is found zb* through an hot-deck procedure. This value is
used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded
EurostatEurostat
Mixed methods: ‘Coarse’ informationWe do not know the parameters of (X, Y, Z), but we know the contingency
table for a categorization (X°, Y°, Z°) of (X, Y, Z)
1. Hot-deck stepFor each obs. a in A determine a ‘live’ value zc* in c* in C with respect to
a distance d((xa,ya),(xc,yc)). It is imputed only if the frequency of
(X°, Y°, Z°) in A is not exceeded. Otherwise continue.
2. Matching step For each obs. a in A impute the live value zb* corresponding to the nearest
neighbor b* in B with respect to the minimum distance d((xa, ~za), (xb,zb)).
EurostatEurostat
Selected referencesRässler S. (2002) Statistical Matching: a frequentist theory, practical applications and
alternative Bayesian approaches, SpringerMoriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the
Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File
Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73
Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association
D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22