multivariate selective editing via mixture models: first applications to italian structural business...
TRANSCRIPT
Multivariate selective editing via mixture models: first applications to Italian structural business surveys
Orietta Luzi, Guarnera U., Silvestri F., Buglielli T., Nurra A., Siesto G.
Italian National Institute of Statistics
UNECE Worksession on Statistical Data Editing
Oslo, 22-24 September 2012
OutlineUNECE Worksession on
Statistical Data Editing
• Objective of the work
• The SeleMix approach to selective editing
• The Software SeleMix
• The Applications
• Final remarks and future work
September 22-24, Oslo
Objective of the workUNECE Worksession on
Statistical Data Editing
• Assessing the advantages (in terms of quality improvements and costs reduction) deriving from the use of a multivariate model-based robust selective editing approach for the detection of influential errors in business surveys.
• Exploring the potential benefits deriving from the use of administrative data in the context of the detection of influential errors in economic business surveys
The idea is to improve the effectiveness of selective editing by directly incorporating the auxiliary information available in external (both administrative and statistical) sources in the selective editing strategy.
September 22-24, Oslo
Selective Editing
• Key elements:
– score function
– cut-off value (threshold) determining the units to be manually reviewed
• The components of a score function are:
– risk ~ probability of error occurrence
– influence ~ (expected) impact on estimates
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
Score Function
• A local score is often defined for each record and each variable through a comparison of current values and “estimated” true values, e.g.
– historical values on the same units (when available)
– estimates (predictions) obtained using auxiliary information (e.g. admin data) or covariates from the same survey
• Different local scores are combined in a single global score. The cut-off value of the global score determines which units are to be manually reviewed
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
Selective Editing
The difference between observed and predicted values is due to
• the potential error
• the natural variability of the analyzed quantity.
In the usual setting, there is no possibility to distinguish these two elements, and the score of an observation is not directly related to the expected error of that unit.
As a consequence we will not be able to relate the selective editing threshold to the desired degree of accuracy in the final estimates.
Problem:
Relate the threshold value of the score function to the desired estimate accuracy (i.e. residual error left in data)
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
Model-based Selective Editing
Proposed solution: use an approach based on
1)explicit modeling of both data and error mechanism (via mixture models). In particular, a latent variable model allows, under certain assumptions, to estimate the expected error associated to each unit.
The method uses contamination normal models, where it is assumed that the distribution of the erroneous data can be obtained from the distribution of the error free data by inflating the variance
2) definition of the score function in terms of the conditional distribution of “true” data given observed data
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
The model
Y* true dataY observed dataX covariates (no error)B regression coefficientsU residuals I Bernoullian variable: True data model:
~
Error model:
~
Distribution of observed data:
),';(),';()1()( xByNxByNyfY
011 IPIP
,* UXBY ),0( NU
,1* IPYYP IYIYY **,|
),0( N ,)1( .1
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
The method
Model parameters can be estimated based on the observed data via EM. These estimates can be used to estimate the conditional distribution of true data given observed data:
)~
,~*;(),()*()),(1()|*( ,|* yxYY yNyxyyyxyyf
},|*{),( iiiiii yxyYPyx posterior probabilty for unit i
,)')1((~
, xBy
yx
)
11(
~
We obtain a prediction for unit i as:
)|*(ˆ |* iiYYi yyEy
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
Risk and Influence
The expected error is:
),( ii yx risk component
influence component
)~)(,(ˆ , ii yxiiiii yyxyy
)~( , ii yxiy
The expected error is the product of the two componentsIt is natural to define the score function in terms of theexpected error.
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
If a total Y in a finite population is to be estimated on a sample S via the
robust estimator:
The score function
iSi
iy ywT ˆˆ̂
we define a (local) score function as:
(weighted expected error for variable Y in unit i)
Ordering (in descending order) the records by that score function,
correcting the first k units, and summing the riY scores over all the
not edited units, we obtain an estimate of the relative expected residual
error RkY in data:
y
iiiYi
T
yywr
ˆ̂)ˆ(
ki
Yi
Yk rR
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
Warnings
1) Model assumptions
- true data are assumed to be normal/log-normal
- error is modeled as additive and Gaussian (in a suitable scale)
- covariance matrices of true data and error distributions are
supposed to be proportional
2) Population Estimates
The score function and the stopping criterion have a straightforward interpretation only for linear estimates like means or totals.
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
The software SeleMix
SeleMix is an R package for selective editing based on a contamination model. Its main functionalities are:
•parameter estimation via ECM algorithm
•prediction of “true” values conditional on observed values according to the estimated model
•computation of score functions, ordering of the units, and identification of influential errors according to the user-specified threshold
SeleMix also provides anticipated values (predictions) for units where
some (or all) of the Y variables are not observed. Missing values in the
X covariates are not allowed.
September 22-24, Oslo
UNECE Worksession on
Statistical Data Editing
The Applications: the surveys
The Economic Surveys
the annual sampling survey on Information and Communication Technology usage and e-commerce in industry (ICT)
the annual sampling survey on Small and Medium Enterprises (SME)
The target variables: Turnover, Costs
The target Parameters: Variables’ Totals (by domain)
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
The Applications: the auxiliary sources
Administrative Archives
Financial Statements (FS) Corporate companies (~ 15.000 enterprises)
Best harmonized source w.r.t. SBS Regulation definitions
Sector Studies Survey (SS) Fiscal survey (~ 4 million enterprises)
Detailed costs and income
Like financial statement
Statistical Sources
Annual total Survey on the Economic Accounts of Enterprises (SEA) (100 employees; ~12,000 enterprises)
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
ICT - Experiment 1
Objective :Evaluating the effectiveness of the proposed selective editing in terms of correct identification of influential errors and correct treatment of both influential errors and of item non responses in the ICT context
Experimental approach
• Simulation of contaminated values and item non responses on edited values of Turnover and Costs on the sub-.sample of corporate enterprises of the 2009 ICT sample
• MonteCarlo evaluation of selective editing & imputation w.r.t. FS (different thresholds, ); “corrections” based on either 2009 FS (true) data or model-based predictions
• Auxiliary variables: Turnover and Costs from 2008 FS data
Results Editing a small number of units is sufficient to remove the most influential errors: bias of the estimates based on edited data is always below 0.3%, while the RRMSE is quite close to the threshold value (0.5%) for almost all domains
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
ICT - Results of experiment 1
Relative Bias (%) RR MSE (%)
RAW EDITED ROB.EST RAW EDITED ROB.ESTDom
N n.cont n.out n.sel turnv cost turnv cost turn cost turn cost turnv cost turnv cost
G 3497 336 515 116 2.8 2.6 0.0 0.0 0.9 1.2 4.2 3.7 0.2 0.2 1.0 1.3
F 3260 317 565 238 15.4 18.1 -0.2 0.0 -7.6 -7.0 22.3 32.8 0.4 0.2 7.7 7.1
DE 876 85 143 16 4.4 13.6 0.1 0.1 -0.2 -1.0 10.4 39.3 0.3 0.5 2.0 1.8
C 3691 362 494 231 13.7 16.3 -0.1 -0.1 0.9 0.3 19.4 23.9 0.3 0.3 1.0 0.7
H 653 63 144 20 2.7 3.3 0.1 0.0 -0.6 -0.8 8.8 10.5 0.4 0.5 0.9 1.0
L 133 13 25 16 44.5 166.7 0.0 -0.1 3.9 10.2 95.4 686.4 1.0 0.7 7.9 11.5
J 565 55 76 15 16.2 19.4 0.0 -0.1 -1.8 -3.6 35.0 50.1 0.6 0.4 2.1 3.8
I 224 22 35 16 6.4 4.6 -0.2 -0.2 2.1 2.9 15.6 12.3 0.8 0.6 2.3 3.0
NS 1156 111 211 18 6.8 6.5 0.2 0.1 0.5 -0.4 11.0 12.1 0.5 0.7 0.9 0.8
M 450 43 78 38 39.2 30.5 -0.1 0.0 -6.1 7.4 79.5 64.3 0.4 0.4 6.1 7.5
Relative bias and root mean square error (RRMSE) for the estimates based on raw data (RAW), edited data (EDITED) and SeleMix predictions (ROB.EST) (=0.005)
UNECE Worksession on
Statistical Data Editing
ICT - Experiment 2
Objective: Assessing the advantages in terms of potential reduction of follow-up and interactive editing costs deriving by integrating selective editing in the current E&I procedure
Experimental approach
• Application of selective editing to raw Turnover and Costs of all the 2008 ICT responding units (different thresholds, )
• Comparative evaluation of parameters’ estimates obtained after selective editing with estimates obtained by the current procedure
• Auxiliary variables: Turnover and Costs available in at least one external source (SEA , FS, SME, SS, with priority), year 2008
• Correction using either ICT edited data or model-based predictions
Results
• High reduction of units selected as suspect vs the corresponding number of manually revised units based on the current approach
• Low distances among totals’ estimates based on selective editing wrt the corresponding final ICT estimates for the most part of domains
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
ICT - Results of experiment 2
Influential errors an missing imputed with ICT edited data
Turnover Costs
Dom N n.sel n.miss ICT.Sel n.miss ICT.Sel
1 745 3 9 0,90 6 0,552 338 6 3 1,62 5 1,703 293 0 0 -0,22 1 0,394 546 8 2 0,16 7 0,215 255 1 1 -0,88 2 -0,356 1036 33 8 0,89 8 -0,127 146 6 3 -0,21 2 0,268 603 19 8 0,34 10 -0,249 169 4 1 -0,29 1 0,37
10 416 5 4 0,26 5 0,9611 201 5 0 -0,48 1 -0,4011 747 17 15 0,27 17 0,5612 5155 16 74 -1,68 97 -1,9113 620 3 7 0,13 9 -0,6114 2795 13 22 -0,94 29 0,2615 1174 5 17 0,06 22 0,3016 752 8 5 0,07 8 -0,0117 29 0 0 0,00 1 0,2018 205 6 4 0,23 4 1,9719 131 1 4 0,04 5 3,6920 120 0 2 0,35 2 -0,3221 47 0 0 0,01 0 -0,2522 36 2 0 0,18 0 0,1923 406 2 1 -1,77 4 -2,2824 149 22 2 0,00 2 0,4225 613 16 10 1,17 12 1,3326 1124 25 16 0,32 17 0,3927 176 0 2 0,18 3 0,0028 74 9 0 0,20 0 0,01
Total 19,101 235 220
Relative distances between SeleMix estimates (Sel) with estimates based on raw data (Raw) and ICT edited data (ICT) (=0.01)
UNECE Worksession on
Statistical Data Editing
SME - Experiment 1
Objective
Assessing the advantages in terms of potential reduction of follow-up and interactive editing that could derive by integrating selective editing in the current E&I procedure
Experimental approach
• Application of selective editing and imputation to raw Turnover and Costs of all the 2008 SME responding units (different thresholds and imputation approaches)
• Comparative evaluation of parameters’ estimates obtained after selective editing &imputation and the “true” estimates obtained from administrative archives
• Auxiliary variables: Turnover and Costs available in at least one external source (FS, SS, with priority), year 2007
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
SME - Results of experiment 1
As expected, higher levels of imply a consistent reduction of expected revisions which is balanced by less accurate estimates
In SME this seems to happen in a too high number of domains
=0.01
869 units selected as influential (~2.9% of the experimental sub-sample)
Diff(True.Sel) ≤ 1.5 in the 89% of domains (the median of the distribution of Diff(True.Sel) over the domains is 0.65)
=0.02
382 influential units selected (~0.01% of the experimental sub-sample),
Diff(True.Sel) ≤ 1.5 in the 75% of domains (the median of the distribution of Diff(True.Sel) over the considered domains is 0.9)
UNECE Worksession on
Statistical Data Editing
SME - Results of experiment 1
Turnover – Relative differences between Diff(True.Sel) when =0,01 and when =0,02
UNECE Worksession on
Statistical Data Editing
TrueselTrue YYYY TTTSelTrueDiff ˆ/ˆˆ).(
Conclusions
Application to ICT data
Fully satisfactory results. The integration of the method in the current E&I procedure is already in progress
Application to SME data
• Further analyses are needed:
• Different thresholds for different domains?
• Additional covariates?
UNECE Worksession on
Statistical Data Editing
September 22-24, Oslo
Thank you for your attention
UNECE Worksession on
Statistical Data Editing