multivariate selective editing via mixture models: first applications to italian structural business...

Multivariate selective editing via mixture models: first applications to Italian structural business surveys

Orietta Luzi, Guarnera U., Silvestri F., Buglielli T., Nurra A., Siesto G.

Italian National Institute of Statistics

UNECE Worksession on Statistical Data Editing

Oslo, 22-24 September 2012

OutlineUNECE Worksession on

Statistical Data Editing

• Objective of the work

• The SeleMix approach to selective editing

• The Software SeleMix

• The Applications

• Final remarks and future work

September 22-24, Oslo

Objective of the workUNECE Worksession on


• Assessing the advantages (in terms of quality improvements and costs reduction) deriving from the use of a multivariate model-based robust selective editing approach for the detection of influential errors in business surveys.

• Exploring the potential benefits deriving from the use of administrative data in the context of the detection of influential errors in economic business surveys

The idea is to improve the effectiveness of selective editing by directly incorporating the auxiliary information available in external (both administrative and statistical) sources in the selective editing strategy.


Selective Editing

• Key elements:

– score function

– cut-off value (threshold) determining the units to be manually reviewed

• The components of a score function are:

– risk ~ probability of error occurrence

– influence ~ (expected) impact on estimates


UNECE Worksession on


Score Function

• A local score is often defined for each record and each variable through a comparison of current values and “estimated” true values, e.g.

– historical values on the same units (when available)

– estimates (predictions) obtained using auxiliary information (e.g. admin data) or covariates from the same survey

• Different local scores are combined in a single global score. The cut-off value of the global score determines which units are to be manually reviewed




Selective Editing

The difference between observed and predicted values is due to

• the potential error

• the natural variability of the analyzed quantity.

In the usual setting, there is no possibility to distinguish these two elements, and the score of an observation is not directly related to the expected error of that unit.

As a consequence we will not be able to relate the selective editing threshold to the desired degree of accuracy in the final estimates.

Problem:

Relate the threshold value of the score function to the desired estimate accuracy (i.e. residual error left in data)




Model-based Selective Editing

Proposed solution: use an approach based on

1)explicit modeling of both data and error mechanism (via mixture models). In particular, a latent variable model allows, under certain assumptions, to estimate the expected error associated to each unit.

The method uses contamination normal models, where it is assumed that the distribution of the erroneous data can be obtained from the distribution of the error free data by inflating the variance

2) definition of the score function in terms of the conditional distribution of “true” data given observed data




The model

Y* true dataY observed dataX covariates (no error)B regression coefficientsU residuals I Bernoullian variable: True data model:

~

Error model:

~

Distribution of observed data:

),';(),';()1()( xByNxByNyfY

011 IPIP

,* UXBY ),0( NU

,1* IPYYP IYIYY **,|

),0( N ,)1( .1




The method

Model parameters can be estimated based on the observed data via EM. These estimates can be used to estimate the conditional distribution of true data given observed data:

)~

,~*;(),()*()),(1()|*( ,|* yxYY yNyxyyyxyyf

},|*{),( iiiiii yxyYPyx posterior probabilty for unit i

,)')1((~

, xBy

yx

)

11(

~

We obtain a prediction for unit i as:

)|*(ˆ |* iiYYi yyEy




Risk and Influence

The expected error is:

),( ii yx risk component

influence component

)~)(,(ˆ , ii yxiiiii yyxyy

)~( , ii yxiy

The expected error is the product of the two componentsIt is natural to define the score function in terms of theexpected error.




If a total Y in a finite population is to be estimated on a sample S via the

robust estimator:

The score function

iSi

iy ywT ˆˆ̂

we define a (local) score function as:

(weighted expected error for variable Y in unit i)

Ordering (in descending order) the records by that score function,

correcting the first k units, and summing the riY scores over all the

not edited units, we obtain an estimate of the relative expected residual

error RkY in data:

y

iiiYi

T

yywr

ˆ̂)ˆ(

ki

Yi

Yk rR




Warnings

1) Model assumptions

- true data are assumed to be normal/log-normal

- error is modeled as additive and Gaussian (in a suitable scale)

- covariance matrices of true data and error distributions are

supposed to be proportional

2) Population Estimates

The score function and the stopping criterion have a straightforward interpretation only for linear estimates like means or totals.




The software SeleMix

SeleMix is an R package for selective editing based on a contamination model. Its main functionalities are:

•parameter estimation via ECM algorithm

•prediction of “true” values conditional on observed values according to the estimated model

•computation of score functions, ordering of the units, and identification of influential errors according to the user-specified threshold

SeleMix also provides anticipated values (predictions) for units where

some (or all) of the Y variables are not observed. Missing values in the

X covariates are not allowed.




The Applications: the surveys

The Economic Surveys

the annual sampling survey on Information and Communication Technology usage and e-commerce in industry (ICT)

the annual sampling survey on Small and Medium Enterprises (SME)

The target variables: Turnover, Costs

The target Parameters: Variables’ Totals (by domain)




tabanell

arial 32 grassetto colote testo RGB 95 - 95 - 95posizione casella di testo 5,50 - 0,90margini sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: in mezzo

tabanell

casella di testo: posizione 5,50 - 3,5margine sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: altoaltezza massima della casella di testo: 13larghezza fissa 18,5titoletto: arial 16 grassetto colore testo RGB 153 - 51- 51testo: arial 16 colore testo RGB 95 - 95 - 95

tabanell

Posizione casella di testo 5,50 - 18arial 10 colore testo RGB 95 - 95 - 95

tabanell

posizione logo: 21,90 - 18

tabanell

arial 10 colore testo biancomassimo 3 righeposizione casella di testo 0 - 0,6margini: sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: basso

The Applications: the auxiliary sources

Administrative Archives

Financial Statements (FS) Corporate companies (~ 15.000 enterprises)

Best harmonized source w.r.t. SBS Regulation definitions

Sector Studies Survey (SS) Fiscal survey (~ 4 million enterprises)

Detailed costs and income

Like financial statement

Statistical Sources

Annual total Survey on the Economic Accounts of Enterprises (SEA) (100 employees; ~12,000 enterprises)




tabanell

arial 32 grassetto colote testo RGB 95 - 95 - 95posizione casella di testo 5,50 - 0,90margini sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: in mezzo

tabanell

casella di testo: posizione 5,50 - 3,5margine sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: altoaltezza massima della casella di testo: 13larghezza fissa 18,5titoletto: arial 16 grassetto colore testo RGB 153 - 51- 51testo: arial 16 colore testo RGB 95 - 95 - 95

tabanell

Posizione casella di testo 5,50 - 18arial 10 colore testo RGB 95 - 95 - 95

tabanell

posizione logo: 21,90 - 18

tabanell

arial 10 colore testo biancomassimo 3 righeposizione casella di testo 0 - 0,6margini: sinistro 0, destro 0, superiore 0, inferiore 0punto di ancoraggio del testo: basso

ICT - Experiment 1

Objective :Evaluating the effectiveness of the proposed selective editing in terms of correct identification of influential errors and correct treatment of both influential errors and of item non responses in the ICT context

Experimental approach

• Simulation of contaminated values and item non responses on edited values of Turnover and Costs on the sub-.sample of corporate enterprises of the 2009 ICT sample

• MonteCarlo evaluation of selective editing & imputation w.r.t. FS (different thresholds, ); “corrections” based on either 2009 FS (true) data or model-based predictions

• Auxiliary variables: Turnover and Costs from 2008 FS data

Results Editing a small number of units is sufficient to remove the most influential errors: bias of the estimates based on edited data is always below 0.3%, while the RRMSE is quite close to the threshold value (0.5%) for almost all domains




tabanell

esempio di slide con grafico:posizione 5 - 5dimensioni massime grafico 11x19

ICT - Results of experiment 1

Relative Bias (%) RR MSE (%)

RAW EDITED ROB.EST RAW EDITED ROB.ESTDom

N n.cont n.out n.sel turnv cost turnv cost turn cost turn cost turnv cost turnv cost

G 3497 336 515 116 2.8 2.6 0.0 0.0 0.9 1.2 4.2 3.7 0.2 0.2 1.0 1.3

F 3260 317 565 238 15.4 18.1 -0.2 0.0 -7.6 -7.0 22.3 32.8 0.4 0.2 7.7 7.1

DE 876 85 143 16 4.4 13.6 0.1 0.1 -0.2 -1.0 10.4 39.3 0.3 0.5 2.0 1.8

C 3691 362 494 231 13.7 16.3 -0.1 -0.1 0.9 0.3 19.4 23.9 0.3 0.3 1.0 0.7

H 653 63 144 20 2.7 3.3 0.1 0.0 -0.6 -0.8 8.8 10.5 0.4 0.5 0.9 1.0

L 133 13 25 16 44.5 166.7 0.0 -0.1 3.9 10.2 95.4 686.4 1.0 0.7 7.9 11.5

J 565 55 76 15 16.2 19.4 0.0 -0.1 -1.8 -3.6 35.0 50.1 0.6 0.4 2.1 3.8

I 224 22 35 16 6.4 4.6 -0.2 -0.2 2.1 2.9 15.6 12.3 0.8 0.6 2.3 3.0

NS 1156 111 211 18 6.8 6.5 0.2 0.1 0.5 -0.4 11.0 12.1 0.5 0.7 0.9 0.8

M 450 43 78 38 39.2 30.5 -0.1 0.0 -6.1 7.4 79.5 64.3 0.4 0.4 6.1 7.5

Relative bias and root mean square error (RRMSE) for the estimates based on raw data (RAW), edited data (EDITED) and SeleMix predictions (ROB.EST) (=0.005)



tabanell


ICT - Experiment 2

Objective: Assessing the advantages in terms of potential reduction of follow-up and interactive editing costs deriving by integrating selective editing in the current E&I procedure


• Application of selective editing to raw Turnover and Costs of all the 2008 ICT responding units (different thresholds, )

• Comparative evaluation of parameters’ estimates obtained after selective editing with estimates obtained by the current procedure

• Auxiliary variables: Turnover and Costs available in at least one external source (SEA , FS, SME, SS, with priority), year 2008

• Correction using either ICT edited data or model-based predictions

Results

• High reduction of units selected as suspect vs the corresponding number of manually revised units based on the current approach

• Low distances among totals’ estimates based on selective editing wrt the corresponding final ICT estimates for the most part of domains




tabanell


ICT - Results of experiment 2

Influential errors an missing imputed with ICT edited data

Turnover Costs

Dom N n.sel n.miss ICT.Sel n.miss ICT.Sel

1 745 3 9 0,90 6 0,552 338 6 3 1,62 5 1,703 293 0 0 -0,22 1 0,394 546 8 2 0,16 7 0,215 255 1 1 -0,88 2 -0,356 1036 33 8 0,89 8 -0,127 146 6 3 -0,21 2 0,268 603 19 8 0,34 10 -0,249 169 4 1 -0,29 1 0,37

10 416 5 4 0,26 5 0,9611 201 5 0 -0,48 1 -0,4011 747 17 15 0,27 17 0,5612 5155 16 74 -1,68 97 -1,9113 620 3 7 0,13 9 -0,6114 2795 13 22 -0,94 29 0,2615 1174 5 17 0,06 22 0,3016 752 8 5 0,07 8 -0,0117 29 0 0 0,00 1 0,2018 205 6 4 0,23 4 1,9719 131 1 4 0,04 5 3,6920 120 0 2 0,35 2 -0,3221 47 0 0 0,01 0 -0,2522 36 2 0 0,18 0 0,1923 406 2 1 -1,77 4 -2,2824 149 22 2 0,00 2 0,4225 613 16 10 1,17 12 1,3326 1124 25 16 0,32 17 0,3927 176 0 2 0,18 3 0,0028 74 9 0 0,20 0 0,01

Total 19,101 235 220

Relative distances between SeleMix estimates (Sel) with estimates based on raw data (Raw) and ICT edited data (ICT) (=0.01)



tabanell


SME - Experiment 1

Objective

Assessing the advantages in terms of potential reduction of follow-up and interactive editing that could derive by integrating selective editing in the current E&I procedure


• Application of selective editing and imputation to raw Turnover and Costs of all the 2008 SME responding units (different thresholds and imputation approaches)

• Comparative evaluation of parameters’ estimates obtained after selective editing &imputation and the “true” estimates obtained from administrative archives

• Auxiliary variables: Turnover and Costs available in at least one external source (FS, SS, with priority), year 2007




tabanell


SME - Results of experiment 1

As expected, higher levels of imply a consistent reduction of expected revisions which is balanced by less accurate estimates

In SME this seems to happen in a too high number of domains

=0.01

869 units selected as influential (~2.9% of the experimental sub-sample)

Diff(True.Sel) ≤ 1.5 in the 89% of domains (the median of the distribution of Diff(True.Sel) over the domains is 0.65)

=0.02

382 influential units selected (~0.01% of the experimental sub-sample),

Diff(True.Sel) ≤ 1.5 in the 75% of domains (the median of the distribution of Diff(True.Sel) over the considered domains is 0.9)



tabanell


SME - Results of experiment 1

Turnover – Relative differences between Diff(True.Sel) when =0,01 and when =0,02



TrueselTrue YYYY TTTSelTrueDiff ˆ/ˆˆ).(

tabanell


Conclusions

Application to ICT data

Fully satisfactory results. The integration of the method in the current E&I procedure is already in progress

Application to SME data

• Further analyses are needed:

• Different thresholds for different domains?

• Additional covariates?




tabanell


Thank you for your attention



multivariate selective editing via mixture models: first applications to italian structural business...

Documents