developing a theoretical framework for selective editing ... · •essential editing criterion:...
TRANSCRIPT
![Page 1: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/1.jpg)
Developing a theoretical framework for selective editing based on modelling and optimization
Work Session on Statistical Data Editing Budapest, 14-16 September 2015
Pedro RevillaINE Spain
1
![Page 2: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/2.jpg)
Outline
• Editing based on modelling
•Macroediting tools based on models
• Selective editing based on optimization
2
![Page 3: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/3.jpg)
Editing based on modelling
• Specification of the edits based on the use of statistical models
• Essential editing criterion: data have to be consistent with the available information
• The available information is summarized in a model
• Data is consistent when it approaches to prediction
References
- Revilla, P. and Rey, P. (1999). “Selective editing methods based on time series modelling.” UN/ECE Work Session on SDE. 2-4 June 1999, Rome.
- Revilla, P. (2002) “An E&I method based on time series modelling designed to improve timeliness.” UN/ECE Work Session on SDE. 27 – 29 May 2002, Helsinki.
3
![Page 4: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/4.jpg)
Microediting
Edit
𝑃 𝑥𝑖𝑗𝑡 − 1.96𝜎𝑖𝑗 < 𝑥𝑖𝑗𝑡 < 𝑥𝑖𝑗𝑡 + 1.96𝜎𝑖𝑗 = 0.95
Microimputation
𝑥𝑖𝑗𝑡
4
![Page 5: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/5.jpg)
Macroediting
Edit
P 𝐼𝑖𝑡 − 1.96𝜎𝑖 < 𝐼𝑖𝑡 < 𝐼𝑖𝑡 + 1.96𝜎𝑖 = 0.95
Macroimputation
𝐼𝑖𝑡
5
![Page 6: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/6.jpg)
Selective editing tools
• Surprises
• Influences
6
![Page 7: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/7.jpg)
Surprise 𝑆𝑖,𝑡 for the index 𝐼𝑖,𝑡
Relative change between the observed and the forecasted data
𝑆𝑖,𝑡 =𝐼𝑖,𝑡− 𝐼𝑖,𝑡
𝐼𝑖,𝑡
Distribution
• 𝑒𝑖,𝑡 = 𝐿𝑛𝐼𝑖,𝑡 − 𝐿𝑛 𝐼𝑖,𝑡 ≅ 𝐼𝑖,𝑡 − 𝐼𝑖,𝑡 𝐼𝑖,𝑡• 𝑒𝑖,𝑡 is 𝑁 0, 𝜎𝑖 𝑆𝑖,𝑡 is approximately 𝑁 0, 𝜎𝑖
7
![Page 8: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/8.jpg)
Confidence interval for the surprises
𝑃 −1.96𝜎𝑖 < 𝑆𝑖,𝑡 ≤ 1.96𝜎𝑖 = 0,95
Edit Outliers can be defined as the indices with surprise outside the confidence interval
8
![Page 9: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/9.jpg)
Surprises
Standard Surprise
𝑆𝑖,𝑡𝜎𝑖
=𝐼𝑖,𝑡 − 𝐼𝑖,𝑡
𝐼𝑖,𝑡
1
𝜎𝑖
It allows the comparison between indices of different variability
Weighted Standard Surprise𝑆𝑖,𝑡𝜎𝑖
𝑤𝑖 =𝐼𝑖,𝑡 − 𝐼𝑖,𝑡
𝐼𝑖,𝑡
𝑤𝑖
𝜎𝑖
It allows ranking indices taking into account not only the magnitude of the surprise but also the different weights
9
![Page 10: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/10.jpg)
Influence over the aggregated index 𝐼𝑡
Influence of an individual datum over an aggregated magnitude the difference between the observed aggregated magnitude and the value for this same magnitude when the individual datum is not available
𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑖𝑤𝑖𝐼𝑖,𝑡 − 𝑖≠𝑖0
𝑤𝑖𝐼𝑖,𝑡 +𝑤𝑖0 𝐼𝑖0, 𝑡−1 𝑗≠𝑗0
𝑞𝑖0,𝑗,𝑡+ 𝑞𝑖0,𝑗0,𝑡
𝑗 𝑞𝑖0,𝑗,𝑡−1=
= 𝑤𝑖0𝐼𝑖0,𝑡−1𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
10
![Page 11: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/11.jpg)
Influence factors
𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑤𝑖0𝐼𝑖0,𝑡−1
𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
• The product (or activity) weight 𝑤𝑖0
• The index 𝐼𝑖0,𝑡−1 which “updates” the weight
• A measure of the relative discrepancy between the observed and the imputed data
𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
11
![Page 12: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/12.jpg)
Surprises
Sector Actual rate Forecasted rate Surprise Standard surprise Weighted standard surprise
4243 70,28 3,32 64,73 3,79 17,10
2511 -27,73 -3,29 -25,25 -3,11 -16,93
4110 -50,24 -6.89 -70,96 -6,84 -16,89
2514 -15,92 4,64 -19,62 -3,00 -16,87
2512 39,39 -11,83 58,12 7,22 16,51
4752 -0,74 2,06 -2,75 -1,09 -15,66
3299 -11,97 4,45 -15,70 -2,02 -15,57
4751 22,82 -7,36 32,55 2,34 14,64
3630 -0,28 3,68 -3,82 -0,81 -14,54
3166 15,97 5,83 9,58 1,89 13,92
![Page 13: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/13.jpg)
Macroediting tools based on models
• We use a RegARIMA model to estimate a set of characteristics of a short term indicator
• Characteristics:
- Level behaviour
- Seasonal behaviour
- Calendar effects
- Other deterministic effects
- Outliers
- Uncertainty
References
- Revilla, P. and Rey, P. (2000). “Analysis and quality control from ARIMA modelling”. UN/ECE Work Session on SDE. Cardiff, 18-20 October 2000.
13
![Page 14: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/14.jpg)
14
Gas Manufacturing
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dairy Industries
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
![Page 15: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/15.jpg)
15
Beer Brewing Clothing Industries
Jan Feb MarApr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
![Page 16: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/16.jpg)
Index of Industrial ProductionLevel Behaviour
Seasonal Behaviour
Working-Days Effect %
Easter Effect %February Strike Effect %
Outliers Uncertainty %
National Total Trend Yes 1.9 -4.2 -3.8 2.1
Andalusia Trend Yes 1.7 -2.0 (*)+7.5 Jan 1996 (-) Feb 1997 (+) Feb 1998 (+)
2.7
Aragón Trend Yes 2.1 -4.1 (*) -6.6 Dec 1994 (+) Feb 1997 (-)
2.2
Asturias Trend Yes 1.3 -4.3 -4.4 2.3
Balearic Islands
Trend Yes 1.6 -3.5 2.3
Canary Islands Local oscillations
No 1.8 -3.7 Mar 1994 (+) Mar 1997 (+)
2.9
Cantabria Trend Yes 1.9 -4.2 2.8
Castilla-León Trend Yes 1.6 -5.2 (*) -8.1 Feb 1996 (-) Jul 1995 (-) Dec 1996 (-) Feb 1997 (-) Nov 1997 (-)
16
![Page 17: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/17.jpg)
Index of Industrial Production
17
Level Behaviour
Seasonal Behaviour
Working-Days Effect %
Easter Effect %February Strike Effect %
Outliers Uncertainty %
Castilla-La Mancha
Trend Yes 0.8 4.0 2.7
Catalonia Trend Yes 2.0 -4.7 -4.6 2.4
Valencian Community
Trend Yes 2.2 -4.6 -4.4 1.7
Estremadura Local oscillations
No 1.3 4.5
Galicia Trend Yes 1.9 -3.3 (*) -5.6 Dec 1995 (+) Feb 1997 (-)
1.9
Madrid Trend Yes 1.6 -5.2 3.0
Murcia Region Trend Yes 2.1 -3.2 2.7
Navarre Trend Yes 2.4 -4.7 (*) -7.6 Feb 1997 (-) 2.6
Basque Country
Trend Yes 2.4 -3.8 (*)-7.7 Aug 1993 (-) Feb 1997 (-)
2.6
Rioja Trend Yes 2.5 -4.9 Jan 1994 (-) Dec 1994 (+)
3.5
![Page 18: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/18.jpg)
Selective editing based on optimization
• Selective editing as an optimization problem
• To reconcile two objectives (reduce editing work and maintain quality at the aggregate level)
• We will determine a selection strategy that allows editing the minimum number of units, while obtaining certain accuracy requirements in the aggregates
• Score functions can be obtained
References
- Arbués I., Revilla, P. and Salgado D. (2013). “An optimization approach to selective editing”. Journal of Official Statistics (JOS)
- Arbués I., González M. and Revilla, P. (2012). “A class of stochastic optimization problems with application to selective data editing". Optimization
18
![Page 19: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/19.jpg)
General approach
True (𝑦𝑘0), observed 𝑦𝑘
𝑜𝑏𝑠 and edited (𝑦𝑘𝑒𝑑𝑖𝑡) values
The ultimate variables are the selection strategy vector
RT = (R1, R2,…, Rn)
for the sample units s = 1,...,n,
where Rk= 0 if the unit k is selected for interactive editing and Rk = 1 otherwise
![Page 20: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/20.jpg)
Objective function to maximize
The objective function to maximize, given the available information 𝑍 is
Em Ri/Z ,
(in matrix notation, Em 1TR/Z where 1 stands for a vector of ones),
whose maximization amounts to minimizing the number of selected units
![Page 21: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/21.jpg)
Constraints
Each constraint controls the loss of accuracy in terms of the chosen loss function L due to non-selected units
Two loss functions most used in practice
- absolute loss L = L 1 a. b = a − b
- squared loss L = L 2 a, b = (a − b)2
For these loss functions, each constraint can always be written as a bound on a quadratic form, denoted by Em RTΔ R Z
![Page 22: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/22.jpg)
Generic optimization problem
𝑃0 max Em 1TR/Z
s. t. Em RT∆(q)R/Z ≤ ηq, q = 1,2,… , Q
RϵΩ0
Ω0 denotes the admissible outcome space of R
q refers to the different constrains (the different constraints q may arise from the fact that there are multiple variables of interest inside the questionnaire)
Choosing the auxiliary information Z and the subset S0 of sought selection strategies in the general problem P0, we end up with different optimization versions
![Page 23: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/23.jpg)
Stochastic version
If no auxiliary information is used and the sought selection strategies are of the form:
𝑅 ∈ 𝑆: 𝑅𝑘 = 1 𝑠𝑖 𝜉𝑘 < 𝑄𝑘
0 𝑖𝑓𝜉𝑘 > 𝑄𝑘
𝜉𝑘 random variable U(0.1) and 𝑄𝑘 = 𝑄𝑘(X, 𝑌𝑜𝑏𝑠, S) is a continuous random variable
23
![Page 24: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/24.jpg)
Combinatorial version
We use both all available auxiliary information and the observed values found in the sample and do not restrict the form of the sought selection strategies S0 = S
Pco max 1´r
s. t. rtM(q)r ≤ mq2 , 𝑞 = 1,… . . , 𝑄
r ∈ Bn the realized selection B = 0,1.
M(q) condenses the modelization of the measurement error
mq bounds chosen by the statistician
![Page 25: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/25.jpg)
Score functions obtained from the optimization approachAdditional assumption is neglecting the cross-unit terms in each constraint
Then these constraints can be rewritten as
Em [RT R/Z] = Em [RT diag()/Z]
![Page 26: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/26.jpg)
Score function
One constrain (Q = 1)
Unit k is selected provided Mkk > 1/*
Mkk can be regarded as a single score and 1/* as the threshold value.
* Mkk can be considered as a “standardized” score, in the sense that the threshold value is generically set to 1.
Multiple constraints (Q >1)
each q* Mkk
(q) is a standardized local score,
qq* Mkk
(q) is the standardized global score, with the generic global threshold value 1.
![Page 27: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/27.jpg)
Evaluation
Compare the performance of the score obtained under the optimization approach to that of the score-function described, for example, in Hedlin (2003)
𝛿𝑖0 = 𝜔𝑖 𝑥𝑖
𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒
, prediction: data t-1
𝛿𝑖1 = 𝜔𝑖 𝑥𝑖
𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒
, prediction: ARIMA model
𝛿𝑖2 optimization method
![Page 28: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/28.jpg)
Effectiveness
𝐸1𝑗𝑛 =
𝑖≥𝑛(𝜔𝑖
𝑗)2(𝑥𝑖𝑗
𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )2
𝐸2𝑗𝑛 =
𝑖≥𝑛
𝜔𝑖𝑗(𝑥𝑖𝑗
𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )
2
(Units arranged in descending order according to the corresponding score function)
These measures can be interpreted as estimates of the remaining error after editing the n first units
28
![Page 29: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/29.jpg)
Comparison of score functions methods
Turnover Orders
𝐸1 𝐸2 𝐸1 𝐸2
𝛿 0 0.43 0.44 1.16 1.33
𝛿 1 0.30 0.38 0.36 0.45
𝛿 2 0.21 0.26 0.28 0.37
29
![Page 30: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/30.jpg)
Remaining error as a function of the edited questionnaires (N~9000).
![Page 31: Developing a theoretical framework for selective editing ... · •Essential editing criterion: data have to be consistent with the available information •The available information](https://reader033.vdocuments.net/reader033/viewer/2022043019/5f3ba5546248b5010f38b02b/html5/thumbnails/31.jpg)
Final remarks
• We have introduced theoretical frameworks using models and optimization techniques
• We consider the search for an adequate selection strategy as a generic optimization problem with an stochastic and a combinatorial version. We have shown that a certain score function provides the solution to the problem with linear constraints
• The selection obtained outperforms that of traditional score functions in general
• More methodological research and practical experiences are needed
31