data editing united nations statistics division (unsd) 16 march 2011 santiago, chile
TRANSCRIPT
![Page 1: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/1.jpg)
Data Editing
United Nations Statistics Division (UNSD)
16 March 2011Santiago, Chile
![Page 2: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/2.jpg)
22
Editing and Imputation Defined
• Data editing: Identification and flagging of missing, invalid, inconsistent or anomalous entries
• Imputation: Resolves problems identified in editing
![Page 3: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/3.jpg)
33
Editing and Imputation Process Flow
1.
2.
3.
![Page 4: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/4.jpg)
44
A General Editing and Imputation Process
1. Identify and treat initial errors• At the data capture stage• At the data entry stage • Ex: Data entered into a table is shifted by a row
2. Identify and treat errorsa: Interactively/Manually treat influential errorsb: Automatically treat non-influential errors
3. Check the aggregated output
![Page 5: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/5.jpg)
55
Editing and Imputation Process Flow
1.
2.
3.
![Page 6: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/6.jpg)
6
Editing Errors• Two categories of errors
– Systematic – reported consistently by some of the respondents• Ex: Gross values are reported instead of net values• Ex: Units are reported in thousands
– Random – non-systematic or caused by accident • Ex: An extra digit is accidentally typed in the response
• Manifestations of errors can be systematic or random– Missing
• Ex: A variable is left blank because the respondent does not know the answer to the question, does not want to answer the question or does not understand the question
– Outliers – values that deviate from a model• Ex: Unanticipated large values as compared to historic trend
– Violation of logical or consistency rules• Ex: A total value is larger than the sum of its components
• Edit rules are used to detect errors and often define how they should be treated
![Page 7: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/7.jpg)
77
Systematic Errors• Errors that are reported consistently over time.– Unit error
• Ex: xt-1 / xt <= 300– Sign error– Bugs in the collection vehicle– Misunderstanding a question or skip rules
• Ex: systematic missing values
• Detection– High failure rates of edits– Outlier detection (e.g. for unit errors)– Knowledge of the survey and the raw data processing
![Page 8: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/8.jpg)
88
Systematic errors (2)
Suggestions• Improvements in the survey or processing
procedures should be made• When systematic errors are identified, they
should be turned into edit rules• Detecting and correcting is cost effective• Should be treated before random errors
![Page 9: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/9.jpg)
99
Missing Values
• Stem from questions a respondent did not answer• Detection is usually simple
Suggestions• Do not ignore missing values (→ bias and loss of
estimate precision)– Missing values may not be missing at random
• Do not replace with zeros (→ inaccurate results)• Nonresponse indicators should be compiled and
analyzed because missing values may be systematic
![Page 10: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/10.jpg)
1010
Outliers
• Observations that do not fit well to a model– Ex: Median-k*IQR < value < Median+k*IQR– Ex: Month-on-month change <= 50%
• May be defined by one variable (univariate) or a set of variables (multivariate)
• Two types– Representative: correct with similar units in population– Non-representative: either incorrect or correct but unique
• Ex: correct – isolated labor strike at a plant
![Page 11: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/11.jpg)
1111
Outliers (2)
• Detection– Univariate – Multivariate– Periodic data (e.g. Hidiroglou-Berthelot)– Regression models or tree-models
![Page 12: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/12.jpg)
12
Edit Rules
• Edit rules are used to determine whether a value is consistent or may be erroneous– Surveys are often created to allow these rules
• Edit rules flag data in two ways– Fatal edit – indicates a value that is (almost)
certainly in error– Query edit – indicates values that may be in error
![Page 13: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/13.jpg)
13
Types of Edit Rules• Validation edits – often in the form of if-then
statements– Ex: if total hours worked > 0 then employees > 0– Ex: if Σproduction quantity > 0 then Σproduction value > 0– Ex: if revenue from manufacturing plant> 0 then
1. hours worked by machinery technicians > 02. plant capacity utilization > 03. Σproduction volume > 04. Σproduction value > 0
• Balance edits – detail items must add to total– Ex: total employee remuneration = wages + salaries +
employer contributions to social security + welfare benefits + profits distributed to workers
![Page 14: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/14.jpg)
14
Types of Edit Rules (2)
• Ratio edits – the ratio of two data items is bounded by lower and upper bounds. The pairs should be correlated.– Ex: total hours/employee/day is between 6 and 10 (very
correlated)– Ex: plant capacity utilization <= 20% change from prvs
month– Ex: wages (W) should change within 10% of the change in
total employment (E)(Et/Et-1 - 1) - .1 <= Wt/Wt-1 -1 <= (Et/Et-1 - 1) + .1
– Ex: Σproduct value / Σ product quantity <= 10% change from previous month
![Page 15: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/15.jpg)
15
Types of Edit Rules (3)• Hidiroglou-Berthelot is a particular type of ratio edit– Ex : Employee month-on-month change
<=100 employees: <= 50% change from prvs month100< emp < =200: <= 20% change from prvs month>200 emp: <= 10% change from prvs month
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0 100 200 300 400 500 600 700 800 900 1000
% C
hang
e fr
om P
rvs
Mon
th
# Employees
![Page 16: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/16.jpg)
16
Editing & Imputation Process
• Interactive/Manual – a record with flagged data is manually reviewed, preferably by a subject matter expert
• Automatic – a record with flagged data is automatically reviewed and corrected by a computer
• Selective – designed to route edits/imputations into interactive or automatic streams– based on influential vs. non-influential errors
• Marcroediting
![Page 17: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/17.jpg)
1717
Editing and Imputation Process Flow
1.
2.
3.
![Page 18: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/18.jpg)
1818
Selective Editing
• Distinguishes between errors in values that have a significant influence on survey estimate and those that are insignificant to the estimate
• Selective editing splits raw data into two streams: – critical stream: records that most likely contain influential
errors and large companies– non-critical stream: records that are unlikely to contain
influential errors
• A score function determines which responses go into which stream
![Page 19: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/19.jpg)
1919
Selective Editing (2)
• Local score function = influence * risk
• For example:Influence =
Risk =
Raw valueAnticipated valueSampling weight
ii yw~
iii yyy ~/~
iy
iy~
iw
Influence
Risk
![Page 20: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/20.jpg)
20
Selective Editing (3)
• Local score functions are aggregated into global score functions for each record– First local scores are scaled, e.g. dividing observed values
by mean values– Scaled local scores are combined into a global score.
For example: Minkowski metric (a common approach)
– The influence of large local scores increases with α α = 1 : simple sum of local scores α = 2 : Euclidean metric α -> ∞ : max local score
1/αn
1i
αir,r LSGS
![Page 21: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/21.jpg)
21
Selective Editing (4)• GS cut-off threshold must be determined– All records above the cut-off are selected for interactive
editing– A simulation can be performed on previous data to
determine a threshold• Raw unedited values and corresponding edited values are used• The first p% of records are edited and the resultant estimate is
compared with the fully edited estimate• Trial and error will lead to estimates that are the same and a
corresponding cut-off value
• Alternatively, a threshold doesn’t need to be used– Records can be edited in priority order until time or
budget constraints tell one to stop
![Page 22: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/22.jpg)
2222
Selective Editing (5)
• A score function can be augmented in many ways– E.g. Size criteria where large enterprises are
always selected for critical stream (influence irrespective of risk)
• Selective editing improves efficiency
![Page 23: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/23.jpg)
2323
Macro-Editing
• Macro-editing techniques account for the distribution of variables and for the plausibility of estimates
• Two forms of macro-editing– Aggregation method– Distribution method
![Page 24: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/24.jpg)
24
Macro-Editing - Aggregation
• Verification whether figures to be published seem plausible
• Compare estimates with– Previous estimate values– Values from other related sources– Related estimates (such as electricity production
and consumption)
![Page 25: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/25.jpg)
25
Macro-Editing - Distribution
• Available data used to characterize distribution of variables
• Individual values are compared with this distribution
• Records that contain values that are uncommon may require further inspection and possibly for editing
![Page 26: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/26.jpg)
2626
Macro-Editing Example: Graphical Editing
• Univariate plot
• Bivariate scatter plot
![Page 27: Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile](https://reader036.vdocuments.net/reader036/viewer/2022062321/56649db45503460f94aa4b2b/html5/thumbnails/27.jpg)
2727
Editing and Imputation Process Flow
1.
2.
3.