1 data editing, coding, and just a little imputation katherine (jenny) thompson office of...
TRANSCRIPT
1
Data Editing, Coding, and Just a Little Imputation
Katherine (Jenny) Thompson
Office of Statistical Methods and Research
for Economic Programs
(301) 763-4941
2
The Basics: What is Editing?
Editing (procedures) review reported/keyed data for errors and pinpoints “inconsistent” values• For “industry”• For respondent
Editing does not change the data. Itemsthat fail edits are• referred to an analyst; or • automatically imputed (replaced with consistent
values)
3
The Basics: What Is Imputation?
Imputation is the replacement of a missing or incorrectly reported item using logical edits or statistical procedures.
In other words,
Imputation replaces a missing or incorrect data item with an “educated guess.”
4
The Basics: What is Coding?
Coding is the assignment of recognizable values to flags that describe key characteristics of the unit or item, such as
• Industry (unit level)• Response status (unit or item level)• Source of data correction (item level)• Imputation model (item level)
5
We Begin With Coding
Before we can evaluate whether a response is reasonable, we have to know where it comes from:• Classification variable(s) value, e.g., industry, state
• Frame information may be erroneous or • unit may have changed classification value
Each unit must be assigned classification code(s) before editing/imputation
6
We End With Coding
At the end of the processing cycle, we want to know • How the data were changed,• Where the data were changed, • Why (if possible) data were changed, and • The final status of the reporting unit
(respondent, non-respondent).
7
Some Edit DefinitionsEditing: Procedures for detecting
“incorrect” keyed or respondent
data.
Micro-Editing: Editing at the individual record
(questionnaire) level
Macro-Editing:Editing at the tabulated value level
8
“Typical” Editing Processing Flow
• Micro-editing (static)• Performed on a flow basis • Predetermined edit tests and edit parameters (historic data)• Administered by machine• Resolved by machine and human
• Outlier detection (dynamic)• Performed after “close-out”• Administered by machine• Often resolved by human
• Macro-editing (dynamic)• See above
9
Micro-Edits are Either:
Fatal Must be resolved before subsequentediting
• Unit is Out-of-Scope for Survey• Unit is missing classification variable value• Required data item not reported
Query Can be corrected “automatically”• Detail items do not add to reported total• Ratio of two items is outside (user-
determined) limits
10
Where Do Micro-Edits Come From?
• Questionnaire
• Reality
• Subject-Matter Expert Rules
• (Enforced) Statistical Relationships
11
Fictional Sample Questionnaire
Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
12
Edit Sources: Questionnaire
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Balance EditItem 3.a. Value + Item 3.b. Value = Item 3.c. Value
Things have to add up!
13
Edits Sources: Questionnaire/Reality
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Ratio EditANNUAL PAYROLL/1ST QUARTER PAYROLL 1
Can’t spend more on payroll in one quarter than for the entire year!
14
Edit Sources: Questionnaire/Reality
Millions Thousands
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
( ) (.
2 0 h o u rsw eek
w eek sy ea r )
1 0 0 0 (rep o rtin g u n it)
4 80 9 6
( ) ( ).
2 4 h o u rsd ay
d ay sy ea r
1 0 0 0 (rep o rtin g u n it)
3 6 5
8 7 6
Ratio Edit
0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76
15
Edits Sources: Questionnaire/Reality
Item 5. EMPLOYMENT
Range Edit0 EMPLOYMENT 5,615,727
A unit can’t have more employees than the population of the resident state (or negatively-value employees!)
16
Edit Sources: Subject-Matter Rules
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Ratio Edit
TOTAL SALES/ANNUAL PAYROLL > 1
“Full-year reporters should operate at a profit!”
17
Edit Sources: Statistical Relationships
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 5. EMPLOYMENT
Ratio Edit
A ANNUAL PAYROLL/EMPLOYMENT B
Wage per employee should be within the (industry) range.
18
Examples of Fatal Micro-edits
• Classification Edits
• Required Data Item Tests
19
Examples of Query Micro-edits
• List Directed (Verification) Edits• Skip Pattern Validation Edits • Range Edits (Including negative tests)• Ratio Edits
• Within same questionnaire • Current to prior period
• Balance Edits• Subject-matter rules
20
List Directed/Verification Edits
Purpose: To compare the reported value of a data field to a pre-determined
list of legal values.
• Machine edits, but highly dependent on data-quality of list
• Human (manual) correction of edit failures
21
Skip Pattern Validation Edits
Purpose: To verify that values of skip items are
consistent with the skip instructions
provided on the questionnaire.
Machine edits that CAN be resolved by machine-imputation• Subject-matter rules (if..then..logic)• Operations Research approach
22
Range Edits
Purpose: To check the reported value of a data item to see if it is within specified minimum and maximum values.
Form of edit: lower bound data item upper bound
• Upper and lower bounds are tolerances.• If data item is not contained within the bounds, then it
fails the range edit (“out of tolerance”).• Negative tests are a special case of range edits.• Can be used to define an imputation region.
23
Range Edits
Examples:
0 Employment 301,064,982 (2006 U.S. Population)
0 Sales 12,455.8 billion (2005 Gross Domestic Product)
0 Percent of work done in category 100%
24
Ratio Edits
Purpose: To compare two “related” items in a questionnaire to see if reported
values are consistent.
Form of Ratio Edit:
• Upper and lower bounds are known as tolerances.• Tolerances generally developed from prior period
data.• If ratio is not contained within the bounds, then it
fails the ratio edit (“out of tolerance”).
LX
XU 1
2
25
Some Reasons for Ratio Editing
One data item is a function of another.
Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters
Ratio Edit: 4.4PAYROLL QUARTER 1ST
PAYROLL ANNUAL1
26
Some Reasons for Ratio Editing
One data item can only be evaluated in comparison with another item
(reasonable lower bound)
(reasonable upper bound)
( ) (.
2 0 h o u rsw eek
w eek sy ea r )
1 0 0 0 (rep o rtin g u n it)
4 80 9 6
( ) ( ).
2 4 h o u rsd ay
d ay sy ea r
1 0 0 0 (rep o rtin g u n it)
3 6 5
8 7 6
76.8EMPLOYEES OF NUMBER
WORKEDHOURS TOTAL96.0
27
Some Reasons for Ratio Editing
One data item is a good predictor of another.
Annual Payroll = factor Total Employment
0
1000
2000
3000
4000
5000
6000
0 20 40 60 80 100 120
Total Employment
An
nu
al
Payro
ll
28
Plot of a Typical Ratio Edit
0
1000
2000
3000
4000
5000
6000
7000
8000
0 20 40 60 80 100 120
Total Employment
An
nu
al P
ayro
ll
Census Data Lower Tolerance Upper Tolerance
29
Advantages of Ratio Edits
• Useful for detecting systematic and random errors• Reasonable comparisons for quantitative data • Verifiable assumptions• Often insensitive to changes in economy when
both items are in the same units • Imply certain imputation models• Can be solved simultaneously
• imputation region implications
30
Disadvantages of Ratio Edits
• Edit failure identifies a pair of potentially incorrect data fields
• Need to have a “tie-breaker” • Often work best when combined with other
edits (can be ratio edits)• Very dependent on the distribution of ratios
• Highly correlated• Goes through origin
31
“Best” Practices for Ratio Edits• Incorporate unit size categories as well as
classification variables in editing cells• Perform preliminary data analysis to determine
validity of edit model• Incorporate tests to prior data from same unit and
item when reasonable• Use non-parametric outlier-resistant methods for
setting ratio edit tolerances• Audit edits
• An edit test that has a high rate of failure could indicate problems with the tolerances or the test itself
32
Periodic Data and Ratio Edits (Caution)
0
10
20
30
40
50
60
70
80
0 5 10 15 20 25 30 35
Prior Month's Number of Employees
Cu
rren
t M
on
th's
Nu
mb
er o
f E
mp
loye
es
33
Brief Digression on Imputation
Situation: Missing item or item marked forimputation (replacement) due to edit failure(s)
We would like the machine to automatically replace the “inconsistent” item with a consistent value.
34
The ideal “imputations” find replacement values that are still considered reported (from the same respondent)
Examples• divide reported data by correct reporting unit • replace reported total with sum of details
35
Link Between Imputation and Program
• Published tabulations (macro-data)• Ratio imputation models• Regression imputation models
• Published micro-data• Hot deck imputation
36
Commonly-Used Imputation Methods(Economic Data)
Rounding/Data Slides (systematic error)• Respondent data divided by unit conversion factor
(e.g., imputed value = reported value/1,000)
Direct Substitution• Another data item (same questionnaire)• Absolute value of reported/keyed item• Sum of Reported Details (logical edit)• Derived value from other reported/keyed item • Previously reported value (historic) from same respondent• Administrative data value (same respondent)
37
Ratio Imputation (Model Imputation)
imputed item = (factor) (another data field)• Same reporting unit/questionnaire• Edit-passing item
Industry (Category) Average Ratio (use average ratio of two items in industry/category)e.g., factor = industry wage/employee ratio
Historic Imputation (Auxiliary Trend) (use ratio of prior data to current data for same respondent)
e.g. factor = previous tabulated value of edit-failing item previous tabulated value of auxiliary data field
38
Balance Edits
Purpose: To determine if detail items add to associated reported total.
Form of Edit: TOTAL = DETAIL1 + DETAIL2 + ... + DETAILn
• Developed from questionnaire• A set of details along with their associated total is
called a balance complex.• More complicated balance complexes
• Nested 1-Dimensional• 2-Dimensional
39
Sample Questionnaire Example
Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
40
“Fixing” a Failed Balance Edit
Editing generally integrated with imputation:
Editor decides which is more believable: TOTAL or SUM OF DETAILS
Only change one side of balance complex (TOTAL or SUM OF DETAILS)
41
Balance Edit Definitions
Residual: TOTAL - SUM OF DETAILS
Failed edit solution can depend on
• SIZE of residual (absolute tolerance)• RATIO of residual to total (relative tolerance)
42
A Few Balance Edit Fixes
RAKE* Rake all detail items to TOTAL
YSUMX* Replace TOTAL with the SUM OF DETAILS
ROUND Divide all details by 1000 or
divide TOTAL by 1000
RESIDUAL Set one missing DETAIL to the RESIDUAL
IMPUTE* Replace all DETAILS with imputed values
*Briefly discussed…
43
Raking
Adjust each detail item as
Conditions:• Reported TOTAL must be “acceptable.”• Relative tolerance is “small” (e.g., within 5%).
D E T A IL *= T O T A L
S U M O F D E T A IL SD E T A ILi i
44
Raking -- Considerations
• Is not considered imputation• Preserves reported distribution of the detail
items
45
YSUMX
Set TOTAL equal to SUM OF DETAILS
Conditions:• TOTAL can be changed by edit (not
“fixed”);• (Optional, but preferable) SUM OF
DETAILS is “reasonable” (e.g., verify with ratio test or range test)
46
YSUMX -- Considerations
• Not (considered: imputation;• logical edit or deductive imputation
• Useful when TOTAL is missing (and details are not);
• Can be imputation solution to ratio edit
1DETAILS OF SUM
TOTAL1
47
Impute
Replace ALL reported DETAILS with imputed values
Imputed DETAILi for reporting unit c is given by
factori TOTAL
Conditions
• TOTAL > 0 (and value of TOTAL “acceptable”)• No restriction on SUM OF DETAILS (all DETAILS are
replaced...]• Difference between TOTAL and SUM OF DETAILS too
large for raking
48
Macro-Editing (Brief Comments)
• Systematic review of tabulations (estimates)
• Tendency to rely on ratio comparisons to identify outlying estimates• Hidiroglou-Berthelot edit• Ratio Edits
• Need to analyze micro-data in outlying cells
49
Back to Coding…
Throughout the editing and imputation process, what do we need to keep track of?
50
Back to Coding…
Original source of data item • Reported from respondent• Elicited by analyst/subject-matter expert• Missing/not reported
51
Back to Coding…Final source of data item value• Unchanged (respondent data)• Raked/considered reported• Rounded detail item (rescaled)• Substitution
• Administrative data• Other item from same questionnaire• Prior period value from same respondent (can indicate
bad survey practice)
• Model imputation (+ auxiliary data)• Other imputation
52
Back to Coding…
Why was data item value changed?• Not reported?• Edit failure (automatic)
• Macro or micro level?• Which edit/edit module
• Analyst change (manual)• Should be documented in notes
53
Back to Coding…
What is the final disposition the data item?
• Reported data?• Equivalent in quality to reported data?• Imputed data?
What is the final disposition of the entire reporting unit?
54
Wrap Up
• Talked in GREAT detail on• Methods and sources of micro-edits• Common imputation models
• Talked in SEMI-GREAT detail on imputation methods
• Brought up the idea of macro-editing• Mentioned a few coding concerns
here and there…