ilcs raking
DESCRIPTION
TRANSCRIPT
Data QualityIssues and “Fixes”
Dr. Fritz Scheuren
July 3, 2009
(for academic purposes only)
RCRC
Two Definitions of Quality
• Conformance to Requirements• (Traditional Producer-Oriented
Definition)• Fitness for Use• (Modern Client-Oriented
Definition)
Definition of Process Quality
• Process Improvements Focus• (Do It Right the First Time)• Can be Reduced to Slogans• Can also lead to Continuous
Improvements• Kaisen
Be Real Four Quality Costs
• Costs of Reputation and Loss of Business from Inaction
• Cost of Prevention to Avoid Errors
• Cost of Detection to Find Errors
• Cost of Repairing Errors Found
Quality and Cost 2 Worlds
Repair Methods
• Goal is “Fixing” to Fit Use
• Data Editing
• Data Imputation
• Data Fabrication
• Raking at NSS
Data Editing• Honest Differences of Opinion or
Real Errors?
• Need for Redundancy in System for Can’t Fail Items
• Achieving Measurability to Frame Expectations and Improvements
Data Editing Techniques
• Minimizing Processing Errors
• Definitional (e.g., Range) Tests
• Deterministic Tests
• Probabilistic Tests
–Outlier Tests
–Ratio Tests
Types of Edits Illustrated
• Range Test
Age Negative
• Deterministic Tests
If Age =14, then code as Child
• Probabilistic Tests
If Income $1,000,000, take a look
Practical Editing Tips• Edit for Diagnosis, not just
Correction• Don’t Edit Outside Your Confidence
Interval• Preserve the Original Dataset as
Backup to Avoid Irreversible Changes• Keep Tallies of all Errors Found
Not all errors need to be corrected
Resist your Perfectionist Tendencies
More Practical Edit Tips
• Use your skilled staff to improve system rather than just edit data • Never just depend on Intuition
but still use it too! • Employ Redundancy, Frugally!
Capture Recapture Methods (Double Keying Example)
• Two-by-Two Table with Cells
A B
C D
• Comparing Data Keyed the Same each time (A) with Errors Detected, (B and C)
• How to Estimate D?
• One Model D = BC/A?
Bottom Line Take-Away
• Use Data Checking to Understand Data’s Fitness for Use• Edit but Don’t Over-Edit• Use Edit Checks to Prevent
Future Errors
Data Editing and Data Imputation
• Joint Role of Imputation and Editing No Clear Line?• Editing “fixes” Often are Model-
Based Hunches• Data Quality (editing)• Information Quality (imputation)
Imputation Versus Editing
• What is Imputation?• Handles Missing and Misreported
Data• Imputation Goal is roughly right!
Information Quality• Editing Goal often “correction”
Exactly right? Data Quality
Data Imputation Techniques
• Imputation Needs More Justification when Data Quality is the Goal• Must be no more than Cosmetic
in Nature, if done at all• Can only be Aggressively applied
for Information Quality Goal
Fellegi-Holt Example
• Identify Errors with Automated Edit Detection Software
• Hot Deck acceptable values from Records that Pass Edits
• Can be worth doing if errors are minor or cosmetic (e.g., Rounding)
More on Imputation
• Treat Influential Errors Individually not just Automatically• That Said, Software Fixes can lead to
Better Documentation (Paradata Matters)• Need to Measure Variance Impacts • Provide a natural break to
Overediting but seldom used for this.
Edit/Imputation Summary
• Most Editing Mainly Eliminates the Bad• Replacing it with a
(Good?)Guess of some Sort • Imputation emphasizes
Guessing even more
More Editing/Imputation
• Best Imputation Practice tries to quantify Guessing impact on Information Quality • Editing has not improved as much as
Imputation• Editing/Imputation needs more Joint
Theory, especially to Measure and Use Mean Square Error Impacts
First Illustrative Example
• Fabrication/Falsification
• Illustrate the General Points about Editing and Imputation
• Emphasize Importance of Fabrication threat to Quality
Fabrication/Falsification
• Respondent/Interviewer Make up Data
• How Common?
• How to Reduce?
• How to Detect?
Right Structure Right Resources
• Examine Practice Elsewhere?
• www.amstat.org Website
• Key is right incentives
• Good staff/training
• But Eternal Vigilance
Second Illustration
• Raking Application at NSS
• To link up to Next Talk
• To illustrate Information Quality that is fit for use despite Data Quality
Raking Quality “Fix”• What is Raking?• How does it improve quality?
Not Data QualityBut Information Quality
• Sometimes both --Better Point EstimatesMore Stable (smaller variances)
Quality Summary
• Editing Data Quality
• Imputation Information Quality
• Raking Information Quality
• Fabrication Can Harm Both
• Must be guarded against always
Almost Done Now
• Tried to Stay Practical, with a Frank Discussion of Key Weaknesses in Current Practice• Deeper Understanding of Data
Quality• But at an Applied Level