ilcs raking

29
Data Quality Issues and “Fixes” Dr. Fritz Scheuren July 3, 2009 (for academic purposes only) R C R C

Upload: crrc-armenia

Post on 11-Nov-2014

697 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ILCS Raking

Data QualityIssues and “Fixes”

Dr. Fritz Scheuren

July 3, 2009

(for academic purposes only)

RCRC

Page 2: ILCS Raking

Two Definitions of Quality

• Conformance to Requirements• (Traditional Producer-Oriented

Definition)• Fitness for Use• (Modern Client-Oriented

Definition)

Page 3: ILCS Raking

Definition of Process Quality

• Process Improvements Focus• (Do It Right the First Time)• Can be Reduced to Slogans• Can also lead to Continuous

Improvements• Kaisen

Page 4: ILCS Raking

Be Real Four Quality Costs

• Costs of Reputation and Loss of Business from Inaction

• Cost of Prevention to Avoid Errors

• Cost of Detection to Find Errors

• Cost of Repairing Errors Found

Page 5: ILCS Raking

Quality and Cost 2 Worlds

Page 6: ILCS Raking

Repair Methods

• Goal is “Fixing” to Fit Use

• Data Editing

• Data Imputation

• Data Fabrication

• Raking at NSS

Page 7: ILCS Raking

Data Editing• Honest Differences of Opinion or

Real Errors?

• Need for Redundancy in System for Can’t Fail Items

• Achieving Measurability to Frame Expectations and Improvements

Page 8: ILCS Raking

Data Editing Techniques

• Minimizing Processing Errors

• Definitional (e.g., Range) Tests

• Deterministic Tests

• Probabilistic Tests

–Outlier Tests

–Ratio Tests

Page 9: ILCS Raking

Types of Edits Illustrated

• Range Test

Age Negative

• Deterministic Tests

If Age =14, then code as Child

• Probabilistic Tests

If Income $1,000,000, take a look

Page 10: ILCS Raking

Practical Editing Tips• Edit for Diagnosis, not just

Correction• Don’t Edit Outside Your Confidence

Interval• Preserve the Original Dataset as

Backup to Avoid Irreversible Changes• Keep Tallies of all Errors Found

Page 11: ILCS Raking

Not all errors need to be corrected

Resist your Perfectionist Tendencies

Page 12: ILCS Raking

More Practical Edit Tips

• Use your skilled staff to improve system rather than just edit data • Never just depend on Intuition

but still use it too! • Employ Redundancy, Frugally!

Page 13: ILCS Raking

Capture Recapture Methods (Double Keying Example)

• Two-by-Two Table with Cells

A B

C D

• Comparing Data Keyed the Same each time (A) with Errors Detected, (B and C)

• How to Estimate D?

• One Model D = BC/A?

Page 14: ILCS Raking

Bottom Line Take-Away

• Use Data Checking to Understand Data’s Fitness for Use• Edit but Don’t Over-Edit• Use Edit Checks to Prevent

Future Errors

Page 15: ILCS Raking

Data Editing and Data Imputation

• Joint Role of Imputation and Editing No Clear Line?• Editing “fixes” Often are Model-

Based Hunches• Data Quality (editing)• Information Quality (imputation)

Page 16: ILCS Raking

Imputation Versus Editing

• What is Imputation?• Handles Missing and Misreported

Data• Imputation Goal is roughly right!

Information Quality• Editing Goal often “correction”

Exactly right? Data Quality

Page 17: ILCS Raking

Data Imputation Techniques

• Imputation Needs More Justification when Data Quality is the Goal• Must be no more than Cosmetic

in Nature, if done at all• Can only be Aggressively applied

for Information Quality Goal

Page 18: ILCS Raking

Fellegi-Holt Example

• Identify Errors with Automated Edit Detection Software

• Hot Deck acceptable values from Records that Pass Edits

• Can be worth doing if errors are minor or cosmetic (e.g., Rounding)

Page 19: ILCS Raking

More on Imputation

• Treat Influential Errors Individually not just Automatically• That Said, Software Fixes can lead to

Better Documentation (Paradata Matters)• Need to Measure Variance Impacts • Provide a natural break to

Overediting but seldom used for this.

Page 20: ILCS Raking

Edit/Imputation Summary

• Most Editing Mainly Eliminates the Bad• Replacing it with a

(Good?)Guess of some Sort • Imputation emphasizes

Guessing even more

Page 21: ILCS Raking

More Editing/Imputation

• Best Imputation Practice tries to quantify Guessing impact on Information Quality • Editing has not improved as much as

Imputation• Editing/Imputation needs more Joint

Theory, especially to Measure and Use Mean Square Error Impacts

Page 22: ILCS Raking

First Illustrative Example

• Fabrication/Falsification

• Illustrate the General Points about Editing and Imputation

• Emphasize Importance of Fabrication threat to Quality

Page 23: ILCS Raking

Fabrication/Falsification

• Respondent/Interviewer Make up Data

• How Common?

• How to Reduce?

• How to Detect?

Page 24: ILCS Raking

Right Structure Right Resources

• Examine Practice Elsewhere?

• www.amstat.org Website

• Key is right incentives

• Good staff/training

• But Eternal Vigilance

Page 25: ILCS Raking

Second Illustration

• Raking Application at NSS

• To link up to Next Talk

• To illustrate Information Quality that is fit for use despite Data Quality

Page 26: ILCS Raking

Raking Quality “Fix”• What is Raking?• How does it improve quality?

Not Data QualityBut Information Quality

• Sometimes both --Better Point EstimatesMore Stable (smaller variances)

Page 27: ILCS Raking

Quality Summary

• Editing Data Quality

• Imputation Information Quality

• Raking Information Quality

• Fabrication Can Harm Both

• Must be guarded against always

Page 28: ILCS Raking

Almost Done Now

• Tried to Stay Practical, with a Frank Discussion of Key Weaknesses in Current Practice• Deeper Understanding of Data

Quality• But at an Applied Level

Page 29: ILCS Raking

ÞÝáñѳϳÉáõÃÛáõÝ

Fritz Scheuren

[email protected]