working with missing values alan c. acock february, 2007 supporting material is available at...
TRANSCRIPT
![Page 1: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/1.jpg)
Working with Missing Values
Alan C. AcockFebruary, 2007
Supporting material is available at www.oregonstate.edu/~acock/missing
![Page 2: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/2.jpg)
Alan C. Acock, Working with Missing Values
2
Why are the Values Missing: The reason instructs the solution
By Design—Completely Random– Missing Completely at Random (MCAR)– 50% of items selected randomly for each interview– 50% randomly selected for follow-up– Effective when there are too many items or high costs
Intentionally Missing—Researcher controlled– Boys not asked when first menstruation– Drop from analysis– Sometimes unintentionally imputed– Imputing doesn’t necessarily hurt
![Page 3: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/3.jpg)
Alan C. Acock, Working with Missing Values
3
Why are the Values Missing
Refusals—We may know mechanism– Adjusted for gender, race, education– May be missing at random– Otherwise, bias is likely w/o Auxiliary
Variables
Missing because of “don’t know” responses– Between agree and disagree?– Can we impute a better value? – Should we?
![Page 4: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/4.jpg)
Alan C. Acock, Working with Missing Values
4
Why are the Values Missing
Missing by researcher error– May be missing completely at random– May reflect researcher bias – Perceived risk to researcher– Missing observation worse than missing value
Code reason value is missing– NLSY97, uses 5 types of missing values – Treat each differently
![Page 5: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/5.jpg)
Alan C. Acock, Working with Missing Values
5
Why are the Values Missing
• Understand why each value is missing
• Delete observations or variables where you do not intend to impute a value
– Drop variable
– Drop observation
![Page 6: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/6.jpg)
Alan C. Acock, Working with Missing Values
6
Four Questions
• Do I want to have a value for this person?
• Is the value missing completely at random, or
• Do I have auxiliary variables that explain why it is missing, and
• Do I have covariates that predict the score?
![Page 7: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/7.jpg)
Alan C. Acock, Working with Missing Values
7
Patterns of Missing Values MISSING DATA PATTERNS 1 2 3 4 5 6 7 8 9 10 HLTH x x x x CHILDS x x x x x x x x x x HAP_GEN x x x x x INCOME98 x x x x x x AGE x x x x x x x x EDUC x x x x x
– What is problem with • HLTH? • INCOME98? • EDUC?
![Page 8: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/8.jpg)
Alan C. Acock, Working with Missing Values
8
Patterns of Missing Values
MISSING DATA PATTERN FREQUENCIESPattern Freq Pattern Freq Pattern Freq 1 550 5 27 9 4 2 81 6 2 10 14 3 77 7 12 4 30 8 21
• Throw out 81 people in pattern 2?• We have data on five of the six variables• Income might not be a key predictor
• Why is health missing in patterns 5 to 10—Was this by design?
![Page 9: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/9.jpg)
Alan C. Acock, Working with Missing Values
9
Amount of Missing ValuesPROPORTION OF DATA PRESENT HLTH CHILDS HAP_GEN INC AGE EDUC HLTH .90CHILDS .90 1.00HAP_GEN .77 .82 .82INCOME98 .76 .83 .70 .83AGE .90 .99 .81 .82 .99EDUC .77 .82 .82 .70 .81 .822
• Income low with educ, hlth, hap_gen• If income is “just” a control variable--Find a substitute or
impute • Over 50% of cases for all the combinations• Could be worse if you did 3-way (hlth, income, educ)
![Page 10: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/10.jpg)
Alan C. Acock, Working with Missing Values
10
Raw Data Missingness
ID Var1 Var2 Var3
1 9 7 .
2 . 3 5
3 7 4 .
4 9 4 6
5 6 2 7
6 . . 5
ID D1 D2 D3
1 0 0 1
2 1 0 0
3 0 0 1
4 0 0 0
5 0 0 0
6 1 1 0
![Page 11: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/11.jpg)
Alan C. Acock, Working with Missing Values
11
Missing Completely at Random (MCAR)
• The Missingness data is random. D1, D2, D3 uncorrelated with anything!
• Correlate (or logistic regression) variables with D1, D2, D3
• Consider race, gender, age, education• None of these should be correlated with
D1, D2, or D3• This is not correlating variables with the
raw score!
![Page 12: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/12.jpg)
Alan C. Acock, Working with Missing Values
12
Missing at Random (MAR)
• The Missingness data is a random pattern after you control for – Variables in your analysis– Auxiliary variables– Probability of missingness NOT dependent on
unobserved variables
• Correlate variables with D1, D2, D3• Consider auxiliary variables--race, gender,
age, education
![Page 13: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/13.jpg)
Alan C. Acock, Working with Missing Values
13
Missing at Random (MAR)
• Include auxiliary variables as mechanisms for missingness– If they are correlated significantly with the
missingness, D1, D2, D3
• Data is MAR after controlling auxiliary variables
• Auxiliary variables available in many datasets
![Page 14: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/14.jpg)
Alan C. Acock, Working with Missing Values
14
Problem with Traditional Approaches
Listwise deletion—standard default– It excludes many observations—50%?– May be only missing one variable and that
variable may not be important– In longitudinal program evaluations
• Missing those with low level of implementation
– If MCAR, this reduces power, but is unbiased– W/O MCAR this is biased– Political Science Journal—50% deleted
![Page 15: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/15.jpg)
Alan C. Acock, Working with Missing Values
15
Problem with Traditional Approaches
Mean Substitution
– Mean often bad estimate
– Attenuates variance
– Reduces effect—variables w/ missing data, or
– Exaggerates effects--variables with little missing data
– Reduces R2
![Page 16: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/16.jpg)
Alan C. Acock, Working with Missing Values
16
Problem with Traditional Approaches
Pairwise Deletion (rarely used)
– Each correlation on different subsample
– Set of correlations—no single sample
– May not be able to invert matrix
– What is the right sample size?
– If it works, usually better than mean substitution or listwise deletion
![Page 17: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/17.jpg)
Alan C. Acock, Working with Missing Values
17
Problem with Traditional Approaches
Ordinary regression imputation – Multiple regression used to predict their score– Predicted value will have no new information if
predictors are in your model—colinearity – Does nothing about uncertainty of predictions
• If R2 = .90, the predicted value is good• If R2 = .10, the predicted value has a lot of noise
– Thus, predicted values are “too good”
![Page 18: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/18.jpg)
Alan C. Acock, Working with Missing Values
18
Problem with Traditional Approaches
Single Imputation (SPSS Module) (MAR)
– American Statistician article--done incorrectly
– Single imputation does not incorporate variability between multiple imputations
– Reviewers for many journals not aware of limitations of single imputation so . . .
– Easy to implement using SPSS
![Page 19: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/19.jpg)
Alan C. Acock, Working with Missing Values
19
Modern Approaches
Multiple Imputation--Assumes MAR
– Imputation is done 5-20 times
– Model is estimated 5-20 times
– Estimates (R’s, B’s, Betas) are averaged
– Standard errors--variances between solutions incorporated
– Reflects uncertainty of the process
– Always better than single imputation
![Page 20: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/20.jpg)
Alan C. Acock, Working with Missing Values
20
Modern Approaches
Multiple Imputation– Available with best Statistical packages
• Stata• SAS
– Available with freeware programs that work in conjunction with statistical packages
• Norm• Amelia• IVEware• Mice
![Page 21: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/21.jpg)
Alan C. Acock, Working with Missing Values
21
Modern Approaches
Full Information Maximum Likelihood (FIML)– Assumes MAR– Uses all available information– Assumes patterns same if no missing– Results similar to multiple imputation– Available with SEM programs
• Mplus• LISREL• AMOS• EQS
![Page 22: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/22.jpg)
Alan C. Acock, Working with Missing Values
22
Modern Approaches
Full Information Maximum Likelihood – Easy changes in SEM programs will do this– Researchers rarely include auxiliary variables– Researchers rarely include covariates unless
in model– Possible to add auxiliary/predictor variables– Mplus allows for both FIML estimation and
multiple imputation--nice to compare results
![Page 23: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/23.jpg)
Alan C. Acock, Working with Missing Values
23
How Multiple Imputation Works: Non-technical Explanation
• All variables may have some missing values, including DV
• Eliminate observations will missing values on all variables – Missing wave of panel is just missing values
• Estimate covariance matrix (listwise)
• Regress xi on remaining variables
![Page 24: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/24.jpg)
Alan C. Acock, Working with Missing Values
24
How Multiple Imputation Works
• Add residual based on strength of prediction– R2 = .90—add small error – R2 = .10—add big error
• You now have an actual or imputed value for all observations on all variables
• Estimate a covariance• This covariance matrix should be “better”
because it utilizes more information
![Page 25: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/25.jpg)
Alan C. Acock, Working with Missing Values
25
How Multiple Imputation Works
• If covariance matrices are different– Repeat process until successive covariance
matrices are virtually identical
• This provides first imputed dataset
• Repeat this process m times – Results—m imputed datasets with no missing
values
![Page 26: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/26.jpg)
Alan C. Acock, Working with Missing Values
26
How Multiple Imputation Works
• Estimate your model with each of your m imputed datasets
• Combine the results using Rubin’s rules – Parameter estimates—mean of their m values– Standard errors inflate mean of standard
errors based on how much solutions vary– Standard errors (hence t-tests) will be
unbiased if the data is MAR
![Page 27: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/27.jpg)
Alan C. Acock, Working with Missing Values
27
How FIML is Implemented: MplusTitle: Missing values including mechanismsData: File is miss_systematic-999.dat ;Variables: Names are childs satfin male hap_gen ident income98 educ hlth age; Missing are all (-999) ; Usevariables are hlth childs hap_gen income98 age educ satfin male ;Analysis: Type = missing ; *without this get listwise
![Page 28: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/28.jpg)
Alan C. Acock, Working with Missing Values
28
FIML: Mplus ExampleModel: hlth on childs hap_gen income98 age educ ;
satfin on childs hap_gen income98 age educ ;
male on childs hap_gen income98 age educ ;
Output: standardized ;
1.The “hlth” and “satfin” lines are the model2.The “male” line is a nonsense equation that
includes any covariates or auxiliary variables
![Page 29: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/29.jpg)
Alan C. Acock, Working with Missing Values
29
Freeware Dedicated Packages
Package Single Imputation
Multiple Imputation
FIML
Amelia X X
IVEware X X
Norm X X
MICE X X
Mx X
![Page 30: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/30.jpg)
Alan C. Acock, Working with Missing Values
30
Commercial Statistical Packages
Package Single Imputation
Multiple Imputation
FIML
SAS (MI) X
SPSS (EM) X
Stata (ice) X X
![Page 31: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/31.jpg)
Alan C. Acock, Working with Missing Values
31
Commercial FIML Packages
Package Single Imputation
Multiple Imputation
FIML
AMOS X
EQS X
HLM X
LISREL X
Mplus X X
![Page 32: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing](https://reader036.vdocuments.net/reader036/viewer/2022081518/55163b3d550346c6758b5197/html5/thumbnails/32.jpg)
Alan C. Acock, Working with Missing Values
32
Web Pages for Selected Software
• Ameilia gking.harvard.edu/amelia/• Iveware http://www.isr.umich.edu/src/smp/ive/• Norm http://www.stat.psu.edu/~jls/misoftwa.html#aut
• MX www.vcu.edu/mx/ • SPSS www.spss.comwww.mvsoft.com/• LISREL http://www.ssicentral.com/hlm/index.html • Mplus www.statmodel.com • SAS www.sas.com • Stata www.stata.com