handling missing data with sas › content › dam › sas › en_ca › user... · • mean...
TRANSCRIPT
Timothy B. Gravelle
Principal Scientist & Director, Insights Lab
September 13, 2013
What’s Missing?
Handling Missing Data with SAS
© 2000-2013 PriceMetrix Inc. Patents granted and pending.
2
Missing data
• Survey data frequently contain missing observations due
to respondent refusal, errors in fieldwork, etc.
• Business data can also contain missing observations.
• Large amounts of missing data can bias survey
estimates.
• Many statistical techniques assume (or require)
complete data, so missing data can reduce effective
sample size (and statistical power).
3
Types of missing data
• Patterns of data loss are typically described as either
ignorable or non-ignorable.
• Types of ignorable missing data:
- Missing completely at random (MCAR): the missing
observations on a given variable differ from the
observed scores on that variable only by chance and
the missing observations are further not related to any
other variable.
- Missing at random (MAR): the missing observations
on a given variable differ from the observed scores on
that variable only by chance.
4
Types of missing data
• Non-ignorable missing data, or data that are missing not
at random (MNAR): cases with missing data differ from
cases with complete data for some reason, rather than
randomly.
5
Dealing with missing data
• Listwise deletion (or complete-case analysis): removes
all cases with any missing data from the analysis.
• Pairwise deletion (or available-case analysis): different
parts of the analysis are conducted with different subsets
of the data.
• Imputation: missing data points in a dataset are replaced
with plausible values.
6
Types of imputation
• Mean imputation: missing data points are simply
replaced with the mean.
• Random imputation: missing data points are imputed
randomly from a random uniform distribution.
• Regression-based imputation: missing values are
replaced by a predicted score generated by a regression
model based on the non-missing data.
7
Single vs. multiple imputation
• A problem with imputing only a single value for every
missing value is that this does not reflect our uncertainty
about the predictions. Standard errors may therefore be
biased (too small).
• An alternative is to replace each missing value with
multiple plausible values. This represents the uncertainty
about the right value to impute.
• Data analyses from multiply-imputed datasets can be
combined to produce estimates and confidence intervals
that incorporate missing-data uncertainty.
8
Steps for multiple imputation
• Impute the missing values m times (m is usually 3 to 10)
• Analyze each of the m completed data sets. This results
in m analyses.
• Pool the results from m analyses into a final result.
β1
β2
β3
β
Incomplete
data
Analysis
results
Complete
data
Final
results
9
Multiple imputation vs. listwise deletion: an example Predicting concern about illegal immigration: United States (OLS regression)
Multiple Imputation
Listwise Deletion
Coeff. SE p Sig. Coeff. SE p Sig.
Intercept 2.08 0.41 0.000 *** 1.88 0.45 0.000 ***
Male -0.02 0.06 0.693 -0.02 0.07 0.725
ln Age (Years) 0.14 0.08 0.093 0.15 0.10 0.127
Education: College -0.24 0.07 0.000 *** -0.25 0.08 0.002 **
Education: Some College 0.03 0.07 0.696 0.01 0.08 0.864
Monthly HH Income: 2K-4K -0.01 0.10 0.958 0.06 0.10 0.552
Monthly HH Income: 4K-7.5K 0.07 0.09 0.438 0.12 0.09 0.208
Monthly HH Income: 7.5K+ -0.03 0.10 0.747 -0.02 0.11 0.865
Race: Black -0.03 0.11 0.812 0.03 0.12 0.833
Race: Other 0.10 0.11 0.379 0.07 0.12 0.544
Hispanic -0.28 0.13 0.031 * -0.28 0.14 0.044 *
Party: Democrat -0.11 0.11 0.323 -0.14 0.12 0.233
Party: Republican 0.15 0.11 0.154 0.17 0.12 0.149
Ideology (Conservative) 0.10 0.04 0.009 ** 0.10 0.04 0.017 *
ln Distance to US-Mex Border (km) 0.07 0.03 0.049 * 0.08 0.03 0.016 *
n 1,037 763
R2 0.142 0.165
Adjusted R2 0.130 0.149
10
Multiple imputation vs. listwise deletion: a second example Predicting positive impressions of NAFTA: Canada (logistic regression)
Multiple Imputation
Listwise Deletion
Coeff. S.E. O.R. p Sig. Coeff. S.E. O.R. p Sig.
Intercept 2.05 0.98 7.76 0.037 * 1.66 1.88 5.27 0.378
Male 0.60 0.18 1.83 0.001 *** 0.55 0.32 1.74 0.084
ln Age (Years) -0.57 0.21 0.56 0.007 ** -0.49 0.38 0.61 0.198
Education: University 0.16 0.22 1.17 0.481 0.53 0.36 1.70 0.141
Education: Community College -0.14 0.23 0.87 0.525 0.06 0.40 1.06 0.888
Province: NL 0.56 1.31 1.75 0.671 -0.26 1.73 0.77 0.880
Province: NS/PEI 1.37 0.54 3.92 0.011 ** -0.02 0.78 0.98 0.983
Province: NB 0.26 0.59 1.30 0.663 1.70 0.91 5.48 0.061
Province: QC -0.48 0.24 0.62 0.047 * 0.91 0.44 2.49 0.037 *
Province: MB 0.12 0.48 1.13 0.796 0.86 0.68 2.37 0.202
Province: SK 0.21 0.50 1.24 0.670 0.17 0.75 1.18 0.821
Province: AB 0.06 0.35 1.06 0.869 0.93 0.78 2.52 0.233
Province: BC -0.23 0.29 0.79 0.422 -0.42 0.48 0.66 0.382
HH Income: Comfortable 0.18 0.19 1.20 0.347 0.38 0.37 1.46 0.306
HH Income: Finding it Difficult -0.73 0.31 0.48 0.018 * -0.14 0.41 0.87 0.725
City Economy Getting Better 0.46 0.19 1.58 0.017 * 0.77 0.36 2.16 0.030 *
Local Job Market Good 0.19 0.20 1.21 0.358 -0.41 0.36 0.67 0.259
Confident in National Government 0.56 0.21 1.75 0.009 ** 1.73 0.40 5.63 0.000 ***
Approve of Canadian Leadership 0.65 0.22 1.91 0.004 ** -0.37 0.40 0.69 0.356
Approve of American Leadership 0.36 0.23 1.44 0.118 0.43 0.37 1.54 0.239
ln Distance to Can-US Border (km) -0.23 0.12 0.80 0.051 * -0.35 0.23 0.71 0.131
n 885 379
Model Chi Square 177.37 91.69
Cox & Snell R2 0.182 0.216
Nagelkerke R2 0.244 0.295
11
Multiple imputation vs. listwise deletion: a second example Predicting positive impressions of NAFTA: Canada (logistic regression)
Multiple Imputation
Listwise Deletion
Coeff. S.E. O.R. p Sig. Coeff. S.E. O.R. p Sig.
Intercept 2.05 0.98 7.76 0.037 * 1.66 1.88 5.27 0.378
Male 0.60 0.18 1.83 0.001 *** 0.55 0.32 1.74 0.084
ln Age (Years) -0.57 0.21 0.56 0.007 ** -0.49 0.38 0.61 0.198
Education: University 0.16 0.22 1.17 0.481 0.53 0.36 1.70 0.141
Education: Community College -0.14 0.23 0.87 0.525 0.06 0.40 1.06 0.888
Province: NL 0.56 1.31 1.75 0.671 -0.26 1.73 0.77 0.880
Province: NS/PEI 1.37 0.54 3.92 0.011 ** -0.02 0.78 0.98 0.983
Province: NB 0.26 0.59 1.30 0.663 1.70 0.91 5.48 0.061
Province: QC -0.48 0.24 0.62 0.047 * 0.91 0.44 2.49 0.037 *
Province: MB 0.12 0.48 1.13 0.796 0.86 0.68 2.37 0.202
Province: SK 0.21 0.50 1.24 0.670 0.17 0.75 1.18 0.821
Province: AB 0.06 0.35 1.06 0.869 0.93 0.78 2.52 0.233
Province: BC -0.23 0.29 0.79 0.422 -0.42 0.48 0.66 0.382
HH Income: Comfortable 0.18 0.19 1.20 0.347 0.38 0.37 1.46 0.306
HH Income: Finding it Difficult -0.73 0.31 0.48 0.018 * -0.14 0.41 0.87 0.725
City Economy Getting Better 0.46 0.19 1.58 0.017 * 0.77 0.36 2.16 0.030 *
Local Job Market Good 0.19 0.20 1.21 0.358 -0.41 0.36 0.67 0.259
Confident in National Government 0.56 0.21 1.75 0.009 ** 1.73 0.40 5.63 0.000 ***
Approve of Canadian Leadership 0.65 0.22 1.91 0.004 ** -0.37 0.40 0.69 0.356
Approve of American Leadership 0.36 0.23 1.44 0.118 0.43 0.37 1.54 0.239
ln Distance to Can-US Border (km) -0.23 0.12 0.80 0.050 * -0.35 0.23 0.71 0.131
n 885 379
Model Chi Square 177.37 91.69
Cox & Snell R2 0.182 0.216
Nagelkerke R2 0.244 0.295
13
PROC MI
• Provides analyses of missing data patterns.
• Creates imputed values (mainly for interval-level
variables using linear regression; handling of categorical
data is new in SAS/STAT 12.1).
14
PROC MIANALYZE
• Combines the analyses of multiply imputed data
performed in other SAS procedures – e.g., PROC REG,
PROC LOGISTIC, PROC SURVEYREG, PROC
SURVEYLOGISTIC, PROC CALIS.
15
IVEware (SAS-callable)
• Developed at the University of Michigan and distributed
free of charge.
• Can accommodate interval, ordinal , nominal, count and
“mixed” data (using linear, binary logistic, generalized
logistic and Poisson regression).
• Can accommodate bounds on the imputed values.
• Can restrict the imputation to a subset of cases (useful
for imputing data for contingent/skip-based questions).
16
IVEware (SAS-callable)
DATA _null_;
INFILE datalines;
FILENAME setup "impute.set";
FILE setup;
INPUT;
PUT _infile_;
REPLACE;
DATALINES4;
DATAIN work.data_3;
DATAOUT work.data_MI;
DEFAULT CATEGORICAL;
CONTINUOUS LN_AGE WP1220 DISTANCE_CAN_US_BORDER
LN_DISTANCE_USA;
TRANSFER CASE_ID YEAR WP8018 WP12596 WP5 WP1220 WEIGHT LAT LON
DISTANCE_CAN_US_BORDER GEO_MATCH;
17
IVEware (SAS-callable)
BOUNDS WP30(>=1,<=2) WP31(>=1,<=3) WP87(>=1,<=2) WP88(>=1,<=3)
WP89(>=1,<=2) WP137(>=1,<=2) WP138(>=1,<=2) WP139(>=1,<=2)
WP141(>=1,<=2) WP142(>=1,<=2) WP143(>=1,<=2) WP144(>=1,<=2)
WP145(>=1,<=2) WP146(>=1,<=2) WP148(>=1,<=2) WP150(>=1,<=2)
WP151(>=1,<=2) WP6879(>=1,<=2) WP1219(>=1,<=2)
LN_AGE(>=2.7080502,<=4.5951199) EDUCATION(>=1,<=3)
INCOME(>=1,<=7) INCOME_GET_BY(>=1,<=7) WP2319(>=1,<=4)
WP4657(>=1,<=2) REGION_CAN(>=1,<=8);
MINRSQD .01;
ITERATIONS 10;
MULTIPLES 10;
PERTURB=COEF;
SEED 20110718;
RUN;
;;;;
18
IVEware (SAS-callable)
%IMPUTE(name=impute, dir=.);
%PUTDATA(name=impute, dir=., mult=1, dataout=data_MI1);
%PUTDATA(name=impute, dir=., mult=2, dataout=data_MI2);
%PUTDATA(name=impute, dir=., mult=3, dataout=data_MI3);
%PUTDATA(name=impute, dir=., mult=4, dataout=data_MI4);
%PUTDATA(name=impute, dir=., mult=5, dataout=data_MI5);
%PUTDATA(name=impute, dir=., mult=6, dataout=data_MI6);
%PUTDATA(name=impute, dir=., mult=7, dataout=data_MI7);
%PUTDATA(name=impute, dir=., mult=8, dataout=data_MI8);
%PUTDATA(name=impute, dir=., mult=9, dataout=data_MI9);
%PUTDATA(name=impute, dir=., mult=10, dataout=data_MI10);
19
IVEware (SAS-callable)
PROC SQL;
CREATE TABLE data_MI AS
SELECT * FROM data_MI1
UNION ALL SELECT * FROM data_MI2
UNION ALL SELECT * FROM data_MI3
UNION ALL SELECT * FROM data_MI4
UNION ALL SELECT * FROM data_MI5
UNION ALL SELECT * FROM data_MI6
UNION ALL SELECT * FROM data_MI7
UNION ALL SELECT * FROM data_MI8
UNION ALL SELECT * FROM data_MI9
UNION ALL SELECT * FROM data_MI10
ORDER BY _mult_, CASEID
;
QUIT;
20
PROC SURVEYLOGISTIC
(using multiply-imputed data)
PROC SURVEYLOGISTIC DATA=data_MI;
BY _mult_ ;
MODEL NAFTA_POS(EVENT='1')=YEAR_2009 YEAR_2010 YEAR_2011
MALE LN_AGE EDU_UNIV EDU_COLLEGE
INCOME_0_1999_MTH INCOME_2000_2999_MTH INCOME_3000_3999_MTH
INCOME_5000_7499_MTH INCOME_7500_9999_MTH INCOME_10000_PL_MTH
PROVINCE_NL PROVINCE_NS_PE PROVINCE_NB PROVINCE_QC PROVINCE_MB
PROVINCE_SK PROVINCE_AB PROVINCE_BC
GOOD_TIME_FIND_JOB CITY_ECON_BETTER CITY_ECON_WORSE
NATL_ECON_BETTER NATL_ECON_WORSE CAN_LEADERSHIP USA_LEADERSHIP
LN_DISTANCE_USA
/RSQ;
ODS OUTPUT ParameterEstimates=ParmEst;
STRATA PROVINCE;
WEIGHT WT;
RUN;
21
PROC MIANANALYZE
(combining parameter estimates)
PROC MIANALYZE PARMS=ParmEst;
MODELEFFECTS
Intercept
YEAR_2009 YEAR_2010 YEAR_2011
MALE LN_AGE EDU_UNIV EDU_COLLEGE
INCOME_0_1999_MTH INCOME_2000_2999_MTH INCOME_3000_3999_MTH
INCOME_5000_7499_MTH INCOME_7500_9999_MTH INCOME_10000_PL_MTH
PROVINCE_NL PROVINCE_NS_PE PROVINCE_NB PROVINCE_QC PROVINCE_MB
PROVINCE_SK PROVINCE_AB PROVINCE_BC
GOOD_TIME_FIND_JOB CITY_ECON_BETTER CITY_ECON_WORSE
NATL_ECON_BETTER NATL_ECON_WORSE CAN_LEADERSHIP USA_LEADERSHIP
LN_DISTANCE_USA
RUN;
22
Wrap-up
• How one chooses to deal with missing data has
implications for one’s analyses and the substantive
conclusions that one can reach.
• The default of listwise deletion is a choice, though often
an implicit one. It may not be the best choice.
• State-of-the-art multiple imputation techniques are now
(relatively) easy to implement in SAS. They allow us to
use all available data while still accounting for the
uncertainty inherent in the imputation process.
23
Wrap-up
• The creation of multiply-imputed datasets, analysis of
multiply-imputed data and pooling of estimates are
distinct steps.
• Consequently, one can conduct the different steps using
different software according to the software’s capabilities
and the analyst’s preference.
• When we encounter missing data, we should give
greater thought to why they are missing.
24
References
Allison, Paul D. (2002) Missing Data, Sage.
Horton, Nicholas J. and Ken P. Kleinman (2007) “Much Ado About
Nothing: A Comparison of Missing Data Methods and Software to Fit
Incomplete Data Regression Models.” American Statistician 61(1):
79–90.
Little, Roderick J.A. and Donald B. Rubin (2002) Statistical Analysis
with Missing Data, Wiley.
Raghunathan, Trivellore E., Peter W. Solenberger and John Van
Hoewyk (2002) IVEware: Imputation and Variance Estimation
Software User Guide. Ann Arbor, MI: Survey Research Center,
Institute for Social Research, University of Michigan.
Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in
Surveys, Wiley.