multiple imputation and multiple regression with sas and...
TRANSCRIPT
MultReg_Mult-Imputation.docx
Multiple Imputation and Multiple Regression with SAS and IBM SPSS
See IntroQ Questionnaire for a description of the survey used to generate the data used here.
*** Mult-Imput_M-Reg.sas ***; options pageno=min nodate formdlim='-'; title 'Multiple Imputation of Missing Data then Multiple Regression.'; run; PROC IMPORT OUT= WORK.IntroQuest DATAFILE= "C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav" DBMS=SPSS REPLACE; RUN; Data Priapus; set IntroQuest; SATM_Miss = 0; If SATM = . then SATM_Miss = 1; proc means n nmiss; run; proc corr nosimple; var SATM_Miss; with statoph gender ideal nucoph year; run;
The data are imported from an SPSS “.sav” file.
The MEANS Procedure Variable Label N N Miss
Gender Ideal Eye Statoph Nucoph SATM Year
Gender Ideal Eye Statoph Nucoph SATM Year
694 689 693 685 692 547 694
0 5 1 9 2
147 0
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0
Number of Observations SATM_Miss
Statoph Statoph
0.08406 0.0278
685
Gender Gender
-0.05740 0.1309
694
2
Ideal Ideal
-0.01715 0.6531
689
Nucoph Nucoph
0.00741 0.8458
692
Year Year
0.08196 0.0309
694
Note that missingness on SATM is associated with statphobia and year. ------------------------------------------------------------------------------------------------ Proc MI seed=69301 out=MIdata; var statoph gender ideal nucoph SATM year; run;
Proc MI is used to create five imputations.
Model Information Data Set WORK.INTROQUEST Method MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 5 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator 69301
Missing Data Patterns
Group Statoph Gender Ideal Nucoph SATM Year Freq Percent Group Means
Statoph Gender Ideal Nucoph SATM Year
1 X X X X X X 540 77.81 6.1712 1.26666 70.27925 58.05740 506.6685 1997.6666
2 X X X X . X 139 20.03 6.6726 1.20863 70.23741 59.18705 . 1999.1510
3 X X X . X X 1 0.14 6.0000 1.00000 73.00000 . 650.0000 2012.0000
4 X X . X X X 2 0.29 5.0000 1.50000 . 57.50000 440.0000 1993.0000
5 X X . X . X 3 0.43 5.3333 1.333333 . 50.000000 . 1999.333333
6 . X X X X X 3 0.43 . 1.000000 70.666667 58.333333 550.000000 2010.000000
7 . X X X . X 5 0.72 . 1.000000 67.200000 43.400000 . 2010.000000
8 . X X . X X 1 0.14 . 1.000000 75.000000 . 730.000000 1991.000000
3
The most common pattern (aside from complete data) is missingness only on SATM. We have means for each of the patterns. Those missing data on SATM do not appear to differ much from those with SATM data. Below we have Expectation Maximization estimates of means and covariances. Missingness on SATM is related to statophobia, by the way.
EM (Posterior Mode) Estimates _TYPE_ _NAME_ Statoph Gender Ideal Nucoph SATM Year MEAN 6.259824 1.252161 70.255976 58.155101 507.402318 1998.110951 COV Statoph 5.252201 -0.141896 0.685998 0.893161 -72.908739 -3.509858 COV Gender -0.141896 0.186693 -0.912737 -0.826166 2.167681 -0.041964 COV Ideal 0.685998 -0.912737 14.816058 8.297513 -23.940582 -1.323101 COV Nucoph 0.893161 -0.826166 8.297513 497.023714 77.921516 2.866324 COV SATM -72.908739 2.167681 -23.940582 77.921516 9130.096517 284.126488 COV Year -3.509858 -0.041964 -1.323101 2.866324 284.126488 79.070552
Variance Information
Variable Variance DF Relative Increase
in Variance
Fraction Missing
Information
Relative Efficiency Between Within Total
Statoph 0.000237 0.007664 0.007949 549.04 0.037133 0.036421 0.992769 Ideal 0.000194 0.021632 0.021865 670.7 0.010747 0.010688 0.997867 Nucoph 0.004013 0.725252 0.730067 681.36 0.006639 0.006617 0.998678 SATM 3.857993 13.586615 18.216206 55.285 0.340746 0.277121 0.947486
Snip, snip. I have culled the rest of the text output from Proc MI. Proc Reg outest=MRbyImput covout; Model Statoph = gender ideal nucoph SATM year / stb; By _Imputation_; run; quit; Proc MIAnalyze; modeleffects intercept gender ideal nucoph SATM year; run;
Here we used Proc Reg to conduct a multiple regression analysis on each of the five imputations. ------------------------------------- Imputation Number=1 --------------------------------------
Analysis of Variance Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 578.89881 115.77976 25.76 <.0001 Error 688 3092.80459 4.49536 Corrected Total 693 3671.70340
4
Root MSE 2.12023 R-Square 0.1577 Dependent Mean 6.25911 Adj R-Sq 0.1515 Coeff Var 33.87421
Parameter Estimates Variable Label DF Parameter
Estimate Standard
Error t Value Pr > |t| Standardized
Estimate Intercept Intercept 1 41.56462 19.15755 2.17 0.0304 0 Gender Gender 1 -0.72482 0.22232 -3.26 0.0012 -0.13684 Ideal Ideal 1 -0.01048 0.02499 -0.42 0.6750 -0.01763 Nucoph Nucoph 1 0.00154 0.00362 0.43 0.6696 0.01503 SATM SATM 1 -0.00836 0.00089371 -9.36 <.0001 -0.34798 Year Year 1 -0.01477 0.00956 -1.54 0.1230 -0.05739
------------------------------------------------------------------------------------------------ Multiple Imputation of Missing Data then Multiple Regression. 5 ------------------------------------- Imputation Number=2 --------------------------------------
Analysis of Variance Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 486.59526 97.31905 20.75 <.0001 Error 688 3226.23255 4.68929 Corrected Total 693 3712.82781
Root MSE 2.16548 R-Square 0.1311 Dependent Mean 6.28280 Adj R-Sq 0.1247 Coeff Var 34.46674
Parameter Estimates
Variable Label DF Parameter Estimate
Standard Error
t Value Pr > |t| Standardized Estimate
Intercept Intercept 1 43.57996 19.63473 2.22 0.0268 0 Gender Gender 1 -0.77863 0.22730 -3.43 0.0006 -0.14618 Ideal Ideal 1 -0.02357 0.02550 -0.92 0.3556 -0.03946 Nucoph Nucoph 1 0.00237 0.00370 0.64 0.5228 0.02289 SATM SATM 1 -0.00735 0.00091452 -8.04 <.0001 -0.30560 Year Year 1 -0.01556 0.00981 -1.59 0.1132 -0.06011
5 ------------------------------------- Imputation Number=3 --------------------------------------
Analysis of Variance Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 542.46132 108.49226 23.71 <.0001 Error 688 3148.45474 4.57624 Corrected Total 693 3690.91606
Root MSE 2.13922 R-Square 0.1470 Dependent Mean 6.24411 Adj R-Sq 0.1408 Coeff Var 34.25972
Parameter Estimates
Variable Label DF Parameter Estimate
Standard Error
t Value Pr > |t| Standardized Estimate
Intercept Intercept 1 47.65661 19.41478 2.45 0.0143 0 Gender Gender 1 -0.73550 0.22412 -3.28 0.0011 -0.13850 Ideal Ideal 1 -0.00947 0.02528 -0.37 0.7080 -0.01585 Nucoph Nucoph 1 0.00284 0.00365 0.78 0.4367 0.02759 SATM SATM 1 -0.00754 0.00086555 -8.71 <.0001 -0.32789 Year Year 1 -0.01809 0.00970 -1.87 0.0626 -0.07011
------------------------------------- Imputation Number=4 --------------------------------------
Analysis of Variance Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 495.04444 99.00889 21.43 <.0001 Error 688 3178.16052 4.61942 Corrected Total 693 3673.20496
Root MSE 2.14928 R-Square 0.1348 Dependent Mean 6.26736 Adj R-Sq 0.1285 Coeff Var 34.29329
6
Parameter Estimates
Variable Label DF Parameter Estimate
Standard Error
t Value Pr > |t| Standardized Estimate
Intercept Intercept 1 47.04474 19.41199 2.42 0.0156 0 Gender Gender 1 -0.77684 0.22556 -3.44 0.0006 -0.14663 Ideal Ideal 1 -0.02171 0.02532 -0.86 0.3916 -0.03654 Nucoph Nucoph 1 0.00208 0.00366 0.57 0.5699 0.02033 SATM SATM 1 -0.00742 0.00090292 -8.21 <.0001 -0.30996 Year Year 1 -0.01733 0.00969 -1.79 0.0741 -0.06732
------------------------------------- Imputation Number=5 --------------------------------------
Analysis of Variance Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 479.36626 95.87325 20.60 <.0001 Error 688 3201.88856 4.65391 Corrected Total 693 3681.25482
Root MSE 2.15729 R-Square 0.1302 Dependent Mean 6.24894 Adj R-Sq 0.1239 Coeff Var 34.52255
Parameter Estimates Variable Label DF Parameter
Estimate Standard
Error t Value Pr > |t| Standardized
Estimate Intercept Intercept 1 54.51860 19.46584 2.80 0.0052 0 Gender Gender 1 -0.68471 0.22651 -3.02 0.0026 -0.12910 Ideal Ideal 1 -0.00908 0.02533 -0.36 0.7202 -0.01533 Nucoph Nucoph 1 0.00235 0.00368 0.64 0.5227 0.02289 SATM SATM 1 -0.00705 0.00089989 -7.84 <.0001 -0.29619 Year Year 1 -0.02170 0.00972 -2.23 0.0259 -0.08418
Proc MIAnalyze is used to pool the results from the multiple imputations. The variance in the estimated parameters is partitioned between that among imputations (A) and that within imputations (W). The “Relative Increase in Variance” (r) is the increase in variance due to having missing data
7
imputed (relative to the condition where no data are missing), W
Amr11 −+
= , where “m” is the number of
imputations. A related statistic, “Fraction of Missing Information,” is an index of how much more precise the parameter estimate would have been if there had been no missing data. Power will, of course, be greater when the fraction of missing information and relative increase in variance are small. The greater the number of imputations, the less the error and the greater the power, ceteris paribus. “Relative efficiency” tells you how much power you have for the number of imputations you have employed relative to what you would have if you used an uncountably large number of imputations.
The MIANALYZE Procedure
Variance Information Parameter Variance DF Relative
Increase in Variance
Fraction Missing
Information
Relative Efficiency Between Within Total
intercept 24.530433 377.042475 406.478994 762.72 0.078072 0.074841 0.985253 gender 0.001539 0.050701 0.052548 3237 0.036433 0.035748 0.992901 ideal 0.000051078 0.000639 0.000701 522.62 0.095874 0.090958 0.982133 nucoph 0.000000224 0.000013399 0.000013669 10309 0.020094 0.019888 0.996038 SATM 0.000000242 0.000000802 0.000001092 56.584 0.362170 0.290519 0.945087 year 0.000007301 0.000094011 0.000103 550.36 0.093198 0.088559 0.982596
Parameter Estimate 95% Confidence Limits DF Minimum Maximum t Pr > |t| intercept 46.872905 7.29463 86.45118 762.72 41.564615 54.518596 2.32 0.0203 gender -0.740101 -1.18956 -0.29064 3237 -0.778630 -0.684711 -3.23 0.0013 ideal -0.014862 -0.06686 0.03714 522.62 -0.023569 -0.009078 -0.56 0.5747 nucoph 0.002236 -0.00501 0.00948 10309 0.001544 0.002838 0.60 0.5453 SATM -0.007544 -0.00964 -0.00545 56.584 -0.008363 -0.007052 -7.22 <.0001 year -0.017489 -0.03740 0.00242 550.36 -0.021695 -0.014770 -1.73 0.0851
8
Multiple Imputation with IBM SPSS Analyze, Multiple Imputation, Impute Missing Data Values
*Impute Missing Data Values. DATASET DECLARE IntroQ_Imputed. MULTIPLE IMPUTATION Statoph Gender Ideal Nucoph SATM Year /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS=IntroQ_Imputed . Multiple Imputation [DataSet] C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav
9
Imputation Specifications
Imputation Method Automatic Number of Imputations 5 Model for Scale Variables Linear Regression Interactions Included in Models
(none)
Maximum Percentage of Missing Values
100.0%
Maximum Number of Parameters in Imputation Model
100
Imputed Values
Imputation Results
Imputation Method Fully Conditional Specification Fully Conditional Specification Method Iterations 10
Dependent Variables
Imputed Statoph,Ideal,Nucoph,SATM Not Imputed(Too Many Missing Values)
Not Imputed(No Missing Values)
Gender,Year
Imputation Sequence Gender,Year,Nucoph,Ideal,Statoph,SATM
Imputation Models
Model Missing Values Imputed Values Type Effects
Nucoph Linear Regression
Gender,Year,Ideal,Statoph,SATM
2 10
Ideal Linear Regression
Gender,Year,Nucoph,Statoph,SATM
5 25
Statoph Linear Regression
Gender,Year,Nucoph,Ideal,SATM
9 45
SATM Linear Regression
Gender,Year,Nucoph,Ideal,Statoph
147 735
10
At this point SPSS has created a new data set with the original data (imputation 0) and the imputed data
(in this case, imputations 1 through 5).
The cells with imputed scores fall are highlighted. At this point, all you need do is run the desired
analysis. If that analysis is supported, it will automatically analyze the original data and each imputed set of data
and give you convenient summaries of the results.
DATASET ACTIVATE IntroQ_MultipleImputation. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Statoph /METHOD=ENTER Gender Ideal Nucoph SATM Year.
11
Model Summary
Imputation_ Model R R Square Adjusted R Square
Std. Error of the Estimate
Original data 1 .371a .138 .129 2.1940 1 1 .361b .130 .124 2.1569 2 1 .352b .124 .117 2.1682 3 1 .388b .150 .144 2.1279 4 1 .367b .134 .128 2.1547 5 1 .340b .116 .109 2.1649
a. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender b. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal
ANOVAa
Imputation_ Model Sum of Squares
df Mean Square F Sig.
Original data 1
Regression 410.002 5 82.000 17.036 .000b
Residual 2570.403 534 4.813
Total 2980.405 539
1 1 Regression 480.049 5 96.010 20.637 .000c Residual 3200.736 688 4.652
Total 3680.785 693
2 1 Regression 456.176 5 91.235 19.406 .000c Residual 3234.479 688 4.701
Total 3690.656 693
3 1 Regression 551.694 5 110.339 24.368 .000c Residual 3115.278 688 4.528
Total 3666.972 693
4 1 Regression 495.625 5 99.125 21.351 .000c Residual 3194.150 688 4.643
Total 3689.775 693
5 1
Regression 421.753 5 84.351 17.997 .000c
Residual 3224.629 688 4.687
Total 3646.382 693
a. Dependent Variable: Statoph b. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender c. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal
12
Coefficientsa
Imputation_ Model Unstandardized Coefficients Standardized Coefficients
t Sig.
B Std. Error Beta
Original data 1
(Constant) 45.632 22.219 2.054 .040
Gender -.833 .270 -.157 -3.083 .002
Ideal -.019 .031 -.031 -.616 .538
Nucoph .003 .004 .032 .801 .424
SATM -.008 .001 -.308 -7.183 .000
Year -.017 .011 -.064 -1.506 .133
1 1
(Constant) 49.000 19.465 2.517 .012 Gender -.827 .226 -.156 -3.655 .000 Ideal -.032 .026 -.054 -1.265 .206 Nucoph .002 .004 .022 .606 .545 SATM -.007 .001 -.302 -7.951 .000 Year -.018 .010 -.070 -1.842 .066
2 1
(Constant) 53.105 19.503 2.723 .007 Gender -.827 .227 -.156 -3.636 .000 Ideal -.021 .026 -.035 -.816 .415 Nucoph .001 .004 .015 .405 .686 SATM -.007 .001 -.287 -7.587 .000 Year -.020 .010 -.079 -2.102 .036
3 1
(Constant) 45.250 19.175 2.360 .019 Gender -.717 .223 -.136 -3.219 .001 Ideal -.013 .025 -.022 -.521 .603 Nucoph .002 .004 .016 .465 .642 SATM -.008 .001 -.335 -8.988 .000 Year -.017 .010 -.065 -1.737 .083
4 1
(Constant) 43.517 19.558 2.225 .026 Gender -.810 .226 -.153 -3.581 .000 Ideal -.019 .025 -.032 -.738 .460 Nucoph .001 .004 .012 .345 .730 SATM -.007 .001 -.312 -8.216 .000 Year -.016 .010 -.061 -1.598 .111
13
Coefficientsa
Imputation_ Model Unstandardized Coefficients Standardized Coefficients
t Sig.
B Std. Error Beta
1 (Constant) 50.098 19.720 2.540 .011 5 Gender -.751 .226 -.142 -3.325 .001
Ideal -.014 .025 -.023 -.533 .594
Nucoph .002 .004 .016 .452 .651
SATM -.007 .001 -.273 -7.073 .000
Year -.019 .010 -.076 -1.963 .050
Pooled 1
(Constant) 48.194 19.934 2.418 .016 Gender -.787 .232 -3.387 .001 Ideal -.020 .027 -.735 .463 Nucoph .002 .004 .452 .651 SATM -.007 .001 -6.730 .000 Year -.018 .010 -1.805 .071
Coefficientsa
Imputation_ Model Fraction Missing Info.
Relative Increase Variance
Relative Efficiency
Pooled 1
(Constant) .045 .047 .991 Gender .056 .058 .989 Ideal .106 .113 .979 Nucoph .011 .011 .998 SATM .311 .396 .941 Year .048 .049 .991
a. Dependent Variable: Statoph
• Karl L. Wuensch, January, 2021 • Return to Wuensch’s Stats Lessons Page • Treatment of Missing Data – recommended reading, David Howell. • Multiple Imputation with SAS – nice tutorial from UCLA