multiple imputation and multiple regression with sas and...

13
MultReg_Mult-Imputation.docx Multiple Imputation and Multiple Regression with SAS and IBM SPSS See IntroQ Questionnaire for a description of the survey used to generate the data used here. *** Mult-Imput_M-Reg.sas ***; options pageno=min nodate formdlim='-'; title 'Multiple Imputation of Missing Data then Multiple Regression.'; run; PROC IMPORT OUT= WORK.IntroQuest DATAFILE= "C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav" DBMS=SPSS REPLACE; RUN; Data Priapus; set IntroQuest; SATM_Miss = 0; If SATM = . then SATM_Miss = 1; proc means n nmiss; run; proc corr nosimple; var SATM_Miss; with statoph gender ideal nucoph year; run; The data are imported from an SPSS “.sav” file. The MEANS Procedure Variable Label N N Miss Gender Ideal Eye Statoph Nucoph SATM Year Gender Ideal Eye Statoph Nucoph SATM Year 694 689 693 685 692 547 694 0 5 1 9 2 147 0 Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations SATM_Miss Statoph Statoph 0.08406 0.0278 685 Gender Gender -0.05740 0.1309 694

Upload: others

Post on 12-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

MultReg_Mult-Imputation.docx

Multiple Imputation and Multiple Regression with SAS and IBM SPSS

See IntroQ Questionnaire for a description of the survey used to generate the data used here.

*** Mult-Imput_M-Reg.sas ***; options pageno=min nodate formdlim='-'; title 'Multiple Imputation of Missing Data then Multiple Regression.'; run; PROC IMPORT OUT= WORK.IntroQuest DATAFILE= "C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav" DBMS=SPSS REPLACE; RUN; Data Priapus; set IntroQuest; SATM_Miss = 0; If SATM = . then SATM_Miss = 1; proc means n nmiss; run; proc corr nosimple; var SATM_Miss; with statoph gender ideal nucoph year; run;

The data are imported from an SPSS “.sav” file.

The MEANS Procedure Variable Label N N Miss

Gender Ideal Eye Statoph Nucoph SATM Year

Gender Ideal Eye Statoph Nucoph SATM Year

694 689 693 685 692 547 694

0 5 1 9 2

147 0

Pearson Correlation Coefficients Prob > |r| under H0: Rho=0

Number of Observations SATM_Miss

Statoph Statoph

0.08406 0.0278

685

Gender Gender

-0.05740 0.1309

694

Page 2: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

2

Ideal Ideal

-0.01715 0.6531

689

Nucoph Nucoph

0.00741 0.8458

692

Year Year

0.08196 0.0309

694

Note that missingness on SATM is associated with statphobia and year. ------------------------------------------------------------------------------------------------ Proc MI seed=69301 out=MIdata; var statoph gender ideal nucoph SATM year; run;

Proc MI is used to create five imputations.

Model Information Data Set WORK.INTROQUEST Method MCMC Multiple Imputation Chain Single Chain Initial Estimates for MCMC EM Posterior Mode Start Starting Value Prior Jeffreys Number of Imputations 5 Number of Burn-in Iterations 200 Number of Iterations 100 Seed for random number generator 69301

Missing Data Patterns

Group Statoph Gender Ideal Nucoph SATM Year Freq Percent Group Means

Statoph Gender Ideal Nucoph SATM Year

1 X X X X X X 540 77.81 6.1712 1.26666 70.27925 58.05740 506.6685 1997.6666

2 X X X X . X 139 20.03 6.6726 1.20863 70.23741 59.18705 . 1999.1510

3 X X X . X X 1 0.14 6.0000 1.00000 73.00000 . 650.0000 2012.0000

4 X X . X X X 2 0.29 5.0000 1.50000 . 57.50000 440.0000 1993.0000

5 X X . X . X 3 0.43 5.3333 1.333333 . 50.000000 . 1999.333333

6 . X X X X X 3 0.43 . 1.000000 70.666667 58.333333 550.000000 2010.000000

7 . X X X . X 5 0.72 . 1.000000 67.200000 43.400000 . 2010.000000

8 . X X . X X 1 0.14 . 1.000000 75.000000 . 730.000000 1991.000000

Page 3: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

3

The most common pattern (aside from complete data) is missingness only on SATM. We have means for each of the patterns. Those missing data on SATM do not appear to differ much from those with SATM data. Below we have Expectation Maximization estimates of means and covariances. Missingness on SATM is related to statophobia, by the way.

EM (Posterior Mode) Estimates _TYPE_ _NAME_ Statoph Gender Ideal Nucoph SATM Year MEAN 6.259824 1.252161 70.255976 58.155101 507.402318 1998.110951 COV Statoph 5.252201 -0.141896 0.685998 0.893161 -72.908739 -3.509858 COV Gender -0.141896 0.186693 -0.912737 -0.826166 2.167681 -0.041964 COV Ideal 0.685998 -0.912737 14.816058 8.297513 -23.940582 -1.323101 COV Nucoph 0.893161 -0.826166 8.297513 497.023714 77.921516 2.866324 COV SATM -72.908739 2.167681 -23.940582 77.921516 9130.096517 284.126488 COV Year -3.509858 -0.041964 -1.323101 2.866324 284.126488 79.070552

Variance Information

Variable Variance DF Relative Increase

in Variance

Fraction Missing

Information

Relative Efficiency Between Within Total

Statoph 0.000237 0.007664 0.007949 549.04 0.037133 0.036421 0.992769 Ideal 0.000194 0.021632 0.021865 670.7 0.010747 0.010688 0.997867 Nucoph 0.004013 0.725252 0.730067 681.36 0.006639 0.006617 0.998678 SATM 3.857993 13.586615 18.216206 55.285 0.340746 0.277121 0.947486

Snip, snip. I have culled the rest of the text output from Proc MI. Proc Reg outest=MRbyImput covout; Model Statoph = gender ideal nucoph SATM year / stb; By _Imputation_; run; quit; Proc MIAnalyze; modeleffects intercept gender ideal nucoph SATM year; run;

Here we used Proc Reg to conduct a multiple regression analysis on each of the five imputations. ------------------------------------- Imputation Number=1 --------------------------------------

Analysis of Variance Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 5 578.89881 115.77976 25.76 <.0001 Error 688 3092.80459 4.49536 Corrected Total 693 3671.70340

Page 4: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

4

Root MSE 2.12023 R-Square 0.1577 Dependent Mean 6.25911 Adj R-Sq 0.1515 Coeff Var 33.87421

Parameter Estimates Variable Label DF Parameter

Estimate Standard

Error t Value Pr > |t| Standardized

Estimate Intercept Intercept 1 41.56462 19.15755 2.17 0.0304 0 Gender Gender 1 -0.72482 0.22232 -3.26 0.0012 -0.13684 Ideal Ideal 1 -0.01048 0.02499 -0.42 0.6750 -0.01763 Nucoph Nucoph 1 0.00154 0.00362 0.43 0.6696 0.01503 SATM SATM 1 -0.00836 0.00089371 -9.36 <.0001 -0.34798 Year Year 1 -0.01477 0.00956 -1.54 0.1230 -0.05739

------------------------------------------------------------------------------------------------ Multiple Imputation of Missing Data then Multiple Regression. 5 ------------------------------------- Imputation Number=2 --------------------------------------

Analysis of Variance Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 5 486.59526 97.31905 20.75 <.0001 Error 688 3226.23255 4.68929 Corrected Total 693 3712.82781

Root MSE 2.16548 R-Square 0.1311 Dependent Mean 6.28280 Adj R-Sq 0.1247 Coeff Var 34.46674

Parameter Estimates

Variable Label DF Parameter Estimate

Standard Error

t Value Pr > |t| Standardized Estimate

Intercept Intercept 1 43.57996 19.63473 2.22 0.0268 0 Gender Gender 1 -0.77863 0.22730 -3.43 0.0006 -0.14618 Ideal Ideal 1 -0.02357 0.02550 -0.92 0.3556 -0.03946 Nucoph Nucoph 1 0.00237 0.00370 0.64 0.5228 0.02289 SATM SATM 1 -0.00735 0.00091452 -8.04 <.0001 -0.30560 Year Year 1 -0.01556 0.00981 -1.59 0.1132 -0.06011

Page 5: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

5 ------------------------------------- Imputation Number=3 --------------------------------------

Analysis of Variance Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 5 542.46132 108.49226 23.71 <.0001 Error 688 3148.45474 4.57624 Corrected Total 693 3690.91606

Root MSE 2.13922 R-Square 0.1470 Dependent Mean 6.24411 Adj R-Sq 0.1408 Coeff Var 34.25972

Parameter Estimates

Variable Label DF Parameter Estimate

Standard Error

t Value Pr > |t| Standardized Estimate

Intercept Intercept 1 47.65661 19.41478 2.45 0.0143 0 Gender Gender 1 -0.73550 0.22412 -3.28 0.0011 -0.13850 Ideal Ideal 1 -0.00947 0.02528 -0.37 0.7080 -0.01585 Nucoph Nucoph 1 0.00284 0.00365 0.78 0.4367 0.02759 SATM SATM 1 -0.00754 0.00086555 -8.71 <.0001 -0.32789 Year Year 1 -0.01809 0.00970 -1.87 0.0626 -0.07011

------------------------------------- Imputation Number=4 --------------------------------------

Analysis of Variance Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 5 495.04444 99.00889 21.43 <.0001 Error 688 3178.16052 4.61942 Corrected Total 693 3673.20496

Root MSE 2.14928 R-Square 0.1348 Dependent Mean 6.26736 Adj R-Sq 0.1285 Coeff Var 34.29329

Page 6: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

6

Parameter Estimates

Variable Label DF Parameter Estimate

Standard Error

t Value Pr > |t| Standardized Estimate

Intercept Intercept 1 47.04474 19.41199 2.42 0.0156 0 Gender Gender 1 -0.77684 0.22556 -3.44 0.0006 -0.14663 Ideal Ideal 1 -0.02171 0.02532 -0.86 0.3916 -0.03654 Nucoph Nucoph 1 0.00208 0.00366 0.57 0.5699 0.02033 SATM SATM 1 -0.00742 0.00090292 -8.21 <.0001 -0.30996 Year Year 1 -0.01733 0.00969 -1.79 0.0741 -0.06732

------------------------------------- Imputation Number=5 --------------------------------------

Analysis of Variance Source DF Sum of

Squares Mean

Square F Value Pr > F

Model 5 479.36626 95.87325 20.60 <.0001 Error 688 3201.88856 4.65391 Corrected Total 693 3681.25482

Root MSE 2.15729 R-Square 0.1302 Dependent Mean 6.24894 Adj R-Sq 0.1239 Coeff Var 34.52255

Parameter Estimates Variable Label DF Parameter

Estimate Standard

Error t Value Pr > |t| Standardized

Estimate Intercept Intercept 1 54.51860 19.46584 2.80 0.0052 0 Gender Gender 1 -0.68471 0.22651 -3.02 0.0026 -0.12910 Ideal Ideal 1 -0.00908 0.02533 -0.36 0.7202 -0.01533 Nucoph Nucoph 1 0.00235 0.00368 0.64 0.5227 0.02289 SATM SATM 1 -0.00705 0.00089989 -7.84 <.0001 -0.29619 Year Year 1 -0.02170 0.00972 -2.23 0.0259 -0.08418

Proc MIAnalyze is used to pool the results from the multiple imputations. The variance in the estimated parameters is partitioned between that among imputations (A) and that within imputations (W). The “Relative Increase in Variance” (r) is the increase in variance due to having missing data

Page 7: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

7

imputed (relative to the condition where no data are missing), W

Amr11 −+

= , where “m” is the number of

imputations. A related statistic, “Fraction of Missing Information,” is an index of how much more precise the parameter estimate would have been if there had been no missing data. Power will, of course, be greater when the fraction of missing information and relative increase in variance are small. The greater the number of imputations, the less the error and the greater the power, ceteris paribus. “Relative efficiency” tells you how much power you have for the number of imputations you have employed relative to what you would have if you used an uncountably large number of imputations.

The MIANALYZE Procedure

Variance Information Parameter Variance DF Relative

Increase in Variance

Fraction Missing

Information

Relative Efficiency Between Within Total

intercept 24.530433 377.042475 406.478994 762.72 0.078072 0.074841 0.985253 gender 0.001539 0.050701 0.052548 3237 0.036433 0.035748 0.992901 ideal 0.000051078 0.000639 0.000701 522.62 0.095874 0.090958 0.982133 nucoph 0.000000224 0.000013399 0.000013669 10309 0.020094 0.019888 0.996038 SATM 0.000000242 0.000000802 0.000001092 56.584 0.362170 0.290519 0.945087 year 0.000007301 0.000094011 0.000103 550.36 0.093198 0.088559 0.982596

Parameter Estimate 95% Confidence Limits DF Minimum Maximum t Pr > |t| intercept 46.872905 7.29463 86.45118 762.72 41.564615 54.518596 2.32 0.0203 gender -0.740101 -1.18956 -0.29064 3237 -0.778630 -0.684711 -3.23 0.0013 ideal -0.014862 -0.06686 0.03714 522.62 -0.023569 -0.009078 -0.56 0.5747 nucoph 0.002236 -0.00501 0.00948 10309 0.001544 0.002838 0.60 0.5453 SATM -0.007544 -0.00964 -0.00545 56.584 -0.008363 -0.007052 -7.22 <.0001 year -0.017489 -0.03740 0.00242 550.36 -0.021695 -0.014770 -1.73 0.0851

Page 8: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

8

Multiple Imputation with IBM SPSS Analyze, Multiple Imputation, Impute Missing Data Values

*Impute Missing Data Values. DATASET DECLARE IntroQ_Imputed. MULTIPLE IMPUTATION Statoph Gender Ideal Nucoph SATM Year /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS=IntroQ_Imputed . Multiple Imputation [DataSet] C:\Users\Vati\Documents\StatData\IntroQ\IntroQ.sav

Page 9: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

9

Imputation Specifications

Imputation Method Automatic Number of Imputations 5 Model for Scale Variables Linear Regression Interactions Included in Models

(none)

Maximum Percentage of Missing Values

100.0%

Maximum Number of Parameters in Imputation Model

100

Imputed Values

Imputation Results

Imputation Method Fully Conditional Specification Fully Conditional Specification Method Iterations 10

Dependent Variables

Imputed Statoph,Ideal,Nucoph,SATM Not Imputed(Too Many Missing Values)

Not Imputed(No Missing Values)

Gender,Year

Imputation Sequence Gender,Year,Nucoph,Ideal,Statoph,SATM

Imputation Models

Model Missing Values Imputed Values Type Effects

Nucoph Linear Regression

Gender,Year,Ideal,Statoph,SATM

2 10

Ideal Linear Regression

Gender,Year,Nucoph,Statoph,SATM

5 25

Statoph Linear Regression

Gender,Year,Nucoph,Ideal,SATM

9 45

SATM Linear Regression

Gender,Year,Nucoph,Ideal,Statoph

147 735

Page 10: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

10

At this point SPSS has created a new data set with the original data (imputation 0) and the imputed data

(in this case, imputations 1 through 5).

The cells with imputed scores fall are highlighted. At this point, all you need do is run the desired

analysis. If that analysis is supported, it will automatically analyze the original data and each imputed set of data

and give you convenient summaries of the results.

DATASET ACTIVATE IntroQ_MultipleImputation. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Statoph /METHOD=ENTER Gender Ideal Nucoph SATM Year.

Page 11: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

11

Model Summary

Imputation_ Model R R Square Adjusted R Square

Std. Error of the Estimate

Original data 1 .371a .138 .129 2.1940 1 1 .361b .130 .124 2.1569 2 1 .352b .124 .117 2.1682 3 1 .388b .150 .144 2.1279 4 1 .367b .134 .128 2.1547 5 1 .340b .116 .109 2.1649

a. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender b. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal

ANOVAa

Imputation_ Model Sum of Squares

df Mean Square F Sig.

Original data 1

Regression 410.002 5 82.000 17.036 .000b

Residual 2570.403 534 4.813

Total 2980.405 539

1 1 Regression 480.049 5 96.010 20.637 .000c Residual 3200.736 688 4.652

Total 3680.785 693

2 1 Regression 456.176 5 91.235 19.406 .000c Residual 3234.479 688 4.701

Total 3690.656 693

3 1 Regression 551.694 5 110.339 24.368 .000c Residual 3115.278 688 4.528

Total 3666.972 693

4 1 Regression 495.625 5 99.125 21.351 .000c Residual 3194.150 688 4.643

Total 3689.775 693

5 1

Regression 421.753 5 84.351 17.997 .000c

Residual 3224.629 688 4.687

Total 3646.382 693

a. Dependent Variable: Statoph b. Predictors: (Constant), Year, Nucoph, Ideal, SATM, Gender c. Predictors: (Constant), Year, Gender, Nucoph, SATM, Ideal

Page 12: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

12

Coefficientsa

Imputation_ Model Unstandardized Coefficients Standardized Coefficients

t Sig.

B Std. Error Beta

Original data 1

(Constant) 45.632 22.219 2.054 .040

Gender -.833 .270 -.157 -3.083 .002

Ideal -.019 .031 -.031 -.616 .538

Nucoph .003 .004 .032 .801 .424

SATM -.008 .001 -.308 -7.183 .000

Year -.017 .011 -.064 -1.506 .133

1 1

(Constant) 49.000 19.465 2.517 .012 Gender -.827 .226 -.156 -3.655 .000 Ideal -.032 .026 -.054 -1.265 .206 Nucoph .002 .004 .022 .606 .545 SATM -.007 .001 -.302 -7.951 .000 Year -.018 .010 -.070 -1.842 .066

2 1

(Constant) 53.105 19.503 2.723 .007 Gender -.827 .227 -.156 -3.636 .000 Ideal -.021 .026 -.035 -.816 .415 Nucoph .001 .004 .015 .405 .686 SATM -.007 .001 -.287 -7.587 .000 Year -.020 .010 -.079 -2.102 .036

3 1

(Constant) 45.250 19.175 2.360 .019 Gender -.717 .223 -.136 -3.219 .001 Ideal -.013 .025 -.022 -.521 .603 Nucoph .002 .004 .016 .465 .642 SATM -.008 .001 -.335 -8.988 .000 Year -.017 .010 -.065 -1.737 .083

4 1

(Constant) 43.517 19.558 2.225 .026 Gender -.810 .226 -.153 -3.581 .000 Ideal -.019 .025 -.032 -.738 .460 Nucoph .001 .004 .012 .345 .730 SATM -.007 .001 -.312 -8.216 .000 Year -.016 .010 -.061 -1.598 .111

Page 13: Multiple Imputation and Multiple Regression with SAS and ...core.ecu.edu/psyc/wuenschk/MV/MultReg/MultReg_Mult-Imputation.pdfwith SATM data. Below we have Expectation Maximization

13

Coefficientsa

Imputation_ Model Unstandardized Coefficients Standardized Coefficients

t Sig.

B Std. Error Beta

1 (Constant) 50.098 19.720 2.540 .011 5 Gender -.751 .226 -.142 -3.325 .001

Ideal -.014 .025 -.023 -.533 .594

Nucoph .002 .004 .016 .452 .651

SATM -.007 .001 -.273 -7.073 .000

Year -.019 .010 -.076 -1.963 .050

Pooled 1

(Constant) 48.194 19.934 2.418 .016 Gender -.787 .232 -3.387 .001 Ideal -.020 .027 -.735 .463 Nucoph .002 .004 .452 .651 SATM -.007 .001 -6.730 .000 Year -.018 .010 -1.805 .071

Coefficientsa

Imputation_ Model Fraction Missing Info.

Relative Increase Variance

Relative Efficiency

Pooled 1

(Constant) .045 .047 .991 Gender .056 .058 .989 Ideal .106 .113 .979 Nucoph .011 .011 .998 SATM .311 .396 .941 Year .048 .049 .991

a. Dependent Variable: Statoph

• Karl L. Wuensch, January, 2021 • Return to Wuensch’s Stats Lessons Page • Treatment of Missing Data – recommended reading, David Howell. • Multiple Imputation with SAS – nice tutorial from UCLA