texas a&m hsc jin is designed by dr. huber. korean female colon cancer risk factors range...

24
Multiple Imputation with large proportions of missing data :how much is too much? Texas A&M HSC Jin is designed by Dr. Huber

Upload: mckayla-bullock

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Multiple Imputation with large proportions of missing data:how much is too much?

Texas A&M HSC

Jin is designed by Dr. Huber

Page 2: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Korean Female Colon Cancer

RiskFactors

Range

Event Non-event

HR 95% CI P

n % n %

Smok-ing

HabitsMissing 1449400 79.57 4071 95.70 - - - -

No smok-ing

351896 19.32 93 2.19 1.000 1.000 1.000 1.000

Smoked before ,

but quitted 4611 0.25 21 0.49 1.174 1.058 1.303 0.0025

Currently,1/2 pack

8735 0.48 38 0.89 0.948 0.828 1.084 0.4339

Currently,1/2-One

pack5534 0.30 26 0.61 0.991 0.901 1.09 0.8457

 Currently,More than One pack

1410 0.08 5 0.12 1.015 0.894 1.153 0.8162

Motivation Motivations and Examples

Is smoking protective?Not sure b/c Huge missing!!

Page 3: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing

2. Missing At Random(MAR) : depends only on observation

3. Not Missing At Random(NMAR) : depends both on observation and on missing

Types of Missing data

Diff. byWhy data are missing

background

Affect the effectiveness and biasness of methods for missing data

Page 4: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

1. Complete Case Analysis(CCA)

2. Available Case Analysis(ACA)

3. Mean imputation

4. Expectation and Maximum(EM)

5. Multiple Imputation

Older Methods

Single Imputation

MultipleImputation

Methods of handling Missing data

background

Only CCA and MI

Page 5: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Y1 Y2 Y3

140 . 20

31 25 .

10 35 40

25 48 57

30 49 60

35 55 65

37 47 70

140 32 30

42 65 40

50 200 20

1. Complete Case Analysis (CCA)

1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size)

Methods of handling Missing databackground

1. Delete all cases of missing values on Y1,Y2,Y3

2. Analyze remaining cases

Page 6: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

2. Multiple Imputation (MI)

(1) Imputation Step

(2) Analysis Step

(3) Combination Step

Methods of handling Missing data

background

MI has 3 steps

Page 7: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

  Imputa-

tion Number

Y X1 X2

1 1 44 11 178

2 1 45 10 185

3 1 59 16.5

1 136.4

8

4 1 49 9 179.5

9

5 1 60 8 170

6 1 50 38.4

0 44

7 1 11 176 -

608.57

8 1 10 49 8

9 1 170 50 -88.94

2. MI (1) Imputation Step

  Y X1 X2

1 44 11 178

2 45 10 185

3 59 . .

4 49 9 .

5 60 8 170

6 50 . 44

7 11 176 .

8 10 49 8

9 170 50 .

  Imputa-

tion Number

Y X1 X2

10 2 44 11 178

11 2 45 10 185

12 2 59 63.9

9-98.96

13 2 49 9 192.3

7

14 2 60 8 170

15 2 50 38.4

944

16 2 11 176 -

644.26

17 2 10 49 8

18 2 170 50 -97.00

Imputation Number

Y X1 X2

19 3 44 11 178

20 3 45 10 185

21 3 59 63.88 -121.12

22 3 49 9 185.82

23 3 60 8 170

24 3 50 33.65 44

25 3 11 176 -665.12

26 3 10 49 8

27 3 170 50 -189.96

  Imputa-

tion Number

Y X1 X2

28 4 44 11 178

29 4 45 10 185

30 4 59 -42.87 458.6

0

31 4 49 9 179.0

7

32 4 60 8 170

33 4 50 33.60 44

34 4 11 176 -

706.87

35 4 10 49 8

36 4 170 50 -

212.18

  Imputa-

tion Number

Y X1 X2

37 5 44 11 178

38 5 45 10 185

39 5 59 1.64 213.9

4

40 5 49 9 182.0

8

41 5 60 8 170

42 5 50 33.16 44

43 5 11 176 -

720.92

44 5 10 49 8

45 5 170 50 -

222.16

Methods of handling Missing data

background

“5 complete datasets”

Page 8: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

2. MI (2) Analysis Step

 Imputa-

tion Number

Label of model

Type of statis-tics

Variable names for rows of

estimated COV

Depen-dent vari-

able

Root mean squared error

Inter-cept

X1 X2 Y

1 1 MODEL1 PARMS   Y 9.49 417.91 -7.96 -1.64 -12 1 MODEL1 COV Intercept Y 9.49 722.00 -15.61 -3.26 . 3 1 MODEL1 COV X1 Y 9.49 -15.61 0.34 0.07 . 4 1 MODEL1 COV X2 Y 9.49 -3.26 0.07 0.02 . 5 2 MODEL1 PARMS   Y 11.80 405.16 -7.81 -1.53 -16 2 MODEL1 COV Intercept Y 11.80 1052.74 -23.16 -4.60 . 7 2 MODEL1 COV X1 Y 11.80 -23.16 0.52 0.10 . 8 2 MODEL1 COV X2 Y 11.80 -4.60 0.10 0.02 . 9 3 MODEL1 PARMS   Y 3.86 233.43 -4.31 -0.80 -1

10 3 MODEL1 COV Intercept Y 3.86 28.82 -0.66 -0.12 . 11 3 MODEL1 COV X1 Y 3.86 -0.66 0.02 0.00 . 12 3 MODEL1 COV X2 Y 3.86 -0.12 0.00 0.00 . 13 4 MODEL1 PARMS   Y 1.76 221.04 -4.17 -0.74 -114 4 MODEL1 COV Intercept Y 1.76 5.20 -0.12 -0.02 . 15 4 MODEL1 COV X1 Y 1.76 -0.12 0.00 0.00 . 16 4 MODEL1 COV X2 Y 1.76 -0.02 0.00 0.00 . 17 5 MODEL1 PARMS   Y 1.46 215.80 -4.08 -0.71 -118 5 MODEL1 COV Intercept Y 1.46 3.36 -0.08 -0.01 . 19 5 MODEL1 COV X1 Y 1.46 -0.08 0.00 0.00 . 20 5 MODEL1 COV X2 Y 1.46 -0.01 0.00 0.00 .

* Standard statistical procedure > regression for each complete datasets (5) separately

Methods of handling Missing data

background

Analyzed 5 times

Page 9: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

2. MI (3) Combination Step

> the results from 5 data are combined to ONE with combination equations.

1. Combined estimate:

2. Variance Total:

3. Var. Within:

4. Var. Between:

5. DF:

6. Fraction missing Info. :

7. Confidence Interval:

Methods of handling Miss-ing data

background

combined to 1 result

Page 10: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

* Comparison of methods to handle missing values

Methods of handling Missing data

Criteria CCA ACA Mean Im-putation

EM method

MultipleImputation

Unbiased Parameter

Estimation

MCAR O X X O O

MAR X X X O O

MNAR X X X X X

Good EstimatesVariability X X X X O

Best Statistical Power X O O O O

background

MI is the BEST!!

Excellent Estimation

Variance among

‘M’est. b/c multiply imputed data

by not deleting any

cases

Page 11: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

(1) Imputation step of MI : imputation mechanisms for substituting missing values

Imputation Mechanisms background

Pattern Type NormalityImputation mechanisms

Univariate Monotone Continuous O Regression

Univariate Monotone Continuous XPredictive

Mean Match-ing

Multivariate Not

Monotone Continuous - MCMC

MCMC is NOT tested to Univariate

Page 12: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

* 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous)

( Xs: observed variables and Z: partly missing var. )

* Z1, and X1,…,X6 are drawn from multivariate normal dist with

Means = 0 and Correlation =

DataData

x6 0.1052 0.1124 -0.0061 -0.0764 0.1157 0.0420 1.0000 x5 0.2924 0.3581 0.8062 -0.0640 0.0441 1.0000 x4 0.1612 0.1415 -0.0063 -0.0738 1.0000 x3 0.0509 0.0351 0.5352 1.0000 x2 0.2764 0.3233 1.0000 x1 0.7655 1.0000 z1 1.0000 z1 x1 x2 x3 x4 x5 x6

Simulated Data

Page 13: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

* 3154 obs. (all variables are continuous)

- Missing variable: Systolic Blood Pressure (Mean: 128.63)

- Observed variables: DBP(82.02), height(69.78), weight(169.95), age(46.28),

BMI(24.52), and Cholesterol (Mean: 226.37)

* Correlation =

DataData

chol 0.1231 0.1296 -0.0889 0.0085 0.0892 0.0706 1.0000 bmi 0.2878 0.3428 -0.0633 0.8079 0.0256 1.0000 age 0.1701 0.1440 -0.0919 -0.0331 1.0000 weight 0.2513 0.2940 0.5333 1.0000 height 0.0156 0.0070 1.0000 dbp 0.7700 1.0000 sbp 1.0000 sbp dbp height weight age bmi chol

Example Data (“A Predictive Study of Coronary Heart Disease” )

Page 14: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Method 1. Missing Mechanisms

1) MCAR: Randomly Z1(SBP) deleted

2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted

3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted

2. Biasness mainly measured by

RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2)

: captures estimates’ Accuracy and Variability

and compares them in the same units.

* True value= Mean of Z1 (SBP) at 0% missing

* Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI

to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%

Method

When RMSE “smaller” → Estimation “better”

Page 15: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

3. The method to deal with missing values (to measure effectiveness of MI)

Complete Case Analysis (CCA)

Multiple Imputation (MI)

4. Imputation numbers

M=10, 20, 30, 40, and 50 numbers

5. Imputation model

(z1= x1 x2 x3 x4 x5x6), (z1= x1 x2 x5), (z1= x3 x4x6)

all variable highly corr. var to z1 rarely corr. var

MethodMethod

z1=x1x2x5 model is best model

b/c smallest RMSE

Page 16: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

6. Imputation Mechanisms

7. 500 repetitions on each MI (to reduce random variability of imputation)

ex) M=10 *500 reps. → Average them→

M=50 *500 reps. → Average them→

8. Statistical Software

STATA11 (Multiple Imputation)

MethodMethod

Mean of Est. for M=10

Mean of Est. for M=50

Regression method PMM MCMC

Page 17: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Result (simulated data) 1. CCA vs. MI* by RMSE

10%20%30%40%50%60%70%80%

0

0.02

0.04

0.06

0.08

0.1

0.12MCAR

CCA MI

RM

SE

10%20%

30%40%

50%60%

70%80%

0

0.05

0.1

0.15

0.2

0.25MAR

CCA MI

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6NMAR

CCA MI

RM

SE

Proportion of missing data Proportion of missing dataProportion of missing data

Result

better

10%20%30%40%50%60%70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MCAR

CCA MI

RM

SE

10%20%

30%40%

50%60%

70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MAR

CCA MI

RM

SE

Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms,

MI is better than CCA.

Percent of missing , RMSEs are linearly

& Diff. of RMSE b/w CCA and MI

> High amount of missing, using Multiple Imputation

Page 18: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

2. imputation numbers (simulated data)

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

MCAR

10 impute 20 impute 30 impute40 impute 50 impute

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2MAR

10 impute 20 impute30 impute 40 impute50 impute

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2NMAR

10 impute 20 impute 30 impute40 impute 50 impute

RM

SE

Proportion of missing dataProportion of missing data

Proportion of missing data

Result

Similar

(Regardless of imputation #)

Under MCAR and MAR, MI Good!

Under NMAR, MI biased est. at 80% missing

b/c large RMSE ≒ ( 1 SD of data=0.99 )

5 lines(M=10~M=50) go together and look like 1 line.

> No difference among diff. Imputation numbers(m)=

10, 20, 30, 40, 50.

Page 19: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

10%20%

30%40%

50%60%

70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4NMAR

reg pmm mcmc

RM

SE

3. Regression, PMM, MCMC(simulated data)

1. Under MCAR and MAR, theoretically Reg. should be better because of normality,

but All method are good. However, Reg. method is slightly better under MAR.

2. Under NMAR, even though normality is not met, Reg. method is better than PMM.

Proportion of missing data

Result

MCMC/ Reg.10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-160.20.40.60.8

11.21.4

MCAR

reg pmm mcmc

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-160.20.40.60.8

11.21.4

MAR

reg pmm mcmc

RM

SENormality Theory Practically (MI)

MCAR Normal Regression All imputation mechanisms

MAR Normal Regression All imputation mechanisms (Reg. slightly better)

NMAR Not Normal PMM Regression, MCMC

Proportion of missing data Proportion of missing data

*Normal assumption may not be important under NMAR.

*MCMC is good under all missing mechanisms.

Thus, MCMC can be used in univariate and continuous missing.

Page 20: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Result (Example data) 1. CCA vs. MI* by RMSE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MCAR

CCA MI

RM

SE

10% 20% 30% 40% 50% 60% 70% 80%0

0.5

1

1.5

2

2.5

3

3.5

4

MAR

CCA MI

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

2

4

6

8

10

12

14

16

18

20

NMAR

CCA MI

RM

SE

Proportion of missing data Proportion of missing data Proportion of missing data

Result

better

10%

20%

30%

40%

50%

60%

70%

80%

02468

101214161820

MCAR

CCA MI

RM

SE

10% 20% 30% 40% 50% 60% 70% 80%02468

101214161820

MAR

CCA MI

RM

SE

Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA.

Percent of missing , RMSEs are linearly

& Diff. of RMSE b/w CCA and MI

> High amount of missing, Multiple Imputation is preferable

Page 21: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

2. imputation numbers (example data)

10%

20%

30%

40%

50%

60%

70%

80%

02468

10121416

MCAR

10 impute 20 impute30 impute 40 impute50 impute

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

2

4

6

8

10

12

14

16MAR

10 impute 20 impute30 impute 40 impute50 impute

RM

SE

10% 20% 30% 40% 50% 60% 70% 80%0

2

4

6

8

10

12

14

16NMAR

10 impute 20 impute30 impute 40 impute50 impute

RM

SE

Proportion of missing dataProportion of missing dataProportion of missing data

Result

Similar

(Regardless of imputation # and percent of missing )

Under MCAR and MAR, MI produces unbiased est.

Under NMAR, MI did not well at 80% missing

due to large RMSE ≒ ( 1 SD of data=15.11 )

No difference among increased Imputation numbers =

10, 20, 30, 40, 50

> Increased Imputation numbers No sign. effect to

correct bias in this data characteristics.

Page 22: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618 NMAR

reg pmm mcmc

RM

SE

Proportion of missing data

Result

MCMC/ Reg.

3. Regression, PMM, MCMC(example data)

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618

MCAR

reg pmm mcmc

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618

MAR

reg pmm mcmc

RM

SE

Proportion of missing dataProportion of missing data

Normality Theory Practically(MI)

MCAR Not Normal PMM All missing mechanisms

MAR Not Normal PMM All missing mechanisms (PMM method slightly better )

NMAR Not Normal PMM Regression, MCMC

1.Under MCAR and MAR, theoretically PMM should be better because normal assump-

tion is broken, but All method are good.

However, PMM method is slightly better under MAR.

2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM.

*Normal assumption maybe important only under MAR.

*MCMC is good to use under MCAR, MAR, and NMAR.

Thus, MCMC can be used not only in multivariate and continuous

missing, but also in univariate and continuous missing.

Page 23: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

Conclusion

1. Multiple Imputation (MI) > Complete Case Analysis always.

2. No significant difference in imputation numbers in my data.

3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing.

4. However, under NMAR, the estimation by MI is also biased at high amount of missing.

5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR.

Conclusion

Page 24: Texas A&M HSC Jin is designed by Dr. Huber. Korean Female Colon Cancer Risk Factors Range EventNon-event HR95% CIP n%n% Smoking Habits Missing144940079.57407195.70----

T h a n k y u