texas a&m hsc jin is designed by dr. huber. korean female colon cancer risk factors range...

Multiple Imputation with large proportions of missing data:how much is too much?

Texas A&M HSC

Jin is designed by Dr. Huber

Korean Female Colon Cancer

RiskFactors

Range

Event Non-event

HR 95% CI P

n % n %

Smok-ing

HabitsMissing 1449400 79.57 4071 95.70 - - - -

No smok-ing

351896 19.32 93 2.19 1.000 1.000 1.000 1.000

Smoked before ,

but quitted 4611 0.25 21 0.49 1.174 1.058 1.303 0.0025

Currently,1/2 pack

8735 0.48 38 0.89 0.948 0.828 1.084 0.4339

Currently,1/2-One

pack5534 0.30 26 0.61 0.991 0.901 1.09 0.8457

　Currently,More than One pack

1410 0.08 5 0.12 1.015 0.894 1.153 0.8162

Motivation Motivations and Examples

☞

Is smoking protective?Not sure b/c Huge missing!!

☞

1. Missing Completely At Random(MCAR) : depends neither on observation nor on missing

2. Missing At Random(MAR) : depends only on observation

3. Not Missing At Random(NMAR) : depends both on observation and on missing

Types of Missing data

Diff. byWhy data are missing

background

Affect the effectiveness and biasness of methods for missing data

1. Complete Case Analysis(CCA)

2. Available Case Analysis(ACA)

3. Mean imputation

4. Expectation and Maximum(EM)

5. Multiple Imputation

Older Methods

Single Imputation

MultipleImputation

Methods of handling Missing data

background

Only CCA and MI

Y1 Y2 Y3

140 . 20

31 25 .

10 35 40

25 48 57

30 49 60

35 55 65

37 47 70

140 32 30

42 65 40

50 200 20

1. Complete Case Analysis (CCA)

1. CCA = NOT using any methods of handling missing data 2. By deleting cases, power will be decreased (b/c reduced sample size)

Methods of handling Missing databackground

1. Delete all cases of missing values on Y1,Y2,Y3

2. Analyze remaining cases

2. Multiple Imputation (MI)

(1) Imputation Step

(2) Analysis Step

(3) Combination Step


background

MI has 3 steps

Imputa-

tion Number

Y X1 X2

1 1 44 11 178

2 1 45 10 185

3 1 59 16.5

1 136.4

8

4 1 49 9 179.5

9

5 1 60 8 170

6 1 50 38.4

0 44

7 1 11 176 -

608.57

8 1 10 49 8

9 1 170 50 -88.94

2. MI (1) Imputation Step

Y X1 X2

1 44 11 178

2 45 10 185

3 59 . .

4 49 9 .

5 60 8 170

6 50 . 44

7 11 176 .

8 10 49 8

9 170 50 .

Imputa-

tion Number

Y X1 X2

10 2 44 11 178

11 2 45 10 185

12 2 59 63.9

9-98.96

13 2 49 9 192.3

7

14 2 60 8 170

15 2 50 38.4

944

16 2 11 176 -

644.26

17 2 10 49 8

18 2 170 50 -97.00

Imputation Number

Y X1 X2

19 3 44 11 178

20 3 45 10 185

21 3 59 63.88 -121.12

22 3 49 9 185.82

23 3 60 8 170

24 3 50 33.65 44

25 3 11 176 -665.12

26 3 10 49 8

27 3 170 50 -189.96

Imputa-

tion Number

Y X1 X2

28 4 44 11 178

29 4 45 10 185

30 4 59 -42.87 458.6

0

31 4 49 9 179.0

7

32 4 60 8 170

33 4 50 33.60 44

34 4 11 176 -

706.87

35 4 10 49 8

36 4 170 50 -

212.18

Imputa-

tion Number

Y X1 X2

37 5 44 11 178

38 5 45 10 185

39 5 59 1.64 213.9

4

40 5 49 9 182.0

8

41 5 60 8 170

42 5 50 33.16 44

43 5 11 176 -

720.92

44 5 10 49 8

45 5 170 50 -

222.16


background

“5 complete datasets”

2. MI (2) Analysis Step

　Imputa-

tion Number

Label of model

Type of statis-tics

Variable names for rows of

estimated COV

Depen-dent vari-

able

Root mean squared error

Inter-cept

X1 X2 Y

1 1 MODEL1 PARMS Y 9.49 417.91 -7.96 -1.64 -12 1 MODEL1 COV Intercept Y 9.49 722.00 -15.61 -3.26 . 3 1 MODEL1 COV X1 Y 9.49 -15.61 0.34 0.07 . 4 1 MODEL1 COV X2 Y 9.49 -3.26 0.07 0.02 . 5 2 MODEL1 PARMS Y 11.80 405.16 -7.81 -1.53 -16 2 MODEL1 COV Intercept Y 11.80 1052.74 -23.16 -4.60 . 7 2 MODEL1 COV X1 Y 11.80 -23.16 0.52 0.10 . 8 2 MODEL1 COV X2 Y 11.80 -4.60 0.10 0.02 . 9 3 MODEL1 PARMS Y 3.86 233.43 -4.31 -0.80 -1

10 3 MODEL1 COV Intercept Y 3.86 28.82 -0.66 -0.12 . 11 3 MODEL1 COV X1 Y 3.86 -0.66 0.02 0.00 . 12 3 MODEL1 COV X2 Y 3.86 -0.12 0.00 0.00 . 13 4 MODEL1 PARMS Y 1.76 221.04 -4.17 -0.74 -114 4 MODEL1 COV Intercept Y 1.76 5.20 -0.12 -0.02 . 15 4 MODEL1 COV X1 Y 1.76 -0.12 0.00 0.00 . 16 4 MODEL1 COV X2 Y 1.76 -0.02 0.00 0.00 . 17 5 MODEL1 PARMS Y 1.46 215.80 -4.08 -0.71 -118 5 MODEL1 COV Intercept Y 1.46 3.36 -0.08 -0.01 . 19 5 MODEL1 COV X1 Y 1.46 -0.08 0.00 0.00 . 20 5 MODEL1 COV X2 Y 1.46 -0.01 0.00 0.00 .

* Standard statistical procedure > regression for each complete datasets (5) separately


background

Analyzed 5 times

2. MI (3) Combination Step

> the results from 5 data are combined to ONE with combination equations.

1. Combined estimate:

2. Variance Total:

3. Var. Within:

4. Var. Between:

5. DF:

6. Fraction missing Info. :

7. Confidence Interval:

Methods of handling Miss-ing data

background

combined to 1 result

* Comparison of methods to handle missing values


Criteria CCA ACA Mean Im-putation

EM method

MultipleImputation

Unbiased Parameter

Estimation

MCAR O X X O O

MAR X X X O O

MNAR X X X X X

Good EstimatesVariability X X X X O

Best Statistical Power X O O O O

background

MI is the BEST!!

Excellent Estimation

Variance among

‘M’est. b/c multiply imputed data

by not deleting any

cases

(1) Imputation step of MI : imputation mechanisms for substituting missing values

Imputation Mechanisms background

Pattern Type NormalityImputation mechanisms

Univariate Monotone Continuous O Regression

Univariate Monotone Continuous XPredictive

Mean Match-ing

Multivariate Not

Monotone Continuous - MCMC

MCMC is NOT tested to Univariate

* 3000 obs. are generated on Z1, and X1,…,X6 (all variables are continuous)

( Xs: observed variables and Z: partly missing var. )

* Z1, and X1,…,X6 are drawn from multivariate normal dist with

Means = 0 and Correlation =

DataData

x6 0.1052 0.1124 -0.0061 -0.0764 0.1157 0.0420 1.0000 x5 0.2924 0.3581 0.8062 -0.0640 0.0441 1.0000 x4 0.1612 0.1415 -0.0063 -0.0738 1.0000 x3 0.0509 0.0351 0.5352 1.0000 x2 0.2764 0.3233 1.0000 x1 0.7655 1.0000 z1 1.0000 z1 x1 x2 x3 x4 x5 x6

Simulated Data

* 3154 obs. (all variables are continuous)

- Missing variable: Systolic Blood Pressure (Mean: 128.63)

- Observed variables: DBP(82.02), height(69.78), weight(169.95), age(46.28),

BMI(24.52), and Cholesterol (Mean: 226.37)

* Correlation =

DataData

chol 0.1231 0.1296 -0.0889 0.0085 0.0892 0.0706 1.0000 bmi 0.2878 0.3428 -0.0633 0.8079 0.0256 1.0000 age 0.1701 0.1440 -0.0919 -0.0331 1.0000 weight 0.2513 0.2940 0.5333 1.0000 height 0.0156 0.0070 1.0000 dbp 0.7700 1.0000 sbp 1.0000 sbp dbp height weight age bmi chol

Example Data (“A Predictive Study of Coronary Heart Disease” )

Method 1. Missing Mechanisms

1) MCAR: Randomly Z1(SBP) deleted

2) MAR: After sorting by one of X(obs.var), Z1(SBP) deleted

3) NMAR: After sorting by Z1(SBP), Z1(SBP) deleted

2. Biasness mainly measured by

RMSE (Root Mean Square Error)= Sqrt (Variance of Estimates + Bias^2)

: captures estimates’ Accuracy and Variability

and compares them in the same units.

* True value= Mean of Z1 (SBP) at 0% missing

* Estimate= Mean of Z1 (SBP) at 10% to 80% missing after MI

to 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%

Method

When RMSE “smaller” → Estimation “better”

3. The method to deal with missing values (to measure effectiveness of MI)

Complete Case Analysis (CCA)

Multiple Imputation (MI)

4. Imputation numbers

M=10, 20, 30, 40, and 50 numbers

5. Imputation model

(z1= x1 x2 x3 x4 x5x6), (z1= x1 x2 x5), (z1= x3 x4x6)

all variable highly corr. var to z1 rarely corr. var

MethodMethod

z1=x1x2x5 model is best model

b/c smallest RMSE

6. Imputation Mechanisms

7. 500 repetitions on each MI (to reduce random variability of imputation)

ex) M=10 *500 reps. → Average them→

…

M=50 *500 reps. → Average them→

8. Statistical Software

STATA11 (Multiple Imputation)

MethodMethod

Mean of Est. for M=10

Mean of Est. for M=50

Regression method PMM MCMC

Result (simulated data) 1. CCA vs. MI* by RMSE

10%20%30%40%50%60%70%80%

0

0.02

0.04

0.06

0.08

0.1

0.12MCAR

CCA MI

RM

SE

10%20%

30%40%

50%60%

70%80%

0

0.05

0.1

0.15

0.2

0.25MAR

CCA MI

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6NMAR

CCA MI

RM

SE

Proportion of missing data Proportion of missing dataProportion of missing data

Result

better

10%20%30%40%50%60%70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MCAR

CCA MI

RM

SE

10%20%

30%40%

50%60%

70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MAR

CCA MI

RM

SE

Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under All missing mechanisms,

MI is better than CCA.

Percent of missing , RMSEs are linearly

& Diff. of RMSE b/w CCA and MI

> High amount of missing, using Multiple Imputation

2. imputation numbers (simulated data)

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2

MCAR

10 impute 20 impute 30 impute40 impute 50 impute

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-16

0.2

0.4

0.6

0.8

1

1.2MAR

10 impute 20 impute30 impute 40 impute50 impute

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2NMAR

10 impute 20 impute 30 impute40 impute 50 impute

RM

SE

Proportion of missing dataProportion of missing data

Proportion of missing data

Result

Similar

(Regardless of imputation #)

Under MCAR and MAR, MI Good!

Under NMAR, MI biased est. at 80% missing

b/c large RMSE ≒ ( 1 SD of data=0.99 )

5 lines(M=10~M=50) go together and look like 1 line.

> No difference among diff. Imputation numbers(m)=

10, 20, 30, 40, 50.

10%20%

30%40%

50%60%

70%80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4NMAR

reg pmm mcmc

RM

SE

3. Regression, PMM, MCMC(simulated data)

1. Under MCAR and MAR, theoretically Reg. should be better because of normality,

but All method are good. However, Reg. method is slightly better under MAR.

2. Under NMAR, even though normality is not met, Reg. method is better than PMM.


Result

MCMC/ Reg.10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-160.20.40.60.8

11.21.4

MCAR

reg pmm mcmc

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%-0.2

-1.66533453693773E-160.20.40.60.8

11.21.4

MAR

reg pmm mcmc

RM

SENormality Theory Practically (MI)

MCAR Normal Regression All imputation mechanisms

MAR Normal Regression All imputation mechanisms (Reg. slightly better)

NMAR Not Normal PMM Regression, MCMC

Proportion of missing data Proportion of missing data

*Normal assumption may not be important under NMAR.

*MCMC is good under all missing mechanisms.

Thus, MCMC can be used in univariate and continuous missing.

Result (Example data) 1. CCA vs. MI* by RMSE

10%

20%

30%

40%

50%

60%

70%

80%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6MCAR

CCA MI

RM

SE

10% 20% 30% 40% 50% 60% 70% 80%0

0.5

1

1.5

2

2.5

3

3.5

4

MAR

CCA MI

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

2

4

6

8

10

12

14

16

18

20

NMAR

CCA MI

RM

SE

Proportion of missing data Proportion of missing data Proportion of missing data

Result

better

10%

20%

30%

40%

50%

60%

70%

80%

02468

101214161820

MCAR

CCA MI

RM

SE

10% 20% 30% 40% 50% 60% 70% 80%02468

101214161820

MAR

CCA MI

RM

SE

Under MCAR and MAR, both CCA and MI are Good. changing scale of Y axis, Under MCAR, MAR, and NMAR, MI produced significantly unbiased values than CCA.

Percent of missing , RMSEs are linearly

& Diff. of RMSE b/w CCA and MI

> High amount of missing, Multiple Imputation is preferable

2. imputation numbers (example data)

10%

20%

30%

40%

50%

60%

70%

80%

02468

10121416

MCAR


RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

0

2

4

6

8

10

12

14

16MAR


RM

SE

10% 20% 30% 40% 50% 60% 70% 80%0

2

4

6

8

10

12

14

16NMAR


RM

SE

Proportion of missing dataProportion of missing dataProportion of missing data

Result

Similar

(Regardless of imputation # and percent of missing )

Under MCAR and MAR, MI produces unbiased est.

Under NMAR, MI did not well at 80% missing

due to large RMSE ≒ ( 1 SD of data=15.11 )

No difference among increased Imputation numbers =

10, 20, 30, 40, 50

> Increased Imputation numbers No sign. effect to

correct bias in this data characteristics.

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618 NMAR

reg pmm mcmc

RM

SE


Result

MCMC/ Reg.

3. Regression, PMM, MCMC(example data)

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618

MCAR

reg pmm mcmc

RM

SE

10%

20%

30%

40%

50%

60%

70%

80%

02468

1012141618

MAR

reg pmm mcmc

RM

SE

Proportion of missing dataProportion of missing data

Normality Theory Practically(MI)

MCAR Not Normal PMM All missing mechanisms

MAR Not Normal PMM All missing mechanisms (PMM method slightly better )

NMAR Not Normal PMM Regression, MCMC

1.Under MCAR and MAR, theoretically PMM should be better because normal assump-

tion is broken, but All method are good.

However, PMM method is slightly better under MAR.

2. Under NMAR, even though normality is not met, Reg. has lower RMSE than PMM.

*Normal assumption maybe important only under MAR.

*MCMC is good to use under MCAR, MAR, and NMAR.

Thus, MCMC can be used not only in multivariate and continuous

missing, but also in univariate and continuous missing.

Conclusion

1. Multiple Imputation (MI) > Complete Case Analysis always.

2. No significant difference in imputation numbers in my data.

3. Under MCAR and MAR, MI produce unbiased estimates at high amount of missing.

4. However, under NMAR, the estimation by MI is also biased at high amount of missing.

5. MCMC is good for univariate and continuous missing under MCAR, MAR and NMAR.

Conclusion

T h a n k y u

texas a&m hsc jin is designed by dr. huber. korean female colon cancer risk factors range...

Documents

imputation number yx1x2

imputation step yx1x2

mean imputation

multiple imputation

missing data slide

mi slide

cases of missing values

huber slide