missing data in randomized control trials

82
Missing Data in Randomized Control Trials John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University [email protected] IES/NCER Summer Research Training Institute, July 2008

Upload: otto-faulkner

Post on 31-Dec-2015

57 views

Category:

Documents


1 download

DESCRIPTION

Missing Data in Randomized Control Trials. John W. Graham The Prevention Research Center and Department of Biobehavioral Health Penn State University. IES/NCER Summer Research Training Institute, July 2008. [email protected]. Sessions in Four Parts. (1) Introduction: Missing Data Theory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Missing Data in  Randomized Control Trials

Missing Data in Randomized Control Trials

John W. GrahamThe Prevention Research Center

andDepartment of Biobehavioral Health

Penn State University

[email protected]/NCER Summer Research Training Institute, July 2008

Page 2: Missing Data in  Randomized Control Trials

Sessions in Four Parts

(1) Introduction: Missing Data Theory (2) Attrition: Bias and Lost Power (3) A brief analysis demonstration

Multiple Imputation with NORM and Proc MI

(4) Hands-on Intro to Multiple Imputation

Page 3: Missing Data in  Randomized Control Trials

Recent Papers

Graham, J. W., Cumsille, P. E., & Elek-Fisk, E. (2003). Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.). Research Methods in Psychology (pp. 87_114). Volume 2 of Handbook of Psychology (I. B. Weiner, Editor-in-Chief). New York: John Wiley & Sons.

Graham, J. W., (2009, in press). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60.

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.

Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.

Page 4: Missing Data in  Randomized Control Trials

Part 1:A Brief Introduction to

Analysis with Missing Data

Page 5: Missing Data in  Randomized Control Trials

Problem with Missing Data

Analysis procedures were designed for complete data

. . .

Page 6: Missing Data in  Randomized Control Trials

Solution 1

Design new model-based procedures

Missing Data + Parameter Estimation in One Step

Full Information Maximum Likelihood (FIML)

SEM and Other Latent Variable Programs(Amos, LISREL, Mplus, Mx, LTA)

Page 7: Missing Data in  Randomized Control Trials

Solution 2

Data based procedures e.g., Multiple Imputation (MI)

Two Steps

Step 1: Deal with the missing data (e.g., replace missing values with plausible

values Produce a product

Step 2: Analyze the product as if there were no missing data

Page 8: Missing Data in  Randomized Control Trials

FAQ

Aren't you somehow helping yourself with imputation?

. . .

Page 9: Missing Data in  Randomized Control Trials

NO. Missing data imputation . . .

does NOT give you something for nothing

DOES let you make use of all data you have

. . .

Page 10: Missing Data in  Randomized Control Trials

FAQ

Is the imputed value what the person would have given?

Page 11: Missing Data in  Randomized Control Trials

NO. When we impute a value . .

We do not impute for the sake of the value itself

We impute to preserve important characteristics of the whole data set

. . .

Page 12: Missing Data in  Randomized Control Trials

We want . . .

unbiased parameter estimation e.g., b-weights

Good estimate of variability e.g., standard errors

best statistical power

Page 13: Missing Data in  Randomized Control Trials

Causes of Missingness

Ignorable MCAR: Missing Completely At Random MAR: Missing At Random

Non-Ignorable MNAR: Missing Not At Random

Page 14: Missing Data in  Randomized Control Trials

MCAR(Missing Completely At Random)

MCAR 1: Cause of missingness completely random process (like coin flip)

MCAR 2: Cause uncorrelated with variables of

interest Example: parents move

No bias if cause omitted

Page 15: Missing Data in  Randomized Control Trials

MAR (Missing At Random)

Missingness may be related to measured variables

But no residual relationship with unmeasured variables Example: reading speed

No bias if you control for measured variables

Page 16: Missing Data in  Randomized Control Trials

MNAR (Missing Not At Random)

Even after controlling for measured variables ...

Residual relationship with unmeasured variables

Example: drug use reason for absence

Page 17: Missing Data in  Randomized Control Trials

MNAR Causes

The recommended methods assume missingness is MAR

But what if the cause of missingness is not MAR?

Should these methods be used when MAR assumptions not met?

. . .

Page 18: Missing Data in  Randomized Control Trials

YES! These Methods Work!

Suggested methods work better than “old” methods

Multiple causes of missingness Only small part of missingness may be

MNAR

Suggested methods usually work very well

Page 19: Missing Data in  Randomized Control Trials

Methods:"Old" vs MAR vs MNAR

MAR methods (MI and ML) are ALWAYS at least as good as, usually better than "old" methods

(e.g., listwise deletion)

Methods designed to handle MNAR missingness are NOT always better than MAR methods

Page 20: Missing Data in  Randomized Control Trials

Analysis: Old and New

Page 21: Missing Data in  Randomized Control Trials

Old Procedures: Analyze Complete

Cases(listwise deletion)

may produce bias

you always lose some power (because you are throwing away data)

reasonable if you lose only 5% of cases

often lose substantial power

Page 22: Missing Data in  Randomized Control Trials

Analyze Complete Cases

(listwise deletion)

1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0

very common situation only 20% (4 of 20) data points missing but discard 80% of the cases

Page 23: Missing Data in  Randomized Control Trials

Other "Old" Procedures

Pairwise deletion May be of occasional use for preliminary

analyses

Mean substitution Never use it

Regression-based single imputation generally not recommended ... except ...

Page 24: Missing Data in  Randomized Control Trials

Recommended Model-Based Procedures

Multiple Group SEM (Structural Equation Modeling)

Latent Transition Analysis (Collins et al.)

A latent class procedure

Page 25: Missing Data in  Randomized Control Trials

Recommended Model-Based Procedures

Raw Data Maximum Likelihood SEMaka Full Information Maximum Likelihood (FIML) Amos (James Arbuckle)

LISREL 8.5+ (Jöreskog & Sörbom)

Mplus (Bengt Muthén)

Mx (Michael Neale)

Page 26: Missing Data in  Randomized Control Trials

Amos 7, Mx, Mplus, LISREL 8.8

Structural Equation Modeling (SEM) Programs

In Single Analysis ...

Good Estimation

Reasonable standard errors

Windows Graphical Interface

Page 27: Missing Data in  Randomized Control Trials

Limitation with Model-Based Procedures

That particular model must be what you want

Page 28: Missing Data in  Randomized Control Trials

Recommended Data-Based Procedures

EM Algorithm (ML parameter estimation)

Norm-Cat-Mix, EMcov, SAS, SPSS

Multiple Imputation NORM, Cat, Mix, Pan (Joe Schafer) SAS Proc MI LISREL 8.5+ Amos 7

Page 29: Missing Data in  Randomized Control Trials

EM Algorithm Expectation - MaximizationAlternate between

E-step: predict missing dataM-step: estimate parameters

Excellent (ML) parameter estimates

But no standard errors must use bootstrap or multiple imputation

Page 30: Missing Data in  Randomized Control Trials

Multiple Imputation

Problem with Single Imputation:Too Little Variability

Because of Error Variance

Because covariance matrix is only one estimate

Page 31: Missing Data in  Randomized Control Trials

Too Little Error Variance

Imputed value lies on regression line

Page 32: Missing Data in  Randomized Control Trials

Imputed Values on Regression Line

Page 33: Missing Data in  Randomized Control Trials

Restore Error . . .

Add random normal residual

Page 34: Missing Data in  Randomized Control Trials

Regression Line only One Estimate

Page 35: Missing Data in  Randomized Control Trials

Covariance Matrix (Regression Line) only One

Estimate Obtain multiple plausible estimates of the

covariance matrix

ideally draw multiple covariance matrices from population

Approximate this with Bootstrap Data Augmentation (Norm) MCMC (SAS 8.2, 9)

Page 36: Missing Data in  Randomized Control Trials

Data Augmentation stochastic version of EM

EM E (expectation) step: predict missing data M (maximization) step: estimate parameters

Data Augmentation I (imputation) step: simulate missing data P (posterior) step: simulate parameters

Page 37: Missing Data in  Randomized Control Trials

Data Augmentation

Parameters from consecutive steps ... too related i.e., not enough variability

after 50 or 100 steps of DA ...

covariance matrices are like random draws from the population

Page 38: Missing Data in  Randomized Control Trials

Multiple Imputation Allows:

Unbiased Estimation

Good standard errors provided number of imputations (m)

is large enough

too few imputations reduced power with small effect sizes

Page 39: Missing Data in  Randomized Control Trials

0

2

4

6

8

10

12

14

Perc

ent P

ow

er

Fallo

ff

100 85 70 55 40 25 10m Imputations

Power FalloffFMI = .50, rho = .10

From Graham, J.W., Olchowski, A.E., & Gilreath, T.D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory.

Prevention Science, 8, 206-213.

ρ

Page 40: Missing Data in  Randomized Control Trials

Part 2Attrition: Bias and Loss of

Power

Page 41: Missing Data in  Randomized Control Trials

Relevant Papers Graham, J.W., (in press). Missing data analysis: making it work in

the real world. Annual Review of Psychology, 60.

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330_351.

Hedeker, D., & Gibbons, R.D. (1997). Application of random-effects pattern-mixture models for missing data in longitudinal studies, Psychological Methods, 2, 64-78.

Graham, J.W., & Collins, L.M. (2008). Using Modern Missing Data Methods with Auxiliary Variables to Mitigate the Effects of Attrition on Statistical Power. Annual Meetings of the Society for Prevention Research, San Francisco, CA. (available upon request)

Graham, J.W., Palen, L.A., et al. (2008). Attrition: MAR & MNAR missingness, and estimation bias. Annual Meetings of the Society for Prevention Research, San Francisco, CA. (available upon request)

Page 42: Missing Data in  Randomized Control Trials

What if the cause of missingness is MNAR?

Problems with this statement

MAR & MNAR are widely misunderstood concepts

I argue that the cause of missingness is never purely MNAR

The cause of missingness is virtually never purely MAR either.

Page 43: Missing Data in  Randomized Control Trials

MAR vs MNAR

"Pure" MCAR, MAR, MNAR never occur in field research

Each requires untenable assumptions e.g., that all possible correlations

and partial correlations are r = 0

Page 44: Missing Data in  Randomized Control Trials

MAR vs MNAR

Better to think of MAR and MNAR asforming a continuum

MAR vs MNAR NOT even the dimension of interest

Page 45: Missing Data in  Randomized Control Trials

MAR vs MNAR: What IS the Dimension of Interest?

How much estimation bias? when cause of missingness cannot be

included in the model

Page 46: Missing Data in  Randomized Control Trials

Bottom Line ...

All missing data situations are partly MAR and partly MNAR

Sometimes it matters ... bias affects statistical conclusions

Often it does not matter bias has tolerably little effect on statistical

conclusions

(Collins, Schafer, & Kam, Psych Methods, 2001)

Page 47: Missing Data in  Randomized Control Trials

Methods:"Old" vs MAR vs MNAR

MAR methods (MI and ML) are ALWAYS at least as good as, usually better than "old" methods

(e.g., listwise deletion)

Methods designed to handle MNAR missingness are NOT always better than MAR methods

Page 48: Missing Data in  Randomized Control Trials

Yardstick for Measuring Bias

Standardized Bias =

(average parameter est) – (population value)-------------------------------------------------------- X 100

Standard Error (SE)

|bias| < 40 considered small enough to be tolerable

Page 49: Missing Data in  Randomized Control Trials

A little background for Collins, Schafer, & Kam (2001; CSK)

Example model of interest: X Y X = Program (prog vs control)Y = Cigarette SmokingZ = Cause of missingness: say,

Rebelliousness (or smoking itself) Factors to be considered:

% Missing (e.g., % attrition) rYZ rZR

Page 50: Missing Data in  Randomized Control Trials

rYZ

Correlation between cause of missingness (Z)

e.g., rebelliousness (or smoking itself) and the variable of interest (Y)

e.g., Cigarette Smoking

Page 51: Missing Data in  Randomized Control Trials

rZR

Correlation between cause of missingness (Z)

e.g., rebelliousness (or smoking itself) and missingness on variable of interest

e.g., Missingness on the Smoking variable

Missingness on Smoking (often designated: R or RY) Dichotomous variable:

R = 1: Smoking variable not missingR = 0: Smoking variable missing

Page 52: Missing Data in  Randomized Control Trials

CSK Study Design (partial)

Simulations manipulated amount of missingness (25% vs 50%) rZY (r = .40, r = .90) rZR held constant

r = .45 with 50% missing (applies to "MNAR-Linear" missingness)

Page 53: Missing Data in  Randomized Control Trials

CSK Results (partial) (MNAR Missingness)

25% missing, rYZ = .40 ... no problem 25% missing, rYZ = .90 ... no problem 50% missing, rYZ = .40 ... no problem 50% missing, rYZ = .90 ... problem

* "no problem" = bias does not interfere with inference

These Results apply to the regression coefficient for X Y with "MNAR-Linear" missingness (see CSK, 2001, Table 2)

Page 54: Missing Data in  Randomized Control Trials

But Even CSK ResultsToo Conservative

Not considered by CSK: rZR In their simulation rZR = .45

Even with 50% missing and rYZ = .90 bias can be acceptably small

Graham et al. (2008): Bias acceptably small

(standardized bias < 40) as long as rZR < .24

Page 55: Missing Data in  Randomized Control Trials

rZR < .24 Very Plausible

Study rZR

_________ _____HealthWise

(Caldwell, Smith, et al., 2004) .106AAPT (Hansen & Graham, 1991) .093Botvin1 .044Botvin2 .078Botvin3 .104

All of these yield standardized bias < 10

(estimated)

Page 56: Missing Data in  Randomized Control Trials

CSK and Follow-up Simulations

Results very promising Suggest that even MNAR biases

are often tolerably small

But these simulations still too narrow

Page 57: Missing Data in  Randomized Control Trials

Beginnings of a Taxonomy of Attrition

Causes of Attrition on Y (main DV)

Case 1: not Program (P), not Y, not PY interaction

Case 2: P only Case 3: Y only . . . (CSK scenario) Case 4: P and Y only

Page 58: Missing Data in  Randomized Control Trials

Beginnings of a Taxonomy of Attrition

Causes of Attrition on Y (main DV)

Case 5: PY interaction only Case 6: P + PY interaction Case 7: Y + PY interaction Case 8: P, Y, and PY interaction

Page 59: Missing Data in  Randomized Control Trials

Taxonomy of Attrition

Cases 1-4 often little or no problem

Cases 5-8 Jury still out (more research needed) Very likely not as much of a problem

as previously though Use diagnostics to shed light

Page 60: Missing Data in  Randomized Control Trials

Use of Missing Data Diagnostics

Diagnostics based on pretest data not much help Hard to predict missing distal

outcomes from differences on pretest scores

Longitudinal Diagnostics can be much more helpful

Page 61: Missing Data in  Randomized Control Trials

Hedeker & Gibbons (1997)

Plot main DV over time for four groups: for Program and Control for those with and without last wave

of data

Much can be learned

Page 62: Missing Data in  Randomized Control Trials

Empirical Examples

Hedeker & Gibbons (1997) Drug treatment of psychiatric patients

Hansen & Graham (1991) Adolescent Alcohol Prevention Trial

(AAPT) Alcohol, smoking, other drug prevention

among normal adolescents (7th – 11th grade)

Page 63: Missing Data in  Randomized Control Trials

Empirical Example Used by Hedeker & Gibbons (1997) IV: Drug Treatment vs. Placebo Control DV: Inpatient Multidimensional Psychiatric

Scale (IMPS) 1 = normal 2 = borderline mentally ill 3 = mildly ill 4 = moderately ill 5 = markedly ill 6 = severely ill 7 = among the most extremely ill

Page 64: Missing Data in  Randomized Control Trials

From Hedeker & Gibbons (1997)

2.5

3

3.5

4

4.5

5

5.5

0 1 3 6

IMPSlow = better outcomes

Placebo Control

Drug Treatment

Weeks of Treatment

Page 65: Missing Data in  Randomized Control Trials

Longitudinal DiagnosticsHedeker & Gibbons Example Treatment

droppers do BETTER than stayers Control

droppers do WORSE than stayers Example of Program X DV interaction But in this case, pattern would lead to suppression bias Not as bad for internal validity in presence

of significant program effect

Page 66: Missing Data in  Randomized Control Trials

AAPT (Hansen & Graham, 1991)

IV: Normative Education Program vs Information Only Control

DV: Cigarette Smoking (3-item scale) Measured at one-year intervals 7th grade – 11th grade

Page 67: Missing Data in  Randomized Control Trials

AAPT

Cigarette Smoking

(high = more smoking; arbitrary scale)

th th th th th

Control

Control

Program

Program

Page 68: Missing Data in  Randomized Control Trials

Longitudinal DiagnosticsAAPT Example Treatment

droppers do WORSE than stayers little steeper increase

Control droppers do WORSE than stayers

little steeper increase

Little evidence for Prog X DV interaction Very likely MAR methods allow good

conclusions (CSK scenario holds)

Page 69: Missing Data in  Randomized Control Trials

Use of Auxiliary Variables

Reduces attrition bias Restores some power lost due to

attrition

Page 70: Missing Data in  Randomized Control Trials

What Is an Auxiliary Variable? A variable correlated with the variables

in your model but not part of the model not necessarily related to missingness used to "help" with missing data estimation

Best auxiliary variables: same variable as main DV, but measured at

waves not used in analysis model

Page 71: Missing Data in  Randomized Control Trials

Model of Interest

X Y res 11

Page 72: Missing Data in  Randomized Control Trials

Benefit of Auxiliary Variables

Example from Graham & Collins (2008)

X Y Z1 1 1 500 complete cases1 0 1 500 cases missing Y

X, Y variables in the model (Y sometimes missing)

Z is auxiliary variable

Page 73: Missing Data in  Randomized Control Trials

Benefit of Auxiliary Variables

Effective sample size (N')

Analysis involving N cases, with auxiliary variable(s)

gives statistical power equivalent to N' complete cases without auxiliary variables

Page 74: Missing Data in  Randomized Control Trials

Benefit of Auxiliary Variables

It matters how highly Y and Z (the auxiliary variable) are correlated

For example increase

rYZ = .40 N = 500 gives power of N' = 542 ( 8%) rYZ = .60 N = 500 gives power of N' = 608 (22%) rYZ = .80 N = 500 gives power of N' = 733 (47%) rYZ = .90 N = 500 gives power of N' = 839 (68%)

Page 75: Missing Data in  Randomized Control Trials

Effective Sample Size by rYZ

500

600

700

800

900

1000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

rYZ

Effective

Sample

Size

Page 76: Missing Data in  Randomized Control Trials

Conclusions Attrition CAN be bad for internal validity But often it's NOT nearly as bad as often feared

Don't rush to conclusions, even with rather substantial attrition

Examine evidence (especially longitudinal diagnostics) before drawing conclusions

Use MI and ML missing data procedures! Use good auxiliary variables to minimize impact

of attrition

Page 77: Missing Data in  Randomized Control Trials

Part 3:Illustration of Missing Data

Analysis: Multiple Imputation with NORM and

Proc MI

Page 78: Missing Data in  Randomized Control Trials

Multiple Imputation:Basic Steps

Impute

Analyze

Combine results

Page 79: Missing Data in  Randomized Control Trials

Imputation and Analysis

Impute 40 datasets a missing value gets a different imputed

value in each dataset

Analyze each data set with USUAL procedures e.g., SAS, SPSS, LISREL, EQS, STATA

Save parameter estimates and SE’s

Page 80: Missing Data in  Randomized Control Trials

Combine the ResultsParameter Estimates to

Report

Average of estimate (b-weight) over 40 imputed datasets

Page 81: Missing Data in  Randomized Control Trials

Combine the ResultsStandard Errors to Report

Weighted sum of: “within imputation” variance

average squared standard error usual kind of variability

“between imputation” variancesample variance of parameter estimates

over 40 datasets variability due to missing data

Page 82: Missing Data in  Randomized Control Trials

Materials for SPSS Regression

Starting place http://methodology.psu.edu

downloads (you will need to get a free user ID to download all our free software)

missing data software Joe Schafer's Missing Data Programs John Graham's Additional NORM Utilities

http://mcgee.hhdev.psu.edu/missing/index.html