missing data in epidemiology: issues & approaches n. birkett, september 4, 2014 presented to...

50
MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Upload: norah-dixon

Post on 26-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES

N. Birkett, September 4, 2014Presented to EPI8166 (PhD seminar course)

Page 2: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

• There are known knowns;• There are things we know we know.

• We also know there are known unknowns; • That is to say we know there are some things we do not know.

• But there are also unknown unknowns;• The ones we don’t know we don’t know.

U.S. Secretary of Defense,Donald H. RumsfeldDepartment of Defense news briefing, February 12, 2002

Page 3: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Example (1)

RR = 1.5

Now, assume 50% of the females refuse to give you information abouttheir final outcome (decline that question but continue in the study).

RR = 1.5

# getting disease

# total Cumulative Incidence

Female 300 1,000 0.3

Male 200 1,000 0.2

500 2,000 0.25

# getting disease

# total Cumulative Incidence

Female: missing data 150 500 0.3

Female: valid data 150 500 0.3

Male 200 1,000 0.2

# getting disease

# total Cumulative Incidence

Female: missing data 150 500 0.3

Female: valid data 150 500 0.3

Male 200 1,000 0.2

350 1,500 0.233

# getting disease

# total Cumulative Incidence

Female 300 1,000 0.3

Male 200 1,000 0.2

500 2,000 0.25

Page 4: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Example (2)• We are missing the outcome status on 50% of the

females• Using available data, we find:

• Overall estimate of the rate of disease is biased• The RR for risk in females compared to males is OK

• Why?• Subjects missing the outcome status are a random subset of all

females• Female-specific incidence risk is correct• Prevalence of female sex is lower in study ‘complete cases’

• fails to reflect the 50:50 distribution of sex in the target population• External validity

Page 5: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Example (3)

RR = 1.5

Now, assume 50% of the females refuse to give you information abouttheir final outcome. BUT only people not getting outcome refuse.

RR = 3.0

# getting disease

# total Cumulative Incidence

Female 300 1,000 0.3

Male 200 1,000 0.2

500 2,000 0.25

# getting disease

# total Cumulative Incidence

Female 300 1,000 0.3

Male 200 1,000 0.2

500 2,000 0.25

# getting disease

# total Cumulative Incidence

Female: missing data 0 500 0.0

Female: valid data 300 500 0.6

Male 200 1,000 0.2

# getting disease

# total Cumulative Incidence

Female: missing data 0 500 0.0

Female: valid data 300 500 0.6

Male 200 1,000 0.2

500 1,500 0.33

Page 6: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Example (4)• The chance the outcome data is missing depends on the

true status of the outcome• Using available data, we find:

• Overall estimate of the rate of disease is biased• The RR for risk in females compared to males is biased

• Why?• Female-specific incidence risk is biased

• Over-estimated

• Prevalence of female sex is lower in study ‘complete cases’• Fails to reflect the 50:50 distribution of sex in the target population

Page 7: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Why missing data matters (1)• All studies have missing data

• People drop out of studies• People decline one of several questionnaires• People decline to complete certain questions (e.g. income)• People miss questions (pages get stuck together)• Lab tests fail• biological levels are ‘below threshold of detection’

• Missing data is usually not the focus of a study• In many cases, missing data is just ignored

Page 8: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Why missing data matters• Failing to adjust properly for missing data can causes

serious problems. • Introduce potential bias in parameter estimation• Weaken the generalizability of the results• Ignoring cases with missing data leads to the loss of information

• Decreases statistical power• Increases standard errors

• Failing to adjust data properly for missing values can make the data unsuitable for a statistical procedure

• Can also make the statistical analyses vulnerable to violations of assumptions

Page 9: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Levels of missing data• Data can be missing at two ‘levels’• Unit-level non-response

• A subject included in the study declines to take part and provides no information at all.

• Serious issue in much research• Mainly affects external generalizibility• Not the focus of further discussions

• Item-level non-response• Subject participates in the study• Fails to provide information for some items

• Applies a skip sequence wrongly• Two pages get stuck together

Page 10: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Types of missing data patterns (1)• Three patterns are generally recognized:

• Missing Completely at Random (MCAR)• Missing at Random (MAR)• Missing not at Random (MNAR or NMAR)

Page 11: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Types of missing data patterns (2)• Missing Completely at Random (MCAR)

• The probability of a data value being missing is independent of all observed and non-observed data.

• Missing data is a random sample of all data• Observed data is an unbiased estimator of the results from total

data• Complete-case (listwise deletion) methods work fine• Can identify MCAR by comparing cases with and without missing

data• Example

• Biosamples collected for genotyping• Some results are missing because the instrument failed for one batch of

samples

Page 12: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Types of missing data patterns (3)• Missing at Random (MAR)

• The probability of a data value being missing is related to observed data but not to non-observed data.

• Can be analyzed using Multiple Imputation methods or likelihood-based methods

• Example• Looking at prognostic value of SNPs for sub-types of breast cancer• Eligible subjects with advanced stage breast cancer (III/IV) were more

likely to be missing SNP information• Subjects with advanced disease are less cooperative with the study.

• Conditional on disease stage, the probability of missing the SNP is unrelated to the value of the SNP.

Page 13: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Types of missing data patterns (4)• Missing Not at Random (MNAR or NMAR)

• The probability of a data value being missing is related to the unobserved values.• e.g. high values are more likely to be missing than low values

• Can be analyzed using Multiple Imputation methods or likelihood-based methods• much more complex to use• requires modeling the process yielding the missing values

• Example• Looking at study which requires measurement of tumor size.• Smaller tumors are less likely to have size recorded

• Harder to measure size of small tumors• Requires more complex methods (e.g. MRI or PET scanning).

• Probability of size being missing relates to the size of the tumor

Page 14: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Another classification of Patterns• Univariate missing data

• Data are missing on only one variable in the analysis set

• Monotonic missing data• You can rearrange the data so the following is true:

• If a subject is missing data on variable ‘i’, then they are missing data on all variables after that

• Longitudinal study with drop-outs.

• Arbitrary missing data• Doesn’t meet the above conditions.

Page 15: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Ignorability• And now, some confusing terminology• Rubin introduced the term ‘ignorability’• If data is MCAR or MAR, then the mechanism which

produces the missing data is not important and can be ignored in analysis.• He called this ‘Ignorability’

• This does not mean that the missing data can be ignored!

Page 16: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Missing data in the literature (1)• Peng et al (2006)

• Education & psychology journals• 36% had no missing data• 48% had missing data• 16% were unclear

• 97% used listwise deletion or pairwise deletion methods.

Page 17: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Missing data in the literature (2)• Klebanoff & Cole (2008)

• Looked at the use of multiple imputation methods• 2 years of articles from Amer J Epidem, Annals Epi, Epidemiology

& Int J Epidem• 1,105 original research articles

• 16 papers (1.4%) used one of• Multiple Imputation (n=12)• Inverse probability weighing• Expectation-minimization algorithm

• 99 papers had imput as text

Page 18: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Missing data in the literature (3)• Desai et al (2011)

• Focused on molecular epidemiology studies in Cancer Epidemiology, Biomakers and Prevention• 15 month period (2009-2010)• 278 eligible articles

• 95% either had missing data or excluded cases with missing data• Only 23 papers (13%) used missing data methods for analysis

• 9 dealt with ‘assays below detection limit’• Single imputation

• 7 used ‘missing data indicators’

• 26 (14%) reported differences between subjects with and without missing data.

Page 19: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

All articles (n=278)

Had missing data (n=184, 66%)

Used CC only (n=161, 88%)

Missing data methods (n=23,

12%)

Beyond limits of detection (n=9)

Single imputation (n=7)

Missing value indicators (n=7)

Required data for eligibility (n=81, 29%)

Population defined by

biomarker (n=11, 4%))

Just nothing missing (n=2)

Page 20: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Methods to handle missing data (1)

• Need to decide on a model for missing data• MCAR

• MAR

• MNAR

• If MNAR, how is the data related to the unobserved value?

• Set a statistical model for the full data• Commonly assumed to be multivariate normal

• Limiting, especially for categorical data

• Some other form

Page 21: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Methods to handle missing data (2)

• Complete Case (Listwise deletion)• Pairwise deletion (e.g.. Proc Corr)

• Corrected complete case method• Weighted regression model with complete cases

• Weights related to inverse of probability that a case is complete

• Fill the contingency table• Allocate subjects with missing values of a row/column to cells in

proportion to the complete cases.

• Replacement with the frequency or mean of complete cases• For categorical variables, create multiple variables (one per level)

• Impute the percent of the group at each level

• Indicator variable for missing data

Page 22: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Methods to handle missing data (3)

• Simple/Single imputation

• Multiple imputation

• Full MLE methods• SAS can use FIMR (Full information Maximum Likelihood)

• Assumes multivariate normality and MAR

• Linked to Structural Equation Modeling (PROC CALIS)

• Reweighting estimation equations• Used in complex survey studies

• Sample weights are adjusted to reflect missing data patterns.

Page 23: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Complete Case (listwise deletion)• Subject missing any values for any variable included in

analysis or model are excluded.• Most commonly used method (‘the default’)• Usually used without any thought to missing data patterns,

etc.• Acceptable if data is MCAR• Leads to lose of sample size and reduced power/precision• Often produces reasonable results

• especially if amount if missing data is small

• Can be strongly biased is data is MAR• Methodological results from

• multiple papers and • theory

Page 24: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Pairwise deletion• Similar to casewise deletion BUT, only subjects with missing data for

variables involved in the specific analysis are subject to exclusion.

• Consider a case where x1 is missing some data but x2 and x3 are complete.

Suppose the analysis looks at these two models:

• Y = B2 * x2 + B3 * x3

• Y = B1 * x1 + B2 * x2 + B3 * x3

• In the complete case method, subjects missing x1 will be excluded for both

models.

• Pairwise deletion:• All subjects would be used in model 1;

• Some cases would be excluded in model 2.

• Leads to different sub-sets being used for different analyses• Complicates interpretation.

• PROC CORR in SAS uses this approach

Page 25: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Corrected Complete Case Method• Subjects missing any values for any variable included in

analysis or model are excluded.• Regression models use weighted regression.• Weights are computed to reflect the inverse of the

probability that a subject will have complete data.• Works OK if data is MAR but can be seriously biased if

not true.• Figuring out the weights is difficult• Finding SE’s can be difficult• Results from Vach et al, 1991

Page 26: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Fill the Contingency Table• Under MAR, the distribution of subjects with missing data

across the 4 cells in a contingency table is the same as the distribution of the complete cases.

• Modify the Contingency table by allocating ‘counts’ of missing subjects to the table

• Similar to the ‘corrected complete case’ method.• Leads to non-integer counts in the cells• Computing variance is tricky because standard formulae

don’t work• Logistic Regression needs integer counts in the tables.• Results from Vach et al, 1991

Page 27: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Replacement with the frequency or mean of complete cases

• Really a type of single imputation• For each subject with missing data, replace the missing

value by the mean of the complete cases• For categorical data, define indicator variables

• 0/1 if there is valid data• If data is missing, use the proportion of the complete cases with

that level of the variable.• Leads to indicator variables which have non-integer components.

• Strongly biased method, even with MAR• more biased than Complete Case method

• Henry et al, 2013

Page 28: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Indicator variable for missing data• Treat ‘missing values’ as if they are a valid response to

the questionnaire• Assign them a code value

• Example (Do you drink alcohol?):• Yes: 1• No: 2• Missing: 3

• Analysis is done using three levels• 2 dummy variables

• This is a very bad method which is strongly biased.

Page 29: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Indicator variable for missing data• Commonly used and commonly taught in epidemiology

courses.• Studied by multiple authors (Vach, Greenland)• Very strongly biased in every study, including theoretical

analyses• Consider two situations:

• Variable is the main effect of interest:

Page 30: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Full Population dataCases Controls

Exp +ve 140 60

Exp -ve 60 140

200 200

OR = 5.44

Now, assume 30% of data is missing, MCAR.Define the ‘missing data’ indicator variable

Cases Controls

Exp +ve 98 42

Exp -ve 42 98

Missing 60 60

200 200

What is OR of Exp +ve to Exp –ve? It is still 5.44=

Page 31: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Confounding example (1)• So, we gain nothing by defining the missing category.• But, suppose the missing data is in a confounder.• Here is the population data. Crude table is as before

(OR=5.44):

Cases Controls

Exp +ve 50 90

Exp -ve 50 10

100 100

Level 1

OR = 9.0

Cases Controls

Exp +ve 90 50

Exp -ve 10 50

100 100

Level 2

OR = 9.0

Adjusted OR would be 9.0. strong confounding

Page 32: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Confounding example (2)• Now, 30% of the data on the confounder is missing. We create the

missing value indicator level.

• Means we now have three 2x2 tables for our confounding analysis.

Cases Controls

Exp +ve 35 63

Exp -ve 35 7

70 70

Level 1

OR = 9.0

Cases Controls

Exp +ve 63 35

Exp -ve 7 35

70 70

Level 2

OR = 9.0

Cases Controls

Exp +ve 42 18

Exp -ve 18 42

60 60

Level 3:Missing OR = 5.44

Page 33: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Confounding example (3)• When there is no missing data, the OR’s are as follows.

Clearly, there is confounding with the adjusted OR being 9.0

No Missing data

Stratum 1 9.0

Stratum 2 9.0

Stratum 3 N/A

Crude 5.44

Adjusted OR 9.0

• When we have the missing indicator in the data, the adjusted OR is not 9.0 but around 8. Very strongly biased.

No Missing data With missing indicator

Stratum 1 9.0 9.0

Stratum 2 9.0 9.0

Stratum 3 N/A 5.44

Crude 5.44 5.44

Adjusted OR 9.0 around 8.0

Page 34: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Indicator Variable for Missing Data• This method has no role in handling missing data• Is strongly biased, even with MCAR data.

• One core requirement for any method to address missing data is that it gives the ‘right’ answer for MCAR data.

Page 35: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Single Imputation (1)• Replace a missing value with an estimate of what the value should have been

• Various methods are possible• Overall mean

• Group-specific mean

• Last observation carried forward (in follow-up studies)

• An extreme value (e.g. missing = heavy alcohol use)

• Regression modeling

• Works best with monotonic missing data.

• To impute Yj, regress Y1 to Yj-1 for all subjects with valid data for Yj

• This gives a group of Betas with SE’s.

• Select a value of each beta at random from the distributions.

• For single imputation, you often use the actual estimated Beta values

• Use the regression equation to estimate the mean value of Yj for subjects with missing data.

• Hot-deck imputation

• MCMC methods

Page 36: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Single Imputation (2)

• Hard to generate validate variance estimates

• Greenland found regression-based single imputation to be

subject to serious errors in the face of mis-specified

models.

Page 37: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Multiple Imputation (1)• MI handles missing data in three steps:

• Impute missing data ‘m’ times to produce ‘m’ complete data sets;

• Analyze each data set using a standard statistical procedure;

• Combine the ‘m’ results into one using formulae from Rubin (1987) or Schafer (1997).

• Most MI methods assume• MAR

• Multivariate normality

• If the assumptions are met, and if these three steps are done correctly, multiple imputation produces estimates that have nearly optimal statistical properties. They are: • Consistent (and, hence, approximately unbiased in large samples),

• Asymptotically efficient (almost), and

• Asymptotically normal.

Page 38: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Multiple Imputation (2)• One common method uses regression models in step #1• Three kinds of variables are included in an imputation

regression model:• Variables that are of theoretical interest,• Variables that are associated with the missing mechanism, &• Variables that are correlated with the variables with missing data.

• Consider adding interactions terms for continuous variables.

• Bayesian ideas can be used in step #1• Regression based

• Set a prior distribution for the regression parameters and error term

• Fit model to generate posterior distribution

• Select at random from posterior distribution to generate several imputation equations

Page 39: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Multiple Imputation (2)

• Bayesian ideas can be used in step #1• MCMC (Markov Chain Monte Carlo)

• Divide sample into subsets with the same missing data for variables

• e.g. Group #1: missing x1 & x2 Group #2: missing x1, x3 & x4

• Fit regression models within each pattern of missingness

• Impute using these models

• Uses full data set to update means, variances and covariances

• Make a random selection from the posterior distribution of these parameters

• Update the regression models

• Repeat

• FCS (Fully Conditional Specification)• Similar to above but handles categorical data better

• No strong theoretical justification

Page 40: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Multiple Imputation (3)• Most MI models assume variables are multivariate normal

• Issues arise with categorical variables• Can treat as continuous and then round to generate a suitable

categorical value• Round based on the normal approximation to the binomial distribution

• Most studies find MI methods to be the most valid of missing variable methods

• Some issues/questions• How many replicate (multiples) to include?• What variables to include in model?• How to handle non-normal variables, including categorical

variables?• Software limitations

Page 41: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Full MLE methods (1)• Suppose we have a data set and we want to fit a

regression model (could be linear, logistic, etc.)• With no missing data, we use Maximum Likelihood

methods• n observations on k variables:• Based on regression model assumptions, the likelihood of

the data can be given as:

• θ is the set of parameters to estimate

• We find the values of θ to make ‘L’ as big as possible

Page 42: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Full MLE methods (2)• What if we have some missing data?• Suppose y1 & y2 have missing data which is MAR

• For a subject with missing values, we can not generate the likelihood contribution since we don’t know y1 & y2

• Instead, consider all possible values which they might have, combined with the probability of those values.

• Add up the likelihood contribution for every possible value:

• Substitute this into the MLE equation and estimate ‘θ’

Page 43: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Full MLE methods (3)• FIML is one way to do this in SAS

• Part of PROC CALIS• Assumes multivariate normality

• MPlus• Software which can handle non-linear models• Can us various regression models

• Logistic• Poisson• Tobit• Cox• Etc.

Page 44: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Reweighting estimation equations• Discussed by Henry et al (2013)

• Applies to complex surveys• Differential probability of selection from target population

• Analysis requires ‘weights’ to adjust for this

• Standard weights are proportional to the inverse of the probability of selection

• With missing data, complete case analysis leads to different subsets for each set of variables• weights are incorrect

• Adjust each weight to account for probability of being a complete case

• Do analysis using new weights and complete cases only

• Henry shows it produces very good estimates

• Limited area of application

Page 45: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

Summary• Missing data can be very important• More than 5-10% of data missing is considered a potential

source of serious bias• Need to consider the model which produces the missing

data• ‘ad hoc’ methods are poor and should not be used• Multiple Imputation or Full MLE methods give excellent

results in most situations• If missing data is MNAR, need to consider the model

which gives rise to the missing data• If missingness is strongly related to value of variable, problem is

complex

Page 46: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

One suggested approach (1)• Describe target population• Clearly describe derivation of analytic data set• Describe population characteristics of analytic data set,

including missing values• Describe differences in population characteristics for

subjects with valid and missing data for key variables

§

§ adapted from Desai et al Cancer Epidemiol Biomarkers Prev; 20(8), 2011

Page 47: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

One suggested approach (2)• Investigate possible assumptions for missing data

• Assume MCAR if• no data to suggest it is violated & • no mechanism to generate MNAR

• Assume MAR if • MCAR is not acceptable,• no mechanism to generate MNAR &• candidate ancillary variables exist

• Assume MNAR if• a priori knowledge exists that missing data are related to unknown

values

• Conduct a CC analysis

Page 48: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

One suggested approach (3)• Choose an additional analysis as appropriate

• For MAR,• Use Multiple Imputation with suitable ancillary variables

• For MNAR,• Use Multiple Imputation,• Need to model the method which generated the missing data.

• If a variable is limited by sensitivity of a lab detection device,• Use a likelihood-based method

• Implement the additional analysis• Include all potential ancillary variables• Use SAS if you can postulate a joint distribution for ancillary

variables• Use STATA or R (fully conditional method).

Page 49: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)

One suggested approach (4)• Perform sensitivity analyses

• Do both CC & MI• Use different subsets of ancillary variables for MI• Use different models for MNAR missing generation

• Interpret the results• If all analyses give same results, this is easy• If they differ, need to present a more complex result in the paper.

Page 50: MISSING DATA IN EPIDEMIOLOGY: ISSUES & APPROACHES N. Birkett, September 4, 2014 Presented to EPI8166 (PhD seminar course)