overview of missing value analysis
DESCRIPTION
This presentation was given at the UCSD/VA Medical Center San Diego Addictions Seminar with the intent to orient the research groups to the options that modern missing value analysis afforded them and determine what type of support each group would prefer.TRANSCRIPT
Statistical Analysis with Missing Data:
A Survey of Options
Kevin Cummins
Addictions Research SeminarOctober 18, 2006
2
© D
avid
Far
ley
3
Objectives
• Introduce the main concepts with missing value analysis
• Provide some tentative guidance dealing with missing values (MV)
• Develop a discussion of MV’s impact on the interpretation of our research findings
• Identify where we should prioritize our development of MV analysis tools
4
Outline
• Introduction– Objective– The Problem
• Getting Parameter Estimates– Complete Case– Imputation– Maximum Likelihood
• Comparison of Approaches• Hypothesis Testing
S = Number of Slides = 33
5
Problems with Missing Data
• Bias Potential
• Analytical Hurdles
• Loss of Power
6
Missing Value Pattern Example
7
Missing Value Mechanisms (MVM)
• Missing by Necessity (NA)
• Missing Completely at Random (MCAR)Missingness not dependant on measured variables
• Missing at Random (MAR)
Missingness not dependant on other measured variables
• Not Missing at Random (NMAR)Missingness is dependant on the variables with missings
8
Graphical Examples of MVM
Y X Z
Y X Z
Ymis
Y X Z
Ymis
Ymis
9
Why MVM Assumptions are Crucial
• Missing data methods depend very strongly on the nature of the dependencies in these mechanisms
10
What Can You Do?
• Complete Case Analyses
• Weighting Procedures
• Imputation-Based Procedures
• Model-Based Procedures
11
Complete Case Analysis
• Benchmark
+ Simple & easy
+ Often satisfactory
+ Direct comparability among variables
- Inefficient
- Can lead to bias, unless MCAR
12
• Can be improved under some designs by using weightings (Little & Rubin 2002)
• Can be improved by dropping variables with many missing values
Complete Case Analysis
13
Single Imputation Imputed observation: a calculated value used
in place of a missing value
Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Substitution Imputation
Cold deck Imputation
Composite Approaches
14
Formal statistical model created to describe distribution of missing values. Explicit assumptions.
Single Imputation Imputed observation: a calculated value used
in place of a missing value
Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Substitution Imputation
Cold deck Imputation
Composite Approaches
No formal model. Algorithm for selecting and assigning imputed values created. Implicit assumptions.
15
“The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where [its application is legitimate and where it creates serious biases]”
Dempster and Rubin 1983
16
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
• Replace the missing value with the variable’s mean
- Severe bias is possible
- Covariance matrices will be attenuated
• Some rectification possible with conditional mean imputation
+/- Improvement on a bad option
- Not a generally recommended
17
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
• Replace the missing value with the expected value from a regression. The regression models the missing variable using the other independent variables.
-Substantial bias issues, especially variances estimates (thus correlations impacted)
-Valid only with monotone missings
18
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
• Replace the missing value from a regression model. In this case it is not an expected value that it is a random observation created using the model (including the stochastic/error term).
+ Reduced bias, better variance estimates
+ Can be recommended at times
- Adding the stochastic term can reduce efficiency
19
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
• Replace missings with values from similar sampling units
- Unbiased only under MCAR
- Inefficient estimators
20
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
• Replace missings with values from a source outside of the current analysis’ data.
- Theory for cold deck is lacking or obvious
21
Single Imputation Explicit Model Imputation
Mean imputation
Regression imputation
Stochastic regression imputation
Implicit Model Imputation
Hot deck Imputation
Cold deck Imputation
Composite Approaches
Example: Hot Deck + RI in Longitudinal Design
1) Find conditional expectation for missing value
2) Obtain a hot deck residual
3) Combine the hot deck residual and expectation to provide imputed value
22
Properties of Imputation
+ Can be more powerful than complete-case analysis
+ Imputation produces completed-case data that can be plugged into standard analyses
- Variances can biased- P-values overly significant
23
Some Take Homes• Imputation should be conditional
Regression or Matched Cases
• Multivariate• Draws from distributions, not expected
values• Use when there are few missings• Key problem: inference about parameters
based on completed data don’t account for imputation uncertainty (Little & Rubin 2002).
24
Methods Addressing Imputation Uncertainty
• Replication MethodsJackknife or bootstrap the analysis
- Require large samples (Little and Rubin 2002)
+ Can be easy
• Multiple Imputation Create multiple imputed data tables and the
variability in the completed data analysis is integrated into the assessment of parameter estimate uncertainty
25
MI vs. Resampling• Both make assumptions about the predictive
distributions• In large samples, resampling produce consistent
estimates of variance with minimal assumptions, whereas MI variance estimates are strongly tied to model and MVM (Little & Rubin 2002)
• MI can have Bayesian motivations rendering it more applicable in small samples than resampling (Little & Rubin 2002)
• Must assume a stochastic distribution in MI
26
Model Based Approach: Maximum Likelihood
• Maximum likelihood (ML) is a method of estimating parameters, as is ordinary least squares
+ When ML is applied to incomplete-data the means and covariance estimates are unbiased (under MAR)
27
Model Based Approach: Maximum Likelihood
• Accept a probability density function
• Calculate likelihood function
• Maximize likelihood
Solutions may not be achievable in with incomplete cases (cases with missing values)
28
Maximum Likelihood Estimators
Which converges to,
Which converges to,
29
ML with Missing Values
• Under MAR, the marginal distribution of the observed data provides the correct likelihood for the unknown parameters, provided that the model is realistic.
• This means, ML can be directly applied to the incomplete data.
• But, the math gets much harder.
30
EM Algorithm
General• Find conditional expectation
of “missing data functions” given current estimates
• Maximize the new completed-data log likelihood to get new parameter estimates
• Reiterate steps until estimates stabilize
Multivariate Normal• Regression imputation of
missings using the means and entire covariance matrix
• Re-estimate the means and covariance matrix with the imputed values
• Reiterate steps until estimates stabilize
31
EM Algorithm: Observations Are Not Imputed
Generally missing sufficient statistics rather than observations need to be re-estimated estimated.
Consider the trivial case of univariate normal data (Y).
E step ->
32
The E step
33
The M step
34
EM Algorithm
Concept• Find conditional expectation
of “missing data functions” given current estimates
• Maximize the new completed-data log likelihood to get new parameter estimates
• Reiterate steps until estimates stabilize
Explicit Formalization
35
EM Algorithm: Observations Are Not Imputed
Generally missing sufficient statistics rather than observations need to be re-estimated estimated.
Consider the trivial case of univariate normal data (Y).
36
Take Homes on ML
• If the model and MVM assumptions are good, ML likely to be a best alternative
• But, need specialized software or special statisticians to help out
37
Comparison of Methods
• Single Imputation Methods (bad)
• Complete-case (bad-good)
• Conditional Imputation (possibly okay)
• Multiple Imputation (okay-good)
• Maximum Likelihood (okay-good+)
38
Nonignorable Missing Data Models
• Typically, missing data are NMAR
• Include the missing value function in the likelihood
• Need to know something about the function
39
What is Missing?
• Needing further development or exposure are issues and methods for test statistics– For MI there are okay, but not fully
satisfactory, approaches to MI include Wald Test, Likelihood Ratio, and Combined Chi-Squared tests (Schaefer 1997).
– For ML unsatisfactory coverage in the literature regarding hypothesis testing.
40
Notes on the Literature
• New but growing (hub and spokes)
• Often two focused on one aspect of mathematical statistics or too applied without clear support and not comparative
41
Objectives
• Introduce the main concepts with MVA analysis
• Provide some tentative guidance dealing with MVA
• Develop a discussion of MVA’s impact on research interpretations
• Identify where I we should prioritize our development, in regards to MVA tools