overview of missing value analysis

41
Statistical Analysis with Missing Data: A Survey of Options Kevin Cummins Addictions Research Seminar October 18, 2006

Upload: kevin-cummins

Post on 03-Dec-2014

191 views

Category:

Data & Analytics


1 download

DESCRIPTION

This presentation was given at the UCSD/VA Medical Center San Diego Addictions Seminar with the intent to orient the research groups to the options that modern missing value analysis afforded them and determine what type of support each group would prefer.

TRANSCRIPT

Page 1: Overview of Missing Value Analysis

Statistical Analysis with Missing Data:

A Survey of Options

Kevin Cummins

Addictions Research SeminarOctober 18, 2006

Page 2: Overview of Missing Value Analysis

2

© D

avid

Far

ley

Page 3: Overview of Missing Value Analysis

3

Objectives

• Introduce the main concepts with missing value analysis

• Provide some tentative guidance dealing with missing values (MV)

• Develop a discussion of MV’s impact on the interpretation of our research findings

• Identify where we should prioritize our development of MV analysis tools

Page 4: Overview of Missing Value Analysis

4

Outline

• Introduction– Objective– The Problem

• Getting Parameter Estimates– Complete Case– Imputation– Maximum Likelihood

• Comparison of Approaches• Hypothesis Testing

S = Number of Slides = 33

Page 5: Overview of Missing Value Analysis

5

Problems with Missing Data

• Bias Potential

• Analytical Hurdles

• Loss of Power

Page 6: Overview of Missing Value Analysis

6

Missing Value Pattern Example

Page 7: Overview of Missing Value Analysis

7

Missing Value Mechanisms (MVM)

• Missing by Necessity (NA)

• Missing Completely at Random (MCAR)Missingness not dependant on measured variables

• Missing at Random (MAR)

Missingness not dependant on other measured variables

• Not Missing at Random (NMAR)Missingness is dependant on the variables with missings

Page 8: Overview of Missing Value Analysis

8

Graphical Examples of MVM

Y X Z

Y X Z

Ymis

Y X Z

Ymis

Ymis

Page 9: Overview of Missing Value Analysis

9

Why MVM Assumptions are Crucial

• Missing data methods depend very strongly on the nature of the dependencies in these mechanisms

Page 10: Overview of Missing Value Analysis

10

What Can You Do?

• Complete Case Analyses

• Weighting Procedures

• Imputation-Based Procedures

• Model-Based Procedures

Page 11: Overview of Missing Value Analysis

11

Complete Case Analysis

• Benchmark

+ Simple & easy

+ Often satisfactory

+ Direct comparability among variables

- Inefficient

- Can lead to bias, unless MCAR

Page 12: Overview of Missing Value Analysis

12

• Can be improved under some designs by using weightings (Little & Rubin 2002)

• Can be improved by dropping variables with many missing values

Complete Case Analysis

Page 13: Overview of Missing Value Analysis

13

Single Imputation Imputed observation: a calculated value used

in place of a missing value

Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Substitution Imputation

Cold deck Imputation

Composite Approaches

Page 14: Overview of Missing Value Analysis

14

Formal statistical model created to describe distribution of missing values. Explicit assumptions.

Single Imputation Imputed observation: a calculated value used

in place of a missing value

Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Substitution Imputation

Cold deck Imputation

Composite Approaches

No formal model. Algorithm for selecting and assigning imputed values created. Implicit assumptions.

Page 15: Overview of Missing Value Analysis

15

“The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where [its application is legitimate and where it creates serious biases]”

Dempster and Rubin 1983

Page 16: Overview of Missing Value Analysis

16

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

• Replace the missing value with the variable’s mean

- Severe bias is possible

- Covariance matrices will be attenuated

• Some rectification possible with conditional mean imputation

+/- Improvement on a bad option

- Not a generally recommended

Page 17: Overview of Missing Value Analysis

17

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

• Replace the missing value with the expected value from a regression. The regression models the missing variable using the other independent variables.

-Substantial bias issues, especially variances estimates (thus correlations impacted)

-Valid only with monotone missings

Page 18: Overview of Missing Value Analysis

18

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

• Replace the missing value from a regression model. In this case it is not an expected value that it is a random observation created using the model (including the stochastic/error term).

+ Reduced bias, better variance estimates

+ Can be recommended at times

- Adding the stochastic term can reduce efficiency

Page 19: Overview of Missing Value Analysis

19

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

• Replace missings with values from similar sampling units

- Unbiased only under MCAR

- Inefficient estimators

Page 20: Overview of Missing Value Analysis

20

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

• Replace missings with values from a source outside of the current analysis’ data.

- Theory for cold deck is lacking or obvious

Page 21: Overview of Missing Value Analysis

21

Single Imputation Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Cold deck Imputation

Composite Approaches

Example: Hot Deck + RI in Longitudinal Design

1) Find conditional expectation for missing value

2) Obtain a hot deck residual

3) Combine the hot deck residual and expectation to provide imputed value

Page 22: Overview of Missing Value Analysis

22

Properties of Imputation

+ Can be more powerful than complete-case analysis

+ Imputation produces completed-case data that can be plugged into standard analyses

- Variances can biased- P-values overly significant

Page 23: Overview of Missing Value Analysis

23

Some Take Homes• Imputation should be conditional

Regression or Matched Cases

• Multivariate• Draws from distributions, not expected

values• Use when there are few missings• Key problem: inference about parameters

based on completed data don’t account for imputation uncertainty (Little & Rubin 2002).

Page 24: Overview of Missing Value Analysis

24

Methods Addressing Imputation Uncertainty

• Replication MethodsJackknife or bootstrap the analysis

- Require large samples (Little and Rubin 2002)

+ Can be easy

• Multiple Imputation Create multiple imputed data tables and the

variability in the completed data analysis is integrated into the assessment of parameter estimate uncertainty

Page 25: Overview of Missing Value Analysis

25

MI vs. Resampling• Both make assumptions about the predictive

distributions• In large samples, resampling produce consistent

estimates of variance with minimal assumptions, whereas MI variance estimates are strongly tied to model and MVM (Little & Rubin 2002)

• MI can have Bayesian motivations rendering it more applicable in small samples than resampling (Little & Rubin 2002)

• Must assume a stochastic distribution in MI

Page 26: Overview of Missing Value Analysis

26

Model Based Approach: Maximum Likelihood

• Maximum likelihood (ML) is a method of estimating parameters, as is ordinary least squares

+ When ML is applied to incomplete-data the means and covariance estimates are unbiased (under MAR)

Page 27: Overview of Missing Value Analysis

27

Model Based Approach: Maximum Likelihood

• Accept a probability density function

• Calculate likelihood function

• Maximize likelihood

Solutions may not be achievable in with incomplete cases (cases with missing values)

Page 28: Overview of Missing Value Analysis

28

Maximum Likelihood Estimators

Which converges to,

Which converges to,

Page 29: Overview of Missing Value Analysis

29

ML with Missing Values

• Under MAR, the marginal distribution of the observed data provides the correct likelihood for the unknown parameters, provided that the model is realistic.

• This means, ML can be directly applied to the incomplete data.

• But, the math gets much harder.

Page 30: Overview of Missing Value Analysis

30

EM Algorithm

General• Find conditional expectation

of “missing data functions” given current estimates

• Maximize the new completed-data log likelihood to get new parameter estimates

• Reiterate steps until estimates stabilize

Multivariate Normal• Regression imputation of

missings using the means and entire covariance matrix

• Re-estimate the means and covariance matrix with the imputed values

• Reiterate steps until estimates stabilize

Page 31: Overview of Missing Value Analysis

31

EM Algorithm: Observations Are Not Imputed

Generally missing sufficient statistics rather than observations need to be re-estimated estimated.

Consider the trivial case of univariate normal data (Y).

E step ->

Page 32: Overview of Missing Value Analysis

32

The E step

Page 33: Overview of Missing Value Analysis

33

The M step

Page 34: Overview of Missing Value Analysis

34

EM Algorithm

Concept• Find conditional expectation

of “missing data functions” given current estimates

• Maximize the new completed-data log likelihood to get new parameter estimates

• Reiterate steps until estimates stabilize

Explicit Formalization

Page 35: Overview of Missing Value Analysis

35

EM Algorithm: Observations Are Not Imputed

Generally missing sufficient statistics rather than observations need to be re-estimated estimated.

Consider the trivial case of univariate normal data (Y).

Page 36: Overview of Missing Value Analysis

36

Take Homes on ML

• If the model and MVM assumptions are good, ML likely to be a best alternative

• But, need specialized software or special statisticians to help out

Page 37: Overview of Missing Value Analysis

37

Comparison of Methods

• Single Imputation Methods (bad)

• Complete-case (bad-good)

• Conditional Imputation (possibly okay)

• Multiple Imputation (okay-good)

• Maximum Likelihood (okay-good+)

Page 38: Overview of Missing Value Analysis

38

Nonignorable Missing Data Models

• Typically, missing data are NMAR

• Include the missing value function in the likelihood

• Need to know something about the function

Page 39: Overview of Missing Value Analysis

39

What is Missing?

• Needing further development or exposure are issues and methods for test statistics– For MI there are okay, but not fully

satisfactory, approaches to MI include Wald Test, Likelihood Ratio, and Combined Chi-Squared tests (Schaefer 1997).

– For ML unsatisfactory coverage in the literature regarding hypothesis testing.

Page 40: Overview of Missing Value Analysis

40

Notes on the Literature

• New but growing (hub and spokes)

• Often two focused on one aspect of mathematical statistics or too applied without clear support and not comparative

Page 41: Overview of Missing Value Analysis

41

Objectives

• Introduce the main concepts with MVA analysis

• Provide some tentative guidance dealing with MVA

• Develop a discussion of MVA’s impact on research interpretations

• Identify where I we should prioritize our development, in regards to MVA tools