overview of missing value analysis

Statistical Analysis with Missing Data:

A Survey of Options

Kevin Cummins

Addictions Research SeminarOctober 18, 2006

2

© D

avid

Far

ley

3

Objectives

• Introduce the main concepts with missing value analysis

• Provide some tentative guidance dealing with missing values (MV)

• Develop a discussion of MV’s impact on the interpretation of our research findings

• Identify where we should prioritize our development of MV analysis tools

4

Outline

• Introduction– Objective– The Problem

• Getting Parameter Estimates– Complete Case– Imputation– Maximum Likelihood

• Comparison of Approaches• Hypothesis Testing

S = Number of Slides = 33

5

Problems with Missing Data

• Bias Potential

• Analytical Hurdles

• Loss of Power

6

Missing Value Pattern Example

7

Missing Value Mechanisms (MVM)

• Missing by Necessity (NA)

• Missing Completely at Random (MCAR)Missingness not dependant on measured variables

• Missing at Random (MAR)

Missingness not dependant on other measured variables

• Not Missing at Random (NMAR)Missingness is dependant on the variables with missings

8

Graphical Examples of MVM

Y X Z

Y X Z

Ymis

Y X Z

Ymis

Ymis

9

Why MVM Assumptions are Crucial

• Missing data methods depend very strongly on the nature of the dependencies in these mechanisms

10

What Can You Do?

• Complete Case Analyses

• Weighting Procedures

• Imputation-Based Procedures

• Model-Based Procedures

11

Complete Case Analysis

• Benchmark

+ Simple & easy

+ Often satisfactory

+ Direct comparability among variables

- Inefficient

- Can lead to bias, unless MCAR

12

• Can be improved under some designs by using weightings (Little & Rubin 2002)

• Can be improved by dropping variables with many missing values

Complete Case Analysis

13

Single Imputation Imputed observation: a calculated value used

in place of a missing value

Explicit Model Imputation

Mean imputation

Regression imputation

Stochastic regression imputation

Implicit Model Imputation

Hot deck Imputation

Substitution Imputation

Cold deck Imputation

Composite Approaches

14

Formal statistical model created to describe distribution of missing values. Explicit assumptions.

Single Imputation Imputed observation: a calculated value used

in place of a missing value

Explicit Model Imputation

Mean imputation




Hot deck Imputation

Substitution Imputation



No formal model. Algorithm for selecting and assigning imputed values created. Implicit assumptions.

15

“The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where [its application is legitimate and where it creates serious biases]”

Dempster and Rubin 1983

16

Single Imputation Explicit Model Imputation

Mean imputation




Hot deck Imputation



• Replace the missing value with the variable’s mean

- Severe bias is possible

- Covariance matrices will be attenuated

• Some rectification possible with conditional mean imputation

+/- Improvement on a bad option

- Not a generally recommended

17


Mean imputation




Hot deck Imputation



• Replace the missing value with the expected value from a regression. The regression models the missing variable using the other independent variables.

-Substantial bias issues, especially variances estimates (thus correlations impacted)

-Valid only with monotone missings

18


Mean imputation




Hot deck Imputation



• Replace the missing value from a regression model. In this case it is not an expected value that it is a random observation created using the model (including the stochastic/error term).

+ Reduced bias, better variance estimates

+ Can be recommended at times

- Adding the stochastic term can reduce efficiency

19


Mean imputation




Hot deck Imputation



• Replace missings with values from similar sampling units

- Unbiased only under MCAR

- Inefficient estimators

20


Mean imputation




Hot deck Imputation



• Replace missings with values from a source outside of the current analysis’ data.

- Theory for cold deck is lacking or obvious

21


Mean imputation




Hot deck Imputation



Example: Hot Deck + RI in Longitudinal Design

1) Find conditional expectation for missing value

2) Obtain a hot deck residual

3) Combine the hot deck residual and expectation to provide imputed value

22

Properties of Imputation

+ Can be more powerful than complete-case analysis

+ Imputation produces completed-case data that can be plugged into standard analyses

- Variances can biased- P-values overly significant

23

Some Take Homes• Imputation should be conditional

Regression or Matched Cases

• Multivariate• Draws from distributions, not expected

values• Use when there are few missings• Key problem: inference about parameters

based on completed data don’t account for imputation uncertainty (Little & Rubin 2002).

24

Methods Addressing Imputation Uncertainty

• Replication MethodsJackknife or bootstrap the analysis

- Require large samples (Little and Rubin 2002)

+ Can be easy

• Multiple Imputation Create multiple imputed data tables and the

variability in the completed data analysis is integrated into the assessment of parameter estimate uncertainty

25

MI vs. Resampling• Both make assumptions about the predictive

distributions• In large samples, resampling produce consistent

estimates of variance with minimal assumptions, whereas MI variance estimates are strongly tied to model and MVM (Little & Rubin 2002)

• MI can have Bayesian motivations rendering it more applicable in small samples than resampling (Little & Rubin 2002)

• Must assume a stochastic distribution in MI

26

Model Based Approach: Maximum Likelihood

• Maximum likelihood (ML) is a method of estimating parameters, as is ordinary least squares

+ When ML is applied to incomplete-data the means and covariance estimates are unbiased (under MAR)

27

Model Based Approach: Maximum Likelihood

• Accept a probability density function

• Calculate likelihood function

• Maximize likelihood

Solutions may not be achievable in with incomplete cases (cases with missing values)

28

Maximum Likelihood Estimators

Which converges to,

Which converges to,

29

ML with Missing Values

• Under MAR, the marginal distribution of the observed data provides the correct likelihood for the unknown parameters, provided that the model is realistic.

• This means, ML can be directly applied to the incomplete data.

• But, the math gets much harder.

30

EM Algorithm

General• Find conditional expectation

of “missing data functions” given current estimates

• Maximize the new completed-data log likelihood to get new parameter estimates

• Reiterate steps until estimates stabilize

Multivariate Normal• Regression imputation of

missings using the means and entire covariance matrix

• Re-estimate the means and covariance matrix with the imputed values


31

EM Algorithm: Observations Are Not Imputed

Generally missing sufficient statistics rather than observations need to be re-estimated estimated.

Consider the trivial case of univariate normal data (Y).

E step ->

32

The E step

33

The M step

34

EM Algorithm

Concept• Find conditional expectation

of “missing data functions” given current estimates

• Maximize the new completed-data log likelihood to get new parameter estimates


Explicit Formalization

35

EM Algorithm: Observations Are Not Imputed

Generally missing sufficient statistics rather than observations need to be re-estimated estimated.

Consider the trivial case of univariate normal data (Y).

36

Take Homes on ML

• If the model and MVM assumptions are good, ML likely to be a best alternative

• But, need specialized software or special statisticians to help out

37

Comparison of Methods

• Single Imputation Methods (bad)

• Complete-case (bad-good)

• Conditional Imputation (possibly okay)

• Multiple Imputation (okay-good)

• Maximum Likelihood (okay-good+)

38

Nonignorable Missing Data Models

• Typically, missing data are NMAR

• Include the missing value function in the likelihood

• Need to know something about the function

39

What is Missing?

• Needing further development or exposure are issues and methods for test statistics– For MI there are okay, but not fully

satisfactory, approaches to MI include Wald Test, Likelihood Ratio, and Combined Chi-Squared tests (Schaefer 1997).

– For ML unsatisfactory coverage in the literature regarding hypothesis testing.

40

Notes on the Literature

• New but growing (hub and spokes)

• Often two focused on one aspect of mathematical statistics or too applied without clear support and not comparative

41

Objectives

• Introduce the main concepts with MVA analysis

• Provide some tentative guidance dealing with MVA

• Develop a discussion of MVA’s impact on research interpretations

• Identify where I we should prioritize our development, in regards to MVA tools

overview of missing value analysis

Data & Analytics

idea of imputation

conditional mean imputation

regression model

missing value analysis

missing variable

missing value mechanisms

missing values mv

formal model