introduction to mixed model and missing data issues in...

Introduction Mixed models Typology of missing data Exploring incomplete data Methods MAR data Conclusion

Introduction to mixed model and missing dataissues in longitudinal studies

Hélène Jacqmin-Gadda

INSERM, U897, Bordeaux, France

Inserm workshop, St Raphael


Outline of the talk I

Introduction

Mixed models

Typology of missing data

Exploring incomplete data

Methods MAR data

Conclusion


Longitudinal data : definition

Definition :Variables measured at several times on the same subjects

Examples :

• repeated measures of biological markers (CD4, HIV RNA)in HIV patients

• repeated measures of neuropsychological tests to studycognitive aging

• Repeated events : dental caries, absences from school orjob, ...


Longitudinal data analysis

Objective :

• Describe change of the variable with time

• Identify factors associated with change

Problem : Intra-subject correlation


Example : HIV clinical trial

Xi=1 if treatment A,

Xi=0 if treatment B

Criterion : Change over time of CD4

Repeated measures of CD4 over the follow-up period.

t = 0 at initiation of treatment.

Yij = CD4 measure for subject i at time tij, i = 1, ...,N,j = 1, ..., ni.


Analysis assuming independence

Yij = β0 + β1tij + β2Xi + β3Xitij + ǫij

with ǫij ∼ N (O, σ2) and ǫij ⊥ ǫij′

Intra-subject correlation

→ V̂ar(β̂) biased

→ Tests for β biased

For time-independent covariate :

• var(β̂2) under-estimated

• Tests for H0 : β2 = 0 anti-conservative (p value too small)


Linear mixed model with random intercept

Yij = (β0 + γ0i) + β1tij + β2Xi + β3Xitij + ǫij

with γ0i ∼ N (O, σ20), and ǫij ∼ N (O, σ2) and ǫij ⊥ ǫij′

• γ0i are random variables

• Only one additional parameter : σ20


Linear mixed model with random intercept (2)

• Population (marginal) mean :

E(Yij) = β0 + β1tij + β2Xi + β3Xitij

• Subject-specific (conditional) mean :

E(Yij|γ0i) = (β0 + γ0i) + β1tij + β2Xi + β3Xitij

• Assume common correlation between all the repeatedmeasures


Linear mixed model with random intercept and slope

Yij = (β0 + γ0i) + (β1 + γ1i)tij + β2Xi + β3Xitij + ǫij,

γ0i ∼ N (O, σ20), γ1i ∼ N (O, σ2

1), ǫij ∼ N (O, σ2), ǫij ⊥ ǫij′

• Population (marginal) mean :

E(Yij) = β0 + β1tij + β2Xi + β3Xitij

• Subject-specific (conditional) mean :

E(Yij|γi) = (β0 + γ0i) + (β1 + γ1i)tij + β2Xi + β3Xitij

• The correlation between repeated measures depend onmeasurement times


Linear mixed model : general formulation

Yij = XTijβ + ZT

ijγi + ǫij

γi ∼ N (0,B) and ǫi ∼ N (0,Ri).

Xij : vector of explanatory variablesβ : vector of fixed effectsZij : sub-vector of Xij (including functions of time)γi : vector of random effects.

Population (marginal) mean : E(Yij) = XTijβ

Subject-specific (conditional) mean : E(Yij|γi) = XTijβ + ZT

ijγi


Linear mixed model : example

Linear mixed model with AR Gaussian error

Yij = (β0 + γ0i) + (β1 + γ1i)tij + β2Xi + β3Xitij + wij + eij

with γti = (γ0i, γ1i) ∼ N (0,B),

eij ∼ N (O, σ2) , eij ⊥ eij′ ,

wij ∼ N (O, σ2w) and Corr(wij,wij′) = exp(−δ|tij − tij′ |)


Linear mixed model : Estimation

• Maximum likelihood estimator

• Yi = (Yi1, ...,Yij, ...,Yini)T multivariate Gaussian with

• mean Xiβ• and covariance matrix Vi = ZiBZT

i + Ri

• Softwares : SAS Proc mixed, R lme, stata


Generalized linear mixed model

Yij ∼ exponential family of distribution and

g(E(Yij|γi)) = XTijβ + ZT

ijγi with γi ∼ N (O,B).

• Example : Logistic mixed model

logit(Pr(Yij = 1|γi)) = XTijβ + ZT

ijγi with γi ∼ N (0,B).

• Maximum likelihood estimation : Numerical integration

• Softwares : SAS Proc nlmixed, R nlme, stata


Typology of missing data in longitudinal studies

Notation :

Yi = (Yobs,i,Ymis,i)with Yobs,i the observed part of Yi and Ymis,i the missing part,

Rij = 1 if Yij is observed and Rij = 0 if Yij is missingRi = (Ri1, ...,Rij, ...,Rini)

′

Xi explanatory variables completely observed


Typology of missing data (2)

Monotone missing data = dropout : P(Rij = 0|Rij−1 = 0) = 1Ri may be summarized by the time to dropout Ti

and an indicator for dropout δi

Intermittent missing data : P(Rij = 0|Rij−1 = 0) < 1



Missing Completely at random (MCAR) :P(Rij = 1) is constantThe observed sample is representative of the whole sample.

→ Loss of precision, no bias

Covariate-dependent missingness process :P(Rij = 1) = f (Xi)

→ Loss of precision, no bias if analyses are adjusted on Xi



Missing at random (MAR) : P(Rij = 1) = f (Yobs,i,Xi)

Example : Probability of dropout depends on past observedvalues→ Loss of precision, no bias with appropriate statistical methods

Informatives or MNAR : P(Rij = 1) = f (Ymis,i,Yobs,i,Xi)

Example : Probability that Y be observed depends on current Yvalue

→ Loss of precision, biases→ Sensitivity analyses


Exploring incomplete data

• Describe missing data frequency

• Cross classify missing data patterns with covariates

• Compare mean evolution for available data and completecases

• Compare mean evolution until time t given observationstatus at time t + 1

• Logistic regression for P(Rij = 1) given covariates andYik, k < j

• Cox regression for time to dropout given covariates

→ Impossible to distinguish MAR from MNAR


An example : Paquid data set

The Paquid Cohort in Gironde

• 2792 subjects of 65 years and older at baseline

• Living at home at the beginning of the study (1988) inGironde (France)

• Seen at home at 1, 3, 5, 8, and 10 years after the baselinevisit

• Cognitive measure : Digit Symbol Substitution Test ofWechsler (attention, limited time to 90s)

Sample :

• 2026 subjects

• without diagnosis of dementia between T0 and T10

• with the test completed at least once (at T0)


Description of dropout : Kaplan-Meyer

Dropout time (=event) : first visit with missing score

Probability to be in the cohort

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 5 8 10

Pro

babi

lity

Follow-up time

95% confidence intervalKaplan-Meyer estimate


Observed means of the DSST score given time

10

15

20

25

30

35

40

65-69 years 70-74 75-79 80 and +

Sco

re

Age

Available data


Observed means of the DSST score given time

10

15

20

25

30

35

40

65-69 years 70-74 75-79 80 and +

Sco

re

Age

Complete dataAvailable data


Logistic regression model for dropout in the first 5years

Covariates OR 95% CI of the ORT3 0.02 0.003 - 0.10T5 0.01 0.001 - 0.09age 1.01 0.99 - 1.02

age × T3 1.05 1.03 - 1.08age × T5 1.06 1.03 - 1.09

previous MMSE score 0.91 0.88 - 0.93men 0.86 0.75 - 0.99

Education (vs university level)No education 1.88 1.15 - 3.07no diploma 2.02 1.39 - 2.93

CEP 1.67 1.17 - 2.40high school level 1.39 0.96 - 2.00


Methods for MCAR or MAR data

• Complete case analysis (loss of precision, require MCAR)

• Imputation (require MCAR or MAR)

• Maximum likelihood using available data (require MAR)


Maximum likelihood for MAR data (1)

Objective : Estimate θ from the distribution f (Y|θ)Likelihood of the observed data : Yobs,R

f (Yobs,R|θ, ψ) =

∫

f (Yobs,Ymis|θ)f (R|Yobs,Ymis, ψ)dYmis


Example : MAR analysis of Paquid data

Mixed effect model

Yij test score for subject i at time tij

Yij = (β0 + age′

iγ0 +α0i) + (β1 + age′

iγ1 +α1i)× tij + β3I{tij=0} + eij

withαi = (α0iα1i)

T ∼ N(0,G), eij ∼ N(

0, σ2e

)

agei vector of indicators for baseline age classes (70-74, 75-79,80 years and older , ref= 65-69)I{tij=0} indicator of the baseline visit


Observed and predicted means of the score given time

10

15

20

25

30

35

40

65-69 years 70-74 75-79 80 and +

Sco

re

Age

Complete dataAvailable data

Mixed model (MAR)


Conclusion

Advantages of mixed models

• use all the available information (repeated measures)

• Flexibly handle intra-subject correlation (unbiasedinference)

• Any number and times of measurements

• Robust to missing at random data

• Available in most softwares

Limits of mixed models

• Assume homogeneous population−→ extended models included latent classes(mixture)

• As the MAR assumption is uncheckable, complete thestudy by a sensitivity analysis−→ extended models for MNAR data


References

Chavance, M. et Manfredi R. Modélisation d’observation incomplètes .Revue d’Epidémiologie et Santé Publique 2000,48,389-400.Diggle PJ, Heagerty P, Liang KY, Zeger SL. Analysis of LongitudinalData .2nd Edition. Oxford Statistical Science series 2002, OxfordUniversity Press.Jacqmin-Gadda H, Commenges D, Dartigues JF. Analyse de donnéeslongitudinales gaussiennes comportant des données manquantes sur lavariable à expliquer. Revue d’Epidémiologie et Santé Publique 1999,47,525-534.Little R.J.A. et Rubin D.B. Statistical Analysis with Missing Data , NewYork : John Wiley & Sons, 1987.Verbeke G and Molenberghs G Linear mixed models for longitudinal data

. Springer Series in Statistics, Springer-Verlag,2000, New-York.

introduction to mixed model and missing data issues in...

Documents