analysis of longitudinal data with missing values

UNIVERSITY OF CALIFORNIALos Angeles

Analysis of longitudinal data with missing values

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Statistics

by

Jinhui Li

2006

Copyright by

Jinhui Li

2006

The dissertation of Jinhui Li is approved.

Xiaowei Yang

Thomas R. Belin

Rick Paik Schoenberg

Hongquan Xu

Yingnian Wu, Committee Chair

University of California, Los Angeles

2006

ii

To my family

iii

TABLE OF CONTENTS

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Models for longitudinal data analysis and missing data issue . . . . . . 5

2.1 Models for longitudinal data analysis . . . . . . . . . . . . . . . . . . 6

2.1.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Generalized linear models . . . . . . . . . . . . . . . . . . . 7

2.1.3 Transition models . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.4 Random effect models . . . . . . . . . . . . . . . . . . . . . 9

2.1.5 Estimation and inference . . . . . . . . . . . . . . . . . . . . 10

2.2 Missing data issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Ignorable versus Nonignorable . . . . . . . . . . . . . . . . . 13

2.3 Modelling with missing values . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 MULTIPLE PARTIAL IMPUTATION . . . . . . . . . . . . . 16

2.3.2 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.3 REMTM for binary response variables . . . . . . . . . . . . . 20

3 Bayesian framework with MCMC approach . . . . . . . . . . . . . . . 24

3.1 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.2 Prior distribution . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 MCMC methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iv

3.2.1 Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 MCMC fundamentals . . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.4 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . 34

3.2.5 Adaptive rejection sampling for Gibbs sampling . . . . . . . 35

4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Bayesian Inference using MCMC . . . . . . . . . . . . . . . . . . . 39

4.1.1 Priors specification and monitoring convergence . . . . . . . 39

4.1.2 MCMC:Data augmentation via Gibbs Sampler . . . . . . . . 40

4.2 Implementation for selection models . . . . . . . . . . . . . . . . . . 42

4.2.1 Selection Models with AR(1) Covariance . . . . . . . . . . . 42

4.2.2 Selection Models with Random-Effects . . . . . . . . . . . . 44

4.3 Implementation for shared-parameter models . . . . . . . . . . . . . 46

4.3.1 Count response variables . . . . . . . . . . . . . . . . . . . . 47

4.3.2 Binary response variables . . . . . . . . . . . . . . . . . . . 49

4.3.3 Continuous response variables . . . . . . . . . . . . . . . . . 50

4.4 Model extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 MPI package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Selection models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 A simulation study . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.2 Selection Model for Continuous Carbon Monoxide Levels . . 58

v

5.2 Shared-parameter models . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.1 A Simulation Study on count response variables . . . . . . . 63

5.2.2 Application to Smoking Cessation Data . . . . . . . . . . . . 65

5.3 Multiple-model sensitivity analysis . . . . . . . . . . . . . . . . . . . 71

5.3.1 Pattern-Mixture Models for Continuous Carbon Monoxide Lev-

els . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.2 REMTM for Continuous Carbon Monoxide Data . . . . . . . 74

6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

vi

LIST OF FIGURES

2.1 Three Representations of Missingness Mechanisms. Three ways in

modeling incomplete data are depicted here. Parameters and sym-

bols in the figures are defined as: yi = (yobsi ,ymisi ) - observed and

missing values for subject i; ri - missingness indicator for repeatedmeasures on subject i; - parameters of data; - parameters ofmissingess indicators; and - parameters shared by data and miss-

ingness indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 The figure shows the idea of Adaptive Rejection Sampling with func-tions defined as: h(x) = log(g(x)) the logarithm of a function g(x)

proportional to a statistical density function f(x); u(x) upper hall

of h(x); and l(x) lower hall of h(x). . . . . . . . . . . . . . . . . 36

5.1 The average and SD curves for the log-scaled carbon monoxide levels.

On this plot, the four mean curves of the log-scaled carbon monoxide

levels and the corresponding point-wise standard errors are drawn for

each of the four treatment conditions: Control, RP-only, CM-only, and

RP+CM (RP=Relapse Prevention, CM=contingency Management). Ver-tical bars indicate the estimated standard errors of average carbon monox-

ide levels. The stars (*) over the x-axis mark the time points (i.e., visitnumbers) where the carbon monoxide levels are significantly differentindicated by a point-wise ANOVA (P-value0.001). Y-axis indicatesvalues of carbon monoxide levels after log(1+x) transform. X-axisrepresents number of clinic visit for study participants (1, . . . , 36). . . 60

vii

5.2 Missingness patterns for the carbon monoxide levels across treatment

conditions. For each treatment condition, an image depicts the miss-

ingness indicators of carbon monoxide levels for each smoker at each

research visit. Dark colored area indicates that the corresponding car-

bon monoxide levels were observed while white colored area indicates

that the corresponding data were missing intermittently or missing af-

ter dropout. The four treatment conditions are Control, RP-only, CM-

only, and RP+CM (RP=Relapse Prevention, CM=contingency Man-agement). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 The Sampling Process and Posterior Distribution of Parameter 1 in

the Simulation Study. The left side figure draws three chains of the

augmented Gibbs sampler, each starting from randomly selected points.

It is seen that after around 100 iterations, the Gibbs sampler converges.

The right side histogram depicts the posterior distribution of the para-

meter of interest (i.e., 1). From the histogram, we can see that themean value of the posterior distribution is very close to the true value

(i.e., -1.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Histograms of estimators of 1 in the Simulation Study. . . . . . . . . 66

viii

5.5 Repeated Count Measures Over the 12-week Study Period for the Smok-

ing Cessation Clinical Trial.The left graph plots the repeated measured

number of smoking episodes for the 90 smokers in the treatment group

who received contingency management. The right graph plots the re-

peated measured number of smoking episodes for the 85 smokers in

the control group who did not receive contingency management. In

both plots, the y-axis indicates the number or counts of tobacco use

for the previous week, the x-axis corresponds to the week numbers (1-12). The thick solid curves depict the mean profiles in the two groups,while the dashed curves represent the individual profiles. . . . . . . . 67

5.6 Mean Carbon Monoxide Levels for Completers and Early terminators.

By diving the 174 smokers into two groups: Completers (n1=112) andEarly terminators (n2=62), the mean curves of carbon monoxide levelsfor subjects receiving CM (contingency management) and for subjectsreceiving no CM are depicted within each of the two groups (com-pleters and early terminators). . . . . . . . . . . . . . . . . . . . . . 73

5.7 Pattern-Dependent Distribution of Carbon Monoxide Levels. Using

the software package named MPI 2.0, profiles and mean curves of car-

bon monoxide levels are drawn within each of the five groups deter-

mined by the dropout times: dropout before week 5 (a), 7 (b), 9 (c), 11(d), and 12 (e). In plots (a) to (e), green curves correspond to the meancarbon monoxide levels of subjects who received CM (contingencymanagement), red curves indicate the mean curves of the subjects whodid not receive CM, and gray-colored dash-lines depicts the profiles of

all the subjects within each group. The plot (f) depicts all the meanprofiles corresponding to the five dropout patterns. . . . . . . . . . . 75

ix

LIST OF TABLES

5.1 Averaged percentage bias (%) in the estimate of slope difference fromthe 100 simulated data sets. . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Coverage rate (%) in the estimate of slope difference from the 100simulated data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Estimates Treatment Effects and Parameters of the Dropout Model for

the Four Partially Imputed Carbon Monoxide Data Sets. . . . . . . . 63

5.4 Relative Bias(%),Standard Deviations and 95% Posterior Credible In-tervals of the Averaged Treatment Effect Estimators ( ) in the Simu-lation Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 Parameter Estimation with Standard Deviations and 95% Posterior Cred-

ible Intervals for the Smoking Cessation Data Set. . . . . . . . . . . 69

5.6 Estimation of Treatment Effect Parameter (1) in the Smoking Cessa-tion Study with artificially generated intermittent missing values. . . . 69

5.7 Estimation of Parameters in the Smoking Cessation Study with the

Method of Multiple Partial Imputation. . . . . . . . . . . . . . . . . . 71

5.8 Estimated Treatment Effect of Contingency Management (1(S.D.))using the Pattern-Mixture Model with two Patterns (Complete vs. Dropout) 74

5.9 Estimated Treatment Effect of Contingency Management using the

Pattern-Mixture Models within the Framework of 2-stage MPI. . . . . 74

5.10 Posterior Parameter Estimation with Standard Deviation and 95% Cred-

ible Intervals Using the REMTM to the continuous Carbon Monoxide

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

x

ACKNOWLEDGMENTS

First and foremost, I would like to express my deepest gratitude to my advisor, my

collaborator, and my friend, Prof. Yingnian Wu. His continuous guidance, support,

and care make the research presented in this dissertation possible.

I also owe my heartfelt gratitude to Prof. Xiaowei Yang of UCDavis, one of the

top researcher on missing data modeling, for giving Yingnian and I this most valuable

opportunity to work with him and guiding me all through. The dissertation is based

on this fruitful collaboration. I admire Prof. Yangs scientific achievements, and also

cherish his warmth and kindness.

I would also like to thank the former fellow students in the vision lab for helpful

discussions. In particular, I thank Dr. Xiangrong Chen, Dr. Cheng-en Guo, Zijian Xufor the help on programming.

My sincere thanks go to Prof. Hongquan Xu , Prof. Rick Paik Schoenberg and

Prof. Thomas R. Belin for serving on my doctoral committee and giving useful advice

on my dissertation.

Finally, I am forever grateful to my family and friends for their care and help.

xi

VITA

1976 Born, Hubei, P.R.China

1999 B.S. in Statistics, Peking University, Beijing, P.R.China

2002 M.S. in Statistics, Peking University, Beijing, P.R.China

2004 M.S. in Statistics, UCLA

20022006 Graduate Student Researcher & Teaching Assistant, Department of

Statistics, UCLA

PUBLICATIONS

J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model

for Poisson-distributed Repeated Measures with Nonignorable Missing Values, 2005

Proceedings of the American Statistical Association, Biometrics Section [CD-ROM],Alexandria, VA: American Statistical Association.

J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model

for Poisson-distributed Repeated Measures with Nonignorable Missing Values, ac-

cepted by Statistics in Medcine.

xii

X. Yang and J. Li ( 2006), Selection Models with Augmented Gibbs Samplers forContinuous Repeated Measures with Nonignorable Dropout, to be submitted.

X. Yang, J. Li, and S. Shoptaw ( 2006), Multiple Partial Imputation for LongitudinalData with Missing Values in Clinical Trials, resubmitted to Statistics in Medcine.

Y. N. Wu, S.-C. Zhu, J. Li, and S. Bahrami (2006), Information Scaling in NaturalImages, resubmitted to JASA.

Y. N. Wu, J. Li, Z. Liu, and S.-C. Zhu (2006), Statistical Principles in Low-LevelVision, tentatively accepted by Technometrics.

xiii

ABSTRACT OF THE DISSERTATION

Analysis of longitudinal data with missing values

by

Jinhui LiDoctor of Philosophy in Statistics

University of California, Los Angeles, 2006

Professor Yingnian Wu, Chair

Biomedical research is plagued with problems of missing data, especially in clini-

cal trials of medical and behavioral therapies adopting longitudinal design. After a

comprehensive literature review on modeling incomplete longitudinal data based on

the full-likelihood functions, this dissertation proposes a Bayesian framework with

MCMC strategies for implementing two kind of advanced models: selection models

and shared parameter models, for handling intermittent missing values and dropouts

that are potentially nonignorable according to various criteria. We combine the advan-

tages of mixed effect models and Markov Transition Models. Simulation study and

application to practical data were performed and comparisons with likelihood meth-

ods were also made, to show the efficacy and consistency of our methods.

KEY WORDS: Selection Model; Shared Parameter Model; Markov Transition

Model; Nonignorable Missing Values.

xiv

CHAPTER 1

Introduction

Longitudinal data refer to data points collected over time on the same subjects. Forexample, if some selected persons measure their weights once a week for ten consec-

utive weeks, the collection of weights through time are longitudinal data. This kind

of data points repeatedly collected on the same subjects are called repeated measures.Longitudinal data are very common for biomedical research and clinical trials, where

the characteristic or some measurement of a person or a thing , for example, the status

of a disease of one person or the value of a car, evolves or develops over time. One

variable is such a characteristic or measurement.

We often have several variables among which one is the target or response variable

and the others are potentially explanatory variables. Theoretically any one of them

can change and be measured over time but usually the response variable is referred by

repeated measures. We are interested in how the change or variation of the response

variable can be explained by the explanatory variables, and statistical modeling is ex-

ploited to approximate the relationship between the two kind of variables. Diggle,

Liang and Zeger([DLZ02]) presented a detailed summary of the models used. Popularchoices are linear models,generalized linear models, transition models and mixed

effect models. The simplest model is the linear or additive models, where we use

a linear combination of the explanatory variables and random errors to describe the

response variable. We can also include some generated new variables ,i.e., functions

of the original explanatory variables. If we use a linear combination of the explana-

1

tory variables to describe a function of the expected value of the response variable,

we have a generalized linear model. Furthermore, if we let the model for the current

measure explicitly depend on its previous measures, it becomes a (Markov) transitionmodel. Finally, if we allow the coefficients of the explanatory variables vary from sub-

ject to subject, we have mixed effect models, in contrast with the models containingsame coefficients of the explanatory variables for all subjects so as to tell the popula-tion trend which are called fixed effect models. We can get the likelihood function(orconditional likelihood function) based on the models used as the joint probability dis-tribution function over all subjects and all time points. MLE(Maximum LikelihoodEstimation) methods are then used to make the estimation and inference.

With the same kinds of models, the computation gets much more complicated when

some of the data points are missing. But missing data are very common for biomedical

research with longitudinal studies,reflecting the problematic nature of the phenomenon

under study, such as substance abuse([NC02]), sexual behavior([CS94]) and mentalhealth disorder([TSBU02]). The proportion of missing data is sometimes noticeablylarge, e.g., as large as 70% at termination in a randomized study of buprenorphine

versus methadone([FW95]). Although investigators may devote substantial efforts tominimize the number of missing values, some amount of missing data is inevitable

in the practice of randomized medical clinical trials. In this dissertation, we focus

on missing values of the response variables. Missing data can be categorized into

two kinds: intermittent missing values (i.e., missing values due to occasionally with-drawal), with observed values afterwards, and dropout values(i.e., missing values dueto earlier withdrawal), with no more observed values. Reasons for dropout are oftenstudy-related, e.g., negative side-effects of the testing medicines, ineffectiveness of the

intervention, and inappropriate conduction of the therapy([YS05]). Without carefulhandling of dropouts, either biased parameter estimates or invalid inferences would

end up. This may also be true for intermittent missing values.

2

Its not possible for us to compute the original joint probability function of all therepeated measures directly with missing data. There are at least three ways to deal with

this situation: Complete case only, Analyze as incomplete, and Imputation. Com-

plete case only means just keeping the complete cases with specified weight and drop-ping all other cases with missing values. Imputation method is to fill in the missing

values with randomly generated ones and how to generate good numbers to approxi-

mate the original ones is critical here. Analyze as incomplete intends to integrate out

the intermittent missing values and to ignore the dropout missing values, which seems

a straightforward choice but its computationally expensive and not always stable.

All above three approaches focus on the response variable since we are mainly

concerned about estimating the effect of the covariates on the response variable. If

the missing data dont affect the estimation at all, we can just use the available datadirectly and ignore the missing ones. So here comes the problem of ignorability

of the missing values. Usually it is unknown to us, so the safe and natural way is

to model the repeated measures and the missingness indicators jointly. There existthree ways to factor the joint distribution of the complete data and missingness in-dicators: outcome-dependent factorization, pattern-dependent factorization, and

parameter-dependent factorization. Correspondingly there are three kind of mod-

els: selection models, pattern-mixture models and shared-parameter models. No

matter what model you use, there are still gaps in computing the joint probability func-tion of all the repeated measures. To compute this joint probability with integration isa common choice but it is usually slow and sometimes not stable. Because of this, we

proposed one approach, which is to use Bayesian framework with MCMC implemen-

tation to overcome the difficulty of integration.

Following this introductory chapter, this dissertation is organized in the follow-

ing way: we review models commonly used for longitudinal data analysis, missing

3

data issue and ignorability in Chapter 2, after that we review the Bayesian framework

and MCMC methodology in Chapter 3. Correspondingly and based on those meth-

ods, we describe our methodology to handle the issue of modeling longitudinal data

with missing values in Chapter 4. Next, in Chapter 5 we present the results of sim-

ulation studies and applications to practical data, making comparison with maximum

likelihood method and showing the consistency of our models. Then we discuss the

problems and future work in the final chapter.

4

CHAPTER 2

Models for longitudinal data analysis and missing data

issue

For most scientific researches, we are concerned about the relationship between dif-

ferent things. For a longitudinal data set with balanced design, our most interest

is in the relationship between the response variable(usually some measurement re-lated to the treatment effect) and the covariates(especially the treatment if there isany). Repeated measures are potentially observed on the ith subject at jth time pointtij(i = 1, . . . , N ; j = 1, . . . , J). From now on, in the following discussion, we use

capital symbols to represent variables, e.g., Yij indicating response variables, and

Xij = (Xij1, . . . , Xijp) indicating covariates or explanatory variables. Symbols in

lower case represent observed or missing values,i.e., the realizations of variables: yijdenoting the value of Yij and xijk denoting the value of Xijk recorded at time tij(i = 1, . . . , N ; j = 1, . . . , J ; k = 1, . . . , p). Bold symbols represent vectors or ma-trices, e.g., yi = (yi1, . . . , yiJ)T is a vector of values for the repeated measures and

xi = [xijk]Jpis a matrix of values of time-varying or time-independent covariates on

the ith subject.For our current study, the response variables are among three common types: con-

tinuous, count and binary. The covariates can be time dependent but usually they are

time independent in practice.

5

2.1 Models for longitudinal data analysis

The review in this section mainly relies on the detailed and systematic summary of the

widely used models by Diggle, Liang and Zeger([DLZ02]).We can use linear mod-els,generalized linear models, transition models and mixed effect models to approx-

imate the relationship between the response variables and the covariates. Different

models need different assumptions, e.g., normality, independence. Those assumptions

need to be checked before or after you apply any model.

Although no model is absolutely correct to reveal the underlying mechanisms for

generating the real-world data, some provide useful approximations and are helpful for

us to understand the mechanisms.

2.1.1 Linear models

For continuous responses, the simplest model is a linear or additive model, where we

assume Yij can be expressed by a linear or additive combination of the covariates and

the random error,

Yij =

pk=1

kxijk + ij = xij + ij

where we have the assumption that the random error ij N(0, 2) , which is the mostwidely used normality assumption. Note that usually xij1 = 1 and 1 is an intercept

term.

Usually we assume the subjects coming from a random sample of the populationand then being independent of each other. Without loss of generality, we can justdescribe the model for the ith subject. In matrix from, we have the model for Yi as,

Yi N(xi, 2V )

where V is a general definite positive symmetric matrix. If we assume that V is a di-

6

agonal matrix or even an identity matrix, it means that the repeated measures from the

same subject are independent of each other. Usually they are dependent and positivelyrelated, especially for the neighbors, and V has no simple structure.

We can set constraints on the elements vjk of the variance-covariance structure

matrix V. Some common choices are like:

vjj = 1, vjk =

which is called compound symmetric since the correlation between any pair Yij, Yik is

and

vjk = |jk|

which is called AR(1)(first order autoregressive) where the correlation between anypair Yij, Yik is decided by and the length of time between the two time points, since

1 1 the correlation is weaker for far away measures.

2.1.2 Generalized linear models

Instead of directly linking the response and the covariates for continuous response, we

can also link the marginal expectation of the response ij = E(Yij) to the covariates

for other kind of responses, thus get the generalized linear models.

For binary responses,e.g., the status of smoking or not, the indicator of having

certain disease or not, let ij = E(Yij) = P (Yij = 1) and the link function is a logit

function

logit(ij) = logij

1 ij = logP (Yij = 1)

P (Yij = 0)= xij

which models the log transformation of the odd P (Yij=1)P (Yij=0)

.

For count responses,e.g., the number of smoking episodes in a certain period of

time, the repeated measures following Poisson distribution is often a reasonable as-

7

sumption, i.e.,

P (Yij = yij) =yijij

yij!exp(ij)

For the marginal expectation of the response we have ij = ij and a log function

serves as the link function

log(ij) = xij

We should notice that these models for the marginal expectation of the response

assume independence between the repeated measures since there are only covariates

and fixed parameters in the model. This is usually not true and we need more complex

models.

2.1.3 Transition models

The above way to model the marginal expectation of the response through generalized

linear models ignores the fact that usually the repeated measures on the same subjectare not independent. Its quite straight forward to make the distribution of the current

repeated measure related to its predecessors, usually through some kind of residuals.

This results in the transition models.

Assume a q-step transition,i.e.the current repeated measure is related to its q im-

mediate predecessors. For continuous variables, we may have

Yij = xij +

qr=1

r(Yijr xijr) + ij

where ij N(0, 2) as before. Here we connect Yij to its q immediate predecessorsthrough the residuals Yijr xijr, r = 1, . . . , q . This is just a re-interpretation ofthe autoregressive covariance structure for continuous variables.

For binary variables, we may have the one step transition probabilities Pkl =

8

Pr(yij = l|yi,j1 = k)(k = 0or1; l = 0or1) be modeled by logistic regressions,

logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij)) = xij01 (2.1)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij)) = xij10 (2.2)

.

Similarly, for count response following Poisson distribution, we can make the in-

tensity(mean) dependent on its q immediate predecessors

log(ij) = xij +

qr=1

r(log(max(1, yijr) xijr)

So transition models provides a reasonable way to directly connect the autocorre-

lated repeated measures.

2.1.4 Random effect models

The previous models all assume fixed effects, i.e., the coefficients of the covariates

are the same for all subjects. We know that there is variation from subject to subject,so reasonably we can have a different(random) coefficient on each covariate for eachsubject to account for the between subject variation. Then there are many more para-meters in the models and comes the danger of over-parameterization. To avoid this,

the natural constraint is that we assume the coefficients for the same covariate coming

from the same distribution, which usually is a normal distribution.

As an example, for continuous variables,we may have

Yij = xiji + ij (2.3)= xij + xiji + ij (2.4)

where ij N(0, 2) as before and the subject specific coefficient i N(,)is decomposed into and i to account for the population effect and subject(random)

9

effect respectively. i N(0,) is commonly used and the specification of isneeded. In practice, we often assume is diagonal if appropriate.

The covariates associated with the random effects can also be different from those

associated with the fixed effects but usually the former is a subset of the latter, which

means we dont have to assume a random effect for each covariate. Random intercept

only or Random intercept and slope structure are widely used.

2.1.5 Estimation and inference

With complete data and appropriate models assumed, we can write down the likelihood

function. Estimation and inference based on MLE(maximum likelihood estimation)or RMLE(restricted maximum likelihood estimation) is widely used. Please referto Diggle, Liang and Zeger([DLZ02]) for details.

2.2 Missing data issue

Missing data make the parameter estimation and inference much more complicated.

The likelihood function requires integration to evaluate. Missing values can happen

to both the response variables and the covariates but we assume only the response

variables have missing values in the current study.

When some values of repeated measures are missing,we partition yi into two parts

yi = (yobsi ,y

misi ), with yobsi indicating the observed values, and ymisi indicating values

that would be observed if they were not missing. When missing values are introduced

by dropout, the pattern of missingness can be indicated by a scalar di, which represents

the actual time of withdrawal for subject i (i.e., di = 2, . . . , J + 1), with di = J + 1indicating the case of completion of the study. Since a subject who drops out at base-line does not contribute to the likelihood function, the case di = 1 is excluded from

10

our consideration. A vector of missingness indicators is defined as ri = (ri1, . . . , riJ)T

with elements

rit =

0 if yit observed

1 if yit intermittently missing

2 if yit missing after dropout.

Theoretically, the joint distribution of the complete data (i.e., yi ) and missingnesspatterns (i.e., ri ) should be modeled in statistical analysis, and based on that we canwrite down the full likelihood function,

L(, |y, r) Ni=1

f(yi, ri|, )dymisi

where represents parameters of the model for repeated measures, and represents

the parameters of the missingness mechanism.

According to the possible graphical representations, there exist three ways to fac-

tor the joint distribution of the complete data and missingness indicators: outcome-dependent factorization, pattern-dependent factorization, and parameter-dependent

factorization. Figure 2.1 shows the general directions for factorizations with the asso-

ciated models respectively. Outcome-dependent factorization means the repeated mea-

sures or outcomes affect the distribution of missingness indicators, pattern-dependent

factorization means missingness indicators(i.e.,missingness patterns) indicate the dif-ference in the distribution of the repeated measures, and parameter-dependent fac-

torization means some common parameters(covariates or random effects) affect bothrepeated measures and missingness indicators.

The three corresponding models available for incomplete longitudinal data analysis

are:

Selection Model, which factors the joint distribution into a marginal distribution

11

Figure 2.1: Three Representations of Missingness Mechanisms. Three ways in mod-

eling incomplete data are depicted here. Parameters and symbols in the figures are

defined as: yi = (yobsi ,ymisi ) - observed and missing values for subject i; ri -missingness indicator for repeated measures on subject i; - parameters of data; - parameters of missingess indicators; and - parameters shared by data and

missingness indicators.

for yi and a conditional distribution of ri given yi, i.e.,

f(yi, ri|xi, , ) = f(yi|xi, )f(ri|yi,xi, )

where f(ri|yi,xi, ) can be interpreted as self-selection of the ith subject intoa specific missingness group.

Pattern-Mixture Model, which is a pattern-dependent model, assumes that distri-bution of repeated measures varies with the missingness patterns and that jointdistribution is factored as

f(yi, ri|xi, , ) = f(yi|ri,xi, )f(ri|xi, ).

Assuming that there are P patterns of missingness in a data set, the marginal

distribution of yi is a mixture,

f(yi|xi, ) =Pp=1

f(yi|ri = rpi ,xi, (p))p

12

where (p) represents the parameters of f(yi) in the pth pattern and i = Pr(ri =

p|xi, ) and ri numerates theP patterns. In pattern-mixture models, (1), , (p)

can be different in dimensionality or in value.

Shared-Parameter Model,we assume that yi and ri are conditionally indepen-dent of each other, given a group of parameters, i,

f(yi, ri|xi, , ) =

f(yi|i,xi, )f(ri|i,xi, )f(i)di (2.5)

Shared parameters i affect both yi and ri, thus can be either observable variables

(e.g., gender) or latent variables (e.g., random-effects or latent scores). For thecase of observed confounders, model 2.5 is in fact a mixture model and the

analysis can be conducted by a stratification analysis.

My focus is on Selection Model and Shared-Parameter Model while Dr.Yang did

a lot of research on Pattern-Mixture Model. One example is shown in Yang and

Li([YL06]).

2.2.1 Ignorable versus Nonignorable

The above models are natural and it is ture that in certain biomedical studies, both

missingness patterns and values of repeated measures are of interest. For example, in a

heart-disease study, the repeatedly measured blood pressures and the survival lengths

of the patients can be modeled jointly. In these scenarios, the above selection, pattern-mixture, and shared-parameter models can be applied directly or after some modifi-

cation( [HL97]). In a majority biomedical studies, however, only the parameters ofrepeated measures themselves are of interest, while parameters related to missing val-

ues are usually viewed as nuisance. In this latter case, it is desirable that missing data

be ignored if possible.

13

Within the setting of outcome-dependent missingness, the concept of ignorabil-

ity was defined and extensively addressed. According to Rubin([R76]), missing val-ues are ignorable when

(i) ri is independent of ymisi , given yobsi and xi ;(ii) and are distinct.Under this ignorability, the log-likelihood function for can be separated from the

log-likelihood function for , i.e.,

l(, |yobsi , ri) = l(|yobsi , ri) + l(|yobsi , ri).

In Little and Rubin [LR02], outcome-dependent missingness was further divided intosub-categories:

missing completely at random (MCAR): ri yi|xi;

missing at random (MAR): ri ymisi |yobsi ,xi;

not missing at random(NMAR): None of above.

For intermittent missing values, ignorability can be interpreted as whether the val-

ues can be interpolated from neighborhood observed values. For dropouts, the assump-

tion corresponds to whether missing values after dropout can be extrapolated from

the previous observed values. In certain applications, occasional omission or nonre-

sponses are due to reasons that are purely random in nature, e.g., schedule conflicts

or bad weather, thus, can be assumed to be ignorable. Nonetheless, subjects withdrawfrom a study usually because of study-related reasons, e.g., being unsatisfactory with

the intervention or notorious side effects of a medical therapy, hence are nonignorable

([VM00, DLZ02]).

The definition of ignorability may be extended to meet the needs of pattern-mixture

and shared-parameter models. Adopting an informal way, we interpret ignorability

14

as a condition under which observed repeated measures can be used to estimate

without bias. For selection and pattern-mixture models, so long as ri and ymisi are

independent of each other, given yobsi and xi , missing data can be ignored. For shared-

parameter models, ignorability corresponds only to the case where i are observable

variables, which are usually viewed as a subset of xi . Unless ri and yi share no

random-effects, a shared-parameter model would generally associate with an assump-

tion as nonignorable.

However, it is difficult or impossible to check ignorability directly. Therefore,

we are going to take an approach by jointly modeling the missingness patterns and re-peated measures without assuming ignorability. But in the end, we can tell ignorability

based on the results from our models.

2.3 Modelling with missing values

In this section, we give a brief summary of several common approaches used by re-

searchers to handle modelling with missing values, including imputation based meth-

ods illustrated by multiple partial imputation, and integration based methods, which

are Analyze as incomplete approaches, illustrated by selection model for continuous

data with dropout and shared-parameter model for binary responses with both inter-

mittent missing and dropout.

Imputation based methods usually need critical assumptions on the missing mech-

anisms. Wrong assumptions lead to imputed values inconsistently distributed, com-

pared to the observed ones. The previously proposed integration based methods are

generally expensive on computation and tend to be unstable.

15

2.3.1 MULTIPLE PARTIAL IMPUTATION

For incomplete longitudinal data sets, the method of multiple imputation(see Rubin[R87]) is especially useful. Accurately predicting missing values is possible becauserepeated measures are often highly correlated to each other. When imputing, all above

three modeling options can be used. In longitudinal data sets, missingness patterns

and mechanisms for intermittent missing values and dropouts are apt to be distinct,

thus requiring different treatment. Their empirical experiences suggest that in certain

clinical trials intermittent missing values are ignorable due to factors that are non-

related to the theme of the study, while dropouts should not be simply ignored. In

Yang and Shoptaw([YS05]), a partial version of multiple imputation, MPI, was firstproposed, within which only intermittent missing values are imputed. As seen in the

application of pattern-mixture model, imputation methods can be further employed to

implement various schemes of identification of restriction for managing dropouts. This

leads to a further extension of multiple imputation, which they term as 2-stage MPI

here. Depending on the assumptions of the mechanism of intermittent missingness

and dropout, there exist many specific forms of MPI and 2-stage MPI.

They further partitioned ymisi into (yIMi ,yDMi ) to denote intermittent missing val-

ues and dropouts. For MPI, they drawm > 1 independent valuesyIM(1)i ,yIM(2)i , ,yIM(m)i

using the posterior predictive distribution p(yimi |yobsi , ri) . Within the 2-stage MPI, foreach of the partial imputation for intermittent missing values,n conditionally indepen-

dent values yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n))i are additionally drawn from the predic-

tive distribution p(yDM(j,k)i |yobsi ,yIM(j)i ),j = 1, , m. The 2-stage MPI provides anatural framework for fitting pattern-mixture models by identifying restrictions with

information borrowed from completers, neighboring cases, or available cases. If we

use selection models or shared-parameter models, imputations for dropouts can be

similarly conducted by applying appropriate MCMC algorithms.

16

A main concern for multiple imputation is how to combine the multiple point

estimators to make an overall inferential statement. A set of rules for combination

was originally developed by Rubin and Schenker([RS86]), which can be used di-rectly for MPI. In Shen [S00], the idea was extended for the case of 2-step multi-ple imputation, which can be viewed as a general approach for a 2-stage MPI. More

specifically, m n complete data sets are obtained eventually in the 2-stage MPI,y(j,k)i = (y

obsi ,y

IM(j)i ,y

DM(j,k)i ) : j = 1, , m, k = 1, , n. A noticeable problem

with these complete data sets is that they are not independent from each other, be-

cause each block or nest ( yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n)i ) contains identical valuesfor yIM(j)i . By denoting Q(j,k) and U (j,k) as the point and variance estimates for Q

from the (j, k)th completed data set, the overall point estimate for Q is still the simply

grand average, i.e.,

Q =1

mn

mj=1

nk=1

Q(j,k) (2.6)

The associated variance for Q involves three components, i.e.,

T = U + (1 1n)W + (1 1

m)B (2.7)

where

U =1

mn

mj=1

nk=1

U (j,k)

estimates the complete-data variance,

B =1

m 1mj=1

(Q(j,) Q)2

indicates the between-nest variance,

W =1

m

mj=1

1

n 1n

k=1

(Q(j,k) Q(j,))2

represents the within-nest variance, and Q(j,) = 1n1

nk=1 Q

(j,k). Inferences about Q

17

are based on the Students t-distribution QQ(T )

t with d.f.

=1

m(n 1)[(1 1/n)W

T]2 +

1

m 1[(1 + 1/m)W

T]2

. Other formulas such as rates of missing information and relative efficiency are seen

in Shen [S00].

Harel and Schafer([HS02]) proposed to use two-stage multiple imputation to han-dle two qualitatively different types of missingness, not necessarily classified as in-

termittent missingness and dropout. This idea might be extended to do multiple-stage

multiple imputation to handle several qualitatively different types of missingness.

2.3.2 Selection model

Diggle and Kenward( [DK94]) studied modelling continuous response variables withdropout missing values where the missingness indicator rij = 1 means dropout and

rij = 0 means observed.

Let us denote tdi as the dropout time for the ith subject, where 2 di J + 1( di = J + 1 indicates a subject who has completed the study). Then, missingnessindicators ri is a vector of di1 consecutive zeros followed by J +1di consecutiveones. Suppressing the dependence on covariates, the selection model of Diggle and

Kenward [DK94] assumes:(i) Pr(rij = 1|j > di) = 1;(ii) for j di, Pr(rij = 1) depends on yij and its history Hij = (yi1, . . . , yidi1)T ;(iii) the conditional distribution of yij givenHij is fij(yij|Hij, ).

Under these assumptions, the full likelihood function for the ith subject is ex-pressed as

Li(, |yobsi , ri) di1j=1

fij(yij|Hij, )di1j=1

[1 pj(yij,Hij)] Pr(ridi = 1|Hidi)

18

where pj(yij,Hij) = Pr(rij = 1|yij,Hij, ) , indicating the probability of dropout attime tij . Dropout probability

Pr(ridi = 1|Hidi) =

Pr(ridi = 1|y,Hidi, )fidi(y|Hidi, )dy (2.8)

, if di < J + 1 ; and Pr(ridi = 1|Hidi) = 1 , if di = J + 1 .

A natural choice for calculating Pr(rij = 1|yij,Hij, ) is a logistic regressionmodel,

logit(Pr(rij = 1|yij,Hij, )) = 0 +Hij1 + 2yij

from where 2 6= 0 implies that dropout process is outcome-dependent nonignor-able.

The full log-likelihood function of the whole data set for (, ) can be partitioned

into

l(, ) = l1() + l2() + l3(, )

where

l1() =

Ni=1

log f(yobsi )

corresponds to the observed-data log-likelihood function for ,

l2() =Ni=1

di1j=2

[1 Pr(rij = 1|yij,Hij, )]

and

l3(, ) =

iN ;diJ

log(Pr(ridi = 1|Hi,di))

together determine the log-likelihood function of dropout process, which contains par-

tial information on .

If dropouts are ignorable, then l3(, ) depends only on , thus can be absorbed

into l2(). In this case, estimation of can be solely derived from l1().

19

For a normal longitudinal data set, yi N(Xi,i()) with parameters =(T , T )T , the conditional distribution, fij(y|Hij, ) is a scalar normal, and the mar-ginal distribution

di1j=1 f(yij|Hij, ) = f(yobsi ), is also a multivariate normal.

To evaluate the integration in equation 2.8, they used a probit approximation to the

logit transformation of the dropout probability and converted the problem into moment

computations of the conditional normal distribution of the first dropout missing value.

With the likelihood function at hand, Diggle and Kenward([NM65]) resorted to thesimplex algorithm to get the MLE estimators. The simplex algorithm does not depend

on derivatives but it converges unacceptably slowly and provides no Fisher information

matrix.

Selection models originated from the Tobit model of Heckman( [H76]). Verbekeand Molenburghs ([VM00]) addressed the theoretical translation from Tobit modelto Diggle and Kenwards selection model. Subsequently, Troxel, Harrington, and

Lipsitz([THL98]) extended it to the non-monotone setting. Selection models for cat-egorical and other type of measures were also developed; see Fitzmaurice, Molen-

berghs, and Lipsitz [FML], Molenberghs Kenward, and Lesaffre [MKL97], Nordheim[N84], and Kenward and Molenburghs [KM99].

2.3.3 REMTM for binary response variables

When the dynamic features of the transition pattern in longitudinal data are of inter-

est, an appropriate longitudinal approach is a transition model. For binary repeated

measures with nonigorable missing values, Albert and Follmann ([AF03]) developeda Markov transition model with random-effects shared by the sub-model on measure-

ment and the sub-model on missingness indicators. We let REMTM stand for random-

effects Markov transition model and review their models in this section.

In the REMTM for incomplete binary repeated measures, the sub-model for mea-

20

surement process is assumed to be a first-order Markov chain for each series of binary

measures. The response variable can jump between the two states, 0 and 1. The tran-sitional probabilities Pkl = Pr(yij = l|yi,j1 = k)(k = 0, 1; l = 0, 1) can be modeledby a logistic regression with random intercepts,

logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij, i)) = xij01 + i (2.9)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij, i)) = xij10 + i (2.10)

where iiid N(0, 2 ) denotes the random intercept, and is the heterogeneity parame-

ter indicating the correlation between P10 and P01 .

The distribution of misisngness indicators ri = (ri1, . . . , riJ)T can be modeled by

another Markov transition model. Here

rij =

0 if yij is observed

1 if yij is intermittently missing

2 if yij is missing due to dropout

They used a first-order Markov process associated with 3 3 transition probabili-ties (i.e., Pkl = Pr(rij = l|ri,j1 = k)(k = 0, 1, 2; l = 0, 1, 2)). Determined bycertain restrictions, the following transition probabilities would be always equal to

zero:P12 = P20 = P21 = 0 . For other combinations of ri,j1 and rij , the transition

probabilities are calculated in the following way. First, if the previous repeated mea-

sure is observed (i.e.,ri,j1 = 0), then the current one could be observed, intermittentmissing, or dropout; the 3-category multinomial-logit model can be used to calculate

the transition probabilities:

P (rij = k|i, xij, ri,j1 = 0) =

11+P2

l=1 exp(xijl+il)if k = 0,

exp(xijk+ik)

1+P2

l=1 exp(xijl+il)if k = 1 or 2.

Second, if the previous measure is intermittently missing, then the current one may

only be observed or intermittently missing again. Correspondingly, a logistic regres-

21

sion model is used to calculate P10 and P11, i.e.,

P (rij = k|i, xij , ri,j1 = 1) =

11+exp(xij1+i1)

if k = 0,exp(xit1+i1)

1+exp(xij1+i1)if k = 1.

Third, for the absorbing state 2, we would always have P (rit = 2|i, xit, rit1 = 2) =1.In the above logit and logistics models, regression coefficients 1 and 2 respectively

indicate whether intermittent missingness and dropout depend on covariates, while co-

efficients 1 and 2 respectively indicate whether intermittent missingness and dropout

are informative (i.e., nonignorable).

By combining the above sub-models for measurement and missingness, the likeli-

hood function for parameters = (01, 10, , 2 ) and = (1, 2, 1, 2) is expressed

as

L(, ) n

i=1

{Tij=1

p(yij|xij , yi,j1, i, )}{Jj=1

p(rij|i, xij , ri,j1, )}p(i)di

where p(i) is the pdf of N(0, 2) and Ti is the last observed time point.

They derived the K step transition probabilities from the one step transition proba-

bilities to handle the intermittent missing values.Then they maximized the above like-

lihood function using a Newton-Raphson algorithm where the integral was evaluated

with Gaussian quadrature.

It should be remarked here that the above REMTM can be easily extended to

deal with other types of repeated measures. In Li et al. [LYWS06] we applied theREMTMs to Poisson-distributed repeated measures with nonignorable missing val-

ues. The random-intercept in the models can be replaced with other types of random

effects, including random slopes and random cohort effects. The REMTM is only

one specific example of shared-parameter models, other longitudinal models such

as marginal model or random-effects models can be also used to implement shared-

parameters modeling. The shared-parameter model was first developed by Wu and

22

Caroll([WC88]) where certain parameters are shared by the measurement model and acensoring process.

23

CHAPTER 3

Bayesian framework with MCMC approach

As stated before, for the longitudinal models based on full-likelihood functions, pa-

rameter estimation based on asymptotic normal theory is difficult to apply mainly

because of the complicated form of likelihood functions. Without an analytical so-

lution for the score function and Hessian matrix, optimization is challenging. Diggle

and Kenward ([DK94]) resort to the simplex algorithm([NM65]), which does not de-pend on derivatives. Unfortunately it converges unacceptably slowly and provides no

Fisher information matrix,as a result of derivatives free algorithm. We implemented

the nonlinear optimization algorithms of Dennis and Schnabel([DS96]) with numeri-cal derivatives, but found that global maximum remained difficult to obtain in practical

settings. When calculating the dropout probabilities in the selection model or integrat-

ing over the random-effects in the REMTMs, time-consuming numerical integration is

demanded, such as the method of Gauss-Hermits. Bayesian methods based on MCMC

provide a more affordable and appropriate alternative to do the parameter estimation

and inferences. By sampling parameters and missing values, MCMC method using

Gibbs sampler offers a natural option for integration and optimization, without relying

on fully determined density functions or expensive derivatives.

In this chapter, we review the basics for Bayesian framework and MCMC methods,

based on the systematic description by Carlin and Louis in [CL00].

24

3.1 Bayesian methods

3.1.1 Bayes Rule

Basically, if we have data D, i.e., collected samples, at hand, and we want to estimate

some parameters with the data and specified model, the likelihood function is the

joint probability distribution of the data given ,

L() = f(D|)

Maximum likelihood method aims to estimate the parameters, assuming they are justunknown constants,completely based on this likelihood function. Bayes approach is

different in that it takes the parameters as random samples from certain prior distri-

bution, q() and makes the estimation and inference completely based on theposterior distribution P (|D), which is calculated by simple calculus and is in a sim-ple form of Bayes Rule,

P (|D) = q()f(D|)f(D)

where,

f(D) =

q()f(D|)d

3.1.2 Prior distribution

For Bayesian approach, introducing prior distribution provides extra information. There

are three major ways to specify it,

Elicited Priors: prior distributions are specified by some experts or elicitees withexperiences and knowledge in the field;

Conjugate priors: prior distribution is selected from a member of the familywhich is conjugate to the likelihood function so that the posterior distributionbelongs to the same distributional family as the prior ;

25

Noninformative Priors: normal distribution with infinite variance or uniformdistribution over the support of is used as its prior distribution;

Among the three kind of priors, Noninformative Priors are simple and actually

provide no extra information, thus can avoid the danger of introducing incorrect extra

information. With enough data, the information from the data usually dominates the

information from priors so the choice doesnt matter much.

3.1.3 Bayesian inference

Within the Bayesian framework, we can perform point estimation, interval estimation

and hypothesis test based on the posterior distribution. The first two are most impor-

tant.

Point estimation:First lets look at the univariate parameter case.The estimator(D) can be a summary feature such as mean, median, or mode of the posterior

distribution P (|D). The basic observations on the three features are

The posterior mode equals the maximum likelihood estimator With a flat

prior;

The mean and the median are identical for symmetric posterior densities;

All three measures are the same for symmetric unimodal posteriors.

The posterior variance with respect to (D), or mean squared error, is defined as

E(D)( (D))2

provides a measure of accuracy of the point estimator. Denote (D) = E(D)()

as the posterior mean which of course depends the data (D). We have the fol-

26

lowing decomposition

E(D)( (D))2 = E(D)[ (D) + (D) (D)]2

= E(D)( (D))2 + ((D) (D))2

E(D)( (D))2 = V ar(D)( (D))

The equal sign = is achieved if and only if (D) = (D), i.e., we use the

posterior mean as the estimator.

When is a vector, the posterior mode is more difficult to get and we still have

that

E(D)((D))((D))T = V ar(D)((D))+((D)(D))((D)(D))T

and the minimum is still achieved at (D) = (D). So generally we use the

posterior mean as the estimator to get the smallest variance.

Interval estimation The counterpart of a frequentist confidence interval (CI) inBayesian framework is called a credible set, or Bayesian confidence interval.The 100(1 )% credible set for is defined as a subset S of the support of such that

1 P (S|D) =S

P (|D)d

However, there is a computation issue with Bayes approach. The posterior distrib-

ution

P (|D) = q()f(D|)q()f(D|)d

is the basis for all estimation and inference, but with the integration at the denominator

it is difficult to explicitly computed. In addition, we still need to compute the posterior

mean and variance,which involve additional integration, usually in high dimensional

27

space,

E(D)() =

P (|D)d

V ar(D)() =

( E(D)())2P (|D)d

Fortunately, we can perform the estimation and inference through sampling, even with

the existence of missing data, which is the idea of Monte Carlo methods. MCMC is

a powerful tool for us to overcome the difficulty in computing. We review basics about

MCMC in next section.

3.2 MCMC methods

MCMC(Markov chain Monte Carlo) methods are the most widely used methods forstatistical computing now. They are based on a class of algorithms for sampling from

probability distributions by constructing a Markov chain whose stationary distribution

is the desired distribution. After a large number of steps, a state of the chain can

be used as a sample from the target distribution. Gilks et al.([GRS96]) provided acomprehensive summary of the related theories and practices.

3.2.1 Monte Carlo methods

Monte Carlo methods refer to integration by sampling and averaging. If we have

q() and we want to compute the mean of g(),

= E(g()) =

g()q()d

. Suppose we have an iid(independent and identically-distributed) sample 1, . . . , N q(), then the average

=1

N

Ni=1

g(i)

28

converges to with probability 1 as N by the Law of Large Numbers. It is agreat property in that as long as we have such an iid sample 1, . . . , N q(), we canapproximate the expectation of any function with respect to . So now the problem of

integration becomes how to get a good sample.

Usually the target distribution q() is quite complex and hard to be directly sam-

pled. When it can be evaluated,two commonly used sampling techniques are impor-

tance sampling and rejection sampling.

Importance sampling: we sample from a simpler proposal distribution h() in-stead of q() and define the weight function as w() = q()

h(), then we have

= E(g()) =

g()q()d =

g()w()h()d

and

=1

N

Ni=1

g(i)w(i)

The optimal proposal distribution is

h() =|g()|q() |g()|q()d

which minimizes the variance of the estimator. The minimum variance is

V arh()(g()w()) = Eh()(g2()w2()) 2

Rejection sampling: If there exists a simpler distribution h() and a M so thatq() Mh(), we can sample in this way:

1. Sample i from h();

2. Sample u from the uniform distribution on the interval (0,1) U(0,1);

3. Keep i if it satisfies that u < q(i)

Mh(i); Otherwise reject it and return to step

1.

29

Unfortunately, for the posterior distribution in the Bayes approach, most of the time

it is not even computable. In this case, we can turn to MCMC though.It is such a way

to sample: start from some initial values and keep updating iteratively with Markov

chain(s) until it reaches its stationary distribution. Metropolis-Hastings algorithmand Gibbs sampling are most popular methods. Gibbs sampling allows us to sample

from a distribution proportional to the target distribution and thus avoids the integra-

tion problem to compute the denominator of the posterior distribution in the Bayesian

approach, so it is very useful for us.

3.2.2 MCMC fundamentals

Mathematically, start from (0) q(), the strategy for generating new samples (i)

, while exploring the state space using a Markov chain mechanism, is by sampling

from

P ((i)|(i1), . . . , (0)) = K((i)|(i1))

To eventually sample from P (), the requirement is that

q((0))Ktt P ()

Then the stationary distribution of the Markov chain must be P () and stationary

means

PK = P

If this is the case, we can start from an arbitrary state, use the Markov chain to per-

form the transition for a long time, then stop and output the current state (t) which is

sampled from P ().

To ensure that the Markov chain converges to a unique stationary distribution the

following conditions are sufficient:

30

Irreducibility: every state is eventually reachable from any start state (0); forany state there exists a t such that

Pr(t)(|(0)) > 0

Aperiodicity: the chain doesnt get caught in cycles.

The process is ergodic if it is both irreducible and aperiodic.

To ensure that the stationary distribution of the Markov chain is P (), it is sufficient

for functions P and K to satisfy the detailed balance (reversibility) condition:

P ((i))K((i1)|(i)) = P ((i1))K((i)|(i1))

Usually with some care the conditions are satisfied for MCMC algorithm. But we

need to tell the time when the stationary distribution or the convergence is reached in

practice ,where the functions K and P are difficult or impossible to compute so it is

not practical to use the above formula. This task is called Convergence Diagnostics.

The pre-convergence time is called the burn-in period. Basically, there are multiple

chain approaches,where independently running multiple chains is needed, and single

chain approaches, where running a single chain is enough. The advantage of multiple

chain approaches is that we can have independent samples from multiple chains while

the disadvantage is there is a huge waste of samples. On the contrary, single chain

approaches is less expensive but we only have correlated samples on which to make

roughly independent samples.

One representative and popular multiple chain approach was proposed by Gelman

and Rubin([GR92]). They run a small number (n) of parallel chains with differentover-dispersed starting points ,with respect to the true posterior, for 2N iterations each

and kept samples from the latter N iterations, they then attempted to compare the

variation within the chains for a given parameter of interest to the total variation across

31

the chains. Specifically, they monitored convergence by the estimated scale reduction

factorR where R was defined by,

R = (N 1N

+n+ 1

nN

B

W)

df

df 2where B

Nwas the variance between the means from the n parallel chains and W was

the average of the n within-chain variances, and df was the degrees of freedom of an

approximating t density to the posterior distribution. They showed thatR

t 1

which means after a long running time, the between-chain variance will be equal to

the within-chain variance. We can randomly select some parameters to observe since

thats only for one parameter. There are also ways to get a criteria to monitor the

convergence of all parameters.

One representative single chain approach was proposed by Raftery and Lewis([RL92]).They retained only every nth sample after burn-in, with n large enough so that the re-

tained samples were approximately independent. More specifically, their procedure

supposed the goal was to estimate a posterior quantile q = P ( < a|D) to within rwith probability s. For example, with q =.025, r = .005, and s = .95, the reported 95%

credible sets would have true coverage between .94 and .96. To accomplish this, they

assumed Z(n)t = Z1+(t1)n with the indicator function Zj = I((j) < a) and used

results from Markov chain convergence theory. So basically they waited until some

selected posterior quantiles got very stable.

The multiple chain approach by Gelman and Rubin is less subject to failure todetect the sorts of convergence failures than single chain approach but it also fails oc-

casionally. And since all the diagnostics methods use only a finite realization from the

chain, no diagnostic can prove convergence of an MCMC algorithm. We statisticians

still rely heavily on such diagnostics, because a weak diagnostic is still useful.

32

We also need to provide the point and interval estimations based on the samples we

generate. Similarly there are multiple chain approaches and single chain approaches.

For multiple chain approaches, you need to have a large number,say, N, of indepen-

dently running Markov chains, after they all converged, collect the current sample (j)

from chain j, and the posterior mean can be estimated as

E(|D) = N = 1N

Nj=1

(j).

The variance of the posterior mean can be estimated as

V ar(N ) =1

N(N 1)Nj=1

((j) N )2.

For single chain approaches, one representative method is called batching. We

divide the single long run of length N into m successive batches of length k (i.e.,N = mk), provided that k is large enough so that the correlation between batches isnegligible, and m is large enough to reliably estimate the between batch variance. We

then compute the mean for each batch to get B1, . . . , Bm and the estimators for the

posterior mean is estimated by

E(|D) = N = 1m

mj=1

Bj.

The variance of the posterior mean can be estimated by

V ar(N) =1

m(m 1)mj=1

(Bj N )2.

The 95% credible interval estimators for E(|D) are always given by

N z0.025

V ar(N),

or

N tm1,0.025

V ar(N),

33

, where z0.025 and tm1,0.025 are the upper .025 point of a standard normal or t distrib-

ution with m-1 degrees of freedom respectively, depending on the methods used and

the sample size or batch size.

3.2.3 Gibbs sampling

Gibbs sampling provides a way to convert multivariate sampling problem into univari-

ate sampling problem. Suppose we have

= (1, . . . , p) q(1, . . . , p).

If all conditional distributions {qi(i|j, j 6= i), i = 1, . . . , p} are available for sam-pling, we start from an arbitrary set of initial values ((0)1 , . . . ,

(0)p ), at iteration t(t

0), we proceed with the following sampling steps to get ((t+1)1 , . . . , (t+1)p ),

(t+1)1 q1(1|(t)2 , . . . , (t)p )(t+1)2 q2(2|(t+1)1 , (t)3 , . . . , (t)p )

.

.

.

(t+1)p qp(p|(t+1)1 , . . . , (t+1)p1 )

Geman and Geman([GG84] ) proved the following results,

Gibbs Convergence Theorem:(i)((t)1 , . . . , (t)p ) d (1, . . . , p) q(1, . . . , p) as t.(ii) Using the L1 norm, the convergence rate is exponential in t.

3.2.4 Metropolis-Hastings algorithm

Metropolis-Hastings algorithm is one of the most popular MCMC method. Metropolis

et al.([MRR+53]) presented the method in its simplest form. We choose an auxiliary

34

proposal function h(x, y) such that h(., y) is a pdf with respect to x for any y and

symmetric,i.e., h(x, y) = h(y, x). At step t, we have a current state (t), then we do

the following,

1. Sample h(., (t));

2. Compute the ratio r = min{1, q()q((t))

};

3. Sample u U(0,1), let

(t+1) =

(t) if u > r

if u r

It can be proved that under mild conditions, (t) d q() as t.

Hastings([H70]) made a simple but important generalization of the Metropolis al-gorithm in 1970. He got rid of the requirement that the proposal function h(x, y) is

symmetric, so the ratio became

r = min{1, q()h(, (t))

q((t))h((t), )}

the rest of the procedure remained the same.

3.2.5 Adaptive rejection sampling for Gibbs sampling

Even with Gibbs sampling, at every step we only need to sample from the univariate

conditional distribution qi(i|j , j 6= i), it is still difficult to sample directly under mostcircumstances. Adaptive rejection sampling, proposed by Gilks and Wild([GW92]), isan efficient method to be used if the conditional distribution is log-concave, directly

or after some transformation.The basic idea of the sampling method is illustrated by

Figure 3.1.

35

Figure 3.1: The figure shows the idea of Adaptive Rejection Sampling with functionsdefined as: h(x) = log(g(x)) the logarithm of a function g(x) proportional to a

statistical density function f(x); u(x) upper hall of h(x); and l(x) lower hall of

h(x).

36

Suppose g(x) is proportional to the target distribution and let h(x) = log(g(x))

which is concave and has domain . We should have a set of sorted points {xi, i =1, . . . , n}, x1 < x2 < . . . < xn at which h(x) was evaluated. The lower hull l(x)is formed by the chords between (xi, h(xi)) and (xi+1, h(xi+1)) while the upper hull

u(x) is formed by the tangents at {xi, i = 1, . . . , n}. Denote the abscissae as {zi, i =1, . . . , n 1} where zi is the intersection of tangents at xi and xi+1.

The kernel of the sampling algorithm to sample one point is the following:

1. Sample x from the normalized exponential of the upper hull, s(x);

2. Sample q from the uniform distribution on the interval (0,1);

3. If q < exp[l(x) u(x)], accept x and stop;Else evaluate h(x). If q < exp[h(x) u(x)], accept x and stop.Otherwise update the upper hull and lower hull with the rejected x and (h(x), h(x)),resulting in closer upper and lower bounds and go back to step 1;

From here we can see that one rejected sample will lead to closer upper and lowerbounds and higher efficiency for subsequent sampling. So it is called adaptive.

Gilks and Wild([GW92]) also provided the following formulas. To compute theabscissae and tangent functions, we have

zi = xi +h(xi) h(xi+1) + h(xi+1)(xi+1 xi)

h(xi+1) h(xi)u(zi) = h

(xi)(zi xi) + h(xi)

They can be derived from the tangent functions and formula for intersection of two

lines.

Let cu denote the normalizing factor of the exponential upper hull, which is the

37

integration of the exponential upper hull over its domain, we have

cu =

exp[u(x)]dx =

n1j=0

1

h(xj+1){exp[h(zj+1)] exp[h(zj)]}

where z0 is the lower bound of x or if x has no lower bound, and zn is the upperbound of x or if x has no upper bound. Denote the cumulative distribution functionof s(x) as scum(x) and the formula at one abscissus is

scum(zi) =1

cu

n1j=0

1

h(xj+1){exp[h(zj+1)] exp[h(zj)]}

To sample x from s(x), we first sample q from the uniform distribution on the

interval (0,1) and find i such that i = argmaxi(scum(zi) < q) and compute x by

x = zi +1

h(xi+1)log[1 +

h(xi+1)cu[q scum(zi)]exp[u(zi)]

]

Adaptive rejection sampling method allows us to use a function proportional to thetarget distribution and constraints it with upper hull and lower hull, which are simple

piece wise straight line segments. The introducing of upper hull and lower hull exploits

the advantage of rejection sampling with squeezing. So it is efficient and the mostlyused method in practice for our current research.

38

CHAPTER 4

Methodology

In this chapter, we present our approach, a Bayesian framework with MCMC imple-

mentation, to handle analysis of longitudinal data with missing values. This approach

is able to handle different assumptions on the missingness mechanism and can be ap-

plied to various kind of repeated measures.

4.1 Bayesian Inference using MCMC

4.1.1 Priors specification and monitoring convergence

In our Bayesian framework, we need to specify prior distributions for all parameters

involved. Since most of the time, there is no actual prior information or enough his-

torical studies, non-informative or flat priors are adopted, which usually ends up with

results that are consistent to those based on the stable maximum likelihood estimate if

there is any. More specifically, normal distribution with infinite variance or uniform

distribution over the support of the target is used for all the regression coefficients, and

the logarithms of the variances of random-effects and random errors. We tried some

informative priors, e.g.,inverse Gamma distribution on the variance parameters, but

did not observe much difference since with enough data, the information from the data

usually dominates the information from priors. Flat uniform distributions are used for

space restricted parameters, such as the correlation parameter in the AR(1) structure,

39

which has the natural constraint 1 1.

On monitoring convergence, we tried two approaches: the first is to start two or

more Markov chains from quite different initial values and wait until they interweave

with each other and the between chain variance is about the same as the within chain

variance. The second is to work with one chain, retaining only every kth sample after

burn-in, with k large enough so that the retained samples are approximately indepen-

dent. The stopping time is when some selected quantiles become stable.

4.1.2 MCMC:Data augmentation via Gibbs Sampler

We employ the method of data augmentation via Gibbs sampler to alleviate the com-

putation difficulty with missing values by sampling from the posterior distribution. We

first impute the missing values and then draw parameters one by one conditional on

the observed and imputed data. More specifically, the Gibbs sampler is an iterative

procedure with each iteration consisting of two steps.

(I) Imputation-Step, where the missing values are updated by drawing from the condi-tional predictive distribution. That is, for i = 1 to N , draw ymisi,j from

ymisij fij(y|y\ji ,xi, ri, ).

where y\ji means excluding yij from the vector yi. For multivariate-normally distrib-

uted measures, this predictive condition is a scalar normal distribution.

We still use di to denote the dropout time point for subject i. To fill in the miss-ing values, we actually only need to impute all intermittent missing values and the

first dropout value for the purpose of computation, because the dropout probability at

di only depends on the current and previous values (i.e., yi,di and Hi,di). Imputingmore dropout values is not necessary for computation and wont provide additional

information since all information actually comes from the available data and the sta-

40

ble assumptions we have: the models remain unchanged through time. Note that to

impute intermittent missing values is different from to impute the first dropout value,

since the former has information from both its history and future while the latter only

has information from its history.

(II) Posterior-Step, where the parameters are drawn from the posterior distribution P (|Y,X,R) in the following order according to the decomposition of thejoint distribution into full-conditional distributions,

f(|\,y1, . . . ,yN ,X,R) f(|\,y1, . . . ,yN ,X,R)

where yi = (yi1, . . . , yi,di1, ymisi,di )T

,Y = (y1, . . . ,yN)

T, and \ means excluding

(e.g., (, )\ = ).

Note that in the above algorithm, missing values are simulated from the predic-

tive function, while parameters are simulated from full-conditional distributions. But

essentially, it is a process that one unobserved quantity,a missing value or parame-

ter, is drawn at a time from the distribution conditional on all other related quan-

tities: observed data, imputed data and parameters. Similar ideas were adopted by

Schafer([S97]) in his data augmentation algorithms for creating imputations of miss-ing values seen in multivariate data sets.

Starting from an initial point and repeating the two sampling stages with large

enough iterations, the procedure would converge to its stationary distribution, i.e., the

joint distribution of parameters and the missing values. Thus, after a long enough burn-in period, the simulated missing values and parameters can be used for parameter esti-

mation. Note that, this algorithm can be used for conducting multiple imputation([R87]),where multiple imputed data sets are first created and then analyzed using standard

longitudinal models for complete data.

41

4.2 Implementation for selection models

We implemented it for continuous response data with dropout missing values with the

same modeling strategy as in Diggle and Kenward [DK94]. For the autocorrelation, wetried the first order autoregressive(AR(1)) variance structure and the explicit randomeffects(random intercept) model.

4.2.1 Selection Models with AR(1) Covariance

For the AR(1) covariance matrix, we have

1J =1

2(1 2)

1 0 0 . . . 0 1 + 2 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

...

.

.

.

0 . . . 0 1 + 2 0 . . . 0 0 1

and det(J) = (2)J(1 2)J1.

The Gibbs sampler consists of the following steps.

I. Imputation-Step: draw missing values from

yi,di fi,di(y|yobsi ,xobsi , ) 12i

exp{(y K

k=1 xi,di,kk i)22i

}

exp(0 + 1y +di

k=2 yi,di+1kk)

1 + exp(0 + 1y +di

k=2 yi,di+1kk)

where

i = CT(di1)

1di1(yobsi xobsi )

and

i = 2(1 CT(di1)1di1C(di1)),

with xobsi = (xi1, . . . ,xi,di1)T and C(j) = (j1, . . . , )T . This can be derived from

regressing yi,di to yobsi .

42

II. Posterior-Step: draw parameters one by one in the following order

(1) For i = 1, . . . , K, draw fixed parameters:

k f(k|\k ,Y,X) Ni=1

exp{(yi xi)1di (yi xi)T

2}

(2) Draw the autocorrelation parameter in the variance structure:

f(|\,Y,X) Ni=1

12 det(di)

exp{(yi xi)T1di (yi xi)

2}

(3) Draw variance of residuals:

2 f(2|\2 ,Y,X) Ni=1

12 det(i)

exp{(yi xi)T1i (yi xi)

2}

(4) For k = 1, . . . , J , draw parameters of the dropout mechanism:

k f(k|\k ,Y) Ni=1

di1j=2

1

1 + exp(0 + 1yij +j

k=2 yi,j+1kk)

exp(0 + 1yi,di +di

k=2 yi,di+1kk)

1 + exp(0 + 1yi,di +di

k=2 yi,di+1kk)

In the above density functions, xi = (xi1, . . . ,xi,di)T and X = (x1, . . . ,xN)T .

When draw each parameter, all other parameters are viewed as known constants.

All above conditional distributions except the one for have log-concave forms, di-

rectly or after some transformation, thus can be simulated using the method of adaptive

rejection sampling. A convenience sampling method for is the Metroplis-Hastingsalgorithm. The proof is simple:

(1) For the missing values, the predictive function is log-concave because

2log fi,di(y|yobsi ,xobsi , )y2i,di

= 1i

21 exp(0 + 1y +

dik=2 yi,di+1kk)

(1 + exp(0 + 1y +di

k=2 yi,di+1kk))2< 0

where i is the positive conditional variance.

43

(2) For the fixed-parameters, the conditional density function is log-concave because2log f(k|\k ,Y,X)

k=

Ni=1

(xi1k, . . . , xi,di,k)1di (xi1k, . . . , xi,di,k)T < 0

(3) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave

2log f(s|\s,Y,X)s2

= Ni=1

(yi xi)T()1di (yi xi)2es

< 0

(4) For the parameters of the dropout mechanism, the conditional density function islog-concave because

2log f(k|\k ,Y)2k

=Ni=1

di1j=2

y2i,j+1k exp(0 + 1yij +

jk=2 yi,j+1kk)

(1 + exp(0 + 1yij +j

k=2 yi,j+1kk))2

y2i,di+1k

exp(0 + 1yi,di +di

k=2 yi,di+1kk)

(1 + exp(0 + 1yi,di +di

k=2 yi,di+1kk))2

< 0

4.2.2 Selection Models with Random-Effects

In the selection model with random-effects, the random-effects model for repeated

measures can be rewritten as

yij =

Kk=1

xijkk +

qk=1

zijkik + ij

, where ij N(0, 2). Here, we restrict that random-effects are independent of eachother with ik N(0, 2k) (k = 1, . . . , q). In the Gibbs sampler, the missing values(y1,d1 , . . . , yN,dN ) and parameters ( = (, , 2, 2, )) are simulated in the followingsteps.

I. Imputation-Step: Draw missing values:

yi,di fi,di(y|yobsi ,xi,di, zi,di, ) exp{(y Kk=1 xi,di,kk qk=1 zi,di,kik)2

22}

exp(0 + 1y +di

k=2 yi,di+1kk)

1 + exp(0 + 1y +di

k=2 yi,di+1kk)

44

II. Posterior-Step: draw parameters one by one:

(1) For fixed-effect parameters:

k f(k|\k ,Y,X,Z) Ni=1

dij=1

exp{(yij K

k=1 xijkk q

k=1 zijkik)2

22}.

(2) For i = 1, . . . , N ; k = 1, . . . , q, simulate random effects:

f(ik|\ik ,Y,X,Z) dij=1

exp{(yij K

k=1 xijkk q

k=1 zijkik)2

22} exp{ ik

2

22k}

(3) For k = 1, . . . , q, draw variance of random-effects:

f(2k |\2k ) Ni=1

122k

exp{ ik2

22k}

(4) For the residual variance:

f(2|\2 ,Y,X,Z) Ni=1

dij=1

122

exp{(yij K

k=1 xijkk q

k=1 zijkik)2

22}

(5) For the parameters of dropout model:

f(k|\k ,Y) Ni=2

di1j=1

1

1 + exp(0 + 1yij +j+1

k=2 yi,j+1kk)

exp(0 + 1yidi +di+1

k=2 yi,di+1kk)

1 + exp(0 + 1yidi +di+1

k=2 yi,di+1kk)

In above densities, zi = (zi1, . . . , zi,di)T and Z = (z1, . . . , zN)T .

As shown here, all above conditional distributions have log-concave forms, directly

or after some transformation, thus can be simulated using the method of adaptive re-

jection sampling.(1) For the missing values, the predictive function is log-concave because2logfi,di(y|xi,di, zi,di, )

y2idi= 1

2

21 exp(0 + 1y +

dik=2 yi,di+1kk)

(1 + exp(0 + 1y +di

k=2 yi,di+1kk))2< 0

45

(2) For the fixed-parameters, the conditional density function is log-concave because

2 log f(k|\k ,Y,X,Z)2k

=

Ni=1

dij=1

(xijk)2

2< 0

(3) For the random effects, the conditional density function is log-concave because

2log f(ik|\ik ,Y,X,Z)2ik

=

dij=1

(zijk)2

2 12k

< 0

(4) For the variance of random effects, the conditional density function after a loga-rithm transform, s = log(2k), is log-concave,

2log f(s|2\s,Y)s2

= Ni=1

ik2

2es< 0

(5) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave

2log f(s|\s,Y,X,Z)s2

= Ni=1

dij=1

(yij K

k=1 xijkk q

k=1 zijkik)2

2es< 0

(6) For the parameters of the dropout mechanism, the conditional density function islog-concave because

2log f(k|\k ,Y)2k

=Ni=1

di1j=2

y2i,t+1k exp(0 + 1yij +

jk=2 yi,j+1kk)

(1 + exp(0 + 1yij +j

k=2 yi,j+1kk))2

y2i,di+1k

exp(0 + 1yi,di +di

k=2 yi,di+1kk)

(1 + exp(0 + 1yi,di +di

k=2 yi,di+1kk))2

< 0

4.3 Implementation for shared-parameter models

Similar to the models for binary response variables(see section 2.3.3) that were usedby Albert and Follmann([AF03]), for count and continuous response variables, we also

46

combine the advantage of mixed effect models and Markov transition models to set up

the submodels for the repeated measures and missingness indicators.

4.3.1 Count response variables

For Poisson distributed count response variables, a one-step(Markov) transition modelis assumed for yi = (yi1, . . . , yiT ), where on any time point, yit(t > 1) is independent

of (yi1, . . . , yit2) given the previous observation yit1. A random intercept effect is

used to capture the baseline heterogeneity across subjects. In this random-interceptMarkov transition model, the distribution of the response variable depends on the co-

variates under investigation and a random intercept,

P (yit|xit, yit1, i) = yitit

yit!eit

where xit = (xi1t, . . . , xipt), and it = E(yit|xit, yit1, i). A Poisson linear regressionmodel is used to connect the covariates (xit), random effects (i N(0, 2)), andprevious observation (yit1) with the current observation (yit),

log(it) = xit + (log(max(1, yit1)) xit1)+ i.

Here, contains the fixed parameters, which are of interest when making inferences

regarding the effect of covariates in modifying the transition among outcomes. The

parameter indicates the influence of the previous counts through the residual,

(log(max(1, yit1)) xit1)

, where max(1, yit1) is used to ensure a positive value for the logarithmic operation.

To impute intermittent missing values in the imputation step, the Metropolis Ran-

dom walk sampling method would be used. For i = 1, . . . , n, and t = 1, . . . , Ti, if yitis missing, an imputation would be drawn with a symmetric proposal distribution, the

47

acceptance probability is evaluated conditionally on the observed or imputed yit1 and

yit+1, i.e.,f(ymisit,new|yit1, yit+1, )f(ymisit,current|yit1, yit+1, )

where

f(ymisit |yit1, yit+1, ) yitityit!

eityit+1it+1 e

it+1.

For count response variables, in the formula, ymisit,new = ymisit,current 1 since we useRandom Walk sampling.

For the posterior-step, the adaptive rejection sampling method would be used todraw parameters in the following order. First, the counting process parameters and

of the Poisson regression model are drawn using,

f(|\, Ymis) ni=1

Tit=1

P (yit|xit, yit1, i)

f(|\, Ymis) ni=1

Tit=1

P (yit|xit, yit1, i)

where \ refers to the sub vector of excluding , and \ has the same sense for\ and other notations that follow.

Second, for the parameters and of the missingness mechanism model, the fol-

lowing conditional distributions are used,

f(|\, Ymis) ni=1

Dit=1

P (rit|i,xit, rit1)

f(|\ , Ymis) ni=1

Dit=1

P (rit|i,xit, rit1).

Third, for random intercept effects, is, and their variance 2, we use the following

48

conditional distributions,

f(i|\i , Ymis) Tit=1

P (yit|xit, yit1, i)Dit=1

P (rit|i,xit, rit1) exp{ 2i

22}

f(2|\2 , Ymis) (2)n/2 exp{n1

2i22

}ba exp(b/2)(2)(a+1)

It is not difficult to verify that the above density functions for , , , , and

i are log-concave by computing the second derivatives of the corresponding log-

transformed functions. For example,

(log(f(|\, Ymis))) ni=1

Tit=1

it(log(max(1, yit1) xit1)2 < 0

. By denoting s = log(2) and letting a = b = 0, it can be shown that the logarithmic

form of

f(s|\s, Ymis) exp{sn+ 22

n

1 2i

2es}

also has a negative second derivative. So we can use the adaptive rejection samplingmethod on all of them.

4.3.2 Binary response variables

For binary response variables, to impute intermittent missing values in the imputation

step, the Metropolis Random walk sampling method would be used. For i = 1, . . . , n,

and t = 1, . . . , Ti, if yit is missing, an imputation would be drawn with a symmet-

ric proposal distrib

analysis of longitudinal data with missing values

Documents