analysis of longitudinal data with missing values

Upload: antonio-jose-acha-romo

Post on 29-Oct-2015

77 views

Category:

Documents


0 download

TRANSCRIPT

  • UNIVERSITY OF CALIFORNIALos Angeles

    Analysis of longitudinal data with missing values

    A dissertation submitted in partial satisfaction

    of the requirements for the degree

    Doctor of Philosophy in Statistics

    by

    Jinhui Li

    2006

  • Copyright by

    Jinhui Li

    2006

  • The dissertation of Jinhui Li is approved.

    Xiaowei Yang

    Thomas R. Belin

    Rick Paik Schoenberg

    Hongquan Xu

    Yingnian Wu, Committee Chair

    University of California, Los Angeles

    2006

    ii

  • To my family

    iii

  • TABLE OF CONTENTS

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 Models for longitudinal data analysis and missing data issue . . . . . . 5

    2.1 Models for longitudinal data analysis . . . . . . . . . . . . . . . . . . 6

    2.1.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.2 Generalized linear models . . . . . . . . . . . . . . . . . . . 7

    2.1.3 Transition models . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.4 Random effect models . . . . . . . . . . . . . . . . . . . . . 9

    2.1.5 Estimation and inference . . . . . . . . . . . . . . . . . . . . 10

    2.2 Missing data issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.1 Ignorable versus Nonignorable . . . . . . . . . . . . . . . . . 13

    2.3 Modelling with missing values . . . . . . . . . . . . . . . . . . . . . 15

    2.3.1 MULTIPLE PARTIAL IMPUTATION . . . . . . . . . . . . . 16

    2.3.2 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.3.3 REMTM for binary response variables . . . . . . . . . . . . . 20

    3 Bayesian framework with MCMC approach . . . . . . . . . . . . . . . 24

    3.1 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1.1 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1.2 Prior distribution . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1.3 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2 MCMC methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    iv

  • 3.2.1 Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . 28

    3.2.2 MCMC fundamentals . . . . . . . . . . . . . . . . . . . . . 30

    3.2.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.2.4 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . 34

    3.2.5 Adaptive rejection sampling for Gibbs sampling . . . . . . . 35

    4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.1 Bayesian Inference using MCMC . . . . . . . . . . . . . . . . . . . 39

    4.1.1 Priors specification and monitoring convergence . . . . . . . 39

    4.1.2 MCMC:Data augmentation via Gibbs Sampler . . . . . . . . 40

    4.2 Implementation for selection models . . . . . . . . . . . . . . . . . . 42

    4.2.1 Selection Models with AR(1) Covariance . . . . . . . . . . . 42

    4.2.2 Selection Models with Random-Effects . . . . . . . . . . . . 44

    4.3 Implementation for shared-parameter models . . . . . . . . . . . . . 46

    4.3.1 Count response variables . . . . . . . . . . . . . . . . . . . . 47

    4.3.2 Binary response variables . . . . . . . . . . . . . . . . . . . 49

    4.3.3 Continuous response variables . . . . . . . . . . . . . . . . . 50

    4.4 Model extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.5 MPI package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.1 Selection models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.1.1 A simulation study . . . . . . . . . . . . . . . . . . . . . . . 55

    5.1.2 Selection Model for Continuous Carbon Monoxide Levels . . 58

    v

  • 5.2 Shared-parameter models . . . . . . . . . . . . . . . . . . . . . . . . 63

    5.2.1 A Simulation Study on count response variables . . . . . . . 63

    5.2.2 Application to Smoking Cessation Data . . . . . . . . . . . . 65

    5.3 Multiple-model sensitivity analysis . . . . . . . . . . . . . . . . . . . 71

    5.3.1 Pattern-Mixture Models for Continuous Carbon Monoxide Lev-

    els . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    5.3.2 REMTM for Continuous Carbon Monoxide Data . . . . . . . 74

    6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    vi

  • LIST OF FIGURES

    2.1 Three Representations of Missingness Mechanisms. Three ways in

    modeling incomplete data are depicted here. Parameters and sym-

    bols in the figures are defined as: yi = (yobsi ,ymisi ) - observed and

    missing values for subject i; ri - missingness indicator for repeatedmeasures on subject i; - parameters of data; - parameters ofmissingess indicators; and - parameters shared by data and miss-

    ingness indicators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.1 The figure shows the idea of Adaptive Rejection Sampling with func-tions defined as: h(x) = log(g(x)) the logarithm of a function g(x)

    proportional to a statistical density function f(x); u(x) upper hall

    of h(x); and l(x) lower hall of h(x). . . . . . . . . . . . . . . . . 36

    5.1 The average and SD curves for the log-scaled carbon monoxide levels.

    On this plot, the four mean curves of the log-scaled carbon monoxide

    levels and the corresponding point-wise standard errors are drawn for

    each of the four treatment conditions: Control, RP-only, CM-only, and

    RP+CM (RP=Relapse Prevention, CM=contingency Management). Ver-tical bars indicate the estimated standard errors of average carbon monox-

    ide levels. The stars (*) over the x-axis mark the time points (i.e., visitnumbers) where the carbon monoxide levels are significantly differentindicated by a point-wise ANOVA (P-value0.001). Y-axis indicatesvalues of carbon monoxide levels after log(1+x) transform. X-axisrepresents number of clinic visit for study participants (1, . . . , 36). . . 60

    vii

  • 5.2 Missingness patterns for the carbon monoxide levels across treatment

    conditions. For each treatment condition, an image depicts the miss-

    ingness indicators of carbon monoxide levels for each smoker at each

    research visit. Dark colored area indicates that the corresponding car-

    bon monoxide levels were observed while white colored area indicates

    that the corresponding data were missing intermittently or missing af-

    ter dropout. The four treatment conditions are Control, RP-only, CM-

    only, and RP+CM (RP=Relapse Prevention, CM=contingency Man-agement). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5.3 The Sampling Process and Posterior Distribution of Parameter 1 in

    the Simulation Study. The left side figure draws three chains of the

    augmented Gibbs sampler, each starting from randomly selected points.

    It is seen that after around 100 iterations, the Gibbs sampler converges.

    The right side histogram depicts the posterior distribution of the para-

    meter of interest (i.e., 1). From the histogram, we can see that themean value of the posterior distribution is very close to the true value

    (i.e., -1.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    5.4 Histograms of estimators of 1 in the Simulation Study. . . . . . . . . 66

    viii

  • 5.5 Repeated Count Measures Over the 12-week Study Period for the Smok-

    ing Cessation Clinical Trial.The left graph plots the repeated measured

    number of smoking episodes for the 90 smokers in the treatment group

    who received contingency management. The right graph plots the re-

    peated measured number of smoking episodes for the 85 smokers in

    the control group who did not receive contingency management. In

    both plots, the y-axis indicates the number or counts of tobacco use

    for the previous week, the x-axis corresponds to the week numbers (1-12). The thick solid curves depict the mean profiles in the two groups,while the dashed curves represent the individual profiles. . . . . . . . 67

    5.6 Mean Carbon Monoxide Levels for Completers and Early terminators.

    By diving the 174 smokers into two groups: Completers (n1=112) andEarly terminators (n2=62), the mean curves of carbon monoxide levelsfor subjects receiving CM (contingency management) and for subjectsreceiving no CM are depicted within each of the two groups (com-pleters and early terminators). . . . . . . . . . . . . . . . . . . . . . 73

    5.7 Pattern-Dependent Distribution of Carbon Monoxide Levels. Using

    the software package named MPI 2.0, profiles and mean curves of car-

    bon monoxide levels are drawn within each of the five groups deter-

    mined by the dropout times: dropout before week 5 (a), 7 (b), 9 (c), 11(d), and 12 (e). In plots (a) to (e), green curves correspond to the meancarbon monoxide levels of subjects who received CM (contingencymanagement), red curves indicate the mean curves of the subjects whodid not receive CM, and gray-colored dash-lines depicts the profiles of

    all the subjects within each group. The plot (f) depicts all the meanprofiles corresponding to the five dropout patterns. . . . . . . . . . . 75

    ix

  • LIST OF TABLES

    5.1 Averaged percentage bias (%) in the estimate of slope difference fromthe 100 simulated data sets. . . . . . . . . . . . . . . . . . . . . . . 57

    5.2 Coverage rate (%) in the estimate of slope difference from the 100simulated data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    5.3 Estimates Treatment Effects and Parameters of the Dropout Model for

    the Four Partially Imputed Carbon Monoxide Data Sets. . . . . . . . 63

    5.4 Relative Bias(%),Standard Deviations and 95% Posterior Credible In-tervals of the Averaged Treatment Effect Estimators ( ) in the Simu-lation Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    5.5 Parameter Estimation with Standard Deviations and 95% Posterior Cred-

    ible Intervals for the Smoking Cessation Data Set. . . . . . . . . . . 69

    5.6 Estimation of Treatment Effect Parameter (1) in the Smoking Cessa-tion Study with artificially generated intermittent missing values. . . . 69

    5.7 Estimation of Parameters in the Smoking Cessation Study with the

    Method of Multiple Partial Imputation. . . . . . . . . . . . . . . . . . 71

    5.8 Estimated Treatment Effect of Contingency Management (1(S.D.))using the Pattern-Mixture Model with two Patterns (Complete vs. Dropout) 74

    5.9 Estimated Treatment Effect of Contingency Management using the

    Pattern-Mixture Models within the Framework of 2-stage MPI. . . . . 74

    5.10 Posterior Parameter Estimation with Standard Deviation and 95% Cred-

    ible Intervals Using the REMTM to the continuous Carbon Monoxide

    Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    x

  • ACKNOWLEDGMENTS

    First and foremost, I would like to express my deepest gratitude to my advisor, my

    collaborator, and my friend, Prof. Yingnian Wu. His continuous guidance, support,

    and care make the research presented in this dissertation possible.

    I also owe my heartfelt gratitude to Prof. Xiaowei Yang of UCDavis, one of the

    top researcher on missing data modeling, for giving Yingnian and I this most valuable

    opportunity to work with him and guiding me all through. The dissertation is based

    on this fruitful collaboration. I admire Prof. Yangs scientific achievements, and also

    cherish his warmth and kindness.

    I would also like to thank the former fellow students in the vision lab for helpful

    discussions. In particular, I thank Dr. Xiangrong Chen, Dr. Cheng-en Guo, Zijian Xufor the help on programming.

    My sincere thanks go to Prof. Hongquan Xu , Prof. Rick Paik Schoenberg and

    Prof. Thomas R. Belin for serving on my doctoral committee and giving useful advice

    on my dissertation.

    Finally, I am forever grateful to my family and friends for their care and help.

    xi

  • VITA

    1976 Born, Hubei, P.R.China

    1999 B.S. in Statistics, Peking University, Beijing, P.R.China

    2002 M.S. in Statistics, Peking University, Beijing, P.R.China

    2004 M.S. in Statistics, UCLA

    20022006 Graduate Student Researcher & Teaching Assistant, Department of

    Statistics, UCLA

    PUBLICATIONS

    J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model

    for Poisson-distributed Repeated Measures with Nonignorable Missing Values, 2005

    Proceedings of the American Statistical Association, Biometrics Section [CD-ROM],Alexandria, VA: American Statistical Association.

    J. Li, X. Yang, Y. Wu, and S. Shoptaw, A Random-effects Markov Transition Model

    for Poisson-distributed Repeated Measures with Nonignorable Missing Values, ac-

    cepted by Statistics in Medcine.

    xii

  • X. Yang and J. Li ( 2006), Selection Models with Augmented Gibbs Samplers forContinuous Repeated Measures with Nonignorable Dropout, to be submitted.

    X. Yang, J. Li, and S. Shoptaw ( 2006), Multiple Partial Imputation for LongitudinalData with Missing Values in Clinical Trials, resubmitted to Statistics in Medcine.

    Y. N. Wu, S.-C. Zhu, J. Li, and S. Bahrami (2006), Information Scaling in NaturalImages, resubmitted to JASA.

    Y. N. Wu, J. Li, Z. Liu, and S.-C. Zhu (2006), Statistical Principles in Low-LevelVision, tentatively accepted by Technometrics.

    xiii

  • ABSTRACT OF THE DISSERTATION

    Analysis of longitudinal data with missing values

    by

    Jinhui LiDoctor of Philosophy in Statistics

    University of California, Los Angeles, 2006

    Professor Yingnian Wu, Chair

    Biomedical research is plagued with problems of missing data, especially in clini-

    cal trials of medical and behavioral therapies adopting longitudinal design. After a

    comprehensive literature review on modeling incomplete longitudinal data based on

    the full-likelihood functions, this dissertation proposes a Bayesian framework with

    MCMC strategies for implementing two kind of advanced models: selection models

    and shared parameter models, for handling intermittent missing values and dropouts

    that are potentially nonignorable according to various criteria. We combine the advan-

    tages of mixed effect models and Markov Transition Models. Simulation study and

    application to practical data were performed and comparisons with likelihood meth-

    ods were also made, to show the efficacy and consistency of our methods.

    KEY WORDS: Selection Model; Shared Parameter Model; Markov Transition

    Model; Nonignorable Missing Values.

    xiv

  • CHAPTER 1

    Introduction

    Longitudinal data refer to data points collected over time on the same subjects. Forexample, if some selected persons measure their weights once a week for ten consec-

    utive weeks, the collection of weights through time are longitudinal data. This kind

    of data points repeatedly collected on the same subjects are called repeated measures.Longitudinal data are very common for biomedical research and clinical trials, where

    the characteristic or some measurement of a person or a thing , for example, the status

    of a disease of one person or the value of a car, evolves or develops over time. One

    variable is such a characteristic or measurement.

    We often have several variables among which one is the target or response variable

    and the others are potentially explanatory variables. Theoretically any one of them

    can change and be measured over time but usually the response variable is referred by

    repeated measures. We are interested in how the change or variation of the response

    variable can be explained by the explanatory variables, and statistical modeling is ex-

    ploited to approximate the relationship between the two kind of variables. Diggle,

    Liang and Zeger([DLZ02]) presented a detailed summary of the models used. Popularchoices are linear models,generalized linear models, transition models and mixed

    effect models. The simplest model is the linear or additive models, where we use

    a linear combination of the explanatory variables and random errors to describe the

    response variable. We can also include some generated new variables ,i.e., functions

    of the original explanatory variables. If we use a linear combination of the explana-

    1

  • tory variables to describe a function of the expected value of the response variable,

    we have a generalized linear model. Furthermore, if we let the model for the current

    measure explicitly depend on its previous measures, it becomes a (Markov) transitionmodel. Finally, if we allow the coefficients of the explanatory variables vary from sub-

    ject to subject, we have mixed effect models, in contrast with the models containingsame coefficients of the explanatory variables for all subjects so as to tell the popula-tion trend which are called fixed effect models. We can get the likelihood function(orconditional likelihood function) based on the models used as the joint probability dis-tribution function over all subjects and all time points. MLE(Maximum LikelihoodEstimation) methods are then used to make the estimation and inference.

    With the same kinds of models, the computation gets much more complicated when

    some of the data points are missing. But missing data are very common for biomedical

    research with longitudinal studies,reflecting the problematic nature of the phenomenon

    under study, such as substance abuse([NC02]), sexual behavior([CS94]) and mentalhealth disorder([TSBU02]). The proportion of missing data is sometimes noticeablylarge, e.g., as large as 70% at termination in a randomized study of buprenorphine

    versus methadone([FW95]). Although investigators may devote substantial efforts tominimize the number of missing values, some amount of missing data is inevitable

    in the practice of randomized medical clinical trials. In this dissertation, we focus

    on missing values of the response variables. Missing data can be categorized into

    two kinds: intermittent missing values (i.e., missing values due to occasionally with-drawal), with observed values afterwards, and dropout values(i.e., missing values dueto earlier withdrawal), with no more observed values. Reasons for dropout are oftenstudy-related, e.g., negative side-effects of the testing medicines, ineffectiveness of the

    intervention, and inappropriate conduction of the therapy([YS05]). Without carefulhandling of dropouts, either biased parameter estimates or invalid inferences would

    end up. This may also be true for intermittent missing values.

    2

  • Its not possible for us to compute the original joint probability function of all therepeated measures directly with missing data. There are at least three ways to deal with

    this situation: Complete case only, Analyze as incomplete, and Imputation. Com-

    plete case only means just keeping the complete cases with specified weight and drop-ping all other cases with missing values. Imputation method is to fill in the missing

    values with randomly generated ones and how to generate good numbers to approxi-

    mate the original ones is critical here. Analyze as incomplete intends to integrate out

    the intermittent missing values and to ignore the dropout missing values, which seems

    a straightforward choice but its computationally expensive and not always stable.

    All above three approaches focus on the response variable since we are mainly

    concerned about estimating the effect of the covariates on the response variable. If

    the missing data dont affect the estimation at all, we can just use the available datadirectly and ignore the missing ones. So here comes the problem of ignorability

    of the missing values. Usually it is unknown to us, so the safe and natural way is

    to model the repeated measures and the missingness indicators jointly. There existthree ways to factor the joint distribution of the complete data and missingness in-dicators: outcome-dependent factorization, pattern-dependent factorization, and

    parameter-dependent factorization. Correspondingly there are three kind of mod-

    els: selection models, pattern-mixture models and shared-parameter models. No

    matter what model you use, there are still gaps in computing the joint probability func-tion of all the repeated measures. To compute this joint probability with integration isa common choice but it is usually slow and sometimes not stable. Because of this, we

    proposed one approach, which is to use Bayesian framework with MCMC implemen-

    tation to overcome the difficulty of integration.

    Following this introductory chapter, this dissertation is organized in the follow-

    ing way: we review models commonly used for longitudinal data analysis, missing

    3

  • data issue and ignorability in Chapter 2, after that we review the Bayesian framework

    and MCMC methodology in Chapter 3. Correspondingly and based on those meth-

    ods, we describe our methodology to handle the issue of modeling longitudinal data

    with missing values in Chapter 4. Next, in Chapter 5 we present the results of sim-

    ulation studies and applications to practical data, making comparison with maximum

    likelihood method and showing the consistency of our models. Then we discuss the

    problems and future work in the final chapter.

    4

  • CHAPTER 2

    Models for longitudinal data analysis and missing data

    issue

    For most scientific researches, we are concerned about the relationship between dif-

    ferent things. For a longitudinal data set with balanced design, our most interest

    is in the relationship between the response variable(usually some measurement re-lated to the treatment effect) and the covariates(especially the treatment if there isany). Repeated measures are potentially observed on the ith subject at jth time pointtij(i = 1, . . . , N ; j = 1, . . . , J). From now on, in the following discussion, we use

    capital symbols to represent variables, e.g., Yij indicating response variables, and

    Xij = (Xij1, . . . , Xijp) indicating covariates or explanatory variables. Symbols in

    lower case represent observed or missing values,i.e., the realizations of variables: yijdenoting the value of Yij and xijk denoting the value of Xijk recorded at time tij(i = 1, . . . , N ; j = 1, . . . , J ; k = 1, . . . , p). Bold symbols represent vectors or ma-trices, e.g., yi = (yi1, . . . , yiJ)T is a vector of values for the repeated measures and

    xi = [xijk]Jpis a matrix of values of time-varying or time-independent covariates on

    the ith subject.For our current study, the response variables are among three common types: con-

    tinuous, count and binary. The covariates can be time dependent but usually they are

    time independent in practice.

    5

  • 2.1 Models for longitudinal data analysis

    The review in this section mainly relies on the detailed and systematic summary of the

    widely used models by Diggle, Liang and Zeger([DLZ02]).We can use linear mod-els,generalized linear models, transition models and mixed effect models to approx-

    imate the relationship between the response variables and the covariates. Different

    models need different assumptions, e.g., normality, independence. Those assumptions

    need to be checked before or after you apply any model.

    Although no model is absolutely correct to reveal the underlying mechanisms for

    generating the real-world data, some provide useful approximations and are helpful for

    us to understand the mechanisms.

    2.1.1 Linear models

    For continuous responses, the simplest model is a linear or additive model, where we

    assume Yij can be expressed by a linear or additive combination of the covariates and

    the random error,

    Yij =

    pk=1

    kxijk + ij = xij + ij

    where we have the assumption that the random error ij N(0, 2) , which is the mostwidely used normality assumption. Note that usually xij1 = 1 and 1 is an intercept

    term.

    Usually we assume the subjects coming from a random sample of the populationand then being independent of each other. Without loss of generality, we can justdescribe the model for the ith subject. In matrix from, we have the model for Yi as,

    Yi N(xi, 2V )

    where V is a general definite positive symmetric matrix. If we assume that V is a di-

    6

  • agonal matrix or even an identity matrix, it means that the repeated measures from the

    same subject are independent of each other. Usually they are dependent and positivelyrelated, especially for the neighbors, and V has no simple structure.

    We can set constraints on the elements vjk of the variance-covariance structure

    matrix V. Some common choices are like:

    vjj = 1, vjk =

    which is called compound symmetric since the correlation between any pair Yij, Yik is

    and

    vjk = |jk|

    which is called AR(1)(first order autoregressive) where the correlation between anypair Yij, Yik is decided by and the length of time between the two time points, since

    1 1 the correlation is weaker for far away measures.

    2.1.2 Generalized linear models

    Instead of directly linking the response and the covariates for continuous response, we

    can also link the marginal expectation of the response ij = E(Yij) to the covariates

    for other kind of responses, thus get the generalized linear models.

    For binary responses,e.g., the status of smoking or not, the indicator of having

    certain disease or not, let ij = E(Yij) = P (Yij = 1) and the link function is a logit

    function

    logit(ij) = logij

    1 ij = logP (Yij = 1)

    P (Yij = 0)= xij

    which models the log transformation of the odd P (Yij=1)P (Yij=0)

    .

    For count responses,e.g., the number of smoking episodes in a certain period of

    time, the repeated measures following Poisson distribution is often a reasonable as-

    7

  • sumption, i.e.,

    P (Yij = yij) =yijij

    yij!exp(ij)

    For the marginal expectation of the response we have ij = ij and a log function

    serves as the link function

    log(ij) = xij

    We should notice that these models for the marginal expectation of the response

    assume independence between the repeated measures since there are only covariates

    and fixed parameters in the model. This is usually not true and we need more complex

    models.

    2.1.3 Transition models

    The above way to model the marginal expectation of the response through generalized

    linear models ignores the fact that usually the repeated measures on the same subjectare not independent. Its quite straight forward to make the distribution of the current

    repeated measure related to its predecessors, usually through some kind of residuals.

    This results in the transition models.

    Assume a q-step transition,i.e.the current repeated measure is related to its q im-

    mediate predecessors. For continuous variables, we may have

    Yij = xij +

    qr=1

    r(Yijr xijr) + ij

    where ij N(0, 2) as before. Here we connect Yij to its q immediate predecessorsthrough the residuals Yijr xijr, r = 1, . . . , q . This is just a re-interpretation ofthe autoregressive covariance structure for continuous variables.

    For binary variables, we may have the one step transition probabilities Pkl =

    8

  • Pr(yij = l|yi,j1 = k)(k = 0or1; l = 0or1) be modeled by logistic regressions,

    logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij)) = xij01 (2.1)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij)) = xij10 (2.2)

    .

    Similarly, for count response following Poisson distribution, we can make the in-

    tensity(mean) dependent on its q immediate predecessors

    log(ij) = xij +

    qr=1

    r(log(max(1, yijr) xijr)

    So transition models provides a reasonable way to directly connect the autocorre-

    lated repeated measures.

    2.1.4 Random effect models

    The previous models all assume fixed effects, i.e., the coefficients of the covariates

    are the same for all subjects. We know that there is variation from subject to subject,so reasonably we can have a different(random) coefficient on each covariate for eachsubject to account for the between subject variation. Then there are many more para-meters in the models and comes the danger of over-parameterization. To avoid this,

    the natural constraint is that we assume the coefficients for the same covariate coming

    from the same distribution, which usually is a normal distribution.

    As an example, for continuous variables,we may have

    Yij = xiji + ij (2.3)= xij + xiji + ij (2.4)

    where ij N(0, 2) as before and the subject specific coefficient i N(,)is decomposed into and i to account for the population effect and subject(random)

    9

  • effect respectively. i N(0,) is commonly used and the specification of isneeded. In practice, we often assume is diagonal if appropriate.

    The covariates associated with the random effects can also be different from those

    associated with the fixed effects but usually the former is a subset of the latter, which

    means we dont have to assume a random effect for each covariate. Random intercept

    only or Random intercept and slope structure are widely used.

    2.1.5 Estimation and inference

    With complete data and appropriate models assumed, we can write down the likelihood

    function. Estimation and inference based on MLE(maximum likelihood estimation)or RMLE(restricted maximum likelihood estimation) is widely used. Please referto Diggle, Liang and Zeger([DLZ02]) for details.

    2.2 Missing data issue

    Missing data make the parameter estimation and inference much more complicated.

    The likelihood function requires integration to evaluate. Missing values can happen

    to both the response variables and the covariates but we assume only the response

    variables have missing values in the current study.

    When some values of repeated measures are missing,we partition yi into two parts

    yi = (yobsi ,y

    misi ), with yobsi indicating the observed values, and ymisi indicating values

    that would be observed if they were not missing. When missing values are introduced

    by dropout, the pattern of missingness can be indicated by a scalar di, which represents

    the actual time of withdrawal for subject i (i.e., di = 2, . . . , J + 1), with di = J + 1indicating the case of completion of the study. Since a subject who drops out at base-line does not contribute to the likelihood function, the case di = 1 is excluded from

    10

  • our consideration. A vector of missingness indicators is defined as ri = (ri1, . . . , riJ)T

    with elements

    rit =

    0 if yit observed

    1 if yit intermittently missing

    2 if yit missing after dropout.

    Theoretically, the joint distribution of the complete data (i.e., yi ) and missingnesspatterns (i.e., ri ) should be modeled in statistical analysis, and based on that we canwrite down the full likelihood function,

    L(, |y, r) Ni=1

    f(yi, ri|, )dymisi

    where represents parameters of the model for repeated measures, and represents

    the parameters of the missingness mechanism.

    According to the possible graphical representations, there exist three ways to fac-

    tor the joint distribution of the complete data and missingness indicators: outcome-dependent factorization, pattern-dependent factorization, and parameter-dependent

    factorization. Figure 2.1 shows the general directions for factorizations with the asso-

    ciated models respectively. Outcome-dependent factorization means the repeated mea-

    sures or outcomes affect the distribution of missingness indicators, pattern-dependent

    factorization means missingness indicators(i.e.,missingness patterns) indicate the dif-ference in the distribution of the repeated measures, and parameter-dependent fac-

    torization means some common parameters(covariates or random effects) affect bothrepeated measures and missingness indicators.

    The three corresponding models available for incomplete longitudinal data analysis

    are:

    Selection Model, which factors the joint distribution into a marginal distribution

    11

  • Figure 2.1: Three Representations of Missingness Mechanisms. Three ways in mod-

    eling incomplete data are depicted here. Parameters and symbols in the figures are

    defined as: yi = (yobsi ,ymisi ) - observed and missing values for subject i; ri -missingness indicator for repeated measures on subject i; - parameters of data; - parameters of missingess indicators; and - parameters shared by data and

    missingness indicators.

    for yi and a conditional distribution of ri given yi, i.e.,

    f(yi, ri|xi, , ) = f(yi|xi, )f(ri|yi,xi, )

    where f(ri|yi,xi, ) can be interpreted as self-selection of the ith subject intoa specific missingness group.

    Pattern-Mixture Model, which is a pattern-dependent model, assumes that distri-bution of repeated measures varies with the missingness patterns and that jointdistribution is factored as

    f(yi, ri|xi, , ) = f(yi|ri,xi, )f(ri|xi, ).

    Assuming that there are P patterns of missingness in a data set, the marginal

    distribution of yi is a mixture,

    f(yi|xi, ) =Pp=1

    f(yi|ri = rpi ,xi, (p))p

    12

  • where (p) represents the parameters of f(yi) in the pth pattern and i = Pr(ri =

    p|xi, ) and ri numerates theP patterns. In pattern-mixture models, (1), , (p)

    can be different in dimensionality or in value.

    Shared-Parameter Model,we assume that yi and ri are conditionally indepen-dent of each other, given a group of parameters, i,

    f(yi, ri|xi, , ) =

    f(yi|i,xi, )f(ri|i,xi, )f(i)di (2.5)

    Shared parameters i affect both yi and ri, thus can be either observable variables

    (e.g., gender) or latent variables (e.g., random-effects or latent scores). For thecase of observed confounders, model 2.5 is in fact a mixture model and the

    analysis can be conducted by a stratification analysis.

    My focus is on Selection Model and Shared-Parameter Model while Dr.Yang did

    a lot of research on Pattern-Mixture Model. One example is shown in Yang and

    Li([YL06]).

    2.2.1 Ignorable versus Nonignorable

    The above models are natural and it is ture that in certain biomedical studies, both

    missingness patterns and values of repeated measures are of interest. For example, in a

    heart-disease study, the repeatedly measured blood pressures and the survival lengths

    of the patients can be modeled jointly. In these scenarios, the above selection, pattern-mixture, and shared-parameter models can be applied directly or after some modifi-

    cation( [HL97]). In a majority biomedical studies, however, only the parameters ofrepeated measures themselves are of interest, while parameters related to missing val-

    ues are usually viewed as nuisance. In this latter case, it is desirable that missing data

    be ignored if possible.

    13

  • Within the setting of outcome-dependent missingness, the concept of ignorabil-

    ity was defined and extensively addressed. According to Rubin([R76]), missing val-ues are ignorable when

    (i) ri is independent of ymisi , given yobsi and xi ;(ii) and are distinct.Under this ignorability, the log-likelihood function for can be separated from the

    log-likelihood function for , i.e.,

    l(, |yobsi , ri) = l(|yobsi , ri) + l(|yobsi , ri).

    In Little and Rubin [LR02], outcome-dependent missingness was further divided intosub-categories:

    missing completely at random (MCAR): ri yi|xi;

    missing at random (MAR): ri ymisi |yobsi ,xi;

    not missing at random(NMAR): None of above.

    For intermittent missing values, ignorability can be interpreted as whether the val-

    ues can be interpolated from neighborhood observed values. For dropouts, the assump-

    tion corresponds to whether missing values after dropout can be extrapolated from

    the previous observed values. In certain applications, occasional omission or nonre-

    sponses are due to reasons that are purely random in nature, e.g., schedule conflicts

    or bad weather, thus, can be assumed to be ignorable. Nonetheless, subjects withdrawfrom a study usually because of study-related reasons, e.g., being unsatisfactory with

    the intervention or notorious side effects of a medical therapy, hence are nonignorable

    ([VM00, DLZ02]).

    The definition of ignorability may be extended to meet the needs of pattern-mixture

    and shared-parameter models. Adopting an informal way, we interpret ignorability

    14

  • as a condition under which observed repeated measures can be used to estimate

    without bias. For selection and pattern-mixture models, so long as ri and ymisi are

    independent of each other, given yobsi and xi , missing data can be ignored. For shared-

    parameter models, ignorability corresponds only to the case where i are observable

    variables, which are usually viewed as a subset of xi . Unless ri and yi share no

    random-effects, a shared-parameter model would generally associate with an assump-

    tion as nonignorable.

    However, it is difficult or impossible to check ignorability directly. Therefore,

    we are going to take an approach by jointly modeling the missingness patterns and re-peated measures without assuming ignorability. But in the end, we can tell ignorability

    based on the results from our models.

    2.3 Modelling with missing values

    In this section, we give a brief summary of several common approaches used by re-

    searchers to handle modelling with missing values, including imputation based meth-

    ods illustrated by multiple partial imputation, and integration based methods, which

    are Analyze as incomplete approaches, illustrated by selection model for continuous

    data with dropout and shared-parameter model for binary responses with both inter-

    mittent missing and dropout.

    Imputation based methods usually need critical assumptions on the missing mech-

    anisms. Wrong assumptions lead to imputed values inconsistently distributed, com-

    pared to the observed ones. The previously proposed integration based methods are

    generally expensive on computation and tend to be unstable.

    15

  • 2.3.1 MULTIPLE PARTIAL IMPUTATION

    For incomplete longitudinal data sets, the method of multiple imputation(see Rubin[R87]) is especially useful. Accurately predicting missing values is possible becauserepeated measures are often highly correlated to each other. When imputing, all above

    three modeling options can be used. In longitudinal data sets, missingness patterns

    and mechanisms for intermittent missing values and dropouts are apt to be distinct,

    thus requiring different treatment. Their empirical experiences suggest that in certain

    clinical trials intermittent missing values are ignorable due to factors that are non-

    related to the theme of the study, while dropouts should not be simply ignored. In

    Yang and Shoptaw([YS05]), a partial version of multiple imputation, MPI, was firstproposed, within which only intermittent missing values are imputed. As seen in the

    application of pattern-mixture model, imputation methods can be further employed to

    implement various schemes of identification of restriction for managing dropouts. This

    leads to a further extension of multiple imputation, which they term as 2-stage MPI

    here. Depending on the assumptions of the mechanism of intermittent missingness

    and dropout, there exist many specific forms of MPI and 2-stage MPI.

    They further partitioned ymisi into (yIMi ,yDMi ) to denote intermittent missing val-

    ues and dropouts. For MPI, they drawm > 1 independent valuesyIM(1)i ,yIM(2)i , ,yIM(m)i

    using the posterior predictive distribution p(yimi |yobsi , ri) . Within the 2-stage MPI, foreach of the partial imputation for intermittent missing values,n conditionally indepen-

    dent values yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n))i are additionally drawn from the predic-

    tive distribution p(yDM(j,k)i |yobsi ,yIM(j)i ),j = 1, , m. The 2-stage MPI provides anatural framework for fitting pattern-mixture models by identifying restrictions with

    information borrowed from completers, neighboring cases, or available cases. If we

    use selection models or shared-parameter models, imputations for dropouts can be

    similarly conducted by applying appropriate MCMC algorithms.

    16

  • A main concern for multiple imputation is how to combine the multiple point

    estimators to make an overall inferential statement. A set of rules for combination

    was originally developed by Rubin and Schenker([RS86]), which can be used di-rectly for MPI. In Shen [S00], the idea was extended for the case of 2-step multi-ple imputation, which can be viewed as a general approach for a 2-stage MPI. More

    specifically, m n complete data sets are obtained eventually in the 2-stage MPI,y(j,k)i = (y

    obsi ,y

    IM(j)i ,y

    DM(j,k)i ) : j = 1, , m, k = 1, , n. A noticeable problem

    with these complete data sets is that they are not independent from each other, be-

    cause each block or nest ( yDM(j,1)i ,yDM(j,2)i , ,yDM(j,n)i ) contains identical valuesfor yIM(j)i . By denoting Q(j,k) and U (j,k) as the point and variance estimates for Q

    from the (j, k)th completed data set, the overall point estimate for Q is still the simply

    grand average, i.e.,

    Q =1

    mn

    mj=1

    nk=1

    Q(j,k) (2.6)

    The associated variance for Q involves three components, i.e.,

    T = U + (1 1n)W + (1 1

    m)B (2.7)

    where

    U =1

    mn

    mj=1

    nk=1

    U (j,k)

    estimates the complete-data variance,

    B =1

    m 1mj=1

    (Q(j,) Q)2

    indicates the between-nest variance,

    W =1

    m

    mj=1

    1

    n 1n

    k=1

    (Q(j,k) Q(j,))2

    represents the within-nest variance, and Q(j,) = 1n1

    nk=1 Q

    (j,k). Inferences about Q

    17

  • are based on the Students t-distribution QQ(T )

    t with d.f.

    =1

    m(n 1)[(1 1/n)W

    T]2 +

    1

    m 1[(1 + 1/m)W

    T]2

    . Other formulas such as rates of missing information and relative efficiency are seen

    in Shen [S00].

    Harel and Schafer([HS02]) proposed to use two-stage multiple imputation to han-dle two qualitatively different types of missingness, not necessarily classified as in-

    termittent missingness and dropout. This idea might be extended to do multiple-stage

    multiple imputation to handle several qualitatively different types of missingness.

    2.3.2 Selection model

    Diggle and Kenward( [DK94]) studied modelling continuous response variables withdropout missing values where the missingness indicator rij = 1 means dropout and

    rij = 0 means observed.

    Let us denote tdi as the dropout time for the ith subject, where 2 di J + 1( di = J + 1 indicates a subject who has completed the study). Then, missingnessindicators ri is a vector of di1 consecutive zeros followed by J +1di consecutiveones. Suppressing the dependence on covariates, the selection model of Diggle and

    Kenward [DK94] assumes:(i) Pr(rij = 1|j > di) = 1;(ii) for j di, Pr(rij = 1) depends on yij and its history Hij = (yi1, . . . , yidi1)T ;(iii) the conditional distribution of yij givenHij is fij(yij|Hij, ).

    Under these assumptions, the full likelihood function for the ith subject is ex-pressed as

    Li(, |yobsi , ri) di1j=1

    fij(yij|Hij, )di1j=1

    [1 pj(yij,Hij)] Pr(ridi = 1|Hidi)

    18

  • where pj(yij,Hij) = Pr(rij = 1|yij,Hij, ) , indicating the probability of dropout attime tij . Dropout probability

    Pr(ridi = 1|Hidi) =

    Pr(ridi = 1|y,Hidi, )fidi(y|Hidi, )dy (2.8)

    , if di < J + 1 ; and Pr(ridi = 1|Hidi) = 1 , if di = J + 1 .

    A natural choice for calculating Pr(rij = 1|yij,Hij, ) is a logistic regressionmodel,

    logit(Pr(rij = 1|yij,Hij, )) = 0 +Hij1 + 2yij

    from where 2 6= 0 implies that dropout process is outcome-dependent nonignor-able.

    The full log-likelihood function of the whole data set for (, ) can be partitioned

    into

    l(, ) = l1() + l2() + l3(, )

    where

    l1() =

    Ni=1

    log f(yobsi )

    corresponds to the observed-data log-likelihood function for ,

    l2() =Ni=1

    di1j=2

    [1 Pr(rij = 1|yij,Hij, )]

    and

    l3(, ) =

    iN ;diJ

    log(Pr(ridi = 1|Hi,di))

    together determine the log-likelihood function of dropout process, which contains par-

    tial information on .

    If dropouts are ignorable, then l3(, ) depends only on , thus can be absorbed

    into l2(). In this case, estimation of can be solely derived from l1().

    19

  • For a normal longitudinal data set, yi N(Xi,i()) with parameters =(T , T )T , the conditional distribution, fij(y|Hij, ) is a scalar normal, and the mar-ginal distribution

    di1j=1 f(yij|Hij, ) = f(yobsi ), is also a multivariate normal.

    To evaluate the integration in equation 2.8, they used a probit approximation to the

    logit transformation of the dropout probability and converted the problem into moment

    computations of the conditional normal distribution of the first dropout missing value.

    With the likelihood function at hand, Diggle and Kenward([NM65]) resorted to thesimplex algorithm to get the MLE estimators. The simplex algorithm does not depend

    on derivatives but it converges unacceptably slowly and provides no Fisher information

    matrix.

    Selection models originated from the Tobit model of Heckman( [H76]). Verbekeand Molenburghs ([VM00]) addressed the theoretical translation from Tobit modelto Diggle and Kenwards selection model. Subsequently, Troxel, Harrington, and

    Lipsitz([THL98]) extended it to the non-monotone setting. Selection models for cat-egorical and other type of measures were also developed; see Fitzmaurice, Molen-

    berghs, and Lipsitz [FML], Molenberghs Kenward, and Lesaffre [MKL97], Nordheim[N84], and Kenward and Molenburghs [KM99].

    2.3.3 REMTM for binary response variables

    When the dynamic features of the transition pattern in longitudinal data are of inter-

    est, an appropriate longitudinal approach is a transition model. For binary repeated

    measures with nonigorable missing values, Albert and Follmann ([AF03]) developeda Markov transition model with random-effects shared by the sub-model on measure-

    ment and the sub-model on missingness indicators. We let REMTM stand for random-

    effects Markov transition model and review their models in this section.

    In the REMTM for incomplete binary repeated measures, the sub-model for mea-

    20

  • surement process is assumed to be a first-order Markov chain for each series of binary

    measures. The response variable can jump between the two states, 0 and 1. The tran-sitional probabilities Pkl = Pr(yij = l|yi,j1 = k)(k = 0, 1; l = 0, 1) can be modeledby a logistic regression with random intercepts,

    logit(P01) = logit(Pr(yij = 1|yi,j1 = 0,xij, i)) = xij01 + i (2.9)logit(P10) = logit(Pr(yij = 0|yi,j1 = 1,xij, i)) = xij10 + i (2.10)

    where iiid N(0, 2 ) denotes the random intercept, and is the heterogeneity parame-

    ter indicating the correlation between P10 and P01 .

    The distribution of misisngness indicators ri = (ri1, . . . , riJ)T can be modeled by

    another Markov transition model. Here

    rij =

    0 if yij is observed

    1 if yij is intermittently missing

    2 if yij is missing due to dropout

    They used a first-order Markov process associated with 3 3 transition probabili-ties (i.e., Pkl = Pr(rij = l|ri,j1 = k)(k = 0, 1, 2; l = 0, 1, 2)). Determined bycertain restrictions, the following transition probabilities would be always equal to

    zero:P12 = P20 = P21 = 0 . For other combinations of ri,j1 and rij , the transition

    probabilities are calculated in the following way. First, if the previous repeated mea-

    sure is observed (i.e.,ri,j1 = 0), then the current one could be observed, intermittentmissing, or dropout; the 3-category multinomial-logit model can be used to calculate

    the transition probabilities:

    P (rij = k|i, xij, ri,j1 = 0) =

    11+P2

    l=1 exp(xijl+il)if k = 0,

    exp(xijk+ik)

    1+P2

    l=1 exp(xijl+il)if k = 1 or 2.

    Second, if the previous measure is intermittently missing, then the current one may

    only be observed or intermittently missing again. Correspondingly, a logistic regres-

    21

  • sion model is used to calculate P10 and P11, i.e.,

    P (rij = k|i, xij , ri,j1 = 1) =

    11+exp(xij1+i1)

    if k = 0,exp(xit1+i1)

    1+exp(xij1+i1)if k = 1.

    Third, for the absorbing state 2, we would always have P (rit = 2|i, xit, rit1 = 2) =1.In the above logit and logistics models, regression coefficients 1 and 2 respectively

    indicate whether intermittent missingness and dropout depend on covariates, while co-

    efficients 1 and 2 respectively indicate whether intermittent missingness and dropout

    are informative (i.e., nonignorable).

    By combining the above sub-models for measurement and missingness, the likeli-

    hood function for parameters = (01, 10, , 2 ) and = (1, 2, 1, 2) is expressed

    as

    L(, ) n

    i=1

    {Tij=1

    p(yij|xij , yi,j1, i, )}{Jj=1

    p(rij|i, xij , ri,j1, )}p(i)di

    where p(i) is the pdf of N(0, 2) and Ti is the last observed time point.

    They derived the K step transition probabilities from the one step transition proba-

    bilities to handle the intermittent missing values.Then they maximized the above like-

    lihood function using a Newton-Raphson algorithm where the integral was evaluated

    with Gaussian quadrature.

    It should be remarked here that the above REMTM can be easily extended to

    deal with other types of repeated measures. In Li et al. [LYWS06] we applied theREMTMs to Poisson-distributed repeated measures with nonignorable missing val-

    ues. The random-intercept in the models can be replaced with other types of random

    effects, including random slopes and random cohort effects. The REMTM is only

    one specific example of shared-parameter models, other longitudinal models such

    as marginal model or random-effects models can be also used to implement shared-

    parameters modeling. The shared-parameter model was first developed by Wu and

    22

  • Caroll([WC88]) where certain parameters are shared by the measurement model and acensoring process.

    23

  • CHAPTER 3

    Bayesian framework with MCMC approach

    As stated before, for the longitudinal models based on full-likelihood functions, pa-

    rameter estimation based on asymptotic normal theory is difficult to apply mainly

    because of the complicated form of likelihood functions. Without an analytical so-

    lution for the score function and Hessian matrix, optimization is challenging. Diggle

    and Kenward ([DK94]) resort to the simplex algorithm([NM65]), which does not de-pend on derivatives. Unfortunately it converges unacceptably slowly and provides no

    Fisher information matrix,as a result of derivatives free algorithm. We implemented

    the nonlinear optimization algorithms of Dennis and Schnabel([DS96]) with numeri-cal derivatives, but found that global maximum remained difficult to obtain in practical

    settings. When calculating the dropout probabilities in the selection model or integrat-

    ing over the random-effects in the REMTMs, time-consuming numerical integration is

    demanded, such as the method of Gauss-Hermits. Bayesian methods based on MCMC

    provide a more affordable and appropriate alternative to do the parameter estimation

    and inferences. By sampling parameters and missing values, MCMC method using

    Gibbs sampler offers a natural option for integration and optimization, without relying

    on fully determined density functions or expensive derivatives.

    In this chapter, we review the basics for Bayesian framework and MCMC methods,

    based on the systematic description by Carlin and Louis in [CL00].

    24

  • 3.1 Bayesian methods

    3.1.1 Bayes Rule

    Basically, if we have data D, i.e., collected samples, at hand, and we want to estimate

    some parameters with the data and specified model, the likelihood function is the

    joint probability distribution of the data given ,

    L() = f(D|)

    Maximum likelihood method aims to estimate the parameters, assuming they are justunknown constants,completely based on this likelihood function. Bayes approach is

    different in that it takes the parameters as random samples from certain prior distri-

    bution, q() and makes the estimation and inference completely based on theposterior distribution P (|D), which is calculated by simple calculus and is in a sim-ple form of Bayes Rule,

    P (|D) = q()f(D|)f(D)

    where,

    f(D) =

    q()f(D|)d

    3.1.2 Prior distribution

    For Bayesian approach, introducing prior distribution provides extra information. There

    are three major ways to specify it,

    Elicited Priors: prior distributions are specified by some experts or elicitees withexperiences and knowledge in the field;

    Conjugate priors: prior distribution is selected from a member of the familywhich is conjugate to the likelihood function so that the posterior distributionbelongs to the same distributional family as the prior ;

    25

  • Noninformative Priors: normal distribution with infinite variance or uniformdistribution over the support of is used as its prior distribution;

    Among the three kind of priors, Noninformative Priors are simple and actually

    provide no extra information, thus can avoid the danger of introducing incorrect extra

    information. With enough data, the information from the data usually dominates the

    information from priors so the choice doesnt matter much.

    3.1.3 Bayesian inference

    Within the Bayesian framework, we can perform point estimation, interval estimation

    and hypothesis test based on the posterior distribution. The first two are most impor-

    tant.

    Point estimation:First lets look at the univariate parameter case.The estimator(D) can be a summary feature such as mean, median, or mode of the posterior

    distribution P (|D). The basic observations on the three features are

    The posterior mode equals the maximum likelihood estimator With a flat

    prior;

    The mean and the median are identical for symmetric posterior densities;

    All three measures are the same for symmetric unimodal posteriors.

    The posterior variance with respect to (D), or mean squared error, is defined as

    E(D)( (D))2

    provides a measure of accuracy of the point estimator. Denote (D) = E(D)()

    as the posterior mean which of course depends the data (D). We have the fol-

    26

  • lowing decomposition

    E(D)( (D))2 = E(D)[ (D) + (D) (D)]2

    = E(D)( (D))2 + ((D) (D))2

    E(D)( (D))2 = V ar(D)( (D))

    The equal sign = is achieved if and only if (D) = (D), i.e., we use the

    posterior mean as the estimator.

    When is a vector, the posterior mode is more difficult to get and we still have

    that

    E(D)((D))((D))T = V ar(D)((D))+((D)(D))((D)(D))T

    and the minimum is still achieved at (D) = (D). So generally we use the

    posterior mean as the estimator to get the smallest variance.

    Interval estimation The counterpart of a frequentist confidence interval (CI) inBayesian framework is called a credible set, or Bayesian confidence interval.The 100(1 )% credible set for is defined as a subset S of the support of such that

    1 P (S|D) =S

    P (|D)d

    However, there is a computation issue with Bayes approach. The posterior distrib-

    ution

    P (|D) = q()f(D|)q()f(D|)d

    is the basis for all estimation and inference, but with the integration at the denominator

    it is difficult to explicitly computed. In addition, we still need to compute the posterior

    mean and variance,which involve additional integration, usually in high dimensional

    27

  • space,

    E(D)() =

    P (|D)d

    V ar(D)() =

    ( E(D)())2P (|D)d

    Fortunately, we can perform the estimation and inference through sampling, even with

    the existence of missing data, which is the idea of Monte Carlo methods. MCMC is

    a powerful tool for us to overcome the difficulty in computing. We review basics about

    MCMC in next section.

    3.2 MCMC methods

    MCMC(Markov chain Monte Carlo) methods are the most widely used methods forstatistical computing now. They are based on a class of algorithms for sampling from

    probability distributions by constructing a Markov chain whose stationary distribution

    is the desired distribution. After a large number of steps, a state of the chain can

    be used as a sample from the target distribution. Gilks et al.([GRS96]) provided acomprehensive summary of the related theories and practices.

    3.2.1 Monte Carlo methods

    Monte Carlo methods refer to integration by sampling and averaging. If we have

    q() and we want to compute the mean of g(),

    = E(g()) =

    g()q()d

    . Suppose we have an iid(independent and identically-distributed) sample 1, . . . , N q(), then the average

    =1

    N

    Ni=1

    g(i)

    28

  • converges to with probability 1 as N by the Law of Large Numbers. It is agreat property in that as long as we have such an iid sample 1, . . . , N q(), we canapproximate the expectation of any function with respect to . So now the problem of

    integration becomes how to get a good sample.

    Usually the target distribution q() is quite complex and hard to be directly sam-

    pled. When it can be evaluated,two commonly used sampling techniques are impor-

    tance sampling and rejection sampling.

    Importance sampling: we sample from a simpler proposal distribution h() in-stead of q() and define the weight function as w() = q()

    h(), then we have

    = E(g()) =

    g()q()d =

    g()w()h()d

    and

    =1

    N

    Ni=1

    g(i)w(i)

    The optimal proposal distribution is

    h() =|g()|q() |g()|q()d

    which minimizes the variance of the estimator. The minimum variance is

    V arh()(g()w()) = Eh()(g2()w2()) 2

    Rejection sampling: If there exists a simpler distribution h() and a M so thatq() Mh(), we can sample in this way:

    1. Sample i from h();

    2. Sample u from the uniform distribution on the interval (0,1) U(0,1);

    3. Keep i if it satisfies that u < q(i)

    Mh(i); Otherwise reject it and return to step

    1.

    29

  • Unfortunately, for the posterior distribution in the Bayes approach, most of the time

    it is not even computable. In this case, we can turn to MCMC though.It is such a way

    to sample: start from some initial values and keep updating iteratively with Markov

    chain(s) until it reaches its stationary distribution. Metropolis-Hastings algorithmand Gibbs sampling are most popular methods. Gibbs sampling allows us to sample

    from a distribution proportional to the target distribution and thus avoids the integra-

    tion problem to compute the denominator of the posterior distribution in the Bayesian

    approach, so it is very useful for us.

    3.2.2 MCMC fundamentals

    Mathematically, start from (0) q(), the strategy for generating new samples (i)

    , while exploring the state space using a Markov chain mechanism, is by sampling

    from

    P ((i)|(i1), . . . , (0)) = K((i)|(i1))

    To eventually sample from P (), the requirement is that

    q((0))Ktt P ()

    Then the stationary distribution of the Markov chain must be P () and stationary

    means

    PK = P

    If this is the case, we can start from an arbitrary state, use the Markov chain to per-

    form the transition for a long time, then stop and output the current state (t) which is

    sampled from P ().

    To ensure that the Markov chain converges to a unique stationary distribution the

    following conditions are sufficient:

    30

  • Irreducibility: every state is eventually reachable from any start state (0); forany state there exists a t such that

    Pr(t)(|(0)) > 0

    Aperiodicity: the chain doesnt get caught in cycles.

    The process is ergodic if it is both irreducible and aperiodic.

    To ensure that the stationary distribution of the Markov chain is P (), it is sufficient

    for functions P and K to satisfy the detailed balance (reversibility) condition:

    P ((i))K((i1)|(i)) = P ((i1))K((i)|(i1))

    Usually with some care the conditions are satisfied for MCMC algorithm. But we

    need to tell the time when the stationary distribution or the convergence is reached in

    practice ,where the functions K and P are difficult or impossible to compute so it is

    not practical to use the above formula. This task is called Convergence Diagnostics.

    The pre-convergence time is called the burn-in period. Basically, there are multiple

    chain approaches,where independently running multiple chains is needed, and single

    chain approaches, where running a single chain is enough. The advantage of multiple

    chain approaches is that we can have independent samples from multiple chains while

    the disadvantage is there is a huge waste of samples. On the contrary, single chain

    approaches is less expensive but we only have correlated samples on which to make

    roughly independent samples.

    One representative and popular multiple chain approach was proposed by Gelman

    and Rubin([GR92]). They run a small number (n) of parallel chains with differentover-dispersed starting points ,with respect to the true posterior, for 2N iterations each

    and kept samples from the latter N iterations, they then attempted to compare the

    variation within the chains for a given parameter of interest to the total variation across

    31

  • the chains. Specifically, they monitored convergence by the estimated scale reduction

    factorR where R was defined by,

    R = (N 1N

    +n+ 1

    nN

    B

    W)

    df

    df 2where B

    Nwas the variance between the means from the n parallel chains and W was

    the average of the n within-chain variances, and df was the degrees of freedom of an

    approximating t density to the posterior distribution. They showed thatR

    t 1

    which means after a long running time, the between-chain variance will be equal to

    the within-chain variance. We can randomly select some parameters to observe since

    thats only for one parameter. There are also ways to get a criteria to monitor the

    convergence of all parameters.

    One representative single chain approach was proposed by Raftery and Lewis([RL92]).They retained only every nth sample after burn-in, with n large enough so that the re-

    tained samples were approximately independent. More specifically, their procedure

    supposed the goal was to estimate a posterior quantile q = P ( < a|D) to within rwith probability s. For example, with q =.025, r = .005, and s = .95, the reported 95%

    credible sets would have true coverage between .94 and .96. To accomplish this, they

    assumed Z(n)t = Z1+(t1)n with the indicator function Zj = I((j) < a) and used

    results from Markov chain convergence theory. So basically they waited until some

    selected posterior quantiles got very stable.

    The multiple chain approach by Gelman and Rubin is less subject to failure todetect the sorts of convergence failures than single chain approach but it also fails oc-

    casionally. And since all the diagnostics methods use only a finite realization from the

    chain, no diagnostic can prove convergence of an MCMC algorithm. We statisticians

    still rely heavily on such diagnostics, because a weak diagnostic is still useful.

    32

  • We also need to provide the point and interval estimations based on the samples we

    generate. Similarly there are multiple chain approaches and single chain approaches.

    For multiple chain approaches, you need to have a large number,say, N, of indepen-

    dently running Markov chains, after they all converged, collect the current sample (j)

    from chain j, and the posterior mean can be estimated as

    E(|D) = N = 1N

    Nj=1

    (j).

    The variance of the posterior mean can be estimated as

    V ar(N ) =1

    N(N 1)Nj=1

    ((j) N )2.

    For single chain approaches, one representative method is called batching. We

    divide the single long run of length N into m successive batches of length k (i.e.,N = mk), provided that k is large enough so that the correlation between batches isnegligible, and m is large enough to reliably estimate the between batch variance. We

    then compute the mean for each batch to get B1, . . . , Bm and the estimators for the

    posterior mean is estimated by

    E(|D) = N = 1m

    mj=1

    Bj.

    The variance of the posterior mean can be estimated by

    V ar(N) =1

    m(m 1)mj=1

    (Bj N )2.

    The 95% credible interval estimators for E(|D) are always given by

    N z0.025

    V ar(N),

    or

    N tm1,0.025

    V ar(N),

    33

  • , where z0.025 and tm1,0.025 are the upper .025 point of a standard normal or t distrib-

    ution with m-1 degrees of freedom respectively, depending on the methods used and

    the sample size or batch size.

    3.2.3 Gibbs sampling

    Gibbs sampling provides a way to convert multivariate sampling problem into univari-

    ate sampling problem. Suppose we have

    = (1, . . . , p) q(1, . . . , p).

    If all conditional distributions {qi(i|j, j 6= i), i = 1, . . . , p} are available for sam-pling, we start from an arbitrary set of initial values ((0)1 , . . . ,

    (0)p ), at iteration t(t

    0), we proceed with the following sampling steps to get ((t+1)1 , . . . , (t+1)p ),

    (t+1)1 q1(1|(t)2 , . . . , (t)p )(t+1)2 q2(2|(t+1)1 , (t)3 , . . . , (t)p )

    .

    .

    .

    (t+1)p qp(p|(t+1)1 , . . . , (t+1)p1 )

    Geman and Geman([GG84] ) proved the following results,

    Gibbs Convergence Theorem:(i)((t)1 , . . . , (t)p ) d (1, . . . , p) q(1, . . . , p) as t.(ii) Using the L1 norm, the convergence rate is exponential in t.

    3.2.4 Metropolis-Hastings algorithm

    Metropolis-Hastings algorithm is one of the most popular MCMC method. Metropolis

    et al.([MRR+53]) presented the method in its simplest form. We choose an auxiliary

    34

  • proposal function h(x, y) such that h(., y) is a pdf with respect to x for any y and

    symmetric,i.e., h(x, y) = h(y, x). At step t, we have a current state (t), then we do

    the following,

    1. Sample h(., (t));

    2. Compute the ratio r = min{1, q()q((t))

    };

    3. Sample u U(0,1), let

    (t+1) =

    (t) if u > r

    if u r

    It can be proved that under mild conditions, (t) d q() as t.

    Hastings([H70]) made a simple but important generalization of the Metropolis al-gorithm in 1970. He got rid of the requirement that the proposal function h(x, y) is

    symmetric, so the ratio became

    r = min{1, q()h(, (t))

    q((t))h((t), )}

    the rest of the procedure remained the same.

    3.2.5 Adaptive rejection sampling for Gibbs sampling

    Even with Gibbs sampling, at every step we only need to sample from the univariate

    conditional distribution qi(i|j , j 6= i), it is still difficult to sample directly under mostcircumstances. Adaptive rejection sampling, proposed by Gilks and Wild([GW92]), isan efficient method to be used if the conditional distribution is log-concave, directly

    or after some transformation.The basic idea of the sampling method is illustrated by

    Figure 3.1.

    35

  • Figure 3.1: The figure shows the idea of Adaptive Rejection Sampling with functionsdefined as: h(x) = log(g(x)) the logarithm of a function g(x) proportional to a

    statistical density function f(x); u(x) upper hall of h(x); and l(x) lower hall of

    h(x).

    36

  • Suppose g(x) is proportional to the target distribution and let h(x) = log(g(x))

    which is concave and has domain . We should have a set of sorted points {xi, i =1, . . . , n}, x1 < x2 < . . . < xn at which h(x) was evaluated. The lower hull l(x)is formed by the chords between (xi, h(xi)) and (xi+1, h(xi+1)) while the upper hull

    u(x) is formed by the tangents at {xi, i = 1, . . . , n}. Denote the abscissae as {zi, i =1, . . . , n 1} where zi is the intersection of tangents at xi and xi+1.

    The kernel of the sampling algorithm to sample one point is the following:

    1. Sample x from the normalized exponential of the upper hull, s(x);

    2. Sample q from the uniform distribution on the interval (0,1);

    3. If q < exp[l(x) u(x)], accept x and stop;Else evaluate h(x). If q < exp[h(x) u(x)], accept x and stop.Otherwise update the upper hull and lower hull with the rejected x and (h(x), h(x)),resulting in closer upper and lower bounds and go back to step 1;

    From here we can see that one rejected sample will lead to closer upper and lowerbounds and higher efficiency for subsequent sampling. So it is called adaptive.

    Gilks and Wild([GW92]) also provided the following formulas. To compute theabscissae and tangent functions, we have

    zi = xi +h(xi) h(xi+1) + h(xi+1)(xi+1 xi)

    h(xi+1) h(xi)u(zi) = h

    (xi)(zi xi) + h(xi)

    They can be derived from the tangent functions and formula for intersection of two

    lines.

    Let cu denote the normalizing factor of the exponential upper hull, which is the

    37

  • integration of the exponential upper hull over its domain, we have

    cu =

    exp[u(x)]dx =

    n1j=0

    1

    h(xj+1){exp[h(zj+1)] exp[h(zj)]}

    where z0 is the lower bound of x or if x has no lower bound, and zn is the upperbound of x or if x has no upper bound. Denote the cumulative distribution functionof s(x) as scum(x) and the formula at one abscissus is

    scum(zi) =1

    cu

    n1j=0

    1

    h(xj+1){exp[h(zj+1)] exp[h(zj)]}

    To sample x from s(x), we first sample q from the uniform distribution on the

    interval (0,1) and find i such that i = argmaxi(scum(zi) < q) and compute x by

    x = zi +1

    h(xi+1)log[1 +

    h(xi+1)cu[q scum(zi)]exp[u(zi)]

    ]

    Adaptive rejection sampling method allows us to use a function proportional to thetarget distribution and constraints it with upper hull and lower hull, which are simple

    piece wise straight line segments. The introducing of upper hull and lower hull exploits

    the advantage of rejection sampling with squeezing. So it is efficient and the mostlyused method in practice for our current research.

    38

  • CHAPTER 4

    Methodology

    In this chapter, we present our approach, a Bayesian framework with MCMC imple-

    mentation, to handle analysis of longitudinal data with missing values. This approach

    is able to handle different assumptions on the missingness mechanism and can be ap-

    plied to various kind of repeated measures.

    4.1 Bayesian Inference using MCMC

    4.1.1 Priors specification and monitoring convergence

    In our Bayesian framework, we need to specify prior distributions for all parameters

    involved. Since most of the time, there is no actual prior information or enough his-

    torical studies, non-informative or flat priors are adopted, which usually ends up with

    results that are consistent to those based on the stable maximum likelihood estimate if

    there is any. More specifically, normal distribution with infinite variance or uniform

    distribution over the support of the target is used for all the regression coefficients, and

    the logarithms of the variances of random-effects and random errors. We tried some

    informative priors, e.g.,inverse Gamma distribution on the variance parameters, but

    did not observe much difference since with enough data, the information from the data

    usually dominates the information from priors. Flat uniform distributions are used for

    space restricted parameters, such as the correlation parameter in the AR(1) structure,

    39

  • which has the natural constraint 1 1.

    On monitoring convergence, we tried two approaches: the first is to start two or

    more Markov chains from quite different initial values and wait until they interweave

    with each other and the between chain variance is about the same as the within chain

    variance. The second is to work with one chain, retaining only every kth sample after

    burn-in, with k large enough so that the retained samples are approximately indepen-

    dent. The stopping time is when some selected quantiles become stable.

    4.1.2 MCMC:Data augmentation via Gibbs Sampler

    We employ the method of data augmentation via Gibbs sampler to alleviate the com-

    putation difficulty with missing values by sampling from the posterior distribution. We

    first impute the missing values and then draw parameters one by one conditional on

    the observed and imputed data. More specifically, the Gibbs sampler is an iterative

    procedure with each iteration consisting of two steps.

    (I) Imputation-Step, where the missing values are updated by drawing from the condi-tional predictive distribution. That is, for i = 1 to N , draw ymisi,j from

    ymisij fij(y|y\ji ,xi, ri, ).

    where y\ji means excluding yij from the vector yi. For multivariate-normally distrib-

    uted measures, this predictive condition is a scalar normal distribution.

    We still use di to denote the dropout time point for subject i. To fill in the miss-ing values, we actually only need to impute all intermittent missing values and the

    first dropout value for the purpose of computation, because the dropout probability at

    di only depends on the current and previous values (i.e., yi,di and Hi,di). Imputingmore dropout values is not necessary for computation and wont provide additional

    information since all information actually comes from the available data and the sta-

    40

  • ble assumptions we have: the models remain unchanged through time. Note that to

    impute intermittent missing values is different from to impute the first dropout value,

    since the former has information from both its history and future while the latter only

    has information from its history.

    (II) Posterior-Step, where the parameters are drawn from the posterior distribution P (|Y,X,R) in the following order according to the decomposition of thejoint distribution into full-conditional distributions,

    f(|\,y1, . . . ,yN ,X,R) f(|\,y1, . . . ,yN ,X,R)

    where yi = (yi1, . . . , yi,di1, ymisi,di )T

    ,Y = (y1, . . . ,yN)

    T, and \ means excluding

    (e.g., (, )\ = ).

    Note that in the above algorithm, missing values are simulated from the predic-

    tive function, while parameters are simulated from full-conditional distributions. But

    essentially, it is a process that one unobserved quantity,a missing value or parame-

    ter, is drawn at a time from the distribution conditional on all other related quan-

    tities: observed data, imputed data and parameters. Similar ideas were adopted by

    Schafer([S97]) in his data augmentation algorithms for creating imputations of miss-ing values seen in multivariate data sets.

    Starting from an initial point and repeating the two sampling stages with large

    enough iterations, the procedure would converge to its stationary distribution, i.e., the

    joint distribution of parameters and the missing values. Thus, after a long enough burn-in period, the simulated missing values and parameters can be used for parameter esti-

    mation. Note that, this algorithm can be used for conducting multiple imputation([R87]),where multiple imputed data sets are first created and then analyzed using standard

    longitudinal models for complete data.

    41

  • 4.2 Implementation for selection models

    We implemented it for continuous response data with dropout missing values with the

    same modeling strategy as in Diggle and Kenward [DK94]. For the autocorrelation, wetried the first order autoregressive(AR(1)) variance structure and the explicit randomeffects(random intercept) model.

    4.2.1 Selection Models with AR(1) Covariance

    For the AR(1) covariance matrix, we have

    1J =1

    2(1 2)

    1 0 0 . . . 0 1 + 2 0 . . . 0

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    ...

    .

    .

    .

    0 . . . 0 1 + 2 0 . . . 0 0 1

    and det(J) = (2)J(1 2)J1.

    The Gibbs sampler consists of the following steps.

    I. Imputation-Step: draw missing values from

    yi,di fi,di(y|yobsi ,xobsi , ) 12i

    exp{(y K

    k=1 xi,di,kk i)22i

    }

    exp(0 + 1y +di

    k=2 yi,di+1kk)

    1 + exp(0 + 1y +di

    k=2 yi,di+1kk)

    where

    i = CT(di1)

    1di1(yobsi xobsi )

    and

    i = 2(1 CT(di1)1di1C(di1)),

    with xobsi = (xi1, . . . ,xi,di1)T and C(j) = (j1, . . . , )T . This can be derived from

    regressing yi,di to yobsi .

    42

  • II. Posterior-Step: draw parameters one by one in the following order

    (1) For i = 1, . . . , K, draw fixed parameters:

    k f(k|\k ,Y,X) Ni=1

    exp{(yi xi)1di (yi xi)T

    2}

    (2) Draw the autocorrelation parameter in the variance structure:

    f(|\,Y,X) Ni=1

    12 det(di)

    exp{(yi xi)T1di (yi xi)

    2}

    (3) Draw variance of residuals:

    2 f(2|\2 ,Y,X) Ni=1

    12 det(i)

    exp{(yi xi)T1i (yi xi)

    2}

    (4) For k = 1, . . . , J , draw parameters of the dropout mechanism:

    k f(k|\k ,Y) Ni=1

    di1j=2

    1

    1 + exp(0 + 1yij +j

    k=2 yi,j+1kk)

    exp(0 + 1yi,di +di

    k=2 yi,di+1kk)

    1 + exp(0 + 1yi,di +di

    k=2 yi,di+1kk)

    In the above density functions, xi = (xi1, . . . ,xi,di)T and X = (x1, . . . ,xN)T .

    When draw each parameter, all other parameters are viewed as known constants.

    All above conditional distributions except the one for have log-concave forms, di-

    rectly or after some transformation, thus can be simulated using the method of adaptive

    rejection sampling. A convenience sampling method for is the Metroplis-Hastingsalgorithm. The proof is simple:

    (1) For the missing values, the predictive function is log-concave because

    2log fi,di(y|yobsi ,xobsi , )y2i,di

    = 1i

    21 exp(0 + 1y +

    dik=2 yi,di+1kk)

    (1 + exp(0 + 1y +di

    k=2 yi,di+1kk))2< 0

    where i is the positive conditional variance.

    43

  • (2) For the fixed-parameters, the conditional density function is log-concave because2log f(k|\k ,Y,X)

    k=

    Ni=1

    (xi1k, . . . , xi,di,k)1di (xi1k, . . . , xi,di,k)T < 0

    (3) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave

    2log f(s|\s,Y,X)s2

    = Ni=1

    (yi xi)T()1di (yi xi)2es

    < 0

    (4) For the parameters of the dropout mechanism, the conditional density function islog-concave because

    2log f(k|\k ,Y)2k

    =Ni=1

    di1j=2

    y2i,j+1k exp(0 + 1yij +

    jk=2 yi,j+1kk)

    (1 + exp(0 + 1yij +j

    k=2 yi,j+1kk))2

    y2i,di+1k

    exp(0 + 1yi,di +di

    k=2 yi,di+1kk)

    (1 + exp(0 + 1yi,di +di

    k=2 yi,di+1kk))2

    < 0

    4.2.2 Selection Models with Random-Effects

    In the selection model with random-effects, the random-effects model for repeated

    measures can be rewritten as

    yij =

    Kk=1

    xijkk +

    qk=1

    zijkik + ij

    , where ij N(0, 2). Here, we restrict that random-effects are independent of eachother with ik N(0, 2k) (k = 1, . . . , q). In the Gibbs sampler, the missing values(y1,d1 , . . . , yN,dN ) and parameters ( = (, , 2, 2, )) are simulated in the followingsteps.

    I. Imputation-Step: Draw missing values:

    yi,di fi,di(y|yobsi ,xi,di, zi,di, ) exp{(y Kk=1 xi,di,kk qk=1 zi,di,kik)2

    22}

    exp(0 + 1y +di

    k=2 yi,di+1kk)

    1 + exp(0 + 1y +di

    k=2 yi,di+1kk)

    44

  • II. Posterior-Step: draw parameters one by one:

    (1) For fixed-effect parameters:

    k f(k|\k ,Y,X,Z) Ni=1

    dij=1

    exp{(yij K

    k=1 xijkk q

    k=1 zijkik)2

    22}.

    (2) For i = 1, . . . , N ; k = 1, . . . , q, simulate random effects:

    f(ik|\ik ,Y,X,Z) dij=1

    exp{(yij K

    k=1 xijkk q

    k=1 zijkik)2

    22} exp{ ik

    2

    22k}

    (3) For k = 1, . . . , q, draw variance of random-effects:

    f(2k |\2k ) Ni=1

    122k

    exp{ ik2

    22k}

    (4) For the residual variance:

    f(2|\2 ,Y,X,Z) Ni=1

    dij=1

    122

    exp{(yij K

    k=1 xijkk q

    k=1 zijkik)2

    22}

    (5) For the parameters of dropout model:

    f(k|\k ,Y) Ni=2

    di1j=1

    1

    1 + exp(0 + 1yij +j+1

    k=2 yi,j+1kk)

    exp(0 + 1yidi +di+1

    k=2 yi,di+1kk)

    1 + exp(0 + 1yidi +di+1

    k=2 yi,di+1kk)

    In above densities, zi = (zi1, . . . , zi,di)T and Z = (z1, . . . , zN)T .

    As shown here, all above conditional distributions have log-concave forms, directly

    or after some transformation, thus can be simulated using the method of adaptive re-

    jection sampling.(1) For the missing values, the predictive function is log-concave because2logfi,di(y|xi,di, zi,di, )

    y2idi= 1

    2

    21 exp(0 + 1y +

    dik=2 yi,di+1kk)

    (1 + exp(0 + 1y +di

    k=2 yi,di+1kk))2< 0

    45

  • (2) For the fixed-parameters, the conditional density function is log-concave because

    2 log f(k|\k ,Y,X,Z)2k

    =

    Ni=1

    dij=1

    (xijk)2

    2< 0

    (3) For the random effects, the conditional density function is log-concave because

    2log f(ik|\ik ,Y,X,Z)2ik

    =

    dij=1

    (zijk)2

    2 12k

    < 0

    (4) For the variance of random effects, the conditional density function after a loga-rithm transform, s = log(2k), is log-concave,

    2log f(s|2\s,Y)s2

    = Ni=1

    ik2

    2es< 0

    (5) For the residual variance, the conditional density function after a logarithm trans-form, s = log(2), is log-concave

    2log f(s|\s,Y,X,Z)s2

    = Ni=1

    dij=1

    (yij K

    k=1 xijkk q

    k=1 zijkik)2

    2es< 0

    (6) For the parameters of the dropout mechanism, the conditional density function islog-concave because

    2log f(k|\k ,Y)2k

    =Ni=1

    di1j=2

    y2i,t+1k exp(0 + 1yij +

    jk=2 yi,j+1kk)

    (1 + exp(0 + 1yij +j

    k=2 yi,j+1kk))2

    y2i,di+1k

    exp(0 + 1yi,di +di

    k=2 yi,di+1kk)

    (1 + exp(0 + 1yi,di +di

    k=2 yi,di+1kk))2

    < 0

    4.3 Implementation for shared-parameter models

    Similar to the models for binary response variables(see section 2.3.3) that were usedby Albert and Follmann([AF03]), for count and continuous response variables, we also

    46

  • combine the advantage of mixed effect models and Markov transition models to set up

    the submodels for the repeated measures and missingness indicators.

    4.3.1 Count response variables

    For Poisson distributed count response variables, a one-step(Markov) transition modelis assumed for yi = (yi1, . . . , yiT ), where on any time point, yit(t > 1) is independent

    of (yi1, . . . , yit2) given the previous observation yit1. A random intercept effect is

    used to capture the baseline heterogeneity across subjects. In this random-interceptMarkov transition model, the distribution of the response variable depends on the co-

    variates under investigation and a random intercept,

    P (yit|xit, yit1, i) = yitit

    yit!eit

    where xit = (xi1t, . . . , xipt), and it = E(yit|xit, yit1, i). A Poisson linear regressionmodel is used to connect the covariates (xit), random effects (i N(0, 2)), andprevious observation (yit1) with the current observation (yit),

    log(it) = xit + (log(max(1, yit1)) xit1)+ i.

    Here, contains the fixed parameters, which are of interest when making inferences

    regarding the effect of covariates in modifying the transition among outcomes. The

    parameter indicates the influence of the previous counts through the residual,

    (log(max(1, yit1)) xit1)

    , where max(1, yit1) is used to ensure a positive value for the logarithmic operation.

    To impute intermittent missing values in the imputation step, the Metropolis Ran-

    dom walk sampling method would be used. For i = 1, . . . , n, and t = 1, . . . , Ti, if yitis missing, an imputation would be drawn with a symmetric proposal distribution, the

    47

  • acceptance probability is evaluated conditionally on the observed or imputed yit1 and

    yit+1, i.e.,f(ymisit,new|yit1, yit+1, )f(ymisit,current|yit1, yit+1, )

    where

    f(ymisit |yit1, yit+1, ) yitityit!

    eityit+1it+1 e

    it+1.

    For count response variables, in the formula, ymisit,new = ymisit,current 1 since we useRandom Walk sampling.

    For the posterior-step, the adaptive rejection sampling method would be used todraw parameters in the following order. First, the counting process parameters and

    of the Poisson regression model are drawn using,

    f(|\, Ymis) ni=1

    Tit=1

    P (yit|xit, yit1, i)

    f(|\, Ymis) ni=1

    Tit=1

    P (yit|xit, yit1, i)

    where \ refers to the sub vector of excluding , and \ has the same sense for\ and other notations that follow.

    Second, for the parameters and of the missingness mechanism model, the fol-

    lowing conditional distributions are used,

    f(|\, Ymis) ni=1

    Dit=1

    P (rit|i,xit, rit1)

    f(|\ , Ymis) ni=1

    Dit=1

    P (rit|i,xit, rit1).

    Third, for random intercept effects, is, and their variance 2, we use the following

    48

  • conditional distributions,

    f(i|\i , Ymis) Tit=1

    P (yit|xit, yit1, i)Dit=1

    P (rit|i,xit, rit1) exp{ 2i

    22}

    f(2|\2 , Ymis) (2)n/2 exp{n1

    2i22

    }ba exp(b/2)(2)(a+1)

    It is not difficult to verify that the above density functions for , , , , and

    i are log-concave by computing the second derivatives of the corresponding log-

    transformed functions. For example,

    (log(f(|\, Ymis))) ni=1

    Tit=1

    it(log(max(1, yit1) xit1)2 < 0

    . By denoting s = log(2) and letting a = b = 0, it can be shown that the logarithmic

    form of

    f(s|\s, Ymis) exp{sn+ 22

    n

    1 2i

    2es}

    also has a negative second derivative. So we can use the adaptive rejection samplingmethod on all of them.

    4.3.2 Binary response variables

    For binary response variables, to impute intermittent missing values in the imputation

    step, the Metropolis Random walk sampling method would be used. For i = 1, . . . , n,

    and t = 1, . . . , Ti, if yit is missing, an imputation would be drawn with a symmet-

    ric proposal distrib