topics in mixed eﬀects modelsweb.math.ku.dk/~erhansen/web/stat1/pinheiro.pdf · topics in mixed...

Topics in Mixed Effects Models

Jose Carlos Pinheiro

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

Doctor of Philosophy

(Statistics)

at the

UNIVERSITY OF WISCONSIN – MADISON

Abstract

Mixed effects models have received a great deal of attention in the statistical

literature for the past forty years because of the flexibility they offer in handling

the unbalanced clustered data that arise in many areas of investigation. In this

dissertation we consider both linear and nonlinear mixed effects models under

maximum likelihood and restricted maximum likelihood estimation. We derive

the asymptotic distribution of both maximum likelihood and restricted maxi-

mum likelihood estimators in a general linear mixed effects models, under mild

regularity conditions. We study different approximations to the loglikelihood

function of nonlinear mixed effects models, comparing them with respect to their

accuracy and computational efficiency. We describe five different parametriza-

tions for variance-covariance matrices that ensure positive definiteness, while

leaving the estimation problem unconstrained, comparing them with respect to

their computational efficiency and statistical interpretability. We consider the

model building issue for mixed effects models, describing techniques for choosing

random effects to be incorporated in the model, using structured random effects

variance-covariance matrices, and using covariates to explain cluster-to-cluster

parameter variability. Finally we describe the S software we have developed for

analyzing linear and nonlinear mixed effects models and which we have con-

tributed to the StatLib collection.

Contents

Abstract i

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Linear Mixed Effects Models . . . . . . . . . . . . . . . . . . . . 2

1.3 Nonlinear Mixed Effects Models . . . . . . . . . . . . . . . . . . 4

1.4 Parametrizations for Variance-Covariance Matrices . . . . . . . 5

1.5 Software Development . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 The Linear Mixed Effects Model 11

2.1 Model and Examples . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Bibliographic Review . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Asymptotic Results for the Linear Mixed Effects Model 24

3.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Limit of φ5 . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.2 Limit of φ4 . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.3 Limit of φ3 . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.4 Limit of φ2 . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.5 Limit of φ1 . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Restricted Maximum Likelihood . . . . . . . . . . . . . . . . . . 48

3.3 Parametrized and/or Structured σ . . . . . . . . . . . . . . . . 57

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 The Nonlinear Mixed Effects Model 76

4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Orange Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Bibliographic Review . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Approximations to the Loglikelihood in the Nonlinear Mixed

Effects Model 83

5.1 Approximations to the Loglikelihood . . . . . . . . . . . . . . . 84

5.1.1 Alternating Approximation . . . . . . . . . . . . . . . . 84

5.1.2 Laplacian Approximation . . . . . . . . . . . . . . . . . 86

5.1.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . 89

5.1.4 Gaussian quadrature . . . . . . . . . . . . . . . . . . . . 91

5.2 Comparing the Approximations . . . . . . . . . . . . . . . . . . 93

5.2.1 Orange Trees . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . 104

5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 Parametrizations for Variance-Covariance Matrices 115

6.1 Parametrizations . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.1.1 Cholesky Parametrization . . . . . . . . . . . . . . . . . 117

6.1.2 Log-Cholesky Parametrization . . . . . . . . . . . . . . . 119

6.1.3 Spherical Parametrization . . . . . . . . . . . . . . . . . 119

6.1.4 Matrix Logarithm Parametrization . . . . . . . . . . . . 121

6.1.5 Givens Parametrization . . . . . . . . . . . . . . . . . . . 122

6.2 Comparing the Parametrizations . . . . . . . . . . . . . . . . . . 125

6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7 Mixed Effects Models Methods and Classes for S 132

7.1 The lme class and related methods . . . . . . . . . . . . . . . . 133

7.1.1 The lme function . . . . . . . . . . . . . . . . . . . . . . 135

7.1.2 The print, summary, and anova methods. . . . . . . . . 136

7.1.3 The plot method . . . . . . . . . . . . . . . . . . . . . . 139

7.1.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . 140

7.2 The nlme class and related methods . . . . . . . . . . . . . . . . 142

7.2.1 The nlme function . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 The nlme methods . . . . . . . . . . . . . . . . . . . . . 147

7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8 Model Building in Mixed Effects Models 157

8.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.1.1 Pine Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.1.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 160

8.1.3 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8.1.4 CO2 Uptake . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.2 Variance-Covariance Modeling . . . . . . . . . . . . . . . . . . . 164

8.2.1 Pine Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.2.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 168

8.2.3 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.2.4 CO2 Uptake . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.3 Covariate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.3.1 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.3.2 CO2 Uptake Data . . . . . . . . . . . . . . . . . . . . . . 180

8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9 Conclusions and Suggestions for Future Research 183

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9.2.1 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . 185

9.2.2 Parametrizations . . . . . . . . . . . . . . . . . . . . . . 186

9.2.3 Assessing Variability . . . . . . . . . . . . . . . . . . . . 187

Bibliography 188

Appendix A 195

Appendix B 203

Chapter 1

Introduction

In this chapter we present an overview of the topics covered in this dissertation.

We discuss the motivation behind mixed effects models and describe briefly the

contents of each of the subsequent chapters.

1.1 Motivation

Mixed models were developed to handle clustered data and have been a topic

of increasing interest in Statistics for the past forty years. Clustered data can

be loosely defined as data in which the observations are grouped into disjoint

classes, called clusters, according to some classification criterion. Examples of

clustered data include split-plot designs in which the observations pertaining

to the same block form a cluster and repeated measures data in which several

observations are made sequentially on the same individual (cluster).

Observations in the same cluster usually cannot be considered independent

and mixed effects models constitute a convenient tool for modeling cluster de-

pendence. In these models the response is assumed to be a function of fixed

(population) effects, non-observable cluster specific random effects, and an error

term. Observations within the same cluster share common random effects and

are therefore statistically dependent.

We will restrict ourselves in this dissertation to models in which the error

terms and the random effects are normally distributed.

The parameters in a mixed effects model can be classified into two types:

fixed effects, associated with the average effect of predictors on the response,

and variance-covariance components, associated with the covariance structure

of the random effects and of the error term. In many practical applications

estimates of the random effects are also of interest.

Several estimation methods have been proposed for mixed effects models and

though maximum likelihood and restricted maximum likelihood (Harville, 1974)

are generally adopted for linear mixed effects models (Longford, 1993), there is

an ongoing debate in the statistical literature about estimation methods for

nonlinear mixed effects models.

1.2 Linear Mixed Effects Models

Linear mixed effects models are mixed effects models in which both the fixed

and the random effects contribute linearly to the response function. The general

form of such models is

y = Xβ + Zb + ε (1.2.1)

where y is the response vector, X and Z are the design matrices corresponding

to the fixed and random effects respectively, β is the fixed effects vector, b is the

random effects vector, and ε is the error vector. It is assumed that b ∼ N (0, D)

and ε ∼ N (0,Λ), with b independent of ε.

Variance components models (Searle, Casella and McCulloch, 1992), mixed

effects ANOVA models (Miller, 1977), and linear models for longitudinal data

(Laird and Ware, 1982) are all special cases of model (1.2.1). The linear mixed

effects model (1.2.1) is described in detail in chapter 2. Two examples are

included there to illustrate the use of this model in the context of mixed effects

ANOVA models and repeated measures data.

Maximum likelihood (ML) and restricted maximum likelihood (RML) are

the most common estimation methods used for linear mixed effects models. The

derivation of (R)ML estimates constitutes a rather complex nonlinear optimiza-

tion problem that only became feasible when fast computers became available.

This optimization is usually done using the EM algorithm (Dempster, Laird and

Rubin, 1977) or Newton-Raphson methods (Thisted, 1988), but the latter seems

to be more efficient than the former (Lindstrom and Bates, 1988). No closed

form expressions are available for the distribution of (R)ML estimates and infer-

ence usually has to rely on asymptotic results. The classical asymptotic theory

available for ML estimates (Lehmann, 1983) cannot be applied to linear mixed

effects models, since the observations are not independent. Miller (1977) derived

the asymptotic distribution of ML estimates for mixed effects ANOVA models,

following the work by Hartley and Rao (1967), but these results had not been

extended to the more general linear mixed effects model (1.2.1). We derive,

in chapter 3, the asymptotic distribution of both ML and RML estimates in

the linear mixed effects model (1.2.1) under quite general regularity conditions.

We also derive the asymptotic distribution of ML and RML estimates of the

variance-covariance components in (1.2.1) for a large class of reparametrizations

of the variance-covariance matrix of the random effects, that encompasses most

cases of practical interest.

1.3 Nonlinear Mixed Effects Models

Nonlinear mixed effects models are mixed effects models in which some of the

fixed and/or random effects occur nonlinearly in the response function. Several

different formulations of nonlinear mixed effects models are available in the

literature; we will adopt here the model proposed by Lindstrom and Bates

(1990), given by

y = f (φ, X) + ε, where (1.3.1)

φ = Aβ + Bb

where y is the response vector, f is a general nonlinear function, φ is a mixed

effects parameter vector that is expressed as a linear function of the fixed effects

β and the random effects b, X is a matrix of covariates, ε is the error vector, and

A and B are the design matrices for the fixed and random effects respectively.

As in the linear mixed effects model (1.2.1) it is assumed that b ∼ N (0, D) and

ε ∼ N (0,Λ), with b independent of ε.

By far the most common application of model (1.3.1) is for repeated mea-

sures data and we will restrict ourselves in this dissertation to this type of

situation. The nonlinear mixed effects model for repeated measures data is

described in detail in chapter 4, together with a real data example of its use.

Different estimation methods have been proposed for the parameters in the

nonlinear mixed effects model (1.3.1) and there is an ongoing debate in the liter-

ature about the most adequate method(s) (Davidian and Giltinan, 1993). One

of the reasons for this variety of estimation methods is related to the numerical

complexity involved in the derivation of (R)ML estimates in the nonlinear mixed

effects model. This complexity is due to the fact that the likelihood function in

the nonlinear mixed effects model, which is based on the marginal distribution

of y, does not usually have a closed form expression. Different approxima-

tions to the loglikelihood in (1.3.1) have been proposed to try to circumvent

this problem (Lindstrom and Bates, 1990; Vonesh and Carter, 1992; Davidian

and Gallant, 1993). We describe in chapter 5 alternative approximations to

the loglikelihood in (1.3.1) based on the Laplacian approximation (Tierney and

Kadane, 1986), importance sampling (Geweke, 1989), and Gaussian quadrature

(Davis and Rabinowitz, 1984). We present a comparison between these methods

and the approximation suggested by Lindstrom and Bates (1990), using sim-

ulated and real data and conclude that, in most cases, Lindstrom and Bates’

approximation gives very accurate results.

As in the linear mixed effects model, the distribution of the (R)ML estimates

cannot be determined explicitly. Asymptotic results for these estimates have

not yet been established and will not be considered in this dissertation.

1.4 Parametrizations for Variance-Covariance

Matrices

The (R)ML estimation of the variance-covariance components in both mod-

els (1.2.1) and (1.3.1) is usually a difficult numerical problem, since the resulting

estimates should correspond to a positive semi-definite matrix. This difficulty

has been pointed out by Harville (1977), Lindstrom and Bates (1988), and Searle

et al. (1992, chapter 6).

Two approaches can be used for ensuring positive semi-definiteness of the

estimated variance-covariance matrix of the random effects: constrained op-

timization, where the natural parametrization for the unique elements in the

variance-covariance matrix is used and the estimates are constrained to be pos-

itive semi-definite matrices, and unconstrained optimization, where the unique

elements in the variance-covariance matrix are reparametrized in such a way

that the resulting estimate must be positive semi-definite. We recommend the

use of the second approach, not only for numerical reasons (parameter estima-

tion tends to be much easier when there are no constraints), but also because

of the superior inferential properties that unconstrained estimates tend to have

(e.g. asymptotic properties). Lindstrom and Bates (1988, 1990) describe the

use of Cholesky factors for implementing unconstrained (R)ML estimation of

variance-covariance components in both the linear and the nonlinear mixed ef-

fects models.

We describe, in chapter 6, five different parametrizations for transforming

the (R)ML estimation of the variance-covariance components in models (1.2.1)

and (1.3.1) into an unconstrained optimization. The basic idea behind all

parametrizations considered in this dissertation is to write

D = LT L (1.4.1)

where the unique elements of L form an unconstrained parameter vector. Differ-

ent choices of L lead to different parametrizations of D. The parametrizations

considered in chapter 6 are of two types: three of them are based on the Cholesky

factorization of D (Thisted, 1988) and the other two are based on the spectral

decomposition (Rao, 1973).

In choosing a parametrization for D one has to take into consideration

its computational efficiency and the statistical interpretability of the individ-

ual parameters. A comparison of the computational efficiency of the different

parametrizations, using simulation, is included in chapter 6. The statistical

interpretation of the individual parameters in each parametrization is also dis-

cussed in that chapter. We conclude that different parametrizations should

be used at different stages of the analysis: during the optimization of the (re-

stricted) loglikelihood function, a parametrization based on the matrix loga-

rithm of D (Leonard and Hsu, 1993) is to be preferred for its superior computa-

tional efficiency; to assess the variability of the variance-covariance components

estimates, a parametrization based on the spherical coordinates of the Cholesky

factor of D is recommended, since it is the one with the most interpretable

elements.

1.5 Software Development

The success of any statistical technique nowadays is directly related to the

availability of reliable, efficient, and simple-to-use software for its application.

We describe in chapter 7 a set of S functions, classes, and methods (Chambers

and Hastie, 1992) that we developed for the analysis of mixed effects models,

using either maximum, or restricted maximum likelihood. These extend the lin-

ear and nonlinear modeling facilities available in release 3 of S and S-plus. The

source code, written in S and C using an object-oriented approach, is available

in the S collection at StatLib. Help files for all S functions and methods are

included in Appendix B.

The two functions used to fit linear and nonlinear mixed effects models are

respectively lme() and nlme(). Objects returned by these functions are of

classes lme and nlme respectively, and the latter class inherits from the former.

Several methods are available for both the lme and nlme classes, including

print, summary, plot, predict and anova. These were developed keeping

consistency with the methods of other model fitting functions available in S,

such as lm(), glm(), and nls().

The use of the S functions and methods for mixed effects models is illustrated

in chapter 7 through the analysis of two real data examples: one of a linear mixed

effects model and the other of a nonlinear mixed effects model.

1.6 Model Building

Model building in mixed effects models involves questions that do not have a

parallel in (fixed effects) linear and nonlinear models. Some of these questions

• determining which effects should have an associated random component

and which should be purely fixed;

• using covariates to explain cluster-to-cluster parameter variability;

• using structured random effects variance-covariance matrices (e.g. diago-

nal matrices) to reduce the number of parameters in the model.

We consider in chapter 8 strategies for addressing these questions in the context

of nonlinear mixed effects models, though most of the techniques described are

also applicable to linear mixed effects models.

The proposed strategy for choosing the random effects to be included in

the model is to start with all parameters as mixed effects, whenever no prior

information about the random effects variance-covariance structure is available

and convergence is possible. Then examine the eigenvalues of the estimated D

matrix, checking if one, or more, are close to zero. The associated eigenvector(s)

would then give an estimate of the linear combination of the parameters that

could be taken as fixed. If near zero eigenvalues are present, a reduced model,

in which the corresponding linear combination of random effects is eliminated,

can then be fit and compared to the original model by means of likelihood

ratio tests or information criterion statistics. In this dissertation we use the

Akaike information criterion (Sakamoto, Ishiguro and Kitagawa, 1986) to decide

between alternative models, choosing the one with the smaller AIC.

For choosing covariates to explain cluster-to-cluster parameter variability

we suggest analyzing plots of random effects estimates (e.g. conditional modes)

versus the candidate covariates. If the number of covariates/random effects

combinations is large, we suggest using a forward stepwise type of approach in

which covariates are included one at a time and the potential importance of

the remaining covariates is (graphically) assessed at each step. The decision on

whether or not to include a covariate can be based on the change in the AIC

values of the fits with and without it.

In comparing alternative models one must also analyze the residuals from

the fit, checking for departures from the model’s assumptions. It is also highly

recommended that any model building analysis be done in conjunction with

experts in the field of application of the model, to ensure the practical usefulness

of the chosen model. The use of the proposed model building strategies is

illustrated in chapter 8 through the analyses of four real data examples, obtained

from the areas of forestry, ecology, and pharmacokinetics.

1.7 Future Research

Considerable research effort is currently dedicated to expand the applicability

of and improve estimation methods for mixed effects models. We suggest in

chapter 9 topics for future research in mixed effects models that were not covered

in this dissertation. These include suggestions for:

• expanding the asymptotic results of chapter 3 to nonlinear mixed ef-

fects models and linear mixed effects models with more general variance-

covariance structures for the error term;

• deriving unconstrained parametrizations for structured variance-covarian-

ce matrices;

• comparing methods for assessing the variability of parameter estimates in

mixed effects models.

Chapter 2

The Linear Mixed Effects Model

In this chapter we describe a general linear mixed effects model and present

two examples of its use in the context of mixed effects ANOVA models and

repeated measures data. We also include a brief bibliographic review of linear

mixed effects models.

2.1 Model and Examples

We write the linear mixed effects model as

y = Xβ + Zb + ε (2.1.1)

where y, X , and Z denote respectively the n-dimensional response vector, the

n× p0 fixed effects design matrix, and the n×m random effects design matrix,

β denotes the p0-dimensional vector of fixed effects parameters, b denotes the

m-dimensional random effects vector, and ε denotes the error term.

The model formulation in (2.1.1) is quite general and in practice some restric-

tions on the structure and the distribution of the random effects are assumed.

Assumption 2.1.1 By permuting the columns of Z if necessary, the random

effects design matrix can be partitioned as

Z = [Z1 : · · · : Zr]

where each Zi is of the form

Z1i 0 0 · · · 0

0 Z2i 0 · · · 0

......

.... . .

0 0 0 · · · Zmii

with each Zji having the same number of columns qi and a variable number of

rows nji . The random effects vector b can be accordingly partitioned as b =[

bT1 , bT

2 , · · · , bTr

]Tand each bi can in turn be partitioned as bi =

i )T , (b2

i )T ,

· · · , (bmii )T

]T. This partition defines a grouping of the random effects into r

classes, with the qi random effects belonging to the same class i being observed

at exactly mi different levels.

We will restrict ourselves in this dissertation to normal distribution models.

More specifically, we will assume

Assumption 2.1.2 The bji are independent (for different i and/or j) and fol-

low a N (0, Di) distribution, ε follows a N (0,Λ) distribution, and the bji are

independent of ε.

The Di can be either general positive semi-definite matrices, with qi(qi +

1)/2 free parameters, or structured positive semi-definite matrices, i.e. Di =

Di(θi) with the dimension of θi being less than qi(qi + 1)/2 (Jennrich and

Schluchter, 1986).

Define

D =r⊕

Di, DA =r⊕

(Imi⊗ Di)

where ⊕ denotes the direct sum and ⊗ denotes the tensor product. Note that

D and DA have the same eigenvalues, with different multiplicities (in particu-

lar they have the same maximum and minimum eigenvalues). Under assump-

tion (2.1.2) it follows that y has a N (Xβ,Σ) distribution (Searle et al., 1992),

where Σ = Λ + ZDAZT .

In most applications of linear mixed effects models it is assumed that Λ =

σ2I and we will also assume this here.

The mixed effects ANOVA model (Miller, 1977; Searle et al., 1992) is a

particular case of model (2.1.1) where qi = 1 and Di = σ2i , i = 1, . . . , r. As an

example, consider the design in which the experimental units are divided into

two blocks, each with two plots, which in turn are divided into two subplots,

and two treatment factors A and B, in a 2 × 2 full factorial arrangement, are

used according to the scheme shown in Table 2.1.1.

Assuming that the block and plot effects are random, the corresponding

mixed effects ANOVA model can be written as

yijk = µ + bi + Aj + sij + Bk + A.Bjk + εijk, i, j, k = 1, 2

where yijk is the response observed in the ith block, jth plot, and kth subplot,

µ is the grand mean, bi is the random effect corresponding to block i, sij is the

Table 2.1.1: Split-split plot designBlock Plot Subplot A B

1 1 1 1 11 1 2 1 21 2 1 2 11 2 2 2 22 1 1 1 12 1 2 1 22 2 1 2 12 2 2 2 2

random effect corresponding to the ijth block-plot combination, Aj and Bk are

the A and B treatment effects respectively, and εijk is the error term. To ensure

identifiability of the fixed effects we will use the sum-to-zero conditions

2∑j=1

Aj =2∑

Bk =2∑

A.Bjk =2∑

A.Bjk = 0.

The assumptions of the model are that the bi are i.i.d. with distribution N (0, σ21),

the sij are i.i.d. with distribution N (0, σ22) and independent of the bi, and the

εijk are i.i.d. with distribution N (0, σ23) and independent of both the bi and the

In the notation of model (2.1.1), we have

1 1 1 1

1 1 −1 −1

1 −1 1 −1

1 −1 −1 1

1 1 1 1

1 1 −1 −1

1 −1 1 −1

1 −1 −1 1

1 0 1 0 0 0

1 0 0 1 0 0

0 1 0 0 1 0

0 1 0 0 0 1

By setting Zj1 = [1 1 1 1]T , j = 1, 2, Zj

2 = [1 1]T , j = 1, . . . , 4, and Zi =⊕j Zj

i , i = 1, 2 we see that r = 2, q1 = q2 = 1, m1 = 2, m2 = 4, b1 = [b1 b2]T ,

and b2 = [s11 s12 s21 s22]T . Note also that in this example

σ21 0

0 σ22

and DA =

σ21 0 0 0 0 0

0 σ21 0 0 0 0

0 0 σ22 0 0 0

0 0 0 σ22 0 0

0 0 0 0 σ22 0

0 0 0 0 0 σ22

The linear mixed effects model for repeated measures (Laird and Ware, 1982;

Lindstrom and Bates, 1988) is a particular case of model (2.1.1) where r = 1.

As an example we consider the data presented in Grizzle and Allen (1969) from

a dental study on the ramus height (in millimeters) measured in 20 boys at ages

8, 8.5, 9, and 9.5 years. The data are shown in figure 2.1.1.

A linear model in age in which both the intercept and the slope vary with the

Age (years)

8.0 8.5 9.0 9.5

k k kk

Figure 2.1.1: Ramus heights for 20 boys measured at 4 ages.

boy seems adequate to explain the ramus height evolution. The corresponding

linear mixed effects model is written as

yij = (β1 + bi1) + (β2 + bi2) agej + εij , i = 1, . . . , 20, j = 1, . . . , 4.

where yij is the ramus height of the ith boy at age j, β1 and β2 are the fixed

intercept and the fixed slope respectively, bi1 and bi2 are the random intercept

and the random slope corresponding to the ith boy, and εij is the error term. The

assumptions of the model are that the bi are i.i.d. with distribution N (0, D1)

and the εij are i.i.d. with distribution N (0, σ2), independent of the bi. D1 is a

general variance-covariance matrix.

In the notation of model (2.1.1) we can express the linear mixed effects model

1 47.8

1 48.8...

1 51.3

1 51.8

1 47.8 0 0 · · · 0 0

1 48.8 0 0 · · · 0 0...

......

.... . .

......

0 0 0 0 · · · 1 51.3

0 0 0 0 · · · 1 51.8

ε20 3

ε20 4

By letting X[n1 · · ·n2, ] denote the submatrix of X corresponding to its n1

through n2 rows and setting Zj1 = X[4j − 3 · · ·4j, ] , j = 1, . . . , 20, we see that,

in this example, r = 1, q1 = 2, m1 = 20, D = D1, and DA =⊕20

i=1 D.

2.2 Likelihood Estimation

Different estimation methods for the parameters in model (2.1.1) have been pro-

posed over the years (Searle et al., 1992), but the most commonly used methods

today are maximum likelihood (ML) and restricted maximum likelihood (RML)

(Longford, 1993).

It is convenient, when writing the (restricted) likelihood of y in model (2.1.1),

to factor out the variance of the error term, σ2, from the variance-covariance

matrix of the random effects, i.e. D = σ2Ds, where Ds is called the scaled

variance-covariance matrix of the random effects. Under assumption (2.1.2),

the loglikelihood function for y in model (2.1.1) is given by

�(β, σ2, Ds | y

)= −1

[n log

(2πσ2

)+ log

(∣∣∣I + ZDsAZT

∣∣∣) (2.2.1)

σ2(y − Xβ)T

(I + ZDs

AZT)−1

(y − Xβ)]

For fixed Ds, the values of β and σ2 that maximize (2.2.1) are given by

β (Ds) =[XT

(I + ZDs

AZT)−1

X]−1

XT(I + ZDs

AZT)−1

y (2.2.2)

σ2 (Ds) = (1/n)[y − Xβ (Ds)

]T (I + ZDs

AZT)−1 [

y − Xβ (Ds)]

Restricted maximum likelihood estimates (RMLEs) of the variance-covarian-

ce components are usually preferred to maximum likelihood estimates (MLEs)

in linear mixed effects models. The basic reason for that being that RMLEs take

into account the estimation of the fixed effects when calculating the degrees of

freedom associated to the variance-components estimates, while MLEs do not.

The RMLEs are defined as the MLEs of the likelihood of a set of n − p0

linear combinations of the response vector y, corresponding to n − p0 vectors

that span the orthogonal complement of the column space of the fixed effects

design matrix X (Harville, 1974). One way of defining such a set of vectors is

to consider the QR decomposition (Thisted, 1988) of X

X = [Q1 Q2]

(2.2.3)

where R1 is upper triangular. It follows from the definition of the QR decom-

position that the columns of Q2 define a set of orthonormal vectors that span

the orthogonal complement of the column space of X and the RMLEs can be

obtained from the likelihood of y∗ = QT2 y. From elementary properties of the

multivariate normal distribution and the definition of the QR decomposition,

y∗ ∼ N (0,Σ∗), where Σ∗ = σ2(I + QT

2 ZDsAZT Q2

). Letting Z∗ = QT

2 Z and

n∗ = n − p0, we can write the corresponding restricted likelihood as

(β, σ2, Ds | y

)= −1

[n∗ log

(2πσ2

)+ log

(∣∣∣I + Z∗DsAZ∗T ∣∣∣) (2.2.4)

σ2y∗T (I + Z∗Ds

AZ∗T )−1y∗]

For fixed Ds, the value of σ2 that maximizes (2.2.4) is

σ2R (Ds) = (1/n∗)y∗T (I + Z∗Ds

AZ∗T)−1y∗ (2.2.5)

The restricted likelihood (2.2.4) does not depend upon β and hence no fixed

effects RMLEs are available. Nevertheless the first formula in (2.2.2), with Ds

replaced by its corresponding RMLE, is usually employed to provide estimates

for the fixed effects in restricted maximum likelihood estimation.

The (R)MLE of Ds in general does not have a closed form expression and its

determination constitutes a constrained nonlinear optimization problem whose

numerical solution has beeen addressed in several papers (Hartley and Rao,

1967; Laird and Ware, 1982; Lindstrom and Bates, 1988; Wolfinger, Tobias and

Sall, 1991). We will not consider the numerical problem of determining the

(R)MLE of Ds in this dissertation. Using the formulas in (2.2.2) and (2.2.5)

one can express the likelihood (2.2.1), or the restricted likelihood (2.2.4), as a

function of Ds alone, greatly simplifying the optimization problem.

The exact distribution of the (R)MLEs cannot be derived in most applica-

tions of model (2.1.1) and inference about them usually has to rely on asymptotic

results. We derive, in chapter 3, the asymptotic distribution of both the MLE

and the RMLE, under quite general regularity conditions.

In many applications of linear mixed effects models, estimates of the random

effects b are also of interest. In (R)ML estimation the conditional modes of

the random effects are frequently used for that purpose (Lindstrom and Bates,

1988). These are defined as the mode of the conditional distribution of b given

y, which in the case of maximum likelihood estimation is given by

bML = Ds

A,MLZT(I + ZD

A,MLZT)−1 (

y − XβML

and in the case of restricted maximum likelihood is given by

bRML = Ds

A,RMLZT(I + ZD

A,RMLZT)−1 (

y − XβRML

where Ds

A,ML, Ds

A,RML, and βML denote respectively the MLE and RMLE of

DsA, and the MLE of β.

2.3 Bibliographic Review

The first developments of linear mixed effects models were related to the so

called variance components models, defined as linear mixed effects models in

which all random effects are independent (and hence no covariance components

are present). Airy (1861) seems to have given the first known formulation of a

variance components model while considering a standard measurement problem

in astronomy.

Fisher (1925) introduced the ANOVA method for estimating variance com-

ponents (i.e. equating sum of squares to their expected values). Tippet (1931)

clarified the use of the ANOVA method for analysis of variance designs and

extended it to 2-way crossed classification mixed effects models. Possibly the

most important paper in ANOVA estimation for unbalanced data is Henderson

(1953). The three ANOVA methods presented in that paper, later known as

Henderson methods, were the standard estimation methods for linear mixed

effects models until fast computers became available.

Maximum likelihood estimation for normal distribution variance components

models seems to have been first considered by Crump (1947). The landmark

paper on ML estimation for variance components models is Hartley and Rao

(1967), in which, among other things, the first asymptotic results for the MLE

were established. Miller (1977) corrected some problems in Hartley and Rao’s

results and established asymptotic results for a large class of variance com-

ponent models, giving also conditions for them to hold. Restricted maximum

likelihood was introduced by Thompson (1962) and later extended by Patterson

and Thompson (1971). Harville (1977) presents a comprehensive review of max-

imum likelihood and restricted maximum likelihood estimation in linear mixed

effects models and introduces the model formulation given in (2.1.1). Laird and

Ware (1982) describe a general linear mixed effects model for repeated measures

data and suggest the use of the EM algorithm for obtaining (R)MLEs of the

variance-covariance components.

The general structure of the linear mixed effects model (2.1.1) seems to be

accepted by most researchers today. The linear mixed effects models literature

that has been published after (Harville, 1977) and Laird and Ware (1982) refers

more to generalizations of the assumptions in model (2.1.1) and/or to different

estimation approaches, than to reformulations of the basic model’s structure.

Chi and Reinsel (1989) consider model (2.1.1) when Λ has the structure of an

autoregressive process of order one (AR(1)). Maximum likelihood estimators of

the model parameter and a score test for the autocorrelation are derived. One of

the main conclusions is that the use of a AR(1) structure for the cluster-specific

errors may have the effect of reducing the number of random effects needed in

the model, but the investigation of ways to determine the best combination of

time series error structure and number of random effects deserves further study.

This issue is also considered by Jones (1990).

A Bayesian analysis of model (2.1.1) using the Gibbs sampler (Geman and

Geman, 1984) is described in Gelfand, Hills, Racine-Poon and Smith (1990) and

in Wakefield, Smith, Racine-Poon and Gelfand (1994). The Bayesian analysis is

developed using a hierarchical model approach. In the second paper the normal

distribution of the random effects (b) is replaced by a multivariate Student-t,

enhancing the robustness of the fit and giving a method for detecting outlying

random effects. The main advantage of this approach is its flexibility in handling

complex situations, such as constrained parameters and non-Gaussian distribu-

tions for the random effects and/or error terms. The main drawbacks are the

intensive computational effort required and the need for prior distributions for

all the population parameters involved.

Jennrich and Schluchter (1986) consider ML estimation in linear mixed ef-

fects models for repeated measures with structured variance-covariance matri-

ces. Their work was extended to the general linear mixed effects models by

Wolfinger et al. (1991), who also discuss restricted maximum likelihood. The

use of structured matrices is very appealing in practice since many times it is

known beforehand that the covariance structure of the random effects and/or

the errors follows a particular pattern, and substantial reductions in computing

time can thus be achieved.

A generalized linear model version of (2.1.1) is discussed in Liang and Zeger

(1986) and Zeger, Liang and Albert (1988). They allow a more flexible error

structure that is no longer restricted to being Gaussian and introduce the idea of

a link function, h, relating E(y | b) to β and b, so that h(E(y | b)) = Xβ+Zb.

This model should in fact be considered a competitor of the nonlinear mixed

effects model, discussed in chapter 4.

Three books solely dedicated to linear mixed effects models have been re-

cently published. Searle et al. (1992) includes a comprehensive review of models

and estimation methods for linear mixed effects models, but focuses more on

variance components models and mixed effects ANOVA models. Lindsey (1993)

covers in detail linear mixed effects models for repeated measures data and

Longford (1993) considers linear mixed effects models in a regression context.

Chapter 3

Asymptotic Results for the

Linear Mixed Effects Model

Miller (1977) derived the asymptotic distribution of maximum likelihood es-

timators for a mixed effects ANOVA model. In section 3.1 we extend these

results to the more general linear mixed effects model (2.1.1), showing that,

under fairly general conditions, with probability going to one there exists a se-

quence of roots of the likelihood equations that is consistent and asymptotically

normal. These results are helpful in establishing the asymptotic uncorrelation of

the estimators of the fixed effects and the estimators of the variance-covariance

components. We also show, in section 3.2, that under fairly general conditions

the restricted maximum likelihood estimators for the general linear mixed ef-

fects model are consistent and asymptotically normal. In section 3.3, we show

that the asymptotic normality of the (restricted) maximum likelihood estima-

tors continues to hold for a large class of reparametrizations/structuring of the

variance-covariance components. Our conclusions are included in section 3.4.

The proofs of the lemmas used throughout this chapter are included in Ap-

pendix A.

3.1 Maximum Likelihood

Under Assumption 2.1.1 the linear mixed effects model (2.1.1) can alternatively

be expressed as

y = Xβ +r∑

qi∑j=1

ji + ε (3.1.1)

where the U ji are n × mi incidence-like matrices defined by the relation

kth column of U ji = jth column of Zk

Note that each U ji has at most one nonzero entry per row. We will assume here

that it has at least one nonzero entry per column, to rule out trivial cases. The

aji vectors are defined by the relation

and represent the values of

the jth random effect of the ith class.

The model formulation (3.1.1) is analogous to that of Hartley and Rao (1967)

and Miller (1977) for the mixed effects ANOVA model. We will use it in this

chapter to maintain consistency with the terminology used in the second paper.

The covariance matrix of y can be expressed as

Σ = σ2I +r∑

qi∑j,k=1

[Di]jk U ji (U

By letting p1 =∑r

i=1 qi(qi + 1)/2, σ0 = σ2 and G0 = I and setting

σ1 = [D1]11 , σ2 = [D1]12 , · · · , σq1(q1+1)/2+1 = [D2]11 , · · · , σp1 = [Dr]qrqr

G1 = U 11(U

T , G2 =(U 1

1(U21)

T + U 21(U

T), · · · , Gp1 = U qr

r (U qrr )T ,

we can write

Σ =p1∑i=0

σiGi. (3.1.2)

This formulation of model (2.1.1) differs from that in Miller (1977) in that some

of the σi may assume negative values and some of the Gi are not required to

be positive semi-definite.

The following assumptions (equivalent to Assumptions 2.2 through 2.5 in

Miller (1977)) are made about model (3.1.1).

Assumption 3.1.1 The matrix X is of full rank p0.

Assumption 3.1.2 n ≥ p0 + p1 + 1.

Assumption 3.1.3 The partitioned matrix[X : U j

]has rank greater than p0

for i = 1, . . . , r, j = 1, . . . , qi.

Assumption 3.1.4 The matrices G0, G1, . . ., Gp1 are linearly independent,

i.e.∑p1

i=0 τiGi = 0 ⇐⇒ τi = 0, i = 0, . . . , p1.

As mentioned in Miller (1977), Assumption 3.1.1 can always be satisfied by

suitably reparametrizing the fixed effects vector. Assumptions 3.1.3 and 3.1.4

ensure that the random effects are not confounded with the fixed effects and

with each other.

Let p = p0 + p1 + 1 and σ = (σ0, σ1, . . . , σp1)T . Then the parameter space

Θ for model (3.1.1) is

Θ ={θ ∈ �p | θ =

(βT , σT

)T, β ∈ �p0; σ0 > 0 and (σ1, . . . , σp1) ∈ �p1

such that each Di is positive semi-definite, i = 1, . . . , r} .

Since the asymptotic results proven here require that θ be an interior point of

Θ we may assume without loss of generality that the Di matrices are actually

positive definite. If Di is not positive definite then there exists one or more

linear combinations of the random effects within the ith class that are identically

equal to zero. The model can then be reparametrized to eliminate this (these)

linear combination(s), thus making the new Di positive definite. In this case,

by a suitable reparametrization of σ, the constrained optimization problem

of determining the maximum likelihood estimates can be transformed into an

unconstrained problem (cf. chapter 6).

The proof of the asymptotic normality and consistency of the maximum

likelihood estimates in the general linear mixed effects model (3.1.1) parallels

that of Theorem 3.1 in Miller (1977). We will also make use of the general the-

orem on asymptotic properties of maximum likelihood estimates given in Weiss

(1971, 1973). The version of Weiss’ theorem given in Miller (1977) is reproduced

below, since it introduces several quantities that will be used throughout the

rest of this section.

Theorem 3.1.1 (Weiss(1971, 1973)) Let yn be a sequence of random vec-

tors with density Ln(yn, θ) where θ ∈ Θ ⊂ �p and define � (θ |yn)=log(Ln (yn,

θ)). Assume that the true parameter value θ0 is an interior point of Θ and

that there exist 2p sequences of positive quantities ni(n) and gi(n), i = 1, 2, . . . , p

such that

limn→∞ni(n) = lim

n→∞ gi(n) = ∞, limn→∞

ni(n)= 0, i = 1, 2, . . . , p.

Further assume that there exist nonrandom quantities Jij(θ) such that

−[1/(ni(n)nj(n))][∂2�(θ | yn)/∂θi∂θj |θ0 ] → Jij(θ0), i, j = 1, . . . , p

in probability as n → ∞. The matrix J(θ0) is assumed to be a continuous

function of θ0 and to be positive definite. Let

Nn(θ0) = {θ ∈ Θ | |θi − θ0i| ≤ gi(n)/ni(n), i = 1, . . . , p}

εij(θ, θ0, n) = − [1/ (ni(n)nj(n))][∂2� (θ | yn) /∂θi∂θj

]− Jij(θ0).

For any γ > 0 let Rn(θ0, γ) denote the region in �n where

p∑i,j=1

gi(n)gj(n) supθ∈Nn(θ0)

|εij(θ, θ0, n)| < γ.

Assume that there exist sequences {γn(θ0)} , {δn(θ0)} of positive quantities with

limn→∞ γn(θ0) = lim

n→∞ δn(θ0) = 0

such that for each n,

Pθ (Rn [θ0, γn(θ0)]) > 1 − δn(θ0), ∀θ ∈ Nn(θ0).

It then follows that there exists a sequence of estimates θ(n), which are roots

of the equations ∂�(θ | yn)/∂θ = 0, such that the vector whose ith component

is ni(n)(θi(n) − θ0i

)converges in distribution to a N (0, J−1(θ0)). That is, the

sequence θ(n) is consistent, asymptotically normal, and efficient.

We will show that, under general assumptions, the conditions of Weiss’ the-

orem are satisfied by the MLEs in (3.1.1). We need the following additional

assumptions in order to derive the main asymptotic theorem of this section.

Assumption 3.1.5 The number of observed levels (mi) of the random effects

in the ith class goes to infinity, i = 1, . . . , r.

Define now νk = rank(Gk), k = 1, . . . , p1, and ν0 = n−rank[U 1

1 : · · · : U qrr

Assumption 3.1.6 limn→∞ ν0/n exists and is positive.

For k = 1, . . . , p1, we have mi(k) ≤ rank(Gk) ≤ 2mi(k) when σk is a covariance

term and rank(Gk) = mi(k) when σk is a variance term, with i(k) denoting the

random effect class with which σk is associated. For the rank of a Gk associated

with a covariance term to be equal to 2mi(k) − s there must be exactly s indices

l for which [Uj1(k)i(k) ]l = [U

j2(k)i(k) ]l, with [A]l representing the lth column of A

and j1(k), j2(k) representing the random effects within class i(k) corresponding

to σk. Note that νk is of the same order of magnitude as mi(k), which by

Assumption 3.1.5 goes to infinity.

The next assumption pertains to the asymptotic covariance matrix of the

maximum likelihood estimates in model (3.1.1). Let θ0 =(βT

0 , σT0

)Tdenote

the true value of the parameter vector and Σ0 the associated covariance matrix

of the response vector y.

Assumption 3.1.7 There exists a sequence of positive quantities νp1+1 depend-

ing on n and going to infinity such that C0 = limn→∞ XTΣ−10 X/νp1+1 exists

and is positive definite. Also let C1 be the (p1 + 1)× (p1 + 1) matrix defined by

[C1]ij = (1/2) limn→∞ trace(Σ−1

0 GiΣ−10 Gj

)/(νiνj)

1/2, i, j = 0, . . . , p1. Then

the limits exist and C1 is positive definite.

Now letting �(θ | y) denote the loglikelihood of the data, it can be shown

that (Searle et al., 1992)

∂2�(θ | y)

∂β∂βT = −XTΣ−1X = Eθ

(∂2�(θ | y)

∂β∂βT

∂2�(θ | y)

∂σi∂β= −XTΣ−1GiΣ

−1 (y − Xβ) , Eθ

(∂2�(θ | y)

∂σi∂β

∂2�(θ | y)

∂σi∂σj

= trace(Σ−1GiΣ

−1Gj

− (y − Xβ)T Σ−1GiΣ−1GjΣ

−1 (y − Xβ) ,

(∂2�(θ | y)

∂σi∂σj

)= −trace

(Σ−1GiΣ

−1Gj

Assumption 3.1.7 simply establishes the existence and positive definiteness of

the limit of the negative of the expected Hessian matrix of the loglikelihood

function. Note in particular that under the conditions of Weiss’ theorem, the

maximum likelihood estimates of the fixed effects β and the elements σ of

variance-covariance components are asymptotically independent.

We are now in a position to state and prove the extension of Miller’s theorem

to the general linear mixed effects model (3.1.1).

Theorem 3.1.2 Under Assumptions 2.1.1, 2.1.2, and 3.1.1 through 3.1.7 and

letting θ0 be an interior point of Θ representing the true parameter vector and

, there exists a sequence of estimates θn =(β

n , σTn

the following properties.

1. Given ε > 0, ∃δ = δ(ε), 0 < δ < ∞ and n0 = n0(ε) such that ∀n > n0

(∂�(θ | y)

∣∣∣∣∣θ=θn

= 0; ‖ βn − β0 ‖ <δ

| σni − σ0i | <δ

ni, i = 0, . . . , p1

)≥ 1 − ε

where ni = ν1/2i , i = 0, . . . , p1 + 1.

2. The p-dimensional random vector with the first p0 components given by

(βn − β0

)and the last p1 + 1 given by ni (σni − σ0i) , i = 0, . . . , p1

converges in distribution to a Np(0, J−1).

The proof of the theorem will consist of verifying that the maximum likelihood

estimates for model (3.1.1) satisfy the conditions of Theorem 3.1.1, under As-

sumptions 2.1.1, 2.1.2, and 3.1.1 through 3.1.7. The proof will parallel the steps

in Miller (1977), but we will need to derive intermediate results, since those

used in his paper do not apply to the more general model (3.1.1).

Define

κ = κ(n) = maxi,j

∣∣∣∣∣−(1/nl(i)nl(j))Eθ0

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)− J ij (θ0)

∣∣∣∣∣where

l(i) =

p1 + 1, if 1 ≤ i ≤ p0

i − (p0 + 1), otherwise

By Assumption 3.1.7 κ → 0. Define now g = min(n

1/40 , n

1/41 , . . . , n

1/4p1+1, κ

−1/4).

Note that g → ∞ since ni → ∞, i = 0, . . . , p1 + 1 by Assumption 3.1.6 and

κ → 0. It is also true that g/ni ≤ g−3 → 0, i = 0, . . . , p1 + 1. Theorem 3.1.1

allows a different sequence gi for each parameter, but we will use a common

gi = g, i = 1, . . . , p.

The conditions of Theorem 3.1.1 are then equivalent to

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)− J ij (θ0)

∣∣∣∣∣ Pθ0−→ 0, i, j = 1, . . . , p (3.1.3)

supθ1∈Nn(θ0)

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ1

)− J ij (θ0)

∣∣∣∣∣ Pθ2−→ 0,

for i, j = 1, . . . , p and ∀θ2 ∈ Nn(θ0), where Nn(θ0) is as defined in Theo-

rem 3.1.1. Using the same reasoning as in Miller (1977) we have, by repeated

applications of the triangle inequality,

supθ1∈Nn(θ0)

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ1

)− J ij(θ0)

∣∣∣∣∣ (3.1.4)

≤ (1/nl(i)nl(j)) supθ1∈Nn(θ0)

∣∣∣∣∣(

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ1

− ∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ2

)∣∣∣∣∣+ (1/nl(i)nl(j))

∣∣∣∣∣ ∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ2

− Eθ2

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ2

)∣∣∣∣∣+ (1/nl(i)nl(j))

∣∣∣∣∣Eθ2

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ2

)− Eθ2

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)∣∣∣∣∣+ (1/nl(i)nl(j))

∣∣∣∣∣Eθ2

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)− Eθ0

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)∣∣∣∣∣+

∣∣∣∣∣(−1/nl(i)nl(j))Eθ0

(∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)− J ij(θ0)

∣∣∣∣∣ .Let the five terms on the right hand side of inequality (3.1.4) be denoted by

φ1, φ2, φ3, φ4, and φ5. We note that φ3, φ4 and φ5 are nonrandom terms that we

will show are bounded by sequences going to zero as n → ∞. Then we will also

show that φ1 and φ2 converge in probability to zero.

3.1.1 Limit of φ5

By definition

g2φ5 ≤ κ−1/2κ = κ1/2 → 0, as n → ∞.

3.1.2 Limit of φ4

To establish g2φ4 → 0 we will first consider the ∂2�(θ | y)/∂βi∂βj derivatives.

Since this quantity is nonrandom it follows that φ4 = 0 for these pairs of terms.

Next we consider the ∂2�(θ | y)/∂σi∂βj second derivatives. In this case

g2φ4 =g2

ninp1+1

∣∣∣ξTj XTΣ−1

0 GiΣ−10 X(β2 − β0)

∣∣∣where ξj denotes the jth canonical basis vector with components

[ξj ]k =

1, if k = j

0, otherwise

Using the Cauchy-Schwartz inequality repeatedly we get

(ninp1+1)φ4 (3.1.5)

≤[ξT

j XTΣ−10 GiΣ

−10 GiΣ

−10 Xξj

]1/2 [(β2 − β0)

T XTΣ−10 X (β2 − β0)

≤ maxk

∣∣∣λk

(Σ−1

)∣∣∣ [ξTj XTΣ−1

0 Xξj

]1/2[(β2 − β0)

T XTΣ−10 X (β2 − β0)

≤ maxk

∣∣∣λk

(Σ−1

)∣∣∣ λmax

(XTΣ−1

0 X)‖β2 − β0‖,

where λk denotes the kth eigenvalue and λmax the maximum eigenvalue. By the

definition of Nn(θ0), ‖β2−β0‖ ≤ √p0g/np1+1 and by Assumption 3.1.7 and the

continuity of the maximum eigenvalue, for sufficiently large n we must have

(XTΣ−1

< 2n2p1+1λmax (C0) . (3.1.6)

Also, from Lemma A.3 proven in Appendix A, ∃δ0 = δ0(σ0) > 0 such that

∣∣∣λk(Σ−10 Gi)

∣∣∣ ≤ 2/δ0, i = 0, . . . , p1. It then follows that for sufficiently

large n

g2φ4 ≤ 4g3√p0λmax(C0)

niδ0≤ 4

√p0λmax(C0)

and by Assumptions 3.1.5 and 3.1.6 the last quantity goes to zero as n → ∞.

Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. In this case we have

g2φ4 ≤ g2

{∣∣∣trace[Σ−1

0 GiΣ−10 GjΣ

−10 (Σ0 −Σ2)

]∣∣∣ (3.1.7)

+∣∣∣(β2 − β0)

T XTΣ−10 GiΣ

−10 GjΣ

−10 X (β2 − β0)

∣∣∣} .

We will consider each term on the right hand side of inequality (3.1.7) sepa-

rately. For the first term note that rij = rank(Σ−1

0 GiΣ−10 GjΣ

−10 (Σ0 − Σ2)

min (rank(Gi), rank(Gj)) = min(νi, νj). Let P ij be an n × rij matrix whose

columns form an orthonormal basis for the range space of Σ−1/20 GiΣ

−10 GjΣ

×(Σ0 − Σ2)(Σ

−1/20

)T, with Σ

−1/20 denoting the Cholesky factor of Σ−1

0 (Thisted,

1988). It then follows that

∣∣∣trace(Σ−1

0 GiΣ−10 GjΣ

−10 (Σ0 −Σ2)

)∣∣∣ (3.1.8)

∣∣∣∣trace(P T

ijΣ−1/20 GiΣ

−10 GjΣ

−10 (Σ0 −Σ2)

−1/20

)TP ij

)∣∣∣∣≤

rij∑k=1

∣∣∣∣ξTk P T

ijΣ−1/20 GiΣ

−10 GjΣ

−10 (Σ0 −Σ2)

−1/20

)TP ijξk

∣∣∣∣ .Applying the Cauchy-Schwartz inequality and Lemmas A.3 and A.7 to the terms

of the summation in (3.1.8) gives

∣∣∣∣ξTk P T

ijΣ−1/20 GiΣ

−10 GjΣ

−10 (Σ0 − Σ2)

−1/20

)TP ijξk

∣∣∣∣ (3.1.9)

≤[ξT

k P TijΣ

−1/20 GiΣ

−10 Gi

−1/20

)TP ijξk

×[ξT

k P TijΣ

−1/20 (Σ0 − Σ2)Σ−1

0 GjΣ−10 GjΣ

−10 (Σ0 − Σ2)

−1/20

)TP ijξk

≤ maxk

∣∣∣λk

(Σ−1

)∣∣∣maxk

∣∣∣λk

(Σ−1

)∣∣∣maxk

∣∣∣λk

(Σ−1

0 (Σ0 −Σ2))∣∣∣

g3δ20

λmin(D0)

where q =∑r

i=1 qi and D0 is the D matrix evaluated at θ = θ0. Consider

now the second term in the right hand side of inequality (3.1.7). Using the

Cauchy-Schwartz inequality, (3.1.6), and Lemma A.3 we get

∣∣∣(β2 − β0)T XTΣ−1

0 GiΣ−10 GjΣ

−10 X (β2 − β0)

∣∣∣ (3.1.10)

≤[(β2 − β0)

T XTΣ−10 GiΣ

−10 GiΣ

−10 X (β2 − β0)

×[(β2 − β0)

T XTΣ−10 GjΣ

−10 GjΣ

−10 X (β2 − β0)

≤ 2n2p1+1λmax(C0) max

∣∣∣λk(Σ−10 Gi)

∣∣∣maxk

∣∣∣λk(Σ−10 Gj)

∣∣∣ ‖ β2 − β0 ‖2

≤ 8p0g2λmax(C0)/δ

and therefore

g2φ4 ≤ 4rij

gninjδ20

λmin(D0)

8p0g4λmax(C0)

ninjδ20

Since rij ≤ ninj , g4/(ninj) ≤ g−4 and g → ∞ it follows that g2φ4 → 0.

3.1.3 Limit of φ3

Let us start with the ∂2�(θ | y)/∂βi∂βj derivatives. In this case we have g2φ3 =∣∣∣ξTi XT (Σ−1

2 −Σ−10 )Xξj

∣∣∣. Noting that Σ−12 − Σ−1

0 = Σ−12 (Σ0 − Σ2)Σ

−10 and

using the Cauchy-Schwartz inequality and (3.1.6) we get

n2p1+1φ3 (3.1.11)

≤[ξT

i XT(Σ−1

2 (Σ0 − Σ2))2

Σ−10 Xξi

]1/2 [ξT

j XTΣ−10 Xξj

≤ 2n2p1+1λmax(C0) max

∣∣∣λk

(Σ−1

2 (Σ0 −Σ2))∣∣∣

and using Lemma A.7 we get

g2φ3 ≤ 4λmax(C0)

λmin(D0)

which goes to zero as n → ∞.

Next consider the ∂2�(θ | y)/∂βi∂σj derivatives. By applying the Cauchy-

Schwartz inequality, Lemma A.3, and (3.1.6) we get

np1+1njφ3

=∣∣∣ξT

i XTΣ−10 GjΣ

−10 X (β2 − β0)

∣∣∣≤

i XTΣ−10 GjΣ

−10 GjΣ

−10 Xξi

]1/2 [(β2 − β0)

T XTΣ−10 X (β2 − β0)

≤ 4√

p0np1+1gλmax(C0)/δ0

and therefore g2φ3 ≤ 4√

p0g3λmax(C0)/nj ≤ 4

√p0λmax(C0)/g → 0.

Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. By applying the triangle

inequality we get

ninjφ3 (3.1.12)

≤ 1/2∣∣∣trace

(Σ−1

0 GiΣ−10 GjΣ

−10 Σ2

)− trace

(Σ−1

2 GiΣ−12 Gj

)∣∣∣+ 1/2

0 GiΣ−10 GjΣ

−10 (Σ2 − Σ0)

)∣∣∣+ (β2 − β0)

T XTΣ−10 GiΣ

−10 GjΣ

−10 X (β2 − β0) .

From (3.1.8), (3.1.9), and (3.1.10) we have that the last two quantities on the

right hand side of (3.1.12) are bounded respectively by (2rij/g3δ2

0)(q/λmin(D

+1/σ20) and 8p0g

2λmax(C0)/δ20. Now note that

Σ−10 GiΣ

−10 GjΣ

−10 Σ2 − Σ−1

2 GiΣ−12 Gj

= Σ−10 GiΣ

−10 GjΣ

−10 (Σ2 − Σ0) + Σ−1

(Σ−1

0 −Σ−12

+(Σ−1

0 − Σ−12

−12 Gj .

From (3.1.8), (3.1.9), and Lemmas A.5 and A.7 it follows that

0 GiΣ−10 GjΣ

−10 (Σ2 − Σ0)

)∣∣∣ ≤(4rij/δ

(D0) +

∣∣∣trace

(Σ−1

0 − Σ−12

)∣∣∣ ≤(8rij/δ

(D0) +

∣∣∣trace

((Σ−1

0 − Σ−12

−12 Gj

)∣∣∣ ≤(16rij/δ

(D0) +

and therefore

g2φ3 ≤ 16rij

gninjδ20

(D0) +

4λmax(C0)

ninjδ20

where, as before, rij = min (νi, νj). Since rij/gnjnj ≤ g−1 and g4/ninj ≤ g−4 it

follows that g2φ3 → 0.

3.1.4 Limit of φ2

It follows from Tchebychev’s inequality that to show g2φ2

Pθ2−→ 0 it suffices to

show Varθ2 (g2φ2) → 0. Since ∂2� (θ | y) /∂βi∂βj is nonrandom, its variance is

zero and the condition is trivially verified.

Consider now the ∂2�(θ | y)/∂β∂σj derivatives. To show that the variance

of each component goes to zero, it is enough to show that the trace of the

associated variance-covariance matrix goes to zero. In this case

trace(Var

(∂2�(θ | y)/∂β∂σj

))(3.1.13)

= trace(XTΣ−1

2 GjΣ−12 GjΣ

−12 X

)≤ p0 max

(Σ−1

)]2)λmax

(XTΣ−1

Using the fact that Σ−12 = Σ−1

0 +(Σ−1

2 −Σ−10

)and the results in (3.1.6) and

(3.1.11) we get that

(XTΣ−1

(3.1.14)

≤ λmax

(XTΣ−1

+ λmax

(Σ−1

2 − Σ−10

≤ 2n2p1+1λmax(C0)

λmin(D0)

Using (3.1.13), (3.1.14), and Lemma A.5 gives

trace(Var

(∂2�(θ | y)/∂β∂σj

32g4p0λmax(C0)

λmin(D0)+

and since g4/n2j ≤ g−4 it follows that g2φ2

Pθ2−→ 0.

Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. Using standard results

on the variance of quadratic forms (Seber, 1977) and Lemma A.5 we get

2jVar (φ2) =

2trace(Σ−1

2 GiΣ−12 Gj

)2 ≤ 2rij

∣∣∣λk(Σ−12 Gi)

∣∣∣maxk

∣∣∣λk(Σ−12 Gj)

∣∣∣]2 ,

g4Var (φ2) ≤ 512g4rij

≤ 512

g4δ40

and so g2φ2

Pθ2−→ 0.

3.1.5 Limit of φ1

Let us first take the ∂2�(θ | y)/∂βi∂βj derivatives. In this case we get

n2p1+1φ1 = sup

θ1∈Nn(θ0)

∣∣∣ξTi XT

(Σ−1

1 − Σ−12

∣∣∣ .Applying the Cauchy-Schwartz inequality, (3.1.14), and Lemma A.8 we have

that for large enough n

∣∣∣ξTi XT

(Σ−1

1 − Σ−12

∣∣∣≤

i XTΣ−11 (Σ2 − Σ1)Σ−1

2 (Σ2 − Σ1)Σ−11 Xξi

]1/2 [ξT

j XT Σ−12 Xξj

≤ λmax(XTΣ−1

2 X) maxk

∣∣∣λk

(Σ−1

1 (Σ2 − Σ1))∣∣∣

≤ 2n2p1+1λmax(C0)

λmin(D0)

and as the bound does not depend on θ1 we get that, for large enough n,

g2φ1 ≤ 8λmax(C0)

λmin(D0)

and therefore g2φ1 → 0.

Consider now the ∂2�(θ | y)/∂βi∂σj derivatives. We have that

np1+1njφ1 =

supθ1∈Nn(θ0)

∣∣∣ξTi XT

(Σ−1

1 GjΣ−11 (y − Xβ1) − Σ−1

2 GjΣ−12 (y − Xβ2)

)∣∣∣ .Note that

∣∣∣ξTi XT

(Σ−1

1 GjΣ−11 (y − Xβ1) − Σ−1

2 GjΣ−12 (y − Xβ2)

)∣∣∣≤

∣∣∣ξTi XT

(Σ−1

1 GjΣ−11 − Σ−1

2 GjΣ−12

)(y − Xβ2)

∣∣∣+∣∣∣ξT

i XTΣ−11 GjΣ

−11 X (β1 − β2)

∣∣∣ .Now from (3.1.14) and Lemmas A.1 and A.5

∣∣∣ξTi XTΣ−1

1 GjΣ−11 X (β1 − β2)

∣∣∣ (3.1.15)

≤[ξT

i XTΣ−11 Xξi

]1/2 [(β1 − β2)

T XTΣ−11 GjΣ

−11 X (β1 − β2)

≤ 4np1+1g√

p0λmax(C0)√δ0

λmin(D0)

Noting that

Σ−11 GjΣ

−11 −Σ−1

2 GjΣ−12 = Σ−1

(Σ−1

1 −Σ−12

)+(Σ−1

1 − Σ−12

−12 ,

we get by applying the triangle inequality

∣∣∣ξTi XT

(Σ−1

1 GjΣ−11 −Σ−1

2 GjΣ−12

)(y − Xβ2)

∣∣∣ (3.1.16)

≤∣∣∣ξT

i XTΣ−11 Gj

(Σ−1

1 − Σ−12

)(y − Xβ2)

∣∣∣+∣∣∣ξT

i XT(Σ−1

1 − Σ−12

−12 (y − Xβ2)

∣∣∣ .But using the Cauchy-Schwartz inequality once again gives

∣∣∣ξTi XTΣ−1

(Σ−1

1 −Σ−12

)(y − Xβ2)

∣∣∣ ≤ [ξT

i XTΣ−11 Xξi

×[(y − Xβ2)

T(Σ−1

1 − Σ−12

−11 Gj

(Σ−1

1 −Σ−12

)(y − Xβ2)

From (3.1.14) it follows that

ξTi XTΣ−1

1 Xξi ≤ 2n2p1+1λmax(C0)

λmin(D0)

Now by applying Lemmas A.5, A.6, and A.8 we get that

g4λmax

]−T(Σ2 −Σ1)Σ

−11 GjΣ

−11 (Σ2 − Σ1)

]−1)

≤ g4 maxk

(Σ−1

)]2)max

(Σ−1

1 (Σ2 − Σ1))]2)

(Σ−1

)≤ 1024g4

g6δ20

λmin(D0)

)(λmax(D

λmin(D0)

)→ 0.

Noting that

rank(Gj) = n2j

≥ rank([

Σ1/22

]−T(Σ2 − Σ1)Σ−1

1 GjΣ−11 GjΣ1−1 (Σ2 −Σ1)

]−1)

we get from Lemma A.9

supθ1∈Nn(θ0)

(y − Xβ2)T (3.1.17)

×(Σ−1

1 −Σ−12

−11 Gj

(Σ−1

1 − Σ−12

)(y − Xβ2)

Pθ2−→ 0.

Consider now the second term on the right hand side of (3.1.16). Applying

the Cauchy-Schwartz inequality to it gives

∣∣∣ξTi XT

(Σ−1

1 − Σ−12

−12 (y − Xβ2)

∣∣∣≤

i XT(Σ−1

1 (Σ2 −Σ1))2

Σ−12 Xξi

×[(y − Xβ2)Σ−1

2 GjΣ−12 GjΣ

−12 (y − Xβ2)

and using (3.1.14) and Lemma A.8 we get that

ξTi XT

(Σ−1

1 (Σ2 − Σ1))2

Σ−12 Xξi

≤ 32n2p1+1λmax(C0)

λmin(D0)

Observing that

rank([

Σ1/22

]−TGjΣ

−12 Gj

]−1)

≤ n2j and

]−TGjΣ

−12 Gj

]−1)

≤ 16/δ20

we get using Lemma A.9

θ1∈Nn(θ0)(1/n2

j) (y − Xβ2)T[Σ

Σ1/22

]−TGjΣ

−12 Gj

]−1) [

Σ1/22

]−T(y − Xβ2) >

)→ 0.

Therefore

np1+1njsup

θ1∈Nn(θ0)

∣∣∣ξTi XT

(Σ−1

1 − Σ−12

−12 (y − Xβ2)

∣∣∣ (3.1.18)

≤ 4√

2λmax(C0)

(1 + 2

λmin(D0)

))1/2 (q

λmin(D0)

× supθ1∈Nn(θ0)

(y − Xβ2)T[Σ

]−1([

Σ1/22

]−TGjΣ

−12 Gj

]−1)

]−T(y − Xβ2)

and this converges to zero in probability (under θ2), since g−1 → 0 and the

second term of the product on the right hand side of (3.1.18) is bounded in

probability by 4√

2/δ0. Combining results (3.1.15), (3.1.17), and (3.1.18) gives

that g2φ1

Pθ2−→ 0 as desired.

Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. In this case we get

ninjφ1 (3.1.19)

≤∣∣∣trace

(Σ−1

1 GiΣ−11 Gj

)− trace

(Σ−1

2 GiΣ−12 Gj

)∣∣∣+∣∣∣(y − Xβ1)

T XTΣ−11 GiΣ

−11 GjΣ

−11 (y − Xβ1)

− (y − Xβ2)T XTΣ−1

2 GiΣ−12 GjΣ

−12 (y − Xβ2)

∣∣∣ .

Now noting that

Σ−11 GiΣ

−11 Gj − Σ−1

2 GiΣ−12 Gj =(

Σ−11 − Σ−1

−11 Gj + Σ−1

(Σ−1

1 − Σ−12

rank((

Σ−11 − Σ−1

−11 Gj

)= rij ≤ min(n2

i , n2j )

rank(Σ−1

(Σ−1

1 − Σ−12

)= rij ≤ min(n2

i , n2j )

we get using the triangle inequality and Lemmas A.5 and A.8

1 GiΣ−11 Gj

)− trace

(Σ−1

2 GiΣ−12 Gj

)∣∣∣≤ g2rij

ninjmax

∣∣∣λk(Σ−11 Gi)

∣∣∣maxk

∣∣∣λk(Σ−11 Gj)

∣∣∣maxk

∣∣∣λk

(Σ−1

2 (Σ2 − Σ1))∣∣∣

+g2rij

ninjmax

∣∣∣λk(Σ−12 Gi)

∣∣∣maxk

∣∣∣λk(Σ−12 Gj)

∣∣∣maxk

∣∣∣λk

(Σ−1

1 (Σ1 − Σ2))∣∣∣

≤ 128rij

δ20gninj

Since rij ≤ ninj and g → ∞ we see that this term converges to zero as n → ∞.

Now note that

∣∣∣(y − Xβ1)T XTΣ−1

1 GiΣ−11 GjΣ

−11 (y − Xβ1) (3.1.20)

− (y − Xβ2)T XTΣ−1

2 GiΣ−12 GjΣ

−12 (y − Xβ2)

∣∣∣≤∣∣∣(y − Xβ2)

T(Σ−1

1 GiΣ−11 GjΣ

−11 −Σ−1

2 GiΣ−12 GjΣ

)(y − Xβ2)

∣∣∣+∣∣∣(y − Xβ2)

T Σ−11 GiΣ

−11 GjΣ

−11 X (β2 − β1)

∣∣∣+∣∣∣(β2 − β1)

T XTΣ−11 GiΣ

−11 GjΣ

−11 (y − Xβ2)

∣∣∣

+∣∣∣(β2 − β1)

T XTΣ−11 GiΣ

−11 GjΣ

−11 X (β2 − β1)

∣∣∣ .Consider now the first term on the right hand side of (3.1.20). Note that

(Σ−1

1 GiΣ−11 GjΣ

−11 −Σ−1

2 GiΣ−12 GjΣ

(Σ−1

1 −Σ−12

−11 GjΣ

−11 + Σ−1

(Σ−1

1 − Σ−12

+ Σ−12 GiΣ

−12 Gj

(Σ−1

1 − Σ−12

∣∣∣(y − Xβ2)T(Σ−1

1 −Σ−12

−11 GjΣ

−11 (y − Xβ2)

∣∣∣≤

[(y − Xβ2)

T Σ−12 (Σ2 − Σ1)Σ

−11 GiΣ

−11 (Σ2 − Σ1)Σ−1

× (y − Xβ2)]1/2[(y − Xβ2)Σ−1

1 GjΣ−11 GjΣ

−11 (y − Xβ2)

Now note that

g4λmax

]−T(Σ2 −Σ1)Σ

−11 GiΣ

−11 (Σ2 −Σ1)

]−1)

≤ 1024g4

λmin(D0)

)2 (λmax(D

λmin(D0)

and this term converges to zero as n → ∞.

From Lemmas A.5 and A.6 it follows that

1/22 Σ−1

1 GjΣ−11 GjΣ

)T)≤

([λk(Σ

−11 Gj)

]2)λmax(Σ

−11 Σ2) ≤ 64

(λmax(D

λmin(D0)

Therefore, by Lemma A.9 we have that

supθ1∈Nn(θ0)

∣∣∣(y − Xβ2)T(Σ−1

1 −Σ−12

−11 GjΣ

−11 (y − Xβ2)

∣∣∣ Pθ2−→ 0

since it is dominated by a product of two terms, the first converging to zero in

probability and the second bounded in probability by a constant.

Using the exact same reasoning we show that

supθ1∈Nn(θ0)

∣∣∣(y − Xβ2)T Σ−1

(Σ−1

1 −Σ−12

−11 (y − Xβ2)

∣∣∣ Pθ2−→ 0

ninjsup

θ1∈Nn(θ0)

∣∣∣(y − Xβ2)T Σ−1

2 GiΣ−12 Gj

(Σ−1

1 − Σ−12

)(y − Xβ2)

∣∣∣ Pθ2−→ 0

and that in turn implies that

ninjsup

θ1∈Nn(θ0)

∣∣∣(y − Xβ2)T

×(Σ−1

1 GiΣ−11 GjΣ

−11 −Σ−1

2 GiΣ−12 GjΣ

)(y − Xβ2)

∣∣∣ Pθ2−→ 0.

Consider now the term

∣∣∣(y − Xβ2)T Σ−1

1 GiΣ−11 GjΣ

−11 X (β2 − β1)

∣∣∣ (3.1.21)

≤[g4

(β2 − β1)T XTΣ−1

1 GjΣ−11 GjΣ

−11 X (β2 − β1)

(y − Xβ2)T Σ−1

1 GjΣ−11 GjΣ

−11 (y − Xβ2)

The first term on the right hand side of (3.1.21) is bounded by(4√

2p0λmax(C0)

÷gδ0)(1 + (2/g3)

(q/λmin(D

0) + 1/σ20

))1/2which goes to zero with n. The

supremum over θ1 ∈ Nn(θ0) of the second term is bounded in probability (un-

der θ2) by(4√

) (λmax(D

0)/λmin(D0) + 1

)1/2as n → ∞, by Lemma A.9.

It then follows that

supθ1∈Nn(θ0)

∣∣∣(y − Xβ2)T Σ−1

1 GiΣ−11 GjΣ

−11 X (β2 − β1)

∣∣∣ Pθ2−→ 0.

Similarly we can show that

supθ1∈Nn(θ0)

∣∣∣(y − Xβ2)T Σ−1

1 GjΣ−11 GiΣ

−11 X (β2 − β1)

∣∣∣ Pθ2−→ 0.

Finally we note that

∣∣∣(β2 − β1)T XTΣ−1

1 GiΣ−11 GjΣ

−11 X (β2 − β1)

∣∣∣≤ g2

[(β2 − β1)

T XTΣ−11 GiΣ

−11 GiΣ

−11 X (β2 − β1)

×[(β2 − β1)

T XTΣ−11 GjΣ

−11 GjΣ

−11 X (β2 − β1)

≤ 32p0g4λmax(C0)

ninjδ20

λmin(D0)

and this term goes to zero as n → ∞, since g4/ninj ≤ g−4.

Now we put all the previous results together and see that they in fact imply

the second condition of Theorem 3.1.1. We want to show that for given θ2 ∈Nn(θ0) and ε, δ > 0, there exists an n0 such that for all n > n0 the probability

that the left hand side of (3.1.4) is greater than δ is less than ε. First we choose

n1 such that g2 max(φ3, φ4, φ5) < δ/5 which is always possible since all the terms

converge to zero. Next we get n2 > n1 such that ∀n > n2, Pθ2(g2φ2 > δ/5) <

ε/2, which is always possible since g2φ2

Pθ2−→ 0. Finally choose n0 > n2 such that

∀n > n0, Pθ2(g2φ1 > δ/5) < ε/2, which is always possible since g2φ1

Pθ2−→ 0.

Now since

g2 supθ1∈Nn(θ0)

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ1

)− J ij (θ0)

∣∣∣∣∣ ≤ g25∑

for all n > n0 we have that

(g2 sup

θ1∈Nn(θ0)

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ1

)− J ij (θ0)

∣∣∣∣∣ > δ

≤ Pθ2

(g2 (φ1 + φ2) > 2δ/5

)≤ Pθ2(g

2φ1 > δ/5) + Pθ2(g2φ2 > δ/5) < ε.

To complete the proof of Theorem 3.1.2 we note that

∣∣∣∣∣(−(1/nl(i)nl(j))

∂2� (θ | y)

∂θi∂θj

∣∣∣∣∣θ0

)− J ij (θ0)

∣∣∣∣∣ ≤ g2 (φ2 + φ5)

with φ2 being evaluated for θ2 = θ0. From the results in subsections 3.1.1 and

3.1.4, both g2φ2 and g2φ5 converge in probability to zero and therefore the left

hand side of the last inequality also does.

3.2 Restricted Maximum Likelihood

In this section we show that under a modification of Assumptions 3.1.4 and 3.1.7,

the RMLEs of the variance-covariance components in (3.1.1) are asymptotically

normal and consistent. We also show that the usual estimates of the fixed effects

in RML estimation have the same asymptotic distribution as the maximum

likelihood estimates.

We recall from section 2.2, that the restricted likelihood can be defined as

the likelihood of y∗ = QT2 y, where Q2 is defined in (2.2.3). Letting G∗

QT2 GiQ2, i = 0, . . . , p1, it follows from (2.2.4) and (3.1.2) that

Σ∗ =p1∑i=0

σiG∗i (3.2.1)

where Σ∗ denotes the covariance matrix of y∗.

The linear mixed effects model corresponding to y∗ can be written as

y∗ = Z∗b + ε∗ (3.2.2)

where ε∗ = QT2 ε ∼ N (0, σ2I) and is independent of b.

Assumptions 2.1.1 through 3.1.7 were the only conditions required in the

proof of Theorem 3.1.2. In this section we will assume that they still hold.

Assumption 2.1.1 was used only to ensure that Σ had a linear structure (i.e.

could be expressed as a linear combination of known matrices). Even though it

doesn’t necessarily hold for Z∗, Σ∗ has a linear structure, as shown in (3.2.1).

Assumption 2.1.2 holds for model (3.2.2) if we replace ε by ε∗. Assumption 3.1.1

implies that Q2 is an n× (n− p0) matrix, but is otherwise not needed, since no

fixed effects need to be estimated in (3.2.2). If we let n∗ = n− p0 represent the

sample size of the restricted model (3.2.2), then by Assumption 3.1.2 n∗ ≥ p1+1

(which is just Assumption 3.1.2 translated to the restricted model). Assump-

tion 3.1.3 is not needed for the restricted model. Assumption 3.1.4 ensures that

σ is identifiable in the general linear mixed effects model. We need a similar

assumption for the restricted model.

Assumption 3.2.1 The matrices G∗0, G∗

1, . . ., G∗p1

are linearly independent,

i.e.∑p1

i=0 τiG∗i = 0 ⇐⇒ τi = 0, i = 0, . . . , p1.

Assumption 3.1.5 remains unchanged in the restricted model. Now define

ν∗k = rank (G∗

k) , k = 1, . . . , p1, ν∗0 = n∗ − rank

1 : · · · : U qrr

]), and note

that νk − 2p0 ≤ ν∗k ≤ νk, k = 0, . . . , p1. It follows that ν∗

k → ∞, k = 0, . . . , p1

and limn→∞ ν∗0/n

∗ = limn→∞ ν0/n, so that Assumption 3.1.6 also holds for the

restricted model, if we replace ν0 and n by ν∗0 and n∗. Assumption 3.1.7 needs

to be rephrased for model (3.2.2) as

Assumption 3.2.2 Let C∗1 be the (p1 + 1) × (p1 + 1) matrix defined by

[C∗1]ij = (1/2) lim

n∗→∞ trace((Σ∗

0)−1 G∗

i (Σ∗0)

−1 G∗j

)/(ν∗

i ν∗j )1/2, i, j = 0, . . . , p1.

Then the limits exist and C∗1 is positive definite.

In the definition of C∗1 above, Σ∗

0 represents the variance-covariance matrix of

y∗ evaluated at the true parameter vector σ0.

Define now the parameter space Θ∗ for the restricted model (3.2.2)

Θ∗ ={σ ∈ �p1+1 | σ0 > 0 and each Di is positive semi-definite, i = 1, . . . , r

We can now state the equivalent of Theorem 3.1.2 for the RMLEs of the

variance-covariance components in model (3.2.2).

Theorem 3.2.1 Under Assumptions 2.1.1, 2.1.2, 3.1.1, 3.1.2, 3.1.5, 3.1.6,

3.2.1, and 3.2.2 and letting σ0 be an interior point of Θ∗ representing the true

parameter vector for model (3.2.2), there exists a sequence of estimates σn with

the following properties.

(∂�(σ | y∗)

∣∣∣∣∣σ=σn

= 0; | σni − σ0i |< δ

, i = 0, . . . , p1

)≥ 1 − ε

where n∗i =

√ν∗

i , i = 0, . . . , p1.

2. The (p1+1)-dimensional random vector with components given by ni (σni − σ0i),

i = 0, . . . , p1 converges in distribution to a Np(0, (C∗1)

−1).

Proof: The proof is identical to that of Theorem 3.1.2, since under Assump-

tions 2.1.1, 3.1.2, 3.1.5, 3.1.6, 3.2.1, and 3.2.2 the lemmas in Appendix A are

valid for the restricted model (with the obvious modifications, such as replacing

Σ0 by Σ∗0, etc.). Note that only the ∂2� (σ | y∗) /∂σi∂σj derivatives need to be

considered for the restricted model (3.2.2).

We now consider the estimation of the fixed effects under RML estimation

of the variance-covariance components. Since, for given σ, the maximum like-

lihood estimates of the fixed effects are given by

β(σ) =(XTΣ−1X

)−1XTΣ−1y (3.2.3)

it has been proposed that β(σ), with σ denoting the RML estimate of σ, be

used as a natural estimate for the fixed effects (Lindstrom and Bates, 1988). We

show now that such estimates have the same asymptotic properties as the ML

estimates, described in Theorem 3.1.2. In fact we show the more general result

Theorem 3.2.2 Let σ be a (weakly) consistent estimator of σ and β(σ) the

corresponding estimator of the fixed effects β, given by (3.2.3). Then under

Assumption 3.1.7 it follows that

1. β(σ) is (weakly) consistent for β;

2. np1+1

(β(σ) − β

) D−→ N (0, C−10 ).

Proof: Note that 2 ⇒ 1 since, by Slutsky’s theorem and (3.1.7)

(β(σ) − β

) D−→ N (0, C−10 ) ⇒ β(σ) − β

D−→ 0 ⇒ β(σ)P−→ β

so that we just need to prove the asymptotic normality of β(σ).

By Assumption 3.1.7, XTΣ−1X/νp1+1 → C0. It then follows that

νp1+1

(XTΣ−1X

)−1C0 → I and np1+1

[(XTΣ−1X

)1/2]−1

C1/20 → I. Now

since y ∼ N (Xβ,Σ), we have

(XTΣ−1X

)−1XTΣ−1 (y − Xβ) ∼ N

(0,(XTΣ−1X

)−1)

XTΣ−1X)T/2

XTΣ−1 (y − Xβ) ∼ N (0, I)

and therefore, by Slutsky’s theorem, we have that

np1+1C1/20

(XTΣ−1X

)−1XTΣ−1 (y − Xβ)

D−→ N (0, I)

⇒ np1+1

(XTΣ−1X

)−1XTΣ−1 (y − Xβ)

D−→ N (0, C−10 ).

We will show that np1+1

(β (σ) − β (σ)

)P−→ 0 and by an application of

Slutsky’s theorem will conclude that np1+1

(β (σ) − β

) D−→ N(0, C−1

). Let

Σ, DA, andD, denote the estimates of Σ, DA, and D, corresponding to σ. We

first show that

(1/νp1+1)XT(Σ−1 − Σ

XP−→ 0. (3.2.4)

In order to do that we need

Lemma 3.2.1 Let A be an a × a symmetric matrix, then

|Aij | ≤ max1≤k≤a

|λk (A)| .

Proof: Letting ξi denote the ith canonical basis vector, we have, using the

Cauchy-Schwartz inequality,

|Aij | =∣∣∣ξT

i Aξj

∣∣∣ ≤ (ξT

i A2ξi

)1/2 ≤ max1≤k≤a

|λk (A)| .

As a result of Lemma 3.2.1, to show (3.2.4), it suffices to show that

max1≤k≤p0

∣∣∣∣λk

((1/νp1+1)X

T(Σ−1 − Σ

X)∣∣∣∣ P−→ 0.

But using elementary results on eigenvalues of symmetric matrices we have

∣∣∣∣λk

((1/νp1+1)X

T(Σ−1 − Σ

X)∣∣∣∣ (3.2.5)

= maxk

∣∣∣∣λk

((1/νp1+1)X

T Σ−1 (

Σ −Σ)Σ−1X

)∣∣∣∣≤ λmax

(XTΣ−1X

νp1+1

∣∣∣∣λk

−1 (Σ − Σ

))∣∣∣∣ .Now

∣∣∣∣λk

−1 (Σ − Σ

))∣∣∣∣ (3.2.6)

= sup‖ξ‖=1

∣∣∣ξT(Σ − Σ

)ξ∣∣∣

ξT Σξ≤ sup

‖ξ‖=1,ZT ξ�=0

∣∣∣ξT Z(DA − DA

)ZT ξ

∣∣∣ξT ZDAZT ξ

+|σ2 − σ2|

≤ maxk

∣∣∣λk

(D − D

)∣∣∣λmin(D)

+|σ2 − σ2|

By assumption, σP−→ σ. Therefore D

P−→ D and σ2 P−→ σ2. By the continuity

of the minimum and maximum eigenvalues and Slutsky’s theorem, the last

bound of (3.2.6) converges in probability to zero. By Assumption 3.1.7

λmax(XTΣ−1X/νp1+1) → λmax(C0)

and it follows that

∣∣∣∣λk

((1/νp1+1)X

T(Σ−1 − Σ

X)∣∣∣∣ P−→ 0.

As a consequence of (3.2.4) we have that

∥∥∥∥∥∥XT Σ

νp1+1− C0

∥∥∥∥∥∥ ≤∥∥∥∥∥∥∥∥XT

−1 −Σ−1)

νp1+1

∥∥∥∥∥∥∥∥ +

∥∥∥∥∥XTΣ−1X

νp1+1− C0

∥∥∥∥∥ P−→ 0

so that XT Σ−1

X/νp1+1P−→ C0 and by the continuity of the inverse matrix

νp1+1

(XT Σ

−1X)−1

P−→ C−10 . Therefore by Assumption 3.1.7

XTΣ−1X(XT Σ

−1X)−1

= (3.2.7)

(XTΣ−1X/νp1+1

) [νp1+1

(XT Σ

−1X)−1

]P−→ C0C

−10 = I.

Then by Slutsky’s theorem we have that

(XT Σ

−1X)−1

XTΣ−1 (y − Xβ)D−→ N (0, C−1

0 ). (3.2.8)

To complete the proof of the theorem we just need to show that

(XT Σ

−1X)−1

XT(Σ−1 − Σ

(y − Xβ)P−→ 0.

By applying the Cauchy-Schwartz inequality we get

np1+1ξTi

(XT Σ

−1X)−1

XT(Σ−1 − Σ

(y − Xβ) (3.2.9)

≤[νp1+1ξ

(XT Σ

−1X)−1

]1/2 [(y − Xβ)T Σ−1

(Σ −Σ

×X(XT Σ

−1X)−1

XT Σ−1 (

Σ − Σ)Σ−1 (y − Xβ)

νp1+1ξTi

(XT Σ

−1X)−1

ξiP−→ ξT

i C−10 ξi ≤ λmax(C

−10 ) (3.2.10)

the first term on the right hand side of (3.2.9) is bounded in probability by

λ1/2max(C

−10 ) as n → ∞. Now noting that

(Σ1/2

)−1(y − Xβ) ∼ N (0, I) and

((Σ1/2

)−T (Σ −Σ

−1X(XT Σ

−1X)−1

×XT Σ−1 (

Σ −Σ) (

Σ1/2)−1)≤ p0

we get using the same reasoning as in Lemma A.9 in Appendix A

(y − Xβ)T Σ−1(Σ −Σ

−1X(XT Σ

−1X)−1

(3.2.11)

× XT Σ−1 (

Σ − Σ)Σ−1 (y − Xβ)

≤ λmax

1/2)−T

X(XT Σ

−1X)−1

1/2)−1

× maxk

∣∣∣∣λk

−1 (Σ − Σ

))∣∣∣∣maxk

∣∣∣λk

(Σ−1

(Σ −Σ

))∣∣∣ ‖wn,p0‖2

where ‖wn,p0‖2 ∼ χ2p0

, ∀n. Now note that

1/2)−T

X(XT Σ

−1X)−1

1/2)−1

)≤ (3.2.12)

1/2)−T

X(XT Σ

−1X)−1

1/2)−1

∣∣∣∣λk

−1 (Σ − Σ

))∣∣∣∣ ≤ (3.2.13)

∣∣∣λk

(D − D

)∣∣∣λmin(D)

+|σ2 − σ2|

P−→ 0

∣∣∣λk

(Σ−1

(Σ − Σ

))∣∣∣ ≤maxk

∣∣∣λk

(D − D

)∣∣∣λmin(D)

+|σ2 − σ2|

P−→ 0.

Combining (3.2.9), (3.2.10), (3.2.12), (3.2.13), and Lemma 3.2.1 gives

(XT Σ

−1X)−1

XT(Σ−1 − Σ

(y − Xβ)P−→ 0.

It then follows from Slutsky’s theorem that

(β (σ) − β

)(3.2.14)

= np1+1

(XT Σ

−1X)−1

XTΣ−1 (y − Xβ)

− np1+1

(XT Σ

−1X)−1

XT(Σ−1 − Σ

(y − Xβ)D−→ N (0, C−1

as we wanted to show.

3.3 Parametrized and/or Structured σ

In this section we consider the asymptotic behavior of the (restricted) maximum

likelihood estimates of the variance-covariance components under reparametriza-

tion (Lindstrom and Bates, 1988) and/or structuring (Jennrich and Schluchter,

1986) of σ. More specifically, we consider the case where σ = f (α), with α of

dimension pα less than or equal to p1 +1. We show that for a large class of well

behaved f , the (restricted) maximum likelihood estimators of α are consistent

and asymptotically normal.

We start by establishing some assumptions about f and α.

Assumption 3.3.1 Let σi, i = 0, . . . , r denote the subset of the parameters

in σ that define the scaled variance-covariance matrix Di of the random effects

belonging to the ith random effects class, with the convention that σ0 = σ2. Then

the parameter vector α and the vector function f can be decomposed into r + 1

disjoint subsets α0, . . . , αr, f0, . . . , f r in such a way that σi = f i (αi) , i =

0, . . . , r.

In other words, we assume that σ2, D1, . . . , Dr are each defined by disjoint

subsets of the parameters in α.

Assumption 3.3.2 f i is of class C2, i = 0, . . . , r, i.e. f i is twice differentiable

with continuous second derivatives.

In the proof of the main asymptotic theorem of this subsection we just need that

the second derivatives do not explode in a small neighborhood of the true pa-

rameter vector α0. Requiring continuity of these derivatives is just a convenient

way of controlling their behavior in a neighborhood of α0.

Assumption 3.3.3 f is one-to-one, i.e. α �= α′ ⇒ f (α) �= f (α′).

Assumption 3.3.3 is needed to ensure that α is identifiable.

We also need an assumption regarding the limit behavior of νi/ml(i), i =

1, . . . , p1, where l(i) denotes the random effect class to which the ith variance-

covariance component corresponds.

Assumption 3.3.4 limn→∞ νi/ml(i) = si, i = 1, . . . , p1 exists and is positive.

As observed in section 3.1, ml(i) ≤ νi ≤ 2ml(i), i = 1, . . . , p1, and hence νi and

ml(i) are of the same order of magnitude. Assumption 3.3.2 is simply stating

that their ratio tends to a limit.

Now note that, by the chain rule,

∂� (β, α | y)

∂αi

=p1∑

∂� (β, σ | y)

∂σk

∂αi

(3.3.1)

∂2� (β, α | y)

∂αi∂αj=

p1∑k,l=0

∂2� (β, σ | y)

∂σk∂σl

∂σk

∂αi

∂σl

∂αj+

p1∑k=0

∂� (β, σ | y)

∂σk

∂2σk

∂αi∂αj

∂2� (β, α | y)

∂αi∂βj=

p1∑k=0

� (β, σ | y)

∂σk∂βj

∂σk

∂αi

where σ is taken as a function of α, so that for example ∂σk/∂αi should be

understood as ∂fk(α)/∂αi, ∂� (β, σ) /∂σ = ∂� (β, σ) /∂σ|σ=f(α), and so on.

Now let ∇f and Hf denote respectively the (p1 + 1) × pα gradient matrix

of f and the pα × pα × (p1 + 1) Hessian array of f , defined as

[∇f ]ij =∂fi (α)

αj, i = 1, . . . , p1 + 1, j = 1, . . . , pα,

[Hf ]ijk =∂2fk (α)

∂αi∂αj, i, j = 1, . . . , pα, k = 1, . . . , p1 + 1.

Note that by Assumption 3.3.1 [Hf ]ijk = 0 whenever l(i) �= l(j). We can

rewrite (3.3.1) in matrix form as

∂� (β, α | y)

∂α= ∇T

∂� (β, σ | y)

∂σ∂2� (β, α | y)

∂α∂αT= ∇T

∂2� (β, σ | y)

∂σ∂σT∇f + Hf

∂� (β, σ | y)

∂σ∂2� (β, α | y)

∂α∂βT = ∇Tf

∂2� (β, σ | y)

∂σ∂βT .

Now note that

2∂� (β, σ | y)

∂σi= trace

(Σ−1Gi

)− (y − Xβ)T Σ−1GiΣ

−1 (y − Xβ)

and as E((y − Xβ)T Σ−1GiΣ

−1 (y − Xβ))

= trace(Σ−1Gi

)it follows that

E (∂� (β, σ | y) /∂σ) = 0. Hence we get

(∂2� (β, α)

∂α∂αT

)= ∇T

∂2� (β, σ | y)

∂σ∂σT∇f . (3.3.2)

Note also that

(∂2� (β, α | y)

∂α∂βT

)= ∇T

(∂2� (β, σ | y)

∂σ∂βT

)= 0. (3.3.3)

The parameter space Θf of the parametrized/structured model is

Θf ={θf ∈ �p0+pα | θ =

(βT , αT

)T, β ∈ �p0; α ∈ �pα such that

σ0 (α) > 0 and Di (α) is positive semi-definite, i = 1, . . . , r} .

Note that if we define the augmented function

f a (θ) =(βT , (f (α))T

we then have

fa (Θf) ⊂ Θ (3.3.4)

where Θ denotes the parameter space of the linear mixed effects model (2.1.1).

Now let

νfi = ml(i), nf

i =√

νfi , i = 1, . . . , pα

νf0 = ν0, nf

0 = n0.

Define nf = diag(nf0 , nf

1 , . . . , nfpα

) and s = diag(1, s1, . . . , sp1). Then we have

Theorem 3.3.1 Under Assumptions 3.1.7 and 3.3.1 through 3.3.4

n−1f

∂2� (β, α | y)

∂α∂αTn−1

fP−→ ∇T

fs1/2C1s1/2∇f

def= Cf

1 , (3.3.5)

(np1+1nf)−1∂2� (β, α | y)

∂α∂βT

P−→ 0.

Proof: Consider initially the first limit in (3.3.5). By (3.3.2) and Assump-

tions 3.1.7, 3.3.1, and 3.3.4 we have that

nfi nf

(∂2� (β, α | y)

∂αi∂αj

p:ml(p)=νfi

∑q:ml(q)=νf

nfi nf

∂σp

∂αi

∂σq

∂αj

(∂� (β, σ | y)

∂σp∂σq

→ ∑p:ml(p)=νf

∑q:ml(q)=νf

s1/2p s1/2

∂σp

∂αi

∂σq

∂αj

[C1]pq =[Cf

Hence E(n−1

(∂2� (β, α | y) /∂α∂αT

)n−1

)→ Cf

1 . Now since

∥∥∥∥∥n−1f

∂2� (β, α | y)

∂α∂αTn−1

f − Cf1

∥∥∥∥∥≤

∥∥∥∥∥n−1f

(∂2� (β, α | y)

∂α∂αT− E

(∂2� (β, α | y)

∂α∂αT

))n−1

∥∥∥∥∥+

∥∥∥∥∥E(n−1

∂2� (β, α | y)

∂α∂αTn−1

)− Cf

∥∥∥∥∥it suffices to show that

n−1f

(∂2� (β, α | y) /∂α∂αT − E

(∂2� (β, α | y) /∂α∂αT

))n−1

fP−→ 0.

Now note that

nfi nf

(∂2� (β, α | y)

∂αi∂αj− E

(∂2� (β, α | y)

∂αi∂αj

p:ml(p)=νfi

∑q:ml(q)=νf

nfi nf

∂σp

∂αi

∂σq

∂αj

(∂2� (β, σ | y)

∂σp∂σq

(∂2� (β, σ | y)

∂σp∂σq

))]P−→ 0

since npnq/nfi njf → s1/2

p s1/2q and by Tchebychev’s inequality and Lemma A.5

∣∣∣∣∣(

∂2� (β, σ | y)

∂σp∂σq

(∂2� (β, σ | y)

∂σp∂σq

))∣∣∣∣∣ > ε

ε2n2pn

(∂2� (β, σ | y)

∂σp∂σq

ε2n2pn

trace(Σ−1GpΣ

−1Gq

≤ 2 max(n2p, n

ε2n2pn

∣∣∣λk

(Σ−1Gp

)∣∣∣ ∣∣∣λk

(Σ−1Gq

)∣∣∣]2 ≤ 32

ε2 min(n2p, n

Consider now the second limit in (3.3.5). Since E (∂2� (β, α | y) /∂αi∂β) = 0,

by Tchebychev’s inequality we just need to show that

(1/np1+1nfi )2trace

(∂2� (β, α | y) /∂αi∂β

))→ 0.

Now note that

−∂2� (β, α | y)

∂αi∂β=

∑p:ml(p)=νf

∂σp

∂αi

XTΣ−1GpΣ−1 (y − Xβ)

= XTΣ−1Gfi Σ

−1 (y − Xβ)

where Gfi =

∑p:ml(p)=νf

(∂σp/∂αi)Gp. Let Mf (α) = maxij |∂σj/∂αi|. Note that

by Assumption 3.3.3 Mf (α) > 0. Using Lemma A.3 and Assumption 3.1.7 we

1(np1+1n

)2 trace

(∂2� (β, α | y)

∂αi∂β

))(3.3.6)

np1+1nfi

)2 trace(XTΣ−1Gf

i Σ−1Gf

i Σ−1X

≤ p0(np1+1n

)2λmax

(XTΣ−1X

)λmax

(Σ−1Gf

≤ 4p1M2f (α)

(nfi δ0)2

(XTΣ−1X

νp1+1

and the last term on the right hand side of (3.3.6) converges to zero as n → ∞,

since λmax

(XTΣ−1X/νp1+1

)→ λmax (C0) and 1/(nf

i )2 → 0.

We can now state and prove the main asymptotic theorem of this section.

Theorem 3.3.2 Under Assumptions 2.1.1, 2.1.2, 3.1.1 through 3.1.7, and 3.3.1

through 3.3.4, and letting θf0 be an interior point of Θf representing the true

parameter vector and Jf =

, there exists a sequence of estimates

n =(β

n , αTn

with the following properties.

∂�(θf)

∂θf

∣∣∣∣∣∣θf =θ

= 0; ‖βn − β0‖ <δ

| αni − α0i |< δ

, i = 1, . . . , pα

)≥ 1 − ε.

2. The (p0 + pα)-dimensional vector with the first p0 components given by

(βn − β0

)and the last pα components given by nf

i (αni − α0i) , i =

1, . . . , pα converges in distribution to a N(0, J−1

Proof: The proof will consist in verifying that the maximum likelihood es-

timates of θf in the parametrized/structured model satisfy the conditions of

Theorem 3.1.1. Note that the first condition was proven in Theorem 3.3.1.

Therefore we just need to show that the second condition holds.

Let g be as defined in the proof of Theorem 3.1.2 and

gfi = gf

i (θf0 ) =

2p1Mf (α0) , if i = 1, . . . , pα

g, if i = pα + 1

Note that gfi → ∞ and gf

i /nfi → 0, i = 1, . . . , pα + 1. Also let

Nfn (θf

0 ) ={θf ∈ Θf | |θf

i − θf0i| ≤ gf

k(i)/nfk(i), i = 1, . . . , p0 + pα

k(i) =

pα + 1, if 1 ≤ i ≤ p0

i − p0, otherwise

with the convention that nfpα+1 = np1+1. Then the second condition of Theo-

rem 3.1.1 is that for i, j = 1, . . . , p0 + pα and ∀θf2 ∈ Nf

n (θf0 )

supθf

1∈Nfn (θf

gfk(i)g

fk(j) (3.3.7)

×∣∣∣∣∣∣∣−(1/nf

k(i)nfk(j))

∂2�(θf | y

)∂θf

i ∂θfj

∣∣∣∣∣∣θf1

−[Jf

∣∣∣∣∣∣∣P

θf2−→ 0.

In the remainder of this section we will adopt the shorthand notation θi =

). The following lemma will be used in the proof of Theorem 3.3.2.

Lemma 3.3.1 f a

))⊂ Nn (θ0) .

Proof: Let θf ∈ Nfn

)and θ = f a

(θf), then for i = 1, . . . , p0 we have

|θi − θ0i| =∣∣∣θf

i − θf0i

∣∣∣ ≤ gfk(i)

nfk(i)

Now take i = p0 + 1, . . . , p0 + p1 + 1 and let l(i) denote the random effect

class to which θi refers. By the mean value theorem we get

|θi − θ0i| ≤∥∥∥f l(i)

(αl(i)

)− f l(i)

(α0l(i)

)∥∥∥ ≤ Mf

)‖α − α0‖

≤ Mf (α0) p1gfi

and therefore by definition θ ∈ Nn (θ0).

Note that by Lemma 3.3.1

supθf∈Nf

n (θf0 )

∣∣∣h (fa

(θf))∣∣∣ ≤ sup

θ∈Nn(θ0)|h (θ)|

for any real function h.

For the ∂2� (β, α | y) /∂β∂βT derivatives condition (3.3.7) is identical to the

equivalent one in Theorem 3.1.2 and the proof given there also applies here.

Consider now the ∂2� (β, α | y) /∂αi∂βj derivatives. Since the corresponding

entries in the Jf matrix are 0, we just need to show that

supθf

1∈Nfn (θf

gfi gf

∣∣∣∣∣∣∣−(1/nfi nf

pα+1)∂2�

(θf | y

)∂αi∂βj

∣∣∣∣∣∣θf

∣∣∣∣∣∣∣P

θf2−→ 0.

By the continuity of ∂f/∂α we have that ∃ε = ε(α0) > 0 such that ‖α−α0‖ <

ε ⇒ maxij |∂fi (α) /∂αj | < 2Mf (α0). From (3.3.1) we get that for sufficiently

large n (such that θf ∈ Nfn

)⇒ ‖α − α0‖ < ε)

supθf

1∈Nfn (θf

gfi gf

∣∣∣∣∣∣∣−(1/nfi nf

pα+1)∂2�

(θf | y

)∂αi∂βj

∣∣∣∣∣∣θf

∣∣∣∣∣∣∣≤ (1/

√2p1)

∑p:ml(p)=νf

(np/nfi ) sup

θ1∈Nn(θ0)g2

∣∣∣∣∣−(1/npnp1+1)∂2� (θ | y)

∂σi∂βj

∣∣∣∣∣θ1

∣∣∣∣∣P

θf2−→ 0

since np/nfi → s1/2

p and by Theorem 3.1.2

supθ1∈Nn(θ0)

∣∣∣∣∣−(1/npnp1+1)∂2� (θ | y)

∂σi∂βj

∣∣∣∣∣θ1

∣∣∣∣∣P

θf2−→ 0.

Consider now the ∂2� (β, α | y) /∂αi∂αj terms. Note that letting ∇f ,i(α)

denote the ith column of the gradient matrix ∇f evaluated at α we get

∣∣∣∣∣∣∣−(1/nfi nf

j )∂2�

(θf | y

)∂αi∂αj

∣∣∣∣∣∣θf

−[Cf

∣∣∣∣∣∣∣ (3.3.8)

≤∣∣∣∣∣∣−(1/nf

i nfj )∇T

f,i(α1)∂2� (θ | y)

∂σ∂σT

∣∣∣∣∣fa(θf

∇f ,j(α1)

−∇Tf ,i(α0)s

1/2C1s1/2∇f,j(α0)

∣∣∣+

∣∣∣∣∣∣∣(1/nfi nf

j )p1∑

∂�(θf | y

)∂σk

∣∣∣∣∣∣σ=f(α1)

∂2σk

∂αi∂αj

∣∣∣∣∣α1

∣∣∣∣∣∣∣ .Now note that

∣∣∣∇Tf ,i(α)s1/2C1s

1/2∇f ,j(α) − ∇Tf ,i(α0)s

1/2C1s1/2∇f,j(α0)

∣∣∣ (3.3.9)

≤∣∣∣(∇f,i(α) − ∇f ,i(α0))

T s1/2C1s1/2∇f ,j(α)

∣∣∣+∣∣∣∇T

f ,i(α0)s1/2C1s

1/2 (∇f ,j (α) − ∇f ,j (α0))∣∣∣ .

Let Qf (α0) = max(maxijk

∣∣∣∂2fk/∂αi∂αj |α0

∣∣∣ , 1). By Assumption 3.3.2,

∃ε = ε(α0) > 0 such that ‖α − α0‖ < ε ⇒ maxijk

∣∣∣∂2fk/∂αi∂αj |α∣∣∣ < Qf(α0).

By the continuity of ∇f and the mean value theorem we have that, for suffi-

ciently large n we have that ∀θf ∈ Nfn

)∣∣∣[∇f ,i (α)]k − [∇f ,i (α0)]k

∣∣∣ ≤ Qf (α0) ‖α − α0‖ ≤ p1Qf (α0) gfi

. (3.3.10)

Therefore we have for sufficiently large n

gfi gf

j supθ1∈Nf

n (θf0 )

∣∣∣∇Tf ,i(α1)s

1/2C1s1/2∇f,j(α1)

−∇Tf,i(α0)s

1/2C1s1/2∇f ,j(α0)

∣∣∣≤ p1g

j Mf (α0) Qf (α0)1T s1/2C1s

≤ Qf (α0)1

T s1/2C1s1/21

M2f (α0) p2

κ(α0)

where 1 denote the constant vector of ones. Hence for sufficiently large n, we

∣∣∣∣∣∣−(1/nfi nf

j )∇Tf ,i(α1)

∂2� (θ | y)

∂σ∂σT

∇f,j(α1)

−∇Tf ,i(α0)s

1/2C1s1/2∇f ,j(α0)

∣∣∣≤

∣∣∣∣∣∣∇Tf ,i(α1)

−(1/nfi nf

j )∂2� (θ | y)

∂σ∂σT

− s1/2C1s1/2

∇f ,j(α1)

∣∣∣∣∣∣+

2Qf (α0)1T s1/2C1s

≤ 2Mf (α0)∑

p:ml(p)=νfi

∑q:ml(q)=νf

∣∣∣∣∣∣−(1/nfi nf

j )∂� (θ | y)

∂σp∂σq

− s1/2p s1/2

q [C1]pq

∣∣∣∣∣∣+

2Qf (α0)1T s1/2C1s

By Assumption 3.3.4 we may replace s1/2p s1/2

q by npnq/nfi nf

j without altering

the limit values. It then follows that for large enough n

supθf1 ∈Nf

n (θf0 )

gfi gf

∣∣∣∣∣∣−(1/nfi nf

j )∇Tf,i(α1)

∂2� (θ | y)

∂σ∂σT

∇f ,j(α1) (3.3.11)

−∇Tf ,i(α0)s

1/2C1s1/2∇f,j(α0)

∣∣∣≤ 2Mf (α0)

∑p:ml(p)=νf

∑q:ml(q)=νf

(npnq/nfi nf

× supθ1∈Nn(θ0)

∣∣∣∣∣−(1/npnq)∂� (θ | y)

∂σp∂σq

∣∣∣∣∣θ1

− [C1]pq

∣∣∣∣∣+ κ(α0)

f2−→ 0.

since npnq/nfi nf

j → s1/2p s1/2

q and by Theorem 3.1.2 and (3.3.4)

g2 supθ1∈Nn(θ0)

∣∣∣∣∣−(1/npnq)∂� (θ | y)

∂σp∂σq

∣∣∣∣∣θ1

− [C1]pq

∣∣∣∣∣ Pθ2−→ 0.

Finally we note that the second term of the sum on the right hand side of (3.3.8)

is zero whenever nfi �= nf

j , so we can restrict ourselves to the case when they

are equal. We have

gfi gf

j supθf1∈Nn(θf

∣∣∣∣∣∣∣(1/νfi )

∑p:ml(p)=νf

∂� (θ | y)

∂σp

∣∣∣∣∣σ=f(α1)

∂2σp

∂αi∂αj

∣∣∣∣∣α1

∣∣∣∣∣∣∣ ≤Qf(α0)

∑p:ml(p)=νf

(νp/νfi )g2 sup

θ1∈Nn(θ0)

∣∣∣∣∣(1/νp)∂� (θ | y)

∂σp

∣∣∣∣∣θ1

∣∣∣∣∣and since νp/ν

fi → sp it suffices to show that

g2 supθ1∈Nn(θ0)

∣∣∣∣∣(1/νi)∂� (θ | y)

∂σi

∣∣∣∣∣θ1

∣∣∣∣∣ Pθ2−→ 0, i = 0, . . . , p1.

Now note that

∣∣∣∣∣ ∂� (θ | y)

∂σi

∣∣∣∣∣θ1

∣∣∣∣∣ (3.3.12)

=∣∣∣trace

(Σ−1

)− (y − Xβ1)

T Σ−11 GiΣ

−11 (y − Xβ1)

∣∣∣≤

∣∣∣trace((

Σ−11 − Σ−1

)∣∣∣+∣∣∣trace

(Σ−1

)− (y − Xβ2)

T Σ−12 GiΣ

−12 (y − Xβ2)

∣∣∣+∣∣∣(y − Xβ1)

T Σ−11 GiΣ

−11 (y − Xβ1)

− (y − Xβ2)T Σ−1

2 GiΣ−12 (y − Xβ2)

∣∣∣ .Consider initially the first term on the right hand side of (3.3.12). By Lem-

mas A.5 and A.8 we have that for sufficiently large n

(g2/νi)∣∣∣trace

((Σ−1

1 − Σ−12

)∣∣∣ (3.3.13)

≤ g2 maxk

∣∣∣λk

((Σ−1

1 − Σ−12

)∣∣∣≤ g2 max

∣∣∣λk

(Σ−1

2 (Σ2 − Σ1))∣∣∣max

∣∣∣λk

(Σ−1

)∣∣∣≤ 16

(D0) +

→ 0.

Note that since the bound on (3.3.13) does not depend on θ1, we also have

(g2/νi) supθ1∈Nn(θ0)

∣∣∣trace((

Σ−11 −Σ−1

)∣∣∣→ 0.

Next consider the second term on the right hand side of the inequality in

(3.3.12) and note that this term does not depend on θ1. By Tchebychev’s

inequality we just need to show that the variance of that term goes to zero with

n. Now by Lemma A.5

(g4/ν2i )var

((y − Xβ2)

T Σ−12 GiΣ

−12 (y − Xβ2)

)= (2g4/ν2

i )trace(Σ−1

)2 ≤ (2/g4) maxk

(Σ−1

))2 ≤ 32

4→ 0.

Finally consider the last term on the right hand side of the inequality in

(3.3.12). We note that

∣∣∣(y − Xβ1)T Σ−1

1 GiΣ−11 (y − Xβ1) (3.3.14)

− (y − Xβ2)T Σ−1

2 GiΣ−12 (y − Xβ2)

∣∣∣≤

∣∣∣(y − Xβ2)T(Σ−1

1 GiΣ−11 − Σ−1

2 GiΣ−12

)(y − Xβ2)

∣∣∣+ 2

∣∣∣(y − Xβ2)T Σ−1

1 GiΣ−11 X (β2 − β1)

∣∣∣+∣∣∣(β2 − β1)

T XTΣ−11 GiΣ

−11 X (β2 − β1)

∣∣∣ .Consider the first term on the right hand side of (3.3.14). Note that

Σ−11 GiΣ

−11 −Σ−1

2 GiΣ−12 =

(Σ−1

1 −Σ−12

−11 + Σ−1

(Σ−1

1 − Σ−12

Let j(i) and k(i) denote the random effects to which σi corresponds within the

associated random effects class l(i). By the Cauchy-Schwartz inequality we have

that for any u, v ∈ �n

∣∣∣uT Giv∣∣∣ ≤ ∥∥∥∥(U l(i)

∥∥∥∥ ∥∥∥∥(U l(i)k(i)

∥∥∥∥+

∥∥∥∥(U l(i)j(i)

∥∥∥∥ ∥∥∥∥(U l(i)k(i)

∥∥∥∥ . (3.3.15)

Using (3.3.15) and Cauchy-Schwartz once again gives

∣∣∣(y − Xβ2)T(Σ−1

1 − Σ−12

−11 (y − Xβ2)

∣∣∣≤

[(y − Xβ2)

T Σ−12 (Σ2 − Σ1)Σ−1

1 Ul(i)j(i)

l(i)j(i)

×Σ−11 (Σ2 −Σ1)Σ

−12 (y − Xβ2)

×[(y − Xβ2)

T Σ−11 U

l(i)k(i)

)TΣ−1

1 (y − Xβ2)]1/2

+[(y − Xβ2)

T Σ−12 (Σ2 −Σ1)Σ

−11 U

l(i)k(i)

×Σ−11 (Σ2 −Σ1)Σ

−12 (y − Xβ2)

×[(y − Xβ2)

T Σ−11 U

l(i)j(i)

)TΣ−1

1 (y − Xβ2)]1/2

Now note that by Lemmas A.4, A.6, and A.8

g4λmax

]−T(Σ2 −Σ1)Σ

−11 U

l(i)j(i)

)TΣ−1

1 (Σ2 − Σ1)[Σ

]−1)

≤ 128

(D0) +

2λmax

(D0) + 1

which converges to zero with n. By Lemma A.9 it follows that

νisup

θ1∈Nn(θ0)(y − Xβ2)

T Σ−12 (Σ2 − Σ1)Σ−1

1 Ul(i)j(i)

l(i)j(i)

×Σ−11 (Σ2 −Σ1)Σ

−12 (y − Xβ2)

Pθ2−→ 0.

From Lemmas A.4 and A.6 we also have that for large enough n

1/22 Σ−1

1 Ul(i)k(i)

l(i)k(i)

)TΣ−1

]T) ≤ 8

(D0) + 1

and therefore by Lemma A.9 it follows that

(1/νi sup

θ1∈Nn(θ0)(y − Xβ2)

T Σ−11 U

l(i)k(i)

Σ−11 (y − Xβ2) >

(D0) + 1

→ 0.

We conclude that

νisup

θ1∈Nn(θ0)

[(y − Xβ2)

T Σ−12 (Σ2 − Σ1)Σ−1

1 Ul(i)j(i)

l(i)j(i)

× Σ−11 (Σ2 −Σ1)Σ

−12 (y − Xβ2)

×[(y − Xβ2)

T Σ−11 U

l(i)k(i)

)TΣ−1

1 (y − Xβ2)]1/2 Pθ2−→ 0.

Similarly we prove that

supθ1∈Nn(θ0)

[(y − Xβ2)

T Σ−12 (Σ2 −Σ1)Σ

−11 U

l(i)k(i)

× Σ−11 (Σ2 −Σ1)Σ

−12 (y − Xβ2)

×[(y − Xβ2)

T Σ−11 U

l(i)j(i)

)TΣ−1

1 (y − Xβ2)]1/2 Pθ2−→ 0

and therefore

νisup

θ1∈Nn(θ0)(y − Xβ2)

T(Σ−1

1 −Σ−12

−11 (y − Xβ2)

Pθ2−→ 0.

Using the exact same reasoning we show that

νisup

θ1∈Nn(θ0)(y − Xβ2)

T Σ−12 Gi

(Σ−1

1 − Σ−12

)(y − Xβ2)

Pθ2−→ 0.

Consider now the second term on the right hand side of (3.3.14). By Cauchy-

Schwartz we get

∣∣∣(y − Xβ2)T Σ−1

1 GiΣ−11 X (β2 − β1)

∣∣∣≤

[(y − Xβ2)

T Σ−11 GiΣ

−11 GiΣ

−11 (y − Xβ2)

×[(β1 − β2)

T XTΣ−11 X (β2 − β1)

Now note that for large enough n

1/22 Σ−1

1 GiΣ−11 GiΣ

]T) ≤ 64

(D0) + 1

and by Lemma A.9 it follows that

(1/νi sup

θ1∈Nn(θ0)(y − Xβ2)

T Σ−11 GiΣ

−11 GiΣ

−11 (y − Xβ2)

(D0) + 1

→ 0.

Note also that for sufficiently large n

(β1 − β2)T XTΣ−1

1 X (β2 − β1)

≤ (g4/νi) supθ1∈Nn(θ0)

(λmax

(XT Σ−1

1 X)‖β2 − β1‖2

≤ 4p0g6

νiλmax

(XTΣ−1

νp1+1

(D0) +

since g6/νi → 0 and λmax

(XTΣ−1

0 X/νp1+1

)→ λmax (C0). Therefore we con-

clude that

supθ1∈Nn(θ0)

∣∣∣(y − Xβ2)T Σ−1

1 GiΣ−11 X (β2 − β1)

∣∣∣ Pθ2−→ 0.

Finally consider the last term on the right hand side of (3.3.14). For suffi-

ciently large n we get

(β1 − β2)T XTΣ−1

1 GiΣ−11 X (β2 − β1)

≤ 4g2

δ0νi

supθ1∈Nn(θ0)

(λmax

(XTΣ−1

1 X)‖β2 − β1‖2

≤ 16p0g4

δ0νiλmax

(XTΣ−1

νp1+1

(D0) +

since g4/νi → 0 and λmax

(XTΣ−1

0 X/νp1+1

)→ λmax (C0).

Hence we have that

g2 supθ1∈Nn(θ0)

∣∣∣∣∣(1/νi)∂� (θ | y)

∂σi

∣∣∣∣∣θ1

∣∣∣∣∣ Pθ2−→ 0

and this completes the proof of Theorem 3.3.2.

Most of the parametrizations and structures of σ proposed in the literature

satisfy Assumptions 3.3.1 through 3.3.4. For example, all the structured covari-

ance matrices considered in Jennrich and Schluchter (1986) and the parametriza-

tions considered in chapter 6 satisfy these assumptions.

As a final comment, note that the results of Theorem 3.3.2 are easily ex-

tended to restricted maximum likelihood estimation, using the same steps as in

Theorem 3.2.1.

3.4 Conclusions

We have established the asymptotic normality of the (restricted) maximum like-

lihood estimators for the parameters in the model (2.1.1). It is interesting to

interpret the basic Assumptions 3.1.5 to 3.1.7 (3.2.2 for the restricted maximum

likelihood estimators). Assumption 3.1.7 is a typical condition requiring that

the limit of the variance-covariance matrix of the parameter estimates exists.

Assumption 3.1.5 ensures that the number of levels goes to infinity. Note that

we do not require that the number of observations within each level becomes

infinite. This is because we need to estimate the variance-covariance compo-

nents of the random effects to an arbitrary precision but not the random effects

themselves. Similarly Assumption 3.1.6 ensures that there are enough observa-

tions over the bare minimum from each level to estimate the fixed effects to an

arbitrary precision.

We have also established the asymptotic normality of (restricted) maximum

likelihood estimators for a large class of reparametrizations/structurings of the

variance-covariance components σ, that includes most cases of practical interest.

The basic condition for the result to hold is that the mapping that defines

the parametrization/structuring be twice differentiable with continuous second

derivatives, a condition commonly observed in practical applications.

Chapter 4

The Nonlinear Mixed Effects

In this chapter we describe a general nonlinear mixed effects model for repeated

measures data and present a real data example of its use. We also include a

brief bibliographic review of nonlinear mixed effects models.

4.1 The Model

The nonlinear mixed effects model used in this dissertation has been suggested

by Lindstrom and Bates (1990) and in its most general form is written as

in (1.3.1). This model formulation allows the use of nested and crossed classi-

fication factors for the clusters, but by far its most common application is for

repeated measures data, which corresponds to a one-way classification scheme.

We will restrict ourselves in this dissertation to this particular application of

model (1.3.1).

The nonlinear mixed effects model for repeated measures can be thought

of as a two-stage model that in some ways generalizes both the linear mixed

effects model for repeated measures (Laird and Ware, 1982) and the nonlinear

regression model for independent data (Bates and Watts, 1988). In the first

stage the jth observation on the ith cluster is modeled as

yij = f(φij, xij) + εij , i = 1, . . . , m, j = 1, . . . , ni (4.1.1)

where m is the number of clusters, ni is the number observations on the ith

cluster, f is a general real valued nonlinear function of a cluster-specific param-

eter vector φij and the covariate vector xij , and εij is a normally distributed

error term. In the second stage the cluster-specific parameter vector is modeled

φij = Aijβ + Bijbi, bi ∼ N (0, σ2D), (4.1.2)

where β is a p-dimensional vector of fixed effects, bi is a q-dimensional random

effects vector associated with the ith cluster (not varying with j), Aij and Bij

are design matrices for the fixed and random effects respectively, and σ2D is a

general variance-covariance matrix. It is assumed that observations correspond-

ing to different clusters are independent and that the εij are i.i.d. N (0, σ2) and

independent of the bi.

We can write (4.1.1) and (4.1.2) in matrix form as

yi = f i (φi, X i) + εi,

φi = Aiβ + Bibi

for i = 1, . . . , m, where

yi = [yi1 · · · yini]T, εi = [εi1 · · · εini

f i (φi, Xi) =[f(φi1, xi1) · · ·f(φini

, xini)]T

X i =[xT

i1 : · · · : xTini

]T, Ai =

i1 : · · · : ATini

Bi =[BT

i1 : · · · : BTini

By letting

y =[yT

1 : · · · : yTm

]T, b =

1 : · · · : bTm

]T, ε =

1 : · · · : εTm

f (φ, X) =[f1(φ1, X1)

T : · · · : fm(φm, Xm)T]T

X =[XT

1 : · · · : XTm

]T, A =

1 : · · · : ATm

]T, B =

m⊕i=1

we see that the nonlinear mixed effects model for repeated measures described

here is a particular case of model (1.3.1).

Several different methods for estimating the parameters in the nonlinear

mixed effects model have been proposed. We concentrate here on two of them:

maximum likelihood and restricted maximum likelihood. A rather complex

numerical issue for (restricted) maximum likelihood estimation is the evaluation

of the loglikelihood function of the data, since it involves the evaluation of the

integral

p(y | β, D, σ2) =∫

p(y | b, β, D, σ2) p(b) db (4.1.3)

which in general does not have a closed-form expression when the model function

f is nonlinear in b. Different approximations have been suggested to try to

circumvent this difficulty (Lindstrom and Bates, 1990; Vonesh and Carter, 1992;

Davidian and Gallant, 1993). This issue is considered in detail in chapter 5.

4.2 Orange Trees

The orange trees data are presented in Figure 4.2.1 and consist of seven mea-

surements of the trunk circumference (in millimeters) on each of five orange

trees, taken over a period of 1,600 days. These data were originally presented

in Draper and Smith (1981, p. 524) and were also described in Lindstrom and

Bates (1990).

200 400 600 800 1000 1200 1400 1600

Figure 4.2.1: Trunk circumference (in millimeters) of five orange trees.

The logistic model y = φ1/ {1 + exp [− (t − φ2) /φ3]} seems to fit the data

well. Lindstrom and Bates (1990) concluded in their analysis that only the

asymptotic circumference φ1 needed a random effect to account for tree-to-tree

variation and suggested the following nonlinear mixed effects model

yij =β1 + bi1

1 + exp [− (tij − β2) /β3]+ εij (4.2.1)

where yij represents the jth circumference measurement on the ith tree, tij

represents the day corresponding to the jth measurement on the ith tree, the

bi1, i = 1, . . . , 5 are i.i.d. N (0, σ2D), and the εij , i = 1, . . . , 5, j = 1, . . . , 7 are

i.i.d. N (0, σ2) and independent of the bi1. In this example p = 3, q = 1, m = 5,

ni = 7, i = 1, . . . , 5, X ij = tij , Aij = I, and Bij = (1, 0, 0)T .

4.3 Bibliographic Review

The first developments of nonlinear mixed effects models appear in Sheiner

and Beal (1980). Their model and estimation method are incorporated in the

NONMEM (Beal and Sheiner, 1980) program which is widely used in pharma-

cokinetics. They introduced a model very similar to (4.1.1) and developed a

maximum likelihood estimation method that was based on a first order Tay-

lor expansion of the model function around the expected values of the random

effects, i.e. 0. The expansion around the current conditional modes of the ran-

dom effects, as done in Lindstrom and Bates (1990), seems to give better results

(Wolf, 1986).

A nonparametric maximum likelihood method for nonlinear mixed effects

models was proposed by Mallet, Mentre, Steimer and Lokiek (1988). They use

a model similar to (4.1.1), but make no assumptions about the distribution of

the random effects, except that it is a probability measure. The conditional

distribution of the yij given the random effects is assumed to be known. The

objective of the estimation procedure is to get the probability distribution of the

cluster-specific effects (φij) that maximizes the likelihood of the data. Mallet

(1986) proved that the maximum likelihood solution is a discrete distribution

with the number of discontinuity points less or equal to the number of clusters

in the sample. Inference is based on the maximum likelihood distribution from

which summary statistics (e.g. means and variance-covariance matrices) and

plots are obtained.

Davidian and Gallant (1992) introduce a smooth nonparametric maximum

likelihood estimation method for nonlinear mixed effects. Their model is again

very similar to (4.1.1), but with a more general definition for the cluster-specific

effects – φij = g(β, bi, xij), where g is a generic function. As in Mallet et al.

(1988), Davidian and Gallant assume that the conditional distribution of the

response vector given the random effects is known (up to the parameters that

define it), but the distribution of the random effects is free to vary within a class

of smooth densities H defined in Gallant and Nychka (1987). A density from Hcan be expressed as an infinite linear combination of normal densities. In the

likelihood calculations the summation is truncated to a finite number of terms

and a quadrature approach is used to calculate the integral that defines the

likelihood (4.1.3). This nonparametric approach is implemented in the Nlmix

software, available through StatLib.

A Bayesian approach using hierarchical models for nonlinear mixed effects

is described in Bennett and Wakefield (1993) and Wakefield (1993). The first

stage model is again very similar to (4.1.1). The distributions of both the

random effects and the errors εij are assumed known up to population parame-

ters. Prior distributions for these must also be provided. Markov chain Monte

Carlo methods, such as the Gibbs sampler (Geman and Geman, 1984) and the

Metropolis algorithm (Hastings, 1970), are used to obtain the posterior density

of the random effects.

Vonesh and Carter (1992) have developed a mixed effects model that is

nonlinear in the fixed effects, but linear in the random effects. Their model can

be described as

yi = f (β, X i) + Zibi + εi

where β, bi, and εi as before denote respectively the fixed effects, the random

effects, and the error term, X i is a matrix of covariates, and Zi is a full-

rank matrix of known constants. It is further assumed that bi ∼ N (0, D),

εi ∼ N (0, σ2I), and the two vectors are independent. In a certain way Vonesh

and Carter incorporate in the model the approximations suggested by Sheiner

and Beal (1980) and Lindstrom and Bates (1990). They propose an estimated

generalized least squares (EGLS) procedure to estimate the model parameters.

In the first stage estimates of the fixed effects are obtained through ordinary

nonlinear least squares. The residuals from that fit are used to estimate the

variance-covariance matrix of the random effects and that in turn is used in

a weighted nonlinear least squares algorithm to get the final estimates of the

fixed effects. Strong consistency and asymptotic normality of the fixed effects

estimators are proven in the paper. Vonesh and Carter’s approach concentrates

more on inferences on the fixed effects, and less on the variance-covariance

components of the random effects.

Chapter 5

Approximations to the

Loglikelihood in the Nonlinear

Mixed Effects Model

In this chapter we consider the estimation of the parameters in the nonlinear

mixed effects model for repeated measures (4.1.1) by either maximum likeli-

hood, or restricted maximum likelihood, based on the marginal density of y

given in (4.1.3). Different approximations have been proposed for estimating

this likelihood. Some of these methods consist of taking a first order Taylor

expansion of the model function f around the expected value of the random

effects (Sheiner and Beal, 1980; Vonesh and Carter, 1992), or around the con-

ditional (on D) modes of the random effects (Lindstrom and Bates, 1990).

Others have proposed the use of Gaussian quadrature rules (Davidian and Gal-

lant, 1992).

We consider here four different approximations to the loglikelihood (4.1.3):

Lindstrom and Bates’ (1990) alternating method, a modified Laplacian approx-

imation (Tierney and Kadane, 1986), importance sampling (Geweke, 1989), and

Gaussian quadrature (Davidian and Gallant, 1992). We compare them based

on their computational and statistical properties, using both real data exam-

ples and simulation results. Section 5.1 contains a description of the different

approximations to the loglikelihood as applied to the nonlinear mixed effects

model (4.1.1). Section 5.2 presents a comparison of the different approximations

based on real and simulated data. Our conclusions are given in section 5.3.

5.1 Approximations to the Loglikelihood

In this section we describe four different approximations to the loglikelihood of

y in the nonlinear mixed effects model (4.1.1). We show that there exists a

close relationship between the Laplacian approximation, importance sampling

and a Gaussian quadrature rule centered around the conditional modes of the

random effects b.

5.1.1 Alternating Approximation

Lindstrom and Bates (1990) propose an alternating algorithm for estimating the

parameters in model (4.1.1). Conditional on the data and the current estimate

of D (the scaled variance-covariance matrix of the random effects), the modes

of the random effects b and the estimates of the fixed effects β are obtained by

minimizing a penalized nonlinear least squares (PNLS) objective function

m∑i=1

(‖yi − f i(β, bi)‖2 + bT

i D−1bi

)(5.1.1)

where [f i (β, bi)]j = f(φij, xij

), i = 1, . . . , m, j = 1, . . . , ni.

To update the estimate of D at the wth iteration, Lindstrom and Bates use a

first order Taylor expansion of the model function around the current estimates

of β and the conditional modes of the random effects b, which we will denote

by β(w)

and b(w)

respectively. Letting

Zi =∂f i

∂bTi

∣∣∣∣∣β,b

, X i =∂f i

∂βT

∣∣∣∣∣β,b

w(w)i = yi − f i(β

(w), b

i ) + X(w)

i β(w)

+ Z(w)

i b(w)

the approximate loglikelihood used for the estimation of D is

(β, σ2, D | y

)= −1

m∑i=1

∣∣∣∣σ2(I + Z

i DZ(w)T

)∣∣∣∣ (5.1.2)

+ σ−2[w

(w)i − X

i β]T (

I + Z(w)

i DZ(w)T

)−1 [w

(w)i − X

i β]}

This loglikelihood is identical to that of a linear mixed effects (LME) model in

which the response vector is given by w(w) and the fixed and random effects

design matrices are given by X(w)

and Z(w)

. Using (2.2.2), one can express

the optimal values of β and σ2 as functions of D and work with the profile

loglikelihood of D, greatly simplifying the optimization problem. Lindstrom

and Bates (1990) have also proposed an approximate restricted loglikelihood

for the estimation of D

(β, σ2, D | y

)= (5.1.3)

m∑i=1

log∣∣∣∣σ2X

(w)T(I + Z

i DZ(w)T

(w)∣∣∣∣+ �A

(β, σ2, D | y

Their estimation algorithm alternates between the PNLS and LME steps

until some convergence criterion is met. Such alternating algorithms tend to

be more efficient when the estimates of the variance-covariance components (D

and σ2) are not highly correlated with the estimates of the fixed effects (β).

In chapter 3 we have demonstrated that, in the linear mixed effects model,

the maximum likelihood estimates of D and σ2 are asymptotically independent

of the maximum likelihood estimates of β . These results have not yet been

extended to the nonlinear mixed effects model (4.1.1).

It can be shown that the maximum likelihood estimate of β and the condi-

tional modes of the random effects bi corresponding to the approximate loglike-

lihood (5.1.2) are the values obtained in the first iteration of the Gauss-Newton

algorithm used to minimize the PNLS objective function (5.1.1). Therefore, at

the converged value of D, the estimates of β and bi obtained from the LME

and PNLS steps coincide. We will use �A when comparing the different approx-

imations at the optimal values in section 5.2, but we do note that in Lindstrom

and Bates (1990) approximation (5.1.2) is used only to update the estimates of

D and not for estimating β.

5.1.2 Laplacian Approximation

Laplacian approximations are frequently used in Bayesian inference to estimate

marginal posterior densities and predictive distributions (Tierney and Kadane,

1986; Leonard, Hsu and Tsui, 1989). These techniques can also be used for the

integration considered here.

The integral that we want to estimate for the marginal distribution of yi in

model (4.1.1) can be written as

p(yi | β, D, σ2) =∫ (

2πσ2)−(ni+q)/2 |D|−1/2 exp

[−g(β, D, yi, bi)/2σ2

where g(β, D, yi, bi) = ‖yi − f i(β, bi)‖2 + bTi D−1bi.

bi = bi (β, D, yi) = arg minbi

g(β, D, yi, bi)

g′ (β, D, yi, bi) =∂g(β, D, yi, bi)

g′′ (β, D, yi, bi) =∂2g(β, D, yi, bi)

∂bi∂bTi

and consider a second order Taylor expansion of g around bi

g (β, D, yi, bi) � (5.1.4)

g(β, D, yi, bi

[bi − bi

]Tg′′ (β, D, yi, bi

) [bi − bi

where the linear term of the approximation vanishes since g′(β, D, yi, bi) = 0.

The Laplacian approximation is defined as

p(y | β, D, σ2

)�(2πσ2

)−N/2 |D|−m/2 exp

[− 1

m∑i=1

g(β, D, yi, bi

×∫ (

2πσ2)q/2

{− 1

m∑i=1

[bi − bi

]Tg′′ (β, D, yi, bi

) [bi − bi

=(2πσ2

)−N/2 |D|−m/2m∏

∣∣∣g′′ (β, D, yi, bi

)∣∣∣−1/2exp

[−g(β, D, yi, bi

)/2σ2

where N =∑m

i=1 ni.

Now we consider an approximation to g′′ similar to the one used in Gauss-

Newton optimization. We have

g′′ (β, D, yi, bi

∂2f(β, bi)

∂bi∂bTi

∣∣∣∣∣bi=bi

[yi − f(β, bi)

∂f (β, bi)

∂bTi

∂f (β, bi)

+ D−1.

At bi, the contribution of ∂2f(β, bi)/∂bi∂bTi

∣∣∣bi=bi

[yi − f (β, bi)

]is usually neg-

ligible compared to that of ∂f (β, bi)/∂bTi

∣∣∣bi=bi

∂f (β, bi)/∂bi|bi=bi(Bates and

Watts, 1980). Therefore we use the approximation

g′′ (β, D, yi, bi

)� G (β, D, yi) =

∂f (β, bi)

∂bTi

∂f (β, bi)

+ D−1.

This has the advantage of requiring only the first order partial derivatives of the

model function with respect to the random effects, which are usually available

from the estimation of bi. This estimation of bi is a penalized least squares

problem, for which standard and reliable code is available.

The modified Laplacian approximation to the loglikelihood of model (4.1.1)

is then given by

(β, D, σ2 | y

)(5.1.5)

= −1

{N log

(2πσ2

)+ m log |D| +

m∑i=1

log [G (β, D, yi)]

+σ−2m∑

g(β, D, yi, bi

Since bi does not depend upon σ2, for given β and D, the maximum likeli-

hood estimate of σ2 (based upon �LA) is

σ2 = σ2 (β, D, y) =m∑

g(β, D, yi, bi

We can profile �LA on σ2 to reduce the dimension of the optimization problem,

obtaining

�LAp (β, D | y) = (5.1.6)

{N[1 + log (2π) + log

(σ2)]

+ m log |D| +m∑

log [G (β, D, yi)]

We note that if f is linear in b then the modified Laplacian approximation

is exact because the second order Taylor expansion in (5.1.4) is exact when

f (β, b) = f (β) + Z (β) b.

There does not yet seem to be a straightforward generalization of the concept

of restricted maximum likelihood (Harville, 1974) to nonlinear mixed effects

models. The difficulty is that restricted maximum likelihood depends heavily

upon the linearity of the fixed effects in the model function, which generally does

not occur in nonlinear models. Lindstrom and Bates (1990) circumvented that

problem by using an approximation to the model function f in which the fixed

effects β occur linearly. This cannot be done for the Laplacian approximation,

unless we consider yet another Taylor expansion of the model function, which

would lead us back to something very similar to Lindstrom and Bates’ approach.

We will return to this topic later in section 5.3.

5.1.3 Importance Sampling

Importance sampling provides a simple and efficient way of performing Monte

Carlo integration. The critical step for the success of this method is the choice of

an importance distribution from which the sample is drawn and the importance

weights calculated. Ideally this distribution corresponds to the density that we

are trying to integrate, but in practice one uses an easily sampled approximation.

For the nonlinear mixed effects model the function that we want to integrate is,

up to a multiplicative constant, equal to exp [−g (β, D, yi, bi) /2σ2]. As shown

in subsection 5.1.2, by taking a second order Taylor expansion of g(β, D, yi, bi)

around bi the integrand is, up to a multiplicative constant, approximately equal

to a N(bi, σ

2 [G(β, D, yi)]−1)

density. This gives us a natural choice for the

importance distribution.

Let NIS denote the number of importance samples to be drawn. In prac-

tice one such sample can be generated by selecting a vector z∗ with distri-

bution N (0, I) and calculating the sample of random effects as b∗i = bi +

σ [G (β, D, yi)]−1/2 z∗, where [G (β, D, yi)]

−1/2 denotes the inverse of the Chol-

esky factor of G (β, D, yi). The importance sampling approximation to the log-

likelihood of y is then defined as

(β, D, σ2 | y

)(5.1.7)

= −1

[N log

(2πσ2

)+ m log |D| +

m∑i=1

log |G (β, D, yi)|]

NIS∑j=1

exp[−g(β, D, yi, b

)/2σ2 + ‖z∗

j‖2/2]/NIS

Note that we cannot in general obtain a closed form expression for the MLE of

σ2 for fixed β and D, so that profiling on σ2 is no longer reasonable.

As in the modified Laplacian approximation, importance sampling gives ex-

act results when the model function is linear in b because in this case

p(yi | bi, β, D, σ2) p(bi) = p(yi | β, D, σ2

(bi, σ

2 [G (β, D, yi)]−1)

so that the importance weights are equal to p (yi | β, D, σ2).

5.1.4 Gaussian quadrature

Gaussian quadrature is used to approximate integrals of functions with re-

spect to a given kernel by a weighted average of the integrand evaluated at

pre-determined abscissas. The weights and abscissas used in Gaussian quadra-

ture rules for the most common kernels can be obtained from the tables of

Abramowitz and Stegun (1964) or by using an algorithm proposed by Golub

(1973) (see also Golub and Welsch (1969)). Gaussian quadrature rules for mul-

tiple integrals are known to be numerically complex (Davis and Rabinowitz,

1984), but using the structure of the integrand in the nonlinear mixed effects

model we can transform the problem into successive applications of simple one

dimensional Gaussian quadrature rules. Letting z∗j , wj, j = 1, . . . , NGQ denote

respectively the abscissas and the weights for the (one dimensional) Gaussian

quadrature rule with NGQ points based on the N (0, 1) kernel, we get

∫(2πσ2)−q/2 |D|−1/2 exp

[−‖yi − f (β, bi)‖2 /2σ2

](5.1.8)

× exp(−bT

i D−1bi/2σ2)dbi

(2π)−q/2 exp[−∥∥∥yi − f

(β, σDT/2z∗)∥∥∥2 /2σ2

(−‖z∗‖2 /2

)dz∗

�NGQ∑j1=1

· · ·NGQ∑jq=1

exp[−∥∥∥yi − f

(β, σDT/2z∗

j1,...,jq

)∥∥∥2 /2σ2] q∏

where z∗j1,...,jq

=(z∗j1 , . . . , z

)T. The corresponding approximation to the log-

likelihood function is

(β, D, σ2 | y

)= −N log(2πσ2)/2 (5.1.9)

NGQ∑

exp[−∥∥∥yi − f

(β, σDT/2z∗

)∥∥∥2 /2σ2] q∏

where j = (j1, . . . , jq)T .

The Gaussian quadrature rule in this case can be viewed as a deterministic

version of Monte Carlo integration in which random samples of bi are gener-

ated from the N (0, σ2D) distribution. The samples (z∗j) and the weights (wj)

are fixed beforehand, while in Monte Carlo integration they are left to random

choice. Since importance sampling tends to be much more efficient than sim-

ple Monte Carlo integration, we also considered the equivalent of importance

sampling in the Gaussian quadrature context, which we will denote by adaptive

Gaussian quadrature. In this approach the grid of abscissas in the bi scale is

centered around the conditional modes bi rather than 0, as in (5.1.8). Another

modification is the use of G (β, D, yi) instead of D in the scaling of the z∗.

The adaptive Gaussian quadrature is then given by

∫(2πσ2)−q/2 |D|−1/2 exp

[−‖yi − f (β, bi)‖2 /2σ2

(−bT

i D−1b/2σ2)dbi

(2π)−q/2 |G (β, D, yi)D|−1/2 exp (−g {β, D, yi,

bi + σ [G (β, D, yi)]−1/2 z∗} /2σ2 + ‖z∗‖2 /2

(−‖z∗‖2 /2

)dz∗

�NGQ∑j1=1

· · ·NGQ∑jq=1

exp(−g{β, D, yi, bi + σ [G (β, D, yi)]

−1/2 z∗j1,...,jq

}/2σ2

+∥∥∥z∗

j1,...,jq

∥∥∥2 /2) q∏

The corresponding approximation to the loglikelihood is then

�AGQ

(β, D, σ2 | y

)(5.1.10)

= −[N log

(2πσ2

)+ m log |D| +

m∑i=1

log |G (β, D, yi)|]/2

NGQ∑j

exp(−g{β, D, yi, bi + σ [G (β, D, yi)]

−1/2 z∗j

}/2σ2

+∥∥∥z∗

∥∥∥2 /2) q∏

The adaptive Gaussian quadrature approximation very closely resembles

that obtained for importance sampling. The basic difference is that the former

uses fixed abscissas and weights, while the latter allows them to be determined

by a pseudo-random mechanism. It is also interesting to note that the one

point (i.e. NGQ = 1) adaptive Gaussian quadrature approximation is simply

the modified Laplacian approximation (5.1.6), since in this case z∗1 = 0 and

w1 = 1. The adaptive Gaussian quadrature also gives the exact loglikelihood

when the model function is linear in b, but that is not true in general for the

Gaussian quadrature approximation (5.1.8). Like the importance sampling ap-

proximation, the Gaussian quadrature approximation cannot be profiled on σ2

to reduce the dimensionality of the optimization problem.

5.2 Comparing the Approximations

In this section we present a comparison of the different approximations to the

loglikelihood of model (4.1.1) described in section 5.1. Two real data examples,

the orange trees data, introduced in section 4.2, and the Theophylline data, as

well as simulation results are used to compare the statistical and computational

aspects of the various approximations.

5.2.1 Orange Trees

The orange trees data and the nonlinear mixed effects model used to describe

it were presented in section 4.2. We note that the single random effect occurs

linearly in (4.2.1) and therefore the modified Laplacian (5.1.6), the importance

sampling (5.1.7), and the adaptive Gaussian quadrature (5.1.10) approximations

are all exact. Figure 5.2.1 presents the data on the trunk circumference together

with the fitted curves corresponding to model (4.2.1), using maximum likelihood

based on the exact likelihood and the conditional modes of the random effects.

Table 5.2.1 presents the results of estimation using the alternating approxi-

mation, Gaussian quadrature with 10 and 200 abscissas, and the exact loglikeli-

hood. Since only the alternating approximation provides a version of restricted

maximum loglikelihood, we will just consider maximum likelihood estimation in

this and the next subsection. The subscript on Gaussian refers to the number

of abscissas used in the approximation and the scalar L is√

D, the square root

of the scaled variance of the random effects. In general this is a matrix but

there is only one random effect here.

Table 5.2.1: Estimation Results – Orange TreesApproximation log(L) β1 β2 β3 log(σ2) �Alternating 1.389 191.049 722.556 344.164 4.120 -131.585Gaussian10 1.123 194.325 727.490 348.065 4.102 -130.497Gaussian200 1.396 192.293 727.074 348.074 4.119 -131.571Exact 1.395 192.053 727.906 348.073 4.119 -131.572

The estimation results in Table 5.2.1 indicate that the different approxima-

tions produce similar fits. The Gaussian approximation with only 10 abscissas

gives the worst approximation, in terms of the value of the loglikelihood, but

200 400 600 800 1000 1200 1400 1600

Figure 5.2.1: Trunk circumference (in millimeters) of five orange trees: Dataand fitted curves using the conditional modes of the random effects and max-imum likelihood estimation based on the exact loglikelihood. The dashed linerepresents the curve obtained setting the random effects to zero.

even that is not far from the exact value. The Gaussian quadrature with 200

abscissas is almost identical to the exact loglikelihood. The alternating approx-

imation is also very close to the exact value.

Another important issue regarding the different approximations is how well

they behave in a neighborhood of the optimal value, since this behavior is often

used to assess the variability of maximum likelihood estimates. Figure 5.2.2

displays the profile traces and contours (Bates and Watts, 1988) for the ex-

act loglikelihood and the alternating approximation. This plot could not be

obtained for the Gaussian approximation because the objective function pre-

sented several local optima during the profiling algorithm. We believe that this

is related to the fact that the Gaussian approximation is centered at bi = 0

and not at the conditional modes of the random effects, where the integrand in

(4.1.3) takes on its highest values.

It can be seen from Figure 5.2.2 that the alternating method gives a good

approximation to the loglikelihood in a neighborhood of the optimal values. It

is interesting to note that the profile traces for the variance-covariance compo-

nents (D and σ2) and the fixed effects (β) meet almost perpendicularly. This

indicates a local lack of correlation between the variance-covariance components

and the fixed effects, which explains why the alternating method was so suc-

cessful in approximating the loglikelihood. The same pattern was observed in

several other data sets that we have analyzed, leading us to conjecture that the

asymptotic lack of correlation between the estimators of the variance-covariance

components and the fixed effects verified in the linear mixed effects model also

holds, at least approximately, for the nonlinear mixed effects model.

To compare the computational efficiency of the different approximations

we consider the number of function evaluations needed until convergence. For

the alternating approximation there are two different functions being evaluated

during the iterations: the objective function (5.1.1) within the PNLS step and

the approximate loglikelihood �A (5.1.2) within the LME step. We will use here

the total number of evaluations of either (5.1.1) or �A, multiplied by the number

of clusters. For the other approximations we will use the total number of calls

to g (β, D, yi, bi). Even though the number of function evaluations used for the

alternating approximation is not directly comparable to the number of function

evaluations of the remaining approximations, it gives a good idea of the relative

1.0 1.5 2.0 2.5

2 log(L)

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2

log(sig2)

800 beta2

1.0 1.5 2.0 2.5

3.6 4.0 4.4 4.8 160 200 240 650 700 750 800 280 320 360 400

2beta3

Figure 5.2.2: Profile traces and profile contour plots for the orange trees databased on the exact loglikelihood (solid line) and the alternating approximation(dashed line). Plots below the diagonal are in the original scale and plots abovethe diagonal are in the zeta scale (Bates and Watts, 1988). Interpolated contourscorrespond approximately to joint confidence levels of 68%, 87%, and 95%.

computational efficiency of this algorithm.

Table 5.2.2 presents the number of function evaluations for the different

approximations in the orange trees example. The Gaussian quadrature approx-

imation is considerably less efficient than either the alternating approximation

or the exact loglikelihood. As expected the alternating approximation is the

most computationally efficient.

Table 5.2.2: Number of Function Evaluations to Convergence – Orange TreesApproximation Function EvaluationsAlternating 200Exact 420Gaussian10 8,150Gaussian200 101,000

5.2.2 Theophylline

The data considered here are courtesy of Dr. Robert A. Upton of the Univer-

sity of California, San Francisco. Theophylline was administered orally to 12

subjects whose serum concentrations were measured at 11 times over the next

25 hours. This is an example of a laboratory pharmacokinetic study character-

ized by many observations on a moderate number of individuals. Figure 5.2.3

displays the data and the fitted curves obtained through maximum likelihood

estimation using the adaptive Gaussian approximation with 10 abscissas and

using the conditional modes of the random effects.

A common model for such data is a first order compartment model with

absorption in a peripheral compartment

Ct =DKka

Cl(ka − K)[exp (−Kt) − exp (−kat)] (5.2.1)

Time (hrs)

0 5 10 15 20 25

Figure 5.2.3: Theophylline concentrations (in mg/L) of twelve patients: Dataand fitted curves using the conditional modes of the random effects and maxi-mum likelihood estimation based on the adaptive Gaussian approximation.

where Ct is the observed concentration (mg/L) at time t, t is the time (hr),

D is the dose (mg/kg), Cl is the clearance (L/kg), K is the elimination rate

constant (1/hr), and ka is the absorption rate constant (1/hr). In order to

ensure positivity of the rate constants and the clearance, the logarithms of

these quantities were used in the fit. Analysis of the Theophylline data using

model (5.2.1) suggested that only log(Cl) and log(ka) needed random effects

to account for the patient-to-patient variability. The nonlinear mixed effects

model used for the Theophylline data is

Ct =D exp [− (β1 + bi1) + (β2 + bi2) + β3]

exp (β2 + bi2) − exp (β3)(5.2.2)

× {exp [− exp (β3) t] − exp [− exp (β2 + bi2) t]}

Table 5.2.3 presents the estimation results from the various approximations

to the loglikelihood. Only maximum likelihood estimation is considered. The

subscripts on Gaussian and on Adap. Gaussian refer to the number of ab-

scissas used in the Gaussian and adaptive Gaussian approximations, while the

subscript on Imp. Sampling refers to the number of importance samples used

in this approximation. L denotes the vector with elements given by the upper

triangular half of the Cholesky decomposition of D, stacked by columns.

Table 5.2.3: Estimation Results – Theophylline DataApproximation log(L1) L2 log(L3) β1 β2 β3

Alternating -1.4466 0.0027 -0.0999 -3.227 0.466 -2.455Laplacian -1.4438 0.0027 -0.0997 -3.230 0.469 -2.464Imp. Sampling1000 -1.4438 0.0027 -0.0988 -3.227 0.476 -2.459Gaussian5 -1.5554 0.0024 -0.3969 -3.304 0.501 -2.487Gaussian10 -1.5642 0.0023 -0.2043 -3.238 0.595 -2.469Gaussian100 -1.4457 0.0027 -0.0982 -3.227 0.480 -2.459Adap. Gaussian5 -1.4460 0.0027 -0.0991 -3.225 0.476 -2.458Adap. Gaussian10 -1.4475 0.0027 -0.0994 -3.227 0.474 -2.459

Approximation log(σ2) �Alternating -0.6866 -177.0237Laplacian -0.6866 -177.0000Imp. Sampling1000 -0.6875 -177.7689Gaussian5 -0.4840 -182.4680Gaussian10 -0.7028 -176.1008Gaussian100 -0.6854 -177.7290Adap. Gaussian5 -0.6868 -177.7500Adap. Gaussian10 -0.6853 -177.7473

We can see from Table 5.2.3 that the alternating approximation, the Lapla-

cian approximation, the importance sampling approximation, and the adaptive

Gaussian approximation all give similar estimation results. The Gaussian ap-

proximation only approaches the other approximations when the number of

abscissas is increased considerably. Note that the actual number of points used

in the grid that defines the Gaussian approximation for this example is the

square of the number of abscissas. The adaptive Gaussian approximations for

1 (Laplacian), 5, and 10 abscissas give similar results, indicating that just a

few points are needed for this approximation to be accurate. The importance

sampling approximation caused some numerical difficulties for the optimiza-

tion algorithm (the ms() function in S (Chambers and Hastie, 1992)) used to

obtain the maximum likelihood estimates, since the stochastic variability asso-

ciated with different importance samples overwhelmed the numerical variability

of the loglikelihood for small changes in the parameter values (used to calculate

numerical derivatives). We solved this problem by keeping the random num-

ber generator seed fixed during the optimization process, thus using the same

importance samples throughout the calculations. Since the results obtained us-

ing importance sampling were very similar to those of the adaptive Gaussian

approximation, we concluded that the latter is to be preferred for its greater

simplicity and computational efficiency.

Table 5.2.4 gives the number of function evaluations until convergence for the

different approximations. The alternating approximation is the most efficient,

followed by the Laplacian and adaptive Gaussian approximations. Gaussian

quadrature with 5 abscissas is efficient compared to the adaptive Gaussian,

but is quite inaccurate. The more reliable Gaussian approximation with 100

abscissas takes about 100 times more function evaluations than the adaptive

Gaussian with 10 abscissas. The importance sampling approximation had the

worst performance in terms of function evaluations.

Table 5.2.4: Number of Function Evaluations to Convergence – TheophyllineApproximation Function EvaluationsAlternating 1,512Laplacian 7,683Adap. Gaussian5 30,020Adap. Gaussian10 96,784Gaussian5 47,700Gaussian10 318,000Gaussian100 10,200,000Imp. Sampling1000 11,211,284

Next we consider the approximations in a neighborhood of the optimal value.

We will restrict ourselves here to the alternating, the Laplacian, and the adap-

tive Gaussian approximation, as the Gaussian approximation for a moderate

number of abscissas is not reliable, and both the Gaussian approximation with

a larger number of abscissas and the importance sampling approximation are

very inefficient computationally and give results quite similar to the adaptive

Gaussian approximation. We used five abscissas for the adaptive Gaussian

quadrature, as this gives roughly the same precision as the ten-abscissa quadra-

ture rule.

The alternating approximation gives results very similar to the adaptive

Gaussian quadrature. As in the orange trees example, the profile traces of

the variance-covariance components and the fixed effects meet almost perpen-

dicularly, indicating a local lack of correlation between these estimates. The

Laplacian and the adaptive Gaussian approximations give virtually identical

plots (not included here). This suggests there is little to be gained by increas-

ing the number of abscissas past one in the quadrature rule. The major gain in

-1.8 -1.2

12 log(L1)

-2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2

0.5 L2

0.4 log(L3)

-0.4 log(sig2)

log(Cl)

0.8 log(ka)

-1.8 -1.2

-0.5 0.5 -0.6 0.0 0.4 -1.0 -0.7 -0.4 -3.35 -3.20 0.0 0.4 0.8 -2.55 -2.40

12log(K)

Figure 5.2.4: Profile traces and profile contour plots for the Theophylline databased on the adaptive Gaussian approximation with 5 abscissas (solid line) andthe alternating approximation (dashed line). Plots below the diagonal are inthe original scale and plots above the diagonal are in the zeta scale (Batesand Watts, 1988). Interpolated contours correspond approximately to jointconfidence levels of 68%, 87%, and 95%.

precision is obtained by centering the grid at the conditional modes and scaling

it using the approximate Hessian.

5.2.3 Simulation Results

In this section we include a comparison of the approximations to the loglikeli-

hood in model (4.1.1) using simulation. We restrict ourselves to the alternat-

ing, the Laplacian, and the (five-abscissa) adaptive Gaussian approximations as

these seem to be more accurate and/or more efficient than the Gaussian and the

importance sampling approximations. Two models were used in the simulation

analysis: a logistic model similar to the one used for the orange trees data and

a first order open compartment model similar to the one used for the Theo-

phylline example. For both models 1000 samples were generated and maximum

likelihood (ML) estimates based on the different approximations obtained. For

the alternating approximation, restricted maximum likelihood (RML) estimates

were also obtained.

Logistic Model

A logistic model similar to (4.2.1), but with two random effects instead of one,

was used to generate the data. The model is given by

yij =β1 + bi1

1 + exp {− [tij − (β2 + bi2)] /β3} + εij , (5.2.3)

where the bi are i.i.d. N (0, σ2D), i = 1, . . . , m, the εij are i.i.d. N (0, σ2) i =

1, . . . , m, j = 1, . . . , ni, and the εij are independent of the bi. We used m = 15,

ni = 10, i, . . . , 15, σ2 = 25, β = (200, 700, 350)T , and D =

4 −2

−2 25

Table 5.2.5 summarizes the simulation results for the variance-covariance

components (MSE denotes the mean square error of the estimators). The dif-

ferent approximations to the loglikelihood give similar simulation results for all

the parameters involved. The cluster specific variance (σ2) is estimated with

more relative precision than the elements of the scaled variance-covariance ma-

trix of the random effects (D). This is probably because the precision of the

estimate of σ2 (as well as the estimates of β) is related more to the total number

of observations, while the precision of the estimates of D is determined by the

number of clusters. We can also see a tendency for the restricted maximum

likelihood to give positively biased estimates of D11 and D22, while the other

approximations give negatively biased estimates. The rationale for restricted

maximum likelihood is to reduce bias in estimating variance components. It

does not seem to do so in this case; it just changes its direction.

Table 5.2.5: Simulation results for D and σ2 in the logistic modelD11 D12

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 4.200 0.200 3.916 -1.946 0.054 18.421Alternating – ML 3.922 -0.078 3.437 -1.995 0.005 16.185Laplacian 3.935 -0.065 3.375 -1.978 0.022 15.724Adap. Gaussian 3.941 -0.059 3.408 -1.965 0.035 15.754

D22 σ2

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 26.089 1.089 360.985 24.885 -0.115 9.756Alternating – ML 23.322 -1.678 314.503 24.651 -0.349 9.647Laplacian 23.864 -1.136 310.054 24.625 -0.375 9.570Adap. Gaussian 23.934 -1.066 312.422 24.617 -0.383 9.567

Figure 5.2.5 presents the scatter plots of the variance-covariance component

(σ2 and D) estimates for the alternating RML, the alternating ML, and the

Laplacian approximations versus the adaptive Gaussian approximation. We

see that, except for the alternating RML approximation, all methods lead to

very similar estimates. In general the alternating RML approximation gives

larger values for the estimates of the variance components (especially D11 and

D22) than the other methods. The higher mean square error for D12 from the

alternating ML and RML methods is visible in the plot, as each of the panels

comparing these estimates to those from the adaptive Gaussian method has a

vertical clump of points at the true value.

Table 5.2.6 presents the simulation results for the fixed effects estimates.

The results are very similar for all approximations considered. We also note

that the relative variability of the fixed effects estimates is much smaller than

those of the estimates of the elements of D. There is very little, if any, bias in

the fixed effects estimates.

Table 5.2.6: Simulation results for β in the logistic modelβ1 β2

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 199.61 -0.39 10.18 698.43 -1.57 138.21Alternating – ML 199.61 -0.39 10.18 698.43 -1.57 138.22Laplacian 199.93 -0.07 10.20 700.03 0.03 138.38Adap. Gaussian 199.92 -0.08 10.15 699.90 -0.09 138.44

Approximation Mean Bias MSEAlternating – RML 348.81 -1.19 57.17Alternating – ML 348.82 -1.18 57.13Laplacian 350.20 0.20 56.94Adap. Gaussian 350.06 0.06 57.06

Figure 5.2.6 presents the scatter plots of the fixed effects estimates for the

alternating RML, alternating ML, and Laplacian approximations versus the

adaptive Gaussian approximation. Again we observe a strong agreement in

the estimates obtained through the various approximations. The alternating

Adaptive Gaussian

20 25 30 35

••

•••

••

• •

••

••••

••

•••

••

•••

••

•••••

••

•••

••

• •

•••

••

•••

••

•• •

••• •

••

• ••

••

• ••

•••

••

•••

••

•••

••

•••

••

• •

••

• •

•• •

••

•••

••

•••

••

•••

••

•••••

••

•••

••

••••

••

•••

••

•• ••

••

•••

••

•••

• ••

•••

••

•••

••

••••

••

••••

••

•••

••

••••

••

•••

••

•••

••

• •

••

•••

••

•••

••

•• ••

••

•••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

•••

••

Adaptive GaussianA

20 25 30 35

••

•••

••

• •

••

••••

••

•••

••

•••

••

•••••

••

•••

••

• •

•••

••

•••

••

•• •

••• •

••

• ••

••

• ••

•••

••

•••

••

•••

••

•••

••

• •

••

• •

•• •

••

•••

••

•••

••

•••

••

•••••

••

•••

••

••••

••

••••

••

•• ••

••

•••

••

•••

••

• ••

•••

••

•••

••

• •••

••••

••

••••

••

•••

••

••••

••

•••

••

•••

••

• •

•• •

••

•••

••

•••

••

•• ••

••

•••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

•••

••

Adaptive Gaussian

20 25 30 35

••

•••

••

••••

••

•••

••

•••

••

•••

••

•••

••

•••••

•••

••

•••

•• •

• •••

••

• ••

••

• ••

•••

••

•••

••

•••

• •••

••

• •

••

• ••

••

•••

••

••••

••

•••

••

•••••

••

•••

••

••••

••

••••

••

•• ••

••

•••

••

•••

••

•••

••

•••

••

• •••

••••

••

••••

••

•••

••

••••

••

•••

••

• •

••

•••

••

• •

••

•••

••

•• ••

••

• ••

• •

•••

••

•••

••

•••

••

•••

••

•••

••

Adaptive Gaussian

0 5 10 15

••

•••

•••••

••

•••

••

•••

••

•••

••

• •

••

••••

••

•••

••

• •

••••

••

•••

••

•••

••

••••••

••••

••

•••

••

••••

••

•••

••

•••

••

•••

••

•••

••

•••••

••

•••

••

••••

••

•••

••••

••

•••

••

••••

••

• ••

••

• •

••

•••

••• •

••

•••

••

•••

•••••

• ••

••

•••

••

••••

••

•• •

••

•••

••

••••

••

•••

••

•• •

••••

•••

••

•••••••

•• •

••

••••

•••

••

• •• ••

••

•••

••

•••

••

••••

••

•••

••

•••

••

••••

••

•••

••

•••

• •

••

• •

••

• ••

••

•••

••

•••

••••

••

•••

Adaptive Gaussian

0 5 10 15

••

•••

•••••

••

•••

••

•••

••

•••

••

• •

••

••••

••

•••

••

• •

••••

••

•••

••

•••

••

••••••

••••

••

•••

••

••••

••

•••

••

•••

••

•••

••

•••

••

•••••

••

•••

••

••••

••

•••

••••

••

•••

••

••••

••

• ••

••

• •

••

•••

••• •

••

•••

••

•••

•••••

••

•••

••

••••

••

•• •

••

•••

••

••••

•••

••

•••

••

•• •

••••

•••

••

•••••••

•• •

••••

•••

••

•••

••

• •• ••

••

•••

••

•••

••

••••

••

•••

••

•••

••

••••

••

•••

••

•••

• •

••

• •

••

• ••

••

•••

••

•••

••••

••

•••

Adaptive GaussianLa

0 5 10 15

••

• •

•••

•••••

••

•••

••

• ••

••

•••

••

••••

••

•••

••

••••

••

• •

•••

••

•••

••

••••••

••••

••

•••

••

••••

••

•••

••

•••

•••••

••

•• •

••

•••

••

•••••

••

•••

• •

••

•••

••••

• ••

••

•••

••

••••

••

• ••

••

• •

••

••• •

••

•••

••

•••

••••••

••

•••

••

•••

••

•••

••

•• •

••

•••

••

••••

•••

••

•••

••

•• •

••••

•••

••

•••••••

••

••••

•••

••

••• •

• ••

••

• •• ••

••

•••

••

•••

••

••••

••

• •

•••

• •

••••

•••

••

•••

••

•••

• •

••

•••

••

•••

••

•••

••••

••

•••

Adaptive Gaussian

-15 -10 -5 0 5 10

••

•• •

••

•••

••

•••

••

••••••

••

•••

••

•• ••

••

•••

• ••

•••

••

•••

••

•••

••••

•• •

••

••••••

••

• •

••

• •

••

•••

••

• • •

•••

••

•••

•• •

••

•• • •

••

•••

••••

••

•••

••

• •

•••

••

• ••

••

••••

•••••

••

• •

••

•••

••

•••

••

•••

••

••••

••

• •

•••

••

• •

•••

••

••••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

••

• •

••

•••

••

•••

••

• ••

••

• ••

••

•••

••

• •

••

Adaptive Gaussian

-15 -10 -5 0 5 10

••

•• •

••

•••

••

•••

••

••••••

••

•••

••

•• ••

••

•••

• ••

•••

••

••••

••

•••

••

•••

••

••••

•• •

••

••••••

••

•••

• •

••

•••

••

• • ••

•••

••

•••

•• •

••

• ••

••

•• • •

••

•••

••••

••

•••

••

•••

••

• ••

••

••••

•••••

••

• •

••

•••

••

•••

••

•••

••

• •

•••

••

• •

•••

••••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

••

•••

••

•••

••

• ••

••

• •

••

•••

••

• •

••

Adaptive Gaussian

-15 -10 -5 0 5 10

••

• ••

••

•••

••••

••

• ••

••

• •

•••

••

•••

••

•••

••

•••

••

•• ••

••

•••

••

•••

••

•••

••

•• •

•••

••

•••••

••••

••

• •

••

••••••

••

•••

••

•••

••

• •

••

•••

••

•• •

•••

••

•••• ••

•••

• •

••

•• •

• •

••

•••

••••••

••

•••

••

••• •

•• •

••

•••

•••••

••

••••

••

••••

••

•••

••

• ••

•••

••

•••

••

•••

••

•••

••

•••

••

• •

•••

••

••••

•••

••

•••

••

•••

••

•••

••

• •••

••••••

••

•••

• •

••

•••

••

• •

••

•••

••••

•••

••

••••••

••

•••

••

•••

• •

•••

••

Adaptive Gaussian

0 20 40 60 80 100 120

••

•••

• •

••

•••

••

••••

••

•••

••

• •

••

•••

••

• •

•••

••

• •

••

•••

••

•••

• ••

••

• •

••

••••

••

••••

••

••••

•• • •

••

••••

••

•••

••

•• •

••

• •

••

•••

••

•••

••

•• •

••

•••

••••

•••

• •••

••

•••••

•••

••

• •

••

••••

••

• ••

•••

••••

••

•••

••

••••

••

•• •

••

•••

••

• ••

•••

• •

••

•••

••

••••

••

• •

••

• •

•••

••

•••

••••

•••

••

•••

••

••••

Adaptive Gaussian

0 20 40 60 80 100 120

••

•••

••

•••

••

••••

••

•••

••

• •

•••

••

•••

••

•••

••

•••

• ••

••

• •

••

••••

••

••••

••

••••

•• • •

••

••••

••

•••

••

• ••

• •

••

• •••

••

•••

••

•••

••

•• •

••

•••

••

•••

• •••

••

•••••

•••

••

••••

••

• ••

•••

••

••••

••

•••

••

••••

••

• •

••

•••

••

• •

•••

• •

••

•••

••

••••

••

• •

••

• •

••

• •

•••

••

•••

••

•••

••••

••

•••

••

••••

Adaptive Gaussian

0 20 40 60 80 100 120

••

•••

••

• •

••

• ••

••

••••

••

•••

••

• •

•••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

•• •

••

• •

••

••••

••

••••

••

••••

•• • •

••

••••

••

• •

••

•••

••

•••

••

•• •

••

•••

••

••••••

• •••

••

•••••

•••

••

••••

••

•• •

••

•••

••

•••

••••

••

••••

••

•••

••

• •

••••

••

• •

•••

••

•••

••

•••

• •

•••••

••

• •

••

•• •

••

•••

••

•••

••

•••

••••

••

•••

••

••••

Figure 5.2.5: Scatter plots of variance-covariance components estimates for thealternating (RML and ML), Laplacian, and adaptive Gaussian approximationsin the logistic model (5.2.3). The dashed lines indicate the true values of theparameters.

approximations tend to give estimates slightly smaller than the Laplacian and

adaptive Gaussian, but the differences are minor.

First Order Compartment Model

The model used in the simulation is identical to (5.2.2). As in the Theophylline

example we set m = 12 and ni = 11, i = 1, . . . , 12. The parameter values used

were σ2 = 0.25, β = (−3.0, 0.5,−2.5)T , and D =

.Table 5.2.7 summarizes the simulation results for the variance-covariance

components estimates. As in the logistic model analysis, we observe that the el-

ements of D are estimated with less relative precision than σ2. The alternating

ML, Laplacian, and adaptive Gaussian approximations seem to lead to slightly

downward biased estimates of D11 and D22, while the alternating RML approx-

imation appears to give unbiased estimates (thus achieving its main purpose).

Note however that the unbiasedeness of the RML estimates does not translate

into smaller mean square error — all four estimation methods lead to similar

MSE, for all parameters.

Figure 5.2.7 presents the scatter plots of the variance-covariance estimates

for the alternating RML, alternating ML, and Laplacian approximations versus

the adaptive Gaussian approximation. The alternating RML approximation

tends to give larger values for D11 and D22, and larger absolute values for D12,

while the remaining approximations lead to very similar estimates. There was

one sample for which the alternating approximations apparently converged to a

different solution than the Laplacian and adaptive Gaussian. Overall there were

no major differences between the approximations in estimating the variance-

covariance components.

Adaptive Gaussian

190 195 200 205 210

•••

••

•••

••

• •

••

•••

••

• •

••

• •

••

• •

••

•••

••

••••

••

• •

••

•••

••

• •

••

•••

••

• •

••

•••

••

•• •

••

• ••

••

• •

••

•••

• •

••

•• •

••

•••

••

•••

••

•••

••

•••

••

• •

••

•••

••

•••

••

•••

••

•••

••

• •

••••••

••

•• •

••

•••

••

• •

••

Adaptive Gaussian

190 195 200 205 210

•••

• •

••

•••

••

• •

••

•••

••

• •

••

• •

••

• •

••

•••

••

••••

••

• •

••

•••

••

• •

••

•••

••

• •

••

•••

••

•• •

••

• ••

••

• •

••

•••

• •

••

•• •

••

•••

••

•••

••

•••

••

•••

••

• •

••

•••

••

•••

••

•••

••

•••

••

• •

••••••

••

•• •

••

•••

••

• •

••

Adaptive Gaussian

190 195 200 205 210

•••

••

•••

••

•••

••

• •

••

• •

••

• •

••

• •

••

•••

••

•••

••

• •

••

•••

••

• •

••

• •

••

•••

••

• ••

••

• ••

••

•••

• •

••

•• •

••

•••

••

•••

••

••••

•••

••

• •

•••

••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

••

• •

••••••

•• •

••

Adaptive Gaussian

680 700 720 740

720 •

••

•••

••

•••

••

••••

••

•• ••• •

••

• •

••

•••

••

• •

••

•••

••

•••

••

• •

••

••••

••

•••

••

• ••

••

• •

••

•••

••

• ••

••

• ••

•••

••

• •

••

•• •

••

•••

••

•••

••

• •

••

•••

••

• •

••

•••

••

•••

••

•••

••

• •

Adaptive Gaussian

680 700 720 740

••

•••

••

••••

••

•• ••• •

••

• •

••

•••

••

• •

••

•••

••

•••

••

• •

••

••••

••

•••

••

• ••

••

• •

••

•••

••

• ••

••

• ••

•••

••

• •

••

•• •

••

•••

••

•••

••

• •

••

•••

••

• •

••

•••

••

•••

••

•••

••

• •

Adaptive Gaussian

680 700 720 740

••

•••

••

•••

••

•••

••

• •

•• ••• •

••

•••

• •••

••

•••

••

• •

••

••••

••

•••

••

• ••

••

•••

••

• ••

••

• ••

••

•••

••

•••

••

•• •

••

•••

••

•••

••

•••

••

•••

••

• •

Adaptive Gaussian

330 340 350 360 370

•• •

••

•••

••

•••

••

•• •

•••

••

•••

••

•••

••

• •••

••

•••

• •

••

• •

••

• •

••

•••

••

•••

••

•••

•• •

••

•••

••

• ••

••

•••

••

•••

••

• •

••

• •

••

•••

• •

••

•••

••

•••

• •

••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

• •

• ••

••

•••

••

•••

••

•••

••

•• •

••

•••

• •

••

•••

••••

••

• •

• ••

• •

••

Adaptive Gaussian

330 340 350 360 370

•••

•• •

••

•••

••

•••

••

•• •

•••

••

•••

••

•••

••

• •••

••

•••

• •

••

• •

••

• •

••

•••

••

•••

••

•••

•• •

••

•••

••

• ••

••

•••

••

• •

••

• •

••

•••

• •

••

•••

••••

••

•••

• •

••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

• •

• ••

•••

••

•••

••

•• •

••

•••

• •

••

•••

••

••••

••

• •

• ••

• •

••

Adaptive Gaussian

330 340 350 360 370

•••

• ••

••

•••

••

•••

••

•••

••

•••

••

• •••

••

•••

••

•••

••

• •

••

•••

••

•••

••

•••

••

• ••

••

•••

••

• •

••

• •

••

•••

••

•••

••••

••

•••

• •

••

•••

••

• ••

••

•••

••

• •••••

••

•••

• •

••

••••

••

•••

••

•• •

••

•••

••

•••

••

• •

••

•••

••

••••

••

•• •

••

Figure 5.2.6: Scatter plots of fixed effects estimates for the alternating (RMLand ML), Laplacian, and adaptive Gaussian approximations in the logisticmodel (5.2.3). The dashed lines indicate the true values of the parameters.

Adaptive Gaussian

0.15 0.20 0.25 0.30 0.35

••

•••

••

•••

••

•••

••••

••

•••

••

• ••

••

••••

••

•••

• •

••

• •

••••

••

•••

••

••••

••

•••

••

• •

••

• •

••

•••

••

•••

••

••••

•••

••

•••

••

••••

••

••••

••

•••

•• •

••

•••

• ••

••

••••

••

•• •

••

•••

••

•••

••

•••

••

••••

••

• •

••

•••

••

• •

••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

••••

••

Adaptive GaussianA

0.15 0.20 0.25 0.30 0.35

••

•••

••

•••

••

•••

••••

••

•••

••

• ••

••

••••

••

•••

••

• •

••••

••

•••

••

••••

••

•••

••

• •

••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

••

• •

••

••••

••

••••

••

•••

•• •

••

•••

• ••

••

••••

••

•• •

••

•••

••

•••

••

•••

••

••••

••

• •

••

•••

••

• •

••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

••••

••

• ••

Adaptive Gaussian

0.15 0.20 0.25 0.30 0.35

••

•••

••

•••

••

•••

••••

••

•••

••

•••

••

••••

••

•••

••

• •

••••

••

•••

••

••••

••

•••

••

• •

••

• •

••

•••

••

•••

••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

•• •

••

•••

• ••

••

• •••

••

•• •

••

•••

••

•••

••

•••

••

••••

••

•• •

••

•••

••

• •

••

• ••

••

•••

••

•••

••

•••

••

•••

••

•••

••

• ••

Adaptive Gaussian

0.0 0.1 0.2 0.3 0.4 0.5

••

• •

••

• •

••

• •

••

• ••

••

•••

••

• •

•••

••

•••

••

•••

••

•••

••

•••

••

•••

••

•• •••

••

••• •

••

•••

••

• ••

••

• •

••

•••

••

•••••

•• ••

••

• ••

••

•••

••

•••••

••

•••

••

• •

••

•••

••

•••

•• ••

••••

•••

••

• •••

••

• ••

••

•••

••

• •

••

•••

••

•••

••

•••

••

• •

••

••••

••

••••

••

•• ••

••

• •

••

••••

••

•••

••

Adaptive Gaussian

0.0 0.1 0.2 0.3 0.4 0.5

••

• •

••

• •

••

• •

••

•••

••

• •

•••

••

•••

••

•••

••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

••

• ••

••

• •

••

•••

••

•••••

•• ••

••

•••

••

•••

••

•••••

••

•••

••

• •

••

•••

••

•••••

••••

•••

••

• •••

••

• ••

••

•••

••

• •

••

•••

••

•••

••

•••

••

••••

••

••••

••

•• ••

••

• •

••

••••

••

•••

••

Adaptive GaussianLa

0.0 0.1 0.2 0.3 0.4 0.5

••

• •

••

• •

••

• ••

••

•••

••

• •

•••

••

•••

••

•••

••

•••

••

•••

••

••••

••

••••

••

•••

••

• ••

••

• •

••

•••

••

•••••

•• ••

••

• ••

••

•••

••

•••••

••

•••

••

•••

••

• •

••

•••

••

•••

••

•••

•• ••

••••

•••

••

•••

••

• ••

••

•••

••

• •

••

•••

••

•••

••

•••

••

••••

••

••••

••

•• ••

••

• •

••

••••

••

•••

••

Adaptive Gaussian

-0.4 -0.2 0.0 0.2 0.4

••

•••

• •••

••

•••

••

•••

••

•••

••

•••

• ••

••

•••

••••

••

• •

••

•••

•• •

••

•••

•• •

••

• •••

••

••••

••

•••••

•••

••

• ••

••

• •

••

•• •••

••

•••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

••

•• ••

•••

••

•••

••

••••••

••

•••

••

•••••

••

• •

••

• ••

••

•••

••

•••

••

•••

••

•••

••

• ••

•••

••

• •

••

•••

••

•••

••

•••

••

Adaptive Gaussian

-0.4 -0.2 0.0 0.2 0.4

••

•••

• •••

••

•••

••

•••

••

•••

••

•••

• ••

••

•••

••••

••

• •

••

•••

• ••

••

•••

•• •

••

• •••

••

••••

••

•••••

•••

••

• ••

••

• •

••

•• •••

••

•••

••

•••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

•••

••

•• ••

•••

••

•••

••

••••••

••

•••

••

•••••

••

• •

••

• ••

••

•••

••

•••

••

•••

••

•••

••

• ••

•••

••

• •

••

•••

••

•••

••

•••

••

Adaptive Gaussian

-0.4 -0.2 0.0 0.2 0.4

••

•••

• •••

••

•••

••

•••

••

•••

••

•••

• ••

••

•••

••••

••

• •

••

•••

• ••

••

•••

•• •

••

••••

••

•••••

•••

••

• ••

••

•••

••

•••

••

•••

• ••

••

• •

•••

••

•••

••

•••

••

• •

••

•••

••

••••

••

•••

••

•••

••

•••

••

••••••

••

•••

••

••• ••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

• ••

•••

••

• ••

• •

••

• •

••

•••

• ••

••

•••

••

•••

••

Adaptive Gaussian

0 1 2 3 4

••

••••

••

•••

••

•••

••

••••••

•••

••

• •••

••

••••

••

• •

•••

••

•••

•••••••

••

••••

••

•••

••

•••

••••

••

•••

••

•••

••

•••

••

•••

••

••••

• •

••

•••••

••

•••

•••••

••••

••

•••••

••

••••

••

• ••

• •••

••

•• •

••

•••••

•••

••

•• •••

•••

••

•••

••

•••

• •

•••

••

•••

••

•••

••

•••

••

• ••

••• ••

••

•••

••

••••

••

•••

••

••••

••

• •

••

••••

••

•••

••

•••••••

••

• ••

••

•••

••

•••

••••

••

•••

••

•••••

••

•••

••

••••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

Adaptive Gaussian

0 1 2 3 4

• •

••

••••

••

•••

••

•••

••

••••••

•••

••

• •••

••

••••

••

• •

•••

••

•••

•••••••

••

••••

••

•••

••

•••

••••

••

•••

••

•••

••

•••

••

•••

••

••••

• •

••

•••••

••

•••

•••••

••••

••

•••••

••

••••

••

• ••

• •••

••

•• •

••

•••••

•••

••

••••

••

• •••

•••

••

•••

••

•••

• •

•••

••

•••

••

•••

••

•••

••

•••

••

• ••

••

•• ••

••

•••

••

••••

••

•••

••

••••

••

• •

••

••••

••

•••

••

•••••••

••

• ••

••

•••

••

•••

••••

••

•••

••

•••••

••

•••

••

••••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

Adaptive Gaussian

0 1 2 3 4

• •

••

••••

••

•••

••

•••

••

••••••••

•••

••

• •••

••

••••

••

• •

•••

• •

••

•••

•••••••

••

••••

••

•••

••

•••

••••

••

•••

••

•••

••

•••

• •

••

••••

••

•••••

••

•••

•••••

••

••••

••

•••••

••

••••

••

• ••

• •••

••

•• •

••

•••••

•••

••

•• •••

•••

••

•••

••

•••

• •

•••

••

•••

••

•••

••

•••

••

• ••

••

• ••

••

•••

••

••••

••

•••

••

••• ••

••

• •

••

••••

••

•••

••

•••••••

••

••••

••

•••

••

•••

••••

••

•••

••

•••••

••

•••

••

••••

••

• ••

••

•••

••

•••

••

•••

••

•••

••

••••

••

Figure 5.2.7: Scatter plots of variance-covariance components estimates for thealternating (RML and ML), Laplacian, and adaptive Gaussian approximationsin the first order compartment model (5.2.2). The dashed lines indicate the truevalues of the parameters.

Table 5.2.7: Simulation results for D and σ2 in the first order compartmentmodel

D11 D12

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 0.1996 -0.0004 0.0089 -0.0013 -0.0013 0.0210Alternating – ML 0.1840 -0.0160 0.0078 -0.0023 -0.0023 0.0179Laplacian 0.1862 -0.0138 0.0078 -0.0011 -0.0011 0.0178Adap. Gaussian 0.1860 -0.0140 0.0077 0.0002 0.0002 0.0180

D22 σ2

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 1.0095 0.0095 0.2565 0.2508 0.0008 0.0012Alternating – ML 0.9249 -0.0751 0.2240 0.2486 -0.0014 0.0011Laplacian 0.9388 -0.0612 0.2276 0.2480 -0.0020 0.0011Adap. Gaussian 0.9476 -0.0524 0.2332 0.2481 -0.0019 0.0011

Table 5.2.8 gives the simulation results for the fixed effects estimates. All

four approximations give virtually identical results for the estimation of the

fixed effects. They all show very little bias and smaller relative variability when

compared to the estimates of the variance-covariance components.

The scatter plots of the fixed effects estimates, not included here, show

practically identical results for the alternating RML and ML, the Laplacian,

and the adaptive Gaussian approximations.

5.3 Conclusions

The results of section 5.1 indicate that the alternating approximation (5.1.2)

to the loglikelihood function in the nonlinear mixed effects model (4.1.1) pro-

posed by Lindstrom and Bates (1990) gives accurate and reliable estimation

results. The main advantages of this approximation are its computational effi-

ciency (allowing the use of linear mixed effects techniques to estimate the scaled

Table 5.2.8: Simulation results for β in the first order compartment modelβ1 β2

Approximation Mean Bias MSE Mean Bias MSEAlternating – RML -2.9989 0.0011 0.0053 0.4876 -0.0124 0.0244Alternating – ML -2.9992 0.0008 0.0053 0.4869 -0.0131 0.0244Laplacian -3.0009 -0.0009 0.0053 0.4983 -0.0017 0.0242Adap. Gaussian -2.9987 0.0013 0.0053 0.4984 -0.0016 0.0246

Approximation Mean Bias MSEAlternating – RML -2.4965 0.0035 0.0020Alternating – ML -2.4965 0.0035 0.0020Laplacian -2.5045 -0.0045 0.0020Adap. Gaussian -2.5008 -0.0008 0.0020

variance-covariance matrix of the random effects D) and the availability of a

restricted likelihood version of it, which is not yet defined for other approxima-

tions/estimation methods. With regard to the restricted maximum likelihood

estimation though, the results of section 5.2 suggest that the bias correction

ability of this method depends on the nonlinear model that is being consid-

ered: RML estimation achieved its purpose for the first order compartment

model (5.2.2), but it increased the bias in the logistic model (5.2.3). More re-

search is needed in this area. Since it is simpler computationally, the alternating

approximation should be used to provide starting values for the more accurate

approximations (e.g. Laplacian and adaptive Gaussian) if they are preferred.

The Gaussian quadrature approximation (5.1.9) only seems to give accurate

results for a large number of abscissas (> 100), which makes it very inefficient

computationally. The cause of this behavior is that the grid of abscissas is

centered at 0 (the expected value of the random effects) and scales it accord-

ing to D, while the highest values of the integrand in (4.1.3) are concentrated

around the posterior modes of the random effects (b) and scaled according to

g′′(β, D, y, b

). The advantages of this approximation are that it does not

require the estimation of the posterior modes of the random effects at each

iteration and it admits closed form partial derivatives with respect to the pa-

rameters of interest (β, D, and σ2), provided these are available for the model

function f (Davidian and Gallant, 1992). We feel that these advantages do not

compensate for the inaccuracy or computational inefficiency of the Gaussian

approximation.

The importance sampling approximation (5.1.7) gives reliable estimation

results, comparable to those of the adaptive Gaussian and Laplacian approxi-

mations, but is considerably less efficient computationally than these approx-

imations. Also, the stochastic variability associated with the different impor-

tance samples may overwhelm the numerical variability of the loglikelihood for

small changes in the parameter values, making it difficult to calculate numeri-

cal derivatives. The main advantage of the importance sampling approximation

is its versatility in handling distributions other than the normal, for both the

random effects and the error term (ε). For example it would be rather straight-

forward to adapt the importance sampling integration to handle a multivariate

t distribution for the random effects, but that would not be a trivial task for

either the alternating, the Laplacian, or the adaptive Gaussian approximations.

Wakefield et al. (1994) use the similar property of Gibbs sampler methods to

check for outliers in nonlinear mixed effects models. If one is willing to stick with

the normal distribution for b and ε in the nonlinear mixed effects model (4.1.1)

then the importance sampling approximation is not the most efficient choice.

Of all approximations considered here, the Laplacian and adaptive Gaussian

approximations probably give the best mix of efficiency and accuracy. The

former can be regarded as a particular case of the latter, where just one abscissa

is used. Both approximations (and the importance sampling approximation

as well) give the exact loglikelihood when the model function f in (4.1.1)

is a linear function of the random effects. In the examples that we analyzed

not much was gained by going from a one-point adaptive Gaussian quadrature

(Laplacian) approximation to approximations with a larger number of abscissas.

It appears that the major gain in adaptive Gaussian approximations is related to

the centering and scaling of the abscissas. Increasing the number of points in the

evaluation grid only gives marginal improvement. The Laplacian approximation

has the additional advantage over the adaptive Gaussian approximation with

more than one abscissa of allowing profiling of the loglikelihood over σ2, thus

reducing the dimensionality of the optimization problem.

For statistical analysis purpose we would recommend using a hybrid scheme

in which the alternating algorithm would be used to get good initial values for

the more refined Laplacian approximation to the loglikelihood of model (4.1.1).

Chapter 6

Parametrizations for

Variance-Covariance Matrices

The estimation of variance-covariance matrices in mixed effects models using ei-

ther maximum likelihood, or restricted maximum likelihood, is usually a difficult

numerical problem, since one must ensure that the resulting estimate is posi-

tive semi-definite. Two approaches can be used for that purpose: constrained

optimization, where the natural parametrization for the unique elements in the

variance-covariance matrix is used and the estimates are constrained to be pos-

itive semi-definite matrices, and unconstrained optimization, where the unique

elements in the variance-covariance matrix are reparametrized in a way such

that the resulting estimate must be positive semi-definite. We recommend the

use of the second approach not only for numerical reasons (parameter estima-

tion tends to be much easier when there are no constraints), but also because

of the superior inferential properties that unconstrained estimates tend to have

(e.g. asymptotic properties).

Since a variance-covariance matrix is positive semi-definite, but not positive

definite (p.d.) only in the rather degenerate situation of nonrandom linear

combinations of the underlying random variables, we will restrict ourselves here

to positive definite variance-covariance matrices.

In addition to enforcing the positive definiteness constraints, the choice of

the parametrization can be influenced by computational efficiency and by the

statistical interpretability of the individual components. In general we can use

numerically or analytically determined second derivatives of the (restricted)

likelihood to approximate standard errors and derive confidence intervals for

the individual parameters. In order to assess the variability of the variance

and covariance estimates, it is desirable that they can be expressed as simple

functions of the unconstrained parameters. More detailed techniques, such as

profiling the likelihood (Bates and Watts, 1988), also work best for functions of

the variance-covariance matrix that are expressed in the original parametriza-

We describe in section 6.1 five different parametrizations for transforming

the estimation of unstructured (general) variance-covariance matrices into an

unconstrained problem. In section 6.2 we compare the parametrizations with

respect to their computational efficiency and statistical interpretability. Our

conclusions are presented in section 6.3.

6.1 Parametrizations

Let D denote an unstructured positive definite q×q variance-covariance matrix

corresponding to a random vector b = (b1, . . . , bq). Since D is symmetric,

only q(q + 1)/2 parameters are needed to represent it. We will denote by θ

any such minimal set of parameters to determine D. The rationale behind all

parametrizations considered in this section is to write

D = LT L (6.1.1)

where L = L (θ) is an q × q matrix of full rank obtained from a q(q + 1)/2-

dimensional vector of unconstrained parameters θ. It is clear that any D defined

as in (6.1.1) is positive definite.

Different choices of L lead to different parametrizations of D. We will

consider here two classes of L: one based on the Cholesky factorization (Thisted,

1988) of D and another based on the spectral decomposition of D (Rao, 1973).

The first three parametrizations presented below use the Cholesky factorization

of D, while the last two are based on its spectral decomposition.

In some of the parametrizations there are particular components of the pa-

rameter vector θ that have meaningful statistical interpretations. These can

include the eigenvalues of D — important in considering when the matrix is ill-

conditioned, the individual variances or standard deviations, and the particular

correlations.

The following variance-covariance matrix will be used throughout this sec-

tion to illustrate the use of the various parametrizations.

1 5 14

(6.1.2)

6.1.1 Cholesky Parametrization

Since D is p.d. it may by factored as D = LT L, where L is an upper triangular

matrix. Setting θ to be the upper triangular elements of L gives the Cholesky

parametrization of D. Lindstrom and Bates (1988) use this parametrization to

obtain derivatives of the loglikelihood of a linear mixed effects model for use in a

Newton-Raphson algorithm. They reported that the use of this parametrization

dramatically improved the convergence properties of the optimization algorithm,

when compared to a constrained estimation approach.

One problem with the Cholesky parametrization is that the Cholesky factor

is not unique. In fact, if L is a Cholesky factor of D then so is any matrix

obtained by multiplying a subset of the rows of L by −1. This has implications

on parameter identification, since up to 2q different θ may represent the same

D. Numerical problems can arise when different optimal solutions are not far

apart.

Another problem with the Cholesky parametrization is the lack of a straight-

forward relationship between θ and the elements of D. This makes it hard to

interpret the estimates of θ and to obtain confidence intervals for the variances

and covariances in D based on confidence intervals for the elements of θ. One

exception is |L11| =√

D11 so confidence intervals on D11 can be obtained from

confidence intervals on L11. By appropriately permuting the columns and rows

of D we can in fact derive confidence intervals for all the variance terms based

on confidence intervals for the elements of L.

The main advantage of this parametrization, apart from the fact that it

ensures positive definiteness of the estimate of D, is that it is computationally

simple and stable.

The Cholesky factorization of A in (6.1.2) is

By convention, the components of the upper triangular part of L are listed

column-wise to give θ = (1, 1, 2, 1, 2, 3)T .

6.1.2 Log-Cholesky Parametrization

If one requires the diagonal elements of L in the Cholesky factorization to be

positive then L is unique. In order to avoid constrained estimation, one can

use the logarithms of the diagonal elements of L. We call this parametrization

the log-Cholesky parametrization. It inherits the good computational proper-

ties of the Cholesky parametrization, but has the advantage of being uniquely

defined. As in the Cholesky parametrization the parameters also lack direct

interpretation in terms of the original variances and covariances, except for L11.

The log-Cholesky parametrization of A is θ = (0, 1, log(2), 1, 2, log(3))T .

6.1.3 Spherical Parametrization

The purpose of this parametrization is to combine the computational efficiency

of the Cholesky parametrization with direct interpretation of θ in terms of the

variances and correlations in D.

Let Li denote the ith column of L in the Cholesky factorization of D and

li denote the spherical coordinates of the first i elements of Li. That is

[Li]1 = [li]1 cos ([li]2)

[Li]2 = [li]1 sin ([li]2) cos ([li]3)

· · ·[Li]i−1 = [li]1 sin ([li]2) · · · cos ([li]i)

[Li]i = [li]1 sin ([li]2) · · · sin ([li]i)

It then follows that Dii = [li]21 , i = 1, . . . , q and ρ1i = cos([li]2), i = 2, . . . , q,

where ρij denotes the correlation coefficient between bi and bj . The correlations

between other variables can be expressed as linear combinations of products

of sines and cosines of the elements in l1, . . . , lq, but the relationship is not as

straightforward as those involving b1. If confidence intervals are available for

the elements of li, i = 1, . . . , q then we can also obtain confidence intervals for

the variances and the correlations ρ1i. By appropriately permuting the rows and

columns of D, we can in fact obtain confidence intervals for all the variances

and correlations of b1, . . . , bq. The exact same reasoning can be applied to

derive profile traces and profile contours (Bates and Watts, 1988) for variances

and correlations of b1, . . . , bq.

In order to ensure uniqueness of the spherical parametrization we must have

[li]1 > 0, i = 1, . . . , n and [li]j ∈ (0, π) , i = 2, . . . , q, j = 2, . . . , i

Unconstrained estimation is obtained by defining θ as follows

θi = log ([li]1) , i = 1, . . . , q and

θq+(i−2)(i−1)/2+(j−1) = log

([li]j

π − [li]j

), i = 2, . . . , q, j = 2, . . . , i

The spherical parametrization has about the same computational efficiency

as the Cholesky and log-Cholesky parametrizations, is uniquely defined, and

allows direct interpretability of θ in terms of the variances and correlations in

The spherical parametrization of A is θ = (0, log(5)/2, log(14)/2,−0.608,−0.348,

−0.787)T .

6.1.4 Matrix Logarithm Parametrization

The next two parametrizations are based on the spectral decomposition of D.

Since D is p.d., it has q positive eigenvalues λ. Letting U denote the orthogonal

matrix of orthonormal eigenvectors of D and Λ = diag (λ), we can write

D = UΛUT (6.1.3)

By setting

L = Λ1/2UT (6.1.4)

in (6.1.1), where Λ1/2 denotes the diagonal matrix with [Λ1/2]ii =√

[Λ]ii, we

get a factorization of D based on the spectral decomposition.

The matrix logarithm of D is defined as log (D) = U log (Λ) UT , where

log (Λ) = diag [log (λ)]. Note that D and log (D) share the same eigenvectors.

The matrix log (D) can take any value in the space of q × q symmetric ma-

trices and letting θ be equal to its upper triangular elements gives the matrix

logarithm parametrization of D.

The matrix logarithm parametrization defines a one-to-one mapping between

θ and D and therefore does not have the identification problems of the Cholesky

factorization. It does involve considerable calculations, as θ produces log (D)

whose eigenstructure must be determined before L in (6.1.4) can be calculated.

Similarly to the Cholesky and log-Cholesky parametrizations, the vector θ in the

matrix logarithm parametrization does not have a straightforward interpretation

in terms of the original variances and covariances in D. We note that even

though the matrix logarithm is based on the spectral decomposition of D, there

is not a straightforward relationship between θ and the eigenvalues-eigenvectors

The matrix logarithm of A is

log (A) =

−0.174 0.397 0.104

0.397 1.265 0.650

0.104 0.650 2.492

and therefore the matrix logarithm parametrization of A is θ = (−0.174, 0.397,

1.265, 0.104, 0.650, 2.492)T .

6.1.5 Givens Parametrization

The eigenstructure of D contains valuable information for determining whether

some linear combination of b1, . . . , bq could be regarded as constant. The Givens

parametrization uses the eigenvalues of D directly in the definition of the pa-

rameter vector θ.

The Givens parametrization is based on the spectral decomposition of D

given in (6.1.3) and the fact that the eigenvector matrix U can be represented

by q(q − 1)/2 angles, used to generate a series of Givens rotation matrices

(Thisted, 1988) whose product reproduces U as follows

U = G1G2 · · ·Gq(q−1)/2, where

Gi [j, k] =

cos(δi), if j = k = m1(i) or j = k = m2(i)

sin(δi), if j = m1(i), k = m2(i)

− sin(δi), if j = m2(i), k = m1(i)

1, if j = k �= m1(i) and j = k �= m2(i)

0, otherwise

and m1(i) < m2(i) are integers taking values in {1, . . . , q} and satisfying i =

m2(i)−m1(i)+ (m1(i) − 1) (q − m1(i)/2). In order to ensure uniqueness of the

Givens parametrization we must have δi ∈ (0, π) , i = 1. . . . , q(q − 1)/2.

The spectral decomposition (6.1.3) is unique up to a reordering of the diag-

onal elements of Λ and columns of U . Uniqueness can be achieved by forcing

the eigenvalues to be sorted in ascending order. This can be attained, within

an unconstrained estimation framework, by using a parametrization suggested

by Jupp (1978) and defining the first q elements of θ as

θi = log (λi − λi−1) , i = 1, . . . , q,

where λi denotes the ith eigenvalue of D is ascending order, with the convention

that λ0 = 0. The remaining elements of θ in the Givens parametrization are

defined by the relation

θq+i = log

π − δi

), i = 1, . . . , q(q − 1)/2.

The main advantage of this parametrization is that the first n elements of

θ give information about the eigenvalues of D directly. Another advantage of

the Givens parametrization is that it can be easily modified to handle general

(not necessarily p.d.) symmetric matrices. The only modification needed is to

set θ1 = λ1 and

λi = θ1 +i∑

exp (θi) , i = 2, . . . , q.

The main disadvantage of this parametrization is that it involves consider-

able computational effort in the calculation of D from the parameter vector

θ. Another problem with the Givens parametrization is that one cannot relate

θ to the elements of D in a straightforward manner, so that inferences about

variances and covariances require indirect methods.

The eigenvector matrix U in (6.1.3) can also be expressed as a product of

a series of Householder reflection matrices (Thisted, 1988) and these in turn

can be derived from q(q − 1)/2 parameters used to obtain the directions of the

Householder reflections. This Householder parametrization is essentially equiv-

alent to the Givens parametrization in terms of statistical interpretability, but

it is less efficient, since the derivation of the Householder reflection matrices in-

volves even more computation than the Givens rotations. We did not considered

it here.

The Givens parametrization of A is

θ=(−0.275,0.761,2.598,−0.265,−0.562,

−0.072)T .

6.2 Comparing the Parametrizations

In this section we compare the parametrizations described in section 6.1 in

terms of their computational efficiency and the statistical interpretability of the

individual parameters.

The computational efficiency of the different parametrizations is assessed

using simulation results. First we analyze the average time needed to calculate

L (θ) from θ for each parametrization and for varying sizes of L. Then we

compare the performance of the different parametrizations in computing the

maximum likelihood estimate of the variance-covariance matrix in a linear mixed

effects model.

To evaluate the average time needed to calculate L, we generated 25 random

q × q matrices Z whose elements were i.i.d. random variables with uniform

distribution in (0, 1) for q varying from 5 to 100, obtained D = ZT Z and then

θ, and recorded the average time to calculate L. Since the user times were too

small for matrices of dimension less than 10, we used 5 evaluations of L at each

user time calculation. Figure 6.2.1 presents the average user time as a function

of q for each of the parametrizations of D.

The Cholesky, the log-Cholesky, and the spherical parametrizations have

similar performances, considerably better than the other two parametrizations.

The matrix logarithm had the worst performance, followed by the Givens param-

etrization. These results are essentially reflecting the computational complexity

of each parametrization, as described in section 6.1.

In order to compare the different parametrizations in an estimation context,

Dimension(a)

10 20 30 40

2.5Matrix LogGivensCholeskylogCholeskySpherical

Dimension(b)

20 40 60 80 100

30Matrix LogGivensCholeskylogCholeskySpherical

Figure 6.2.1: Average user time to calculate L as a function of q, for the differentparametrizations of D. Plot (a) shows the behavior of the average user timefor q ≤ 40 and plot (b) shows the behavior of the average user time for q up to100.

we conducted a small simulation study using the linear mixed effects model

yi = X i (β + bi) + εi, i = 1, . . . , m (6.2.1)

where the bi are i.i.d. N (0, σ2D) random effects and the εi are i.i.d. Nni(0, σ2I)

error terms independent of the bi, with ni representing the number of obser-

vations on the ith cluster. Lindstrom and Bates (1988) have shown that the

loglikelihood corresponding to (6.2.1) can be profiled to produce a function of

D alone. We used, in the simulation, D matrices of dimensions 3 and 6. These

were defined such that the nonzero elements of the ith column of the correspond-

ing Cholesky factor were equal to {1, 2, . . . , i}. For q = 3 we have D = A, as

given in (6.1.2). For q = 3 we used m = 10, ni = 15, i = 1, . . . , 10, σ2 = 1, and

β = (10, 1, 2)T , while for q = 6 we used m = 50, ni = 25, i = 1, . . . 50, σ2 = 1,

and β = (10, 1, 2, 3, 4, 5)T . In both cases, the elements of the first column of X

were set equal to 1 and the remaining elements were generated according to a

U (1, 20) distribution. A total of 300 and 50 samples were generated respectively

for q = 3 and q = 6, and the number of iterations and the user time to calculate

the maximum likelihood estimate of D for each parametrization recorded.

Figures 6.2.2 and 6.2.3 present the box-plots of the number of iterations and

user times for the various parametrizations. The Cholesky, the log-Cholesky,

the spherical, and the matrix logarithm parametrizations had similar perfor-

mances for q = 3, considerably better than the Givens parametrization. For

q = 6 the Cholesky and the matrix logarithm parametrizations gave the best

performances, followed by the log-Cholesky and spherical parametrizations, all

considerably better than the Givens parametrization. Since D is relatively

small in these examples, the numerical complexity of the different parametriza-

tions did not play a major role in their performances. It is interesting to note

that even though the matrix logarithm is the least efficient parametrization in

terms of numerical complexity, it had the best performance in terms of number

of iterations and user time to obtain the maximum likelihood estimate of D,

suggesting that this parametrization is the most numerically stable.

Another important aspect in which the parametrizations should be compared

has to do with their behavior as D approaches singularity. All parametrizations

described in section 6.1 require D to be positive definite, though the Givens

parametrization can be modified to handle general symmetric matrices. It is

usually an important statistical issue to test if D is not of full rank and the

dimension of the parameter space can be reduced.

As D approaches singularity its determinant goes to zero and so at least one

of the diagonal elements of its Cholesky factor goes to zero too. The Cholesky

oooooooooooooooooooooooooo

oooooooooooooooooo

oooooooooooooo

oooooooooooooooooooo

ooooooooo

User Time to ConvergenceU

Cholesky logCholesky Spherical Matrix log Givens

Parametrization

ooooooooooooooo

ooooooooooooooooooooooo

ooooooooooooooo

oooooooooooooooooo

oooooooo

Number of Iterations to Convergence

of Itera

Parametrization

Figure 6.2.2: Box-plots of user time and number of iterations to convergencefor 300 random samples of model (6.2.1) with D of dimension 3.

parametrization would then become numerically unstable, since equivalent so-

lutions would get closer together in the estimation space. At least one element

of θ in the log-Cholesky parametrization would go to −∞ (the logarithm of

the diagonal element of L that goes to zero). In the spherical parametrization

we would also have at least one element of θ going in absolute value to ∞: if

the first diagonal element of L goes to zero, θ1 → −∞; otherwise at least one

angle of the spherical coordinates of the column of L whose diagonal element

approaches 0 would either approach 0 or π, in which cases the corresponding

element of θ would go respectively to −∞ or ∞.

Singularity of D implies that at least one of its eigenvalues is zero. The

Givens parametrization would then have at least the first element of θ going

to −∞. To understand what happens with the matrix logarithm parametriza-

tion when D approaches singularity we note that letting (λ1, u1), . . . , (λq, uq)

represent the eigenvalue-eigenvector pairs corresponding to D we can write

User Time to ConvergenceU

Parametrization

Number of Iterations to Convergence

of Itera

Parametrization

Figure 6.2.3: Box-plots of user time and number of iterations to convergencefor 50 random samples of model (6.2.1) with D of dimension 6.

D =∑q

i=1 λiuiuTi . As λ1 → 0 all entries of log(D) corresponding to nonzero

elements of u1uT1 would converge in absolute value to ∞. Hence in the matrix

logarithm parametrization we could have all elements of θ going either to −∞or ∞ as D approached singularity.

Finally we consider the statistical interpretability of the parametrizations of

D. The least interpretable parametrization is the matrix logarithm — none of

its elements can be directly related to the individual variances, covariances, or

eigenvalues of D. The Cholesky and log-Cholesky parametrizations have the

first component directly related to the variance of b1, the first underlying random

variable in D. By permuting the order of the random variables in the definition

of D, one can derive measures of variability and confidence intervals for all the

variances in D, from corresponding quantities obtained for the parameters in the

Cholesky or log-Cholesky parametrizations. The Givens parametrization is the

only one considered here that uses the eigenvalues of D directly in the definition

of θ. It is a very useful parametrization for identifying ill-conditioning of D.

None of its parameters, though, can be directly related to the variances and

covariances in D. Finally, the spherical parametrization is the one that gives the

largest number of interpretable parameters of all parametrizations considered

here. Measures of variability and confidence intervals for all the variances in D

and the correlations with b1 can be obtained from the corresponding quantities

calculated for θ. By permuting the order of the underlying random variables in

the definition of D, one can in fact derive measures of variability and confidence

intervals for all the variances and correlations in D.

6.3 Conclusions

The parametrizations described in section 6.1 allow the estimation of variance-

covariance matrices using unconstrained optimization. This has numerical and

statistical advantages over constrained optimization, since the latter is usually

a much harder numerical problem. Furthermore unconstrained estimates tend

to have better inferential properties.

Of the five parametrizations considered here, the spherical parametrization

presents the best combination of performance and statistical interpretability of

individual parameters. The Cholesky and log-Cholesky parametrizations have

comparable performances, similar to the spherical parametrization, but lack

direct parameter interpretability. The Givens parametrization is considerably

less efficient than these parametrizations, but has the feature of being directly

based on the eigenvalues of the variance-covariance matrix. This can be used,

for example, to identify nonrandom linear combinations of the underlying ran-

dom variables. The matrix logarithm parametrization is very inefficient as the

dimension of the variance-covariance matrix increases, but seems to be most

stable parametrization. It also lacks direct interpretability of its parameters.

Different parametrizations can be used at different stages of the data analy-

sis. The matrix logarithm parametrization seems to be the most efficient for the

optimization step, at least for moderately large D. The spherical parametriza-

tions is probably the best one to derive measures of variability and confidence

intervals for the elements of D, while the Givens parametrization is the most

convenient to investigate rank deficiency of D.

Chapter 7

Mixed Effects Models Methods

and Classes for S

In this chapter we describe a set of S functions, classes, and methods (Chambers

and Hastie, 1992) for the analysis of mixed effects models. These extend the lin-

ear and nonlinear modeling facilities available in release 3 of S and S-plus. The

source code, written in S and C using an object-oriented approach, is available

in the S collection at StatLib. Details on how to obtain this and other soft-

ware from StatLib can be found in Newton (1993). Help files for all functions

described here are included in Appendix B.

Section 7.1 presents the functions and methods for fitting and analyzing

linear mixed effects models. The nonlinear mixed effects functions and methods

are described in section 7.2. Section 7.3 presents our conclusions and some

future directions for the code development.

7.1 The lme class and related methods

The functions and methods for the linear mixed effects model will be described

here through the analysis of data on a dental study presented in (Potthoff and

Roy, 1964). The data, displayed in Figure 7.1.1, consist of four measurements

of the distance (in millimeters) from the centre of the pituitary to the ptery-

omaxillary fissure made at ages 8, 10, 12, and 14 years for 16 boys and 11 girls.

A linear model seems adequate to explain the distance as a function of age, but

the intercept and slope seem to vary with the individual. The corresponding

linear mixed effects model is

dij = (β0 + bi0) + (β1 + bi1) agej + εij , i = 1, . . . , 27, j = 1, . . . , 4 (7.1.1)

where dij represents the distance for the ith individual at age j, βo and β1 are

respectively the fixed intercept and the fixed slope, bi0 and bi1 are respectively

the random intercept and the random slope corresponding to the ith individual,

and εij is the cluster error term. It is assumed that the bi = (bi0, bi1)T are i.i.d.

with a N (0, σ2D) distribution and the εij are i.i.d. with a N (0, σ2) distribution,

independent of the bi.

One of the questions of interest for these data is to determine whether

there are significant differences between boys and girls with respect to distance

growth. Model (7.1.1) can be modified to test for sex related differences in

intercept and slope

dij = (β00 + β01sexi + bi0) + (β10 + β11sexi + bi1) agej + εij (7.1.2)

where sexi is an indicator variable assuming value zero if the ith individual is

Age (years)

8 9 10 11 12 13 14

Figure 7.1.1: Distance from the centre of the pituitary to the pteryomaxillaryfissure in boys and girls at different ages.

a boy and one if she is a girl. β00 and β10 represent the fixed intercept and

slope for the boys and β01 and β11 the (fixed) increments in intercept and slope

associated with girls. Differences between boys and girls can be evaluated by

testing whether β01 and β11 are significantly different from zero. The remaining

terms in (7.1.2) are defined as in (7.1.1). It will be assumed here that the

data is available in a data.frame called dental, with columns distance, age,

subject, and sex as below

> dentaldistance age subject sex

1 26.0 8 1 02 25.0 10 1 0

3 29.0 12 1 04 31.0 14 1 0

. . .105 24.5 8 27 1106 25.0 10 27 1107 28.0 12 27 1108 28.0 14 27 1

7.1.1 The lme function

The lme function is used to fit the general linear mixed effects model, described

in chapter 2, using either maximum likelihood or restricted maximum likelihood.

Several optional arguments can be used with this function, but the typical call

lme(fixed, random, cluster, data)

The first three arguments are required. Fixed and random are formulas

defining the fixed and random effects part of the model. Any linear model

formula (Chambers and Hastie, 1992) is allowed, giving the model formulation

considerable flexibility. For the dental data these formulas would be written as

fixed = distance ~ age, random = ~ age

for model (7.1.1) and

fixed = distance ~ age * sex, random = ~ age

for model (7.1.2). Note that the response variable is defined only in the fixed

formula.

The cluster argument is a formula, or expression, defining the labels of the

different subjects in the data. For the dental data we would use

cluster = ~ subject

for both models (7.1.1) and (7.1.2). Note that the cluster formula has no left

hand side. The optional argument data specifies the data frame in which the

variables used in the model are available. A simple call to lme to fit model (7.1.1)

would be

> dental.fit1 <- lme(fixed = distance ~ age, random = ~ age,+ cluster = ~ subject, data = dental)

and to fit model (7.1.2) we would use

> dental.fit2 <- lme(fixed = distance ~ age * sex, random = ~ age,+ cluster =~ subject, data = dental)

The fitted objects returned by lme are of class lme, for which several methods

are available, including those for the generic functions print, summary, and

7.1.2 The print, summary, and anova methods.

A brief description of the estimation results can be obtained through the print

method. This only gives the estimates for the standard errors and correla-

tions of the random effects, the cluster variance, and the fixed effects. For the

dental.fit1 object we get

> dental.fit1Call:Fixed: distance ~ ageRandom: ~ age

Cluster: ~ subjectData: dental

Variance/Covariance Components Estimates:

Structure: logcholesky

Standard Deviation(s) of Random Effect(s)(Intercept) age

2.194103 0.2149245Correlation of Random Effects

(Intercept)age -0.5814881

Cluster Residual Variance: 1.716204

Fixed Effects Estimates:(Intercept) age

16.76111 0.6601852

Number of Observations: 108Number of Clusters: 27

A more complete description of the estimation results is obtained with summary.

> summary(dental.fit2). . .

Loglikelihood: -114.6576AIC: 245.3152

Variance/Covariance Components Estimates:Structure: logcholeskyStandard Deviation(s) of Random Effect(s)

(Intercept) age2.134464 0.1541247

Correlation of Random Effects(Intercept)

age -0.6024329

Fixed Effects Estimates:Value Approx. Std.Error z ratio(C)

(Intercept) 16.3406250 0.98005731 16.6731321age 0.7843750 0.08275189 9.4786353sex 1.0321023 1.53545472 0.6721802

age:sex -0.3048295 0.12964730 -2.3512218

Conditional Correlations of Fixed Effects Estimates(Intercept) age sex

age -0.8801554sex -0.6382847 0.5617897

age:sex 0.5617897 -0.6382847 -0.8801554. . .

The approximate standard errors for the fixed effects are derived using the

asymptotic theory described in chapter 3. The results above indicate that the

distance grows faster in boys than in girls (significant, negative age:sex fixed

effect), but they have the same average initial distance (non significant sex fixed

effect).

A likelihood ratio test to evaluate the hypothesis of no sex differences in

distance development is available with the anova method.

> anova(dental.fit1, dental.fit2). .

Model Df AIC Loglik Test Lik.Ratio P valuedental.fit1 1 6 252.72 -120.36dental.fit2 2 8 245.32 -114.66 1 vs. 2 11.406 0.0033365

The likelihood ratio test strongly rejects the null hypothesis of no sex differences.In order to test if only the growth rate is dependent on sex, using a likelihoodratio test, we can fit

> dental.fit3 <- lme(fixed = distance ~ age + age:sex, random = ~ age,+ cluster = ~ subject, data = dental)

and use the anova method again.

> anova(dental.fit2, dental.fit3). . .

Model Df AIC Loglik Test Lik.Ratio P valuedental.fit2 1 8 245.32 -114.66dental.fit3 2 7 243.76 -114.88 1 vs. 2 0.44806 0.50326

As expected, the likelihood ratio test indicates that the initial distances do not

depend on sex.

7.1.3 The plot method

Plots of random effects estimates, residuals, and fitted values can be obtained

using the plot method for class lme. The following call will produce a scatter

plot of the intercept and slope random effects estimates in model (7.1.2), as

shown in Figure 7.1.2.

> plot(dental.fit2, levels = c(0.5, 0.75, 0.9, 0.95))

The optional levels argument specifies the approximate coverage probabilities

for the random effects density contours to be included in the plot.

(Intercept)

-3 -2 -1 0 1 2 3

Figure 7.1.2: Scatter plot of the conditional modes of the intercept and sloperandom effects in model (7.1.2). Dashed lines represent the approximate 50%,75%, 90%, and 95% random effects density contours.

The point at the upper left corner of Figure 7.1.2 appears to be an outlying

value that is possibly having a great impact on the correlation and variance

estimates.

Residual plots may be obtained by setting the argument option in the plot

method to "r".

> plot(dental.fit3, option = "r")

The resulting plots are included in Figure 7.1.3. The first plot, observed

versus fitted values, indicates that the linear model does a reasonable job of

explaining the distance growth. The points fall relatively close to the y = x line,

indicating a reasonable agreement between the fitted and observed values. The

second plot, residuals versus fitted values suggests the presence of three outliers

in the data. The remaining residuals appear to be homogeneously scattered

around the y = 0 line. The final plot, featuring the boxplot of the residuals by

subject, suggests that the outliers occurred for subjects 9 and 13. There seems

to be considerable variation in the within subjects variability, but it must be

remembered that the boxplots represent only four residual values.

7.1.4 Other methods

Standard S methods for extracting components of fitted objects, such as resid-

uals, fitted, and coefficients, can be also be used on lme objects. The first

two methods return data frames with two columns, population and cluster,

while the last one returns a list with two components, the random and the fixed

effects estimates. A more detailed description of these objects is available in

the help files, included in Appendix B.

Fitted Values

18 20 22 24 26 28 30

ooo o oo

oo o o

oo oo o

o oo o

oo o o

Fitted Values

18 20 22 24 26 28 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

subject

Figure 7.1.3: Residuals and fitted values plots.

Estimates of the individual parameters are obtained using the cluster.coef

method.

> cluster.coef(dental.fit3)(Intercept) age age:sex

1 18.27436 0.8322535 -0.22812472 15.48918 0.7357740 -0.22812473 16.18725 0.7418871 -0.2281247. . .

27 19.21336 0.8370006 -0.2281247

Predicted values are obtained using the predict method. For example, if

we are interested in predicting the average distance for boys and girls at ages

14, 15, and 16, as well as for subjects 1 and 20 at age 13, we should create a

new data frame, say dental.new, as follows

> dental.new <-+ data.frame(sex = c(1, 1, 1, 0, 0, 0, 1, 0),+ age = c(14, 15, 16, 14, 15, 16, 13, 13),+ subject = c(NA, NA, NA, NA, NA, NA, 1, 20))

and then use

> predict(dental.fit3, ~ subject, dental.new)cluster fit.cluster fit.population

1 NA 24.111112 NA 24.636113 NA 25.161114 NA 27.304865 NA 28.057986 NA 28.811117 1 26.12804 23.586118 20 25.05424 26.55173

to get the cluster and population predictions.

7.2 The nlme class and related methods

The functions and methods for the nonlinear mixed effects model will be de-

scribed here through the analysis of the CO2 uptake data. These data, shown

in Figure 7.2.1 and described in Potvin and Lechowicz (1990), come from a

biological study aimed at analyzing the cold tolerance of a C4 grass species,

Echinochloa crus-galli. A total of twelve four-week-old plants, six from Quebec

and six from Mississippi, were divided into two groups: control plants that

stayed at 26◦C and chilled plants that were subject to 14 h of chilling at 7◦C.

After 10 h of recovery at 20◦C, CO2 uptake rates (in µmol/m2s) were measured

for each plant at seven concentrations of ambient CO2 (100, 175, 250, 350, 500,

675, 1000µL/L). Each plant was subjected to the seven concentrations of CO2 in

increasing, consecutive order. The objective of the experiment was to evaluate

the effect of plant type and chilling treatment on the CO2 uptake.

Ambient CO2 (uL/L)

200 400 600 800 1000

Mississippi Control

Chilled

Quebec Control

Chilled

Figure 7.2.1: CO2 uptake rates (in µmol/m2s) for Quebec and Mississippi plantsof Echinochloa crus-galli, control and chilled at different ambient CO2 concen-trations.

The model used in Potvin and Lechowicz (1990) is

Uij = φ1i {1 − exp [−φ2i (Cj − φ3i)]} + εij (7.2.1)

where Uij denotes the CO2 uptake rate of the ith plant at the jth CO2 ambient

concentration; φ1i, φ2i, and φ3i denote respectively the asymptotic uptake rate,

the uptake growth rate, and the maximum ambient CO2 concentration at which

no uptake is verified for the ith plant; Cj denotes the jth ambient CO2 level;

and the εij are i.i.d. error terms with distribution N (0, σ2).

It will be assumed here that the CO2 uptake data is available in a data.frame

called CO2, with columns plant, type, trt, conc, and uptake as below

plant type trt conc uptake1 1 Quebec nonchilled 95 16.02 1 Quebec nonchilled 175 30.43 1 Quebec nonchilled 250 34.8. . .

83 12 Mississippi chilled 675 18.984 12 Mississippi chilled 1000 19.9

7.2.1 The nlme function

The nlme function is used to fit the nonlinear mixed effects model (cf. chapter

4) using either maximum likelihood or restricted maximum likelihood. Several

optional arguments can be used with this function, but a typical call is

nlme(model, fixed, random, cluster, data, start)

The model argument is required and consists of a formula specifying the

nonlinear model to be fitted. Any S nonlinear formula can be used, giving the

function considerable flexibility. From (7.2.1) we have that for the CO2 uptake

data this argument is declared as

uptake ~ A * (1 - exp(-B * (conc - C)))

where we have used the notation A = φ1, B = φ2, and C = φ3. Alternatively,

we can define an S function, say co2.uptake, as follows

> co2.uptake <- function(A, B, C, conc) A * (1 - exp(-B*(conc - C)))

and write the model argument as

uptake ~ co2.uptake(A, B, C, conc)

The advantage of this latter approach is that the analytical derivatives of the

model function can be passed to the nlme function as the gradient attribute

of co2.uptake and used in the optimization algorithm. The S function deriv

can be used to create expressions for the derivatives.

> co2.uptake <- deriv(~ A * ( 1 - exp(-B * ( conc - C))),+ LETTERS[1:3], function(A, B, C, conc){})

If the model function does not have a gradient attribute, numerical derivatives

are used instead.

The required arguments fixed and random are lists of formulas that define

the structures of the fixed and random effects in the model. In these formulas

a . on the right hand side of a formula indicates that a single parameter is

associated with the effect, but any linear formula in S could be used instead.

This gives considerable flexibility to the model, as time-dependent parameters

can be easily incorporated (e.g. when a formula in the fixed list involves a co-

variate that changes with time). Usually every parameter in the model will have

an associated fixed effect, but it may, or may not, have an associated random

effect. Since we assumed that all random effects have mean zero, the inclusion

of a random effect without a corresponding fixed effect would be unusual. Note

that the fixed and random formulas could be directly incorporated in the model

declaration. The approach used in nlme allows for more efficient calculation of

derivatives and will be useful for update methods that will be incorporated in

the code in the future.

For the CO2 uptake data, if we want to fit a model in which all parameters

are random and no covariates are included we use

fixed = list(A ~., B~., C~.), random = list(A~., B~., C~.)

If we want to estimate the effects of plant type and chilling treatment on

the parameters in the model we can use

fixed = list(A ~ type*trt, B ~ type*trt, C ~ type*trt),random = list(A ~ ., B ~ ., C ~ .)

The cluster argument is required and defines the cluster label of each

observation. An S expression or a formula with no left hand side can be used

here. Data is an optional argument that names a data frame and start provides

a list of starting values for the iterative algorithm. Only the fixed effects starting

estimates are required. The default value for the random effects is zero and

starting estimates for the variance-covariance matrix of the random effects (D)

and the cluster variance (σ2) are automatically generated using a formula given

in Laird, Lange and Stram (1987) if they are not supplied. Further information

on the arguments of nlme is available in the help files in Appendix B.

A simple call to nlme to fit model (7.2.1), without any covariates and with

all parameters random is

> co2.fit1 <-+ nlme(model = uptake ~ co2.uptake(A, B, C, conc),+ fixed = list(A ~ ., B ~ ., C ~ .),+ random = list(A ~ ., B ~ ., C ~ .),+ cluster = ~ plant, data = CO2,+ start = list(fixed = c(30, 0.01, 50)))

The initial values for the fixed effects were obtained from Potvin and Lechowicz

(1990).

7.2.2 The nlme methods

Objects returned by the nlme function are of class nlme which inherits from

lme. All methods described in section 7.1 are also available for the nlme class.

In fact, with the exception of the predict method, all methods are common to

both classes. We illustrate their use here with the CO2 uptake data.

The print method provides a brief description of the estimation results.

This only gives the estimates for the standard errors and correlations of the

random effects, the cluster variance, and the fixed effects.

> co2.fit1Call:Model: uptake ~ co2.uptake(A, B, C, conc)Fixed: list(A ~ ., B ~ ., C ~ .)Random: list(A ~ ., B ~ ., C ~ .)

Cluster: ~ plantData: CO2

Variance/Covariance Components Estimates:

Structure: logcholeskyStandard Deviation(s) of Random Effect(s)

A B C9.510373 0.001152827 11.39466Correlation of Random Effects

A BB -0.06187818C 0.99998745 -0.06192643

Fixed Effects Estimates:A B C

32.55042 0.00944257 41.61764

Number of Observations: 84Number of Clusters: 12

Note that there is a very strong correlation between the φ1 and the φ3 random

effects and these are almost not correlated with the φ2 random effect. The

scatter plot matrix of the random effects is obtained using the plot method

> plot(co2.fit1, levels = c(0.5, 0.75, 0.9, 0.95))

and is shown in Figure 7.2.2. It is clear that the φ1 and φ3 random effects are

-0.0015 0.0 0.0010

-15 -5 0 5 10

-20 -10 0 10

Figure 7.2.2: Scatter plot of the conditional modes of the φ1, φ2, and the φ3

random effects in model (7.2.1). Dashed lines represent the approximate 50%,75%, 90%, and 95% random effects density contours.

virtually identical. This correlation may be due to the fact that the plant type

and the chilling treatment, that were not included in the co2.fit1 model, are

affecting φ1 and φ3 in the same way.

One of the main advantages of having the code defined within the S en-

vironment is that all the analytical and graphical machinery present in S is

simultaneously available. We can use these to analyze the dependence of the

individual parameters φ1i, φ2i, and φ3i in model (7.2.1) on plant type and chill-

ing factor. Initially we create a data.frame with the conditional modes of the

random effects obtained in the first fit.

> CO2.random <- data.frame(coef(co2.fit1)$random)

Then we add a column to CO2.random with the treatment combinations corre-

sponding to each plant.

> CO2.random$type.trt <- as.factor(rep(c("Quebec nonchilled",+ "Quebec chilled", "Mississippi nonchilled",+ "Mississippi chilled"), rep(3,4)))

Finally we obtain plots of the conditional modes of the random effects versus the

treatment combinations. The corresponding plots are presented in Figure 7.2.3.

> plot(A ~ type.trt, data = CO2.random)> plot(B ~ type.trt, data = CO2.random)> plot(C ~ type.trt, data = CO2.random)

These plots indicate that chilled plants tend to have smaller values of φ1 and

φ3, but the Mississippi plants seem to be much more affected than the Quebec

plants, suggesting an interaction effect between plant type and chilling treat-

ment. There is no clear pattern of dependence between φ2 and the treatment

factors, suggesting that this parameter is not significantly affected by either

plant type or chilling treatment. We can then fit a new model in which φ1 and

φ3 depend on the treatment factors, as below.

> co2.fit2 <-+ nlme(model = uptake ~ co2.uptake(A, B, C, conc),

Mississippi nonchilled Mississippi chilled Quebec nonchilled Quebec chilledtype.trt

Figure 7.2.3: Boxplots of the conditional modes of the φ1, φ2, and φ3 randomeffects in model (7.2.1) by plant type and chilling treatment combination.

+ fixed = list(A ~ type*trt, B ~ ., C ~ type*trt),+ random = list(A ~ ., B ~ ., C ~ .), cluster = ~ plant, data = CO2,+ start = list(fixed = c(30, 0, 0, 0, 0.01, 50, 0, 0, 0)))

We can use the summary method to get more detailed information on the esti-

mation results of the new fitted object.

> summary(co2.fit2). . .

Convergence at iteration: 6Approximate Loglikelihood: -103.5041AIC: 239.0082

Variance/Covariance Components Estimates:Structure: logcholesky

Standard Deviation(s) of Random Effect(s)A.(Intercept) B C.(Intercept)

2.276278 0.0003200845 5.981132Correlation of Random Effects

A.(Intercept) BB -0.008043761

C.(Intercept) 0.999984502 -0.008100170

Fixed Effects Estimates:Value Approx. Std.Error z ratio(C)

A.(Intercept) 32.452100011 0.7225786330 44.911513A.type -7.909764880 0.7024079993 -11.260927A.trt -4.231594577 0.7009980593 -6.036528

A.type:trt -2.434420834 0.7010132656 -3.472717B 0.009545959 0.0005908485 16.156356

C.(Intercept) 39.936295607 5.6567839253 7.059894C.type -10.469319722 4.2166574898 -2.482848C.trt -7.975396202 4.1963538181 -1.900554

C.type:trt -12.360984497 4.2249903799 -2.925683. . .

Note that the correlation between the φ1 and the φ3 random effects remains

very high, suggesting that the model is probably overparametrized and fewer

random effects are needed. We will not pursue the model building analysis of

the CO2 uptake data in here, since our main goal is to illustrate the use of

the methods for the nlme class and not to present a thorough analysis of the

problem.

In order to compare the fits corresponding to the objects co2.fit1 and

co2.fit2 we can use the anova method.

> anova(co2.fit1, co2.fit2). . .

Model Df AIC Loglik Test Lik.Ratio P valueco2.fit1 1 10 268.44 -124.22co2.fit2 2 16 239.01 -103.50 1 vs. 2 41.43 2.3824e-07

We see that the inclusion of plant type and chilling treatment in the model

caused a substantial increase in the loglikelihood, indicating that they have a

significant effect on φ1 and φ3.

Diagnostic plots can be obtained by using the r option of the plot method

> plot(co2.fit2, option = "r")

The corresponding plot is presented in Figure 7.2.4. The first plot, observed

versus fitted values, indicates that the model fits the data well — most points

lie close to the y = x line. The second plot, residuals versus fitted values, does

not indicate any departures from the assumptions in the model — no outliers

seem to be present and the residuals are symmetrically scattered around the

y = 0 line, with constant spread for different levels of the fitted values.

Predictions are obtained through the predict method . For example, to

obtain the population predictions of CO2 uptake rate for Quebec and Missis-

sippi plants under chilling and no chilling, at ambient CO2 concentrations of

50, 100, 200, and 500µL/L, we would first define

oooooo

o ooooo

Fitted Values

10 20 30 40

Fitted Values

10 20 30 40

1 2 3 4 5 6 7 8 9 10 11 12plant

Figure 7.2.4: Residuals and fitted values plots.

> CO2.new <-+ data.frame(type = rep(c("Quebec", "Mississippi"), c(8, 8)),+ trt = rep(rep(c("chilled","nonchilled"),c(4,4)),2),+ conc = rep(c(50, 100, 200, 500), 4))

and then use

> predict(co2.fit2, CO2.new)population

1 0.058527812 11.88120535. . .

15 28.9221963316 38.01456512

to obtain the predictions.

The predict method can also be used for plotting smooth fitted curves by

calculating fitted values at closely spaced concentrations. Figure 7.2.5 presents

the individual fitted curves for all twelve plants using a total of 200 concentra-

tions between 50 and 1000 µL/L.

Ambient CO2 (uL/L)

200 400 600 800 1000

Mississippi Control

Chilled

Quebec Control

Chilled

Figure 7.2.5: Individual fitted curves for the twelve plants in the CO2 uptakedata based on the co2.fit2 object.

7.3 Conclusions

The classes and methods described here provide tools for analyzing linear and

nonlinear mixed effects models. As they are defined within the S environment,

all the powerful analytical and graphical machinery present in S is simultane-

ously available. The analyses of the dental data and CO2 uptake data illustrate

some of the available features, but many other features are available.

The code presented here was developed to handle primarily repeated mea-

sures data, i.e. data generated by observing a number of clusters repeatedly

under varying experimental conditions. More general mixed effects models (e.g.

with different levels of nesting) can be analyzed using the functions described

here, but the code will not be computationally efficient for that purpose.

There are several directions in which the software can be expanded to handle

more general mixed effects models and/or incorporate other estimation tech-

niques. These include, but are not limited to,

• Mixed effects models with autocorrelated cluster errors (Chi and Reinsel,

1989). The current version of the code only handles the i.i.d. case;

• More accurate approximations to the loglikelihood in the nonlinear mixed

effects model (cf. chapter 5). These include Laplacian and Gaussian

quadrature approximations to the integral that defines the likelihood of

the data in the nonlinear mixed effects model. The current version uses

an alternating algorithm suggested by Lindstrom and Bates (1990);

• Profiling methods (Bates and Watts, 1988) for deriving confidence regions

on the parameters in the model and assessing the normality of the param-

eter estimates. These methods are computationally intensive, especially

for the nonlinear mixed effects model, and efficient programming is needed

to make them feasible to use;

• Update methods for refitting the model when only small changes in the

original calling sequence are necessary. These methods are particular use-

ful for model building, when several similar models are fitted sequentially;

• Methods for deriving confidence and prediction intervals for predicted val-

We plan to incorporate all these features in future releases of the software

to be contributed to the S collection at StatLib. The autocorrelation structure

for the cluster errors has already been incorporated in an experimental version

currently undergoing tests. C code to calculate Laplacian and Gaussian quadra-

ture approximations to the integral in the nonlinear mixed effects has already

been developed, but needs to be incorporated into the S code. For the profiling

methods we plan to use a linear mixed effects approximation to the marginal

density in the nonlinear mixed effects, suggested in Lindstrom and Bates (1990),

to speed up the calculations.

Chapter 8

Model Building in Mixed Effects

Models

Model building in mixed effects models involves questions that do not have a

parallel in (fixed effects) linear and nonlinear models. Some of these questions

• determining which effects should have an associated random component

and which should be purely fixed;

• using covariates to explain cluster-to-cluster parameter variability;

• using structured random effects variance-covariance matrices (e.g. diago-

nal matrices) to reduce the number of parameters in the model.

In this chapter we consider strategies for addressing these questions in the con-

text of nonlinear mixed effects models, though most of the techniques described

are also applicable to linear mixed effects models.

Any model building strategy is by nature iterative: a tentative model is

initially fitted and modified to generate possibly better models (according to

some goodness-of-fit criterion) and the process is repeated until no further im-

provements are possible. In comparing alternative models one must also analyze

the residuals from the fit, checking for departures from the assumptions in the

model. It is also highly recommended that any model building analysis be done

in conjunction with experts in the field of application of the model, to ensure

the practical usefulness of the chosen model.

The use of the model building techniques described in this chapter is illus-

trated through the analysis of four real data examples. These data sets are

described in section 8.1. In section 8.2 we describe techniques that can be used

to model the variance-covariance matrix of the random effects and to choose

which random effects should be incorporated in the model. The use of covari-

ates to model cluster-to-cluster parameter variability is considered in section 8.3.

Our conclusions are included in section 8.4.

8.1 Examples

We make extensive use of real data examples to illustrate the model building

techniques presented in this chapter. We now introduce the data sets that will

be used throughout this chapter.

8.1.1 Pine Trees

The pine trees growth data are described in Kung (1986). A total of 14 sources

(seeds) of Loblolly pine were planted in the southern United States and the

tree heights (in ft.) were measured at 3, 5, 10, 15, 20, and 25 years of age.

Figure 8.1.1 shows a plot of these data.

Age (years)

5 10 15 20 25

Figure 8.1.1: Loblolly pine heights at different ages.

Kung (1986) used a logistic curve to model the trees’ growth, but an asymptotic

regression model seems to explain the observed growth pattern better. We also

tried the logistic model, the Gompertz model, the Morgan, Mercer, and Flodin

model, and the Weibull type model (Ratkowsky, 1990), but the asymptotic

regression gave the best overall fit. This model can be expressed as

f(t, φ) = φ1 − φ2 exp (−φ3t) (8.1.1)

where t denotes the tree’s age, φ1 the asymptotic height, φ2 the difference be-

tween φ1 and the height at age zero, and φ3 the growth rate.

8.1.2 Theophylline

The Theophylline data were described in section 5.1. We reproduce the plot of

the data in Figure 8.1.2.

Time (hrs)

0 5 10 15 20 25

Figure 8.1.2: Theophylline concentrations (in mg/L) of twelve patients overtime.

We recall from section 5.1 that a first order compartment model with absorption

in a peripheral compartment is used to represent the variation in the drug

concentration with time. The model equation is reproduced next

Ct =DKka

Cl(ka − K)[exp (−Kt) − exp (−kat)] (8.1.2)

where Ct is the observed concentration at time t (mg/L), t is the time (hr),

D is the dose (mg/kg), Cl is the clearance (L/kg), K is the elimination rate

constant (1/hr), and ka is the absorption rate constant (1/hr). In order to

ensure positivity of the rate constants and the clearance, the logarithms of

these quantities can be used in (8.1.2), giving the reparametrized model

Ct =D exp (lka + lK − lCl)

exp (lka) − exp (lK)(8.1.3)

× {exp [− exp (lK) t] − exp [− exp (lka) t]}

where lCl = log(Cl), lka = log(ka), and lK = log(K).

8.1.3 Quinidine

The third data set comes from a pharmacokinetics clinical study of the antiar-

rhytimic drug Quinidine. A total of 361 Quinidine concentration measurements

were made on 136 hospitalized patients under varying dosage regimens. Addi-

tional data were collected on a set of nine covariates: age, height, weight, race,

smoking status, ethanol abuse, congestive heart failure, creatinine clearance,

and α-1-acid glycoprotein concentration. Some of these covariates varied for

the same patient during the course of the study, while others remained con-

stant. One of the main objectives of the study was to investigate relationships

between the individual pharmacokinetics parameters and the covariates. A full

description of the data can be found in Verme, Ludden, Clementi and Harris

(1992). Statistical analyses of these data using alternative modeling approaches

are given in Davidian and Gallant (1993) and Wakefield (1993).

The model that has been suggested for the Quinidine data is the one-

compartment open model with first-order absorption. This model can be defined

in a recursive way as follows.

Suppose that, at time t, the patient receives at dose dt and prior to that

time the last dose was given at time t′. The expected concentration, Ct, and

the apparent concentration in the absorption compartment, Cat are given by

Ct = Ct′ exp [−K (t − t′)] +Cat′ka

ka − K{exp [−K (t − t′)] − exp [−ka (t − t′)]}

Cat = Cat′ exp [−ka (t − t′)] +dt

V(8.1.4)

where V represents the apparent volume in distribution and ka and K are

respectively the absorption and the elimination rate constants.

When a patient receives the same dose d at regular time intervals ∆, the

model (8.1.4) converges to the so-called steady state model, where the expected

concentrations are given by

Ct =dka

V (ka − K)

1 − exp (−K∆)− 1

1 − exp (−ka∆)

Cat =d

V [1 − exp (−ka∆)](8.1.5)

Patients considered to be in steady state conditions have concentrations modeled

as above.

Finally, for a between-dosages time t, the model for the expected concentra-

tion Ct, given that the last dose was received at time t′, is identical to (8.1.4).

Using the fact that the elimination rate constant K is equal to the ra-

tio between the clearance (Cl) and the volume of distribution (V ), we can

reparametrize models (8.1.4) and (8.1.5) in terms of V , ka, and Cl.

In order to ensure that the estimates of V , ka, and Cl are positive, we can

rewrite models (8.1.4) and (8.1.5) in terms of lV = log(V ), lka = log(ka), and

lCl = log(Cl).

The initial conditions for the recursive model are C0 = 0 and Ca0 = d0/V ,

with d0 denoting the first dose received by the patient. It has been assumed

throughout the model definition that the bioavailability of the drug, i.e. the per-

centage of the administered dose that reaches the measurement compartment,

is equal to one.

8.1.4 CO2 Uptake

The last data set considered here is the CO2 uptake data described in section 7.2.

The data, presented in Figure 8.1.3, consist of measurements of CO2 uptake (in

µmol/m2s) for six Echinochloa crus-galli plants from Quebec and six plants

from Mississippi at seven different concentrations of ambient CO2. Half the

plants from each type were chilled before the measurements were taken, while

the other half stayed at room temperature.

Ambient CO2 (uL/L)

200 400 600 800 1000

Mississippi Control

Chilled

Quebec Control

Chilled

Figure 8.1.3: CO2 uptake rates (in µmol/m2s) for Quebec and Mississippi plantsof Echinochloa crus-galli, control and chilled at different ambient CO2 concen-trations.

The nonlinear mixed effects model used to describe the CO2 uptake as a

function of the ambient CO2 concentration is

Uij = φ1i {1 − exp [−φ2i (Cj − φ3i)]} + εij (8.1.6)

where Uij denotes the CO2 uptake rate of the ith plant at the jth CO2 ambient

concentration; φ1i, φ2i, and φ3i denote respectively the asymptotic uptake rate,

the uptake growth rate, and the maximum ambient CO2 concentration at which

no uptake is verified for the ith plant; Cj denotes the jth ambient CO2 level;

and the εij are i.i.d. error terms with distribution N (0, σ2). The main purpose

of the study was to estimate the effect of the plant type (P ) and the chilling

treatment (T ) on the parameters φ1, φ2, and φ3.

8.2 Variance-Covariance Modeling

In this section we consider the questions of determining which parameters in

the model should have a random component and whether the scaled variance-

covariance matrix of the random effects (D) can be structured in a simpler

form, i.e. with fewer parameters than the unstructured form.

The first question that should be addressed in the analysis is choosing which

parameters should be random effects and which purely fixed effects. Our ap-

proach is to fit different prospective models and compare nested models us-

ing some information criterion statistics, e.g. the Akaike information criterion

(Sakamoto et al., 1986). One of the problems with this approach is deciding

which way to construct the nesting; from smaller to larger models, or the other

way around. Starting with a model where all parameters have associated ran-

dom effects and then removing unnecessary terms is probably the best strategy,

but may not be possible to implement if the model is badly overparametrized.

In these cases the variance-covariance matrix of the random effects may become

seriously ill-conditioned, making it difficult or impossible to converge. The

smaller to larger approach is another alternative in these cases, but has the dis-

advantage of the large number of models that may have to be fitted before the

desired one is found. There is yet another important aspect that is overlooked

by the model nesting approach: sometimes it is a linear combination of random

effects being treated as fixed that gives the best model reduction.

The strategy we suggest for choosing the random effects to be included in

the model is to start with all parameters as mixed effects, whenever no prior

information about the random effects variance-covariance structure is available

and convergence is possible. Then we examine the eigenvalues of the estimated

D matrix, checking if one, or more, are close to zero. The associated eigenvec-

tor(s) would then give an estimate of the linear combination of the parameters

that could be taken as fixed. We used the Akaike information criterion to decide

between alternative models, choosing the one with the smaller AIC.

Small eigenvalues may arise when the relative magnitude of the scales of

the parameters in the model are quite different, without necessarily implying

overparametrization. Therefore we suggest using a normalized version of the

variance-covariance matrix that is scale invariant. There are different choices

of normalized D, the most common being the correlation matrix. This is not a

particularly good choice in the present context, since all random effects would

then have normalized variance equal to one and we would not be able to iden-

tify those with relatively small dispersion (which would be natural candidates

to be dropped from the model). Whenever the A and B matrices in (4.1.2)

are incidence-like matrices (i.e. with just one nonzero entry per row), a more

convenient choice of normalization is the coefficient of variation (CV) matrix

DCV with

[DCV ]ij =[D]ij∣∣∣βk(i)βk(j)

∣∣∣ (8.2.1)

where βk represents the kth fixed effect and k(i), k(j) represent the indices of

the fixed effects associated with the ith and jth random effects. When the

nonzero elements of the ith row of A and B are equal to one, the ith diagonal

element of DCV is equal to the square of the coefficient of variation of φi. We

note that, in the majority of real life applications of model (4.1.1), A and B

will be incidence-like matrices.

To illustrate the use of this method we consider the examples described in

section 8.1. We do not include any analyses of residuals here, but in all cases

they did not indicate violations of the model’s assumptions. All maximum

likelihood calculations in the examples were done using the nlme function in S

(cf. chapter 7).

8.2.1 Pine Trees

Assuming that all three parameters in model (8.1.1) have both a fixed and a

random component, the corresponding nonlinear mixed effects model is

yij = (β1 + bi1) − (β2 + bi2) exp [(β3 + bi3) tj ] + εij (8.2.2)

The maximum likelihood estimates of the parameters are σ2 = 0.397,

102.26

110.85

334.43 328.53 −0.15

328.53 322.72 −0.14

−0.15 −0.14 8.26 × 10−5

0.032 0.029 −0.037

0.029 0.026 −0.033

−0.037 −0.033 0.054

The value of the loglikelihood at convergence is −34.618, corresponding to an

AIC of 89.236.

The eigenvalues of DCV , 0.1056, 0.0065, and 8.152 × 10−15, give a clear

indication of rank deficiency. The eigenvector corresponding to the smallest

eigenvalue, converted back to the original scale of the random effects and nor-

malized, is (0.7005,−0.7131,−0.0278)T , suggesting that the difference between

the first two random effects can be considered nonrandom. This can be checked

by reparametrizing model (8.1.1) as follows

yij = φ′1 + (φ′

2 − φ′1) exp (−φ′

3tj) + eij (8.2.3)

where φ′1 and φ′

3 continue to have the same interpretation as φ1 and φ3 in

the previous parametrization, but φ′2 now represents the height at age zero (i.e.

φ1−φ2). Using this reformulation of the model, with φ1 and φ3 as random effects,

we get the following estimates β = (102.253,−8.574, 0.039)T , σ2 = 0.400, and

251.060 −0.102

−0.102 5.967 × 10−5

. The loglikelihood at convergence is equal to

−34.630, corresponding to an AIC of 83.262, which is considerably smaller than

the AIC of model (8.2.2). The AIC values obtained by considering each of φ1,

φ2, and φ3 at a time as fixed in model (8.1.1) are respectively 86.201, 89.917,

and 89.078, all larger than the AIC of the reduced reparametrized model (8.2.3).

The eigenvalues of the DCV matrix corresponding to the reduced model

(8.2.3) are 0.058 and 0.005, suggesting that no further reduction in the num-

ber of random effects can be attained. If we refit the reduced model (8.2.3)

with either φ′1 or φ′

3 as a fixed effect, we get AIC values of 85.105 and 85.917

respectively, both larger than in the previous model. It is interesting to note

that the eigenvalues of D for model (8.2.3), 251.06 and 0.000018, could at first

suggest that further reductions were possible. In fact they are just reflecting the

different scales in which φ′1 and φ′

3 are measured (the eigenvector corresponding

to the smallest eigenvalue is (−0.0004,−0.9999)T ).

8.2.2 Theophylline

The Theophylline data give yet another example where convergence is attained

for the model in which all parameters are mixed effects. We refer to this model

as model I.

The AIC of model I is 124.03 and the MLE of the DCV matrix has eigen-

values 4.324, 0.019, and 2.031 × 10−7 indicating that the model is probably

overparametrized. The eigenvector corresponding to the smallest eigenvalue,

converted back to the original scale and normalized is (0.464, 0.020,−0.886)T ,

suggesting that the lCl random effect is approximately equal to twice the lK

random effect. Recalling that the volume of distribution (V) is equal to the

ratio between Cl and K, we see that lCl = 2lK implies that the ratio between

V and K, that we will denote by R, is a fixed effect. The recommendation at

this point would be to contact a pharmacologist and check the plausability of

this finding, as well as the interpretability of the parameter R. We will proceed

with the analysis here for the purpose of illustrating the use of the proposed

model building techniques.

We reparametrized model (8.1.3) in terms of lka, lK, and lR = log(R),

letting only the first two parameters be mixed effects. The AIC of this reduced

model, called model II, was 118.20, considerably smaller than the AIC of model

I. The eigenvalues of the estimated DCV matrix are 0.356 and 0.158 indicating

that no further linear combinations of random effects could be eliminated from

the model. In fact, if we remove either the lka or the lK random effect we get

AIC values of respectively 203.865 and 200.135, both substantially worse than

in model II.

It is interesting to compare model II with the models obtained by considering

each parameter at a time in model I as a fixed effect, to check if a more easily

understood model could be used. The AIC of the models considering each of

lCl, lka, and lK at a time as fixed effects are respectively 163.224, 194.189, and

125.446, all considerably larger than the AIC of model II. Note however that the

elimination of the lK random effect from model I has a much smaller impact

on the AIC value, than the elimination of either the lCl or the lka random

effects. This suggests that if one is willing to correct the overparametrization

problem by dropping one of the random effects from the model (and not a linear

combination of them, as was done in model II), lK would be the natural choice.

The estimated correlation between the lK and the lka random effects in

model II was −0.132, suggesting that the two random effects could be regarded

as independent and a diagonal D used. The AIC of this model (III) was 116.388

indicating that it should be preferred over the previous models. No further

reduction in the number of parameters in D could be obtained and we concluded

that model III was the most adequate.

8.2.3 Quinidine

The Quinidine data provide an example where convergence cannot be attained

for the model with all parameters as mixed effects, called model I. The data are

characterized by few observations on many patients: for 46 patients there is only

one observation of Quinidine concentration and for 32 patients only two. As a

consequence, the optimization of the loglikelihood for model I becomes a very

ill-conditioned numerical problem, with the optimizing algorithm alternating

between equivalent solutions (in terms of the value of the loglikelihood) without

ever converging.

Different strategies can be used to try to circumvent the nonconvergence

problem:

• try to achieve convergence using a diagonal D and examine the relative

variability of the random effects, investigating the possibility of eliminat-

ing one, or more, of them from the model;

• force convergence (e.g. letting the algorithm run until a pre-established

maximum number of iterations) and examine the corresponding DCV ma-

trix for rank deficiency;

• try to achieve convergence for models with a smaller number of random

effects.

Convergence could not be achieved even for a diagonal D and so the second

strategy was used here. We forced convergence after ten iterations of Lindstrom

and Bates’ alternating algorithm. The AIC of this forced convergence fit was

344.74. The eigenvalues of the DCV matrix were equal to 74.526, 0.032, and

1.363 × 10−9, suggesting that the model was overparametrized.

The eigenvector corresponding to the smallest eigenvalue, converted to the

original scale of the random effects and normalized, was (0.097, 0.415,−0.905)T ,

indicating that the lka random effect was about twice the lV random effect. In

terms of the original parameters, that is equivalent to assume that R = ka/V2

is a fixed effect. As in the Theophylline example, the recommendation at this

point would be to consult a pharmacologist about the physical meaning, if any,

of the parameter R. It must be said, though, that in this case R seems to be

a rather awkward quantity and probably lacks any practical meaning. Most

likely the poor quality of the data is responsible for the convergence problems

in general, and the rank deficiency observed for DCV in particular. Therefore,

by using a reparametrization of model I in which R was incorporated as a fixed

effect, we would run the risk of overfitting low quality data. We decided to

follow a more conservative approach trying to solve the overparametrization

problem by removing each random effect at a time from model I.

Convergence was attained for the models in which each of lCl, lka, and lV

at a time were treated as fixed. The corresponding AIC values were respectively

501.925, 341.782, and 365.409, indicating that the model in which lka is the only

purely fixed effect, called model II, should be preferred. For the sake of com-

parison, we also fitted the reparametrized model in which R was incorporated

as the only purely fixed effect. The corresponding AIC was 338.620 and though

this suggests that the reparametrized model fits the data better than model II,

we will keep the latter for the reasons described in the previous paragraph.

The estimated standard deviations of the random effects in model II were

0.323 and 0.310 and the estimated correlation coefficient was 0.05. This sug-

gested that the random effects had approximately the same variance and were

uncorrelated. A multiple of the identity matrix was used to model D and the

AIC of this reduced model (III) was 338.205, considerably smaller than those

of both models I and II. No further reductions were possible, since, if we re-

moved either the lCl random effect or the lV random effect from model III, we

obtained AIC values of 497.968 and 339.799 respectively.

In section 8.3.1 we explore the use of covariates to explain cluster-to-cluster

variability observed for the lCl and the lV random effects.

8.2.4 CO2 Uptake

Convergence using the model with all parameters as mixed effects (called model

I) was attained for the CO2 uptake data. The eigenvalues of the DCV ma-

trix were 0.051, 0.005, and 3.201 × 10−7, suggesting that the model was over-

parametrized. The eigenvector corresponding to the smallest eigenvalue, con-

verted to the original scale of the random effects and normalized, was (0.683,

−0.00008,−0.730)T indicating that φ1 −φ3 was probably nonrandom. In terms

of the parameters in model (8.1.6), this implies that the difference between the

asymptotic CO2 uptake rate (φ1) and the maximum ambient concentration of

CO2 at which no uptake is verified (φ3) is a fixed effect. Graphically this implies

that if the asymptotic CO2 uptake rate of a given plant is ∆ units above that of

another plant, so will be the concentration at which no CO2 uptake is present.

The AIC of model I was 268.44.

Since the linear dependence between φ1 and φ3 in model I seems to have a

meaningful practical interpretation, we decided to fit the reduced model, called

model II, in which this dependence was incorporated. Instead of reparametrizing

the model though, we decided to set A = I, B =

, and bi = (b1i, b2i)T

in the model specification of φ, cf. (4.1.2). The reason for this alternative

formulation was that we wanted to preserve the same parameters as in the

original model. This allows an easier interpretation of the effects of plant type

and chilling treatment on the parameters. Note that φ1 and φ3 share the same

random effect in model II.

The AIC of model II was 262.44, considerably better than that of model

I. The estimated correlation between the random effects in model II, −0.226,

suggested that a diagonal D could be used. The AIC of this reduced model

(III) was 260.62, indicating that it should be preferred over model II. No further

reductions were possible: convergence was not attained for the model with just

φ2 random and the AIC for the model with φ2 as a purely fixed effect was 261.95.

The CO2 uptake data is an example of a designed experiment in which most

of the variability in the random effects is related to differences in the treatment

effects. This issue will be explored in detail in the section 8.3.2.

8.3 Covariate Modeling

In this section we consider the use of covariates to model random effects variabil-

ity. This variability can either be related to natural cluster-to-cluster variation,

or caused by differences in covariate values between and/or within clusters.

The first questions to be addressed in the covariate modeling process are the

determination of which variables are potentially useful in explaining random

effects variation and which random effects may have their variability explained

by covariates. This is probably best achieved by analyzing plots of the random

effects estimates versus the covariates, looking for trends and patterns. The

conditional modes of the random effects (Lindstrom and Bates, 1990) will be

used here for this purpose.

After the candidate covariates have been chosen, a decision has to be made

on how to test for their inclusion in the model. The number of extra parame-

ters to be estimated tends to grow considerably with the inclusion of covariates

and their associated random effects in the model. If the number of covari-

ates/random effects combinations is large, we suggest using a forward stepwise

type of approach in which covariates are included one at a time and the po-

tential importance of the remaining covariates is (graphically) assessed at each

step. The decision on whether or not to include a covariate can be based on the

AIC of the fits with and without it. Another question that has to be addressed

when including a covariate in the model, is which of the new parameters should

be random or purely fixed. We suggest using an approach similar to the one

described in section 8.2, for modeling the variance-covariance structure: when-

ever no prior information is available and convergence is possible, start with a

saturated model (in which all new parameters are random) and, by examining

the eigenstructure of the estimated D (or DCV ) matrix, search for plausible

structures with fewer parameters. We use the Quinidine and the CO2 uptake

data to illustrate this model building approach. We reiterate that any model

building strategy is not complete without a careful analysis of residuals and

expert advice. In all examples considered here the residual analyses did not

indicate departures from the assumptions in the model.

8.3.1 Quinidine

Figure 8.3.1 presents the scatter plots of the conditional modes of the lCl ran-

dom effect, based on model III of section 8.2.3, versus the available covariates

(when the covariate value changed over time, the mode was used). A loess

smoother (Cleveland, Grosse and Shyu, 1992) was included in the continuous

covariates plots to help the visualization of possible trends.

oo ooo

Glycoprotein concentration

50 100 150 200 250 300

<= 50 > 50Creatinine clearance

No/Mild Moderate SevereCongestive heart failure

oo ooo

40 50 60 70 80 90

oo ooo

o oo o

Height

60 65 70 75

o ooo o

Weight

40 60 80 100 120

Caucasian Latin BlackRace

Non Smoker SmokerSmoking status

No Abuse PreviouslyEthanol abuse

Figure 8.3.1: Conditional modes of the lCl random effect in model III versusavailable covariates.

Clearance appears to decrease with α-1-acid glycoprotein concentration and age

and to increase with weight and height. There is also some evidence that clear-

ance decreases with severity of congestive heart failure and is smaller in Blacks

than in both Caucasian and Latins. Clearly the α-1-acid glycoprotein concen-

tration is the most important covariate for explaining the lCl cluster-to-cluster

variation and a straight line seems adequate to model the observed relationship.

Figure 8.3.2 presents the scatter plots of the conditional modes of the lV

random effect versus the available covariates. None of the covariates seems help-

ful in explaining the variability of this random effect and we did not pursue the

modeling of its variability any further.

Initially only the α-1-acid glycoprotein concentration was included in the

model to explain the lCl random effect variation according to a linear model.

In the notation of (4.1.2) this modification of models (8.1.4) and (8.1.5) is ac-

complished by writing

lClij = (β1 + bi1) + (β2 + bi2) glycoproteinij (8.3.1)

All parameters were treated as mixed in this first attempt, called model IV,

but the random effects associated with lCl were assumed independent of the lV

random effect, preserving the covariance structure of model III. The AIC of this

model was 215.796 indicating a considerable gain in goodness-of-fit when com-

pared to model III (AIC of 338.205). Using the same strategy as in section 8.2

to model the variance-covariance matrix of the random effects, we selected a

model in which the lCl random effect was independent of the lV random effect,

the variance-covariance matrix of the lCl random effects was unstructured, but

the variances of the intercept term in the lCl random effect, bi1 in (8.3.1), and

the lV random effect were the same. The AIC of this model (V) was 213.788.

o oo oooo o

o oo oo ooo

ooo oo oo

ooooo o

oooo o

o ooooo

oo o o ooo oo

oo oo oo oo o o ooo o oo o o

50 100 150 200 250 300

oooooooo

oooooo

o o oo

o o o oooo o

o ooo o ooo

o o oo

o o oo oo o

o oooo o

oo oo o

o ooooo

o oo oo oooo

o o oo oo oo o oooo o oo o o

40 50 60 70 80 90

o oooo oo o

o oo o oooo

o oo o

o oo oo oo

oooo oo

oo o oo

oo oooo

o o oo ooo o o

oo oo o oooo oo oo ooo o o

Height

60 65 70 75

oo ooo o o o

o oo oo ooo

ooo ooo o

oooooo

o oo oo

oo o ooo

oo oo ooo oo

o oo oo ooooo oo oooo oo

Weight

40 60 80 100 120

oooooooo

oooooo

ooooooo

oooooo

oooooooo

Figure 8.3.2: Conditional modes of the lV random effect in model III versusavailable covariates.

Figures 8.3.3 and 8.3.4 present the scatter plots of the conditional modes of

the lCl random effect in model V versus the available covariates. These plots

indicate that the intercept random effect does not vary systematically with any

of the covariates, but the slope random effect tends to increase with weight and

height and is smaller among Blacks and patients with previous history of con-

gestive heart failure. This suggests an interaction between the effects of these

covariates and the α-1-acid glycoprotein on the Quinidine clearance. At this

point expert advice would be needed to clarify the plausability of this hypoth-

esis. Since this was not possible here, we proceeded with the model building

analysis just for the purpose of illustrating the use of the proposed methodology.

ooo oo o

ooo oo

o oo oooo o o oo

ooooo o

50 100 150 200 250 300

o oo ooo

o ooo o

oooooo ooo o o

oo ooo o

40 50 60 70 80 90

ooo oo o

oooo o

o oooo

o o oo oooo oo o

o oooo o

Height

60 65 70 75

o oooo oo

ooo oo

o o o oo

o oo o oooo ooo

o o oo oo

Weight

40 60 80 100 120

oooooo

ooooooooo

ooooooooooo

Figure 8.3.3: Conditional modes of the lCl intercept random effect in model Vversus available covariates.

Using the forward stepwise approach we included the interactions between α-

1-acid glycoprotein and weight (as a linear predictor), race (as an indicator

variable of Black/not Black status), and congestive heart failure (as an indica-

tor variable of previous/no previous history of congestive heart failure) in the

50 100 150 200 250 300

40 50 60 70 80 90

Height

60 65 70 75

Weight

40 60 80 100 120

ooooooo

Figure 8.3.4: Conditional modes of lCl slope random effect in model V versusavailable covariates.

model, in this order. The same random effect variance-covariance structure as

in model V was used in all cases. The corresponding AIC values were respec-

tively 210.117, 204.556, and 199.893. In all three cases substantial reductions in

the AIC values were observed. The random effects plots of the last model (with

all three interactions with α-1-acid glycoprotein included) did not indicate any

other candidate covariates to be included in the model and we concluded that

the model was adequate.

8.3.2 CO2 Uptake Data

Figure 8.3.5 presents the plots of the conditional modes of the random effects of

model III in section 8.2.4 against plant type and using the chilling treatment as

a symbol. The plots indicate a strong relationship between the φ1/φ3 random

effect and both the plant type and the chilling treatment. Apparently the

Quebec plants have higher values of φ1 and φ3 than the Mississippi plants and

chilling the plants causes a reduction in both φ1 and φ3, and that is more

pronounced in the Mississippi plants than in the Quebec plants, suggesting an

interaction between plant type and chilling treatment. The φ2 random effects

plot suggests a possible interaction between plant type and chilling treatment

with respect to their effect on φ2, but there is considerable variability in the

random effects estimates, making the statistical significance of this interaction

unclear.

Quebec Mississippi

Control

Chilled

Quebec Mississippi

Control

Chilled

Figure 8.3.5: Conditional modes of the φ1/φ3 random effect (a) and the φ2

random effect (b) in model III versus plant type, using chilling treatment as asymbol.

We initially considered a model (IV) in which both of φ1 and φ3 were written

in a full 22 factorial model in plant type and chilling treatment, with the inter-

cept and all treatment effects random and an unstructured variance-covariance

matrix. In order to keep consistency with the variance-covariance structure of

model III, the random effects were the same in φ1 and φ3. No covariates were

used for φ2 in this first model. As in model III, the φ1/φ3 random effects were

assumed independent of the φ2 random effect.

The AIC of model IV was 245.38, considerably smaller than the AIC of

model III. Analysis of the estimated D matrix of model IV indicated severe

overparametrization. Using the strategy described in section 8.2, we chose a

model (V) in which only the intercept of the φ1/φ3 random effect was random.

The AIC of model V was 230.63. No further terms could be dropped, nor any

covariates included in model V. Table 8.3.2 gives the AIC of the models obtained

by dropping each covariate term at a time from model V. Note that the AIC

increases for each of the covariate terms and hence none should be removed

from the model.

Table 8.3.1: AIC of models in which one covariate was dropped from model VParameter Coefficient AICφ1 P 258.95

T 246.10P × T 237.36

φ3 P 236.62T 233.01

P × T 239.94

8.4 Conclusions

The analysis of the eigenstructure of the estimated variance-covariance matrix

(D) of the random effects is a useful tool to determine which terms in the model

should be random and which should be purely fixed. The estimated D matrix

also provides useful information to identify structured variance-covariance pat-

terns. Information criterion statistics, such as the AIC, can be used as guidelines

to model selection, but analysis of residuals and consultation with experts in

the field of application of the model should also be used.

The goodness-of-fit and interpretability of a mixed effects model can be

substantially enhanced through the inclusion of covariates to explain random

effects variability. Information criterion statistics can again be used for model

selection, together with analysis of residuals and expert advice.

We restricted ourselves here to mixed effects models in which the cluster

errors, ε in model (4.1.1), were i.i.d., but other covariance structures (e.g. au-

toregressive processes) can easily be incorporated into mixed models (Chi and

Reinsel, 1989; Lindstrom and Bates, 1990). There is usually a trade-off between

the number of random effects incorporated in the model and the complexity of

the cluster errors covariance structure (Jones, 1990). Further research is needed

in that area, especially under a model building perspective.

Chapter 9

Conclusions and Suggestions for

Future Research

Mixed effects models constitute a powerful tool for modeling dependence within

clustered data. They give an intuitive interpretation for the source and the

structure of the dependence and can easily handle the unbalanced data that are

frequently encountered in many areas of scientific investigation.

Despite their usefulness, mixed effects models remain a mystery for many

researchers that could benefit from their application. We believe that this un-

fortunate situation could be changed if easy-to-use and reliable software were

available for fitting and analyzing mixed effects models. This has been the main

goal of our research.

9.1 Conclusions

The set of S functions and methods described in chapter 7 constitutes a user-

friendly, efficient, flexible, and reliable implementation for the analysis of mixed

effects models. We hope this will facilitate access of researchers to these kinds

of models. This software has been available in the S collection at StatLib for

over a year now and has been successfully used by researchers from a broad

range of areas.

The efficiency of the code is partially explained by the use of a loose coupling

algorithm (Soo and Bates, 1992) that allows the size of the optimization problem

to increase only linearly with the number of clusters, instead of quadratically

as would otherwise be expected.

The parametrizations for variance-covariance matrices described in chapter 6

are also fundamental for the code’s efficiency and reliability, since they allow

the unconstrained estimation of the variance-covariance components, greatly

simplifying the optimization problem.

The asymptotic results for the linear mixed effects models, proven in chap-

ter 3, provide the justification for the common practice of using the information

matrix in conjunction with the normal distribution to derive confidence regions

on the parameters in the model. These results are also important in showing

that the estimates of the fixed effects and the variance-covariance components

are asymptotically uncorrelated. This probably explains the success of the al-

ternating algorithm used in the nonlinear mixed effects code and described in

section 5.1.1.

Different approximations to the loglikelihood function in the nonlinear mixed

effects model were analyzed in chapter 5. The alternating approximation (5.1.2)

suggested by Lindstrom and Bates (1990) and used in the software implemen-

tation, gives in general accurate and reliable estimation results. If more exact

results are needed, the Laplacian approximation (5.1.6), or the adaptive Gaus-

sian approximation (5.1.10), can be used instead. Possibly the best strategy

is to use a hybrid scheme in which the alternating algorithm would be used

to get good initial values for the more refined Laplacian or adaptive Gaussian

approximations.

Model building in mixed effects models constitutes an interesting and diffi-

cult topic. Several questions that do not have a parallel in fixed effects models

arise when one has to choose a mixed effects model. Possibly the most difficult

question is determining which parameters should be mixed effects and which

should be purely fixed. Other important questions are related to the use of

simpler structured variance-covariance matrices for the random effects and the

choice of covariates to explain cluster-to-cluster parameter variability. Strategies

for addressing these questions in the context of nonlinear mixed effects models,

based on the eigenstructure of the estimated variance-covariance matrix of the

random effects, were described in chapter 8. These strategies were illustrated

through the analysis of real data examples.

9.2 Future Research

Considerable research effort is currently dedicated to expanding the applicability

of and improving the estimation methods for mixed effects models. We include

here some topics for future research that were not covered in this dissertation.

9.2.1 Asymptotics

There are at least two directions in which the results given in chapter 3 can

be extended. We assumed in this dissertation that the error term variance-

covariance matrix Λ was of the form σ2I, but it may be of interest to consider

it to be of the more general form Λ = Λ(ρ) where ρ is a (low dimensional)

parameter vector, e.g. Λ may have an AR(1) structure (Chi and Reinsel, 1989).

We feel the asymptotic results given in chapter 3, such as the asymptotic nor-

mality, can be extended to the (restricted) maximum likelihood estimators of

the error term variance-covariance parameters ρ.

Another possible direction in which the asymptotic results given in chapter 3

can be extended is in (restricted) maximum likelihood estimation for nonlinear

mixed effects models. One of the difficulties with this extension comes from the

fact that the likelihood function usually does not have a closed form expression

in nonlinear mixed effects models. The approximations described in chapter 5

can be used as a starting point for investigating solutions to this problem.

9.2.2 Parametrizations

Only unstructured variance-covariance matrices were considered in chapter 6

but, in many applications of mixed effects models, structured matrices are used

instead (Jennrich and Schluchter, 1986). It is therefore important to derive

parametrizations for structured variance-covariance matrices that allow uncon-

strained estimation of the associated parameters. This issue is particularly im-

portant in mixed models that allow more general variance-covariance structures

for the error term, since these will usually correspond to structured variance-

covariance matrices. Unrestricted parametrizations are easily derived for sim-

pler structures, such as diagonal and compound symmetry, but are far from

trivial for more complex structures, such as generalized autoregressive matrices.

The asymptotic properties of the different parametrizations considered in

chapter 6 have not yet been studied and certainly constitute an interesting

research topic. It may be that some of the parametrizations give faster rates of

convergence to normality than others and this could be used as a criterion for

choosing among them.

9.2.3 Assessing Variability

Once a model has been chosen to represent the data, measures of variability

for the estimates and confidence regions on the model’s parameters are usually

needed for inferential purposes. Asymptotic theory can certainly be used at a

preliminary stage. These results have only been proven for the linear mixed

effect model (cf. chapter 3), but, at least as a first order-like approximation,

could also be used for nonlinear mixed effects models.

More refined methods, such as likelihood profile traces and contours (Bates

and Watts, 1988), can also be used to assess the variability in the estimates,

but these will generally be computer intensive. A compromise between the

asymptotic and profiling methodologies is to use a linear approximation to the

loglikelihood, as in (5.1.2), to calculate the profile traces and contours. This

constitutes a considerably less intensive computational problem than profiling

the loglikelihood directly. Alternatively, if the fixed effects are the primary

object of interest, we could profile the penalized nonlinear least squares problem

corresponding to a fixed D = D.

Another (computationally intensive) alternative to assess the variability in

the estimates is to use bootstrap methods (Efron and Tibshirani, 1993) to esti-

mate standard errors and confidence regions.

More research is needed to determine which methods provide the most reli-

able statistical results and to compare their relative computational performance.

Bibliography

Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions

with Formulas, Graphs, and Mathematical Tables, Dover, New York.

Airy, G. B. (1861). On the Algebraical and Numerical Theory of Errors of

Observations and the Combinations of Observations, MacMillan, London.

Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of nonlin-

earity, Journal of the Royal Statistical Society, Ser. B 42: 1–25.

Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis and Its

Applications, Wiley, New York.

Beal, S. and Sheiner, L. (1980). The NONMEM system, American Statistician

34: 118–119.

Bennett, J. E. and Wakefield, J. C. (1993). Markov chain Monte Carlo for non-

linear hierarchical models, Technical Report TR-93-11, Statistics Section,

Imperial College.

Chambers, J. M. and Hastie, T. J. (eds) (1992). Statistical Models in S,

Wadsworth, Belmont, CA.

Chi, E. M. and Reinsel, G. C. (1989). Models for longitudinal data with random

effects and AR(1) errors, Journal of the American Statistical Association

84: 452–459.

Cleveland, W. S., Grosse, E. and Shyu, W. M. (1992). Local regression models,

Statistical Models in S, Wadsworth, Belmont, CA, chapter 8.

Crump, S. L. (1947). The Estimation of Variance in Multiple Classification,

PhD thesis, Department of Statistics, Iowa State University.

Davidian, M. and Gallant, A. R. (1992). Smooth nonparametric maximum

likelihood estimation for population pharmacokinetics, with application to

quinidine, Journal of Pharmacokinetics and Biopharmaceutics 20: 529–556.

Davidian, M. and Gallant, A. R. (1993). The nonlinear mixed effects model

with a smooth random effects density, Biometrika 80: 475–488.

Davidian, M. and Giltinan, D. M. (1993). Some simple methods for estimating

intraindividual variability in nonlinear random effects models, Biometrics

49: 59–73.

Davis, P. J. and Rabinowitz, P. (1984). Methods of Numerical Integration,

second edn, Academic Press, New York.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood

from incomplete data via the EM algorithm (c/r: P22-37), Journal of the

Royal Statistical Society, Ser. B 39: 1–22.

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, 2nd edn,

Wiley, New York.

Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap, Chap-

man & Hall, New York.

Fisher, R. A. (1925). Statistical Methods for Research Workers, Oliver and

Boyd, London.

Gallant, A. R. and Nychka, D. W. (1987). Seminonparametric maximum like-

lihood estimation, Econometrica 55: 363–390.

Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. (1990). Illus-

tration of Bayesian inference in normal data models using Gibbs sampling,

Journal of the American Statistical Association 85(412): 972–985.

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions

and the Bayesian restoration of images, IEEE Transactions on Pattern

Analysis and Machine Intelligence 6: 721–741.

Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo

integration, Econometrica 57: 1317–1339.

Golub, G. H. (1973). Some modified matrix eigenvalue problems, SIAM Review

15: 318–334.

Golub, G. H. and Welsch, J. H. (1969). Calculation of Gaussian quadrature

rules, Math. Comp. 23: 221–230.

Grizzle, J. E. and Allen, D. M. (1969). Analysis of growth and dose response

curves, Biometrics 25: 357–382.

Hartley, H. O. and Rao, J. N. K. (1967). Maximum likelihood estimation for

the mixed analysis of variance model, Biometrika 54: 93–108.

Harville, D. A. (1974). Bayesian inference for variance components using only

error contrasts, Biometrika 61: 383–385.

Harville, D. A. (1977). Maximum likelihood approaches to variance components

estimation and to related problems, Journal of the American Statistical

Association 72: 320–338.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains

and their applications, Biometrika 57: 97–109.

Henderson, C. R. (1953). Estimation of variance and covariance components,

Biometrics 9: 226–252.

Jennrich, R. I. and Schluchter, M. D. (1986). Unbalanced repeated measures

models with structural covariance matrices, Biometrics 42(4): 805–820.

Jones, R. H. (1990). Serial correlation or random subject effects, Communica-

tions in Stat., Part B–Simulation and Comp. 19: 1105–1123.

Jupp, D. L. B. (1978). Approximation to data by splines with free knots, SIAM

Journal of Numerical Analysis 15(2): 328–343.

Kung, F. H. (1986). Fitting logistic growth curve with predetermined carrying

capacity, ASA Proceedings of the Statistical Computing Sect. pp. 340–343.

Laird, N., Lange, N. and Stram, D. (1987). Maximum likelihood computations

with repeated measures: Application of the EM algorithm, Journal of the

American Statistical Association 82: 97–105.

Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal

data, Biometrics 38: 963–974.

Lehmann, E. L. (1983). Theory of Point Estimation, Wiley, New York.

Leonard, T. and Hsu, J. S. J. (1993). Bayesian inference for a covariance matrix,

Annals of Statistics 21: 1–25.

Leonard, T., Hsu, J. S. J. and Tsui, K. W. (1989). Bayesian marginal inference,

Journal of the American Statistical Association 84: 1051–1058.

Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using general-

ized linear models, Biometrika 73: 13–22.

Lindsey, J. K. (1993). Models for Repeated Measurements, Oxford University

Press, New York.

Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM algorithms

for linear mixed-effects models for repeated-measures data, Journal of the

American Statistical Association 83: 1014–1022.

Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed effects models for

repeated measures data, Biometrics 46: 673–687.

Longford, N. T. (1993). Random Coefficient Models, Oxford University Press,

New York.

Mallet, A. (1986). A maximum likelihood estimation method for random coef-

ficient regression models, Biometrika 73(3): 645–656.

Mallet, A., Mentre, F., Steimer, J.-L. and Lokiek, F. (1988). Nonparamet-

ric maximum likelihood estimation for population pharmacokinetics, with

applications to Cyclosporine, J. Pharmacokin. Biopharm. 16: 311–327.

Miller, J. J. (1977). Asymptotic properties of maximum likelihood estimates in

the mixed model of the analysis of variance, Ann. of Statistics 5: 746–762.

Newton, H. J. (1993). New developments in statistical computing, American

Statistician 47: 146–147.

Patterson, H. D. and Thompson, R. (1971). Recovery of inter-block information

when block sizes are unequal, Biometrika 58(3): 545–554.

Potthoff, R. F. and Roy, S. N. (1964). A generalized multivariate analysis of

variance model useful especially for growth curve problems, Biometrika

51: 313–326.

Potvin, C. and Lechowicz, M. J. (1990). The statistical analysis of ecophysio-

logical response curves obtained from experiments involving repeated mea-

sures, Ecology 71: 1389–1400.

Rao, C. R. (1973). Linear statistical inference and its applications, 2nd edn,

Wiley, New York.

Ratkowsky, D. A. (1990). Handbook of Nonlinear Regression Models, Marcel

Dekker, New York.

Sakamoto, Y., Ishiguro, M. and Kitagawa, G. (1986). Akaike Information Cri-

terion Statistics, D. Reidel Publishing Company, Holland.

Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components,

Wiley, New York.

Seber, G. A. F. (1977). Linear Regression Analysis, Wiley, New York.

Sheiner, L. B. and Beal, S. L. (1980). Evaluation of methods for estimating pop-

ulation pharmacokinetic parameters. I. michaelis-menten model: Routine

clinical pharmacokinetic data, Journal of Pharmacokinetics and Biophar-

maceutics 8(6): 553–571.

Soo, Y.-W. and Bates, D. M. (1992). Loosely coupled nonlinear least squares,

Computational Statistics and Data Analysis 14: 249–259.

Thisted, R. A. (1988). Elements of Statistical Computing, Chapman & Hall,

London.

Thompson, W. A. (1962). The problem of negative estimates of variance com-

ponents, Annals of Mathematical Statistics 33: 273–289.

Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior

moments and densities, Journal of the American Statistical Association

81(393): 82–86.

Tippet, L. H. C. (1931). The Methods of Statistics, Williams and Norgate,

London.

Verme, C. N., Ludden, T. M., Clementi, W. A. and Harris, S. C. (1992). Phar-

macokinetics of quinidine in male patients: A population analysis, Clin.

Pharmacokin. 22: 468–480.

Vonesh, E. F. and Carter, R. L. (1992). Mixed-effects nonlinear regression for

unbalanced repeated measures, Biometrics 48: 1–18.

Wakefield, J. C. (1993). The Bayesian analysis of population pharmacokinetic

models, Technical Report TR-93-11, Statistics Section, Imperial College.

Wakefield, J. C., Smith, A. F. M., Racine-Poon, A. and Gelfand, A. E. (1994).

Bayesian analysis of linear and nonlinear population models using the

Gibbs sampler, Applied Statistics . Accepted for publication.

Weiss, L. (1971). Asymptotic properties of maximum likelihood estimators in

some nonstandard cases,I, J. of the Amer. Stat’l. Assn. 66: 345–350.

Weiss, L. (1973). Asymptotic properties of maximum likelihood estimators in

some nonstandard cases, II, J. of the Amer. Stat’l. Assn. 68: 428–430.

Wolf, D. A. (1986). Nonlinear Least Squares for Linear Compartment Models,

PhD thesis, University of Wisconsin–Madison.

Wolfinger, R., Tobias, R. and Sall, J. (1991). Mixed models: A future direction,

SUGI 16: 1380–1388.

Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data:

A generalized estimating equation approach, Biometrics 44: 1049–1060.

Appendix A

We prove here a series of lemmas used in the proof of the theorems of chap-

ter 3. Throughout the proofs we will let θ0 be an interior point of the parameter

space Θ and assume that θ1 and θ2 ∈ Nn(θ0), with Nn(θ0) as defined in theo-

rem (3.1.1). We will denote by βk, σk,Σk, Dk, and Dk

A the quantities associated

with θk, k = 0, 1, 2.

Lemma A.1

‖β1 − β0‖2 ≤ p0g2/n2

‖β1 − β2‖2 ≤ 4p0g2/n2

proof: By definition of Nn(θ0), ‖β1 − β0‖2 =∑p0

i=1(β1i − β0i)2 ≤ p0g

2/n2p1+1.

Also ‖β1−β2‖2 ≤ ‖β1−β0‖2+‖β2−β0‖2+2‖β1−β0‖‖β2−β0‖ ≤ 4p0g2/n2

Lemma A.2 λmax

(Σ−1

0 U ji

)T)≤ 1/δ0, i = 1, . . . , r, j = 1, . . . , qi for some

δ0 = δ0(σ0) > 0.

proof: We note first that the eigenvalues of Σ0 = σ20I + ZD0

AZT are of

the form σ20 + λ(ZD0

AZT ) and since ZD0AZT is positive semi-definite it fol-

lows that λmin(Σ0) ≥ σ20. Now let σk be the kth diagonal element of D and

Gl(k) the corresponding G matrix, chosen so that Gl(k) = U ji

)T. Define

D0k(δ) = D0 − δξkξTk where ξk is a q-dimensional vector whose only nonnull

element is a one at position k. By assumption θ0 is an interior point of Θ

and so D0 is positive definite. It follows that λmin

(D0k(0)

)= λmin

Since trace(D0k(δ)

)= trace

(D0)− δ ≥ qλmin

(D0k(δ)

), it follows that δ >

trace(D0)⇒ λmin

(D0k(δ)

)< 0. Define δk = min

{δ > 0 : λmin

(D0k(δ)

Note that by the continuity of the minimum eigenvalue, we must have δk ∈(0, trace

. Define now δ0 = min(δ1, . . . , δq, σ20). Note that δ0 > 0 and

(D0k(δ0)

)≥ 0, k = 1, . . . , q. Let Σk

0(δ) = σ20I + ZD0k

A (δ)ZT and note

that by (2.3.1.2) Σ0 = Σk0(δ)+ δGl(k). From standard results on eigenvalues we

have that

λmax(Σ−10 Gl(k)) = sup

‖ξ‖=1

ξT Gl(k)ξ

ξTΣ0ξ= sup

‖ξ‖=1

ξT Gl(k)ξ

δ0ξT Gl(k)ξ + ξTΣk

0(δ0)ξ≤ 1/δ0

where we used the fact that ξTΣk0(δ0)ξ ≥ σ2

0 + λmin

(D0k(δ0)

)‖ZT ξ‖2 which

is greater than zero by construction. Note that λmax(Σ−10 ) = 1/λmin(Σ0) ≤

1/σ20 ≤ 1/δ0, so that the result also holds for k = 0 (G0 = I).

Lemma A.3 maxk

∣∣∣λk(Σ−10 Gi)

∣∣∣ ≤ 2/δ0, i = 0, . . . , p1.

proof: Letting l(i), (j(i), and k(i)) denote respectively the random effects class

and the random effects indices associated with Gi (j(i) = k(i), when σi is

a variance term) we note that Gi is either of the form Uj(i)l(i)

j(i)l(i)

)T(when

j(i) = k(i)) or Uj(i)l(i)

k(i)l(i)

j(i)l(i)

)T(when j(i) �= k(i)). By the

Cauchy-Schwartz and the triangle inequalities and lemma (A.2) we have that

∣∣∣∣ξTΣ−1/20 Gi

−1/20

)Tξ∣∣∣∣ ≤ 2

[ξTΣ

−1/20 U

j(i)l(i)

)T (Σ

−1/20

)Tξ]1/2

×[ξTΣ

−1/20 U

k(i)l(i)

)T (Σ

−1/20

)Tξ]1/2

≤ (2/δ0)‖ξ‖2

where Σ−1/20 denotes the Cholesky factor (Thisted, 1988) of Σ−1

0 . We note that

Σ−10 Gi and Σ

−1/20 Gi

−1/20

)Tshare the same eigenvalues. To see that let u be

an eigenvector of Σ−1/20 Gi

−1/20

)Twith eigenvalue λ, then

−1/20

−1/20 Gi

−1/20

)Tu = λ

−1/20

Letting v =(Σ

−1/20

)Tu and noting that Σ−1

0 =(Σ

−1/20

−1/20 , we have that v

is an eigenvector of Σ−10 Gi with eigenvalue λ. Conversely, let v be an eigenvector

of Σ−10 Gi with eigenvalue λ. Since Σ0 is positive definite, there exists u ∈ �n

such that v =(Σ

−1/20

)Tu and

−1/20

−1/20 Gi

−1/20

)Tu = λ

−1/20

)Tu ⇒

Σ−1/20 Gi

−1/20

)Tu = λu

and u is an eigenvector of Σ−10 Gi

−1/20

)Twith eigenvalue λ. It then follows

that maxk

∣∣∣λk(Σ−10 Gi)

∣∣∣ = sup‖ξ‖=1 |ξTΣ−10 Giξ| ≤ 2/δ0.

Lemma A.4 There exists an n0 = n0(σ0) such that ∀n ≥ n0

(Σ−1

1 U ji

)T)≤ 2/δ0, i = 1, . . . , r, j = 1, . . . , qi.

proof: As before let σk be such that Gl(k) = U ji

)T. By the definition

of δ0, λmin

(D0k(δ0/2)

)> 0 and by the continuity of the functions involved,

∃εk = εk(σ0) > 0 such that

‖σ1−σ0‖ < εk ⇒∣∣∣λmin

(D1k(δ0/2)

)− λmin

(D0k(δ0/2)

)∣∣∣ < λmin

(D0k(δ0/2)

Let ε0 = mink εk, then it follows that ∀θ1 ∈ Nn(θ0) such that ‖σ1 − σ0‖ < ε0

we have that for k = 1, . . . , q

λmax(Σ−11 Gl(k)) = sup

‖ξ‖=1

ξT Gl(k)ξ

ξTΣ1ξ= sup

‖ξ‖=1

ξT Gl(k)ξ

(δ0/2)ξT Gl(k)ξ + ξTΣk1(δ0/2)ξ

≤ 2/δ0

ξTΣk1(δ0/2)ξ ≥σ2

1 + λmin

(D1k(δ0/2)

)‖ZT ξ‖2 ≥ (1/2)λmin

(D0k(δ0/2)

)‖ZT ξ‖2 > 0.

Now note that if |σ21 − σ2

0| < σ20/2 then

λmax(Σ−11 ) = 1/λmin(Σ1) ≤ 1/σ2

1 ≤ 2/σ20 ≤ 2/δ0.

Let n0 = n0(σ0) be the smallest integer n such that

max0≤i≤p1

(gi(n)/ni(n)) < min(ε0, σ20/2).

Note that such an n0 always exists, since gi/ni → 0. Then by the previous

results it follows that

∀n ≥ n0, λmax

(Σ−1

1 U ji

)T)≤ 2/δ0, i = 1, . . . , r, j = 1, . . . , qi.

Lemma A.5 There exists an n0 = n0(σ0) such that ∀n ≥ n0,

∣∣∣λk(Σ−11 Gi)

∣∣∣ ≤ 4/δ0, i = 0, . . . , p1.

proof: The proof is identical to that of lemma (A.3), with lemma (A.4) replac-

ing lemma (A.2) in the final inequalities.

λmax(Σ−11 Σ0) ≤ 2

(λmax(D

0)/λmin(D0) + 1

proof: Let ε0 = ε0(σ0) > 0 be such that ∀θ1 ∈ Nn(θ0) satisfying ‖θ1−θ0‖ < ε0

we have |λmin(D1) − λmin(D

0)| < λmin(D0)/2 and |σ2

1 − σ20| < σ2

0/2. Define n0

to be the smallest integer such that

max0≤i≤p1(g(n)/ni(n)) < ε0. It then follows that ∀n ≥ n0

λmax(Σ−11 Σ0) = sup

‖ξ‖=1

(ξTΣ0ξ

ξTΣ1ξ

≤ sup‖ξ‖=1,ZT ξ�=0

(ξT ZD0

AZT ξ

ξT ZD1AZT ξ

≤ λmax(D0)

λmin(D1)

(λmax(D

λmin(D0)

Using the exact same reasoning we also have that for sufficiently large n

λmax(Σ−11 Σ2) ≤ 4

(λmax(D

0)/λmin(D0) + 1

Lemma A.7 maxk

∣∣∣λk

(Σ−1

0 (Σ1 − Σ0))∣∣∣ ≤ 1/g3

(q/λmin(D

0) + 1/σ20

proof: By definition the maximum absolute eigenvalue of Σ−10 (Σ1 −Σ0) is

sup‖ξ‖=1

∣∣∣ξT (Σ1 − Σ0) ξ∣∣∣

ξTΣ0ξ≤ sup

‖ξ‖=1,ZT ξ�=0

∣∣∣ξT Z(D1

A − D0A

)ZT ξ

∣∣∣ξT ZD0

AZT ξ+

|σ21 − σ2

Now note that

∣∣∣ξT(D1 − D0

)ξ∣∣∣ ≤

q∑i,j

∣∣∣[D1]ij − [D0]ij∣∣∣ |ξi| |ξj| ≤ 1/g3

( q∑i=1

|ξi|)2

≤ (q/g3)‖ξ‖2.

Noting also that |σ21 − σ2

0| ≤ 1/g3 and that D1A − D0

A and D1 − D0 share the

same eigenvalues with different multiplicities, the result follows immediately.

∣∣∣λk

(Σ−1

1 (Σ0 − Σ1))∣∣∣ ≤ 2/g3

(q/λmin(D

0) + 1/σ20

∣∣∣λk

(Σ−1

1 (Σ2 − Σ1))∣∣∣ ≤ 4/g3

(q/λmin(D

0) + 1/σ20

proof: It is an immediate consequence of the previous lemma and the fact

that for sufficiently large n the smallest eigenvalue of D1 may be bounded from

below by λmin(D0)/2 for any θ1 ∈ Nn(θ0).

Lemma A.9 Let θ0 ∈ Θ and θ1, θ2 ∈ Nn(θ0) and let {An(θ1)} be a sequence

of positive semi-definite n × n matrices of rank rn(θ1) and {an} a sequence of

positive quantities going to infinity such that ∀θ1 ∈ Nn(θ0) and ∀n we have

rn(θ1) ≤ an and λmax (An(θ1)) ≤ M(θ0) for some nonnegative M(θ0) < ∞. It

then follows that

(1/an sup

θ1∈Nn(θ0)(y − Xβ2)

]−1An(θ1)

]−T(y − Xβ2) > 2M(θ0)

)→ 0.

Furthermore, if λmax(An(θ1)) ≤ Mn(θ0), ∀n and ∀θ1 ∈ Nn(θ0) with Mn(θ0) →0 then

θ1∈Nn(θ0)(y − Xβ2)

]−1An(θ1)

]−T(y − Xβ2)

)Pθ2−→ 0.

proof: Under θ2, y ∼ N (Xβ2,Σ2) and so z =[Σ

]−T(y − Xβ2)∼Nn(0, I).

By assumption An(θ1) has n − rn(θ1) zero eigenvalues and therefore we can

write An(θ1) = P n(θ1)Λn(θ1)PTn (θ1) where P n(θ1) is a n × rn(θ1) matrix

whose columns are orthonormal eigenvectors of An(θ1) corresponding to the

nonzero eigenvalues and Λn(θ1) is a rn(θ1) × rn(θ1) diagonal matrix with di-

agonal elements given by the nonzero eigenvalues of An(θ1). It follows that

wn = P Tn (θ1)z follows a Nrn(θ1)(0, I) distribution and

zT An(θ1)z = wTnΛn(θ1)wn ≤ λmax (An(θ1)) ‖wn‖2 ≤ M(θ0)‖wn‖2.

Note that ‖wn‖2 ∼ χ2rn(θ1). Let Fk denote the distribution function of the

chisquare distribution with k degrees of freedom. It follows from the definition

of that distribution that Frn(θ1) ≥ Fan and so

(1/an (y − Xβ2)

]−1An(θ1)

]−T(y − Xβ2) > 2M(θ0)

)≤ 1 − Fan(2an) ≤ 2/an

with the last bound following from Tchebychev’s inequality. Since an does not

depend on θ1 we have

(1/an sup

θ1∈Nn(θ0)

((y − Xβ2)

]−1An(θ1)

]−T(y − Xβ2)

> 2M(θ0)) ≤ 2/an → 0.

Now if λmax(An(θ1)) is bounded by a sequence {Mn(θ0)} going to zero then for

any given ε > 0

(1/an (y − Xβ2)

]−1An(θ1)

]−T(y − Xβ2) > ε

)≤ 1 − Fan (εan/Mn(θ0)) ≤ Mn(θ0)/ε → 0

with the last bound following from Tchebychev’s inequality. Since the bound

does not depend on θ1, the result is also true when we take the sup over θ1 ∈Nn(θ0).

Appendix B

This appendix contains detailed documentation for the functions, classes, and

methods described in chapter 7. Online documentation in S and Splus is avail-

able for all of these functions, classes, and methods after correct installation of

the software that we contributed to StatLib.

topics in mixed eﬀects modelsweb.math.ku.dk/~erhansen/web/stat1/pinheiro.pdf · topics in mixed...

Documents

cognitive attraction and online misinformation ·...

conditional akaike information for mixed-eﬀects models ·...

meta-regression in stata - agecon...

fitting mixed-effects models using the lme4 package in...

mixed models for longitudinal ordinal and nominal...

290 - microsoft · 2019. 7. 20. · 290 doncaster unitary...

objective subtle cognitive difficulties predict future...

bigdataandreliabilityapplications: thecomplexity...

year 8 curriculum summary - sherborne preparatory school...

multivariate mixed-eﬀects meta-analysis of paired ... ·...

(aka topics of world history, semester 2 mixed with a 38...

origin of reversible photoinduced phase separation in hybrid...

modelcomparisoninpsychology · mixed-eﬀects models (glmm;...

statistical image analysis - dartmouth...

ecen689: special topics in high-speed links circuits...

mixed-effects modeling with crossed random...

three topics in mixed integer...

mixed-effects modeling with crossed random ... - jake...

estimatingdynamiceconomicmodelswithfixed eﬀects

mixed models - stat.purdue.edubacraig/scs/mixed.pdfto...