topics in mixed effects modelsweb.math.ku.dk/~erhansen/web/stat1/pinheiro.pdf · topics in mixed...
TRANSCRIPT
Topics in Mixed Effects Models
by
Jose Carlos Pinheiro
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
Doctor of Philosophy
(Statistics)
at the
UNIVERSITY OF WISCONSIN – MADISON
1994
Abstract
Mixed effects models have received a great deal of attention in the statistical
literature for the past forty years because of the flexibility they offer in handling
the unbalanced clustered data that arise in many areas of investigation. In this
dissertation we consider both linear and nonlinear mixed effects models under
maximum likelihood and restricted maximum likelihood estimation. We derive
the asymptotic distribution of both maximum likelihood and restricted maxi-
mum likelihood estimators in a general linear mixed effects models, under mild
regularity conditions. We study different approximations to the loglikelihood
function of nonlinear mixed effects models, comparing them with respect to their
accuracy and computational efficiency. We describe five different parametriza-
tions for variance-covariance matrices that ensure positive definiteness, while
leaving the estimation problem unconstrained, comparing them with respect to
their computational efficiency and statistical interpretability. We consider the
model building issue for mixed effects models, describing techniques for choosing
random effects to be incorporated in the model, using structured random effects
variance-covariance matrices, and using covariates to explain cluster-to-cluster
parameter variability. Finally we describe the S software we have developed for
analyzing linear and nonlinear mixed effects models and which we have con-
tributed to the StatLib collection.
i
Contents
Abstract i
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Linear Mixed Effects Models . . . . . . . . . . . . . . . . . . . . 2
1.3 Nonlinear Mixed Effects Models . . . . . . . . . . . . . . . . . . 4
1.4 Parametrizations for Variance-Covariance Matrices . . . . . . . 5
1.5 Software Development . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 The Linear Mixed Effects Model 11
2.1 Model and Examples . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Bibliographic Review . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Asymptotic Results for the Linear Mixed Effects Model 24
3.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Limit of φ5 . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
3.1.2 Limit of φ4 . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Limit of φ3 . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.4 Limit of φ2 . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.5 Limit of φ1 . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Restricted Maximum Likelihood . . . . . . . . . . . . . . . . . . 48
3.3 Parametrized and/or Structured σ . . . . . . . . . . . . . . . . 57
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 The Nonlinear Mixed Effects Model 76
4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Orange Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Bibliographic Review . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Approximations to the Loglikelihood in the Nonlinear Mixed
Effects Model 83
5.1 Approximations to the Loglikelihood . . . . . . . . . . . . . . . 84
5.1.1 Alternating Approximation . . . . . . . . . . . . . . . . 84
5.1.2 Laplacian Approximation . . . . . . . . . . . . . . . . . 86
5.1.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . 89
5.1.4 Gaussian quadrature . . . . . . . . . . . . . . . . . . . . 91
5.2 Comparing the Approximations . . . . . . . . . . . . . . . . . . 93
5.2.1 Orange Trees . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . 104
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Parametrizations for Variance-Covariance Matrices 115
iii
6.1 Parametrizations . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.1 Cholesky Parametrization . . . . . . . . . . . . . . . . . 117
6.1.2 Log-Cholesky Parametrization . . . . . . . . . . . . . . . 119
6.1.3 Spherical Parametrization . . . . . . . . . . . . . . . . . 119
6.1.4 Matrix Logarithm Parametrization . . . . . . . . . . . . 121
6.1.5 Givens Parametrization . . . . . . . . . . . . . . . . . . . 122
6.2 Comparing the Parametrizations . . . . . . . . . . . . . . . . . . 125
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7 Mixed Effects Models Methods and Classes for S 132
7.1 The lme class and related methods . . . . . . . . . . . . . . . . 133
7.1.1 The lme function . . . . . . . . . . . . . . . . . . . . . . 135
7.1.2 The print, summary, and anova methods. . . . . . . . . 136
7.1.3 The plot method . . . . . . . . . . . . . . . . . . . . . . 139
7.1.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . 140
7.2 The nlme class and related methods . . . . . . . . . . . . . . . . 142
7.2.1 The nlme function . . . . . . . . . . . . . . . . . . . . . 144
7.2.2 The nlme methods . . . . . . . . . . . . . . . . . . . . . 147
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8 Model Building in Mixed Effects Models 157
8.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.1.1 Pine Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.1.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 160
8.1.3 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1.4 CO2 Uptake . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Variance-Covariance Modeling . . . . . . . . . . . . . . . . . . . 164
iv
8.2.1 Pine Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2.2 Theophylline . . . . . . . . . . . . . . . . . . . . . . . . 168
8.2.3 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.2.4 CO2 Uptake . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.3 Covariate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.3.1 Quinidine . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.3.2 CO2 Uptake Data . . . . . . . . . . . . . . . . . . . . . . 180
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9 Conclusions and Suggestions for Future Research 183
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2.1 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2.2 Parametrizations . . . . . . . . . . . . . . . . . . . . . . 186
9.2.3 Assessing Variability . . . . . . . . . . . . . . . . . . . . 187
Bibliography 188
Appendix A 195
Appendix B 203
v
Chapter 1
Introduction
In this chapter we present an overview of the topics covered in this dissertation.
We discuss the motivation behind mixed effects models and describe briefly the
contents of each of the subsequent chapters.
1.1 Motivation
Mixed models were developed to handle clustered data and have been a topic
of increasing interest in Statistics for the past forty years. Clustered data can
be loosely defined as data in which the observations are grouped into disjoint
classes, called clusters, according to some classification criterion. Examples of
clustered data include split-plot designs in which the observations pertaining
to the same block form a cluster and repeated measures data in which several
observations are made sequentially on the same individual (cluster).
Observations in the same cluster usually cannot be considered independent
and mixed effects models constitute a convenient tool for modeling cluster de-
pendence. In these models the response is assumed to be a function of fixed
2
(population) effects, non-observable cluster specific random effects, and an error
term. Observations within the same cluster share common random effects and
are therefore statistically dependent.
We will restrict ourselves in this dissertation to models in which the error
terms and the random effects are normally distributed.
The parameters in a mixed effects model can be classified into two types:
fixed effects, associated with the average effect of predictors on the response,
and variance-covariance components, associated with the covariance structure
of the random effects and of the error term. In many practical applications
estimates of the random effects are also of interest.
Several estimation methods have been proposed for mixed effects models and
though maximum likelihood and restricted maximum likelihood (Harville, 1974)
are generally adopted for linear mixed effects models (Longford, 1993), there is
an ongoing debate in the statistical literature about estimation methods for
nonlinear mixed effects models.
1.2 Linear Mixed Effects Models
Linear mixed effects models are mixed effects models in which both the fixed
and the random effects contribute linearly to the response function. The general
form of such models is
y = Xβ + Zb + ε (1.2.1)
where y is the response vector, X and Z are the design matrices corresponding
to the fixed and random effects respectively, β is the fixed effects vector, b is the
random effects vector, and ε is the error vector. It is assumed that b ∼ N (0, D)
and ε ∼ N (0,Λ), with b independent of ε.
3
Variance components models (Searle, Casella and McCulloch, 1992), mixed
effects ANOVA models (Miller, 1977), and linear models for longitudinal data
(Laird and Ware, 1982) are all special cases of model (1.2.1). The linear mixed
effects model (1.2.1) is described in detail in chapter 2. Two examples are
included there to illustrate the use of this model in the context of mixed effects
ANOVA models and repeated measures data.
Maximum likelihood (ML) and restricted maximum likelihood (RML) are
the most common estimation methods used for linear mixed effects models. The
derivation of (R)ML estimates constitutes a rather complex nonlinear optimiza-
tion problem that only became feasible when fast computers became available.
This optimization is usually done using the EM algorithm (Dempster, Laird and
Rubin, 1977) or Newton-Raphson methods (Thisted, 1988), but the latter seems
to be more efficient than the former (Lindstrom and Bates, 1988). No closed
form expressions are available for the distribution of (R)ML estimates and infer-
ence usually has to rely on asymptotic results. The classical asymptotic theory
available for ML estimates (Lehmann, 1983) cannot be applied to linear mixed
effects models, since the observations are not independent. Miller (1977) derived
the asymptotic distribution of ML estimates for mixed effects ANOVA models,
following the work by Hartley and Rao (1967), but these results had not been
extended to the more general linear mixed effects model (1.2.1). We derive,
in chapter 3, the asymptotic distribution of both ML and RML estimates in
the linear mixed effects model (1.2.1) under quite general regularity conditions.
We also derive the asymptotic distribution of ML and RML estimates of the
variance-covariance components in (1.2.1) for a large class of reparametrizations
of the variance-covariance matrix of the random effects, that encompasses most
cases of practical interest.
4
1.3 Nonlinear Mixed Effects Models
Nonlinear mixed effects models are mixed effects models in which some of the
fixed and/or random effects occur nonlinearly in the response function. Several
different formulations of nonlinear mixed effects models are available in the
literature; we will adopt here the model proposed by Lindstrom and Bates
(1990), given by
y = f (φ, X) + ε, where (1.3.1)
φ = Aβ + Bb
where y is the response vector, f is a general nonlinear function, φ is a mixed
effects parameter vector that is expressed as a linear function of the fixed effects
β and the random effects b, X is a matrix of covariates, ε is the error vector, and
A and B are the design matrices for the fixed and random effects respectively.
As in the linear mixed effects model (1.2.1) it is assumed that b ∼ N (0, D) and
ε ∼ N (0,Λ), with b independent of ε.
By far the most common application of model (1.3.1) is for repeated mea-
sures data and we will restrict ourselves in this dissertation to this type of
situation. The nonlinear mixed effects model for repeated measures data is
described in detail in chapter 4, together with a real data example of its use.
Different estimation methods have been proposed for the parameters in the
nonlinear mixed effects model (1.3.1) and there is an ongoing debate in the liter-
ature about the most adequate method(s) (Davidian and Giltinan, 1993). One
of the reasons for this variety of estimation methods is related to the numerical
complexity involved in the derivation of (R)ML estimates in the nonlinear mixed
effects model. This complexity is due to the fact that the likelihood function in
5
the nonlinear mixed effects model, which is based on the marginal distribution
of y, does not usually have a closed form expression. Different approxima-
tions to the loglikelihood in (1.3.1) have been proposed to try to circumvent
this problem (Lindstrom and Bates, 1990; Vonesh and Carter, 1992; Davidian
and Gallant, 1993). We describe in chapter 5 alternative approximations to
the loglikelihood in (1.3.1) based on the Laplacian approximation (Tierney and
Kadane, 1986), importance sampling (Geweke, 1989), and Gaussian quadrature
(Davis and Rabinowitz, 1984). We present a comparison between these methods
and the approximation suggested by Lindstrom and Bates (1990), using sim-
ulated and real data and conclude that, in most cases, Lindstrom and Bates’
approximation gives very accurate results.
As in the linear mixed effects model, the distribution of the (R)ML estimates
cannot be determined explicitly. Asymptotic results for these estimates have
not yet been established and will not be considered in this dissertation.
1.4 Parametrizations for Variance-Covariance
Matrices
The (R)ML estimation of the variance-covariance components in both mod-
els (1.2.1) and (1.3.1) is usually a difficult numerical problem, since the resulting
estimates should correspond to a positive semi-definite matrix. This difficulty
has been pointed out by Harville (1977), Lindstrom and Bates (1988), and Searle
et al. (1992, chapter 6).
Two approaches can be used for ensuring positive semi-definiteness of the
estimated variance-covariance matrix of the random effects: constrained op-
timization, where the natural parametrization for the unique elements in the
6
variance-covariance matrix is used and the estimates are constrained to be pos-
itive semi-definite matrices, and unconstrained optimization, where the unique
elements in the variance-covariance matrix are reparametrized in such a way
that the resulting estimate must be positive semi-definite. We recommend the
use of the second approach, not only for numerical reasons (parameter estima-
tion tends to be much easier when there are no constraints), but also because
of the superior inferential properties that unconstrained estimates tend to have
(e.g. asymptotic properties). Lindstrom and Bates (1988, 1990) describe the
use of Cholesky factors for implementing unconstrained (R)ML estimation of
variance-covariance components in both the linear and the nonlinear mixed ef-
fects models.
We describe, in chapter 6, five different parametrizations for transforming
the (R)ML estimation of the variance-covariance components in models (1.2.1)
and (1.3.1) into an unconstrained optimization. The basic idea behind all
parametrizations considered in this dissertation is to write
D = LT L (1.4.1)
where the unique elements of L form an unconstrained parameter vector. Differ-
ent choices of L lead to different parametrizations of D. The parametrizations
considered in chapter 6 are of two types: three of them are based on the Cholesky
factorization of D (Thisted, 1988) and the other two are based on the spectral
decomposition (Rao, 1973).
In choosing a parametrization for D one has to take into consideration
its computational efficiency and the statistical interpretability of the individ-
ual parameters. A comparison of the computational efficiency of the different
7
parametrizations, using simulation, is included in chapter 6. The statistical
interpretation of the individual parameters in each parametrization is also dis-
cussed in that chapter. We conclude that different parametrizations should
be used at different stages of the analysis: during the optimization of the (re-
stricted) loglikelihood function, a parametrization based on the matrix loga-
rithm of D (Leonard and Hsu, 1993) is to be preferred for its superior computa-
tional efficiency; to assess the variability of the variance-covariance components
estimates, a parametrization based on the spherical coordinates of the Cholesky
factor of D is recommended, since it is the one with the most interpretable
elements.
1.5 Software Development
The success of any statistical technique nowadays is directly related to the
availability of reliable, efficient, and simple-to-use software for its application.
We describe in chapter 7 a set of S functions, classes, and methods (Chambers
and Hastie, 1992) that we developed for the analysis of mixed effects models,
using either maximum, or restricted maximum likelihood. These extend the lin-
ear and nonlinear modeling facilities available in release 3 of S and S-plus. The
source code, written in S and C using an object-oriented approach, is available
in the S collection at StatLib. Help files for all S functions and methods are
included in Appendix B.
The two functions used to fit linear and nonlinear mixed effects models are
respectively lme() and nlme(). Objects returned by these functions are of
classes lme and nlme respectively, and the latter class inherits from the former.
Several methods are available for both the lme and nlme classes, including
8
print, summary, plot, predict and anova. These were developed keeping
consistency with the methods of other model fitting functions available in S,
such as lm(), glm(), and nls().
The use of the S functions and methods for mixed effects models is illustrated
in chapter 7 through the analysis of two real data examples: one of a linear mixed
effects model and the other of a nonlinear mixed effects model.
1.6 Model Building
Model building in mixed effects models involves questions that do not have a
parallel in (fixed effects) linear and nonlinear models. Some of these questions
are:
• determining which effects should have an associated random component
and which should be purely fixed;
• using covariates to explain cluster-to-cluster parameter variability;
• using structured random effects variance-covariance matrices (e.g. diago-
nal matrices) to reduce the number of parameters in the model.
We consider in chapter 8 strategies for addressing these questions in the context
of nonlinear mixed effects models, though most of the techniques described are
also applicable to linear mixed effects models.
The proposed strategy for choosing the random effects to be included in
the model is to start with all parameters as mixed effects, whenever no prior
information about the random effects variance-covariance structure is available
and convergence is possible. Then examine the eigenvalues of the estimated D
9
matrix, checking if one, or more, are close to zero. The associated eigenvector(s)
would then give an estimate of the linear combination of the parameters that
could be taken as fixed. If near zero eigenvalues are present, a reduced model,
in which the corresponding linear combination of random effects is eliminated,
can then be fit and compared to the original model by means of likelihood
ratio tests or information criterion statistics. In this dissertation we use the
Akaike information criterion (Sakamoto, Ishiguro and Kitagawa, 1986) to decide
between alternative models, choosing the one with the smaller AIC.
For choosing covariates to explain cluster-to-cluster parameter variability
we suggest analyzing plots of random effects estimates (e.g. conditional modes)
versus the candidate covariates. If the number of covariates/random effects
combinations is large, we suggest using a forward stepwise type of approach in
which covariates are included one at a time and the potential importance of
the remaining covariates is (graphically) assessed at each step. The decision on
whether or not to include a covariate can be based on the change in the AIC
values of the fits with and without it.
In comparing alternative models one must also analyze the residuals from
the fit, checking for departures from the model’s assumptions. It is also highly
recommended that any model building analysis be done in conjunction with
experts in the field of application of the model, to ensure the practical usefulness
of the chosen model. The use of the proposed model building strategies is
illustrated in chapter 8 through the analyses of four real data examples, obtained
from the areas of forestry, ecology, and pharmacokinetics.
10
1.7 Future Research
Considerable research effort is currently dedicated to expand the applicability
of and improve estimation methods for mixed effects models. We suggest in
chapter 9 topics for future research in mixed effects models that were not covered
in this dissertation. These include suggestions for:
• expanding the asymptotic results of chapter 3 to nonlinear mixed ef-
fects models and linear mixed effects models with more general variance-
covariance structures for the error term;
• deriving unconstrained parametrizations for structured variance-covarian-
ce matrices;
• comparing methods for assessing the variability of parameter estimates in
mixed effects models.
Chapter 2
The Linear Mixed Effects Model
In this chapter we describe a general linear mixed effects model and present
two examples of its use in the context of mixed effects ANOVA models and
repeated measures data. We also include a brief bibliographic review of linear
mixed effects models.
2.1 Model and Examples
We write the linear mixed effects model as
y = Xβ + Zb + ε (2.1.1)
where y, X , and Z denote respectively the n-dimensional response vector, the
n× p0 fixed effects design matrix, and the n×m random effects design matrix,
β denotes the p0-dimensional vector of fixed effects parameters, b denotes the
m-dimensional random effects vector, and ε denotes the error term.
12
The model formulation in (2.1.1) is quite general and in practice some restric-
tions on the structure and the distribution of the random effects are assumed.
Assumption 2.1.1 By permuting the columns of Z if necessary, the random
effects design matrix can be partitioned as
Z = [Z1 : · · · : Zr]
where each Zi is of the form
Zi =
Z1i 0 0 · · · 0
0 Z2i 0 · · · 0
......
.... . .
...
0 0 0 · · · Zmii
with each Zji having the same number of columns qi and a variable number of
rows nji . The random effects vector b can be accordingly partitioned as b =[
bT1 , bT
2 , · · · , bTr
]Tand each bi can in turn be partitioned as bi =
[(b1
i )T , (b2
i )T ,
· · · , (bmii )T
]T. This partition defines a grouping of the random effects into r
classes, with the qi random effects belonging to the same class i being observed
at exactly mi different levels.
We will restrict ourselves in this dissertation to normal distribution models.
More specifically, we will assume
Assumption 2.1.2 The bji are independent (for different i and/or j) and fol-
low a N (0, Di) distribution, ε follows a N (0,Λ) distribution, and the bji are
independent of ε.
13
The Di can be either general positive semi-definite matrices, with qi(qi +
1)/2 free parameters, or structured positive semi-definite matrices, i.e. Di =
Di(θi) with the dimension of θi being less than qi(qi + 1)/2 (Jennrich and
Schluchter, 1986).
Define
D =r⊕
i=1
Di, DA =r⊕
i=1
(Imi⊗ Di)
where ⊕ denotes the direct sum and ⊗ denotes the tensor product. Note that
D and DA have the same eigenvalues, with different multiplicities (in particu-
lar they have the same maximum and minimum eigenvalues). Under assump-
tion (2.1.2) it follows that y has a N (Xβ,Σ) distribution (Searle et al., 1992),
where Σ = Λ + ZDAZT .
In most applications of linear mixed effects models it is assumed that Λ =
σ2I and we will also assume this here.
The mixed effects ANOVA model (Miller, 1977; Searle et al., 1992) is a
particular case of model (2.1.1) where qi = 1 and Di = σ2i , i = 1, . . . , r. As an
example, consider the design in which the experimental units are divided into
two blocks, each with two plots, which in turn are divided into two subplots,
and two treatment factors A and B, in a 2 × 2 full factorial arrangement, are
used according to the scheme shown in Table 2.1.1.
Assuming that the block and plot effects are random, the corresponding
mixed effects ANOVA model can be written as
yijk = µ + bi + Aj + sij + Bk + A.Bjk + εijk, i, j, k = 1, 2
where yijk is the response observed in the ith block, jth plot, and kth subplot,
µ is the grand mean, bi is the random effect corresponding to block i, sij is the
14
Table 2.1.1: Split-split plot designBlock Plot Subplot A B
1 1 1 1 11 1 2 1 21 2 1 2 11 2 2 2 22 1 1 1 12 1 2 1 22 2 1 2 12 2 2 2 2
random effect corresponding to the ijth block-plot combination, Aj and Bk are
the A and B treatment effects respectively, and εijk is the error term. To ensure
identifiability of the fixed effects we will use the sum-to-zero conditions
2∑j=1
Aj =2∑
k=1
Bk =2∑
j=1
A.Bjk =2∑
k=1
A.Bjk =2∑
j,k=1
A.Bjk = 0.
The assumptions of the model are that the bi are i.i.d. with distribution N (0, σ21),
the sij are i.i.d. with distribution N (0, σ22) and independent of the bi, and the
εijk are i.i.d. with distribution N (0, σ23) and independent of both the bi and the
sij.
15
In the notation of model (2.1.1), we have
y111
y112
y121
y122
y211
y212
y221
y222
=
1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1
1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1
µ
A1
B1
A.B11
+
1 0 1 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 1 0 0
0 1 0 0 1 0
0 1 0 0 1 0
0 1 0 0 0 1
0 1 0 0 0 1
b1
b2
s11
s12
s21
s22
+
ε111
ε112
ε121
ε122
ε211
ε212
ε221
ε222
By setting Zj1 = [1 1 1 1]T , j = 1, 2, Zj
2 = [1 1]T , j = 1, . . . , 4, and Zi =⊕j Zj
i , i = 1, 2 we see that r = 2, q1 = q2 = 1, m1 = 2, m2 = 4, b1 = [b1 b2]T ,
and b2 = [s11 s12 s21 s22]T . Note also that in this example
D =
σ21 0
0 σ22
and DA =
σ21 0 0 0 0 0
0 σ21 0 0 0 0
0 0 σ22 0 0 0
0 0 0 σ22 0 0
0 0 0 0 σ22 0
0 0 0 0 0 σ22
The linear mixed effects model for repeated measures (Laird and Ware, 1982;
Lindstrom and Bates, 1988) is a particular case of model (2.1.1) where r = 1.
As an example we consider the data presented in Grizzle and Allen (1969) from
a dental study on the ramus height (in millimeters) measured in 20 boys at ages
8, 8.5, 9, and 9.5 years. The data are shown in figure 2.1.1.
A linear model in age in which both the intercept and the slope vary with the
16
Age (years)
Ram
us h
eigh
t (m
m.)
8.0 8.5 9.0 9.5
46
48
50
52
54
56
a
a a
a
b
bb
b
cc
c
c
d d
d
de
ee
e
f
f ff
g
g
g g
h hh
h
i
i
i
i
j
jj
j
k k kk
l
l
l
l
m
m
m
m
n
nn
no
o
o
o
p
p
p
p
q
q
r
rr r
s
s
ss
t
t
tt
Figure 2.1.1: Ramus heights for 20 boys measured at 4 ages.
boy seems adequate to explain the ramus height evolution. The corresponding
linear mixed effects model is written as
yij = (β1 + bi1) + (β2 + bi2) agej + εij , i = 1, . . . , 20, j = 1, . . . , 4.
where yij is the ramus height of the ith boy at age j, β1 and β2 are the fixed
intercept and the fixed slope respectively, bi1 and bi2 are the random intercept
and the random slope corresponding to the ith boy, and εij is the error term. The
assumptions of the model are that the bi are i.i.d. with distribution N (0, D1)
and the εij are i.i.d. with distribution N (0, σ2), independent of the bi. D1 is a
general variance-covariance matrix.
In the notation of model (2.1.1) we can express the linear mixed effects model
17
as
y =
1 47.8
1 48.8...
...
1 51.3
1 51.8
β1
β2
+
1 47.8 0 0 · · · 0 0
1 48.8 0 0 · · · 0 0...
......
.... . .
......
0 0 0 0 · · · 1 51.3
0 0 0 0 · · · 1 51.8
b11
b12
...
b20 1
b20 2
+
ε11
ε12
...
ε20 3
ε20 4
By letting X[n1 · · ·n2, ] denote the submatrix of X corresponding to its n1
through n2 rows and setting Zj1 = X[4j − 3 · · ·4j, ] , j = 1, . . . , 20, we see that,
in this example, r = 1, q1 = 2, m1 = 20, D = D1, and DA =⊕20
i=1 D.
2.2 Likelihood Estimation
Different estimation methods for the parameters in model (2.1.1) have been pro-
posed over the years (Searle et al., 1992), but the most commonly used methods
today are maximum likelihood (ML) and restricted maximum likelihood (RML)
(Longford, 1993).
It is convenient, when writing the (restricted) likelihood of y in model (2.1.1),
to factor out the variance of the error term, σ2, from the variance-covariance
matrix of the random effects, i.e. D = σ2Ds, where Ds is called the scaled
variance-covariance matrix of the random effects. Under assumption (2.1.2),
the loglikelihood function for y in model (2.1.1) is given by
�(β, σ2, Ds | y
)= −1
2
[n log
(2πσ2
)+ log
(∣∣∣I + ZDsAZT
∣∣∣) (2.2.1)
+1
σ2(y − Xβ)T
(I + ZDs
AZT)−1
(y − Xβ)]
18
For fixed Ds, the values of β and σ2 that maximize (2.2.1) are given by
β (Ds) =[XT
(I + ZDs
AZT)−1
X]−1
XT(I + ZDs
AZT)−1
y (2.2.2)
σ2 (Ds) = (1/n)[y − Xβ (Ds)
]T (I + ZDs
AZT)−1 [
y − Xβ (Ds)]
Restricted maximum likelihood estimates (RMLEs) of the variance-covarian-
ce components are usually preferred to maximum likelihood estimates (MLEs)
in linear mixed effects models. The basic reason for that being that RMLEs take
into account the estimation of the fixed effects when calculating the degrees of
freedom associated to the variance-components estimates, while MLEs do not.
The RMLEs are defined as the MLEs of the likelihood of a set of n − p0
linear combinations of the response vector y, corresponding to n − p0 vectors
that span the orthogonal complement of the column space of the fixed effects
design matrix X (Harville, 1974). One way of defining such a set of vectors is
to consider the QR decomposition (Thisted, 1988) of X
X = [Q1 Q2]
R1
0
(2.2.3)
where R1 is upper triangular. It follows from the definition of the QR decom-
position that the columns of Q2 define a set of orthonormal vectors that span
the orthogonal complement of the column space of X and the RMLEs can be
obtained from the likelihood of y∗ = QT2 y. From elementary properties of the
multivariate normal distribution and the definition of the QR decomposition,
y∗ ∼ N (0,Σ∗), where Σ∗ = σ2(I + QT
2 ZDsAZT Q2
). Letting Z∗ = QT
2 Z and
19
n∗ = n − p0, we can write the corresponding restricted likelihood as
�R
(β, σ2, Ds | y
)= −1
2
[n∗ log
(2πσ2
)+ log
(∣∣∣I + Z∗DsAZ∗T ∣∣∣) (2.2.4)
+1
σ2y∗T (I + Z∗Ds
AZ∗T )−1y∗]
For fixed Ds, the value of σ2 that maximizes (2.2.4) is
σ2R (Ds) = (1/n∗)y∗T (I + Z∗Ds
AZ∗T)−1y∗ (2.2.5)
The restricted likelihood (2.2.4) does not depend upon β and hence no fixed
effects RMLEs are available. Nevertheless the first formula in (2.2.2), with Ds
replaced by its corresponding RMLE, is usually employed to provide estimates
for the fixed effects in restricted maximum likelihood estimation.
The (R)MLE of Ds in general does not have a closed form expression and its
determination constitutes a constrained nonlinear optimization problem whose
numerical solution has beeen addressed in several papers (Hartley and Rao,
1967; Laird and Ware, 1982; Lindstrom and Bates, 1988; Wolfinger, Tobias and
Sall, 1991). We will not consider the numerical problem of determining the
(R)MLE of Ds in this dissertation. Using the formulas in (2.2.2) and (2.2.5)
one can express the likelihood (2.2.1), or the restricted likelihood (2.2.4), as a
function of Ds alone, greatly simplifying the optimization problem.
The exact distribution of the (R)MLEs cannot be derived in most applica-
tions of model (2.1.1) and inference about them usually has to rely on asymptotic
results. We derive, in chapter 3, the asymptotic distribution of both the MLE
and the RMLE, under quite general regularity conditions.
In many applications of linear mixed effects models, estimates of the random
20
effects b are also of interest. In (R)ML estimation the conditional modes of
the random effects are frequently used for that purpose (Lindstrom and Bates,
1988). These are defined as the mode of the conditional distribution of b given
y, which in the case of maximum likelihood estimation is given by
bML = Ds
A,MLZT(I + ZD
s
A,MLZT)−1 (
y − XβML
)
and in the case of restricted maximum likelihood is given by
bRML = Ds
A,RMLZT(I + ZD
s
A,RMLZT)−1 (
y − XβRML
)
where Ds
A,ML, Ds
A,RML, and βML denote respectively the MLE and RMLE of
DsA, and the MLE of β.
2.3 Bibliographic Review
The first developments of linear mixed effects models were related to the so
called variance components models, defined as linear mixed effects models in
which all random effects are independent (and hence no covariance components
are present). Airy (1861) seems to have given the first known formulation of a
variance components model while considering a standard measurement problem
in astronomy.
Fisher (1925) introduced the ANOVA method for estimating variance com-
ponents (i.e. equating sum of squares to their expected values). Tippet (1931)
clarified the use of the ANOVA method for analysis of variance designs and
extended it to 2-way crossed classification mixed effects models. Possibly the
most important paper in ANOVA estimation for unbalanced data is Henderson
21
(1953). The three ANOVA methods presented in that paper, later known as
Henderson methods, were the standard estimation methods for linear mixed
effects models until fast computers became available.
Maximum likelihood estimation for normal distribution variance components
models seems to have been first considered by Crump (1947). The landmark
paper on ML estimation for variance components models is Hartley and Rao
(1967), in which, among other things, the first asymptotic results for the MLE
were established. Miller (1977) corrected some problems in Hartley and Rao’s
results and established asymptotic results for a large class of variance com-
ponent models, giving also conditions for them to hold. Restricted maximum
likelihood was introduced by Thompson (1962) and later extended by Patterson
and Thompson (1971). Harville (1977) presents a comprehensive review of max-
imum likelihood and restricted maximum likelihood estimation in linear mixed
effects models and introduces the model formulation given in (2.1.1). Laird and
Ware (1982) describe a general linear mixed effects model for repeated measures
data and suggest the use of the EM algorithm for obtaining (R)MLEs of the
variance-covariance components.
The general structure of the linear mixed effects model (2.1.1) seems to be
accepted by most researchers today. The linear mixed effects models literature
that has been published after (Harville, 1977) and Laird and Ware (1982) refers
more to generalizations of the assumptions in model (2.1.1) and/or to different
estimation approaches, than to reformulations of the basic model’s structure.
Chi and Reinsel (1989) consider model (2.1.1) when Λ has the structure of an
autoregressive process of order one (AR(1)). Maximum likelihood estimators of
the model parameter and a score test for the autocorrelation are derived. One of
the main conclusions is that the use of a AR(1) structure for the cluster-specific
22
errors may have the effect of reducing the number of random effects needed in
the model, but the investigation of ways to determine the best combination of
time series error structure and number of random effects deserves further study.
This issue is also considered by Jones (1990).
A Bayesian analysis of model (2.1.1) using the Gibbs sampler (Geman and
Geman, 1984) is described in Gelfand, Hills, Racine-Poon and Smith (1990) and
in Wakefield, Smith, Racine-Poon and Gelfand (1994). The Bayesian analysis is
developed using a hierarchical model approach. In the second paper the normal
distribution of the random effects (b) is replaced by a multivariate Student-t,
enhancing the robustness of the fit and giving a method for detecting outlying
random effects. The main advantage of this approach is its flexibility in handling
complex situations, such as constrained parameters and non-Gaussian distribu-
tions for the random effects and/or error terms. The main drawbacks are the
intensive computational effort required and the need for prior distributions for
all the population parameters involved.
Jennrich and Schluchter (1986) consider ML estimation in linear mixed ef-
fects models for repeated measures with structured variance-covariance matri-
ces. Their work was extended to the general linear mixed effects models by
Wolfinger et al. (1991), who also discuss restricted maximum likelihood. The
use of structured matrices is very appealing in practice since many times it is
known beforehand that the covariance structure of the random effects and/or
the errors follows a particular pattern, and substantial reductions in computing
time can thus be achieved.
A generalized linear model version of (2.1.1) is discussed in Liang and Zeger
(1986) and Zeger, Liang and Albert (1988). They allow a more flexible error
structure that is no longer restricted to being Gaussian and introduce the idea of
23
a link function, h, relating E(y | b) to β and b, so that h(E(y | b)) = Xβ+Zb.
This model should in fact be considered a competitor of the nonlinear mixed
effects model, discussed in chapter 4.
Three books solely dedicated to linear mixed effects models have been re-
cently published. Searle et al. (1992) includes a comprehensive review of models
and estimation methods for linear mixed effects models, but focuses more on
variance components models and mixed effects ANOVA models. Lindsey (1993)
covers in detail linear mixed effects models for repeated measures data and
Longford (1993) considers linear mixed effects models in a regression context.
Chapter 3
Asymptotic Results for the
Linear Mixed Effects Model
Miller (1977) derived the asymptotic distribution of maximum likelihood es-
timators for a mixed effects ANOVA model. In section 3.1 we extend these
results to the more general linear mixed effects model (2.1.1), showing that,
under fairly general conditions, with probability going to one there exists a se-
quence of roots of the likelihood equations that is consistent and asymptotically
normal. These results are helpful in establishing the asymptotic uncorrelation of
the estimators of the fixed effects and the estimators of the variance-covariance
components. We also show, in section 3.2, that under fairly general conditions
the restricted maximum likelihood estimators for the general linear mixed ef-
fects model are consistent and asymptotically normal. In section 3.3, we show
that the asymptotic normality of the (restricted) maximum likelihood estima-
tors continues to hold for a large class of reparametrizations/structuring of the
variance-covariance components. Our conclusions are included in section 3.4.
25
The proofs of the lemmas used throughout this chapter are included in Ap-
pendix A.
3.1 Maximum Likelihood
Under Assumption 2.1.1 the linear mixed effects model (2.1.1) can alternatively
be expressed as
y = Xβ +r∑
i=1
qi∑j=1
U jia
ji + ε (3.1.1)
where the U ji are n × mi incidence-like matrices defined by the relation
kth column of U ji = jth column of Zk
i .
Note that each U ji has at most one nonzero entry per row. We will assume here
that it has at least one nonzero entry per column, to rule out trivial cases. The
aji vectors are defined by the relation
[aj
i
]k
=[bk
i
]j
and represent the values of
the jth random effect of the ith class.
The model formulation (3.1.1) is analogous to that of Hartley and Rao (1967)
and Miller (1977) for the mixed effects ANOVA model. We will use it in this
chapter to maintain consistency with the terminology used in the second paper.
The covariance matrix of y can be expressed as
Σ = σ2I +r∑
i=1
qi∑j,k=1
[Di]jk U ji (U
ki )
T .
By letting p1 =∑r
i=1 qi(qi + 1)/2, σ0 = σ2 and G0 = I and setting
σ1 = [D1]11 , σ2 = [D1]12 , · · · , σq1(q1+1)/2+1 = [D2]11 , · · · , σp1 = [Dr]qrqr
26
G1 = U 11(U
11)
T , G2 =(U 1
1(U21)
T + U 21(U
11)
T), · · · , Gp1 = U qr
r (U qrr )T ,
we can write
Σ =p1∑i=0
σiGi. (3.1.2)
This formulation of model (2.1.1) differs from that in Miller (1977) in that some
of the σi may assume negative values and some of the Gi are not required to
be positive semi-definite.
The following assumptions (equivalent to Assumptions 2.2 through 2.5 in
Miller (1977)) are made about model (3.1.1).
Assumption 3.1.1 The matrix X is of full rank p0.
Assumption 3.1.2 n ≥ p0 + p1 + 1.
Assumption 3.1.3 The partitioned matrix[X : U j
i
]has rank greater than p0
for i = 1, . . . , r, j = 1, . . . , qi.
Assumption 3.1.4 The matrices G0, G1, . . ., Gp1 are linearly independent,
i.e.∑p1
i=0 τiGi = 0 ⇐⇒ τi = 0, i = 0, . . . , p1.
As mentioned in Miller (1977), Assumption 3.1.1 can always be satisfied by
suitably reparametrizing the fixed effects vector. Assumptions 3.1.3 and 3.1.4
ensure that the random effects are not confounded with the fixed effects and
with each other.
Let p = p0 + p1 + 1 and σ = (σ0, σ1, . . . , σp1)T . Then the parameter space
Θ for model (3.1.1) is
Θ ={θ ∈ �p | θ =
(βT , σT
)T, β ∈ �p0; σ0 > 0 and (σ1, . . . , σp1) ∈ �p1
such that each Di is positive semi-definite, i = 1, . . . , r} .
27
Since the asymptotic results proven here require that θ be an interior point of
Θ we may assume without loss of generality that the Di matrices are actually
positive definite. If Di is not positive definite then there exists one or more
linear combinations of the random effects within the ith class that are identically
equal to zero. The model can then be reparametrized to eliminate this (these)
linear combination(s), thus making the new Di positive definite. In this case,
by a suitable reparametrization of σ, the constrained optimization problem
of determining the maximum likelihood estimates can be transformed into an
unconstrained problem (cf. chapter 6).
The proof of the asymptotic normality and consistency of the maximum
likelihood estimates in the general linear mixed effects model (3.1.1) parallels
that of Theorem 3.1 in Miller (1977). We will also make use of the general the-
orem on asymptotic properties of maximum likelihood estimates given in Weiss
(1971, 1973). The version of Weiss’ theorem given in Miller (1977) is reproduced
below, since it introduces several quantities that will be used throughout the
rest of this section.
Theorem 3.1.1 (Weiss(1971, 1973)) Let yn be a sequence of random vec-
tors with density Ln(yn, θ) where θ ∈ Θ ⊂ �p and define � (θ |yn)=log(Ln (yn,
θ)). Assume that the true parameter value θ0 is an interior point of Θ and
that there exist 2p sequences of positive quantities ni(n) and gi(n), i = 1, 2, . . . , p
such that
limn→∞ni(n) = lim
n→∞ gi(n) = ∞, limn→∞
gi(n)
ni(n)= 0, i = 1, 2, . . . , p.
28
Further assume that there exist nonrandom quantities Jij(θ) such that
−[1/(ni(n)nj(n))][∂2�(θ | yn)/∂θi∂θj |θ0 ] → Jij(θ0), i, j = 1, . . . , p
in probability as n → ∞. The matrix J(θ0) is assumed to be a continuous
function of θ0 and to be positive definite. Let
Nn(θ0) = {θ ∈ Θ | |θi − θ0i| ≤ gi(n)/ni(n), i = 1, . . . , p}
and
εij(θ, θ0, n) = − [1/ (ni(n)nj(n))][∂2� (θ | yn) /∂θi∂θj
]− Jij(θ0).
For any γ > 0 let Rn(θ0, γ) denote the region in �n where
p∑i,j=1
gi(n)gj(n) supθ∈Nn(θ0)
|εij(θ, θ0, n)| < γ.
Assume that there exist sequences {γn(θ0)} , {δn(θ0)} of positive quantities with
limn→∞ γn(θ0) = lim
n→∞ δn(θ0) = 0
such that for each n,
Pθ (Rn [θ0, γn(θ0)]) > 1 − δn(θ0), ∀θ ∈ Nn(θ0).
It then follows that there exists a sequence of estimates θ(n), which are roots
of the equations ∂�(θ | yn)/∂θ = 0, such that the vector whose ith component
is ni(n)(θi(n) − θ0i
)converges in distribution to a N (0, J−1(θ0)). That is, the
29
sequence θ(n) is consistent, asymptotically normal, and efficient.
We will show that, under general assumptions, the conditions of Weiss’ the-
orem are satisfied by the MLEs in (3.1.1). We need the following additional
assumptions in order to derive the main asymptotic theorem of this section.
Assumption 3.1.5 The number of observed levels (mi) of the random effects
in the ith class goes to infinity, i = 1, . . . , r.
Define now νk = rank(Gk), k = 1, . . . , p1, and ν0 = n−rank[U 1
1 : · · · : U qrr
].
Assumption 3.1.6 limn→∞ ν0/n exists and is positive.
For k = 1, . . . , p1, we have mi(k) ≤ rank(Gk) ≤ 2mi(k) when σk is a covariance
term and rank(Gk) = mi(k) when σk is a variance term, with i(k) denoting the
random effect class with which σk is associated. For the rank of a Gk associated
with a covariance term to be equal to 2mi(k) − s there must be exactly s indices
l for which [Uj1(k)i(k) ]l = [U
j2(k)i(k) ]l, with [A]l representing the lth column of A
and j1(k), j2(k) representing the random effects within class i(k) corresponding
to σk. Note that νk is of the same order of magnitude as mi(k), which by
Assumption 3.1.5 goes to infinity.
The next assumption pertains to the asymptotic covariance matrix of the
maximum likelihood estimates in model (3.1.1). Let θ0 =(βT
0 , σT0
)Tdenote
the true value of the parameter vector and Σ0 the associated covariance matrix
of the response vector y.
Assumption 3.1.7 There exists a sequence of positive quantities νp1+1 depend-
ing on n and going to infinity such that C0 = limn→∞ XTΣ−10 X/νp1+1 exists
and is positive definite. Also let C1 be the (p1 + 1)× (p1 + 1) matrix defined by
30
[C1]ij = (1/2) limn→∞ trace(Σ−1
0 GiΣ−10 Gj
)/(νiνj)
1/2, i, j = 0, . . . , p1. Then
the limits exist and C1 is positive definite.
Now letting �(θ | y) denote the loglikelihood of the data, it can be shown
that (Searle et al., 1992)
∂2�(θ | y)
∂β∂βT = −XTΣ−1X = Eθ
(∂2�(θ | y)
∂β∂βT
),
∂2�(θ | y)
∂σi∂β= −XTΣ−1GiΣ
−1 (y − Xβ) , Eθ
(∂2�(θ | y)
∂σi∂β
)= 0,
∂2�(θ | y)
∂σi∂σj
= trace(Σ−1GiΣ
−1Gj
)/2
− (y − Xβ)T Σ−1GiΣ−1GjΣ
−1 (y − Xβ) ,
Eθ
(∂2�(θ | y)
∂σi∂σj
)= −trace
(Σ−1GiΣ
−1Gj
)/2.
Assumption 3.1.7 simply establishes the existence and positive definiteness of
the limit of the negative of the expected Hessian matrix of the loglikelihood
function. Note in particular that under the conditions of Weiss’ theorem, the
maximum likelihood estimates of the fixed effects β and the elements σ of
variance-covariance components are asymptotically independent.
We are now in a position to state and prove the extension of Miller’s theorem
to the general linear mixed effects model (3.1.1).
Theorem 3.1.2 Under Assumptions 2.1.1, 2.1.2, and 3.1.1 through 3.1.7 and
letting θ0 be an interior point of Θ representing the true parameter vector and
J =
C0 0
0 C1
, there exists a sequence of estimates θn =(β
T
n , σTn
)T
with
the following properties.
31
1. Given ε > 0, ∃δ = δ(ε), 0 < δ < ∞ and n0 = n0(ε) such that ∀n > n0
Pθ0
(∂�(θ | y)
∂θ
∣∣∣∣∣θ=θn
= 0; ‖ βn − β0 ‖ <δ
np1+1
and
| σni − σ0i | <δ
ni, i = 0, . . . , p1
)≥ 1 − ε
where ni = ν1/2i , i = 0, . . . , p1 + 1.
2. The p-dimensional random vector with the first p0 components given by
np1+1
(βn − β0
)and the last p1 + 1 given by ni (σni − σ0i) , i = 0, . . . , p1
converges in distribution to a Np(0, J−1).
The proof of the theorem will consist of verifying that the maximum likelihood
estimates for model (3.1.1) satisfy the conditions of Theorem 3.1.1, under As-
sumptions 2.1.1, 2.1.2, and 3.1.1 through 3.1.7. The proof will parallel the steps
in Miller (1977), but we will need to derive intermediate results, since those
used in his paper do not apply to the more general model (3.1.1).
Define
κ = κ(n) = maxi,j
∣∣∣∣∣−(1/nl(i)nl(j))Eθ0
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)− J ij (θ0)
∣∣∣∣∣where
l(i) =
p1 + 1, if 1 ≤ i ≤ p0
i − (p0 + 1), otherwise
By Assumption 3.1.7 κ → 0. Define now g = min(n
1/40 , n
1/41 , . . . , n
1/4p1+1, κ
−1/4).
Note that g → ∞ since ni → ∞, i = 0, . . . , p1 + 1 by Assumption 3.1.6 and
κ → 0. It is also true that g/ni ≤ g−3 → 0, i = 0, . . . , p1 + 1. Theorem 3.1.1
allows a different sequence gi for each parameter, but we will use a common
32
gi = g, i = 1, . . . , p.
The conditions of Theorem 3.1.1 are then equivalent to
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)− J ij (θ0)
∣∣∣∣∣ Pθ0−→ 0, i, j = 1, . . . , p (3.1.3)
and
supθ1∈Nn(θ0)
g2
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ1
)− J ij (θ0)
∣∣∣∣∣ Pθ2−→ 0,
for i, j = 1, . . . , p and ∀θ2 ∈ Nn(θ0), where Nn(θ0) is as defined in Theo-
rem 3.1.1. Using the same reasoning as in Miller (1977) we have, by repeated
applications of the triangle inequality,
supθ1∈Nn(θ0)
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ1
)− J ij(θ0)
∣∣∣∣∣ (3.1.4)
≤ (1/nl(i)nl(j)) supθ1∈Nn(θ0)
∣∣∣∣∣(
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ1
− ∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ2
)∣∣∣∣∣+ (1/nl(i)nl(j))
∣∣∣∣∣ ∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ2
− Eθ2
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ2
)∣∣∣∣∣+ (1/nl(i)nl(j))
∣∣∣∣∣Eθ2
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ2
)− Eθ2
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)∣∣∣∣∣+ (1/nl(i)nl(j))
∣∣∣∣∣Eθ2
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)− Eθ0
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)∣∣∣∣∣+
∣∣∣∣∣(−1/nl(i)nl(j))Eθ0
(∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)− J ij(θ0)
∣∣∣∣∣ .Let the five terms on the right hand side of inequality (3.1.4) be denoted by
φ1, φ2, φ3, φ4, and φ5. We note that φ3, φ4 and φ5 are nonrandom terms that we
will show are bounded by sequences going to zero as n → ∞. Then we will also
show that φ1 and φ2 converge in probability to zero.
33
3.1.1 Limit of φ5
By definition
g2φ5 ≤ κ−1/2κ = κ1/2 → 0, as n → ∞.
3.1.2 Limit of φ4
To establish g2φ4 → 0 we will first consider the ∂2�(θ | y)/∂βi∂βj derivatives.
Since this quantity is nonrandom it follows that φ4 = 0 for these pairs of terms.
Next we consider the ∂2�(θ | y)/∂σi∂βj second derivatives. In this case
g2φ4 =g2
ninp1+1
∣∣∣ξTj XTΣ−1
0 GiΣ−10 X(β2 − β0)
∣∣∣where ξj denotes the jth canonical basis vector with components
[ξj ]k =
1, if k = j
0, otherwise
Using the Cauchy-Schwartz inequality repeatedly we get
(ninp1+1)φ4 (3.1.5)
≤[ξT
j XTΣ−10 GiΣ
−10 GiΣ
−10 Xξj
]1/2 [(β2 − β0)
T XTΣ−10 X (β2 − β0)
]1/2
≤ maxk
∣∣∣λk
(Σ−1
0 Gi
)∣∣∣ [ξTj XTΣ−1
0 Xξj
]1/2[(β2 − β0)
T XTΣ−10 X (β2 − β0)
]1/2
≤ maxk
∣∣∣λk
(Σ−1
0 Gi
)∣∣∣ λmax
(XTΣ−1
0 X)‖β2 − β0‖,
where λk denotes the kth eigenvalue and λmax the maximum eigenvalue. By the
definition of Nn(θ0), ‖β2−β0‖ ≤ √p0g/np1+1 and by Assumption 3.1.7 and the
34
continuity of the maximum eigenvalue, for sufficiently large n we must have
λmax
(XTΣ−1
0 X)
< 2n2p1+1λmax (C0) . (3.1.6)
Also, from Lemma A.3 proven in Appendix A, ∃δ0 = δ0(σ0) > 0 such that
maxk
∣∣∣λk(Σ−10 Gi)
∣∣∣ ≤ 2/δ0, i = 0, . . . , p1. It then follows that for sufficiently
large n
g2φ4 ≤ 4g3√p0λmax(C0)
niδ0≤ 4
√p0λmax(C0)
gδ0
and by Assumptions 3.1.5 and 3.1.6 the last quantity goes to zero as n → ∞.
Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. In this case we have
g2φ4 ≤ g2
ninj
{∣∣∣trace[Σ−1
0 GiΣ−10 GjΣ
−10 (Σ0 −Σ2)
]∣∣∣ (3.1.7)
+∣∣∣(β2 − β0)
T XTΣ−10 GiΣ
−10 GjΣ
−10 X (β2 − β0)
∣∣∣} .
We will consider each term on the right hand side of inequality (3.1.7) sepa-
rately. For the first term note that rij = rank(Σ−1
0 GiΣ−10 GjΣ
−10 (Σ0 − Σ2)
)≤
min (rank(Gi), rank(Gj)) = min(νi, νj). Let P ij be an n × rij matrix whose
columns form an orthonormal basis for the range space of Σ−1/20 GiΣ
−10 GjΣ
−10
×(Σ0 − Σ2)(Σ
−1/20
)T, with Σ
−1/20 denoting the Cholesky factor of Σ−1
0 (Thisted,
1988). It then follows that
∣∣∣trace(Σ−1
0 GiΣ−10 GjΣ
−10 (Σ0 −Σ2)
)∣∣∣ (3.1.8)
=
∣∣∣∣trace(P T
ijΣ−1/20 GiΣ
−10 GjΣ
−10 (Σ0 −Σ2)
(Σ
−1/20
)TP ij
)∣∣∣∣≤
rij∑k=1
∣∣∣∣ξTk P T
ijΣ−1/20 GiΣ
−10 GjΣ
−10 (Σ0 −Σ2)
(Σ
−1/20
)TP ijξk
∣∣∣∣ .Applying the Cauchy-Schwartz inequality and Lemmas A.3 and A.7 to the terms
35
of the summation in (3.1.8) gives
∣∣∣∣ξTk P T
ijΣ−1/20 GiΣ
−10 GjΣ
−10 (Σ0 − Σ2)
(Σ
−1/20
)TP ijξk
∣∣∣∣ (3.1.9)
≤[ξT
k P TijΣ
−1/20 GiΣ
−10 Gi
(Σ
−1/20
)TP ijξk
]1/2
×[ξT
k P TijΣ
−1/20 (Σ0 − Σ2)Σ−1
0 GjΣ−10 GjΣ
−10 (Σ0 − Σ2)
(Σ
−1/20
)TP ijξk
]1/2
≤ maxk
∣∣∣λk
(Σ−1
0 Gi
)∣∣∣maxk
∣∣∣λk
(Σ−1
0 Gj
)∣∣∣maxk
∣∣∣λk
(Σ−1
0 (Σ0 −Σ2))∣∣∣
≤ 4
g3δ20
(q
λmin(D0)
+1
σ20
),
where q =∑r
i=1 qi and D0 is the D matrix evaluated at θ = θ0. Consider
now the second term in the right hand side of inequality (3.1.7). Using the
Cauchy-Schwartz inequality, (3.1.6), and Lemma A.3 we get
∣∣∣(β2 − β0)T XTΣ−1
0 GiΣ−10 GjΣ
−10 X (β2 − β0)
∣∣∣ (3.1.10)
≤[(β2 − β0)
T XTΣ−10 GiΣ
−10 GiΣ
−10 X (β2 − β0)
]1/2
×[(β2 − β0)
T XTΣ−10 GjΣ
−10 GjΣ
−10 X (β2 − β0)
]1/2
≤ 2n2p1+1λmax(C0) max
k
∣∣∣λk(Σ−10 Gi)
∣∣∣maxk
∣∣∣λk(Σ−10 Gj)
∣∣∣ ‖ β2 − β0 ‖2
≤ 8p0g2λmax(C0)/δ
20
and therefore
g2φ4 ≤ 4rij
gninjδ20
(q
λmin(D0)
+1
σ20
)+
8p0g4λmax(C0)
ninjδ20
Since rij ≤ ninj , g4/(ninj) ≤ g−4 and g → ∞ it follows that g2φ4 → 0.
36
3.1.3 Limit of φ3
Let us start with the ∂2�(θ | y)/∂βi∂βj derivatives. In this case we have g2φ3 =∣∣∣ξTi XT (Σ−1
2 −Σ−10 )Xξj
∣∣∣. Noting that Σ−12 − Σ−1
0 = Σ−12 (Σ0 − Σ2)Σ
−10 and
using the Cauchy-Schwartz inequality and (3.1.6) we get
n2p1+1φ3 (3.1.11)
≤[ξT
i XT(Σ−1
2 (Σ0 − Σ2))2
Σ−10 Xξi
]1/2 [ξT
j XTΣ−10 Xξj
]1/2
≤ 2n2p1+1λmax(C0) max
k
∣∣∣λk
(Σ−1
2 (Σ0 −Σ2))∣∣∣
and using Lemma A.7 we get
g2φ3 ≤ 4λmax(C0)
g
(q
λmin(D0)
+1
σ20
),
which goes to zero as n → ∞.
Next consider the ∂2�(θ | y)/∂βi∂σj derivatives. By applying the Cauchy-
Schwartz inequality, Lemma A.3, and (3.1.6) we get
np1+1njφ3
=∣∣∣ξT
i XTΣ−10 GjΣ
−10 X (β2 − β0)
∣∣∣≤
[ξT
i XTΣ−10 GjΣ
−10 GjΣ
−10 Xξi
]1/2 [(β2 − β0)
T XTΣ−10 X (β2 − β0)
]1/2
≤ 4√
p0np1+1gλmax(C0)/δ0
and therefore g2φ3 ≤ 4√
p0g3λmax(C0)/nj ≤ 4
√p0λmax(C0)/g → 0.
Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. By applying the triangle
inequality we get
37
ninjφ3 (3.1.12)
≤ 1/2∣∣∣trace
(Σ−1
0 GiΣ−10 GjΣ
−10 Σ2
)− trace
(Σ−1
2 GiΣ−12 Gj
)∣∣∣+ 1/2
∣∣∣trace(Σ−1
0 GiΣ−10 GjΣ
−10 (Σ2 − Σ0)
)∣∣∣+ (β2 − β0)
T XTΣ−10 GiΣ
−10 GjΣ
−10 X (β2 − β0) .
From (3.1.8), (3.1.9), and (3.1.10) we have that the last two quantities on the
right hand side of (3.1.12) are bounded respectively by (2rij/g3δ2
0)(q/λmin(D
0)
+1/σ20) and 8p0g
2λmax(C0)/δ20. Now note that
Σ−10 GiΣ
−10 GjΣ
−10 Σ2 − Σ−1
2 GiΣ−12 Gj
= Σ−10 GiΣ
−10 GjΣ
−10 (Σ2 − Σ0) + Σ−1
0 Gi
(Σ−1
0 −Σ−12
)Gj
+(Σ−1
0 − Σ−12
)GiΣ
−12 Gj .
From (3.1.8), (3.1.9), and Lemmas A.5 and A.7 it follows that
∣∣∣trace(Σ−1
0 GiΣ−10 GjΣ
−10 (Σ2 − Σ0)
)∣∣∣ ≤(4rij/δ
20g
3) q
λmin
(D0) +
1
σ20
∣∣∣trace
(Σ−1
0 Gi
(Σ−1
0 − Σ−12
)Gj
)∣∣∣ ≤(8rij/δ
20g
3) q
λmin
(D0) +
1
σ20
∣∣∣trace
((Σ−1
0 − Σ−12
)GiΣ
−12 Gj
)∣∣∣ ≤(16rij/δ
20g
3) q
λmin
(D0) +
1
σ20
and therefore
g2φ3 ≤ 16rij
gninjδ20
q
λmin
(D0) +
1
σ20
+8p0g
4λmax(C0)
ninjδ20
,
38
where, as before, rij = min (νi, νj). Since rij/gnjnj ≤ g−1 and g4/ninj ≤ g−4 it
follows that g2φ3 → 0.
3.1.4 Limit of φ2
It follows from Tchebychev’s inequality that to show g2φ2
Pθ2−→ 0 it suffices to
show Varθ2 (g2φ2) → 0. Since ∂2� (θ | y) /∂βi∂βj is nonrandom, its variance is
zero and the condition is trivially verified.
Consider now the ∂2�(θ | y)/∂β∂σj derivatives. To show that the variance
of each component goes to zero, it is enough to show that the trace of the
associated variance-covariance matrix goes to zero. In this case
trace(Var
(∂2�(θ | y)/∂β∂σj
))(3.1.13)
= trace(XTΣ−1
2 GjΣ−12 GjΣ
−12 X
)≤ p0 max
k
([λk
(Σ−1
2 Gj
)]2)λmax
(XTΣ−1
2 X)
.
Using the fact that Σ−12 = Σ−1
0 +(Σ−1
2 −Σ−10
)and the results in (3.1.6) and
(3.1.11) we get that
λmax
(XTΣ−1
2 X)
(3.1.14)
≤ λmax
(XTΣ−1
0 X)
+ λmax
(XT
(Σ−1
2 − Σ−10
)X)
≤ 2n2p1+1λmax(C0)
(1 +
2
g3
(q
λmin(D0)
+1
σ20
)).
Using (3.1.13), (3.1.14), and Lemma A.5 gives
trace(Var
(∂2�(θ | y)/∂β∂σj
))≤
39
32g4p0λmax(C0)
n2jδ
20
(1 +
2
g3
(q
λmin(D0)+
1
σ20
))
and since g4/n2j ≤ g−4 it follows that g2φ2
Pθ2−→ 0.
Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. Using standard results
on the variance of quadratic forms (Seber, 1977) and Lemma A.5 we get
n2i n
2jVar (φ2) =
2trace(Σ−1
2 GiΣ−12 Gj
)2 ≤ 2rij
[max
k
∣∣∣λk(Σ−12 Gi)
∣∣∣maxk
∣∣∣λk(Σ−12 Gj)
∣∣∣]2 ,
hence
g4Var (φ2) ≤ 512g4rij
n2i n
2jδ
40
≤ 512
g4δ40
and so g2φ2
Pθ2−→ 0.
3.1.5 Limit of φ1
Let us first take the ∂2�(θ | y)/∂βi∂βj derivatives. In this case we get
n2p1+1φ1 = sup
θ1∈Nn(θ0)
∣∣∣ξTi XT
(Σ−1
1 − Σ−12
)Xξj
∣∣∣ .Applying the Cauchy-Schwartz inequality, (3.1.14), and Lemma A.8 we have
that for large enough n
∣∣∣ξTi XT
(Σ−1
1 − Σ−12
)Xξj
∣∣∣≤
[ξT
i XTΣ−11 (Σ2 − Σ1)Σ−1
2 (Σ2 − Σ1)Σ−11 Xξi
]1/2 [ξT
j XT Σ−12 Xξj
]1/2
≤ λmax(XTΣ−1
2 X) maxk
∣∣∣λk
(Σ−1
1 (Σ2 − Σ1))∣∣∣
40
≤ 2n2p1+1λmax(C0)
(1 +
2
g3
(q
λmin(D0)
+1
σ20
))4
g3
(q
λmin(D0)
+1
σ20
)
and as the bound does not depend on θ1 we get that, for large enough n,
g2φ1 ≤ 8λmax(C0)
g
(1 +
2
g3
(q
λmin(D0)
+1
σ20
))(q
λmin(D0)
+1
σ20
)
and therefore g2φ1 → 0.
Consider now the ∂2�(θ | y)/∂βi∂σj derivatives. We have that
np1+1njφ1 =
supθ1∈Nn(θ0)
∣∣∣ξTi XT
(Σ−1
1 GjΣ−11 (y − Xβ1) − Σ−1
2 GjΣ−12 (y − Xβ2)
)∣∣∣ .Note that
∣∣∣ξTi XT
(Σ−1
1 GjΣ−11 (y − Xβ1) − Σ−1
2 GjΣ−12 (y − Xβ2)
)∣∣∣≤
∣∣∣ξTi XT
(Σ−1
1 GjΣ−11 − Σ−1
2 GjΣ−12
)(y − Xβ2)
∣∣∣+∣∣∣ξT
i XTΣ−11 GjΣ
−11 X (β1 − β2)
∣∣∣ .Now from (3.1.14) and Lemmas A.1 and A.5
∣∣∣ξTi XTΣ−1
1 GjΣ−11 X (β1 − β2)
∣∣∣ (3.1.15)
≤[ξT
i XTΣ−11 Xξi
]1/2 [(β1 − β2)
T XTΣ−11 GjΣ
−11 X (β1 − β2)
]1/2
≤ 4np1+1g√
p0λmax(C0)√δ0
(1 +
2
g3
(q
λmin(D0)
+1
σ20
)).
41
Noting that
Σ−11 GjΣ
−11 −Σ−1
2 GjΣ−12 = Σ−1
1 Gj
(Σ−1
1 −Σ−12
)+(Σ−1
1 − Σ−12
)GjΣ
−12 ,
we get by applying the triangle inequality
∣∣∣ξTi XT
(Σ−1
1 GjΣ−11 −Σ−1
2 GjΣ−12
)(y − Xβ2)
∣∣∣ (3.1.16)
≤∣∣∣ξT
i XTΣ−11 Gj
(Σ−1
1 − Σ−12
)(y − Xβ2)
∣∣∣+∣∣∣ξT
i XT(Σ−1
1 − Σ−12
)GjΣ
−12 (y − Xβ2)
∣∣∣ .But using the Cauchy-Schwartz inequality once again gives
∣∣∣ξTi XTΣ−1
1 Gj
(Σ−1
1 −Σ−12
)(y − Xβ2)
∣∣∣ ≤ [ξT
i XTΣ−11 Xξi
]1/2
×[(y − Xβ2)
T(Σ−1
1 − Σ−12
)GjΣ
−11 Gj
(Σ−1
1 −Σ−12
)(y − Xβ2)
]1/2.
From (3.1.14) it follows that
ξTi XTΣ−1
1 Xξi ≤ 2n2p1+1λmax(C0)
(1 +
2
g3
(q
λmin(D0)
+1
σ20
)).
Now by applying Lemmas A.5, A.6, and A.8 we get that
g4λmax
([Σ
1/22
]−T(Σ2 −Σ1)Σ
−11 GjΣ
−11 GjΣ
−11 (Σ2 − Σ1)
[Σ
1/22
]−1)
≤ g4 maxk
([λk
(Σ−1
1 Gi
)]2)max
k
([λk
(Σ−1
1 (Σ2 − Σ1))]2)
λmax
(Σ−1
2 Σ1
)≤ 1024g4
g6δ20
(q
λmin(D0)
+1
σ20
)(λmax(D
0)
λmin(D0)
+ 1
)→ 0.
42
Noting that
rank(Gj) = n2j
≥ rank([
Σ1/22
]−T(Σ2 − Σ1)Σ−1
1 GjΣ−11 GjΣ1−1 (Σ2 −Σ1)
[Σ
1/22
]−1)
we get from Lemma A.9
supθ1∈Nn(θ0)
g4
n2j
(y − Xβ2)T (3.1.17)
×(Σ−1
1 −Σ−12
)GjΣ
−11 Gj
(Σ−1
1 − Σ−12
)(y − Xβ2)
Pθ2−→ 0.
Consider now the second term on the right hand side of (3.1.16). Applying
the Cauchy-Schwartz inequality to it gives
∣∣∣ξTi XT
(Σ−1
1 − Σ−12
)GjΣ
−12 (y − Xβ2)
∣∣∣≤
[ξT
i XT(Σ−1
1 (Σ2 −Σ1))2
Σ−12 Xξi
]1/2
×[(y − Xβ2)Σ−1
2 GjΣ−12 GjΣ
−12 (y − Xβ2)
]1/2
and using (3.1.14) and Lemma A.8 we get that
ξTi XT
(Σ−1
1 (Σ2 − Σ1))2
Σ−12 Xξi
≤ 32n2p1+1λmax(C0)
g6
(1 +
2
g3
(q
λmin(D0)
+1
σ20
))(q
λmin(D0)
+1
σ20
)2
.
Observing that
rank([
Σ1/22
]−TGjΣ
−12 Gj
[Σ
1/22
]−1)
≤ n2j and
43
λmax
([Σ
1/22
]−TGjΣ
−12 Gj
[Σ
1/22
]−1)
≤ 16/δ20
we get using Lemma A.9
Pθ2
(sup
θ1∈Nn(θ0)(1/n2
j) (y − Xβ2)T[Σ
1/22
]−1
×([
Σ1/22
]−TGjΣ
−12 Gj
[Σ
1/22
]−1) [
Σ1/22
]−T(y − Xβ2) >
32
δ20
)→ 0.
Therefore
g2
np1+1njsup
θ1∈Nn(θ0)
∣∣∣ξTi XT
(Σ−1
1 − Σ−12
)GjΣ
−12 (y − Xβ2)
∣∣∣ (3.1.18)
≤ 4√
2λmax(C0)
g
(1 + 2
(q
λmin(D0)
+1
σ20
))1/2 (q
λmin(D0)
+1
σ20
)
× supθ1∈Nn(θ0)
[1
n2j
(y − Xβ2)T[Σ
1/22
]−1([
Σ1/22
]−TGjΣ
−12 Gj
[Σ
1/22
]−1)
×[Σ
1/22
]−T(y − Xβ2)
]1/2
and this converges to zero in probability (under θ2), since g−1 → 0 and the
second term of the product on the right hand side of (3.1.18) is bounded in
probability by 4√
2/δ0. Combining results (3.1.15), (3.1.17), and (3.1.18) gives
that g2φ1
Pθ2−→ 0 as desired.
Finally consider the ∂2�(θ | y)/∂σi∂σj derivatives. In this case we get
ninjφ1 (3.1.19)
≤∣∣∣trace
(Σ−1
1 GiΣ−11 Gj
)− trace
(Σ−1
2 GiΣ−12 Gj
)∣∣∣+∣∣∣(y − Xβ1)
T XTΣ−11 GiΣ
−11 GjΣ
−11 (y − Xβ1)
− (y − Xβ2)T XTΣ−1
2 GiΣ−12 GjΣ
−12 (y − Xβ2)
∣∣∣ .
44
Now noting that
Σ−11 GiΣ
−11 Gj − Σ−1
2 GiΣ−12 Gj =(
Σ−11 − Σ−1
2
)GiΣ
−11 Gj + Σ−1
2 Gi
(Σ−1
1 − Σ−12
)Gj
and
rank((
Σ−11 − Σ−1
2
)GiΣ
−11 Gj
)= rij ≤ min(n2
i , n2j )
rank(Σ−1
2 Gi
(Σ−1
1 − Σ−12
)Gj
)= rij ≤ min(n2
i , n2j )
we get using the triangle inequality and Lemmas A.5 and A.8
g2
ninj
∣∣∣trace(Σ−1
1 GiΣ−11 Gj
)− trace
(Σ−1
2 GiΣ−12 Gj
)∣∣∣≤ g2rij
ninjmax
k
∣∣∣λk(Σ−11 Gi)
∣∣∣maxk
∣∣∣λk(Σ−11 Gj)
∣∣∣maxk
∣∣∣λk
(Σ−1
2 (Σ2 − Σ1))∣∣∣
+g2rij
ninjmax
k
∣∣∣λk(Σ−12 Gi)
∣∣∣maxk
∣∣∣λk(Σ−12 Gj)
∣∣∣maxk
∣∣∣λk
(Σ−1
1 (Σ1 − Σ2))∣∣∣
≤ 128rij
δ20gninj
.
Since rij ≤ ninj and g → ∞ we see that this term converges to zero as n → ∞.
Now note that
∣∣∣(y − Xβ1)T XTΣ−1
1 GiΣ−11 GjΣ
−11 (y − Xβ1) (3.1.20)
− (y − Xβ2)T XTΣ−1
2 GiΣ−12 GjΣ
−12 (y − Xβ2)
∣∣∣≤∣∣∣(y − Xβ2)
T(Σ−1
1 GiΣ−11 GjΣ
−11 −Σ−1
2 GiΣ−12 GjΣ
−12
)(y − Xβ2)
∣∣∣+∣∣∣(y − Xβ2)
T Σ−11 GiΣ
−11 GjΣ
−11 X (β2 − β1)
∣∣∣+∣∣∣(β2 − β1)
T XTΣ−11 GiΣ
−11 GjΣ
−11 (y − Xβ2)
∣∣∣
45
+∣∣∣(β2 − β1)
T XTΣ−11 GiΣ
−11 GjΣ
−11 X (β2 − β1)
∣∣∣ .Consider now the first term on the right hand side of (3.1.20). Note that
(Σ−1
1 GiΣ−11 GjΣ
−11 −Σ−1
2 GiΣ−12 GjΣ
−12
)=
(Σ−1
1 −Σ−12
)GiΣ
−11 GjΣ
−11 + Σ−1
2 Gi
(Σ−1
1 − Σ−12
)GjΣ
−11
+ Σ−12 GiΣ
−12 Gj
(Σ−1
1 − Σ−12
).
But
∣∣∣(y − Xβ2)T(Σ−1
1 −Σ−12
)GiΣ
−11 GjΣ
−11 (y − Xβ2)
∣∣∣≤
[(y − Xβ2)
T Σ−12 (Σ2 − Σ1)Σ
−11 GiΣ
−11 GiΣ
−11 (Σ2 − Σ1)Σ−1
2
× (y − Xβ2)]1/2[(y − Xβ2)Σ−1
1 GjΣ−11 GjΣ
−11 (y − Xβ2)
]1/2.
Now note that
g4λmax
([Σ
1/22
]−T(Σ2 −Σ1)Σ
−11 GiΣ
−11 GiΣ
−11 (Σ2 −Σ1)
[Σ
1/22
]−1)
≤ 1024g4
δ20g
6
(q
λmin(D0)
+1
σ20
)2 (λmax(D
0)
λmin(D0)
+ 1
)
and this term converges to zero as n → ∞.
From Lemmas A.5 and A.6 it follows that
λmax
(Σ
1/22 Σ−1
1 GjΣ−11 GjΣ
−11
(Σ
1/22
)T)≤
maxk
([λk(Σ
−11 Gj)
]2)λmax(Σ
−11 Σ2) ≤ 64
δ20
(λmax(D
0)
λmin(D0)
+ 1
).
46
Therefore, by Lemma A.9 we have that
g2
ninj
supθ1∈Nn(θ0)
∣∣∣(y − Xβ2)T(Σ−1
1 −Σ−12
)GiΣ
−11 GjΣ
−11 (y − Xβ2)
∣∣∣ Pθ2−→ 0
since it is dominated by a product of two terms, the first converging to zero in
probability and the second bounded in probability by a constant.
Using the exact same reasoning we show that
g2
ninj
supθ1∈Nn(θ0)
∣∣∣(y − Xβ2)T Σ−1
2 Gi
(Σ−1
1 −Σ−12
)GjΣ
−11 (y − Xβ2)
∣∣∣ Pθ2−→ 0
g2
ninjsup
θ1∈Nn(θ0)
∣∣∣(y − Xβ2)T Σ−1
2 GiΣ−12 Gj
(Σ−1
1 − Σ−12
)(y − Xβ2)
∣∣∣ Pθ2−→ 0
and that in turn implies that
g2
ninjsup
θ1∈Nn(θ0)
∣∣∣(y − Xβ2)T
×(Σ−1
1 GiΣ−11 GjΣ
−11 −Σ−1
2 GiΣ−12 GjΣ
−12
)(y − Xβ2)
∣∣∣ Pθ2−→ 0.
Consider now the term
g2
ninj
∣∣∣(y − Xβ2)T Σ−1
1 GiΣ−11 GjΣ
−11 X (β2 − β1)
∣∣∣ (3.1.21)
≤[g4
n2i
(β2 − β1)T XTΣ−1
1 GjΣ−11 GjΣ
−11 X (β2 − β1)
]1/2
×[
1
n2j
(y − Xβ2)T Σ−1
1 GjΣ−11 GjΣ
−11 (y − Xβ2)
]1/2
.
The first term on the right hand side of (3.1.21) is bounded by(4√
2p0λmax(C0)
÷gδ0)(1 + (2/g3)
(q/λmin(D
0) + 1/σ20
))1/2which goes to zero with n. The
47
supremum over θ1 ∈ Nn(θ0) of the second term is bounded in probability (un-
der θ2) by(4√
2/δ0
) (λmax(D
0)/λmin(D0) + 1
)1/2as n → ∞, by Lemma A.9.
It then follows that
g2
ninj
supθ1∈Nn(θ0)
∣∣∣(y − Xβ2)T Σ−1
1 GiΣ−11 GjΣ
−11 X (β2 − β1)
∣∣∣ Pθ2−→ 0.
Similarly we can show that
g2
ninj
supθ1∈Nn(θ0)
∣∣∣(y − Xβ2)T Σ−1
1 GjΣ−11 GiΣ
−11 X (β2 − β1)
∣∣∣ Pθ2−→ 0.
Finally we note that
g2
ninj
∣∣∣(β2 − β1)T XTΣ−1
1 GiΣ−11 GjΣ
−11 X (β2 − β1)
∣∣∣≤ g2
ninj
[(β2 − β1)
T XTΣ−11 GiΣ
−11 GiΣ
−11 X (β2 − β1)
]1/2
×[(β2 − β1)
T XTΣ−11 GjΣ
−11 GjΣ
−11 X (β2 − β1)
]1/2
≤ 32p0g4λmax(C0)
ninjδ20
(1 +
2
g3
(q
λmin(D0)
+1
σ20
))
and this term goes to zero as n → ∞, since g4/ninj ≤ g−4.
Now we put all the previous results together and see that they in fact imply
the second condition of Theorem 3.1.1. We want to show that for given θ2 ∈Nn(θ0) and ε, δ > 0, there exists an n0 such that for all n > n0 the probability
that the left hand side of (3.1.4) is greater than δ is less than ε. First we choose
n1 such that g2 max(φ3, φ4, φ5) < δ/5 which is always possible since all the terms
converge to zero. Next we get n2 > n1 such that ∀n > n2, Pθ2(g2φ2 > δ/5) <
ε/2, which is always possible since g2φ2
Pθ2−→ 0. Finally choose n0 > n2 such that
∀n > n0, Pθ2(g2φ1 > δ/5) < ε/2, which is always possible since g2φ1
Pθ2−→ 0.
48
Now since
g2 supθ1∈Nn(θ0)
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ1
)− J ij (θ0)
∣∣∣∣∣ ≤ g25∑
i=1
φi
for all n > n0 we have that
Pθ2
(g2 sup
θ1∈Nn(θ0)
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ1
)− J ij (θ0)
∣∣∣∣∣ > δ
)
≤ Pθ2
(g2 (φ1 + φ2) > 2δ/5
)≤ Pθ2(g
2φ1 > δ/5) + Pθ2(g2φ2 > δ/5) < ε.
To complete the proof of Theorem 3.1.2 we note that
∣∣∣∣∣(−(1/nl(i)nl(j))
∂2� (θ | y)
∂θi∂θj
∣∣∣∣∣θ0
)− J ij (θ0)
∣∣∣∣∣ ≤ g2 (φ2 + φ5)
with φ2 being evaluated for θ2 = θ0. From the results in subsections 3.1.1 and
3.1.4, both g2φ2 and g2φ5 converge in probability to zero and therefore the left
hand side of the last inequality also does.
3.2 Restricted Maximum Likelihood
In this section we show that under a modification of Assumptions 3.1.4 and 3.1.7,
the RMLEs of the variance-covariance components in (3.1.1) are asymptotically
normal and consistent. We also show that the usual estimates of the fixed effects
in RML estimation have the same asymptotic distribution as the maximum
likelihood estimates.
We recall from section 2.2, that the restricted likelihood can be defined as
the likelihood of y∗ = QT2 y, where Q2 is defined in (2.2.3). Letting G∗
i =
49
QT2 GiQ2, i = 0, . . . , p1, it follows from (2.2.4) and (3.1.2) that
Σ∗ =p1∑i=0
σiG∗i (3.2.1)
where Σ∗ denotes the covariance matrix of y∗.
The linear mixed effects model corresponding to y∗ can be written as
y∗ = Z∗b + ε∗ (3.2.2)
where ε∗ = QT2 ε ∼ N (0, σ2I) and is independent of b.
Assumptions 2.1.1 through 3.1.7 were the only conditions required in the
proof of Theorem 3.1.2. In this section we will assume that they still hold.
Assumption 2.1.1 was used only to ensure that Σ had a linear structure (i.e.
could be expressed as a linear combination of known matrices). Even though it
doesn’t necessarily hold for Z∗, Σ∗ has a linear structure, as shown in (3.2.1).
Assumption 2.1.2 holds for model (3.2.2) if we replace ε by ε∗. Assumption 3.1.1
implies that Q2 is an n× (n− p0) matrix, but is otherwise not needed, since no
fixed effects need to be estimated in (3.2.2). If we let n∗ = n− p0 represent the
sample size of the restricted model (3.2.2), then by Assumption 3.1.2 n∗ ≥ p1+1
(which is just Assumption 3.1.2 translated to the restricted model). Assump-
tion 3.1.3 is not needed for the restricted model. Assumption 3.1.4 ensures that
σ is identifiable in the general linear mixed effects model. We need a similar
assumption for the restricted model.
Assumption 3.2.1 The matrices G∗0, G∗
1, . . ., G∗p1
are linearly independent,
i.e.∑p1
i=0 τiG∗i = 0 ⇐⇒ τi = 0, i = 0, . . . , p1.
Assumption 3.1.5 remains unchanged in the restricted model. Now define
50
ν∗k = rank (G∗
k) , k = 1, . . . , p1, ν∗0 = n∗ − rank
(QT
2
[U 1
1 : · · · : U qrr
]), and note
that νk − 2p0 ≤ ν∗k ≤ νk, k = 0, . . . , p1. It follows that ν∗
k → ∞, k = 0, . . . , p1
and limn→∞ ν∗0/n
∗ = limn→∞ ν0/n, so that Assumption 3.1.6 also holds for the
restricted model, if we replace ν0 and n by ν∗0 and n∗. Assumption 3.1.7 needs
to be rephrased for model (3.2.2) as
Assumption 3.2.2 Let C∗1 be the (p1 + 1) × (p1 + 1) matrix defined by
[C∗1]ij = (1/2) lim
n∗→∞ trace((Σ∗
0)−1 G∗
i (Σ∗0)
−1 G∗j
)/(ν∗
i ν∗j )1/2, i, j = 0, . . . , p1.
Then the limits exist and C∗1 is positive definite.
In the definition of C∗1 above, Σ∗
0 represents the variance-covariance matrix of
y∗ evaluated at the true parameter vector σ0.
Define now the parameter space Θ∗ for the restricted model (3.2.2)
Θ∗ ={σ ∈ �p1+1 | σ0 > 0 and each Di is positive semi-definite, i = 1, . . . , r
}.
We can now state the equivalent of Theorem 3.1.2 for the RMLEs of the
variance-covariance components in model (3.2.2).
Theorem 3.2.1 Under Assumptions 2.1.1, 2.1.2, 3.1.1, 3.1.2, 3.1.5, 3.1.6,
3.2.1, and 3.2.2 and letting σ0 be an interior point of Θ∗ representing the true
parameter vector for model (3.2.2), there exists a sequence of estimates σn with
the following properties.
1. Given ε > 0, ∃δ = δ(ε), 0 < δ < ∞ and n0 = n0(ε) such that ∀n > n0
Pσ0
(∂�(σ | y∗)
∂σ
∣∣∣∣∣σ=σn
= 0; | σni − σ0i |< δ
n∗i
, i = 0, . . . , p1
)≥ 1 − ε
51
where n∗i =
√ν∗
i , i = 0, . . . , p1.
2. The (p1+1)-dimensional random vector with components given by ni (σni − σ0i),
i = 0, . . . , p1 converges in distribution to a Np(0, (C∗1)
−1).
Proof: The proof is identical to that of Theorem 3.1.2, since under Assump-
tions 2.1.1, 3.1.2, 3.1.5, 3.1.6, 3.2.1, and 3.2.2 the lemmas in Appendix A are
valid for the restricted model (with the obvious modifications, such as replacing
Σ0 by Σ∗0, etc.). Note that only the ∂2� (σ | y∗) /∂σi∂σj derivatives need to be
considered for the restricted model (3.2.2).
We now consider the estimation of the fixed effects under RML estimation
of the variance-covariance components. Since, for given σ, the maximum like-
lihood estimates of the fixed effects are given by
β(σ) =(XTΣ−1X
)−1XTΣ−1y (3.2.3)
it has been proposed that β(σ), with σ denoting the RML estimate of σ, be
used as a natural estimate for the fixed effects (Lindstrom and Bates, 1988). We
show now that such estimates have the same asymptotic properties as the ML
estimates, described in Theorem 3.1.2. In fact we show the more general result
Theorem 3.2.2 Let σ be a (weakly) consistent estimator of σ and β(σ) the
corresponding estimator of the fixed effects β, given by (3.2.3). Then under
Assumption 3.1.7 it follows that
1. β(σ) is (weakly) consistent for β;
2. np1+1
(β(σ) − β
) D−→ N (0, C−10 ).
52
Proof: Note that 2 ⇒ 1 since, by Slutsky’s theorem and (3.1.7)
np1+1
(β(σ) − β
) D−→ N (0, C−10 ) ⇒ β(σ) − β
D−→ 0 ⇒ β(σ)P−→ β
so that we just need to prove the asymptotic normality of β(σ).
By Assumption 3.1.7, XTΣ−1X/νp1+1 → C0. It then follows that
νp1+1
(XTΣ−1X
)−1C0 → I and np1+1
[(XTΣ−1X
)1/2]−1
C1/20 → I. Now
since y ∼ N (Xβ,Σ), we have
(XTΣ−1X
)−1XTΣ−1 (y − Xβ) ∼ N
(0,(XTΣ−1X
)−1)
⇒[(
XTΣ−1X)T/2
]−1
XTΣ−1 (y − Xβ) ∼ N (0, I)
and therefore, by Slutsky’s theorem, we have that
np1+1C1/20
(XTΣ−1X
)−1XTΣ−1 (y − Xβ)
D−→ N (0, I)
⇒ np1+1
(XTΣ−1X
)−1XTΣ−1 (y − Xβ)
D−→ N (0, C−10 ).
We will show that np1+1
(β (σ) − β (σ)
)P−→ 0 and by an application of
Slutsky’s theorem will conclude that np1+1
(β (σ) − β
) D−→ N(0, C−1
0
). Let
Σ, DA, andD, denote the estimates of Σ, DA, and D, corresponding to σ. We
first show that
(1/νp1+1)XT(Σ−1 − Σ
−1)
XP−→ 0. (3.2.4)
In order to do that we need
Lemma 3.2.1 Let A be an a × a symmetric matrix, then
maxij
|Aij | ≤ max1≤k≤a
|λk (A)| .
53
Proof: Letting ξi denote the ith canonical basis vector, we have, using the
Cauchy-Schwartz inequality,
|Aij | =∣∣∣ξT
i Aξj
∣∣∣ ≤ (ξT
i A2ξi
)1/2 ≤ max1≤k≤a
|λk (A)| .
As a result of Lemma 3.2.1, to show (3.2.4), it suffices to show that
max1≤k≤p0
∣∣∣∣λk
((1/νp1+1)X
T(Σ−1 − Σ
−1)
X)∣∣∣∣ P−→ 0.
But using elementary results on eigenvalues of symmetric matrices we have
maxk
∣∣∣∣λk
((1/νp1+1)X
T(Σ−1 − Σ
−1)
X)∣∣∣∣ (3.2.5)
= maxk
∣∣∣∣λk
((1/νp1+1)X
T Σ−1 (
Σ −Σ)Σ−1X
)∣∣∣∣≤ λmax
(XTΣ−1X
νp1+1
)max
k
∣∣∣∣λk
(Σ
−1 (Σ − Σ
))∣∣∣∣ .Now
maxk
∣∣∣∣λk
(Σ
−1 (Σ − Σ
))∣∣∣∣ (3.2.6)
= sup‖ξ‖=1
∣∣∣ξT(Σ − Σ
)ξ∣∣∣
ξT Σξ≤ sup
‖ξ‖=1,ZT ξ�=0
∣∣∣ξT Z(DA − DA
)ZT ξ
∣∣∣ξT ZDAZT ξ
+|σ2 − σ2|
σ2
≤ maxk
∣∣∣λk
(D − D
)∣∣∣λmin(D)
+|σ2 − σ2|
σ2.
By assumption, σP−→ σ. Therefore D
P−→ D and σ2 P−→ σ2. By the continuity
of the minimum and maximum eigenvalues and Slutsky’s theorem, the last
54
bound of (3.2.6) converges in probability to zero. By Assumption 3.1.7
λmax(XTΣ−1X/νp1+1) → λmax(C0)
and it follows that
maxk
∣∣∣∣λk
((1/νp1+1)X
T(Σ−1 − Σ
−1)
X)∣∣∣∣ P−→ 0.
As a consequence of (3.2.4) we have that
∥∥∥∥∥∥XT Σ
−1X
νp1+1− C0
∥∥∥∥∥∥ ≤∥∥∥∥∥∥∥∥XT
(Σ
−1 −Σ−1)
X
νp1+1
∥∥∥∥∥∥∥∥ +
∥∥∥∥∥XTΣ−1X
νp1+1− C0
∥∥∥∥∥ P−→ 0
so that XT Σ−1
X/νp1+1P−→ C0 and by the continuity of the inverse matrix
νp1+1
(XT Σ
−1X)−1
P−→ C−10 . Therefore by Assumption 3.1.7
XTΣ−1X(XT Σ
−1X)−1
= (3.2.7)
(XTΣ−1X/νp1+1
) [νp1+1
(XT Σ
−1X)−1
]P−→ C0C
−10 = I.
Then by Slutsky’s theorem we have that
np1+1
(XT Σ
−1X)−1
XTΣ−1 (y − Xβ)D−→ N (0, C−1
0 ). (3.2.8)
To complete the proof of the theorem we just need to show that
np1+1
(XT Σ
−1X)−1
XT(Σ−1 − Σ
−1)
(y − Xβ)P−→ 0.
55
By applying the Cauchy-Schwartz inequality we get
np1+1ξTi
(XT Σ
−1X)−1
XT(Σ−1 − Σ
−1)
(y − Xβ) (3.2.9)
≤[νp1+1ξ
Ti
(XT Σ
−1X)−1
ξi
]1/2 [(y − Xβ)T Σ−1
(Σ −Σ
)Σ
−1
×X(XT Σ
−1X)−1
XT Σ−1 (
Σ − Σ)Σ−1 (y − Xβ)
]1/2
Since
νp1+1ξTi
(XT Σ
−1X)−1
ξiP−→ ξT
i C−10 ξi ≤ λmax(C
−10 ) (3.2.10)
the first term on the right hand side of (3.2.9) is bounded in probability by
λ1/2max(C
−10 ) as n → ∞. Now noting that
(Σ1/2
)−1(y − Xβ) ∼ N (0, I) and
rank
((Σ1/2
)−T (Σ −Σ
)Σ
−1X(XT Σ
−1X)−1
×XT Σ−1 (
Σ −Σ) (
Σ1/2)−1)≤ p0
we get using the same reasoning as in Lemma A.9 in Appendix A
(y − Xβ)T Σ−1(Σ −Σ
)Σ
−1X(XT Σ
−1X)−1
(3.2.11)
× XT Σ−1 (
Σ − Σ)Σ−1 (y − Xβ)
≤ λmax
((Σ
1/2)−T
X(XT Σ
−1X)−1
XT(Σ
1/2)−1
)
× maxk
∣∣∣∣λk
(Σ
−1 (Σ − Σ
))∣∣∣∣maxk
∣∣∣λk
(Σ−1
(Σ −Σ
))∣∣∣ ‖wn,p0‖2
56
where ‖wn,p0‖2 ∼ χ2p0
, ∀n. Now note that
λmax
((Σ
1/2)−T
X(XT Σ
−1X)−1
XT(Σ
1/2)−1
)≤ (3.2.12)
trace
((Σ
1/2)−T
X(XT Σ
−1X)−1
XT(Σ
1/2)−1
)= p0
and
maxk
∣∣∣∣λk
(Σ
−1 (Σ − Σ
))∣∣∣∣ ≤ (3.2.13)
maxk
∣∣∣λk
(D − D
)∣∣∣λmin(D)
+|σ2 − σ2|
σ2
P−→ 0
maxk
∣∣∣λk
(Σ−1
(Σ − Σ
))∣∣∣ ≤maxk
∣∣∣λk
(D − D
)∣∣∣λmin(D)
+|σ2 − σ2|
σ2
P−→ 0.
Combining (3.2.9), (3.2.10), (3.2.12), (3.2.13), and Lemma 3.2.1 gives
np1+1
(XT Σ
−1X)−1
XT(Σ−1 − Σ
−1)
(y − Xβ)P−→ 0.
It then follows from Slutsky’s theorem that
np1+1
(β (σ) − β
)(3.2.14)
= np1+1
(XT Σ
−1X)−1
XTΣ−1 (y − Xβ)
− np1+1
(XT Σ
−1X)−1
XT(Σ−1 − Σ
−1)
(y − Xβ)D−→ N (0, C−1
0 ),
as we wanted to show.
57
3.3 Parametrized and/or Structured σ
In this section we consider the asymptotic behavior of the (restricted) maximum
likelihood estimates of the variance-covariance components under reparametriza-
tion (Lindstrom and Bates, 1988) and/or structuring (Jennrich and Schluchter,
1986) of σ. More specifically, we consider the case where σ = f (α), with α of
dimension pα less than or equal to p1 +1. We show that for a large class of well
behaved f , the (restricted) maximum likelihood estimators of α are consistent
and asymptotically normal.
We start by establishing some assumptions about f and α.
Assumption 3.3.1 Let σi, i = 0, . . . , r denote the subset of the parameters
in σ that define the scaled variance-covariance matrix Di of the random effects
belonging to the ith random effects class, with the convention that σ0 = σ2. Then
the parameter vector α and the vector function f can be decomposed into r + 1
disjoint subsets α0, . . . , αr, f0, . . . , f r in such a way that σi = f i (αi) , i =
0, . . . , r.
In other words, we assume that σ2, D1, . . . , Dr are each defined by disjoint
subsets of the parameters in α.
Assumption 3.3.2 f i is of class C2, i = 0, . . . , r, i.e. f i is twice differentiable
with continuous second derivatives.
In the proof of the main asymptotic theorem of this subsection we just need that
the second derivatives do not explode in a small neighborhood of the true pa-
rameter vector α0. Requiring continuity of these derivatives is just a convenient
way of controlling their behavior in a neighborhood of α0.
58
Assumption 3.3.3 f is one-to-one, i.e. α �= α′ ⇒ f (α) �= f (α′).
Assumption 3.3.3 is needed to ensure that α is identifiable.
We also need an assumption regarding the limit behavior of νi/ml(i), i =
1, . . . , p1, where l(i) denotes the random effect class to which the ith variance-
covariance component corresponds.
Assumption 3.3.4 limn→∞ νi/ml(i) = si, i = 1, . . . , p1 exists and is positive.
As observed in section 3.1, ml(i) ≤ νi ≤ 2ml(i), i = 1, . . . , p1, and hence νi and
ml(i) are of the same order of magnitude. Assumption 3.3.2 is simply stating
that their ratio tends to a limit.
Now note that, by the chain rule,
∂� (β, α | y)
∂αi
=p1∑
k=0
∂� (β, σ | y)
∂σk
∂σk
∂αi
(3.3.1)
∂2� (β, α | y)
∂αi∂αj=
p1∑k,l=0
∂2� (β, σ | y)
∂σk∂σl
∂σk
∂αi
∂σl
∂αj+
p1∑k=0
∂� (β, σ | y)
∂σk
∂2σk
∂αi∂αj
∂2� (β, α | y)
∂αi∂βj=
p1∑k=0
� (β, σ | y)
∂σk∂βj
∂σk
∂αi
where σ is taken as a function of α, so that for example ∂σk/∂αi should be
understood as ∂fk(α)/∂αi, ∂� (β, σ) /∂σ = ∂� (β, σ) /∂σ|σ=f(α), and so on.
Now let ∇f and Hf denote respectively the (p1 + 1) × pα gradient matrix
of f and the pα × pα × (p1 + 1) Hessian array of f , defined as
[∇f ]ij =∂fi (α)
αj, i = 1, . . . , p1 + 1, j = 1, . . . , pα,
[Hf ]ijk =∂2fk (α)
∂αi∂αj, i, j = 1, . . . , pα, k = 1, . . . , p1 + 1.
59
Note that by Assumption 3.3.1 [Hf ]ijk = 0 whenever l(i) �= l(j). We can
rewrite (3.3.1) in matrix form as
∂� (β, α | y)
∂α= ∇T
f
∂� (β, σ | y)
∂σ∂2� (β, α | y)
∂α∂αT= ∇T
f
∂2� (β, σ | y)
∂σ∂σT∇f + Hf
∂� (β, σ | y)
∂σ∂2� (β, α | y)
∂α∂βT = ∇Tf
∂2� (β, σ | y)
∂σ∂βT .
Now note that
2∂� (β, σ | y)
∂σi= trace
(Σ−1Gi
)− (y − Xβ)T Σ−1GiΣ
−1 (y − Xβ)
and as E((y − Xβ)T Σ−1GiΣ
−1 (y − Xβ))
= trace(Σ−1Gi
)it follows that
E (∂� (β, σ | y) /∂σ) = 0. Hence we get
E
(∂2� (β, α)
∂α∂αT
)= ∇T
f
∂2� (β, σ | y)
∂σ∂σT∇f . (3.3.2)
Note also that
E
(∂2� (β, α | y)
∂α∂βT
)= ∇T
fE
(∂2� (β, σ | y)
∂σ∂βT
)= 0. (3.3.3)
The parameter space Θf of the parametrized/structured model is
Θf ={θf ∈ �p0+pα | θ =
(βT , αT
)T, β ∈ �p0; α ∈ �pα such that
σ0 (α) > 0 and Di (α) is positive semi-definite, i = 1, . . . , r} .
60
Note that if we define the augmented function
f a (θ) =(βT , (f (α))T
)T
we then have
fa (Θf) ⊂ Θ (3.3.4)
where Θ denotes the parameter space of the linear mixed effects model (2.1.1).
Now let
νfi = ml(i), nf
i =√
νfi , i = 1, . . . , pα
νf0 = ν0, nf
0 = n0.
Define nf = diag(nf0 , nf
1 , . . . , nfpα
) and s = diag(1, s1, . . . , sp1). Then we have
Theorem 3.3.1 Under Assumptions 3.1.7 and 3.3.1 through 3.3.4
n−1f
∂2� (β, α | y)
∂α∂αTn−1
fP−→ ∇T
fs1/2C1s1/2∇f
def= Cf
1 , (3.3.5)
(np1+1nf)−1∂2� (β, α | y)
∂α∂βT
P−→ 0.
Proof: Consider initially the first limit in (3.3.5). By (3.3.2) and Assump-
tions 3.1.7, 3.3.1, and 3.3.4 we have that
1
nfi nf
j
E
(∂2� (β, α | y)
∂αi∂αj
)
=∑
p:ml(p)=νfi
∑q:ml(q)=νf
j
npnq
nfi nf
j
∂σp
∂αi
∂σq
∂αj
[1
npnqE
(∂� (β, σ | y)
∂σp∂σq
)]
→ ∑p:ml(p)=νf
i
∑q:ml(q)=νf
j
s1/2p s1/2
q
∂σp
∂αi
∂σq
∂αj
[C1]pq =[Cf
1
]ij
.
61
Hence E(n−1
f
(∂2� (β, α | y) /∂α∂αT
)n−1
f
)→ Cf
1 . Now since
∥∥∥∥∥n−1f
∂2� (β, α | y)
∂α∂αTn−1
f − Cf1
∥∥∥∥∥≤
∥∥∥∥∥n−1f
(∂2� (β, α | y)
∂α∂αT− E
(∂2� (β, α | y)
∂α∂αT
))n−1
f
∥∥∥∥∥+
∥∥∥∥∥E(n−1
f
∂2� (β, α | y)
∂α∂αTn−1
f
)− Cf
1
∥∥∥∥∥it suffices to show that
n−1f
(∂2� (β, α | y) /∂α∂αT − E
(∂2� (β, α | y) /∂α∂αT
))n−1
fP−→ 0.
Now note that
1
nfi nf
j
(∂2� (β, α | y)
∂αi∂αj− E
(∂2� (β, α | y)
∂αi∂αj
))
=∑
p:ml(p)=νfi
∑q:ml(q)=νf
j
npnq
nfi nf
j
∂σp
∂αi
∂σq
∂αj
[1
npnq
(∂2� (β, σ | y)
∂σp∂σq
−E
(∂2� (β, σ | y)
∂σp∂σq
))]P−→ 0
since npnq/nfi njf → s1/2
p s1/2q and by Tchebychev’s inequality and Lemma A.5
P
(1
npnq
∣∣∣∣∣(
∂2� (β, σ | y)
∂σp∂σq
− E
(∂2� (β, σ | y)
∂σp∂σq
))∣∣∣∣∣ > ε
)
≤ 1
ε2n2pn
2q
Var
(∂2� (β, σ | y)
∂σp∂σq
)=
2
ε2n2pn
2q
trace(Σ−1GpΣ
−1Gq
)2
≤ 2 max(n2p, n
2q)
ε2n2pn
2q
[max
k
∣∣∣λk
(Σ−1Gp
)∣∣∣ ∣∣∣λk
(Σ−1Gq
)∣∣∣]2 ≤ 32
ε2 min(n2p, n
2q)δ
40
→0.
Consider now the second limit in (3.3.5). Since E (∂2� (β, α | y) /∂αi∂β) = 0,
62
by Tchebychev’s inequality we just need to show that
(1/np1+1nfi )2trace
(Var
(∂2� (β, α | y) /∂αi∂β
))→ 0.
Now note that
−∂2� (β, α | y)
∂αi∂β=
∑p:ml(p)=νf
i
∂σp
∂αi
XTΣ−1GpΣ−1 (y − Xβ)
= XTΣ−1Gfi Σ
−1 (y − Xβ)
where Gfi =
∑p:ml(p)=νf
i
(∂σp/∂αi)Gp. Let Mf (α) = maxij |∂σj/∂αi|. Note that
by Assumption 3.3.3 Mf (α) > 0. Using Lemma A.3 and Assumption 3.1.7 we
get
1(np1+1n
fi
)2 trace
(Var
(∂2� (β, α | y)
∂αi∂β
))(3.3.6)
=1(
np1+1nfi
)2 trace(XTΣ−1Gf
i Σ−1Gf
i Σ−1X
)
≤ p0(np1+1n
fi
)2λmax
(XTΣ−1X
)λmax
(Σ−1Gf
i
)2
≤ 4p1M2f (α)
(nfi δ0)2
λmax
(XTΣ−1X
νp1+1
)
and the last term on the right hand side of (3.3.6) converges to zero as n → ∞,
since λmax
(XTΣ−1X/νp1+1
)→ λmax (C0) and 1/(nf
i )2 → 0.
We can now state and prove the main asymptotic theorem of this section.
Theorem 3.3.2 Under Assumptions 2.1.1, 2.1.2, 3.1.1 through 3.1.7, and 3.3.1
through 3.3.4, and letting θf0 be an interior point of Θf representing the true
63
parameter vector and Jf =
C0 0
0 Cf1
, there exists a sequence of estimates
θf
n =(β
T
n , αTn
)T
with the following properties.
1. Given ε > 0, ∃δ = δ(ε), 0 < δ < ∞ and n0 = n0(ε) such that ∀n > n0
Pθf0
∂�(θf)
∂θf
∣∣∣∣∣∣θf =θ
f
n
= 0; ‖βn − β0‖ <δ
np+1
and
| αni − α0i |< δ
nfi
, i = 1, . . . , pα
)≥ 1 − ε.
2. The (p0 + pα)-dimensional vector with the first p0 components given by
np1+1
(βn − β0
)and the last pα components given by nf
i (αni − α0i) , i =
1, . . . , pα converges in distribution to a N(0, J−1
f
).
Proof: The proof will consist in verifying that the maximum likelihood es-
timates of θf in the parametrized/structured model satisfy the conditions of
Theorem 3.1.1. Note that the first condition was proven in Theorem 3.3.1.
Therefore we just need to show that the second condition holds.
Let g be as defined in the proof of Theorem 3.1.2 and
gfi = gf
i (θf0 ) =
g/√
2p1Mf (α0) , if i = 1, . . . , pα
g, if i = pα + 1
Note that gfi → ∞ and gf
i /nfi → 0, i = 1, . . . , pα + 1. Also let
Nfn (θf
0 ) ={θf ∈ Θf | |θf
i − θf0i| ≤ gf
k(i)/nfk(i), i = 1, . . . , p0 + pα
}
64
where
k(i) =
pα + 1, if 1 ≤ i ≤ p0
i − p0, otherwise
with the convention that nfpα+1 = np1+1. Then the second condition of Theo-
rem 3.1.1 is that for i, j = 1, . . . , p0 + pα and ∀θf2 ∈ Nf
n (θf0 )
supθf
1∈Nfn (θf
0 )
gfk(i)g
fk(j) (3.3.7)
×∣∣∣∣∣∣∣−(1/nf
k(i)nfk(j))
∂2�(θf | y
)∂θf
i ∂θfj
∣∣∣∣∣∣θf1
−[Jf
(θf
0
)]ij
∣∣∣∣∣∣∣P
θf2−→ 0.
In the remainder of this section we will adopt the shorthand notation θi =
f a
(θf
i
). The following lemma will be used in the proof of Theorem 3.3.2.
Lemma 3.3.1 f a
(Nf
n
(θf
0
))⊂ Nn (θ0) .
Proof: Let θf ∈ Nfn
(θf
0
)and θ = f a
(θf), then for i = 1, . . . , p0 we have
|θi − θ0i| =∣∣∣θf
i − θf0i
∣∣∣ ≤ gfk(i)
nfk(i)
=g
np1+1
.
Now take i = p0 + 1, . . . , p0 + p1 + 1 and let l(i) denote the random effect
class to which θi refers. By the mean value theorem we get
|θi − θ0i| ≤∥∥∥f l(i)
(αl(i)
)− f l(i)
(α0l(i)
)∥∥∥ ≤ Mf
(αf
0
)‖α − α0‖
≤ Mf (α0) p1gfi
nfi
≤ g
ni
and therefore by definition θ ∈ Nn (θ0).
65
Note that by Lemma 3.3.1
supθf∈Nf
n (θf0 )
∣∣∣h (fa
(θf))∣∣∣ ≤ sup
θ∈Nn(θ0)|h (θ)|
for any real function h.
For the ∂2� (β, α | y) /∂β∂βT derivatives condition (3.3.7) is identical to the
equivalent one in Theorem 3.1.2 and the proof given there also applies here.
Consider now the ∂2� (β, α | y) /∂αi∂βj derivatives. Since the corresponding
entries in the Jf matrix are 0, we just need to show that
supθf
1∈Nfn (θf
0 )
gfi gf
pα+1
∣∣∣∣∣∣∣−(1/nfi nf
pα+1)∂2�
(θf | y
)∂αi∂βj
∣∣∣∣∣∣θf
1
∣∣∣∣∣∣∣P
θf2−→ 0.
By the continuity of ∂f/∂α we have that ∃ε = ε(α0) > 0 such that ‖α−α0‖ <
ε ⇒ maxij |∂fi (α) /∂αj | < 2Mf (α0). From (3.3.1) we get that for sufficiently
large n (such that θf ∈ Nfn
(θf
0
)⇒ ‖α − α0‖ < ε)
supθf
1∈Nfn (θf
0 )
gfi gf
pα+1
∣∣∣∣∣∣∣−(1/nfi nf
pα+1)∂2�
(θf | y
)∂αi∂βj
∣∣∣∣∣∣θf
1
∣∣∣∣∣∣∣≤ (1/
√2p1)
∑p:ml(p)=νf
i
(np/nfi ) sup
θ1∈Nn(θ0)g2
∣∣∣∣∣−(1/npnp1+1)∂2� (θ | y)
∂σi∂βj
∣∣∣∣∣θ1
∣∣∣∣∣P
θf2−→ 0
since np/nfi → s1/2
p and by Theorem 3.1.2
supθ1∈Nn(θ0)
g2
∣∣∣∣∣−(1/npnp1+1)∂2� (θ | y)
∂σi∂βj
∣∣∣∣∣θ1
∣∣∣∣∣P
θf2−→ 0.
Consider now the ∂2� (β, α | y) /∂αi∂αj terms. Note that letting ∇f ,i(α)
66
denote the ith column of the gradient matrix ∇f evaluated at α we get
∣∣∣∣∣∣∣−(1/nfi nf
j )∂2�
(θf | y
)∂αi∂αj
∣∣∣∣∣∣θf
1
−[Cf
1
]ij
∣∣∣∣∣∣∣ (3.3.8)
≤∣∣∣∣∣∣−(1/nf
i nfj )∇T
f,i(α1)∂2� (θ | y)
∂σ∂σT
∣∣∣∣∣fa(θf
1 )
∇f ,j(α1)
−∇Tf ,i(α0)s
1/2C1s1/2∇f,j(α0)
∣∣∣+
∣∣∣∣∣∣∣(1/nfi nf
j )p1∑
k=0
∂�(θf | y
)∂σk
∣∣∣∣∣∣σ=f(α1)
∂2σk
∂αi∂αj
∣∣∣∣∣α1
∣∣∣∣∣∣∣ .Now note that
∣∣∣∇Tf ,i(α)s1/2C1s
1/2∇f ,j(α) − ∇Tf ,i(α0)s
1/2C1s1/2∇f,j(α0)
∣∣∣ (3.3.9)
≤∣∣∣(∇f,i(α) − ∇f ,i(α0))
T s1/2C1s1/2∇f ,j(α)
∣∣∣+∣∣∣∇T
f ,i(α0)s1/2C1s
1/2 (∇f ,j (α) − ∇f ,j (α0))∣∣∣ .
Let Qf (α0) = max(maxijk
∣∣∣∂2fk/∂αi∂αj |α0
∣∣∣ , 1). By Assumption 3.3.2,
∃ε = ε(α0) > 0 such that ‖α − α0‖ < ε ⇒ maxijk
∣∣∣∂2fk/∂αi∂αj |α∣∣∣ < Qf(α0).
By the continuity of ∇f and the mean value theorem we have that, for suffi-
ciently large n we have that ∀θf ∈ Nfn
(θf
0
)∣∣∣[∇f ,i (α)]k − [∇f ,i (α0)]k
∣∣∣ ≤ Qf (α0) ‖α − α0‖ ≤ p1Qf (α0) gfi
nfi
. (3.3.10)
Therefore we have for sufficiently large n
gfi gf
j supθ1∈Nf
n (θf0 )
∣∣∣∇Tf ,i(α1)s
1/2C1s1/2∇f,j(α1)
67
−∇Tf,i(α0)s
1/2C1s1/2∇f ,j(α0)
∣∣∣≤ p1g
fi gf
j Mf (α0) Qf (α0)1T s1/2C1s
1/21
gfi
nfi
+gf
j
nfj
≤ Qf (α0)1
T s1/2C1s1/21
M2f (α0) p2
1g
def=
κ(α0)
g
where 1 denote the constant vector of ones. Hence for sufficiently large n, we
have
∣∣∣∣∣∣−(1/nfi nf
j )∇Tf ,i(α1)
∂2� (θ | y)
∂σ∂σT
∣∣∣∣∣fa(θf
1 )
∇f,j(α1)
−∇Tf ,i(α0)s
1/2C1s1/2∇f ,j(α0)
∣∣∣≤
∣∣∣∣∣∣∇Tf ,i(α1)
−(1/nfi nf
j )∂2� (θ | y)
∂σ∂σT
∣∣∣∣∣fa(θf
1 )
− s1/2C1s1/2
∇f ,j(α1)
∣∣∣∣∣∣+
2Qf (α0)1T s1/2C1s
1/21
g3
≤ 2Mf (α0)∑
p:ml(p)=νfi
∑q:ml(q)=νf
j
∣∣∣∣∣∣−(1/nfi nf
j )∂� (θ | y)
∂σp∂σq
∣∣∣∣∣fa(θf
1 )
− s1/2p s1/2
q [C1]pq
∣∣∣∣∣∣+
2Qf (α0)1T s1/2C1s
1/21
g3.
By Assumption 3.3.4 we may replace s1/2p s1/2
q by npnq/nfi nf
j without altering
the limit values. It then follows that for large enough n
supθf1 ∈Nf
n (θf0 )
gfi gf
j
∣∣∣∣∣∣−(1/nfi nf
j )∇Tf,i(α1)
∂2� (θ | y)
∂σ∂σT
∣∣∣∣∣fa(θf
1 )
∇f ,j(α1) (3.3.11)
−∇Tf ,i(α0)s
1/2C1s1/2∇f,j(α0)
∣∣∣≤ 2Mf (α0)
∑p:ml(p)=νf
i
∑q:ml(q)=νf
j
(npnq/nfi nf
j )g2
68
× supθ1∈Nn(θ0)
∣∣∣∣∣−(1/npnq)∂� (θ | y)
∂σp∂σq
∣∣∣∣∣θ1
− [C1]pq
∣∣∣∣∣+ κ(α0)
g
Pθ
f2−→ 0.
since npnq/nfi nf
j → s1/2p s1/2
q and by Theorem 3.1.2 and (3.3.4)
g2 supθ1∈Nn(θ0)
∣∣∣∣∣−(1/npnq)∂� (θ | y)
∂σp∂σq
∣∣∣∣∣θ1
− [C1]pq
∣∣∣∣∣ Pθ2−→ 0.
Finally we note that the second term of the sum on the right hand side of (3.3.8)
is zero whenever nfi �= nf
j , so we can restrict ourselves to the case when they
are equal. We have
gfi gf
j supθf1∈Nn(θf
0 )
∣∣∣∣∣∣∣(1/νfi )
∑p:ml(p)=νf
i
∂� (θ | y)
∂σp
∣∣∣∣∣σ=f(α1)
∂2σp
∂αi∂αj
∣∣∣∣∣α1
∣∣∣∣∣∣∣ ≤Qf(α0)
∑p:ml(p)=νf
i
(νp/νfi )g2 sup
θ1∈Nn(θ0)
∣∣∣∣∣(1/νp)∂� (θ | y)
∂σp
∣∣∣∣∣θ1
∣∣∣∣∣and since νp/ν
fi → sp it suffices to show that
g2 supθ1∈Nn(θ0)
∣∣∣∣∣(1/νi)∂� (θ | y)
∂σi
∣∣∣∣∣θ1
∣∣∣∣∣ Pθ2−→ 0, i = 0, . . . , p1.
Now note that
2
∣∣∣∣∣ ∂� (θ | y)
∂σi
∣∣∣∣∣θ1
∣∣∣∣∣ (3.3.12)
=∣∣∣trace
(Σ−1
1 Gi
)− (y − Xβ1)
T Σ−11 GiΣ
−11 (y − Xβ1)
∣∣∣≤
∣∣∣trace((
Σ−11 − Σ−1
2
)Gi
)∣∣∣+∣∣∣trace
(Σ−1
2 Gi
)− (y − Xβ2)
T Σ−12 GiΣ
−12 (y − Xβ2)
∣∣∣+∣∣∣(y − Xβ1)
T Σ−11 GiΣ
−11 (y − Xβ1)
69
− (y − Xβ2)T Σ−1
2 GiΣ−12 (y − Xβ2)
∣∣∣ .Consider initially the first term on the right hand side of (3.3.12). By Lem-
mas A.5 and A.8 we have that for sufficiently large n
(g2/νi)∣∣∣trace
((Σ−1
1 − Σ−12
)Gi
)∣∣∣ (3.3.13)
≤ g2 maxk
∣∣∣λk
((Σ−1
1 − Σ−12
)Gi
)∣∣∣≤ g2 max
k
∣∣∣λk
(Σ−1
2 (Σ2 − Σ1))∣∣∣max
k
∣∣∣λk
(Σ−1
1 Gi
)∣∣∣≤ 16
δ0g
q
λmin
(D0) +
1
σ20
→ 0.
Note that since the bound on (3.3.13) does not depend on θ1, we also have
(g2/νi) supθ1∈Nn(θ0)
∣∣∣trace((
Σ−11 −Σ−1
2
)Gi
)∣∣∣→ 0.
Next consider the second term on the right hand side of the inequality in
(3.3.12) and note that this term does not depend on θ1. By Tchebychev’s
inequality we just need to show that the variance of that term goes to zero with
n. Now by Lemma A.5
(g4/ν2i )var
((y − Xβ2)
T Σ−12 GiΣ
−12 (y − Xβ2)
)= (2g4/ν2
i )trace(Σ−1
2 Gi
)2 ≤ (2/g4) maxk
(λk
(Σ−1
2 Gi
))2 ≤ 32
δ20g
4→ 0.
Finally consider the last term on the right hand side of the inequality in
(3.3.12). We note that
∣∣∣(y − Xβ1)T Σ−1
1 GiΣ−11 (y − Xβ1) (3.3.14)
70
− (y − Xβ2)T Σ−1
2 GiΣ−12 (y − Xβ2)
∣∣∣≤
∣∣∣(y − Xβ2)T(Σ−1
1 GiΣ−11 − Σ−1
2 GiΣ−12
)(y − Xβ2)
∣∣∣+ 2
∣∣∣(y − Xβ2)T Σ−1
1 GiΣ−11 X (β2 − β1)
∣∣∣+∣∣∣(β2 − β1)
T XTΣ−11 GiΣ
−11 X (β2 − β1)
∣∣∣ .Consider the first term on the right hand side of (3.3.14). Note that
Σ−11 GiΣ
−11 −Σ−1
2 GiΣ−12 =
(Σ−1
1 −Σ−12
)GiΣ
−11 + Σ−1
2 Gi
(Σ−1
1 − Σ−12
).
Let j(i) and k(i) denote the random effects to which σi corresponds within the
associated random effects class l(i). By the Cauchy-Schwartz inequality we have
that for any u, v ∈ �n
∣∣∣uT Giv∣∣∣ ≤ ∥∥∥∥(U l(i)
j(i)
)Tu
∥∥∥∥ ∥∥∥∥(U l(i)k(i)
)Tv
∥∥∥∥+
∥∥∥∥(U l(i)j(i)
)Tv
∥∥∥∥ ∥∥∥∥(U l(i)k(i)
)Tu
∥∥∥∥ . (3.3.15)
Using (3.3.15) and Cauchy-Schwartz once again gives
∣∣∣(y − Xβ2)T(Σ−1
1 − Σ−12
)GiΣ
−11 (y − Xβ2)
∣∣∣≤
[(y − Xβ2)
T Σ−12 (Σ2 − Σ1)Σ−1
1 Ul(i)j(i)
(U
l(i)j(i)
)T
×Σ−11 (Σ2 −Σ1)Σ
−12 (y − Xβ2)
]1/2
×[(y − Xβ2)
T Σ−11 U
l(i)k(i)
(U
l(i)k(i)
)TΣ−1
1 (y − Xβ2)]1/2
+[(y − Xβ2)
T Σ−12 (Σ2 −Σ1)Σ
−11 U
l(i)k(i)
(U
l(i)k(i)
)T
×Σ−11 (Σ2 −Σ1)Σ
−12 (y − Xβ2)
]1/2
×[(y − Xβ2)
T Σ−11 U
l(i)j(i)
(U
l(i)j(i)
)TΣ−1
1 (y − Xβ2)]1/2
.
71
Now note that by Lemmas A.4, A.6, and A.8
g4λmax
([Σ
1/22
]−T(Σ2 −Σ1)Σ
−11 U
l(i)j(i)
(U
l(i)j(i)
)TΣ−1
1 (Σ2 − Σ1)[Σ
1/22
]−1)
≤ 128
δ0g2
q
λmin
(D0) +
1
σ20
2λmax
(D0)
λmin
(D0) + 1
which converges to zero with n. By Lemma A.9 it follows that
g4
νisup
θ1∈Nn(θ0)(y − Xβ2)
T Σ−12 (Σ2 − Σ1)Σ−1
1 Ul(i)j(i)
(U
l(i)j(i)
)T
×Σ−11 (Σ2 −Σ1)Σ
−12 (y − Xβ2)
Pθ2−→ 0.
From Lemmas A.4 and A.6 we also have that for large enough n
λmax
(Σ
1/22 Σ−1
1 Ul(i)k(i)
(U
l(i)k(i)
)TΣ−1
1
[Σ
1/22
]T) ≤ 8
δ0
λmax
(D0)
λmin
(D0) + 1
and therefore by Lemma A.9 it follows that
Pθ2
(1/νi sup
θ1∈Nn(θ0)(y − Xβ2)
T Σ−11 U
l(i)k(i)
(U
l(i)k(i)
)T
Σ−11 (y − Xβ2) >
16
δ0
λmax
(D0)
λmin
(D0) + 1
→ 0.
We conclude that
g2
νisup
θ1∈Nn(θ0)
[(y − Xβ2)
T Σ−12 (Σ2 − Σ1)Σ−1
1 Ul(i)j(i)
(U
l(i)j(i)
)T
× Σ−11 (Σ2 −Σ1)Σ
−12 (y − Xβ2)
]1/2
×[(y − Xβ2)
T Σ−11 U
l(i)k(i)
(U
l(i)k(i)
)TΣ−1
1 (y − Xβ2)]1/2 Pθ2−→ 0.
72
Similarly we prove that
g2
νi
supθ1∈Nn(θ0)
[(y − Xβ2)
T Σ−12 (Σ2 −Σ1)Σ
−11 U
l(i)k(i)
(U
l(i)k(i)
)T
× Σ−11 (Σ2 −Σ1)Σ
−12 (y − Xβ2)
]1/2
×[(y − Xβ2)
T Σ−11 U
l(i)j(i)
(U
l(i)j(i)
)TΣ−1
1 (y − Xβ2)]1/2 Pθ2−→ 0
and therefore
g2
νisup
θ1∈Nn(θ0)(y − Xβ2)
T(Σ−1
1 −Σ−12
)GiΣ
−11 (y − Xβ2)
Pθ2−→ 0.
Using the exact same reasoning we show that
g2
νisup
θ1∈Nn(θ0)(y − Xβ2)
T Σ−12 Gi
(Σ−1
1 − Σ−12
)(y − Xβ2)
Pθ2−→ 0.
Consider now the second term on the right hand side of (3.3.14). By Cauchy-
Schwartz we get
∣∣∣(y − Xβ2)T Σ−1
1 GiΣ−11 X (β2 − β1)
∣∣∣≤
[(y − Xβ2)
T Σ−11 GiΣ
−11 GiΣ
−11 (y − Xβ2)
]1/2
×[(β1 − β2)
T XTΣ−11 X (β2 − β1)
]1/2.
Now note that for large enough n
λmax
(Σ
1/22 Σ−1
1 GiΣ−11 GiΣ
−11
[Σ
1/22
]T) ≤ 64
δ20
λmax
(D0)
λmin
(D0) + 1
73
and by Lemma A.9 it follows that
Pθ2
(1/νi sup
θ1∈Nn(θ0)(y − Xβ2)
T Σ−11 GiΣ
−11 GiΣ
−11 (y − Xβ2)
>128
δ20
λmax
(D0)
λmin
(D0) + 1
→ 0.
Note also that for sufficiently large n
(g4/νi) supθ1∈Nn(θ0)
(β1 − β2)T XTΣ−1
1 X (β2 − β1)
≤ (g4/νi) supθ1∈Nn(θ0)
(λmax
(XT Σ−1
1 X)‖β2 − β1‖2
)
≤ 4p0g6
νiλmax
(XTΣ−1
0 X
νp1+1
)1 +2
g3
q
λmin
(D0) +
1
σ20
→ 0
since g6/νi → 0 and λmax
(XTΣ−1
0 X/νp1+1
)→ λmax (C0). Therefore we con-
clude that
supθ1∈Nn(θ0)
∣∣∣(y − Xβ2)T Σ−1
1 GiΣ−11 X (β2 − β1)
∣∣∣ Pθ2−→ 0.
Finally consider the last term on the right hand side of (3.3.14). For suffi-
ciently large n we get
(g2/νi) supθ1∈Nn(θ0)
(β1 − β2)T XTΣ−1
1 GiΣ−11 X (β2 − β1)
≤ 4g2
δ0νi
supθ1∈Nn(θ0)
(λmax
(XTΣ−1
1 X)‖β2 − β1‖2
)
≤ 16p0g4
δ0νiλmax
(XTΣ−1
0 X
νp1+1
)1 +2
g3
q
λmin
(D0) +
1
σ20
→ 0
74
since g4/νi → 0 and λmax
(XTΣ−1
0 X/νp1+1
)→ λmax (C0).
Hence we have that
g2 supθ1∈Nn(θ0)
∣∣∣∣∣(1/νi)∂� (θ | y)
∂σi
∣∣∣∣∣θ1
∣∣∣∣∣ Pθ2−→ 0
and this completes the proof of Theorem 3.3.2.
Most of the parametrizations and structures of σ proposed in the literature
satisfy Assumptions 3.3.1 through 3.3.4. For example, all the structured covari-
ance matrices considered in Jennrich and Schluchter (1986) and the parametriza-
tions considered in chapter 6 satisfy these assumptions.
As a final comment, note that the results of Theorem 3.3.2 are easily ex-
tended to restricted maximum likelihood estimation, using the same steps as in
Theorem 3.2.1.
3.4 Conclusions
We have established the asymptotic normality of the (restricted) maximum like-
lihood estimators for the parameters in the model (2.1.1). It is interesting to
interpret the basic Assumptions 3.1.5 to 3.1.7 (3.2.2 for the restricted maximum
likelihood estimators). Assumption 3.1.7 is a typical condition requiring that
the limit of the variance-covariance matrix of the parameter estimates exists.
Assumption 3.1.5 ensures that the number of levels goes to infinity. Note that
we do not require that the number of observations within each level becomes
infinite. This is because we need to estimate the variance-covariance compo-
nents of the random effects to an arbitrary precision but not the random effects
75
themselves. Similarly Assumption 3.1.6 ensures that there are enough observa-
tions over the bare minimum from each level to estimate the fixed effects to an
arbitrary precision.
We have also established the asymptotic normality of (restricted) maximum
likelihood estimators for a large class of reparametrizations/structurings of the
variance-covariance components σ, that includes most cases of practical interest.
The basic condition for the result to hold is that the mapping that defines
the parametrization/structuring be twice differentiable with continuous second
derivatives, a condition commonly observed in practical applications.
Chapter 4
The Nonlinear Mixed Effects
Model
In this chapter we describe a general nonlinear mixed effects model for repeated
measures data and present a real data example of its use. We also include a
brief bibliographic review of nonlinear mixed effects models.
4.1 The Model
The nonlinear mixed effects model used in this dissertation has been suggested
by Lindstrom and Bates (1990) and in its most general form is written as
in (1.3.1). This model formulation allows the use of nested and crossed classi-
fication factors for the clusters, but by far its most common application is for
repeated measures data, which corresponds to a one-way classification scheme.
We will restrict ourselves in this dissertation to this particular application of
model (1.3.1).
The nonlinear mixed effects model for repeated measures can be thought
77
of as a two-stage model that in some ways generalizes both the linear mixed
effects model for repeated measures (Laird and Ware, 1982) and the nonlinear
regression model for independent data (Bates and Watts, 1988). In the first
stage the jth observation on the ith cluster is modeled as
yij = f(φij, xij) + εij , i = 1, . . . , m, j = 1, . . . , ni (4.1.1)
where m is the number of clusters, ni is the number observations on the ith
cluster, f is a general real valued nonlinear function of a cluster-specific param-
eter vector φij and the covariate vector xij , and εij is a normally distributed
error term. In the second stage the cluster-specific parameter vector is modeled
as
φij = Aijβ + Bijbi, bi ∼ N (0, σ2D), (4.1.2)
where β is a p-dimensional vector of fixed effects, bi is a q-dimensional random
effects vector associated with the ith cluster (not varying with j), Aij and Bij
are design matrices for the fixed and random effects respectively, and σ2D is a
general variance-covariance matrix. It is assumed that observations correspond-
ing to different clusters are independent and that the εij are i.i.d. N (0, σ2) and
independent of the bi.
We can write (4.1.1) and (4.1.2) in matrix form as
yi = f i (φi, X i) + εi,
φi = Aiβ + Bibi
78
for i = 1, . . . , m, where
yi = [yi1 · · · yini]T, εi = [εi1 · · · εini
]T,
f i (φi, Xi) =[f(φi1, xi1) · · ·f(φini
, xini)]T
,
X i =[xT
i1 : · · · : xTini
]T, Ai =
[AT
i1 : · · · : ATini
]T,
Bi =[BT
i1 : · · · : BTini
]T.
By letting
y =[yT
1 : · · · : yTm
]T, b =
[bT
1 : · · · : bTm
]T, ε =
[εT
1 : · · · : εTm
]T,
f (φ, X) =[f1(φ1, X1)
T : · · · : fm(φm, Xm)T]T
,
X =[XT
1 : · · · : XTm
]T, A =
[AT
1 : · · · : ATm
]T, B =
m⊕i=1
Bi.
we see that the nonlinear mixed effects model for repeated measures described
here is a particular case of model (1.3.1).
Several different methods for estimating the parameters in the nonlinear
mixed effects model have been proposed. We concentrate here on two of them:
maximum likelihood and restricted maximum likelihood. A rather complex
numerical issue for (restricted) maximum likelihood estimation is the evaluation
of the loglikelihood function of the data, since it involves the evaluation of the
integral
p(y | β, D, σ2) =∫
p(y | b, β, D, σ2) p(b) db (4.1.3)
which in general does not have a closed-form expression when the model function
f is nonlinear in b. Different approximations have been suggested to try to
circumvent this difficulty (Lindstrom and Bates, 1990; Vonesh and Carter, 1992;
79
Davidian and Gallant, 1993). This issue is considered in detail in chapter 5.
4.2 Orange Trees
The orange trees data are presented in Figure 4.2.1 and consist of seven mea-
surements of the trunk circumference (in millimeters) on each of five orange
trees, taken over a period of 1,600 days. These data were originally presented
in Draper and Smith (1981, p. 524) and were also described in Lindstrom and
Bates (1990).
Day
Tre
e ci
rcum
fere
nce
(mm
.)
200 400 600 800 1000 1200 1400 1600
5010
015
020
0
1
1
1
11
1 1
2
2
2
2
2
2 2
3
3
3
33
3 3
4
4
4
4
4
44
5
5
5
5
5
5 5
Figure 4.2.1: Trunk circumference (in millimeters) of five orange trees.
The logistic model y = φ1/ {1 + exp [− (t − φ2) /φ3]} seems to fit the data
well. Lindstrom and Bates (1990) concluded in their analysis that only the
80
asymptotic circumference φ1 needed a random effect to account for tree-to-tree
variation and suggested the following nonlinear mixed effects model
yij =β1 + bi1
1 + exp [− (tij − β2) /β3]+ εij (4.2.1)
where yij represents the jth circumference measurement on the ith tree, tij
represents the day corresponding to the jth measurement on the ith tree, the
bi1, i = 1, . . . , 5 are i.i.d. N (0, σ2D), and the εij , i = 1, . . . , 5, j = 1, . . . , 7 are
i.i.d. N (0, σ2) and independent of the bi1. In this example p = 3, q = 1, m = 5,
ni = 7, i = 1, . . . , 5, X ij = tij , Aij = I, and Bij = (1, 0, 0)T .
4.3 Bibliographic Review
The first developments of nonlinear mixed effects models appear in Sheiner
and Beal (1980). Their model and estimation method are incorporated in the
NONMEM (Beal and Sheiner, 1980) program which is widely used in pharma-
cokinetics. They introduced a model very similar to (4.1.1) and developed a
maximum likelihood estimation method that was based on a first order Tay-
lor expansion of the model function around the expected values of the random
effects, i.e. 0. The expansion around the current conditional modes of the ran-
dom effects, as done in Lindstrom and Bates (1990), seems to give better results
(Wolf, 1986).
A nonparametric maximum likelihood method for nonlinear mixed effects
models was proposed by Mallet, Mentre, Steimer and Lokiek (1988). They use
a model similar to (4.1.1), but make no assumptions about the distribution of
the random effects, except that it is a probability measure. The conditional
81
distribution of the yij given the random effects is assumed to be known. The
objective of the estimation procedure is to get the probability distribution of the
cluster-specific effects (φij) that maximizes the likelihood of the data. Mallet
(1986) proved that the maximum likelihood solution is a discrete distribution
with the number of discontinuity points less or equal to the number of clusters
in the sample. Inference is based on the maximum likelihood distribution from
which summary statistics (e.g. means and variance-covariance matrices) and
plots are obtained.
Davidian and Gallant (1992) introduce a smooth nonparametric maximum
likelihood estimation method for nonlinear mixed effects. Their model is again
very similar to (4.1.1), but with a more general definition for the cluster-specific
effects – φij = g(β, bi, xij), where g is a generic function. As in Mallet et al.
(1988), Davidian and Gallant assume that the conditional distribution of the
response vector given the random effects is known (up to the parameters that
define it), but the distribution of the random effects is free to vary within a class
of smooth densities H defined in Gallant and Nychka (1987). A density from Hcan be expressed as an infinite linear combination of normal densities. In the
likelihood calculations the summation is truncated to a finite number of terms
and a quadrature approach is used to calculate the integral that defines the
likelihood (4.1.3). This nonparametric approach is implemented in the Nlmix
software, available through StatLib.
A Bayesian approach using hierarchical models for nonlinear mixed effects
is described in Bennett and Wakefield (1993) and Wakefield (1993). The first
stage model is again very similar to (4.1.1). The distributions of both the
random effects and the errors εij are assumed known up to population parame-
ters. Prior distributions for these must also be provided. Markov chain Monte
82
Carlo methods, such as the Gibbs sampler (Geman and Geman, 1984) and the
Metropolis algorithm (Hastings, 1970), are used to obtain the posterior density
of the random effects.
Vonesh and Carter (1992) have developed a mixed effects model that is
nonlinear in the fixed effects, but linear in the random effects. Their model can
be described as
yi = f (β, X i) + Zibi + εi
where β, bi, and εi as before denote respectively the fixed effects, the random
effects, and the error term, X i is a matrix of covariates, and Zi is a full-
rank matrix of known constants. It is further assumed that bi ∼ N (0, D),
εi ∼ N (0, σ2I), and the two vectors are independent. In a certain way Vonesh
and Carter incorporate in the model the approximations suggested by Sheiner
and Beal (1980) and Lindstrom and Bates (1990). They propose an estimated
generalized least squares (EGLS) procedure to estimate the model parameters.
In the first stage estimates of the fixed effects are obtained through ordinary
nonlinear least squares. The residuals from that fit are used to estimate the
variance-covariance matrix of the random effects and that in turn is used in
a weighted nonlinear least squares algorithm to get the final estimates of the
fixed effects. Strong consistency and asymptotic normality of the fixed effects
estimators are proven in the paper. Vonesh and Carter’s approach concentrates
more on inferences on the fixed effects, and less on the variance-covariance
components of the random effects.
Chapter 5
Approximations to the
Loglikelihood in the Nonlinear
Mixed Effects Model
In this chapter we consider the estimation of the parameters in the nonlinear
mixed effects model for repeated measures (4.1.1) by either maximum likeli-
hood, or restricted maximum likelihood, based on the marginal density of y
given in (4.1.3). Different approximations have been proposed for estimating
this likelihood. Some of these methods consist of taking a first order Taylor
expansion of the model function f around the expected value of the random
effects (Sheiner and Beal, 1980; Vonesh and Carter, 1992), or around the con-
ditional (on D) modes of the random effects (Lindstrom and Bates, 1990).
Others have proposed the use of Gaussian quadrature rules (Davidian and Gal-
lant, 1992).
We consider here four different approximations to the loglikelihood (4.1.3):
84
Lindstrom and Bates’ (1990) alternating method, a modified Laplacian approx-
imation (Tierney and Kadane, 1986), importance sampling (Geweke, 1989), and
Gaussian quadrature (Davidian and Gallant, 1992). We compare them based
on their computational and statistical properties, using both real data exam-
ples and simulation results. Section 5.1 contains a description of the different
approximations to the loglikelihood as applied to the nonlinear mixed effects
model (4.1.1). Section 5.2 presents a comparison of the different approximations
based on real and simulated data. Our conclusions are given in section 5.3.
5.1 Approximations to the Loglikelihood
In this section we describe four different approximations to the loglikelihood of
y in the nonlinear mixed effects model (4.1.1). We show that there exists a
close relationship between the Laplacian approximation, importance sampling
and a Gaussian quadrature rule centered around the conditional modes of the
random effects b.
5.1.1 Alternating Approximation
Lindstrom and Bates (1990) propose an alternating algorithm for estimating the
parameters in model (4.1.1). Conditional on the data and the current estimate
of D (the scaled variance-covariance matrix of the random effects), the modes
of the random effects b and the estimates of the fixed effects β are obtained by
minimizing a penalized nonlinear least squares (PNLS) objective function
m∑i=1
(‖yi − f i(β, bi)‖2 + bT
i D−1bi
)(5.1.1)
85
where [f i (β, bi)]j = f(φij, xij
), i = 1, . . . , m, j = 1, . . . , ni.
To update the estimate of D at the wth iteration, Lindstrom and Bates use a
first order Taylor expansion of the model function around the current estimates
of β and the conditional modes of the random effects b, which we will denote
by β(w)
and b(w)
respectively. Letting
Zi =∂f i
∂bTi
∣∣∣∣∣β,b
, X i =∂f i
∂βT
∣∣∣∣∣β,b
and
w(w)i = yi − f i(β
(w), b
(w)
i ) + X(w)
i β(w)
+ Z(w)
i b(w)
i ,
the approximate loglikelihood used for the estimation of D is
�A
(β, σ2, D | y
)= −1
2
m∑i=1
{log
∣∣∣∣σ2(I + Z
(w)
i DZ(w)T
i
)∣∣∣∣ (5.1.2)
+ σ−2[w
(w)i − X
(w)
i β]T (
I + Z(w)
i DZ(w)T
i
)−1 [w
(w)i − X
(w)
i β]}
.
This loglikelihood is identical to that of a linear mixed effects (LME) model in
which the response vector is given by w(w) and the fixed and random effects
design matrices are given by X(w)
and Z(w)
. Using (2.2.2), one can express
the optimal values of β and σ2 as functions of D and work with the profile
loglikelihood of D, greatly simplifying the optimization problem. Lindstrom
and Bates (1990) have also proposed an approximate restricted loglikelihood
for the estimation of D
�RA
(β, σ2, D | y
)= (5.1.3)
−1
2
m∑i=1
log∣∣∣∣σ2X
(w)T(I + Z
(w)
i DZ(w)T
i
)X
(w)∣∣∣∣+ �A
(β, σ2, D | y
).
Their estimation algorithm alternates between the PNLS and LME steps
86
until some convergence criterion is met. Such alternating algorithms tend to
be more efficient when the estimates of the variance-covariance components (D
and σ2) are not highly correlated with the estimates of the fixed effects (β).
In chapter 3 we have demonstrated that, in the linear mixed effects model,
the maximum likelihood estimates of D and σ2 are asymptotically independent
of the maximum likelihood estimates of β . These results have not yet been
extended to the nonlinear mixed effects model (4.1.1).
It can be shown that the maximum likelihood estimate of β and the condi-
tional modes of the random effects bi corresponding to the approximate loglike-
lihood (5.1.2) are the values obtained in the first iteration of the Gauss-Newton
algorithm used to minimize the PNLS objective function (5.1.1). Therefore, at
the converged value of D, the estimates of β and bi obtained from the LME
and PNLS steps coincide. We will use �A when comparing the different approx-
imations at the optimal values in section 5.2, but we do note that in Lindstrom
and Bates (1990) approximation (5.1.2) is used only to update the estimates of
D and not for estimating β.
5.1.2 Laplacian Approximation
Laplacian approximations are frequently used in Bayesian inference to estimate
marginal posterior densities and predictive distributions (Tierney and Kadane,
1986; Leonard, Hsu and Tsui, 1989). These techniques can also be used for the
integration considered here.
The integral that we want to estimate for the marginal distribution of yi in
model (4.1.1) can be written as
p(yi | β, D, σ2) =∫ (
2πσ2)−(ni+q)/2 |D|−1/2 exp
[−g(β, D, yi, bi)/2σ2
]dbi,
87
where g(β, D, yi, bi) = ‖yi − f i(β, bi)‖2 + bTi D−1bi.
Let
bi = bi (β, D, yi) = arg minbi
g(β, D, yi, bi)
g′ (β, D, yi, bi) =∂g(β, D, yi, bi)
∂bi
g′′ (β, D, yi, bi) =∂2g(β, D, yi, bi)
∂bi∂bTi
and consider a second order Taylor expansion of g around bi
g (β, D, yi, bi) � (5.1.4)
g(β, D, yi, bi
)+
1
2
[bi − bi
]Tg′′ (β, D, yi, bi
) [bi − bi
]
where the linear term of the approximation vanishes since g′(β, D, yi, bi) = 0.
The Laplacian approximation is defined as
p(y | β, D, σ2
)�(2πσ2
)−N/2 |D|−m/2 exp
[− 1
2σ2
m∑i=1
g(β, D, yi, bi
)]
×∫ (
2πσ2)q/2
exp
{− 1
2σ2
m∑i=1
[bi − bi
]Tg′′ (β, D, yi, bi
) [bi − bi
]}dbi
=(2πσ2
)−N/2 |D|−m/2m∏
i=1
∣∣∣g′′ (β, D, yi, bi
)∣∣∣−1/2exp
[−g(β, D, yi, bi
)/2σ2
]
where N =∑m
i=1 ni.
Now we consider an approximation to g′′ similar to the one used in Gauss-
Newton optimization. We have
g′′ (β, D, yi, bi
)=
88
∂2f(β, bi)
∂bi∂bTi
∣∣∣∣∣bi=bi
[yi − f(β, bi)
]+
∂f (β, bi)
∂bTi
∣∣∣∣∣bi=bi
∂f (β, bi)
∂bi
∣∣∣∣∣bi=bi
+ D−1.
At bi, the contribution of ∂2f(β, bi)/∂bi∂bTi
∣∣∣bi=bi
[yi − f (β, bi)
]is usually neg-
ligible compared to that of ∂f (β, bi)/∂bTi
∣∣∣bi=bi
∂f (β, bi)/∂bi|bi=bi(Bates and
Watts, 1980). Therefore we use the approximation
g′′ (β, D, yi, bi
)� G (β, D, yi) =
∂f (β, bi)
∂bTi
∣∣∣∣∣bi=bi
∂f (β, bi)
∂bi
∣∣∣∣∣bi=bi
+ D−1.
This has the advantage of requiring only the first order partial derivatives of the
model function with respect to the random effects, which are usually available
from the estimation of bi. This estimation of bi is a penalized least squares
problem, for which standard and reliable code is available.
The modified Laplacian approximation to the loglikelihood of model (4.1.1)
is then given by
�LA
(β, D, σ2 | y
)(5.1.5)
= −1
2
{N log
(2πσ2
)+ m log |D| +
m∑i=1
log [G (β, D, yi)]
+σ−2m∑
i=1
g(β, D, yi, bi
)}.
Since bi does not depend upon σ2, for given β and D, the maximum likeli-
hood estimate of σ2 (based upon �LA) is
σ2 = σ2 (β, D, y) =m∑
i=1
g(β, D, yi, bi
)/N.
We can profile �LA on σ2 to reduce the dimension of the optimization problem,
89
obtaining
�LAp (β, D | y) = (5.1.6)
− 1
2
{N[1 + log (2π) + log
(σ2)]
+ m log |D| +m∑
i=1
log [G (β, D, yi)]
}.
We note that if f is linear in b then the modified Laplacian approximation
is exact because the second order Taylor expansion in (5.1.4) is exact when
f (β, b) = f (β) + Z (β) b.
There does not yet seem to be a straightforward generalization of the concept
of restricted maximum likelihood (Harville, 1974) to nonlinear mixed effects
models. The difficulty is that restricted maximum likelihood depends heavily
upon the linearity of the fixed effects in the model function, which generally does
not occur in nonlinear models. Lindstrom and Bates (1990) circumvented that
problem by using an approximation to the model function f in which the fixed
effects β occur linearly. This cannot be done for the Laplacian approximation,
unless we consider yet another Taylor expansion of the model function, which
would lead us back to something very similar to Lindstrom and Bates’ approach.
We will return to this topic later in section 5.3.
5.1.3 Importance Sampling
Importance sampling provides a simple and efficient way of performing Monte
Carlo integration. The critical step for the success of this method is the choice of
an importance distribution from which the sample is drawn and the importance
weights calculated. Ideally this distribution corresponds to the density that we
are trying to integrate, but in practice one uses an easily sampled approximation.
For the nonlinear mixed effects model the function that we want to integrate is,
90
up to a multiplicative constant, equal to exp [−g (β, D, yi, bi) /2σ2]. As shown
in subsection 5.1.2, by taking a second order Taylor expansion of g(β, D, yi, bi)
around bi the integrand is, up to a multiplicative constant, approximately equal
to a N(bi, σ
2 [G(β, D, yi)]−1)
density. This gives us a natural choice for the
importance distribution.
Let NIS denote the number of importance samples to be drawn. In prac-
tice one such sample can be generated by selecting a vector z∗ with distri-
bution N (0, I) and calculating the sample of random effects as b∗i = bi +
σ [G (β, D, yi)]−1/2 z∗, where [G (β, D, yi)]
−1/2 denotes the inverse of the Chol-
esky factor of G (β, D, yi). The importance sampling approximation to the log-
likelihood of y is then defined as
�IS
(β, D, σ2 | y
)(5.1.7)
= −1
2
[N log
(2πσ2
)+ m log |D| +
m∑i=1
log |G (β, D, yi)|]
+m∑
i=1
log
NIS∑j=1
exp[−g(β, D, yi, b
∗ij
)/2σ2 + ‖z∗
j‖2/2]/NIS
.
Note that we cannot in general obtain a closed form expression for the MLE of
σ2 for fixed β and D, so that profiling on σ2 is no longer reasonable.
As in the modified Laplacian approximation, importance sampling gives ex-
act results when the model function is linear in b because in this case
p(yi | bi, β, D, σ2) p(bi) = p(yi | β, D, σ2
)· N
(bi, σ
2 [G (β, D, yi)]−1)
so that the importance weights are equal to p (yi | β, D, σ2).
91
5.1.4 Gaussian quadrature
Gaussian quadrature is used to approximate integrals of functions with re-
spect to a given kernel by a weighted average of the integrand evaluated at
pre-determined abscissas. The weights and abscissas used in Gaussian quadra-
ture rules for the most common kernels can be obtained from the tables of
Abramowitz and Stegun (1964) or by using an algorithm proposed by Golub
(1973) (see also Golub and Welsch (1969)). Gaussian quadrature rules for mul-
tiple integrals are known to be numerically complex (Davis and Rabinowitz,
1984), but using the structure of the integrand in the nonlinear mixed effects
model we can transform the problem into successive applications of simple one
dimensional Gaussian quadrature rules. Letting z∗j , wj, j = 1, . . . , NGQ denote
respectively the abscissas and the weights for the (one dimensional) Gaussian
quadrature rule with NGQ points based on the N (0, 1) kernel, we get
∫(2πσ2)−q/2 |D|−1/2 exp
[−‖yi − f (β, bi)‖2 /2σ2
](5.1.8)
× exp(−bT
i D−1bi/2σ2)dbi
=∫
(2π)−q/2 exp[−∥∥∥yi − f
(β, σDT/2z∗)∥∥∥2 /2σ2
]exp
(−‖z∗‖2 /2
)dz∗
�NGQ∑j1=1
· · ·NGQ∑jq=1
exp[−∥∥∥yi − f
(β, σDT/2z∗
j1,...,jq
)∥∥∥2 /2σ2] q∏
k=1
wjk
where z∗j1,...,jq
=(z∗j1 , . . . , z
∗jq
)T. The corresponding approximation to the log-
likelihood function is
�GQ
(β, D, σ2 | y
)= −N log(2πσ2)/2 (5.1.9)
+m∑
i=1
log
NGQ∑
j
exp[−∥∥∥yi − f
(β, σDT/2z∗
j
)∥∥∥2 /2σ2] q∏
k=1
wjk
92
where j = (j1, . . . , jq)T .
The Gaussian quadrature rule in this case can be viewed as a deterministic
version of Monte Carlo integration in which random samples of bi are gener-
ated from the N (0, σ2D) distribution. The samples (z∗j) and the weights (wj)
are fixed beforehand, while in Monte Carlo integration they are left to random
choice. Since importance sampling tends to be much more efficient than sim-
ple Monte Carlo integration, we also considered the equivalent of importance
sampling in the Gaussian quadrature context, which we will denote by adaptive
Gaussian quadrature. In this approach the grid of abscissas in the bi scale is
centered around the conditional modes bi rather than 0, as in (5.1.8). Another
modification is the use of G (β, D, yi) instead of D in the scaling of the z∗.
The adaptive Gaussian quadrature is then given by
∫(2πσ2)−q/2 |D|−1/2 exp
[−‖yi − f (β, bi)‖2 /2σ2
]exp
(−bT
i D−1b/2σ2)dbi
=∫
(2π)−q/2 |G (β, D, yi)D|−1/2 exp (−g {β, D, yi,
bi + σ [G (β, D, yi)]−1/2 z∗} /2σ2 + ‖z∗‖2 /2
)exp
(−‖z∗‖2 /2
)dz∗
�NGQ∑j1=1
· · ·NGQ∑jq=1
exp(−g{β, D, yi, bi + σ [G (β, D, yi)]
−1/2 z∗j1,...,jq
}/2σ2
+∥∥∥z∗
j1,...,jq
∥∥∥2 /2) q∏
k=1
wjk.
The corresponding approximation to the loglikelihood is then
�AGQ
(β, D, σ2 | y
)(5.1.10)
= −[N log
(2πσ2
)+ m log |D| +
m∑i=1
log |G (β, D, yi)|]/2
93
+m∑
i=1
log
NGQ∑j
exp(−g{β, D, yi, bi + σ [G (β, D, yi)]
−1/2 z∗j
}/2σ2
+∥∥∥z∗
j
∥∥∥2 /2) q∏
k=1
wjk
].
The adaptive Gaussian quadrature approximation very closely resembles
that obtained for importance sampling. The basic difference is that the former
uses fixed abscissas and weights, while the latter allows them to be determined
by a pseudo-random mechanism. It is also interesting to note that the one
point (i.e. NGQ = 1) adaptive Gaussian quadrature approximation is simply
the modified Laplacian approximation (5.1.6), since in this case z∗1 = 0 and
w1 = 1. The adaptive Gaussian quadrature also gives the exact loglikelihood
when the model function is linear in b, but that is not true in general for the
Gaussian quadrature approximation (5.1.8). Like the importance sampling ap-
proximation, the Gaussian quadrature approximation cannot be profiled on σ2
to reduce the dimensionality of the optimization problem.
5.2 Comparing the Approximations
In this section we present a comparison of the different approximations to the
loglikelihood of model (4.1.1) described in section 5.1. Two real data examples,
the orange trees data, introduced in section 4.2, and the Theophylline data, as
well as simulation results are used to compare the statistical and computational
aspects of the various approximations.
94
5.2.1 Orange Trees
The orange trees data and the nonlinear mixed effects model used to describe
it were presented in section 4.2. We note that the single random effect occurs
linearly in (4.2.1) and therefore the modified Laplacian (5.1.6), the importance
sampling (5.1.7), and the adaptive Gaussian quadrature (5.1.10) approximations
are all exact. Figure 5.2.1 presents the data on the trunk circumference together
with the fitted curves corresponding to model (4.2.1), using maximum likelihood
based on the exact likelihood and the conditional modes of the random effects.
Table 5.2.1 presents the results of estimation using the alternating approxi-
mation, Gaussian quadrature with 10 and 200 abscissas, and the exact loglikeli-
hood. Since only the alternating approximation provides a version of restricted
maximum loglikelihood, we will just consider maximum likelihood estimation in
this and the next subsection. The subscript on Gaussian refers to the number
of abscissas used in the approximation and the scalar L is√
D, the square root
of the scaled variance of the random effects. In general this is a matrix but
there is only one random effect here.
Table 5.2.1: Estimation Results – Orange TreesApproximation log(L) β1 β2 β3 log(σ2) �Alternating 1.389 191.049 722.556 344.164 4.120 -131.585Gaussian10 1.123 194.325 727.490 348.065 4.102 -130.497Gaussian200 1.396 192.293 727.074 348.074 4.119 -131.571Exact 1.395 192.053 727.906 348.073 4.119 -131.572
The estimation results in Table 5.2.1 indicate that the different approxima-
tions produce similar fits. The Gaussian approximation with only 10 abscissas
gives the worst approximation, in terms of the value of the loglikelihood, but
95
Days
Tre
e ci
rcum
fere
nce
200 400 600 800 1000 1200 1400 1600
050
100
150
200
250
1
1
1
11
1 1
2
2
2
2
2
2 2
3
3
3
33
3 3
4
4
4
4
4
44
5
5
5
5
5
5 5
Figure 5.2.1: Trunk circumference (in millimeters) of five orange trees: Dataand fitted curves using the conditional modes of the random effects and max-imum likelihood estimation based on the exact loglikelihood. The dashed linerepresents the curve obtained setting the random effects to zero.
even that is not far from the exact value. The Gaussian quadrature with 200
abscissas is almost identical to the exact loglikelihood. The alternating approx-
imation is also very close to the exact value.
Another important issue regarding the different approximations is how well
they behave in a neighborhood of the optimal value, since this behavior is often
used to assess the variability of maximum likelihood estimates. Figure 5.2.2
displays the profile traces and contours (Bates and Watts, 1988) for the ex-
act loglikelihood and the alternating approximation. This plot could not be
96
obtained for the Gaussian approximation because the objective function pre-
sented several local optima during the profiling algorithm. We believe that this
is related to the fact that the Gaussian approximation is centered at bi = 0
and not at the conditional modes of the random effects, where the integrand in
(4.1.3) takes on its highest values.
It can be seen from Figure 5.2.2 that the alternating method gives a good
approximation to the loglikelihood in a neighborhood of the optimal values. It
is interesting to note that the profile traces for the variance-covariance compo-
nents (D and σ2) and the fixed effects (β) meet almost perpendicularly. This
indicates a local lack of correlation between the variance-covariance components
and the fixed effects, which explains why the alternating method was so suc-
cessful in approximating the loglikelihood. The same pattern was observed in
several other data sets that we have analyzed, leading us to conjecture that the
asymptotic lack of correlation between the estimators of the variance-covariance
components and the fixed effects verified in the linear mixed effects model also
holds, at least approximately, for the nonlinear mixed effects model.
To compare the computational efficiency of the different approximations
we consider the number of function evaluations needed until convergence. For
the alternating approximation there are two different functions being evaluated
during the iterations: the objective function (5.1.1) within the PNLS step and
the approximate loglikelihood �A (5.1.2) within the LME step. We will use here
the total number of evaluations of either (5.1.1) or �A, multiplied by the number
of clusters. For the other approximations we will use the total number of calls
to g (β, D, yi, bi). Even though the number of function evaluations used for the
alternating approximation is not directly comparable to the number of function
evaluations of the remaining approximations, it gives a good idea of the relative
97
1.0 1.5 2.0 2.5
-2-1
01
2 log(L)
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
-2-1
01
2
3.6
4.0
4.4
4.8
log(sig2)
-2-1
01
2
160
200
240
beta1
-2-1
01
2
650
700
750
800 beta2
-2-1
01
2
1.0 1.5 2.0 2.5
280
320
360
400
3.6 4.0 4.4 4.8 160 200 240 650 700 750 800 280 320 360 400
-2-1
01
2beta3
Figure 5.2.2: Profile traces and profile contour plots for the orange trees databased on the exact loglikelihood (solid line) and the alternating approximation(dashed line). Plots below the diagonal are in the original scale and plots abovethe diagonal are in the zeta scale (Bates and Watts, 1988). Interpolated contourscorrespond approximately to joint confidence levels of 68%, 87%, and 95%.
98
computational efficiency of this algorithm.
Table 5.2.2 presents the number of function evaluations for the different
approximations in the orange trees example. The Gaussian quadrature approx-
imation is considerably less efficient than either the alternating approximation
or the exact loglikelihood. As expected the alternating approximation is the
most computationally efficient.
Table 5.2.2: Number of Function Evaluations to Convergence – Orange TreesApproximation Function EvaluationsAlternating 200Exact 420Gaussian10 8,150Gaussian200 101,000
5.2.2 Theophylline
The data considered here are courtesy of Dr. Robert A. Upton of the Univer-
sity of California, San Francisco. Theophylline was administered orally to 12
subjects whose serum concentrations were measured at 11 times over the next
25 hours. This is an example of a laboratory pharmacokinetic study character-
ized by many observations on a moderate number of individuals. Figure 5.2.3
displays the data and the fitted curves obtained through maximum likelihood
estimation using the adaptive Gaussian approximation with 10 abscissas and
using the conditional modes of the random effects.
A common model for such data is a first order compartment model with
absorption in a peripheral compartment
Ct =DKka
Cl(ka − K)[exp (−Kt) − exp (−kat)] (5.2.1)
99
Time (hrs)
Con
cent
ratio
n (m
g/L)
0 5 10 15 20 25
0
2
4
6
8
10
a
a
a
a
a
aa
a
a
a
a
b
b
bb b
b
b
b
b
b
b
c
c
c
cc
c
c
cc
c
c
d
d
d
dd
d
d
dd
d
d
e
e
e
e
e
e
ee
e
e
e
f
f
f
f f
f
f
f
f
f
f
g
g
g
g
gg
g
g
g
g
g
h
hh
hh
h
h
h h
h
h
i
i
i
i
i
i i
i i
i
i
j
j
j
j
j
j
j
j
j
j
j
k
k
k
k
k
k
k
k
k
k
k
l
l
l
l
l l
l
ll
l
l
Figure 5.2.3: Theophylline concentrations (in mg/L) of twelve patients: Dataand fitted curves using the conditional modes of the random effects and maxi-mum likelihood estimation based on the adaptive Gaussian approximation.
where Ct is the observed concentration (mg/L) at time t, t is the time (hr),
D is the dose (mg/kg), Cl is the clearance (L/kg), K is the elimination rate
constant (1/hr), and ka is the absorption rate constant (1/hr). In order to
ensure positivity of the rate constants and the clearance, the logarithms of
these quantities were used in the fit. Analysis of the Theophylline data using
model (5.2.1) suggested that only log(Cl) and log(ka) needed random effects
to account for the patient-to-patient variability. The nonlinear mixed effects
100
model used for the Theophylline data is
Ct =D exp [− (β1 + bi1) + (β2 + bi2) + β3]
exp (β2 + bi2) − exp (β3)(5.2.2)
× {exp [− exp (β3) t] − exp [− exp (β2 + bi2) t]}
Table 5.2.3 presents the estimation results from the various approximations
to the loglikelihood. Only maximum likelihood estimation is considered. The
subscripts on Gaussian and on Adap. Gaussian refer to the number of ab-
scissas used in the Gaussian and adaptive Gaussian approximations, while the
subscript on Imp. Sampling refers to the number of importance samples used
in this approximation. L denotes the vector with elements given by the upper
triangular half of the Cholesky decomposition of D, stacked by columns.
Table 5.2.3: Estimation Results – Theophylline DataApproximation log(L1) L2 log(L3) β1 β2 β3
Alternating -1.4466 0.0027 -0.0999 -3.227 0.466 -2.455Laplacian -1.4438 0.0027 -0.0997 -3.230 0.469 -2.464Imp. Sampling1000 -1.4438 0.0027 -0.0988 -3.227 0.476 -2.459Gaussian5 -1.5554 0.0024 -0.3969 -3.304 0.501 -2.487Gaussian10 -1.5642 0.0023 -0.2043 -3.238 0.595 -2.469Gaussian100 -1.4457 0.0027 -0.0982 -3.227 0.480 -2.459Adap. Gaussian5 -1.4460 0.0027 -0.0991 -3.225 0.476 -2.458Adap. Gaussian10 -1.4475 0.0027 -0.0994 -3.227 0.474 -2.459
Approximation log(σ2) �Alternating -0.6866 -177.0237Laplacian -0.6866 -177.0000Imp. Sampling1000 -0.6875 -177.7689Gaussian5 -0.4840 -182.4680Gaussian10 -0.7028 -176.1008Gaussian100 -0.6854 -177.7290Adap. Gaussian5 -0.6868 -177.7500Adap. Gaussian10 -0.6853 -177.7473
101
We can see from Table 5.2.3 that the alternating approximation, the Lapla-
cian approximation, the importance sampling approximation, and the adaptive
Gaussian approximation all give similar estimation results. The Gaussian ap-
proximation only approaches the other approximations when the number of
abscissas is increased considerably. Note that the actual number of points used
in the grid that defines the Gaussian approximation for this example is the
square of the number of abscissas. The adaptive Gaussian approximations for
1 (Laplacian), 5, and 10 abscissas give similar results, indicating that just a
few points are needed for this approximation to be accurate. The importance
sampling approximation caused some numerical difficulties for the optimiza-
tion algorithm (the ms() function in S (Chambers and Hastie, 1992)) used to
obtain the maximum likelihood estimates, since the stochastic variability asso-
ciated with different importance samples overwhelmed the numerical variability
of the loglikelihood for small changes in the parameter values (used to calculate
numerical derivatives). We solved this problem by keeping the random num-
ber generator seed fixed during the optimization process, thus using the same
importance samples throughout the calculations. Since the results obtained us-
ing importance sampling were very similar to those of the adaptive Gaussian
approximation, we concluded that the latter is to be preferred for its greater
simplicity and computational efficiency.
Table 5.2.4 gives the number of function evaluations until convergence for the
different approximations. The alternating approximation is the most efficient,
followed by the Laplacian and adaptive Gaussian approximations. Gaussian
quadrature with 5 abscissas is efficient compared to the adaptive Gaussian,
but is quite inaccurate. The more reliable Gaussian approximation with 100
abscissas takes about 100 times more function evaluations than the adaptive
102
Gaussian with 10 abscissas. The importance sampling approximation had the
worst performance in terms of function evaluations.
Table 5.2.4: Number of Function Evaluations to Convergence – TheophyllineApproximation Function EvaluationsAlternating 1,512Laplacian 7,683Adap. Gaussian5 30,020Adap. Gaussian10 96,784Gaussian5 47,700Gaussian10 318,000Gaussian100 10,200,000Imp. Sampling1000 11,211,284
Next we consider the approximations in a neighborhood of the optimal value.
We will restrict ourselves here to the alternating, the Laplacian, and the adap-
tive Gaussian approximation, as the Gaussian approximation for a moderate
number of abscissas is not reliable, and both the Gaussian approximation with
a larger number of abscissas and the importance sampling approximation are
very inefficient computationally and give results quite similar to the adaptive
Gaussian approximation. We used five abscissas for the adaptive Gaussian
quadrature, as this gives roughly the same precision as the ten-abscissa quadra-
ture rule.
The alternating approximation gives results very similar to the adaptive
Gaussian quadrature. As in the orange trees example, the profile traces of
the variance-covariance components and the fixed effects meet almost perpen-
dicularly, indicating a local lack of correlation between these estimates. The
Laplacian and the adaptive Gaussian approximations give virtually identical
plots (not included here). This suggests there is little to be gained by increas-
ing the number of abscissas past one in the quadrature rule. The major gain in
103
-1.8 -1.2
-20
12 log(L1)
-2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2 -2 0 1 2
-20
12
-0.5
0.5 L2
-20
12
-0.6
0.0
0.4 log(L3)
-20
12
-1.0
-0.7
-0.4 log(sig2)
-20
12
-3.3
5-3
.20
log(Cl)
-20
12
0.0
0.4
0.8 log(ka)
-20
12
-1.8 -1.2
-2.5
5-2
.40
-0.5 0.5 -0.6 0.0 0.4 -1.0 -0.7 -0.4 -3.35 -3.20 0.0 0.4 0.8 -2.55 -2.40
-20
12log(K)
Figure 5.2.4: Profile traces and profile contour plots for the Theophylline databased on the adaptive Gaussian approximation with 5 abscissas (solid line) andthe alternating approximation (dashed line). Plots below the diagonal are inthe original scale and plots above the diagonal are in the zeta scale (Batesand Watts, 1988). Interpolated contours correspond approximately to jointconfidence levels of 68%, 87%, and 95%.
104
precision is obtained by centering the grid at the conditional modes and scaling
it using the approximate Hessian.
5.2.3 Simulation Results
In this section we include a comparison of the approximations to the loglikeli-
hood in model (4.1.1) using simulation. We restrict ourselves to the alternat-
ing, the Laplacian, and the (five-abscissa) adaptive Gaussian approximations as
these seem to be more accurate and/or more efficient than the Gaussian and the
importance sampling approximations. Two models were used in the simulation
analysis: a logistic model similar to the one used for the orange trees data and
a first order open compartment model similar to the one used for the Theo-
phylline example. For both models 1000 samples were generated and maximum
likelihood (ML) estimates based on the different approximations obtained. For
the alternating approximation, restricted maximum likelihood (RML) estimates
were also obtained.
Logistic Model
A logistic model similar to (4.2.1), but with two random effects instead of one,
was used to generate the data. The model is given by
yij =β1 + bi1
1 + exp {− [tij − (β2 + bi2)] /β3} + εij , (5.2.3)
where the bi are i.i.d. N (0, σ2D), i = 1, . . . , m, the εij are i.i.d. N (0, σ2) i =
1, . . . , m, j = 1, . . . , ni, and the εij are independent of the bi. We used m = 15,
ni = 10, i, . . . , 15, σ2 = 25, β = (200, 700, 350)T , and D =
4 −2
−2 25
.
105
Table 5.2.5 summarizes the simulation results for the variance-covariance
components (MSE denotes the mean square error of the estimators). The dif-
ferent approximations to the loglikelihood give similar simulation results for all
the parameters involved. The cluster specific variance (σ2) is estimated with
more relative precision than the elements of the scaled variance-covariance ma-
trix of the random effects (D). This is probably because the precision of the
estimate of σ2 (as well as the estimates of β) is related more to the total number
of observations, while the precision of the estimates of D is determined by the
number of clusters. We can also see a tendency for the restricted maximum
likelihood to give positively biased estimates of D11 and D22, while the other
approximations give negatively biased estimates. The rationale for restricted
maximum likelihood is to reduce bias in estimating variance components. It
does not seem to do so in this case; it just changes its direction.
Table 5.2.5: Simulation results for D and σ2 in the logistic modelD11 D12
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 4.200 0.200 3.916 -1.946 0.054 18.421Alternating – ML 3.922 -0.078 3.437 -1.995 0.005 16.185Laplacian 3.935 -0.065 3.375 -1.978 0.022 15.724Adap. Gaussian 3.941 -0.059 3.408 -1.965 0.035 15.754
D22 σ2
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 26.089 1.089 360.985 24.885 -0.115 9.756Alternating – ML 23.322 -1.678 314.503 24.651 -0.349 9.647Laplacian 23.864 -1.136 310.054 24.625 -0.375 9.570Adap. Gaussian 23.934 -1.066 312.422 24.617 -0.383 9.567
Figure 5.2.5 presents the scatter plots of the variance-covariance component
(σ2 and D) estimates for the alternating RML, the alternating ML, and the
Laplacian approximations versus the adaptive Gaussian approximation. We
106
see that, except for the alternating RML approximation, all methods lead to
very similar estimates. In general the alternating RML approximation gives
larger values for the estimates of the variance components (especially D11 and
D22) than the other methods. The higher mean square error for D12 from the
alternating ML and RML methods is visible in the plot, as each of the panels
comparing these estimates to those from the adaptive Gaussian method has a
vertical clump of points at the true value.
Table 5.2.6 presents the simulation results for the fixed effects estimates.
The results are very similar for all approximations considered. We also note
that the relative variability of the fixed effects estimates is much smaller than
those of the estimates of the elements of D. There is very little, if any, bias in
the fixed effects estimates.
Table 5.2.6: Simulation results for β in the logistic modelβ1 β2
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 199.61 -0.39 10.18 698.43 -1.57 138.21Alternating – ML 199.61 -0.39 10.18 698.43 -1.57 138.22Laplacian 199.93 -0.07 10.20 700.03 0.03 138.38Adap. Gaussian 199.92 -0.08 10.15 699.90 -0.09 138.44
β3
Approximation Mean Bias MSEAlternating – RML 348.81 -1.19 57.17Alternating – ML 348.82 -1.18 57.13Laplacian 350.20 0.20 56.94Adap. Gaussian 350.06 0.06 57.06
Figure 5.2.6 presents the scatter plots of the fixed effects estimates for the
alternating RML, alternating ML, and Laplacian approximations versus the
adaptive Gaussian approximation. Again we observe a strong agreement in
the estimates obtained through the various approximations. The alternating
107
sig2
Adaptive Gaussian
Alte
rnat
ing
- R
ML
20 25 30 35
2025
3035
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
• •
•
•
•
•
• •
••
•
•
••••
•
•
•
••
•
•
••
••
•
•
•
••
••
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
•
•
•••••
•
•
•
•
•
••
•••
•
•
••
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
• •
•••
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•• •
•
•
•
•
••• •
••
••
• ••
••
•
•
••
•
•
••
• ••
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•••
•
•
•••
•
•
•
••
•
•
•
•
••
•
•••
••
•••
•
•
•
•
••
•
•
•
••
•
• •
•
••
•
• •
•• •
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•••
•
•
•
••
•
•
••
•••
••
•
•
•
•
•
•
••
•
••
•
•
•••••
•
•
••
•••
•
•
••
•
•
•
•
••
••
•
••
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
••••
•
•
•
•
•
•
•
•
•
••
•
••
••
•
•••
•
•
••
•
•
•
••
•
•
•
••
••
•
•
••
•• ••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
•
••
•
•
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
••
•
•
•
•
•••
•
•
•
•
•••
•
•
•
• ••
•••
••
••
•
•••
•
•
••
••
•
••
••••
•
•
••••
•
•
••
••••
•
••
••
•
•
•
••
•••
••
•
•
••••
••
•••
•
•
•
•
•
•
•
••
••
••
••
•
•••
•
••
••
• •
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•••
•
•
•
•
••
••
•
•
•
••
•
••
•
•
•••
••
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•• ••
•
•
••
•
••
•
•
•••
•
••
• ••
•
•
••
•
•
••
•
•
•••
•
•
•
•••
•••
•
•
••
•
•
•
••
•••
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
•
•
•
•
•
••
•••
•
•
•
•
•
••
••
•
••
••
sig2
Adaptive GaussianA
ltern
atin
g -
ML
20 25 30 35
2025
3035
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
• •
•
•
•
•
••
••
•
•
••••
•
•
•
••
•
•
•
•
••
•
•
•
••
••
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
•
•
•••••
•
•
•
•
•
••
•••
•
•
••
•
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
••
•
• •
•••
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•• •
•
•
•
•
••• •
••
••
• ••
••
•
•
••
•
•
••
• ••
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•••
•
•
•••
•
•
•
••
•
•
•
•
••
•
•••
••
•••
•
•
•
•
••
•
•
•
••
•
• •
•
••
•
• •
•• •
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•••
•
•
•
••
•
•
••
•••
••
•
•
•
•
•
•
••
•
••
•
•
•••••
•
•
••
•••
•
•
••
•
•
•
•
••
••
•
••
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
••••
•
•
•
•
•
•
•
•
•
••
•
••
••
•
••••
•
••
•
•
•
••
•
•
•
••
••
•
•
••
•• ••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
•
••
••
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
••
•
•
•
•
•••
•
•
•
•
••
•
•
•
•
• ••
•••
••
••
•
•••
•
•
••
••
•
••
• •••
•
•
••••
•
•
••
••••
•
••
••
•
•
•
••
•••
••
•
•
••••
••
•••
•
•
•
•
•
•
•
••
••
••
••
•
•••
•
••
••
• •
•
•• •
••
•
•
•
•
•
•
•
•
••
•
•
•••
•
•
•
•
••
••
•
•
•
••
•
••
•
•
•••
••
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•• ••
•
•
••
•
••
•
•
•••
•
••
• ••
•
•
••
•
•
••
•
•
•••
•
•
•
•••
•••
•
•
••
•
•
•
••
•••
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
•
•
•
•
•
••
•••
•
•
•
•
•
••
••
•
••
••
sig2
Adaptive Gaussian
Lapl
acia
n
20 25 30 35
2025
3035
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
••
•
•
•
•
••
••
•
•
••••
•
•
•
••
•
•
••
••
•
•
•
•••
•
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
•
•••••
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
••
•
••
•
••
•
•
•
••
•
•
•
••
•••
•
•••
•
•
•
•• •
•
•
•
•
• •••
••
••
• ••
••
•
•
••
•
•
••
• ••
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•••
•
•
•••
•
•
•
••
•
•
•
•
••
•
•••
• •••
•
•
•
•
•
••
•
•
•
••
•
• •
•
••
•
••
•
• ••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
•
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
••••
•
•
••
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
••
•
•
•••••
•
•
••
•••
•
•
••
•
•
•
•
••
••
•
••
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
••••
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••••
•
••
•
•
•
••
•
•
•
••
••
•
•
••
•• ••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
•
••
••
•
•
•
•
•
•
•••
•
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
••
•
•
•
•
•••
•
•
•
•
••
•
•
•
•
••
•
•••
••
••
•
•••
•
•
••
••
•
•
•
• •••
•
•
••••
•
•
••
••••
•
••
••
•
•
•
••
•••
••
•
•
••••
••
••
•
•
•
•
•
•
•
•
••
••
••
••
•
•••
•
••
••
• •
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
••
••
•
•
•
• •
•
••
•
•
•••
•••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•• ••
•
•
••
•
••
•
•
••
•
•
••
• ••
•
•
•
•
•
•
• •
•
•
•••
•
•
•
•••
•••
•
•
••
•
•
•
••
•••
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
•
•
•
•
•
••
•••
•
•
•
•
•
••
••
•
••
••
D11
Adaptive Gaussian
Alte
rnat
ing
- R
ML
0 5 10 15
05
1015
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•
•
••
•
•
•••
•
•••••
••
••
•
••
•
•
•
•
•••
•
•••
•
•
•••
•••
•
•
••
•
•••
•
•
•
••
•
•
•
•
••
••
••
•
••
••
•
•
•
•
••
••
•
•
••
•
•
••
•••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
• •
•
•
•
•
•
••
•
•
•
•
••
•
••••
••
•
••
•
•
•
••
•
•
•
•
••
•
••
••
•
•
•••
•
•
•
•
••
•
•
•
• •
•
••••
••••
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
••
•
•••
••
•
•••
•
••
••••••
•
••••
••
•••
••
••
•
••
•
•
•
•
•
••••
•
•
•
•
•
•
••
•••
•
•
••
•••
••
•••
•
••
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
••
•••
••
•
•••••
••
•
•••
•
•
•
••
•
••••
•
••
•
••
•
••
••
•
•
••
•••
•
••••
••
•
••
••
•
•
•
•
•
•
•
••
••
••
••
•
••
••
•
•••
••
•
•
••
••
••
••
•
••••
••
••
•
••
•
••
•
• ••
•
•
•
•
••
•
•
• •
•
•
••
•
•••
•
•
••• •
•
••
•
•
•
•
••
•
•••
•
••
•••
•
•••••
•
•
•
•
•
•
•
•
•
•
•
•
•
• ••
•
••
•
••
•••
•
•
•
••
••
••
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••••
•
•
•
•
•
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
•
•
••
••
•
•
••
•• •
•
••
•
•
••
•
••
•
••
••
•
••
•
•
•
••
•
•
•••
••
•
•
••
••••
•
•
••••
•
•
•
••
•
•
•
••
•
••
•
•
•••
•••
•
••
•
••
•
•
•
•
••
•
•
•• •
•
•
••••
•••
••
••
••
••
•
•
•
•
•••••••
•
•• •
•
•
•
••
•
•
••••
•
•
•
•••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
••
•
•
•
•
•
••
••
•
• •• ••
•
•
•
•
•
•
•
•
•
••
••
•
••
••
•
•
••
•
•
•••
•
•
•••
••
•
•
•
•••
•••
•
•
•
•
•
•
•
•
•
•••
••
••
•
••••
•
••
•••
••
••
••
•
••
•
•
•••
•
•
••
•
••••
••••
••
•
•
•
••
••
••
•
•
•
•
••
•
•
•
•••
••
•
•
•
•
•••
•••
•
• •
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
• •
•
•
••
• ••
•
•
••
•
•
••
•
•
•
•
•••
•
•
••
•
•
•
•
••
•
•
•••
•
••••
•
•
••
•
••
•
•
•••
•
•
•
D11
Adaptive Gaussian
Alte
rnat
ing
- M
L
0 5 10 15
05
1015
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•
•
••
•
•
•••
•
•••••
••
••
•
••
•
•
•
•
•••
•
•••
•
•
•••
•••
•
•
••
•
•••
•
•
•
••
•
•
•
•
••
••
••
•
••
••
•
•
•
•
••
••
•
•
••
•
•
••
•••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
• •
•
•
•
•
•
••
•
•
•
•
••
•
••••
••
•
••
•
•
•
••
•
•
•
•
••
•
••
••
•
•
•••
•
•
•
•
••
•
•
•
• •
•
••••
••••
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
••
•
•••
••
•
•••
•
••
••••••
•
••••
••
•••
••
••
•
••
•
•
•
•
•
••••
•
•
•
•
•
•
••
•••
•
•
••
•••
••
•••
•
••
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
••
•••
••
•
•••••
••
•
•••
•
•
•
••
•
••••
•
••
•
••
•
••
••
•
•
••
•••
•
••••
••
•
••
••
•
•
•
•
•
•
•
••
••
••
••
•
••
••
•
•••
••
•
•
••
••
••
••
•
••••
••
••
•
••
•
••
•
• ••
•
•
•
•
••
•
•
• •
•
•
••
•
•••
•
•
••• •
•
••
•
•
•
•
••
•
•••
•
••
•••
•
•••••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
••
•••
•
•
•
••
••
••
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••••
•
•
•
•
•
•
•
•
•
••
•
••
••
•
•
••
•
••
•
•
•
•
•
••
••
•
•
••
•• •
•
••
•
•
••
•
••
•
••
••
•
••
•
•
•
••
•
•
•••
••
•
•
••
••••
•
•
•••
••
•
•
••
•
•
•
••
•
••
•
•
•••
•••
•
••
•
••
•
•
•
•
••
•
•
•• •
•
•
••••
•••
••
••
••
••
•
•
•
•
•••••••
•
•• •
•
•
•
•• •
•
••••
•
•
•
•••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
••
•
• •• ••
•
•
•
•
•
•
•
•
•
••
••
•
••
••
•
•
••
•
•
•••
•
•
•••
••
•
•
•
•••
•••
•
•
•
•
•
•
•
•
•
•••
••
••
•
••••
•
••
•••
••
••
••
•
••
•
•
•••
•
•
••
•
••••
••••
••
•
•
•
••
••
••
•
•
•
•
••
•
•
•
•••
••
•
•
•
•
•••
•••
•
• •
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
••
• ••
•
•
••
•
•
••
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
•
•••
•
••••
•
•
••
•
••
•
•
•••
•
•
•
D11
Adaptive GaussianLa
plac
ian
0 5 10 15
510
15
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•
•
• •
•
•
•••
•
•••••
••
••
•
••
•
•
•
•
•••
•
•••
•
•
•••
•••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
••
••
••
•
••
••
•
•
•
•
••
• ••
•
••
•
•
••
•••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••••
••
•
••
•
•
•
••
••
•
•
••
•
••
••
•
•
•••
•
•
•
•
••
•
•
•
••
•
••••
••••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
• •
•
•••
••
•
•••
•
••
••••••
•
••••
••
••
•••
••
•
••
•
•
•
•
•
••••
•
•
•
•
•
•
••
•••
•
•
••
•••
•••••
•
••
•
•
•
•• •
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•••
••
•
•••••
••
•
•••
•
•
•
• •
•
••
••
•
••
•
••
•
••
••
•
•
••
•••
•
••••
• ••
••
••
•
•
•
•
•
•
•
••
••
••
••
•
••
••
•
•••
•
•
•
•
••
••
••
••
•
••••
••
••
•
••
•
••
•
• ••
•
•
•
•
••
•
•
• •
•
•
••
•
••
•
•
•
••• •
•
••
•
•
•
•
••
•
•••
•
••
•••
••••••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
•
••
•••
•
•
•
••
••
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•••
••
•
•
•
•
•
•
•
•
•
••
•
••
•••
•
••
•
••
•
•
•
•
•
••
••
•
•
••
•• •
•
••
•
•
••
•
•
•
•
••
••
•
••
•
•
•
••
•
•
•••
••
•
•
••
••••
•
•
•••
••
•
•
••
•
•
•
••
•
••
•
•
•••
•••
•
••
•
••
•
•
•
•
••
•
•
•• •
•
•
••••
•••
••
••
••
••
•
•
•
•
•••••••
•
•
••
•
•
•
••
•
•
••••
•
•
•
•••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
••• •
•
•
•
•
•
•
• ••
•
•
•
•
••
••
•
• •• ••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
••
•
•
•••
•
•
•••
••
•
•
•
•••
•••
•
•
•
•
•
•
•
•
•
•••
••
••
•
••••
•
••
••
••
••
••
•
•
• •
•
•
•••
•
•
• •
•
••••
•••
••
••
•
•
•
•
••
••
•
•
•
•
••
•
•
•
•••
••
•
•
•
•
•••
•••
•
• •
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•••
•
•
••
••
••
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
•
•••
•
••••
•
•
••
•
••
•
•
•••
•
•
•
D12
Adaptive Gaussian
Alte
rnat
ing
- R
ML
-15 -10 -5 0 5 10
-15
-10
-50
510
••
•
•
•
••
•
•
•
•
•
•• •
•
••
•
•
•
•
•
•••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••••••
•
••
•
•••
••
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
••
••
•
••
•• ••
•
•
•
•
••
•
•
•
•
••
•
•
•••
•
•
•
•
• ••
•
•
•
•
•
•••
•••
•
•
•
•
•
••
•
••
•
•
•
••
•
•
•
•
•
•
••
•
••
••
••
•
•
•
•
•
•
•
••
••
••
•
•
•••
••
•
•
••
••
•
•
••
•••
•
•
•••
•
••••
•• •
••
•
••
••
•
•
••
•
•
•
•
•
•
•
••••••
•
•
•
••
•
••
•
•
•
••
•
•
•
• •
••
••
•
••
•
•
•
•
•
•
•
• •
•
•
••
•
•
••
•••
••
• • •
•
•
•
•
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
••
•
••
•••
•••
•
•• •
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•• • •
••
••
•
••
••
•
•
••
•
•
•
•
••
••
•
•
••
•
•
•••
•
••••
••
••
•••
•
•
•
•
••
•
••
• •
•••
•
•
•••
••
•
••
•
••
•
• ••
•
•
•
•
•
•
•
•
•
••
•
••
•
••••
•••••
••
•
• •
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
••
••
••
••
••
•••
•
••
•
••
•
••
•
•
•
•
•
••
••
•
•
••
•
•
•
••
•••
••
••
•
•••
•
•
•
•••
••
••
•
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
•
•
••
••••
•
•
••
• •
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
••
•
•
• •
•••
•
•
••
•
•
•
•
•
•
•
••••
•
•
•
•••
•
•
••
•••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
••
•
•
•
•
•••
••
••
•
•
••
•••
•
•
•
•
•••
•
•
•
•
•
••
•
••
••••
•
•
•
••
••
••
•
••
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
••
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•••
•••
•
•
•
•
••
••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
•••
•••
••
•
•
•
••
•
••
•
•
•
•
••
•
•
••
••
•
••
•
•
•
••
•
•
•
••
••
• ••
••
•
•
••
••
•
•
•
•
•
•
• ••
••
•
•••
••
••
•
•
•
•
•
• •
•
••
•
••
•
•
•
•
•
•
•
•
D12
Adaptive Gaussian
Alte
rnat
ing
- M
L
-15 -10 -5 0 5 10
-15
-10
-50
510
••
•
•
•
••
•
•
•
•
•
•• •
•
••
•
•
•
•
•
•••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••••••
•
••
•
•••
••
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
••
••
•
••
•• ••
•
•
•
•
••
•
•
•
•
••
•
•
•••
•
•
•
•
• ••
•
•
•
•
•
•••
•••
•
•
•
•
•••
•
••
•
•
•
••
•
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
•
•
••••
••
•
•
•••
••
•
•
••
••
•
•
••
•••
•
•••
••
••••
•• •
••
•
••
••
•
•
••
•
•
••
•
•
•
••••••
•
•
•
••
•
••
•
•
•
••
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
••
•
•
••
•••
••
• • ••
•
•
•
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
••
•
••
•••
•••
•
•• •
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
• ••
•
•
•
••
•
•
•• • •
••
••
•
••
••
•
•
••
•
•
•
•
••
••
•
•
••
•
•
•••
•
••••
••
••
•••
•
•
•
•
••
•
••
••
•••
•
•
•••
••
•
••
•
••
•
• ••
•
•
•
•
•
•
•
•
•
••
•
••
•
••••
•••••
••
•
• •
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
••
••
••
••
••
••
•
•
••
•
••
•
••
•
•
•
••
••
••
•
•
••
•
•
•
••
•••
••
••
•
•••
•
•
•
•••
••
••
•
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
•
•
••
•••
••
•
••
• •
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
••
•
•
• •
•••
•
•
•
•
•
•
•
•
•
•
•
••••
•
•
•
•••
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
••
•
•
•
•
•••
••
••
•
•
•••
••
•
•
•
•
•••
•
•
•
•
•
••
•
••
••••
•
•
•
••
••
••
•
••
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
••
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•••
•••
•
•
•
•
••
••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
•••
•••
••
•
•
•
••
•
••
•
•
•
•
••
•
•
••
••
•
••
•
•
•
••
•
•
•
••
••
• ••
••
•
•
••
••
•
•
•
•
•
•
• •
•
••
•
•••
••
••
•
•
•
•
•
• •
•
••
•
••
•
•
•
•
•
•
•
•
D12
Adaptive Gaussian
Lapl
acia
n
-15 -10 -5 0 5 10
-15
-10
-50
510
••
•
•
•
••
•
•
•
•
•
••
••
••
•
•
•
•
•
•
••
•
•
•
•
••
• ••
•
•
•
••
•
•••
•
••••
•
•
••
•
•
• ••
•
•
•
•
•
•
•
•
••
• •
•
•
•
•
•
•
•
•
•
•••
•••
•
••
•
•••
••
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
•
•••
•
••
•
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•••
••
•
••
•• ••
•
•
•
•
••
•
•
•
•
••
•
•
•••
•
•
•
•
•••
•
•
•
•
•
•••
•
••
•
•
•
••
••
•
••
•
••
••
•
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
••
•
•
•••
•
•
•
••
•• •
•
•••
••
•••••
••••
•
••
••
•
•
• •
•
••
•
•
•
•
••••••
•
•
•
••
•
•••
•
•
•
•
•
•
•
••
•••
••
•
•
•
•
••
•
•
•
• •
•
•
••
•
•
••
•••
••
•• •
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
••
••
•
••
•••• ••
•
•••
• •
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
•
•
•
•• •
• •
••
•
••
••
•
•
••
•
•
•
•
••
••
•
•
••
•
•
•••
•
••••••
••
•••
•
•
•
•
••
•
••• •
•• •
•
••
••
••
•
••
•
••
•
••
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•••
•••••
••
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
••
••••
••
••
•••
•
•••
••
•
••
•
•
•
• ••
•••
•
•
•••
•
••
•
••
•
••
••
•
•••
•
•
•
••
•
•••
••
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
•
•
••
•••
••
•
••
••
•
••
•
•••
•
•
••
•
•
•
•
•
••
•
••
•
•
• •
•••
•
•
••
•
•
•
•
•
•
•
••••
•
•
•
•
•••
•
••
•••
•
•
•
•
•
•
••
•
••
••
•
•
•
••
•
•
•
•
•••
•••
•
•
•
•••
••
•
•
•
•
•••
•
•
•
•
•
••
•
••
• •••
•
•
•
••••••
•
•
•
••
•
•
•••
•
• •
•
•
•
•
•
•
•
•
••
•
•••
••
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•••
••••
•••
••
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
••••••
••
•
•
•
••
•
••
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
••
••
•••
••
•
•
•••
•
•
•
•
•
•
•
• •
•••
•
•••
••
••
•
•
•
•
•
••
•
••
•
••
•
•
••
•
•
•
•
D22
Adaptive Gaussian
Alte
rnat
ing
- R
ML
0 20 40 60 80 100 120
020
4060
8010
012
0
•
•
•
•
•
••
••
•
•
•
••
•
•••
•
•
•
• •
••
•
•
••
•
•
••
•
•
•
••
•
•
•••
•
••
•
••
•
•
••
••
•
•
•
•
••
•
•
••••
•
•
•
••
•
••
•••
•
•
•
•
••
•
•
•
••
•
•
••
•
• •
•
••
•
•
•••
•
•••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
•
••
••
•
• •
•••
•
•
••
••
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
••
••
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
• ••
•
••
• •
•
••
•
••
•
•
•
•
••••
••
••
•
•
•
•
•
••
•
•
•
••
••••
••
•
•
•
•
••
••
•
•
••
•
•
•
••
•
•
•
•
•
•
•
••••
•• • •
•
•
•
••
••
•
••
•
•
••
•
••
•
••••
••
•
•
••
•
••
•
••
•
•
•
•
•
•
••
•
•
•
•••
•
•
•
•
••
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
••
•• •
•
•
•
•
•
•
•
••
•
••
••
••
••
•
• •
•
••
•
•
••
•••
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•••
•
•
••
•
•
•
•
•• •
•
•
•
••
•
•
•
•
•••
•
•
••••
•••
•
•
•
• •••
•
•
••
•••••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
••
• •
•
•
••
•
••••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
• ••
•
•••
•
•
•
•
•
•••
•
•
•
••••
•
•
•
••
•
•
••
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•••
•••
••
••
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••••
•
••
•• •
•
••
••
•
•
••
•••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
• ••
•••
•
•
• •
•
•
•
••
••
•
••
•
•
•
•
••
•
•
••
••
•••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•••
•
•
•
••
•
••••
•
•
•
••
•
•
••
••
•
•
•
•
•
•
•
•
• •
••
•
•
•
•
•
••
••
•
•
•
•
••
•
•
•
••
••
•
•
•
• •
•••
••
•
•
•
••
•
••
••
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
••
•••
•
•
•
•••
••••
•••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
•
•
••
•
•
••
•
•
•
••
•
•
•
•
•
•
••••
•
•
•
•
D22
Adaptive Gaussian
Alte
rnat
ing
- M
L
0 20 40 60 80 100 120
020
4060
8010
012
0
•
•
•
•
•
••
••
•
•
•
•
•
•
•••
•
•
•
••
••
•
•
••
•
•
••
•
•
•
••
•
•
•••
•
••
•
••
•
•
••
••
•
•
•
•
••
•
•
••••
•
•
•
••
•
••
•••
•
•
•
•
••
•
•
•
••
•
•
••
•
• •
•
•
•
•
•
•••
•
•••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
•
••
•
•
•
••
•••
•
•
••
••
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
••
••
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
• ••
•
••
• •
•
•
•
•
••
•
•
•
•
••••
••
••
•
•
•
•
•
••
•
•
••
•
••••
••
•
•
•
•
••
••
•
•
••
•
•
•
••
•
•
•
•
•
•
•
••••
•• • •
•
•
•
••
••
•
••
•
•
••
•
••
•
••••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•••
••
•
•
••
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
• ••
• •
•
•
•
•
•
•
••
•
•
••
• •••
••
•
••
•
••
•
•
••
•••
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•••
•
•
••
•
•
•
•
•• •
•
•
•
••
•
•
•
•
•••
•
••
•••
•••
•
•
•
• •••
•
•
••
•••••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
••
••
•
•
••
•
••••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
• ••
•
•••
•
•
•
•
•
••
•
•
•
•
••••
•
•
•
••
•
•
••
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•••
•••
••
••
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••••
•
••
•
• •
•
••
••
•
•
••
•••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
• •
•••
•
•
•
• •
•
•
•
••
••
•
••
•
•
•
•
••
•
•
••
••
•••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•••
•
•
•
••
•
••••
•
•
•
••
•
•
••
••
•
•
•
•
•
•
•
•
• •
••
•
•
•
•
•
••
••
•
•
•
•
• •
•
•
•
••
••
•
•
•
• •
•••
••
•
•
•
••
•
••
••
•
••
•
•
•
••
•
•
••
•
•
•
•
•••
•
•
•
•
•
••
•••
•
•
•
•••
••••
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
•
•
••
•
•
••
•
•
•
••
•
•
•
•
•
•
••••
•
•
•
•
D22
Adaptive Gaussian
Lapl
acia
n
0 20 40 60 80 100 120
020
4060
8010
012
0
•
•
•
•
•
••
••
•
•
•
••
•
•••
•
•
•
••
• •
•
•
••
•
•
••
•
•
•
••
•
•
••
•
•
••
•
• ••
•
••
••
••
•
•
••
•
•
••••
•
•
•
••
•
••
••
•
•
•
•
•
•••
•
•
••
•
•
••
•
• •
•
•
•
•
•
•••
•
•••
•
•
• •
•
•
•
•
•
••
•
•
•
••
•
•
•
•
••
•
•
•
••
•
••
•••
•
•
••
••
•
•
•
•
•
•••
•
••
••
•
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•••
•
••
•
•
•
••
••
•
•
•
•
•
••
••
•
•
•
••
•
•••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•••
•• •
•
•
••
• •
•
••
•
••
•
•
•
•
••••
••
••
•
•
•
•
•
••
•
••
•
•
••••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••••
•• • •
•
•
•
•
•
••
•
••
•
•
••
•
••
•
••••
••
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•
•
••
••
••
••
•
• •
•
••
•
•
••
•••
•••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•••
•
•
••
•
•
•
•
•• •
•
•
•
••
•
•
•
•
•••
•
••
••••••
•
•
•
• •••
•
•
••
•••••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
••
••
•
•
••
•
••••
••
•
•
•
•
•
•
•
•
••
•
•
•• •
••
••
•••
•
••
•
•
•••
•
•
•
••••
•
•
•
••
•
•
••
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••••
••
••
••
•
•
•
•
•
•
•
•
••
••
•
•
•
•••
••
•
••
•
• •
•
••••
•
•
••
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
• •
•••
•
•
•
••
•
•
•••
••
•
••
••
•
•
••
•
••
••
•••
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•••
•
•
•
• •
•
•••••
•
•
••
•
•
••
••
•
•
•
•
•
•
•
•
• •
••
•
•
•
•
••
•
••
•
•
•
•• •
•
•
•
••
••
•
•
•
••
•••
••
•
•
•
••
•
••
••
••
•
•
•
•
••
•
•
••
•
•
•
•
•••
•
•
•
•
•
••
•••
•
•
•
•••
••••
••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
••
•
•
•
••
•
•
•
•
•
•
••••
•
•
•
•
Figure 5.2.5: Scatter plots of variance-covariance components estimates for thealternating (RML and ML), Laplacian, and adaptive Gaussian approximationsin the logistic model (5.2.3). The dashed lines indicate the true values of theparameters.
108
approximations tend to give estimates slightly smaller than the Laplacian and
adaptive Gaussian, but the differences are minor.
First Order Compartment Model
The model used in the simulation is identical to (5.2.2). As in the Theophylline
example we set m = 12 and ni = 11, i = 1, . . . , 12. The parameter values used
were σ2 = 0.25, β = (−3.0, 0.5,−2.5)T , and D =
0.2 0
0 1
.Table 5.2.7 summarizes the simulation results for the variance-covariance
components estimates. As in the logistic model analysis, we observe that the el-
ements of D are estimated with less relative precision than σ2. The alternating
ML, Laplacian, and adaptive Gaussian approximations seem to lead to slightly
downward biased estimates of D11 and D22, while the alternating RML approx-
imation appears to give unbiased estimates (thus achieving its main purpose).
Note however that the unbiasedeness of the RML estimates does not translate
into smaller mean square error — all four estimation methods lead to similar
MSE, for all parameters.
Figure 5.2.7 presents the scatter plots of the variance-covariance estimates
for the alternating RML, alternating ML, and Laplacian approximations versus
the adaptive Gaussian approximation. The alternating RML approximation
tends to give larger values for D11 and D22, and larger absolute values for D12,
while the remaining approximations lead to very similar estimates. There was
one sample for which the alternating approximations apparently converged to a
different solution than the Laplacian and adaptive Gaussian. Overall there were
no major differences between the approximations in estimating the variance-
covariance components.
109
beta1
Adaptive Gaussian
Alte
rnat
ing
- R
ML
190 195 200 205 210
190
195
200
205
210
•
•
•••
•
••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
••
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
•
•••
•
•
••
••
•
•
••
•
•
•
•
• •
•
•
••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• •
••
•
•
•
•
••
•
•
••
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
• •
•
•
•
•
••
••
•••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
•
••••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
••
••
••
•
•
••
••
••
•
•
•
•
••
•
•
•
•
•
•
• •
•
•
•
•
•
•
••
•
•
•
•
•••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
•
•
• •
•
• •
•
•
••
•
••
•
•
•
••
•
••
••
••
•
•
••
••
••
•
•
•
•
•••
••
••
•
• •
•
••
•
•
•
•
•
•
•••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•• •
••
•
•
• ••
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
••
•
••
•
•••
•
•
• •
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•• •
••
•
•
•
••
•
•••
••
•
•
••
•••
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
••
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•••
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•••
••
•
•
••
••
•
••
•
••
•
•
••
•
••
•
• •
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
••
•••
••
•
••
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
••
•
•
•
••
•••
•
•
•
•
•
•
••
••
•
•
••
•
• •
••••••
••
•
•• •
•
••
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
• •
•
•
•
••
•
•
•
•
•
•
••
•
•
•
beta1
Adaptive Gaussian
Alte
rnat
ing
- M
L
190 195 200 205 210
190
195
200
205
210
•
•
•••
•
• •
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
••
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
•
•••
•
•
••
••
•
•
••
•
•
•
•
• •
•
•
••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• •
••
•
•
•
•
••
•
•
••
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
• •
•
•
•
•
••
••
•••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
•
••••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
••
••
••
•
•
••
••
••
•
•
•
•
••
•
•
•
•
•
•
• •
•
•
•
•
•
•
••
•
•
•
•
•••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
•
•
• •
•
• •
•
•
••
•
••
•
•
•
••
•
••
••
••
•
•
••
••
••
•
•
•
•
•••
••
••
•
• •
•
••
•
•
•
•
•
•
•••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•• •
••
•
•
• ••
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
••
•
••
•
•••
•
•
• •
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•• •
••
•
•
•
••
•
•••
••
•
•
••
•••
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
••
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•••
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•••
••
•
•
••
••
•
••
•
••
•
•
••
•
••
•
• •
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
••
•••
••
•
••
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
••
•
•
•
••
•••
•
•
•
•
•
•
••
••
•
•
••
•
• •
••••••
••
•
•• •
•
••
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
• •
•
•
•
••
•
•
•
•
•
•
••
•
•
•
beta1
Adaptive Gaussian
Lapl
acia
n
190 195 200 205 210
190
195
200
205
210
•
•
•••
••
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
••
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
•
•••
•
•
••
••
•
•
••
•
•
•
•
••
•
•
••
•
•
••
•
••
•
•
•
•
••
•
•
•
•
•
•
•••
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
• •
••
•
•
•
•
••
•
•
• •
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
• •
•
•
•
•
••
••
•••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
•
•••
•
•
•
••
••
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
••
••
•
•
•
•
••
•
•
•
•
•
•
• •
•
•
•
•
•
•
••
•
•
•
•
•••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
•
•
• •
•
••
•
•
••
•
••
•
•
•
••
•
••
••
••
•
•
••
••
••
•
•
•
•
••
•
••
••
•
• •
•
••
•
•
•
•
•
•
•••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
• ••
••
•
•
•
•
•
••
•
•
••
•
••
•
•
• ••
•
••
•
•
•
••
•
•
•
•
•
••
•
••
•
•
•
•
•••
•
•
• •
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•• •
••
••
•
••
•
•••
••
•
•
••
•••
••
•
•
•
•
••
•
••
•
•
••
••
•
•
••
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
••••
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
• •
•••
••
•
•
••
••
•
••
•
• •
•
•
••
•
•••
••
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•••
••
•
••
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
••
•
•
•
••
•••
•
•
•
•
•
•
••
••
•
•
••
•
• •
••••••
•
•
•
•• •
•
••
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
••
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
••
•
••
•
•
•
••
•
•
•
beta2
Adaptive Gaussian
Alte
rnat
ing
- R
ML
680 700 720 740
680
700
720 •
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•••
•
••
••
•
•
••
•
•
•
•
••
•
••
••••
•
••
•• ••• •
••
••
•
•
••
•
•
•
•
•
• •
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
••
••
••
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
••
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
• •
•
••
•
•
••
••
•
•
•
•
••••
•
•
••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•••
••
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
••
•
•
•
• ••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
••
•
•
••
•
•
• •
••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
•
•
•
••
••
•
• ••
••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
• ••
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
••
•
•
••
•
•
•
• •
•
••
•
•
•
•
•
••
•
•
•• •
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•••
•
••
•
•
•
•
•
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
••
•••
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
••
•
•
••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
• •
••
••
•
•
••
•
•
•
•
•
•
•
••
•
••
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• •
beta2
Adaptive Gaussian
Alte
rnat
ing
- M
L
680 700 720 740
680
700
720
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
••
•
•
••
••
•
•
••
•
•
•
•
••
•
••
••••
•
••
•• ••• •
••
••
•
•
••
•
•
•
•
•
• •
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
••
•
•
•
•
••
••
••
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
••
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
• •
•
••
•
•
••
••
•
•
•
•
••••
•
•
••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•••
••
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
••
•
•
•
• ••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
••
•
••
•
•
••
•
•
• •
••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
•
•
•
••
••
•
• ••
••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
• ••
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
••
•
•
••
•
•
•
• •
•
••
•
•
•
•
•
••
•
•
•• •
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•••
•
••
•
•
•
•
•
••
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
••
•••
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
••
•
•
••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
• •
••
••
•
•
••
•
•
•
•
•
•
•
••
•
••
•
•
•
••
•
•
••
•
•
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• •
beta2
Adaptive Gaussian
Lapl
acia
n
680 700 720 740
680
700
720
740
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
••
•
•
•
••
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
••
•
•••
•••
•
••
•
•
•
•
••
•
••
•••
••
• •
•• ••• •
••
••
•
•
••
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
• •••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
••
••
•
•
••
•
•
•
•
•••
•
••
•
•
•
••
•
••
•
•
•
•
•
••
•
••
•
•
•
• •
•
••
•
•
••
••
•
•
•
•
••••
•
•
••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•••
••
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
••
•
••
• ••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
••
•
•
•
••
•
•
••
•
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
•
•
•
••
••
•
• ••
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
• ••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
••
•
•
•
•
•••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
••
•
•
•
••
•
••
•
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•••
•
••
•
•
•
•
•
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
••
•• •
•
•
•
••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
••
••
•
•
••
••
•
•
•
•
•
••
•
••
•
•
•••
•
•
••
•
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•••
••
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• •
beta3
Adaptive Gaussian
Alte
rnat
ing
- R
ML
330 340 350 360 370
330
340
350
360
370
•
•
•• •
•• •
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
•••
••
••
••
•
•
•
•• •
•••
••
•
•
•
••
•
•••
•
••
••
•
•
•
••
••
•
•
•
•••
•
•
•
••
•
•
•
• •••
•
••
•
•
••
••
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•••
•
•
•
• •
•
•
•
••
•
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
•
•
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
••
• •
•
•
•
••
•••
•
•
••
•
••
•
•
••
•
•
•
••
•
•
•
•
•
••
•••
•
••
•
•
•
•
••
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
••
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
••
••
•
•••
•
•• •
•
•
••
••
•
•
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• ••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
••
•
•
•••
•
•
••
••
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
• •
•
•
••
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
• •
•
•
•
•
•
•
•
•
•
•
••
•••
•
•
•
•••
•
•
•
•
•
••
•
•
•
•••
•
•
•
• •
•
••
•
• ••
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
•
•
•
•
••
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•••
•
•
•
•
•
••
•
•
••
•••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•••
••
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
••••
•
•
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
•
•
•
• ••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
••
•
•
••
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
••
•• •
••
•••
•
• •
•
•
• •
•
•
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••••
•
••
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
• •
•
•
• ••
• •
•
•
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
•
••
beta3
Adaptive Gaussian
Alte
rnat
ing
- M
L
330 340 350 360 370
330
340
350
360
370
•
•
•••
•• •
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
•••
••
••
••
•
•
•
•• •
•••
••
•
•
•
••
•
•••
•
••
••
•
•
•
••
••
•
•
•
•••
•
•
•
••
•
•
•
• •••
•
••
•
•
••
••
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•••
•
•
•
• •
•
•
•
••
•
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
•
•
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
••
• •
•
•
•
••
•••
•
•
••
•
••
•
•
••
•
•
•
••
•
•
•
•
•
••
•••
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
••
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
••
••
•
•••
•
•• •
•
•
••
••
•
•
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• ••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
•••
•
•••
•
•
••
••
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
• •
•
•
••
•
•
•
•
•
•
•
•
••
••
•
••
•
•
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
• •
•
•
•
•
•
•
•
•
•
•
••
•••
•
•
••••
•
•
•
•
•
••
•
•
•
•••
•
•
•
• •
•
••
•
• ••
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
•
•
•
•
••
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•••
•
•
•
•
•
••
•
•
••
•••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•••
•••
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
••••
•
•
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
•
•
•
• ••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
••
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
••
•• •
••
•••
•
• •
•
•
• •
•
•
••
•
••
•••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••••
•
••
••
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
• •
•
•
• ••
• •
•
•
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
•
••
beta3
Adaptive Gaussian
Lapl
acia
n
330 340 350 360 370
330
340
350
360
370
•
•
•••
• ••
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•••
••
•
•
••
•
•
•
••
•
•••
•
•
•
•
•
••
•
•••
•
••
••
•
•
•
••
••
•
•
•
•••
•
•
•
••
•
•
•
• •••
•
••
•
•
••
••
•
•
•
•
••
•
••
•
•
••
•
•
•
•
•
•
•
••
•
•
••
•
•
•
••
•
•
•
••
•
•
•
••
••
••
•
•
•
••
•
•
•
•
••
•
•
•
••
•
••
•
•
•
••
•
•
•
•
•
•
••
••
•
•
•
•
•
••
•
••
•
•
•
•
•
••
••
••
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•••
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
••
•
•
•
••
• •
•
•
• •
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•••
••
•
••
•
•
•••
•
••
•
•
•
••
••
•
•
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•••
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
• ••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
• •
•
•
••
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
• •
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
•
•
•
•
•
•
••
•••
•
•
••••
•
•
•
•
•
••
•
•
•
•••
•
•
•
• •
•
••
•
••
•
•
•
•
•
••
•
••
••
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
•
•
•
•
•
•
••
•
•
••
•
••
•
•
•
•
•
•
••
• ••
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•••
•
•
•
•
•
••
••
•
• •••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•••
•••
•
•
•
•
•
•
• •
•
••
•
•
•
•
•
•
•
•
•
•
••••
•
•
••
•
•
•
••
•
•
••
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•
•
•
•
•
••
•
•
••
•• •
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
••
••
••
•
••
•••
•
••
•
•
• •
•
•
•
•
•
••
•••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••••
•
••
•• •
••
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•
•
•
••
••
••
••
•
•
•
•
•
••
•
•
••
••
•
••
•
•
•
•
•
•
•
••
•
••
Figure 5.2.6: Scatter plots of fixed effects estimates for the alternating (RMLand ML), Laplacian, and adaptive Gaussian approximations in the logisticmodel (5.2.3). The dashed lines indicate the true values of the parameters.
110
sig2
Adaptive Gaussian
Alte
rnat
ing
- R
ML
0.15 0.20 0.25 0.30 0.35
0.15
0.20
0.25
0.30
0.35
••
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
••
••
•
•••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•••
•
••••
••
•
•
•
•
•
•••
•
•••
•
••
•
•
•
••
•
••
••
••
•
•
•
•
••
• ••
•
••
•
•
•
•
•
•
••
••
••
•
•
••
•
••
•
••
•
•
••
•
••
••
•
•
•
••
••
•
•
••
••
••••
••
•
•
•
•
•
•
•
•
•••
•
•
• •
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
• •
•
•
•
•
•
•
••••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•••
••
•
••
•
•
•
•
••
•
•
••
••
••••
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•••
•••
•
•
•
•
••
•
•
•
•
••
•
• •
•
•
•
•
•
••
•
•
••
• •
••
•
•
•
•
•
••
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•••
•
•
••
••
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
••••
•
•••
•
••
••
••
•••
•
•••
•
•••
••
•
••
••
••
•
•
••
•
•
•
••
••••
••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
••••
•
•
•
•
•
••
•
••
•
•
••
•
••
•
••
••
••
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•••
•
•• •
•
••
•
•
•
•
••
•••
•
•
•
•
•
•
•
•••
•
•
• ••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
••
•• •
••
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
••
•
•••
•
•
•
••
•
•
•••
••
••••
••
•
•
•
••
•
•
•
• •
••
•
••
••
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•••
•
••
••
••
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
• •
••
•
• •
•
•
••
•••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
•
•
•
••
•••
•
•
••
•
•
•
•
•
•
•
•
•••
••
•
•
•
•
••
•
•••
••
•
•
••
•
••
•
•
•
••
•
•
•••
•
•
••••
•
•
•
•
••
••
••
•
•
•
sig2
Adaptive GaussianA
ltern
atin
g -
ML
0.15 0.20 0.25 0.30 0.35
0.15
0.20
0.25
0.30
0.35
••
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
••
••
•
•••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•••
•
••••
••
•
•
•
•
•
•••
•
•••
•
••
•
•
•
••
•
••
••
••
•
•
•
•
••
• ••
•
••
•
•
•
•
•
•
••
••
••
•
•
••
•
••
•
••
•
•
••
•
••
••
•
•
•
••
••
•
•
••
••
••••
••
•
•
•
•
•
•
•
•
•••
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
• •
•
•
•
•
•
•
••••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•••
••
•
••
•
••
•
••
•
•
••
••
••••
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•••
•••
•
•
•
•
••
•
•
•
•
••
•
• •
•
•
•
•
•
••
•
•
••
• •
••
•
•
•
•
•
••
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•••
•
•
••
••
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•••
••
•••
•
••
••
••
•••
•
•••
•
•••
••
•
• •
••
••
•
•
••
•
•
•
••
••••
••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
••••
•
•
•
•
•
••
•
••
•
•
••
•
••
•
••
••
••
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•••
•
•• •
•
••
•
•
•
•
••
•••
•
•
•
•
•
•
•
•••
•
•
• ••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•• •
••
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
••
•
•••
•
•
•
••
•
•
•••
••
••••
••
•
•
•
••
•
•
•
• •
••
•
••
••
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•••
•
••
••
••
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
• •
••
•
• •
•
•
••
•••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
•
•
•
••
•••
•
•
••
•
•
•
•
•
•
•
•
•••
••
•
•
•
•
••
•
•••
••
•
•
••
•
••
•
•
•
••
•
•
•••
•
•
••••
•
•
•
•
••
• ••
•
•
•
•
sig2
Adaptive Gaussian
Lapl
acia
n
0.15 0.20 0.25 0.30 0.35
0.15
0.20
0.25
0.30
0.35
••
•
•
•
•
•
•
•
•
•
•••
•
•
••
•
••
••
•
•••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•••
•
••••
••
•
•
•
•
•
•••
•
•••
•
••
•
•
•
••
•
••
••
••
•
•
•
•
••
•••
•
••
•
•
•
•
•
•
••
••
••
•
•
••
•
••
•
••
•
•
••
•
•
•
••
•
•
•
••
••
•
•
••
••
••••
••
•
•
•
•
•
•
•
•
•••
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
• •
•
•
•
•
•
•
••••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
••
•••
••
•
••
•
••
•
••
•
•
••
••
••••
•
•
••
•
•
••
•
•
•
•
•
••
•
•
••
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•••
•••
•
•
•
•
••
•
•
•
•
••
•
• •
•
•
•
•
•
••
•
•
••
• •
••
•
•
•
•
•
••
•
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•••
•
••
•••
•
•
••
••
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•••
••
•••
•
••
••
••
•••
•
•••
•
•••
••
•
••
••
••
•
•
••
•
•
•
••
•
•••
••
•
•
•
•
••
••
•
•
•
•
••
••
•
•
•
•
•
•
•
••
•
•
••••
•
•
•
•
•
••
•
••
•
•
••
•
••
•
••
••
••
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•••
•
•• •
•
••
•
•
•
•
••
•••
•
•
•
•
•
•
•
•••
•
•
• ••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
• •••
•
•
•
••
•
•
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
••
•• •
••
•
•
•
•
•
•
•
••
••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
••
•
•••
•
•
•
••
•
•
•••
••
••••
••
•
•
•
••
•
•
•• •
••
•
••
••
•
•
••
•
•
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•••
•
••
••
••
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
• •
••
•
• ••
•
••
•••
•
•
•
•
••
•
••
••
•
•
•
•
•
••
•
•
•
••
•••
•
•
••
•
•
•
•
•
•
•
•
•••
••
•
•
•
•
••
•
•••
••
•
•
••
•
••
•
•
•
••
•
•
•••
•
•
•••
•
•
•
•
•
••
• ••
•
•
•
•
D11
Adaptive Gaussian
Alte
rnat
ing
- R
ML
0.0 0.1 0.2 0.3 0.4 0.5
0.1
0.2
0.3
0.4
0.5
0.6
•
•
•
••
•
•
• •
•
•
••
•
•
•
•
••
•
•
•
•
• •
••
•
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
••
•
••
• •
•
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
• ••
•
•
•
•
•
••
•
•
•
•
•
••
•
••
•••
•
•
•••
••
•
•
•
•
•
•
•
••
•
•
••
•
•
••
••
•
•
• •
•
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•••
•
•••
•
•
•
•
••
••
•
•
•
•
•
•••
•
•
•
•
•
•••
••
•
•
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
••
•
•••
•
••
••
•
•
••
•
•
•
•
•
••
•
•
•••
••
•
•
•
•
•
•
••
••
••
•
•
•• •••
•
•
••
•
•
•
•
•
••• •
•
••
••
••
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
• ••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
•
• •
•
••
•
••
•
•
••
••
•
••
•
•
•
••
•
•
•
•••
••
•
•••••
•
•
•
•
•• ••
•
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
• ••
•
•
•
•
••
•
••
••
•
••
•
•
•
••
•••
••
•
•
•
••
•
•
••
•
•
••
••
•
••
•
•••••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
• •
••
••
••
•
•
•
••
••
•
••
•
•
••
•
•
••
•
•
•
•••
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•••
•• ••
•
••••
•
•
•••
••
•
••
•
•
•
•
••
•
•
•
•
•
• •••
•
•
••
•
••
•
• ••
•
••
•••
•
•
•
••
•
•
• •
•
••
••
•
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
••
•••
••
•
•
••
••
•
•
•
••
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•••
•
••
•
••
•
•
•
•
•
•
•
•
•
• •
••
••
•
•
•
••••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
••••
•
••
•
•
•
•
•
••
•
••
•
••
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•• ••
•
••
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
••
•
••
•
•
•
• •
•
•
•
•
•
•
••
•
••
•
•
••••
•
•
••
••
•
•
•
•
•
•
•••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
D11
Adaptive Gaussian
Alte
rnat
ing
- M
L
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
•
•
•
••
•
•
• •
•
•
••
•
•
•
•
••
•
•
•
•
• •
••
•
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
••
•
••
•
••
• •
•
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•••
•
•
•••
••
•
•
•
•
•
•
•
••
•
•
••
•
•
••
••
•
•
• •
•
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•••
•
•••
•
•
•
•
••
••
•
•
•
•
•
•••
•
•
•
•
•
•••
••
•
•
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
••
•
•••
•
••
••
•
•
••
•
•
•
•
•
••
•
•
•••
••
•
•
•
•
•
•
••
••
••
•
•
•••
••
•
•
••
•
•
•
•
•
••••
•
••
••
••
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
• ••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
•
• •
•
••
•
••
•
•
••
••
•
••
•
•
•
••
•
•
•
•••
••
•
•••••
•
•
•
•
•• ••
•
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•••
•
•
•
•
••
•
••
••
•
••
•
•
•
••
•••
••
•
•
•
••
•
•
••
•
•
••
••
•
••
•
•••••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
• •
••
••
••
•
•
•
••
••
•
••
•
•
••
•
•
••
•
•
•
•••
•
•
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•••••
•
•
••••
•
•
•••
••
•
••
•
•
•
•
••
•
•
•
•
•
• •••
•
•
••
•
••
•
• ••
•
••
•••
•
•
•
••
•
•
• •
•
••
••
•
•
•
•
•
•
•
••
•
••
•
•
••
••
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
••
•••
••
•
•
••
••
•
•
•
••
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
••••
•
••
•
•
•
•
•
••
•
••
•
••
•
•
•
••
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•• ••
•
••
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
••
•
••
•
•
•
• •
•
•
•
•
•
•
••
•
••
•
•
••••
•
•
••
••
•
•
•
•
•
•
•••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
D11
Adaptive GaussianLa
plac
ian
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
•
•
•
••
•
•
• •
•
•
••
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
••
•
••
•
•
•
••
••
•
•
•
•
•
•
•
•
••
••
•
••
•
••
• •
•
•
•
••
•
•
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
• ••
•
•
•
•
•
••
•
•
•
•
•
••
•
••
•••
•
•
•••
••
•
•
•
•
•
•
•
••
•
•
••
•
•
••
••
•
•
• •
•
•
•
•
•
•••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•••
•
•••
•
•
•
•
••
••
•
•
•
•
•
•••
•
•
•
•
•
•••
••
•
•
••
•
•
•
•
••
••
•
•
••
•
•
•
•
••
•
•••
•
••
••
•
•
••
•
•
•
•
•
••
•
•
•••
••
•
•
•
•
•
•
••
••
••
•
•
••••
•
•
•
••
•
•
•
•
•
••••
•
••
••
••
•
•
•
•
•
•••
••
•
•
•
•
•
•
••
•
•
•
• ••
••
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
••
•
• •
•
••
•
••
•
•
••
••
•
••
•
•
•
••
•
•
•
•••
••
•
•••••
•
•
•
•
•• ••
•
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•
•
• ••
•
•
•
•
••
•
••
••
•
••
•
•
•
••
•••
••
•
•
•
••
•
•
••
••
••
•
•
•
••
•
•••••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
••
•••
•
••
•
•
•
•••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
• •
••
••
••
•
•
•
••
••
•
••
•
•
•
•
•
•
••
•
•
•
•••
•
•
••
•
••
•
•
•
•
•
•
•
•
•••
•
•
••
•
•
•••
•• ••
•
••••
•
•
•••
••
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•••
•
••
•
••
•
• ••
•
••
•••
•
•
•
••
•
•
• •
•
••
••
•
•
•
•
•
•
•
••
•••
•
•
••
••
•
•
••
••
•
•
•••
•
•
•
•
•
•
•
•
•
•••
••
•
•
••
••
•
•
•
••
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
••
•••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•
••••
•
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
••••
•
••
•
•
•
•
•
••
•
••
•
••
•
•
•
••
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
••
•• ••
•
••
•
•
•
•
•
••
••
••
•
•
•
•
•
•
•
•
••
••
•
••
•
•
•
• •
•
•
•
•
•
•
••
•
••
•
•
••••
•
•
••
••
•
•
•
•
•
•
•••
•
••
••
•
•
•
•
•
•
•
•
•
•
•
D12
Adaptive Gaussian
Alte
rnat
ing
- R
ML
-0.4 -0.2 0.0 0.2 0.4
-0.6
-0.4
-0.2
0.0
0.2
0.4
•
•
•
•
•
••
•
•••
•
•
•
•
• •••
•
•
•
••
••
••
••
•
•
•
•••
•
•
••
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
••
•
••
•
••
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
••
••
•
•
••
•
•••
•
•
•
••
•
•
•
•••
•
•
•
•
•
•
••
•
••
•
•
•
••
•••
•
•
• ••
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•••
••••
•
••
••
••
•
••
•
•
• •
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•••
•• •
•
•
••
•
•
•
•
••
•
•
•
•
•
••
•
••
•
••
•••
•
•
•
•• •
••
•
•
•
•
••
•
•
• •••
••
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
••
•
•
•
••••
••
••
••
•
•
•
••
•
•
•
•
•
••
••
••
•
••
••
•
•
••
•
•••••
•••
••
••
••
•
•
•
•
•
•
•
•
•
• ••
•
•
•
••
• •
•
••
•• •••
•
•
•
••
••
•
••
•
••
•
•
•
••
•
••
••
•
••
•••
• ••
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
••
•
••
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
••
••
•
•••
•••
•••
••
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
••
••
••
•
•
•
•
••
•
•
•
••
•
•
•••
•
•
•
••
••
•
••••
•
••
•
•
•
••
••
•
••
•
••
•••
•
••
••
•
•
•
••
•
•
••
•
••
•
••
•
•
•• ••
•
•
•
•••
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
••
•
••
••
•
•
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
••••••
•
••
•
••
•
•
•
••
•••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•••••
•
•
•
•
••
•
•
• •
•
•
•
•
•
••
••
••
•
•
•
•
••
•
••
••
••
• ••
•
•
•
•
•
•
•
•
••
•
••
••
••
••
•
•
•••
•
•
•
••
••
•
•
•
•
••
•
•
•••
••
••
•••
•
•
•
•
•••
•
•
•
•
•••
••
•
•
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
••
•
•
••
•
•
•••
••
•
•
• ••
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
••
•
•
•
• •
• •
•
•
•
•
•
•
•
• •
•
••
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
••
•
•
••
••
•
•
•
•••
•
•••
••
••
•
•
••
••
•••
•
•
•
•••
••
•
•
•
•
•••
••
D12
Adaptive Gaussian
Alte
rnat
ing
- M
L
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
0.4
•
•
•
•
•
••
•
•••
•
•
•
•
• •••
•
•
•
••
••
••
••
•
•
•
•••
•
•
••
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
••
•
••
•
••
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
••
••
•
•
••
•
•••
•
•
•
••
•
•
•
•••
•
•
•
•
•
•
••
•
••
•
•
•
••
•••
•
•
• ••
••
•
••
•
•
•
•
•
•
•
•
••
•
•
•••
••••
•
••
••
••
•
••
•
•
• •
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•••
• ••
•
•
••
•
•
•
•
••
•
•
•
•
•
••
•
••
•
••
•••
•
•
•
•• •
••
•
•
•
•
••
•
•
• •••
••
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
••
•
•
•
••••
••
••
••
•
•
•
••
•
•
•
•
•
••
••
••
•
••
••
•
•
••
•
•••••
•••
••
••
••
•
•
•
•
•
•
•
•
•
• ••
•
•
•
••
• •
•
••
•• •••
•
•
•
••
••
•
••
•
••
•
•
•••
•
••
••
•
••
•••
• ••
•
•
•
•
•
•
•
•
••
•
••
•
•
••
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
••
•
••
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
••
••
•
•••
•••
•••
••
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
••
••
••
•
•
•
•
••
•
•
•
••
•
•
•••
•
•
•
••
••
•
••••
•
••
•
•
•
••
••
•
••
•
••
•••
•
••
••
•
•
•
••
•
•
••
•
••
•
••
•
•
•• ••
•
•
•
•••
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
••
•
••
••
•
•
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
••••••
•
••
•
••
•
•
•
••
•••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•••••
•
•
•
•
••
•
•
• •
•
•
•
•
•
••
••
••
•
•
•
•
••
•
••
••
••
• ••
•
•
•
•
•
•
•
•
••
•
••
••
••
••
•
•
•••
•
•
•
••
••
•
•
•
•
••
•
•
•••
••
••
•••
•
•
•
•
•••
•
•
•
•
•••
••
•
•
••
••
•
••
••
•
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•••
••
•
•
• ••
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
••
••
•
•
•
• •
• •
•
•
•
•
•
•
•
• •
•
••
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
••
•
•
••
••
•
•
•
•••
•
•••
••
••
•
•
••
••
•••
•
•
•
•••
••
•
•
•
•
•••
••
D12
Adaptive Gaussian
Lapl
acia
n
-0.4 -0.2 0.0 0.2 0.4
-0.4
-0.2
0.0
0.2
0.4
•
•
•
•
•
••
•
•••
•
•
•
•
• •••
•
•
•
••
••
••
••
•
•
•
•••
•
•
••
••
•
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
••
••
••
•
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
••
•
••
•
••
•
•
•
•
••
••
•
•
•
•
•
••
•
•
•
•
••
••
•
•
••
•
•••
•
•
•
••
•
•
•
•••
•
•
•
•
•
•
••
•
••
•
•
•
••
•••
•
•
• ••
••
•
••
•
•
•
•
•
•
•
•
•••
•
•••
••••
•
••
••
••
•
••
•
•
• •
•
•
•
••
••
•
••
•
•
••
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•••
• ••
•
•
••
•
•
•
•
••
•
•
•
•
•
••
•
••
•
••
•••
•
•
•
•• •
••
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
••
•
•
•
••••
••
••
••
•
•
•
••
•
•
•
•
•
••
••
••
•
••
••
•
•
••
•
•••••
•••
••
••
••
•
•
•
•
•
•
•
•
•
• ••
•
•
•
••
••
•
••
••
•••
•
•
•
••
••
•
••
•
•••
•
•••
•
••
••
•
••
•••
• ••
•
•
•
•
•
•
•
•
••
•
••
•
•
• •
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
•
•
•
••
•
•
•
•
•
••
•
•
••
•
••
••
•
••
•
•
•
••
•
•
•
•
•
•
•
••
••
•
•
•
•
••
•
••
••
•
•••
•••
•••
••
•
•
•
•
• •
••
•
•
•
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
••
••
••
•
•
•
•
••
•
•
•
••
•
•
•••
•
•
•
••
••
•
••••
•
••
•
•
•
••
••
•
••
•
••
•••
•
••
••
•
•
•
••
•
•
••
•
••
•
••
•
•
•••
•
•
•
•
•••
•
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
•
•
•
••
••
•
•
••
•
••
••
•
•
•
••
•
•
•••
•
••
•
•
•
•
••
••
••
••
••
••••••
•
••
•
••
•
•
•
••
•••
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
••• ••
•
•
•
•
••
•
•
••
•
•
•
•
•
••
••
••
•
•
•
•
••
•
••
••
••
• ••
•
•
•
•
•
•
•
•
••
•
••
••
••
••
•
•
•••
•
•
•
••
••
•
•
•
•
••
•
•
•••
••
••
•••
•
•
•
•
•••
•
•
•
•
•••
••
•
•
••
••
•
••
••
•
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•••
••
•
•
• ••
•
•
•
•
•••
•
•
•
•
•
•
•
•
••
•
• ••
•
•
•
•
• •
••
•
•
•
•
•
•
•
• •
•
••
•
•
•
•
••
•
•
••
••
••
•
•
•
•
•
••
•
•
••
••
•
•
•
•••
•
• ••
••
••
•
•
••
••
•••
•
•
•
•••
••
•
•
•
•
•••
••
D22
Adaptive Gaussian
Alte
rnat
ing
- R
ML
0 1 2 3 4
01
23
45
•
••
••
•
•
•
•
•
••
••••
••••
••
•••
•
•
•
•
•
••
•••
••
•
••
••••••
•••
••
••
•
•
•
•
••
•
•
• •••
•
••
•
••
•
•
•
•
••
•
•
•
••
••
•
•
•
•
••
••
••••
••
•
••
••
• •
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•••
•
•••••••
••
•
•
••
••
•
••
•
•
•
•
••
•
••
•
••••
••
•
•
••
•
•
•••
••
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
••••
•
••••
•
••
•
•
•
••
•
•
••
••
••
•••
•
•
•
•
••
•
•
•
••
•••
•
•
••
••
•
•
•
•••
•
•
•••
••
•
••
•
•
•
•
••
••
•
•••
••
•
•
••
•
••
•
•
••
•
••
••
•
•
•
••
•
••
•
•
••••
•
•
•
• •
•
••
•
•
•
•
•
••
••
••
••
•
•
•
•••••
••
•••
•
•
•
•
•
•••••
•
•
•
•
•
••••
•
••
••
•
•
•
•
•
••
••
•
•
•
•
•••••
•
••
•
•
•
••
•
•
•
••••
•
••
•
•
• ••
•
•
•
• •••
•
•
••
••
•• •
••
•••••
•••
•
•
••
•
•
•
•• •••
•
•••
•
•
•••
•
••
••
•
•••
•
•••
•
•
•
••
•
••
•
•••
• •
•
•••
•
•
••
••
•
•
•
••
•
•
•
•
•••
•
•••
•
•
•
••
•
••
•
•••
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
••
•
••
•
••
•
•
•
•
•
••
••
•
•
•
•
•
••
•
••
•
••
••
•••
•
•
•
•••
•
•
•
•••
•
•••
••
•
••
•
••
••
•
••
•
••
•
••
•
•
••
•
•
•
••
•
• ••
•
••• ••
•
•
•
•
•
•
••
•
•
•
••
•••
••
•
•
•
•
••
••
•
•
•
•
•
••
••
••
•
••
••
•
••
•
•
•
•
•
••••
•
•
••
•
•
••
•
•
•
••
••
••
•
••
•
•
•••
•
•
•
••
•
••••
••
•
•
••
• •
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
••
••••
•
••
•••
••
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•••••••
••
•
•
•
•
••
•
•
•
•
•
•
•
••
•
• ••
•
••
•
•
•
•
•
••
•
•••
••
•
••
•
•••
•
••••
••
••
•
••
•
•
•
•
•
••
••
•••
•
•
••
••
••
•
•
••
•
•
••
•
•••••
••
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
•
•••
•
••
••
•
•
•
•
•
•
••
••
•
••
•
••
••
•
••
••••
••
• ••
•
••
•
•••
•
••
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
••
•••
••
•
•••
••
•
•
•
••
••
•
•
••
••••
•
•
•
•
•
•
•
••
•
•
D22
Adaptive Gaussian
Alte
rnat
ing
- M
L
0 1 2 3 4
01
23
4
•
• •
••
•
•
•
•
•
••
••••
••••
••
•••
•
•
•
•
•
••
•••
••
•
••
••••••
•••
••
••
•
•
•
•
••
•
•
• •••
•
••
•
••
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
••
••
••••
••
•
••
••
• •
•
•
•
•
•••
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•••
•
•••••••
••
•
•
••
••
•
••
•
•
•
•
••
•
••
•
••••
••
•
•
••
•
•
•••
••
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
••••
•
••••
•
••
•
•
•
••
•
•
••
••
••
•••
•
•
•
•
••
•
•
•
••
•••
•
•
••
••
•
•
•
•••
•
•
•••
••
•
••
•
•
•
•
••
••
•
•••
••
•
•
••
•
••
•
•
••
•
••
••
•
•
•
••
•
••
•
•
••••
•
•
•
• •
•
••
•
•
•
•
•
••
••
••
••
•
•
•
•••••
••
•••
•
•
•
•
•
•••••
•
•
•
•
•
••••
•
••
••
•
•
•
•
•
••
••
•
•
•
•
•••••
•
••
•
•
•
••
•
•
•
••••
•
••
•
•
• ••
•
•
•
• •••
•
•
••
••
•• •
••
•••••
•••
•
•
••
•
•
•
••••
••
• •••
•
•••
•
••
••
•
•••
•
•••
•
•
•
••
•
••
•
•••
• •
•
•••
•
•
••
••
•
•
•
••
•
•
•
•
•••
•
•••
•
•
•
••
•
••
•
•••
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
••
•
••
•
••
•
•
•
•
•
••
•••
•
•
•
•
••
•
••
•
••
••
•••
•
•
•
•••
•
•
•
•••
•
•••
••
•
••
•
••
••
•
••
•
••
•
••
•
•
••
•
•
•
••
•
• ••
••
•• ••
•
•
•
•
•
•
••
•
•
•
••
•••
••
•
•
•
•
••
••
•
•
•
•
•
••
••
••
•
••
••
•
••
•
•
•
•
•
••••
•
•
••
•
•
••
•
•
•
••
••
••
•
••
•
•
•••
•
•
•
••
•
••••
••
•
•
••
• •
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
••
••••
•
••
•••
••
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•••••••
••
•
•
•
•
••
•
•
•
•
•
•
•
••
•
• ••
•
••
•
•
•
•
•
••
•
•••
••
•
••
•
•••
•
••••
••
••
•
••
•
•
•
•
•
••
••
•••
•
•
••
••
•
•
•
•
••
•
•
••
•
•••••
••
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
•
•••
•
••
••
•
•
•
•
•
•
••
••
•
••
•
••
••
•
••
••••
••
• ••
•
••
•
•••
•
••
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
••
•••
••
•
•••
••
•
•
•
••
••
•
•
••
••••
•
•
•
•
•
•
•
••
•
•
D22
Adaptive Gaussian
Lapl
acia
n
0 1 2 3 4
01
23
4
•
• •
••
•
•
•
•
•
••
••••
••••
••
•••
•
•
•
•
•
••
•••
••
•
••
••••••••
•••
••
•
•
•
•
••
•
•
• •••
•
••
•
••
•
•
•
•
••
•
•
•
••
••
•
•
•
•
••
••
••••
••
•
••
••
• •
•
•
•
•
•••
•
•
•
• •
•
•
••
•
•
••
•
••
•
•
•
•••
•
•••••••
••
•
•
••
••
•
••
•
•
•
•
••
•
••
•
••••
••
•
•
••
•
•
•••
••
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•••
•
•
••••
•
••
•
•
•
••
•
•
••
••
••
•••
•
•
•
•
••
•
•
•
••
•••
•
•
••
••
•
•
•
•••
•
•
•••
• •
•
••
•
•
•
•
••
••
•
•
••
••
•
•
••
•
••
•
•
••
•
••
••
•
•
•
••
•
••
•
•
••••
•
•
•
••
•
••
•
•
•
•
•
••
••
••
••
•
••
•••••
••
•••
•
•
•
•
•
•••••
•
•
••
•
••••
•
••
••
•
•
•
•
•
••
••
•
•
•
•
•••••
•
••
•
•
•
••
•
••
••••
•
••
•
•
• ••
•
•
•
• •••
•
•
••
••
•• •
••
•••••
•••
•
•
••
•
•
•
•• •••
•
•••
•
•
•••
•
••
••
•
•••
•
•••
•
•
•
••
•
••
•
•••
• •
•
•••
•
•
••
••
•
•
•
••
•
•
•
•
•••
•
••
•
•
•
•
••
•
••
•
•••
•
•
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
••
••
•
•
••
•
•
•
•
•
••
••
•
•
•
•
•
••
•
••
•
••
••
•••
•
•
•
•••
•
•
•
•••
•
•••
••
•
••
•
••
••
•
••
•
••
•
••
•
•
••
•
•
•
••
•
• ••
•
••
• ••
•
•
•
•
•
•
••
•
•
•
••
•••
••
•
•
•
•
••
••
•
•
•
•
•
••
••
••
•
••
••
•
••
•
•
•
•
•
••••
•
•
••
•
•
••
•
•
•
••
••
••
•
••
•
•
•••
•
•
•
••
•
••• ••
••
•
••
• •
••
••
•
••
••
•
•
•
•
•
•
•
•
•
•
••
••
••••
•
••
•••
••
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•••••••
••
•
•
•
•
••
•
••
•
•
•
•
••
••••
•
••
•
•
•
•
•
••
•
•••
••
•
••
•
•••
•
••••
••
••
•
••
•
•
•
•
•
••
••
•••
•
•
••
••
•
•
•
•
••
•
•
••
•
•••••
••
•
•
•
•
••
•
•
•
•
••
•
••
•
•
••
•••
•
•
••
••
•
•
•
•
•
•
••
••
•
••
•
••
••
•
••
••••
••
• ••
•
••
•
•••
•
••
•
•
•
•
•
••
•
•••
•
•
•
•
•
•
••
•••
••
•
•••
••
•
•
•
••
••
•
•
••
••••
•
•
•
•
•
•
•
••
•
•
Figure 5.2.7: Scatter plots of variance-covariance components estimates for thealternating (RML and ML), Laplacian, and adaptive Gaussian approximationsin the first order compartment model (5.2.2). The dashed lines indicate the truevalues of the parameters.
111
Table 5.2.7: Simulation results for D and σ2 in the first order compartmentmodel
D11 D12
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 0.1996 -0.0004 0.0089 -0.0013 -0.0013 0.0210Alternating – ML 0.1840 -0.0160 0.0078 -0.0023 -0.0023 0.0179Laplacian 0.1862 -0.0138 0.0078 -0.0011 -0.0011 0.0178Adap. Gaussian 0.1860 -0.0140 0.0077 0.0002 0.0002 0.0180
D22 σ2
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML 1.0095 0.0095 0.2565 0.2508 0.0008 0.0012Alternating – ML 0.9249 -0.0751 0.2240 0.2486 -0.0014 0.0011Laplacian 0.9388 -0.0612 0.2276 0.2480 -0.0020 0.0011Adap. Gaussian 0.9476 -0.0524 0.2332 0.2481 -0.0019 0.0011
Table 5.2.8 gives the simulation results for the fixed effects estimates. All
four approximations give virtually identical results for the estimation of the
fixed effects. They all show very little bias and smaller relative variability when
compared to the estimates of the variance-covariance components.
The scatter plots of the fixed effects estimates, not included here, show
practically identical results for the alternating RML and ML, the Laplacian,
and the adaptive Gaussian approximations.
5.3 Conclusions
The results of section 5.1 indicate that the alternating approximation (5.1.2)
to the loglikelihood function in the nonlinear mixed effects model (4.1.1) pro-
posed by Lindstrom and Bates (1990) gives accurate and reliable estimation
results. The main advantages of this approximation are its computational effi-
ciency (allowing the use of linear mixed effects techniques to estimate the scaled
112
Table 5.2.8: Simulation results for β in the first order compartment modelβ1 β2
Approximation Mean Bias MSE Mean Bias MSEAlternating – RML -2.9989 0.0011 0.0053 0.4876 -0.0124 0.0244Alternating – ML -2.9992 0.0008 0.0053 0.4869 -0.0131 0.0244Laplacian -3.0009 -0.0009 0.0053 0.4983 -0.0017 0.0242Adap. Gaussian -2.9987 0.0013 0.0053 0.4984 -0.0016 0.0246
β3
Approximation Mean Bias MSEAlternating – RML -2.4965 0.0035 0.0020Alternating – ML -2.4965 0.0035 0.0020Laplacian -2.5045 -0.0045 0.0020Adap. Gaussian -2.5008 -0.0008 0.0020
variance-covariance matrix of the random effects D) and the availability of a
restricted likelihood version of it, which is not yet defined for other approxima-
tions/estimation methods. With regard to the restricted maximum likelihood
estimation though, the results of section 5.2 suggest that the bias correction
ability of this method depends on the nonlinear model that is being consid-
ered: RML estimation achieved its purpose for the first order compartment
model (5.2.2), but it increased the bias in the logistic model (5.2.3). More re-
search is needed in this area. Since it is simpler computationally, the alternating
approximation should be used to provide starting values for the more accurate
approximations (e.g. Laplacian and adaptive Gaussian) if they are preferred.
The Gaussian quadrature approximation (5.1.9) only seems to give accurate
results for a large number of abscissas (> 100), which makes it very inefficient
computationally. The cause of this behavior is that the grid of abscissas is
centered at 0 (the expected value of the random effects) and scales it accord-
ing to D, while the highest values of the integrand in (4.1.3) are concentrated
around the posterior modes of the random effects (b) and scaled according to
113
g′′(β, D, y, b
). The advantages of this approximation are that it does not
require the estimation of the posterior modes of the random effects at each
iteration and it admits closed form partial derivatives with respect to the pa-
rameters of interest (β, D, and σ2), provided these are available for the model
function f (Davidian and Gallant, 1992). We feel that these advantages do not
compensate for the inaccuracy or computational inefficiency of the Gaussian
approximation.
The importance sampling approximation (5.1.7) gives reliable estimation
results, comparable to those of the adaptive Gaussian and Laplacian approxi-
mations, but is considerably less efficient computationally than these approx-
imations. Also, the stochastic variability associated with the different impor-
tance samples may overwhelm the numerical variability of the loglikelihood for
small changes in the parameter values, making it difficult to calculate numeri-
cal derivatives. The main advantage of the importance sampling approximation
is its versatility in handling distributions other than the normal, for both the
random effects and the error term (ε). For example it would be rather straight-
forward to adapt the importance sampling integration to handle a multivariate
t distribution for the random effects, but that would not be a trivial task for
either the alternating, the Laplacian, or the adaptive Gaussian approximations.
Wakefield et al. (1994) use the similar property of Gibbs sampler methods to
check for outliers in nonlinear mixed effects models. If one is willing to stick with
the normal distribution for b and ε in the nonlinear mixed effects model (4.1.1)
then the importance sampling approximation is not the most efficient choice.
Of all approximations considered here, the Laplacian and adaptive Gaussian
approximations probably give the best mix of efficiency and accuracy. The
former can be regarded as a particular case of the latter, where just one abscissa
114
is used. Both approximations (and the importance sampling approximation
as well) give the exact loglikelihood when the model function f in (4.1.1)
is a linear function of the random effects. In the examples that we analyzed
not much was gained by going from a one-point adaptive Gaussian quadrature
(Laplacian) approximation to approximations with a larger number of abscissas.
It appears that the major gain in adaptive Gaussian approximations is related to
the centering and scaling of the abscissas. Increasing the number of points in the
evaluation grid only gives marginal improvement. The Laplacian approximation
has the additional advantage over the adaptive Gaussian approximation with
more than one abscissa of allowing profiling of the loglikelihood over σ2, thus
reducing the dimensionality of the optimization problem.
For statistical analysis purpose we would recommend using a hybrid scheme
in which the alternating algorithm would be used to get good initial values for
the more refined Laplacian approximation to the loglikelihood of model (4.1.1).
Chapter 6
Parametrizations for
Variance-Covariance Matrices
The estimation of variance-covariance matrices in mixed effects models using ei-
ther maximum likelihood, or restricted maximum likelihood, is usually a difficult
numerical problem, since one must ensure that the resulting estimate is posi-
tive semi-definite. Two approaches can be used for that purpose: constrained
optimization, where the natural parametrization for the unique elements in the
variance-covariance matrix is used and the estimates are constrained to be pos-
itive semi-definite matrices, and unconstrained optimization, where the unique
elements in the variance-covariance matrix are reparametrized in a way such
that the resulting estimate must be positive semi-definite. We recommend the
use of the second approach not only for numerical reasons (parameter estima-
tion tends to be much easier when there are no constraints), but also because
of the superior inferential properties that unconstrained estimates tend to have
(e.g. asymptotic properties).
Since a variance-covariance matrix is positive semi-definite, but not positive
116
definite (p.d.) only in the rather degenerate situation of nonrandom linear
combinations of the underlying random variables, we will restrict ourselves here
to positive definite variance-covariance matrices.
In addition to enforcing the positive definiteness constraints, the choice of
the parametrization can be influenced by computational efficiency and by the
statistical interpretability of the individual components. In general we can use
numerically or analytically determined second derivatives of the (restricted)
likelihood to approximate standard errors and derive confidence intervals for
the individual parameters. In order to assess the variability of the variance
and covariance estimates, it is desirable that they can be expressed as simple
functions of the unconstrained parameters. More detailed techniques, such as
profiling the likelihood (Bates and Watts, 1988), also work best for functions of
the variance-covariance matrix that are expressed in the original parametriza-
tion.
We describe in section 6.1 five different parametrizations for transforming
the estimation of unstructured (general) variance-covariance matrices into an
unconstrained problem. In section 6.2 we compare the parametrizations with
respect to their computational efficiency and statistical interpretability. Our
conclusions are presented in section 6.3.
6.1 Parametrizations
Let D denote an unstructured positive definite q×q variance-covariance matrix
corresponding to a random vector b = (b1, . . . , bq). Since D is symmetric,
only q(q + 1)/2 parameters are needed to represent it. We will denote by θ
any such minimal set of parameters to determine D. The rationale behind all
117
parametrizations considered in this section is to write
D = LT L (6.1.1)
where L = L (θ) is an q × q matrix of full rank obtained from a q(q + 1)/2-
dimensional vector of unconstrained parameters θ. It is clear that any D defined
as in (6.1.1) is positive definite.
Different choices of L lead to different parametrizations of D. We will
consider here two classes of L: one based on the Cholesky factorization (Thisted,
1988) of D and another based on the spectral decomposition of D (Rao, 1973).
The first three parametrizations presented below use the Cholesky factorization
of D, while the last two are based on its spectral decomposition.
In some of the parametrizations there are particular components of the pa-
rameter vector θ that have meaningful statistical interpretations. These can
include the eigenvalues of D — important in considering when the matrix is ill-
conditioned, the individual variances or standard deviations, and the particular
correlations.
The following variance-covariance matrix will be used throughout this sec-
tion to illustrate the use of the various parametrizations.
A =
1 1 1
1 5 5
1 5 14
(6.1.2)
6.1.1 Cholesky Parametrization
Since D is p.d. it may by factored as D = LT L, where L is an upper triangular
matrix. Setting θ to be the upper triangular elements of L gives the Cholesky
118
parametrization of D. Lindstrom and Bates (1988) use this parametrization to
obtain derivatives of the loglikelihood of a linear mixed effects model for use in a
Newton-Raphson algorithm. They reported that the use of this parametrization
dramatically improved the convergence properties of the optimization algorithm,
when compared to a constrained estimation approach.
One problem with the Cholesky parametrization is that the Cholesky factor
is not unique. In fact, if L is a Cholesky factor of D then so is any matrix
obtained by multiplying a subset of the rows of L by −1. This has implications
on parameter identification, since up to 2q different θ may represent the same
D. Numerical problems can arise when different optimal solutions are not far
apart.
Another problem with the Cholesky parametrization is the lack of a straight-
forward relationship between θ and the elements of D. This makes it hard to
interpret the estimates of θ and to obtain confidence intervals for the variances
and covariances in D based on confidence intervals for the elements of θ. One
exception is |L11| =√
D11 so confidence intervals on D11 can be obtained from
confidence intervals on L11. By appropriately permuting the columns and rows
of D we can in fact derive confidence intervals for all the variance terms based
on confidence intervals for the elements of L.
The main advantage of this parametrization, apart from the fact that it
ensures positive definiteness of the estimate of D, is that it is computationally
simple and stable.
119
The Cholesky factorization of A in (6.1.2) is
A =
1 0 0
1 2 0
1 2 3
1 1 1
0 2 2
0 0 3
By convention, the components of the upper triangular part of L are listed
column-wise to give θ = (1, 1, 2, 1, 2, 3)T .
6.1.2 Log-Cholesky Parametrization
If one requires the diagonal elements of L in the Cholesky factorization to be
positive then L is unique. In order to avoid constrained estimation, one can
use the logarithms of the diagonal elements of L. We call this parametrization
the log-Cholesky parametrization. It inherits the good computational proper-
ties of the Cholesky parametrization, but has the advantage of being uniquely
defined. As in the Cholesky parametrization the parameters also lack direct
interpretation in terms of the original variances and covariances, except for L11.
The log-Cholesky parametrization of A is θ = (0, 1, log(2), 1, 2, log(3))T .
6.1.3 Spherical Parametrization
The purpose of this parametrization is to combine the computational efficiency
of the Cholesky parametrization with direct interpretation of θ in terms of the
variances and correlations in D.
Let Li denote the ith column of L in the Cholesky factorization of D and
120
li denote the spherical coordinates of the first i elements of Li. That is
[Li]1 = [li]1 cos ([li]2)
[Li]2 = [li]1 sin ([li]2) cos ([li]3)
· · ·[Li]i−1 = [li]1 sin ([li]2) · · · cos ([li]i)
[Li]i = [li]1 sin ([li]2) · · · sin ([li]i)
It then follows that Dii = [li]21 , i = 1, . . . , q and ρ1i = cos([li]2), i = 2, . . . , q,
where ρij denotes the correlation coefficient between bi and bj . The correlations
between other variables can be expressed as linear combinations of products
of sines and cosines of the elements in l1, . . . , lq, but the relationship is not as
straightforward as those involving b1. If confidence intervals are available for
the elements of li, i = 1, . . . , q then we can also obtain confidence intervals for
the variances and the correlations ρ1i. By appropriately permuting the rows and
columns of D, we can in fact obtain confidence intervals for all the variances
and correlations of b1, . . . , bq. The exact same reasoning can be applied to
derive profile traces and profile contours (Bates and Watts, 1988) for variances
and correlations of b1, . . . , bq.
In order to ensure uniqueness of the spherical parametrization we must have
[li]1 > 0, i = 1, . . . , n and [li]j ∈ (0, π) , i = 2, . . . , q, j = 2, . . . , i
Unconstrained estimation is obtained by defining θ as follows
θi = log ([li]1) , i = 1, . . . , q and
121
θq+(i−2)(i−1)/2+(j−1) = log
([li]j
π − [li]j
), i = 2, . . . , q, j = 2, . . . , i
The spherical parametrization has about the same computational efficiency
as the Cholesky and log-Cholesky parametrizations, is uniquely defined, and
allows direct interpretability of θ in terms of the variances and correlations in
D.
The spherical parametrization of A is θ = (0, log(5)/2, log(14)/2,−0.608,−0.348,
−0.787)T .
6.1.4 Matrix Logarithm Parametrization
The next two parametrizations are based on the spectral decomposition of D.
Since D is p.d., it has q positive eigenvalues λ. Letting U denote the orthogonal
matrix of orthonormal eigenvectors of D and Λ = diag (λ), we can write
D = UΛUT (6.1.3)
By setting
L = Λ1/2UT (6.1.4)
in (6.1.1), where Λ1/2 denotes the diagonal matrix with [Λ1/2]ii =√
[Λ]ii, we
get a factorization of D based on the spectral decomposition.
The matrix logarithm of D is defined as log (D) = U log (Λ) UT , where
log (Λ) = diag [log (λ)]. Note that D and log (D) share the same eigenvectors.
The matrix log (D) can take any value in the space of q × q symmetric ma-
trices and letting θ be equal to its upper triangular elements gives the matrix
logarithm parametrization of D.
122
The matrix logarithm parametrization defines a one-to-one mapping between
θ and D and therefore does not have the identification problems of the Cholesky
factorization. It does involve considerable calculations, as θ produces log (D)
whose eigenstructure must be determined before L in (6.1.4) can be calculated.
Similarly to the Cholesky and log-Cholesky parametrizations, the vector θ in the
matrix logarithm parametrization does not have a straightforward interpretation
in terms of the original variances and covariances in D. We note that even
though the matrix logarithm is based on the spectral decomposition of D, there
is not a straightforward relationship between θ and the eigenvalues-eigenvectors
of D
The matrix logarithm of A is
log (A) =
−0.174 0.397 0.104
0.397 1.265 0.650
0.104 0.650 2.492
and therefore the matrix logarithm parametrization of A is θ = (−0.174, 0.397,
1.265, 0.104, 0.650, 2.492)T .
6.1.5 Givens Parametrization
The eigenstructure of D contains valuable information for determining whether
some linear combination of b1, . . . , bq could be regarded as constant. The Givens
parametrization uses the eigenvalues of D directly in the definition of the pa-
rameter vector θ.
The Givens parametrization is based on the spectral decomposition of D
given in (6.1.3) and the fact that the eigenvector matrix U can be represented
123
by q(q − 1)/2 angles, used to generate a series of Givens rotation matrices
(Thisted, 1988) whose product reproduces U as follows
U = G1G2 · · ·Gq(q−1)/2, where
Gi [j, k] =
cos(δi), if j = k = m1(i) or j = k = m2(i)
sin(δi), if j = m1(i), k = m2(i)
− sin(δi), if j = m2(i), k = m1(i)
1, if j = k �= m1(i) and j = k �= m2(i)
0, otherwise
and m1(i) < m2(i) are integers taking values in {1, . . . , q} and satisfying i =
m2(i)−m1(i)+ (m1(i) − 1) (q − m1(i)/2). In order to ensure uniqueness of the
Givens parametrization we must have δi ∈ (0, π) , i = 1. . . . , q(q − 1)/2.
The spectral decomposition (6.1.3) is unique up to a reordering of the diag-
onal elements of Λ and columns of U . Uniqueness can be achieved by forcing
the eigenvalues to be sorted in ascending order. This can be attained, within
an unconstrained estimation framework, by using a parametrization suggested
by Jupp (1978) and defining the first q elements of θ as
θi = log (λi − λi−1) , i = 1, . . . , q,
where λi denotes the ith eigenvalue of D is ascending order, with the convention
that λ0 = 0. The remaining elements of θ in the Givens parametrization are
defined by the relation
θq+i = log
(δi
π − δi
), i = 1, . . . , q(q − 1)/2.
124
The main advantage of this parametrization is that the first n elements of
θ give information about the eigenvalues of D directly. Another advantage of
the Givens parametrization is that it can be easily modified to handle general
(not necessarily p.d.) symmetric matrices. The only modification needed is to
set θ1 = λ1 and
λi = θ1 +i∑
j=2
exp (θi) , i = 2, . . . , q.
The main disadvantage of this parametrization is that it involves consider-
able computational effort in the calculation of D from the parameter vector
θ. Another problem with the Givens parametrization is that one cannot relate
θ to the elements of D in a straightforward manner, so that inferences about
variances and covariances require indirect methods.
The eigenvector matrix U in (6.1.3) can also be expressed as a product of
a series of Householder reflection matrices (Thisted, 1988) and these in turn
can be derived from q(q − 1)/2 parameters used to obtain the directions of the
Householder reflections. This Householder parametrization is essentially equiv-
alent to the Givens parametrization in terms of statistical interpretability, but
it is less efficient, since the derivation of the Householder reflection matrices in-
volves even more computation than the Givens rotations. We did not considered
it here.
The Givens parametrization of A is
θ=(−0.275,0.761,2.598,−0.265,−0.562,
−0.072)T .
125
6.2 Comparing the Parametrizations
In this section we compare the parametrizations described in section 6.1 in
terms of their computational efficiency and the statistical interpretability of the
individual parameters.
The computational efficiency of the different parametrizations is assessed
using simulation results. First we analyze the average time needed to calculate
L (θ) from θ for each parametrization and for varying sizes of L. Then we
compare the performance of the different parametrizations in computing the
maximum likelihood estimate of the variance-covariance matrix in a linear mixed
effects model.
To evaluate the average time needed to calculate L, we generated 25 random
q × q matrices Z whose elements were i.i.d. random variables with uniform
distribution in (0, 1) for q varying from 5 to 100, obtained D = ZT Z and then
θ, and recorded the average time to calculate L. Since the user times were too
small for matrices of dimension less than 10, we used 5 evaluations of L at each
user time calculation. Figure 6.2.1 presents the average user time as a function
of q for each of the parametrizations of D.
The Cholesky, the log-Cholesky, and the spherical parametrizations have
similar performances, considerably better than the other two parametrizations.
The matrix logarithm had the worst performance, followed by the Givens param-
etrization. These results are essentially reflecting the computational complexity
of each parametrization, as described in section 6.1.
In order to compare the different parametrizations in an estimation context,
126
Dimension(a)
Tim
e (s
econ
ds)
10 20 30 40
0.5
1.0
1.5
2.0
2.5Matrix LogGivensCholeskylogCholeskySpherical
Dimension(b)
Tim
e (s
econ
ds)
20 40 60 80 100
0
10
20
30Matrix LogGivensCholeskylogCholeskySpherical
Figure 6.2.1: Average user time to calculate L as a function of q, for the differentparametrizations of D. Plot (a) shows the behavior of the average user timefor q ≤ 40 and plot (b) shows the behavior of the average user time for q up to100.
we conducted a small simulation study using the linear mixed effects model
yi = X i (β + bi) + εi, i = 1, . . . , m (6.2.1)
where the bi are i.i.d. N (0, σ2D) random effects and the εi are i.i.d. Nni(0, σ2I)
error terms independent of the bi, with ni representing the number of obser-
vations on the ith cluster. Lindstrom and Bates (1988) have shown that the
loglikelihood corresponding to (6.2.1) can be profiled to produce a function of
D alone. We used, in the simulation, D matrices of dimensions 3 and 6. These
were defined such that the nonzero elements of the ith column of the correspond-
ing Cholesky factor were equal to {1, 2, . . . , i}. For q = 3 we have D = A, as
given in (6.1.2). For q = 3 we used m = 10, ni = 15, i = 1, . . . , 10, σ2 = 1, and
β = (10, 1, 2)T , while for q = 6 we used m = 50, ni = 25, i = 1, . . . 50, σ2 = 1,
127
and β = (10, 1, 2, 3, 4, 5)T . In both cases, the elements of the first column of X
were set equal to 1 and the remaining elements were generated according to a
U (1, 20) distribution. A total of 300 and 50 samples were generated respectively
for q = 3 and q = 6, and the number of iterations and the user time to calculate
the maximum likelihood estimate of D for each parametrization recorded.
Figures 6.2.2 and 6.2.3 present the box-plots of the number of iterations and
user times for the various parametrizations. The Cholesky, the log-Cholesky,
the spherical, and the matrix logarithm parametrizations had similar perfor-
mances for q = 3, considerably better than the Givens parametrization. For
q = 6 the Cholesky and the matrix logarithm parametrizations gave the best
performances, followed by the log-Cholesky and spherical parametrizations, all
considerably better than the Givens parametrization. Since D is relatively
small in these examples, the numerical complexity of the different parametriza-
tions did not play a major role in their performances. It is interesting to note
that even though the matrix logarithm is the least efficient parametrization in
terms of numerical complexity, it had the best performance in terms of number
of iterations and user time to obtain the maximum likelihood estimate of D,
suggesting that this parametrization is the most numerically stable.
Another important aspect in which the parametrizations should be compared
has to do with their behavior as D approaches singularity. All parametrizations
described in section 6.1 require D to be positive definite, though the Givens
parametrization can be modified to handle general symmetric matrices. It is
usually an important statistical issue to test if D is not of full rank and the
dimension of the parameter space can be reduced.
As D approaches singularity its determinant goes to zero and so at least one
of the diagonal elements of its Cholesky factor goes to zero too. The Cholesky
128
20
40
60
oooooooooooooooooooooooooo
o
oooooooooooooooooo
o
o
oooooooooooooo
ooo
oooooooooooooooooooo
ooooooooo
oo
o
User Time to ConvergenceU
ser
Tim
e (
sec.
)
Cholesky logCholesky Spherical Matrix log Givens
Parametrization
20
40
60
80
ooooooooooooooo
ooo
o
ooooooooooooooooooooooo
o
ooooooooooooooo
oooooooooooooooooo
oooooooo
o
o
Number of Iterations to Convergence
Num
ber
of Itera
tions
Cholesky logCholesky Spherical Matrix log Givens
Parametrization
Figure 6.2.2: Box-plots of user time and number of iterations to convergencefor 300 random samples of model (6.2.1) with D of dimension 3.
parametrization would then become numerically unstable, since equivalent so-
lutions would get closer together in the estimation space. At least one element
of θ in the log-Cholesky parametrization would go to −∞ (the logarithm of
the diagonal element of L that goes to zero). In the spherical parametrization
we would also have at least one element of θ going in absolute value to ∞: if
the first diagonal element of L goes to zero, θ1 → −∞; otherwise at least one
angle of the spherical coordinates of the column of L whose diagonal element
approaches 0 would either approach 0 or π, in which cases the corresponding
element of θ would go respectively to −∞ or ∞.
Singularity of D implies that at least one of its eigenvalues is zero. The
Givens parametrization would then have at least the first element of θ going
to −∞. To understand what happens with the matrix logarithm parametriza-
tion when D approaches singularity we note that letting (λ1, u1), . . . , (λq, uq)
represent the eigenvalue-eigenvector pairs corresponding to D we can write
129
400
600
800
1000
1200
o
o
o
o
o
o
User Time to ConvergenceU
ser
Tim
e (
sec.
)
Cholesky logCholesky Spherical Matrix log Givens
Parametrization
40
60
80
100
120
o
o
o
o
oo
o
o
Number of Iterations to Convergence
Num
ber
of Itera
tions
Cholesky logCholesky Spherical Matrix log Givens
Parametrization
Figure 6.2.3: Box-plots of user time and number of iterations to convergencefor 50 random samples of model (6.2.1) with D of dimension 6.
D =∑q
i=1 λiuiuTi . As λ1 → 0 all entries of log(D) corresponding to nonzero
elements of u1uT1 would converge in absolute value to ∞. Hence in the matrix
logarithm parametrization we could have all elements of θ going either to −∞or ∞ as D approached singularity.
Finally we consider the statistical interpretability of the parametrizations of
D. The least interpretable parametrization is the matrix logarithm — none of
its elements can be directly related to the individual variances, covariances, or
eigenvalues of D. The Cholesky and log-Cholesky parametrizations have the
first component directly related to the variance of b1, the first underlying random
variable in D. By permuting the order of the random variables in the definition
of D, one can derive measures of variability and confidence intervals for all the
variances in D, from corresponding quantities obtained for the parameters in the
Cholesky or log-Cholesky parametrizations. The Givens parametrization is the
only one considered here that uses the eigenvalues of D directly in the definition
130
of θ. It is a very useful parametrization for identifying ill-conditioning of D.
None of its parameters, though, can be directly related to the variances and
covariances in D. Finally, the spherical parametrization is the one that gives the
largest number of interpretable parameters of all parametrizations considered
here. Measures of variability and confidence intervals for all the variances in D
and the correlations with b1 can be obtained from the corresponding quantities
calculated for θ. By permuting the order of the underlying random variables in
the definition of D, one can in fact derive measures of variability and confidence
intervals for all the variances and correlations in D.
6.3 Conclusions
The parametrizations described in section 6.1 allow the estimation of variance-
covariance matrices using unconstrained optimization. This has numerical and
statistical advantages over constrained optimization, since the latter is usually
a much harder numerical problem. Furthermore unconstrained estimates tend
to have better inferential properties.
Of the five parametrizations considered here, the spherical parametrization
presents the best combination of performance and statistical interpretability of
individual parameters. The Cholesky and log-Cholesky parametrizations have
comparable performances, similar to the spherical parametrization, but lack
direct parameter interpretability. The Givens parametrization is considerably
less efficient than these parametrizations, but has the feature of being directly
based on the eigenvalues of the variance-covariance matrix. This can be used,
for example, to identify nonrandom linear combinations of the underlying ran-
dom variables. The matrix logarithm parametrization is very inefficient as the
131
dimension of the variance-covariance matrix increases, but seems to be most
stable parametrization. It also lacks direct interpretability of its parameters.
Different parametrizations can be used at different stages of the data analy-
sis. The matrix logarithm parametrization seems to be the most efficient for the
optimization step, at least for moderately large D. The spherical parametriza-
tions is probably the best one to derive measures of variability and confidence
intervals for the elements of D, while the Givens parametrization is the most
convenient to investigate rank deficiency of D.
Chapter 7
Mixed Effects Models Methods
and Classes for S
In this chapter we describe a set of S functions, classes, and methods (Chambers
and Hastie, 1992) for the analysis of mixed effects models. These extend the lin-
ear and nonlinear modeling facilities available in release 3 of S and S-plus. The
source code, written in S and C using an object-oriented approach, is available
in the S collection at StatLib. Details on how to obtain this and other soft-
ware from StatLib can be found in Newton (1993). Help files for all functions
described here are included in Appendix B.
Section 7.1 presents the functions and methods for fitting and analyzing
linear mixed effects models. The nonlinear mixed effects functions and methods
are described in section 7.2. Section 7.3 presents our conclusions and some
future directions for the code development.
133
7.1 The lme class and related methods
The functions and methods for the linear mixed effects model will be described
here through the analysis of data on a dental study presented in (Potthoff and
Roy, 1964). The data, displayed in Figure 7.1.1, consist of four measurements
of the distance (in millimeters) from the centre of the pituitary to the ptery-
omaxillary fissure made at ages 8, 10, 12, and 14 years for 16 boys and 11 girls.
A linear model seems adequate to explain the distance as a function of age, but
the intercept and slope seem to vary with the individual. The corresponding
linear mixed effects model is
dij = (β0 + bi0) + (β1 + bi1) agej + εij , i = 1, . . . , 27, j = 1, . . . , 4 (7.1.1)
where dij represents the distance for the ith individual at age j, βo and β1 are
respectively the fixed intercept and the fixed slope, bi0 and bi1 are respectively
the random intercept and the random slope corresponding to the ith individual,
and εij is the cluster error term. It is assumed that the bi = (bi0, bi1)T are i.i.d.
with a N (0, σ2D) distribution and the εij are i.i.d. with a N (0, σ2) distribution,
independent of the bi.
One of the questions of interest for these data is to determine whether
there are significant differences between boys and girls with respect to distance
growth. Model (7.1.1) can be modified to test for sex related differences in
intercept and slope
dij = (β00 + β01sexi + bi0) + (β10 + β11sexi + bi1) agej + εij (7.1.2)
where sexi is an indicator variable assuming value zero if the ith individual is
134
Age (years)
Dis
tanc
e (m
m)
8 9 10 11 12 13 14
20
25
30
Boys
Girls
Figure 7.1.1: Distance from the centre of the pituitary to the pteryomaxillaryfissure in boys and girls at different ages.
a boy and one if she is a girl. β00 and β10 represent the fixed intercept and
slope for the boys and β01 and β11 the (fixed) increments in intercept and slope
associated with girls. Differences between boys and girls can be evaluated by
testing whether β01 and β11 are significantly different from zero. The remaining
terms in (7.1.2) are defined as in (7.1.1). It will be assumed here that the
data is available in a data.frame called dental, with columns distance, age,
subject, and sex as below
> dentaldistance age subject sex
1 26.0 8 1 02 25.0 10 1 0
135
3 29.0 12 1 04 31.0 14 1 0
. . .105 24.5 8 27 1106 25.0 10 27 1107 28.0 12 27 1108 28.0 14 27 1
7.1.1 The lme function
The lme function is used to fit the general linear mixed effects model, described
in chapter 2, using either maximum likelihood or restricted maximum likelihood.
Several optional arguments can be used with this function, but the typical call
is
lme(fixed, random, cluster, data)
The first three arguments are required. Fixed and random are formulas
defining the fixed and random effects part of the model. Any linear model
formula (Chambers and Hastie, 1992) is allowed, giving the model formulation
considerable flexibility. For the dental data these formulas would be written as
fixed = distance ~ age, random = ~ age
for model (7.1.1) and
fixed = distance ~ age * sex, random = ~ age
for model (7.1.2). Note that the response variable is defined only in the fixed
formula.
The cluster argument is a formula, or expression, defining the labels of the
different subjects in the data. For the dental data we would use
cluster = ~ subject
136
for both models (7.1.1) and (7.1.2). Note that the cluster formula has no left
hand side. The optional argument data specifies the data frame in which the
variables used in the model are available. A simple call to lme to fit model (7.1.1)
would be
> dental.fit1 <- lme(fixed = distance ~ age, random = ~ age,+ cluster = ~ subject, data = dental)
and to fit model (7.1.2) we would use
> dental.fit2 <- lme(fixed = distance ~ age * sex, random = ~ age,+ cluster =~ subject, data = dental)
The fitted objects returned by lme are of class lme, for which several methods
are available, including those for the generic functions print, summary, and
plot.
7.1.2 The print, summary, and anova methods.
A brief description of the estimation results can be obtained through the print
method. This only gives the estimates for the standard errors and correla-
tions of the random effects, the cluster variance, and the fixed effects. For the
dental.fit1 object we get
> dental.fit1Call:Fixed: distance ~ ageRandom: ~ age
Cluster: ~ subjectData: dental
Variance/Covariance Components Estimates:
Structure: logcholesky
137
Standard Deviation(s) of Random Effect(s)(Intercept) age
2.194103 0.2149245Correlation of Random Effects
(Intercept)age -0.5814881
Cluster Residual Variance: 1.716204
Fixed Effects Estimates:(Intercept) age
16.76111 0.6601852
Number of Observations: 108Number of Clusters: 27
A more complete description of the estimation results is obtained with summary.
> summary(dental.fit2). . .
Loglikelihood: -114.6576AIC: 245.3152
Variance/Covariance Components Estimates:Structure: logcholeskyStandard Deviation(s) of Random Effect(s)
(Intercept) age2.134464 0.1541247
Correlation of Random Effects(Intercept)
age -0.6024329
Cluster Residual Variance: 1.716232
Fixed Effects Estimates:Value Approx. Std.Error z ratio(C)
(Intercept) 16.3406250 0.98005731 16.6731321age 0.7843750 0.08275189 9.4786353sex 1.0321023 1.53545472 0.6721802
138
age:sex -0.3048295 0.12964730 -2.3512218
Conditional Correlations of Fixed Effects Estimates(Intercept) age sex
age -0.8801554sex -0.6382847 0.5617897
age:sex 0.5617897 -0.6382847 -0.8801554. . .
The approximate standard errors for the fixed effects are derived using the
asymptotic theory described in chapter 3. The results above indicate that the
distance grows faster in boys than in girls (significant, negative age:sex fixed
effect), but they have the same average initial distance (non significant sex fixed
effect).
A likelihood ratio test to evaluate the hypothesis of no sex differences in
distance development is available with the anova method.
> anova(dental.fit1, dental.fit2). .
Model Df AIC Loglik Test Lik.Ratio P valuedental.fit1 1 6 252.72 -120.36dental.fit2 2 8 245.32 -114.66 1 vs. 2 11.406 0.0033365
The likelihood ratio test strongly rejects the null hypothesis of no sex differences.In order to test if only the growth rate is dependent on sex, using a likelihoodratio test, we can fit
> dental.fit3 <- lme(fixed = distance ~ age + age:sex, random = ~ age,+ cluster = ~ subject, data = dental)
and use the anova method again.
> anova(dental.fit2, dental.fit3). . .
Model Df AIC Loglik Test Lik.Ratio P valuedental.fit2 1 8 245.32 -114.66dental.fit3 2 7 243.76 -114.88 1 vs. 2 0.44806 0.50326
As expected, the likelihood ratio test indicates that the initial distances do not
depend on sex.
139
7.1.3 The plot method
Plots of random effects estimates, residuals, and fitted values can be obtained
using the plot method for class lme. The following call will produce a scatter
plot of the intercept and slope random effects estimates in model (7.1.2), as
shown in Figure 7.1.2.
> plot(dental.fit2, levels = c(0.5, 0.75, 0.9, 0.95))
The optional levels argument specifies the approximate coverage probabilities
for the random effects density contours to be included in the plot.
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
ooo
o
(Intercept)
age
-3 -2 -1 0 1 2 3
-0.1
0.0
0.1
0.2
Figure 7.1.2: Scatter plot of the conditional modes of the intercept and sloperandom effects in model (7.1.2). Dashed lines represent the approximate 50%,75%, 90%, and 95% random effects density contours.
140
The point at the upper left corner of Figure 7.1.2 appears to be an outlying
value that is possibly having a great impact on the correlation and variance
estimates.
Residual plots may be obtained by setting the argument option in the plot
method to "r".
> plot(dental.fit3, option = "r")
The resulting plots are included in Figure 7.1.3. The first plot, observed
versus fitted values, indicates that the linear model does a reasonable job of
explaining the distance growth. The points fall relatively close to the y = x line,
indicating a reasonable agreement between the fitted and observed values. The
second plot, residuals versus fitted values suggests the presence of three outliers
in the data. The remaining residuals appear to be homogeneously scattered
around the y = 0 line. The final plot, featuring the boxplot of the residuals by
subject, suggests that the outliers occurred for subjects 9 and 13. There seems
to be considerable variation in the within subjects variability, but it must be
remembered that the boxplots represent only four residual values.
7.1.4 Other methods
Standard S methods for extracting components of fitted objects, such as resid-
uals, fitted, and coefficients, can be also be used on lme objects. The first
two methods return data frames with two columns, population and cluster,
while the last one returns a list with two components, the random and the fixed
effects estimates. A more detailed description of these objects is available in
the help files, included in Appendix B.
141
oo
o
o
oo
o
o
oo
o
o
o
oo
o
o
oo
o
oo
o
o
o o
o
o
o
o
oo
o
o
o
o
oo
oo
o oo
o
o
oo
o
o
o
o
o
o
o oo
o
o
o
o
oo
o
o
oo
o
o
oo
o
o
o
oo
o
oo
o
o
o
oo
o
oo o
oo
oo
o
o oo
o
oo
oo
o
o oo
oo
o o
Fitted Values
Obs
erve
d V
alue
s
18 20 22 24 26 28 30
20
25
30
o
o
oo
oo
o
oo
o o
ooo
o
oo
o
o
ooo o oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o ooo
o
o
o
oo
o
o
o
o
oo o o
o
o
oo o
o
oo
ooo
oo
o
oo oo o
ooo
o
o oo o
oo
o oo
oo
oo
oo o o
oo
o
oo
oo
o
o
Fitted Values
Res
idua
ls
18 20 22 24 26 28 30
-4
-2
0
2
4
-4
-2
0
2
4
Res
idua
ls
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
subject
Figure 7.1.3: Residuals and fitted values plots.
Estimates of the individual parameters are obtained using the cluster.coef
method.
> cluster.coef(dental.fit3)(Intercept) age age:sex
1 18.27436 0.8322535 -0.22812472 15.48918 0.7357740 -0.22812473 16.18725 0.7418871 -0.2281247. . .
27 19.21336 0.8370006 -0.2281247
Predicted values are obtained using the predict method. For example, if
we are interested in predicting the average distance for boys and girls at ages
142
14, 15, and 16, as well as for subjects 1 and 20 at age 13, we should create a
new data frame, say dental.new, as follows
> dental.new <-+ data.frame(sex = c(1, 1, 1, 0, 0, 0, 1, 0),+ age = c(14, 15, 16, 14, 15, 16, 13, 13),+ subject = c(NA, NA, NA, NA, NA, NA, 1, 20))
and then use
> predict(dental.fit3, ~ subject, dental.new)cluster fit.cluster fit.population
1 NA 24.111112 NA 24.636113 NA 25.161114 NA 27.304865 NA 28.057986 NA 28.811117 1 26.12804 23.586118 20 25.05424 26.55173
to get the cluster and population predictions.
7.2 The nlme class and related methods
The functions and methods for the nonlinear mixed effects model will be de-
scribed here through the analysis of the CO2 uptake data. These data, shown
in Figure 7.2.1 and described in Potvin and Lechowicz (1990), come from a
biological study aimed at analyzing the cold tolerance of a C4 grass species,
Echinochloa crus-galli. A total of twelve four-week-old plants, six from Quebec
and six from Mississippi, were divided into two groups: control plants that
stayed at 26◦C and chilled plants that were subject to 14 h of chilling at 7◦C.
After 10 h of recovery at 20◦C, CO2 uptake rates (in µmol/m2s) were measured
143
for each plant at seven concentrations of ambient CO2 (100, 175, 250, 350, 500,
675, 1000µL/L). Each plant was subjected to the seven concentrations of CO2 in
increasing, consecutive order. The objective of the experiment was to evaluate
the effect of plant type and chilling treatment on the CO2 uptake.
Ambient CO2 (uL/L)
CO
2 up
take
rat
e (u
mol
/(m
2 se
c))
200 400 600 800 1000
-10
0
10
20
30
40
50
Mississippi Control
Chilled
Quebec Control
Chilled
Figure 7.2.1: CO2 uptake rates (in µmol/m2s) for Quebec and Mississippi plantsof Echinochloa crus-galli, control and chilled at different ambient CO2 concen-trations.
The model used in Potvin and Lechowicz (1990) is
Uij = φ1i {1 − exp [−φ2i (Cj − φ3i)]} + εij (7.2.1)
where Uij denotes the CO2 uptake rate of the ith plant at the jth CO2 ambient
concentration; φ1i, φ2i, and φ3i denote respectively the asymptotic uptake rate,
the uptake growth rate, and the maximum ambient CO2 concentration at which
144
no uptake is verified for the ith plant; Cj denotes the jth ambient CO2 level;
and the εij are i.i.d. error terms with distribution N (0, σ2).
It will be assumed here that the CO2 uptake data is available in a data.frame
called CO2, with columns plant, type, trt, conc, and uptake as below
plant type trt conc uptake1 1 Quebec nonchilled 95 16.02 1 Quebec nonchilled 175 30.43 1 Quebec nonchilled 250 34.8. . .
83 12 Mississippi chilled 675 18.984 12 Mississippi chilled 1000 19.9
7.2.1 The nlme function
The nlme function is used to fit the nonlinear mixed effects model (cf. chapter
4) using either maximum likelihood or restricted maximum likelihood. Several
optional arguments can be used with this function, but a typical call is
nlme(model, fixed, random, cluster, data, start)
The model argument is required and consists of a formula specifying the
nonlinear model to be fitted. Any S nonlinear formula can be used, giving the
function considerable flexibility. From (7.2.1) we have that for the CO2 uptake
data this argument is declared as
uptake ~ A * (1 - exp(-B * (conc - C)))
where we have used the notation A = φ1, B = φ2, and C = φ3. Alternatively,
we can define an S function, say co2.uptake, as follows
> co2.uptake <- function(A, B, C, conc) A * (1 - exp(-B*(conc - C)))
and write the model argument as
145
uptake ~ co2.uptake(A, B, C, conc)
The advantage of this latter approach is that the analytical derivatives of the
model function can be passed to the nlme function as the gradient attribute
of co2.uptake and used in the optimization algorithm. The S function deriv
can be used to create expressions for the derivatives.
> co2.uptake <- deriv(~ A * ( 1 - exp(-B * ( conc - C))),+ LETTERS[1:3], function(A, B, C, conc){})
If the model function does not have a gradient attribute, numerical derivatives
are used instead.
The required arguments fixed and random are lists of formulas that define
the structures of the fixed and random effects in the model. In these formulas
a . on the right hand side of a formula indicates that a single parameter is
associated with the effect, but any linear formula in S could be used instead.
This gives considerable flexibility to the model, as time-dependent parameters
can be easily incorporated (e.g. when a formula in the fixed list involves a co-
variate that changes with time). Usually every parameter in the model will have
an associated fixed effect, but it may, or may not, have an associated random
effect. Since we assumed that all random effects have mean zero, the inclusion
of a random effect without a corresponding fixed effect would be unusual. Note
that the fixed and random formulas could be directly incorporated in the model
declaration. The approach used in nlme allows for more efficient calculation of
derivatives and will be useful for update methods that will be incorporated in
the code in the future.
For the CO2 uptake data, if we want to fit a model in which all parameters
are random and no covariates are included we use
146
fixed = list(A ~., B~., C~.), random = list(A~., B~., C~.)
If we want to estimate the effects of plant type and chilling treatment on
the parameters in the model we can use
fixed = list(A ~ type*trt, B ~ type*trt, C ~ type*trt),random = list(A ~ ., B ~ ., C ~ .)
The cluster argument is required and defines the cluster label of each
observation. An S expression or a formula with no left hand side can be used
here. Data is an optional argument that names a data frame and start provides
a list of starting values for the iterative algorithm. Only the fixed effects starting
estimates are required. The default value for the random effects is zero and
starting estimates for the variance-covariance matrix of the random effects (D)
and the cluster variance (σ2) are automatically generated using a formula given
in Laird, Lange and Stram (1987) if they are not supplied. Further information
on the arguments of nlme is available in the help files in Appendix B.
A simple call to nlme to fit model (7.2.1), without any covariates and with
all parameters random is
> co2.fit1 <-+ nlme(model = uptake ~ co2.uptake(A, B, C, conc),+ fixed = list(A ~ ., B ~ ., C ~ .),+ random = list(A ~ ., B ~ ., C ~ .),+ cluster = ~ plant, data = CO2,+ start = list(fixed = c(30, 0.01, 50)))
The initial values for the fixed effects were obtained from Potvin and Lechowicz
(1990).
147
7.2.2 The nlme methods
Objects returned by the nlme function are of class nlme which inherits from
lme. All methods described in section 7.1 are also available for the nlme class.
In fact, with the exception of the predict method, all methods are common to
both classes. We illustrate their use here with the CO2 uptake data.
The print method provides a brief description of the estimation results.
This only gives the estimates for the standard errors and correlations of the
random effects, the cluster variance, and the fixed effects.
> co2.fit1Call:Model: uptake ~ co2.uptake(A, B, C, conc)Fixed: list(A ~ ., B ~ ., C ~ .)Random: list(A ~ ., B ~ ., C ~ .)
Cluster: ~ plantData: CO2
Variance/Covariance Components Estimates:
Structure: logcholeskyStandard Deviation(s) of Random Effect(s)
A B C9.510373 0.001152827 11.39466Correlation of Random Effects
A BB -0.06187818C 0.99998745 -0.06192643
Cluster Residual Variance: 3.129989
Fixed Effects Estimates:A B C
32.55042 0.00944257 41.61764
Number of Observations: 84Number of Clusters: 12
148
Note that there is a very strong correlation between the φ1 and the φ3 random
effects and these are almost not correlated with the φ2 random effect. The
scatter plot matrix of the random effects is obtained using the plot method
> plot(co2.fit1, levels = c(0.5, 0.75, 0.9, 0.95))
and is shown in Figure 7.2.2. It is clear that the φ1 and φ3 random effects are
A
-0.0015 0.0 0.0010
o
oo
o
oo
o o
o
o
o
o
-15
-50
510
o
oo
o
oo
oo
o
o
o
o
-0.0
015
0.0
0.00
10 o
o
o
o
oo
o
ooo
o
o
B
o
o
o
o
oo
o
ooo
o
o
-15 -5 0 5 10
o
oo
o
oo
oo
o
o
o
o
o
oo
o
oo
o o
o
o
o
o
-20 -10 0 10
-20
-10
010
C
Figure 7.2.2: Scatter plot of the conditional modes of the φ1, φ2, and the φ3
random effects in model (7.2.1). Dashed lines represent the approximate 50%,75%, 90%, and 95% random effects density contours.
virtually identical. This correlation may be due to the fact that the plant type
and the chilling treatment, that were not included in the co2.fit1 model, are
affecting φ1 and φ3 in the same way.
149
One of the main advantages of having the code defined within the S en-
vironment is that all the analytical and graphical machinery present in S is
simultaneously available. We can use these to analyze the dependence of the
individual parameters φ1i, φ2i, and φ3i in model (7.2.1) on plant type and chill-
ing factor. Initially we create a data.frame with the conditional modes of the
random effects obtained in the first fit.
> CO2.random <- data.frame(coef(co2.fit1)$random)
Then we add a column to CO2.random with the treatment combinations corre-
sponding to each plant.
> CO2.random$type.trt <- as.factor(rep(c("Quebec nonchilled",+ "Quebec chilled", "Mississippi nonchilled",+ "Mississippi chilled"), rep(3,4)))
Finally we obtain plots of the conditional modes of the random effects versus the
treatment combinations. The corresponding plots are presented in Figure 7.2.3.
> plot(A ~ type.trt, data = CO2.random)> plot(B ~ type.trt, data = CO2.random)> plot(C ~ type.trt, data = CO2.random)
These plots indicate that chilled plants tend to have smaller values of φ1 and
φ3, but the Mississippi plants seem to be much more affected than the Quebec
plants, suggesting an interaction effect between plant type and chilling treat-
ment. There is no clear pattern of dependence between φ2 and the treatment
factors, suggesting that this parameter is not significantly affected by either
plant type or chilling treatment. We can then fit a new model in which φ1 and
φ3 depend on the treatment factors, as below.
> co2.fit2 <-+ nlme(model = uptake ~ co2.uptake(A, B, C, conc),
150
-15
-10
-5
0
5
10
A
Mississippi nonchilled Mississippi chilled Quebec nonchilled Quebec chilledtype.trt
-0.0
015
-0.0
005
0.0
005
B
Mississippi nonchilled Mississippi chilled Quebec nonchilled Quebec chilledtype.trt
-20
-10
0
10
C
Mississippi nonchilled Mississippi chilled Quebec nonchilled Quebec chilledtype.trt
Figure 7.2.3: Boxplots of the conditional modes of the φ1, φ2, and φ3 randomeffects in model (7.2.1) by plant type and chilling treatment combination.
151
+ fixed = list(A ~ type*trt, B ~ ., C ~ type*trt),+ random = list(A ~ ., B ~ ., C ~ .), cluster = ~ plant, data = CO2,+ start = list(fixed = c(30, 0, 0, 0, 0.01, 50, 0, 0, 0)))
We can use the summary method to get more detailed information on the esti-
mation results of the new fitted object.
> summary(co2.fit2). . .
Convergence at iteration: 6Approximate Loglikelihood: -103.5041AIC: 239.0082
Variance/Covariance Components Estimates:Structure: logcholesky
Standard Deviation(s) of Random Effect(s)A.(Intercept) B C.(Intercept)
2.276278 0.0003200845 5.981132Correlation of Random Effects
A.(Intercept) BB -0.008043761
C.(Intercept) 0.999984502 -0.008100170
Cluster Residual Variance: 3.127764
Fixed Effects Estimates:Value Approx. Std.Error z ratio(C)
A.(Intercept) 32.452100011 0.7225786330 44.911513A.type -7.909764880 0.7024079993 -11.260927A.trt -4.231594577 0.7009980593 -6.036528
A.type:trt -2.434420834 0.7010132656 -3.472717B 0.009545959 0.0005908485 16.156356
C.(Intercept) 39.936295607 5.6567839253 7.059894C.type -10.469319722 4.2166574898 -2.482848C.trt -7.975396202 4.1963538181 -1.900554
C.type:trt -12.360984497 4.2249903799 -2.925683. . .
Note that the correlation between the φ1 and the φ3 random effects remains
152
very high, suggesting that the model is probably overparametrized and fewer
random effects are needed. We will not pursue the model building analysis of
the CO2 uptake data in here, since our main goal is to illustrate the use of
the methods for the nlme class and not to present a thorough analysis of the
problem.
In order to compare the fits corresponding to the objects co2.fit1 and
co2.fit2 we can use the anova method.
> anova(co2.fit1, co2.fit2). . .
Model Df AIC Loglik Test Lik.Ratio P valueco2.fit1 1 10 268.44 -124.22co2.fit2 2 16 239.01 -103.50 1 vs. 2 41.43 2.3824e-07
We see that the inclusion of plant type and chilling treatment in the model
caused a substantial increase in the loglikelihood, indicating that they have a
significant effect on φ1 and φ3.
Diagnostic plots can be obtained by using the r option of the plot method
> plot(co2.fit2, option = "r")
The corresponding plot is presented in Figure 7.2.4. The first plot, observed
versus fitted values, indicates that the model fits the data well — most points
lie close to the y = x line. The second plot, residuals versus fitted values, does
not indicate any departures from the assumptions in the model — no outliers
seem to be present and the residuals are symmetrically scattered around the
y = 0 line, with constant spread for different levels of the fitted values.
Predictions are obtained through the predict method . For example, to
obtain the population predictions of CO2 uptake rate for Quebec and Missis-
sippi plants under chilling and no chilling, at ambient CO2 concentrations of
50, 100, 200, and 500µL/L, we would first define
153
o
o
oo
o
oo
o
o
o
o ooo
o
o
oo oo
o
o
o
o
ooo
o
o
o
o
o oo
o
o
o
o
o
ooo
o
o
o
o oo
o
o
o
oo o
oo
o
o
ooooo
o
o
ooooo
o
oooooo
o
o ooooo
Fitted Values
Obs
erve
d V
alue
s
10 20 30 40
10
20
30
40
o
o
oo
o
oo
oo
oo
oo
oo
o
o
o
oo
oo
oo
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
oo
ooo
oo
ooo
oo
oo
oo
o
oo
o
o
o
oo
o
o
Fitted Values
Res
idua
ls
10 20 30 40
-4
-2
0
2
4
-4
-2
0
2
4
o
oo
Res
idua
ls
1 2 3 4 5 6 7 8 9 10 11 12plant
Figure 7.2.4: Residuals and fitted values plots.
> CO2.new <-+ data.frame(type = rep(c("Quebec", "Mississippi"), c(8, 8)),+ trt = rep(rep(c("chilled","nonchilled"),c(4,4)),2),+ conc = rep(c(50, 100, 200, 500), 4))
and then use
> predict(co2.fit2, CO2.new)population
1 0.058527812 11.88120535. . .
15 28.9221963316 38.01456512
to obtain the predictions.
154
The predict method can also be used for plotting smooth fitted curves by
calculating fitted values at closely spaced concentrations. Figure 7.2.5 presents
the individual fitted curves for all twelve plants using a total of 200 concentra-
tions between 50 and 1000 µL/L.
Ambient CO2 (uL/L)
CO
2 up
take
rat
e (u
mol
/(m
2 se
c))
200 400 600 800 1000
-10
0
10
20
30
40
50
Mississippi Control
Chilled
Quebec Control
Chilled
Figure 7.2.5: Individual fitted curves for the twelve plants in the CO2 uptakedata based on the co2.fit2 object.
7.3 Conclusions
The classes and methods described here provide tools for analyzing linear and
nonlinear mixed effects models. As they are defined within the S environment,
155
all the powerful analytical and graphical machinery present in S is simultane-
ously available. The analyses of the dental data and CO2 uptake data illustrate
some of the available features, but many other features are available.
The code presented here was developed to handle primarily repeated mea-
sures data, i.e. data generated by observing a number of clusters repeatedly
under varying experimental conditions. More general mixed effects models (e.g.
with different levels of nesting) can be analyzed using the functions described
here, but the code will not be computationally efficient for that purpose.
There are several directions in which the software can be expanded to handle
more general mixed effects models and/or incorporate other estimation tech-
niques. These include, but are not limited to,
• Mixed effects models with autocorrelated cluster errors (Chi and Reinsel,
1989). The current version of the code only handles the i.i.d. case;
• More accurate approximations to the loglikelihood in the nonlinear mixed
effects model (cf. chapter 5). These include Laplacian and Gaussian
quadrature approximations to the integral that defines the likelihood of
the data in the nonlinear mixed effects model. The current version uses
an alternating algorithm suggested by Lindstrom and Bates (1990);
• Profiling methods (Bates and Watts, 1988) for deriving confidence regions
on the parameters in the model and assessing the normality of the param-
eter estimates. These methods are computationally intensive, especially
for the nonlinear mixed effects model, and efficient programming is needed
to make them feasible to use;
• Update methods for refitting the model when only small changes in the
156
original calling sequence are necessary. These methods are particular use-
ful for model building, when several similar models are fitted sequentially;
• Methods for deriving confidence and prediction intervals for predicted val-
ues.
We plan to incorporate all these features in future releases of the software
to be contributed to the S collection at StatLib. The autocorrelation structure
for the cluster errors has already been incorporated in an experimental version
currently undergoing tests. C code to calculate Laplacian and Gaussian quadra-
ture approximations to the integral in the nonlinear mixed effects has already
been developed, but needs to be incorporated into the S code. For the profiling
methods we plan to use a linear mixed effects approximation to the marginal
density in the nonlinear mixed effects, suggested in Lindstrom and Bates (1990),
to speed up the calculations.
Chapter 8
Model Building in Mixed Effects
Models
Model building in mixed effects models involves questions that do not have a
parallel in (fixed effects) linear and nonlinear models. Some of these questions
are:
• determining which effects should have an associated random component
and which should be purely fixed;
• using covariates to explain cluster-to-cluster parameter variability;
• using structured random effects variance-covariance matrices (e.g. diago-
nal matrices) to reduce the number of parameters in the model.
In this chapter we consider strategies for addressing these questions in the con-
text of nonlinear mixed effects models, though most of the techniques described
are also applicable to linear mixed effects models.
158
Any model building strategy is by nature iterative: a tentative model is
initially fitted and modified to generate possibly better models (according to
some goodness-of-fit criterion) and the process is repeated until no further im-
provements are possible. In comparing alternative models one must also analyze
the residuals from the fit, checking for departures from the assumptions in the
model. It is also highly recommended that any model building analysis be done
in conjunction with experts in the field of application of the model, to ensure
the practical usefulness of the chosen model.
The use of the model building techniques described in this chapter is illus-
trated through the analysis of four real data examples. These data sets are
described in section 8.1. In section 8.2 we describe techniques that can be used
to model the variance-covariance matrix of the random effects and to choose
which random effects should be incorporated in the model. The use of covari-
ates to model cluster-to-cluster parameter variability is considered in section 8.3.
Our conclusions are included in section 8.4.
8.1 Examples
We make extensive use of real data examples to illustrate the model building
techniques presented in this chapter. We now introduce the data sets that will
be used throughout this chapter.
8.1.1 Pine Trees
The pine trees growth data are described in Kung (1986). A total of 14 sources
(seeds) of Loblolly pine were planted in the southern United States and the
tree heights (in ft.) were measured at 3, 5, 10, 15, 20, and 25 years of age.
159
Figure 8.1.1 shows a plot of these data.
Age (years)
Hei
ght (
feet
)
5 10 15 20 25
10
20
30
40
50
60
a
a
a
a
a
a
b
b
b
b
b
b
c
c
c
c
c
c
d
d
d
d
d
d
e
e
e
e
e
e
f
f
f
f
f
f
g
g
g
g
g
g
h
h
h
h
h
h
i
i
i
i
i
i
j
j
j
j
j
j
k
k
k
k
k
k
l
l
l
l
l
l
m
m
m
m
m
m
n
n
n
n
n
n
Figure 8.1.1: Loblolly pine heights at different ages.
Kung (1986) used a logistic curve to model the trees’ growth, but an asymptotic
regression model seems to explain the observed growth pattern better. We also
tried the logistic model, the Gompertz model, the Morgan, Mercer, and Flodin
model, and the Weibull type model (Ratkowsky, 1990), but the asymptotic
regression gave the best overall fit. This model can be expressed as
f(t, φ) = φ1 − φ2 exp (−φ3t) (8.1.1)
where t denotes the tree’s age, φ1 the asymptotic height, φ2 the difference be-
tween φ1 and the height at age zero, and φ3 the growth rate.
160
8.1.2 Theophylline
The Theophylline data were described in section 5.1. We reproduce the plot of
the data in Figure 8.1.2.
Time (hrs)
Con
cent
ratio
n (m
g/L)
0 5 10 15 20 25
0
2
4
6
8
10
a
a
a
a
a
aa
a
a
a
a
b
b
bb b
b
b
b
b
b
b
c
c
c
cc
c
c
cc
c
c
d
d
d
dd
d
d
dd
d
d
e
e
e
e
e
e
e
e
e
e
e
f
f
f
f f
f
f
f
f
f
f
g
g
g
g
g
gg
g
g
g
g
h
hh
hh
h
h
h h
h
h
i
i
i
i
i
i i
i i
i
i
j
j
j
j
j
j
j
j
j
j
j
k
k
k
k
k
k
k
k
k
k
k
l
l
l
l
l l
l
l
l
l
l
Figure 8.1.2: Theophylline concentrations (in mg/L) of twelve patients overtime.
We recall from section 5.1 that a first order compartment model with absorption
in a peripheral compartment is used to represent the variation in the drug
concentration with time. The model equation is reproduced next
Ct =DKka
Cl(ka − K)[exp (−Kt) − exp (−kat)] (8.1.2)
where Ct is the observed concentration at time t (mg/L), t is the time (hr),
D is the dose (mg/kg), Cl is the clearance (L/kg), K is the elimination rate
constant (1/hr), and ka is the absorption rate constant (1/hr). In order to
161
ensure positivity of the rate constants and the clearance, the logarithms of
these quantities can be used in (8.1.2), giving the reparametrized model
Ct =D exp (lka + lK − lCl)
exp (lka) − exp (lK)(8.1.3)
× {exp [− exp (lK) t] − exp [− exp (lka) t]}
where lCl = log(Cl), lka = log(ka), and lK = log(K).
8.1.3 Quinidine
The third data set comes from a pharmacokinetics clinical study of the antiar-
rhytimic drug Quinidine. A total of 361 Quinidine concentration measurements
were made on 136 hospitalized patients under varying dosage regimens. Addi-
tional data were collected on a set of nine covariates: age, height, weight, race,
smoking status, ethanol abuse, congestive heart failure, creatinine clearance,
and α-1-acid glycoprotein concentration. Some of these covariates varied for
the same patient during the course of the study, while others remained con-
stant. One of the main objectives of the study was to investigate relationships
between the individual pharmacokinetics parameters and the covariates. A full
description of the data can be found in Verme, Ludden, Clementi and Harris
(1992). Statistical analyses of these data using alternative modeling approaches
are given in Davidian and Gallant (1993) and Wakefield (1993).
The model that has been suggested for the Quinidine data is the one-
compartment open model with first-order absorption. This model can be defined
in a recursive way as follows.
Suppose that, at time t, the patient receives at dose dt and prior to that
time the last dose was given at time t′. The expected concentration, Ct, and
162
the apparent concentration in the absorption compartment, Cat are given by
Ct = Ct′ exp [−K (t − t′)] +Cat′ka
ka − K{exp [−K (t − t′)] − exp [−ka (t − t′)]}
Cat = Cat′ exp [−ka (t − t′)] +dt
V(8.1.4)
where V represents the apparent volume in distribution and ka and K are
respectively the absorption and the elimination rate constants.
When a patient receives the same dose d at regular time intervals ∆, the
model (8.1.4) converges to the so-called steady state model, where the expected
concentrations are given by
Ct =dka
V (ka − K)
[1
1 − exp (−K∆)− 1
1 − exp (−ka∆)
]
Cat =d
V [1 − exp (−ka∆)](8.1.5)
Patients considered to be in steady state conditions have concentrations modeled
as above.
Finally, for a between-dosages time t, the model for the expected concentra-
tion Ct, given that the last dose was received at time t′, is identical to (8.1.4).
Using the fact that the elimination rate constant K is equal to the ra-
tio between the clearance (Cl) and the volume of distribution (V ), we can
reparametrize models (8.1.4) and (8.1.5) in terms of V , ka, and Cl.
In order to ensure that the estimates of V , ka, and Cl are positive, we can
rewrite models (8.1.4) and (8.1.5) in terms of lV = log(V ), lka = log(ka), and
lCl = log(Cl).
The initial conditions for the recursive model are C0 = 0 and Ca0 = d0/V ,
with d0 denoting the first dose received by the patient. It has been assumed
163
throughout the model definition that the bioavailability of the drug, i.e. the per-
centage of the administered dose that reaches the measurement compartment,
is equal to one.
8.1.4 CO2 Uptake
The last data set considered here is the CO2 uptake data described in section 7.2.
The data, presented in Figure 8.1.3, consist of measurements of CO2 uptake (in
µmol/m2s) for six Echinochloa crus-galli plants from Quebec and six plants
from Mississippi at seven different concentrations of ambient CO2. Half the
plants from each type were chilled before the measurements were taken, while
the other half stayed at room temperature.
Ambient CO2 (uL/L)
CO
2 up
take
rat
e (u
mol
/(m
2 se
c))
200 400 600 800 1000
-10
0
10
20
30
40
50
Mississippi Control
Chilled
Quebec Control
Chilled
Figure 8.1.3: CO2 uptake rates (in µmol/m2s) for Quebec and Mississippi plantsof Echinochloa crus-galli, control and chilled at different ambient CO2 concen-trations.
164
The nonlinear mixed effects model used to describe the CO2 uptake as a
function of the ambient CO2 concentration is
Uij = φ1i {1 − exp [−φ2i (Cj − φ3i)]} + εij (8.1.6)
where Uij denotes the CO2 uptake rate of the ith plant at the jth CO2 ambient
concentration; φ1i, φ2i, and φ3i denote respectively the asymptotic uptake rate,
the uptake growth rate, and the maximum ambient CO2 concentration at which
no uptake is verified for the ith plant; Cj denotes the jth ambient CO2 level;
and the εij are i.i.d. error terms with distribution N (0, σ2). The main purpose
of the study was to estimate the effect of the plant type (P ) and the chilling
treatment (T ) on the parameters φ1, φ2, and φ3.
8.2 Variance-Covariance Modeling
In this section we consider the questions of determining which parameters in
the model should have a random component and whether the scaled variance-
covariance matrix of the random effects (D) can be structured in a simpler
form, i.e. with fewer parameters than the unstructured form.
The first question that should be addressed in the analysis is choosing which
parameters should be random effects and which purely fixed effects. Our ap-
proach is to fit different prospective models and compare nested models us-
ing some information criterion statistics, e.g. the Akaike information criterion
(Sakamoto et al., 1986). One of the problems with this approach is deciding
which way to construct the nesting; from smaller to larger models, or the other
165
way around. Starting with a model where all parameters have associated ran-
dom effects and then removing unnecessary terms is probably the best strategy,
but may not be possible to implement if the model is badly overparametrized.
In these cases the variance-covariance matrix of the random effects may become
seriously ill-conditioned, making it difficult or impossible to converge. The
smaller to larger approach is another alternative in these cases, but has the dis-
advantage of the large number of models that may have to be fitted before the
desired one is found. There is yet another important aspect that is overlooked
by the model nesting approach: sometimes it is a linear combination of random
effects being treated as fixed that gives the best model reduction.
The strategy we suggest for choosing the random effects to be included in
the model is to start with all parameters as mixed effects, whenever no prior
information about the random effects variance-covariance structure is available
and convergence is possible. Then we examine the eigenvalues of the estimated
D matrix, checking if one, or more, are close to zero. The associated eigenvec-
tor(s) would then give an estimate of the linear combination of the parameters
that could be taken as fixed. We used the Akaike information criterion to decide
between alternative models, choosing the one with the smaller AIC.
Small eigenvalues may arise when the relative magnitude of the scales of
the parameters in the model are quite different, without necessarily implying
overparametrization. Therefore we suggest using a normalized version of the
variance-covariance matrix that is scale invariant. There are different choices
of normalized D, the most common being the correlation matrix. This is not a
particularly good choice in the present context, since all random effects would
then have normalized variance equal to one and we would not be able to iden-
tify those with relatively small dispersion (which would be natural candidates
166
to be dropped from the model). Whenever the A and B matrices in (4.1.2)
are incidence-like matrices (i.e. with just one nonzero entry per row), a more
convenient choice of normalization is the coefficient of variation (CV) matrix
DCV with
[DCV ]ij =[D]ij∣∣∣βk(i)βk(j)
∣∣∣ (8.2.1)
where βk represents the kth fixed effect and k(i), k(j) represent the indices of
the fixed effects associated with the ith and jth random effects. When the
nonzero elements of the ith row of A and B are equal to one, the ith diagonal
element of DCV is equal to the square of the coefficient of variation of φi. We
note that, in the majority of real life applications of model (4.1.1), A and B
will be incidence-like matrices.
To illustrate the use of this method we consider the examples described in
section 8.1. We do not include any analyses of residuals here, but in all cases
they did not indicate violations of the model’s assumptions. All maximum
likelihood calculations in the examples were done using the nlme function in S
(cf. chapter 7).
8.2.1 Pine Trees
Assuming that all three parameters in model (8.1.1) have both a fixed and a
random component, the corresponding nonlinear mixed effects model is
yij = (β1 + bi1) − (β2 + bi2) exp [(β3 + bi3) tj ] + εij (8.2.2)
167
The maximum likelihood estimates of the parameters are σ2 = 0.397,
β =
102.26
110.85
0.039
, D =
334.43 328.53 −0.15
328.53 322.72 −0.14
−0.15 −0.14 8.26 × 10−5
, and
DCV =
0.032 0.029 −0.037
0.029 0.026 −0.033
−0.037 −0.033 0.054
.
The value of the loglikelihood at convergence is −34.618, corresponding to an
AIC of 89.236.
The eigenvalues of DCV , 0.1056, 0.0065, and 8.152 × 10−15, give a clear
indication of rank deficiency. The eigenvector corresponding to the smallest
eigenvalue, converted back to the original scale of the random effects and nor-
malized, is (0.7005,−0.7131,−0.0278)T , suggesting that the difference between
the first two random effects can be considered nonrandom. This can be checked
by reparametrizing model (8.1.1) as follows
yij = φ′1 + (φ′
2 − φ′1) exp (−φ′
3tj) + eij (8.2.3)
where φ′1 and φ′
3 continue to have the same interpretation as φ1 and φ3 in
the previous parametrization, but φ′2 now represents the height at age zero (i.e.
φ1−φ2). Using this reformulation of the model, with φ1 and φ3 as random effects,
we get the following estimates β = (102.253,−8.574, 0.039)T , σ2 = 0.400, and
D =
251.060 −0.102
−0.102 5.967 × 10−5
. The loglikelihood at convergence is equal to
−34.630, corresponding to an AIC of 83.262, which is considerably smaller than
168
the AIC of model (8.2.2). The AIC values obtained by considering each of φ1,
φ2, and φ3 at a time as fixed in model (8.1.1) are respectively 86.201, 89.917,
and 89.078, all larger than the AIC of the reduced reparametrized model (8.2.3).
The eigenvalues of the DCV matrix corresponding to the reduced model
(8.2.3) are 0.058 and 0.005, suggesting that no further reduction in the num-
ber of random effects can be attained. If we refit the reduced model (8.2.3)
with either φ′1 or φ′
3 as a fixed effect, we get AIC values of 85.105 and 85.917
respectively, both larger than in the previous model. It is interesting to note
that the eigenvalues of D for model (8.2.3), 251.06 and 0.000018, could at first
suggest that further reductions were possible. In fact they are just reflecting the
different scales in which φ′1 and φ′
3 are measured (the eigenvector corresponding
to the smallest eigenvalue is (−0.0004,−0.9999)T ).
8.2.2 Theophylline
The Theophylline data give yet another example where convergence is attained
for the model in which all parameters are mixed effects. We refer to this model
as model I.
The AIC of model I is 124.03 and the MLE of the DCV matrix has eigen-
values 4.324, 0.019, and 2.031 × 10−7 indicating that the model is probably
overparametrized. The eigenvector corresponding to the smallest eigenvalue,
converted back to the original scale and normalized is (0.464, 0.020,−0.886)T ,
suggesting that the lCl random effect is approximately equal to twice the lK
random effect. Recalling that the volume of distribution (V) is equal to the
ratio between Cl and K, we see that lCl = 2lK implies that the ratio between
V and K, that we will denote by R, is a fixed effect. The recommendation at
this point would be to contact a pharmacologist and check the plausability of
169
this finding, as well as the interpretability of the parameter R. We will proceed
with the analysis here for the purpose of illustrating the use of the proposed
model building techniques.
We reparametrized model (8.1.3) in terms of lka, lK, and lR = log(R),
letting only the first two parameters be mixed effects. The AIC of this reduced
model, called model II, was 118.20, considerably smaller than the AIC of model
I. The eigenvalues of the estimated DCV matrix are 0.356 and 0.158 indicating
that no further linear combinations of random effects could be eliminated from
the model. In fact, if we remove either the lka or the lK random effect we get
AIC values of respectively 203.865 and 200.135, both substantially worse than
in model II.
It is interesting to compare model II with the models obtained by considering
each parameter at a time in model I as a fixed effect, to check if a more easily
understood model could be used. The AIC of the models considering each of
lCl, lka, and lK at a time as fixed effects are respectively 163.224, 194.189, and
125.446, all considerably larger than the AIC of model II. Note however that the
elimination of the lK random effect from model I has a much smaller impact
on the AIC value, than the elimination of either the lCl or the lka random
effects. This suggests that if one is willing to correct the overparametrization
problem by dropping one of the random effects from the model (and not a linear
combination of them, as was done in model II), lK would be the natural choice.
The estimated correlation between the lK and the lka random effects in
model II was −0.132, suggesting that the two random effects could be regarded
as independent and a diagonal D used. The AIC of this model (III) was 116.388
indicating that it should be preferred over the previous models. No further
reduction in the number of parameters in D could be obtained and we concluded
170
that model III was the most adequate.
8.2.3 Quinidine
The Quinidine data provide an example where convergence cannot be attained
for the model with all parameters as mixed effects, called model I. The data are
characterized by few observations on many patients: for 46 patients there is only
one observation of Quinidine concentration and for 32 patients only two. As a
consequence, the optimization of the loglikelihood for model I becomes a very
ill-conditioned numerical problem, with the optimizing algorithm alternating
between equivalent solutions (in terms of the value of the loglikelihood) without
ever converging.
Different strategies can be used to try to circumvent the nonconvergence
problem:
• try to achieve convergence using a diagonal D and examine the relative
variability of the random effects, investigating the possibility of eliminat-
ing one, or more, of them from the model;
• force convergence (e.g. letting the algorithm run until a pre-established
maximum number of iterations) and examine the corresponding DCV ma-
trix for rank deficiency;
• try to achieve convergence for models with a smaller number of random
effects.
Convergence could not be achieved even for a diagonal D and so the second
strategy was used here. We forced convergence after ten iterations of Lindstrom
and Bates’ alternating algorithm. The AIC of this forced convergence fit was
171
344.74. The eigenvalues of the DCV matrix were equal to 74.526, 0.032, and
1.363 × 10−9, suggesting that the model was overparametrized.
The eigenvector corresponding to the smallest eigenvalue, converted to the
original scale of the random effects and normalized, was (0.097, 0.415,−0.905)T ,
indicating that the lka random effect was about twice the lV random effect. In
terms of the original parameters, that is equivalent to assume that R = ka/V2
is a fixed effect. As in the Theophylline example, the recommendation at this
point would be to consult a pharmacologist about the physical meaning, if any,
of the parameter R. It must be said, though, that in this case R seems to be
a rather awkward quantity and probably lacks any practical meaning. Most
likely the poor quality of the data is responsible for the convergence problems
in general, and the rank deficiency observed for DCV in particular. Therefore,
by using a reparametrization of model I in which R was incorporated as a fixed
effect, we would run the risk of overfitting low quality data. We decided to
follow a more conservative approach trying to solve the overparametrization
problem by removing each random effect at a time from model I.
Convergence was attained for the models in which each of lCl, lka, and lV
at a time were treated as fixed. The corresponding AIC values were respectively
501.925, 341.782, and 365.409, indicating that the model in which lka is the only
purely fixed effect, called model II, should be preferred. For the sake of com-
parison, we also fitted the reparametrized model in which R was incorporated
as the only purely fixed effect. The corresponding AIC was 338.620 and though
this suggests that the reparametrized model fits the data better than model II,
we will keep the latter for the reasons described in the previous paragraph.
The estimated standard deviations of the random effects in model II were
172
0.323 and 0.310 and the estimated correlation coefficient was 0.05. This sug-
gested that the random effects had approximately the same variance and were
uncorrelated. A multiple of the identity matrix was used to model D and the
AIC of this reduced model (III) was 338.205, considerably smaller than those
of both models I and II. No further reductions were possible, since, if we re-
moved either the lCl random effect or the lV random effect from model III, we
obtained AIC values of 497.968 and 339.799 respectively.
In section 8.3.1 we explore the use of covariates to explain cluster-to-cluster
variability observed for the lCl and the lV random effects.
8.2.4 CO2 Uptake
Convergence using the model with all parameters as mixed effects (called model
I) was attained for the CO2 uptake data. The eigenvalues of the DCV ma-
trix were 0.051, 0.005, and 3.201 × 10−7, suggesting that the model was over-
parametrized. The eigenvector corresponding to the smallest eigenvalue, con-
verted to the original scale of the random effects and normalized, was (0.683,
−0.00008,−0.730)T indicating that φ1 −φ3 was probably nonrandom. In terms
of the parameters in model (8.1.6), this implies that the difference between the
asymptotic CO2 uptake rate (φ1) and the maximum ambient concentration of
CO2 at which no uptake is verified (φ3) is a fixed effect. Graphically this implies
that if the asymptotic CO2 uptake rate of a given plant is ∆ units above that of
another plant, so will be the concentration at which no CO2 uptake is present.
The AIC of model I was 268.44.
Since the linear dependence between φ1 and φ3 in model I seems to have a
meaningful practical interpretation, we decided to fit the reduced model, called
model II, in which this dependence was incorporated. Instead of reparametrizing
173
the model though, we decided to set A = I, B =
1 0
0 1
1 0
, and bi = (b1i, b2i)T
in the model specification of φ, cf. (4.1.2). The reason for this alternative
formulation was that we wanted to preserve the same parameters as in the
original model. This allows an easier interpretation of the effects of plant type
and chilling treatment on the parameters. Note that φ1 and φ3 share the same
random effect in model II.
The AIC of model II was 262.44, considerably better than that of model
I. The estimated correlation between the random effects in model II, −0.226,
suggested that a diagonal D could be used. The AIC of this reduced model
(III) was 260.62, indicating that it should be preferred over model II. No further
reductions were possible: convergence was not attained for the model with just
φ2 random and the AIC for the model with φ2 as a purely fixed effect was 261.95.
The CO2 uptake data is an example of a designed experiment in which most
of the variability in the random effects is related to differences in the treatment
effects. This issue will be explored in detail in the section 8.3.2.
8.3 Covariate Modeling
In this section we consider the use of covariates to model random effects variabil-
ity. This variability can either be related to natural cluster-to-cluster variation,
or caused by differences in covariate values between and/or within clusters.
The first questions to be addressed in the covariate modeling process are the
determination of which variables are potentially useful in explaining random
effects variation and which random effects may have their variability explained
174
by covariates. This is probably best achieved by analyzing plots of the random
effects estimates versus the covariates, looking for trends and patterns. The
conditional modes of the random effects (Lindstrom and Bates, 1990) will be
used here for this purpose.
After the candidate covariates have been chosen, a decision has to be made
on how to test for their inclusion in the model. The number of extra parame-
ters to be estimated tends to grow considerably with the inclusion of covariates
and their associated random effects in the model. If the number of covari-
ates/random effects combinations is large, we suggest using a forward stepwise
type of approach in which covariates are included one at a time and the po-
tential importance of the remaining covariates is (graphically) assessed at each
step. The decision on whether or not to include a covariate can be based on the
AIC of the fits with and without it. Another question that has to be addressed
when including a covariate in the model, is which of the new parameters should
be random or purely fixed. We suggest using an approach similar to the one
described in section 8.2, for modeling the variance-covariance structure: when-
ever no prior information is available and convergence is possible, start with a
saturated model (in which all new parameters are random) and, by examining
the eigenstructure of the estimated D (or DCV ) matrix, search for plausible
structures with fewer parameters. We use the Quinidine and the CO2 uptake
data to illustrate this model building approach. We reiterate that any model
building strategy is not complete without a careful analysis of residuals and
expert advice. In all examples considered here the residual analyses did not
indicate departures from the assumptions in the model.
175
8.3.1 Quinidine
Figure 8.3.1 presents the scatter plots of the conditional modes of the lCl ran-
dom effect, based on model III of section 8.2.3, versus the available covariates
(when the covariate value changed over time, the mode was used). A loess
smoother (Cleveland, Grosse and Shyu, 1992) was included in the continuous
covariates plots to help the visualization of possible trends.
o
o
oo
o
o o
o
o
oo
oo
oo
ooooo
o
o
o
o
o
o
ooo o
o
o
o o
o
oo
o
o o
o
ooo
o
oo
o
o
o
o o
o
o
o
o
o o
oo
oo
o
o
o
o
o
oo
o
o
o o
o
o
o
o o
o
oo
oo
o
o
o
oo ooo
o
o
o
o
oo
o
o
o oo
o
o
o
o
o o
oo
o
oo
oo
o
o oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
Glycoprotein concentration
lCl
50 100 150 200 250 300
-0.8
-0.4
0.0
0.4
-0.8
-0.4
0.0
0.4
o
oooo
lCl
<= 50 > 50Creatinine clearance
-0.8
-0.4
0.0
0.4
ooo
lCl
No/Mild Moderate SevereCongestive heart failure
o
o
oo
o
o o
o
o
oo
oo
oo
oo o
oo
o
o
o
o
o
o
o ooo
o
o
o o
o
oo
o
oo
o
ooo
o
oo
o
o
o
o o
o
o
o
o
oo
oo
oo
o
o
o
o
o
o o
o
o
oo
o
o
o
oo
o
oo
o o
o
o
o
oo ooo
o
o
o
o
oo
o
o
ooo
o
o
o
o
oo
oo
o
o o
oo
o
o oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
Age
lCl
40 50 60 70 80 90
-0.8
-0.4
0.0
0.4
o
o
oo
o
o o
o
o
oo
oo
oo
oo ooo
o
o
o
o
o
o
o oo o
o
o
o o
o
oo
o
o o
o
ooo
o
o o
o
o
o
oo
o
o
o
o
o o
oo
oo
o
o
o
o
o
oo
o
o
o o
o
o
o
oo
o
oo
o o
o
o
o
ooooo
o
o
o
o
oo
o
o
o oo
o
o
o
o
o o
oo
o
oo
oo
o
ooo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
Height
lCl
60 65 70 75
-0.8
-0.4
0.0
0.4
o
o
oo
o
oo
o
o
oo
oo
oo
ooo
oo
o
o
o
o
o
o
ooo o
o
o
oo
o
oo
o
o o
o
oo o
o
o o
o
o
o
oo
o
o
o
o
o o
oo
oo
o
o
o
o
o
o o
o
o
o o
o
o
o
o o
o
oo
oo
o
o
o
o ooo o
o
o
o
o
oo
o
o
o oo
o
o
o
o
o o
oo
o
o o
oo
o
ooo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
Weight
lCl
40 60 80 100 120
-0.8
-0.4
0.0
0.4
-0.8
-0.4
0.0
0.4
oo
o
lCl
Caucasian Latin BlackRace
-0.8
-0.4
0.0
0.4
oo
o
o
lCl
Non Smoker SmokerSmoking status
-0.8
-0.4
0.0
0.4
oo
o
o
lCl
No Abuse PreviouslyEthanol abuse
Figure 8.3.1: Conditional modes of the lCl random effect in model III versusavailable covariates.
Clearance appears to decrease with α-1-acid glycoprotein concentration and age
176
and to increase with weight and height. There is also some evidence that clear-
ance decreases with severity of congestive heart failure and is smaller in Blacks
than in both Caucasian and Latins. Clearly the α-1-acid glycoprotein concen-
tration is the most important covariate for explaining the lCl cluster-to-cluster
variation and a straight line seems adequate to model the observed relationship.
Figure 8.3.2 presents the scatter plots of the conditional modes of the lV
random effect versus the available covariates. None of the covariates seems help-
ful in explaining the variability of this random effect and we did not pursue the
modeling of its variability any further.
Initially only the α-1-acid glycoprotein concentration was included in the
model to explain the lCl random effect variation according to a linear model.
In the notation of (4.1.2) this modification of models (8.1.4) and (8.1.5) is ac-
complished by writing
lClij = (β1 + bi1) + (β2 + bi2) glycoproteinij (8.3.1)
All parameters were treated as mixed in this first attempt, called model IV,
but the random effects associated with lCl were assumed independent of the lV
random effect, preserving the covariance structure of model III. The AIC of this
model was 215.796 indicating a considerable gain in goodness-of-fit when com-
pared to model III (AIC of 338.205). Using the same strategy as in section 8.2
to model the variance-covariance matrix of the random effects, we selected a
model in which the lCl random effect was independent of the lV random effect,
the variance-covariance matrix of the lCl random effects was unstructured, but
the variances of the intercept term in the lCl random effect, bi1 in (8.3.1), and
the lV random effect were the same. The AIC of this model (V) was 213.788.
177
o ooo
o
o oo oooo o
o oo oo ooo
o ooo
o
ooo oo oo
o
oo
o
oo o
oo
oo
oo
ooooo o
o
ooo
o
o
oo
oo
o
o ooo
ooo
o
o
o
o
oo
oo
oo
oooo o
oo
o ooooo
oo o o ooo oo
oo
oo o
o
oo
o
oo oo oo oo o o ooo o oo o o
o
oo
o
oo
o
Glycoprotein concentration
lV
50 100 150 200 250 300
-0.4
0.0
0.2
0.4
0.6
-0.4
0.0
0.2
0.4
0.6
o
oo
o
o
o
o
oooooooo
oooooooo
o
lV
<= 50 > 50Creatinine clearance
-0.4
0.0
0.2
0.4
0.6
o
ooo
oooooo
oo
o
o
o
o
o
ooo
oo
lV
No/Mild Moderate SevereCongestive heart failure
o o oo
o
o o o oooo o
o ooo o ooo
o o oo
o
o o oo oo o
o
ooo
ooo
oo
oo
oo
o oooo o
o
oo o
o
o
o o
oo
o
oo oo
o oo
o
o
o
o
oo
oo
oo
oo oo o
oo
o ooooo
o oo oo oooo
oo
ooo
o
o o
o
o o oo oo oo o oooo o oo o o
o
oo
o
oo
o
Age
lV
40 50 60 70 80 90
-0.4
0.0
0.2
0.4
0.6
oo oo
o
o oooo oo o
o oo o oooo
o oo o
o
o oo oo oo
o
oo
o
oo o
oo
oo
oo
oooo oo
o
o oo
o
o
o o
oo
o
o ooo
ooo
o
o
o
o
oo
oo
oo
oo o oo
oo
oo oooo
o o oo ooo o o
o o
ooo
o
o o
o
oo oo o oooo oo oo ooo o o
o
o o
o
o o
o
Height
lV
60 65 70 75
-0.4
0.0
0.2
0.4
0.6
oo oo
o
oo ooo o o o
o oo oo ooo
ooo o
o
ooo ooo o
o
oo
o
oo o
oo
o o
oo
oooooo
o
ooo
o
o
o o
oo
o
oooo
o oo
o
o
o
o
oo
oo
oo
o oo oo
oo
oo o ooo
oo oo ooo oo
o o
ooo
o
oo
o
o oo oo ooooo oo oooo oo
o
o o
o
oo
o
Weight
lV
40 60 80 100 120
-0.4
0.0
0.2
0.4
0.6
-0.4
0.0
0.2
0.4
0.6
o
ooo
oooooooo
oo
ooo
o
o o
o
lV
Caucasian Latin BlackRace
-0.4
0.0
0.2
0.4
0.6
o
o
oooooo
ooooooo
o
o
oo
ooo
o
lV
Non Smoker SmokerSmoking status
-0.4
0.0
0.2
0.4
0.6
oo
oooooo
oooooooo
o
o
o
o
o
o
o
lV
No Abuse PreviouslyEthanol abuse
Figure 8.3.2: Conditional modes of the lV random effect in model III versusavailable covariates.
Figures 8.3.3 and 8.3.4 present the scatter plots of the conditional modes of
the lCl random effect in model V versus the available covariates. These plots
indicate that the intercept random effect does not vary systematically with any
of the covariates, but the slope random effect tends to increase with weight and
height and is smaller among Blacks and patients with previous history of con-
gestive heart failure. This suggests an interaction between the effects of these
covariates and the α-1-acid glycoprotein on the Quinidine clearance. At this
178
point expert advice would be needed to clarify the plausability of this hypoth-
esis. Since this was not possible here, we proceeded with the model building
analysis just for the purpose of illustrating the use of the proposed methodology.
o
o
ooo
oo
o
o
oo
o o
oo
o o
o
oo
o
ooo
o
o
ooo
oo
oo
o
o
o
o
o
o
o
o
o oo
oo
o
o
o
o
o o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
ooo oo o
oo
oo o
ooooo
o
o oo
o
ooo oo
o
o
o
o oo
o oo oooo o o oo
ooooo o
o
o
o
oo
oo
o
o o
o
o o o
oo
ooo o
Glycoprotein concentration
lCl i
nter
cept
50 100 150 200 250 300
-0.1
5-0
.05
0.05
0.15
-0.1
5-0
.05
0.05
0.15
o
oo
oooo
o
oooo
ooo
oo
lCl i
nter
cept
<= 50 > 50Creatinine clearance
-0.1
5-0
.05
0.05
0.15
o
ooooo
oo
o
o
o
oo
o
oo
oo
lCl i
nter
cept
No/Mild Moderate SevereCongestive heart failure
o
o
ooo
oo
o
o
oo
o o
oo
oo
o
oo
o
oo oo
o
o o o
oo
o o
o
o
o
o
o
o
o
o
o oo
oo
o
o
o
o
o o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o oo ooo
oo
ooo
o ooo o
o
ooo
o
ooooo
o
o
o
ooo
oooooo ooo o o
oo ooo o
o
o
o
oo
oo
o
o o
o
o oo
oo
oooo
Age
lCl i
nter
cept
40 50 60 70 80 90
-0.1
5-0
.05
0.05
0.15
o
o
oooo
o
o
o
oo
o o
oo
o o
o
oo
o
ooo
o
o
o oo
oo
oo
o
o
o
o
o
o
o
o
o oo
oo
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
ooo oo o
oo
ooo
oooo o
o
o o o
o
o oooo
o
o
o
o oo
o o oo oooo oo o
o oooo o
o
o
o
oo
oo
o
oo
o
o o o
oo
oo oo
Height
lCl i
nter
cept
60 65 70 75
-0.1
5-0
.05
0.05
0.15
o
o
ooo
oo
o
o
oo
o o
oo
o o
o
oo
o
ooo
o
o
ooo
oo
o o
o
o
o
o
o
o
o
o
o o o
oo
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o oooo oo
o
oo o
ooo oo
o
o oo
o
o o o oo
o
o
o
o oo
o oo o oooo ooo
o o oo oo
o
o
o
oo
oo
o
oo
o
oo o
oo
ooo o
Weight
lCl i
nter
cept
40 60 80 100 120
-0.1
5-0
.05
0.05
0.15
-0.1
5-0
.05
0.05
0.15
o
o
oooo
oooo
o
ooo
o
oo
lCl i
nter
cept
Caucasian Latin BlackRace
-0.1
5-0
.05
0.05
0.15
o
oooooo
oooooo
ooo
o
o
lCl i
nter
cept
Non Smoker SmokerSmoking status
-0.1
5-0
.05
0.05
0.15
o
ooooooooo
ooooooooooo
o o
o
oo
lCl i
nter
cept
No Abuse PreviouslyEthanol abuse
Figure 8.3.3: Conditional modes of the lCl intercept random effect in model Vversus available covariates.
Using the forward stepwise approach we included the interactions between α-
1-acid glycoprotein and weight (as a linear predictor), race (as an indicator
variable of Black/not Black status), and congestive heart failure (as an indica-
tor variable of previous/no previous history of congestive heart failure) in the
179
o
o
oo
o
o oo
oo
o
o
o
ooo
oo
o
o
o
o
oo
o
o
ooo
o
o oo
o
o
o
oo
oo
o
oooo
oo
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
ooo
o
o
oo
oo
o
o
o
o o
o
oo
oo
o
o o
oo
oo o
o
o
oo
o
oo
oo
o
o
oo
o
oo
o
oo
o
Glycoprotein concentration
lCl s
lope
50 100 150 200 250 300
-0.0
040.
00.
004
-0.0
040.
00.
004 o
o
o
ooo
o
lCl s
lope
<= 50 > 50Creatinine clearance
-0.0
040.
00.
004
o
o
o
oooo
lCl s
lope
No/Mild Moderate SevereCongestive heart failure
o
o
oo
o
o oo
oo
o
o
o
ooo
oo
o
o
o
o
oo
o
o
o o o
o
oo o
o
o
o
oo
oo
o
oo
oo
oo
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
oo
o o
o
o o
o
o
o
o
o
o
o
o
o
o oo
o
oo
o
o
ooo
o
o
oo
oo
o
o
o
oo
o
o o
oo
o
oo
o o
oo o
o
o
oo
o
oo
oo
o
o
oo
o
oo
o
oo
o
Age
lCl s
lope
40 50 60 70 80 90
-0.0
040.
00.
004
o
o
oo
o
o oo
oo
o
o
o
ooo
oo
o
o
o
o
oo
o
o
o oo
o
o oo
o
o
o
oo
oo
o
oo
o o
oo
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
oo
oo
o
o o
o
o
o
o
o
o
o
o
o
o o o
o
oo
o
o
ooo
o
o
oo
oo
o
o
o
oo
o
oo
oo
o
oo
oo
oo o
o
o
oo
o
o o
oo
o
o
oo
o
oo
o
oo
o
Height
lCl s
lope
60 65 70 75
-0.0
040.
00.
004
o
o
oo
o
ooo
oo
o
o
o
ooo
oo
o
o
o
o
oo
o
o
ooo
o
oo o
o
o
o
oo
oo
o
oo
oo
oo
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
oo
o o
o
oo
o
o
o
o
o
o
o
o
o
oo o
o
oo
o
o
o oo
o
o
oo
oo
o
o
o
o o
o
oo
oo
o
o o
o o
o oo
o
o
oo
o
oo
ooo
o
oo
o
oo
o
oo
o
Weight
lCl s
lope
40 60 80 100 120
-0.0
040.
00.
004
-0.0
040.
00.
004
o
oooo
o
o
lCl s
lope
Caucasian Latin BlackRace
-0.0
040.
00.
004
o
ooo
o
o
o
o
lCl s
lope
Non Smoker SmokerSmoking status
-0.0
040.
00.
004
oooo
ooooooo
o
o
lCl s
lope
No Abuse PreviouslyEthanol abuse
Figure 8.3.4: Conditional modes of lCl slope random effect in model V versusavailable covariates.
model, in this order. The same random effect variance-covariance structure as
in model V was used in all cases. The corresponding AIC values were respec-
tively 210.117, 204.556, and 199.893. In all three cases substantial reductions in
the AIC values were observed. The random effects plots of the last model (with
all three interactions with α-1-acid glycoprotein included) did not indicate any
other candidate covariates to be included in the model and we concluded that
the model was adequate.
180
8.3.2 CO2 Uptake Data
Figure 8.3.5 presents the plots of the conditional modes of the random effects of
model III in section 8.2.4 against plant type and using the chilling treatment as
a symbol. The plots indicate a strong relationship between the φ1/φ3 random
effect and both the plant type and the chilling treatment. Apparently the
Quebec plants have higher values of φ1 and φ3 than the Mississippi plants and
chilling the plants causes a reduction in both φ1 and φ3, and that is more
pronounced in the Mississippi plants than in the Quebec plants, suggesting an
interaction between plant type and chilling treatment. The φ2 random effects
plot suggests a possible interaction between plant type and chilling treatment
with respect to their effect on φ2, but there is considerable variability in the
random effects estimates, making the statistical significance of this interaction
unclear.
Plant
phi1
/phi
3
-15
-10
-50
510
Quebec Mississippi
Control
Chilled
Plant
phi2
-0.0
015
-0.0
005
0.00
05
Quebec Mississippi
Control
Chilled
Figure 8.3.5: Conditional modes of the φ1/φ3 random effect (a) and the φ2
random effect (b) in model III versus plant type, using chilling treatment as asymbol.
181
We initially considered a model (IV) in which both of φ1 and φ3 were written
in a full 22 factorial model in plant type and chilling treatment, with the inter-
cept and all treatment effects random and an unstructured variance-covariance
matrix. In order to keep consistency with the variance-covariance structure of
model III, the random effects were the same in φ1 and φ3. No covariates were
used for φ2 in this first model. As in model III, the φ1/φ3 random effects were
assumed independent of the φ2 random effect.
The AIC of model IV was 245.38, considerably smaller than the AIC of
model III. Analysis of the estimated D matrix of model IV indicated severe
overparametrization. Using the strategy described in section 8.2, we chose a
model (V) in which only the intercept of the φ1/φ3 random effect was random.
The AIC of model V was 230.63. No further terms could be dropped, nor any
covariates included in model V. Table 8.3.2 gives the AIC of the models obtained
by dropping each covariate term at a time from model V. Note that the AIC
increases for each of the covariate terms and hence none should be removed
from the model.
Table 8.3.1: AIC of models in which one covariate was dropped from model VParameter Coefficient AICφ1 P 258.95
T 246.10P × T 237.36
φ3 P 236.62T 233.01
P × T 239.94
182
8.4 Conclusions
The analysis of the eigenstructure of the estimated variance-covariance matrix
(D) of the random effects is a useful tool to determine which terms in the model
should be random and which should be purely fixed. The estimated D matrix
also provides useful information to identify structured variance-covariance pat-
terns. Information criterion statistics, such as the AIC, can be used as guidelines
to model selection, but analysis of residuals and consultation with experts in
the field of application of the model should also be used.
The goodness-of-fit and interpretability of a mixed effects model can be
substantially enhanced through the inclusion of covariates to explain random
effects variability. Information criterion statistics can again be used for model
selection, together with analysis of residuals and expert advice.
We restricted ourselves here to mixed effects models in which the cluster
errors, ε in model (4.1.1), were i.i.d., but other covariance structures (e.g. au-
toregressive processes) can easily be incorporated into mixed models (Chi and
Reinsel, 1989; Lindstrom and Bates, 1990). There is usually a trade-off between
the number of random effects incorporated in the model and the complexity of
the cluster errors covariance structure (Jones, 1990). Further research is needed
in that area, especially under a model building perspective.
Chapter 9
Conclusions and Suggestions for
Future Research
Mixed effects models constitute a powerful tool for modeling dependence within
clustered data. They give an intuitive interpretation for the source and the
structure of the dependence and can easily handle the unbalanced data that are
frequently encountered in many areas of scientific investigation.
Despite their usefulness, mixed effects models remain a mystery for many
researchers that could benefit from their application. We believe that this un-
fortunate situation could be changed if easy-to-use and reliable software were
available for fitting and analyzing mixed effects models. This has been the main
goal of our research.
9.1 Conclusions
The set of S functions and methods described in chapter 7 constitutes a user-
friendly, efficient, flexible, and reliable implementation for the analysis of mixed
184
effects models. We hope this will facilitate access of researchers to these kinds
of models. This software has been available in the S collection at StatLib for
over a year now and has been successfully used by researchers from a broad
range of areas.
The efficiency of the code is partially explained by the use of a loose coupling
algorithm (Soo and Bates, 1992) that allows the size of the optimization problem
to increase only linearly with the number of clusters, instead of quadratically
as would otherwise be expected.
The parametrizations for variance-covariance matrices described in chapter 6
are also fundamental for the code’s efficiency and reliability, since they allow
the unconstrained estimation of the variance-covariance components, greatly
simplifying the optimization problem.
The asymptotic results for the linear mixed effects models, proven in chap-
ter 3, provide the justification for the common practice of using the information
matrix in conjunction with the normal distribution to derive confidence regions
on the parameters in the model. These results are also important in showing
that the estimates of the fixed effects and the variance-covariance components
are asymptotically uncorrelated. This probably explains the success of the al-
ternating algorithm used in the nonlinear mixed effects code and described in
section 5.1.1.
Different approximations to the loglikelihood function in the nonlinear mixed
effects model were analyzed in chapter 5. The alternating approximation (5.1.2)
suggested by Lindstrom and Bates (1990) and used in the software implemen-
tation, gives in general accurate and reliable estimation results. If more exact
results are needed, the Laplacian approximation (5.1.6), or the adaptive Gaus-
sian approximation (5.1.10), can be used instead. Possibly the best strategy
185
is to use a hybrid scheme in which the alternating algorithm would be used
to get good initial values for the more refined Laplacian or adaptive Gaussian
approximations.
Model building in mixed effects models constitutes an interesting and diffi-
cult topic. Several questions that do not have a parallel in fixed effects models
arise when one has to choose a mixed effects model. Possibly the most difficult
question is determining which parameters should be mixed effects and which
should be purely fixed. Other important questions are related to the use of
simpler structured variance-covariance matrices for the random effects and the
choice of covariates to explain cluster-to-cluster parameter variability. Strategies
for addressing these questions in the context of nonlinear mixed effects models,
based on the eigenstructure of the estimated variance-covariance matrix of the
random effects, were described in chapter 8. These strategies were illustrated
through the analysis of real data examples.
9.2 Future Research
Considerable research effort is currently dedicated to expanding the applicability
of and improving the estimation methods for mixed effects models. We include
here some topics for future research that were not covered in this dissertation.
9.2.1 Asymptotics
There are at least two directions in which the results given in chapter 3 can
be extended. We assumed in this dissertation that the error term variance-
covariance matrix Λ was of the form σ2I, but it may be of interest to consider
it to be of the more general form Λ = Λ(ρ) where ρ is a (low dimensional)
186
parameter vector, e.g. Λ may have an AR(1) structure (Chi and Reinsel, 1989).
We feel the asymptotic results given in chapter 3, such as the asymptotic nor-
mality, can be extended to the (restricted) maximum likelihood estimators of
the error term variance-covariance parameters ρ.
Another possible direction in which the asymptotic results given in chapter 3
can be extended is in (restricted) maximum likelihood estimation for nonlinear
mixed effects models. One of the difficulties with this extension comes from the
fact that the likelihood function usually does not have a closed form expression
in nonlinear mixed effects models. The approximations described in chapter 5
can be used as a starting point for investigating solutions to this problem.
9.2.2 Parametrizations
Only unstructured variance-covariance matrices were considered in chapter 6
but, in many applications of mixed effects models, structured matrices are used
instead (Jennrich and Schluchter, 1986). It is therefore important to derive
parametrizations for structured variance-covariance matrices that allow uncon-
strained estimation of the associated parameters. This issue is particularly im-
portant in mixed models that allow more general variance-covariance structures
for the error term, since these will usually correspond to structured variance-
covariance matrices. Unrestricted parametrizations are easily derived for sim-
pler structures, such as diagonal and compound symmetry, but are far from
trivial for more complex structures, such as generalized autoregressive matrices.
The asymptotic properties of the different parametrizations considered in
chapter 6 have not yet been studied and certainly constitute an interesting
research topic. It may be that some of the parametrizations give faster rates of
convergence to normality than others and this could be used as a criterion for
choosing among them.
9.2.3 Assessing Variability
Once a model has been chosen to represent the data, measures of variability
for the estimates and confidence regions on the model’s parameters are usually
needed for inferential purposes. Asymptotic theory can certainly be used at a
preliminary stage. These results have only been proven for the linear mixed
effect model (cf. chapter 3), but, at least as a first order-like approximation,
could also be used for nonlinear mixed effects models.
More refined methods, such as likelihood profile traces and contours (Bates
and Watts, 1988), can also be used to assess the variability in the estimates,
but these will generally be computer intensive. A compromise between the
asymptotic and profiling methodologies is to use a linear approximation to the
loglikelihood, as in (5.1.2), to calculate the profile traces and contours. This
constitutes a considerably less intensive computational problem than profiling
the loglikelihood directly. Alternatively, if the fixed effects are the primary
object of interest, we could profile the penalized nonlinear least squares problem
corresponding to a fixed D = D.
Another (computationally intensive) alternative to assess the variability in
the estimates is to use bootstrap methods (Efron and Tibshirani, 1993) to esti-
mate standard errors and confidence regions.
More research is needed to determine which methods provide the most reli-
able statistical results and to compare their relative computational performance.
Bibliography
Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions
with Formulas, Graphs, and Mathematical Tables, Dover, New York.
Airy, G. B. (1861). On the Algebraical and Numerical Theory of Errors of
Observations and the Combinations of Observations, MacMillan, London.
Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of nonlin-
earity, Journal of the Royal Statistical Society, Ser. B 42: 1–25.
Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis and Its
Applications, Wiley, New York.
Beal, S. and Sheiner, L. (1980). The NONMEM system, American Statistician
34: 118–119.
Bennett, J. E. and Wakefield, J. C. (1993). Markov chain Monte Carlo for non-
linear hierarchical models, Technical Report TR-93-11, Statistics Section,
Imperial College.
Chambers, J. M. and Hastie, T. J. (eds) (1992). Statistical Models in S,
Wadsworth, Belmont, CA.
188
189
Chi, E. M. and Reinsel, G. C. (1989). Models for longitudinal data with random
effects and AR(1) errors, Journal of the American Statistical Association
84: 452–459.
Cleveland, W. S., Grosse, E. and Shyu, W. M. (1992). Local regression models,
Statistical Models in S, Wadsworth, Belmont, CA, chapter 8.
Crump, S. L. (1947). The Estimation of Variance in Multiple Classification,
PhD thesis, Department of Statistics, Iowa State University.
Davidian, M. and Gallant, A. R. (1992). Smooth nonparametric maximum
likelihood estimation for population pharmacokinetics, with application to
quinidine, Journal of Pharmacokinetics and Biopharmaceutics 20: 529–556.
Davidian, M. and Gallant, A. R. (1993). The nonlinear mixed effects model
with a smooth random effects density, Biometrika 80: 475–488.
Davidian, M. and Giltinan, D. M. (1993). Some simple methods for estimating
intraindividual variability in nonlinear random effects models, Biometrics
49: 59–73.
Davis, P. J. and Rabinowitz, P. (1984). Methods of Numerical Integration,
second edn, Academic Press, New York.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood
from incomplete data via the EM algorithm (c/r: P22-37), Journal of the
Royal Statistical Society, Ser. B 39: 1–22.
Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, 2nd edn,
Wiley, New York.
190
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap, Chap-
man & Hall, New York.
Fisher, R. A. (1925). Statistical Methods for Research Workers, Oliver and
Boyd, London.
Gallant, A. R. and Nychka, D. W. (1987). Seminonparametric maximum like-
lihood estimation, Econometrica 55: 363–390.
Gelfand, A. E., Hills, S. E., Racine-Poon, A. and Smith, A. F. M. (1990). Illus-
tration of Bayesian inference in normal data models using Gibbs sampling,
Journal of the American Statistical Association 85(412): 972–985.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions
and the Bayesian restoration of images, IEEE Transactions on Pattern
Analysis and Machine Intelligence 6: 721–741.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo
integration, Econometrica 57: 1317–1339.
Golub, G. H. (1973). Some modified matrix eigenvalue problems, SIAM Review
15: 318–334.
Golub, G. H. and Welsch, J. H. (1969). Calculation of Gaussian quadrature
rules, Math. Comp. 23: 221–230.
Grizzle, J. E. and Allen, D. M. (1969). Analysis of growth and dose response
curves, Biometrics 25: 357–382.
Hartley, H. O. and Rao, J. N. K. (1967). Maximum likelihood estimation for
the mixed analysis of variance model, Biometrika 54: 93–108.
191
Harville, D. A. (1974). Bayesian inference for variance components using only
error contrasts, Biometrika 61: 383–385.
Harville, D. A. (1977). Maximum likelihood approaches to variance components
estimation and to related problems, Journal of the American Statistical
Association 72: 320–338.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains
and their applications, Biometrika 57: 97–109.
Henderson, C. R. (1953). Estimation of variance and covariance components,
Biometrics 9: 226–252.
Jennrich, R. I. and Schluchter, M. D. (1986). Unbalanced repeated measures
models with structural covariance matrices, Biometrics 42(4): 805–820.
Jones, R. H. (1990). Serial correlation or random subject effects, Communica-
tions in Stat., Part B–Simulation and Comp. 19: 1105–1123.
Jupp, D. L. B. (1978). Approximation to data by splines with free knots, SIAM
Journal of Numerical Analysis 15(2): 328–343.
Kung, F. H. (1986). Fitting logistic growth curve with predetermined carrying
capacity, ASA Proceedings of the Statistical Computing Sect. pp. 340–343.
Laird, N., Lange, N. and Stram, D. (1987). Maximum likelihood computations
with repeated measures: Application of the EM algorithm, Journal of the
American Statistical Association 82: 97–105.
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal
data, Biometrics 38: 963–974.
192
Lehmann, E. L. (1983). Theory of Point Estimation, Wiley, New York.
Leonard, T. and Hsu, J. S. J. (1993). Bayesian inference for a covariance matrix,
Annals of Statistics 21: 1–25.
Leonard, T., Hsu, J. S. J. and Tsui, K. W. (1989). Bayesian marginal inference,
Journal of the American Statistical Association 84: 1051–1058.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using general-
ized linear models, Biometrika 73: 13–22.
Lindsey, J. K. (1993). Models for Repeated Measurements, Oxford University
Press, New York.
Lindstrom, M. J. and Bates, D. M. (1988). Newton-Raphson and EM algorithms
for linear mixed-effects models for repeated-measures data, Journal of the
American Statistical Association 83: 1014–1022.
Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed effects models for
repeated measures data, Biometrics 46: 673–687.
Longford, N. T. (1993). Random Coefficient Models, Oxford University Press,
New York.
Mallet, A. (1986). A maximum likelihood estimation method for random coef-
ficient regression models, Biometrika 73(3): 645–656.
Mallet, A., Mentre, F., Steimer, J.-L. and Lokiek, F. (1988). Nonparamet-
ric maximum likelihood estimation for population pharmacokinetics, with
applications to Cyclosporine, J. Pharmacokin. Biopharm. 16: 311–327.
193
Miller, J. J. (1977). Asymptotic properties of maximum likelihood estimates in
the mixed model of the analysis of variance, Ann. of Statistics 5: 746–762.
Newton, H. J. (1993). New developments in statistical computing, American
Statistician 47: 146–147.
Patterson, H. D. and Thompson, R. (1971). Recovery of inter-block information
when block sizes are unequal, Biometrika 58(3): 545–554.
Potthoff, R. F. and Roy, S. N. (1964). A generalized multivariate analysis of
variance model useful especially for growth curve problems, Biometrika
51: 313–326.
Potvin, C. and Lechowicz, M. J. (1990). The statistical analysis of ecophysio-
logical response curves obtained from experiments involving repeated mea-
sures, Ecology 71: 1389–1400.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd edn,
Wiley, New York.
Ratkowsky, D. A. (1990). Handbook of Nonlinear Regression Models, Marcel
Dekker, New York.
Sakamoto, Y., Ishiguro, M. and Kitagawa, G. (1986). Akaike Information Cri-
terion Statistics, D. Reidel Publishing Company, Holland.
Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components,
Wiley, New York.
Seber, G. A. F. (1977). Linear Regression Analysis, Wiley, New York.
194
Sheiner, L. B. and Beal, S. L. (1980). Evaluation of methods for estimating pop-
ulation pharmacokinetic parameters. I. michaelis-menten model: Routine
clinical pharmacokinetic data, Journal of Pharmacokinetics and Biophar-
maceutics 8(6): 553–571.
Soo, Y.-W. and Bates, D. M. (1992). Loosely coupled nonlinear least squares,
Computational Statistics and Data Analysis 14: 249–259.
Thisted, R. A. (1988). Elements of Statistical Computing, Chapman & Hall,
London.
Thompson, W. A. (1962). The problem of negative estimates of variance com-
ponents, Annals of Mathematical Statistics 33: 273–289.
Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior
moments and densities, Journal of the American Statistical Association
81(393): 82–86.
Tippet, L. H. C. (1931). The Methods of Statistics, Williams and Norgate,
London.
Verme, C. N., Ludden, T. M., Clementi, W. A. and Harris, S. C. (1992). Phar-
macokinetics of quinidine in male patients: A population analysis, Clin.
Pharmacokin. 22: 468–480.
Vonesh, E. F. and Carter, R. L. (1992). Mixed-effects nonlinear regression for
unbalanced repeated measures, Biometrics 48: 1–18.
Wakefield, J. C. (1993). The Bayesian analysis of population pharmacokinetic
models, Technical Report TR-93-11, Statistics Section, Imperial College.
195
Wakefield, J. C., Smith, A. F. M., Racine-Poon, A. and Gelfand, A. E. (1994).
Bayesian analysis of linear and nonlinear population models using the
Gibbs sampler, Applied Statistics . Accepted for publication.
Weiss, L. (1971). Asymptotic properties of maximum likelihood estimators in
some nonstandard cases,I, J. of the Amer. Stat’l. Assn. 66: 345–350.
Weiss, L. (1973). Asymptotic properties of maximum likelihood estimators in
some nonstandard cases, II, J. of the Amer. Stat’l. Assn. 68: 428–430.
Wolf, D. A. (1986). Nonlinear Least Squares for Linear Compartment Models,
PhD thesis, University of Wisconsin–Madison.
Wolfinger, R., Tobias, R. and Sall, J. (1991). Mixed models: A future direction,
SUGI 16: 1380–1388.
Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data:
A generalized estimating equation approach, Biometrics 44: 1049–1060.
Appendix A
We prove here a series of lemmas used in the proof of the theorems of chap-
ter 3. Throughout the proofs we will let θ0 be an interior point of the parameter
space Θ and assume that θ1 and θ2 ∈ Nn(θ0), with Nn(θ0) as defined in theo-
rem (3.1.1). We will denote by βk, σk,Σk, Dk, and Dk
A the quantities associated
with θk, k = 0, 1, 2.
Lemma A.1
‖β1 − β0‖2 ≤ p0g2/n2
p1+1
‖β1 − β2‖2 ≤ 4p0g2/n2
p1+1
proof: By definition of Nn(θ0), ‖β1 − β0‖2 =∑p0
i=1(β1i − β0i)2 ≤ p0g
2/n2p1+1.
Also ‖β1−β2‖2 ≤ ‖β1−β0‖2+‖β2−β0‖2+2‖β1−β0‖‖β2−β0‖ ≤ 4p0g2/n2
p1+1.
Lemma A.2 λmax
(Σ−1
0 U ji
(U j
i
)T)≤ 1/δ0, i = 1, . . . , r, j = 1, . . . , qi for some
δ0 = δ0(σ0) > 0.
proof: We note first that the eigenvalues of Σ0 = σ20I + ZD0
AZT are of
the form σ20 + λ(ZD0
AZT ) and since ZD0AZT is positive semi-definite it fol-
lows that λmin(Σ0) ≥ σ20. Now let σk be the kth diagonal element of D and
Gl(k) the corresponding G matrix, chosen so that Gl(k) = U ji
(U j
i
)T. Define
197
D0k(δ) = D0 − δξkξTk where ξk is a q-dimensional vector whose only nonnull
element is a one at position k. By assumption θ0 is an interior point of Θ
and so D0 is positive definite. It follows that λmin
(D0k(0)
)= λmin
(D0)
> 0.
Since trace(D0k(δ)
)= trace
(D0)− δ ≥ qλmin
(D0k(δ)
), it follows that δ >
trace(D0)⇒ λmin
(D0k(δ)
)< 0. Define δk = min
{δ > 0 : λmin
(D0k(δ)
)= 0
}.
Note that by the continuity of the minimum eigenvalue, we must have δk ∈(0, trace
(D0))
. Define now δ0 = min(δ1, . . . , δq, σ20). Note that δ0 > 0 and
λmin
(D0k(δ0)
)≥ 0, k = 1, . . . , q. Let Σk
0(δ) = σ20I + ZD0k
A (δ)ZT and note
that by (2.3.1.2) Σ0 = Σk0(δ)+ δGl(k). From standard results on eigenvalues we
have that
λmax(Σ−10 Gl(k)) = sup
‖ξ‖=1
ξT Gl(k)ξ
ξTΣ0ξ= sup
‖ξ‖=1
ξT Gl(k)ξ
δ0ξT Gl(k)ξ + ξTΣk
0(δ0)ξ≤ 1/δ0
where we used the fact that ξTΣk0(δ0)ξ ≥ σ2
0 + λmin
(D0k(δ0)
)‖ZT ξ‖2 which
is greater than zero by construction. Note that λmax(Σ−10 ) = 1/λmin(Σ0) ≤
1/σ20 ≤ 1/δ0, so that the result also holds for k = 0 (G0 = I).
Lemma A.3 maxk
∣∣∣λk(Σ−10 Gi)
∣∣∣ ≤ 2/δ0, i = 0, . . . , p1.
proof: Letting l(i), (j(i), and k(i)) denote respectively the random effects class
and the random effects indices associated with Gi (j(i) = k(i), when σi is
a variance term) we note that Gi is either of the form Uj(i)l(i)
(U
j(i)l(i)
)T(when
j(i) = k(i)) or Uj(i)l(i)
(U
k(i)l(i)
)T+ U
k(i)l(i)
(U
j(i)l(i)
)T(when j(i) �= k(i)). By the
Cauchy-Schwartz and the triangle inequalities and lemma (A.2) we have that
∣∣∣∣ξTΣ−1/20 Gi
(Σ
−1/20
)Tξ∣∣∣∣ ≤ 2
[ξTΣ
−1/20 U
j(i)l(i)
(U
j(i)l(i)
)T (Σ
−1/20
)Tξ]1/2
×[ξTΣ
−1/20 U
k(i)l(i)
(U
k(i)l(i)
)T (Σ
−1/20
)Tξ]1/2
≤ (2/δ0)‖ξ‖2
198
where Σ−1/20 denotes the Cholesky factor (Thisted, 1988) of Σ−1
0 . We note that
Σ−10 Gi and Σ
−1/20 Gi
(Σ
−1/20
)Tshare the same eigenvalues. To see that let u be
an eigenvector of Σ−1/20 Gi
(Σ
−1/20
)Twith eigenvalue λ, then
(Σ
−1/20
)TΣ
−1/20 Gi
(Σ
−1/20
)Tu = λ
(Σ
−1/20
)Tu
Letting v =(Σ
−1/20
)Tu and noting that Σ−1
0 =(Σ
−1/20
)TΣ
−1/20 , we have that v
is an eigenvector of Σ−10 Gi with eigenvalue λ. Conversely, let v be an eigenvector
of Σ−10 Gi with eigenvalue λ. Since Σ0 is positive definite, there exists u ∈ �n
such that v =(Σ
−1/20
)Tu and
(Σ
−1/20
)TΣ
−1/20 Gi
(Σ
−1/20
)Tu = λ
(Σ
−1/20
)Tu ⇒
Σ−1/20 Gi
(Σ
−1/20
)Tu = λu
and u is an eigenvector of Σ−10 Gi
(Σ
−1/20
)Twith eigenvalue λ. It then follows
that maxk
∣∣∣λk(Σ−10 Gi)
∣∣∣ = sup‖ξ‖=1 |ξTΣ−10 Giξ| ≤ 2/δ0.
Lemma A.4 There exists an n0 = n0(σ0) such that ∀n ≥ n0
λmax
(Σ−1
1 U ji
(U j
i
)T)≤ 2/δ0, i = 1, . . . , r, j = 1, . . . , qi.
proof: As before let σk be such that Gl(k) = U ji
(U j
i
)T. By the definition
of δ0, λmin
(D0k(δ0/2)
)> 0 and by the continuity of the functions involved,
∃εk = εk(σ0) > 0 such that
‖σ1−σ0‖ < εk ⇒∣∣∣λmin
(D1k(δ0/2)
)− λmin
(D0k(δ0/2)
)∣∣∣ < λmin
(D0k(δ0/2)
)/2
199
Let ε0 = mink εk, then it follows that ∀θ1 ∈ Nn(θ0) such that ‖σ1 − σ0‖ < ε0
we have that for k = 1, . . . , q
λmax(Σ−11 Gl(k)) = sup
‖ξ‖=1
ξT Gl(k)ξ
ξTΣ1ξ= sup
‖ξ‖=1
ξT Gl(k)ξ
(δ0/2)ξT Gl(k)ξ + ξTΣk1(δ0/2)ξ
≤ 2/δ0
since
ξTΣk1(δ0/2)ξ ≥σ2
1 + λmin
(D1k(δ0/2)
)‖ZT ξ‖2 ≥ (1/2)λmin
(D0k(δ0/2)
)‖ZT ξ‖2 > 0.
Now note that if |σ21 − σ2
0| < σ20/2 then
λmax(Σ−11 ) = 1/λmin(Σ1) ≤ 1/σ2
1 ≤ 2/σ20 ≤ 2/δ0.
Let n0 = n0(σ0) be the smallest integer n such that
max0≤i≤p1
(gi(n)/ni(n)) < min(ε0, σ20/2).
Note that such an n0 always exists, since gi/ni → 0. Then by the previous
results it follows that
∀n ≥ n0, λmax
(Σ−1
1 U ji
(U j
i
)T)≤ 2/δ0, i = 1, . . . , r, j = 1, . . . , qi.
Lemma A.5 There exists an n0 = n0(σ0) such that ∀n ≥ n0,
maxk
∣∣∣λk(Σ−11 Gi)
∣∣∣ ≤ 4/δ0, i = 0, . . . , p1.
200
proof: The proof is identical to that of lemma (A.3), with lemma (A.4) replac-
ing lemma (A.2) in the final inequalities.
Lemma A.6 There exists an n0 = n0(σ0) such that ∀n ≥ n0
λmax(Σ−11 Σ0) ≤ 2
(λmax(D
0)/λmin(D0) + 1
).
proof: Let ε0 = ε0(σ0) > 0 be such that ∀θ1 ∈ Nn(θ0) satisfying ‖θ1−θ0‖ < ε0
we have |λmin(D1) − λmin(D
0)| < λmin(D0)/2 and |σ2
1 − σ20| < σ2
0/2. Define n0
to be the smallest integer such that
max0≤i≤p1(g(n)/ni(n)) < ε0. It then follows that ∀n ≥ n0
λmax(Σ−11 Σ0) = sup
‖ξ‖=1
(ξTΣ0ξ
ξTΣ1ξ
)
≤ sup‖ξ‖=1,ZT ξ�=0
(ξT ZD0
AZT ξ
ξT ZD1AZT ξ
)+
σ20
σ21
≤ λmax(D0)
λmin(D1)
+ 2
≤ 2
(λmax(D
0)
λmin(D0)
+ 1
).
Using the exact same reasoning we also have that for sufficiently large n
λmax(Σ−11 Σ2) ≤ 4
(λmax(D
0)/λmin(D0) + 1
).
Lemma A.7 maxk
∣∣∣λk
(Σ−1
0 (Σ1 − Σ0))∣∣∣ ≤ 1/g3
(q/λmin(D
0) + 1/σ20
).
proof: By definition the maximum absolute eigenvalue of Σ−10 (Σ1 −Σ0) is
sup‖ξ‖=1
∣∣∣ξT (Σ1 − Σ0) ξ∣∣∣
ξTΣ0ξ≤ sup
‖ξ‖=1,ZT ξ�=0
∣∣∣ξT Z(D1
A − D0A
)ZT ξ
∣∣∣ξT ZD0
AZT ξ+
|σ21 − σ2
0|σ2
0
.
201
Now note that
∣∣∣ξT(D1 − D0
)ξ∣∣∣ ≤
q∑i,j
∣∣∣[D1]ij − [D0]ij∣∣∣ |ξi| |ξj| ≤ 1/g3
( q∑i=1
|ξi|)2
≤ (q/g3)‖ξ‖2.
Noting also that |σ21 − σ2
0| ≤ 1/g3 and that D1A − D0
A and D1 − D0 share the
same eigenvalues with different multiplicities, the result follows immediately.
Lemma A.8 There exists an n0 = n0(σ0) such that ∀n ≥ n0
maxk
∣∣∣λk
(Σ−1
1 (Σ0 − Σ1))∣∣∣ ≤ 2/g3
(q/λmin(D
0) + 1/σ20
)
and
maxk
∣∣∣λk
(Σ−1
1 (Σ2 − Σ1))∣∣∣ ≤ 4/g3
(q/λmin(D
0) + 1/σ20
).
proof: It is an immediate consequence of the previous lemma and the fact
that for sufficiently large n the smallest eigenvalue of D1 may be bounded from
below by λmin(D0)/2 for any θ1 ∈ Nn(θ0).
Lemma A.9 Let θ0 ∈ Θ and θ1, θ2 ∈ Nn(θ0) and let {An(θ1)} be a sequence
of positive semi-definite n × n matrices of rank rn(θ1) and {an} a sequence of
positive quantities going to infinity such that ∀θ1 ∈ Nn(θ0) and ∀n we have
rn(θ1) ≤ an and λmax (An(θ1)) ≤ M(θ0) for some nonnegative M(θ0) < ∞. It
then follows that
Pθ2
(1/an sup
θ1∈Nn(θ0)(y − Xβ2)
T
×[Σ
1/22
]−1An(θ1)
[Σ
1/22
]−T(y − Xβ2) > 2M(θ0)
)→ 0.
202
Furthermore, if λmax(An(θ1)) ≤ Mn(θ0), ∀n and ∀θ1 ∈ Nn(θ0) with Mn(θ0) →0 then
1/an
(sup
θ1∈Nn(θ0)(y − Xβ2)
T[Σ
1/22
]−1An(θ1)
[Σ
1/22
]−T(y − Xβ2)
)Pθ2−→ 0.
proof: Under θ2, y ∼ N (Xβ2,Σ2) and so z =[Σ
1/22
]−T(y − Xβ2)∼Nn(0, I).
By assumption An(θ1) has n − rn(θ1) zero eigenvalues and therefore we can
write An(θ1) = P n(θ1)Λn(θ1)PTn (θ1) where P n(θ1) is a n × rn(θ1) matrix
whose columns are orthonormal eigenvectors of An(θ1) corresponding to the
nonzero eigenvalues and Λn(θ1) is a rn(θ1) × rn(θ1) diagonal matrix with di-
agonal elements given by the nonzero eigenvalues of An(θ1). It follows that
wn = P Tn (θ1)z follows a Nrn(θ1)(0, I) distribution and
zT An(θ1)z = wTnΛn(θ1)wn ≤ λmax (An(θ1)) ‖wn‖2 ≤ M(θ0)‖wn‖2.
Note that ‖wn‖2 ∼ χ2rn(θ1). Let Fk denote the distribution function of the
chisquare distribution with k degrees of freedom. It follows from the definition
of that distribution that Frn(θ1) ≥ Fan and so
Pθ2
(1/an (y − Xβ2)
T[Σ
1/22
]−1An(θ1)
[Σ
1/22
]−T(y − Xβ2) > 2M(θ0)
)≤ 1 − Fan(2an) ≤ 2/an
with the last bound following from Tchebychev’s inequality. Since an does not
depend on θ1 we have
Pθ2
(1/an sup
θ1∈Nn(θ0)
((y − Xβ2)
T[Σ
1/22
]−1An(θ1)
[Σ
1/22
]−T(y − Xβ2)
)
203
> 2M(θ0)) ≤ 2/an → 0.
Now if λmax(An(θ1)) is bounded by a sequence {Mn(θ0)} going to zero then for
any given ε > 0
Pθ2
(1/an (y − Xβ2)
T[Σ
1/22
]−1An(θ1)
×[Σ
1/22
]−T(y − Xβ2) > ε
)≤ 1 − Fan (εan/Mn(θ0)) ≤ Mn(θ0)/ε → 0
with the last bound following from Tchebychev’s inequality. Since the bound
does not depend on θ1, the result is also true when we take the sup over θ1 ∈Nn(θ0).
Appendix B
This appendix contains detailed documentation for the functions, classes, and
methods described in chapter 7. Online documentation in S and Splus is avail-
able for all of these functions, classes, and methods after correct installation of
the software that we contributed to StatLib.