short-course: introduction to regression analysis for...

Short-Course: Introduction to regression analysis for

longitudinal data

(Part 1)

Instructor: Juan Carlos Salazar-Uribe*.

School of Statistics. Faculty of Sciences

National University of Colombia at Medellin

[email protected]*Educational Ambassador 2014-2015. Dr. Martha Aliaga’s Fund, American Statistical Asso-ciation. This material is largely based on Fitzmaurice, Laird, and Ware (2011). Applied Lon-gitudinal Analysis. 2nd Ed. Wiley: New Jersey. and JSM Professional Development ProgramContinuing Education Course on Applied Longitudinal Analysis. Other references were alsoused.

VII Summer school Medellın, Colombia, December 2015.

1

Abstract Short-Course

Longitudinal data are used both in research and in practice to try

to answer some questions about events that change or evolve

over time (eg cognitive skills of a senior citizen, the weight of a

person, the diameter of a tree, etc.). Linear mixed models have

become an indispensable tool for the analysis of longitudinal data.


2


In this short-course, it will be presented and discussed some li-

near mixed regression models adapted to situations where longi-

tudinal data arise. The first part of this short-course is dedicated

to presenting motivational examples where linear mixed regres-

sion models can be applied and further strategies for estimating

the parameters characterizing these models (Maximum likelihood

and restricted maximum likelihood ML REML) will be discussed.


3


Also, this first part includes examples of how these models can be

implemented using SAS. In the second part of this short-course

we will present some moderns approaches to model longitudinal

data using penalized splines as proposed by Fitzmaurice el at

(2011), finally, we study conditional and marginal linear mixed re-

gression models. Advantages and disadvantages of the presen-

ted methods are discussed.


4

Contents

Introduction and motivational examples, the classic lineal model,

longitudinal Data, Basic concepts, Fixed and random effects, Ma-

ximum likelihood estimation and Restricted maximum likelihood

estimation, regression model based on penalized splines, Condi-

tional models versus Marginal models. Examples.


5

Spruce tree growth data

(Kryscio, 2002) This experiment was conducted to compare the

growth of spruce trees under two environments: Control and Ozo-

ne enriched environment. There are 12 trees per group. For each

tree, 4 measures about growth were recorded at days 152, 174,

201 y 227, respectively. In other words, each tree contributes with

4 measures. The data set:


6


Obs Grupo Id y152 y174 y201 y2271 Control 1 2.79 3.10 3.30 3.382 Control 2 3.30 3.90 4.34 4.963 Control 3 3.98 4.36 4.79 4.994 Control 4 4.36 4.77 5.10 5.305 Control 5 4.34 4.95 5.42 5.976 Control 6 4.59 5.08 5.36 5.767 Control 7 4.41 4.56 4.95 5.238 Control 8 4.24 4.64 4.95 5.389 Control 9 4.82 5.17 5.76 6.1210 Control 10 3.84 4.17 4.67 4.6711 Control 11 4.07 4.31 4.90 5.1012 Control 12 4.28 4.80 5.27 5.55


7


Obs Grupo Id y152 y174 y201 y22713 Ozono 13 4.53 5.05 5.18 5.4114 Ozono 14 4.97 5.32 5.83 6.2915 Ozono 15 4.37 4.81 5.03 5.1916 Ozono 16 4.58 4.99 5.37 5.6817 Ozono 17 4.00 4.50 4.92 5.4418 Ozono 18 4.73 5.05 5.33 5.9219 Ozono 19 5.15 5.63 6.11 6.3920 Ozono 20 4.10 4.46 4.84 5.2921 Ozono 21 4.56 5.12 5.40 5.6922 Ozono 22 5.16 5.49 5.74 6.0523 Ozono 23 5.46 5.79 6.12 6.4124 Ozono 24 4.52 4.91 5.04 5.71


8


Control Environment :

Ozone enriched environment :


9


These plots show a linear trend over time and each tree seems to

have its own random intercept (the slopes seem to be the same).

The previous data set, will be used to illustrate different analysis

techniques typical in LMM.


10

Inference in the LMM

Example (Prostate Cancer). The Baltimore Longitudinal Study on Aging*.

This is a retrospective study. Prostate disease is one of the most common

and most costly medical problems in USA. Important to look for markers which

can detect the disease at an early age. Prostate-Specific Antigen (PSA)** is

an enzime produced by both normal and cancerous prostate cells. PSA level

is related to the volume of prostate tissue.

*References: Verbeke and Molenberghs (2000); Carter et al. (1992), Cancer Research; Mo-rrell et al. (1995), Journal of the American Statistical Association; Pearson at al. (1994),Statistics in Medicine

**High values of PSA mean high risk of prostate cancer


11


Problem : Patients with Benign Prostatic Hyperplasia (BPH) also

have an increased PSA level. This produces an overlap in the

PSA distribution. In fact, this overlaping in the PSA distribution

for cancer and BPH cases seriously complicates the detection

of prostate cancer. The research question (hypothesis based on

clinical practice): Can longitudinal PSA profiles be used to de-

tect prostate cancer in an early stage? .


12


In order to answer this important question, a retrospective case-control study

based on frozen serum samples was conducted. The study included 16 con-

trol patients, 20 BPH cases, 14 local cancer cases, 4 metastatic cancer cases

(more serious than local). Complication: No perfect match for age at diagno-

sis and years of follow-up possible*.

*Source : Stat Med. 1994 Mar 15-Apr 15;13(5-7):587-601. Mixed-effects regression models for studying thenatural history of prostate disease. Pearson JD, Morrell CH, Landis PK, Carter HB, Brant LJ. Abstract . Althoughprostate cancer and benign prostatic hyperplasia are major health problems in U.S. men, little is known aboutthe early stages of the natural history of prostate disease. A molecular biomarker called prostate specificantigen (PSA), together with a unique longitudinal bank of frozen serum, now allows a historic prospective studyof changes in PSA levels for decades prior to the diagnosis of prostate disease. Linear mixed-effects regressionmodels were used to test whether rates of change in PSA were different in men with and without prostatedisease. In addition, since the prostate cancer cases developed their tumours at different (and unknown) timesduring their periods of follow-up, a piece-wise non-linear mixed-effects regression model was used to estimatethe time when rapid increases in PSA were first observable beyond the background level of PSA change.These methods have a wide range of applications in biomedical research utilizing repeated measures datasuch as pharmacokinetic studies, crossover trials, growth and development studies, aging studies, and diseasedetection.


13


Individual profiles:


14


From previous plot we observe:

Much variability between subjects

Little variability within subjects

Highly unbalanced data

Not same number of subjects

Not linear trends


15

Treatment of Lead-Exposed Children Trial

(Fitzmaurice and Laird 2014)

Exposure of lead during infancy is associated with substantial deficits in test of cognitiveabilityChelation* treatment of children with high lead levels usually requires injections andhospitalizationsA new agent, Succimer, can be given orallyRandomized placebo-controlled trial examining changes in blood lead level during cour-se of treatment100 children randomized to placebo or SuccimerMeasures of blood lead level at baseline, 1, 4, and 6 weeks

*To remove (a heavy metal, such as lead or mercury) from the bloodstream by means ofa chelate, such as EDTA (Ethylenediaminetetracetic acid; a crystalline acid that acts as astrong chelating agent and that forms a sodium salt used as an antidote for metal poisoningand as an anticoagulant.)


16


Treatment of Lead-Exposed Children Trial Data. Mean blood lead levels (and

SD) at baseline, week 1, 4, and 6.

Group Baseline Week 1 Week 4 Week 6Succimer 26.5 13.5 15.5 20.8

(5.0) (7.7) (7.8) (9.2)

Placebo 26.3 24.7 24.1 23.2(5.0) (5.5) (5.7) (6.2)


17


Plot of mean blood lead levels at Baseline (Time=0), week 1, 4, and 6 in the

Succimer and Placebo groups


18

Example

Example. Progesterone metabolite concentration during me ns-

trual cycle *. Longitudinal data on PdG (pregnanediol-3-glucoronide)

measured daily in urine from day -8 to day 15 in the menstrual cy-

cle (day 0 denotes ovulation day).

*Brumback, B. and Rice, J. (1998). Smoothing spline models for the analysis of nested andcrossed samples of curves. Journal of the American Statistical Association, 93, 961−993


19

Example

A sample of 22 conceptive cycles from 22 women and 29 non-conceptive cy-

cles from another 29 women. Following standard practice in endocrinological

research, progesterone profiles are aligned by the day of ovulation, determi-

ned by serum luteinizing hormone, and then truncated at each end to yield

curves of equal length.

Measures are missing for certain cycles on some days. The goal is to describe

and compare the mean hormone profiles in the conceptive and non-conceptive

groups*, especially before and after implantation which occurs approx 7−8

days after ovulation.

*Using anti-pregnancy pills


20

Example

The following figures display time plots for the conceptive and non-conceptive groups. Figure12. (a) suggests that the trajectory of log progesterone concentration differs for the two groupsfollowing implantation (around day 7).


21

Alkaline battery duration

Six different brands of alkaline batteries were selected to study their dura-

tion under normal conditions: RAYOVAC c©, DURACELL c©, VARTA c©, ENER-

GIZER MAX c©, EVEREADY ALKALINE c©, PANASONIC POWER ALKALINE c©.

For each brand, 2 batteries were selected and they were plugged to a para-

llel especially designed to take the charge out from each battery. Every hour,

during 11 hours, the voltage were measured. The data is as follows:

RAYOVAC DURACELLVOLTAJE VOLTAGE

TIME Battery 1 Battery 2 Battery 1 Battery 20 1217 1288 1419 14501 1181 1248 1359 13932 1077 1170 1284 12963 1040 1138 1222 12354 1043 1099 1173 11785 1004 1047 1103 11066 949 968 877 8147 873 870 156 1388 780 740 107 1129 641 618 89 8410 62 62 71 7811 50 58 61 75


22


VARTA ENERGIZERVOLTAGE VOLTAGE

TIME Battery 1 Battery 2 Battery 1 Battery 20 1430 1308 1286 12681 1207 1150 1102 10602 1112 1052 1036 9953 1056 993 1000 9624 1025 953 972 9405 992 917 934 9056 952 818 874 8537 870 805 759 7638 246 172 166 109.29 96 77 75.4 70.410 69 65 62.6 6611 64 60 60.1 65.1


23


EVEREADY PANASONICVOLTAGE VOLTAGE

TIME Battery 1 Battery 2 Battery 1 Battery 20 1276 1212 1217 12791 1086 1001 1005 10442 1034 955 943 9803 992 910 897 9584 959 879 871 9305 886 844 804 8796 825 788 718 8237 693 679 602 7368 118.8 142.4 718 6129 75.5 75.1 602 95.510 64.7 64.3 117.6 7311 61.2 61.5 81.7 70.3


24


A profile plot of the data:

Figura 1. Profile plot of the data for the battery duration experiment. Around 5

hours it is observed a decay of the profiles associated to the different brands.


25

Longitudinal Data: Basic concepts

A defining feature of longitudinal studies is that measurements on the

same individuals are taken repeatedly through time

Longitudinal studies allow direct study of change over time

Primary goal is to characterize the changes in response over time and the

factors that influence change

With repeated measures, we can capture within-individual (or within-subject)

variability.


26

Longitudinal Data: Basic concepts

However, there are several complications when dealing with lon-

gitudinal data:

Repeated measures on individuals are correlated

Variability is often heterogeneous over time


27

Introduction

When the response variable is continuous, classical linear regres-

sion models can be extended to handle correlated outcomes that

are typical in longitudinal data. This correlation among repeated

measures could be modelled explicitly (e.g. via covariance pat-

tern models) or implicitly (via the introduction of random effect).

The latter approach yields a versatile class of regression models

for longitudinal data known as linear mixed effects models (Fitz-

maurice and Laird, 2014).


28

Introduction

LMM is a powerful statistical tool in applied research since it is not

only widely available in standard software but also it has a solid

and strong theoretical background. The term mixed model refers

to the nature of the parts of the model that explains the mean of

a statistical model.


29

Introduction

Usually in longitudinal data the outcome is observed repeatedly on different

times for each subject under study. Subjects are not necessarily observed

neither at the same time nor the same number of times (unbalanced data).

Covariates could also be different for each subject. These kind of data have

several features:

Each subject have a particular trend (or path) and they are not necessarily

straight lines.

If we obtain the average of all the observations we obtain an estimate of

the mean population profile

For this kind of longitudinal data it is formulated the so called subject specific

models or mean population models and depending of the goals of the study

one could be interested in one or another.


30

Introduction

Why to use LMM to model Longitudinal Data?

Because it is a flexible tool to model correlated data

Because estimation is based on likelihood function

Yields valid test in complex designs with unbalanced data

Allows REML estimation of variance components (Restricted Maximum Like-

lihood REML, to be discussed in detail later)

Allows inference of random effects

Available in R and SAS c©


31

Longitudinal and Panel data

Define yit to be the response for the ith subject during the tth time period.

A longitudinal data set consists of observations of the ith subject over t =

1,2, . . . ,T time periods, for each of i = 1,2, . . . ,n subjects. Thus, we observe:

First subject −{

y11,y12, . . . ,y1T1

}

Second subject −{

y21,y22, . . . ,y2T2

}

...

nth subject −{

yn1,yn2, . . . ,ynTn

}


32

General Linear Model (GLM)

(Littell at al. 2000) Statistical Models for data are descriptions on how the data

could be produced and they basically consist of two parts: (1) a formula relating

the response to all explanatory variable (e.g. effects), and (2) a description of

the probability distribution assumed to characterize random variation affecting

the observed response.


33

GLM

The GLM in matrix form

Y = Xβ+ ε

where

ε ∼ MN(

E(ε) = O, Σε = σ2I)

The Ordinary Least Squares (OLS) estimator for β is

β =(X′X

)−1X′Y

If Σε = σ2V where V is known

β =(

X′V−1X)−1

X′V−1Y (Generalized least squares estimator)


34

Linear mixed model LMM

This GLM can be extended to obtain a LMM that can be written

in a more precise way:

Yi = Xiβ+Zibi+ εi

bi ∼ N(0,D)

εi ∼ N(0,Σi)

b1, . . . ,bm,ε1, . . . ,εm : mutually independent


35

Linear mixed model LMM

Terminology :

β: Fixed effects.

bi and εi: Random effects.

Variance components: Elements of the matrices D and Σi


36

Special formulations of LMM

Random intercept model . This is one of the most common formulations of

the LMM. Each subject have a profile which is approximately parallel to the

trend of a particular group (Recall the Sitka Spruce data example). This model

can be expressed as:

yi j = β0+β1Ti j +β2Gi +β3Gi ×Ti j +b0i + εi j

where

The subject is represented by i: i = 1,2, · · · ,N.

Observations for subject j : j = 1,2, · · · ,mi .

b0i ∼ N(

0,σ2b

). εi j ∼ N

(0,σ2

ε

). b0i indep. of εi j


37

Alternative way of formulating the LMM

Recall the classic linear model*:

Yn×1 = Xn×kβk×1+ en×1

E (e) = 0

Var(e) = σ2I

e ∼ N(

0, σ2I)

Then:

E (Y) = E (Xβ+ e) = XβVar(Y) = Var(e) = σ2I

Y ∼ Nn

(Xβ, σ2I

)

*Machiavelli, R. (2014) Introduccion a los modelos mixtos. Proceedings X Coloquio de Es-tadıstica. Medellın, Colombia.


38

Randomized block design

The random intercepts model can be obtained from a design of experiments

theory point of view as follows. When we are studying the influence of a factor

over a quantitative factor (response) it is common to find other variables or

factors influencing that response and they must be controlled. These variables

are known as blocks. These blocks have the following properties:

(i) To study their effect is not a direct goal but they appear in a natural fashion

in the study.

(ii) They do not interact with the factor of interest.


39

Randomized block design

Classic model: Yi j = µ+ τi +ei j

Versus

Mixed model (Randomized block design): Yi j = µ+ τi +b j +ei j

(Similar to random intercept model)

b j and ei j are normally distributed.


40


Random intercepts and slopes model

This is perhaps the other most used LMM. Each subject differs to the others

in terms of both the intercept and the trend (slope) of the group. The model is

expressed as:

yi j = β0+β1Ti j +β2Gi +β3Gi ×Ti j +b0i +b1iTi j + εi j

where

εi j ∼ N(

0,σ2ε)

and[

b0ib1i

]∼ N

([00

],

[σ2

b0σb0,b1

σb0,b1σ2

b1

])


41


In matrix form:

yi1yi2...

yini

=

1 Ti1 Gi Gi ×Ti11 Ti2 Gi Gi ×Ti2... ... ... ...1 Tini

Gi Gi ×Tini

β0β1β2β3

+

1 Ti11 Ti2... ...1 Tini

[

b0i

b1i

]+

εi1εi2...

εini

yi = Xiβ+Zibi + εi

with Var(yi) = ZiDZ′i +σ2

εIni . Notice that when no random effects are speci-

fied Var(yi) = σ2εIni .


42

More on LMM

When an effect is either fixed or random?

If the studied levels are the only ones we are interested in, the effect could be

considered as fixed (e.g. the effect of a specific diet)

If the studied levels correspond to a random sample from a set of possible

levels, the effect could be considered as random (e.g. the effect of weight

groups)


43

More on Longitudinal Data

Let us suppose for a moment that we have not specified random effects. If

we have N independent units, then the matrix Σ will have a block diagonal

structure:

Σ =

Σ1 ,0 , . . . ,00 ,Σ2 , . . . ,0... ... ... ...0 ,0 . . . ,ΣN

Each of these blocks corresponds to a particular experimental unit and it will

have a structure reflecting both the variances and covariances within that unit.


44


If the errors are independent:

Σi = σ2e

1 0 . . . 00 1 . . . 0... ... ... ...0 0 . . . 1

(recall that If no random effects are specified, var(yi) = Σi)


45


On the other hand, if we specified random effects and the so ca-

lled compound symmetry structure:

Var(yi) = Vi =(

σ2b+σ2

e

)

1 ρ . . . ρρ 1 . . . ρ... ... ... ...ρ ρ . . . 1

If ρ > 0, this is equivalent to think about Σi as independent and

add a random intercept at the subject’s level. The variances are

homogeneous. There is a correlation between two separate mea-

surements, but it is assumed that the correlation is constant re-

gardless of how far apart the measurements are.VII Summer school Medellın, Colombia, December 2015.

46


Autoregressive of order 1 (AR(1)*, valid if observations are equi-spaced):

Var(yi) = Vi = σ2e

1 ρ ρ2 . . . ρT−1

ρ 1 ρ . . . ρT−2

ρ2 ρ 1 . . . ρT−3

... ... ... ... ...ρT−1 ρT−2 ρT−3 . . . 1

If observations are not equi-spaced one can use:

Corr(

ej ,ej ′)= ρ

∣∣∣t j−t j ′

∣∣∣, j 6= j ′

*The AR(1) structure has homogeneous variances and correlations that decline exponentially with distance. Itmeans that the variability in a measurement, say white blood cell count, is constant regardless of when youmeasure it. It also means that two measurements that are right next to each other in time are going to be prettycorrelated (depending on the value of ρ ), but that as measurements get farther and farther apart they are lesscorrelated.


47


Selecting a covariance structure . To select models for the mean is a neces-

sary task. One approach consists in adjusting different covariance structures

and random effects (matrices D and Σ) and then, by means of a suitable crite-

ria (such as LRT, BIC, AIC) we choose the one suggested by the the criterium.


48


Strategy of analysis . Once the structure for the covariance has been selected

we check again the fixed part of the model. For example, when we are fitting a

regression model, we first select variables, then we check assumptions. Then

we interpret the estimated coefficients, test the null hypotheses and draw con-

clusions. In the ANOVA we interpret interactions, compare means and test

contrasts to draw conclusions.


49


Recall the formulations for the LMM:

Y = Xβ+Zb+ e[

be

]∼ N

([00

],

[D 00 Σ

])

E (Y) = Xβ, V (Y) = V = ZDZ′+Σ


50


The conditional model :

Y|b ∼ N (Xβ+Zb, Σ)

b ∼ N (0,D)

E (Y|b) = Xβ+Zb, V (Y|b) = Σ


51


The system of Henderson’s equations (Henderson, 1975) is useful for obtai-

ning estimates for and predictions for both β and b, respectively*:

X′Σ−1Xβ+X′Σ−1Zb = X′Σ−1Y(Z′Σ−1Z+D−1

)b+Z′Σ−1Xβ = Z′Σ−1Y

*Henderson’s Advice to Beginning Scientists: Study methods of your predecessors. Workhard. Do not fear to try new ideas. Discuss your ideas with others freely. Be quick to admiterrors. Progress comes by correcting mistakes. Always be optimistic. Nature is benign. Enjoyyour scientific work, it can be a great joy


52


Solving this system, yields

β =(

X′V−1X)−1

X′V−1Y

The solution for b will be

b = DZ′V−1(

Y−Xβ)

which is the Best Linear Unbiased Predictor (BLUP) for b*

*BLUP’s are used to ’estimate’ random effects. BLUP’s are similar to BLUE’s (Best LinearUnbiased Estimator) of fixed effects. The difference is that we estimate fixed effects andpredict random effects, however, the equation for one or another are different.


53


Maximum likelihood estimation

Notation : β: Vector of fixed effects

α: Vector of the components of variance in D y Σ

θ = (β, α): Vector of parameters in the marginal model (fixed effects and va-

riance components, respectively).


54


By means of a process known as profiling we could find estimates for the fixed

effects. Profiled likelihood If v(β,α |Y) is twice the negative of the marginal

log-likelihood

−2log(β,α |Y)= log(|V(α)|)+(Y−Xβ)′V−1 (α)(Y−Xβ)+T log(2π) (4)

and it must be minimized with respect to β y α (the quantity T log(2π) does

not affect the maximization of the log-likelihood function process, since it is a

constant). This can be done by means of profiling with respect to β:

In this process we find a closed expression for β and it is substituted in

v(β,α |Y). i.e v(

β(α),α |Y)

The resultant expression will be a function of α and it is twice the negative

of the profiled likelihood.


55


Marginal likelihood function :

LML (θ) =m

∏i=1

{(2π)−

ni2 |Vi(α)|−

12 e

−12

((Yi−Xiβ)′V−1

i (α)(Yi−Xiβ))}

If α is known, the MLE of β are:

β(α) =

(m

∑i=1

X′iWiXi

)−1 m

∑i=1

X′iWiXi

where Wi = V−1i . However, in general α is unknown and it is necessary to

replace it by an estimator α which is obtained either by maximum likelihood

(ML) or restricted maximum likelihood (REML)*. Using ML, one can obtained

αML maximizing LML

(α, β(α)

), with respect to α. The resultant estimator

β(αML) for β it is denoted as βML (process known as profiling).

*REML will be discussed ahead


56


αML and βML can also be obtained from maximizing LML(θ = (α,β)) with

respect to θ, i.e., with respect to α and β simultaneously. Estimation of both

the fixed and variance components by ML was uncommon due to computatio-

nal difficulties related with the solution of the resultant systems of equations.

Harville (1977) review previous works, unified the theory and built iterative ML

algorithms for obtaining estimates of the parameters. In addition to the numeri-

cal problems, the estimators of the variance components obtained by ML, tend

to be biased.


57


Restricted maximum likelihood estimation (REML) The question is: How

to estimate the variance components and not having to estimate the corres-

ponding parameters for the fixed effects? The answer leads to the so called

Restricted Maximum Likelihood Estimation (REML) proposed by Patterson y

Thompson (1971). By using this strategy the vector of fixed effects, β is drop-

ped from the log-likelihood function.


58


In general, the only thing we need is a transformation of the data orthogonal

to X. In other words, we need residuals:

r = A′Y ∼ N(0,A′V(θ)A

)

The restricted likelihood function does not depend on β. Substituting into the

generalized least square estimator of β this REML estimator we obtain the

fixed effects estimator β(θREML) which is denoted as βREML.

In other words, we transform Y in such a way that the linear predictor Xβvanishes from the likelihood

r = A′Y ∼ N(0,A′V(θ)A

)


59


Two examples of matrix A could be: Consider a sample of n observationsY1,Y2, . . . ,Yn from a N(µ,σ2)

1)

r = A′Y =

1 −1 · · · 0 0 00 1 −1 · · · 0 0... ... ... ... ... ...0 0 0 0 1 −1

Y1Y2...Yn

=

Y1−Y2Y2−Y3...Yn−1−Yn

2)

r=A′Y=

1− 1n −1

n · · · −1n −1

n −1n

−1n 1− 1

n −1n · · · −1

n −1n... ... ... ... ... ...

1n −1

n −1n −1

n −1n 1− 1

n

Y1Y2...Yn

=

Y1−YY2−Y...Yn−1−Y

In both cases, matrix A corresponds to the first n−1 rows of some matrix M.


60


Usually M can be chosen as:

M = I−X(X′X

)−1X′

M could also be formulated as

M = V−1− V−1X(X′X

)−1X′V−1

Where V is some estimator of V =Cov(Y)).

Note : SAS and R use this last formulation for matrix M.


61


These REML equations yields an estimator for α only. The estimators for the

fixed effects are estimated by using generalized mean squares, once one have

αR, as

βR=(

X′V−1(αR)X)−1

X′V−1(αR)Y


62


SAS PROC MIXED constructs an objective function associated with ML or

REML and maximizes it over all unknown parameters. Using calculus, it is

possible to reduce this maximization problem to one over only the parameters

in D and Σ. The corresponding log-likelihood functions are as follows:

ML : ℓ(D,Σ) =−12

log|V|−12

r′V−1r−n2

log(2π)

REML: ℓR(D,Σ) =−12

log|V|−12

log|X′V−1X|−12

r′V−1r−n− p

2log(2π)}

where r = y−X(

X′V−1X)−1

X′V−1y, and p is the rank of X.


63


Harville’s formulation for the likelihood

Harville (1974), showed that the log-likelihood function for the set of contrastscould be written as,

ℓ(α,A′Y

)= c−

12

log

∣∣∣∣∣m

∑i=1

X′iV

−1i Xi

∣∣∣∣∣−12

m

∑i=1

|Vi|−12

m

∑i=1

(Yi −Xiβ

)′V−1

i

(Yi −Xiβ

)

where c is a constant and β = β(α) as defined previously. To find a matrix A is

not a problem: any matrix of dimension (T− r)×T with T− r rows linearly

independent of M can be used , (Harville, 1977).


64


This establishes two key properties of the REML estimators:

1. The likelihood, and so the REML estimator for α, does not depend on the

choice of M.

2. If we take βREML= β(αREML), then ℓ(α,A′Y

)is interpreted as the log

of the likelihood for θ, or simply ℓREML(θ).


65


Relationship between ℓREML(θ) and ℓML(θ)

Since ℓML(θ) = logLML(θ) it can be proved that

ℓREML(θ) =−12

log

∣∣∣∣∣m

∑i=1

X′iV

−1i Xi

∣∣∣∣∣+ ℓML(θ)

This way, we establish another property: Even if ℓ(α,M′Y

)is the true like-

lihood for α IT IS NOT a log-likelihood for θ.

Observe that 2 implies that while the ML estimators for α y β and the REML

estimators for α satisfied the properties with large samples of those estimators

based on the likelihood principle: (Consistency, Asymptotic normality and Ef-

ficiency), the REML estimators for the fixed effects DO NOT satisfy these

properties .


66


On the other hand, for mixed balanced ANOVA designs, it is known that the

REML estimators of the variance components are equal to the classic esti-

mator for the component variance obtained by the method of moments. This

generates a robustness property of the REML estimators agains violations of

the normality assumption.

Theoretical justification for the REML estimators

According to Patterson and Thompson (1971) and Harville and Jeske (1992),

if there are not information about β, we do not loose information about α when

the inference is based in A′Y instead of Y only.

From a bayesian point of view, to use just contrasts to make inference about

α, is equivalent to ignore any a priori information about β and use all the data

to make those inferences.VII Summer school Medellın, Colombia, December 2015.

67


(Verbeke and Molenberghs 2009). Maximum likelihood estimation and restric-

ted maximum likelihood estimation both have the same merits of being based

on the likelihood principle which leads to useful properties such as consistency,

asymptotic normality, and efficiency (valid also for the REML estimates of the

Variance components). ML estimation also provides estimators of the fixed ef-

fects, whereas REML estimation, in itself, does not (the reason for this is that

REML yield estimators for the variance components). On the other hand, for

balanced mixed ANOVA models, the REML estimates for the variance compo-

nents are identical to classical ANOVA-type estimates obtained from solving

the equations which set mean squares equal to their expectations.


68


This implies optimal minimum variance properties, and it shows that REML

estimates in that context do not rely on any normality assumption since only

moment assumptions are involved (Harville 1977, Searle 1992). Also with re-

gard to the mean squared error for estimating α, there is not clear preference

for either one of the two estimation procedures, since it depends on the speci-

fics of the underlying model and possibly on the true value of α.


69


For ordinary ANOVA or regression models, the ML estimator of the residual va-

riance σ2e has uniformly smaller mean squared error than the REML estimator

when p= rank(X)≤ 4, but the opposite is true when p> 4 and n− p is suffi-

ciently large (indeed n− p> 2 suffices if p> 12). In general, one may expect

results from ML and REML estimation to differ more as the numb er p of

fixed effects in the model increases .


70


Once the likelihood function is built (either by ML or REML) we

can use a numerical method to obtain the maximum θ. This can

be done by means of the Newton-Raphson algorithm (Lindstrom

and Bates, (1988)), the Expectation-Maximization algorithm (Dem-

pster et al. 1977), or the Simplex of Nelder-Mead (Nelder and

Mead, 1965). Monte Carlo Markov Chain based methods can al-

so be used and they today they are becoming very popular.


71

Example: Sitka-Spruce

This data set has the same number of trees per group (12) and 4 measures ofgrowth were taken at days 152, 174, 201 and 227, respectively. This is, eachtree contributes with 4 measures. The main goal of this study is to compare thegrowth of trees raised in two different environments: Control and Ozone enri-ched environment. SAS Program to fit the summary statistics analysis basedon linear regression:

data spruce;input ID Group$ y152 y174 y201 y227 @@;cards;1 Control 2.79 3.10 3.30 3.38 2 Control 3.30 3.90 4.34 4.963 Control 3.98 4.36 4.79 4.99 4 Control 4.36 4.77 5.10 5.305 Control 4.34 4.95 5.42 5.97 6 Control 4.59 5.08 5.36 5.767 Control 4.41 4.56 4.95 5.23 8 Control 4.24 4.64 4.95 5.389 Control 4.82 5.17 5.76 6.12 10 Control 3.84 4.17 4.67 4.6711 Control 4.07 4.31 4.90 5.10 12 Control 4.28 4.80 5.27 5.5513 Ozone 4.53 5.05 5.18 5.41 14 Ozone 4.97 5.32 5.83 6.2915 Ozone 4.37 4.81 5.03 5.19 16 Ozone 4.58 4.99 5.37 5.6817 Ozone 4.00 4.50 4.92 5.44 18 Ozone 4.73 5.05 5.33 5.9219 Ozone 5.15 5.63 6.11 6.39 20 Ozone 4.10 4.46 4.84 5.2921 Ozone 4.56 5.12 5.40 5.69 22 Ozone 5.16 5.49 5.74 6.0523 Ozone 5.46 5.79 6.12 6.41 24 Ozone 4.52 4.91 5.04 5.71;run;


72

Example

data spruce1;set spruce;array xx(4) y152 y174 y201 y227;do i=1 to 4;log_growth=xx{i};if i=1 then time1=152;if i=2 then time1=174;if i=3 then time1=201;if i=4 then time1=227;output;

end;run;proc sort data=spruce1;by id;run;


73

Example

Consider the Sitka-Spruce trees data. First we are going to fit the followingmodel with REML as the estimation method:

proc mixed data=spruce1 method=REML;class group id;model log_growth=group time1 / solution;repeated / subject=id type=cs;lsmeans group / pdiff;title ’A variance component model with same marginal distribution’;title ’REML’;run;


74

Example

This program yields

USING REMLSolucion para efectos fijos

ErrorEfecto Group Estimador estandar DF Valor t Pr > |t|

Intercept 2.4783 0.1759 22 14.09 <.0001Group Control -0.5746 0.2139 22 -2.69 0.0135Group Ozone 0 . . . .time1 0.01466 0.000477 71 30.77 <.0001

Tests de tipo 3 de efectos fijosNum Den

Efecto DF DF F-Valor Pr > F

Group 1 22 7.22 0.0135time1 1 71 946.76 <.0001


75

Example

USING REMLMedias de minimos cuadrados

ErrorEfecto Group Estimador estandar DF Valor t Pr > |t|Group Control 4.6677 0.1512 22 30.86 <.0001Group Ozone 5.2423 0.1512 22 34.66 <.0001

Diferencias de medias de minimos cuadradosError

Efecto Group _Group Estimador estandar DF Valor t Pr > |t|Group Control Ozone -0.5746 0.2139 22 -2.69 0.0135


76

Example

Now we fit the following model with ML as the estimation method:

proc mixed data=spruce1 method=ML;class group id;model log_growth=group time1 / solution;repeated / subject=id type=cs;lsmeans group / pdiff;title ’A variance component model with same marginal distribution’;title ’ML’;run;


77

Example

This program yields

USING ML

Solucion para efectos fijos

ErrorEfecto Group Estimador estandar DF Valor t Pr > |t|Intercept 2.4783 0.1701 22 14.57 <.0001Group Control -0.5746 0.2048 22 -2.81 0.0103Group Ozone 0 . . . .time1 0.01466 0.000473 71 30.99 <.0001

Tests de tipo 3 de efectos fijos

Num DenEfecto DF DF F-Valor Pr > FGroup 1 22 7.87 0.0103time1 1 71 960.09 <.0001


78

Example

USING REMLMedias de minimos cuadrados

ErrorEfecto Group Estimador estandar DF Valor t Pr > |t|Group Control 4.6677 0.1512 22 30.86 <.0001Group Ozone 5.2423 0.1512 22 34.66 <.0001

Diferencias de medias de minimos cuadradosError

Efecto Group _Group Estimador estandar DF Valor t Pr > |t|Group Control Ozone -0.5746 0.2139 22 -2.69 0.0135


79

Example

Notice that there are differences in some p-values, standard errors

as well as between the values for some tests. For instance, the

estimated standard error for the parameter of the Control group

using REML is 0.2139 whereas the one obtained using ML is

0.2048 (values for the t-test are also different: -2.69 versus -2.81).

But we have to admit that there are not substantial differences in

the estimates obtained by using either REMl or ML (a possible

reason for this is that n>p).


80

short-course: introduction to regression analysis for...

Documents