10 dichotomous or binary responses

10 Dichotomous or binary responses

10.1 Introduction

Dichotomous or binary responses are widespread. Examples include being dead oralive, agreeing or disagreeing with a statement, and succeeding or failing to accomplishsomething. The responses are usually coded as 1 or 0, where 1 can be interpreted as theanswer “yes” and 0 as the answer “no” to some question. For instance, in section 10.2,we will consider the employment status of women where the question is whether thewomen are employed.

We start by briefly reviewing single-level logistic and probit regression for dichoto-mous responses, formulating the models both as generalized linear models, as is com-mon in statistics and biostatistics, and as latent-response models, which is common ineconometrics and psychometrics. This lays the foundation for a discussion of variousapproaches for clustered dichotomous data. The focus will be on logistic regression withrandom effects, a special case of generalized linear mixed models. In this setting, thedistinction between conditional or subject-specific effects and marginal or population-averaged effects is highlighted, and measures of dependence and heterogeneity are de-scribed.

We also discuss special features of statistical inference for generalized linear mixedmodels, including maximum likelihood (ML) estimation of model parameters, methodsfor assigning values to random effects, and how to obtain different kinds of predictedprobabilities. This more technical material is provided here because the principles ap-ply to all models discussed in this volume. However, you can skip it (sections 10.10through 10.12) on first reading because it is not essential for understanding and inter-preting the models.

Other approaches to clustered data with binary responses, such as fixed-effects (con-ditional ML) and generalized estimating equations (GEE) are briefly discussed in sec-tion 10.13.

10.2 Single-level logit and probit regression models for di-chotomous responses

In this section, we will introduce logit and probit models without random effects thatare appropriate for datasets without any kind of clustering. For simplicity, we will start

557

558 Chapter 10 Dichotomous or binary responses

by considering just one covariate xi for unit (for example, subject) i. The models canbe specified either as generalized linear models or as latent-response models. These twoapproaches and their relationship are described in sections 10.2.1 and 10.2.2.

10.2.1 Generalized linear model formulation

As in models for continuous responses, we are interested in the expectation (mean) ofthe response as a function of the covariate. The expectation of a binary (0 or 1) responseis just the probability that the response is 1:

E(yi|xi) = Pr(yi = 1|xi)In linear regression, the conditional expectation of the response is modeled as a linearfunction E(yi|xi) = β1 + β2xi of the covariate (see section 1.5). For dichotomousresponses, this approach may be problematic because the probability must lie between0 and 1, whereas regression lines increase (or decrease) indefinitely as the covariateincreases (or decreases). Instead, a nonlinear function is specified in one of two ways:

Pr(yi = 1|xi) = h(β1 + β2xi)

org{Pr(yi = 1|xi)} = β1 + β2xi ≡ νi

where νi (pronounced “nu”) is referred to as the linear predictor. These two formulationsare equivalent if the function h(·) is the inverse of the function g(·). Here g(·) is knownas the link function and h(·) as the inverse link function, sometimes written as g−1(·).

An appealing feature of generalized linear models is that they all involve a linearpredictor resembling linear regression (without a residual error term). Therefore, we canhandle categorical explanatory variables, interactions, and flexible curved relationshipsby using dummy variables, products of variables, and polynomials or splines, just as inlinear regression.

Typical choices of link function for binary responses are the logit and probit links.In this section, we focus on the logit link, which is used for logistic regression, whereasboth links are discussed in section 10.2.2. For the logit link, the model can be writtenas

logit {Pr(yi = 1|xi)} ≡ ln

{Pr(yi = 1|xi)

1− Pr(yi = 1|xi)

}

︸︷︷︸Odds(yi=1|xi)

= β1 + β2xi (10.1)

where ln is the natural logarithm (base e = 1.27). The fraction in braces in (10.1)represents the odds that yi = 1 given xi, the expected number of 1 responses per 0response or successes per failure. The odds against—or in other words, the expectednumber of failures per success—is the standard way of representing the chances againstwinning in gambling. It follows from (10.1) that the logit model can alternatively beexpressed as an exponential function for the odds:

Odds(yi = 1|xi) = exp(β1 + β2xi)

10.2.1 Generalized linear model formulation 559

Because the relationship between odds and probabilities is

Odds =Pr

1− Prand Pr =

Odds

1 + Odds

the probability that the response is 1 in the logit model is

Pr(yi = 1|xi) = logit−1(β1 + β2xi) ≡exp(β1 + β2xi)

1 + exp(β1 + β2xi)(10.2)

which is the inverse logit function (sometimes called logistic function) of the linearpredictor.

We have introduced two components of a generalized linear model: the linear predic-tor and the link function. The third component is the distribution of the response giventhe covariates. Letting πi ≡ Pr(yi = 1|xi), the distribution is specified as Bernoulli(πi),or equivalently as binomial(1, πi). There is no level-1 residual ǫi in (10.1), so the re-lationship between the probability and the covariate is deterministic. However, the re-sponses are random because the covariate determines only the probability. Whether theresponse is 0 or 1 is the result of a Bernoulli trial. A Bernoulli trial can be thought of astossing a biased coin with probability of heads equal to πi. It follows from the Bernoullidistribution that the relationship between the conditional variance of the response andits conditional mean πi, also known as the variance function, is Var(yi|xi) = πi(1− πi).(Including a residual ǫi in the linear predictor of binary regression models would leadto a model that is at best weakly identified1 unless the residual is shared between unitsin a cluster, as in the multilevel models considered later in the chapter.)

The logit link is appealing because it produces a linear model for the log of theodds, implying a multiplicative model for the odds themselves. If we add one unit toxi, we must add β2 to the log odds or multiply the odds by exp(β2). This can be seenby considering a one-unit change in xi from some value a to a+ 1. The correspondingchange in the log odds is

ln{Odds(yi = 1|xi = a+ 1)} − ln{Odds(yi = 1|xi = a)}= {β1 + β2(a+ 1)} − (β1 + β2a) = β2

Exponentiating both sides, we obtain the odds ratio (OR):

exp[ln{Odds(yi = 1|xi = a+ 1)} − ln{Odds(yi = 1|xi = a)}

]

=Odds(yi = 1|xi = a+ 1)

Odds(yi = 1|xi = a)=

Pr(yi = 1|xi = a+ 1)

Pr(yi = 0|xi = a+ 1)

/Pr(yi = 1|xi = a)

Pr(yi = 0|xi = a)

= exp(β2)

1. Formally, the model is identified by functional form. For instance, if xi is continuous, the level-1variance has a subtle effect on the shape of the relationship between Pr(yi = 1|xi) and xi. With aprobit link, single-level models with residuals are not identified.


Consider the case where several covariates—for instance, x2i and x3i—are includedin the model:

logit {Pr(yi = 1|x2i, x3i)} = β1 + β2x2i + β3x3i

In this case, exp(β2) is interpreted as the OR comparing x2i = a + 1 with x2i = a forgiven x3i (controlling or adjusting for x3i), and exp(β3) is the OR comparing x3i = a+1with x3i = a for given x2i.

The predominant interpretation of the coefficients in logistic regression models is interms of ORs, which is natural because the log odds is a linear function of the covariates.However, economists instead tend to interpret the coefficients in terms of marginaleffects or partial effects on the response probability, which is a nonlinear function ofthe covariates. We relegate description of this approach to display 10.1, which may beskipped.

For a continuous covariate x2i, economists often consider the partial derivative of the prob-ability of success with respect to x2i:

∆(x2i|x3i) ≡∂Pr(yi=1|x2i, x3i)

∂x2i= β2

exp(β1 + β2x2i + β3x3i)

{exp(β1 + β2xi + β3x3i)}2

A small change in x2i hence produces a change of β2exp(β1+β2x2i+β3x3i)

{exp(β1+β2x2i+β3x3i)}2in

Pr(yi = 1|x2i, x3i). Unlike in linear models, where the partial effect simply becomesβ2, the derivative of the nonlinear logistic function is not constant but depends on x2i andx3i.

For a binary covariate x3i, economists consider the difference in probabilities betweengroups,

∆(x3i|x2i) ≡ Pr(yi=1|x2i, x3i=1)− Pr(yi=1|x2i, x3i=0)

=exp(β1 + β2x2i + β3)

1 + exp(β1 + β2x2i + β3)− exp(β1 + β2x2i)

1 + exp(β1 + β2x2i)

which, unlike linear models, depends on x2i.

The partial effect at the average (PEA) is obtained by substituting the sample meansx2· = 1

N

∑Ni=1 xi2 and x3· = 1

N

∑Ni=1 xi3 for xi2 and xi3, respectively, in the above

expressions. For binary covariates, the sample means are proportions, and subjects cannotbe at the average (because the proportions are between 0 and 1).

The average partial effect (APE) overcomes this problem by taking the sample meansof the individual partial effects, APE(x2i|x3i) = 1

N

∑Ni=1 ∆(x2i|x3i) and APE(x3i|x2i) =

1N

∑Ni=1 ∆(x3i|x2i). Fortunately, the APE and PEA tend to be similar.

Display 10.1: Partial effects at the average (PEA) and average partial effects (APE) forthe logistic regression model, logit {Pr(yi=1|x2i, x3i)} = β1 + β2x2i + β3x3i, where x2iis continuous and x3i is binary


Labor-participation data

To illustrate logistic regression, we will consider data on married women from the 1977Canadian Women’s Labor Force Participation Dataset used by Fox (1997):

. use https://www.stata-press.com/data/mlmus4/womenlf

The dataset womenlf.dta contains women’s employment status and two explanatoryvariables:

• workstat: employment status(0: not working; 1: employed part time; 2: employed full time)

• husbinc: husband’s income in $1,000

• chilpres: child present in household; dummy variable (0: absent; 1: present)

Fox (1997) considered a multiple logistic regression model for a woman being em-ployed (full or part time) versus not working, with covariates husbinc and chilpres:

logit{Pr(yi=1|xi)} = β1 + β2x2i + β3x3i

where yi=1 denotes being employed, yi=0 denotes not being employed, x2i is husbinc,x3i is chilpres, and xi = (x2i, x3i)

′ is a vector containing both covariates.

We first merge categories 1 and 2 (employed part time and full time) of workstatinto a new category 1 for being employed:

. recode workstat 2=1

Estimation using logit

We then fit the model by ML using Stata’s logit command:

. logit workstat husbinc i.chilpres

Logistic regression Number of obs = 263LR chi2(2) = 36.42Prob > chi2 = 0.0000

Log likelihood = -159.86627 Pseudo R2 = 0.1023

workstat Coefficient Std. err. z P>|z| [95% conf. interval]

husbinc -.0423084 .0197801 -2.14 0.032 -.0810768 -.0035401

chilprespresent -1.575648 .2922629 -5.39 0.000 -2.148473 -1.002824

_cons 1.33583 .3837634 3.48 0.000 .5836674 2.087992


Interpretation

The estimated coefficients are negative, so the estimated log odds of employmentare lower if the husband earns more and if there is a child in the household. At the 5%significance level, we can reject the null hypotheses that the individual coefficients β2and β3 are 0. The estimated coefficients and their estimated standard errors are alsogiven in table 10.1.

Table 10.1: ML estimates for logistic regression model for women’s labor force partici-pation

Est (SE) OR=exp(β) (95% CI)

β1 [ cons] 1.34 (0.38)

β2 [husbinc] −0.04 (0.02) 0.96 (0.92, 1.00)

β3 [chilpres] −1.58 (0.29) 0.21 (0.12, 0.37)

Instead of considering changes in log odds, it is more informative to obtain ORs, theexponentiated regression coefficients. This can be achieved using the logit commandwith the or option:

. logit workstat husbinc i.chilpres, or

Logistic regression Number of obs = 263LR chi2(2) = 36.42Prob > chi2 = 0.0000


workstat Odds ratio Std. err. z P>|z| [95% conf. interval]

husbinc .9585741 .0189607 -2.14 0.032 .9221229 .9964662

chilprespresent .2068734 .0604614 -5.39 0.000 .1166621 .3668421

_cons 3.80315 1.45951 3.48 0.000 1.792601 8.068699

Note: _cons estimates baseline odds.

Comparing women with and without a child at home, whose husbands have the sameincome, the odds of working are estimated to be about 5 (≈1/0.2068734) times as highfor women who do not have a child at home as for women who do. Within thesetwo groups of women, each $1,000 increase in husband’s income reduces the odds ofworking by an estimated 4% {−4% = 100%(0.9585741 − 1)}. Although this OR looksless important than the one for chilpres, remember that we cannot directly comparethe magnitude of the two ORs. The OR for chilpres represents a comparison of twodistinct groups of women, whereas the OR for husbinc merely expresses the effect of a$1,000 increase in the husband’s income. A $10,000 increase would be associated withan OR of 0.66 {= 0.958574110 = exp(−0.0423084× 10)}.


The exponentiated intercept, estimated as 3.80, represents the odds of working forwomen who do not have a child at home and whose husbands’ income is 0. This isnot an OR as the column heading implies, but the odds when all covariates are 0. Toavoid potential confusion, the exponentiated intercept was omitted from the output inearlier releases of Stata (until Stata 12.0) when the or option was used. As of Stata 15,a footnote is included to explain that this quantity represents the baseline odds, thatis, the odds when all covariates are 0. The baseline odds is meaningful only if 0 is apossible value for all covariates.

In an attempt to make effects directly comparable and assess the relative importanceof covariates, some researchers standardize all covariates to have standard deviation 1,thereby comparing the effects of a standard deviation change in each covariate. Asdiscussed in section 1.5, there are many problems with such an approach, one of thembeing the meaningless notion of a standard deviation change in a dummy variable, suchas chilpres.

The standard errors of exponentiated estimated regression coefficients should not beused for confidence intervals or hypothesis tests. Instead, the 95% confidence intervalsin the above output were computed by taking the exponentials of the confidence limitsfor the regression coefficients β:

exp{β ± 1.96×SE(β)}

In table 10.1, we therefore report estimated ORs with 95% confidence intervals insteadof standard errors.

To visualize the model, we can produce a plot of the predicted probabilities versushusbinc, with separate curves for women with and without children at home. Pluggingin ML estimates for the parameters in (10.2), the predicted probability for woman i,often denoted πi, is given by the inverse logit of the estimated linear predictor,

πi ≡ Pr(yi = 1|xi) =exp(β1 + β2x2i + β3x3i)

1 + exp(β1 + β2x2i + β3x3i)= logit−1(β1 + β2x2i + β3x3i)

(10.3)and can be obtained for the women in the dataset by using the predict command withthe pr option:

. predict prob, pr

We can now produce the graph of predicted probabilities, shown in figure 10.1, by using

. twoway (line prob husbinc if chilpres==0, sort)> (line prob husbinc if chilpres==1, sort lpatt(dash)),> legend(order(1 "No child" 2 "Child"))> xtitle("Husband´s income/$1000") ytitle("Probability that wife works")


0.2

.4.6

.8P

robabili

ty that w

ife w

ork

s

0 10 20 30 40 50Husband’s income/$1000

No child Child

Figure 10.1: Predicted probability of working from logistic regression model (for rangeof husbinc in dataset)

The graph is similar to the graph of the predicted means from an analysis of covari-ance model (a linear regression model with a continuous and a dichotomous covariate;see section 1.7) except that the curves are not exactly straight. The curves have beenplotted for the range of values of husbinc observed for the two groups of women, andfor these ranges the predicted probabilities are nearly linear functions of husbinc.

Just for illustration, to see what the inverse logit function looks like, we will now plotthe predicted probabilities for a widely extended range of values of husbinc (includingnegative values, although this does not make sense). This could be accomplished byinventing additional observations with more extreme values of husbinc and then usingthe predict command again. More conveniently, we can also use Stata’s useful twowayplot type function:

. twoway (function y=invlogit(_b[husbinc]*x+_b[_cons]), range(-100 100))> (function y=invlogit(_b[husbinc]*x+_b[1.chilpres]+_b[_cons]),> range(-100 100) lpatt(dash)),> xtitle("Husband´s income/$1000") ytitle("Probability that wife works")> legend(order(1 "No child" 2 "Child")) xline(1) xline(45)

The estimated regression coefficients are referred to as b[husbinc], b[1.chilpres],and b[ cons] (use the option coeflegend in the logit command to find out how torefer to estimated coefficients), and we used Stata’s invlogit() function to obtain thepredicted probabilities given in (10.3). The resulting graph is shown in figure 10.2.


0.2

.4.6

.81

Pro

babili

ty that w

ife w

ork

s

−100 −50 0 50 100Husband’s income/$1000

No child Child

Figure 10.2: Illustration: Predicted probability of working from logistic regressionmodel, extrapolated beyond the range of husbinc in the data

We could have alternatively used the margins and marginsplot commands to pro-duce the same graph as above:

quietly margins chilpres, at(husbinc=(-100(10)100))

marginsplot, noci recast(line) plot2opts(lpatt(dash))legend(order(1 "No child" 2 "Child"))ytitle("Probability that wife works") xline(1) xline(45)

The range of husbinc actually observed in the data lies approximately between thetwo vertical lines. It would not be safe to rely on predicted probabilities extrapolatedoutside this range. The curves are approximately linear in the region where the linearpredictor is close to 0 (and the predicted probability is close to 0.5) and then flattenas the linear predictor becomes extreme. This flattening ensures that the predictedprobabilities remain in the permitted interval from 0 to 1.

Estimation using glm

We can fit the same model by using the glm command for generalized linear models.The syntax is the same as that of the logit command except that we must specify thelogit link function in the link() option and the binomial distribution in the family()option:


. glm workstat husbinc i.chilpres, link(logit) family(binomial)

Generalized linear models Number of obs = 263Optimization : ML Residual df = 260

Scale parameter = 1Deviance = 319.7325378 (1/df) Deviance = 1.229741Pearson = 265.9615312 (1/df) Pearson = 1.022929

Variance function: V(u) = u*(1-u) [Bernoulli]Link function : g(u) = ln(u/(1-u)) [Logit]

AIC = 1.238527Log likelihood = -159.8662689 BIC = -1129.028

OIMworkstat Coefficient std. err. z P>|z| [95% conf. interval]

husbinc -.0423084 .0197801 -2.14 0.032 -.0810768 -.0035401

chilprespresent -1.575648 .2922629 -5.39 0.000 -2.148473 -1.002824

_cons 1.33583 .3837634 3.48 0.000 .5836674 2.087992

To obtain estimated ORs, we use the eform option (for “exponentiated form”), and tofit a probit model, we simply change the link(logit) option to link(probit).

10.2.2 Latent-response formulation

The logistic regression model and other models for dichotomous responses can also beviewed as latent-response models. Underlying the observed dichotomous response yi(whether the woman works or not), we imagine that there is an unobserved or latentcontinuous response y∗i representing the propensity to work or the excess utility ofworking as compared with not working. If this latent response is greater than 0, thenthe observed response is 1; otherwise, the observed response is 0:

yi =

{1 if y∗i > 00 otherwise

For simplicity, we will assume that there is one covariate xi. A linear regression modelis then specified for the latent response y∗i ,

y∗i = β1 + β2xi + ǫi

where ǫi is a residual error term, assumed to be independent of xi and independentacross women. As we will see, the model for the observed binary response yi becomesa probit model if ǫi has a standard normal distribution and a logit model if ǫi has astandard logistic distribution.

The latent-response formulation has been used in various disciplines and applica-tions. In genetics, where yi is often a phenotype or qualitative trait, y∗i is called aliability. For attitudes measured by agreement or disagreement with statements, thelatent response can be thought of as a “sentiment” in favor of the statement. In eco-

10.2.2 Latent-response formulation 567

nomics, the latent response is often called an index function. In discrete-choice settings(see chapter 12), y∗i is the difference in utilities between alternatives.

Figure 10.3 illustrates the relationship between the latent-response formulation,shown in the lower graph, and the generalized linear model formulation, shown in theupper graph in terms of a curve for the conditional probability that yi=1. The regres-sion line in the lower graph represents the conditional expectation of y∗i given xi as afunction of xi, and the density curves represent the conditional distributions of y∗i givenxi. The dotted horizontal line at y∗i =0 represents the threshold, so yi=1 if y∗i exceedsthe threshold and yi=0 otherwise. Therefore, the areas under the parts of the densitycurves that lie above the dotted line, here shaded gray, represent the probabilities thatyi=1 given xi. For the value of xi indicated by the vertical dotted line, the mean of y∗iis 0; therefore, half the area under the density curve lies above the threshold, and theconditional probability that yi=1 equals 0.5 at that point.

Pr(yi=1|x

i)

0.5

0.0

xi

y∗i

Figure 10.3: Illustration of equivalence of latent-response and generalized linear modelformulations for logistic regression

We can derive the probability curve from the latent-response formulation as follows:

Pr(yi=1|xi) = Pr(y∗i > 0|xi) = Pr(β1 + β2xi + ǫi > 0|xi)= Pr{ǫi > −(β1 + β2xi)|xi} = Pr(−ǫi ≤ β1 + β2xi|xi)= F (β1 + β2xi)


where F (·) is the cumulative density function of −ǫi, or the area under the densitycurve for −ǫi from minus infinity to β1 + β2xi. If the density of ǫi is symmetric, thecumulative density function of −ǫi is the same as that of ǫi.

Logistic regression

In logistic regression, ǫi (and hence −ǫi) is assumed to have a standard logistic cumu-lative density function given xi,

Pr(ǫi < τ |xi) =exp(τ)

1 + exp(τ)

For this distribution, ǫi has mean 0 and variance π2/3 ≈ 3.29 (note that π here representsthe famous mathematical constant pronounced “pi”, the circumference of a circle dividedby its diameter).

Probit regression

When a latent-response formulation is used, it seems natural to assume that ǫi (andhence −ǫi) has a normal distribution given xi, as is typically done in linear regression. Ifa standard (mean 0 and variance 1) normal distribution is assumed, the model becomesa probit model,

Pr(yi=1|xi) = F (β1 + β2xi) = Φ(β1 + β2xi) (10.4)

Here Φ(·) is the standard normal cumulative distribution function, the probability thata standard normally distributed random variable (here ǫi) is less than the argument.For example, when β1 + β2xi equals 1.96, Φ(β1 + β2xi) equals 0.975. Φ(·) is the inverselink function h(·), whereas the link function g(·) is Φ−1(·), the inverse standard normalcumulative distribution function, called the probit link function [the Stata function forΦ−1(·) is invnormal()].

To understand why a standard normal distribution is specified for ǫi, with the vari-ance θ fixed at 1, consider the graph in figure 10.4. On the left, the standard deviationis 1, whereas the standard deviation on the right is 2. However, by doubling the slopeof the regression line for y∗i on the right (without changing the point where it intersectsthe threshold 0), we obtain the same curve for the probability that yi=1. Because wecan obtain equivalent models by increasing both the standard deviation and the slopeby the same multiplicative factor, the model with a freely estimated standard deviationis not identified.

This lack of identification is also evident from inspecting the expression for theprobability if the variance θ were not fixed at 1. If ǫi ∼ N(0, θ), then ǫi/

√θ ∼ N(0, 1),

so (10.4) changes as follows:

Pr(yi=1|xi) = Pr(ǫi ≤ β1 + β2xi) = Pr

(ǫi√θ≤ β1 + β2xi√

θ

)= Φ

(β1√θ+β2√θxi

)

10.2.2 Latent-response formulation 569

It is now easy to see that multiplication of the regression coefficients by a constant canbe counteracted by multiplying

√θ by the same constant. This is the reason for fixing

the standard deviation in probit models to 1 (see also exercise 10.10). The variance ofǫi in logistic regression is also fixed but to a larger value, π2/3.

Pr(yi=1|x

i)

0.0 xi

xi

y∗i

Figure 10.4: Illustration of equivalence between probit models with change in residualstandard deviation counteracted by change in slope

Estimation using probit

A probit model can be fit to the women’s employment data in Stata by using:

. probit workstat husbinc i.chilpres

Probit regression Number of obs = 263LR chi2(2) = 36.19Prob > chi2 = 0.0000


workstat Coefficient Std. err. z P>|z| [95% conf. interval]

husbinc -.0242081 .0114252 -2.12 0.034 -.0466011 -.001815

chilprespresent -.9706164 .1769051 -5.49 0.000 -1.317344 -.6238887

_cons .7981507 .2240082 3.56 0.000 .3591028 1.237199


Interpretation

These estimates are closer to 0 than those reported for the logit model in table 10.1because the standard deviation of ǫi is 1 for the probit model and π/

√3 ≈ 1.81 for the

logit model. Therefore, as we have already seen in figure 10.4, the regression coefficientsin logit models must be larger in absolute value to produce nearly the same curve forthe conditional probability that yi = 1. Here we say “nearly the same” because theshapes of the probit and logit curves are similar yet not identical. To visualize thesubtle difference in shape, we can plot the predicted probabilities for women withoutchildren at home from both the logit and the probit models:

. twoway (function y=invlogit(1.3358-0.0423*x), range(-100 100))> (function y=normal(0.7982-0.0242*x), range(-100 100) lpatt(dash)),> xtitle("Husband´s income/$1000") ytitle("Probability that wife works")> legend(order(1 "Logit link" 2 "Probit link")) xline(1) xline(45)

0.2

.4.6

.81

Pro

babili

ty that w

ife w

ork

s

−100 −50 0 50 100Husband’s income / $1000

Logit link Probit link

Figure 10.5: Predicted probabilities of working from logistic and probit regression mod-els for women without children at home

Here the predictions from the models coincide nearly perfectly in the region where mostof the data are concentrated and are very similar elsewhere. It is thus futile to attemptto empirically distinguish between the logit and probit links unless one has a hugesample.

Regression coefficients in probit models cannot be interpreted in terms of ORs as inlogistic regression models. Instead, the coefficients can be interpreted as differences inthe population means of the latent response y∗i , controlling or adjusting for other co-variates (the same kind of interpretation can also be made in logistic regression). Manypeople find interpretation based on latent responses less appealing than interpretation

10.4 Longitudinal data structure 571

using ORs, because the latter refer to observed responses yi. Alternatively, the coef-ficients can be interpreted in terms of average partial effects or partial effects at theaverage as shown for logit models2 in display 10.1.

10.3 Which treatment is best for toenail infection?

Lesaffre and Spiessens (2001) analyzed data provided by De Backer et al. (1998) froma randomized, double-blind trial of treatments for toenail infection (dermatophyte ony-chomycosis). Toenail infection is not uncommon, with a prevalence of about 2% to 3%in the U.S. and a much higher prevalence among diabetics and the elderly. The infectionis caused by a fungus that not only disfigures the nails but also can cause physical painand impair the ability to work.

In this clinical trial, 378 patients were randomly allocated into two oral antifungaltreatments (250 mg/day terbinafine and 200 mg/day itraconazole) and evaluated atseven visits, at weeks 0, 4, 8, 12, 24, 36, and 48. One outcome is onycholysis, the degreeof separation of the nail plate from the nail bed, which was dichotomized (“moderateor severe” versus “none or mild”) and is available for 294 patients.

The dataset toenail.dta contains the following variables:

• patient: patient identifier

• outcome: onycholysis (separation of nail plate from nail bed)(0: none or mild; 1: moderate or severe)

• treatment: treatment group (0: itraconazole; 1: terbinafine)

• visit: visit number (1, 2, . . . , 7)

• month: exact timing of visit in months

We read in the toenail data by typing

. use https://www.stata-press.com/data/mlmus4/toenail, clear

The main research question is whether the treatments differ in their efficacy. Inother words, do patients receiving one treatment experience a greater decrease in theirprobability of having onycholysis than those receiving the other treatment?

10.4 Longitudinal data structure

Before investigating the research question, we should look at the longitudinal structureof the toenail data by using, for instance, the xtdescribe, xtsum, and xttab com-

2. For probit models with continuous x2i and binary x3i, ∆(x2i|x3i) = β2 φ(β1 + β2x2i + β3x3i),where φ(·) is the density function of the standard normal distribution, and ∆(x3i|x2i) = Φ(β1 +β2x2i + β3)− Φ(β1 + β2x2i).


mands, discussed in Part III: Introduction to models for longitudinal and panel data (involume 1).

Here we illustrate the use of the xtdescribe command. Before using xtdescribe,we xtset the data with patient as the cluster identifier and visit as the time variable:

. xtset patient visit

Panel variable: patient (unbalanced)Time variable: visit, 1 to 7, but with gaps

Delta: 1 unit

The output states that the data are unbalanced and that there are gaps. We woulddescribe the time variable visit as fixed occasions because the values are identicalacross patients apart from the gaps caused by missing data; see Part III: Introductionto models for longitudinal and panel data (in volume 1). In contrast, month is a varying-occasion time variable that takes on a unique set of values for each individual. To usextdescribe, it is important to xtset the data with a fixed-occasion variable like visitso that it is clear when data are missing for individuals:

. xtdescribe if !missing(outcome)

patient: 1, 2, ..., 383 n = 294visit: 1, 2, ..., 7 T = 7

Delta(visit) = 1 unitSpan(visit) = 7 periods(patient*visit uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max1 3 7 7 7 7 7

Freq. Percent Cum. Pattern

224 76.19 76.19 111111121 7.14 83.33 11111.110 3.40 86.73 1111.116 2.04 88.78 111....5 1.70 90.48 1......5 1.70 92.18 11111..4 1.36 93.54 1111...3 1.02 94.56 11.....3 1.02 95.58 111.11113 4.42 100.00 (other patterns)

294 100.00 XXXXXXX

We see that 224 patients have complete data (the pattern “1111111”), 21 patientsmissed the sixth visit (“11111.1”), 10 patients missed the fifth visit (“1111.11”), andmost other patients dropped out at some point, never returning after missing a visit.The latter pattern is sometimes referred to as monotone missingness, in contrast withintermittent missingness, which follows no particular pattern.

As discussed in section 5.9, a nice feature of ML estimation for incomplete data suchas these is that all information is used. Thus, not only patients who attended all visitsbut also patients with missing visits contribute information. If the model is correctlyspecified, ML estimates are consistent when the responses are missing at random (MAR).

10.5 Proportions and fitted population-averaged or marginal probabilities 573

10.5 Proportions and fitted population-averaged or

marginal probabilities

A useful graphical display of the data is a bar plot showing the proportion of patientswith onycholysis at each visit by treatment group. The following Stata commands canbe used to produce the graph shown in figure 10.6:

. label define tr 0 "Itraconazole" 1 "Terbinafine"

. label values treatment tr

. graph bar (mean) proportion = outcome, over(visit) by(treatment)> ytitle(Proportion with onycholysis)

Here we defined value labels for treatment to make them appear on the graph.

0.1

.2.3

.4

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Itraconazole Terbinafine

Pro

port

ion w

ith o

nycholy

sis

Graphs by treatment

Figure 10.6: Bar plot of proportion of patients with toenail infection by visit and treat-ment group

An alternative display is a line graph, plotting the observed proportions at eachvisit against time. The most natural time scale would be month. However, the exacttiming in months is unique for each patient, so we cannot estimate proportions for eachtiming. We therefore use the proportions associated with visit number visit and findthe average time (month) associated with each visit number. Both variables can beobtained using the egen command with the mean() function:


. egen prop = mean(outcome), by(treatment visit)

. egen mn_month = mean(month), by(treatment visit)

. twoway line prop mn_month, by(treatment) sort> xtitle(Time in months) ytitle(Proportion with onycholysis)

The resulting graph is shown in figure 10.7.

0.2

.4

0 5 10 15 0 5 10 15


Pro

babili

ty w

ith o

nycholy

sis

Time in monthsGraphs by treatment

Figure 10.7: Line plot of proportion of patients with toenail infection by average timeat visit and treatment group

The proportions shown in figure 10.7 represent the estimated average (or marginal)probabilities of onycholysis given the two covariates, time since randomization and treat-ment group. We are not attempting to estimate individual patients’ personal probabil-ities, which may vary substantially, but are considering the population averages giventhe covariates.

Instead of estimating the probabilities for each combination of visit and treatment,we can attempt to obtain smooth curves of the estimated probability as a function oftime. We then no longer have to group observations for the same visit number together—we can use the exact timing of the visits directly. One way to accomplish this is by usinga logistic regression model with month, treatment, and their interaction as covariates.This model for the dichotomous outcome yij at visit i for patient j can be written as

logit{Pr(yij=1|xij)} = β1 + β2x2j + β3x3ij + β4x2jx3ij (10.5)

where x2j represents treatment, x3ij represents month, and xij = (x2j , x3ij)′ is a vector

containing both covariates. This model allows for a difference between groups at base-line β2, and linear changes in the log odds of onycholysis over time with slope β3 in the

10.5 Proportions and fitted population-averaged or marginal probabilities 575

itraconazole group and slope β3 + β4 in the terbinafine group. Therefore, β4, the differ-ence in the rate of improvement (on the log odds per month scale) between treatmentgroups, can be viewed as the treatment effect (terbinafine versus itraconazole).

This model makes the unrealistic assumption that the responses for a given patientare conditionally independent after controlling for the included covariates. We willrelax this assumption in the next section. Here we can get satisfactory inferences formarginal effects by using robust standard errors for clustered data instead of model-based standard errors. This approach is analogous to pooled OLS in linear modelsand corresponds to the generalized estimating equations (GEE) approach discussed insection 6.6 with an independence working correlation structure (see section 10.13.2 foran example with a different working correlation structure).

Estimation using logit

We fit the model by ML with cluster–robust standard errors:

. logit outcome i.treatment##c.month, or vce(cluster patient)

Logistic regression Number of obs = 1,908Wald chi2(3) = 64.30Prob > chi2 = 0.0000

Log pseudolikelihood = -908.00747 Pseudo R2 = 0.0830

(Std. err. adjusted for 294 clusters in patient)

Robustoutcome Odds ratio std. err. z P>|z| [95% conf. interval]

treatmentTerbinafine .9994184 .2511294 -0.00 0.998 .6107468 1.635436

month .8434052 .0246377 -5.83 0.000 .7964725 .8931034

treatment#c.month

Terbinafine .934988 .0488105 -1.29 0.198 .8440528 1.03572

_cons .5731389 .0982719 -3.25 0.001 .4095534 .8020642

Note: _cons estimates baseline odds.

Note that, as discussed in display 2.1 in volume 1, cluster–robust standard errors (basedon a sandwich estimator) work only if there is a sufficiently large number of clusters.Inspired by Angrist and Pischke (2009, 319), we use the rule of thumb that the numberof clusters minus the number of cluster-level covariates q should be at least 42.


We will leave interpretation of the estimates for later and first check how well thepredicted probabilities from the logistic regression model correspond to the observedproportions in figure 10.7. The predicted probabilities are obtained and plotted to-gether with the observed proportions by using the following commands, which result infigure 10.8:

. predict prob, pr

. twoway (line prop mn_month, sort) (line prob month, sort lpatt(dash)),> by(treatment) legend(order(1 "Observed proportions" 2 "Fitted probabilities"))> xtitle(Time in months) ytitle(Probability of onycholysis)

0.2

.4

0 5 10 15 20 0 5 10 15 20


Observed proportions Fitted probabilities

Pro

babili

ty o

f onycholy

sis

Time in months

Graphs by treatment

Figure 10.8: Proportions and fitted probabilities using ordinary logistic regression

The marginal probabilities predicted by the model fit the observed proportions reason-ably well. However, we have treated the dependence among responses for the same pa-tient as a nuisance by fitting an ordinary logistic regression model with robust standarderrors for clustered data. We now add random effects to accommodate the dependenceand estimate the degree of dependence instead of treating it as a nuisance.

10.6.1 Model specification 577

10.6 Random-intercept logistic regression

10.6.1 Model specification

Reduced-form specification

To relax the assumption of conditional independence among the responses for the samepatient given the covariates, we can include a patient-specific random intercept ζj inthe linear predictor to obtain a random-intercept logistic regression model:

logit{Pr(yij=1|xij , ζj)} = β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj (10.6)

The random intercepts are assumed to be independently normally distributed given thecovariates, ζj |xij ∼ N(0, ψ). Given ζj and xij , the responses yij for patient j at differentoccasions i are independently Bernoulli distributed. To write this down more formally,it is useful to define πij ≡ Pr(yij |xij , ζj), giving

logit(πij) = β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj

yij |πij ∼ Binomial(1, πij)

This is a simple example of a generalized linear mixed model (GLMM) because it isa generalized linear model with both fixed effects β1 to β4 and a random effect ζj . Themodel is also sometimes referred to as a hierarchical generalized linear model (HGLM)in contrast to a hierarchical linear model (HLM). The random intercept can be thoughtof as the combined effect of omitted patient-specific (time-constant) covariates thatcause some patients to be more prone to onycholysis than others (more precisely, thecomponent of this combined effect that is independent of the covariates in the model—not an issue if the covariates are exogenous). It is appealing to model this unobservedheterogeneity in the same way as observed heterogeneity by simply adding the randomintercept to the linear predictor. As we will explain later, be aware that ORs obtained byexponentiating regression coefficients in this model must be interpreted conditionally onthe random intercept and are therefore often referred to as conditional or subject-specificORs.

Using the latent-response formulation, the model can equivalently be written as

y∗ij = β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj + ǫij (10.7)

where ζj |xij ∼ N(0, ψ) and the ǫij |xij , ζj have independent standard logistic distribu-tions. The binary responses yij are determined by the latent continuous responses viathe threshold model

yij =

{1 if y∗ij > 0

0 otherwise

Confusingly, logistic random-effects models are sometimes written as yij = πij + eij ,where eij is a normally distributed level-1 residual with variance πij(1 − πij). Thisformulation is clearly incorrect because such a model does not produce binary responses(see Skrondal and Rabe-Hesketh [2007]).


Two-stage formulation

Raudenbush and Bryk (2002) and others write two-level models in terms of a level-1model and one or more level-2 models (see section 4.9 of volume I). In generalized linearmixed models, the need to specify a link function and distribution leads to two furtherstages of model specification.

Using the notation and terminology of Raudenbush and Bryk (2002), the level-1sampling model, link function, and structural model are written as

yij ∼ Bernoulli(ϕij)

logit(ϕij) = ηij

ηij = β0j + β1jx2j + β2jx3ij + β3jx2jx3ij

respectively.

The level-2 model for the intercept β0j is written as

β0j = γ00 + u0j

where γ00 is a fixed intercept and u0j is a residual or random intercept. The level-2models for the coefficients β1j , β2j , and β3j have no residuals for a random-interceptmodel,

βpj = γp0, p = 1, 2, 3

Plugging the level-2 models into the level-1 structural model, we obtain

ηij = γ00 + u0j + γ01x2j + γ02x3ij + γ03x2jx3ij

≡ β1 + ζ0j + β2x2j + β3x3ij + β4x2jx3ij

Equivalent models can be specified using either the reduced-form formulation (usedfor instance by Stata) or the two-stage formulation (used in the HLM software ofRaudenbush et al. 2019). However, in practice, what models are being considered isto some extent influenced by the approach adopted as discussed in section 4.9.

10.6.2 Model assumptions

It is assumed that the ζj are independent across patients and independent of the co-variates xij at occasion i. It is also assumed that the covariates at other occasions donot affect the response probabilities given the random intercept (called strict exogeneityconditional on the random intercept). For the latent response formulation, the ǫij areassumed to be independent across both occasions and patients, and independent of bothζj and xij . In the generalized linear model formulation, the analogous assumptions areimplicit in assuming that the responses are independently Bernoulli distributed (withprobabilities determined by ζj and xij).

10.6.3 Estimation 579

In contrast to linear random-effects models, consistent estimation in random-effectslogistic regression requires that the random part of the model is correctly specified inaddition to the fixed part. Specifically, consistency formally requires (1) a correct lin-ear predictor (such as including relevant interactions), (2) a correct link function, (3)correct specification of covariates having random coefficients, (4) conditional indepen-dence of responses given the random effects and covariates, (5) independence of therandom effects from covariates (for causal inference), and (6) normally distributed ran-dom effects. Hence, the assumptions are stronger than those discussed for linear modelsin section 3.3.2. However, for regression coefficients, the normality assumption for therandom intercepts seems to be rather innocuous (McCulloch and Neuhaus 2011) in con-trast to the assumption of independence between the random intercepts and covariates(Heagerty and Kurland 2001). As in standard logistic regression, the ML estimator isnot necessarily unbiased in finite samples even if all the assumptions are true.

Although the sandwich estimator of the standard errors of regression coefficients isrobust to misspecification (consistent as the number of clusters becomes large), it maybe less useful than in linear models because in logistic regression misspecification alsoleads to inconsistent point estimates. For example, a reason for using the sandwichestimator in linear models is to protect against violation of the constant variance as-sumption (homoskedasticity) for the random intercepts. However, in logistic (or probit)models, heteroskedasticity of the random intercepts leads to inconsistent estimates ofthe regression coefficients (Heagerty and Kurland 2001). In this case, the sandwich es-timator is consistent for the sampling variance of an inconsistent point estimator, whichmay be of limited use (see, for example, Freedman 2006).

10.6.3 Estimation

There are two commands for fitting random-intercept logistic models in Stata, xtlogitand melogit, and two commands for broader classes of models that include logisticmixed models, meglm and gllamm. (melogit was introduced as xtmelogit in Stata 10,and the xt prefix was removed in Stata 13.) All of these commands provide marginalML estimation with model-based standard errors or robust standard errors (sandwichestimator) if the vce(robust) option is used (or just robust in gllamm). Note that theREML (restricted maximum likelihood) estimator, used extensively in volume I, is definedonly for linear mixed models (although various approximations have been proposed forgeneralized linear mixed models). Adaptive quadrature is used to approximate theintegrals involved in the likelihood (see section 10.10.1 for more information). Thecommands have essentially the same syntax as their counterparts for linear modelsdiscussed in volume I. Specifically, xtlogit corresponds to xtreg, melogit correspondsto mixed, and both meglm and gllamm use essentially the same syntax for linear, logistic,and other types of models.

All of these commands are relatively slow because they use numerical integration,but for random-intercept models, xtlogit, melogit, and meglm tend to be faster thangllamm. However, gllapred, the postestimation command of gllamm, is still the most


useful command for predicting random effects and various types of probabilities, as wewill see in sections 10.11 and 10.12. Each command uses a default for the number ofterms (called “integration points”) used to approximate the integral, and there is noguarantee that a sufficient number of terms has been used to achieve reliable estimates.It is therefore the user’s responsibility to make sure that the approximation is adequateby increasing the number of integration points until the results stabilize. The moreintegration points are used, the more accurate the approximation at the cost of increasedcomputation time.

We do not discuss random-coefficient logistic regression in this chapter, but suchmodels can be fit with melogit and gllamm (but not with xtlogit), using essentially thesame syntax as for linear random-coefficient models discussed in section 4.5. Random-coefficient logistic regression using meologit and gllamm is demonstrated in chapter 11for ordinal responses. Chapter 16 uses melogit for a three-level random-coefficientlogistic regression model. The probit versions of these models are available in meprobit,meoprobit and gllamm (see sections 11.10 through 11.12 for ordinal probit random-intercept models). Analogously to xtlogit, xtprobit is a fast command for two-level random-intercept models but cannot be used for random-coefficient or higher-levelmodels.

Using xtlogit

The xtlogit command for fitting the random-intercept model is analogous to the xtregcommand for fitting the corresponding linear model. We first use the xtset commandto specify the clustering variable. In the xtlogit command, we use the intpoints(30)option (intpoints() stands for “integration points”) to ensure accurate estimates (seesection 10.10.1):


. quietly xtset patient

. xtlogit outcome i.treatment##c.month, intpoints(30)

Random-effects logistic regression Number of obs = 1,908Group variable: patient Number of groups = 294

Random effects u_i ~ Gaussian Obs per group:min = 1avg = 6.5max = 7

Integration method: mvaghermite Integration pts. = 30

Wald chi2(3) = 150.65Log likelihood = -625.38558 Prob > chi2 = 0.0000

outcome Coefficient Std. err. z P>|z| [95% conf. interval]

treatmentTerbinafine -.160608 .5796716 -0.28 0.782 -1.296744 .9755275

month -.390956 .0443707 -8.81 0.000 -.4779209 -.3039911

treatment#c.month

Terbinafine -.1367758 .0679947 -2.01 0.044 -.270043 -.0035085

_cons -1.618795 .4303891 -3.76 0.000 -2.462342 -.7752477

/lnsig2u 2.775749 .1890237 2.405269 3.146228

sigma_u 4.006325 .3786451 3.328876 4.821641rho .8298976 .026684 .7710804 .8760322

LR test of rho=0: chibar2(01) = 565.24 Prob >= chibar2 = 0.000

The estimated regression coefficients are given in the usual format. The value next to

sigma u represents the estimated residual standard deviation

√ψ of the random inter-

cept and the value next to rho represents the estimated residual intraclass correlationof the latent responses (see section 10.8.1).

We can use the or option to obtain exponentiated regression coefficients, which areinterpreted as conditional ORs here. Instead of refitting the model, we can simply changethe way the results are displayed using the following short xtlogit command (knownas “replaying the estimation results” in Stata parlance):


. xtlogit, or

Random-effects logistic regression Number of obs = 1,908Group variable: patient Number of groups = 294

Random effects u_i ~ Gaussian Obs per group:min = 1avg = 6.5max = 7



outcome Odds ratio Std. err. z P>|z| [95% conf. interval]

treatmentTerbinafine .8516258 .4936633 -0.28 0.782 .2734207 2.652566

month .6764099 .0300128 -8.81 0.000 .6200712 .7378675

treatment#c.month

Terbinafine .8721658 .0593027 -2.01 0.044 .7633467 .9964976

_cons .1981373 .0852762 -3.76 0.000 .0852351 .4605897

/lnsig2u 2.775749 .1890237 2.405269 3.146228

sigma_u 4.006325 .3786451 3.328876 4.821641rho .8298976 .026684 .7710804 .8760322

Note: Estimates are transformed only in the first equation to odds ratios.Note: _cons estimates baseline odds (conditional on zero random effects).LR test of rho=0: chibar2(01) = 565.24 Prob >= chibar2 = 0.000

Interpretation

The estimated ORs and their 95% confidence intervals are also given in table 10.2.We see that the estimated conditional odds (given ζj) for a subject in the itraconazolegroup are multiplied by 0.68 every month and the conditional odds for a subject inthe terbinafine group are multiplied by 0.59 (= 0.6764 × 0.8722) every month. Interms of percentage change in estimated odds, 100%(OR − 1), the conditional oddsdecrease 32% [−32% = 100%(0.6764−1)] per month in the itraconazole group and 41%[−41% = 100%(0.6764×0.8722−1)] per month in the terbinafine group. (The differencein interpretation of ORs from random-intercept logistic regression and ordinary logisticregression is discussed in section 10.7).


Tab

le10.2:Estim

ates

fortoen

aildata

Marginal

effects

Con

ditional

effects

Ordinary

GEE†

Ran

dom

int.

Con

ditional

logistic

logistic

logistic

logistic

Param

eter

OR

(95%

CI)⋆

OR

(95%

CI)⋆

OR

(95%

CI)

OR

(95%

CI)

Fixed

part

exp(β

2)[treatment]

1.00

(0.61,

1.64)

1.01

(0.61,

1.68)

0.85

(0.27,

2.65)

exp(β

3)[month]

0.84

(0.80,

0.89)

0.84

(0.79,

0.89)

0.68

(0.62,

0.74)

0.68

(0.62,

0.75)

exp(β

4)[treatment#c.month]

0.93

(0.84,

1.04)

0.93

(0.83,

1.03)

0.87

(0.76,

1.00)

0.91

(0.78,

1.05)

Ran

dom

part

ψ16.08

ρ0.83

Log

likelihood

−908.01

−625.39

−188.94

•

†Usingexchangeable

workingcorrelation

⋆Basedonthesandwichestimatorforclustered

data

•Logconditionallikelihood


Using melogit

The syntax for melogit is analogous to that for mixed except that we also specify thenumber of quadrature points, or integration points, by using the intpoints() option:

. melogit outcome i.treatment##c.month || patient:, intpoints(30)

Mixed-effects logistic regression Number of obs = 1,908Group variable: patient Number of groups = 294

Obs per group:min = 1avg = 6.5max = 7




treatmentTerbinafine -.1608934 .5802058 -0.28 0.782 -1.298076 .9762891

month -.3911056 .0443906 -8.81 0.000 -.4781097 -.3041016

treatment#c.month

Terbinafine -.1368286 .0680213 -2.01 0.044 -.2701479 -.0035093

_cons -1.620355 .4322382 -3.75 0.000 -2.467526 -.7731834

patientvar(_cons) 16.0841 3.062625 11.07431 23.36021

LR test vs. logistic model: chibar2(01) = 565.24 Prob >= chibar2 = 0.0000

We store the estimates for later use within the same Stata session:

. estimates store melogit

(The command estimates save can be used to save the estimates in a file for use in afuture Stata session.)

The results are almost but not exactly the same as those from xtlogit becausethe commands use slightly different algorithms. Estimated ORs can be obtained usingthe or option, possibly by just replaying the estimation output by using the commandmelogit, or. The estimated standard deviation of the random intercept (instead ofthe default variance) can be obtained by using the postestimation command estat sd

(as of Stata 15).

melogit can be used with one integration point by specifying the intpoints(1)

and intmethod(mcaghermite) options, which is equivalent to using the Laplace ap-proximation, and can also be requested by specifying the intmethod(laplace) option.See section 10.10.2 for the results obtained using this less accurate but faster methodfor the toenail data.


Using gllamm

We now introduce the community-contributed command for multilevel and latent vari-able modeling, called gllamm (stands for generalized linear latent and mixed models)by Rabe-Hesketh, Skrondal, and Pickles (2002, 2005). See also http://www.gllamm.orgwhere you can download the gllamm manual, the gllamm companion for this book, andfind many other resources.

To check whether gllamm is installed on your computer, use the command

. which gllamm

If the message

command gllamm not found as either built-in or ado-file

appears, install gllamm (assuming that you have a net-aware Stata) by using the ssc

command:

. ssc install gllamm

Occasionally, you should update gllamm by using ssc with the replace option:

. ssc install gllamm, replace

Before fitting the model, we construct a new variable, trt month, for the interactionof treatment and month:

. generate trt_month = treatment*month

Using gllamm for the random-intercept logistic regression model requires that wespecify a logit link and binomial distribution with the link() and family() options(exactly as for the glm command). We also use the nip() option (for the number ofintegration points) to request that 30 integration points be used. The cluster identifieris specified in the i() option:


. gllamm outcome treatment month trt_month, i(patient) link(logit)> family(binomial) nip(30) adapt

number of level 1 units = 1908number of level 2 units = 294

Condition Number = 23.076299

gllamm model

log likelihood = -625.38558


treatment -.1608751 .5802054 -0.28 0.782 -1.298057 .9763065month -.3911055 .0443906 -8.81 0.000 -.4781096 -.3041015

trt_month -.136829 .0680213 -2.01 0.044 -.2701484 -.0035097_cons -1.620364 .4322408 -3.75 0.000 -2.46754 -.7731873

Variances and covariances of random effects------------------------------------------------------------------------------

***level 2 (patient)

var(1): 16.084107 (3.0626223)------------------------------------------------------------------------------

The estimates are almost the same as those from xtlogit and melogit. The esti-mated random-intercept variance is given next to var(1). We store the gllamm esti-mates for later use:

. estimates store gllamm

We can use the eform option to obtain estimated ORs, or we can alternatively usethe command

gllamm, eform

to replay the estimation results after having already fit the model. We can also use therobust option to obtain robust standard errors based on the sandwich estimator. Atthe time of writing this book, gllamm does not accept factor variables (i., c., and #)but does accept i. if the gllamm command is preceded by the prefix command xi:.

10.7 Subject-specific or conditional versuspopulation-averaged or marginal relationships

The estimated regression coefficients for the random-intercept logistic regression modelare more extreme (more different from 0) than those for the ordinary logistic regressionmodel (see table 10.2). Correspondingly, the estimated ORs are more extreme (moredifferent from 1) than those for the ordinary logistic regression model. The reason forthis discrepancy is that ordinary logistic regression fits overall population-averaged or

10.8 Subject-specific vs. population-averaged relationships 587

marginal probabilities, whereas random-effects logistic regression fits subject-specific orconditional probabilities for the individual patients.

This important distinction can be seen in the way the two models are writtenin (10.5) and (10.6). Whereas the former is for the overall or population-averaged prob-ability, conditioning only on covariates, the latter is for the subject-specific probability,conditioning on the covariates and the subject-specific random intercept ζj . ORs derivedfrom these models can be referred to as population-averaged (although the averaging isapplied to the probabilities) or subject-specific ORs, respectively.

For instance, in the random-intercept logistic regression model, we can interpret theestimated subject-specific or conditional OR of 0.68 for month (a covariate varying withinpatient) as the OR for each patient in the itraconazole group: the odds for a given patienthence decreases by 32% per month. In contrast, the estimated population-averaged OR

of 0.84 for month means that the odds of having onycholysis among the patients in theitraconazole group decreases by 16% per month.

Considering instead the OR for treatment (a covariate only varying between pa-tients) when month equals 1, the estimated subject-specific or conditional OR is esti-mated as 0.74 (=0.85×0.87) and the odds after one month of treatment are hence 26%lower for terbinafine than for itraconazole for each subject. However, because no pa-tients are given both terbinafine and itraconazole, it might be best to interpret the OR

in terms of a comparison between two patients j and j′ with the same value of therandom intercept ζj = ζj′ , one of whom is given terbinafine and the other itraconazole.The estimated population-averaged or marginal OR of about 0.93 (=1.00×0.93) meansthat the odds after one month of treatment are 7% lower for the group of patients giventerbinafine compared with the group of patients given itraconazole.

When interpreting subject-specific or conditional ORs, keep in mind that these arenot purely based on within-subject information and are hence not free from subject-level confounding. In fact, for between-subject covariates like treatment group above,there is no within-subject information in the data. Although the ORs are interpreted aseffects keeping the subject-specific random intercepts ζj constant, these random inter-cepts are assumed to be independent of the covariates included in the model and hencedo not represent effects of unobserved confounders, which are by definition correlatedwith the covariates. Unlike fixed-effects approaches, we are therefore not controllingfor unobserved confounders. Both conditional and marginal effect estimates suffer fromomitted-variable bias if subject-level or other confounders are not included in the model.See section 3.7.4 for a discussion of this issue in linear random-intercept models. Sec-tion 10.13.1 is on conditional logistic regression, the fixed-effects approach in logisticregression that controls for observed and unobserved subject-level confounders.

The population-averaged probabilities implied by the random-intercept model canbe obtained by averaging the subject-specific probabilities over the random-interceptdistribution. Because the random intercepts are continuous, this averaging is accom-plished by integration:


Pr(yij = 1|x2j , x3ij)

=

∫Pr(yij = 1|x2j , x3ij , ζj)φ(ζj ; 0, ψ) dζj

=

∫exp(β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj)

1 + exp(β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj)φ(ζj ; 0, ψ) dζj

6= exp(β1 + β2x2j + β3x3ij + β4x2jx3ij)

1 + exp(β1 + β2x2j + β3x3ij + β4x2jx3ij)(10.8)

where φ(ζj ; 0, ψ) is the normal density function with mean 0 and variance ψ.

The difference between population-averaged and subject-specific effects is due tothe average of a nonlinear function not being the same as the nonlinear function of theaverage. In the present context, the average of the inverse logit of the linear predictor,β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj , is not the same as the inverse logit of the averageof the linear predictor, which is β1 + β2x2j + β3x3ij + β4x2jx3ij . We can see this bycomparing the simple average of the inverse logits of 1 and 2 with the inverse logit ofthe average of 1 and 2:

. display (invlogit(1) + invlogit(2))/2

.80592783

. display invlogit((1+2)/2)

.81757448

We can also see this in figure 10.9. Here the individual, thin, dashed curves representsubject-specific logistic curves for a made-up model, each with a subject-specific (ran-domly drawn) intercept. These are inverse logit functions of the subject-specific linearpredictors (here the linear predictors are simply β1 + β2xij + ζj). The thick, dashedcurve is the inverse logit function of the average of the linear predictor (that is, ζj = 0)and this is not the same as the flatter average of the logistic functions shown as a thick,solid curve.

10.8 Subject-specific vs. population-averaged relationships 589

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Probab

ility

xij

Figure 10.9: Subject-specific probabilities (thin, dashed curves), population-averagedprobabilities (thick, solid curve), and population median probabilities (thick, dashedcurve) for random-intercept logistic regression

The average curve has a different shape than the subject-specific curves. Specifically,the effect of xij on the average curve is smaller than the effect of xij on the subject-specific curves. However, the population median probability is the same as the subject-specific probability evaluated at the median of ζj (ζj = 0), shown as the thick, dashedcurve, because the inverse logit function is a strictly increasing function.

Another way of understanding why the subject-specific effects are more extreme thanthe population-averaged effects is by writing the random-intercept logistic regressionmodel as a latent-response model:

y∗ij = β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj + ǫij︸︷︷︸ξij

The total residual variance is

Var(ξij) = ψ + π2/3

estimated as ψ + π2/3 = 16.0841 + 3.2899 = 19.37, which is much greater than theresidual variance of about 3.29 for an ordinary logistic regression model. As we havealready seen in figure 10.4 for probit models, the slope in the model for y∗i has to in-crease when the residual standard deviation increases to produce an equivalent curvefor the marginal probability that the observed response is 1. Therefore, the regressioncoefficients of the random-intercept model (representing subject-specific effects) must be


larger in absolute value than those of the ordinary logistic regression model (representingpopulation-averaged effects) to obtain a good fit of the model-implied marginal proba-bilities to the corresponding sample proportions (see exercise 10.10). In section 10.12,we will obtain predicted subject-specific and population-averaged probabilities for thetoenail data.

In the special case of extremely rare outcomes, subject-specific and population aver-aged effects become practically identical in random-intercept logistic models (Lin, Psaty,and Kronmal 1998), in contrast to random-intercept probit models.

Having described subject-specific and population-averaged probabilities or expecta-tions of yij , for given covariate values, we now consider the corresponding variances.The subject-specific or conditional variance is

Var(yij |xij , ζj) = Pr(yij = 1|xij , ζj){1− Pr(yij = 1|xij , ζj)}

and the population-averaged or marginal variance (obtained by integrating over ζj) is

Var(yij |xij) = Pr(yij = 1|xij){1− Pr(yij = 1|xij)}

We see that the random-intercept variance ψ does not affect the relationship betweenthe marginal variance and the marginal mean. This is in contrast to models for countsdescribed in chapter 13, where a random intercept (with ψ > 0) produces so-calledoverdispersion, with a larger marginal variance for a given marginal mean than the modelwithout a random intercept (ψ = 0). It is sometimes not recognized that overdispersionis impossible for dichotomous responses (Skrondal and Rabe-Hesketh 2007).

10.8 Measures of dependence and heterogeneity

10.8.1 Conditional or residual intraclass correlation of the latentresponses

Returning to the latent-response formulation, the dependence among the dichotomousresponses for the same subject (or the between-subject heterogeneity) can be quantifiedby the conditional intraclass correlation or residual intraclass correlation ρ of the latentresponses y∗ij given the covariates:

ρ ≡ Cor(y∗ij , y∗i′j |xij ,xi′j) = Cor(ξij , ξi′j) =

ψ

ψ + π2/3

Substituting the estimated variance ψ = 16.08, we obtain an estimated conditionalintraclass correlation of 0.83, which is large even for longitudinal data. The estimatedintraclass correlation is also reported next to rho by xtlogit and can be obtained usingestat icc after melogit:

10.8.2 Median odds ratio 591

. estimates restore melogit(results melogit are active now)

. estat icc

Residual intraclass correlation

Level ICC Std. err. [95% conf. interval]

patient .8301913 .0268433 .7709672 .8765531

For probit models, the expression for the residual intraclass correlation of the latentresponses is as above with π2/3 replaced by 1.

10.8.2 Median odds ratio

Larsen et al. (2000) and Larsen and Merlo (2005) suggest a measure of heterogeneity forrandom-intercept models with normally distributed random intercepts. They considerrepeatedly sampling two subjects with the same covariate values and forming the OR

comparing the subject who has the larger random intercept with the other subject. Fora given pair of subjects j and j′, this OR is given by exp(|ζj − ζj′ |), and heterogeneityis expressed as the median of these ORs across repeated samples.

The median and other percentiles a > 1 can be obtained from the cumulative dis-tribution function

Pr{exp(|ζj − ζj′ |) ≤ a} = Pr

{ |ζj − ζj′ |√2ψ

≤ ln(a)√2ψ

}= 2Φ

{ln(a)√2ψ

}− 1

If the cumulative probability is set to 1/2, a is the median OR, ORmedian:

2Φ

{ln(ORmedian)√

2ψ

}− 1 = 1/2

Solving this equation gives

ORmedian = exp{√

2ψΦ−1(3/4)}


To be able to plug in the variance estimate, we first check how this is stored bymelogit by using the coeflegend option:

. melogit, coeflegend





outcome Coefficient Legend

treatmentTerbinafine -.1608934 _b[1.treatment]

month -.3911056 _b[month]

treatment#c.month

Terbinafine -.1368286 _b[1.treatment#c.month]

_cons -1.620355 _b[_cons]

patientvar(_cons) 16.0841 _b[/var(_cons[patient])]


Now we obtain ORmedian as follows:

. display exp(sqrt(2*_b[/var(_cons[patient])])*invnormal(3/4))45.855915

When two subjects are chosen at random at a given time point from the same treat-ment group, the OR comparing the subject who has the larger odds with the sub-ject who has the smaller odds will exceed 45.83 half the time, which is a very largeOR. For comparison, the estimated OR comparing two subjects at 20 months whohad the same value of the random intercept, but one of whom received itraconazole(treatment=0) and the other of whom received terbinafine (treatment=1), is 18.13{= 1/ exp(−0.1609− 20× 0.1368)}.

10.8.3 qMeasures of association for observed responses at median fixedpart of the model

The reason why the degree of dependence is often expressed in terms of the residualintraclass correlation for the latent responses y∗ij is that the intraclass correlation forthe observed responses yij varies according to the values of the covariates.

10.9.3 q Measures of association for observed responses 593

One may nevertheless proceed by obtaining measures of association for specific val-ues of the covariates. In particular, Rodrıguez and Elo (2003) suggest obtaining themarginal association between the binary observed responses at the sample median valueof the estimated fixed part of the model, β1+ β2x2j+ β3x3ij+ β4x2jx3ij . Marginal asso-ciation here refers to the fact that the associations are based on marginal probabilities(averaged over the random-intercept distribution with the ML estimate ψ plugged in).

Rodrıguez and Elo (2003) have written a program called xtrho that can be used afterxtlogit, xtprobit, and xtcloglog to produce such marginal association measures andtheir confidence intervals. The program can be downloaded by issuing the command

. findit xtrho

clicking on st0031, and then clicking on click here to install. Having downloadedxtrho, we run it after refitting the random-intercept logistic model with xtlogit:


. quietly xtlogit outcome i.treatment##c.month, re intpoints(30)

. xtrho

Measures of intra-class manifest association in random-effects logitEvaluated at median linear predictor

Measure Estimate [95% Conf.Interval]

Marginal prob. .250812 .217334 .283389Joint prob. .178265 .139538 .217568Odds ratio 22.9189 16.2512 32.6823Pearson´s r .61392 .542645 .675887Yule´s Q .916384 .884066 .940622

We see that for a patient whose fixed part of the linear predictor is equal to thesample median, the marginal probability of having onycholysis (a measure of toenailinfection) at an occasion is estimated as 0.25 and the joint probability of having ony-cholysis at two occasions is estimated as 0.18. From the estimated joint probabilitiesfor the responses 00, 10, 01, and 11 in the 2×2 table for two occasions (with linearpredictor equal to the sample median), xtrho estimates various measures of associationfor onycholysis for two occasions, given that the fixed part of the linear predictor equalsthe sample median.

The estimated OR of 22.92 means that the odds of onycholysis at one of the twooccasions is almost 23 times as high for a patient who had onycholysis at the otheroccasion as for a patient with the same covariates (treatment and month) who did nothave onycholysis at the other occasion. The estimated Pearson correlation of 0.61 forthe observed responses is lower than the estimated residual correlation for the latentresponses of 0.83, as would be expected from statistical theory. Squaring the Pearsoncorrelation, we see that onycholysis at one occasion explains about 36% of the variationin onycholysis at the other occasion for fixed covariate values.

We can use the detail option to obtain the above measures of associations evaluatedat sample percentiles other than the median. We can also use Rodrıguez and Elo’s(2003) xtrhoi command to obtain measures of associations for other values of the fixed


part of the linear predictor and other values of the variance of the random-interceptdistribution.

xtrho and xtrhoi assume that the fixed part of the linear predictor is the sameacross occasions. However, in the toenail example, month must change between any twooccasions within a patient, and the linear predictor is a function of month. Consideringtwo occasions with month equal to 3 and 6, the OR is estimated as 25.6 for patients in thecontrol group and 29.4 for patients in the treatment group. Marginal 2×2 tables, takinginto account that month changes, can be obtained using gllamm and gllapred with thell option as demonstrated in a do-file that can copied into the working directory asfollows:

copy https://www.stata-press.com/data/mlmus4/ch10table.do ch10table.do

10.9 Inference for random-intercept logistic models

10.9.1 Tests and confidence intervals for odds ratios

As discussed earlier, we can interpret the regression coefficient β as the difference inlog odds associated with a unit change in the corresponding covariate, and we caninterpret the exponentiated regression coefficient as an OR, OR = exp(β). The relevantnull hypothesis for ORs usually is H0: OR = 1, and this corresponds directly to the nullhypothesis that the corresponding regression coefficient is 0, H0: β = 0.

Wald tests can be used for regression coefficients just as described in section 3.6.1 forlinear models. Ninety-five percent Wald confidence intervals for individual regressioncoefficients are obtained using

β ± z0.975 SE(β)

where z0.975 = 1.96 is the 97.5th percentile of the standard normal distribution. Thecorresponding confidence interval for the OR is obtained by exponentiating both limitsof the confidence interval:

exp{β − z0.975 SE(β)} to exp{β + z0.975 SE(β)}

Wald tests for linear combinations of regression coefficients can be used to test thecorresponding multiplicative relationships among odds for different covariate values. Forinstance, for the toenail data, we may want to obtain the OR comparing the treatmentgroups after 20 months. The corresponding difference in log odds after 20 months is alinear combination of regression coefficients, namely, β2+β4× 20 (see section 1.8 if thisis not clear). We can test the null hypothesis that this difference in log odds is 0 andhence that the OR is 1 by using the lincom command:

10.9.2 Tests of variance components 595

. lincom 1.treatment + 1.treatment#c.month*20

( 1) [outcome]1.treatment + 20*[outcome]1.treatment#c.month = 0


(1) -2.896123 1.309682 -2.21 0.027 -5.463053 -.3291935

If we require the corresponding OR with a 95% confidence interval, we can use thelincom command with the or option:

. lincom 1.treatment + 1.treatment#c.month*20, or

( 1) [outcome]1.treatment + 20*[outcome]1.treatment#c.month = 0


(1) .0552369 .0723428 -2.21 0.027 .0042406 .7195038

After 20 months of treatment, the OR comparing terbinafine (treatment=1) with itra-conazole is estimated as 0.055. Such small numbers are difficult to interpret, so we canswitch the groups around by taking the reciprocal of the OR, 18 (= 1/0.055), whichrepresents the OR comparing itraconazole with terbinafine. Alternatively, we can al-ways switch the comparison around by simply changing the sign of the correspondingdifference in log odds in the lincom command:

lincom -(1.treatment + 1.treatment#c.month*20), or

Multivariate Wald tests can be performed using testparm. Wald tests and confi-dence intervals can be based on robust standard errors from the sandwich estimator,obtained using the vce(robust) option in estimation commands (if the number of clus-ters J minus the number of cluster-level covariates q is at least 42).

Null hypotheses about individual regression coefficients or several regression coeffi-cients can also be tested using likelihood-ratio tests. Although likelihood-ratio and Waldtests are asymptotically equivalent, the test statistics are not identical in finite samples.(See display 2.2 for the relationships between likelihood-ratio, Wald, and score tests.)If the statistics are very different, there may be a sparseness problem, for instance withmostly “1” responses or mostly “0” responses in one of the groups.

10.9.2 Tests of variance components

The last line of the output of both xtlogit and melogit provides a likelihood-ratiotest for the null hypothesis that the residual between-cluster variance ψ is 0. The p-value is based on the correct asymptotic sampling distribution [0.5χ2(0) + 0.5χ2(0), asdescribed for linear models in section 2.6.2, not the naıve χ2(1)]. For the toenail data,the likelihood-ratio statistic is 565.2, giving p < 0.001, which suggests that a multilevelmodel is required.


10.10 Maximum likelihood estimation

10.10.1 q Adaptive quadrature

The marginal likelihood is the joint probability of all observed responses given theobserved covariates. For linear mixed models, this marginal likelihood can be evaluatedand maximized relatively easily (see section 2.10). However, in generalized linear mixedmodels, the marginal likelihood does not have a closed form and must be evaluated byapproximate methods.

To see this, we will now construct this marginal likelihood step by step for a random-intercept logistic regression model with one covariate xj . The responses are conditionallyindependent given the random intercept ζj and the covariate xj . Therefore, the jointprobability of all the responses yij (i = 1, . . . , nj) for cluster j given the random interceptand covariate is simply the product of the conditional probabilities of the individualresponses:

Pr(y1j , . . . , ynjj |xj , ζj) =

nj∏

i=1

Pr(yij |xj , ζj) =

nj∏

i=1

exp(β1 + β2xj + ζj)yij

1 + exp(β1 + β2xj + ζj)

In the last term,

exp(β1 + β2xj + ζj)yij

1 + exp(β1 + β2xj + ζj)=

exp(β1+β2xj+ζj)1+exp(β1+β2xj+ζj)

if yij = 1

11+exp(β1+β2xj+ζj)

if yij = 0

as specified by the logistic regression model.

To obtain the marginal joint probability of the responses, not conditioning on therandom intercept ζj (but still on the covariate xj), we integrate out the random inter-cept:

Pr(y1j , . . . , ynjj |xj) =

∫Pr(y1j , . . . , ynjj |xj , ζj)φ(ζj ; 0, ψ) dζj (10.9)

where φ(ζj , 0, ψ) is the normal density of ζj with mean 0 and variance ψ. Unfortunately,this integral does not have a closed-form expression.

The marginal likelihood is just the joint probability of all responses for all clusters.Because the clusters are mutually independent, this is given by the product of themarginal joint probabilities of the responses for the individual clusters:

L(β1, β2, ψ) =N∏

j=1

Pr(y1j , . . . , ynjj |xj)

This marginal likelihood is viewed as a function of the parameters β1, β2, and ψ (withthe observed responses treated as given). The parameters are estimated by finding thevalues of β1, β2, and ψ that yield the largest likelihood. The search for the maximumis iterative, beginning with some initial guesses or starting values for the parameters

10.10.1 q Adaptive quadrature 597

and updating these step by step until the maximum is reached, typically by using aNewton–Raphson or expectation-maximization (EM) algorithm.

The integral over ζj in (10.9) can be approximated by a sum of R terms with ersubstituted for ζj and the normal density replaced by a weight wr for the rth term,r = 1, . . . , R,

Pr(y1j , . . . , ynjj |xj) ≈R∑

r=1

Pr(y1j , . . . , ynjj |xj , ζj=er)wr

where er and wr are called Gauss–Hermite quadrature locations and weights, respec-tively. This approximation can be viewed as replacing the continuous density of ζj witha discrete distribution with R possible values of ζj having probabilities wr = Pr(ζj=er).The Gauss–Hermite approximation is illustrated for R=5 in figure 10.10. Obviously,the approximation improves when the number of points R increases.

0.0

0.1

0.2

0.3

0.4

0.5

−4 −2 0 2 4

ζj

Density

Figure 10.10: Gauss–Hermite quadrature: Approximating continuous density (dashedcurve) by discrete distribution (bars)

The ordinary quadrature approximation described above can perform poorly if thefunction being integrated, called the integrand, has a sharp peak, as discussed inRabe-Hesketh, Skrondal, and Pickles (2002, 2005). Sharp peaks can occur when theclusters are very large so that many functions (the individual response probabilities asfunctions of ζj) are multiplied to yield Pr(y1j , . . . , ynjj |xj , ζj). Similarly, if the responsesare counts or continuous responses, even a few terms can result in a highly peaked func-tion. Another potential problem is a high intraclass correlation. Here the functionsbeing multiplied coincide with each other more closely because of the greater similar-ity of responses within clusters, yielding a sharper peak. In fact, the toenail data wehave been analyzing, which has an estimated conditional intraclass correlation for the


latent responses of 0.83, poses real problems for estimation using ordinary quadrature,as pointed out by Lesaffre and Spiessens (2001).

The top panel in figure 10.11 shows the same five-point quadrature approximationand density of ζj as in figure 10.10. The solid curve is proportional to the integrand fora hypothetical cluster. Here the quadrature approximation works poorly because thepeak of the integrand falls between adjacent quadrature points.

Ordinary quadrature

0.0

0.2

0.4

0.6

0.8

1.0

1.2

−4 −2 0 2 4

Den

sity

ζj

Adaptive quadrature

0.0

0.2

0.4

0.6

0.8

1.0

1.2

−4 −2 0 2 4

Den

sity

ζj

Figure 10.11: Density of ζj (dashed curve), normalized integrand (solid curve), andquadrature weights (bars) for ordinary quadrature and adaptive quadrature (source:Rabe-Hesketh, Skrondal, and Pickles 2002)

10.10.2 Some speed and accuracy considerations 599

The bottom panel of figure 10.11 shows an improved approximation, known as adaptivequadrature with rescaled and translated locations,

erj = aj + bjer (10.10)

that are concentrated under the peak of the integrand, where aj and bj are cluster-specific constants. This transformation of the locations is accompanied by a transfor-mation of the weights wr that also depends on aj and bj . The method is called adaptivebecause the quadrature locations and weights are adapted to the data for the individualclusters.

To maximize the likelihood, we start with a set of initial or starting values of theparameters and then keep updating the parameters until the likelihood is maximized.The quantities aj and bj needed to evaluate the likelihood are functions of the param-eters (as well as the data) and must therefore be updated or “readapted” when theparameters are updated.

There are two different implementations of adaptive quadrature in Stata that dif-fer in the values used for aj and bj in (10.10). The method called mvaghermite for“mean–variance adaptive Gauss–Hermite quadrature” uses the posterior mean of ζjfor aj and the posterior standard deviation for bj . This is the default in melogit,xtlogit, and gllamm. Obtaining the posterior mean and standard deviation requiresnumerical integration (which requires values of aj and bj , making the process itera-tive), so the mvaghermite version of adaptive quadrature sometimes does not workwhen there are too few quadrature points (for example, fewer than five). Detailsof the algorithm are given in Rabe-Hesketh, Skrondal, and Pickles (2002, 2005) andSkrondal and Rabe-Hesketh (2004).

The method called mcaghermite for “mode-curvature adaptive quadrature Gauss–Hermite quadrature” uses the posterior mode of ζj for aj and for bj it uses the standarddeviation of the normal density whose logarithm has the same curvature (second deriva-tive) as the log posterior of ζj at the mode. An advantage of this approach is that itdoes not rely on numerical integration and can therefore be implemented even withone quadrature point. With one quadrature point, this version of adaptive quadraturebecomes a Laplace approximation. The laplace method is the default in melogit formodels with crossed random effects (see chapter 16) and can be requested for othermodels by using the intmethod(laplace) option.

10.10.2 Some speed and accuracy considerations

Integration methods and number of quadrature points

As discussed in section 10.10.1, the likelihood involves integrals that are evaluated by nu-merical integration. The marginal likelihood itself, as well as the marginal ML estimates,are therefore only approximate. The accuracy increases as the number of quadraturepoints increases, at the cost of increased computation time. We can assess whether theapproximation is adequate in a given situation by repeating the analysis with a larger


number of quadrature points. If we get essentially the same result, the lower number ofquadrature points is likely to be adequate. Such checking should always be done beforeestimates are taken at face value. See section 16.3.4 for an example in gllamm. For agiven number of quadrature points, adaptive quadrature is more accurate than ordinaryquadrature. Stata’s commands therefore use adaptive quadrature by default, and werecommend using the adapt option in gllamm.

Because of numerical integration, estimation can be slow, especially if there are manyrandom effects. The time it takes to fit a model is approximately proportional to theproduct of the number of quadrature points for all random effects. For example, if thereare two random effects at level 2 (a random intercept and slope) and eight quadraturepoints are used for each random effect, the time will be approximately proportional to64. Therefore, using four quadrature points for each random effect will take only aboutone-fourth (16/64) as long as using eight. The time is also approximately proportionalto the number of observations. melogit and the other “me” commands use analyticaldifferentiation as of Stata 15, making them much faster than before when they usednumerical differentiation. gllamm still uses numerical differentiation and its computationtime is therefore approximately quadratic in the number of parameters.

For large problems, it may be advisable to estimate how long estimation will takebefore starting work on a project. In this case, we recommend fitting a similar modelwith fewer random effects, fewer integration points, fewer observations, or some com-bination of those, and using the above approximate proportionality factors to estimatethe time that will be required for the larger problem.

For random-intercept models melogit and xtlogit are relatively fast because theyuses analytical derivatives. For random-coefficient models or higher-level models intro-duced in chapter 16, the quickest way of obtaining results is with melogit by specifyingintmethod(laplace) or intmethod(mcaghermite) together with intpoints(1). Al-though this Laplace approximation sometimes works well, it can produce severely biasedestimates, especially if the clusters are small and the (true) random-intercept varianceis large, as for the toenail data. For these data, we obtain the following:

10.10.2 Some speed and accuracy considerations 601

. melogit outcome i.treatment##c.month || patient:, intmethod(laplace)



Integration method: laplace



treatmentTerbinafine -.3070105 .6899463 -0.44 0.656 -1.65928 1.045259

month -.4000896 .0470581 -8.50 0.000 -.4923217 -.3078574

treatment#c.month

Terbinafine -.1372588 .069586 -1.97 0.049 -.2736448 -.0008728

_cons -2.523249 .7882044 -3.20 0.001 -4.068101 -.9783969

patientvar(_cons) 20.89221 6.580464 11.26885 38.73371


We see that the estimated intercept and coefficient of treatment are very differentfrom the estimates in section 10.6.3 using adaptive quadrature with 30 quadraturepoints. As mentioned in the previous section, the integration method mvaghermite

typically requires at least five quadrature points, but the method mcaghermite usedabove (with one integration point which makes this Laplace) could also be used withjust two or three integration points to check the accuracy of the Laplace approximationwhile keeping estimation relatively fast (see section 16.8).

Starting values

To speed up estimation in melogit or gllamm, use good starting values if they areavailable. For instance, when increasing the number of quadrature points or addingor dropping covariates, use the previous estimates as starting values. Another possi-bility, when estimation is slow because the dataset is large, is fitting the model on asubset of the data to obtain starting values for the full dataset. Starting values canbe specified using the from() option. This option should be combined with the skip

option if the new model contains fewer parameters than supplied. You can also use thecopy option if your parameters are supplied in the correct order yet are not necessarilylabeled correctly. Use of these options is demonstrated in sections 16.3 and 16.8. Thestartvalues() option in melogit can be used to specify how starting values should becomputed.


Using melogit and gllamm for collapsible data

For some datasets and models, you can represent the data by using fewer rows thanthere are observations, thus speeding up estimation. For example, if the response isdichotomous and we are using one dichotomous covariate in a two-level dataset, thenwe can use one row of data for each combination of covariate and response (00, 01, 10,11) for each cluster, leading to at most four rows per cluster. We can then specify avariable containing level-1 frequency weights equal to the number of observations, orlevel-1 units, in each cluster having each combination of the covariate and responsevalues. Level-2 weights can be used if several clusters have the same response andcovariate pattern across units. For example, if clusters are of size nj = 5 and theresponse variable is binary, then there are only 32 different response patterns; if there isone cluster-level binary covariate, the dataset can be reduced to 64 cluster types, whichcan be treated as “clusters” in the collapsed data, giving a total of 320 (= 64×5) units.Level-2 weights then specify how many original clusters each of the 64 “clusters” inthe collapsed dataset represents. For collapsed data with frequency weights, melogitcan be used with the fweight() and [fweight=] options for weights at the highestand lowest level, respectively. Weights at all levels can be specified in gllamm by usingthe weight() option. See exercise 10.7 for an example with level-1 weights, and seeexercises 10.3 and 2.3 for examples with level-2 weights. In exercise 16.11, collapsingthe data reduces computation time by about 99%.

Spherical quadrature in gllamm

For models involving several random effects at the same level, such as two-level random-coefficient models with a random intercept and slope, the multivariate integral can beevaluated more efficiently using spherical quadrature instead of the default Cartesian-product quadrature. For the random intercept and slope example, Cartesian-productquadrature consists of evaluating the function being integrated on the rectangular grid ofquadrature points consisting of all combinations of ζ1j = e1, . . . , eR and ζ2j = e1, . . . , eR,giving R2 terms. In contrast, spherical quadrature consists of evaluating ζ1j and ζ2j atvalues falling on concentric circles (spheres in three dimensions). The important pointis that the same accuracy can now be achieved with fewer than R2 points. For example,when R = 8, Cartesian-product quadrature requires 64 evaluations, while sphericalquadrature requires only 44 evaluations, taking nearly 30% less time to achieve thesame accuracy. Here accuracy is expressed in terms of the degree of the approximationgiven by d = 2R−1 for Cartesian-product quadrature. For example, R = 8 gives d = 15.To use spherical quadrature, specify the ip(m) option in gllamm and give the degree d ofthe approximation by using the nip(#) option. Unfortunately, spherical integration isavailable only for certain combinations of numbers of dimensions (or numbers of randomeffects) and degrees of accuracy, d: For two dimensions, d can be 5, 7, 9, 11, or 15, andfor more than two dimensions, d can be 5 or 7. See Rabe-Hesketh, Skrondal, and Pickles(2005) for more information.

10.11.1 Maximum “likelihood” estimation 603

10.11 Assigning values to random effects

Having estimated the model parameters (the β’s and ψ), we may want to assign valuesto the random intercepts ζj for individual clusters j. The ζj are not model parameters,but as for linear models, we can treat the estimated parameters as known and theneither estimate or predict ζj .

Such predictions are useful for making inferences for the clusters in the data, im-portant examples being assessment of institutional performance (see section 4.8.5) or ofabilities in item response theory (see exercise 10.4). The estimated or predicted valuesof ζj should generally not be used for model diagnostics in random-intercept logisticregression because their distribution if the model is true is not known. In general,the values should also not be used to obtain cluster-specific predicted probabilities (seesection 10.12.2).

10.11.1 Maximum “likelihood” estimation

As discussed for linear models in section 2.11.1, we can estimate the intercepts ζj bytreating them as the only unknown parameters, after estimates have been plugged infor the model parameters:

logit{Pr(yij = 1|xij , ζj)} = offsetij︸︷︷︸β1+β2x2ij+···

+ ζj

This is a logistic regression model for cluster j with offset (a term with regressioncoefficient set to 1) given by the estimated fixed part of the linear predictor and with acluster-specific intercept ζj .

We then maximize the corresponding likelihood for cluster j,

Likelihood(y1j , y2j , . . . , ynjj |Xj , ζj)

with respect to ζj , where Xj is a matrix containing all covariates for cluster j. Asexplained in section 2.11.1, we put “likelihood” in quotes in the section heading becauseit differs from the marginal likelihood that is used to estimate the model parameters.

Maximization can be accomplished by fitting logistic regression models to the indi-vidual clusters. First, obtain the offset from the melogit estimates:


. predict offset, xb


Then use the statsby command to fit individual logistic regression models for eachpatient, specifying an offset:

. statsby mlest=_b[_cons], by(patient) saving(ml, replace): logit outcome,> offset(offset)(running logit on estimation sample)

Command: logit outcome, offset(offset)mlest: _b[_cons]

By: patient(file ml.dta not found)

Statsby groups1 2 3 4 5

......xx.......xx..xxx...x.x...xxxxx.xx...xxx.xxxx 50xx.xxxxxxxxx.xxxx..xxxxxxxxx.x..xx..x.xxx.xxx.x... 100xx.xxxxxxxxxxx.xxx.x.x...x.xx.xxxxx.xx....xxx.x.xx 150.x..x.xxxx..xxxxx.xx..xxxx..xxx.x.xxxxx.x.x.xxx... 200.xxxxx.xx.xx..x.xxx...xx.x..xxxxx.x..x.x..x..xxxxx 250x.xx.x..xxxxxx..x..x..xxx.x..xxxxxxxx.x.x...

Here we have saved the estimates under the variable name mlest in a file called ml.dta

in the working directory. The x’s in the output indicate that the logit commanddid not converge for many clusters. For these clusters, the variable mlest is missing.This happens for clusters where all responses are 0 or all responses are 1 because themaximum “likelihood” estimate then is −∞ and +∞, respectively.

We now merge the estimates with the data for later use:

. sort patient

. merge m:1 patient using ml

Result Number of obs

Not matched 0Matched 1,908 (_merge==3)

. drop _merge

10.11.2 Empirical Bayes prediction

The ideas behind empirical Bayes prediction discussed in section 2.11.2 for linearvariance-components models also apply to other generalized linear mixed models. In-stead of basing inference completely on the “likelihood” of the responses for a clustergiven the random intercept, we combine this information with the prior of the randomintercept, which is just the estimated density of the random intercept (a normal density

with mean 0 and estimated variance ψ), to obtain the posterior distribution:

Posterior(ζj |y1j , . . . , ynjj ,Xj) ∝ Prior(ζj)× Likelihood(y1j , . . . , ynjj |Xj , ζj)

The product on the right is proportional to, but not equal to, the posterior distribution.Obtaining the posterior distribution requires dividing this product by a normalizingconstant that can only be obtained by numerical integration. Note that the model

10.11.2 Empirical Bayes prediction 605

parameters are treated as known, and estimates are plugged into the expression for theposterior, giving what is sometimes called an estimated posterior distribution.

The estimated posterior distribution is no longer normal as for linear models, andhence its mode does not equal its mean. There are therefore two different types ofpredictions we could consider: the mean of the posterior and its mode. The first is un-doubtedly the most common and is referred to as empirical Bayes prediction [sometimescalled expected a posterior (EAP) prediction], whereas the second is referred to as em-pirical Bayes modal prediction [sometimes called modal a posterior (MAP) prediction].

The empirical Bayes prediction of the random intercept for a cluster j is the meanof the estimated posterior distribution and can be obtained as

ζj =

∫ζj Posterior(ζj |y1j , . . . , ynjj ,Xj) dζj

by using numerical integration. This is not a best linear unbiased prediction (BLUP) asin linear models.

We can obtain empirical Bayes predictions by using the postestimation commandpredict for melogit with the reffects and ebmeans (for empirical Bayes) options:


. predict eb, reffects ebmeans reses(semean)(calculating posterior means of random effects)(using 30 quadrature points)

The variable eb contains the empirical Bayes predictions. In the next section, we willproduce a graph of these predictions, together with maximum “likelihood” estimatesand empirical Bayes modal predictions.

The posterior standard deviations produced by the reses() option of predict aboveand placed in the variable semean represent the conditional standard deviations of theprediction errors, given the observed responses and treating the parameter estimatesas known. The square of semean is also the conditional mean squared error of theprediction, conditional on the observed responses. As in section 2.11.3, we refer tothis standard error as the comparative standard error because it can be used to makeinferences regarding the random effects of individual clusters and to compare clusters.

We mentioned in section 2.11.3 that, for linear models, the posterior variance isthe same as the unconditional mean squared error of prediction (MSEP) or diagnosticstandard error. However, this is not true for generalized linear mixed models not havingan identity link, such as the random-intercept logistic model discussed here.

There is also no longer an easy way to obtain the MSEP or diagnostic standarderror, but an approximate version can be obtained as the square root of the dif-ference between the random-intercept variance estimate and the posterior variance

estimate, here

√ψ − semean2. (See Skrondal and Rabe-Hesketh [2004, 231–232] or

Skrondal and Rabe-Hesketh [2009] for details). This approximation is used by gllamm


with the ustd option to obtain empirical Bayes predictions divided by their approximatediagnostic standard errors.

10.11.3 Empirical Bayes modal prediction

Instead of basing prediction of random effects on the mean of the posterior distribution,we can use the mode. Such empirical Bayes modal predictions are easy to obtain usingthe predict command with the reffects and ebmodes (for empirical Bayes modal)options:

. predict ebmodal, reffects ebmodes reses(semode)(calculating posterior modes of random effects)

To see how the various methods compare, we now produce a graph of the empiri-cal Bayes modal predictions (circles) and nonmissing maximum “likelihood” estimates(triangles) versus the empirical Bayes predictions, connecting empirical Bayes modalpredictions and maximum “likelihood” estimates with vertical lines.

. twoway (rspike mlest ebmodal eb if visit==1)> (scatter mlest eb if visit==1, msize(small) msym(th) mcol(black))> (scatter ebmodal eb if visit==1, msize(small) msym(oh) mcol(black))> (function y=x, range(eb) lpatt(solid)),> xtitle(Empirical Bayes prediction)> legend(order(2 "Maximum likelihood" 3 "Empirical Bayes modal"))

10.11.3 Empirical Bayes modal prediction 607

The graph is given in figure 10.12.−

50

51

0

−5 0 5 10Empirical Bayes prediction

Maximum likelihood Empirical Bayes modal

Figure 10.12: Empirical Bayes modal predictions (circles) and nonmissing maximum“likelihood” estimates (triangles) versus empirical Bayes predictions

We see that the maximum “likelihood” estimates are missing when the empiricalBayes predictions are extreme (where the responses are all 0 or all 1) and that the em-pirical Bayes modal predictions tend to be quite close to the empirical Bayes predictions(close to the line).

When using the posterior modes as a predictions of the random intercepts insteadof the posterior means, it makes sense to base standard errors on the curvature of thelog posterior distribution at the mode (just as is done in the mcaghermite method ofadaptive quadrature). The reses() option of predict combined with the ebmodes op-tion therefore produces standard errors that are standard deviations of normal densitiesthat approximate the posterior at the mode. These approximate standard errors havebeen placed in the variable semode. Below we list the predictions and standard errorsproduced by the ebmeans option (eb and semean) with those produced by the ebmodesoption (ebmodal and semode), together with the number of 0 responses, num0, and thenumber of 1 responses, num1, for the first 16 patients:


. egen num0 = total(outcome==0), by(patient)

. egen num1 = total(outcome==1), by(patient)

. list patient num0 num1 eb ebmodal semean semode if visit==1&patient<=12, noobs

patient num0 num1 eb ebmodal semean semode

1 4 3 3.742003 3.738238 1.053459 1.0256572 4 2 1.834451 1.935999 1.019206 .94238423 6 1 .5889862 .9492028 1.30982 1.131524 6 1 .6017116 .9566681 1.314893 1.1364076 4 3 3.283585 3.255311 1.01189 .9710647

7 4 3 3.403232 3.369022 1.030795 .99569489 7 0 -2.680705 -1.399398 2.707369 2.61038810 7 0 -2.888323 -1.604582 2.645096 2.50544711 3 4 4.464952 4.363659 1.088513 1.0726812 4 3 2.727964 2.73049 .9417346 .899028

We see that the predictions and standard errors agree reasonably well for some patientswith several 1s and 0s, but there are large discrepancies when the responses are all 0(patients 9 and 10), which also leads to large standard errors. Such large discrepanciessuggest that the (empirical) posteriors are asymmetric. Although the posterior mean isgenerally preferred by Bayesians because it minimizes the posterior mean squared errorloss, the mean becomes a less useful summary of the posterior when the posterior isvery asymmetric.

10.12 Different kinds of predicted probabilities

10.12.1 Predicted population-averaged or marginal probabilities

Population-averaged or marginal probabilities π(xij) can be predicted for random-intercept logistic regression models by evaluating the integral in (10.8) numerically forthe estimated parameters and values of covariates in the data, that is, evaluating

π(xij) ≡∫

Pr(yij = 1|x2j , x3ij , ζj)φ(ζj ; 0, ψ) dζj

To obtain these predicted marginal probabilities after estimation using melogit, usethe predict command with the options pr or mu (for the probability or mean response)and marginal (for integrating over the random-intercept distribution):

. predict margprob, pr marginal(using 30 quadrature points)

(After estimation using gllamm, the command predict margprob, mu marginal givesidentical results.)

We now compare predictions of population-averaged or marginal probabilities fromthe ordinary logit model (previously obtained under the variable name prob) and therandom-intercept logit model, giving figure 10.13.

10.12.2 Predicted subject-specific probabilities 609

. twoway (line prob month, sort) (line margprob month, sort lpatt(dash) ),> by(treatment) legend(order(1 "Ordinary logit" 2 "Random intercept logit"))> xtitle(Time in months) ytitle(Fitted marginal probabilities of onycholysis)

The predictions are nearly identical. This is not surprising because estimators of themarginal relationships based on generalized linear mixed models are consistent for thetrue marginal relationships even if the random-intercept distribution is misspecified(Heagerty and Kurland 2001), implying that both single-level and random-interceptlogistic models are consistent. (However, unlike probit models, logit models have subtlydifferent functional forms for the marginal relationship depending on the estimatedrandom-intercept variance).

0.2

.4

0 5 10 15 20 0 5 10 15 20


Ordinary logit Random intercept logit

Fitte

d m

arg

inal pro

babili

ties o

f onycholy

sis

Time in months

Graphs by treatment

Figure 10.13: Fitted marginal probabilities using ordinary and random-intercept logisticregression

10.12.2 Predicted subject-specific probabilities

Predictions for hypothetical subjects: Conditional probabilities

Subject-specific or conditional predictions of Pr(yij = 1|x2j , x3ij , ζj) for different valuesof ζj can be produced using the inverse logit function of the linear predictor. We firstuse predict with the xb option to obtain the predicted fixed part of the model,

. predict fixedpart, xb

and then produce predicted probabilities for ζj equal to −4, −2, 0, 2, and 4:

. generate condprobm4 = invlogit(fixedpart-4)

. generate condprobm2 = invlogit(fixedpart-2)


. generate condprob0 = invlogit(fixedpart)

. generate condprob2 = invlogit(fixedpart+2)

. generate condprob4 = invlogit(fixedpart+4)

Plotting all of these conditional probabilities together with the observed proportionsand marginal probabilities produces figure 10.14.

. twoway (line prop mn_month, sort)> (line margprob month, sort lpatt(dash))> (line condprob0 month, sort lpatt(shortdash_dot))> (line condprob4 month, sort lpatt(shortdash))> (line condprobm4 month, sort lpatt(shortdash))> (line condprob2 month, sort lpatt(shortdash))> (line condprobm2 month, sort lpatt(shortdash)),> by(treatment)> legend(order(1 "Observed proportion" 2 "Marginal probability"> 3 "Median probability" 4 "Conditional probabilities"))> xtitle(Time in months) ytitle(Probabilities of onycholysis)

0.5

1

0 5 10 15 20 0 5 10 15 20


Observed proportion Marginal probability

Median probability Conditional probabilities

Pro

babili

ties o

f onycholy

sis

Time in months

Graphs by treatment

Figure 10.14: Conditional and marginal predicted probabilities for random-interceptlogistic regression model

Clearly, the conditional curves have steeper downward slopes than does the marginalcurve. The conditional curve represented by a dash-dot line is for ζj = 0 and hencerepresents the population median curve.


Predictions for the subjects in the sample: Posterior mean probabilities

We may also want to predict the probability that yij = 1 for a given subject j. Thepredicted conditional probability, given the unknown random intercept ζj , is

Pr(yij = 1|xij , ζj) =exp(β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj)

1 + exp(β1 + β2x2j + β3x3ij + β4x2jx3ij + ζj)

Because our knowledge about ζj for subject j is represented by the posterior dis-tribution, a good prediction πj(xij) of the unconditional probability is obtained byintegrating over the posterior distribution:

πj(xij) ≡∫

Pr(yij = 1|xij , ζj) × Posterior(ζj |y1j , . . . , ynjj ,Xj) dζj (10.11)

6= Pr(yij = 1|xij , ζj)

This minimizes the mean squared error of prediction for known parameters and can bethought of as an empirical Bayes version of posterior predictive probabilities where therandom intercepts are the only unknown parameters.

We cannot simply plug in the posterior mean of the random intercept ζj for ζj ingeneralized linear mixed models (for example, by using melogit followed by predict

with the mu and conditional(ebmeans) options). The reason is that the mean of agiven nonlinear function of ζj does not in general equal the same function evaluated atthe mean of ζj .

At the time of writing, melogit and xtlogit did not have postestimation commandsfor computing the posterior mean predicted probabilities as defined in (10.12). Wetherefore use gllapred with the mu option (and not the marginal option) after retrievingthe gllamm estimates:

. estimates restore gllamm(results gllamm are active now)

. gllapred cmu, mu(mu will be stored in cmu)Non-adaptive log-likelihood: -625.52573-625.3853 -625.3856 -625.3856

log-likelihood:-625.38558

gllapred can produce predicted posterior mean probabilities also for occasionswhere the response variable is missing. This is useful for making forecasts for a patientor for making predictions for visits where the patient did not attend the assessment. Aswe saw in section 10.4, such missing data occur frequently in the toenail data.


Listing patient and visit for patients 2 and 15,

. sort patient visit

. list patient visit if patient==2|patient==15, sepby(patient) noobs

patient visit

2 12 22 32 42 52 6

15 115 215 315 415 515 7

we see that these patients each have one missing visit: visit 7 is missing for patient2 and visit 6 is missing for patient 15. To make predictions for these visits, we mustfirst create rows of data (or records) for these visits. A very convenient command toaccomplish this is fillin:

. fillin patient visit

. list patient visit _fillin if patient==2|patient==15, sepby(patient) noobs

patient visit _fillin

2 1 02 2 02 3 02 4 02 5 02 6 02 7 1

15 1 015 2 015 3 015 4 015 5 015 6 115 7 0

fillin finds all values of patient that occur in the data and all values of visit and fillsin all combinations of these values that do not already occur in the data, for example,patient 2 and visit 7. The command creates a new variable, fillin, taking the value1 for filled-in records and 0 for records that existed before. All variables have missingvalues for these new records except patient, visit, and fillin.


Before we can make predictions, we must fill in values for the covariates: treatment,month, and the interaction trt month. Note that, by filling in values for covariates, weare not imputing missing data but just specifying for which covariate values we wouldlike to make predictions.

We start by filling in the appropriate values for treatment, taking into account thattreatment is a time-constant variable,

. egen trt = mean(treatment), by(patient)

. replace treatment = trt if _fillin==1

and proceed by filling in the average time (month) associated with the visit number forthe time-varying variable month:

. drop mn_month

. egen mn_month = mean(month), by(treatment visit)

. replace month = mn_month if _fillin==1

Finally, we obtain the filled-in version of the interaction variable, trt month, by multi-plying the variables treatment and month that we have constructed:

. replace trt_month = treatment*month

It is important that the response variable, outcome, remains missing; the posteriordistribution should only be based on the responses that were observed. We also cannotchange the covariate values corresponding to these observed responses because thatwould change the posterior distribution.

We can now make predictions for the entire dataset by repeating the gllapred

command (after deleting cmu) with the fsample (for “full sample”) option:

. drop cmu

. gllapred cmu, mu fsample(mu will be stored in cmu)Non-adaptive log-likelihood: -625.52573-625.3853 -625.3856 -625.3856

log-likelihood:-625.38558


. list patient visit _fillin cmu if patient==2|patient==15, sepby(patient) noobs

patient visit _fillin cmu

2 1 0 .546542272 2 0 .468889252 3 0 .38679532 4 0 .309869662 5 0 .121022712 6 0 .052826632 7 1 .01463992

15 1 0 .5914434615 2 0 .4771622615 3 0 .3975563515 4 0 .3054290715 5 0 .0899208215 6 1 .0185595715 7 0 .00015355

The predicted forecast probability for visit 7 for patient 2 hence is 0.015.

To look at some patient-specific posterior mean probability curves, we will producetrellis graphs of 16 randomly chosen patients from each treatment group. We will firstrandomly assign consecutive integer identifiers (1, 2, 3, etc.) to the patients in eachgroup, in a new variable, randomid. We will then plot the data for patients withrandomid 1 through 16 in each group.

To create the random identifier, we first generate a random number from the uniformdistribution whenever visit is 1 (which happens once for each patient):

. set seed 1234421

. sort patient

. generate rand = runiform() if visit==1

Here use of the set seed and sort commands ensures that you get the same values ofrandomid as we do, because the same “seed” is used for the random-number genera-tor. We now define a variable, randid, that represents the rank order of rand withintreatment groups and is missing when rand is missing:

. by treatment (rand), sort: generate randid = _n if rand<.

randid is the required random identifier, but it is only available when visit is 1 andmissing otherwise. We can fill in the missing values by using

. egen randomid = mean(randid), by(patient)


We are now ready to produce the trellis graphs:

. twoway (line cmu month, sort) (scatter cmu month if _fillin==1, mcol(black))> if randomid<=16&treatment==0, by(patient, compact legend(off)> l1title("Posterior mean probabilities"))

and

. twoway (line cmu month, sort) (scatter cmu month if _fillin==1, mcol(black))> if randomid<=16&treatment==1, by(patient, compact legend(off)> l1title("Posterior mean probabilities"))

The graphs are shown in figure 10.15. We see that there is considerable variability inthe probability trajectories of different patients within the same treatment group.


0.5

10

.51

0.5

10

.51

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

2 73 90 143

181 186 203 204

222 226 228 246

308 309 354 369Po

ste

rio

r m

ea

n p

rob

ab

ilitie

s

monthGraphs by patient

(a)

0.5

10

.51

0.5

10

.51

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

33 35 78 133

134 157 158 166

250 276 293 302

321 328 340 374Po

ste

rio

r m

ea

n p

rob

ab

ilitie

s

monthGraphs by patient

(b)

Figure 10.15: Posterior mean probabilities against time for 16 patients in the controlgroup (a) and treatment group (b) with predictions for missing responses shown asdiamonds

10.13.1 Conditional logistic regression 617

After estimation with melogit, the predict command with the options mu andconditional(ebmodes) gives the posterior mode of the predicted conditional probabil-ity Pr(yij |xij , ζj) instead of the posterior mean. This is achieved by substituting theposterior mode of ζj into the expression for the conditional probability. [The mode of astrictly increasing function of ζj (here an inverse logit), is the same function evaluatedat the mode of ζj .]

10.13 Other approaches to clustered dichotomous data

10.13.1 Conditional logistic regression

Instead of using random intercepts for clusters (patients in the toenail application),it would be tempting to use fixed intercepts by including a dummy variable for eachpatient (and omitting the overall intercept). This would be analogous to the fixed-effects estimator of within-patient effects discussed for linear models in section 3.7.2.However, in logistic regression, this approach would lead to inconsistent estimates ofthe within-patient effects unless the cluster size n is large, due to what is known asthe incidental parameter problem. Roughly speaking, this problem occurs because thenumber of cluster-specific intercepts (the incidental parameters) increases in tandemwith the sample size (number of clusters), so that the usual asymptotic or large-sampleresults break down. Obviously, we also cannot eliminate the random intercepts innonlinear models by simply cluster-mean-centering the responses and covariates, asin (3.12).

Instead, we can eliminate the patient-specific intercepts by constructing a likelihoodthat is conditional on the number of responses that take the value 1, a sufficient statis-tic for the patient-specific intercept. This approach is called conditional ML estimation,in contrast to marginal ML estimation where the patient-specific intercepts are inte-grated out, the approach used thus far in this chapter and implemented in melogit.In linear random-intercept models, conditional ML estimation is equivalent to ordinaryleast squares estimation of the cluster-mean centered model, that is, fixed-effects esti-mation as implemented in xtreg with the fe option. In logistic regression, conditionalML estimation is more involved and is known as conditional logistic regression—seedisplay 12.2 for a derivation of the likelihood contribution of a cluster with one re-sponse equal to 1. By eliminating the random intercepts, conditional logistic regressionrelaxes all assumptions regarding the random intercepts, including any patient-level ex-ogeneity assumptions, which is why the method is sometimes referred to as fixed-effectslogistic regression. Importantly, this method estimates conditional or subject-specificeffects, just like random-effects logistic regression, but without assuming that there isno patient-level unobserved confounding. When using conditional logistic regression,we can only estimate the effects of within-patient or time-varying covariates. Patient-specific covariates, such as treatment, cannot be included. However, interactions be-tween patient-specific and time-varying variables, such as treatment by month, can beestimated.


Estimation using clogit

Conditional logistic regression can be performed using Stata’s xtlogit command withthe fe option or using the clogit command (with the or option to obtain ORs):

. clogit outcome month i.treatment#c.month, group(patient) or

note: multiple positive outcomes within groups encountered.note: 179 groups (1,141 obs) omitted because of all positive or

all negative outcomes.

Conditional (fixed-effects) logistic regression Number of obs = 767LR chi2(2) = 290.97Prob > chi2 = 0.0000



month .6827717 .0321547 -8.10 0.000 .6225707 .748794

treatment#c.month

Terbinafine .9065404 .0667426 -1.33 0.183 .7847274 1.047262

Interpretation

The subject-specific or conditional OR for the treatment effect (treatment by timeinteraction) is now estimated as 0.91 and is no longer significant at the 5% level. How-ever, both this estimate and the estimate for month, also given in the last column oftable 10.2, are quite similar to the marginal ML estimates for the random-interceptmodel. This is likely to be due to the randomization of treatments in the toenail study,ensuring that they are exogenous.

The subject-specific or conditional ORs from conditional logistic regression representwithin-effects, where patients serve as their own controls. As discussed in chapter 5,within-patient estimates cannot be confounded with omitted between-patient covariatesand are hence less sensitive to model misspecification than marginal ML estimates for therandom-intercept model (which makes the strong assumption that the patient-specificintercepts are independent of the covariates). A further advantage of conditional ML

estimation is that it does not make any assumptions regarding the distribution of thepatient-specific effect. Therefore, it is reassuring that the conditional ML estimates arefairly similar to the marginal ML estimates.

Skrondal and Rabe-Hesketh (2014a) show that conditional logistic regression may bebeneficial when there are missing data. Specifically, conditional ML can be protectiveagainst not missing at random (NMAR) mechanisms, such as missingness depending onthe random intercept and on the (observed or missing) responses.

If the random-intercept model is correct, the marginal ML estimator is more efficientand tends to yield smaller standard errors leading to smaller p-values, as we can see forthe treatment by time interaction. Here the conditional logistic regression method is

10.13.2 Generalized estimating equations (GEE) 619

inefficient because, as noted in the output, 179 subjects whose responses were all 0 or all1 cannot contribute to the analysis. This is because the conditional probabilities of theseresponse patterns, conditioning on the sum of responses across time, are 1 regardless ofthe covariates (for example, if the sum is 0, all responses must be 0) and the conditionalprobabilities therefore do not provide any information on covariate effects.

The conditional logistic regression model is sometimes referred to as the Chamberlainfixed-effects logit model in econometrics and is used for matched case–control studiesin epidemiology. The same trick of conditioning is also used for the Rasch model inpsychometrics and the conditional logit model for discrete choice and nominal responses(see sections 12.2.2 and 12.2.3). Unfortunately, there is no counterpart to conditionallogistic regression for probit models.

Note that dynamic models with subject-specific effects cannot be estimated con-sistently by simply including lagged responses in conditional logistic regression. Also,subject-specific predictions are not possible in conditional logistic regression because noinferences are made regarding the subject-specific intercepts.

10.13.2 Generalized estimating equations (GEE)

Generalized estimating equations (GEE), first introduced in section 6.6, can be used toestimate marginal or population-averaged effects. Dependence among the responses ofunits in a given cluster is taken into account but treated as a nuisance, whereas thisdependence is of central interest in multilevel modeling.

The basic idea of GEE is that an algorithm, known as reweighted iterated leastsquares, for ML estimation of single-level generalized linear models requires only themean structure (expectation of the response variable as a function of the covariates)and the variance function. The algorithm iterates between linearizing the model givencurrent parameter estimates and then updating the parameters by using weighted leastsquares, with weights determined by the variance function. In GEE, this iterative algo-rithm is extended to two-level data by assuming a within-cluster correlation structure,in addition to the mean structure and variance function, so that the weighted least-squares step becomes a generalized least-squares step (see section 3.10.1), and anotherstep is required for updating the correlation matrix. GEE can be viewed as a specialcase of generalized methods of moments (GMM) estimation (implemented in Stata’s gmmcommand).

In addition to specifying a model for the marginal relationship between the responsevariable and covariates, it is necessary to choose a structure for the correlations amongthe observed responses (conditional on covariates). The variance function follows fromthe Bernoulli distribution. The most common correlation structures are (see section 6.6for some other correlation structures):


• IndependenceSame as ordinary logistic regression.

• ExchangeableSame correlation for all pairs of units.

• Autoregressive lag-1 [AR(1)]Correlation declines exponentially with the time lag—only makes sense for longi-tudinal data and assumes equally spaced occasions, that is, constant time intervalsbetween occasions, except for gaps due to missing data.

• UnstructuredA different correlation for each pair of responses—only makes sense if units arenot exchangeable within clusters, in the sense that the labels i attached to theunits mean the same thing across clusters. For instance, it is meaningful for fixed-occasion longitudinal where the time associated with a given occasion i is identicalacross individuals, but not for data on students nested in schools where the labelsi assigned to students are arbitrary. In addition, each pair of unit labels i and i′

must occur sufficiently often across clusters to estimate the pairwise correlations.Finally, the number of different unique unit labels, say, m, should not be too largebecause the number of parameters is m(m−1)/2.

The reason for specifying a correlation structure is that more efficient estimates (withsmaller standard errors) are obtained if the specified correlation structure resembles thetrue dependence structure. Using ordinary logistic regression is equivalent to assumingan independence structure. GEE is therefore generally more efficient than ordinarylogistic regression for clustered data although the gain in precision can be meager forbalanced data (Lipsitz and Fitzmaurice 2009).

An important feature of GEE (and ordinary logistic regression) is that marginaleffects can be consistently estimated, even if the dependence among units in clusters isnot properly modeled. For this reason, correct specification of the correlation structureis downplayed by using the term “working correlations”.

In GEE, the standard errors for the marginal effects are usually based on the robustsandwich estimator, which takes the dependence into account. Use of the sandwichestimator implicitly relies on there being many replications of the responses associatedwith each distinct combination of covariate values. Otherwise, the estimated standarderrors can be biased downward. Furthermore, estimated standard errors based on thesandwich estimator can be very unreliable unless the number of clusters is large (42or larger as a rough rule of thumb), so in this case model-based (nonrobust) standarderrors may be preferable. See Lipsitz and Fitzmaurice (2009) for further discussion.

Estimation using xtgee

We now use GEE to estimate marginal ORs for the toenail data. We request an exchange-able correlation structure (the default) and robust standard errors by using xtgee withthe corr(exchangeable), vce(robust), and eform options:

10.13.2 Generalized estimating equations (GEE) 621


. xtgee outcome i.treatment##c.month, link(logit)> family(binomial) corr(exchangeable) vce(robust) eform

GEE population-averaged model Number of obs = 1,908Group variable: patient Number of groups = 294Family: Binomial Obs per group:Link: Logit min = 1Correlation: exchangeable avg = 6.5

max = 7Wald chi2(3) = 63.44

Scale parameter = 1 Prob > chi2 = 0.0000

(Std. err. adjusted for clustering on patient)

Robustoutcome Odds ratio std. err. z P>|z| [95% conf. interval]

treatmentTerbinafine 1.007207 .2618022 0.03 0.978 .6051549 1.676373

month .8425856 .0253208 -5.70 0.000 .7943911 .893704

treatment#c.month

Terbinafine .9252113 .0501514 -1.43 0.152 .8319576 1.028918

_cons .5588229 .0963122 -3.38 0.001 .3986309 .7833889

Note: _cons estimates baseline odds (conditional on zero random effects).

Interpretation

These estimates are given under “GEE” in table 10.2 and can alternatively be ob-tained using xtlogit with the pa option. Comparing the standard error estimates forGEE with an exchangeable correlation structure to those for ordinary logistic regression(that is, GEE with an independence correlation structure) suggests that specification ofa more realistic working correlation structure does not result in much efficiency gainhere.

We can display the fitted working correlation matrix by using estat wcorrelation:

. estat wcorrelation, format(%4.3f)

Estimated within-patient correlation matrix R:

c1 c2 c3 c4 c5 c6 c7

r1 1.000r2 0.422 1.000r3 0.422 0.422 1.000r4 0.422 0.422 0.422 1.000r5 0.422 0.422 0.422 0.422 1.000r6 0.422 0.422 0.422 0.422 0.422 1.000r7 0.422 0.422 0.422 0.422 0.422 0.422 1.000

A problem with the exchangeable correlation structure is that the true marginal(over the random effects) correlation of the responses is in general not constant butvaries according to values of the observed covariates. Using Pearson correlations for


dichotomous responses is also somewhat peculiar because the OR is the measure ofassociation in logistic regression.

GEE is an estimation method that does not require the specification of a full statis-tical model. While the mean structure, variance function, and correlation structure arespecified, it is often not be possible to find a statistical model with such a structure. Aswe already pointed out, it may not be possible to specify a model for binary responseswhere the residual Pearson correlation matrix is exchangeable. For this reason, theapproach is called an estimating equation approach rather than a modeling approach.This is in stark contrast to multilevel modeling, where statistical models are explicitlyspecified.

The fact that no full statistical model is specified has three important implications.First, there is no likelihood and therefore likelihood-ratio tests cannot be used. Instead,comparison of nested models typically proceeds by using Wald tests. Unless the samplesize is large, this approach may be problematic because it is known that these tests donot work as well as likelihood-ratio tests in ordinary logistic regression. Second, it is notpossible to simulate or predict individual responses based on the estimates from GEE

(see section 10.12.2 for prediction and forecasting based on multilevel models). Third,GEE does not share the useful property of ML that estimators are consistent whendata are MAR. Although GEE produces consistent estimates of marginal effects if theprobability of responses being missing is covariate dependent [and for the special caseof responses missing completely at random (MCAR)], it produces inconsistent estimatesif the probability of a response being missing for a unit depends on observed responsesfor other units in the same cluster. Such missingness is likely to occur in longitudinaldata where dropout could depend on a subjects’ previous responses (see sections 5.9.1and 13.12).

10.14 Summary and further reading

We described various approaches to modeling clustered dichotomous data, focusingon random-intercept logistic models for longitudinal data. Alternatives to multilevelmodeling, such as conditional ML estimation and GEE, were also briefly discussed. Theimportant distinction between conditional or subject-specific effects and marginal orpopulation-averaged effects was emphasized.

We described adaptive quadrature for marginal ML estimation. We recommendedthat you compare a sequence of estimates with an increasing number of quadrature (orintegration) points and establish that the estimates stabilize before concluding that youhave used enough quadrature points for a given model and application. We demon-strated the use of a variety of predictions, either cluster-specific predictions, based onempirical Bayes, or population-averaged predictions. Keep in mind that consistentestimation in logistic regression models with random effects, in principle, requires acompletely correct model specification. Diagnostics for generalized linear mixed modelsare still being developed (see, for example, Breinegaard, Rabe-Hesketh, and Skrondal(2017, 2018) and the references therein).

10.14 Summary and further reading 623

We did not cover random-coefficient models for binary responses in this chapterbut have included two exercises (10.3 and 10.8), with solutions provided, involvingthese models. The issues discussed in chapter 4 regarding linear models with randomcoefficients are also relevant for other generalized linear mixed models. The syntaxfor random-coefficient logistic models is analogous to the syntax for linear random-coefficient models except that mixed is replaced with melogit and gllamm is used with adifferent link function and distribution (the syntax for linear random-coefficient modelsin gllamm can be found in the gllamm companion). Three-level random-coefficientlogistic models for binary responses are discussed in chapter 16. In chapter 11, meologitis used to fit random-coefficient ordinal logistic regression models; see section 11.7.1.

Dynamic or lagged-response models for binary responses were not discussed in thischapter. Such models, sometimes called transition models, can suffer from similar kindsof endogeneity problems as those discussed for dynamic models with random inter-cepts in section 5.8 of volume 1. Skrondal and Rabe-Hesketh (2014b) review severalapproaches for handling these endogeneity problems. The paper provides a link toa do-file for the Wooldridge (2005) solution using melogit or meprobit and for theHeckman (1981) approach using gllamm. Rabe-Hesketh and Skrondal (2013) point outthat Wooldridge’s solution has often been implemented incorrectly, and their proposedcorrect version is implemented in the community-contributed Stata program xtpdyn byGrotti and Cutuli (2018).

We discussed the most common link functions for dichotomous responses, namely,logit and probit links. A third link sometimes used is the complementary log–log link,which is introduced in section 14.6. Dichotomous responses are sometimes aggregatedinto counts, giving the number of successes yi in mi trials for unit i. In this situation, itis usually assumed that yi has a binomial(mi,πi) distribution. melogit can then be usedas for dichotomous responses but with the binomial() option to specify the variablecontaining the values mi. Similarly, gllamm can be used with the binomial distributionand any of the link functions together with the denom() option to specify the variablecontaining mi.

Good introductions to single-level logistic regression include Collett (2003), Long(1997), and Hosmer, Lemeshow, and Sturdivant (2013). Logistic and other types ofregression using Stata are discussed by Long and Freese (2014), primarily with examplesfrom social science, and by Vittinghoff et al. (2012), with examples from medicine.

Generalized linear mixed models are described in the books by McCulloch, Searle,and Neuhaus (2008), Skrondal and Rabe-Hesketh (2004), Molenberghs and Verbeke(2005), and Hedeker and Gibbons (2006). See also Goldstein (2011), Raudenbush andBryk (2002), and volume 3 of the anthology by Skrondal and Rabe-Hesketh (2010). Sev-eral examples with dichotomous responses are discussed in Skrondal and Rabe-Hesketh(2004, chap. 9). Guo and Zhao (2000) is a good introductory paper on multilevel mod-eling of binary data with applications in social science. We also recommend the bookchapter by Rabe-Hesketh and Skrondal (2009), the article by Agresti et al. (2000), and


the encyclopedia entry by Hedeker (2005) for overviews of generalized linear mixedmodels. Prediction of random effects and outcomes in generalized linear mixed modelsis treated in Skrondal and Rabe-Hesketh (2009).

A classic paper on conditional logistic regression for longitudinal data is Chamberlain(1980). Skrondal and Rabe-Hesketh (2014a) discuss advantages of conditional logis-tic regression when there are missing data. Detailed accounts of GEE are given inHardin and Hilbe (2013), Diggle et al. (2002), and Lipsitz and Fitzmaurice (2009).

Exercises 10.1, 10.2, 10.3, and 10.6 are on longitudinal or panel data. There are alsoexercises on cross-sectional datasets on students nested in schools (10.7 and 10.8), cowsnested in herds (10.5), questions nested in respondents (10.4) and wine bottles nestedin judges (10.9). Exercise 10.2 involves GEE, whereas exercises 10.4 and 10.6 involveconditional logistic regression. The latter exercise also asks you to perform a Hausmantest. Exercises 10.3 and 10.8 consider random-coefficient models for dichotomous re-sponses (solutions are provided for both exercises). Exercise 10.4 introduces the idea ofitem-response theory, and exercise 10.8 shows how melogit and gllamm can be used tofit multilevel models with survey weights.

10.15 Exercises

10.1 Toenail data

1. Fit the probit version of the random-intercept model in (10.6) with meprobit.How many quadrature points appear to be needed?

2. Estimate the residual intraclass correlation for the latent responses, bothby plugging the estimates into the appropriate expression and by using theappropriate postestimation command.

3. Obtain empirical Bayes predictions of the random intercepts for both the logitand probit models and estimate the approximate constant of proportionalitybetween these.

4. q By considering the residual standard deviations of the latent responsefor the logit and probit models, work out what you think the constant ofproportionality should be for the logit- and probit-based empirical Bayespredictions. How does this compare with the constant estimated in step 3?

10.2 Ohio-wheeze data

In this exercise, we use data from the Six Cities Study (Ware et al. 1984), previ-ously analyzed by Fitzmaurice (1998), among others. The dataset includes 537children from Steubenville, Ohio, who were examined annually four times from age7 to age 10 to ascertain their wheezing status. The smoking status of the motherwas also determined at the beginning of the study to investigate whether maternalsmoking increases the risk of wheezing in children. The mother’s smoking statusis treated as time constant, although it may have changed for some mothers overtime.

10.15 Exercises 625

The dataset wheeze.dta has the following variables:

• id: child identifier (j)

• age: number of years since ninth birthday (x2ij)

• smoking: mother smokes regularly (1: yes; 0: no) (x3j)

• y: wheeze status (1: yes; 0: no) (yij)

1. Fit the following transition model considered by Fitzmaurice (1998):

logit{Pr(yij=1|xij , yi−1,j)} = β1 + β2x2ij + β3x3j + γyi−1,j , i = 2, 3, 4

where x2ij is age and x3j is smoking. (The lagged responses can be includedin the model by first typing xtset id age and then including L.y as acovariate in the model.)

2. Fit the following random-intercept model considered by Fitzmaurice (1998):

logit{Pr(yij=1|xij , ζj)} = β1 + β2x2ij + β3x3j + ζj , i = 1, 2, 3, 4

It is assumed that ζj |xij ∼ N(0, ψ) and that ζj is independent across children.

3. Use GEE to fit the marginal model

logit{Pr(yij=1|xij)} = β1 + β2x2ij + β3x3j , i = 1, 2, 3, 4

specifying an unstructured correlation matrix (xtset the data using xtset

id age). Try some other correlation structures and compare the fit (usingestat wcorrelation) to the unstructured version.

4. Interpret the estimated effects of mother’s smoking status for the models insteps 1, 2, and 3.

10.3 Vaginal-bleeding dataSolutions

Fitzmaurice, Laird, and Ware (2011) analyzed data from a trial reported byMachin et al. (1988). Women were randomized to receive an injection of either100 mg or 150 mg of the long-lasting injectable contraception depot medroxypro-gesterone acetate (DMPA) at the start of the trial and at three successive 90-dayintervals. In addition, the women were followed up 90 days after the final injection.Throughout the study, each woman completed a menstrual diary that recordedany vaginal bleeding-pattern disturbances. The diary data were used to determinewhether a woman experienced amenorrhea, defined as the absence of menstrualbleeding for at least 80 consecutive days.

The response variable for each of the four 90-day intervals is whether the womanexperienced amenorrhea during the interval. Data are available on 1,151 womenfor the first interval, but there was considerable dropout after that.


The dataset amenorrhea.dta has the following variables:

• dose: high dose (1: yes; 0: no)

• y1–y4: responses for time intervals 1–4 (1: amenorrhea; 0: no amenorrhea)

• wt2: number of women with the same dose level and response pattern

1. Produce an identifier variable for women, and reshape the data to long form,stacking the responses y1–y4 into one variable and creating a new variable,occasion, taking the values 1–4 for each woman.

2. Fit the following model considered by Fitzmaurice, Laird, and Ware (2011):

logit{Pr(yij = 1|xj , tij , ζj)} = β1 + β2tij + β3t2ij + β4xjtij + β5xjt

2ij + ζj

where tij = 1, 2, 3, 4 is the time interval and xj is dose. It is assumed thatζj ∼ N(0, ψ) and that ζj is independent across women and independent ofxj and tij . Use melogit with the fweight(wt2) option to specify that wt2are level-2 frequency weights.

3. Write down the above model, adding a random slope of tij , and fit the ex-tended model. (See section 11.7.1 for an example of a random-coefficientmodel for ordinal responses fit in meologit.)

4. Interpret the estimated coefficients.

5. Plot marginal predicted probabilities as a function of time, separately forwomen in the two treatment groups.

10.4 Verbal-aggression data

De Boeck and Wilson (2004) discuss a dataset from Vansteelandt (2000) where316 participants were asked to imagine the following four frustrating situationswhere either another or oneself is to blame:

1. Bus: A bus fails to stop for me (another to blame)

2. Train: I miss a train because a clerk gave me faulty information (another toblame)

3. Store: The grocery store closes just as I am about to enter (self to blame)

4. Operator: The operator disconnects me when I have used up my last 10 centsfor a call (self to blame)

For each situation, the participant was asked if it was true (yes, perhaps, or no)that

1. I would (want to) curse

2. I would (want to) scold

3. I would (want to) shout

For each of the three behaviors above, the words “want to” were both includedand omitted, yielding six statements with a 3× 2 factorial design (3 behaviors in2 modes) combined with the four situations. Thus, there were 24 items in total.

10.15 Exercises 627

The dataset aggression.dta contains the following variables:

• person: subject identifier

• item: item (or question) identifier

• description: item description(situation: bus/train/store/operator; behavior: curse/scold/shout; mode:do/want)

• i1–i24: dummy variables for the items, for example, i5 equals 1 when item

equals 5 and 0 otherwise

• y: ordinal response (0: no; 1: perhaps; 2: yes)

• Person characteristics:

– anger: trait anger score (STAXI, Spielberger [1988]) (w1j)– gender: dummy variable for being male (1: male; 0: female) (w2j)

• Item characteristics:

– do want: dummy variable for mode being “do” (that is, omitting words“want to”) versus “want” (x2ij)

– other self: dummy variable for others to blame versus self to blame(x3ij)

– blame: variable equal to 0.5 for blaming behaviors curse and scold and−1 for shout (x4ij)

– express: variable equal to 0.5 for expressive behaviors curse and shoutand −1 for scold (x5ij)

1. Recode the ordinal response variable y so that either a “2” or a “1” for theoriginal variable becomes a “1” for the recoded variable.

2. De Boeck and Wilson (2004, sec. 2.5) consider the following “explanatoryitem-response model” for the dichotomous response:

logit{Pr(yij=1|xij , ζj)} = β1 + β2x2ij + β3x3ij + β4x4ij + β5x5ij + ζj

where ζj ∼ N(0, ψ) can be interpreted as the latent trait “verbal aggressive-ness”. Fit this model using melogit, and interpret the estimated coefficients.In De Boeck and Wilson (2004), the first five terms have minus signs, so theirestimated coefficients have the opposite sign.

3. De Boeck and Wilson (2004, sec. 2.6) extend the above model by includinga latent regression, allowing verbal aggressiveness (now denoted ηj insteadof ζj) to depend on the person characteristics w1j and w2j :

logit{Pr(yij=1|xij , ηj)} = β1 + β2x2ij + β3x3ij + β4x4ij + β5x5ij + ηj

ηj = γ1w1j + γ2w2j + ζj

Substitute the level-2 model for ηj into the level-1 model for the item re-sponses, and fit the model using melogit.


4. Use melogit to fit the “descriptive item-response model”, usually called aone-parameter logistic item response theory (IRT) model or Rasch model,considered by De Boeck and Wilson (2004, sec. 2.3):

logit{Pr(yij=1|d1i, . . . , d24,i, ζj)} =

24∑

m=1

βmdmi + ζj

where dmi is a dummy variable for item i, with dmi = 1 if m = i and 0otherwise. In De Boeck and Wilson (2004), the first term has a minus sign,so their βm coefficients have the opposite sign; see also their page 53.

5. The model above is known as a one-parameter item-response model becausethere is one parameter βm for each item. The negative of these item-specificparameters, −βm, can be interpreted as “difficulties”; the larger −βm, thelarger the latent trait (here verbal aggressiveness, but often ability) has tobe to yield a given probability (for example, 0.5) of a 1 response.

Sort the items in increasing order of the estimated difficulties. For the leastand most difficult items, look up the variable description, and discusswhether it makes sense that these items are particularly easy and hard toendorse (requiring little and a lot of verbal aggressiveness), respectively.

6. Replace the random intercepts ζj with fixed parameters αj . Set the difficultyof the first item to 0 for identification and fit the model by conditional ML.Verify that differences between estimated difficulties for the items are similarto those in step 4.

7. q Obtain empirical Bayes (also called EAP) predictions and empirical Bayesmodal (also called MAP) predictions, and ML estimates of the latent trait.Also obtain standard errors (for ML, this means saving se[ cons] in additionto b[ cons] by adding mlse = se[ cons] in the statsby command). Doesthere appear to be much shrinkage? Calculate the total score (sum of itemresponses) for each person and plot the different kinds of standard errorsagainst the total score, connecting adjacent points. Comment on what youfind.

See also exercise 11.2 for further analyses of these data.

10.5 Dairy-cow data

Dohoo et al. (2001) and Dohoo, Martin, and Stryhn (2010) analyzed data on dairycows from Reunion Island. One outcome considered was the “risk” of conceptionat the first insemination attempt (first service) since the previous calving. Thisoutcome was available for several lactations (calvings) per cow.

The variables in the dataset dairy.dta used here are

• cow: cow identifier

• herd: herd identifier

• region: geographic region

10.15 Exercises 629

• fscr: first-service conception risk (dummy variable for cow becoming preg-nant)

• lncfs: log of time interval (in log days) between calving and first service(insemination attempt)

• ai: dummy variable for artificial insemination being used (versus natural) atfirst service

• heifer: dummy variable for being a young cow that has calved only once

1. Fit a two-level random-intercept logistic regression model for the responsevariable fscr, an indicator for conception at the first insemination attempt(first service). Include a random intercept for cow and the covariates lncfs,ai, and heifer. (Use either melogit, xtlogit, or gllamm.)

2. Obtain estimated ORs with 95% confidence intervals for the covariates andinterpret them.

3. Obtain the estimated residual intraclass correlation between the latent re-sponses for two observations on the same cow.

4. Obtain the estimated median OR for two randomly chosen cows with the samecovariates, comparing the cow that has the larger random intercept with thecow that has the smaller random intercept. Is there much variability in thecows’ fertility?

See also exercises 8.8 and 16.1.

10.6 Union-membership data

Vella and Verbeek (1998) analyzed panel data on 545 young males taken from theU.S. National Longitudinal Survey (Youth Sample) for the period 1980–1987. Inthis exercise, we will focus on modeling whether the men were members of unionsor not.

The dataset wagepan.dta was provided by Wooldridge (2010) and was previouslyused in exercise 3.6 and Part III: Introduction to models for longitudinal and paneldata (in volume 1). The subset of variables considered here is

• nr: person identifier (j)

• year: 1980–1987 (i)

• union: dummy variable for being a member of a union (that is, wage beingset in collective bargaining agreement) (yij)

• educ: years of schooling (x2j)

• black: dummy variable for being Black (x3j)

• hisp: dummy variable for being Hispanic (x4j)

• exper: labor market experience, defined as age−6−educ (x5ij)

• married: dummy variable for being married (x6ij)

• rur: dummy variable for living in a rural area (x7ij)

• nrtheast: dummy variable for living in Northeast (x8ij)


• nrthcen: dummy variable for living in Northern Central (x9ij)

• south: dummy variable for living in South (x10,ij)

You can use the describe command to get a description of the other variables inthe file.

1. Use ML estimation to fit the random-intercept logistic regression model

logit{Pr(yij = 1|xij , ζj)} = β1 + β2x2j + · · ·+ β11x10,ij + ζj

where ζj ∼ N(0, ψ), and ζj is assumed to be independent across persons andindependent of xij . Use xtlogit here.

2. Interpret the estimated effects of the covariates from step 1 in terms ofORs, and report the estimated residual intraclass correlation of the latentresponses.

3. Fit the marginal model

logit{Pr(yij = 1|xij)} = β1 + β2x2j + · · ·+ β11x10,ij

using GEE with an exchangeable working correlation structure.

4. Interpret the estimated effects of the covariates from step 3 in terms of ORs,and compare these estimates with those from step 1. Why are the estimatesdifferent?

5. Explore the within and between variability of the response variable and co-variates listed above. For which of the covariates is it impossible to estimatean effect using a fixed-effects approach? Are there any covariates whose ef-fects you would expect to be imprecisely estimated when using a fixed-effectsapproach?

6. Use conditional ML estimation to fit the fixed-intercept logistic regressionmodel

logit{Pr(yij = 1|xij)} = β1 + β2x2j + · · ·+ β11x10,ij + αj

where the αj are unknown person-specific parameters.

7. Interpret the estimated effects of the covariates from step 6 in terms of ORs,and compare these estimates with those from step 1. Why are the estimatesdifferent?

8. Perform a Hausman test to assess the validity of the random-intercept model.What do you conclude?

9. Fit the probit versions of the random-intercept model from step 1 usingxtprobit. Which type of model do you find easiest to interpret?

10.15 Exercises 631

10.7 School-retention-in-Thailand data

A national survey of primary education was conducted in Thailand in 1988. Thedata were previously analyzed by Raudenbush and Bhumirat (1992) and are dis-tributed with the HLM software (Raudenbush et al. 2019). Here we will model theprobability that a child repeats a grade any time during primary school.

The dataset thailand.dta has the following variables:

• rep: dummy variable for child having repeated a grade during primary school(yij)

• schoolid: school identifier (j)

• pped: dummy variable for child having preprimary experience (x2ij)

• male: dummy variable for child being male (x3ij)

• mses: school mean socioeconomic status (SES) (x4j)

• wt1: number of children in the school having a given set of values of rep,pped, and male (level-1 frequency weights)

1. Fit the model

logit{Pr(yij = 1|xij , ζj)} = β1 + β2x2ij + β3x3ij + β4x4j + ζj

where ζj ∼ N(0, ψ), and ζj is independent across schools and independent ofthe covariates xij . Use gllamm with the weight(wt) option to specify thateach row in the data represents wt1 children (level-1 units).

2. Obtain and interpret the estimated ORs and the estimated residual intra-school correlation of the latent responses.

3. Use gllapred to obtain empirical Bayes predictions of the probability ofrepeating a grade. These probabilities will be specific to the schools, as wellas dependent on the student-level predictors.

a. List the values of male, pped, rep, wt1, and the predicted probabilitiesfor the school with schoolid equal to 10104. Explain why the predictedprobabilities are greater than 0 although none of the children in thesample from that school have been retained. For comparison, list thesame variables for the school with schoolid equal to 10105.

b. Produce box plots of the predicted probabilities for each school by male

and pped (for instance, using by(male) and over(pped)). To ensure thateach school contributes no more than four probabilities to the graph (onefor each combination of the student-level covariates), use only responseswhere rep is 0 (that is, if rep==0). Do the schools appear to be variablein their retention probabilities?


10.8 PISA dataSolutions

Here we consider data from the Program for International Student Assessment(PISA) 2000 conducted by the Organization for Economic Cooperation and De-velopment (OECD 2000) that are made available with permission from MariannLemke. The survey assessed educational attainment of 15-year-olds in 43 coun-tries in various areas, with an emphasis on reading. Following Rabe-Heskethand Skrondal (2006), we will analyze reading proficiency, treated as dichotomous(1: proficient; 0: not proficient), for the U.S. sample.

The variables in the dataset pisaUSA2000.dta are

• id school: school identifier

• pass read: dummy variable for being proficient in reading

• female: dummy variable for student being female

• isei: international socioeconomic index

• high school: dummy variable for highest education level by either parentbeing high school

• college: dummy variable for highest education level by either parent beingcollege

• test lang: dummy variable for test language (English) being spoken at home

• one for: dummy variable for one parent being foreign born

• both for: dummy variable for both parents being foreign born

• w fstuwt: student-level or level-1 survey weights

• wnrschbq: school-level or level-2 survey weights

1. Fit a logistic regression model with pass read as the response variable andthe variables female to both for above as covariates and with a random in-tercept for schools using melogit. (Use the default seven quadrature points.)

2. Fit the model from step 1 with the school mean of isei as an additionalcovariate.

3. Interpret the estimated coefficients of isei and school mean isei and com-ment on the change in the other parameter estimates due to adding schoolmean isei.

4. From the estimates in step 2, obtain an estimate of the between-school effectof socioeconomic status.

5. Rerun the command but this time with robust standard errors.

6. q In this survey, schools were sampled with unequal probabilities, πj , andgiven that a school was sampled, students were sampled from the schoolwith unequal probabilities πi|j . The reciprocals of these probabilities aregiven as school- and student-level survey weights, wnrschbg (wj = 1/πj) andw fstuwt (wi|j = 1/πi|j), respectively. As discussed in Rabe-Hesketh andSkrondal (2006), incorporating survey weights in multilevel models using aso-called pseudolikelihood approach can lead to biased estimates, particularly

10.15 Exercises 633

if the level-1 weights wi|j are substantially different from 1 and if the clustersizes are small. Neither of these issues arises here, so implement pseudo ML

estimation as follows:

a. Rescale the student-level weights by dividing them by their cluster means[this is scaling method 2 in Rabe-Hesketh and Skrondal (2006)].

b. Rename the level-2 weights and rescaled level-1 weights to wt2 and wt1,respectively.

c. Run the melogit command from step 2 above, adding [pw=wt1] before|| to specify level-1 weights and the additional option pweight(wt2) tospecify level-2 weights.

d. Compare the estimates with those from step 2. Robust standard errorsare computed by melogit because model-based standard errors are notappropriate with survey weights.

For useful discussions of the use of survey weights in multilevel modeling, werefer to Rabe-Hesketh and Skrondal (2006) and Snijders and Bosker (2012,chap. 14).

10.9 Wine-tasting data

Tutz and Hennevogl (1996) and Fahrmeir and Tutz (2001) analyzed data on thebitterness of white wines from Randall (1989).

The dataset wine.dta has the following variables:

• bitter: dummy variable for bottle being classified as bitter (yij)

• judge: judge identifier (j)

• temp: temperature (low=1; high=0) x2ij• contact: skin contact when pressing the grapes (yes=1; no=0) x3ij

• repl: replication

Interest concerns whether conditions that can be controlled while pressing thegrapes, such as temperature and skin contact, influence the bitterness. For eachcombination of temperature and skin contact, two bottles of white wine wererandomly chosen. The bitterness of each bottle was rated by the same nine judges,who were selected and trained for the ability to detect bitterness. Here we considerthe binary response “bitter” or “nonbitter”.

To allow the judgment of bitterness to vary between judges, a random-interceptlogistic model is specified

ln

{Pr(yij=1|x2ij , x3ij , ζj)Pr(yij=0|x2ij , x3ij , ζj)

}= β1 + β2x2ij + β3x3ij + ζj

where ζj ∼ N(0, ψ). The random intercepts are assumed to be independent acrossjudges and independent of the covariates x2ij and x3ij . ML estimates and esti-mated standard errors for the model are given in table 10.3 below.


Table 10.3: ML estimates for bitterness model

Est (SE)Fixed partβ1 −1.50 (0.90)β2 4.26 (1.30)β3 2.63 (1.00)

Random partψ 2.80

Log likelihood −25.86

1. Interpret the estimated effects of the covariates as ORs.

2. State the expression for the residual intraclass correlation of the latent re-sponses for the above model and estimate this intraclass correlation.

3. Consider two bottles characterized by the same covariates and judged by tworandomly chosen judges. Estimate the median OR comparing the judge whohas the larger random intercept with the judge who has the smaller randomintercept.

4. q Based on the estimates given in table 10.3, provide an approximate esti-mate of ψ if a probit model is used instead of a logit model. Assume that theestimated residual intraclass correlation of the latent responses is the sameas for the logit model.

5. q Based on the estimates given in the table, provide approximate estimatesfor the marginal effects of x2ij and x3ij in an ordinary logistic regressionmodel (without any random effects).

See also exercise 11.8 for further analysis of these data.

10.10 q Random-intercept probit model

In a hypothetical study, an ordinary probit model was fit for students clustered inschools. The response was whether students gave the right answer to a question,and the single covariate was socioeconomic status (SES). The intercept and regres-

sion coefficient of SES were estimated as β1 = 0.2 and β2 = 1.6, respectively. Theanalysis was then repeated, this time including a normally distributed randomintercept for school with variance estimated as ψ = 0.15.

1. Using a latent-response formulation for the random-intercept probit model,derive the marginal probability that yij = 1 given SES. See section 10.2.2 andremember to replace ǫij with ζj + ǫij .

2. Obtain the values of the estimated school-specific regression coefficients forthe random-intercept probit model.

3. Obtain the estimated residual intraclass correlation for the latent responses.

10 dichotomous or binary responses

Documents