mixed-e ects hidden markov model · 1. introduction hidden markov model (rabiner 1989, macdonald...

Mixed-effects Hidden Markov Model

Edward H. Ip ∗, Alison Snow Jones†, Qiang Zhang∗, Frank Rijmen ‡

September 21, 2007

Correspondence concerning this article should be addressed to Edward Ip, Wake For-est University School of Medicine, Department of Biostatistical Sciences, Division of PublicHealth Sciences, MRI Building 3/F.,Medical Center Boulevard, Winston-Salem, NC 27157.Phone: 336.716.9833. Fax: 336.716.6427. Electronic mail may be sent to [email protected] [email protected].

This research was made possible by research grants #R01 AA11071 (Jones (PI)) from theNational Institute of Alcohol Abuse and Alcoholism and the Office of Research on Women’sHealth and #0532185 from the National Science Foundation (Ip (PI), Jones (co-PI)).

∗Department of Biostatistical Sciences, Wake Forest University School of Medicine, Winston-Salem, NC

27159†Department of Social Sciences and Health Policy, Wake Forest University School of Medicine, Winston-

Salem, NC 27159‡Educational Test Services, Princeton, NJ

1

Abstract

In this paper, we develop a method - the Mixed-effects Hidden Markov Model(MHMM) - for analyzing multiple outcomes in a longitudinal context and for examiningthe covariates impact on HMM parameters. MHMM embeds a Generalized LinearMixed Model (GLMM) into a HMM structure, and treat any one of the three setsof HMM parameters, i.e. prior probabilities, transition probabilities and conditionalprobabilities, as predicted variables. We present the overall likelihood function andits simplified forms, and estimate the parameters through an EM algorithm. Theconvergence of the algorithm and the model identifiability is also briefly discussed.MHMM is applied to a sample of young children drawn from the National LongitudinalSurvey of Youth (NLSY).

2

1. INTRODUCTION

Hidden Markov model (Rabiner 1989, MacDonald and Zucchini 1997) is a natural wayof modeling stochastic processes of often multiple outcomes from each individual, es-pecially in the context of longitudinal data (Scott, James and Sugar 2005). It assumesconditional independency of outcomes, given a discrete latent state at each time point,and a Markov-chain of latent states to represent the stochastic changes. Parametersto be estimated, i.e. prior probabilities, transition probabilities and conditional prob-abilities, are summarized as the mean features of the whole sample population.

Heterogenity is inherently ignored when only mean features are estimated. Thus,there have been interests in relating certain covariates, also called risk factors, to theseparameters. For example, in one of HMM’s major applications, speech recognition, Fu-jinaga, Nakai, Shimodaira and Sagayama (2001) found that the fundamental frequencycorrelates well with the 7th melcepstral coefficient (MCEP), and naturally assume themean of the conditional distribution of outcomes as a linear function of MCEP. Gupta,Qu and Ibrahim (2007) assumed a similar model applied on gene datasets, while Ri-jmen, Vansteelandt and De Boeck (2006) included also a random effect in the linearregression model of conditional means. Bandeen-Roche, Miglioretti, Zeger and Rathouz(1998) were interested in knowing how the latent states, namely health states in theircontext, are related to risk factors, i.e. covariates, and assumed a similar linear rela-tionship between prior probabilities and covariates. So far, authors have not identifiedworks on transition probabilities regressions.

Even though HMM regression is not a new idea, we only see limited work in litera-ture, and mostly in applications. Perhaps, because we no longer assume independencefor the dependent variables, this becomes a rather hard problem. Fridman (1993)started a theoretical work in this perspective, by assuming a Gaussian conditional dis-tribution of continuous outcomes, with the mean related to covariates through a designmatrix. Using an EM algorithm to estimate parameters, it provided both reestimationformulas and proof of convergence of the log likelihood.

Motivated by the attempt to setup both a theoretical and comutational frameworkof HMM regression, we intent to allow generalized linear regressions on any one orall three sets of HMM parameters, and also include both fixed effects and randomeffects, which dives further into the variance-covariance structure of the error terms.Collectively, we name is as the Mixed-effects HMM (MHMM), which simultaneouslyestimates both HMM parameters, regression coefficients and their standard errors (Ri-jmen, Vansteelandt and De Boeck 2006).

The method is illustrated by an application to a sample of young women and theirchildren drawn from the National Longitudinal Survey of Youth (NLSY). We study acollection of cognitive and social outcomes and their evolution over time for childrenthat have been exposed to heavy episodic drinking mothers.

The remainder of this article is organized as follows. In section 2, we describeMHMM, the likilihood function, the parameter estimation and the computation algo-rithm. In section 3, we apply the method on the NLSY dataset with regressions onboth conditional and transitional probabilitiees. We conclude the article with a briefdiscussion.

3

2. MODEL

In this section, we start the model description by briefly reviewing the HMM and themixed-effects model separately, and then come up with a likelihood function with themixed-effects model embedded in a HMM structure. Then, we will provide the detailson the parameter estimation and the computation algorithm.

2.1 Hidden Markov Model

The discrete response HMM is defined as follows. Let yitj denote the response of subjecti at occasion t on outcome j, i = 1, . . . ,N ; j = 1, . . . , J ; t = 1, . . . , T . Specifically,yit = (yit1, . . . , yitJ)′denotes the response vector of subject i at occasion t, and yi =(y′

i1, . . . ,y′

iT )′ represents the complete response pattern of subject i. Additionally,zit = 1,. . . ,S, denotes the categorical latent state of subject i at occasion t. Thus,zi = (zi1, . . . , ziT )′ is the trajectory of subject i through the latent space over time.A first-order Markov chain for the latent variable assumes that the latent state atoccasion t+1 depends only upon the latent state at occasion t, i.e. Pr (zt |z1, . . . , zT ) =Pr (zt |zt−1 ). Accordingly, the basic hidden Markov model contains the following threesets of parameters:

1. τ = (τrs), the matrix of time-homogeneous transition probabilities between latentstates, where

τrs = Pr (zt = s |zt−1 = r ) , t = 2, . . . , T. (1)

2. αt = (αt1, . . . , αtS)′, the vector of marginal-state probabilities at occasion t,

αts = Pr(zt = s), t = 1, . . . , T, (2)

which can be recursively derived by formula α′

t = α′

t−1τ .

3. Π = (πjsk), the matrix of state-conditional response probabilities, where k is theindex of outcome category, k = 1, . . . ,K, and for any t,

πjsk = Pr (yitj = k |zt = s) (3)

Assuming conditional independence between responses, given the latent state, themarginal probability of a response pattern yi is given by

Pr (yi) =∑

z

Pr (z)

T∏

t=1

Pr (yit |zt ) , (4)

with the summation over the entire latent state space of size ST . Furthermore, we canwrite out the likelihood equation using the local independency assumption given thelatent state,

Pr (yit |zt = s) =

J∏

j=1

K∏

k=1

πyijtk

jsk (5)

where yijtk = 1 if yijt = k, and 0 otherwise.

4

The distribution of latent states is given by:

Pr(z) = α1z1

T∏

t=2

τzt−1zt , . (6)

2.2 Mixed-effects HMM

The mixed-effects regression (Rijmen, Vansteelandt and De Boeck 2006), used in con-junction with the HMM, is motivated by several factors. First, in the traditional HMM,at a specific time point the outcomes within a state are assumed to be conditionallyindependent given the latent state. The assumption of conditional independence oftenleads to a large number of latent states because of possible residual correlation amongresponses. The mixed-effects regression model models dependency among multiple re-sponses the way it is handled in item response theory (IRT) (Lord and Novick 1968).Specifically, an extended one parameter logistic model (1PL), or the Rasch model, isdesigned to capture conditional dependence between items given state and time. Ac-cording to the Rasch model, an underlying random effect, θ, is posited to influence theconditional probabilities of how a subject positively responds to an item. A secondmotivation of using an IRT based model is the ability to incorporate both item-leveland individual-level contextual information (Fischer 1973).

In a general form, the mixed-effects model can be expressed as:

E(g(ζi)|bi) = Xβ + Zbi, (7)

where g is the link function, β are the fixed effects, and bi ∈ N(0,Σ) are the randomeffects (Fitzmaurice, Laird and Ware 2004). Assumptions on g are that g ∈ C 2, g′ > 0,g(−∞) = 0 and g(+∞) = 1. ζ represents one of the three sets of HMM parameters,i.e. τ ,α or π.

Since the mixed-effects regression will need individual-level HMM parameters asthe predicted variables, we have to redefine (1) to (3) by adding a subject index i,

τirs = Pr(zt = s|zt−1 = r,yi, θ), (8)

αits = Pr(zt = s|yi, θ), (9)

andπijsk = Pr(itj= k|zt = s,yi, θ), (10)

where θ = (β,Σ). (7) to (10) together, we call it Mixed-effects HMM (MHMM).

2.3 Parameter Estimation

The EM algorithm (Dempster, Laird and Rubin 1977) is used to estimate parameters.The objective here is to maximize the expected log-likelihood of the complete data,which include the observed y, the latent states z and the random effects b, given someprovisional estimates for the parameter. If we define the whole parameter set as θ,then EM algorithm is an iterative process to optimize for θ(n+1) at step n, i.e.

θ(n+1) = argmaxθ

Q(θ, θ(n)),

5

where Q, the expected value of log-likelihood function, is defined as:

Q(θ, θ(n)) = E[log(p(y,z, b|θ)|y]. (11)

And with discrete latent state variables, z and continuous random effect variables, b,we have

Q(θ, θ(n)) =∑

z

∫

p(z, b|y, θ(n)) log p(y,z, b|θ)db

=∑

z

∫

p(b|z,y, θ(n))p(z|y, θ(n)) log p(y,z, b|θ)db (12)

With the independent observations and i.i.d. random effects assumptions, it is not hardto see that,

Q(θ, θ(n)) =∑

i

∑

z

∫

p(z|yi, θ(n)) log p(yi,z, bi|θ)fθ(n)(bi)dbi, (13)

where fθ(n) is a multivariate normal function with a covariance matrix Σn, since bi

is only dependent on θ(n), which includes Σn. The posterior probabilities in (13),p(z|yi, θ

(n)), can be derived via Bayes Theorem, and it will be treated as a knownfactor in (13), so is fθ(n)(bi).

We can write out the joint distribution of the complete data of subject i, by (4) to(6),

p(yi, zi, bi|θ) = cαiz1

T∏

t=2

τizt−1zt

T∏

t=1

J∏

j=1

K∏

k=1

πyijtk

isjk fθ(bi). (14)

In (14), what are different from the basic HMM, are that all three parameter setshave an index on subject i and all parameters are related to covariates through thefollowing generalized mixed effect models, i.e.

E(αi1|bi) = hα(Xαβα + Zαbi),

E(τi|bi) = hτ (Xτβτ + Zτbi),

E(πi|bi) = hπ(Xπβπ + Zπbi), (15)

where αi1 ∈ RS, τi ∈ R

S2and πi ∈ R

JSK . The inverse link function, h, can bedifferent from one set of parameters to another, so are the design matrices X andZ. In the trivial case, if we set all hs as the identity function, all Xs as the identitymatrix and all Zs as the zero matrix, we are back at the basic HMM. In this sense,(14) is a generalized HMM, and is also a generalization of the latent variable regressionin Bandeen-Roche, Miglioretti, Zeger and Rathouz (1997), where the regression onlyapplies to latent state prior probabilities, αi, and Rijmen, Vansteelandt and De Boeck(2006) and Gupta, Qu and Ibrahim (2007), in both of which the regression is only onthe conditional probabilities, πi. Also, in neither one of the three articles, randomeffects were included. (14) and (15) together, we call it Mixed-effects Regression HMM(MHMM).

One nice feature of (14) is the separability of three parameter sets, which can beoptimized individually, due to the logarithm function changing multiplications into

6

additions. Now we can write out the complete objective function, Q(θ, θ (n)),

Q(θ, θ(n)) =∑

i

∑

z

p(z|yi, θ(n))

∫

log [hα(Xαβα + Zαbi)|q] fθ(n)(bi)dbi

+∑

i

∑

z

p(z|yi, θ(n))

∫ T∑

t=2

log [hτ (Xτβτ + Zτbi)|rs] fθ(n)(bi)dbi

+∑

i

∑

z

p(z|yi, θ(n))

∫

∑

t

∑

j

∑

k

yijtk log [hπ(Xπβπ + Zπbi)|jsk] fθ(n)(bi)dbi

+∑

i

∑

z

p(z|yi, θ(n))

∫

log [fθ(bi)] fθ(n)(bi)dbi (16)

Even though (16) looks long and seemingly difficult to solve, the following lemmaswill show us that each part of (16) is simply a log likelihood function of multinomialresponses, and thus standard approach of solving a generalized linear mixed model(McLaugh and Nelder 1989, Fitzmaurice, Laird and Ware 2004) can be applied to eachpart.

To begin with a simpler statement and proof, we will first ignore the random effectof (15) and the next lemma will show that the first three rows of (16) are all followingthe exact log likelihood function of multinomial responses. Next, we will show thatincluding the random effect, will result in a similar form of log likelihood functions.

Lemma 1. Without random effects in MHMM, we can simplify each one of the firstthree rows of (16) into a exact form of the log likelihood function of multinomial re-sponese.

∑

i

∑

I

yiI log(πiI), (17)

where I is a vector of summation indices, yiI are observed responses and πiI are theprobabilities to be estimated.

1) In the first row of (16), I = {s = 1, . . . , S}, yiI = α(n)i1s = p(z1 = s|yi, θ

(n)) andπiI = αi1s.

2) In the second row of (16), I = {r = 1, . . . , S, s = 1, . . . , S}, yiI =T∑

t=2τ

(n)itrs, where

τ(n)itrs = p(zt−1 = r, zt = s|yi, θ

(n)) and πiI = τirs.

3) In the third row of (16), I = {s = 1, . . . , S, j = 1, . . . , J, k = 1, . . . ,K}, yiI =∑

t

α(n)its yijtk and πiI = πisjk, where α

(n)it (s) = p(zt = s|yi, θ

(n)).

In (17), we have the α(n)iq , τ

(n)irs and α

(n)its yijtk as observed responses, which are results

of the current estimate of parameters, and αiq, τirs and πisjk are the new sets ofprobabilities to be linked with parameters in a generalized linear model format. Withthis linkage, if we take the sum-to-one constraints of αq, τrs and πsjk , add a Lagrangeterm to each form of (17)and take its derivatives, it is easy to see that MLEs of αq, τrs

and πsjk are exactly the reestimation formulas of the basic HMM (Rabiner 1989). Inother words, if we decide not to run regression on any one of αq, τrs and πsjk, we willsimple use its reestimation formula of the basic HMM.

7

The literature on the mixed-effects multinomial regression varies mostly in twoparts. First, due to the integration of random effects in computing the marginal dis-tribution of the response, there is not a closed form solution and there have been var-ious approaches in the numercial integration, e.g. Gauss-Hermitte quadrature (Hinde1982), Monte Carlo techniques (McCulloch 1997) and Taylor expansions (Wolfingerand O’Connell 1993), among which Gauss-Hermite quadrature seems the most popular,which will turn the integrals into summations over Q quadrature points for each dimen-sion. Also the likelihood function can be slightly changed into the pseudo-likelihoodthat avoids the integration (Wolfinger and O’Connell 1993, Hartzel 2001), though giv-ing poorer approximation.

Second, for the iterative parameter updating algorithm, Harville and Mee (1984),Jansen (1990) and Tutz and Hennevogl (1996) used the EM algorithm, but the first twoarticles commented on the slow convergence. Ezzet and Whitehead (1991) chose theNewton-Raphson method, and similarly, Hedeker and Gibbons (1994) and Hedeker(2003) used the Fisher scoring solution, in the same spirit of the Newton-Raphsonmethod, which converges much faster than the EM algorithm when applied to ran-dom effects models (Bock 1989), and also provides the standard error for all modelparameters.

Here our choice is to use Gauss-Hermite quadrature for the integration and theFisher scoring solution for udpating parameters in an iterative process.

We denote Gauss-Hermite quadrature points, v(n) ∈ Rb, as a b-dimensional vector

to replace bi in (15) and w(n) are their weights to replace fθ(bi), while both depend onthe current estimate of the variance structure, Σ(n). Without much effort in proofing,we can extend the previous lemma into the following one.

Lemma 2. With random effects in MHMM, we will have the same general form of loglikelihood functions as (17), except by adding in extra dimensions of qudrature pointsfor random effects.

1) In the first row of (16), I = {s, l}, yiI = w(n)l α

(n)i1sl and πiI = αi1sl, where l = (lk),

k = 1, . . . , b, lk = 1, . . . , Q, and α(n)i1sl = p(z1 = s|yi, θ

(n),v(n)l

).

2) In the second row of (16), I = {r, s, l}, yiI =T∑

t=2w

(n)l

τ(n)irsl and πiI = τirsl, where

τ(n)itrsl = p(zt−1 = r, zt = s|yi, θ

(n),v(n)l ).

3) In the third row of (16), I = {s, j, k, l}, yiI =∑

t

w(n)l α

(n)itslyijtk and πiI = πisjkl,

where α(n)itsl = p(zt = s|yi, θ

(n),v(n)l ).

Now we can define β̂ as,

β̂ = argmaxβ

∑

i

∑

I

yiI log(πiI), , (18)

where I, y and π depend on the parameter set, as shown in Lemma 2. From Sec-tion 3.4.1 of Fahrmeir and Tutz (1994), the Fisher scoring iteration uses the followingupdating equation,

β(n+1) = β(n) + (X ′W (β(n))X)−1s(β(n)), (19)

8

whereW (β) = D(β)V −1(β)D′(β),

ands(β) = XDV −1(y − µ(β)).

Here, D is the derivative of the link function, V is the variance function and µ is theestimated frequency counts at step n. Matrices, W , V and D, all take a diagonalform, i.e.

W = diag(Wi),V = diag(Vi),D = diag(Di).

Since Di = ∂h(ηi)/∂η), we only need to specify V for MER on each parameter set ofHMM, and for simplicity, we only show the ones with only the fixed effects, because toinclude the random effects, we simply add b extra dimensions of quadrature points.

1. For the MER on prior probabilities, we have

Vi = diag(αi1) − αi1α′

i1,

where αi1 ∈ IS .

2. For the MER on transition probabilities, we have

Vi = diag(Vir),Vir =1

∑

s

T∑

t=2τitrs

diag(τir) − τirτ′

ir,

where τir ∈ IS .

3. For the MER on conditional probabilities, we have

Vi = diag(Visj),Visj =1

∑

t

∑

k

yijtkαits

[diag(πisj) − πisjπ′

isj ],

where πisj ∈ IK .

Note that α, τ and π are all computed using the current estimates of parameters.

2.4 Convergence and Identifiability

In (11), we defined the expected log likelihood of the complete data including latentstates and random effects. But, eventually we would like to know whether in theiterative EM process, we will have an increasing sequence of log likelihood values ofthe observed data only, defined as,

L(θ) = p(y|θ). (20)

Here we use a map within the parameter space, Θ, to describe the EM parameterupdating, i.e.

σ : θ → θ̂ = argmaxθ̂

Q(θ̂, θ), (21)

where θ = (βα,βτ ,βπ,Σ) and θ̂ = (β̂α, β̂τ , β̂π, Σ̂).

9

Wu (1983) showed that any EM sequence of parameter estimates increases thelikelihood of the incomplete data, if bounded above and L converges to some L ′. Ifthe unobserved complete data can be described by a curved exponential family, L ′ is astationary value of L. Following the same line, we will explicitly prove the increasingsequence of likelihood and its convergence to a stationary point in the first part of thefollowing theorem.

Model identifiability, a particular problem for latent class models (Goodman 1974),deals with situations that different parameter sets give rise to identical probability dis-tributions, and hence the same maximum likelihood of the observed data. This leads toambiguities in interpreting the estimated parameters. One weak form of identifiabilityis the local identifiability, i.e. the model is identifiable in a small neighborhood of apoint in the parameter space. In the second part of the following theorem, we willshow that a stationary point of σ is also locally identifiable, thus MHMM is locallyidentifiable. This means that the likelihood function might have multiple local maxi-mums, and multiple runs with different starting values might be needed to choose thebest maximum as the global one.

Theorem 1. With L(θ) and σ defined in (20) and (21), we have:

(1) L(σ(θ) ≥ L(θ) for all θ ∈ Θ.

(2) If the link function is strictly log concave and X has full rank, the maximumlikelihood estimates exist and are unique, hence, the equality in (1) holds iff θ is astationary point of σ.

(3) A stationary point, θ0, is locally identifiable, i.e. there exist a neighborhood ν ofθ0, such that ∀θ ∈ ν,

L(θ) = L(θ0) ⇔ θ = θ0.

We leave the proof of the first part in Appendix and will prove the second and thirdpart here.

Albert and Anderson (1984) and Lesaffre and Kaufmann (1992) both showed thatif the covariate points are completely separable or quasi-complete separable, no finiteMLEs exist and the likelihood will diverge. Figure 1 from Lesaffre and Kaufmann(1992) shows that two covariates plotted on a 2-D plane, can be completely separated bytheir bivariate response patterns. Here we will show that the covariates are inherentlyinseparable in MHMM.

Since latent states in HMM are not observable, and we only infer the possibilitiesof observing each latent state, given a subject’s responses, this is equivalent to assigndifferent latent state memberships to the same subject multiple times. Thus, thecovariate points can not be separated, unless we are under one extreme situation, i.e.all possibilities are either 1 or 0. But this would not be possible, given our assumptionson the inverse link function, which can only be 1 or 0 when β(n) → ∞, but our initialestimate can never be ∞. This finishes the proof of the existence and uniqueness of β̂.

The full rank of X provides the positive definite property of the Fisher informationmatrix, F (β), and thus the likelihood function is locally maximum and also locallyidentifiable.

2.5 Computation and Algorithm

The E-step, which involves the computation for the joint probabilities in (14), can beefficiently implemented using graphical model theory (Lauritzen 1995). The essence of

10

Figure 1: Schematic representation of complete separation. The numbers 1, 2, 3 and 4 denoteresponse patterns (-1,-1), (1, -1), (-1,1) and (1,1) (Lesaffre and Kaufmann 1992).

this implementation is a decomposition of a high-dimensional joint probability distri-bution into modules of lower dimensions. Briefly, we create a directed acyclic graphfor the associated statistical model. The directed graph, consisting of nodes (variables)and arcs, is a pictorial representation of the conditional dependence relations of thestatistical model. The Directed Acyclic Graph (DAG) is then transformed into a so-called junction tree, which forms the basis of efficient computational schemes (Jensen,Lauritzen and Olesen 1990; Cowell, Dawid, Laurizen and Spiegelhalter 1999). To con-struct the directed acyclic graph, and to transform the graph into a junction tree, weadapted the Bayes Net Toolbox of Murphy (2001).

Shown in Figure 2, is an example of DAG, including a basic HMM model withnodes for latent states, z1, z2 and z3, and for observables, yi1, yi2 and yi3, and also anextra node on the random effect θi, which becomes the parents of two latent statesnodes at time 2 and time 3, meaning that we are running a mixed-effects regressionmodel on the transition probabilities. Similarly, if we make the random effect node asthe parent of three observable nodes, we are running the mixed-effects regression onthe conditional probabilities.

In fitting HMM models, the number of hidden (latent) states is usually not knowna priori and a model selection step prior to parameter estimation is required. Modelselection has been an important and actively researched area in HMMs (Scott 2002)and in finite mixture models generally (McLachlan and Peel 2000, MacKay 2002).In this application, we follow an approach suggested by Huang and Bandeen-Roche(2004). We employed the Bayesian Information Criterion (Schwarz 1978) to determinethe optimal number of states that gives the lower BIC value, by fitting a range ofHMMs to the data.

We recommend not to introduce covariates in the model selection step, in otherwords, we first only fit basic HMMs, and then estimate parameters of MHMM onlyfor the model with the optimal number of states. The reason is that adding a few

11

Figure 2: Directed acyclic graph of a HMM model with a latent trait node

covariates would not significantly increase the number of parameters to be estimated,if compared to the total number of basic HMM parameters, S2+SJ(K−1)−1, and thusit would not alter the BIC values significantly either. However, this would requirementa different order of computation time if we estimate MHMM for a full range of latentstates. Thus, running basic HMMs for the model selection would save us much time,without altering BIC values much.

Finally, we summarize all the descriptions above in Algorithm 1 in peudo-codes.

3. DATA ANALYSIS

Data sources are the 1989 and 1994 waves of the National Longitudinal Survey of Youth(NLSY79) and the 1990-1994 Children of the NLSY (CoNLSY), a sample of N=367(52% female, 29% African-American and 20% Hispanic).

This analysis focuses on cognitive and social development in children five to nineyears old. Five variables were selected from the Behavior Problem Index (BPI): anti-social, anxious, peer conflict, headstrong, and hyperactive behaviors. Three variablesthat measure academic acheivement were selected from Peabody Individual Achieve-ment Test (PIAT) scores: math, oral reading, and reading comprehension. We tri-chomtomized BPI subscales and PIAT into three categories (0,1,2) based on nationallynormed percentiles with cutpoints at 15% and 85%. BPI categories were then re-ordered so that movement from 0 to 2 represented increasingly better behavior andconformed to the direction of the PIAT classification.

Approximately 1 in 4 children in the United States under 18 years is exposed toheavy, abusive or dependent drinking in the family (Grant 2000). Previous studies(Jones, Miller and Salkever 1999; Wall, Garcia-Andrade, Wong, Lau and Ehlers 2000;Jones forthcoming; Schuckit and Chiles 1978; Aronson, Kyllerman, Sabel, Sandin andOlegard 1985) have found a positive association between parents’ or maternal alcoholconsumption and children’s behavioral and cognitive problems. However, most of these

12

Algorithm 1 MHMM Estimation and Maximization Algorithm

for S = S1 to S2 do

α, τ , π = random(0,1);(loglikelihood, α, τ , π) = fitHMM(Y , α, τ , π);BICS = computeBIC(loglikelihood,S);

end for

S = findMinimumPosition(BICS);(Xα, Xτ , Xπ, Zα, Zτ , Zπ) = constructDesignMatrices(S,covariates)(βα, βτ , βπ) = random(0,1);while converge(loglikelihood) == False do

(ηα, ητ , ηπ) = computeLinearPredictors(βα, βτ , βπ,Xα, Xτ , Xπ)(α, τ , π) = linkFunction(ηα, ητ , ηπ)(cliqueTable, separatorTable) = propagateMessages(’forward’, α, τ , π);(cliqueTable, separatorTable) = propagateMessages(’backward’, α, τ , π);(αi, τi, πi)= computeJointProbabilities(cliqueTable);loglikelihood = computeLoglikelihood(αi, τi, πi);(βα, βτ , βπ, seα, seτ , seπ) = fitMHMM(Y , αi, τi, πi, Xα, Xτ , Xπ, Zα, Zτ , Zπ);

end while

studies viewed developmental outcomes as relatively stationary over time. They havenot examined the extent to which dynamic child development trajectories differ amongchildren with heterogeneous cognitive and behavioral profiles exposed to heavy drinkingand non-heavy drinking mothers.

Because characteristics of the child’s home may also be associated with child behav-ior problems, we controlled for the home environment in the mixed-effects regressionusing the Home Observation for Measurement of the Environment (HOME) Inventory,i.e. two HOME scores (low, medium, or high) measuring cognitive stimulation andemotional support.

3.1 Results of basic HMM parameters

Figure 3 shows the BIC values for models with 2 to 7 hidden states. Based on theBIC criterion, the 5-state model was selected. Figure 4 displays the profiles of thefive states. Each bar shows the conditional probability distribution of the responsecategories for a specific item within each state. As noted above, BPI was rescaled suchthat higher values for items imply fewer behavior problems. For PIAT items, highervalues indicate better academic achievement.

Table 1 summarizes the interpretation of five latent states, based on their condi-tional probability profiles. Table 2 contains the prior probabilities of five states andthe transition probabilities table.

3.2 Fixed-effect regression on conditional probabilities

We first run a fixed-effect regression on conditional probabilities and see whether itcan be impacted by two HOME scores and maternal pregnancy drinking. The model

13

3 4 5 6 7

1.6

1.61

1.62

1.63

1.64

1.65

1.66

1.67

1.68

x 104

Models with different number of hidden states

Figure 3: BIC values for models with 2 to 7 latent states

0 0.20.40.60.8 1

PIAT RC

PIAT RR

PIAT Math

Hyperactivity

Head Strong

Dependency

Anxiety

Anti−Social

state 1

0 0.20.40.60.8 1

state 2

0 0.20.40.60.8 1

state 3

0 0.20.40.60.8 1

state 4

0 0.20.40.60.8 1

state 5

Figure 4: Conditional probability profile for the 5-State model.

is described below.

g(πisjk) = βsjk + xi1γ1 + xi2γ2 + xi3γ3, k = 1, . . . ,K − 1, (22)

and

g1(πisj1, . . . , πisjK) = log

(

πisj1

1 − πijs1

)

gk(πisj1, . . . , πisjK) = log

[

log

(

πisj1 + . . . + πisjk

1 − πisj1 − . . . − πisjk

)

− log

(

πisj1 + . . . + πisjk−1

1 − πisj1 − . . . − πisjk−1

)]

(23)

where xi1, xi2 and xi3 are the two HOME scores and the maternal pregnancy drinkingscore, and γ1 to γ3 are the coefficients to be estimated. The link function, g, looks a bitcomplicated, due to the ordinal responses that demand a cumulative model (Section3.3.3 of Fahrmeir and Tutz 1994).

14

Table 1: Interpretation of states

State InterpretationState 1 No problems (NP)State 2 Bad behavior, good academics (PB/GA)State 3 Good behavior, poor academics (GB/PA)State 4 Very bad behavior, poor academics (VPB/PA)State 5 Bad behavior, very poor academics (PB/VPA)

Table 2: Estimated prior state probabilities and transition probabilities

State 1 State 2 State 3 State 4 State 5Prior probabilities α: 0.28 0.18 0.27 0.16 0.10Transition probabilities τ : 0.80 0.08 0.04 0.06 0.02

0.11 0.80 0.09 0.01 0.000.15 0.04 0.67 0.04 0.110.09 0.00 0.04 0.87 0.000.00 0.00 0.06 0.00 0.94

Table 3 reports mixed-effects regression coefficient estimates, standard errors andp-values. The two HOME scores show strong positive impacts on the conditionalresponse probabilities (p < 0.001), indicating that more supportive home environmentsare associated with better behavior and academic performance.

Table 3: Coefficients for fixed-effects regression covariates and their significances

Covariates Coefficient SE t value Pr > |t|HOME cognitive stimulation 0.928 0.048 19.33 < .001HOME emotional support 0.259 0.051 5.117 < .001

Maternal pregnancy drinking -0.0012 0.2159 -0.005 0.498

3.3 Mixed-effect regression on transition probabilities

In this second analysis, we add a subject level random effect to the regression ontransition probabilities, and use only the HOME score on emotional support.

log

(

τirs

τir1

)

= βrs + xi1γrs + θi, s = 2, . . . , S, (24)

where xi are the covariates and γ are the coefficients. Notice that we have one γ foreach transition probability, and thus from its positive or negative value, we can see thecovariate’s impact on the transition probability. Though the model above introduces

15

more parameters which is the reason we only use one covariate here, it also gives acomplete picture of a covariate’s influence.

Table 4 shows the estimated γrs , and we see clearly that HOME emotional supporthas a positive impact on transitions from worse to better states and a negative impacton transitions from better to worse states. Note that due to permutations of states, thereference state here is State 2. Also, ±15 are the limits we impose on the parametersto avoid the machine overflow.

Table 4: Coefficients of HOME emotion support for mixed-effects regression on transitionprobabilities and their significances

From To γ se t Pr > |t|State 1 State 1 0.662 0.733 0.903 0.183

State 3 -5.967 2.805 -2.127 0.017State 4 -15.000 23.142 -0.648 0.258State 5 -5.792 3.149 -1.839 0.033

State 2 State 1 0.975 0.586 1.663 0.048State 3 -0.219 0.890 -0.246 0.403State 4 -0.996 0.504 -1.978 0.024State 5 -0.439 0.570 -0.771 0.220

State 3 State 1 6.224 1.557 3.997 0.000State 3 5.085 1.435 3.543 0.000State 4 1.052 1.425 0.738 0.230State 5 0.013 1.738 0.007 0.497

State 4 State 1 1.669 2.321 0.719 0.236State 3 -3.215 0.906 -3.548 0.000State 4 -0.537 0.496 -1.083 0.139State 5 -3.178 0.805 -3.950 0.000

State 5 State 1 7.729 9.949 0.777 0.219State 3 -15.000 87628.245 0.000 0.500State 4 -1.335 0.955 -1.398 0.081State 5 -1.229 0.859 -1.432 0.076

4. DISCUSSION

In this text, we describe the new method, Mixed-effects Hidden Markov Model (MHMM)and its application on the NLSY dataset. MHMM allows us analyze multiple outcomesstochastically and also presents the impacts of covariates on the stochastic process.This extended HMM approach brings more insights into the connections between mul-tiple child outcomes and related covariates, e.g. HOME scores and mothers’ alchoholicstatus. Parameters estimated through MHMM showed that HOME environment caninfluence a child’s chance of staying in or transitioning out of a better hidden state. It

16

also affects the probabilities of observing better behaviors and academic performances,given a child’s state.

The one advantage MHMM possesses is the flexible framework that incorporatesdifferent design matrices that could include mixed effects, link functions and HMMparameter sets within the same likelihood function. Thus, depending researchers’ in-terest in particular applications, they can choose where the regression is turned on oroff, which would not change the overall parameter estimation.

Even though MHMM can simultaneously regress all three sets of HMM param-eters, this would usually significantly slow down the computation. Hence we wouldrecommend running regression only for one parameter set and also avoid mixed-effectsregression on the conditional probabilities, which has a greater dimension by them-selves, and adding extra dimensions on subjects and on random effects, would increasethe computation time exponentially.

Due to the large number of transition probabilities, having one covariate for eachtransition probability would increase the total number of estimated parameters. How-ever, ways to reduce these parameters have to carefully designed, as covariates oftenaffect the transitions from the same hidden state to other states in opposite directions.Thus, it would not be satisfactory to use the same parameter for all transitions fromthe same state, neither to the same state.

A Appendix

A.1 Proof of Lemma 1.

(1) In the first row of (16), if we set Zα = 0, we have

∑

i

∑

z

p(z|yi, θ(n)) log [hα(Xαβα)|q]

∫

fθ(n)(bi)dbi

=∑

i

∑

z

p(z|yi, θ(n)) log [hα(Xαβα)|q]

=∑

i

∑

z

p(z1|yi, θ(n))p(z2, . . . , zT |z1,yi, θ

(n)) log(αiq)

=∑

i

∑

z1

p(z1|yi, θ(n)) log(αiq)

∑

z2,...,zT

p(z2, . . . , zT |z1,yi, θ(n))

=∑

i

∑

q

p(z1 = q|yi, θ(n)) log(αiq)

=∑

i

∑

q

α(n)iq log(αiq)

Hence, we would have exactly the log likelihood function of multinomial responses(section 5.4.1 in McCullagh and Nelder 1989).

17

(2) In the second row of (16), if we set Zτ = 0, we have

∑

i

∑

z

p(z|yi, θ(n))

T∑

t=2

log [hα(Xτβτ )|irs]

∫

fθ(n)(bi)dbi

=∑

i

∑

z

p(z|yi, θ(n))

T∑

t=2

log(τirs)

=∑

i

∑

z

α(n)iq

T∏

t=2

τ(n)irs

T∑

t=2

log(τirs)

=∑

i

∑

z2,...,zT

T∏

t=2

τ(n)irs

T∑

t=2

log(τirs)∑

z1

α(n)iq

=∑

i

∑

z2,...,zT

T∏

t=2

τ(n)irs

T∑

t=2

log(τirs)

=∑

i

[

∑

z2,...,zT

T∏

t=2

τ(n)izT−1zT

log(τiz2z3) + . . . +∑

z2,...,zT

T∏

t=2

τ(n)izT−1zT

log(τizT−1zT)

]

=∑

i

∑

z2,z3

τ(n)iz2z3

log(τiz2z3) + . . . +∑

zT−1,zT

τ(n)izT−1zT

log(τizT−1zT)

In the last step, because there’s no difference between zts in all the summations,we can summarize them altogether while still keeping indices r and s,

∑

i

∑

r

∑

s

T∑

t=2

τ(n)itrs log(τirs).

(3) In the third row of (16), if we set Zπ = 0, and denote fθ(yit, zt) =∑

j

∑

k

(yijtk log(πisjk),

we have

∑

i

∑

z

p(z|yi, θ(n))

∑

t

fθ(yit, zt)

=∑

i

∑

t

∑

z

p(z|yi, θ(n))fθ(yit, zt)

=∑

i

∑

t

∑

zt

fθ(yit, zt)∑

z1,...,zt−1,zt+1,...,zT

p(z|yi, θ(n))

=∑

i

∑

t

∑

zt

fθ(yit, zt)p(zt|yi, θ(n))

=∑

i

∑

t

∑

s

α(n)its

∑

j

∑

k

yijtk log(πisjk)

A.2 Proof of Lemma 2.

Here we will only prove the fourth part, since the first three parts follows naturallyfrom the proof above, except by adding in an extra dimension of l.

18

[2] Altman R. M. and Petkau A. J. (2005), ”Application of Hidden Markov Modelsto Multiple Sclerosis Lesion Count Data”, Statistics in Medicine; 24(15), 2335-44.

[3] Aronson, M., Kyllerman, M., Sabel, K, G., Sandin, B., and Olegard, R. (1985),”Children of Alcoholic Mothers: Developmental, Perceptual and Behavioral Char-acteristics As Compared to Matched Controls”, Acta Paediatrica Scandanavia, 74,27-35.

[4] Achenbach, T. M. and Edelbrock, C. S. (1983), Manual for the Child BehaviorChecklist and Revised Child Behavior Profile, Burlington, Vermont: University ofVermont, Department of Psychology.

[5] Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L. and Rathouz, P. J. (1997), ”La-tent Variable Regression for Multiple Discrete Outcomes”, Journal of the Ameri-can Statisitical Association, 92, 1375-1386.

[6] Bock, R. D. (1989), ”Measurement of Human Variation: a Two-Stage Model”, inMultilevel Analysis of Educational Data, R. D. Bock (Ed.). San Diego, California:Academic Press.

[7] Bureau, A., Shiboski, S. and Hughes, J. P. (2003), ”Applications of ContinuousTime Hidden Markov Models to the Study of Misclassified Disease Outcomes”,Statistics in Medicine, 22(3), 441-62.

[8] Caldwell, B. M. and Bradley, R. H. (1984), Home Observation for Measurementof the Environment. Little Rock: University of Arkansas At Little Rock, Centerfor Child Development and Education.

[9] Cowell, R. G., Dawid, A.P., Lauritzen, S.L., and Spiegelhalter, D. J. (1999), Prob-abilistic Networks and Expert Systems. New York: Springer.

[10] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), ”Maximum LikelihoodFrom Incomplete Data via EM Algorithm”, Journal of the Royal Statistical SocietySeries B, 39, 1-38.

[11] Dunson, D. B. and Herring, A. H. (2005), ”Bayesian Latent Variable Models forMixed Discrete Outcomes”, Biostatistics, 6, 11-25.

[12] Ezzet, F. and Whitehead, J. (1991), ”A Random Effects Model for Ordinal Re-sponses From a Crossover Trial,” Statistics in Medicine, 10, 901-907.

[13] Fahrmeir, L., and Tutz, G. (1994), Multivariate Statistical Modeling Based OnGeneralized Linear Models. New York: Springer.

[14] Fischer, G. H. (1973), ”The Linear Logistic Test Model As An Instrument inEducational Research”, Acta Psychologica, 37, 359-374.

[15] Fitzmaurice, G.M. Laird, N.M., and Ware, J.H. (2004), Applied Longitudinal Anal-ysis. Hokoben, NJ: Wiley.

[16] Fujinaga, K., Nakai, M., Shimodaira, H. and Sagayama, S. (2001), ”Multiple-Regression Hidden Markov Model”, Proceedings of 2001 IEEE International Con-ference On Acoustics, Speech, and Signal Processing, 1, 513 - 516.

[17] Goodman, L. A. (1974), ”Exploratory Latent Structure Analysis Using Identifiableand Unidentifiable Models”, Biometrika, 61, 215-231.

[18] Grant, B. F. (2000), ”Estimates of US Children Exposed to Alcohol Abuse andDependence in the Family”, American Journal of Public Health, 90, 112-115.

20

[19] Gupta, M., Qu, P. and Ibrahim, J. G. (2007), ”A Temporal Hidden Makrov Re-gression Model for the Analysis of Gene Regulatory Networks”, Biostatisitcs, 0,1-16.

[20] Haberman, S. (1977), ”Maximum Likelihood Estimates in Exponential ResponseModels”, Annuals of Statististics, 5, 815-841.

[21] Harville, D. A. and Mee, R. W. (1984), ”A Mixed-Model Procedure for AnalyzingOrdered Categorical Data”, Biometrics, 40, 393-408.

[22] Hedeker, D. and Gibbons, R. D. (1994), ”A Random Effects Ordinal RegressionModel for Multilevel Analysis”, Biometrics, 50, 933-944.

[23] Hedeker, D. (2003), ”A Mixed-effects Multinomial Logistic Regression Model”,Statistics in Medicine, 22, 1433-1446.

[24] Hinde, J.P. (1982), ”Compound Regression Models”, in: Gilchrist R, Ed. GLIM82: International Conference for Generalized Linear Models. New York: Springer,109-121.

[25] Huang, G. H. and Bandeen-Roche, K. (2004), ”Building an Identifiable LatentVariable Model with Covariate Effects on Underlying and Measured Variables”,Psychometrika, 69, 5-32.

[26] Jackson, C. H. and Sharples, L. D. (2002), ”Hidden Markov Models for the Onsetand Progression of Bronchiolitis Obliterans Syndrome in Lung Transplant Recip-ients”, Statistics in Medicine, 21(1), 113-28.

[27] Jansen, J. (1990), ”On the Statistical Analysis of Ordinal Data When Extravari-ation is Present”, Applied Statistics, 39, 74-85.

[28] Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. (1990), ”Bayesian Updating inCausal Probabilistic Networks by Local Computation”, Computational StatisticsQuarterly, 4, 269-282.

[29] Jones, A. S., Miller, D. J. and Salkever, D. S. (1999), ”Parental Use of Alcoholand Children’s Behavioural Health: a Household Production Analysis”, HealthEconomics, 8, 661-683.

[30] Jones, A. S., ”Maternal Alcohol Abuse/Dependence, Children’s Behavior Prob-lems and Home Environment: Estimates From the NLSY Using Propensity ScoreMatching”, Journal of Study in Alcohol, (in Press).

[31] Lauritzen, S. L. (1995), ”The EM Algorithm for Graphical Association ModelsWith Missing Data”, Computational Statistics and Data Analysis, 19, 191-201.

[32] Lesaffre, E. and Kaufmann, H. (1992), ”Existence and Uniqueness of the Maxi-mum Likelihood Estimator for a Multivariate Probit Model”, Journal of the Amer-ican Statistical Association, 87(419), 805-811.

[33] Le Strat, Y. and Carrat, F. (1999), ”Monitoring Epidemiologic Surveillance DataUsing Hidden Markov Models”, Statistics in Medicine, 18(24), 3463-78.

[34] Lord, F. M., and Novick, M. R. (1968), Statistical Theories of Mental Tests Scores.Readings, MA: Addison-Wesley.

[35] Macdonald, I. L. and Zucchini, W. (1997), Hidden Markov and Other Models forDiscrete Valued Time Series. Boca Raton, FL: Chapman and Hall.

21

[36] Mackay, R. J. (2002), ”Estimating the Order of a Hidden Markov Model”, TheCanadian Journal of Statistics, 30(4), 573-589.

[37] Mattson, S. N. and Riley, E. P. (2000), ”Parent Ratings of Behavior in ChildrenWith Heavy Prenatal Alcohol Exposure and IQ-Matched Controls”, Alcoholism,Clinical and Experimental Research, 24, 226-231.

[38] Mccarty, C. A., Zimmerman, F. J., Digiuseppe, D. L. and Christakis, D. L.(2005), ”Parental Emotional Support and Subsequent Internalizing and Exter-nalizing Problems Among Children”, Journal of Developmental and BehavioralPediatrics, 26, 267-275.

[39] Mcculloch, C. E. (1997), ”Maximum Likelihood Algorithms for Generalized LinearMixed Models”, Journal of the American Statistical Association, 92, 162-70.

[40] Mclachlan, G. and Peel, D. (2000), Finite Mixture Models. New York: Wiley.

[41] Murphy, K. (2001), ”The Bayesian Net Toolbox for Matlab”. Computing Sciencesand Statistics: Proceedings of the 33rd Symposium On the Interface.

[42] Rabiner, L. R. (1989), ”A Tutorial on Hidden Markov-Models and Selected Ap-plications in Speech Recognition”, Proceedings of the IEEE, 77, 257-286.

[43] Rijmen, F., Vansteelandt, K. and De Boeck, P. (2006), ”Latent Class Models forDiary Method Data: Parameter Estimation by Local Computations”, Psychome-trika. Accepted.

[44] Scott, S.L. (2002), ”Bayesian Methods for Hidden Markov Models : RecursiveComputing in the 21st Century”, Journal of the American Statistical Association,97, 337-351.

[45] Scott, S and Ip, E. (2002) ”Empirical Bayes and Item Clustering Effects in aLatent Variable Hierarchical Model : a Case Study From the National Assessmentof Educational Progress”, Journal of the American Statistical Association, 97, 409-419.

[46] Scott, S. L., James, G. M., and Sugar, C. A. (2005), ”Hidden Markov Models forLongitudinal Comparisons”, Journal of the American Statistical Association, 100,359-369.

[47] Schuckit, M., and Chiles, J. (1978), ”Family History As a Diagnostic Aid in TwoSamples of Adolescents”, Journal of Nervous and Mental Disease, 166, 165-176.

[48] Schwarz, G. (1978), ”Estimating the Dimension of a Model”, Annals of Statistics,6, 461-464.

[49] Smith T and Vounatsou P. (2003), ”Estimation of Infection and Recovery Ratesfor Highly Polymorphic Parasites When Detectability is Imperfect, Using HiddenMarkov Models”, Statistics in Medicine, 22(10),1709-24.

[50] Tutz, G. and Hennevogl, W. (1996), ”Random Effects in Ordinal Regression Mod-els”, Computational Statistics and Data Analysis, 22, 537-57.

[51] Viterbi, A. J. (1967), ”Error Bounds for Convolutional Codes and an Asymptoti-cally Optimal Decoding Algorithm”, IEEE Transaction, Information Theory, 13,260-269.

22

[52] Wall, T. L., Garcia-Andrade, C., Wong, V., Lau, P. and Ehlers, C. L. (2000),”Parental History of Alcoholism and Problem Behaviors in Native-American Chil-dren and Adolescents”, Alcoholism, Clinical and Experimental Research, 24, 30-34.

[53] Wedderburn, R. W. M. (1976), ”On the Existence and Uniqueness of the Maxi-mum Likelihood Estimates for Certain Generalized Linear Models”, Biometrika,63, 27-32.

[54] Wolfinger, R. and O’Connell, M. (1993), ”Generalized Linear Mixed Models: aPseudo-Likelihood Approach”, Journal of Statistical Computation and Simulation,48, 233-43.

[55] Wu, C. F. J. (1983), ”On the Convergence Properties of the EM Algorithm”Annuals of Statistics, 11(1), 95-103.

23

mixed-e ects hidden markov model · 1. introduction hidden markov model (rabiner 1989, macdonald...

Documents