bayesian networks for dynamic classiﬁcationbayesian networks for dynamic classiﬁcation shengtong...

Bayesian Networks for Dynamic Classification

Shengtong Zhong a, Helge Langseth a, Thomas D. Nielsen b

aDepartment of Computer and Information Science, The Norwegian University ofScience and Technology, Trondheim, Norway

bDepartment of Computer Science, Aalborg University, Aalborg, Denmark

Abstract

Monitoring a complex process often involves keeping an eye on hundreds or thou-sands of sensors to determine whether or not the process is stable. We have beenworking with dynamic data from an oil production facility in the North sea, whereunstable situations should be identified as soon as possible. Motivated by this prob-lem setting, we propose a generative model for dynamic classification in continuousdomains. At each time point the model can be seen as combining a naıve Bayesmodel with a mixture of factor analyzers. The latent variables of the factor an-alyzers are used to capture the state-specific dynamics of the process as well asmodeling dependencies between attributes. Exact inference in the proposed mod-el is intractable and computationally expensive, so in this paper we focus on anapproximate inference scheme.

1 Introduction

When doing classification, the task is to assign a class label to an object. Ob-jects are characterized by a collection Y = {Y1, Y2, . . . , Yn} of random vari-ables, and a specific object is defined by a value assignment y = {y1, y2, . . . , yn}to these variables. The set of possible labels (or classes) for an object is rep-resented by a class variable C, denoted sp (C). We will focus on real-valuedattributes in this paper, meaning that y ∈ R

n.

In a probabilistic framework, it is well-known that the optimal classifier will

Email addresses: [email protected] (Shengtong Zhong),[email protected] (Helge Langseth), [email protected] (Thomas D. Nielsen).

Preprint submitted to Reliability Engineering & System Safety 28 May 2012

label an object y by the class label c∗, where

c∗ = arg minc∈sp(C)

∑

c′∈sp(C)

L(c, c′)P (c′|y),

and L(c, c′) is the loss-function encoding the cost of mis-classification. Learn-ing a classifier therefore amounts to estimating the probability distributionP (C = c|y). In order to reduce complexity, certain independence assump-tions are often introduced. For example, in the naıve Bayes (NB) classifier allattributes are assumed to be independent given the class. Although the NBmodel obtains competitive results in some application domains, it can also per-form poorly in many other domains. This has motivated the development of avariety of improvements to overcome the often overly simplistic independence-assumption. Among these approaches, the tree-augmented naive Bayes (TAN)[7] and the aggregating one-dependence estimator (AODE) [24] relax the in-dependence assumptions by allowing a more general correlation structure inthe model. Alternatively, the latent classification model (LCM) [16] encodeconditional dependencies among the attributes using latent variables.

Common to all these classification techniques is that they assume that the dif-ferent objects are independently and identically distributed. This assumptionis violated in dynamic domains, where a temporal relation in the data canintroduce correlations between the different observations. To exemplify, let usconsider a system that monitors a complex process. Overseeing such process-es will often mean keeping an eye on hundreds or thousands of sensors, eachof them updating their readings several times per second. Real-life process-es have their own natural dynamics when everything is running according toplan; “outliers” may be seen as indications that the process is leaving its stablestate, thereby becoming more dangerous. We want the monitoring system todetermine whether or not the process is under control at all times (that is, todetect unstable behavior) in order to be able to shut down an unsafe process,or even better, to detect that the system is about to become unstable (that is,to predict the future problems). The latter would give the system operatorsthe chance to implement countermeasures before the system has exposed any-one to an increased risk. Classifiers that fail to take the dynamic aspect of theprocess into account will not be able to make accurate predictions, and willtherefore not be able to recognize a problem under development.

The framework of dynamic Bayesian networks [11] supports the specifica-tion of dynamic processes. A simple instance of this framework is the hiddenMarkov model (HMM), which has previously been considered for classificationpurposes [13,22]; to this end the “hidden” node in the HMM is used as theclassification node, and the attributes at time t are assumed to be independentof those at time t+ 1 given the class label at either of the two points in time.Further simplification can be obtained by assuming that all attributes at one

2

time step are conditionally independent given the class label at that time step;the resulting model by [20] is known as a dynamic naıve Bayes (dNB) classifi-er. The dNB models can be effectively learned from data due to the relativelysmall number of parameters required to specify them. However, the dNB haslimitations when it comes to expressive power that are similar to those of it-s static counterpart, which can lead to reduced classification accuracy whenthe independence assumptions are violated. To the best of our knowledge,there has been no systematic investigation into the properties of probabilisticclassifiers and their applicability to real-life dynamic data. In this paper wewill take a step in that direction, by examining the underlying assumption ofsome well-known probabilistic classifiers and their natural extensions to dy-namic domains. We do so by carefully linking our analysis back to a real-lifedataset, and the result of this analysis is a classification model, which can beused , e.g., to help prevent unwanted events by automatically analyzing a datastream and raise an alert if the process is entering an unstable state.

This paper is organized as follows: In Section 2 we describe the target domainin some detail. This is followed by Section 3, where we give an overview of thedynamic classification scheme in general, and propose a specific classificationmodel denoted by dynamic latent classification model (dLCM). Next, we lookat inference and learning in dLCMs (Section 4), before reporting on theirclassification accuracy in Section 5. Finally, in Section A we conclude and givedirections for future research.

2 The domain and the dataset

The dataset we consider is from an oil production installation in the NorthSea. Data, consisting of some sixty variables, is captured every five second-s. The data is monitored in real time by experienced engineers, who have anumber of tasks to perform ranging from understanding the situation on theplatform (activity recognition) via avoiding a number of either dangerous orcostly situations (event detection), to optimization of the drilling operation.The variables that are collected cover both topside measurements (like flowrates) and down-hole measurements (like, for instance, gamma rate). The goalof this research is to build a dynamic classification system that can help thedrilling engineers by automatically analyzing the data stream and classify thesituation accordingly. To do so, we will firstly investigate a number of differen-t dynamic classification frameworks, and thereafter propose a new model fordynamic classification in general. However, for the discussions to be concrete,we will tie the development to the task of activity recognition. The drillingof a well is a complex process, which consists of activities that are performediteratively as the length of the well increases, and knowing which activityis performed at any time is important in several contexts: Firstly, some un-

3

desired events can only happen during specific activities, and knowing thecurrent activity is therefore of high importance to the drilling engineer from asafety perspective; Secondly, operators are consistently looking for more cost-efficient ways of drilling, and the sequencing of activities during an operationis important for hunting down potential time-sinks. Activity recognition is atask that also finds applications as diverse as healthcare [23] and video analy-sis [17]. In the Wellsite Information Transfer Specification (WITS), a total of34 different activities with associated activity codes are defined. Each activityhas its separate purpose and consists of a set of actions. Out of the 34 differentdrilling activities in total, only a handful are really important to recognize.The important activities, which roughly correspond to those that constitutemost of the total well drilling time are described next:

2 – Drilling: The activity when the well is gaining depth by crushing rockat the bottom of the hole and removing the crushed pieces (cuttings) outof the well-bore. Thus, the drill string is rotating during this activity, andmud is circulated at low speed to transport out the cuttings. The activityis interrupted by other activities, but continues until the well reaches thereservoir and production of oil may commence.

3 – Connection: This activity involves changing the length of the drill-string, by either adding or removing pieces of drill-pipe.

4 – Reaming: Reaming means widening a section of the well hole after ithas been drilled the first time.

7 – Circulation: Circulation resembles drilling to the extent that the mud isflowing and the drill string is rotating at a low speed. However, new cuttingsare not produced. Like for reaming, the point of this activity is to clean arecently drilled sections of the well, but the radius of the well hole is notwidened as much as during reaming, and circulation can therefore be seenas a less severe cleaning activity.

8 – Tripping in: This is the act of running the drill string into the well hole.9 – Tripping out: Tripping out means pulling the drill string out of the wellbore.

Additionally, the activity 0 – Other is used to describe events that do not fitany of the other activity definitions, making sure that exactly one activity isperformed at all times.

Out of the approximately 60 attributes that are collected, domain expertshave selected the following as most important for activity detection: DepthBit Measured, Depth Hole Measured, Block Position, Hookload, Weight OnBit, Revolutions Per Minute, Torque, Mud Flow In, and Standpipe Pressure.These variables are also the ones we will use in the following.

4

3 From Static to Dynamic Bayesian Classifiers

In this section we develop a general framework for performing dynamic clas-sification. The framework will be specified incrementally by examining itsexpressivity relative to the oil production data.

3.1 Static classifiers

Standard (static) classifiers like NB [5] or TAN [7] assume that the class vari-able and attributes at different time points are independent given the modelM, i.e, {Ct1 ,Y t1}⊥⊥{Ct2 ,Y t2}|M for all t1 6= t2. This independence assump-tion is clearly violated in many domains and, in particular, in domains thatspecify a process evolving over time. To validate the independence assump-tions in practice, we can for instance compare the marginal distribution ofthe class variables with the conditional distribution of the class variable giv-en its value at the previous time step. The results using the oil productiondata are shown in Table 1, where we see a considerable correlation betweenthe class variable of consecutive time slices. In particular, if the system wasin the drilling activity at time t − 1, the probability of being in the drillingactivity also at time t changes from 0.325 (static classifier) to 0.997 (dynamicclassifier). The reason for this dramatic difference is that the system tends toremain in the drilling activity as soon as a drilling is introduced, an effect thatthe static classifier is unable to represent.

P (Ct = drilling) 3.25 · 10−1

P (Ct = drilling|Ct−1 = ¬drilling) 2.90 · 10−3

P (Ct = drilling|Ct−1 = drilling) 9.97 · 10−1

Table 1Comparison between P (ct) and P (ct|ct−1). The class “¬drilling” signifies the activiesthat is not belong to drilling.

One way to capture this dependence is to explicitly take the dynamics of theprocess into account, i.e., to look at the class of dynamic classifiers.

3.2 A simple dynamic classifier

The temporal dynamics of the class variable can be described using, e.g., afirst order Markov chain, where P (ct|c1:t−1) = P (ct|ct−1) for all t. By com-bining this temporal model with the class-conditional observation model forthe attributes we have the well-known hidden Markov model. This model is

5

Ct−1 Ct

Y t−1

1Y t−1

2Y t−1

3Y t−1

4Y t

1Y t

2Y t

3Y t

4

Fig. 1. Attributes are assumed to be conditionally independent given the class vari-able (equivalent structure for HMM) with n = 4.

described by a prior distribution over the class variable P (c0), a condition-al observation distribution P (yt|ct), and transition probabilities for the classvariable P (ct|ct−1); we assume that the model is stationary, i.e., P (yt|ct) =P (ys|cs) and P (ct|ct−1) = P (cs|cs−1), for all s, t ≥ 1.

With a continuous observation vector, the conditional distribution may bespecified by a class-conditional multivariate Gaussian distribution with meanµc and covariance matrix Σc, i.e., Y |{C = c} ∼ N(µc,Σc) [9]. Unfortunately,learning a full covariance matrix involves estimating a number of parametersthat is quadratic in the number of attributes, which may result in over-fittingwhen data is scarce compared with the number of free parameters. One ap-proach to alleviate this problem is to introduce additional independence as-sumptions about the domain being modeled. Specifically, by assuming that allvariables are independent given the class variable, we will at each time stephave a NB model defined by a diagonal covariance matrix, thus requiring onlyO(n) parameters to be learned, where n = |Y | is the number of attributes inthe model. This structure corresponds to the dNB model for discrete domains.A graphical representation of the resulting independence assumptions can beseen in Fig. 1 in the form of a two time-slice DBN [19].

As for the (static) NB model, the independence assumptions encoded in thedNB model are often violated in real-world settings. For example, if we con-sider the measured flow of drilling fluid going in to the well (Mud Flow In) andthe observed pressure in the well (Stand Pipe Pressure), and plot their valuesconditioned on the class variable (activities tripping in and tripping out), itis evident that there is a conditional correlation between the two attributesgiven the class, see Fig. 2; similar results are also obtained when consideringother pairs of attributes.

3.3 Modeling dependence between attributes

There are several approaches to model attribute dependence. For example,Friedman et al. [8] propose an extension of the TAN model [7] to facilitatecontinuous domains. In the TAN framework, each attribute is allowed to haveat most one parent besides the class variable. As an alternative, [16] present

6

0 0.5 1 1.5 2 2.5 3

x 107

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Correlation between pair of attributes

Mud Flow In

Sta

nd P

ipe

Pre

ssur

e

Fig. 2. Scatter plot of theMud Flow In (x-axis) and the Stand Pipe Pressure (y-axis)for the two classes in the oil production data (black “+” is the tripping in class andgrey “o” is the tripping out class). The conditional correlation between the twoattributes is evident.

the latent classification model (LCM), which can be seen as combining theNB model with a factor analysis model [6].

A LCM offers a natural extension of the NB model by introducing latentvariables Z = (Z1, . . . , Zk) as children of the class variable C and parents ofall the attributes Y = (Y1, . . . , Yn). The latent variables and the attributeswork as a factor analyzer focusing on modeling the correlation structure amongthe attributes.

Following the approach by [16], we introduce latent variables to encode condi-tional dependencies among the attributes. Specifically, for each time step t wehave the vector Zt = (Zt

1, . . . , Ztk) of latent variables that appear as children

of the class variable and parents of all the attributes (see Fig. 3). The latentvariable Zt is assigned a multivariate Gaussian distribution conditional onthe class variable and the attributes Y are also assumed to be a multivariateGaussian distributions conditional on the latent variables:

Zt|{Ct = ct} ∼ N(µct ,Σct),

Y t|{Zt = zt} ∼ N(Lzt,Θ),

where Σct and Θ are diagonal and L is the transition matrix; note that thestationarity assumption encoded in the model.

In this model, the latent variables capture the dependencies between the at-tributes. They are conditionally independent given the class but marginallydependent. Furthermore, the same mapping, L, from the latent space to the

7

attribute space is used for all classes, and hence, the relation between the classand the attributes is conveyed by the latent variables only.

Ct−1 Ct

Zt−1

1Zt−1

2Zt

1Zt

2

Y t−1

1Y t−1

2Y t−1

3Y t−1

4Y t

1Y t

2Y t

3Y t

4

Fig. 3. In each time step, the conditional dependencies between the attributes areencoded by the latent variables (Zt

1, . . . , Ztk), with k = 2, n = 4.

The model in Fig. 3 assumes that the attributes in different time slices are in-dependent given the class variable. This assumption implies that the temporaldynamics is captured at the class level only. When the state specification ofthe class variable is coarse, then this assumption will rarely hold 1 . For theoil production data, this assumption does not hold as we can see in Fig. 4,which show the conditional correlation of the Stand Pipe Pressure attributein successive time slices in both tripping in and tripping out activities.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Fig. 4. Scatter plot of the Stand Pipe Pressure (x-axis is value of Stand Pipe Pressureat time t and y-axis is value of Stand Pipe Pressure at time t+1) for the two classesin the oil production data (black “+” is tripping in class and gray “o” is tripping outclass). The conditional dynamic correlation of this attribute over time is evident.

We propose to address this apparent short-coming by modeling the dynamicsof the system at the level of the latent variables, which semantically can be seenas a compact representation of the “true” state of the system. Specifically, we

1 Obviously, the finer the granularity of the state specification of the class variable,the more accurate this assumption will be.

8

encode the state specific dynamics by assuming that the latent variable vectorZt follows a linear multivariate Gaussian distribution conditioned on Zt−1:

Zt|{Zt−1 = zt−1, Ct = ct} ∼ N(Actzt−1,Σct)

where Act is the transition dynamics for latent variables. A graphical repre-sentation of the model is given in Fig. 5, and will be referred to as a dynamiclatent classification model (dLCM).

Ct−1 Ct

Zt−1

1Zt−1

2Zt

1Zt

2

Y t−1

1Y t−1

2Y t−1

3Y t−1

4Y t

1Y t

2Y t

3Y t

4

Fig. 5. The state specific dynamics are encoded at the level of the latent variables,with k = 2, n = 4.

3.4 Modeling non-linear systems

One of the main assumptions in the model above is that there is a linearmapping from the latent variables to the attribute space as well as a linearmapping from the latent variables in a given time slice to the latent variablesin the succeeding time slice (i.e., that the state specific dynamics are linear).When class variable is at a fixed class value, then the model is equivalentto stochastic linear dynamical systems (LDS) [1], also known as state-spacemodels.

Given sufficient dimension of latent continuous space, LDS can represent anycomplex real-world processes, but the computational cost makes it infeasiblein practice. In order to reduce the computational cost while maintaining therepresentational power, we introduce a discrete mixture variable M for eachtime slice (Fig. 6), as done by [16] for static domains. Similar situations in thedynamic domains can be seen from [2,12] where a probabilistic model calledswitching state-space model is proposed by combining discrete and continuousdynamics. The representational power and computational efficiency are welldemonstrated in their work. When the value of class variable is fixed, theproposed model in this paper is different from their work, because there is nodynamics on the latent mixture node while the discrete class variables carries

9

Ct−1 Ct

Zt−1

1 M t−1 Zt−1

2Zt

1 M t Zt

2

Y t−1

1Y t−1

2Y t−1

3Y t−1

4Y t

1Y t

2Y t

3Y t

4

Fig. 6. A mixture variable M is introduced at each time slice to extend of the model,with k = 2, n = 4.

the dynamics over time. Switching state-space models focus on modeling non-linear real-world process with one single system state (class), whereas ourmodel tries to capture the non-linearity of multiple system states (classes) atthe same time.

We call the resulting model dynamic latent classification model (dLCM), eachtime slice can now be seen as combining a naıve Bayes model with a mixtureof factor analyzers. In this case, the mixture variable follows a multinomialdistribution conditioned on the class variable. and the attributes Y t follow amultivariate Gaussian distribution conditioned on the latent variables and thediscrete mixture variable, i.e.:

M t|{Ct = ct} ∼ P (M t|Ct = ct),

Y t|{Zt = zt,M t = mt} ∼ N(Lmtzt,Θmt),

where 1 ≤ mt ≤ |sp (M)|, P (M t|Ct = ct) ≥ 0 and∑|sp(M)|

mt=1 P (M t = mt|Ct =ct) = 1 for all 1 ≤ ct ≤ |sp (C)|.

4 Learning and inference

In what follows we discuss algorithms for performing inference and learningin the proposed models.

4.1 Learning

Learning the dLCM model involves estimating the parameters of the model,the number of latent variables and the number of the mixture components.

10

With the number of latent variables and the number of the mixture compo-nents specified, parameter learning becomes simple: since the class variablesare always observed during learning, the resulting model is similar to a linearstate-space model for which existing EM-algorithms [4] can be applied.

Following the approach from [16], we determine the number of latent vari-ables and the number of the mixture components by performing a systematicsearch over a selected subset of candidate structures (note that a model struc-ture is completely specified by its number of latent variables and number ofthe mixture components). For each candidate structure we learn the modelparameters by applying the EM-algorithm, and the quality of the resultingmodel is then evaluated using the wrapper approach [15]. Finally, we selectthe model with the highest score.

In the following, we will present the parameter learning algorithm of dLCMmodels in the form of the EM-algorithm which is an obvious modification ofEM original work.

4.1.1 The E-step

The E-step of EM algorithm computes the expected log likelihood of the com-pleted observation, given the data:

Q=E

[

logT∏

t=1

f(·t)|D1:T]

=T∑

t=1

{

E[log f(·t)|D1:T ]}

=T∑

t=1

{

E[logP (ct|ct−1)|D1:T ]}

+T∑

t=1

{

E[logP (M t|ct)|D1:T ]}

+

T∑

t=1

{

E[log p(Zt|Zt−1, ct)|D1:T ]}

+T∑

t=1

{

E[log p(yt|Zt,M t)|D1:T ]}

.

Therefore, the expected log likelihood depends on these four expectations:

•∑T

t=1 E[logP (ct|ct−1)|D1:T ],•

∑Tt=1 E[logP (M t|ct)|D1:T ],

•∑T

t=1 E[log p(Zt|Zt−1, ct)|D1:T ], and

•∑T

t=1 E[log p(yt|Zt,M t)|D1:T ].

These expectations can be directly calculated (see Appendix.A for more de-tails).

11

4.1.2 The M-step

The parameters of dLCM model are A,L,Σ,Θ,Φ,J ,K. At each iteration,these parameters can be obtained by taking the corresponding partial deriva-tive of the expected log likelihood. The following are the results (refer Ap-pendix.A for more details):

L, the linear dynamics from the latent space to the attribute spaces: Li,mt

denotes the i’th row of this matrix when the mixture node is mt.

Li,mt =( T∑

t=1

{

P (M t = mt|D1:T )E[Zt(Zt)′|M t = mt,D1:T ]}

)−1

·

T∑

t=1

{

(yti − Φi,mt)P (M t = mt|D1:T )E[Zt|M t = mt,D1:T ]}

Φ, the offset from the latent space to the attribute spaces: Φi,mt denotes thei’th row of this matrix when the mixture node is mt.

Φi,mt =1

∑Tt=1

{

P (M t = mt|D1:T )}

( T∑

t=1

{

P (M t = mt|D1:T )yti}

−

T∑

t=1

{

P (M t = mt|D1:T )L′i,mtE[Zt|M t = mt,D1:T ]

})

Θ, the covariance matrices of the attribute spaces: Θi,mt denotes the i’th ele-ment on the diagonal of the matrix when the mixture node is mt.

Θi,mt = −1

∑Tt=1

{

P (M t = mt|D1:T )}

T∑

t=1

{

(yti − Φi,mt)2P (M t = mt|D1:T )−

2(yti − Φi,mt)P (M t = mt|D1:T )L′i,mtE[Zt|M t = mt,D1:T ]+

P (M t = mt|D1:T )L′i,mtE[Zt(Zt)′|M t = mt,D1:T ]Li,mt)

}

Σ, the covariance matrices of the latent spaces: Σc denotes the diagonal matrixwhen the class variable is c.

12

Σc =1

αc

∑

t:ct=c

{

E[Zt(Zt)′|D1:T ]− 2AcE[Zt−1(Zt)′|D1:T ] +Act·

E[Zt−1(Zt−1)′|D1:T ]A′ct

}

A, the linear dynamics within the latent space from one time slice to nexttime slice: Ac denotes the matrix when the class variable is c.

Ac =∑

t:ct=c

{

E[Zt(Zt−1)′|D1:T ]}

·(

∑

t:ct=c

{

E[Zt−1(Zt−1)′|D1:T ]}

)−1

J , the transition matrix of class variable is directly obtained by using frequen-cy estimation on P (ct|ct−1)

K, the transition matrix of the mixture node is computed based on P (mt|ct =c,D1:T ) :

P (mt|ct = c,D1:T ) =

∑

t:ct=c

{

P (M t = mt|D1:T )}

∑Tt=1

{

P (M t = mt|D1:T )}

4.2 Inference

Seen from the dLCM in Fig. 6, an equivalent probabilistic model is

p(y1:T , z1:T , m1:T , c1:T ) =p(y1|z1,M1)p(z1|c1)p(m1|c1)p(c1)·T∏

t=2

p(yt|zt, mt)p(zt|zt−1, ct)p(mt|ct)p(ct|ct−1).

Following the reasoning of [2], exact filtered and smoothed inference for dLCMcan easily be shown to be intractable (scaling exponentially with T [18]) asneither the class variables nor the mixture variables are observed: At time step1, p(z1|y1) is a mixture of |sp (C)| · |sp (M)| Gaussians. At time-step 2, dueto the summation over the classes and mixture variables, p(z2|y1:2) will be amixture of |sp (C)|2 · |sp (M)|2 Gaussians; at time-step 3 it will be a mixtureof |sp (C)|3 · |sp (M)|3 Gaussians and so on until the generation of a mixtureof |sp (C)|T · |sp (M)|T Gaussians at time-point T . To control this explosion in

13

computational complexity, we will resort to approximate inference techniquesfor both classification and learning (i.e., when performing the E-step).

4.2.1 Approximate inference

As the structure of the proposed dLCM is similar to the Linear DynamicalSystem (LDS) [1], the standard Rauch-Tung-Striebel (RTS) smoother [21] andthe expectation correction smoother [2] for LDS provide the basic ideas forthe approximate inference of dLCM. As for the RTS, the filtered estimate ofdLCM p(zt, mt, ct|y1:t) is obtained by a forward recursion, then following abackward recursion to calculate the smoothed estimate p(zt, mt, ct|y1:T ). Theinference of dLCM is then achieved by a single forward recursion and a singlebackward recursion. Gaussian collapse is incorporated into both the forwardrecursion and the backward recursion to form the approximate inference. TheGaussian collapse in the forward recursion is equivalent to assumed densi-ty filtering [3], and the Gaussian collapse in the backward recursion mirrorsthe smoothed posterior collapse from [2]. The detail of forward recursion andbackward recursion are introduced in the next subsections.

4.2.2 Forward recursion: filtering

During the forward recursion of dLCM, the filtered posterior p(zt, mt, ct|y1:t)is computed with a recursive form. By a simple decomposition,

p(zt, mt, ct|y1:t) = p(zt, mt, ct,yt|y1:t−1)/p(yt|y1:t−1)

∝ p(zt, mt, ct,yt|y1:t−1).

Dropping the normalization constant p(yt|y1:t−1), p(zt, mt, ct|y1:t) is propor-tional to the new joint probability p(zt, mt, ct,yt|y1:t−1), where

p(zt, mt, ct,yt|y1:t−1) = p(yt, zt|mt, ct,y1:t−1)p(mt|ct,y1:t−1)p(ct|y1:t−1). (1)

Closing the forward recursion with Gaussian collapse

Since all factors in Equation1 can be obtined from the previously filtered re-sults (see Appendix.B for more details), the forward resursion is achieved. Dur-ing the process of building the forwad recursion, the term p(zt−1, mt−1, ct−1|y1:t−1)is requred to be otained. Although p(zt−1|yt−1) is the filtered probability ofprevious time slice which can be directly obtained, the mixture componentsof p(zt−1|yt−1) is increasing exponentially over time. The same situation al-so applies to the case for p(zt−1|mt−1, ct−1,y1:t−1). To deal with this in this

14

paper, we collapse p(zt−1|mt−1, ct−1,y1:t−1) to a single Gaussian, parameter-ized with mean νmt−1,ct−1 and covariance Γmt−1,ct−1, and then propagate thiscollapsed Gaussian for next time slice. The approximated Gaussian is denot-ed as p(zt−1|mt−1, ct−1,y1:t−1). For notational convenience we drop tildes forall the approximated quantities in later of this article. After the propagationand updating at time t, the approximated p(zt|mt, ct,y1:t) will be a Gaussianmixture with fixed |sp (C)| · |sp (M)| components. We later collapse the newapproximated Gaussian mixture |sp (C)| · |sp (M)| to a single Gaussian andpropagate it for the next time slice. This will guarantee that at any time slice,the mixture components of approximated Gaussian mixture is fixed. At thelast step of the forward recursion, the filtered results is p(zT , mT , cT |y1:T ) attime T .

4.2.3 Backward recursion: smoothing

Similar to the forward pass, the backward pass also relies on a recursion com-putation of the smoothed posterior p(zt, mt, ct|y1:T ). In detail, we computep(zt, mt, ct|y1:T ) from its smoothed result of the previous step, p(zt+1, mt+1, ct+1|y1:T ),together with some other quantities obtained from forward pass. The first s-moothed posterior is p(zT , mT , cT |y1:T ) which can be directly obtained as it isalso the last filtered posterior from forward pass. To compute p(zt, mt, ct|y1:T ),we factorize it as

p(zt, mt, ct|y1:T ) =∑

mt+1,ct+1

p(zt, mt, ct, mt+1, ct+1|y1:T )

=∑

mt+1,ct+1

p(zt|mt, ct, mt+1, ct+1,y1:T )·

p(mt, ct|mt+1, ct+1,y1:T )p(mt+1, ct+1|y1:T ).

(2)

With the factorized form we can achieve the recursion for the backward pass,since all the quantities can be computed either from previous smoothed re-sults or from the forward recursion. There are two main approximations in thebackward recursion. Following [2] we note that although {mt, ct} 6⊥⊥zt+1|y1:T ,the influence of {mt, ct} on zt+1 through zt is “weak” as zt will be mostly in-fluenced by y1:t. We shall therefore approximate p(zt+1|mt, ct, mt+1, ct+1,y1:T )by p(zt+1|mt+1, ct+1,y1:T ). The benefit of this simple assumption lies in thatp(zt+1|mt+1, ct+1,y1:T ) can be directly obtained from the previous backwardrecursion. Note that p(zt+1|mt+1, ct+1,y1:T ) is a Gaussian Mixture whose com-ponents increase exponentially in T − t. Then the second assumption is thatwe collapse p(zt+1|mt+1, ct+1,y1:T ) to a single Gaussian and then pass this col-lapsed Gaussian for the next step. Then the approximated p(zt+1|mt+1, ct+1,y1:T )at next time step will be a Gaussian mixture with fixed |sp (C)| · |sp (M)| com-ponents. We later collapse this new approximated Gaussian mixture to a single

15

Gaussian and propagate it to next time step. This process will also guaran-tee that at any time step, the mixture components of approximated Gaussianmixture is fixed. (see Apendix.B for more details)

Closing the backward recursion with Gaussian collapse

All the factors in Equation B.5 are now computed with a recursive form basedon the smoothed results from the previous backward recursion, which aslocompletes the approxiamate inference for dLCM.

5 Experiments Results

Dynamic Latent Classification Model should prove its computational efficien-cy and representational power in modeling non-linear real-world time series.To illustrate the point we examine the time series data from oil drilling plat-form of North sea. It is natural to choose oil drilling data, since the dLCMis initially motivated from the same data. These data are from on-rig sen-sors which are describing the physical attributes of the drilling system duringdrilling, and these sensor data are recorded every 5 seconds. There are morethan 60 physical attributes of raw sensor data, however, only some of theseattributes are considered important. Finally, only 9 attributes are chosen forthe classification purpose of oil drilling activities which are show in Fig. 7.These attributes are highly non-linear and chosen based on the analysis fromexperienced field experts.

Fig. 7. The selected on-rig sensor data during oil drilling: e.g. HKL: Hookload, itmeasures the load on the hook in tons.

16

In total, there are 5 oil drilling activities in the dataset used for the classifi-cation task. These activities are drilling, connection, tripping in, tripping out,other respectively, and they are referring to different action during drillingprocess. For the classification of these activities, we decide to use a two-stephierarchical classification process. At the first hierarchical step, the drilling andconnection activities are combined as one activity called drilling-connection,and we do the similar combination for tripping in and tripping out activitiescalled tripping in/out. The dataset after this combination is called modifieddataset. The reason behind is these combined actives are physically close andhas quite similar dynamics. A dLCM is then trained to classify these two com-bined activities. At the second hierarchical step, with the classification resultsfrom the first step, a dLCM is trained for each of the two combined activities.One dLCM aims to classify the activities contains drilling and connection, andanother one is trained to classify tripping in and tripping out activities. Allof these dLCMs are trained separately, but their training process is identicaland we describe it as follows.

The learning of dLCM contains two steps: the score scheme for evaluatingthe quality of a model, and the search strategy for exploreing the space ofdLCM. First, the wrapper approach [15] is adopted to score a model basedon its classification accuracy. By the wrapper approach a model is trainedwith a training dataset, and the score is given as the classification accuracyfound by applying this model on another validation dataset. Following [16],the search strategy is characterized as: (i) A systematic approach for selectingvalues for |sp (M)| , |sp (Z)| , |sp (Y )| and, given such a pair of values, (ii)learn the edge-set and the parameters in the model. Next the model withthe highest score on validation dataset is chosen as the learnt dLCM. Basedon this learnt dLCM, the classification task is conducted on a separate testdataset. In our experiment the training, validation, test dataset used fortrain-validate-test process has 90000, 80000, 50000 time slice respectively. Thetermination condition for learning is set as either meeting the convergencecriteria by measuring the change in likelihood over the consecutive EM stepsor 50 iterations.

To illustrate the classification performance of dLCM, we choose Naive Bayes,HMM, ,dLCM without mixture component (or equivalent to |sp (M)| = 1),decision tree (J48), local-global learning [25] to compare the classification re-sults. Among these models, the first three models are the intermediate modelswhen we motivate the dLCM. The classification results of first hierarchicalstep for drilling-connection and tripping in/out activities are shown in Table2:

Furthermore, Fig. 8 shows the detail of classification results using dLCM.

After obtaining the classification results for drilling-connection andtripping

17

Table 2Experimental results on classification accuracy from first hierarchical step

Model classification accuray

NB 84.77

HMM 84.80

J48 96.83

LGL -

dLCM (M=1) 97.14

dLCM 98.72

0 1 2 3 4 5

x 104

1

1.2

1.4

1.6

1.8

2The combined activities of the test dataset

Time step

Act

ivity

0 1 2 3 4 5

x 104

1

1.2

1.4

1.6

1.8

2The classified results using dLCM for combinied activities

Time step

Act

ivity

Fig. 8. The detail of classification results for drilling-connection activity (1) andtripping in/out activity (2) using dLCM.

in/out activities. We separate the original unmodified training dataset intosubsets based on the activity change (between drilling-connection andtrippingin/out) points of the first step results. Then we will have a set of training da-ta for drilling-connection and another set of training data for tripping in/outactivity. In the set of training data for drilling-connection activity, the exactactivities in the data includes drilling, connection, and other activity. Mean-while, in the set of training data for tripping int/out activity, the exact activ-ities includes both tripping in and tripping out activity. The same separationprocess also applies to the validation and test dataset.

In the second hierarchical step of classification, we trained one dLCM fordrilling activity, connection and other activities set and another dLCM for

18

tripping in/out activity. The training processes and the parameter settingsare exactly the same as the first hierarchical step. The classification results oftwo subsets in second step are shown in Table 3:

Table 3Experimental results on classification accuracy from second hierarchical step

Model tripping in/out drilling-connection

NB 60.72 67.64

HMM 59.96 76.23

J48 57.88 69.56

LGL - -

dLCM (|sp (M)| = 1) 75.57 76.66

dLCM 75.57 85.47

The overall result of full dataset by combing the two steps classification isillustrated in Table 4:

Table 4The overall experimental results on classification accuracy

Model full dataset

NB 58.53

HMM 61.69

J48 66.75

LGL -

dLCM (|sp (M)| = 1) 76.29

dLCM 82.09

The experimental results shows that dLCM achieve significant better classi-fication accuracy than static classifier such as NB, J48 and LGL. Meanwhilethe experimental results also show that dLCM is more effective than all thethree intermediate models one the way to motivate dLCM. In practise, HMMis shown to be more effective than NB in the classification results in Table 2(drilling-connection classification); dLCM (|sp (M)| = 1) shows a better per-formance than HMM in Table 1, Table 2 (tripping in/out classification) andTable 3. In general, these results show that dLCM is very effective for the ac-tivity detection in our oil drailing data. The detail of full dataset classificationfrom dLCM is shown in Fig. 9

19

0 1 2 3 4 5

x 104

1

2

3

4

5The original activities of the test dataset

Time step

Act

ivity

0 1 2 3 4 5

x 104

1

2

3

4

5The classified results using dLCM for original activities

Time step

Act

ivity

Fig. 9. The detail of classification results of full dataset from dLCM.

6 Conclusions

In this paper we have described a new family of classification models specifical-ly designed for the analysis of streaming data. Dynamic classification systemsare of great importance for reliability engineers, with obvious applications in,e.g., monitoring systems. We exemplified the use of the classifier by lookingat online activity recognition, and showed that our classification model signif-icantly outperforms the standard (non-dynamic) classifiers.

Our dynamic latent classification model (dLCM) is a generalization of thelatent classification model [16] to dynamic domains, but at the same time itis closely reflated to switching stat-space models [12]. This is a natural choicewhen each class have multiple approximately linear sub regimes of dynamics,but forced us to look at an approximate inference technique inspired by [2].The approximations are shown to be effective and efficient, which makes thelearning of dLCMs tractable. We want to investigate the appropriateness ofthe approximate inference scheme further thoroughly comparing our approachto traditional sampling techniques (e.g., [10]).

Finally, we plan to use the classification model to do event detection, i.e.,to foresee – and therefore potentially help the operators prevent – undesiredsituations. This is a challenging task, because traditional probabilistic classi-fication based on Equation (1) requires the specification of a meaningful loss-function. In the dynamic setting, one would want to encode statements like

20

“Detecting an event before a minute has past is worth $1, but detection after 90seconds is useless”, which requires a richer representation than a (static) lossfunction. Furthermore, event-detection is an extremely unbalanced classifica-tion problem [14], where the marginal probability of a given event can be in theorder of 10−5 or even less. The appropriateness of maximum-likelihood basedlearning can thus be questioned, and we plan to devise a semi-discriminativestrategy in the spirit of [25] to this end.

Acknowledgments

Ana helped us out in the beginning. Early version of paper presented atFLAIRS. Also, the Verdande people have helped with preparing and under-standing the data.

References

[1] Yaakov Bar-Shalom and Xiao-Rong Li. Estimation and Tracking: Principles,Techniques and software. Artech House Publishers, 1993.

[2] David Barber. Expectation correction for smoothed inference in switching lineardynamical systems. Journal of Machine Learning Research, 7:2515–2540, 2006.

[3] Xavier Boyen and Daphne Koller. Approximate learning of dynamic models. InAdvances in Neural Information Processing Systems 12, pages 396–402, 1999.

[4] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the Royal StatisticalSociety, Series B, 39(1):1–38, 1977.

[5] Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis.John Wiley & Sons, New York, 1973.

[6] Brian S. Everitt. An introduction to latent variable models. Chapmann & Hall,London, 1984.

[7] Nir Friedman, Dan Geiger, andMoises Goldszmidt. Bayesian network classifiers.Machine Learning, 29(2–3):131–163, 1997.

[8] Nir Friedman, Moises Goldszmidt, and Thomas J. Lee. Bayesian networkclassification with continuous attributes: Getting the best of both discretizationand parametric fitting. In Proceedings of the Fifteenth International Conferenceon Machine Learning, pages 179–187, San Francisco, CA, 1998. MorganKaufmann Publishers.

21

[9] Joao Gama. A linear-Bayes classifier. In Proceedings of the 7th Ibero-AmericanConference on AI: Advances in Artificial Intelligence, volume 1952 of LectureNotes in Computer Science, pages 269–279. Springer-Verlag, Berlin, Germany,2000.

[10] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributionand the Bayesian restoration of images. IEEE Transactions on Pattern Analysisand Machine Intelligence, 6:721–741, 1984.

[11] Zoubin Ghahramani. Learning dynamic Bayesian networks. In AdaptiveProcessing of Sequences and Data Structures, International Summer School onNeural Networks, ”E.R. Caianiello”-Tutorial Lectures, pages 168–197, London,UK, 1998. Springer-Verlag.

[12] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switchingstate-space models. Neural Computation, 12:963–996, 1998.

[13] Y. He and A. Kundu. 2-D shape classification using hidden Markov model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(11):1172–1184, 1991.

[14] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: Asystematic study. Intelligent Data Analysis, 6(5):429–449, 2002.

[15] Ron Kohavi and George H. John. Wrappers for feature subset selection.Artificial Intelligence, 97(1–2):273–324, 1997.

[16] Helge Langseth and Thomas D. Nielsen. Latent classification models. MachineLearning, 59(3):237–265, 2005.

[17] Gal Lavee, Ehud Rivlin, and Michael Rudzsky. Understanding video events: asurvey of methods for automatic interpretation of semantic occurrences in video.IEEE Transactions on Systems, Man, and Cybernetics Part C, 39(5):489–504,2009.

[18] Uri Lerner. Hybrid Bayesian networks for reasoning about complex systems.PhD thesis, Dept. of Comp. Sci. Stanford University, Stanford, 2002.

[19] Uri Lerner and Ronald Parr. Inference in hybrid networks: Theoretical limitsand practical algorithms. In Proceedings of the Seventeenth Conference onUncertainty in Artificial Intelligence (UAI), pages 310–318, 2001.

[20] Miriam Martınez and Luis Enrique Sucar. Learning dynamic naive Bayesianclassifiers. In Proceedings of the Twentyfirst International Florida ArtificialIntelligence Research Symposium Conference, pages 655–659, 2008.

[21] H.E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates oflinear dynamic systems. AIAA Journal, 3:1445–1450, 1965.

[22] Padhraic Smyth. Hidden Markov models for fault detection in dynamic system.Pattern Recognition, 27(1):149–164, 1994.

22

[23] Ryoji Suzuki, Mitsuhiro Ogawa, Sakuto Otake, Takeshi Izutsu, YoshikoTobimatsu, Shin-Ichi Izumi, and Tsutomu Iwaya. Analysis of activities ofdaily living in elderly people living alone: single-subject feasibility study.Telemedicine Journal and E-Health, 10(2):260–276, 2004.

[24] G. I. Webb, J. R. Boughton, and Z. Wang. Not so naıve Bayes: Aggregatingone-dependence estimators. Machine Learning, 58(1):5–24, 2005.

[25] Shengtong Zhong and Helge Langseth. Local-global-learning of naive Bayesianclassifier. In Proceedings of the 2009 Fourth International Conference onInnovative Computing, Information and Control, pages 278–281, Washington,DC, USA, 2009. IEEE Computer Society.

A Appendix A. Learning

A.1 The E-step

The E-step of EM algorithm computes the expected log likelihood of the com-pleted observation, given the data:

Q=E

[

logT∏

t=1

f(·t)|D1:T]

=T∑

t=1

{

E[log f(·t)|D1:T ]}

=T∑

t=1

{

E[logP (ct|ct−1)|D1:T ]}

+T∑

t=1

{

E[logP (M t|ct)|D1:T ]}

+

T∑

t=1

{

E[log p(Zt|Zt−1, ct)|D1:T ]}

+T∑

t=1

{

E[log p(yt|Zt,M t)|D1:T ]}

.

Therefore, the expected log likelihood depends on these four expectations:

•∑T

t=1 E[logP (ct|ct−1)|D1:T ],•

∑Tt=1 E[logP (M t|ct)|D1:T ],

•∑T

t=1 E[log p(Zt|Zt−1, ct)|D1:T ], and

•∑T

t=1 E[log p(yt|Zt,M t)|D1:T ].

The respective computation form of these probability quantities are as follows:

23

P (ct|ct−1) ∼ J

P (mt|ct) ∼K

p(zt|zt−1, ct) ∼ N(Actzt−1,Σct)

= (2π)−k/2|Σct|−1/2 exp

(

− 1/2(zt −Actzt−1)′Σ−1

ct (zt −Actz

t−1))

p(yt|zt, mt) ∼ N(Lmtzt +Φmt ,Θmt)

= (2π)−n/2|Θmt |−1/2 exp(

− 1/2(yt −Lmtzt −Φmt)′Θ−1mt ·

(yt − Lmtzt −Φmt))

A.2 The M-step

The parameters of dLCM model are A,L,Σ,Θ,Φ,J ,K. At each iteration,these parameters can be obtained by taking the corresponding partial deriva-tive of the expected log likelihood. The following are the results:

L, the linear dynamics from the latent space to the attribute spaces: Li,mt

denotes the i’th row of this matrix when the mixture node is mt.

∂Q

∂Li,mt

= Θ−1i,mt

T∑

t=1

{

P (M t = mt|D1:T )E[Zt(Zt)′|M t = mt,D1:T ]Li,mt−


Li,mt =( T∑

t=1

{

P (M t = mt|D1:T )E[Zt(Zt)′|M t = mt,D1:T ]}

)−1

·

T∑

t=1

{


Φ, the offset from the latent space to the attribute spaces: Φi,mt denotes thei’th row of this matrix when the mixture node is mt.

24

∂Q

∂Φi,mt

=T∑

t=1

{

2P (M t = mt|D1:T )(yti − Φi,mt)− 2P (M t = mt|D1:T )L′i,mt ·

E[Zt|M t = mt,D1:T ]}

Φi,mt =1

∑Tt=1

{

P (M t = mt|D1:T )}

( T∑

t=1

{

P (M t = mt|D1:T )yti}

−

T∑

t=1

{

P (M t = mt|D1:T )L′i,mtE[Zt|M t = mt,D1:T ]

})

Θ, the covariance matrices of the attribute spaces: Θi,mt denotes the i’th ele-ment on the diagonal of the matrix when the mixture node is mt.

∂Q

∂Θi,mt

= −1

2Θ−1

i,mt

T∑

t=1

{

P (M t = mt|D1:T )}

+1

2Θ−2

i,mt

T∑

t=1

{

(yti − Φi,mt)2·

P (M t = mt|D1:T )− 2(yti − Φi,mt)P (M t = mt|D1:T )L′i,mt ·

E[Zt|M t = mt,D1:T ] + P (M t = mt|D1:T )L′i,mt ·

E[Zt(Zt)′|M t = mt,D1:T ]Li,mt

}

Θi,mt = −1

∑Tt=1

{

P (M t = mt|D1:T )}

T∑

t=1

{

(yti − Φi,mt)2P (M t = mt|D1:T )−

2(yti − Φi,mt)P (M t = mt|D1:T )L′i,mtE[Zt|M t = mt,D1:T ]+

P (M t = mt|D1:T )L′i,mtE[Zt(Zt)′|M t = mt,D1:T ]Li,mt)

}

Σ, the covariance matrices of the latent spaces: Σc denotes the diagonal matrixwhen the class variable is c.

∂Q

∂Σc= −

αc

2Σ−1

c

(

I −1

αc

∑

t:ct=c

{

E[Zt(Zt)′|D1:T ]Σ−1c

}

+ 21

αc

∑

t:ct=c

{

Ac·

E[Zt−1(Zt)′|D1:T ]Σ−1c

}

−1

αc

∑

t:ct=c

{

AcE[Zt−1(Zt−1)′|D1:T ]A′

cΣ−1c

}

)

25

Σc =1

αc

∑

t:ct=c

{

E[Zt(Zt)′|D1:T ]− 2AcE[Zt−1(Zt)′|D1:T ] +Act·

E[Zt−1(Zt−1)′|D1:T ]A′ct

}

A, the linear dynamics within the latent space from one time slice to nexttime slice: Ac denotes the matrix when the class variable is c.

∂Q

∂Ac

= −2∑

t:ct=c

{

Σ−1c E[Zt(Zt−1)′|D1:T ]

}

+

2∑

t:ct=c

{

Σ−1c AcE[Z

t−1(Zt−1)′|D1:T ]}

Ac =∑

t:ct=c

{

E[Zt(Zt−1)′|D1:T ]}

·(

∑

t:ct=c

{

E[Zt−1(Zt−1)′|D1:T ]}

)−1

J , the transition matrix of class variable is directly obtained by using frequen-cy estimation on P (ct|ct−1)

K, the transition matrix of the mixture node is computed based on P (mt|ct =c,D1:T ) :

P (mt|ct = c,D1:T ) =

∑

t:ct=c

{

P (M t = mt|D1:T )}

∑Tt=1

{

P (M t = mt|D1:T )}

B Appendix B. Inference

Seen from the dLCM in Fig. 6, an equivalent probabilistic model is

p(y1:T , z1:T , m1:T , c1:T ) =p(y1|z1,M1)p(z1|c1)p(m1|c1)p(c1)·T∏

t=2

p(yt|zt, mt)p(zt|zt−1, ct)p(mt|ct)p(ct|ct−1).

(B.1)

26

Following the reasoning of [2], exact filtered and smoothed inference of dLCMcan easily be shown to be intractable (scaling exponentially with T [18]) asneither the class variables nor the mixture variables are observed: At time step1, p(z1|y1) is a mixture of |sp (C)||sp (M)| Gaussians. At time-step 2, due tothe summation over the classes and mixture variables, p(z2|y1:2) will be amixture of |sp (C)|2|sp (M)|2 Gaussians; at time-step 3 it will be a mixtureof |sp (C)|3|sp (M)|3 Gaussians and so on until the generation of a mixtureof |sp (C)|T |sp (M)|T Gaussians at time-point T . To control this explosion incomputational complexity, we will resort to approximate inference techniquesfor both classification and learning (i.e., when performing the E-step).

B.1 Approximate inference

As the structure of the proposed dLCM is similar to the Linear DynamicalSystem (LDS) [1], the standard Rauch-Tung-Striebel (RTS) smoother [21] andthe expectation correction smoother [2] for LDS provide the basic ideas forthe approximate inference of dLCM. As for the RTS, the filtered estimate ofdLCM p(zt, mt, ct|y1:t) is obtained by a forward recursion, then following abackward recursion to calculate the smoothed estimate p(zt, mt, ct|y1:T ). Theinference of dLCM is then achieved by a single forward recursion and a singlebackward recursion. Gaussian collapse is incorporated into both the ForwardRecursion and the backward recursion to form the approximate inference.The Gaussian collapse in the forward recursion is equivalent to assumed den-sity filtering [3], and the Gaussian collapse in the backward recursion mirrorsthe smoothed posterior collapse from [2]. The detail of forward recursion andbackward recursion are introduced next.

B.2 Forward recursion: filtering

During the forward recursion of dLCM, the filtered posterior p(zt, mt, ct|y1:t)is computed with a recursive form. By a simple decomposition,

p(zt, mt, ct|y1:t) = p(zt, mt, ct,yt|y1:t−1)/p(yt|y1:t−1)

∝ p(zt, mt, ct,yt|y1:t−1).

Dropping the normalization constant p(yt|y1:t−1), p(zt, mt, ct|y1:t) is propor-tional to the new joint probability p(zt, mt, ct,yt|y1:t−1), where

p(zt, mt, ct,yt|y1:t−1) = p(yt, zt|mt, ct,y1:t−1)p(mt|ct,y1:t−1)p(ct|y1:t−1).(B.2)

27

We are going to find a recursion form for each of the factors in Equation B.2given the filtered results of the previous time-step.

Evaluating p(ct|y1:t−1):p(ct|y1:t−1) can be factorized to

∑

ct−1∈sp(C) p(ct|ct−1)p(ct−1|y1:t−1), where p(ct|ct−1)

is the known transition probability of class variables and p(ct−1|y1:t−1) can beobtained by marginalizing out the previously filtered results,

p(ct−1|y1:t−1) =∑

mt−1

p(mt−1, ct−1|y1:t−1)

=∑

mt−1

∫

zt−1p(zt−1, mt−1, ct−1|y1:t−1) dzt−1.

p(zt−1, mt−1, ct−1|y1:t−1) can be directly obtained since it is the filtered proba-bility of previous time slice. However, we have earlier pointed out that themixture components of p(zt−1|yt−1) is increasing exponentially over time,so is the case for p(zt−1|mt−1, ct−1,y1:t−1). To deal with this, we collapsep(zt−1|mt−1, ct−1,y1:t−1) to a single Gaussian, parameterized with mean νmt−1,ct−1

and covariance Γmt−1,ct−1, and then propagate this collapsed Gaussian for nexttime slice. The approximated Gaussian is denoted as p(zt−1|mt−1, ct−1,y1:t−1).For notational convenience we drop tildes for all the approximated quan-tities in later of this article. After the propagation and updating at timet, the approximated p(zt|mt, ct,y1:t) will be a Gaussian mixture with fixed|sp (C)| · |sp (M)| components. We later collapse the new approximated Gaus-sian mixture |sp (C)| · |sp (M)| to a single Gaussian and propagate it for thenext time slice. This will guarantee that at any time slice, the mixture com-ponents of approximated Gaussian mixture is fixed. At the last step of theforward recursion, the filtered results is p(zT , mT , cT , |y1:T ) at time T .

Evaluating p(mt|ct,y1:t−1):As mt⊥⊥y1:t−1|ct, p(mt|ct,y1:t−1) is equal to p(mt|ct) which is a known modelparameter.

Evaluating p(yt, zt|mt, ct,y1:t−1):

p(yt, zt|mt, ct,y1:t−1) =∑

mt−1,ct−1

p(yt, zt|mt, ct, mt−1, ct−1,y1:t−1)p(mt−1, ct−1|mt, ct,y1:t−1).

The joint (yt, zt|mt, ct, mt−1, ct−1,y1:t−1) is a Gaussian with covariance andmean elements that can be obtained by integrating the forward dynamics

28

over zt−1 using Appendix.C,

µzt = Actνmt−1,ct−1,

µyt = Lmtµzt ,

Σztzt = ActΓmt−1,ct−1A′ct +Σct,

Σytyt = LmtΣztztL′mt +Θmt ,

Σytzt = LmtΣztzt .

Upon a trivial normalization constant, the remaining joint p(mt−1, ct−1|mt, ct,y1:t−1)can be further decomposed as

p(mt−1, ct−1|mt, ct,y1:t−1) ∝ p(mt, ct|mt−1, ct−1,y1:t−1)p(mt−1, ct−1|y1:t−1).(B.3)

In Equation B.3, the last factor p(mt−1, ct−1|y1:t−1) is given from the previoustime step, and the factor p(mt, ct|mt−1, ct−1,y1:t−1) is found from

p(mt, ct|mt−1, ct−1,y1:t−1) =∫

zt−1p(mt, ct|zt−1, mt−1, ct−1,y1:t−1) p(zt−1|mt−1, ct−1,y1:t−1)dzt−1.

(B.4)

As both p(mt, ct|zt−1, mt−1, ct−1,y1:t−1) and p(zt−1|mt−1, ct−1,y1:t−1) can bedirectly computed based on the previously filtered results, any numerical in-tegration method can be adopted to solve Equation B.4, e.g., mean field ap-proximation, or Monte Carlo integration.

Closing the forward recursion

The forward recursion is achieved, since all factors in Equation B.2 are obtinedfrom the previously filtered results. At the last step of the forward recursion,the filtered results is p(zT |mT , cT ,y1:T ) at time T .

B.3 Backward recursion: smoothing

Similar to the forward pass, the backward pass also relies on a recursion com-putation of the smoothed posterior p(zt, mt, ct|y1:T ). In detail, we computep(zt, mt, ct|y1:T ) from its smoothed result of the previous step, p(zt+1, mt+1, ct+1|y1:T ),together with some other quantities obtained from forward pass. The first s-moothed posterior is p(zT , mT , cT |y1:T ) which can be directly obtained as it is

29

also the last filtered posterior from forward pass. To compute p(zt, mt, ct|y1:T ),we factorize it as

p(zt, mt, ct|y1:T ) =∑

mt+1,ct+1

p(zt, mt, ct, mt+1, ct+1|y1:T )

=∑

mt+1,ct+1

p(zt|mt, ct, mt+1, ct+1,y1:T )·

p(mt, ct|mt+1, ct+1,y1:T )p(mt+1, ct+1|y1:T ).

(B.5)

With the factorized form, we can achieve the recursion for the backward pass.

Evaluating p(zt|mt, ct, mt+1, ct+1,y1:T ):Due to the fact that zt⊥⊥{yt+1:T , mt+1, ct+1}|{zt+1, mt, ct}, the term p(zt|mt, ct, mt+1, ct+1,y1:T )can be found from

p(zt|mt, ct, mt+1, ct+1,y1:T )

=∫

zt+1p(zt|zt+1, mt, ct, mt+1, ct+1,y1:T ) p(zt+1|mt, ct, mt+1, ct+1,y1:T )dzt+1

=∫

zt+1p(zt|zt+1, mt, ct,y1:t) p(zt+1|mt, ct, mt+1, ct+1,y1:T )dzt+1.

The above equation enables a recursive computation of p(zt|mt, ct, mt+1, ct+1,y1:T ).First we compute the the term p(zt+1|mt, ct, mt+1, ct+1,y1:T ). We follow (Baber2006) and note that although {mt, ct} 6⊥⊥zt+1|y1:T , the influence of {mt, ct} onzt+1 through zt is “weak” as zt will be mostly influenced by y1:t. We shalltherefore approximate p(zt+1|mt, ct, mt+1, ct+1,y1:T ) by p(zt+1|mt+1, ct+1,y1:T ).The benefit of this simple assumption lies in that p(zt+1|mt+1, ct+1,y1:T ) canbe directly obtained from the previous backward recursion. Note that p(zt+1|mt+1, ct+1,y1:T )is a Gaussian Mixture whose components increase exponentially in T−t. Simi-lar to what we did in the forward recursion, we now collapse p(zt+1|mt+1, ct+1,y1:T )to a single Gaussian and then pass this collapsed Gaussian for the next step.Then the approximated p(zt+1|mt+1, ct+1,y1:T ) at next time step will be aGaussian mixture with fixed |sp (C)| · |sp (M)| components. We later collapsethis new approximated Gaussian mixture to a single Gaussian and propagateit to next time step. This process will also guarantee that at any time step,the mixture components of approximated Gaussian mixture is fixed.

The factor p(zt|zt+1, mt, ct,y1:t) can be evaluated by conditioning the jointp(zt, zt+1|mt, ct,y1:t) on zt+1 using Appendix.D. The joint p(zt, zt+1|mt, ct,y1:t)can be computed as

p(zt, zt+1|mt, ct,y1:t) = p(zt+1|zt, mt, ct, ct+1,y1:t)p(zt|mt, ct,y1:t). (B.6)

p(zt|mt, ct,y1:t) can be obtained from the forward recursion. The remaining

30

necessary statistics of the joint are the mean and covariance of zt+1, andcovariance between zt+1 and zt,

µzt+1 = Act+1νmt,ct ,

Σzt+1zt+1 = Act+1Γmt,ctA′ct+1 +Σct+1,

Σzt+1zt = Act+1Γmt,ct.

Evaluating p(mt, ct|mt+1, ct+1,y1:T ):The term p(mt, ct|mt+1, ct+1,y1:T ) can be found from

p(mt, ct|mt+1, ct+1,y1:T )

=∫

zt+1p(mt, ct|zt+1, mt+1, ct+1,y1:T ) p(zt+1|mt, ct, mt+1, ct+1,y1:T )dzt+1.

(B.7)

As {mt, ct}⊥⊥yt+1:T |{zt+1, mt+1, ct+1}, p(mt, ct|zt+1, mt+1, ct+1,y1:T )= p(mt, ct|zt+1, mt+1, ct+1,y1:t), and we can compute p(mt, ct|zt+1, mt+1, ct+1,y1:t)as

p(mt, ct|zt+1, mt+1, ct+1,y1:t) ∝ p(mt, ct, zt+1, mt+1, ct+1,y1:t)

= p(zt+1|mt, ct, mt+1, ct+1,y1:t)p(mt, ct, mt+1, ct+1|y1:t).

(B.8)

The first factor in Equation B.8 can be obtained by marginalizing out zt fromp(zt, zt+1|mt, ct, mt+1, ct+1,y1:t), which is computed in Equation B.6. The lastfactor p(mt, ct, mt+1, ct+1|y1:t) is found from

p(mt, ct, mt+1, ct+1|y1:t) ∝ p(mt+1, ct+1|mt, ct,y1:t)p(mt, ct|y1:t),

where all the quantities can be computed from the forward recursion. Usingp(zt+1|mt+1, ct+1,y1:T ) from the previous backward recursion, we are now ableto compute Equation B.7. Similar to the situation we encountered for theforward recursion in Equation B.4, any numerical integration method can beadopted to perform the integral in Equation B.7.

Evaluating p(mt+1, ct+1|y1:T )This term can be obtained by marginalizing out zt+1 from p(zt+1, mt+1, ct+1|y1:T ),which is obtained during the previous step of the backward recursion.

31

Closing the backward recursion

All the factors in Equation B.5 are now computed with a recursive form basedon the smoothed results from the previous backward recursion, which aslocompletes the approxiamate inference for dLCM.

C Appendix C. Gaussian propagation

let y be linearly related to x through y = Ax+ψ, where ψ ∈ N(ν,Ψ), andx ∈ N(µ,Σ). Then p(y) =

∫

x p(y|x)p(x) is Gaussian with mean Aµ+ν andCovariance AΣAT +Ψ [2].

D Appendix D. Gaussian conditioning

For a joint Gaussian over x and y with means µx, µy and covariance elements

Σxx,Σxy,Σyy, the conational p(x|y) is a Gaussian with mean µx+ΣxyΣ−1yy (y−

µy) and covariance Σxx −ΣxyΣ−1yy Σyx [2].

32

bayesian networks for dynamic classiﬁcationbayesian networks for dynamic classiﬁcation shengtong...

Documents