analyzing time-evolving networks using an evolving cluster ... › ... › ch23_mmm2014.pdf ·...

23Analyzing Time-Evolving Networks using anEvolving Cluster Mixed Membership Blockmodel

Qirong HoMachine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Eric P. XingSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

CONTENTS23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49023.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49123.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49123.4 Time-Evolving Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

23.4.1 The Mixed Membership Stochastic Blockmodel (MMSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . 49223.4.2 Mixture of MMSBs (M3SB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49323.4.3 Dynamic M3SB (dM3SB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494

23.5 dM3SB Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49523.5.1 Parameter Estimation (M-step) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49823.5.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49823.5.3 Suitability of the Variational Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

23.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49923.6.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49923.6.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

23.7 Case Study: U. S. Congress Voting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50223.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

Time-evolving networks are a natural representation for dynamic social and biological interactions.While latent space models are gaining popularity in network modeling and analysis, previous worksmostly ignore networks with temporal behavior and multi-modal actor roles. Furthermore, priorknowledge, such as division and grouping of social actors or biological specificity of molecularfunctions, has not been systematically exploited in network modeling. In this chapter, we developa network model featuring a state-space mixture prior that tracks complex actor latent role changesthrough time. We provide a fast variational inference algorithm for learning our model, and validateit with simulations and held-out likelihood comparisons on real-world, time-evolving networks.Finally, we demonstrate our model’s utility as a network analysis tool, by applying it to UnitedStates Congress voting data.

489

490 Handbook of Mixed Membership Models and Its Applications

23.1 IntroductionSocial and biological systems can often be represented as a series of temporal networks over actors,and these networks may undergo systematic rewiring or experience large topological changes overtime. The dynamics of these time-evolving networks pose many interesting questions. For instance,what are the latent roles played by these networked actors? How will these roles dictate the waytwo actors interact? Furthermore, how do actors play multiple roles (multi-functionality) in differentsocial and biological contexts, and how does an actor’s set of roles evolve over time? By knowingwhich actors play what roles as well as the relationships between different roles, we can gain insightas to how social or biological communities form in networks. For example, we can elucidate howactors with diverse role compositions group together, and how these groupings change over time.

In particular, we want network actors to be capable of multiple roles, because assuming a singlerole per actor may simply be too restrictive. As an example, consider a social network composedof working adults. We can imagine the participants play at least two roles: one when at work (say,being a manager or a worker), and one when at home (perhaps a parent, or possibly unmarried).These two classes of roles are orthogonal to each other, thus one cannot account for all networkbehaviors with just one class.

The time-evolving aspect of the network is equally important—we do not expect each actor’sroles to remain static over time, but anticipate that they will change, giving rise to rewiring in thenetwork. Returning to the previous example, we might imagine a newly-pregnant mother increasingher “parent” role, or a promoted employee shifting from worker to manager. In fact, multiple rolescould change at once—a working father caught in an accident would be less active both as a workerand as a parent, for instance.

A final, crucial assumption is that the relationships between roles remain constant over time,like how a manager always delegates work to a subordinate, or how parents are always involvedin raising children. This static relationship between roles provides a reference point for actor rolemixtures to evolve over time; it is difficult to interpret actor role changes if the roles themselves arealso changing! In fact, allowing both actor roles and role relationships to change arbitrarily makesfor an ill-posed problem; it becomes unclear if a given network change should be explained in termsof actor roles or role relationships, or even a combination of both.

In this chapter, we present a mixed membership solution to understanding time-evolving net-works, which we call a dynamic mixture of mixed membership stochastic blockmodels (dM3SB).This model employs the regular mixed membership stochastic blockmodel (MMSB) as the basicbuilding block, but augments it with a multi-modal mixture prior that captures each actor’s role-mixture trajectory in a statistically flexible manner. Essentially, we conjoin the MMSB with a set ofstate-space models, one over each mixture component, and each state-space trajectory correspondsto the average evolution of the role mixtures of a group of actors.

Compared to MMSB, this evolving mixture prior presents additional challenges to parameterlearning and latent variable inference. We overcome these difficulties by developing a variationalEM algorithm inspired by ideas from Ghahramani and Hinton (2000) and dMMSB (an earlier ver-sion of dM3SB) (Xing et al., 2010), which allow for efficient approximate inference and parameterlearning. In the following sections, we first develop the dM3SB model and variational EM algo-rithm, after which we present validation experiments on both synthetic and real data. Finally, weconclude with a demonstration of dM3SB towards analyzing voting data from the United StatesCongress.

Analyzing Time-Evolving Networks 491

23.2 Related WorkThere is increasing interest in employing latent space models for network analysis1 (Hoff et al.,2002; Handcock et al., 2007; Heaukulani and Ghahramani, 2013; Soufiani and Airoldi, 2012), ofwhich dM3SB is one kind. However, most of these models assume static networks and a single,fixed role for each actor. Hence, they cannot model evolution of multiple actor roles over time,making them unsuitable for analyzing complex temporal networks.

With respect to addressing these issues, Airoldi et al. (2008) provided a foundation with MMSBwhich permits actors to have role mixtures instead of single roles. Later, Xing et al. (2010) de-veloped a dynamic extension of MMSB, called dMMSB, which addresses temporal evolution ofactor role mixtures. The dMMSB places a time-evolving, unimodal prior on all network actors;specifically, it employs a time-evolving logistic normal distribution similar to a state-space model.

Although an important first step towards dynamic network analysis, dMMSB offers very weakmodeling power—because it employs a unimodal logistic normal for the role distribution of all ac-tors, it is only applicable to networks where the role mixtures of all actors follow similar, unimodaldynamics. A direct solution might be to introduce a separate dynamic process for each actor, butnot only is this computationally impractical for large networks with many actors, it is also statisti-cally unsatisfactory from a Bayesian standpoint as the actors no longer share any common patternand coupling, leaving the model prone to over-fitting and unable to support activity and anomalydetection.

This challenge naturally leads us to explore “evolving clusters” of actors. By modeling dynamicprocesses on clusters, rather than on individuals or on the whole network, we can increase inferentialpower while retaining a common, yet expressive, multi-modal mixture model prior over each actor.Such a prior allows dM3SB to accommodate the non-stationary and heterogeneous behaviors ofactors.

23.3 Problem FormulationWe consider a sequence of interaction networks or graphs, denoted by {G(t)}Tt=1, where each G(t) ≡{V, E(t)} represents the network observed at time t. We assume the set of actors V = {1, . . . , N}is constant. Furthermore, we permit E(t) ≡ {e(t)

ij }N,Ni,j=1, the set of interactions between actors, to

evolve with time. We ignore self edges e(t)ii .

Our goal is to infer the time-evolving actor role mixtures that give rise to this network sequence.An actor’s role mixture is essentially a probability distribution over network roles. For example, aperson in a social network could be 0.5 manager and 0.5 parent, meaning that half of his interactions(and non-interactions) can be explained in terms of manager role behavior, while the other half canbe explained in terms of parenting behavior. The precise definition of an actor role mixture will bemade clear later.

We approach this problem by extending the mixed membership stochastic blockmodel (MMSB)(Airoldi et al., 2008), a static network model that treats each actor as having a mixture of networkroles. The key modification is the addition of a time-evolving (i.e., dynamic) prior on top of theMMSB, which allows it to account for temporally-evolving network dynamics. This prior is amixture of time-evolving logistic normal distributions, which is multi-modal, time-evolving, and

1Also, see the chapter entitled “Mixed Membership Blockmodels for Dynamic Networks with Feedback” (Cho et al.,2014).


captures correlations between roles. In particular, it is similar to the factorial hidden Markov model,for which variational inference techniques have been developed (Ghahramani and Hinton, 2000).With this prior, the resulting MMSB model is able to fit complex, time-evolving data densities thatthe static, unimodal, uncorrelated Dirichlet prior used in MMSB cannot.

23.4 Time-Evolving Network ModelsRather than directly introduce the full dM3SB model, we shall start by introducing the regularMMSB, and gradually extend it to become dM3SB. We hope that this presentation will not only beeasier to understand, but will also make the connection between MMSB and dM3SB more clear.

23.4.1 The Mixed Membership Stochastic Blockmodel (MMSB)

We begin by describing the mixed membership stochastic blockmodel (Airoldi et al., 2008), whichserves as the foundation for our model. The MMSB is a static network model, meaning that we onlyconsider one network E ≡ {eij}N,Ni,j=1. Furthermore, it assumes each actor vi ∈ V possesses a latentmixture of K roles, which determine observed network interactions. This role mixture formalizesthe notion of actor multi-functionality, and we denote it by a normalized K × 1 vector πi, referredto as a mixed membership or MM vector. We assume these vectors are drawn from some prior p(π).

Given MM vectors πi, πj for actors i and j, the network edge eij is stochastically generated asfollows: first, actor i (the donor) picks one role z→ij ∼ p(z|πi) to interact with actor j. Next, actor j(the receiver) also picks one role z←ij ∼ p(z|πj) to receive the interaction from i. Both z→ij , z←ijare K × 1 unit indicator vectors. Finally, the chosen roles of i, j determine the network interactioneij ∼ p(e|z→ij , z←ij), where eij ∈ {0, 1}. The specific distributions over z→ij , z←ij , eij are:

• z→ij ∼ Multinomial(πi). Actor i’s donor role indicator.

• z←ij ∼ Multinomial(πj). Actor j’s receiver role indicator.

• eij ∼ Bernoulli(z>→ijBz←ij). Interaction outcome from actor i to j,

where B is a K ×K role compatibility matrix. Intuitively, the bilinear form z>→ijBz←ij selects asingle element of B; the indicators z→ij , z←ij behave like indices into B.

This generative model has two noteworthy features. First, observed relations E result from actorlatent roles interacting. In the case of social networks, the latent roles are naturally interpretableas social functions, like manager, worker, parent, or single adult. Note that actor i’s latent mem-bership indicators z→i·, z←·i are unique to each interaction; he/she may assume different roles forinteracting with each actor.

Second, the role compatibility matrix B completely determines the affinity between latent roles.For example, a diagonally-dominant B signifies that actors of the same role are more likely tointeract. Conversely, off-diagonal entries inB suggest interactions between actors of different roles.The MMSB’s expressive power lies in its ability to control the interaction strength between any pairof roles, by specifying the corresponding entries of B.

An Example

We now provide a simple example to explain how MMSBs generate interactions. Say we have twosocial network actors i, j, with MM vectors:

• πi = [parent = 0.3,worker = 0.7],


• πj = [child = 1].

Let us assume that i is the biological father of j, and that the presence or absence of the directededge eij signifies whether i has given orders to j. Finally, suppose that the role compatibility matrixhas the following entries:

• Bparent,child = 0.5,

• Bworker,child = 0.01,

where we ignore the other entries of B as they are irrelevant to this discussion. Intuitively, this Breflects how people acting as parents are likely to order their children to do things, whereas peopleacting as (office) workers are unlikely to interact with children at all. Then, the probability ofeij = 1 is computed as:

p(eij = 1 | πi, πj , B)

=∑

z→ij ,z←ij

p(eij = 1 | z→ij , z←ij , B) p(z→ij | πi)P(z←ij | πj)

= p(eij = 1 | z→ij = parent, z←ij = child, B)

× p(z→ij = parent | πi) p(z←ij = child | πj)+ p(eij = 1 | z→ij = worker, z←ij = child, B)

× p(z→ij = worker | πi) p(z←ij = child | πj)= (0.5)(0.3)(1) + (0.01)(0.7)(1)

= 0.15 + 0.007

= 0.157.

We see that most of the interaction probability comes from the parent→ child relationship, ratherthan the worker→ child relationship.

23.4.2 Mixture of MMSBs (M3SB)

The actor MM prior p(π) significantly affects MMSB’s expressive power. In the previous section,we say that MMSB uses a Dirichlet prior, which is conjugate to the multinomial role indicatordistribution p(z|π). The advantage of this conjugacy is that one can derive a clean variationalinference algorithm (Airoldi et al., 2008). However, a Dirichlet prior over roles is fairly restrictivein a statistical sense: it is not multi-modal and cannot capture correlations between roles.

To overcome these shortcomings, we shall extend the MMSB by making p(π) a logistic normalmixture prior, which is both multi-modal (due to the mixture) and permits correlations (due to thenormal distribution). This adds the following generative process over the MM vectors π:

• ci ∼ Multinomial(δ). Mixture component indicator.

• γi ∼ Normal(µci ,Σci). Unnormalized MM vector.

• πi = Logistic(γi). Logistic-transformed MM vector, where [Logistic(γ)]k = exp{γk}∑Kl=1 exp{γl}

.

Combining this generative process over π with the MMSB model gives rise to what we call amixture of MMSBs (M3SB). Here, ci is a C × 1 cluster selection indicator for πi, where C is thenumber of mixture components. Thus, πi is drawn from a logistic normal distribution with meanand covariance selected by ci, while ci itself is drawn from a prior multinomial distribution δ.

The M3SB accounts for role correlations using its logistic normal distribution, and has the flexi-bility to fit complex data densities by virtue of its multi-modal mixture prior. In the next section, we


shall exploit these properties to design a time-varying network model that tracks the role mixturetrajectories of clusters of actors. This is in contrast to the dMMSB model of Xing et al. (2010),which tracks a single, average trajectory.

23.4.3 Dynamic M3SB (dM3SB)

In a time-evolving network, we assume that the actor MM vectors π(t) and their prior p(t)(π) changewith time, and the goal is to infer their dynamic trajectories. Inferring the dynamic actor MM vectorsallows us to detect large-scale temporal network trends, particularly groups of actors whose MMvectors π shift from one set of roles to another. For example, if a company suddenly goes out ofbusiness, then its employees will also shift from the “worker” role to the “unemployed” role.

In order to model time-evolution in the network, we place a state-space model on every logisticnormal distribution in the mixture prior p(π), similar to a Kalman filter. Let N denote the numberof actors and T the number of time points in the evolving network. Also, let K denote the numberof MMSB latent roles and C the number of mixture components. We begin with an outline of ourfull generative process; see Figure 23.1 for a graphical model representation.

Φ

μh(1)

Σh

C

ν

K×K

Bkl

γi(1)

NN-1

z→ij(1) z←ji

(1)

ci(1)

N×N

eij(1)

δ

… μh(T)

…

γi(T)

NN-1

z→ij(T) z←ji

(T)

ci(T)

N×N

eij(T)

1. Mixture StateSpace Model

2. MixtureComponentIndicators

3. Mixed MembershipStochastic Blockmodel

(MMSB)

FIGURE 23.1Graphical model representation of dM3SB.


1. Mixture State-Space Model for MM Vectors

• µ(1)h ∼ Normal(ν,Φ) for h = 1 . . . C. Mixture means for the MM prior at t = 1.

• µ(t)h ∼ Normal(µ(t−1),Φ) for h = 1 . . . C, t = 1 . . . T . Mixture means for t > 1.

2. Mixture Component Indicators

• {c(t)i }Ni=1 ∼ Multinomial(δ) for t = 1 . . . T . Mixture indicator for each MM vector.

3. Mixed Membership Stochastic Blockmodel

• {γ(t)i }Ni=1 ∼ Normal(µ

(t)

c(t)i

,Σc(t)i

) for t = 1 . . . T . Unnormalized MM vectors according

to the mixture indicated by c(t)i .

• π(t)i = Logistic(γ

(t)i ), [Logistic(γ)]k = exp{γk}∑K

l=1 exp{γl}. Logistic transform γ

(t)i into MM

vector π(t)i .

• For every actor pair (i, j 6= i) and every time point t = 1 . . . T :

– z(t)→ij ∼ Multinomial(π

(t)i ). Actor i’s donor role indicator.

– z(t)←ij ∼ Multinomial(π

(t)j ). Actor j’s receiver role indicator.

– e(t)ij ∼ Bernoulli(z

(t)>→ij Bz

(t)←ij). Interaction outcome from actor i to j.

We refer to this model as the dynamic mixture of MMSBs (dM3SB for short). The general idea isto apply the state-space model (SSM) used in object tracking to the MMSB model. Specifically, theMMSB becomes the emission model to the SSM; a distinct MMSB model is “emitted” at each timepoint (Figure 23.1). Furthermore, the SSM contains C distinct trajectories µh, each modeling themean trajectory for a subset of MM vectors π(t)

i . The SSM has two parameters ν,Φ, representingthe prior mean and variance of the C trajectories. Each trajectory evolves according to a lineartransition model µ(t)

h = Aµ(t−1)h + w

(t)h , where A is a transition matrix and w(t)

h ∼ Normal(0,Φ)is Gaussian transition noise. We assume A to be the identity matrix, which corresponds to randomwalk dynamics; generalization to arbitrary A is straightforward.

Each MM vector π(t)i is then drawn from one of the C trajectories µ(t)

h . The choice of trajectoryfor π(t)

i is given by the indicator vector c(t)i , which is drawn from some prior. For simplicity, we haveused a single multinomial prior with parameter δ for all c(t)i . Observe that c(t)i can change over time,allowing actors to switch clusters if that would fit the data better. Given c(t)i , the MM vector π(t)

i isdrawn according to LN (µ

(t)

c(t)i

,Σc(t)i

), where the variances Σ1, . . . ,ΣC are model parameters. LNdenotes a logistic normal distribution, the result of applying a logistic transformation to a normaldistribution.

Once {π(t)i }Ni=1 have been drawn for some t, the remaining variables z(t)

→ij , z(t)←ij , e

(t)ij follow the

MMSB exactly. We assume the role compatibility B to be a model parameter, although we notethat more sophisticated assumptions can be found in the literature, such as a state-space model prior(Xing et al., 2010).

23.5 dM3SB Inference and LearningAs with other mixed membership models, neither exact latent variable inference nor parameterlearning are computationally tractable in dM3SB. The mixture prior on π

(t)i , a factorial hidden


Markov model, presents the biggest difficulty—it is analytically un-integrable, its likelihood is sub-ject to many local maxima, and it requires exponential time for exact inference. Moreover, itslogistic normal distribution does not admit closed-form integration with the multinomial distribu-tion of z|π. Finally, the space of possible discrete role indicators z is exponentially large in thenumber of actors N and time points T .

We address all these difficulties with a variational EM procedure (Ghahramani and Beal, 2001)based on the generalized mean field (GMF) algorithm (Xing et al., 2003), and using techniques fromGhahramani and Hinton (2000) and dMMSB (Xing et al., 2010). Our algorithm simultaneouslyperforms inference and learning for dM3SB in a computationally-effective fashion.

Throughout this section, we shall present just the final dM3SB update equations. For morethorough derivations, the interested reader is referred to the Appendix.

Briefly, variational inference attempts to approximate the true posterior distribution witha simpler factored distribution on which inference is computationally more tractable. LetΘ = {ν,Φ, {Σh}Ch=1, δ, B} denote all model parameters. We approximate the joint poste-rior p({z(t), γ(t), c(t), {µ(t)

h }Ch=1}Tt=1 | {E(t)}Tt=1; Θ) by a variational distribution over factoredmarginals,

q = qµ

({µ(t)

h }T,Ct,h

) T,N∏t,i=1

qγ(γ(t)i )qc(c

(t)i )

N∏j=1

qz(z(t)→ij , z

(t)←ij)

.The variational factors qz , qγ , and qc are the marginal distributions over the MMSB latent vari-

ables z, γ, and mixture indicators c, respectively. The last variational factor qµ is the marginaldistribution over the mixture of C SSMs over time. The idea is to approximate latent variable in-ference under p (intractable) with feasible inference under q. In particular, Ghahramani and Hinton(2000) have demonstrated that it is feasible to have one marginal qµ over all µs.

The GMF algorithm maximizes a lower bound on the marginal distribution p({E(t)}Tt=1; Θ)over arbitrary choices of qz, qγ , qc, qµ. We use the GMF solutions to the variational distributions qas the E-step of our variational EM algorithm, and derive the M-step through direct maximizationof our variational lower bound with respect to Θ. Under GMF, the optimal solution to a marginalq(X) for some latent variable set X is p(X|Y,Eq[φ(MBX)]), the distribution of X conditioned onthe observed variables Y and the expected exponential family sufficient statistics (under variationaldistribution q) of X’s Markov blanket variables (Xing et al., 2003). Hence, our E-step iterativelycomputes q(X) := p(X|{E(t)}Tt=1,Eq[φ(MBX)]) for X = {u(t)

h }T,Ct,h , γ(t)

i , c(t)i and {z(t)→ij , z

(t)←ij}.

For brevity, we present only the final E-step equations; exact derivations can be found in the Ap-pendix.

E-step for qz:

From here, we drop time indices t whenever appropriate. qz is a categorical distribution over K2

elements,

qz(z→ij = k, z←ij = l) ∼ Multinomial(ω(ij)), (23.1)

ω(ij)kl ∝ (Bkl)eij (1−Bkl)1−eij exp(〈γik〉+ 〈γjl〉),

where ω(ij) is a normalized K2 × 1 vector indexed by (k, l).2 The notation 〈X〉 denotes the expec-tation of X under q; for example, the expectations of z under qz are 〈z(→ij)k〉 :=

∑l ω(ij)kl and

〈z(←ij)l〉 :=∑k ω(ij)kl.

2k, l correspond to roles indicated by zi→j , zi←j .


E-step for qγ:

qγ does not have a closed form, because the logistic-normal distribution of γ is not conjugate tothe multinomial distribution of z. We apply a Laplace approximation to qγ , making it normally dis-tributed (Xing et al., 2010; Ahmed and Xing, 2007). Define Ψ(a, b, C) := exp{− 1

2 (a−b)>C−1(a−b)}. The approximation to qγ is

qγ(γi) ∝Ψ(γi, τi,Λi), where (23.2)

Λi =

((2N − 2)Hi +

C∑h=1

Σ−1h 〈cih〉

)−1

,

τi = u+ Λi{N∑j 6=i

(〈z→ij〉+ 〈z←ji〉)

− (2N − 2) (gi +Hi(u− γ̂i))},

u =

(C∑h=1

Σ−1h 〈cih〉

)−1( C∑h=1

Σ−1h 〈cih〉〈µh〉

),

γ̂i is a Taylor expansion point, and gi and Hi are the gradient and Hessian of the vector-valuedfunction log(

∑Kl=1 exp γi) evaluated at γi = γ̂i. We set γ̂i to 〈γi〉 from the previous E-step iteration,

keeping the expansion point close to the current expectation of γi.

E-step for qc:

qc is discrete over C elements,

qc(ci = h) ∝ δh |Σh|−1/2exp

{−1

2tr[Σ−1h

(〈γiγ>i 〉

−〈µh〉〈γi〉> − 〈γi〉〈µh〉> + 〈µhµ>h 〉)] }

.

Note the dependency on second order moments 〈γiγ>i 〉 and 〈µhµ>h 〉. Since qγ , qµ are Gaussian,these moments are simple to compute.

E-step for qµ:

The GMF solution to qµ factors across clusters h:

qµ

({µ(t)

h }T,Ct,h

):=

C∏h=1

qµ,h

({µ(t)

h }Tt), where (23.3)

qµ,h

({µ(t)

h }Tt)∝

Ψ(µ(1)h , ν,Φ)Ob(1, h)

T∏t=1

Ψ(µ(t)h , µ

(t−1)h ,Φ)Ob(t, h),

Ob(t, h) := Ψ

(∑Ni=1〈c

(t)ih 〉〈γ

(t)i 〉∑N

i=1〈c(t)ih 〉

, µ(t)h ,

Σh∑Ni=1〈c

(t)ih 〉

).

Notice that factor qµ,h({µ(t)h }Tt ) resembles a state-space model for cluster h, with “observation

probability” at time t proportional to Ob(h, t). Hence the mean and covariance of each µ can beefficiently computed using the Kalman smoothing algorithm.


Input: Temporal sequence of networks {G(t)}Tt=1.Output: Variational distributions qz, qγ , qc, qµ and model parameters B, δ, ν,Φ, {Σh}Ch=1.Initialize parameters B, δ, ν,Φ, {Σh}Ch=1.Sample initial values for µ(t), γ(t), c(t).repeat

repeatUpdate qz(z

(t)i→j , z

(t)i←j) for all i, j, t.

Update B.Update qγ(γ

(t)i ) for all i, t.

until convergenceUpdate qµ({µ(t)

h }T,Ct,h=1).

Update ν,Φ.Update qc(c

(t)i ) for all i, t.

Update δ, {Σh}Ch=1.until convergence

Algorithm 1: Variational EM for dM3SB.

23.5.1 Parameter Estimation (M-step)

Given GMF solutions to each q from our E-step, we take our variational lower bound on the logmarginal likelihood, and maximize it jointly with respect to all parameters Θ (for details, refer tothe Appendix). Let S(A) := A+A>. The parameter solutions are:

β̂kl :=

∑T,N,Nt,i,j 6=i ω

(t)(ij)kle

(t)ij∑T,N,N

t,i,j 6=i ω(t)(ij)kl

, ν̂ :=

C∑h

〈µ(1)h 〉C

, δ̂ :=

T,N∑t,i

〈c(t)i 〉TN

Φ̂ :=1

TC

[C∑h=1

〈µ(1)h µ

(1)>h 〉 − S

(〈µ(1)h 〉ν̂>

)+ ν̂ν̂>

+

T∑t=2

〈µ(t)h µ

(t)>h 〉 − S

(〈µ(t)h µ

(t−1)>h 〉

)+ 〈µ(t−1)

h µ(t−1)>h 〉

]

Σ̂h :=

∑T,Nt,i 〈c

(t)ih 〉[〈γ

(t)i γ

(t)>i 〉 − S(〈γ(t)

i 〉〈µ(t)h 〉>) + 〈µ(t)

h µ(t)>h 〉]∑T,N

t,i 〈c(t)ih 〉

.

Our full inference and learning algorithm is summarized in Algorithm 1. This algorithm in-terleaves the E-step and M-step equations, yielding a coordinate ascent algorithm in the space ofvariational and model parameters. The algorithm is guaranteed to converge to a local optimum inour variational lower bound, and we use multiple random restarts to approach the global optimum.Similar to the MMSB variational EM algorithm (Airoldi et al., 2008), we update qz, qγ , and Bmore frequently for improved convergence. Note that each random restart can be run on a separatecomputational thread, making dM3SB easily parallelizable and therefore highly scalable.

23.5.2 Variational Inference

23.5.3 Suitability of the Variational Approximation

Given that our true model is multi-modal, our variational approximation will only be useful if italso fits multi-modal data. Historically, naive mean field approximations, such as used in latent


space models such as MMSB (Airoldi et al., 2008) and the latent Dirichlet allocation (Blei et al.,2003), approximate all latent variables with unimodal variational distributions. These unimodal dis-tributions are unlikely to fit multi-modal densities well; instead, we employ a structured mean fieldapproximation that approximates all µs with a single, multi-modal switching state-space distribu-tion qµ(), essentially a collection of C Kalman filters. This ensures that the multi-modal structureof the prior on the MM vectors γ(t)

i is not lost. Moreover, although each qγ(γ(t)i ) for a given i, t is

a unimodal Gaussian, it can be fitted to any mode in qµ(), independently of qγ(γ(t)i ) for other i, t.

This flexibility ensures the variational posterior over all γ(t)i s remains multi-modal.

23.6 Validation

To validate dM3SB, we need to show that it fits multi-modal, correlated, time-varying data betterthan alternative models. For this purpose, we shall compare dM3SB to its unimodal predecessordMMSB (Xing et al., 2010), and show that it improves over the latter in multiple respects, onboth synthetic and real-world data. Later, we shall conduct a case study on a real-world dataset todemonstrate dM3SB’s capabilities.

In the experiments that follow, we ran our algorithm for 50 outer loop iterations per randomrestart, with 5 iterations per inner loop. We also fixed Φ = IK and δ = 1/C instead of runningtheir M-steps, as we found this yields more stable results. For the remaining parameters, we usedtheir M-steps with the following initializations: Bkl ∼ Uniform(0, 1), Σh = IK . As for ν, we ini-tialized 〈µ(1)

h 〉 ∼ Uniform([−1, 1]K) for all h and set ν to their average. The remaining variationalparameters were initialized via the generative process.

23.6.1 Synthetic Data

Previously, Xing et al. (2010) compared the performance of the dMMSB time-varying model againsta naive sequence of disjoint MMSBs, one per network time point. In particular, when the rolesare correlated, the logistic-normal prior provides a better fit to the data than the Dirichlet prior.Moreover, for time-varying networks, dMMSB provides a better fit than disjoint MMSBs on everytime point.

We now demonstrate that dM3SB’s multi-modal prior is an even better fit to time-varyingnetwork data than dMMSB’s unimodal prior. In this experiment, we shall compare dM3SBs todMMSBs in terms of model fit (measured by the log marginal likelihood) and actor MM recovery.We generate data with N = 200 actors and T = 5 time points, and assume a K = 3 role compati-bility matrix B = (B1, B2, B3)>, with rows B1 = (1, .25, 0), B2 = (0, 1, .25), and B3 = (0, 0, 1).The actors are divided into four groups of 50, with the first three groups having true MM vectors(.9, .05, .05), (.05, .9, .05) and (.05, .05, .9), respectively, for all time points. The last group hasMM vectors that move over time, according to the sequence π(1) = (.6, .3, .1), π(2) = (.3, .6, .1),π(3) = (.1, .8, .1), π(4) = (.1, .6, .3), π(5) = (.1, .3, .6). The generated B, MM vectors π, andnetworks E(t) are visualized in Figure 23.2.

Thus far, we have not addressed model selection—specifically, selection of the number of rolesK and the number of mixture components (clusters) C. To do so, we performed a gridsearchover K ∈ {2, 3, 4, 5, 6} and C ∈ {1, 2, 3, 4, 5} on the full network, using 200 random restartsper (K,C) combination. For all combinations, we observed convergence well within our limit of50 outer iterations. Furthermore, completing all 200 restarts for each K,C took between 8 hours(K = 2, C = 1) and 28 hours (K = 6, C = 5) on a single processor. Since the random restarts can


FIGURE 23.2Synthetic data ground truth visualization. Top Row: Adjacency matrix visualizations, beginningon the left with t = 1 using random actor ordering, followed by t = 1, . . . , 5 with actors groupedaccording to the ground truth. Bottom left: The role compatibility matrix B, shown as a graph.Circles represent roles, and numbered arrows represent interaction probabilities. Bottom row: Trueactor MM plots in the 3-role simplex for each t. Blue, green and red crosses denote the static MMsof the first 3 actor groups, and the cyan circle denotes the moving MM of the last actor group.

be run in parallel, with sufficient computing power one could easily scale dM3SB to much largertime-varying networks with thousands of actors and tens of time points.

For each (K,C) from the gridsearch, we selected its best random restart using the variationallower bound with a Bayesian information criterious (BIC) penalty. The best restart BIC scores areplotted in Figure 23.3; note that dMMSB corresponds to the special case C = 1. The optimal BICscore selects the correct number of roles K = 3 and clusters C = 4, making it a good substitute forheld-out model selection.

2 3 4 5 6−1.1

−1.08

−1.06

−1.04

−1.02x 10

5

K

Synthetic: BIC scores

C=1 (dMMSB)C=2C=3C=4C=5 −4

−3

−2

−1x 10

5

dMMSB K=3 dM3SB K=3,C=4

Synthetic5−fold avg. LL

FIGURE 23.3Synthetic data: BIC scores and 5-fold heldout log-likelihoods for dM3SB and dMMSB.

Next, using the BIC-optimal (K,C), we ran dM3SB on a 5-fold heldout experiment. In eachfold, we randomly partitioned the dataset’s actors into two equal sets, and used the two corre-sponding subnetworks as training and test data. In each training fold, we selected the best modelparameters Θ from 100 random restarts using the variational lower bound. We then estimated thelog marginal likelihood for these parameters on the corresponding test fold, using Monte Carlo


integration with 2000 samples. This process was repeated for all 5 folds to get an average logmarginal likelihood for dM3SB . For comparison, we conducted the same heldout experiment fora dMMSB set to K from the optimal (K,C) pair. The average log marginal likelihood for bothmethods is shown in Figure 23.3, and we see that dM3SB’s greater heldout likelihood makes it abetter statistical fit to this synthetic dataset than dMMSB.

Finally, we compared dM3SB to dMMSB in role estimation (B) and actor role recovery (π(t)i ),

using their best restarts on the correct (K,C) (or just K for dMMSB). Table 23.1 shows, for bothmethods versus the ground truth, the average `2 error in π(t)

i —specifically, we compared the groundtruth to π(t)

i ’s posterior mean from either method—as well as the total variation in B. dM3SB’saverage `2 error in π(t)

i is significantly lower than dMMSB’s, at the cost of a higher total variationin B. However, dM3SB’s total variation of 0.1083 implies an average difference of only 0.012 ineach of the nine entries of B, which is already quite accurate. The fact that dM3SB accuratelyrecovers π(t)

i confirms that its posterior over all π(t)i is multi-modal, which validates our variational

approximation.

TABLE 23.1Synthetic data: Estimation accuracy of dM3SB (K = 3, C = 4) and dMMSB (K = 3).

dM3SB role matrix B, Total Variation 0.1083dMMSB role matrix B, Total Variation 0.0135dM3SB MMs π(t)

i , mean `2 difference 0.0266dMMSB MMs π(t)

i , mean `2 difference 0.0477

We also note that dM3SB’s mean cluster trajectories 〈µ(t)h 〉 accurately estimated the four groups’

mean MM vectors with a maximum `2 error of 0.0761 for any group h and time t, except at t = 5,where dM3SB exchanged group 3’s trajectory with that of (moving) group 4. In conclusion, we haveseen that dM3SB provides a better fit to this synthetic dataset than dMMSB, thanks to the former’smulti-modal prior.

23.6.2 Real Data

We now assess the model fitness of both dM3SB and dMMSB on two real-world datasets: a 151actor subset of the Enron email communications dataset (Shetty and Adibi, 2004) over the 12 monthsof 2001, and a 100 actor subset of the United States Congress voting data over the 8 quarters of2005 and 2006 (described in the next section). As with the synthetic data, we shall use heldoutlog-likelihood to measure how well each model fits the data.

For both datasets, we first selected the optimal values of (K,C) via BIC score gridsearchwith dM3SB over K ∈ {3, 4, 5, 6}, C ∈ {2, 3, 4, 5}. Our previous synthetic experiment hasdemonstrated that model gridsearch using BIC produces good results. The optimal values wereK = 4, C = 2 for the Senator dataset, and K = 3, C = 4 for the Enron dataset (Figure 23.4).

Using each dataset’s optimal (K,C), we next ran dM3SB on the 5-fold heldout experimentdiscussed in the previous section, obtaining average log marginal likelihoods. For comparison, weconducted the same heldout experiments for dMMSB set to K from the optimal (K,C) pair.

Plots of the heldout log marginal likelihoods for dM3SB and dMMSB can be found in Figure23.4. On the Senator dataset, dM3SB has the higher log marginal likelihood, implying that it is abetter statistical fit than dMMSB. For the Enron dataset, both methods have the same likelihood,showing that using dM3SB with more mixture components at least incurs no statistical cost over


−3

−2

−1

x 104


Senator5−fold avg. LL

−5

−4

−3x 10

4


Enron5−fold avg. LL

3 4 5 6−2.2

−2.1

−2

−1.9

−1.8

−1.7x 10

4

K

Senator BIC scores

C=2C=3C=4C=5

3 4 5 6−2.8

−2.75

−2.7

−2.65

−2.6x 10

4

K

Enron BIC scores

C=2C=3C=4C=5

FIGURE 23.4Senator/Enron data: BIC scores and 5-fold heldout log-likelihoods for dM3SB and dMMSB.

dMMSB. These results demonstrate that dM3SB’s multi-modal prior is a better fit to some real-world, time-varying networks, compared to dMMSB unimodal prior.

23.7 Case Study: U. S. Congress Voting DataWe finish our discussion with an application of dM3SB to the United States 109th Congress votingrecords. Here, we will show that dM3SB not only recovers MM vectors and a role compatibilitymatrix that matches our intuitive expectations of the data, but that the MM vectors are useful foridentifying outliers and other unusual phenomena.

The 109th Congress involved 100 senators and 542 bills spread over the dates January 1, 2005through December 31, 2006. The original voting data3 is provided in the form of yes/no votes foreach senator and each bill. In order to create a time-varying network suitable for dM3SB, we appliedthe method of Kolar et al. (2008) to recreate their network result.

The generated time-varying network contains 100 actors (senators), and 8 time points corre-sponding to 3-month epochs starting on January 1, 2005 and ending on December 31, 2006. The

3Available at http://www.senate.gov.


network is an undirected graph, where an edge between two senators indicates that their votes weremostly similar during that particular epoch. Conversely, a missing edge indicates that their voteswere mostly different. Our intention is to discover how the political allegiances of different senatorsshifted from 2005 to 2006.

For our analysis, we used the optimal dM3SB restart from the BIC gridsearch described in theprevious held-out experiment. Recall that this optimal restart uses K = 4 roles and C = 2 clusters.The learned MM vectors πi, compatibility matrix B, and most probable cluster assignments aresummarized in Figure 23.5. The results are intuitive: Democratic party members have a high pro-portion of Role 1, while Republican party members have a high proportion of Role 2. Both Roles 1and 2 interact exclusively with themselves, reflecting the tendency of both political parties to votewith their comrades and against the other party. The remaining two roles exhibit no interactions;senators with high proportions of these roles are unaligned and unlikely to vote with either politicalparty. Observe that the two clusters perfectly capture party affiliations—Republican senators arealmost always in cluster 1, while Democratic senators are almost always in cluster 2.

While it is reassuring to see results that reflect a contemporary understanding of U.S. politics,a more useful application of dM3SB’s mixed membership analysis is in identifying outliers. Forinstance, consider the Democrat Senator Ben Nelson (#75): from t = 1 through 7, his votes wereunaligned with either Democrats or Republicans, though his votes were gradually shifting towardsRepublican. At t = 8 (the end of 2006), his voting becomes strongly Republican (Role 2), and heshifts from the Democrat cluster (1) to the Republican one (2). Sen. Nelson’s trajectory through therole simplex is plotted in Figure 23.6. Incidentally, Sen. Nelson was re-elected as the Senator fromNebraska in late 2006, winning a considerable percentage of his state’s Republican vote.

Next, observe how the senator from New Jersey, #28, started off unaligned from t = 1 to 4but ended up Democratic from t = 5 to 8; his role trajectory is also plotted in Figure 23.6. Thereis an interesting reason for this: the seat for New Jersey was occupied by two senators during theCongress, Senator Jon Corzine in the first session (t = 1 to 4), and Senator Bob Menendez in thesecond session (t = 5 to 8). Sen. Corzine was known to have far-left views, reflected in #28’s lackof both Republican and Democratic roles during his term (the Democrat role captures mainstreamrather than extremist voting behavior). Once Sen. Menendez took over, #28’s behavior fell in linewith most Democrats.

Other notable outliers include Senator James Jeffords (#54), the sole Independent senator whovotes like a Democrat, and three Republican senators with Democratic leanings: Senator LincolnChafee #19, Senator Susan Collins #25, and Senator Olympia Snowe #89. These senators exhibitMM vectors that deviate significantly from their party average, which make them obvious outliersunder even a simple K-means cluster. Through examining these outliers, dM3SB allows us to per-form anomaly detection and analysis.

In summary, dM3SB provides a latent space view of the 109th Congress voting network, whichreveals both expected aggregate trends (voting along bipartisian lines) as well as unexpected anoma-lies (senators who differ from their party norm). We anticipate that dM3SB can also be appliedto understanding time-evolving biological networks, just as Xing et al. (2010) applied the earlierdMMSB model to such data in 2010.


Legend

Senator Number

Official Affiliation D – Democrat R – Republican I – Independent

State Abbreviation

e.g. HI = Hawaii

Cluster at each time point 1 – “Democrat” cluster

2 – “Republican” cluster

Evolving Mixed Membership vector

(8 time points)

Role Compatibility Matrix

Democrat behavior

Republican Behavior

Centrist Behavior

Centrist Behavior

FIGURE 23.5Congress voting network: Mixed membership vectors (colored bars) and most probable clusterassignments (numbers under bars) for all 100 senators, displayed as an 8-time-point series fromleft-to-right. The annotation beside a senator’s number refers to that senator’s political party (D forDemocrat, R for Republican, I for Independent) and state (as a two-letter abbreviation). Refer to thelegend for specific details. The learned role compatibility matrix is displayed at the top right.


Democrat 1

2 Republican

3 Centrist

4

Centrist

Democrat 1

2 Republican

3 Centrist

4

Centrist

FIGURE 23.6Congress voting network 3-simplex visualizations. Colors (green, blue) denote cluster membership.Left: MM vector time-trajectory for Senator #28 (D-NJ)—Jon Corzine during time points 1–4 andBob Menendez during time points 5–8. Right: MM vector time-trajectory for Senator Ben Nelson(#75, D-NE).


23.8 Conclusion

dM3SB is a probabilistic model for latent role analysis in time-varying networks, with an efficientvariational EM algorithm for approximate inference and learning. This model is distinguished byits explict modeling of actor multi-functionalities (role MMs), as well as its multi-modal, time-evolving, logistic normal mixture prior over these multi-functionalities, which allows dM3SB to fixcomplex latent role densities. We also note that dM3SB’s variational inference algorithm is trivialto run in parallel, since each random restart can be run on a separate computational thread.

Notably, dM3SB is an evolution of the dMMSB (Xing et al., 2010) and MMSB (Airoldi et al.,2008) models, and shares much in common with them. Validation experiments show that dM3SB’smulti-modal prior outperforms the unimodal prior of dMMSB on both synthetic and real data,which underscores the importance of using statistically flexible priors. The most important usesof dM3SB are exploration of actor latent roles and anomaly detection, which were demonstrated ina case study on the 109th U.S. Congress voting data.

Appendix

Derivation of the Variational EM Algorithm

This appendix provides detailed derivations of the dM3SB variational EM algorithm. Recall that ourgoal is to find the posterior distribution of the latent variables µ, c, γ, z given the observed sequencenetwork E(1), . . . , E(T ), under the maximum likelihood model parameters B, δ, ν,Φ, and Σ.

Finding the posterior (inference) or solving for the maximum likelihood parameters (learning)are both intractable under our original model. Hence we resort to a variational EM algorithm,which locally optimizes the model parameters with respect to a lower bound on the true marginallog-likelihood, while simultaneously finding a variational distribution that approximates the latentvariable posterior. The marginal log-likelihood lower bound being optimized is

log p (E | Θ) = log

∫X

p (E,X | Θ) dX

= log

∫X

q (X)p (E,X | Θ)

q (X)dX

≥∫X

q (X) logp (E,X | Θ)

q (X)dX (Jensen’s inequality)

= Eq [log p (E,X | Θ)− log q (X)] =: L (q,Θ) ,

where X denotes the latent variables {µ, c, γ, z}, Θ denotes the model parameters {B, δ, ν,Φ,Σ},and q is the variational distribution. This lower bound is iteratively maximized with respect to q’sparameters (E-step) and the model parameters Θ(M-step).

In principle, the lower bound L (q,Θ) holds for any distribution q; ideally q should closely


approximate the true posterior p (X | E,Θ). In the next section, we define a factored form for qand derive its optimal solution.

Variational Distribution q

We assume a factorized form for q:

q = qµ

(µ

(1)1 , . . . , µ

(T )C

) T,N∏t,i=1

qγ (γ(t)i

)qc

(c(t)i

) N∏j 6=i

qz

(z

(t)i→j , z

(t)i←j

) .We now make use of the generalized mean field (GMF) theory (Xing et al. 2003) to deter-mine each factor’s form. GMF theory optimizes a lower bound on the marginal distributionp (E | Θ) over arbitrary choices of qµ, qγ , qc, and qz . In particular, the optimal solution to qX isp (X | E,Eq [φ (MBX)]), the distribution of the latent variable set X conditioned on the observedvariables E and the expected exponential family sufficient statistics (under q) of X’s Markov blan-ket variables. More precisely, qX has the same functional form as p (X | E,MBX), but where avariational parameter V replaces φ (Y ) for each Y ∈ MBX , with optimal solution V := Eq [φ (Y )].In general, if Y ∈ MBX , then we use 〈φ (Y )〉 to denote the variational parameter corresponding toY .

We begin by deriving optimal solutions to qµ, qγ , qc, and qz in terms of the the variational param-eters 〈φ (Y )〉. After we have derived all factors, we present closed-form solutions to 〈φ (Y )〉. Thesesolutions form a set of fixed-point equations which, when iterated, converge to a local optimum inthe space of variational parameters (thus completing the E-step).

Distribution of qz

qz is a discrete distribution since the zs are indicator vectors. We begin by deriving the distributionof the zs conditioned on their Markov blanket:

p(z

(t)i→j , z

(t)i←j | MB

z(t)i→j ,z

(t)i←j

)∝ p

(E

(t)ij | z

(t)i→j , z

(t)i←j

)p(z

(t)i→j | γ

(t)i

)p(z

(t)i←j | γ

(t)j

)=

((z

(t)i→j

)>Bz

(t)i←j

)E(t)ij(

1−(z

(t)i→j

)>Bz

(t)i←j

)1−E(t)ij

K∏k=1

(exp γ

(t)i,k∑K

l=1 exp γ(t)i,l

)z(t)i→j,k (

exp γ(t)j,k∑K

l=1 exp γ(t)j,l

)z(t)i←j,k

∝ exp

{E

(t)ij log

((z

(t)i→j

)>Bz

(t)i←j

)+(

1− E(t)ij

)log

(1−

(z

(t)i→j

)>Bz

(t)i←j

)+(z

(t)i→j

)>γ

(t)i +

(z

(t)i←j

)>γ

(t)j

}.

The variables γ(t)i , γ

(t)j belong to other variational factors, and their exponential family sufficient

statistics are just γ(t)i and γ(t)

j themselves. Hence

qz

(z

(t)i→j , z

(t)i←j

):∝ exp

{E

(t)ij log

((z

(t)i→j

)>Bz

(t)i←j

)+(

1− E(t)ij

)log

(1−

(z

(t)i→j

)>Bz

(t)i←j

)


+(z

(t)i→j

)> ⟨γ

(t)i

⟩+(z

(t)i←j

)> ⟨γ

(t)j

⟩},

with variational parameters⟨γ

(t)i

⟩and

⟨γ

(t)j

⟩. We can also express qz in terms of indices k, l:

qz

(z

(t)i→j = k, z

(t)i←j = l

):∝ exp

{E

(t)ij logBk,l +

(1− E(t)

ij

)log (1−Bk,l) +

⟨γ

(t)i,k

⟩+⟨γ

(t)j,l

⟩}.

Distribution of qγqγ is a continuous distribution. The distribution of γ(t)

i conditioned on its Markov blanket is

p(γ

(t)i | MB

γ(t)i

)∝ p

(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C

) N∏j 6=i

p(z

(t)i→j | γ

(t)i

)p(z

(t)j←i | γ

(t)i

)

∝ exp

{C∑h=1

−1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)}N∏j 6=i

K∏k=1

(exp γ

(t)i,k∑K

l=1 exp γ(t)i,l

)z(t)i→j,k (

exp γ(t)i,k∑K

l=1 exp γ(t)i,l

)z(t)j←i,k

= exp

{C∑h=1

−1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)

+

N∑j 6=i

K∑k=1

(z

(t)i→j,kγ

(t)i,k + z

(t)j←i,kγ

(t)i,k

)− (2N − 2) log

K∑l=1

exp γ(t)i,l

∝ exp

{C∑h=1

−1

2c(t)i,h

[(γ

(t)i

)>Σ−1h γ

(t)i −

(γ

(t)i

)>Σ−1h µ

(t)h −

(µ

(t)h

)>Σ−1h γ

(t)i

]

+

N∑j 6=i

z(t)i→j + z

(t)j←i

> γ(t)i − (2N − 2) log

K∑l=1

exp γ(t)i,l

.

The variables c(t)i , µ(t)1 , . . . , µ

(t)C , z

(t)i→1, . . . , z

(t)i→N , z

(t)1←i, . . . , z

(t)N←i belong to other variational fac-

tors. The sufficient statistics for variables z are just z(t)i→j and z(t)

j←i themselves. For variables c and

µ, their sufficient statistics are c(t)i,h and c(t)i,h(µ

(t)h

)>. However, since c is marginally independent of

µ under q, we can take their expectations independently, hence the variational parameters are just⟨c(t)i,h

⟩and

⟨µ

(t)h

⟩. Hence

qγ(γ

(t)i

):∝ exp

{C∑h=1

−1

2

⟨c(t)i,h

⟩[(γ

(t)i

)>Σ−1h γ

(t)i −

(γ

(t)i

)>Σ−1h

⟨µ

(t)h

⟩−⟨µ

(t)h

⟩>Σ−1h γ

(t)i

]

+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> γ(t)i − (2N − 2) log

K∑l=1

exp γ(t)i,l

,


with variational parameters⟨c(t)i

⟩,⟨µ

(t)h

⟩,⟨z

(t)i→j

⟩,⟨z

(t)j←i

⟩.

Laplace Approximation to qγ

The term Zγ(γ

(t)i

):= log

∑Kl=1 exp γ

(t)i,l makes the exponent analytically un-integrable, which

prevents us from computing the normalizer for qγ(γ

(t)i

). Thus, we approximate Zγ

(γ

(t)i

)with its

second-order Taylor expansion around a chosen point γ̂(t)i :

Zγ(γ

(t)i

)≈ Zγ

(γ̂

(t)i

)+(g

(t)i

)> (γ

(t)i − γ̂

(t)i

)+

1

2

(γ

(t)i − γ̂

(t)i

)>H

(t)i

(γ

(t)i − γ̂

(t)i

)(23.4)

g(t)i,k :=

exp γ̂(t)i,k∑K

k′=1 exp γ̂(t)i,k′

H(t)i,kl :=

I [k = l] exp γ̂(t)i,k∑K


−exp γ̂

(t)i,k exp γ̂

(t)i,l(∑K


)2 .

Note that H(t)i = diag

(g

(t)i

)− g(t)

i

(g

(t)i

)>. Because the variational EM algorithm is iterative, we

set γ̂(t)i to γ̃(t)

i := Eq[γ

(t)i

]from the previous iteration, which should keep the point of expansion

close to Eq[γ

(t)i

]for the current iteration. The point of this Taylor expansion is to approximate qγ

with a normal distribution; consider the exponent of qγ ,

−

C∑h=1

⟨c(t)i,h

⟩2

[(γ

(t)i

)>Σ−1h γ

(t)i −

(γ

(t)i

)>Σ−1h

⟨µ

(t)h

⟩−⟨µ

(t)h

⟩>Σ−1h γ

(t)i

]+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> γ(t)i − (2N − 2)Zγ

(γ

(t)i

)= const(1) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> γ(t)i − (2N − 2)Zγ

(γ

(t)i

),

where const(i) denotes a constant independent of γ(t)i , S :=

∑Ch=1 Σ−1

h

⟨c(t)i,h

⟩and u :=

S−1(∑C

h=1 Σ−1h

⟨c(t)i,h

⟩⟨µ

(t)h

⟩). Applying the Taylor expansion in Equation (23.4) gives

≈ const(1) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> γ(t)i

− (2N − 2)

[Zγ(γ̂

(t)i

)+(g

(t)i

)> (γ

(t)i − γ̂

(t)i

)+

1

2

(γ

(t)i − γ̂

(t)i

)>H

(t)i

(γ

(t)i − γ̂

(t)i

)]

= const(2) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> γ(t)i


− (2N − 2)

[(g

(t)i

)>γ

(t)i +

1

2

(γ

(t)i

)>H

(t)i γ

(t)i −

(γ̂

(t)i

)>H

(t)i γ

(t)i

]= const(2) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩> − (2N − 2)

((g

(t)i

)>−(γ̂

(t)i

)>H

(t)i

) γ(t)i

− (N − 1)(γ

(t)i

)>H

(t)i γ

(t)i .

Define A :=(∑N

j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩)>− (2N − 2)

((g

(t)i

)>−(γ̂

(t)i

)>H

(t)i

)and B :=

− (N − 1)H(t)i , so that we obtain

= const(2) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+Aγ

(t)i +

(γ

(t)i

)>Bγ

(t)i

= const(2) − 1

2

(γ

(t)i − u

)>S(γ

(t)i − u

)+A

(γ

(t)i − u+ u

)+(γ

(t)i − u+ u

)>B(γ

(t)i − u+ u

)= const(3) − 1

2

(γ

(t)i − u

)>(S − 2B)

(γ

(t)i − u

)+(A+ 2u>B

) (γ

(t)i − u

).

Finally, define D := A+ 2u>B and E := S − 2B, resulting in

= const(3) − 1

2

(γ

(t)i − u

)>E(γ

(t)i − u

)+D

(γ

(t)i − u

)= const(4) − 1

2

(γ

(t)i − u

)>E(γ

(t)i − u

)+(E−1D>

)>E(γ

(t)i − u

)−1

2

(E−1D>

)>E(E−1D>

)= const(4) − 1

2

(γ

(t)i − u− E−1D>

)>E(γ

(t)i − u− E−1D>

).

Hence qγ(γ

(t)i

)is approximately Normal

(τ

(t)i ,Λ

(t)i

)with variance and mean

Λ(t)i := E−1

=

([C∑h=1

Σ−1h

⟨c(t)i,h

⟩]+ (2N − 2)Hi

)−1

τ(t)i := u+ E−1D>

= u+ Λ(t)i

N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩− (2N − 2)[g

(t)i +H

(t)i

(u− γ̂(t)

i

)]u :=

(C∑h=1

Σ−1h

⟨c(t)i,h

⟩)−1( C∑h=1

Σ−1h

⟨c(t)i,h

⟩⟨µ

(t)h

⟩).


Distribution of qc

qc is a discrete distribution. The distribution of c(t)i conditioned on its Markov blanket is

p(c(t)i | MB

c(t)i

)∝ p

(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C

)p(c(t)i

)∝

(C∏h=1

[|Σh|−1/2

]c(t)i,h

)exp

{C∑h=1

−1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)}( C∏h=1

δc(t)i,h

h

)

= exp

{C∑h=1

−1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)+

C∑h=1

c(t)i,h log

δh

|Σh|1/2

}

= exp

{C∑h=1

−1

2c(t)i,h

[(γ

(t)i

)>Σ−1h γ

(t)i −

(γ

(t)i

)>Σ−1h µ

(t)h −

(µ

(t)h

)>Σ−1h γ

(t)i +

(µ

(t)h

)>Σ−1h µ

(t)h

]

+

C∑h=1

c(t)i,h log

δh

|Σh|1/2

}

= exp

{C∑h=1

−1

2c(t)i,htr

[Σ−1h

(γ

(t)i

(γ

(t)i

)>− µ(t)

h

(γ

(t)i

)>− γ(t)

i

(µ

(t)h

)>+ µ

(t)h

(µ

(t)h

)>)]

+

C∑h=1

c(t)i,h log

δh

|Σh|1/2

}.

The variables γ(t)1 , . . . , γ

(t)N , µ

(t)1 , . . . , µ

(t)C belong to other variational factors. The sufficient statis-

tics of γ and µ are γ(t)i

(γ

(t)i

)>, µ(t)

h

(γ

(t)i

)>, and µ(t)

h

(µ

(t)h

)>, but since γ and µ are marginally

independent under q, we can take their expectations separately. Hence

qc

(c(t)i

):∝ exp

{C∑h=1

−1

2c(t)i,htr

[Σ−1h

(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)]+

C∑h=1

c(t)i,h log

δh

|Σh|1/2

},

with variational parameters⟨µ

(t)h

(µ

(t)h

)>⟩,

⟨γ

(t)i

(γ

(t)i

)>⟩,⟨µ

(t)h

⟩,⟨γ

(t)i

⟩. We can also ex-

press qc in terms of indices h:

qc

(c(t)i = h

):∝ δh

|Σh|1/2exp

{−1

2tr

[Σ−1h

(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)]}.


Distribution of qµ

qµ is a continuous distribution. The distribution of µ(1)1 , . . . , µ

(T )C conditioned on its Markov blanket

is

p(µ

(1)1 , . . . , µ

(T )C | MB

µ(1)1 ,...,µ

(T )C

)∝

[T∏t=1

N∏i=1

p(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C

)][ C∏h=1

p(µ

(1)h

) T∏t=2

p(µ

(t)h | µ

(t−1)h

)]

∝ exp

{T∑t=1

N∑i=1

C∑h=1

−1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)+

C∑h=1

[−1

2

(µ

(1)h − ν

)>Φ−1

(µ

(1)h − ν

)+

T∑t=2

−1

2

(µ

(t)h − µ

(t−1)h

)>Φ−1

(µ

(t)h − µ

(t−1)h

)]}

∝ exp

{T∑t=1

N∑i=1

C∑h=1

−1

2c(t)i,h

[−(γ

(t)i

)>Σ−1h µ

(t)h −

(µ

(t)h

)>Σ−1h γ

(t)i +

(µ

(t)h

)>Σ−1h µ

(t)h

]

+

C∑h=1

[−1

2

(µ

(1)h − ν

)>Φ−1

(µ

(1)h − ν

)+

T∑t=2

−1

2

(µ

(t)h − µ

(t−1)h

)>Φ−1

(µ

(t)h − µ

(t−1)h

)]}.

The variables γ(1)1 , . . . , γ

(T )N , c

(1)1 , . . . , c

(T )N belong to other variational factors. The sufficient statis-

tic of γ and c is c(t)i,h(γ

(t)u

)>, but since γ and c are marginally independent under q, we can take

their expectations separately. Hence

qµ(µ

(1)1 , . . . , µ

(T )C

):∝ exp

{T∑t=1

N∑i=1

C∑h=1

−1

2

⟨c(t)i,h

⟩[−⟨γ

(t)i

⟩>Σ−1h µ

(t)h −

(µ

(t)h

)>Σ−1h

⟨γ

(t)i

⟩+(µ

(t)h

)>Σ−1h µ

(t)h

]

+

C∑h=1

[−1

2

(µ

(1)h − ν

)>Φ−1

(µ

(1)h − ν

)+

T∑t=2

−1

2

(µ

(t)h − µ

(t−1)h

)>Φ−1

(µ

(t)h − µ

(t−1)h

)]}

∝C∏h=1

exp

{T∑t=1

N∑i=1

−1

2

⟨c(t)i,h

⟩[−⟨γ

(t)i

⟩>Σ−1h µ

(t)h −

(µ

(t)h

)>Σ−1h

⟨γ

(t)i

⟩+(µ

(t)h

)>Σ−1h µ

(t)h

]

−1

2

(µ

(1)h − ν

)>Φ−1

(µ

(1)h − ν

)+

T∑t=2

−1

2

(µ

(t)h − µ

(t−1)h

)>Φ−1

(µ

(t)h − µ

(t−1)h

)},

with variational parameters⟨γ

(t)i

⟩,⟨c(t)i

⟩.

Kalman Smoother for qµ

We can apply the Kalman smoother to compute the mean and covariance of each µ(t)h under qµ. Let

Ψ (a, b, C) := exp{− 1

2 (a− b)> C−1 (a− b)}

, then with some manipulation we obtain

qµ(µ

(1)1 , . . . , µ

(T )C

)∝

C∏h=1

[Ψ(µ

(1)h , ν,Φ

) N∏i=1

Ψ(⟨γ

(1)i

⟩, µ

(1)h ,Σh

)⟨c(1)i,h

⟩][T∏t=2

Ψ(µ

(t)h , µ

(t−1)h ,Φ

) N∏i=1

Ψ(⟨γ

(t)i

⟩, µ

(t)h ,Σh

)⟨c(t)i,h

⟩]


∝C∏h=1

Ψ(µ

(1)h , ν,Φ

)Ψ

∑Ni=1

⟨c(1)i,h

⟩⟨γ

(1)i

⟩∑Ni=1

⟨c(1)i,h

⟩ , µ(1)h ,

Σh∑Ni=1

⟨c(1)i,h

⟩

T∏t=2

Ψ(µ

(t)h , µ

(t−1)h ,Φ

)Ψ

∑Ni=1

⟨c(t)i,h

⟩⟨γ

(t)i

⟩∑Ni=1

⟨c(t)i,h

⟩ , µ(t)h ,

Σh∑Ni=1

⟨c(t)i,h

⟩ .

Notice that qµ factorizes across cluster indices h:

qµ(µ

(1)1 , . . . , µ

(T )C

)=

C∏h=1

qµh

(µ

(1)h , . . . , µ

(T )h

)

qµh

(µ

(1)h , . . . , µ

(T )h

):∝ Ψ

(µ

(1)h , ν,Φ

)Ψ

∑Ni=1

⟨c(1)i,h

⟩⟨γ

(1)i

⟩∑Ni=1

⟨c(1)i,h

⟩ , µ(1)h ,

Σh∑Ni=1

⟨c(1)i,h

⟩

T∏t=2

Ψ(µ

(t)h , µ

(t−1)h ,Φ

)Ψ

∑Ni=1

⟨c(t)i,h

⟩⟨γ

(t)i

⟩∑Ni=1

⟨c(t)i,h

⟩ , µ(t)h ,

Σh∑Ni=1

⟨c(t)i,h

⟩ .

Observe that each factor qµh(µ

(1)h , . . . , µ

(T )h

)is a linear system of the form

µ(t+1)h = µ

(t)h + w

(t)h

α(t)h = µ

(t)h + v

(t)h ,

where µ(t)h are latent variables and α(t)

h are observed variables with value α(t)h =

∑Ni=1

⟨c(t)i,h

⟩⟨γ

(t)i

⟩∑Ni=1

⟨c(t)i,h

⟩ .

Furthermore, w(t)h ∼ N (0,Φ), v(t)

h ∼ N(

0,Ξ(t)h

)with Ξ

(t)h = Σh∑N

i=1

⟨c(t)i,h

⟩ , and µ(1)h ∼ N (ν,Φ).

Hence the distribution of each µ(t)h under qµ is Gaussian, and its mean and covariance can be com-

puted using the Kalman smoother equations

µ̂(t+1)|(t)h = µ̂

(t)|(t)h

P(t+1)|(t)h = P

(t)|(t)h + Φ

K(t+1)h = P

(t+1)|(t)h

(P

(t+1)|(t)h + Ξ

(t+1)h

)−1

µ̂(t+1)|(t+1)h = µ̂

(t+1)|(t)h +K

(t+1)h

(α

(t+1)h − µ̂(t+1)|(t)

h

)P

(t+1)|(t+1)h =

(I−K(t+1)

h

)P

(t+1)|(t)h

and

L(t)h = P

(t)|(t)h

(P

(t+1)|(t)h

)−1

µ̂(t)|(T )h = µ̂

(t)|(t)h + L

(t)h

(µ̂

(t+1)|(T )h − µ̂(t+1)|(t)

h

)P

(t)|(T )h = P

(t)|(t)h + L

(t)h

(P

(t+1)|(T )h − P (t+1)|(t)

h

)(L

(t)h

)>.

Thus, µh has mean µ̂(t)|(T )h and covariance P (t)|(T )

h under qµ.

E-Step: Solutions to Variational Parameters

In the E-step, we find locally optimal variational parameters for each factor of q. The solutions to


the continuous parameters are⟨µ

(t)h

⟩= µ̂

(t)|(T )h⟨

µ(t)h

(µ

(t)h

)>⟩= Eqµ

[µ

(t)h

(µ

(t)h

)>]= Vqµ

[µ

(t)h

]+ Eqµ

[µ

(t)h

]Eqµ

[µ

(t)h

]>= P

(t)|(T )h + µ̂

(t)|(T )h

(µ̂

(t)|(T )h

)>⟨γ

(t)i

⟩= τ

(t)i⟨

γ(t)i

(γ

(t)i

)>⟩= Eqγ

[γ

(t)i

(γ

(t)i

)>]= Vqγ

[γ

(t)i

]+ Eqγ

[γ

(t)i

]Eqµ

[γ

(t)i

]>= Λ

(t)i + τ

(t)i

(τ

(t)i

)>,

while the solutions to the discrete parameters are⟨c(t)h,i

⟩= q

(c(t)i = h

)⟨z

(t)(i→j),k

⟩=

K∑l=1

qz

(z

(t)i→j = k, z

(t)i←j = l

)⟨z

(t)(i←j),l

⟩=

K∑k=1

qz

(z

(t)i→j = k, z

(t)i←j = l

).

These solutions are used to update the variational parameters in each factor of q. Note that theyform a set of fixed-point equations that converge to a local optimum in the space of variationalparameters. Hence the E-step involves iterating these equations until some convergence thresholdhas been reached.

M-Step

In the M-step, we maximize L (q,Θ) with respect to the model parameters Θ = {B,Σ, δ, ν,Φ}.Recall that

L (q,Θ) := Eq [log p (E,X | Θ)− log q (X)] .

Note that the variational distribution q is not actually a function of the model parameters Θ; themodel parameters that appear in the q’s optimal solution come from the previous M-step, similar toregular EM. Hence it suffices to maximize

L′ (q,Θ) := Eq [log p (E,X | Θ)]

= Eq

log

T,N∏t,i=1

N∏j 6=i

p(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)p(z

(t)i→j | γ

(t)i

)p(z

(t)i←j | γ

(t)j

) T,N∏t,i=1

p(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C ; Σ1, . . . ,ΣC

)p(c(t)i ; δ

)

Analyzing Time-Evolving Networks 515(C∏h=1

p(µ

(1)h ; ν,Φ

) T∏t=2

p(µ

(t)h | µ

(t−1)h ; Φ

))]

= Eq

T,N∑t,i=1

N∑j 6=i

log p(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)+Eq

T,N∑t,i=1

N∑j 6=i

log p(z

(t)i→j | γ

(t)i

)p(z

(t)i←j | γ

(t)j

)+Eq

T,N∑t,i=1

log p(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C ; Σ1, . . . ,ΣC

)+Eq

T,N∑t,i=1

log p(c(t)i ; δ

)+Eq

[C∑h=1

log p(µ

(1)h ; ν,Φ

)+

C∑h=1

T∑t=2

log p(µ

(t)h | µ

(t−1)h ; Φ

)].

Maximizing B

Consider the B-dependent terms in L′ (q,Θ),

Eq

T,N∑t,i=1

N∑j 6=i

log p(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)=

T,N∑t,i=1

N∑j 6=i

Eq[log p

(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)]

=

T,N∑t,i=1

N∑j 6=i

∑z

(t)i→j

∑z

(t)i←j

qz

(z

(t)i→j , z

(t)i←j

)log p

(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)(zs independent of other latent variables under q).

Since z(t)i→j , z

(t)i←j are indicator variables, we index their possible values with k ∈ {1, . . . ,K} and

l ∈ {1, . . . ,K} , respectively:

=

T,N∑t,i=1

N∑j 6=i

K,K∑k,l=1

qz

(z

(t)i→j = k, z

(t)i←j = l

)log p

(E

(t)i,j | z

(t)i→j = k, z

(t)i←j = l;B

)

=

T,N∑t,i=1

N∑j 6=i

K,K∑k,l=1

qz

(z

(t)i→j = k, z

(t)i←j = l

)(E

(t)i,j logBk,l +

(1− E(t)

i,j

)log (1−Bk,l)

).

(23.5)

Setting the first derivative wrt Bk,l to zero yields the maximizer B̂k,l for L′ (q,Θ):

0 =∂

∂Bk,l

T,N∑t,i=1

N∑j 6=i

K,K∑k′,l′=1

qz

(z

(t)i→j = k′, z

(t)i←j = l′

)

516 Handbook of Mixed Membership Models and Its Applications(E

(t)i,j logBk′,l′ +

(1− E(t)

i,j

)log (1−Bk′,l′)

)0 =

T,N∑t,i=1

N∑j 6=i

qz

(z

(t)i→j = k, z

(t)i←j = l

)(E(t)i,j

Bk,l−

1− E(t)i,j

1−Bk,l

)

0 =

T,N∑t,i=1

N∑j 6=i

qz

(z

(t)i→j = k, z

(t)i←j = l

)(E

(t)i,j −Bk,l

)

B̂k,l := Bk,l =

∑T,Nt,i=1

∑Nj 6=i qz

(z

(t)i→j = k, z

(t)i←j = l

)E

(t)i,j∑T,N

t,i=1

∑Nj 6=i qz

(z

(t)i→j = k, z

(t)i←j = l

) .

Maximizing Σ

Consider the Σ1, . . . ,ΣC-dependent terms in L′ (q,Θ),

Eq

T,N∑t,i=1

log p(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C ; Σ1, . . . ,ΣC

)=

T,N∑t,i=1

Eq[log p

(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C ; Σ1, . . . ,ΣC

)]

=

T,N∑t,i=1

Eq

log

C∏h=1

((2π)

−K/2 |Σh|−1/2exp

{−1

2

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)})c(t)i,h=

T,N∑t,i=1

C∑h=1

Eq[c(t)i,h log

((2π)

−K/2 |Σh|−1/2)− 1

2c(t)i,h

(γ

(t)i − µ

(t)h

)>Σ−1h

(γ

(t)i − µ

(t)h

)]

=

T,N∑t,i=1

C∑h=1

− log(

(2π)K/2 |Σh|1/2

)Eq[c(t)i,h

]−1

2Eq[c(t)i,htr

[Σ−1h

(γ

(t)i

(γ

(t)i

)>− µ(t)

h

(γ

(t)i

)>− γ(t)

i

(µ

(t)h

)>+ µ

(t)h

(µ

(t)h

)>)]].

Since c, µ, γ are independent of each other (and other latent variables) under q,

=

T,N∑t,i=1

C∑h=1

− log(

(2π)K/2 |Σh|1/2

)⟨c(t)i,h

⟩(23.6)

−1

2

⟨c(t)i,h

⟩tr

[Σ−1h

(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)],

where we have defined 〈X〉 := Eq [X], and the solutions to 〈X〉 are identical to the E-step. Settingthe first derivative wrt Σh to zero yields the maximizer Σ̂h for L′ (q,Θ):

0 = ∇Σh

T,N∑t,i=1

C∑h=1

− log(

(2π)K/2 |Σh|1/2)⟨

c(t)i,h

⟩


−1

2

⟨c(t)i,h

⟩tr

[Σ−1h

(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)]

0 =

T,N∑t,i=1

−1

2

⟨c(t)i,h

⟩Σ−1h

+1

2

⟨c(t)i,h

⟩Σ−1h

(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)Σ−1h

0 =

T,N∑t,i=1

−⟨c(t)i,h

⟩Σh

+⟨c(t)i,h

⟩(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)

Σ̂h := Σh =

∑T,Nt,i=1

⟨c(t)i,h

⟩(⟨γ

(t)i

(γ

(t)i

)>⟩−⟨γ

(t)i

⟩⟨µ

(t)h

⟩>−⟨µ

(t)h

⟩⟨γ

(t)i

⟩>+

⟨µ

(t)h

(µ

(t)h

)>⟩)∑T,Nt,i=1

⟨c(t)i,h

⟩ .

Maximizing δ

Consider the δ-dependent terms in L′ (q,Θ),

Eq

T,N∑t,i=1

log p(c(t)i ; δ

)=

T,N∑t,i=1

Eq

[log

C∏h=1

δc(t)i,h

h

]

=

T,N∑t,i=1

C∑h=1

Eq[c(t)i,h log δh

]

=

T,N∑t,i=1

C∑h=1

⟨c(t)i,h

⟩log δh

=

T,N∑t,i=1

⟨c(t)i

⟩> log δ, (23.7)

where⟨c(t)i,h

⟩:= Eq

[c(t)i,h

], and the solution to

⟨c(t)i,h

⟩is identical to the E-step. Taking the first

derivative with respect to δ1, . . . , δC−1,

∂

∂δh

T,N∑t,i=1

C∑h′=1

⟨c(t)i,h′

⟩log δh′ =

T,N∑t,i=1

∂

∂δh

⟨c(t)i,h

⟩log δh +

∂

∂δh

⟨c(t)i,C

⟩log

(1−

C−1∑h′=1

δh′

)

=

T,N∑t,i=1

⟨c(t)i,h

⟩δh

−

⟨c(t)i,C

⟩1−∑C−1

h′=1 δh′.

By setting all the derivatives to zero and performing some manipulation, we obtain the maximizerδ̂ for L′ (q,Θ):

δ̂ =

∑T,Nt,i=1

⟨c(t)i

⟩TN

.


Maximizing ν,Φ

Consider the ν,Φ-dependent terms in L′ (q,Θ),

Eq

[C∑h=1

log p(µ

(1)h ; ν,Φ

)+

C∑h=1

T∑t=2

log p(µ

(t)h | µ

(t−1)h ; Φ

)]

=

C∑h=1

Eq[log p

(µ

(1)h ; ν,Φ

)]+

C∑h=1

T∑t=2

Eq[log p

(µ

(t)h | µ

(t−1)h ; Φ

)].

We begin by maximizing wrt ν, which only requires us to focus on the first term:

C∑h=1

Eq[log p

(µ

(1)h ; ν,Φ

)]=

C∑h=1

Eq[log

((2π)−K/2 |Φ|−1/2 exp

{−1

2

(µ

(1)h − ν

)>Φ−1

(µ

(1)h − ν

)})]

=

C∑h=1

Eq[log(

(2π)−K/2 |Φ|−1/2)

−1

2

((µ

(1)h

)>Φ−1µ

(1)h −

(µ

(1)h

)>Φ−1ν − ν>Φ−1µ

(1)h + ν>Φ−1ν

)].

Dropping terms that do not depend on ν,

=

C∑h=1

Eq[−1

2

(−(µ

(1)h

)>Φ−1ν − ν>Φ−1µ

(1)h + ν>Φ−1ν

)]

=

C∑h=1

1

2

⟨µ

(1)h

⟩>Φ−1ν +

1

2ν>Φ−1

⟨µ

(1)h

⟩− 1

2ν>Φ−1ν,

where⟨µ

(1)h

⟩:= Eq

[µ

(1)h

], and the solution to

⟨µ

(1)h

⟩is identical to the E-step. Setting the first

derivative wrt ν to zero yields the maximizer ν̂ for L′ (q,Θ):

0 = ∇νC∑h=1

1

2

⟨µ

(1)h

⟩>Φ−1ν +

1

2ν>Φ−1

⟨µ

(1)h

⟩− 1

2ν>Φ−1ν

0 =

C∑h=1

Φ−1⟨µ

(1)h

⟩− Φ−1ν

ν̂ := ν =

∑Ch=1

⟨µ

(1)h

⟩C

.

We now substitute ν = ν̂ and consider the Φ-dependent terms in L′ (q,Θ):

Eq

[C∑h=1

log p(µ

(1)h ; ν̂,Φ

)+

C∑h=1

T∑t=2

log p(µ

(t)h | µ

(t−1)h ; Φ

)]

=

C∑h=1

Eq[log p

(µ

(1)h ; ν̂,Φ

)]+

C∑h=1

T∑t=2

Eq[log p

(µ

(t)h | µ

(t−1)h ; Φ

)]


=

C∑h=1

− log(

(2π)K/2 |Φ|1/2

)− 1

2Eq[(µ

(1)h − ν̂

)>Φ−1

(µ

(1)h − ν̂

)]

+

C∑h=1

T∑t=2

− log(

(2π)K/2 |Φ|1/2

)− 1

2Eq[(µ

(t)h − µ

(t−1)h

)>Φ−1

(µ

(t)h − µ

(t−1)h

)]= −TC log

((2π)

K/2 |Φ|1/2)

−C∑h=1

1

2tr

[Φ−1

(⟨µ

(1)h

(µ

(1)h

)>⟩− ν̂

⟨µ

(1)h

⟩>−⟨µ

(1)h

⟩ν̂> + ν̂ν̂>

)]

−C∑h=1

T∑t=2

1

2tr

[Φ−1

(⟨µ

(t)h

(µ

(t)h

)>⟩−⟨µ

(t−1)h

(µ

(t)h

)>⟩−⟨µ

(t)h

(µ

(t−1)h

)>⟩+

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩)],

where 〈X〉 := Eq [X]. The solutions to⟨µ

(1)h

⟩,

⟨µ

(t)h

(µ

(t)h

)>⟩,

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩are iden-

tical to the E-step. The remaining expectations are⟨µ

(t)h

(µ

(t−1)h

)>⟩=

⟨µ

(t−1)h

(µ

(t)h

)>⟩>= P

(t)|(T )h

(L

(t−1)h

)>+⟨µ

(t)h

⟩⟨µ

(t−1)h

⟩>,

where P and L are defined in the section discussing the Kalman smoother. Setting the first deriva-tive wrt Φ to zero yields the maximizer Φ̂ for L′ (q,Θ):

0 = ∇Φ − TC log(

(2π)K/2 |Φ|1/2

)−

C∑h=1

1

2tr

[Φ−1

(⟨µ

(1)h

(µ

(1)h

)>⟩− ν̂

⟨µ

(1)h

⟩>−⟨µ

(1)h

⟩ν̂>

+ ν̂ν̂>)]

−C∑h=1

T∑t=2

1

2tr

[Φ−1

(⟨µ

(t)h

(µ

(t)h

)>⟩−⟨µ

(t−1)h

(µ

(t)h

)>⟩

−⟨µ

(t)h

(µ

(t−1)h

)>⟩+

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩)]

0 = −TC

2Φ−1

+

C∑h=1

1

2Φ−1

(⟨µ

(1)h

(µ

(1)h

)>−⟨µ

(1)h

⟩ν̂> − ν̂

⟨µ

(1)h

⟩>+ ν̂ν̂

>⟩)

Φ−1

+

C∑h=1

T∑t=2

1

2Φ−1

(⟨µ

(t)h

(µ

(t)h

)>⟩−⟨µ

(t)h

(µ

(t−1)h

)>⟩

−⟨µ

(t−1)h

(µ

(t)h

)>⟩+

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩)Φ−1

0 = −TCΦ +

C∑h=1

⟨µ

(1)h

(µ

(1)h

)>⟩−⟨µ

(1)h

⟩ν̂> − ν̂

⟨µ

(1)h

⟩>+ ν̂ν̂

>

+

C∑h=1

T∑t=2

⟨µ

(t)h

(µ

(t)h

)>⟩−⟨µ

(t)h

(µ

(t−1)h

)>⟩−⟨µ

(t−1)h

(µ

(t)h

)>⟩+

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩

Φ̂ := Φ =

∑Ch=1

⟨µ

(1)h

(µ

(1)h

)>⟩−⟨µ

(1)h

⟩ν̂> − ν̂

⟨µ

(1)h

⟩>+ ν̂ν̂>

TC

+

∑Ch=1

∑Tt=2

⟨µ

(t)h

(µ

(t)h

)>⟩−⟨µ

(t)h

(µ

(t−1)h

)>⟩−⟨µ

(t−1)h

(µ

(t)h

)>⟩+

⟨µ

(t−1)h

(µ

(t−1)h

)>⟩TC

.


Computing the Variational Lower Bound L (q,Θ)

The marginal likelihood lower bound L (q,Θ) can be used to test for convergence in the variationalEM algorithm. It also functions as a surrogate for the true marginal likelihood p (E | Θ); this isuseful when taking random restarts, as it enables us to select the highest likelihood restart. Recallthat

L (q,Θ)

= Eq [log p (E,X | Θ)− log q (X)]

= Eq

T,N∑t,i=1

N∑j 6=i

log p(E

(t)i,j | z

(t)i→j , z

(t)i←j ;B

)+ Eq

T,N∑t,i=1

N∑j 6=i

log p(z

(t)i→j | γ

(t)i

)p(z

(t)i←j | γ

(t)j

)+Eq

[T,N∑t,i=1

log p(γ

(t)i | c

(t)i , µ

(t)1 , . . . , µ

(t)C ; Σ1, . . . ,ΣC

)]+ Eq

[T,N∑t,i=1

log p(c(t)i ; δ

)]

+Eq

[C∑h=1

log p(µ

(1)h ; ν,Φ

)+

C∑h=1

T∑t=2

log p(µ

(t)h | µ

(t−1)h ; Φ

)]

−Eq[log qµ

(µ

(1)1 , . . . , µ

(T )C

)]− Eq

[T,N∑t,i=1

log qγ(γ

(t)i

)]

−Eq

[T,N∑t,i=1

log qc(c(t)i

)]− Eq

[T,N,N∑t,i,j=1

log qz(z

(t)i→j , z

(t)i←j

)].

It turns out that we cannot compute L (q,Θ) exactly because of term 2, but we can lower-boundthe latter to produce a lower bound Llower (q,Θ) on L (q,Θ).

Closed forms for terms 1,3,4, and 5 are in Equations (23.5, 23.6, 23.7, and 23.8), respectively.We now provide closed forms for terms 6,7,8, and 9, as well as the aforementioned lower bound forterm 2.

Lower Bound for Term 2

Eq

T,N∑t,i=1

N∑j 6=i

log p(z

(t)i→j | γ

(t)i

)p(z

(t)i←j | γ

(t)j

)=

T,N∑t,i=1

N∑j 6=i

Eq

logK∏k=1

(exp γ

(t)i,k∑K

l=1 exp γ(t)i,l

)z(t)i→j,k

(exp γ

(t)j,k∑K

l=1 exp γ(t)j,l

)z(t)i←j,k

=

T,N∑t,i=1

N∑j 6=i

K∑k=1

Eq

[z

(t)i→j,kγ

(t)i,k − z

(t)i→j,k log

K∑l=1

exp γ(t)i,l + z

(t)i←j,kγ

(t)j,k − z

(t)i←j,k log

K∑l=1

exp γ(t)j,l

].

Since z, γ are independent of each other under q,

=

T,N∑t,i=1

N∑j 6=i

K∑k=1

⟨z

(t)i→j,k

⟩⟨γ

(t)i,k

⟩−⟨z

(t)i→j,k

⟩Eq

[log

K∑l=1

exp γ(t)i,l

]

+⟨z

(t)i←j,k

⟩⟨γ

(t)j,k

⟩−⟨z

(t)i←j,k

⟩Eq

[log

K∑l=1

exp γ(t)j,l

].


Applying Jensen’s inequality to the log-sum-exp terms,

≥T,N∑t,i=1

N∑j 6=i

K∑k=1

⟨z

(t)i→j,k

⟩⟨γ

(t)i,k

⟩−⟨z

(t)i→j,k

⟩logEq

[K∑l=1

exp γ(t)i,l

]

+⟨z

(t)i←j,k

⟩⟨γ

(t)j,k

⟩−⟨z

(t)i←j,k

⟩logEq

[K∑l=1

exp γ(t)j,l

]

=

T,N∑t,i=1

N∑j 6=i

K∑k=1

⟨z

(t)i→j,k

⟩⟨γ

(t)i,k

⟩−⟨z

(t)i→j,k

⟩log

(K∑l=1

⟨exp γ

(t)i,l

⟩)

+⟨z

(t)i←j,k

⟩⟨γ

(t)j,k

⟩−⟨z

(t)i←j,k

⟩log

(K∑l=1

⟨exp γ

(t)j,l

⟩)

=

T,N∑t,i=1

N∑j 6=i

K∑k=1

⟨z

(t)i→j,k

⟩(⟨γ

(t)i,k

⟩− log

K∑l=1

⟨exp γ

(t)i,l

⟩)

+⟨z

(t)i←j,k

⟩(⟨γ

(t)j,k

⟩− log

K∑l=1

⟨exp γ

(t)j,l

⟩)

=

T,N∑t,i=1

K∑k=1

(⟨γ

(t)i,k

⟩− log

K∑l=1

⟨exp γ

(t)i,l

⟩) N∑j 6=i

⟨z

(t)i→j,k

⟩+⟨z

(t)j←i,k

⟩=

T,N∑t,i=1

(⟨γ

(t)i

⟩− log

K∑l=1

⟨exp γ

(t)i,l

⟩)> N∑j 6=i

⟨z

(t)i→j

⟩+⟨z

(t)j←i

⟩ ,

where 〈X〉 := Eq [X]. The solutions to⟨z

(t)i→j,k

⟩,⟨z

(t)i←j,k

⟩,⟨γ

(t)i,k

⟩are in the E-step. As for⟨

exp γ(t)i,l

⟩, observe that for a univariate Gaussian random variable X with mean µ and variance

σ2,

E [expX] =

∫x

1√2πσ

exp

{− (x− µ)2

2σ2

}exp {x} dx

=

∫x

1√2πσ

exp

{−x2 − 2x

(µ+ σ2

)+ µ2

2σ2

}dx

=

∫x

1√2πσ

exp

{−x2 − 2x

(µ+ σ2

)+(µ2 + 2µσ2 + σ4

)2σ2

}exp

{µ+

σ2

2

}dx

= exp

{µ+

σ2

2

}∫x

1√2πσ

exp

{−(x−

(µ+ σ2

))22σ2

}dx

= exp

{µ+

σ2

2

}.

Hence, ⟨exp γ

(t)i,l

⟩= exp

{⟨γ

(t)i,l

⟩+

1

2Λ

(t)i,ll

},

where Λi is defined in the previous section discussing the Laplace approximation.


Term 6

Define

N (a, b, C) : = (2π)−dim(C)/2 |C|−1/2

exp

{−1

2(a− b)> C−1 (a− b)

}.

Thus,

−Eq[log qµ

(µ

(1)1 , . . . , µ

(T )C

)]= −Eq

[log

C∏h=1

qµh

(µ

(1)h , . . . , µ

(T )h

)]

= −C∑h=1

Eq[logN

(µ

(1)h , ν,Φ

)N(α

(1)h , µ

(1)h ,Ξ

(1)h

)T∏t=2

N(µ

(t)h , µ

(t−1)h ,Φ

)N(α

(t)h , µ

(t)h ,Ξ

(t)h

)]

= −C∑h=1

Eq[logN

(µ

(1)h , ν,Φ

)+ logN

(α

(1)h , µ

(1)h ,Ξ

(1)h

)+

T∑t=2

logN(µ

(t)h , µ

(t−1)h ,Φ

)+ logN

(α

(t)h , µ

(t)h ,Ξ

(t)h

)],

where α(t)h ,Ξ

(t)h are from the Kalman smoother. Also note our abuse of notation: ν,Φ refer to the

values used to compute µ(t)|(T )h , P

(t)|(T )h , L

(t)h in the E-step (see Kalman smoother section), and not

their current values (recall that qµ is not a function of ν,Φ). Now define

ZN (C) : = log (2π)−dim(C)/2 |C|−1/2

Ψ (a, b, C) : = exp

{−1

2(a− b)> C−1 (a− b)

}so we have

= −[CTZN (Φ) +

C∑h=1

T∑t=1

+ZN(

Ξ(t)h

)]

−C∑h=1

[Eq[log Ψ

(µ

(1)h , ν,Φ

)]+ Eq

[log Ψ

(α

(1)h , µ

(1)h ,Ξ

(1)h

)]+

T∑t=2

Eq[log Ψ

(µ

(t)h , µ

(t−1)h ,Φ

)]+ Eq

[log Ψ

(α

(t)h , µ

(t)h ,Ξ

(t)h

)]],

where

Eq[log Ψ

(µ

(1)h , ν,Φ

)]= −1

2tr

[Φ−1

(m

(1)h − ν

(µ̂

(1)|(T )h

)>−(µ̂

(1)|(T )h

)ν> + νν>

)]Eq[log Ψ

(µ

(t)h , µ

(t−1)h ,Φ

)]= −1

2tr

[Φ−1

(m

(t)h −

(V

(t,t−1)h

)>− V (t,t−1)

h +m(t−1)h

)]∀t ∈ {2, . . . T}


Eq[log Ψ

(α

(t)h , µ

(t)h ,Ξ

(t)h

)]= −1

2tr

[(Ξ

(t)h

)−1(α

(t)h

(α

(t)h

)>− µ̂(t)|(T )

h

(α

(t)h

)>−α(t)

h

(µ̂

(t)|(T )h

)>+m

(t)h

)]∀t ∈ {1, . . . , T}

m(t)h := P

(t)|(T )h + µ̂

(t)|(T )h

(µ̂

(t)|(T )h

)>V

(t,t−1)h := P

(t)|(T )h

(L

(t−1)h

)>+ µ̂

(t)|(T )h

(µ̂

(t−1)|(T )h

)>,

and where µ(t)|(T )h , P

(t)|(T )h , L

(t)h are from the Kalman smoother section.

Term 7

Using definitions from the previous section,

−Eq

[T,N∑t,i=1

log qγ(γ

(t)i

)]

= −T,N∑t,i=1

Eq[logN

(γ

(t)i , τ

(t)i ,Λ

(t)i

)]

= −T,N∑t,i=1

ZN(

Λ(t)i

)−

T,N∑t,i=1

Eq[log Ψ

(γ

(t)i , τ

(t)i ,Λ

(t)i

)]

= −T,N∑t,i=1

ZN(

Λ(t)i

)

+

T,N∑t,i=1

1

2tr

[(Λ

(t)i

)−1(Eq[γ

(t)i

(γ

(t)i

)>]− τ (t)

i Eq[γ

(t)i

]>−Eq

[γ

(t)i

] (τ

(t)i

)>+ τ

(t)i

(τ

(t)i

)>)]= −

T,N∑t,i=1

ZN(

Λ(t)i

)+TNK

2,

where Λ(t)i is from the Laplace approximation section.

Term 8

Term 8 is trivial to compute since qc is discrete:

−Eq

T,N∑t,i=1

log qc

(c(t)i

) = −T,N∑t,i=1

C∑h=1

qc

(c(t)i = h

)log qc

(c(t)i = h

).


Term 9

Term 9 is also trivial to compute since qz is discrete:

−Eq

T,N∑t,i=1

N∑j 6=i

log qz

(z

(t)i→j , z

(t)i←j

)= −

T,N∑t,i=1

N∑j 6=i

K,K∑k,l=1

qz

(z

(t)i→j = k, z

(t)i←j = l

)log qz

(z

(t)i→j = k, z

(t)i←j = l

).

ReferencesAhmed, A. and Xing, E. P. (2007). On tight approximate inference of logistic-normal admixture

model. In Proceedings of the 11th International Conference on Artificial Intelligence and Statis-tics (AISTATS 2007). Journal of Machine Learning Research – Proceedings Track 2: 16–26.

Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008). Mixed membership stochasticblockmodels. Journal of Machine Learning Research 9: 1981–2014.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of MachineLearning Research 3: 993–1022.

Cho, Y. -S., Ver Steeg, G., and Galstyan, A. (2014). Mixed membership blockmodels for dynamicnetworks with feedback. In Airoldi, E. M., Blei, D. M., Erosheva, E. A., and Fienberg, S. E. (eds),Handbook of Mixed Membership Models and Its Applications. Chapman & Hall/CRC.

Ghahramani, Z. and Beal, M. J. (2001). Propagation algorithms for variational Bayesian learning.In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds) Advances in Neural Information ProcessingSystems 13. Cambridge, MA: The MIT Press, 507–513.

Ghahramani, Z. and Hinton, G. E. (2000). Variational learning for switching state-space models.Neural Computation 12: 831–864.

Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007). Model-based clustering for socialnetworks. Journal of the Royal Statistical Society: Series A 170: 1–22.

Heaukulani, C. and Ghahramani, Z. (2013). Dynamic probabilistic models for latent feature propa-gation in social networks. In Proceedings of the 30th International Conference on Machine Learn-ing (ICML ’13). Omnipress, 275–283.

Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to social networkanalysis. Journal of the American Statistical Association 97: 1090–1098.

Kolar, M., Song, L., Ahmed, A., and Xing, E. P. (2008). Estimating time-varying networks. Annalsof Applied Statistics, to appear. http://arxiv.org/abs/0812.5087 [stat.ML].

Shetty, J. and Adibi, J. (2004). The Enron Email Dataset Database Schema and Brief StatisticalReport. Tech. report, Information Sciences Institute, University of Southern California.

Soufiani, H. A. and Airoldi, E. M. (2012). Graphlet decomposition of a weighted network. In Pro-ceedings of the 15th International Conference on Artificial Intelligence and Statistics. Vol. 22 ofJournal of Machine Learning Research: Workshop and Conference Proceedings, 54–63.


Xing, E. P., Fu, W., and Song, L. (2010). A state-space mixed membership blockmodel for dynamicnetwork tomography. Annals of Applied Statistics 4: 535–566.

Xing, E. P., Jordan, M. I., and Russell, S. (2003). A generalized mean field algorithm for varia-tional inference in exponential families. In Proceedings of the 19th Conference on Uncertainty inArtificial Intelligence (UAI 2003). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,583–591.

analyzing time-evolving networks using an evolving cluster ... › ... › ch23_mmm2014.pdf ·...

Documents