particle learning for general mixtureslcarin/miao8.19.2011.pdf · 2011. 8. 19. · particle...

Particle Learning for General Mixtures

Carlos M. Carvalho1, Hedibert F. Lopes2 , Nicholas G. Polson2 andMatt A. Taddy 2

1UT-Austin, 2University of ChicagoPresenter: Miao Liu

August 19, 2011

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 1 / 18

1 Introduction

2 PL algorithm

3 Other Nonparametric Models

4 Summary


Sequential Inference for General State-space Model

A general state-space model

xt+1 = g(xt , θ)

yt+1 = f (xt+1, θ)

The sequential inference of states and parameters involves computingthe joint posterior distributions, p(xs , · · · xs′ , θ | yt) , whereyt = (y1, · · · , yt) is the set of observations up to time t.

Specific problem includes filtering (when s = s ′ = t), smoothing (ifs = 0 and s ′ = t)

p(xs , · · · xs′ , θ | yt) can be computed in closed form only in veryspecific cases including linear Gaussian model and discrete hiddenMarkov Model.

For nonlinear dynamic systems the extend Kalman filter (EKF) and itsvariant can be employed, but the performance may be very poor.

Sequential Monte Carlo (particle) methods use a discrete

representation of p(xt , θ | y t) via pN(xt , θ | yt) = 1N

∑Ni=1 δ

(i)(xt ,θ)


Resampling

Through IS, we obtain a ”weighted” approximation p(xt+1 | y t+1) ofp(xt+1 | y t+1). This approximation is based on weighted samplesfrom p(xt+1 | xt).

To obtain N samples x(i)t+1 approximated distributed according to

P(xt+1 | y t+1), we just resample x(i)t+1 ∼ p(xt+1 | y t+1) to obtain

p(xt+1 | y t+1) = 1N

∑Ni=1 δx(i)

t+1

(xt+1)

Particles with high weights are copied multiple times, particles withlow weights die.

In practice resampling is performed when the variance of the weightsis large than a pre-specified threshold.

BF is called ”propagate-resample filter”.

BOOTSTRAP FILTER (BF)[Gordon, Salmond and Smith (1993)]

1 Propagate. x(i)t Ni=1 to x(i)

t+1Ni=1 via p(xt+1 | xt)

2 Resample. x(i)t+1Ni=1 from x(i)

t+1Ni=1 with weights ωt+1 ∝ p(yt+1 | x(i)t+1)


Auxiliary particle filter (APF)[Pitt and Shephard (1999)]

1 Resample. x(i)t Ni=1 from x(i)

t Ni=1 weights ω(i)t+1 ∝ p(yt+1 | g(x

(i)t ))

2 Propagate. x(i)t Ni=1 to x(i)

t+1Ni=1 via p(xt+1 | xt)

3 Resample. x(i)t+1Ni=1 with weights ω

(i)t+1 ∝ p(yt+1 | x(i)

t )/p(yt+1 | g(x(i)t ))

the APF uses a different representation of the joint filtering disribution of (xt , xt+1) as

p(xt , xt+1 | y t+1) ∝P(xt+1 | xt , y t+1)p(xt | y t+1) (1)

=p(xt+1 | xt , y t+1)p(yt+1 | xt)p(xt | y t) (2)

APF is a ”resample-propagate filter”.

starting with a particle approximation of p(xt | y t), draws from p(xt | y t+1) are obtainedby resampling the particles with weights proportional to p(yt+1 | xt). These resampledparticles are then propagated forward via p(xt+1 | xt , y t+1)

The above representation is optimal, fully adaptive if both the predictive and propagationdensities were available.

Otherwise importance function p(yt+1 | g(xt)) is proposed for the resampling steps.


The General PL Framework

PL = APF + Parameter Estimation

Define Zt as the ”essential state vector” which may include state variables,auxiliary variables, sufficient statistics etc.

Zt allows the computation of

(a) the posterior predictive p(yt+1 | Zt),(b) the posterior updating rule p(Zt+1 | Zt , yt+1)(c) parameter learning via p(θ | Zt+1)

Particle Learning

1 Resample. z (i)t Ni=1 from z (i)

t Ni=1 weights ω(i)t+1 ∝ p(yt+1 | z (i)

t )

2 Propagate. x (i)t Ni=1 to x (i)

t+1Ni=1 via p(xt+1 | zt , yt+1) update sufficient

statistics s it+1 = S(s(i)t , x

(i)t+1, yt+1)

3 Learn: p(θ | y t+1) ≈ 1N

∑Ni=1 p(θ | z it+1)


Example 1: Finite Mixture of Poisson Densities

p(yt) =m∑j=1

πjPo(yt ; θj )

π ∼ Dir(γ), θj ∼Ga(αj , βj ), for j = 1, · · · ,m

Define nt the number of samples in each component, and sufficient statesst = (st,1, · · · , st,m) where st,j =

∑tr=1 yr Ikr =j for the conditional posterior given y t

and kt , we have essential state vector Zt = kt , st , ntPredictive for yt+1

p(yt+1 | Zt) =m∑

kt+1=j=1

∫ ∫p(yt+1, kt+1 | θkt+1

)p(θ, π)d(θ, π)

=m∑

kt+1=j=1

Γ(st,j + yt+1 + αj )

Γ(st,j + αj )

(βj + nt,j )st,j+αj

(βj + nt,j + 1)st,j+αj+yt+1

1

yt+1!

(γj + nt,j∑mi=1 γi + nt,i

)propagating kt+1 is done according to

p(kt+1 = j | Zt , yt+1) ∝Γ(st,j + yt+1 + αj )

Γ(st,j + αj )

(βj + nt,j )st,j+αj

(βj + nt,j + 1)st,j+αj+yt+1

1

yt+1!

γj + nt,j∑mi=1 γi + nt,i

Given kt+1Zt is updated by the recursions st+1,j = st,j + Ikt+1=j andnt+1,j = nt,j + Ikt+1=j , for j = 1, · · · ,m..learning θ: p(θ | y t+1) ≈ 1

N

∑Ni=1 p(θ | Z(

t+1i))Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 8 / 18

(a) 4 mixture of Poisson (b) MC study

Figure: Poisson Mixture Example.


Allocation: to obtain smoothed samples of kt

I Draws from p(k t | y t) can be obtained through the backwards updateeuqaiton

p(k t | y t) =

∫p(k t | Zt , y

t)dP(Zt | y t) =

∫ t∏r=1

p(kr | Zr , yr )dP(Zt | y t)

I we can directly approximate the p(k t | y t) by sampling for each particle

Z(i)t with probability p(kr | Zt , yr ) ∝ p(yr | kr = j ,Zt)p(kr = j | Zt)

I The conditional independence of the mixture models leads to thefactorization provides an algorithm for posterior allocation that is oforder O(N)

Marginal Likelihood

I Useful in in Bayesian model comparison procedures, based either onBayes factors or posterior model probabilities.

I Sequential marginal likelihood holds that p(y t) =∏t

r=1 p(yr | y r−1).The factors of this product can be estimated at each PL step as

p(yt | y t−1) =∫p(yt | Zt−1)dP(Zt−1)) ≈ 1

N

∑Ni=1 p(yt | Z(i)

t−1)

I Given each new observation yt , p(y t) = p(y t−1)∑N

i=1 p(yt | Z(i)t−1)/N


PL for Nonparametric mixture models

Consider DP mixture model

yt ∼ k(yt ; θxt ), xt ∼ G and G ∼ DP(α,G0) (3)

Define nt = (nt,1, · · · , nt,mt ), the number of observations allocated to eachunique component, st = (st,1, · · · st,mt ) the conditional sufficient statistics,where mt is the number of mixture component; Zt = (xt ,mt , st ,nt) thestate vector to be traced

the uncertainty updating equation

p(st+1, nt+1 | yt+1) ∝∫

p(st+1, nt+1 | st , nt ,mt , yt+1)

p(yt+1 | st , nt ,mt)dP(st , nt ,mt , | yt),

where

p(yt+1 | nt , st) =

∫ ∫ ∫k(yt+1; θ)dG(θ)dP(G |θ?t , nt)dP(θ?t |nt , st)

=

∫k(yt+1; θ)

[ ∫dG t

0 (θ; θ?t , nt)dP(θ?t | nt , st)]

p(st+1, nt+1,mt+1 |st , nt ,mt , yt+1) = p(xt+1 | st , nt ,mt , yt+1)

∝nt,xt+1

∫k(yt+1; θ?j )dP(θ?xt+1

| st,xt+1 , nt,xt+1 ) for xt+1 = 1, · · · ,mt

α∫k(yt+1;θ)dG0(θ) if xt+1 = mt + 1


PL for DP mixture modelsGiven a set of particles

(nt , st ,mt

)(i)N

i=1

which approximate p(nt , st ,mt | yt) and a new

observation yt+1, the PL algorithm for DP mixture models updates the approximation top(nt+1, st+1 | yt+1) using the following algorithm

Algorithm 2:PL for DP mixture models1 Resample: Generate an index ξ ∼ MN(ω,N) where

ω(i) =p(yt+1 | (st , nt ,mt)(i))∑Ni=1 p(yt+1 | (st , nt ,mt)(i))

2 Propagate:

kt+1 ∼ p(xt+1 | (st , nt ,mt)ξ(i), yt+1)

st+1 = S(st , kt+1, yt+1). For j 6= xt+1, nt+1,j = nt,j .

If kt+1 ≤ mt , nt+1,xt = nt,xt + 1 and mt+1 = mt .

Otherwise, mt+1 = mt + 1 and nt,mt+1 = 1;

3 Estimate:

p(E[f (y ;G)] | y t) =1

Np(y | (st , nt ,mt)

(i))


Example 2: DP mixture of Multivariate Normals

f (yt ;G ) =∫N (yt | µt ,Σt)dG (µt ,Σt)

G ∼ DP(α,G0(µ,Σ))

with G0 = N (µ;λ,Σ/κ)W(Σ−1; ν,Ω)

Zt = kt , st , nt ,mt, where st is the conditional sufficient statisticsfor each unique mixture component, it is defined byyt,j =

∑r :k−r=j yr/nt,j and St,j =

∑r :kr=j(yr − yt,j)(yr − yt,j)

′

The predictive density for sampling is p(yt+1 | Zt) =αα+tSt(yt+1; a0,B0, c0) +

∑mtj=1

nt,jα+tSt(yt+1; at,j ,Bt,j , ct,j) where the

Student’s t distribution are parametrized by

a0 = λ,B0 = 2(κ+1)κc0

, c0 = 2ν − d + 1, at,j =κλ+nt,j yt,jκ+nt,j

, ...

Propagate kt+1 according to

for j = 1, ...,mt , p(kt+1 = j) ∝nt,jα + t

St(yt+1; at,j ,Bt,j , ct,j)

and p(kt+1 = mt + 1) ∝ α

α + tSt(yt+1; a0,B0, c0)


Latent Features Models via the Indian Buffet ProcessConsider the infinite linear-Gaussian matrix factorization model of (Griffths andGhahramani (2006))

YT = ZTB + E (4)

YT = y ′1, y ′2, ..., y ′T, where each yt ∈ Rp, Z ∈ 0, 1k×T , B ∈ Rk×p, E ∈ RT×p Assumeei,j ∼ N (0, σ2), br,q ∼ N (0, σ2

b)

Eqn (4) can be casted as a state-space model where each row of ZT is a state andthe transition p(zt+1 | ZT ) follows from the IBP generative process:p(zt+1,n,j | ZT ) = n

Define the essential state vector Zt = (mt , st , nt), where mt is the number ofcurrent latent features, nt is the number of objects allocated to each of thecurrently available features, st is the set of conditional sufficient statistics for B.Let m∗ denote the number of pontential new features per particle, m

(i)∗ ∼ Po(α/t)

The posterior predictive: p(yt+1 | Zt ,m∗) ∝∑

zt+1∈C p(yt+1|zt+1,st )p(zt+1 | Zt)where C is defined by all 2mt possible configurations of zt+1 where the final m∗elements are fixed at one.

Propagating Zt+1 forward: p(zt+1 | Zt , yt+1) ∝ p(yt+1 | zt+1, st)p(zt+1 | Zt)


summary

PL is essentially an auxiliary particle filter applied in settings wheresufficient statistic exists.

PL can provide on-line inference to a wide class of mixture models.

Computation can be accelerated by utilizing parallel nature of PL.

Caveat : Degeneracy is unavoidable: increasing the number ofobservations without simultaneously and exponentially increasing thenumber of particles necessarily leads to the degeneracy of thesimulated sufficient statistic path. In the experiment of [Chopin etc.2010], this degeneracy always occurs between 5, 000 and 10,000observations.


particle learning for general mixtureslcarin/miao8.19.2011.pdf · 2011. 8. 19. · particle...

Documents