particle learning for general mixtureslcarin/miao8.19.2011.pdf · 2011. 8. 19. · particle...

18
Particle Learning for General Mixtures Carlos M. Carvalho 1 , Hedibert F. Lopes 2 , Nicholas G. Polson 2 and Matt A. Taddy 2 1 UT-Austin, 2 University of Chicago Presenter: Miao Liu August 19, 2011 Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, Particle Learning for General Mixtures August 19, 2011 1 / 18

Upload: others

Post on 27-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Particle Learning for General Mixtures

Carlos M. Carvalho1, Hedibert F. Lopes2 , Nicholas G. Polson2 andMatt A. Taddy 2

1UT-Austin, 2University of ChicagoPresenter: Miao Liu

August 19, 2011

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 1 / 18

Page 2: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

1 Introduction

2 PL algorithm

3 Other Nonparametric Models

4 Summary

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 2 / 18

Page 3: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Sequential Inference for General State-space Model

A general state-space model

xt+1 = g(xt , θ)

yt+1 = f (xt+1, θ)

The sequential inference of states and parameters involves computingthe joint posterior distributions, p(xs , · · · xs′ , θ | yt) , whereyt = (y1, · · · , yt) is the set of observations up to time t.

Specific problem includes filtering (when s = s ′ = t), smoothing (ifs = 0 and s ′ = t)

p(xs , · · · xs′ , θ | yt) can be computed in closed form only in veryspecific cases including linear Gaussian model and discrete hiddenMarkov Model.

For nonlinear dynamic systems the extend Kalman filter (EKF) and itsvariant can be employed, but the performance may be very poor.

Sequential Monte Carlo (particle) methods use a discrete

representation of p(xt , θ | y t) via pN(xt , θ | yt) = 1N

∑Ni=1 δ

(i)(xt ,θ)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 3 / 18

Page 4: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Particle FilteringThe filtering density we are interested in is

p(xt+1 | yt+1) ∝ p(yt+1 | xt+1)p(xt+1 | yt)where the state predictive is

p(xt+1 | yt) =∫p(xt+1 | xt)dP(xt | yt)

Importance Sampling (IS)Assume you have at time t

p(xt | y t) = 1N

∑Ni=1 δx(i)

t(xt)

By sampling X(i)t+1 ∼ p(xt+1 | x(i)

t ), then

p(xt+1 | y t) = 1N

∑Ni=1 δx(i)

t(xt+1)

Our target at time t + 1 is

p(xt+1 | y t+1) ∝ p(yt+1|xt+1)p(xt+1 | y t)

so by substituting p(xt+1 | y t) to p(xt+1 | y t) we obtain

p(xt+1 | y t+1) =∑N

i=1 ω(i)n δ

x(i)t+1

(xt+1), ω(i)n ∝ p(yt+1 | x(i)

t+1)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 4 / 18

Page 5: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Resampling

Through IS, we obtain a ”weighted” approximation p(xt+1 | y t+1) ofp(xt+1 | y t+1). This approximation is based on weighted samplesfrom p(xt+1 | xt).

To obtain N samples x(i)t+1 approximated distributed according to

P(xt+1 | y t+1), we just resample x(i)t+1 ∼ p(xt+1 | y t+1) to obtain

p(xt+1 | y t+1) = 1N

∑Ni=1 δx(i)

t+1

(xt+1)

Particles with high weights are copied multiple times, particles withlow weights die.

In practice resampling is performed when the variance of the weightsis large than a pre-specified threshold.

BF is called ”propagate-resample filter”.

BOOTSTRAP FILTER (BF)[Gordon, Salmond and Smith (1993)]

1 Propagate. x(i)t Ni=1 to x(i)

t+1Ni=1 via p(xt+1 | xt)

2 Resample. x(i)t+1Ni=1 from x(i)

t+1Ni=1 with weights ωt+1 ∝ p(yt+1 | x(i)t+1)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 5 / 18

Page 6: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Auxiliary particle filter (APF)[Pitt and Shephard (1999)]

1 Resample. x(i)t Ni=1 from x(i)

t Ni=1 weights ω(i)t+1 ∝ p(yt+1 | g(x

(i)t ))

2 Propagate. x(i)t Ni=1 to x(i)

t+1Ni=1 via p(xt+1 | xt)

3 Resample. x(i)t+1Ni=1 with weights ω

(i)t+1 ∝ p(yt+1 | x(i)

t )/p(yt+1 | g(x(i)t ))

the APF uses a different representation of the joint filtering disribution of (xt , xt+1) as

p(xt , xt+1 | y t+1) ∝P(xt+1 | xt , y t+1)p(xt | y t+1) (1)

=p(xt+1 | xt , y t+1)p(yt+1 | xt)p(xt | y t) (2)

APF is a ”resample-propagate filter”.

starting with a particle approximation of p(xt | y t), draws from p(xt | y t+1) are obtainedby resampling the particles with weights proportional to p(yt+1 | xt). These resampledparticles are then propagated forward via p(xt+1 | xt , y t+1)

The above representation is optimal, fully adaptive if both the predictive and propagationdensities were available.

Otherwise importance function p(yt+1 | g(xt)) is proposed for the resampling steps.

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 6 / 18

Page 7: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

The General PL Framework

PL = APF + Parameter Estimation

Define Zt as the ”essential state vector” which may include state variables,auxiliary variables, sufficient statistics etc.

Zt allows the computation of

(a) the posterior predictive p(yt+1 | Zt),(b) the posterior updating rule p(Zt+1 | Zt , yt+1)(c) parameter learning via p(θ | Zt+1)

Particle Learning

1 Resample. z (i)t Ni=1 from z (i)

t Ni=1 weights ω(i)t+1 ∝ p(yt+1 | z (i)

t )

2 Propagate. x (i)t Ni=1 to x (i)

t+1Ni=1 via p(xt+1 | zt , yt+1) update sufficient

statistics s it+1 = S(s(i)t , x

(i)t+1, yt+1)

3 Learn: p(θ | y t+1) ≈ 1N

∑Ni=1 p(θ | z it+1)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 7 / 18

Page 8: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Example 1: Finite Mixture of Poisson Densities

p(yt) =m∑j=1

πjPo(yt ; θj )

π ∼ Dir(γ), θj ∼Ga(αj , βj ), for j = 1, · · · ,m

Define nt the number of samples in each component, and sufficient statesst = (st,1, · · · , st,m) where st,j =

∑tr=1 yr Ikr =j for the conditional posterior given y t

and kt , we have essential state vector Zt = kt , st , ntPredictive for yt+1

p(yt+1 | Zt) =m∑

kt+1=j=1

∫ ∫p(yt+1, kt+1 | θkt+1

)p(θ, π)d(θ, π)

=m∑

kt+1=j=1

Γ(st,j + yt+1 + αj )

Γ(st,j + αj )

(βj + nt,j )st,j+αj

(βj + nt,j + 1)st,j+αj+yt+1

1

yt+1!

(γj + nt,j∑mi=1 γi + nt,i

)propagating kt+1 is done according to

p(kt+1 = j | Zt , yt+1) ∝Γ(st,j + yt+1 + αj )

Γ(st,j + αj )

(βj + nt,j )st,j+αj

(βj + nt,j + 1)st,j+αj+yt+1

1

yt+1!

γj + nt,j∑mi=1 γi + nt,i

Given kt+1Zt is updated by the recursions st+1,j = st,j + Ikt+1=j andnt+1,j = nt,j + Ikt+1=j , for j = 1, · · · ,m..learning θ: p(θ | y t+1) ≈ 1

N

∑Ni=1 p(θ | Z(

t+1i))Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 8 / 18

Page 9: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

(a) 4 mixture of Poisson (b) MC study

Figure: Poisson Mixture Example.

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 9 / 18

Page 10: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Allocation: to obtain smoothed samples of kt

I Draws from p(k t | y t) can be obtained through the backwards updateeuqaiton

p(k t | y t) =

∫p(k t | Zt , y

t)dP(Zt | y t) =

∫ t∏r=1

p(kr | Zr , yr )dP(Zt | y t)

I we can directly approximate the p(k t | y t) by sampling for each particle

Z(i)t with probability p(kr | Zt , yr ) ∝ p(yr | kr = j ,Zt)p(kr = j | Zt)

I The conditional independence of the mixture models leads to thefactorization provides an algorithm for posterior allocation that is oforder O(N)

Marginal Likelihood

I Useful in in Bayesian model comparison procedures, based either onBayes factors or posterior model probabilities.

I Sequential marginal likelihood holds that p(y t) =∏t

r=1 p(yr | y r−1).The factors of this product can be estimated at each PL step as

p(yt | y t−1) =∫p(yt | Zt−1)dP(Zt−1)) ≈ 1

N

∑Ni=1 p(yt | Z(i)

t−1)

I Given each new observation yt , p(y t) = p(y t−1)∑N

i=1 p(yt | Z(i)t−1)/N

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 10 / 18

Page 11: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

PL for Nonparametric mixture models

Consider DP mixture model

yt ∼ k(yt ; θxt ), xt ∼ G and G ∼ DP(α,G0) (3)

Define nt = (nt,1, · · · , nt,mt ), the number of observations allocated to eachunique component, st = (st,1, · · · st,mt ) the conditional sufficient statistics,where mt is the number of mixture component; Zt = (xt ,mt , st ,nt) thestate vector to be traced

the uncertainty updating equation

p(st+1, nt+1 | yt+1) ∝∫

p(st+1, nt+1 | st , nt ,mt , yt+1)

p(yt+1 | st , nt ,mt)dP(st , nt ,mt , | yt),

where

p(yt+1 | nt , st) =

∫ ∫ ∫k(yt+1; θ)dG(θ)dP(G |θ?t , nt)dP(θ?t |nt , st)

=

∫k(yt+1; θ)

[ ∫dG t

0 (θ; θ?t , nt)dP(θ?t | nt , st)]

p(st+1, nt+1,mt+1 |st , nt ,mt , yt+1) = p(xt+1 | st , nt ,mt , yt+1)

∝nt,xt+1

∫k(yt+1; θ?j )dP(θ?xt+1

| st,xt+1 , nt,xt+1 ) for xt+1 = 1, · · · ,mt

α∫k(yt+1;θ)dG0(θ) if xt+1 = mt + 1

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 11 / 18

Page 12: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

PL for DP mixture modelsGiven a set of particles

(nt , st ,mt

)(i)N

i=1

which approximate p(nt , st ,mt | yt) and a new

observation yt+1, the PL algorithm for DP mixture models updates the approximation top(nt+1, st+1 | yt+1) using the following algorithm

Algorithm 2:PL for DP mixture models1 Resample: Generate an index ξ ∼ MN(ω,N) where

ω(i) =p(yt+1 | (st , nt ,mt)(i))∑Ni=1 p(yt+1 | (st , nt ,mt)(i))

2 Propagate:

kt+1 ∼ p(xt+1 | (st , nt ,mt)ξ(i), yt+1)

st+1 = S(st , kt+1, yt+1). For j 6= xt+1, nt+1,j = nt,j .

If kt+1 ≤ mt , nt+1,xt = nt,xt + 1 and mt+1 = mt .

Otherwise, mt+1 = mt + 1 and nt,mt+1 = 1;

3 Estimate:

p(E[f (y ;G)] | y t) =1

Np(y | (st , nt ,mt)

(i))

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 12 / 18

Page 13: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Example 2: DP mixture of Multivariate Normals

f (yt ;G ) =∫N (yt | µt ,Σt)dG (µt ,Σt)

G ∼ DP(α,G0(µ,Σ))

with G0 = N (µ;λ,Σ/κ)W(Σ−1; ν,Ω)

Zt = kt , st , nt ,mt, where st is the conditional sufficient statisticsfor each unique mixture component, it is defined byyt,j =

∑r :k−r=j yr/nt,j and St,j =

∑r :kr=j(yr − yt,j)(yr − yt,j)

The predictive density for sampling is p(yt+1 | Zt) =αα+tSt(yt+1; a0,B0, c0) +

∑mtj=1

nt,jα+tSt(yt+1; at,j ,Bt,j , ct,j) where the

Student’s t distribution are parametrized by

a0 = λ,B0 = 2(κ+1)κc0

, c0 = 2ν − d + 1, at,j =κλ+nt,j yt,jκ+nt,j

, ...

Propagate kt+1 according to

for j = 1, ...,mt , p(kt+1 = j) ∝nt,jα + t

St(yt+1; at,j ,Bt,j , ct,j)

and p(kt+1 = mt + 1) ∝ α

α + tSt(yt+1; a0,B0, c0)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 13 / 18

Page 14: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 14 / 18

Page 15: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 15 / 18

Page 16: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 16 / 18

Page 17: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

Latent Features Models via the Indian Buffet ProcessConsider the infinite linear-Gaussian matrix factorization model of (Griffths andGhahramani (2006))

YT = ZTB + E (4)

YT = y ′1, y ′2, ..., y ′T, where each yt ∈ Rp, Z ∈ 0, 1k×T , B ∈ Rk×p, E ∈ RT×p Assumeei,j ∼ N (0, σ2), br,q ∼ N (0, σ2

b)

Eqn (4) can be casted as a state-space model where each row of ZT is a state andthe transition p(zt+1 | ZT ) follows from the IBP generative process:p(zt+1,n,j | ZT ) = n

Define the essential state vector Zt = (mt , st , nt), where mt is the number ofcurrent latent features, nt is the number of objects allocated to each of thecurrently available features, st is the set of conditional sufficient statistics for B.Let m∗ denote the number of pontential new features per particle, m

(i)∗ ∼ Po(α/t)

The posterior predictive: p(yt+1 | Zt ,m∗) ∝∑

zt+1∈C p(yt+1|zt+1,st )p(zt+1 | Zt)where C is defined by all 2mt possible configurations of zt+1 where the final m∗elements are fixed at one.

Propagating Zt+1 forward: p(zt+1 | Zt , yt+1) ∝ p(yt+1 | zt+1, st)p(zt+1 | Zt)

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 17 / 18

Page 18: Particle Learning for General Mixtureslcarin/Miao8.19.2011.pdf · 2011. 8. 19. · Particle Learning for General Mixtures Carlos M. Carvalho1, Hedibert F. Lopes2, Nicholas G. Polson2

summary

PL is essentially an auxiliary particle filter applied in settings wheresufficient statistic exists.

PL can provide on-line inference to a wide class of mixture models.

Computation can be accelerated by utilizing parallel nature of PL.

Caveat : Degeneracy is unavoidable: increasing the number ofobservations without simultaneously and exponentially increasing thenumber of particles necessarily leads to the degeneracy of thesimulated sufficient statistic path. In the experiment of [Chopin etc.2010], this degeneracy always occurs between 5, 000 and 10,000observations.

Carlos M. Carvalho, Hedibert F. Lopes , Nicholas G. Polson and Matt A. Taddy ( UT-Austin, University of ChicagoPresenter: Miao Liu)Particle Learning for General Mixtures August 19, 2011 18 / 18