bayesian nonparametric topic modeling hierarchical dirichlet processes

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

JinYeong Bak

Department of Computer ScienceKAIST, Daejeon

South Korea

jy.bak@kaist.ac.kr

August 22, 2013

Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk).JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121

Introduction

Bayesian topic modelsI Latent Dirichlet Allocation (LDA) [BNJ03]I Hierarchical Dircihlet Processes (HDP) [TJBB06]

In this talk,I Dirichlet distribution, Dircihlet processI Concept of Hierarchical Dircihlet Processes (HDP)I How to infer the latent variables in HDP

Motivation

What are the topics discussed in the article?

How can we describe the topics?

Topic Modeling

Each topic has word distribution

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Topic Modeling

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

Our interestsI What are the topics discussed in the article?I How can we describe the topics?

Latent Dirichlet AllocationWhat we can see

Words in documents

Latent Dirichlet AllocationWhat we want to see

Our interestsI What are the topics discussed in the article?

=> Topic proportion of each documentI How can we describe the topics?

=> Word distribution of each topic

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

A) Hierarchical Dircihlet Processes

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

Is it true?

Dice modelingThink about the probability of a number from dicesAccording to the textbook, it is widely known as uniform.

Is it true?Ans) No!

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

∏k=1

θα−1k

Notation

θ ∼ Dir(α)

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

∏k=1

θα−1k

Notation

θ ∼ Dir(α)

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

∏k=1

θα−1k

Property of Dirichlet distributionDensity plots [BAFG10]

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

∝ p(X |θ)p(θ)

x1! · · ·xK !

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

∏k=1

θα−1k

∏k=1

θαk +xk−1k

= Dir(α + n)

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

∝ p(X |θ)p(θ)

x1! · · ·xK !

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

∏k=1

θα−1k

∏k=1

θαk +xk−1k

= Dir(α + n)

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

(G(A1), . . . ,G(AR))|θ1

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(A1), . . . ,G(AR))|θ1

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(A1), . . . ,G(AR))|θ1

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

Blackwell-MacQueen urn scheme produces a sequence θ1,θ2, . . . withthe following conditionals

θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

As Polya Urn analogyI Infinite number of ball colorsI Empty urnI Filling Polya urn process (n starts 1)

F With probability α0, pick a new color from the set of infinite ball colors G0,and paint a new ball that color and add it to urn

F With probability n−1, pick a ball from urn record its color, and put it back tourn with another ball of the same color

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

The distribution over partitions is called the Chinese Restaurant Process(CRP)

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

The CRP exhibits the clustering property of DPI Tables are clusters, φk ∼ G0I Customers are the actual realizations, θn = φzn where zn ∈ 1, . . . ,K

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

What is G′?

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

What is G′?

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

Do this repeatly with distinct values, φ1,φ2, · · ·

G ∼ DP(α0,G0)

G = β1δφ1 + (1−β1)G′1G = β1δφ1 + (1−β1)(β2δφ2 + (1−β2)G′2)

G =∞

∑k=1

πk δφk

πk = βk

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Draws from the DP looks like a sum of point masses, with masses drawnfrom a stick-breaking construction.

Stick Breaking ConstructionSummary)

G =∞

∑k=1

πk δφk

πk = βk

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Summary of DPDefinition

I G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,Ar ) of Ω

(G(A1), . . . ,G(Ar ))∼ Dir(α0G0(A1), . . . ,α0G0(Ar ))

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

G =∞

∑k=1

πk δφk

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

G =∞

∑k=1

πk δφk

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

Hierarchical Dirichlet Processes

Making G0 discrete forces shared cluster between G1 and G2

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

∑i=1

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from a Blackwell-MacQueen Urn Scheme

θd1,θd2, . . . induces to φd1,φd2, . . .

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

G0 ∼ DP(γ,H), φk ∼ H

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

Collapsed Gibbs sampling

A collapsed Gibbs sampling integrates out one or more variables whensampling for some other variable.Example)

There are three latent variables A,B and C.

Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially

But when we integrate out B,

Sampling only p(A|C), p(C|A) sequentially

Review) Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Review) Blackwell-MacQueen Urn Scheme for DP

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

Review) Chinese Restaurant FranchiseGenerating from the Chinese Restaurant Franchise

For each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

∑k=1

βk δφk + βuGu

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

∑k=1

βk δφk + βuGu

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

Hierarchical Dirichlet Processes

⇐⇒

xdn ∼ F(θn)

θn ∼ Gd

Gd ∼ DP(α0,G0)

G0 ∼ DP(γ,H)

⇐⇒

xn ∼Mult(φzdn )

zdn ∼Mult(θd )

φk ∼ Dir(η)

θd ∼ Dir(α0π)

π ∼ Dir(m.1, . . . ,m.K ,γ)

Gibbs Sampling for HDPJoint distribution

p(θ ,z,φ ,x,π,m|α0,η ,γ) = p(π|m,γ)K

∏k=1

p(φ k |η)

∏d=1

p(θ d |α0,π)N

∏n=1

p(zdn|θ d ) p(xdn|zdn,φ)

Integrate out θ ,φ

p(z,x,π,m|α0,η ,γ) =Γ(∑

Kk=1 m.k + γ)

∏Kk=1 Γ(m.k )Γ(γ)

∏k=1

πm.k−1k π

γ−1K +1

∏k=1

Γ(∑Vv=1 ηv )

∏Vv=1 Γ(ηv )

∏Vv=1 Γ(ηv + nk

(·),v )

Γ(∑Vv=1 ηv + nk

(·),v )

∏d=1

Γ(∑Kk=1 α0πk )

∏Kk=1 Γ(α0πk )

∏Kk=1 Γ(α0πk + nk

d ,(·))

Γ(∑Kk=1 α0πk + nk

d ,(·))

Gibbs Sampling for HDP

Full conditional distribution of z

p(z(d ′,n′) = k ′|z−(d ′,n′),m,π,x, ·) =p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

p(z−(d ′,n′),m,π,x|·)∝ p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

(α0πk ′ + nk ′,−(d ′,n′)

d ′,(·)

) (ηv ′ + nk ′,−(d ′,n′)(·),v ′ )

(∑Vv=1 ηv + nk ′,−(d ′,n′)

(·),v )

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

where s(n,m) is unsigned Stirling number of the first kind

d ,(·),t

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

d ,(·),t

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

Full conditional distribution of π

(π1,π2, . . . ,πK ,πu)|· ∼ Dir(m.1,m.2, . . . ,m.K ,γ)

Algorithm 1 Gibbs Sampling for HDP1: Initialize all latent variables as random2: repeat3: for Each document d do4: for Each word n in document d do

5: Sample z(d ,n) ∼Mult

((α0πk ′ + nk ′,−(d ,n)

d ′,(·)

) (ηv ′+nk ′,−(d ,n)

(·),v ′ )

(∑Vv=1 ηv +nk ′,−(d ,n)

(·),v )

)6: end for

7: Sample m ∼Mult

(Γ(α0πk ′ )

Γ(α0πk ′+nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

)8: Sample β ∼ Dir(m.1,m.2, . . . ,m.K ,γ)9: end for

10: until Converged

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

∏i=1

(1−π′dl) π

∑i=1

Alternative Stick Breaking ConstructionProblem)

Original Stick Breaking Construction is weights βk and πdk are tightlycorrelated

βk = β′k

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

πdk = π′dk

∏i=1

(1−π′di) π

∑i=1

Alternative Stick Breaking Construction for each document [FSJW08]

ψdt ∼ G0

πdt = π′dt

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

Gd =∞

∑t=1

πdtδψdt

Alternative Stick Breaking Construction

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

∏i=1

(1−π′di) π

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

∏i=1

(1−π′di) π

To connect ψdt and φk

We add auxiliary variable cdt ∼Mult(β )

Then ψdt = φcdt

Generative process1 For each global-level topic k ∈ 1, . . . ,∞:

1 Draw topic word proportions, φk ∼ Dir(η)2 Draw a corpus breaking proportion,

β ′k ∼ Beta(1,γ)

2 For each document d ∈ 1, . . . ,D:1 For each document-level topic t ∈ 1, . . . ,∞:

1 Draw document-level topic indices,cdt ∼Mult(σ(β

′))2 Draw a document breaking proportion,

π ′dt ∼ Beta(1,α0)

2 For each word n ∈ 1, . . . ,N:1 Draw a topic index zdn ∼Mult(σ(π ′d ))2 Generate a word wdn ∼Mult(φcdzdn

3 whereσ(β

′)≡ β1,β2, . . .,βk = β ′k ∏k−1i=1 (1−β ′i )

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

KL-divergence of p from q

logp(X) = ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)+ DKL(q||p)

Log evidence logp(X) is fixed with respect to q

Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)

F Find lower bound of logp(X)F Maximizing it

Variational Inference for HDP

q(β ,φ ,π,c,z) =K

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

∏d=1

∏t=1

q(cdt |ζdt)T−1

∏t=1

q(πdt |γ1dt ,γ

∏n=1

q(zdn|ϕdn)

Variational Inference for HDPFind lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

= ln∫

∫π∑c

p(w ,β ,φ ,π,c,z|α0,γ,η) dβ dφ dπ

= ln∫

∫π∑c

p(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z)

q(β ,φ ,π,c,z)dβ dφ dπ

≥∫

∫π∑c

lnp(w ,β ,φ ,π,c,z|α0,γ,η)

q(β ,φ ,π,c,z)·q(β ,φ ,π,c,z) dβ dφ dπ

(∵ Jensen’s inequality)

∫π∑c

lnp(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z) dβ dφ dπ

−∫

∫π∑c

lnq(β ,φ ,π,c,z) ·q(β ,φ ,π,c,z) dβ dφ dπ

= Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

lnp(w |α0,γ,η)

≥ Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

= Eq[lnp(β |γ)p(φ |η)D

∏d=1

p(πd |α0)p(cd |β )N

∏n=1

p(wdn|cd ,zdn,φ)p(zdn|πd )]

−Eq[lnK

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

∏d=1

∏t=1

q(cdt |ζdt )T−1

∏t=1

q(πdt |γ1dt ,γ

∏n=1

q(zdn|ϕdn)]

∑d=1

Eq [lnp(πd |α0)] + Eq[lnp(cd |β )] + Eq[lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq [lnq(πd |γ1d ,γ

2d )]−Eq [lnq(zd |ϕd )]

+ Eq [lnp(β |γ)] + Eq[lnp(φ |η)]−Eq[lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

We can run Variational EM to maximize lower bound of logp(w |α0,γ,η)

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)Derivative of it with respect to each variational parameter

γ1dt = 1 +

∑n=1

ϕdnt , γ2dt = α0 +

∑n=1

∑b=t+1

ζdtk = expk−1

∑e=1

(Ψ(a2e)−Ψ(a1

e + a2e)) + (Ψ(a1

k )−Ψ(a1k + a2

∑n=1

∑v=1

wvdnϕdnt (Ψ(λkv )−Ψ(

∑l=1

λkl ))

ϕdnt = expt−1

∑h=1

(Ψ(γ2dh)−Ψ(γ

1dh + γ

2dh)) + (Ψ(γ

1dt )−Ψ(γ

1dt + γ

2dt ))

∑k=1

∑v=1

wvdnζdtk (Ψ(λkv )−Ψ(

∑l=1

λkl ))

a1k = 1 +

∑d=1

∑t=1

ζdtk , a2k = γ +

∑d=1

∑t=1

∑f=k+1

λkv = ηv +D

∑d=1

∑n=1

∑t=1

wvdnϕdnt ζdtk

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)

Derivative of it with respect to each variational parameterRun Variational EM

I E step: compute document level parameters γ1dt ,γ

2dt ,ζdtk ,ϕdnt

I M step: compute corpus level parameters a1k ,a

2k ,λkv

Algorithm 2 Variational Inference for HDP1: Initialize the variational parameters2: repeat3: for Each document d do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: end for8: Compute topic parameters a1

k ,a2k ,λkv

9: until Converged

Online Variational Inference

Stochastic optimization to the variational objective [WPB11]I Subsample the documentsI Compute approximation of the gradient based on subsampleI Follow that gradient with a decreasing step-size

Lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

∑d=1

Eq[lnp(πd |α0)] + Eq [lnp(cd |β )] + Eq [lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq[lnq(πd |γ1d ,γ

2d )]−Eq[lnq(zd |ϕd )]

+ Eq[lnp(β |γ)] + Eq[lnp(φ |η)]−Eq [lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

∑d=1

Ld +Lk

= Eqj [DLd +1D

Online Variational Inference for HDP

Lower bound of logp(w |α0,γ,η) = Eqj [DLd + 1D Lk ]

Online learning algorithm for HDPI Sample a document dI Compute its optimal document-level parameters γ1

I Take the gradient 5 of the corpus level parameters a1k ,a

2k ,λkv with noise

I Update corpus level parameters a1k ,a

2k ,λkv with decreasing learning rate

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

∑f =k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

∑t=1

wvdnϕdnt ζdtk )

where ρe is the learning rate which satisfy ∑∞e=1 ρe = ∞, ∑

∞e=1 ρ2

e < ∞

5Natural graident is structurally equivalent to the Variational Inference oneJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121

Online Variational Inference for HDP

Algorithm 3 Online Variational Inference for HDP1: Initialize the variational parameters2: e = 03: for Each document d ∈ 1, . . . ,D do4: repeat5: Compute document parameters γ1

6: until Converged7: e = e + 18: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]9: Update topic parameters a1

k ,a2k ,λkv

10: end for

Motivation

Problem 1: Inference for HDP takes a long timeProblem 2: Continuously expanding corpus necessitates continuousupdates of model parameters

I But updating of model parameters is not possible with plain HDPI Must re-train with the entire updated corpus

Our Approach: Combine distributed inference and online learning

Distributed Online HDP

Based on variational inference

Mini-batch updates via stochastic learning (variational EM)

Distribute variational EM using MapReduce

Algorithm 4 Distributed Online HDP - Driver1: Initialize the variational parameters2: e = 03: while Run forever do4: Collect new documents s ∈ 1, . . . ,S5: e = e + 16: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]7: Run MapReduce job8: Get result of job and update topic parameters9: end while

Algorithm 5 Distributed Online HDP - Mapper1: Mapper get one document s ∈ 1, . . . ,S2: repeat3: Compute document parameters γ1

4: until Converged5: Output the sufficient statistics for topic parameters

Algorithm 6 Distributed Online HDP - Reducer1: Reducer get sufficient statistics for each topic parameter2: Compute changes of topic parameter with sufficient statistics3: Output the changes of topic parameter

Experimental Setup

Data: 973,266 Twitter conversations, 7.54 tweets / conv

Approximately 7,297,000 tweets

60 node Hadoop system

Each node with 8 x 2.30GHz cores

ResultDistributed Online HDP runs faster than online HDP

Distributed Online HDP preserve the quality of result (perplexity)

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

Practical Tips

Implementation

https://github.com/NoSyu/Topic_Models

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

∑f=k+1

ζdtf )

∑n=1

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

∑f=k+1

ζdtf )

∑n=1

∑t=1

wvdnϕdntζdtk )

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

∑f=k+1

ζdtf )

∑n=1

∑t=1

wvdnϕdntζdtk )

Mini-batch sizeWhen mini-batch size is large, distributed online HDP runs faster

Perplexity is similar as others

Summary

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

I Chinese Restaurant FranchiseI Stick Breaking Construction

Posterior Inference for HDPI Gibbs SamplingI Variational InferenceI Online Learning

Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak

Implementations are updated in http://github.com/NoSyu/Topic_Models

bayesian nonparametric topic modeling hierarchical dirichlet processes

topic proportions d

topic index zdn mult

document d

word distributions

document n

practical tips

latent variables

daejeon south korea

Education

hierarchical dirichlet processes · metric bayesian...

ams 241: bayesian nonparametric methods notes 3 dependent...

hierarchical bayesian nonparametric models with...

nonparametric bayesian methods in machine learning...

gaussian hierarchical latent dirichlet allocation: bringing...

variational inference for hierarchical dirichlet process...

hierarchical dirichlet processes - columbia department of

dynamic opict model with hierarchical wto-parameter ... ·...

truncated priors for tempered hierarchical dirichlet … ·...

tree-structured hierarchical dirichlet...

one-shot learning with a hierarchical nonparametric bayesian...

discipline hotspots mining based on hierarchical dirichlet...

hierarchical dirichlet process (hdp)

nested hierarchical dirichlet processes · the ncrp is a...

nonparametric bayesian models of hierarchical structure in...

hierarchical dirichlet processes -...

ams 241: bayesian nonparametric methods notes 2 dirichlet...

hierarchical dirichlet processes - statistics at uc berkeley

hierarchical bayesian nonparametric approach to modeling...

nonparametric bayesian models and dirichlet process