Download - Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

JinYeong Bak

Department of Computer ScienceKAIST, Daejeon

South Korea

August 22, 2013

Part of this slides adopted from presentation by Yee Whye Teh ([email protected]).JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121

Page 2: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121

Page 3: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 4: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Introduction

Bayesian topic modelsI Latent Dirichlet Allocation (LDA) [BNJ03]I Hierarchical Dircihlet Processes (HDP) [TJBB06]

In this talk,I Dirichlet distribution, Dircihlet processI Concept of Hierarchical Dircihlet Processes (HDP)I How to infer the latent variables in HDP

Page 5: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Page 6: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Page 7: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Page 8: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Page 9: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Page 10: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

What are the topics discussed in the article?

How can we describe the topics?

Topic Modeling

Each topic has word distribution

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Topic Modeling

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

Page 26: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

wdn ∼Mult(βzdn )

Page 27: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

wdn ∼Mult(βzdn )

Page 28: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

wdn ∼Mult(βzdn )

Page 29: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Our interestsI What are the topics discussed in the article?I How can we describe the topics?

Page 30: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet AllocationWhat we can see

Words in documents

Page 31: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet AllocationWhat we want to see

Page 32: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Our interestsI What are the topics discussed in the article?

=> Topic proportion of each documentI How can we describe the topics?

=> Word distribution of each topic

Page 33: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

Page 34: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

Page 35: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

Page 36: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

A) Hierarchical Dircihlet Processes

Page 37: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 38: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

Page 39: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

Is it true?

Page 40: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesAccording to the textbook, it is widely known as uniform.

Is it true?Ans) No!

Page 41: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

Page 42: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

Page 43: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

http://en.wikipedia.org/wiki/Simplex

Page 44: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

http://en.wikipedia.org/wiki/Simplex

Page 45: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

Page 46: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Notation

θ ∼ Dir(α)

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

Page 47: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Notation

θ ∼ Dir(α)

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

Page 48: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 49: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionDensity plots [BAFG10]

Page 50: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

Page 51: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

Page 52: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

Page 53: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

Page 54: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

Page 55: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

Page 56: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

Page 57: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 58: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

Page 59: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

Page 62: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

Page 63: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(A1), . . . ,G(AR))|θ1

Page 64: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(A1), . . . ,G(AR))|θ1

Page 65: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(A1), . . . ,G(AR))|θ1

Page 66: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Page 67: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(A1), . . . ,G(AR))|θ1

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Page 68: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(A1), . . . ,G(AR))|θ1

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Page 69: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

Page 70: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

Page 71: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

Page 72: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

Page 73: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen urn scheme produces a sequence θ1,θ2, . . . withthe following conditionals

θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

As Polya Urn analogyI Infinite number of ball colorsI Empty urnI Filling Polya urn process (n starts 1)

F With probability α0, pick a new color from the set of infinite ball colors G0,and paint a new ball that color and add it to urn

F With probability n−1, pick a ball from urn record its color, and put it back tourn with another ball of the same color

Page 74: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

Page 75: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

The distribution over partitions is called the Chinese Restaurant Process(CRP)

Page 76: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

Page 77: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 78: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 79: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 80: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

The CRP exhibits the clustering property of DPI Tables are clusters, φk ∼ G0I Customers are the actual realizations, θn = φzn where zn ∈ 1, . . . ,K

Page 81: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

Page 82: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

Page 83: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

Page 84: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

Page 85: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

What is G′?

Page 86: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

What is G′?

Page 87: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

Page 88: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

Page 89: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

Page 90: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Do this repeatly with distinct values, φ1,φ2, · · ·

G ∼ DP(α0,G0)

G = β1δφ1 + (1−β1)G′1G = β1δφ1 + (1−β1)(β2δφ2 + (1−β2)G′2)

...

G =∞

∑k=1

πk δφk

where

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Draws from the DP looks like a sum of point masses, with masses drawnfrom a stick-breaking construction.

Page 91: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary)

G =∞

∑k=1

πk δφk

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Page 92: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Summary of DPDefinition

I G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,Ar ) of Ω

(G(A1), . . . ,G(Ar ))∼ Dir(α0G0(A1), . . . ,α0G0(Ar ))

Page 93: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 94: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

Page 95: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

Page 96: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Page 97: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

G =∞

∑k=1

πk δφk

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Page 98: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

G =∞

∑k=1

πk δφk

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Page 99: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Page 100: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 101: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Page 102: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Page 103: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

Page 104: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

Hierarchical Dirichlet Processes

Making G0 discrete forces shared cluster between G1 and G2

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

Page 109: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from a Blackwell-MacQueen Urn Scheme

θd1,θd2, . . . induces to φd1,φd2, . . .

Page 110: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

Page 111: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

G0 ∼ DP(γ,H), φk ∼ H

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

Page 112: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

Page 113: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 114: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 115: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 116: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 117: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

Page 118: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

Page 119: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 120: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

Page 121: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

Page 122: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Collapsed Gibbs sampling

A collapsed Gibbs sampling integrates out one or more variables whensampling for some other variable.Example)

There are three latent variables A,B and C.

Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially

But when we integrate out B,

Sampling only p(A|C), p(C|A) sequentially

Page 123: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

Page 124: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Blackwell-MacQueen Urn Scheme for DP

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

Page 125: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Chinese Restaurant FranchiseGenerating from the Chinese Restaurant Franchise

For each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

Page 126: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

Page 127: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

Hierarchical Dirichlet Processes

⇐⇒

xdn ∼ F(θn)

θn ∼ Gd

Gd ∼ DP(α0,G0)

G0 ∼ DP(γ,H)

⇐⇒

xn ∼Mult(φzdn )

zdn ∼Mult(θd )

φk ∼ Dir(η)

θd ∼ Dir(α0π)

π ∼ Dir(m.1, . . . ,m.K ,γ)

Page 129: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPJoint distribution

p(θ ,z,φ ,x,π,m|α0,η ,γ) = p(π|m,γ)K

∏k=1

p(φ k |η)

D

∏d=1

p(θ d |α0,π)N

∏n=1

p(zdn|θ d ) p(xdn|zdn,φ)

Integrate out θ ,φ

p(z,x,π,m|α0,η ,γ) =Γ(∑

Kk=1 m.k + γ)

∏Kk=1 Γ(m.k )Γ(γ)

K

∏k=1

πm.k−1k π

γ−1K +1

K

∏k=1

Γ(∑Vv=1 ηv )

∏Vv=1 Γ(ηv )

∏Vv=1 Γ(ηv + nk

(·),v )

Γ(∑Vv=1 ηv + nk

(·),v )

M

∏d=1

Γ(∑Kk=1 α0πk )

∏Kk=1 Γ(α0πk )

∏Kk=1 Γ(α0πk + nk

d ,(·))

Γ(∑Kk=1 α0πk + nk

d ,(·))

Page 130: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDP

Full conditional distribution of z

p(z(d ′,n′) = k ′|z−(d ′,n′),m,π,x, ·) =p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

p(z−(d ′,n′),m,π,x|·)∝ p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

∝

(α0πk ′ + nk ′,−(d ′,n′)

d ′,(·)

) (ηv ′ + nk ′,−(d ′,n′)(·),v ′ )

(∑Vv=1 ηv + nk ′,−(d ′,n′)

(·),v )

Page 131: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

Page 132: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

d ,(·),t

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

Page 133: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

d ,(·),t

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

Page 134: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Full conditional distribution of π

(π1,π2, . . . ,πK ,πu)|· ∼ Dir(m.1,m.2, . . . ,m.K ,γ)

Page 135: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Algorithm 1 Gibbs Sampling for HDP1: Initialize all latent variables as random2: repeat3: for Each document d do4: for Each word n in document d do

5: Sample z(d ,n) ∼Mult

((α0πk ′ + nk ′,−(d ,n)

d ′,(·)

) (ηv ′+nk ′,−(d ,n)

(·),v ′ )

(∑Vv=1 ηv +nk ′,−(d ,n)

(·),v )

)6: end for

7: Sample m ∼Mult

(Γ(α0πk ′ )

Γ(α0πk ′+nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

)8: Sample β ∼ Dir(m.1,m.2, . . . ,m.K ,γ)9: end for

10: until Converged

Page 136: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 137: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

k

∑i=1

βi))

Page 138: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking ConstructionProblem)

Original Stick Breaking Construction is weights βk and πdk are tightlycorrelated

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

πdk = π′dk

k−1

∏i=1

(1−π′di) π

k

∑i=1

βi))

Alternative Stick Breaking Construction for each document [FSJW08]

ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

Gd =∞

∑t=1

πdtδψdt

Page 139: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking Construction

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

Page 140: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

To connect ψdt and φk

We add auxiliary variable cdt ∼Mult(β )

Then ψdt = φcdt

Page 141: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Generative process1 For each global-level topic k ∈ 1, . . . ,∞:

1 Draw topic word proportions, φk ∼ Dir(η)2 Draw a corpus breaking proportion,

β ′k ∼ Beta(1,γ)

2 For each document d ∈ 1, . . . ,D:1 For each document-level topic t ∈ 1, . . . ,∞:

1 Draw document-level topic indices,cdt ∼Mult(σ(β

′))2 Draw a document breaking proportion,

π ′dt ∼ Beta(1,α0)

2 For each word n ∈ 1, . . . ,N:1 Draw a topic index zdn ∼Mult(σ(π ′d ))2 Generate a word wdn ∼Mult(φcdzdn

),

3 whereσ(β

′)≡ β1,β2, . . .,βk = β ′k ∏k−1i=1 (1−β ′i )

Page 142: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

Page 143: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 144: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

Page 145: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

Page 146: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from q

logp(X) = ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)+ DKL(q||p)

Log evidence logp(X) is fixed with respect to q

Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)

Page 147: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

F Find lower bound of logp(X)F Maximizing it

Page 148: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDP

q(β ,φ ,π,c,z) =K

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt)T−1

∏t=1

q(πdt |γ1dt ,γ

2dt)

N

∏n=1

q(zdn|ϕdn)

Page 149: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPFind lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) dβ dφ dπ

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z)

q(β ,φ ,π,c,z)dβ dφ dπ

≥∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η)

q(β ,φ ,π,c,z)·q(β ,φ ,π,c,z) dβ dφ dπ

(∵ Jensen’s inequality)

=∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z) dβ dφ dπ

−∫

β

∫φ

∫π∑c

∑z

lnq(β ,φ ,π,c,z) ·q(β ,φ ,π,c,z) dβ dφ dπ

= Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

Page 150: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

lnp(w |α0,γ,η)

≥ Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

= Eq[lnp(β |γ)p(φ |η)D

∏d=1

p(πd |α0)p(cd |β )N

∏n=1

p(wdn|cd ,zdn,φ)p(zdn|πd )]

−Eq[lnK

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt )T−1

∏t=1

q(πdt |γ1dt ,γ

2dt )

N

∏n=1

q(zdn|ϕdn)]

=D

∑d=1

Eq [lnp(πd |α0)] + Eq[lnp(cd |β )] + Eq[lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq [lnq(πd |γ1d ,γ

2d )]−Eq [lnq(zd |ϕd )]

+ Eq [lnp(β |γ)] + Eq[lnp(φ |η)]−Eq[lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

We can run Variational EM to maximize lower bound of logp(w |α0,γ,η)

Page 151: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)Derivative of it with respect to each variational parameter

γ1dt = 1 +

N

∑n=1

ϕdnt , γ2dt = α0 +

N

∑n=1

T

∑b=t+1

ϕdnb

ζdtk = expk−1

∑e=1

(Ψ(a2e)−Ψ(a1

e + a2e)) + (Ψ(a1

k )−Ψ(a1k + a2

k ))

+N

∑n=1

V

∑v=1

wvdnϕdnt (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

ϕdnt = expt−1

∑h=1

(Ψ(γ2dh)−Ψ(γ

1dh + γ

2dh)) + (Ψ(γ

1dt )−Ψ(γ

1dt + γ

2dt ))

+K

∑k=1

V

∑v=1

wvdnζdtk (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

a1k = 1 +

D

∑d=1

T

∑t=1

ζdtk , a2k = γ +

D

∑d=1

T

∑t=1

K

∑f=k+1

ζdtf

λkv = ηv +D

∑d=1

N

∑n=1

T

∑t=1

wvdnϕdnt ζdtk

Page 152: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)

Derivative of it with respect to each variational parameterRun Variational EM

I E step: compute document level parameters γ1dt ,γ

2dt ,ζdtk ,ϕdnt

I M step: compute corpus level parameters a1k ,a

2k ,λkv

Algorithm 2 Variational Inference for HDP1: Initialize the variational parameters2: repeat3: for Each document d do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: end for8: Compute topic parameters a1

k ,a2k ,λkv

9: until Converged

Page 153: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 154: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference

Stochastic optimization to the variational objective [WPB11]I Subsample the documentsI Compute approximation of the gradient based on subsampleI Follow that gradient with a decreasing step-size

Page 155: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

≥D

∑d=1

Eq[lnp(πd |α0)] + Eq [lnp(cd |β )] + Eq [lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq[lnq(πd |γ1d ,γ

2d )]−Eq[lnq(zd |ϕd )]

+ Eq[lnp(β |γ)] + Eq[lnp(φ |η)]−Eq [lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

=D

∑d=1

Ld +Lk

= Eqj [DLd +1D

Lk ]

Page 156: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference for HDP

Lower bound of logp(w |α0,γ,η) = Eqj [DLd + 1D Lk ]

Online learning algorithm for HDPI Sample a document dI Compute its optimal document-level parameters γ1

I Take the gradient 5 of the corpus level parameters a1k ,a

2k ,λkv with noise

I Update corpus level parameters a1k ,a

2k ,λkv with decreasing learning rate

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f =k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdnt ζdtk )

where ρe is the learning rate which satisfy ∑∞e=1 ρe = ∞, ∑

∞e=1 ρ2

e < ∞

5Natural graident is structurally equivalent to the Variational Inference oneJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121

Page 157: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference for HDP

Algorithm 3 Online Variational Inference for HDP1: Initialize the variational parameters2: e = 03: for Each document d ∈ 1, . . . ,D do4: repeat5: Compute document parameters γ1

6: until Converged7: e = e + 18: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]9: Update topic parameters a1

k ,a2k ,λkv

10: end for

Page 158: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 159: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Problem 1: Inference for HDP takes a long timeProblem 2: Continuously expanding corpus necessitates continuousupdates of model parameters

I But updating of model parameters is not possible with plain HDPI Must re-train with the entire updated corpus

Our Approach: Combine distributed inference and online learning

Page 160: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Distributed Online HDP

Based on variational inference

Mini-batch updates via stochastic learning (variational EM)

Distribute variational EM using MapReduce

Page 161: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Algorithm 4 Distributed Online HDP - Driver1: Initialize the variational parameters2: e = 03: while Run forever do4: Collect new documents s ∈ 1, . . . ,S5: e = e + 16: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]7: Run MapReduce job8: Get result of job and update topic parameters9: end while

Page 162: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Algorithm 5 Distributed Online HDP - Mapper1: Mapper get one document s ∈ 1, . . . ,S2: repeat3: Compute document parameters γ1

4: until Converged5: Output the sufficient statistics for topic parameters

Algorithm 6 Distributed Online HDP - Reducer1: Reducer get sufficient statistics for each topic parameter2: Compute changes of topic parameter with sufficient statistics3: Output the changes of topic parameter

Page 163: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Experimental Setup

Data: 973,266 Twitter conversations, 7.54 tweets / conv

Approximately 7,297,000 tweets

60 node Hadoop system

Each node with 8 x 2.30GHz cores

Page 164: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

ResultDistributed Online HDP runs faster than online HDP

Distributed Online HDP preserve the quality of result (perplexity)

Page 165: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

Page 166: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Page 167: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Page 168: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Implementation

https://github.com/NoSyu/Topic_Models

Page 169: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

Page 170: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Page 171: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP

Page 172: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

Page 173: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]