Transcript
Page 1: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

JinYeong Bak

Department of Computer ScienceKAIST, Daejeon

South Korea

[email protected]

August 22, 2013

Part of this slides adopted from presentation by Yee Whye Teh ([email protected]).JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121

Page 2: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121

Page 3: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121

Page 4: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Introduction

Bayesian topic modelsI Latent Dirichlet Allocation (LDA) [BNJ03]I Hierarchical Dircihlet Processes (HDP) [TJBB06]

In this talk,I Dirichlet distribution, Dircihlet processI Concept of Hierarchical Dircihlet Processes (HDP)I How to infer the latent variables in HDP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121

Page 5: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121

Page 6: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Page 7: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Page 8: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Page 9: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Page 10: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

What are the topics discussed in the article?

How can we describe the topics?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121

Page 11: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121

Page 12: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Page 13: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Page 14: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Page 15: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Page 16: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

Each topic has word distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121

Page 17: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Page 18: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Page 19: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Page 20: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Page 21: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Page 22: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Page 23: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Page 24: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Page 25: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Page 26: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Page 27: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Page 28: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Page 29: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Our interestsI What are the topics discussed in the article?I How can we describe the topics?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121

Page 30: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet AllocationWhat we can see

Words in documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121

Page 31: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet AllocationWhat we want to see

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121

Page 32: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

Our interestsI What are the topics discussed in the article?

=> Topic proportion of each documentI How can we describe the topics?

=> Word distribution of each topic

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121

Page 33: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121

Page 34: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121

Page 35: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121

Page 36: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

A) Hierarchical Dircihlet Processes

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121

Page 37: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121

Page 38: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121

Page 39: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121

Page 40: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dicesAccording to the textbook, it is widely known as uniform.

=> 16 for 6 dimentional dice

Is it true?Ans) No!

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121

Page 41: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121

Page 42: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121

Page 43: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Page 44: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Page 45: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Page 46: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Page 47: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Page 48: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Latent Dirichlet Allocation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121

Page 49: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionDensity plots [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121

Page 50: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121

Page 51: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121

Page 52: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121

Page 53: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Page 54: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Page 55: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Page 56: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Page 57: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121

Page 58: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121

Page 59: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121

Page 60: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121

Page 61: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121

Page 62: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Page 63: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Page 64: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Page 65: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Page 66: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Page 67: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Page 68: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Page 69: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Page 70: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Page 71: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Page 72: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121

Page 73: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Blackwell-MacQueen Urn Scheme

Blackwell-MacQueen urn scheme produces a sequence θ1,θ2, . . . withthe following conditionals

θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

As Polya Urn analogyI Infinite number of ball colorsI Empty urnI Filling Polya urn process (n starts 1)

F With probability α0, pick a new color from the set of infinite ball colors G0,and paint a new ball that color and add it to urn

F With probability n−1, pick a ball from urn record its color, and put it back tourn with another ball of the same color

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121

Page 74: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121

Page 75: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121

Page 76: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Page 77: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Page 78: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Page 79: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Page 80: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Process

The CRP exhibits the clustering property of DPI Tables are clusters, φk ∼ G0I Customers are the actual realizations, θn = φzn where zn ∈ 1, . . . ,K

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121

Page 81: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Page 82: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Page 83: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Page 84: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Page 85: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Page 86: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Page 87: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Page 88: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Page 89: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Page 90: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking Construction

Do this repeatly with distinct values, φ1,φ2, · · ·

G ∼ DP(α0,G0)

G = β1δφ1 + (1−β1)G′1G = β1δφ1 + (1−β1)(β2δφ2 + (1−β2)G′2)

...

G =∞

∑k=1

πk δφk

where

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Draws from the DP looks like a sum of point masses, with masses drawnfrom a stick-breaking construction.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121

Page 91: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionSummary)

G =∞

∑k=1

πk δφk

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121

Page 92: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Summary of DPDefinition

I G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,Ar ) of Ω

(G(A1), . . . ,G(Ar ))∼ Dir(α0G0(A1), . . . ,α0G0(Ar ))

Chinese Restaurant Process

Stick Breaking Construction

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121

Page 93: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121

Page 94: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121

Page 95: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121

Page 96: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Page 97: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Page 98: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Page 99: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121

Page 100: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121

Page 101: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Page 102: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Page 103: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121

Page 104: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121

Page 105: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Problem of Naive Dirichlet Process Mixture Model

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121

Page 106: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Problem of Naive Dirichlet Process Mixture Model

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121

Page 107: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes

Making G0 discrete forces shared cluster between G1 and G2

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121

Page 108: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121

Page 109: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from a Blackwell-MacQueen Urn Scheme

θd1,θd2, . . . induces to φd1,φd2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121

Page 110: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121

Page 111: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

G0 ∼ DP(γ,H), φk ∼ H

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121

Page 112: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Page 113: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Page 114: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Page 115: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Page 116: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Chinese Restaurant Franchise

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121

Page 117: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121

Page 118: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121

Page 119: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121

Page 120: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121

Page 121: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121

Page 122: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Collapsed Gibbs sampling

A collapsed Gibbs sampling integrates out one or more variables whensampling for some other variable.Example)

There are three latent variables A,B and C.

Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially

But when we integrate out B,

Sampling only p(A|C), p(C|A) sequentially

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121

Page 123: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121

Page 124: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Blackwell-MacQueen Urn Scheme for DP

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121

Page 125: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Review) Chinese Restaurant FranchiseGenerating from the Chinese Restaurant Franchise

For each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121

Page 126: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121

Page 127: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121

Page 128: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes

⇐⇒

xdn ∼ F(θn)

θn ∼ Gd

Gd ∼ DP(α0,G0)

G0 ∼ DP(γ,H)

⇐⇒

xn ∼Mult(φzdn )

zdn ∼Mult(θd )

φk ∼ Dir(η)

θd ∼ Dir(α0π)

π ∼ Dir(m.1, . . . ,m.K ,γ)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121

Page 129: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPJoint distribution

p(θ ,z,φ ,x,π,m|α0,η ,γ) = p(π|m,γ)K

∏k=1

p(φ k |η)

D

∏d=1

p(θ d |α0,π)N

∏n=1

p(zdn|θ d ) p(xdn|zdn,φ)

Integrate out θ ,φ

p(z,x,π,m|α0,η ,γ) =Γ(∑

Kk=1 m.k + γ)

∏Kk=1 Γ(m.k )Γ(γ)

K

∏k=1

πm.k−1k π

γ−1K +1

K

∏k=1

Γ(∑Vv=1 ηv )

∏Vv=1 Γ(ηv )

∏Vv=1 Γ(ηv + nk

(·),v )

Γ(∑Vv=1 ηv + nk

(·),v )

M

∏d=1

Γ(∑Kk=1 α0πk )

∏Kk=1 Γ(α0πk )

∏Kk=1 Γ(α0πk + nk

d ,(·))

Γ(∑Kk=1 α0πk + nk

d ,(·))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121

Page 130: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDP

Full conditional distribution of z

p(z(d ′,n′) = k ′|z−(d ′,n′),m,π,x, ·) =p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

p(z−(d ′,n′),m,π,x|·)∝ p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

(α0πk ′ + nk ′,−(d ′,n′)

d ′,(·)

) (ηv ′ + nk ′,−(d ′,n′)(·),v ′ )

(∑Vv=1 ηv + nk ′,−(d ′,n′)

(·),v )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121

Page 131: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Page 132: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Page 133: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Page 134: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDP

Full conditional distribution of π

(π1,π2, . . . ,πK ,πu)|· ∼ Dir(m.1,m.2, . . . ,m.K ,γ)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121

Page 135: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Gibbs Sampling for HDP

Algorithm 1 Gibbs Sampling for HDP1: Initialize all latent variables as random2: repeat3: for Each document d do4: for Each word n in document d do

5: Sample z(d ,n) ∼Mult

((α0πk ′ + nk ′,−(d ,n)

d ′,(·)

) (ηv ′+nk ′,−(d ,n)

(·),v ′ )

(∑Vv=1 ηv +nk ′,−(d ,n)

(·),v )

)6: end for

7: Sample m ∼Mult

(Γ(α0πk ′ )

Γ(α0πk ′+nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

)8: Sample β ∼ Dir(m.1,m.2, . . . ,m.K ,γ)9: end for

10: until Converged

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121

Page 136: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121

Page 137: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121

Page 138: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking ConstructionProblem)

Original Stick Breaking Construction is weights βk and πdk are tightlycorrelated

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

πdk = π′dk

k−1

∏i=1

(1−π′di) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

Alternative Stick Breaking Construction for each document [FSJW08]

ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

Gd =∞

∑t=1

πdtδψdt

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121

Page 139: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking Construction

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121

Page 140: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking Construction

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

To connect ψdt and φk

We add auxiliary variable cdt ∼Mult(β )

Then ψdt = φcdt

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121

Page 141: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Alternative Stick Breaking Construction

Generative process1 For each global-level topic k ∈ 1, . . . ,∞:

1 Draw topic word proportions, φk ∼ Dir(η)2 Draw a corpus breaking proportion,

β ′k ∼ Beta(1,γ)

2 For each document d ∈ 1, . . . ,D:1 For each document-level topic t ∈ 1, . . . ,∞:

1 Draw document-level topic indices,cdt ∼Mult(σ(β

′))2 Draw a document breaking proportion,

π ′dt ∼ Beta(1,α0)

2 For each word n ∈ 1, . . . ,N:1 Draw a topic index zdn ∼Mult(σ(π ′d ))2 Generate a word wdn ∼Mult(φcdzdn

),

3 whereσ(β

′)≡ β1,β2, . . .,βk = β ′k ∏k−1i=1 (1−β ′i )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121

Page 142: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

Page 143: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

Page 144: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

Page 145: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

Page 146: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

KL-divergence of p from q

logp(X) = ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)+ DKL(q||p)

Log evidence logp(X) is fixed with respect to q

Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121

Page 147: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 4

F Find lower bound of logp(X)F Maximizing it

4Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121

Page 148: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDP

q(β ,φ ,π,c,z) =K

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt)T−1

∏t=1

q(πdt |γ1dt ,γ

2dt)

N

∏n=1

q(zdn|ϕdn)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121

Page 149: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPFind lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) dβ dφ dπ

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z)

q(β ,φ ,π,c,z)dβ dφ dπ

≥∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η)

q(β ,φ ,π,c,z)·q(β ,φ ,π,c,z) dβ dφ dπ

(∵ Jensen’s inequality)

=∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z) dβ dφ dπ

−∫

β

∫φ

∫π∑c

∑z

lnq(β ,φ ,π,c,z) ·q(β ,φ ,π,c,z) dβ dφ dπ

= Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121

Page 150: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDP

lnp(w |α0,γ,η)

≥ Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

= Eq[lnp(β |γ)p(φ |η)D

∏d=1

p(πd |α0)p(cd |β )N

∏n=1

p(wdn|cd ,zdn,φ)p(zdn|πd )]

−Eq[lnK

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt )T−1

∏t=1

q(πdt |γ1dt ,γ

2dt )

N

∏n=1

q(zdn|ϕdn)]

=D

∑d=1

Eq [lnp(πd |α0)] + Eq[lnp(cd |β )] + Eq[lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq [lnq(πd |γ1d ,γ

2d )]−Eq [lnq(zd |ϕd )]

+ Eq [lnp(β |γ)] + Eq[lnp(φ |η)]−Eq[lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

We can run Variational EM to maximize lower bound of logp(w |α0,γ,η)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121

Page 151: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)Derivative of it with respect to each variational parameter

γ1dt = 1 +

N

∑n=1

ϕdnt , γ2dt = α0 +

N

∑n=1

T

∑b=t+1

ϕdnb

ζdtk = expk−1

∑e=1

(Ψ(a2e)−Ψ(a1

e + a2e)) + (Ψ(a1

k )−Ψ(a1k + a2

k ))

+N

∑n=1

V

∑v=1

wvdnϕdnt (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

ϕdnt = expt−1

∑h=1

(Ψ(γ2dh)−Ψ(γ

1dh + γ

2dh)) + (Ψ(γ

1dt )−Ψ(γ

1dt + γ

2dt ))

+K

∑k=1

V

∑v=1

wvdnζdtk (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

a1k = 1 +

D

∑d=1

T

∑t=1

ζdtk , a2k = γ +

D

∑d=1

T

∑t=1

K

∑f=k+1

ζdtf

λkv = ηv +D

∑d=1

N

∑n=1

T

∑t=1

wvdnϕdnt ζdtk

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121

Page 152: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)

Derivative of it with respect to each variational parameterRun Variational EM

I E step: compute document level parameters γ1dt ,γ

2dt ,ζdtk ,ϕdnt

I M step: compute corpus level parameters a1k ,a

2k ,λkv

Algorithm 2 Variational Inference for HDP1: Initialize the variational parameters2: repeat3: for Each document d do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: end for8: Compute topic parameters a1

k ,a2k ,λkv

9: until Converged

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121

Page 153: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121

Page 154: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference

Stochastic optimization to the variational objective [WPB11]I Subsample the documentsI Compute approximation of the gradient based on subsampleI Follow that gradient with a decreasing step-size

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121

Page 155: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Variational Inference for HDP

Lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

≥D

∑d=1

Eq[lnp(πd |α0)] + Eq [lnp(cd |β )] + Eq [lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq[lnq(πd |γ1d ,γ

2d )]−Eq[lnq(zd |ϕd )]

+ Eq[lnp(β |γ)] + Eq[lnp(φ |η)]−Eq [lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

=D

∑d=1

Ld +Lk

= Eqj [DLd +1D

Lk ]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121

Page 156: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference for HDP

Lower bound of logp(w |α0,γ,η) = Eqj [DLd + 1D Lk ]

Online learning algorithm for HDPI Sample a document dI Compute its optimal document-level parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

I Take the gradient 5 of the corpus level parameters a1k ,a

2k ,λkv with noise

I Update corpus level parameters a1k ,a

2k ,λkv with decreasing learning rate

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f =k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdnt ζdtk )

where ρe is the learning rate which satisfy ∑∞e=1 ρe = ∞, ∑

∞e=1 ρ2

e < ∞

5Natural graident is structurally equivalent to the Variational Inference oneJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121

Page 157: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Online Variational Inference for HDP

Algorithm 3 Online Variational Inference for HDP1: Initialize the variational parameters2: e = 03: for Each document d ∈ 1, . . . ,D do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: e = e + 18: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]9: Update topic parameters a1

k ,a2k ,λkv

10: end for

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121

Page 158: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121

Page 159: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Motivation

Problem 1: Inference for HDP takes a long timeProblem 2: Continuously expanding corpus necessitates continuousupdates of model parameters

I But updating of model parameters is not possible with plain HDPI Must re-train with the entire updated corpus

Our Approach: Combine distributed inference and online learning

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121

Page 160: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Distributed Online HDP

Based on variational inference

Mini-batch updates via stochastic learning (variational EM)

Distribute variational EM using MapReduce

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121

Page 161: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Distributed Online HDP

Algorithm 4 Distributed Online HDP - Driver1: Initialize the variational parameters2: e = 03: while Run forever do4: Collect new documents s ∈ 1, . . . ,S5: e = e + 16: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]7: Run MapReduce job8: Get result of job and update topic parameters9: end while

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121

Page 162: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Distributed Online HDP

Algorithm 5 Distributed Online HDP - Mapper1: Mapper get one document s ∈ 1, . . . ,S2: repeat3: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

4: until Converged5: Output the sufficient statistics for topic parameters

Algorithm 6 Distributed Online HDP - Reducer1: Reducer get sufficient statistics for each topic parameter2: Compute changes of topic parameter with sufficient statistics3: Output the changes of topic parameter

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121

Page 163: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Experimental Setup

Data: 973,266 Twitter conversations, 7.54 tweets / conv

Approximately 7,297,000 tweets

60 node Hadoop system

Each node with 8 x 2.30GHz cores

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121

Page 164: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

ResultDistributed Online HDP runs faster than online HDP

Distributed Online HDP preserve the quality of result (perplexity)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121

Page 165: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Page 166: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Page 167: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Page 168: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Implementation

https://github.com/NoSyu/Topic_Models

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121

Page 169: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121

Page 170: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121

Page 171: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

HDP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121

Page 172: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121

Page 173: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Page 174: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Page 175: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Page 176: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Page 177: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121

Page 178: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Page 179: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Page 180: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Page 181: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Mini-batch sizeWhen mini-batch size is large, distributed online HDP runs faster

Perplexity is similar as others

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121

Page 182: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Summary

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

I Chinese Restaurant FranchiseI Stick Breaking Construction

Posterior Inference for HDPI Gibbs SamplingI Variational InferenceI Online Learning

Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak

Implementations are updated in http://github.com/NoSyu/Topic_Models

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121

Page 183: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Further Reading

Dirichlet ProcessI Dirichlet ProcessI Dirichlet distribution and Dirichlet Process + Indian Buffet Process

Bayesian Nonparametric modelI Machine Learning Summer School - Yee Whye TehI Machine Learning Summer School - Peter OrbanzI Introductory article

InferenceI MCMCI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121

Page 184: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Thank You!

JinYeong [email protected], linkedin.com/in/jybak

Users & Information Lab, KAIST

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121

Page 185: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

References I

Charles E Antoniak, Mixtures of dirichlet processes with applications tobayesian nonparametric problems, The annals of statistics (1974),1152–1174.

Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to thedirichlet distribution and related processes, Tech. ReportUWEETR-2010-0006, Department of Electrical Engineering, University ofWashington, Seattle, WA 98195, December 2010.

Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition andmachine learning, vol. 1, springer New York, 2006.

David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichletallocation, the Journal of machine Learning research 3 (2003), 993–1022.

Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, Anhdp-hmm for systems with state persistence, Proceedings of the 25thinternational conference on Machine learning, ACM, 2008, pp. 312–319.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121

Page 186: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

References II

Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, andLawrence K Saul, An introduction to variational methods for graphicalmodels, Springer, 1998.

Yohan Jo and Alice H. Oh, Aspect and sentiment unification model foronline review analysis, Proceedings of the fourth ACM internationalconference on Web search and data mining (New York, NY, USA), WSDM’11, ACM, 2011, pp. 815–824.

Radford M Neal, Markov chain sampling methods for dirichlet processmixture models, Journal of computational and graphical statistics 9(2000), no. 2, 249–265.

Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei,Hierarchical dirichlet processes, Journal of the american statisticalassociation 101 (2006), no. 476.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121

Page 187: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

References III

Chong Wang, John W Paisley, and David M Blei, Online variationalinference for the hierarchical dirichlet process, International Conferenceon Artificial Intelligence and Statistics, 2011, pp. 752–760.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121

Page 188: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Images source I

http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm

http://www.flickr.com/photos/autumn2may/3965964418/

http://www.flickr.com/photos/ppix/1802571058/

http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871

http://www.flickr.com/photos/jwight/2710392971/

http://www.flickr.com/photos/jasohill/2511594886/

http://en.wikipedia.org/wiki/Kim_Yuna

http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29

http://en.wikipedia.org/wiki/Gangnam_Style

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121

Page 189: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Measurable space (Ω,B)

Def) A set considered together with the σ -algebra on the set6.

Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω

I Special kind of collection of subsets of the sample space ΩF Complete

A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections

A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property

F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset

6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121

Page 190: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Measurable space (Ω,B)

Def) A set considered together with the σ -algebra on the set6.

Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω

I Special kind of collection of subsets of the sample space ΩF Complete

A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections

A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property

F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset

6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121

Page 191: Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

Proof 1Decimative property

I Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Then

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

changes to

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

using decimative property with

α1 = α0 θ1 = (1−β1)

βk = G0(Ak ) τk = G′(Ak )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121


Top Related