bayesian nonparametric topic modeling hierarchical dirichlet processes

Post on 27-Jan-2015

138 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is presentation slide files in machine learning summer school in Korea. http://prml.yonsei.ac.kr/ I talked about dirichlet distribution, dirichlet process and HDP.

TRANSCRIPT

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

JinYeong Bak

Department of Computer ScienceKAIST, Daejeon

South Korea

jy.bak@kaist.ac.kr

August 22, 2013

Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk).JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121

Introduction

Bayesian topic modelsI Latent Dirichlet Allocation (LDA) [BNJ03]I Hierarchical Dircihlet Processes (HDP) [TJBB06]

In this talk,I Dirichlet distribution, Dircihlet processI Concept of Hierarchical Dircihlet Processes (HDP)I How to infer the latent variables in HDP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Motivation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121

Motivation

What are the topics discussed in the article?

How can we describe the topics?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121

Topic Modeling

Each topic has word distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Topic Modeling

Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Topic Modeling

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Latent Dirichlet Allocation

Generative process of LDAFor each topic k ∈ 1, . . . ,K:

I Draw word distributions βk ∼ Dir(η)

For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:

F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic

wdn ∼Mult(βzdn )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121

Latent Dirichlet Allocation

Our interestsI What are the topics discussed in the article?I How can we describe the topics?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121

Latent Dirichlet AllocationWhat we can see

Words in documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121

Latent Dirichlet AllocationWhat we want to see

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121

Latent Dirichlet Allocation

Our interestsI What are the topics discussed in the article?

=> Topic proportion of each documentI How can we describe the topics?

=> Word distribution of each topic

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121

Latent Dirichlet Allocation

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121

Latent Dirichlet Allocation

What we can see: w

What we want to see: θ ,z,β

∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)

But this distribution is intractable to compute (∵ normalization term)So we do approximate methods

I Gibbs SamplingI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121

Limitation of Latent Dirichlet Allocation

Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics

Q) Can we get it from data automatically?

A) Hierarchical Dircihlet Processes

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121

Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform

=> 16 for 6 dimentional dice

Is it true?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121

Dice modelingThink about the probability of a number from dicesAccording to the textbook, it is widely known as uniform.

=> 16 for 6 dimentional dice

Is it true?Ans) No!

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121

Dice modeling

We should model the randomness of pmfs for each diceHow can we do that?

I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag

OK, but what is the formal description?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Standard Simplex

A generalization of the notion of a triangle or tetrahedron

All points are non-negative and sum to 1 1

A pmf can be thought of as a point in the standard simplex

Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1

1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Dirichlet distribution

Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K

Notation

θ ∼ Dir(α)

where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]

Probability density function

p(θ ;α) =Γ(∑

Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121

Latent Dirichlet Allocation

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121

Property of Dirichlet distributionDensity plots [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121

Property of Dirichlet distribution

When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution

I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)

p(θ |X) =p(X |θ)p(θ)

p(X)

∝ p(X |θ)p(θ)

=n!

x1! · · ·xK !

K

∏k=1

θxkk ·

Γ(∑Kk=1 αk )

∏Kk=1 Γ(αk )

K

∏k=1

θα−1k

= CK

∏k=1

θαk +xk−1k

= Dir(α + n)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Property of Dirichlet distribution

Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,

then (∑k∈A1θk , . . . ,∑k∈AR

θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR

αk )

Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )

then θk is independent of the vector 11−θk

(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121

Dice modelingThink about the probability of a number from dices

Each dice has its own pmf

Draw out a dice from a bag

Problem) Do not know the number of face in a bag

Solution) Dirichlet process

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121

Dirichlet Process

Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space

Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,AR) of Ω

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Posterior Dirichlet Processes

G ∼ DP(α0,G0) can be treat as a random distribution over Ω

We can draw a sample θ1 from G

We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )

(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

Using Dirichlet-multinomial conjugacy, the posterior is

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise

It is always true for every finite partition of Ω

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Posterior Dirichlet Processes

For every finite partition of Ω,

(G(A1), . . . ,G(AR))|θ1

∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))

where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise

The posterior process is also a Dirichlet process

G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Summary)

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Blackwell-MacQueen Urn Scheme

Now we draw a sample θ1, . . . ,θN

First sample

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Second sample

θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

⇐⇒ θ2|θ1 ∼α0G0 + δθ1

α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,

α0G0 + δθ1 + δθ2

α0 + 2)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121

Blackwell-MacQueen Urn Scheme

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121

Blackwell-MacQueen Urn Scheme

Blackwell-MacQueen urn scheme produces a sequence θ1,θ2, . . . withthe following conditionals

θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

As Polya Urn analogyI Infinite number of ball colorsI Empty urnI Filling Polya urn process (n starts 1)

F With probability α0, pick a new color from the set of infinite ball colors G0,and paint a new ball that color and add it to urn

F With probability n−1, pick a ball from urn record its color, and put it back tourn with another ball of the same color

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121

Chinese Restaurant Process

Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,

and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back

to urn with another ball of the same color

θs can take same values, θi = θj

There are K < N distinct values, φ1, . . . ,φK

It works as partition of Ω

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

The distribution over partitions is called the Chinese Restaurant Process(CRP)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Chinese Restaurant Process

θ1,θ2, . . . ,θN induces to φ1, . . . ,φK

Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table

Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability nkα0+n−1 ,

where nk is the number of customers at table k

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121

Chinese Restaurant Process

The CRP exhibits the clustering property of DPI Tables are clusters, φk ∼ G0I Customers are the actual realizations, θn = φzn where zn ∈ 1, . . . ,K

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Stick Breaking Construction

Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself

To construct G, we use Stick Breaking Construction

Review) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1))

∼ Dir((α0 + 1)α0G0 + δθ1

α0 + 1(θ1),(α0 + 1)

α0G0 + δθ1

α0 + 1(Ω\θ1))

= Dir(1,α0) = Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Stick Breaking Construction

Consider a partition (θ1,Ω\θ1) of Ω. Then

(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)

G has a point mass located at θ1

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

where G′ is the probability measure with the point mass θ1 removed

What is G′?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Stick Breaking ConstructionSummary) Posterior Dirichlet Processes

θ1|G ∼ G G ∼ DP(α0,G0)

⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1

α0 + 1)

G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)

Consider a further partition (θ1,A1, . . . ,AR) of Ω

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

Using decimative property of Dirichlet distribution (proof)

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121

Stick Breaking Construction

Do this repeatly with distinct values, φ1,φ2, · · ·

G ∼ DP(α0,G0)

G = β1δφ1 + (1−β1)G′1G = β1δφ1 + (1−β1)(β2δφ2 + (1−β2)G′2)

...

G =∞

∑k=1

πk δφk

where

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

Draws from the DP looks like a sum of point masses, with masses drawnfrom a stick-breaking construction.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121

Stick Breaking ConstructionSummary)

G =∞

∑k=1

πk δφk

πk = βk

k−1

∏i=1

(1−βi),∞

∑k=1

πk = 1 βk ∼ Beta(1,α0) φk ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121

Summary of DPDefinition

I G is a random probability measure over (Ω,B)

G ∼ DP(α0,G0)

if for any finite measurable partition (A1, . . . ,Ar ) of Ω

(G(A1), . . . ,G(Ar ))∼ Dir(α0G0(A1), . . . ,α0G0(Ar ))

Chinese Restaurant Process

Stick Breaking Construction

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121

Dirichlet Process Mixture Models

We model a data set x1, . . . ,xN using the followingmodel [Nea00]

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Dirichlet Process Mixture ModelsSince G is of the form

G =∞

∑k=1

πk δφk

We have θn = φk with probability πk

Let kn take on value k with probability πk . We canequivalently define θn = φkn

An equivalent model

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121

Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Topic modeling with documents

Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121

Problem of Naive Dirichlet Process Mixture Model

Use a DP mixutre for each document

xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)

But there is no sharing of clusters across differentgroups because G0 is smooth

G1 =∞

∑k=1

π1k δφ1k , G2 =∞

∑k=1

π2k δφ2k

φ1k ,φ2k ∼ G0

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121

Problem of Naive Dirichlet Process Mixture Model

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121

Problem of Naive Dirichlet Process Mixture Model

SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution

Hierarchical Dirichlet Process

G0 ∼ DP(γ,H)

G1,G2|G0 ∼ DP(α0,G0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121

Hierarchical Dirichlet Processes

Making G0 discrete forces shared cluster between G1 and G2

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from a Blackwell-MacQueen Urn Scheme

θd1,θd2, . . . induces to φd1,φd2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121

Chinese Restaurant Franchise

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121

Chinese Restaurant Franchise

G0 ∼ DP(γ,H), φk ∼ H

Gd |G0 ∼ DP(α0,G0), θdn ∼ G0

Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme

θd1,θd2, . . . induces toφd1,φd2, . . .

Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme

θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Chinese Restaurant Franchise

Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table

Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121

Chinese Restaurant Franchise

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121

HDP for Topic modeling

QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?

SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121

Gibbs Sampling

Definition

A special case of Markov-chain Monte Carlo (MCMC) method

An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]

Algorithm

Find full conditional distribution of latent variables of target distribution

Initialize all latent variablesSampling until converged

I Sample one latent variable from full conditional distribution

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121

Collapsed Gibbs sampling

A collapsed Gibbs sampling integrates out one or more variables whensampling for some other variable.Example)

There are three latent variables A,B and C.

Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially

But when we integrate out B,

Sampling only p(A|C), p(C|A) sequentially

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121

Review) Dirichlet Process Mixture Models

⇐⇒

xn ∼ F(θn)

θn ∼ G

G ∼ DP(α0,G0)

⇐⇒

xn ∼ F(φkn )

p(kn = k) = πk

πk = βk

k−1

∏i=1

(1−βi)

βk ∼ Beta(1,α0)

φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121

Review) Blackwell-MacQueen Urn Scheme for DP

Nth sample

θN |θ1,...,N−1,G ∼ G

G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑

N−1n=1 δθn

α0 + N−1)

⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑

N−1n=1 δθn

α0 + N−1

G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑

Nn=1 δθn

α0 + N)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121

Review) Chinese Restaurant FranchiseGenerating from the Chinese Restaurant Franchise

For each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at

F A new table with probability α0α0+n−1

F Table k with probability ndtα0+n−1

where ndt is the number of customers at table tI n-th customer choose

F A new menu with probability γ

γ+m−1F Existing menu with probability mk

γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121

Alternative form of HDP

G0 ∼ DP(γ,H), φdt ∼ G0

∴ G0|φdt , . . .∼ DP(γ + m,γH+∑

Kk=1 mk δφk

γ+m )

Then G0 is given as

G0 =K

∑k=1

βk δφk + βuGu

where

Gu ∼ DP(γ,H)

π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)

p(φk |·) ∝ h(φk ) ∏dn:zdn=k

f (xdn|φk )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121

Hierarchical Dirichlet Processes

⇐⇒

xdn ∼ F(θn)

θn ∼ Gd

Gd ∼ DP(α0,G0)

G0 ∼ DP(γ,H)

⇐⇒

xn ∼Mult(φzdn )

zdn ∼Mult(θd )

φk ∼ Dir(η)

θd ∼ Dir(α0π)

π ∼ Dir(m.1, . . . ,m.K ,γ)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121

Gibbs Sampling for HDPJoint distribution

p(θ ,z,φ ,x,π,m|α0,η ,γ) = p(π|m,γ)K

∏k=1

p(φ k |η)

D

∏d=1

p(θ d |α0,π)N

∏n=1

p(zdn|θ d ) p(xdn|zdn,φ)

Integrate out θ ,φ

p(z,x,π,m|α0,η ,γ) =Γ(∑

Kk=1 m.k + γ)

∏Kk=1 Γ(m.k )Γ(γ)

K

∏k=1

πm.k−1k π

γ−1K +1

K

∏k=1

Γ(∑Vv=1 ηv )

∏Vv=1 Γ(ηv )

∏Vv=1 Γ(ηv + nk

(·),v )

Γ(∑Vv=1 ηv + nk

(·),v )

M

∏d=1

Γ(∑Kk=1 α0πk )

∏Kk=1 Γ(α0πk )

∏Kk=1 Γ(α0πk + nk

d ,(·))

Γ(∑Kk=1 α0πk + nk

d ,(·))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121

Gibbs Sampling for HDP

Full conditional distribution of z

p(z(d ′,n′) = k ′|z−(d ′,n′),m,π,x, ·) =p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

p(z−(d ′,n′),m,π,x|·)∝ p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)

(α0πk ′ + nk ′,−(d ′,n′)

d ′,(·)

) (ηv ′ + nk ′,−(d ′,n′)(·),v ′ )

(∑Vv=1 ηv + nk ′,−(d ′,n′)

(·),v )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k

p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)

d ,(·),t

p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk

These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)

d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk

Antoniak [Ant74] has shown that

p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)

Γ(α0πk ′ + nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

where s(n,m) is unsigned Stirling number of the first kind

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121

Gibbs Sampling for HDP

Full conditional distribution of π

(π1,π2, . . . ,πK ,πu)|· ∼ Dir(m.1,m.2, . . . ,m.K ,γ)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121

Gibbs Sampling for HDP

Algorithm 1 Gibbs Sampling for HDP1: Initialize all latent variables as random2: repeat3: for Each document d do4: for Each word n in document d do

5: Sample z(d ,n) ∼Mult

((α0πk ′ + nk ′,−(d ,n)

d ′,(·)

) (ηv ′+nk ′,−(d ,n)

(·),v ′ )

(∑Vv=1 ηv +nk ′,−(d ,n)

(·),v )

)6: end for

7: Sample m ∼Mult

(Γ(α0πk ′ )

Γ(α0πk ′+nk ′d ,(·),(·))

s(nk ′d ,(·),(·),m)(α0πk ′)

m

)8: Sample β ∼ Dir(m.1,m.2, . . . ,m.K ,γ)9: end for

10: until Converged

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121

Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments

G0 ∼ DP(γ,H)

Gd |G0 ∼ DP(α0,G0)

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑k=1

πdk δφk

πdk = π′dk

k−1

∏i=1

(1−π′dl) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121

Alternative Stick Breaking ConstructionProblem)

Original Stick Breaking Construction is weights βk and πdk are tightlycorrelated

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

πdk = π′dk

k−1

∏i=1

(1−π′di) π

′dk ∼ Beta(α0βk ,α0(1−

k

∑i=1

βi))

Alternative Stick Breaking Construction for each document [FSJW08]

ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

Gd =∞

∑t=1

πdtδψdt

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121

Alternative Stick Breaking Construction

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′l ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121

Alternative Stick Breaking Construction

The stick-breaking construction for the HDP

G0 =∞

∑k=1

βk δφk φk ∼ H

βk = β′k

k−1

∏i=1

(1−β′i ) β

′k ∼ Beta(1,γ)

Gd =∞

∑t=1

πdtδψdt ψdt ∼ G0

πdt = π′dt

t−1

∏i=1

(1−π′di) π

′dt ∼ Beta(1,α0)

To connect ψdt and φk

We add auxiliary variable cdt ∼Mult(β )

Then ψdt = φcdt

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121

Alternative Stick Breaking Construction

Generative process1 For each global-level topic k ∈ 1, . . . ,∞:

1 Draw topic word proportions, φk ∼ Dir(η)2 Draw a corpus breaking proportion,

β ′k ∼ Beta(1,γ)

2 For each document d ∈ 1, . . . ,D:1 For each document-level topic t ∈ 1, . . . ,∞:

1 Draw document-level topic indices,cdt ∼Mult(σ(β

′))2 Draw a document breaking proportion,

π ′dt ∼ Beta(1,α0)

2 For each word n ∈ 1, . . . ,N:1 Draw a topic index zdn ∼Mult(σ(π ′d ))2 Generate a word wdn ∼Mult(φcdzdn

),

3 whereσ(β

′)≡ β1,β2, . . .,βk = β ′k ∏k−1i=1 (1−β ′i )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2

2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

KL-divergence of p from qFind lower bound of log evidence logp(X)

logp(X) = log ∑Z

p(Z ,X) = log ∑Z

p(Z ,X)q(Z |X)

q(Z |X)

= log ∑Z

q(Z |X)p(Z ,X)

q(Z |X)

≥ ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)3

Gap between lower bound of logp(X) and logp(X)

logp(X)−∑Z

q(Z |X) logp(Z ,X)

q(Z |X)= ∑

Zq(Z) log

q(Z)

p(Z |X)

= DKL(q||p)

3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121

KL-divergence of p from q

logp(X) = ∑Z

q(Z |X) logp(Z ,X)

q(Z |X)+ DKL(q||p)

Log evidence logp(X) is fixed with respect to q

Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121

Variational Inference

Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one

More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 4

F Find lower bound of logp(X)F Maximizing it

4Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121

Variational Inference for HDP

q(β ,φ ,π,c,z) =K

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt)T−1

∏t=1

q(πdt |γ1dt ,γ

2dt)

N

∏n=1

q(zdn|ϕdn)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121

Variational Inference for HDPFind lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) dβ dφ dπ

= ln∫

β

∫φ

∫π∑c

∑z

p(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z)

q(β ,φ ,π,c,z)dβ dφ dπ

≥∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η)

q(β ,φ ,π,c,z)·q(β ,φ ,π,c,z) dβ dφ dπ

(∵ Jensen’s inequality)

=∫

β

∫φ

∫π∑c

∑z

lnp(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z) dβ dφ dπ

−∫

β

∫φ

∫π∑c

∑z

lnq(β ,φ ,π,c,z) ·q(β ,φ ,π,c,z) dβ dφ dπ

= Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121

Variational Inference for HDP

lnp(w |α0,γ,η)

≥ Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]

= Eq[lnp(β |γ)p(φ |η)D

∏d=1

p(πd |α0)p(cd |β )N

∏n=1

p(wdn|cd ,zdn,φ)p(zdn|πd )]

−Eq[lnK

∏k=1

q(φk |λk )K−1

∏k=1

q(βk |a1k ,a

2k )

D

∏d=1

T

∏t=1

q(cdt |ζdt )T−1

∏t=1

q(πdt |γ1dt ,γ

2dt )

N

∏n=1

q(zdn|ϕdn)]

=D

∑d=1

Eq [lnp(πd |α0)] + Eq[lnp(cd |β )] + Eq[lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq [lnq(πd |γ1d ,γ

2d )]−Eq [lnq(zd |ϕd )]

+ Eq [lnp(β |γ)] + Eq[lnp(φ |η)]−Eq[lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

We can run Variational EM to maximize lower bound of logp(w |α0,γ,η)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)Derivative of it with respect to each variational parameter

γ1dt = 1 +

N

∑n=1

ϕdnt , γ2dt = α0 +

N

∑n=1

T

∑b=t+1

ϕdnb

ζdtk = expk−1

∑e=1

(Ψ(a2e)−Ψ(a1

e + a2e)) + (Ψ(a1

k )−Ψ(a1k + a2

k ))

+N

∑n=1

V

∑v=1

wvdnϕdnt (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

ϕdnt = expt−1

∑h=1

(Ψ(γ2dh)−Ψ(γ

1dh + γ

2dh)) + (Ψ(γ

1dt )−Ψ(γ

1dt + γ

2dt ))

+K

∑k=1

V

∑v=1

wvdnζdtk (Ψ(λkv )−Ψ(

V

∑l=1

λkl ))

a1k = 1 +

D

∑d=1

T

∑t=1

ζdtk , a2k = γ +

D

∑d=1

T

∑t=1

K

∑f=k+1

ζdtf

λkv = ηv +D

∑d=1

N

∑n=1

T

∑t=1

wvdnϕdnt ζdtk

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121

Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)

Derivative of it with respect to each variational parameterRun Variational EM

I E step: compute document level parameters γ1dt ,γ

2dt ,ζdtk ,ϕdnt

I M step: compute corpus level parameters a1k ,a

2k ,λkv

Algorithm 2 Variational Inference for HDP1: Initialize the variational parameters2: repeat3: for Each document d do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: end for8: Compute topic parameters a1

k ,a2k ,λkv

9: until Converged

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121

Online Variational Inference

Stochastic optimization to the variational objective [WPB11]I Subsample the documentsI Compute approximation of the gradient based on subsampleI Follow that gradient with a decreasing step-size

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121

Variational Inference for HDP

Lower bound of logp(w |α0,γ,η)

lnp(w |α0,γ,η)

≥D

∑d=1

Eq[lnp(πd |α0)] + Eq [lnp(cd |β )] + Eq [lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]

−Eq[lnq(cd |ζ d )]−Eq[lnq(πd |γ1d ,γ

2d )]−Eq[lnq(zd |ϕd )]

+ Eq[lnp(β |γ)] + Eq[lnp(φ |η)]−Eq [lnq(φ |λ )]−Eq[lnq(β |a1,a2)]

=D

∑d=1

Ld +Lk

= Eqj [DLd +1D

Lk ]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121

Online Variational Inference for HDP

Lower bound of logp(w |α0,γ,η) = Eqj [DLd + 1D Lk ]

Online learning algorithm for HDPI Sample a document dI Compute its optimal document-level parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

I Take the gradient 5 of the corpus level parameters a1k ,a

2k ,λkv with noise

I Update corpus level parameters a1k ,a

2k ,λkv with decreasing learning rate

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f =k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdnt ζdtk )

where ρe is the learning rate which satisfy ∑∞e=1 ρe = ∞, ∑

∞e=1 ρ2

e < ∞

5Natural graident is structurally equivalent to the Variational Inference oneJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121

Online Variational Inference for HDP

Algorithm 3 Online Variational Inference for HDP1: Initialize the variational parameters2: e = 03: for Each document d ∈ 1, . . . ,D do4: repeat5: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

6: until Converged7: e = e + 18: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]9: Update topic parameters a1

k ,a2k ,λkv

10: end for

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121

Outline1 Introduction

MotivationTopic Modeling

2 BackgroundDirichlet DistributionDirichlet Processes

3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes

4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning

5 Practical Tips6 Summary

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121

Motivation

Problem 1: Inference for HDP takes a long timeProblem 2: Continuously expanding corpus necessitates continuousupdates of model parameters

I But updating of model parameters is not possible with plain HDPI Must re-train with the entire updated corpus

Our Approach: Combine distributed inference and online learning

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121

Distributed Online HDP

Based on variational inference

Mini-batch updates via stochastic learning (variational EM)

Distribute variational EM using MapReduce

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121

Distributed Online HDP

Algorithm 4 Distributed Online HDP - Driver1: Initialize the variational parameters2: e = 03: while Run forever do4: Collect new documents s ∈ 1, . . . ,S5: e = e + 16: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]7: Run MapReduce job8: Get result of job and update topic parameters9: end while

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121

Distributed Online HDP

Algorithm 5 Distributed Online HDP - Mapper1: Mapper get one document s ∈ 1, . . . ,S2: repeat3: Compute document parameters γ1

dt ,γ2dt ,ζdtk ,ϕdnt

4: until Converged5: Output the sufficient statistics for topic parameters

Algorithm 6 Distributed Online HDP - Reducer1: Reducer get sufficient statistics for each topic parameter2: Compute changes of topic parameter with sufficient statistics3: Output the changes of topic parameter

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121

Experimental Setup

Data: 973,266 Twitter conversations, 7.54 tweets / conv

Approximately 7,297,000 tweets

60 node Hadoop system

Each node with 8 x 2.30GHz cores

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121

ResultDistributed Online HDP runs faster than online HDP

Distributed Online HDP preserve the quality of result (perplexity)

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Practical Tips

Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP

These are theoretical interests

Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning

topics! But I want to know about practical issues, because I amin the industrial field.

So I prepared some tips for him/her and you

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121

Implementation

https://github.com/NoSyu/Topic_Models

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121

HDP

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121

Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Assign Dirichlet parameters

Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words

We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative

sentiments [JO11]

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121

Some tips for using topic models

How to manage hyper-parameters (Dirichlet parameters)?

How to manage learning rate and mini-batch size in online learning?

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]

a1k = (1−ρe)a1

k + ρe(1 + DT

∑t=1

ζdtk )

a2k = (1−ρe)a2

k + ρe(γ + DT

∑t=1

K

∑f=k+1

ζdtf )

λkv = (1−ρe)λkv + ρe(ηv + DN

∑n=1

T

∑t=1

wvdnϕdntζdtk )

Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten

So it depends on dataset

Usually, we set τ0 = 1.0,κ = 0.7

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121

Mini-batch sizeWhen mini-batch size is large, distributed online HDP runs faster

Perplexity is similar as others

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121

Summary

Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes

I Chinese Restaurant FranchiseI Stick Breaking Construction

Posterior Inference for HDPI Gibbs SamplingI Variational InferenceI Online Learning

Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak

Implementations are updated in http://github.com/NoSyu/Topic_Models

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121

Further Reading

Dirichlet ProcessI Dirichlet ProcessI Dirichlet distribution and Dirichlet Process + Indian Buffet Process

Bayesian Nonparametric modelI Machine Learning Summer School - Yee Whye TehI Machine Learning Summer School - Peter OrbanzI Introductory article

InferenceI MCMCI Variational Inference

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121

Thank You!

JinYeong Bakjy.bak@kaist.ac.kr, linkedin.com/in/jybak

Users & Information Lab, KAIST

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121

References I

Charles E Antoniak, Mixtures of dirichlet processes with applications tobayesian nonparametric problems, The annals of statistics (1974),1152–1174.

Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to thedirichlet distribution and related processes, Tech. ReportUWEETR-2010-0006, Department of Electrical Engineering, University ofWashington, Seattle, WA 98195, December 2010.

Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition andmachine learning, vol. 1, springer New York, 2006.

David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichletallocation, the Journal of machine Learning research 3 (2003), 993–1022.

Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, Anhdp-hmm for systems with state persistence, Proceedings of the 25thinternational conference on Machine learning, ACM, 2008, pp. 312–319.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121

References II

Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009.

Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, andLawrence K Saul, An introduction to variational methods for graphicalmodels, Springer, 1998.

Yohan Jo and Alice H. Oh, Aspect and sentiment unification model foronline review analysis, Proceedings of the fourth ACM internationalconference on Web search and data mining (New York, NY, USA), WSDM’11, ACM, 2011, pp. 815–824.

Radford M Neal, Markov chain sampling methods for dirichlet processmixture models, Journal of computational and graphical statistics 9(2000), no. 2, 249–265.

Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei,Hierarchical dirichlet processes, Journal of the american statisticalassociation 101 (2006), no. 476.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121

References III

Chong Wang, John W Paisley, and David M Blei, Online variationalinference for the hierarchical dirichlet process, International Conferenceon Artificial Intelligence and Statistics, 2011, pp. 752–760.

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121

Images source I

http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm

http://www.flickr.com/photos/autumn2may/3965964418/

http://www.flickr.com/photos/ppix/1802571058/

http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871

http://www.flickr.com/photos/jwight/2710392971/

http://www.flickr.com/photos/jasohill/2511594886/

http://en.wikipedia.org/wiki/Kim_Yuna

http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29

http://en.wikipedia.org/wiki/Gangnam_Style

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121

Measurable space (Ω,B)

Def) A set considered together with the σ -algebra on the set6.

Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω

I Special kind of collection of subsets of the sample space ΩF Complete

A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections

A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property

F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset

6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121

Measurable space (Ω,B)

Def) A set considered together with the σ -algebra on the set6.

Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω

I Special kind of collection of subsets of the sample space ΩF Complete

A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections

A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property

F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset

6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121

Proof 1Decimative property

I Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )

Then

(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))

∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))

changes to

(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))

G′ ∼ DP(α0,G0)

using decimative property with

α1 = α0 θ1 = (1−β1)

βk = G0(Ak ) τk = G′(Ak )

JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121

top related