Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes
JinYeong Bak
Department of Computer ScienceKAIST, Daejeon
South Korea
August 22, 2013
Part of this slides adopted from presentation by Yee Whye Teh ([email protected]).JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121
Introduction
Bayesian topic modelsI Latent Dirichlet Allocation (LDA) [BNJ03]I Hierarchical Dircihlet Processes (HDP) [TJBB06]
In this talk,I Dirichlet distribution, Dircihlet processI Concept of Hierarchical Dircihlet Processes (HDP)I How to infer the latent variables in HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
What are the topics discussed in the article?
How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
Each topic has word distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
Topic Modeling
Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
Each document has topic proportionEach word has its own topic indexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Latent Dirichlet Allocation
Generative process of LDAFor each topic k ∈ 1, . . . ,K:
I Draw word distributions βk ∼ Dir(η)
For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:
F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic
wdn ∼Mult(βzdn )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDAFor each topic k ∈ 1, . . . ,K:
I Draw word distributions βk ∼ Dir(η)
For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:
F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic
wdn ∼Mult(βzdn )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDAFor each topic k ∈ 1, . . . ,K:
I Draw word distributions βk ∼ Dir(η)
For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:
F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic
wdn ∼Mult(βzdn )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDAFor each topic k ∈ 1, . . . ,K:
I Draw word distributions βk ∼ Dir(η)
For each document d ∈ 1, . . . ,D:I Draw topic proportions θd ∼ Dir(α)I For each word in a document n ∈ 1, . . . ,N:
F Draw a topic index zdn ∼Mult(θ)F Generate word from chosen topic
wdn ∼Mult(βzdn )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Our interestsI What are the topics discussed in the article?I How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
Latent Dirichlet AllocationWhat we can see
Words in documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
Latent Dirichlet AllocationWhat we want to see
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
Latent Dirichlet Allocation
Our interestsI What are the topics discussed in the article?
=> Topic proportion of each documentI How can we describe the topics?
=> Word distribution of each topic
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
Latent Dirichlet Allocation
What we can see: w
What we want to see: θ ,z,β
∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)
But this distribution is intractable to compute (∵ normalization term)So we do approximate methods
I Gibbs SamplingI Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
Latent Dirichlet Allocation
What we can see: w
What we want to see: θ ,z,β
∴ Compute p(θ ,z,β |w,α,η) = p(θ ,z,β ,w|α,η)p(w |α,η)
But this distribution is intractable to compute (∵ normalization term)So we do approximate methods
I Gibbs SamplingI Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics
Q) Can we get it from data automatically?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric modelI People should assign the number of topics in a corpusI People should find the best number of topics
Q) Can we get it from data automatically?
A) Hierarchical Dircihlet Processes
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121
Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform
=> 16 for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
Dice modelingThink about the probability of a number from dicesEach dice has its own pmfAccording to the textbook, it is widely known as uniform
=> 16 for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
Dice modelingThink about the probability of a number from dicesAccording to the textbook, it is widely known as uniform.
=> 16 for 6 dimentional dice
Is it true?Ans) No!
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
Dice modeling
We should model the randomness of pmfs for each diceHow can we do that?
I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
Dice modeling
We should model the randomness of pmfs for each diceHow can we do that?
I Let’s imagine a bag which has many dicesI We cannot see inside the bagI We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1
1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x ,y ,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x + y + z = 1
1http://en.wikipedia.org/wiki/SimplexJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
Dirichlet distribution
Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]
Probability density function
p(θ ;α) =Γ(∑
Kk=1 αk )
∏Kk=1 Γ(αk )
K
∏k=1
θα−1k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Dirichlet distribution
Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]
Probability density function
p(θ ;α) =Γ(∑
Kk=1 αk )
∏Kk=1 Γ(αk )
K
∏k=1
θα−1k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Dirichlet distribution
Definition [BN06]I A probability distribution over the (K −1) dimensional standard simplexI A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1, . . . ,θK ] is random pmf, α = [α1, . . . ,αK ]
Probability density function
p(θ ;α) =Γ(∑
Kk=1 αk )
∏Kk=1 Γ(αk )
K
∏k=1
θα−1k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Latent Dirichlet Allocation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121
Property of Dirichlet distributionDensity plots [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
Property of Dirichlet distribution
When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution
I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)
p(θ |X) =p(X |θ)p(θ)
p(X)
∝ p(X |θ)p(θ)
=n!
x1! · · ·xK !
K
∏k=1
θxkk ·
Γ(∑Kk=1 αk )
∏Kk=1 Γ(αk )
K
∏k=1
θα−1k
= CK
∏k=1
θαk +xk−1k
= Dir(α + n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
Property of Dirichlet distribution
When K = 2, it is Beta distributionConjugate prior for the Multinomial distribution
I Likelihood X ∼Mult(n,θ), Prior θ ∼ Dir(α)I ∴ Posterior (θ |X)∼ Dir(α + n)I Proof)
p(θ |X) =p(X |θ)p(θ)
p(X)
∝ p(X |θ)p(θ)
=n!
x1! · · ·xK !
K
∏k=1
θxkk ·
Γ(∑Kk=1 αk )
∏Kk=1 Γ(αk )
K
∏k=1
θα−1k
= CK
∏k=1
θαk +xk−1k
= Dir(α + n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
Property of Dirichlet distribution
Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,
then (∑k∈A1θk , . . . ,∑k∈AR
θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR
αk )
Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )
Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then θk is independent of the vector 11−θk
(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,
then (∑k∈A1θk , . . . ,∑k∈AR
θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR
αk )
Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )
Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then θk is independent of the vector 11−θk
(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,
then (∑k∈A1θk , . . . ,∑k∈AR
θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR
αk )
Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )
Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then θk is independent of the vector 11−θk
(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then (θ1 + θ2, . . . ,θK )∼ Dir(α1 + α2, . . . ,αK )I In general, if A1, . . . ,AR is any partition of 1, . . . ,K,
then (∑k∈A1θk , . . . ,∑k∈AR
θk )∼ Dir(∑k∈A1αk , . . . ,∑k∈AR
αk )
Decimative propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )
Neutrality propertyI Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )
then θk is independent of the vector 11−θk
(θ1,θ2, . . . ,θk−1,θk+1, . . . ,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121
Dice modelingThink about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
Dice modelingThink about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
Dirichlet Process
Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space
Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1, . . . ,AR) of Ω
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
Dirichlet Process
Definition [BAFG10]I A distribution over probability measuresI A distribution whose realizations are distribution over any sample space
Formal definitionI (Ω,B) is a measurable spaceI G0 is a distribution over sample space ΩI α0 is a positive real numberI G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1, . . . ,AR) of Ω
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1, . . . ,AR) of Ωthen p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1), . . . ,G(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1), . . . ,G(AR))|θ1
∼Dir(α0G0(A1) + δθ1(A1), . . . ,α0G0(AR) + δθ1(AR))
where δθ1(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1, . . . ,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
⇐⇒ θ2|θ1 ∼α0G0 + δθ1
α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,
α0G0 + δθ1 + δθ2
α0 + 2)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1, . . . ,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
⇐⇒ θ2|θ1 ∼α0G0 + δθ1
α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,
α0G0 + δθ1 + δθ2
α0 + 2)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1, . . . ,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
⇐⇒ θ2|θ1 ∼α0G0 + δθ1
α0 + 1G|θ1,θ2 ∼ DP(α0 + 2,
α0G0 + δθ1 + δθ2
α0 + 2)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Nth sample
θN |θ1,...,N−1,G ∼ G
G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑
N−1n=1 δθn
α0 + N−1)
⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑
N−1n=1 δθn
α0 + N−1
G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑
Nn=1 δθn
α0 + N)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121
Blackwell-MacQueen Urn Scheme
Blackwell-MacQueen urn scheme produces a sequence θ1,θ2, . . . withthe following conditionals
θN |θ1,...,N−1 ∼α0G0 + ∑
N−1n=1 δθn
α0 + N−1
As Polya Urn analogyI Infinite number of ball colorsI Empty urnI Filling Polya urn process (n starts 1)
F With probability α0, pick a new color from the set of infinite ball colors G0,and paint a new ball that color and add it to urn
F With probability n−1, pick a ball from urn record its color, and put it back tourn with another ball of the same color
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
Chinese Restaurant Process
Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1, . . . ,φK
It works as partition of Ω
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
The distribution over partitions is called the Chinese Restaurant Process(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
Chinese Restaurant Process
Draw θ1,θ2, . . . ,θN from a Blackwell-MacQueen Urn SchemeI With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urnI With probability n−1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1, . . . ,φK
It works as partition of Ω
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
The distribution over partitions is called the Chinese Restaurant Process(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
Chinese Restaurant Process
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table
Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability nkα0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table
Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability nkα0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table
Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability nkα0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2, . . . ,θN induces to φ1, . . . ,φK
Chinese Restaurant Process interpretationI There is a Chinese Restaurant which has infinite tablesI Each customer sits at a table
Generating from the Chinese Restaurant ProcessI First customer sits at the first tableI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability nkα0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
The CRP exhibits the clustering property of DPI Tables are clusters, φk ∼ G0I Customers are the actual realizations, θn = φzn where zn ∈ 1, . . . ,K
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1))
∼ Dir((α0 + 1)α0G0 + δθ1
α0 + 1(θ1),(α0 + 1)
α0G0 + δθ1
α0 + 1(Ω\θ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1))
∼ Dir((α0 + 1)α0G0 + δθ1
α0 + 1(θ1),(α0 + 1)
α0G0 + δθ1
α0 + 1(Ω\θ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1))
∼ Dir((α0 + 1)α0G0 + δθ1
α0 + 1(θ1),(α0 + 1)
α0G0 + δθ1
α0 + 1(Ω\θ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
where G′ is the probability measure with the point mass θ1 removed
What is G′?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking Construction
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
where G′ is the probability measure with the point mass θ1 removed
What is G′?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking Construction
Consider a partition (θ1,Ω\θ1) of Ω. Then
(G(θ1),G(Ω\θ1)) = (β1,1−β1)∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
where G′ is the probability measure with the point mass θ1 removed
What is G′?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking ConstructionSummary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1, . . . ,AR) of Ω
(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))
∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
G′ ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking ConstructionSummary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1, . . . ,AR) of Ω
(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))
∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
G′ ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking ConstructionSummary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 + 1,α0G0 + δθ1
α0 + 1)
G = β1δθ1 + (1−β1)G′ β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1, . . . ,AR) of Ω
(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))
∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
G′ ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking Construction
Do this repeatly with distinct values, φ1,φ2, · · ·
G ∼ DP(α0,G0)
G = β1δφ1 + (1−β1)G′1G = β1δφ1 + (1−β1)(β2δφ2 + (1−β2)G′2)
...
G =∞
∑k=1
πk δφk
where
πk = βk
k−1
∏i=1
(1−βi),∞
∑k=1
πk = 1 βk ∼ Beta(1,α0) φk ∼ G0
Draws from the DP looks like a sum of point masses, with masses drawnfrom a stick-breaking construction.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
Stick Breaking ConstructionSummary)
G =∞
∑k=1
πk δφk
πk = βk
k−1
∏i=1
(1−βi),∞
∑k=1
πk = 1 βk ∼ Beta(1,α0) φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121
Summary of DPDefinition
I G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1, . . . ,Ar ) of Ω
(G(A1), . . . ,G(Ar ))∼ Dir(α0G0(A1), . . . ,α0G0(Ar ))
Chinese Restaurant Process
Stick Breaking Construction
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121
Dirichlet Process Mixture Models
We model a data set x1, . . . ,xN using the followingmodel [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
Dirichlet Process Mixture Models
We model a data set x1, . . . ,xN using the followingmodel [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, whileG is the unknown distribution over parametersmodelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
Dirichlet Process Mixture ModelsSince G is of the form
G =∞
∑k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We canequivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn )
p(kn = k) = πk
πk = βk
k−1
∏i=1
(1−βi)
βk ∼ Beta(1,α0)
φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture ModelsSince G is of the form
G =∞
∑k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We canequivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn )
p(kn = k) = πk
πk = βk
k−1
∏i=1
(1−βi)
βk ∼ Beta(1,α0)
φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture ModelsSince G is of the form
G =∞
∑k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We canequivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn )
p(kn = k) = πk
πk = βk
k−1
∏i=1
(1−βi)
βk ∼ Beta(1,α0)
φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture Models
⇐⇒
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn )
p(kn = k) = πk
πk = βk
k−1
∏i=1
(1−βi)
βk ∼ Beta(1,α0)
φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121
Topic modeling with documents
Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
Topic modeling with documents
Each document consists of bags of wordsEach word in a document has latent topic indexLatent topics for words in a document can be groupedEach document has topic proportionEach topic has word distributionTopics must be shared across documentsJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across differentgroups because G0 is smooth
G1 =∞
∑k=1
π1k δφ1k , G2 =∞
∑k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across differentgroups because G0 is smooth
G1 =∞
∑k=1
π1k δφ1k , G2 =∞
∑k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
Problem of Naive Dirichlet Process Mixture Model
SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
Problem of Naive Dirichlet Process Mixture Model
SolutionI Make the base distribution G0 discreteI Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
Hierarchical Dirichlet Processes
Making G0 discrete forces shared cluster between G1 and G2
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =∞
∑k=1
βk δφk φk ∼ H
βk = β′k
k−1
∏i=1
(1−β′l ) β
′k ∼ Beta(1,γ)
Gd =∞
∑k=1
πdk δφk
πdk = π′dk
k−1
∏i=1
(1−π′dl) π
′dk ∼ Beta(α0βk ,α0(1−
k
∑i=1
βi))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2, . . . from a Blackwell-MacQueen Urn Scheme
θd1,θd2, . . . induces to φd1,φd2, . . .
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme
θd1,θd2, . . . induces toφd1,φd2, . . .
Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme
θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
Chinese Restaurant Franchise
G0 ∼ DP(γ,H), φk ∼ H
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2, . . . from aBlackwell-MacQueen UrnScheme
θd1,θd2, . . . induces toφd1,φd2, . . .
Draw θd ′1,θd ′2, . . . from aBlackwell-MacQueen UrnScheme
θd ′1,θd ′2, . . . induces toφd ′1,φd ′2, . . .
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table
Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability ndtα0+n−1
where ndt is the number of customers at table tI n-th customer choose
F A new menu with probability γ
γ+m−1F Existing menu with probability mk
γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table
Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability ndtα0+n−1
where ndt is the number of customers at table tI n-th customer choose
F A new menu with probability γ
γ+m−1F Existing menu with probability mk
γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table
Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability ndtα0+n−1
where ndt is the number of customers at table tI n-th customer choose
F A new menu with probability γ
γ+m−1F Existing menu with probability mk
γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretationI Each restaurant has infinite tablesI All restaurant share food menuI Each customer sits at a table
Generating from the Chinese Restaurant FranchiseFor each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability ndtα0+n−1
where ndt is the number of customers at table tI n-th customer choose
F A new menu with probability γ
γ+m−1F Existing menu with probability mk
γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121
HDP for Topic modeling
QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?
SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
HDP for Topic modeling
QuestionsI What can we assume about the topics in a document?I What can we assume about the words in the topics?
SolutionI Each document consists of bags of wordsI Each word in a document has latent topicI Latent topics for words in a document can be groupedI Each document has topic proportionI Each topic has word distributionI Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121
Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variablesSampling until converged
I Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parametervalues whose distribution converges to the target joint posteriordistribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variablesSampling until converged
I Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
Collapsed Gibbs sampling
A collapsed Gibbs sampling integrates out one or more variables whensampling for some other variable.Example)
There are three latent variables A,B and C.
Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially
But when we integrate out B,
Sampling only p(A|C), p(C|A) sequentially
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
Review) Dirichlet Process Mixture Models
⇐⇒
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn )
p(kn = k) = πk
πk = βk
k−1
∏i=1
(1−βi)
βk ∼ Beta(1,α0)
φk ∼ G0JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121
Review) Blackwell-MacQueen Urn Scheme for DP
Nth sample
θN |θ1,...,N−1,G ∼ G
G|θ1,...,N−1 ∼ DP(α0 + N−1,α0G0 + ∑
N−1n=1 δθn
α0 + N−1)
⇐⇒ θN |θ1,...,N−1 ∼α0G0 + ∑
N−1n=1 δθn
α0 + N−1
G|θ1,...,N ∼ DP(α0 + N,α0G0 + ∑
Nn=1 δθn
α0 + N)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121
Review) Chinese Restaurant FranchiseGenerating from the Chinese Restaurant Franchise
For each restaurantI First customer sits at the first table and choose a new menuI n-th customer sits at
F A new table with probability α0α0+n−1
F Table k with probability ndtα0+n−1
where ndt is the number of customers at table tI n-th customer choose
F A new menu with probability γ
γ+m−1F Existing menu with probability mk
γ+m−1where m is the number of tables in all restaurant, mk is the number of chosenmenu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt , . . .∼ DP(γ + m,γH+∑
Kk=1 mk δφk
γ+m )
Then G0 is given as
G0 =K
∑k=1
βk δφk + βuGu
where
Gu ∼ DP(γ,H)
π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)
p(φk |·) ∝ h(φk ) ∏dn:zdn=k
f (xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt , . . .∼ DP(γ + m,γH+∑
Kk=1 mk δφk
γ+m )
Then G0 is given as
G0 =K
∑k=1
βk δφk + βuGu
where
Gu ∼ DP(γ,H)
π = (π1, . . . ,πK ,πu)∼ Dir(m1, . . . ,mK ,γ)
p(φk |·) ∝ h(φk ) ∏dn:zdn=k
f (xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
Hierarchical Dirichlet Processes
⇐⇒
xdn ∼ F(θn)
θn ∼ Gd
Gd ∼ DP(α0,G0)
G0 ∼ DP(γ,H)
⇐⇒
xn ∼Mult(φzdn )
zdn ∼Mult(θd )
φk ∼ Dir(η)
θd ∼ Dir(α0π)
π ∼ Dir(m.1, . . . ,m.K ,γ)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121
Gibbs Sampling for HDPJoint distribution
p(θ ,z,φ ,x,π,m|α0,η ,γ) = p(π|m,γ)K
∏k=1
p(φ k |η)
D
∏d=1
p(θ d |α0,π)N
∏n=1
p(zdn|θ d ) p(xdn|zdn,φ)
Integrate out θ ,φ
p(z,x,π,m|α0,η ,γ) =Γ(∑
Kk=1 m.k + γ)
∏Kk=1 Γ(m.k )Γ(γ)
K
∏k=1
πm.k−1k π
γ−1K +1
K
∏k=1
Γ(∑Vv=1 ηv )
∏Vv=1 Γ(ηv )
∏Vv=1 Γ(ηv + nk
(·),v )
Γ(∑Vv=1 ηv + nk
(·),v )
M
∏d=1
Γ(∑Kk=1 α0πk )
∏Kk=1 Γ(α0πk )
∏Kk=1 Γ(α0πk + nk
d ,(·))
Γ(∑Kk=1 α0πk + nk
d ,(·))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
Gibbs Sampling for HDP
Full conditional distribution of z
p(z(d ′,n′) = k ′|z−(d ′,n′),m,π,x, ·) =p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)
p(z−(d ′,n′),m,π,x|·)∝ p(z(d ′,n′) = k ′,z−(d ′,n′),m,π,x|·)
∝
(α0πk ′ + nk ′,−(d ′,n′)
d ′,(·)
) (ηv ′ + nk ′,−(d ′,n′)(·),v ′ )
(∑Vv=1 ηv + nk ′,−(d ′,n′)
(·),v )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k
p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)
d ,(·),t
p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk
These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)
d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk
Antoniak [Ant74] has shown that
p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)
Γ(α0πk ′ + nk ′d ,(·),(·))
s(nk ′d ,(·),(·),m)(α0πk ′)
m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k
p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)
d ,(·),t
p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk
These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)
d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk
Antoniak [Ant74] has shown that
p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)
Γ(α0πk ′ + nk ′d ,(·),(·))
s(nk ′d ,(·),(·),m)(α0πk ′)
m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDPFull conditional distribution of mThe probability that word xd ′n′ is assigned to some table t such thatkdt = k
p(θd ′n′ = φt |φdt = φk ,θ−(d ′,n′),π) ∝ n(·),−(d ′,n′)
d ,(·),t
p(θd ′n′ = new table|φdtnew = φk ,θ−(d ′,n′),π) ∝ α0πk
These equations form Dirichlet process with concentration parameterα0πk and assignment of n(·),−(d ′,n′)
d ,(·),t to componentsThe corresponding distribution over the number of components is desiredconditional distribution of mdk
Antoniak [Ant74] has shown that
p(md ′k ′ = m|z,md ′k ′ ,π) =Γ(α0πk ′)
Γ(α0πk ′ + nk ′d ,(·),(·))
s(nk ′d ,(·),(·),m)(α0πk ′)
m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDP
Full conditional distribution of π
(π1,π2, . . . ,πK ,πu)|· ∼ Dir(m.1,m.2, . . . ,m.K ,γ)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
Gibbs Sampling for HDP
Algorithm 1 Gibbs Sampling for HDP1: Initialize all latent variables as random2: repeat3: for Each document d do4: for Each word n in document d do
5: Sample z(d ,n) ∼Mult
((α0πk ′ + nk ′,−(d ,n)
d ′,(·)
) (ηv ′+nk ′,−(d ,n)
(·),v ′ )
(∑Vv=1 ηv +nk ′,−(d ,n)
(·),v )
)6: end for
7: Sample m ∼Mult
(Γ(α0πk ′ )
Γ(α0πk ′+nk ′d ,(·),(·))
s(nk ′d ,(·),(·),m)(α0πk ′)
m
)8: Sample β ∼ Dir(m.1,m.2, . . . ,m.K ,γ)9: end for
10: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121
Stick Breaking ConstructionA Hierarchical Dirichlet Process with 1, . . . ,Ddocuments
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =∞
∑k=1
βk δφk φk ∼ H
βk = β′k
k−1
∏i=1
(1−β′l ) β
′k ∼ Beta(1,γ)
Gd =∞
∑k=1
πdk δφk
πdk = π′dk
k−1
∏i=1
(1−π′dl) π
′dk ∼ Beta(α0βk ,α0(1−
k
∑i=1
βi))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
Alternative Stick Breaking ConstructionProblem)
Original Stick Breaking Construction is weights βk and πdk are tightlycorrelated
βk = β′k
k−1
∏i=1
(1−β′i ) β
′k ∼ Beta(1,γ)
πdk = π′dk
k−1
∏i=1
(1−π′di) π
′dk ∼ Beta(α0βk ,α0(1−
k
∑i=1
βi))
Alternative Stick Breaking Construction for each document [FSJW08]
ψdt ∼ G0
πdt = π′dt
t−1
∏i=1
(1−π′di) π
′dt ∼ Beta(1,α0)
Gd =∞
∑t=1
πdtδψdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =∞
∑k=1
βk δφk φk ∼ H
βk = β′k
k−1
∏i=1
(1−β′l ) β
′k ∼ Beta(1,γ)
Gd =∞
∑t=1
πdtδψdt ψdt ∼ G0
πdt = π′dt
t−1
∏i=1
(1−π′di) π
′dt ∼ Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =∞
∑k=1
βk δφk φk ∼ H
βk = β′k
k−1
∏i=1
(1−β′i ) β
′k ∼ Beta(1,γ)
Gd =∞
∑t=1
πdtδψdt ψdt ∼ G0
πdt = π′dt
t−1
∏i=1
(1−π′di) π
′dt ∼ Beta(1,α0)
To connect ψdt and φk
We add auxiliary variable cdt ∼Mult(β )
Then ψdt = φcdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
Alternative Stick Breaking Construction
Generative process1 For each global-level topic k ∈ 1, . . . ,∞:
1 Draw topic word proportions, φk ∼ Dir(η)2 Draw a corpus breaking proportion,
β ′k ∼ Beta(1,γ)
2 For each document d ∈ 1, . . . ,D:1 For each document-level topic t ∈ 1, . . . ,∞:
1 Draw document-level topic indices,cdt ∼Mult(σ(β
′))2 Draw a document breaking proportion,
π ′dt ∼ Beta(1,α0)
2 For each word n ∈ 1, . . . ,N:1 Draw a topic index zdn ∼Mult(σ(π ′d ))2 Generate a word wdn ∼Mult(φcdzdn
),
3 whereσ(β
′)≡ β1,β2, . . .,βk = β ′k ∏k−1i=1 (1−β ′i )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
Variational Inference
Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one
More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2
2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
Variational Inference
Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one
More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 2
2Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
KL-divergence of p from qFind lower bound of log evidence logp(X)
logp(X) = log ∑Z
p(Z ,X) = log ∑Z
p(Z ,X)q(Z |X)
q(Z |X)
= log ∑Z
q(Z |X)p(Z ,X)
q(Z |X)
≥ ∑Z
q(Z |X) logp(Z ,X)
q(Z |X)3
Gap between lower bound of logp(X) and logp(X)
logp(X)−∑Z
q(Z |X) logp(Z ,X)
q(Z |X)= ∑
Zq(Z) log
q(Z)
p(Z |X)
= DKL(q||p)
3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
KL-divergence of p from qFind lower bound of log evidence logp(X)
logp(X) = log ∑Z
p(Z ,X) = log ∑Z
p(Z ,X)q(Z |X)
q(Z |X)
= log ∑Z
q(Z |X)p(Z ,X)
q(Z |X)
≥ ∑Z
q(Z |X) logp(Z ,X)
q(Z |X)3
Gap between lower bound of logp(X) and logp(X)
logp(X)−∑Z
q(Z |X) logp(Z ,X)
q(Z |X)= ∑
Zq(Z) log
q(Z)
p(Z |X)
= DKL(q||p)
3Use Jensen’s inequalityJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
KL-divergence of p from q
logp(X) = ∑Z
q(Z |X) logp(Z ,X)
q(Z |X)+ DKL(q||p)
Log evidence logp(X) is fixed with respect to q
Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
Variational Inference
Main idea [JGJS98]I Modify original graphical model to simple modelI Minimize similarity between original and modified one
More FormallyI Observed data X , Latent variable ZI We want to compute p(Z |X)I Make q(Z)I Minimize similarity between p and q 4
F Find lower bound of logp(X)F Maximizing it
4Commonly it is KL-divergence of p from q, DKL(q||p)JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
Variational Inference for HDP
q(β ,φ ,π,c,z) =K
∏k=1
q(φk |λk )K−1
∏k=1
q(βk |a1k ,a
2k )
D
∏d=1
T
∏t=1
q(cdt |ζdt)T−1
∏t=1
q(πdt |γ1dt ,γ
2dt)
N
∏n=1
q(zdn|ϕdn)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
Variational Inference for HDPFind lower bound of logp(w |α0,γ,η)
lnp(w |α0,γ,η)
= ln∫
β
∫φ
∫π∑c
∑z
p(w ,β ,φ ,π,c,z|α0,γ,η) dβ dφ dπ
= ln∫
β
∫φ
∫π∑c
∑z
p(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z)
q(β ,φ ,π,c,z)dβ dφ dπ
≥∫
β
∫φ
∫π∑c
∑z
lnp(w ,β ,φ ,π,c,z|α0,γ,η)
q(β ,φ ,π,c,z)·q(β ,φ ,π,c,z) dβ dφ dπ
(∵ Jensen’s inequality)
=∫
β
∫φ
∫π∑c
∑z
lnp(w ,β ,φ ,π,c,z|α0,γ,η) ·q(β ,φ ,π,c,z) dβ dφ dπ
−∫
β
∫φ
∫π∑c
∑z
lnq(β ,φ ,π,c,z) ·q(β ,φ ,π,c,z) dβ dφ dπ
= Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
Variational Inference for HDP
lnp(w |α0,γ,η)
≥ Eq[lnp(w ,β ,φ ,π,c,z|α0,γ,η)]−Eq[lnq(β ,φ ,π,c,z)]
= Eq[lnp(β |γ)p(φ |η)D
∏d=1
p(πd |α0)p(cd |β )N
∏n=1
p(wdn|cd ,zdn,φ)p(zdn|πd )]
−Eq[lnK
∏k=1
q(φk |λk )K−1
∏k=1
q(βk |a1k ,a
2k )
D
∏d=1
T
∏t=1
q(cdt |ζdt )T−1
∏t=1
q(πdt |γ1dt ,γ
2dt )
N
∏n=1
q(zdn|ϕdn)]
=D
∑d=1
Eq [lnp(πd |α0)] + Eq[lnp(cd |β )] + Eq[lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζ d )]−Eq [lnq(πd |γ1d ,γ
2d )]−Eq [lnq(zd |ϕd )]
+ Eq [lnp(β |γ)] + Eq[lnp(φ |η)]−Eq[lnq(φ |λ )]−Eq[lnq(β |a1,a2)]
We can run Variational EM to maximize lower bound of logp(w |α0,γ,η)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)Derivative of it with respect to each variational parameter
γ1dt = 1 +
N
∑n=1
ϕdnt , γ2dt = α0 +
N
∑n=1
T
∑b=t+1
ϕdnb
ζdtk = expk−1
∑e=1
(Ψ(a2e)−Ψ(a1
e + a2e)) + (Ψ(a1
k )−Ψ(a1k + a2
k ))
+N
∑n=1
V
∑v=1
wvdnϕdnt (Ψ(λkv )−Ψ(
V
∑l=1
λkl ))
ϕdnt = expt−1
∑h=1
(Ψ(γ2dh)−Ψ(γ
1dh + γ
2dh)) + (Ψ(γ
1dt )−Ψ(γ
1dt + γ
2dt ))
+K
∑k=1
V
∑v=1
wvdnζdtk (Ψ(λkv )−Ψ(
V
∑l=1
λkl ))
a1k = 1 +
D
∑d=1
T
∑t=1
ζdtk , a2k = γ +
D
∑d=1
T
∑t=1
K
∑f=k+1
ζdtf
λkv = ηv +D
∑d=1
N
∑n=1
T
∑t=1
wvdnϕdnt ζdtk
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
Variational Inference for HDPMaximize lower bound of logp(w |α0,γ,η)
Derivative of it with respect to each variational parameterRun Variational EM
I E step: compute document level parameters γ1dt ,γ
2dt ,ζdtk ,ϕdnt
I M step: compute corpus level parameters a1k ,a
2k ,λkv
Algorithm 2 Variational Inference for HDP1: Initialize the variational parameters2: repeat3: for Each document d do4: repeat5: Compute document parameters γ1
dt ,γ2dt ,ζdtk ,ϕdnt
6: until Converged7: end for8: Compute topic parameters a1
k ,a2k ,λkv
9: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121
Online Variational Inference
Stochastic optimization to the variational objective [WPB11]I Subsample the documentsI Compute approximation of the gradient based on subsampleI Follow that gradient with a decreasing step-size
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
Variational Inference for HDP
Lower bound of logp(w |α0,γ,η)
lnp(w |α0,γ,η)
≥D
∑d=1
Eq[lnp(πd |α0)] + Eq [lnp(cd |β )] + Eq [lnp(wd |cd ,zd ,φ)] + Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζ d )]−Eq[lnq(πd |γ1d ,γ
2d )]−Eq[lnq(zd |ϕd )]
+ Eq[lnp(β |γ)] + Eq[lnp(φ |η)]−Eq [lnq(φ |λ )]−Eq[lnq(β |a1,a2)]
=D
∑d=1
Ld +Lk
= Eqj [DLd +1D
Lk ]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
Online Variational Inference for HDP
Lower bound of logp(w |α0,γ,η) = Eqj [DLd + 1D Lk ]
Online learning algorithm for HDPI Sample a document dI Compute its optimal document-level parameters γ1
dt ,γ2dt ,ζdtk ,ϕdnt
I Take the gradient 5 of the corpus level parameters a1k ,a
2k ,λkv with noise
I Update corpus level parameters a1k ,a
2k ,λkv with decreasing learning rate
a1k = (1−ρe)a1
k + ρe(1 + DT
∑t=1
ζdtk )
a2k = (1−ρe)a2
k + ρe(γ + DT
∑t=1
K
∑f =k+1
ζdtf )
λkv = (1−ρe)λkv + ρe(ηv + DN
∑n=1
T
∑t=1
wvdnϕdnt ζdtk )
where ρe is the learning rate which satisfy ∑∞e=1 ρe = ∞, ∑
∞e=1 ρ2
e < ∞
5Natural graident is structurally equivalent to the Variational Inference oneJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
Online Variational Inference for HDP
Algorithm 3 Online Variational Inference for HDP1: Initialize the variational parameters2: e = 03: for Each document d ∈ 1, . . . ,D do4: repeat5: Compute document parameters γ1
dt ,γ2dt ,ζdtk ,ϕdnt
6: until Converged7: e = e + 18: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]9: Update topic parameters a1
k ,a2k ,λkv
10: end for
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
Outline1 Introduction
MotivationTopic Modeling
2 BackgroundDirichlet DistributionDirichlet Processes
3 Hierarchical Dirichlet ProcessesDirichlet Process Mixture ModelsHierarchical Dirichlet Processes
4 InferenceGibbs SamplingVariational InferenceOnline LearningDistributed Online Learning
5 Practical Tips6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121
Motivation
Problem 1: Inference for HDP takes a long timeProblem 2: Continuously expanding corpus necessitates continuousupdates of model parameters
I But updating of model parameters is not possible with plain HDPI Must re-train with the entire updated corpus
Our Approach: Combine distributed inference and online learning
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
Distributed Online HDP
Based on variational inference
Mini-batch updates via stochastic learning (variational EM)
Distribute variational EM using MapReduce
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
Distributed Online HDP
Algorithm 4 Distributed Online HDP - Driver1: Initialize the variational parameters2: e = 03: while Run forever do4: Collect new documents s ∈ 1, . . . ,S5: e = e + 16: Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]7: Run MapReduce job8: Get result of job and update topic parameters9: end while
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
Distributed Online HDP
Algorithm 5 Distributed Online HDP - Mapper1: Mapper get one document s ∈ 1, . . . ,S2: repeat3: Compute document parameters γ1
dt ,γ2dt ,ζdtk ,ϕdnt
4: until Converged5: Output the sufficient statistics for topic parameters
Algorithm 6 Distributed Online HDP - Reducer1: Reducer get sufficient statistics for each topic parameter2: Compute changes of topic parameter with sufficient statistics3: Output the changes of topic parameter
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
Experimental Setup
Data: 973,266 Twitter conversations, 7.54 tweets / conv
Approximately 7,297,000 tweets
60 node Hadoop system
Each node with 8 x 2.30GHz cores
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
ResultDistributed Online HDP runs faster than online HDP
Distributed Online HDP preserve the quality of result (perplexity)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I amin the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I amin the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic ModelingI Concept of Hierarchical Dirichlet ProcessesI How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school saidWow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I amin the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Implementation
https://github.com/NoSyu/Topic_Models
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
Property of Dirichlet distributionSample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words
We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words
We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words
We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1I People usually use a few topics to write a documentI People usually do not use all topicsI Each topic usually use a few words to represent its own topicI Each topic do not use all words
We can assign the each topics/words weightsI Some topics are more general than othersI Some words are more general than othersI Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1k = (1−ρe)a1
k + ρe(1 + DT
∑t=1
ζdtk )
a2k = (1−ρe)a2
k + ρe(γ + DT
∑t=1
K
∑f=k+1
ζdtf )
λkv = (1−ρe)λkv + ρe(ηv + DN
∑n=1
T
∑t=1
wvdnϕdntζdtk )
Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1k = (1−ρe)a1
k + ρe(1 + DT
∑t=1
ζdtk )
a2k = (1−ρe)a2
k + ρe(γ + DT
∑t=1
K
∑f=k+1
ζdtf )
λkv = (1−ρe)λkv + ρe(ηv + DN
∑n=1
T
∑t=1
wvdnϕdntζdtk )
Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Compute learning rate ρe = (τ0 + e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1k = (1−ρe)a1
k + ρe(1 + DT
∑t=1
ζdtk )
a2k = (1−ρe)a2
k + ρe(γ + DT
∑t=1
K
∑f=k+1
ζdtf )
λkv = (1−ρe)λkv + ρe(ηv + DN
∑n=1
T
∑t=1
wvdnϕdntζdtk )
Meaning of each parametersI τ0: Slow down the early iterations of the algorithmI κ : Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Mini-batch sizeWhen mini-batch size is large, distributed online HDP runs faster
Perplexity is similar as others
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
Summary
Bayesian Nonparametric Topic ModelingHierarchical Dirichlet Processes
I Chinese Restaurant FranchiseI Stick Breaking Construction
Posterior Inference for HDPI Gibbs SamplingI Variational InferenceI Online Learning
Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak
Implementations are updated in http://github.com/NoSyu/Topic_Models
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
Further Reading
Dirichlet ProcessI Dirichlet ProcessI Dirichlet distribution and Dirichlet Process + Indian Buffet Process
Bayesian Nonparametric modelI Machine Learning Summer School - Yee Whye TehI Machine Learning Summer School - Peter OrbanzI Introductory article
InferenceI MCMCI Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
Thank You!
JinYeong [email protected], linkedin.com/in/jybak
Users & Information Lab, KAIST
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
References I
Charles E Antoniak, Mixtures of dirichlet processes with applications tobayesian nonparametric problems, The annals of statistics (1974),1152–1174.
Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to thedirichlet distribution and related processes, Tech. ReportUWEETR-2010-0006, Department of Electrical Engineering, University ofWashington, Seattle, WA 98195, December 2010.
Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition andmachine learning, vol. 1, springer New York, 2006.
David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichletallocation, the Journal of machine Learning research 3 (2003), 993–1022.
Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, Anhdp-hmm for systems with state persistence, Proceedings of the 25thinternational conference on Machine learning, ACM, 2008, pp. 312–319.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
References II
Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009.
Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, andLawrence K Saul, An introduction to variational methods for graphicalmodels, Springer, 1998.
Yohan Jo and Alice H. Oh, Aspect and sentiment unification model foronline review analysis, Proceedings of the fourth ACM internationalconference on Web search and data mining (New York, NY, USA), WSDM’11, ACM, 2011, pp. 815–824.
Radford M Neal, Markov chain sampling methods for dirichlet processmixture models, Journal of computational and graphical statistics 9(2000), no. 2, 249–265.
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei,Hierarchical dirichlet processes, Journal of the american statisticalassociation 101 (2006), no. 476.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
References III
Chong Wang, John W Paisley, and David M Blei, Online variationalinference for the hierarchical dirichlet process, International Conferenceon Artificial Intelligence and Statistics, 2011, pp. 752–760.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
Images source I
http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm
http://www.flickr.com/photos/autumn2may/3965964418/
http://www.flickr.com/photos/ppix/1802571058/
http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871
http://www.flickr.com/photos/jwight/2710392971/
http://www.flickr.com/photos/jasohill/2511594886/
http://en.wikipedia.org/wiki/Kim_Yuna
http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29
http://en.wikipedia.org/wiki/Gangnam_Style
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121
Measurable space (Ω,B)
Def) A set considered together with the σ -algebra on the set6.
Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω
I Special kind of collection of subsets of the sample space ΩF Complete
A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections
A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property
F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset
6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
Measurable space (Ω,B)
Def) A set considered together with the σ -algebra on the set6.
Ω: the set of all outcomes, the sample spaceB: σ -algebra over Ω
I Special kind of collection of subsets of the sample space ΩF Complete
A is σ -algebra, then AC is also σ -algebraF Closed under countable unions and intersections
A and B are σ -algebra, then A∪B and A∩B are also σ -algebraI A collection of eventsI Property
F Smallest possible σ -algebra: Ω, /0F Largest possible σ -algebra: powerset
6http://mathworld.wolfram.com/MeasurableSpace.htmlJinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
Proof 1Decimative property
I Let (θ1,θ2, . . . ,θK )∼ Dir(α1,α2, . . . ,αK )and (τ1,τ2)∼ Dir(α1β1,α1β2) where β1 + β2 = 1,then (θ1τ1,θ1τ2,θ2, . . . ,θK )∼ Dir(α1β1,α1β2,α2, . . . ,αK )
Then
(G(θ1),G(A1), . . . ,G(AR)) = (β1,(1−β1)G′(A1), . . . ,(1−β1)G′(AR))
∼ Dir(1,α0G0(A1), . . . ,α0G0(AR))
changes to
(G′(A1), . . . ,G′(AR))∼ Dir(α0G0(A1), . . . ,α0G0(AR))
G′ ∼ DP(α0,G0)
using decimative property with
α1 = α0 θ1 = (1−β1)
βk = G0(Ak ) τk = G′(Ak )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121