bayesian nonparametric topic modeling hierarchical dirichlet processes

Click here to load reader

Post on 27-Jan-2015

122 views

Category:

Education

4 download

Embed Size (px)

DESCRIPTION

This is presentation slide files in machine learning summer school in Korea. http://prml.yonsei.ac.kr/ I talked about dirichlet distribution, dirichlet process and HDP.

TRANSCRIPT

  • 1. Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes JinYeong Bak Department of Computer Science KAIST, Daejeon South Korea [email protected] August 22, 2013 Part of this slides adopted from presentation by Yee Whye Teh (y.w[email protected]). JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121

2. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121 3. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121 4. Introduction Bayesian topic models Latent Dirichlet Allocation (LDA) [BNJ03] Hierarchical Dircihlet Processes (HDP) [TJBB06] In this talk, Dirichlet distribution, Dircihlet process Concept of Hierarchical Dircihlet Processes (HDP) How to infer the latent variables in HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121 5. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121 6. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121 7. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121 8. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121 9. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121 10. Motivation What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121 11. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121 12. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121 13. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121 14. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121 15. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121 16. Topic Modeling Each topic has word distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121 17. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121 18. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121 19. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121 20. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121 21. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121 22. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121 23. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121 24. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121 25. Latent Dirichlet Allocation Generative process of LDA For each topic k {1,...,K}: Draw word distributions k Dir() For each document d {1,...,D}: Draw topic proportions d Dir() For each word in a document n {1,...,N}: Draw a topic index zdn Mult() Generate word from chosen topic wdn Mult(zdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121 26. Latent Dirichlet Allocation Generative process of LDA For each topic k {1,...,K}: Draw word distributions k Dir() For each document d {1,...,D}: Draw topic proportions d Dir() For each word in a document n {1,...,N}: Draw a topic index zdn Mult() Generate word from chosen topic wdn Mult(zdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121 27. Latent Dirichlet Allocation Generative process of LDA For each topic k {1,...,K}: Draw word distributions k Dir() For each document d {1,...,D}: Draw topic proportions d Dir() For each word in a document n {1,...,N}: Draw a topic index zdn Mult() Generate word from chosen topic wdn Mult(zdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121 28. Latent Dirichlet Allocation Generative process of LDA For each topic k {1,...,K}: Draw word distributions k Dir() For each document d {1,...,D}: Draw topic proportions d Dir() For each word in a document n {1,...,N}: Draw a topic index zdn Mult() Generate word from chosen topic wdn Mult(zdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121 29. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121 30. Latent Dirichlet Allocation What we can see Words in documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121 31. Latent Dirichlet Allocation What we want to see JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121 32. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? => Topic proportion of each document How can we describe the topics? => Word distribution of each topic JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121 33. Latent Dirichlet Allocation What we can see: w What we want to see: ,z, Compute p(,z,|w,,) = p(,z,,w|,) p(w|,) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121 34. Latent Dirichlet Allocation What we can see: w What we want to see: ,z, Compute p(,z,|w,,) = p(,z,,w|,) p(w|,) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121 35. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should nd the best number of topics Q) Can we get it from data automatically? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121 36. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should nd the best number of topics Q) Can we get it from data automatically? A) Hierarchical Dircihlet Processes JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121 37. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121 38. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121 39. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121 40. Dice modeling Think about the probability of a number from dices According to the textbook, it is widely known as uniform. => 1 6 for 6 dimentional dice Is it true? Ans) No! JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121 41. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Lets imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121 42. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Lets imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121 43. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x 0,y 0,z 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121 44. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x 0,y 0,z 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121 45. Dirichlet distribution Denition [BN06] A probability distribution over the (K 1) dimensional standard simplex A distribution over pmfs of length K Notation Dir() where = [1,...,K ] is random pmf, = [1,...,K ] Probability density function p(;) = (K k=1 k ) K k=1 (k ) K k=1 1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121 46. Dirichlet distribution Denition [BN06] A probability distribution over the (K 1) dimensional standard simplex A distribution over pmfs of length K Notation Dir() where = [1,...,K ] is random pmf, = [1,...,K ] Probability density function p(;) = (K k=1 k ) K k=1 (k ) K k=1 1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121 47. Dirichlet distribution Denition [BN06] A probability distribution over the (K 1) dimensional standard simplex A distribution over pmfs of length K Notation Dir() where = [1,...,K ] is random pmf, = [1,...,K ] Probability density function p(;) = (K k=1 k ) K k=1 (k ) K k=1 1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121 48. Latent Dirichlet Allocation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121 49. Property of Dirichlet distribution Density plots [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121 50. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121 51. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X Mult(n,), Prior Dir() Posterior (|X) Dir( +n) Proof) p(|X) = p(X|)p() p(X) p(X|)p() = n! x1!xK ! K k=1 xk k (K k=1 k ) K k=1 (k ) K k=1 1 k = C K k=1 k +xk 1 k = Dir( +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121 52. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X Mult(n,), Prior Dir() Posterior (|X) Dir( +n) Proof) p(|X) = p(X|)p() p(X) p(X|)p() = n! x1!xK ! K k=1 xk k (K k=1 k ) K k=1 (k ) K k=1 1 k = C K k=1 k +xk 1 k = Dir( +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121 53. Property of Dirichlet distribution Aggregation property Let (1,2,...,K ) Dir(1,2,...,K ) then (1 +2,...,K ) Dir(1 +2,...,K ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (kA1 k ,...,kAR k ) Dir(kA1 k ,...,kAR k ) Decimative property Let (1,2,...,K ) Dir(1,2,...,K ) and (1,2) Dir(11,12) where 1 +2 = 1, then (11,12,2,...,K ) Dir(11,12,2,...,K ) Neutrality property Let (1,2,...,K ) Dir(1,2,...,K ) then k is independent of the vector 1 1k (1,2,...,k1,k+1,...,K ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121 54. Property of Dirichlet distribution Aggregation property Let (1,2,...,K ) Dir(1,2,...,K ) then (1 +2,...,K ) Dir(1 +2,...,K ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (kA1 k ,...,kAR k ) Dir(kA1 k ,...,kAR k ) Decimative property Let (1,2,...,K ) Dir(1,2,...,K ) and (1,2) Dir(11,12) where 1 +2 = 1, then (11,12,2,...,K ) Dir(11,12,2,...,K ) Neutrality property Let (1,2,...,K ) Dir(1,2,...,K ) then k is independent of the vector 1 1k (1,2,...,k1,k+1,...,K ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121 55. Property of Dirichlet distribution Aggregation property Let (1,2,...,K ) Dir(1,2,...,K ) then (1 +2,...,K ) Dir(1 +2,...,K ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (kA1 k ,...,kAR k ) Dir(kA1 k ,...,kAR k ) Decimative property Let (1,2,...,K ) Dir(1,2,...,K ) and (1,2) Dir(11,12) where 1 +2 = 1, then (11,12,2,...,K ) Dir(11,12,2,...,K ) Neutrality property Let (1,2,...,K ) Dir(1,2,...,K ) then k is independent of the vector 1 1k (1,2,...,k1,k+1,...,K ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121 56. Property of Dirichlet distribution Aggregation property Let (1,2,...,K ) Dir(1,2,...,K ) then (1 +2,...,K ) Dir(1 +2,...,K ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (kA1 k ,...,kAR k ) Dir(kA1 k ,...,kAR k ) Decimative property Let (1,2,...,K ) Dir(1,2,...,K ) and (1,2) Dir(11,12) where 1 +2 = 1, then (11,12,2,...,K ) Dir(11,12,2,...,K ) Neutrality property Let (1,2,...,K ) Dir(1,2,...,K ) then k is independent of the vector 1 1k (1,2,...,k1,k+1,...,K ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121 57. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121 58. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121 59. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121 60. Dirichlet Process Denition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal denition (,B) is a measurable space G0 is a distribution over sample space 0 is a positive real number G is a random probability measure over (,B) G DP(0,G0) if for any nite measurable partition (A1,...,AR) of (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121 61. Dirichlet Process Denition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal denition (,B) is a measurable space G0 is a distribution over sample space 0 is a positive real number G is a random probability measure over (,B) G DP(0,G0) if for any nite measurable partition (A1,...,AR) of (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121 62. Posterior Dirichlet Processes G DP(0,G0) can be treat as a random distribution over We can draw a sample 1 from G We also can make nite partition, (A1,...,AR) of then p(1 Ar |G) = G(Ar ), p(1 Ar ) = G0(Ar ) (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where (Ar ) = 1 if Ar and 0 otherwise It is always true for every nite partition of JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121 63. Posterior Dirichlet Processes G DP(0,G0) can be treat as a random distribution over We can draw a sample 1 from G We also can make nite partition, (A1,...,AR) of then p(1 Ar |G) = G(Ar ), p(1 Ar ) = G0(Ar ) (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where (Ar ) = 1 if Ar and 0 otherwise It is always true for every nite partition of JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121 64. Posterior Dirichlet Processes G DP(0,G0) can be treat as a random distribution over We can draw a sample 1 from G We also can make nite partition, (A1,...,AR) of then p(1 Ar |G) = G(Ar ), p(1 Ar ) = G0(Ar ) (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where (Ar ) = 1 if Ar and 0 otherwise It is always true for every nite partition of JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121 65. Posterior Dirichlet Processes G DP(0,G0) can be treat as a random distribution over We can draw a sample 1 from G We also can make nite partition, (A1,...,AR) of then p(1 Ar |G) = G(Ar ), p(1 Ar ) = G0(Ar ) (G(A1),...,G(AR)) Dir(0G0(A1),...,0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where (Ar ) = 1 if Ar and 0 otherwise It is always true for every nite partition of JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121 66. Posterior Dirichlet Processes For every nite partition of , (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where 1 (Ar ) = 1 if 1 Ar and 0 otherwise The posterior process is also a Dirichlet process G|1 DP(0 +1, 0G0 +1 0 +1 ) Summary) 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121 67. Posterior Dirichlet Processes For every nite partition of , (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where 1 (Ar ) = 1 if 1 Ar and 0 otherwise The posterior process is also a Dirichlet process G|1 DP(0 +1, 0G0 +1 0 +1 ) Summary) 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121 68. Posterior Dirichlet Processes For every nite partition of , (G(A1),...,G(AR))|1 Dir(0G0(A1)+1 (A1),...,0G0(AR)+1 (AR)) where 1 (Ar ) = 1 if 1 Ar and 0 otherwise The posterior process is also a Dirichlet process G|1 DP(0 +1, 0G0 +1 0 +1 ) Summary) 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121 69. Blackwell-MacQueen Urn Scheme Now we draw a sample 1,...,N First sample 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Second sample 2|1,G G G|1 DP(0 +1, 0G0 +1 0 +1 ) 2|1 0G0 +1 0 +1 G|1,2 DP(0 +2, 0G0 +1 +2 0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121 70. Blackwell-MacQueen Urn Scheme Now we draw a sample 1,...,N First sample 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Second sample 2|1,G G G|1 DP(0 +1, 0G0 +1 0 +1 ) 2|1 0G0 +1 0 +1 G|1,2 DP(0 +2, 0G0 +1 +2 0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121 71. Blackwell-MacQueen Urn Scheme Now we draw a sample 1,...,N First sample 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Second sample 2|1,G G G|1 DP(0 +1, 0G0 +1 0 +1 ) 2|1 0G0 +1 0 +1 G|1,2 DP(0 +2, 0G0 +1 +2 0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121 72. Blackwell-MacQueen Urn Scheme Nth sample N|1,...,N1,G G G|1,...,N1 DP(0 +N 1, 0G0 +N1 n=1 n 0 +N 1 ) N|1,...,N1 0G0 +N1 n=1 n 0 +N 1 G|1,...,N DP(0 +N, 0G0 +N n=1 n 0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121 73. Blackwell-MacQueen Urn Scheme Blackwell-MacQueen urn scheme produces a sequence 1,2,... with the following conditionals N|1,...,N1 0G0 +N1 n=1 n 0 +N 1 As Polya Urn analogy Innite number of ball colors Empty urn Filling Polya urn process (n starts 1) With probability 0, pick a new color from the set of innite ball colors G0, and paint a new ball that color and add it to urn With probability n 1, pick a ball from urn record its color, and put it back to urn with another ball of the same color JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121 74. Chinese Restaurant Process Draw 1,2,...,N from a Blackwell-MacQueen Urn Scheme With probability 0, pick a new color from the set of innite ball colors G0, and paint a new ball that color and add it to urn With probability n 1, pick a ball from urn record its color, and put it back to urn with another ball of the same color s can take same values, i = j There are K < N distinct values, 1,...,K It works as partition of 1,2,...,N induces to 1,...,K The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121 75. Chinese Restaurant Process Draw 1,2,...,N from a Blackwell-MacQueen Urn Scheme With probability 0, pick a new color from the set of innite ball colors G0, and paint a new ball that color and add it to urn With probability n 1, pick a ball from urn record its color, and put it back to urn with another ball of the same color s can take same values, i = j There are K < N distinct values, 1,...,K It works as partition of 1,2,...,N induces to 1,...,K The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121 76. Chinese Restaurant Process 1,2,...,N induces to 1,...,K Chinese Restaurant Process interpretation There is a Chinese Restaurant which has innite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the rst table n-th customer sits at A new table with probability 0 0+n1 Table k with probability nk 0+n1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121 77. Chinese Restaurant Process 1,2,...,N induces to 1,...,K Chinese Restaurant Process interpretation There is a Chinese Restaurant which has innite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the rst table n-th customer sits at A new table with probability 0 0+n1 Table k with probability nk 0+n1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121 78. Chinese Restaurant Process 1,2,...,N induces to 1,...,K Chinese Restaurant Process interpretation There is a Chinese Restaurant which has innite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the rst table n-th customer sits at A new table with probability 0 0+n1 Table k with probability nk 0+n1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121 79. Chinese Restaurant Process 1,2,...,N induces to 1,...,K Chinese Restaurant Process interpretation There is a Chinese Restaurant which has innite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the rst table n-th customer sits at A new table with probability 0 0+n1 Table k with probability nk 0+n1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121 80. Chinese Restaurant Process The CRP exhibits the clustering property of DP Tables are clusters, k G0 Customers are the actual realizations, n = zn where zn {1,...,K} JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121 81. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Consider a partition (1,1) of . Then (G(1),G(1)) Dir((0 +1) 0G0 +1 0 +1 (1),(0 +1) 0G0 +1 0 +1 (1)) = Dir(1,0) = Beta(1,0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121 82. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Consider a partition (1,1) of . Then (G(1),G(1)) Dir((0 +1) 0G0 +1 0 +1 (1),(0 +1) 0G0 +1 0 +1 (1)) = Dir(1,0) = Beta(1,0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121 83. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) Consider a partition (1,1) of . Then (G(1),G(1)) Dir((0 +1) 0G0 +1 0 +1 (1),(0 +1) 0G0 +1 0 +1 (1)) = Dir(1,0) = Beta(1,0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121 84. Stick Breaking Construction Consider a partition (1,1) of . Then (G(1),G(1)) = (1,1 1) Beta(1,0) G has a point mass located at 1 G = 11 +(1 1)G 1 Beta(1,0) where G is the probability measure with the point mass 1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121 85. Stick Breaking Construction Consider a partition (1,1) of . Then (G(1),G(1)) = (1,1 1) Beta(1,0) G has a point mass located at 1 G = 11 +(1 1)G 1 Beta(1,0) where G is the probability measure with the point mass 1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121 86. Stick Breaking Construction Consider a partition (1,1) of . Then (G(1),G(1)) = (1,1 1) Beta(1,0) G has a point mass located at 1 G = 11 +(1 1)G 1 Beta(1,0) where G is the probability measure with the point mass 1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121 87. Stick Breaking Construction Summary) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) G = 11 +(1 1)G 1 Beta(1,0) Consider a further partition (1,A1,...,AR) of (G(1),G(A1),...,G(AR)) = (1,(1 1)G (A1),...,(1 1)G (AR)) Dir(1,0G0(A1),...,0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) Dir(0G0(A1),...,0G0(AR)) G DP(0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121 88. Stick Breaking Construction Summary) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) G = 11 +(1 1)G 1 Beta(1,0) Consider a further partition (1,A1,...,AR) of (G(1),G(A1),...,G(AR)) = (1,(1 1)G (A1),...,(1 1)G (AR)) Dir(1,0G0(A1),...,0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) Dir(0G0(A1),...,0G0(AR)) G DP(0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121 89. Stick Breaking Construction Summary) Posterior Dirichlet Processes 1|G G G DP(0,G0) 1 G0 G|1 DP(0 +1, 0G0 +1 0 +1 ) G = 11 +(1 1)G 1 Beta(1,0) Consider a further partition (1,A1,...,AR) of (G(1),G(A1),...,G(AR)) = (1,(1 1)G (A1),...,(1 1)G (AR)) Dir(1,0G0(A1),...,0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) Dir(0G0(A1),...,0G0(AR)) G DP(0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121 90. Stick Breaking Construction Do this repeatly with distinct values, 1,2, G DP(0,G0) G = 11 +(1 1)G1 G = 11 +(1 1)(22 +(1 2)G2) ... G = k=1 k k where k = k k1 i=1 (1 i ), k=1 k = 1 k Beta(1,0) k G0 Draws from the DP looks like a sum of point masses, with masses drawn from a stick-breaking construction. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121 91. Stick Breaking Construction Summary) G = k=1 k k k = k k1 i=1 (1 i ), k=1 k = 1 k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121 92. Summary of DP Denition G is a random probability measure over (,B) G DP(0,G0) if for any nite measurable partition (A1,...,Ar ) of (G(A1),...,G(Ar )) Dir(0G0(A1),...,0G0(Ar )) Chinese Restaurant Process Stick Breaking Construction JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121 93. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121 94. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn F(n) n G G DP(0,G0) Each n is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121 95. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn F(n) n G G DP(0,G0) Each n is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121 96. Dirichlet Process Mixture Models Since G is of the form G = k=1 k k We have n = k with probability k Let kn take on value k with probability k . We can equivalently dene n = kn An equivalent model xn F(n) n G G DP(0,G0) xn F(kn ) p(kn = k) = k k = k k1 i=1 (1 i ) k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121 97. Dirichlet Process Mixture Models Since G is of the form G = k=1 k k We have n = k with probability k Let kn take on value k with probability k . We can equivalently dene n = kn An equivalent model xn F(n) n G G DP(0,G0) xn F(kn ) p(kn = k) = k k = k k1 i=1 (1 i ) k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121 98. Dirichlet Process Mixture Models Since G is of the form G = k=1 k k We have n = k with probability k Let kn take on value k with probability k . We can equivalently dene n = kn An equivalent model xn F(n) n G G DP(0,G0) xn F(kn ) p(kn = k) = k k = k k1 i=1 (1 i ) k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121 99. Dirichlet Process Mixture Models xn F(n) n G G DP(0,G0) xn F(kn ) p(kn = k) = k k = k k1 i=1 (1 i ) k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121 100. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121 101. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121 102. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121 103. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn F(dn), dn Gd , Gd DP(0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = k=1 1k 1k , G2 = k=1 2k 2k 1k ,2k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121 104. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn F(dn), dn Gd , Gd DP(0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = k=1 1k 1k , G2 = k=1 2k 2k 1k ,2k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121 105. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 DP(,H) G1,G2|G0 DP(0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121 106. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 DP(,H) G1,G2|G0 DP(0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121 107. Hierarchical Dirichlet Processes Making G0 discrete forces shared cluster between G1 and G2 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121 108. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 DP(,H) Gd |G0 DP(0,G0) The stick-breaking construction for the HDP G0 = k=1 k k k H k = k k1 i=1 (1 l ) k Beta(1,) Gd = k=1 dk k dk = dk k1 i=1 (1 dl ) dk Beta(0k ,0(1 k i=1 i )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121 109. Chinese Restaurant Franchise Gd |G0 DP(0,G0), dn G0 Draw d1,d2,... from a Blackwell-MacQueen Urn Scheme d1,d2,... induces to d1,d2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121 110. Chinese Restaurant Franchise Gd |G0 DP(0,G0), dn G0 Draw d1,d2,... from a Blackwell-MacQueen Urn Scheme d1,d2,... induces to d1,d2,... Draw d 1,d 2,... from a Blackwell-MacQueen Urn Scheme d 1,d 2,... induces to d 1,d 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121 111. Chinese Restaurant Franchise G0 DP(,H), k H Gd |G0 DP(0,G0), dn G0 Draw d1,d2,... from a Blackwell-MacQueen Urn Scheme d1,d2,... induces to d1,d2,... Draw d 1,d 2,... from a Blackwell-MacQueen Urn Scheme d 1,d 2,... induces to d 1,d 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121 112. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has innite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the rst table and choose a new menu n-th customer sits at A new table with probability 0 0+n1 Table k with probability ndt 0+n1 where ndt is the number of customers at table t n-th customer choose A new menu with probability +m1 Existing menu with probability mk +m1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121 113. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has innite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the rst table and choose a new menu n-th customer sits at A new table with probability 0 0+n1 Table k with probability ndt 0+n1 where ndt is the number of customers at table t n-th customer choose A new menu with probability +m1 Existing menu with probability mk +m1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121 114. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has innite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the rst table and choose a new menu n-th customer sits at A new table with probability 0 0+n1 Table k with probability ndt 0+n1 where ndt is the number of customers at table t n-th customer choose A new menu with probability +m1 Existing menu with probability mk +m1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121 115. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has innite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the rst table and choose a new menu n-th customer sits at A new table with probability 0 0+n1 Table k with probability ndt 0+n1 where ndt is the number of customers at table t n-th customer choose A new menu with probability +m1 Existing menu with probability mk +m1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121 116. Chinese Restaurant Franchise JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121 117. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121 118. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121 119. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121 120. Gibbs Sampling Denition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121 121. Gibbs Sampling Denition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121 122. Collapsed Gibbs sampling A collapsed Gibbs sampling integrates out one or more variables when sampling for some other variable. Example) There are three latent variables A,B and C. Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially But when we integrate out B, Sampling only p(A|C), p(C|A) sequentially JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121 123. Review) Dirichlet Process Mixture Models xn F(n) n G G DP(0,G0) xn F(kn ) p(kn = k) = k k = k k1 i=1 (1 i ) k Beta(1,0) k G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121 124. Review) Blackwell-MacQueen Urn Scheme for DP Nth sample N|1,...,N1,G G G|1,...,N1 DP(0 +N 1, 0G0 +N1 n=1 n 0 +N 1 ) N|1,...,N1 0G0 +N1 n=1 n 0 +N 1 G|1,...,N DP(0 +N, 0G0 +N n=1 n 0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121 125. Review) Chinese Restaurant Franchise Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the rst table and choose a new menu n-th customer sits at A new table with probability 0 0+n1 Table k with probability ndt 0+n1 where ndt is the number of customers at table t n-th customer choose A new menu with probability +m1 Existing menu with probability mk +m1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121 126. Alternative form of HDP G0 DP(,H), dt G0 G0|dt ,... DP( +m, H+K k=1 mk k +m ) Then G0 is given as G0 = K k=1 k k +uGu where Gu DP(,H) = (1,...,K ,u) Dir(m1,...,mK ,) p(k |) h(k ) dn:zdn=k f(xdn|k ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121 127. Alternative form of HDP G0 DP(,H), dt G0 G0|dt ,... DP( +m, H+K k=1 mk k +m ) Then G0 is given as G0 = K k=1 k k +uGu where Gu DP(,H) = (1,...,K ,u) Dir(m1,...,mK ,) p(k |) h(k ) dn:zdn=k f(xdn|k ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121 128. Hierarchical Dirichlet Processes xdn F(n) n Gd Gd DP(0,G0) G0 DP(,H) xn Mult(zdn ) zdn Mult(d ) k Dir() d Dir(0) Dir(m.1,...,m.K ,) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121 129. Gibbs Sampling for HDP Joint distribution p(,z,,x,,m|0,,) = p(|m,) K k=1 p(k |) D d=1 p(d |0,) N n=1 p(zdn|d ) p(xdn|zdn,) Integrate out , p(z,x,,m|0,,) = (K k=1 m.k +) K k=1 (m.k )() K k=1 m.k 1 k 1 K+1 K k=1 (V v=1 v ) V v=1 (v ) V v=1 (v +nk (),v ) (V v=1 v +nk (),v ) M d=1 (K k=1 0k ) K k=1 (0k ) K k=1 (0k +nk d,()) (K k=1 0k +nk d,()) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121 130. Gibbs Sampling for HDP Full conditional distribution of z p(z(d ,n ) = k |z(d ,n ) ,m,,x,) = p(z(d ,n ) = k ,z(d ,n ),m,,x|) p(z(d ,n ),m,,x|) p(z(d ,n ) = k ,z(d ,n ) ,m,,x|) 0k +n k ,(d ,n ) d ,() (v +n k ,(d ,n ) (),v ) (V v=1 v +n k ,(d ,n ) (),v ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121 131. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(d n = t |dt = k ,(d ,n ) ,) n (),(d ,n ) d,(),t p(d n = new table|dtnew = k ,(d ,n ) ,) 0k These equations form Dirichlet process with concentration parameter 0k and assignment of n (),(d ,n ) d,(),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,) = (0k ) (0k +nk d,(),()) s(nk d,(),(),m)(0k )m where s(n,m) is unsigned Stirling number of the rst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121 132. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(d n = t |dt = k ,(d ,n ) ,) n (),(d ,n ) d,(),t p(d n = new table|dtnew = k ,(d ,n ) ,) 0k These equations form Dirichlet process with concentration parameter 0k and assignment of n (),(d ,n ) d,(),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,) = (0k ) (0k +nk d,(),()) s(nk d,(),(),m)(0k )m where s(n,m) is unsigned Stirling number of the rst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121 133. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(d n = t |dt = k ,(d ,n ) ,) n (),(d ,n ) d,(),t p(d n = new table|dtnew = k ,(d ,n ) ,) 0k These equations form Dirichlet process with concentration parameter 0k and assignment of n (),(d ,n ) d,(),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,) = (0k ) (0k +nk d,(),()) s(nk d,(),(),m)(0k )m where s(n,m) is unsigned Stirling number of the rst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121 134. Gibbs Sampling for HDP Full conditional distribution of (1,2,...,K ,u)| Dir(m.1,m.2,...,m.K ,) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121 135. Gibbs Sampling for HDP Algorithm 1 Gibbs Sampling for HDP 1: Initialize all latent variables as random 2: repeat 3: for Each document d do 4: for Each word n in document d do 5: Sample z(d,n) Mult 0k +n k ,(d,n) d ,() (v +n k ,(d,n) (),v ) (V v=1 v +n k ,(d,n) (),v ) 6: end for 7: Sample m Mult (0k ) (0k +nk d,(),() ) s(nk d,(),(),m)(0k )m 8: Sample Dir(m.1,m.2,...,m.K ,) 9: end for 10: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121 136. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121 137. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 DP(,H) Gd |G0 DP(0,G0) The stick-breaking construction for the HDP G0 = k=1 k k k H k = k k1 i=1 (1 l ) k Beta(1,) Gd = k=1 dk k dk = dk k1 i=1 (1 dl ) dk Beta(0k ,0(1 k i=1 i )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121 138. Alternative Stick Breaking Construction Problem) Original Stick Breaking Construction is weights k and dk are tightly correlated k = k k1 i=1 (1 i ) k Beta(1,) dk = dk k1 i=1 (1 di ) dk Beta(0k ,0(1 k i=1 i )) Alternative Stick Breaking Construction for each document [FSJW08] dt G0 dt = dt t1 i=1 (1 di ) dt Beta(1,0) Gd = t=1 dt dt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121 139. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = k=1 k k k H k = k k1 i=1 (1 l ) k Beta(1,) Gd = t=1 dt dt dt G0 dt = dt t1 i=1 (1 di ) dt Beta(1,0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121 140. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = k=1 k k k H k = k k1 i=1 (1 i ) k Beta(1,) Gd = t=1 dt dt dt G0 dt = dt t1 i=1 (1 di ) dt Beta(1,0) To connect dt and k We add auxiliary variable cdt Mult() Then dt = cdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121 141. Alternative Stick Breaking Construction Generative process 1 For each global-level topic k {1,...,}: 1 Draw topic word proportions, k Dir() 2 Draw a corpus breaking proportion, k Beta(1,) 2 For each document d {1,...,D}: 1 For each document-level topic t {1,...,}: 1 Draw document-level topic indices, cdt Mult(( )) 2 Draw a document breaking proportion, dt Beta(1,0) 2 For each word n {1,...,N}: 1 Draw a topic index zdn Mult((d )) 2 Generate a word wdn Mult(cdzdn ), 3 where ( ) {1,2,...},k = k k1 i=1 (1 i ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121 142. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modied one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121 143. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modied one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121 144. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log {Z} p(Z,X) = log {Z} p(Z,X) q(Z|X) q(Z|X) = log {Z} q(Z|X) p(Z,X) q(Z|X) {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X) {Z} q(Z|X)log p(Z,X) q(Z|X) = Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensens inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121 145. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log {Z} p(Z,X) = log {Z} p(Z,X) q(Z|X) q(Z|X) = log {Z} q(Z|X) p(Z,X) q(Z|X) {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X) {Z} q(Z|X)log p(Z,X) q(Z|X) = Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensens inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121 146. KL-divergence of p from q logp(X) = {Z} q(Z|X)log p(Z,X) q(Z|X) +DKL(q||p) Log evidence logp(X) is xed with respect to q Minimising DKL(q||p) Maximizing lower bound of logp(X) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121 147. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modied one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 4 Find lower bound of logp(X) Maximizing it 4 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121 148. Variational Inference for HDP q(,,,c,z) = K k=1 q(k |k ) K1 k=1 q(k |a1 k ,a2 k ) D d=1 T t=1 q(cdt |dt ) T1 t=1 q(dt |1 dt ,2 dt ) N n=1 q(zdn|dn) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121 149. Variational Inference for HDP Find lower bound of logp(w|0,,) lnp(w|0,,) = ln c z p(w,,,,c,z|0,,) d d d = ln c z p(w,,,,c,z|0,,)q(,,,c,z) q(,,,c,z) d d d c z ln p(w,,,,c,z|0,,) q(,,,c,z) q(,,,c,z) d d d ( Jensens inequality) = c z lnp(w,,,,c,z|0,,)q(,,,c,z) d d d c z lnq(,,,c,z)q(,,,c,z) d d d = Eq[lnp(w,,,,c,z|0,,)]Eq[lnq(,,,c,z)] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121 150. Variational Inference for HDP lnp(w|0,,) Eq[lnp(w,,,,c,z|0,,)]Eq[lnq(,,,c,z)] = Eq[lnp(|)p(|) D d=1 p(d |0)p(cd |) N n=1 p(wdn|cd ,zdn,)p(zdn|d )] Eq[ln K k=1 q(k |k ) K1 k=1 q(k |a1 k ,a2 k ) D d=1 T t=1 q(cdt |dt ) T1 t=1 q(dt |1 dt ,2 dt ) N n=1 q(zdn|dn)] = D d=1 Eq[lnp(d |0)]+Eq[lnp(cd |)]+Eq[lnp(wd |cd ,zd ,)]+Eq[lnp(zd |d )] Eq[lnq(cd |d )]Eq[lnq(d |1 d ,2 d )]Eq[lnq(zd |d )] +Eq[lnp(|)]+Eq[lnp(|)]Eq[lnq(|)]Eq[lnq(|a1 ,a2 )] We can run Variational EM to maximize lower bound of logp(w|0,,) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121 151. Variational Inference for HDP Maximize lower bound of logp(w|0,,) Derivative of it with respect to each variational parameter 1 dt = 1 + N n=1 dnt , 2 dt = 0 + N n=1 T b=t+1 dnb dtk = exp{ k1 e=1 ((a2 e)(a1 e +a2 e))+((a1 k )(a1 k +a2 k )) + N n=1 V v=1 wv dndnt ((kv )( V l=1 kl ))} dnt = exp{ t1 h=1 ((2 dh)(1 dh +2 dh))+((1 dt )(1 dt +2 dt )) + K k=1 V v=1 wv dndtk ((kv )( V l=1 kl ))} a1 k = 1 + D d=1 T t=1 dtk , a2 k = + D d=1 T t=1 K f=k+1 dtf kv = v + D d=1 N n=1 T t=1 wv dndnt dtk JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121 152. Variational Inference for HDP Maximize lower bound of logp(w|0,,) Derivative of it with respect to each variational parameter Run Variational EM E step: compute document level parameters 1 dt ,2 dt ,dtk ,dnt M step: compute corpus level parameters a1 k ,a2 k ,kv Algorithm 2 Variational Inference for HDP 1: Initialize the variational parameters 2: repeat 3: for Each document d do 4: repeat 5: Compute document parameters 1 dt ,2 dt ,dtk ,dnt 6: until Converged 7: end for 8: Compute topic parameters a1 k ,a2 k ,kv 9: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121 153. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121 154. Online Variational Inference Stochastic optimization to the variational objective [WPB11] Subsample the documents Compute approximation of the gradient based on subsample Follow that gradient with a decreasing step-size JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121 155. Variational Inference for HDP Lower bound of logp(w|0,,) lnp(w|0,,) D d=1 Eq[lnp(d |0)]+Eq[lnp(cd |)]+Eq[lnp(wd |cd ,zd ,)]+Eq[lnp(zd |d )] Eq[lnq(cd |d )]Eq[lnq(d |1 d ,2 d )]Eq[lnq(zd |d )] +Eq[lnp(|)]+Eq[lnp(|)]Eq[lnq(|)]Eq[lnq(|a1 ,a2 )] = D d=1 Ld +Lk = Eqj [DLd + 1 D Lk ] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121 156. Online Variational Inference for HDP Lower bound of logp(w|0,,) = Eqj [DLd + 1 D Lk ] Online learning algorithm for HDP Sample a document d Compute its optimal document-level parameters 1 dt ,2 dt ,dtk ,dnt Take the gradient 5 of the corpus level parameters a1 k ,a2 k ,kv with noise Update corpus level parameters a1 k ,a2 k ,kv with decreasing learning rate a1 k = (1 e)a1 k +e(1 +D T t=1 dtk ) a2 k = (1 e)a2 k +e( +D T t=1 K f=k+1 dtf ) kv = (1 e)kv +e(v +D N n=1 T t=1 wv dndnt dtk ) where e is the learning rate which satisfy e=1 e = , e=1 2 e < 5 Natural graident is structurally equivalent to the Variational Inference one JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121 157. Online Variational Inference for HDP Algorithm 3 Online Variational Inference for HDP 1: Initialize the variational parameters 2: e = 0 3: for Each document d {1,...,D} do 4: repeat 5: Compute document parameters 1 dt ,2 dt ,dtk ,dnt 6: until Converged 7: e = e +1 8: Compute learning rate e = (0 +e) where 0 > 0, (0.5,1] 9: Update topic parameters a1 k ,a2 k ,kv 10: end for JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121 158. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121 159. Motivation Problem 1: Inference for HDP takes a long time Problem 2: Continuously expanding corpus necessitates continuous updates of model parameters But updating of model parameters is not possible with plain HDP Must re-train with the entire updated corpus Our Approach: Combine distributed inference and online learning JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121 160. Distributed Online HDP Based on variational inference Mini-batch updates via stochastic learning (variational EM) Distribute variational EM using MapReduce JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121 161. Distributed Online HDP Algorithm 4 Distributed Online HDP - Driver 1: Initialize the variational parameters 2: e = 0 3: while Run forever do 4: Collect new documents s {1,...,S} 5: e = e +1 6: Compute learning rate e = (0 +e) where 0 > 0, (0.5,1] 7: Run MapReduce job 8: Get result of job and update topic parameters 9: end while JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121 162. Distributed Online HDP Algorithm 5 Distributed Online HDP - Mapper 1: Mapper get one document s {1,...,S} 2: repeat 3: Compute document parameters 1 dt ,2 dt ,dtk ,dnt 4: until Converged 5: Output the sufcient statistics for topic parameters Algorithm 6 Distributed Online HDP - Reducer 1: Reducer get sufcient statistics for each topic parameter 2: Compute changes of topic parameter with sufcient statistics 3: Output the changes of topic parameter JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121 163. Experimental Setup Data: 973,266 Twitter conversations, 7.54 tweets / conv Approximately 7,297,000 tweets 60 node Hadoop system Each node with 8 x 2.30GHz cores JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121 164. Result Distributed Online HDP runs faster than online HDP Distributed Online HDP preserve the quality of result (perplexity) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121 165. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial eld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121 166. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial eld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121 167. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial eld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121 168. Implementation https://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121 169. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121 170. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121 171. HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121 172. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121 173. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121 174. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121 175. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121 176. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121 177. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121 178. Compute learning rate e = (0 +e) where 0 > 0, (0.5,1] a1 k = (1 e)a1 k +e(1 +D T t=1 dtk ) a2 k = (1 e)a2 k +e( +D T t=1 K f=k+1 dtf ) kv = (1 e)kv +e(v +D N n=1 T t=1 wv dndnt dtk ) Meaning of each parameters 0: Slow down the early iterations of the algorithm : Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set 0 = 1.0, = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121 179. Compute learning rate e = (0 +e) where 0 > 0, (0.5,1] a1 k = (1 e)a1 k +e(1 +D T t=1 dtk ) a2 k = (1 e)a2 k +e( +D T t=1 K f=k+1 dtf ) kv = (1 e)kv +e(v +D N n=1 T t=1 wv dndnt dtk ) Meaning of each parameters 0: Slow down the early iterations of the algorithm : Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set 0 = 1.0, = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121 180. Compute learning rate e = (0 +e) where 0 > 0, (0.5,1] a1 k = (1 e)a1 k +e(1 +D T t=1 dtk ) a2 k = (1 e)a2 k +e( +D T t=1 K f=k+1 dtf ) kv = (1 e)kv +e(v +D N n=1 T t=1 wv dndnt dtk ) Meaning of each parameters 0: Slow down the early iterations of the algorithm : Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set 0 = 1.0, = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121 181. Mini-batch size When mini-batch size is large, distributed online HDP runs faster Perplexity is similar as others JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121 182. Summary Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes Chinese Restaurant Franchise Stick Breaking Construction Posterior Inference for HDP Gibbs Sampling Variational Inference Online Learning Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak Implementations are updated in http://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121 183. Further Reading Dirichlet Process Dirichlet Process Dirichlet distribution and Dirichlet Process + Indian Buffet Process Bayesian Nonparametric model Machine Learning Summer School - Yee Whye Teh Machine Learning Summer School - Peter Orbanz Introductory article Inference MCMC Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121 184. Thank You! JinYeong Bak [email protected], linkedin.com/in/jybak Users & Information Lab, KAIST JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121 185. References I Charles E Antoniak, Mixtures of dirichlet processes with applications to bayesian nonparametric problems, The annals of statistics (1974), 11521174. Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the dirichlet distribution and related processes, Tech. Report UWEETR-2010-0006, Department of Electrical Engineering, University of Washington, Seattle, WA 98195, December 2010. Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and machine learning, vol. 1, springer New York, 2006. David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003), 9931022. Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An hdp-hmm for systems with state persistence, Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 312319. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121 186. References II Peter D Hoff, A rst course in bayesian statistical methods, Springer, 2009. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul, An introduction to variational methods for graphical models, Springer, 1998. Yohan Jo and Alice H. Oh, Aspect and sentiment unication model for online review analysis, Proceedings of the fourth ACM international conference on Web search and data mining (New York, NY, USA), WSDM 11, ACM, 2011, pp. 815824. Radford M Neal, Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics 9 (2000), no. 2, 249265. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei, Hierarchical dirichlet processes, Journal of the american statistical association 101 (2006), no. 476. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121 187. References III Chong Wang, John W Paisley, and David M Blei, Online variational inference for the hierarchical dirichlet process, International Conference on Articial Intelligence and Statistics, 2011, pp. 752760. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121 188. Images source I http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm http://www.ickr.com/photos/autumn2may/3965964418/ http://www.ickr.com/photos/ppix/1802571058/ http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871 http://www.ickr.com/photos/jwight/2710392971/ http://www.ickr.com/photos/jasohill/2511594886/ http://en.wikipedia.org/wiki/Kim_Yuna http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29 http://en.wikipedia.org/wiki/Gangnam_Style JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121 189. Measurable space (,B) Def) A set considered together with the -algebra on the set6 . : the set of all outcomes, the sample space B: -algebra over Special kind of collection of subsets of the sample space Complete A is -algebra, then AC is also -algebra Closed under countable unions and intersections A and B are -algebra, then A B and A B are also -algebra A collection of events Property Smallest possible -algebra: {, /0} Largest possible -algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121 190. Measurable space (,B) Def) A set considered together with the -algebra on the set6 . : the set of all outcomes, the sample space B: -algebra over Special kind of collection of subsets of the sample space Complete A is -algebra, then AC is also -algebra Closed under countable unions and intersections A and B are -algebra, then A B and A B are also -algebra A collection of events Property Smallest possible -algebra: {, /0} Largest possible -algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121 191. Proof 1 Decimative property Let (1,2,...,K ) Dir(1,2,...,K ) and (1,2) Dir(11,12) where 1 +2 = 1, then (11,12,2,...,K ) Dir(11,12,2,...,K ) Then (G(1),G(A1),...,G(AR)) = (1,(1 1)G (A1),...,(1 1)G (AR)) Dir(1,0G0(A1),...,0G0(AR)) changes to (G (A1),...,G (AR)) Dir(0G0(A1),...,0G0(AR)) G DP(0,G0) using decimative property with 1 = 0 1 = (1 1) k = G0(Ak ) k = G (Ak ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121