lda-the gritty details

12
Distributed Gibbs Sampling of Latent Dirichlet Allocation: The Gritty Details Yi Wang [email protected] THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY AP- PRECIATED. 1 Introduction I started to write this chapter since November 2007, right after my first MapRe- duce implementation of the AD-LDA algorithm[7]. I had worked so hard to understand LDA, but cannot find any document that were comprehensive, com- plete, and contain all necessary details. It is true that I can find some open source implementations of LDA, for example, LDA Gibbs sampler in Java 1 and GibbsLDA++ 2 , but code does not reveal math derivations. I read Griffiths’ paper [4] and technical report [3], but realized most derivations are skipped. It was lucky to me that in the paper about Topic-over-time[9], an extension of LDA, the authors show part of the derivation. The derivation is helpful to understand LDA but is too short and not self-contained. A primer [5] by Gregor Heinrich contains more details than documents mentioned above. Indeed, most of my initial understanding on LDA comes from [5]. As [5] focuses more on the Bayesian background of latent variable modeling, I wrote this article focusing on more on the derivation of the Gibbs sampling algorithm for LDA. You may notice that the Gibbs sampling updating rule we derived in (38) differs slightly from what was presented in [3] and [5], but matches that in [9]. 2 LDA and Its Learning Problem In the literature, the training documents of LDA are often denoted by a long and segmented vector W : W = {w 1 ,...,w N1 }, words in the 1st document {w N1+1 ,...,w N1+N2 }, words in the 2nd document ... ... {w 1+ D-1 j=1 Nj ,...,w D j=1 Nj } words in the D-th document , (1) 1 http://arbylon.net/projects/LdaGibbsSampler.java 2 http://gibbslda++.sourceforge.net 1

Upload: jun-wang

Post on 07-Oct-2014

120 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lda-The Gritty Details

Distributed Gibbs Sampling of Latent Dirichlet

Allocation: The Gritty Details

Yi [email protected]

THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY AP-PRECIATED.

1 Introduction

I started to write this chapter since November 2007, right after my first MapRe-duce implementation of the AD-LDA algorithm[7]. I had worked so hard tounderstand LDA, but cannot find any document that were comprehensive, com-plete, and contain all necessary details. It is true that I can find some opensource implementations of LDA, for example, LDA Gibbs sampler in Java1 andGibbsLDA++2, but code does not reveal math derivations. I read Griffiths’paper [4] and technical report [3], but realized most derivations are skipped.It was lucky to me that in the paper about Topic-over-time[9], an extensionof LDA, the authors show part of the derivation. The derivation is helpful tounderstand LDA but is too short and not self-contained. A primer [5] by GregorHeinrich contains more details than documents mentioned above. Indeed, mostof my initial understanding on LDA comes from [5]. As [5] focuses more on theBayesian background of latent variable modeling, I wrote this article focusingon more on the derivation of the Gibbs sampling algorithm for LDA. You maynotice that the Gibbs sampling updating rule we derived in (38) differs slightlyfrom what was presented in [3] and [5], but matches that in [9].

2 LDA and Its Learning Problem

In the literature, the training documents of LDA are often denoted by a longand segmented vector W :

W =

w1, . . . , wN1, words in the 1st document

wN1+1, . . . , wN1+N2, words in the 2nd document

. . . . . .w1+

∑D−1j=1 Nj

, . . . , w∑Dj=1Nj

words in the D-th document

, (1)

1http://arbylon.net/projects/LdaGibbsSampler.java2http://gibbslda++.sourceforge.net

1

Page 2: Lda-The Gritty Details

where Nd denotes the number of words in the d-th document. 3

In rest of this article, we consider hidden variables

Z =

z1, . . . , zN1

, topics of words in the 1st document

zN1+1, . . . , zN1+N2, topics of words in the 2nd document

. . . . . .z1+

∑D−1j=1 Nj

, . . . , z∑Dj=1Nj

topics of words in the D-th document

,

(3)

where each zi ∈ Z corresponds to a word wi ∈W in (1).

3 Dirichlet and Multinomial

To derive the Gibbs sampling algorithm with LDA, it is important to be familiarwith the conjugacy between Dirichlet distribution and multinomial distribution.

The Dirichlet distribution is defined as:

Dir(p;α) =1

B(α)

|α|∏v=1

pαt−1t , (4)

where the normalizing constant is the multinomial beta function, which can beexpressed in terms of gamma function:

B(α) =

∏|α|i=1 Γ(αi)

Γ(∑|α|i=1 αi)

. (5)

When Dirichlet distribution is used in LDA to model the prior, αi’s are positiveintegers. In this case, the gamma function degenerates to the factorial function:

Γ(n) = (n− 1)! (6)

The multinomial distribution is defined as:

Mult(x;p) =n!∏Ki=1 xi!

K∏i=1

pxii , (7)

where xi denotes the number of times that value i appears in the samples drawnfrom the discrete distribution p, K = |x| = |p| = |α| and n =

∑Ki=1 xi. The

normalization constant comes from the product of combinatorial numbers.We say the Dirichlet distribution is the conjugate distribution of the multi-

nomial distribution, because if the prior of p is Dir(p;α) and x is generated

3Another representation of the training documents are two vectors: d and W :[dW

]=

[d1, . . . , dNw1, . . . , wN

], (2)

where N =∑D

j=1 Nj , di indices in D, wi indices in W, and the tuple [di, wi]T denotes an

edge between the two vertices indexed by di and wi respectively. Such representation iscomprehensive and is used in many implementations of LDA, e.g, [8].

However, many papers (including this article) do not use this representation, because itmisleads readers to consider two sets of random variables, W and d. The fact is that d isthe structure of W and is highly dependent with W . d is also the structure of the hiddenvariables Z, as shown by (3).

2

Page 3: Lda-The Gritty Details

Figure 1: The procedure of learning LDA by Gibbs sampling.

by Mult(x;p), then the posterior distribution of p, p(p|x,α), is a Dirichletdistribution:

p(p|x,α) = Dir(p;x+α)

=1

B(x+α)

|α|∏v=1

pxt+αt−1t .

(8)

Because (8) is a probability distribution function, integrating it over p shouldresult in 1:

1 =

∫1

B(x+α)

|α|∏v=1

pxt+αt−1t dp

=1

B(x+α)

∫ |α|∏v=1

pxt+αt−1t dp .

(9)

This implies ∫ |α|∏v=1

pxt+αt−1t dp = B(x+α) . (10)

This property will be used in the following derivations.

4 Learning LDA by Gibbs Sampling

There have been three strategies to learn LDA: EM with variational inference[2], EM with expectation propagation [6], and Gibbs sampling [4]. In this article,we focus on the Gibbs sampling method, whose performance is comparable withthe other two but is tolerant better to local optima.

Gibbs sampling is one of the class of samplings methods known as MarkovChain Monte Carlo. We use it to sample from the posterior distribution,p(Z|W ), given the training data W represented in the form of (1). As willbe shown by the following text, given the sample of Z we can infer model pa-rameters Φ and Θ. This forms a learning algorithm with its general frameworkshown in Fig. 1.

In order to sample from p(Z|W ) using the Gibbs sampling method, we needthe full conditional posterior distribution p(zi|Z−i,W ), where Z−i denotes all

3

Page 4: Lda-The Gritty Details

zj ’s with j 6= i. In particular, Gibbs sampling does not require knowing theexact form of p(zi|Z−i,W ); instead, it is enough to have a function f(·), where

f(zi|Z−i,W ) ∝ p(zi|Z−i,W ) (11)

The following mathematical derivation is for such a function f(·).

The Joint Distribution of LDA We start from deriving the joint distribu-tion 4,

p(Z,W |α,β) = p(W |Z,β)p(Z|α) , (12)

which is the basis of the derivation of the Gibbs updating rule and the parameterestimation rule. As p(W |Z,β) and p(Z|α) depend on Φ and Θ respectively, wederive them separately.

According to the definition of LDA, we have

p(W |Z,β) =

∫p(W |Z,Φ)p(Φ|β)dΦ , (13)

where p(Φ|β) has Dirichlet distribution:

p(Φ|β) =

K∏k=1

p(φk|β) =

K∏k=1

1

B(β)

V∏v=1

φβv−1k,v , (14)

and p(Φ|β) has multinomial distribution:

p(W |Z,Φ) =

N∏i=1

φzi,wi =

K∏k=1

V∏v=1

φΨ(k,v)k,v , (15)

where Ψ is a K ×V count matrix and Ψ(k, v) is the number of times that topick is assigned to word v. With W and Z defined by (1) and (3) respectively, wecan compute Ψ(k, v) by

Ψ(k, v) =

N∑i=1

Iwi = v ∧ zi = k , (16)

where N is the corpus size by words.Given (15) and (14), (13) becomes

p(W |Z,β) =

∫ K∏k=1

1

B(β)

V∏v=1

φΨ(k,v)+βv−1k,v dφk . (17)

Using the property of integration of a product, which we learned in our colleaguetime, ∫ K∏

k=1

fk(φk) dφ1 . . . dφK =

K∏k=1

∫fk(φk) dφk , (18)

4Notations used in this article are identical with those used in [5].

4

Page 5: Lda-The Gritty Details

we have

p(W |Z,β) =

K∏k=1

(∫1

B(β)

V∏v=1

φΨ(k,v)+βv−1k,v dφk

)

=

K∏k=1

(1

B(β)

∫ V∏v=1

φΨ(k,v)+βk−1k,v dφk

).

(19)

Using property (10), we can compute the integratation over φz in a close form,so

p(W |Z,β) =

K∏k=1

B(Ψk + β)

B(β), (20)

where Ψk denotes the k-th row of matrix Ψ.Now we derive p(Z|α) analogous to p(W |Z,β). Similar with (14), we have

p(Θ|α) =

D∏d=1

p(θm|α) =

D∏d=1

1

B(α)

K∏k=1

θαk−1d,k ; (21)

similar with (15), we have

p(Z|Θ) =

N∏i=1

θdi,zi =

D∏d=1

K∏k=1

θΩ(d,k)d,k ; (22)

and similar with (20) we have

p(Z|α) =

∫p(Z|Θ)p(Θ|α)dΘ

=

D∏d=1

(∫1

B(α)

K∏k=1

θΩ(d,k)+αk−1d,k dθk

)

=

D∏d=1

B(Ωd +α)

B(α),

(23)

where Ω is a count matrix and Ω(d, k) is the number of times that topic k isassigned to words in document d; Ωd denotes the d-th row of Ω. With inputdata defined by (2), Ω(d, k) can be computed as

Ω(d, k) =

N∑i=1

Idi = m ∧ zi = z , (24)

where N is the corpus size by words.Given (20) and (23), the joint distribution (12) becomes:

p(Z,W |α,β) = p(W |Z,β)p(Z|α)

=

K∏k=1

B(Ψk + β)

B(β)·D∏d=1

B(Ωd +α)

B(α)

(25)

5

Page 6: Lda-The Gritty Details

The Gibbs Updating Rule With (25), we can derive the Gibbs updatingrule for LDA:

p(zi = k|Z−i,W,α,β) =p(zi = k, Z−i,W |α,β)

p(Z−i,W |α,β). (26)

Because zi depends only on wi, (26) becomes

p(zi|Z−i,W,α,β) =p(Z,W |α,β)

p(Z−i,W−i|α,β)where zi = k . (27)

The numerator in (27) is (25). The denominator has a similar form:

p(Z,W |α,β) = p(W |Z,β)p(Z|α)

=

K∏k=1

B(Ψ−ik + β)

B(β)·D∏d=1

B(Ω−id +α)

B(α),

(28)

where Ψ−ik denotes the k-th row of the count K × V count matrix Ψ−i, andΨ−i(k, v) is the number of times that topic k is assigned to word w, but withthe i-th word and its topic assignment excluded. Similarly, Ω−id is the d-th rowof count matrix Ω−i, and Ω−i(d, k) is the number of words in document d thatare assigned topic k, but with the i-th word and its topic assignment excluded.In more details,

Ψ−i(k, v) =∑

1≤j≤Nj 6=i

Iwj = v ∧ zj = k ,

Ω−i(d, k) =∑

1≤j≤Nj 6=i

Idj = d ∧ zj = k ,(29)

where dj denotes the document to which the i-th word in the corpus belongs.Some properties can be derived from the definitions directly:

Ψ(k, v) =

Ψ−i(k, v) + 1 if v = wi and k = zi;

Ψ−i(k, v) all other cases.

Ω(d, k) =

Ω−i(d, k) + 1 if d = di and k = zi;

Ω−i(d, k) all other cases.

(30)

Correct notations from here on: From (30) we have

V∑v=1

n(v; zi) = 1 +

V∑v=1

n−i(v|zi)

K∑k=1

n(z; di) = 1 +

K∑k=1

n−i(z|di)

(31)

(30) and (31) are important to compute (27). Expanding the products in(25) and take the fraction in (27), due to (30), those factors related with z 6= ziand m 6= mi disappear, and we get

p(zi|Z−i,W,α,β) =B(n(·; zi) + β)

B(n−i(·|zi) + β)· B(n(·;mi) +α)

B(n−i(·|mi) +α). (32)

6

Page 7: Lda-The Gritty Details

This equation can be further simplified by expanding B(·). Considering

B(x) =

∏dimxk=1 Γ(xk)

Γ(∑dimxk=1 xk)

, (33)

(32) becomes

∏Vv=1 Γ(n(t;zi)+βt)

Γ(∑V

v=1 n(t;zi)+βt)∏Vv=1 Γ(n−i(t|zi)+βt)

Γ(∑V

v=1 n−i(t|zi)+βt)

·

∏Kk=1 Γ(n(z;di)+αz)

Γ(∑K

k=1 n(z;di)+αz)∏Kk=1 Γ(n−i(z|di)+αz)

Γ(∑K

k=1 n−i(z|di)+αz)

. (34)

Using (30), we can remove most factors in the products of (34) and get

Γ(n(wi;zi)+βwi)

Γ(∑V

v=1 n(t;zi)+βt)

Γ(n−i(wi|zi)+βwi)

Γ(∑V

v=1 n−i(t|zi)+βt)

·Γ(n(zi;di)+αzi

)

Γ(∑K

k=1 n(z;di)+αz)

Γ(n−i(zi|di)+αzi)

Γ(∑K

k=1 n−i(z|di)+αz)

. (35)

Here it is important to know that

Γ(y) = (y − 1)! if y is a positive integer , (36)

so we expand (35) and using (30) to remove most factors in the factorials. Thisleads us to the Gibbs updating rule for LDA:

p(zi|Z−i,W,α,β) =

n(wi; zi) + βwi− 1[∑V

v=1 n(t; zi) + βt

]− 1· n(zi; di) + αzi − 1[∑K

k=1 n(z; di) + αz

]− 1

. (37)

Note that the denominator of the second factor at the right hand side of(38) does not depend on zi, the parameter of the function p(zi|Z−i,W,α,β).Because the Gibbs sampling method requires only a function f(zi) ∝ p(zi),c.f. (11), we can use the following updating rule in practice:

p(zi|Z−i,W,α,β) =

n(wi; zi) + βwi− 1[∑V

v=1 n(t; zi) + βt

]− 1·[n(zi; di) + αzi − 1

]. (38)

Parameter Estimation Given the sampled topics Z, as well as the input:α, β, and W , we can estimate model parameters Φ and Θ.

Because Ψ(k, v) is a matrix that consists of the number of times of co-occurrences of word w and topic z, which is proportional to the joint probabilityp(w, z). Similarly, n(z; d) is proportional to p(z, d). Therefore, we can estimatethe model parameters by simply normalizing Ψ(k, v) and n(z; d); this leads tothe updating equation (43).

Another more precise understanding comes from [5].According to the definitions φk,t = p(w = t|z = k) and θm,k = p(z = k|d =

m), we can interpret φk,t and θm,k by the posterior probabilities:

φk,t = p(w = t|z = k,W,Z,β)

θm,k = p(z = k|Z,α)(39)

7

Page 8: Lda-The Gritty Details

The following is a trick that derives (39) using the joint distribution (25).

φk,t · θm,k = p(w = t | z = k,W,Z,β) · p(z = k | Z,α)

= p(w = t, z = k |W,Z,α,β)

=p(W , Z | α,β)

p(W,Z | α,β),

(40)

where W = W, t, Z = Z, k and d = m. Both the numerator and denomi-nator can be computed using (25). Analogous to what we did to compute (27),(40) becomes:

φk,t · θm,k =p(W , Z | α,β)

p(W,Z | α,β)

=

Γ(n(t;k)+1+βt)

Γ(∑V

v=1 n(t;k)+1+βt)

Γ(n−i(t|k)+βt)

Γ(∑V

v=1 n−i(t|k)+βt)

·Γ(n(k;m)+1+αk)

Γ(∑K

k=1 Ω(d,k)+1+αz)

Γ(n−i(k|m)+αk)

Γ(∑K

k=1 Ω−i(d,k)+αz)

(41)

Note that the “+1” terms in (41) is due to |W | = |W |+ 1 and |Z| = |Z|+ 1.Analogous to the simplification we did for (34), simplifying (41) results in

φk,t · θm,k =n(t; k) + βt(∑Vv=1 n(t; k) + βt

) · n(k;m) + αk(∑Kk=1 Ω(d, k) + αz

) . (42)

Comparing with (38), (42) does not have those “−1” terms because of those“+1” terms in (41).

Because the two factors at the right hand side of (42) correspond to p(w =t | z = k,W,Z,β) and p(z = k | Z,α) respectively, according to the definitionof φk,t and θm,k, we have

φk,t =n(t; k) + βt(∑V

t′=1 n(t′; k) + βt′) ,

θm,k =n(k;m) + αk(∑Kk=1 Ω(d, k) + αz

) .

(43)

The Algorithm With (38) known, we can realize the learning procedureshown in Fig. 1 by the algorithm shown in Fig. 1.

It is notable that this algorithm does not require input of α and β, because ittakes α and β as zero vectors. This simplification is reasonable because we caninterpret the conjugate property of Dirichlet and multinomial distribution [1] asthat αz and βt denote the number of times that topic z and word t respectivelyare observed out of the training data set. Taking α and β as zero vectors meanswe do not have prior knowledge on these numbers of times.

This simplification makes it easy to maintain the quantities used in (38).We use a 2D array NWZ[w,z] to maintain n(w, z) and NZM[z,m] for Ω(d, k).To accelerate the computation, we use a 1D array NZ[z] to maintain n(z) =∑|W,v=1 n(t; z). Note that we do not need an array NM[d] for n(z) =

∑|Z,k=1 n(z, d);

which appears in (37) but is unnecessary in (38).

8

Page 9: Lda-The Gritty Details

zero all count variables NWZ, NZM, NZ ;foreach document m ∈ [1, D] do

foreach word n ∈ [1, Nm] in document m dosample topic index zm,n ∼ Mult(1/K) for word wm,n;increment document-topic count: NZM[zm,n,m]++ ;increment topic-term count: NWZ[wm,n, zm,n]++ ;increment topic-term sum: NZ[zm,n]++ ;

end

endwhile not finished do

foreach document m ∈ [1, D] doforeach word n ∈ [1, Nm] in document m do

NWZ[wm,n, zm,n]--, NZ[zm,n]--, NZM[zm,n,m]-- ;sample topic index zm,n according to (44) ;NWZ[wm,n, zm,n]++, NZ[zm,n]++, NZM[zm,n,m]++ ;

end

endif converged and L sampling iterations since last read out then

read out parameter set Θ and Φ according to (43) ;end

endAlgorithm 1: The Gibbs sampling algorithm that learns LDA.

If α and β are known as non-zero vectors, we can easily modify the algorithmto make use of them. In particular, NWZ[w,z]=n(w, z)+βw, NZM[z,m]=Ω(d, k)+

αz and NZ[z]=∑Vv=1 n(t; z) + βt. Only the initialization part of this algorithm

need to be modified.The input of this algorithm is not in the form of (1); instead, it is in a

more convenient form: with each document represented by a set of words itincludes, the input is given as a set of documents. Denote the n-th word inthe m-th document by wm,n, the corresponding topic is denoted by zm,n andcorresponding document is m. With these correspondence known, the abovederivations can be easily implemented in the algorithm.

The sampling operation in this algorithm is not identical with the full condi-tional posterior distribution (38); instead, it does not contain those “−1” terms:

p(zi) =n(wi; zi) + βwi[∑Vv=1 n(t; zi) + βt

] · [n(zi; di) + αzi]. (44)

This makes it elegant to maintain the consistency of sampling new topics andupdating the counts (NWZ, NZ and NZM) using three lines of code: decreasing thecounts, sampling and increasing the counts. 5

9

Page 10: Lda-The Gritty Details

Figure 2: The ground-truth Φ used to synthesize our testing data.

5 Experiments

The synthetic image data mentioned in [4] is useful to test the correctness ofany implementation of LDA learning algorithm. To synthesize this data set, wefix Φ = φkK=10

k=1 as visualized in Fig. 2 (also, Fig. 1 in [4]), set α = [1], thenumber of words/pixels in each document/image d = 100. Because every imagehas 5× 5 pixels, the vocabulary size is 25.6

Fig. 3 shows the result of running Gibbs sampling to learn an LDA from thistesting data set, where the left pane shows the convergence of the likelihood,and the right pane shows the updating process of Φ along the iterations of thealgorithm. The 20 rows of images in the right pane are visualizations of Φestimated at the iterations of 1, 2, 3, 5, 7, , 10, 15, 20, 25, 50, 75, 100, 125, 150,175, , 200, 250, 300, 400, 499. From this figure we can see that since iteration100, the estimated Φ is almost identical to Fig. 1(a) of [4].

6 Acknowledgement

Thanks go to Gregor Heinrich and Xuerui Wang for helping me grasp somedetails in the derivation of Gibbs sampling of LDA.

References

[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.

[2] D. Blei, A. Ng, M. Jordan, and J. Lafferty. Latent dirichlet allocation.JMLR, 3:993–1022, 20003.

[3] T. Griffith. Gibbs sampling in the generative model of latent dirichlet allo-cation. Technical report, Stanford University, 2004.

[4] T. Griffiths and M. Steyvers. Finding scientific topics. In PNAS, volume101, pages 5228–5235, 2004.

[5] G. Heinrich. Parameter estimation for text analysis. Technical report, vsonixGmbH and University of Leipzig, Germany, 2009.

[6] T. Minka and J. Lafferty. Expectation propaga-tion for the generative aspect model. In UAI, 2002.https://research.microsoft.com/ minka/papers/aspect/minka-aspect.pdf.

5This trick illustrates the physical meaning of those “−1” terms in (38). I learn this trickfrom Heinrich’s code at http://arbylon.net/projects/LdaGibbsSampler.java.

6All these settings are identical with those described in [4].

10

Page 11: Lda-The Gritty Details

Figure 3: The result of running the Gibbs sampling algorithm to learn an LDAfrom synthetic data. Each row of 10 images in the right pane visualizes theΦ = φkK=10

k=1 estimated at those iterations indicated by red circles in the leftpane.

[7] D. Newman, A. Asuncion, P. Smyth, and MaxWelling. Distributed inferencefor latent dirichlet allocation. In NIPS, 2007.

11

Page 12: Lda-The Gritty Details

[8] M. Steyvers and T. Griffiths. Matlab topic modeling toolbox. Technicalreport, University of California, Berkeley, 2007.

[9] X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In KDD, 2006.

12