nested hierarchical dirichlet processes · the ncrp is a bayesian nonparametric prior for...

Nested Hierarchical Dirichlet Processes

John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan

Review by David Carlson

John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 1 / 25

Overview

Dirichlet process (DP)Nested Chinese restaurant process topic model (nCRP)Hierarchical Dirichlet process topic model (HDP)Nested Hierarchical Dirichlet process topic model (nHDP)Outline of stochastic variational Bayesian procedureResults


Dirichlet Process

In general, we can write a that a distribution G drawn from a Dirichletprocess can be written as:

G ∼ DP(αG0) (1)

G =∞∑

i=1

piδθi (2)

where pi is a probability and each θi is an atom. We can construct aDirichlet process mixture model over data W1, ...,WN :

Wn|ϕn ∼ FW (ϕn) (3)ϕn|G ∼ G (4)


Generating the Dirichlet process

There are two common methods for generating the Dirichlet process.The first is the Chinese restaurant process, where we integrate out Gto get the a distribution for ϕn+1 given the previous values as:

ϕn+1|ϕ1, ..., ϕn ∼α

α+ nG0 +

n∑

i=1

1α+ n

δϕi (5)

The second commonly used method is a stick-breaking construction.in this case, one can construct G as:

G =∞∑

i=1

Vi

i−1∏

j=1

(1− Vj)

δθi , Vi

iid∼ Beta(1, α), θiiid∼ G0 (6)

Because the stick-breaking construction maintains the independenceamong ϕ1, ..., ϕN is has advantages over the CRP during mean-fieldvariational inference.


Nested Chinese restaurant processes

The CRP (or DP) is a flat model. Often, it is of interest to organize thetopics (or atoms) hierarchically to have subcategories of largercategories in a tree-structure. One way to construct such a hierarchicaldata structure is through the nested Chinese restaurant process(nCRP).


Nested Chinese restaurant processes

As an analogy, consider an extension of the CRP analogy. Eachcustomer selects a table (parameter) according to the CRP. From thattable, the customer chooses a restaurant accessible only from thetable, where he/she chooses a table from that restaurant specific CRP.

2

Fig. 1. An example of path structures for the nested Chinese restaurant process (nCRP) and the nested hierarchical Dirichlet

process (nHDP) for hierarchical topic modeling. With the nCRP, the topics for a document are restricted to lying along a single

path to a root node. With the nHDP, each document has access to the entire tree, but a document-specific distribution on paths

will place high probability on a particular subtree. The goal of the nHDP is to learn a thematically consistent tree as achieved

by the nCRP, while allowing for the cross-thematic borrowings that naturally occur within a document.

They do this by learning a tree structure for the underlying topics, with the inferential goal being that

topics closer to the root are more general, and gradually become more specific in thematic content when

following a path down the tree.

The nCRP is a Bayesian nonparametric prior for hierarchical topic models, but is limited in the

hierarchies it can model. We illustrate this limitation in Figure 1. The nCRP models the topics that go

into constructing a document as lying along one path of the tree. From a practical standpoint this is

a disadvantage, since inference in trees over three levels is computationally hard [2][1], and hence in

practice each document is limited to only three underlying topics. Moreover, this is also a significant

disadvantage from a modeling standpoint.

As a simple example, consider a document on ESPN.com about an injured player, compared with an

article in a sports medicine journal. Both documents will contain words about medicine and words about

sports. Should the nCRP select a path transitioning from sports to medicine, or vice versa? Depending

on the article, both options are reasonable, and during the learning process the model will either acquire

DRAFT

As shown in the image, each customer (document) that draws from theCRP chooses a single path down the tree.


Modeling the nCRP

Let i l = (i1, ..., il) be a path to a node at level l of the tree. Then we candefine the DP at the end of this path as:

Gi l =∞∑

j=1

V(i l ,j)

j−1∏

m=1

(1− V(i l ,m))δθ(i l ,j)(7)

If the next node is child j, then the nCRP transitions to the DP Gi l+1 ,where we define i l+1 = (i l , j)


Nested CRP topic models

We can use the nCRP to define a path down a shared tree, but wewant to use this tree to model the data. One application of thetree-structure is a topic model, where we would define each atom θi l ,jdefines a topic.

θi l ,j ∼ Dir(η) (8)

Each document in the nCRP would choose one path down the treeaccording to a Markov process, and the path provides a sequence oftopics ϕd = (ϕd ,1, ϕd ,2, ...) which we can use to generate the words inthe document. The distribution over these topics is provided by a newdocument-specific stick-breaking process:

G(d) =∞∑

j=1

Ud ,j

j−1∏

m=1

(1− Ud ,m)δϕd,j , Ud ,jiid∼ Beta(γ1, γ2) (9)


Problems with nCRP

There are several problems with the nCRP, including:Each document is only allowed to follow one path down the tree,limiting the number of topics for each document to the number oflevels (typically ≤ 4), which can force topics to blend (have lessspecificity)Topics are often repeated on many different parts of the tree ifthey appear as random effects in documentsThe tree is shared, but very few topics are shared between a setof documents because they each follow independent single pathsdown the tree

We would like to be able to learn a distribution over the entire sharedtree for each document to give a more flexible modeling structure. Thesolution given to this problem is the nested hierarchical Dirichletprocess.


Hierarchical Dirichlet processes

The HDP is a multi-level version of the Dirichlet process. This isdescribed as the hierarchical process:

Gd |G ∼ DP(βG), G ∼ DP(αG0) (10)

In this case, we have that each document has it’s own DP (Gd ) whichis drawn from a shared DP G. In this way, the weights on each topic(atom) are allowed to vary smoothly from document to document, butstill share statistical strength.

This can be represented as a stick-breaking process as well:

Gd =∞∑

i=1

V di

i−1∏

j=1

(1− V dj )δφi , V d

iiid∼ Beta(1, β), φi

iid∼ G (11)



The nHDP formulation allows (i) each word to follow its own path to atopic, and (ii) each topic its own distribution over a shared tree.

To formulate the nHDP, let a tree T be a draw from the global nCRPwith stick-breaking construction. Instead of drawing a path for eachdocument, we use each Dirichlet process in T as a base for a secondlevel DP drawn independently for each document. In order words,each document d has tree Td , where for each Gi l ∈ T we draw:

G(d)i l∼ DP(βGi l ) (12)



We can write the second level DP as:

G(d)i l

=∞∑

j=1

V (d)i l ,j

j−1∏

m=1

(1− V (d)i l ,m

)δφ(d)i l ,j, V (d)

(i i ,jiid∼ Beta(1, β), φ(d)i,j

iid∼ Gi l (13)

However, we would like to maintain the same tree structure in Td as inT . To do this, we can map the probabilities, so that the probability ofbeing on node θ(i l ,j) in document d is:

G(d)i l

({θ(i l ,j)}) =∑

m

G(d)i l

({φ(d)i l ,m})I(φ(d)i l ,m

= θ(i l ,j)) (14)


Generating a Document

After generating the tree Td for document d , we drawdocument-specific beta random variables that act as a stochasticswitch. I.E. if a word is at node i l , it determines the probability that theword uses the topic at that node or continues down the tree. So westop at node i l with probability:

Ud ,i l

idd∼ Beta(γ1, γ2) (15)

From the stick-breaking construction, the probability that the topicϕd ,n = θi l for word Wd ,n is:

Pr(ϕd ,n = θi l |Td ,Ud) =

∏

im⊂i l

G(d)im

({θim+1})

[

Ud ,i l

l−1∏

m=1

(1− Ud ,im)

]

(16)


Generative Procedure

10

Algorithm 1 Generating Documents with the Nested Hierarchical Dirichlet Process

Step 1. Generate a global tree T by constructing an nCRP as in Section II-B1.

Step 2. Generate document tree Td and switching probabilities U (d). For document d,

a) For each DP in T , draw a second-level DP with this base distribution (Equation 8).

b) For each node in Td (equivalently T ), draw a beta random variable (Equation 10).

Step 3. Generate the documents. For word n in document d,

a) Sample atom 'n,d = ✓il with probability given in Equation (11).

b) Sample Wn,d from the discrete distribution with parameter 'd,n.

For each node il, we draw a document-specific beta random variable that acts as a stochastic switch;

given a word is at node il, it determines the probability that the word uses the topic at that node or

continues on down the tree. That is, given the path for word Wd,n is at node il, stop with probability

Ud,iliid⇠ Beta(�1, �2), (10)

or continue by selecting node il+1 according to G(d)il . We observe the stick-breaking construction implicit

in this construction; for word n in document d, the probability that its topic 'd,n = ✓il is

Pr('d,n = ✓il |Td, Ud) =

" Y

im⇢il

G(d)im

�{✓im+1

}�# "

Ud,il

l�1Y

m=1

(1� Ud,im)

#. (11)

We use im ⇢ il to indicate that the first m values in il are equal to im. The leftmost term in this expression

is the probability of path il, the right term is the probability that the word does not select the first l� 1

topics, but does select the lth. Since all random variables are independent, a simple product form results

that will significantly aid the development of a posterior inference algorithm. The overlapping nature

of this stick-breaking construction on the levels of a sequence is evident from the fact that the random

variables U are shared for the first l values by all paths along the subtree starting at node il. A similar

tree-structured prior distribution was presented by Adams, et al. [18] in which all groups shared the same

distribution on a tree and entire objects (e.g. images or documents) were clustered within a single node.

We summarize our model for generating documents with the nHDP in Algorithm 1.

IV. STOCHASTIC VARIATIONAL INFERENCE FOR THE NESTED HDP

Many text corpora can be viewed as “Big Data”—they are large data sets for which standard inference

algorithms can be prohibitively slow. For example, Wikipedia currently indexes several million entries, and

The New York Times has published almost two million articles in the last 20 years. With so much data, fast

DRAFT


Inference

In a large amount of data, it is difficult to use an MCMC algorithm toefficient learn the parameters in the model. To solve this problem, theauthors developed a stochastic variational Bayesian inference schemethat updates over a sub-batch of documents denoted by Cs

for s = 1, ...,∞ dofor d ∈ Cs do

Update all local parameters for document d :(z(d)

i,j , cd ,n, V (d)i,j , Ud ,i ) while holding global variables constant

endStochastic updates for corpus variables:Find a noisy estimate for the Dirichlet parameters λ′i of q(θi) , andthen update the global parameters λs+1

i,w = λ0 +(1−ρs)λsi,w +ρsλ

′i,w

Likewise update the parameters for q(Vi l ,j)

end


Notes on inference

Initialization: A good initialization greatly benefits the stochastic VBalgorithm. For a small set of documents, the authors iteratively usek-means through a hierarchical k-means clustering to define an initialtree, with n1 clusters at the top level, n2 clusters at the next level, andn3 clusters at the last level.

In the small experiments a truncated tree with widths of (10,7,5) wasused in the inference results to give 430 possible nodes whereas in the”big data” experiments the tree was truncated to (20,10,5).

To test the hold-out set the authors completely held out a set ofdocuments, and then learned their local parameters on 75% of theheld-out words and tested on the remaining 25%. Predictivelog-likelihood values are reported.


Results

19

TABLE II

COMPARISON OF THE NHDP WITH THE NCRP ON THREE SMALLER PROBLEMS.

Method\Data set JACM Psych. Review PNAS

Variational nHDP -5.405 ± 0.012 -5.674 ± 0.019 -6.304 ± 0.003

Variational nCRP -5.433 ± 0.010 -5.843 ± 0.015 -6.574 ± 0.005

Gibbs nCRP -5.392 ± 0.005 -5.783 ± 0.015 -6.496 ± 0.007

children, which themselves each have 5 children for a total of 420 nodes. Following previous work on

the nCRP, we truncate the tree to three levels. Also, because these three data sets contain stop words, we

follow [2] and [1] by including a root node shared by all documents for this batch problem. Following

[2], we perform five-fold cross validation to evaluate performance on each corpora.

We present our results in Table II. We see that for all data sets, the variational nHDP outperforms the

variational nCRP. For the two larger data sets, the variational nHDP also outperforms Gibbs sampling for

the nCRP. Given the relative sizes of these corpora, we see that the benefit of learning a per-document

distribution on the full tree rather than a path appears to increase as the corpus size and document size

increase. Since we are interested in the “Big Data” regime, this strongly hints at an advantage of our

nHDP approach over the nCRP.

C. Stochastic inference for The New York Times and Wikipedia

We next present an evaluation of our stochastic variational inference algorithm on The New York Times

and Wikipedia. These are both very large data sets, with The New York Times containing roughly 1.8

million articles and Wikipedia roughly 3.3 million web pages. The average document size is somewhat

larger than those considered in our batch experiments as well, with an article from The New York Times

containing 254 words on average taken from a vocabulary size of 8,000, and Wikipedia 164 words on

average taken from a vocabulary size of 7,702.

1) Setup: We use the algorithm discussed in Section V-A to initialize a three-level tree with (20, 10, 5)

child nodes per level, giving a total of 1,220 initial topics. For the Dirichlet processes, we set all top-level

DP concentration parameters to ↵ = 5 and the second-level DP concentration parameters to � = 1. For

the switching probabilities U , we set the beta distribution hyperparameters to �1 = 2/3 and �2 = 4/3,

which takes the weight of a uniform prior and skews it toward smaller values, slightly encouraging a

word to continue down the tree. For our greedy subtree selection algorithm, we stop adding nodes to the

subtree when the marginal improvement to the lower bound falls below 10�2. When optimizing the local

variational parameters of a document given its subtree, we continue iterating until the absolute change

DRAFT


Results21

104 105⇧7.65

⇧7.6

⇧7.55

⇧7.5

⇧7.45

⇧7.4

⇧7.35

⇧7.3

Number of documents seen

pred

ictiv

elo

glik

elih

ood

nHDPHDPLDA 150LDA 100LDA 50

Fig. 2. The New York Times: Average per-word log likelihood on a held-out test set as a function of training documents seen.

0 5 10 15 20 250

500

1000

1500

2000

1 2 30

1

2

3

4

5

6

7

1 2 30

20

40

60

80

100

120

140

Fig. 3. The New York Times: Per-document statistics from the test set using the tree at the final step of the algorithm. (a)

A histogram of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree

(white), and the average number by level that have at least one expected observation (black). (c) The average number of words

allocated to each level of the tree per document.

DRAFT


Results

21

104 105⇧7.65

⇧7.6

⇧7.55

⇧7.5

⇧7.45

⇧7.4

⇧7.35

⇧7.3


pred

ictiv

elo

glik

elih

ood


Fig. 2. The New York Times: Average per-word log likelihood on a held-out test set as a function of training documents seen.

0 5 10 15 20 250

500

1000

1500

2000

1 2 30

1

2

3

4

5

6

7

1 2 30

20

40

60

80

100

120

140

Fig. 3. The New York Times: Per-document statistics from the test set using the tree at the final step of the algorithm. (a)

A histogram of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree

(white), and the average number by level that have at least one expected observation (black). (c) The average number of words

allocated to each level of the tree per document.

DRAFT


Results22

Fig. 4. Tree-structured topics from The New York Times. The shaded node is the top-level node and lines indicate dependencies

within the tree. In general, topics are learning in increasing levels of specificity. For clarity, we have removed grammatical

variations of the same word, such as “scientist” and “scientists.”

DRAFT


Results

23

0 0.5 1 1.50

200

400

600

800

1000

1200

2 0 0.5 1 1.5 2 2.5 3100

200

300

400

500

600

700

800

900

1000

1100

90%99%99.9%

Fig. 5. Tree size: The smallest number of nodes containing 90%, 99% and 99.9% of all paths as a function of documents

seen for (a) The New York Times, and (b) Wikipedia.

the nHDP than would be available in the nCRP. In Figure 8, we see example subtrees used by three

documents. We note that the topics contain many more function words than for The New York Times, but

an underlying hierarchical structure is uncovered that would be unlikely to arise along one path, as the

nCRP would require. In Figure 5b we again show the size of the tree as a function of documents seen by

showing the number of nodes containing 90%, 99% and 99.9% of all paths from the subtrees. As with

The New York Times, the model simplifies significantly from the original 1,220 nodes initialized.

VI. CONCLUSION

We have presented the nested hierarchical Dirichlet process (nHDP), an extension of the nested Chinese

restaurant process (nCRP) that allows each observation to follow its own path to a topic in the tree.

Starting with a stick-breaking construction for the nCRP, the new model samples document-specific path

distributions for a shared tree using a hierarchy of Dirichlet processes. By giving a document access to

the entire tree, we are able to borrow thematic content from various parts of the tree in constructing

a document. In our experiments we showed that this led to a general improvement over the nCRP for

hierarchical topic modeling. In addition, we have developed a stochastic variational inference algorithm

that is scalable to very large data sets. We compared the stochastic nHDP topic model with stochastic LDA

and HDP on large collections from The New York Times and Wikipedia, where we showed an improvement

in predictive ability with our tree-structured prior. Qualitative results on these corpora indicate that the

DRAFT


Results24

104 105

⇧6.85

⇧6.8

⇧6.75

⇧6.7

⇧6.65

⇧6.6


pred

ictive

loglik

eliho

od


Fig. 6. Wikipedia: Average per-word log likelihood on a held-out test set as a function of training documents seen.

0 5 10 15 20 250

200

400

600

800

1000

1 2 30

1

2

3

4

5

1 2 30

10

20

30

40

50

60

70

80

Fig. 7. Wikipedia: Per-document statistics from the test set using the tree at the final step of the algorithm. (a) A histogram

of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree (white), and

the average number by level that have at least one expected observation (black). (c) The average number of words allocated to

each level of the tree per document.

DRAFT


Results

24

104 105

⇧6.85

⇧6.8

⇧6.75

⇧6.7

⇧6.65

⇧6.6


pred

ictiv

elo

glik

elih

ood


Fig. 6. Wikipedia: Average per-word log likelihood on a held-out test set as a function of training documents seen.

0 5 10 15 20 250

200

400

600

800

1000

1 2 30

1

2

3

4

5

1 2 30

10

20

30

40

50

60

70

80

Fig. 7. Wikipedia: Per-document statistics from the test set using the tree at the final step of the algorithm. (a) A histogram

of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree (white), and

the average number by level that have at least one expected observation (black). (c) The average number of words allocated to

each level of the tree per document.

DRAFT


Results25

Fig. 8. Examples of subtrees for three articles from Wikipedia. The three sizes of font indicate differentiate the more probable

topics from the less probable.

nHDP can learn meaningful topic hierarchies, and that documents benefit by taking advantage of the

entire tree.

REFERENCES

[1] D. Blei, T. Griffiths, and M. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic

hierarchies,” Journal of the ACM, vol. 57, no. 2, pp. 7:1–30, 2010.

[2] C. Wang and D. Blei, “Variational inference for the nested Chinese restaurant process,” in Advances in Neural Information

Processing Systems, 2009.

[3] J. H. Kim, D. Kim, S. Kim, and A. Oh, “Modeling topic hierarchies with the recursive Chinese restaurant process,” in

International Conference on Information and Knowledge Management (CIKM), 2012.

[4] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association,

vol. 101, no. 476, pp. 1566–1581, 2006.

DRAFT


Conclusions

The nHDP provides a way to eliminate some of the constraints of thenCRP and provide a more informative tree that gives a higherpredictive log-likelihood.

Using the complete tree is a method explored in the paper”Tree-Structured Stick Breaking for Hierarchical Data” by Adams et. al,but this method allows documents to share statistical strength forpreferences on the tree structure between documents.

The stochastic variational Bayesian algorithm allows for efficientinference on this complicated model that seems to perform well.