Post on 29-May-2020
Nested Hierarchical Dirichlet Processes
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan
Review by David Carlson
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 1 / 25
Overview
Dirichlet process (DP)Nested Chinese restaurant process topic model (nCRP)Hierarchical Dirichlet process topic model (HDP)Nested Hierarchical Dirichlet process topic model (nHDP)Outline of stochastic variational Bayesian procedureResults
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 2 / 25
Dirichlet Process
In general, we can write a that a distribution G drawn from a Dirichletprocess can be written as:
G ∼ DP(αG0) (1)
G =∞∑
i=1
piδθi (2)
where pi is a probability and each θi is an atom. We can construct aDirichlet process mixture model over data W1, ...,WN :
Wn|ϕn ∼ FW (ϕn) (3)ϕn|G ∼ G (4)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 3 / 25
Generating the Dirichlet process
There are two common methods for generating the Dirichlet process.The first is the Chinese restaurant process, where we integrate out Gto get the a distribution for ϕn+1 given the previous values as:
ϕn+1|ϕ1, ..., ϕn ∼α
α+ nG0 +
n∑
i=1
1α+ n
δϕi (5)
The second commonly used method is a stick-breaking construction.in this case, one can construct G as:
G =∞∑
i=1
Vi
i−1∏
j=1
(1− Vj)
δθi , Vi
iid∼ Beta(1, α), θiiid∼ G0 (6)
Because the stick-breaking construction maintains the independenceamong ϕ1, ..., ϕN is has advantages over the CRP during mean-fieldvariational inference.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 4 / 25
Nested Chinese restaurant processes
The CRP (or DP) is a flat model. Often, it is of interest to organize thetopics (or atoms) hierarchically to have subcategories of largercategories in a tree-structure. One way to construct such a hierarchicaldata structure is through the nested Chinese restaurant process(nCRP).
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 5 / 25
Nested Chinese restaurant processes
As an analogy, consider an extension of the CRP analogy. Eachcustomer selects a table (parameter) according to the CRP. From thattable, the customer chooses a restaurant accessible only from thetable, where he/she chooses a table from that restaurant specific CRP.
2
Fig. 1. An example of path structures for the nested Chinese restaurant process (nCRP) and the nested hierarchical Dirichlet
process (nHDP) for hierarchical topic modeling. With the nCRP, the topics for a document are restricted to lying along a single
path to a root node. With the nHDP, each document has access to the entire tree, but a document-specific distribution on paths
will place high probability on a particular subtree. The goal of the nHDP is to learn a thematically consistent tree as achieved
by the nCRP, while allowing for the cross-thematic borrowings that naturally occur within a document.
They do this by learning a tree structure for the underlying topics, with the inferential goal being that
topics closer to the root are more general, and gradually become more specific in thematic content when
following a path down the tree.
The nCRP is a Bayesian nonparametric prior for hierarchical topic models, but is limited in the
hierarchies it can model. We illustrate this limitation in Figure 1. The nCRP models the topics that go
into constructing a document as lying along one path of the tree. From a practical standpoint this is
a disadvantage, since inference in trees over three levels is computationally hard [2][1], and hence in
practice each document is limited to only three underlying topics. Moreover, this is also a significant
disadvantage from a modeling standpoint.
As a simple example, consider a document on ESPN.com about an injured player, compared with an
article in a sports medicine journal. Both documents will contain words about medicine and words about
sports. Should the nCRP select a path transitioning from sports to medicine, or vice versa? Depending
on the article, both options are reasonable, and during the learning process the model will either acquire
DRAFT
As shown in the image, each customer (document) that draws from theCRP chooses a single path down the tree.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 6 / 25
Modeling the nCRP
Let i l = (i1, ..., il) be a path to a node at level l of the tree. Then we candefine the DP at the end of this path as:
Gi l =∞∑
j=1
V(i l ,j)
j−1∏
m=1
(1− V(i l ,m))δθ(i l ,j)(7)
If the next node is child j, then the nCRP transitions to the DP Gi l+1 ,where we define i l+1 = (i l , j)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 7 / 25
Nested CRP topic models
We can use the nCRP to define a path down a shared tree, but wewant to use this tree to model the data. One application of thetree-structure is a topic model, where we would define each atom θi l ,jdefines a topic.
θi l ,j ∼ Dir(η) (8)
Each document in the nCRP would choose one path down the treeaccording to a Markov process, and the path provides a sequence oftopics ϕd = (ϕd ,1, ϕd ,2, ...) which we can use to generate the words inthe document. The distribution over these topics is provided by a newdocument-specific stick-breaking process:
G(d) =∞∑
j=1
Ud ,j
j−1∏
m=1
(1− Ud ,m)δϕd,j , Ud ,jiid∼ Beta(γ1, γ2) (9)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 8 / 25
Problems with nCRP
There are several problems with the nCRP, including:Each document is only allowed to follow one path down the tree,limiting the number of topics for each document to the number oflevels (typically ≤ 4), which can force topics to blend (have lessspecificity)Topics are often repeated on many different parts of the tree ifthey appear as random effects in documentsThe tree is shared, but very few topics are shared between a setof documents because they each follow independent single pathsdown the tree
We would like to be able to learn a distribution over the entire sharedtree for each document to give a more flexible modeling structure. Thesolution given to this problem is the nested hierarchical Dirichletprocess.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 9 / 25
Hierarchical Dirichlet processes
The HDP is a multi-level version of the Dirichlet process. This isdescribed as the hierarchical process:
Gd |G ∼ DP(βG), G ∼ DP(αG0) (10)
In this case, we have that each document has it’s own DP (Gd ) whichis drawn from a shared DP G. In this way, the weights on each topic(atom) are allowed to vary smoothly from document to document, butstill share statistical strength.
This can be represented as a stick-breaking process as well:
Gd =∞∑
i=1
V di
i−1∏
j=1
(1− V dj )δφi , V d
iiid∼ Beta(1, β), φi
iid∼ G (11)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 10 / 25
Nested Hierarchical Dirichlet Processes
The nHDP formulation allows (i) each word to follow its own path to atopic, and (ii) each topic its own distribution over a shared tree.
To formulate the nHDP, let a tree T be a draw from the global nCRPwith stick-breaking construction. Instead of drawing a path for eachdocument, we use each Dirichlet process in T as a base for a secondlevel DP drawn independently for each document. In order words,each document d has tree Td , where for each Gi l ∈ T we draw:
G(d)i l∼ DP(βGi l ) (12)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 11 / 25
Nested Hierarchical Dirichlet Processes
We can write the second level DP as:
G(d)i l
=∞∑
j=1
V (d)i l ,j
j−1∏
m=1
(1− V (d)i l ,m
)δφ(d)i l ,j, V (d)
(i i ,jiid∼ Beta(1, β), φ(d)i,j
iid∼ Gi l (13)
However, we would like to maintain the same tree structure in Td as inT . To do this, we can map the probabilities, so that the probability ofbeing on node θ(i l ,j) in document d is:
G(d)i l
({θ(i l ,j)}) =∑
m
G(d)i l
({φ(d)i l ,m})I(φ(d)i l ,m
= θ(i l ,j)) (14)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 12 / 25
Generating a Document
After generating the tree Td for document d , we drawdocument-specific beta random variables that act as a stochasticswitch. I.E. if a word is at node i l , it determines the probability that theword uses the topic at that node or continues down the tree. So westop at node i l with probability:
Ud ,i l
idd∼ Beta(γ1, γ2) (15)
From the stick-breaking construction, the probability that the topicϕd ,n = θi l for word Wd ,n is:
Pr(ϕd ,n = θi l |Td ,Ud) =
∏
im⊂i l
G(d)im
({θim+1})
[
Ud ,i l
l−1∏
m=1
(1− Ud ,im)
]
(16)
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 13 / 25
Generative Procedure
10
Algorithm 1 Generating Documents with the Nested Hierarchical Dirichlet Process
Step 1. Generate a global tree T by constructing an nCRP as in Section II-B1.
Step 2. Generate document tree Td and switching probabilities U (d). For document d,
a) For each DP in T , draw a second-level DP with this base distribution (Equation 8).
b) For each node in Td (equivalently T ), draw a beta random variable (Equation 10).
Step 3. Generate the documents. For word n in document d,
a) Sample atom 'n,d = ✓il with probability given in Equation (11).
b) Sample Wn,d from the discrete distribution with parameter 'd,n.
For each node il, we draw a document-specific beta random variable that acts as a stochastic switch;
given a word is at node il, it determines the probability that the word uses the topic at that node or
continues on down the tree. That is, given the path for word Wd,n is at node il, stop with probability
Ud,iliid⇠ Beta(�1, �2), (10)
or continue by selecting node il+1 according to G(d)il . We observe the stick-breaking construction implicit
in this construction; for word n in document d, the probability that its topic 'd,n = ✓il is
Pr('d,n = ✓il |Td, Ud) =
" Y
im⇢il
G(d)im
�{✓im+1
}�# "
Ud,il
l�1Y
m=1
(1� Ud,im)
#. (11)
We use im ⇢ il to indicate that the first m values in il are equal to im. The leftmost term in this expression
is the probability of path il, the right term is the probability that the word does not select the first l� 1
topics, but does select the lth. Since all random variables are independent, a simple product form results
that will significantly aid the development of a posterior inference algorithm. The overlapping nature
of this stick-breaking construction on the levels of a sequence is evident from the fact that the random
variables U are shared for the first l values by all paths along the subtree starting at node il. A similar
tree-structured prior distribution was presented by Adams, et al. [18] in which all groups shared the same
distribution on a tree and entire objects (e.g. images or documents) were clustered within a single node.
We summarize our model for generating documents with the nHDP in Algorithm 1.
IV. STOCHASTIC VARIATIONAL INFERENCE FOR THE NESTED HDP
Many text corpora can be viewed as “Big Data”—they are large data sets for which standard inference
algorithms can be prohibitively slow. For example, Wikipedia currently indexes several million entries, and
The New York Times has published almost two million articles in the last 20 years. With so much data, fast
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 14 / 25
Inference
In a large amount of data, it is difficult to use an MCMC algorithm toefficient learn the parameters in the model. To solve this problem, theauthors developed a stochastic variational Bayesian inference schemethat updates over a sub-batch of documents denoted by Cs
for s = 1, ...,∞ dofor d ∈ Cs do
Update all local parameters for document d :(z(d)
i,j , cd ,n, V (d)i,j , Ud ,i ) while holding global variables constant
endStochastic updates for corpus variables:Find a noisy estimate for the Dirichlet parameters λ′i of q(θi) , andthen update the global parameters λs+1
i,w = λ0 +(1−ρs)λsi,w +ρsλ
′i,w
Likewise update the parameters for q(Vi l ,j)
end
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 15 / 25
Notes on inference
Initialization: A good initialization greatly benefits the stochastic VBalgorithm. For a small set of documents, the authors iteratively usek-means through a hierarchical k-means clustering to define an initialtree, with n1 clusters at the top level, n2 clusters at the next level, andn3 clusters at the last level.
In the small experiments a truncated tree with widths of (10,7,5) wasused in the inference results to give 430 possible nodes whereas in the”big data” experiments the tree was truncated to (20,10,5).
To test the hold-out set the authors completely held out a set ofdocuments, and then learned their local parameters on 75% of theheld-out words and tested on the remaining 25%. Predictivelog-likelihood values are reported.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 16 / 25
Results
19
TABLE II
COMPARISON OF THE NHDP WITH THE NCRP ON THREE SMALLER PROBLEMS.
Method\Data set JACM Psych. Review PNAS
Variational nHDP -5.405 ± 0.012 -5.674 ± 0.019 -6.304 ± 0.003
Variational nCRP -5.433 ± 0.010 -5.843 ± 0.015 -6.574 ± 0.005
Gibbs nCRP -5.392 ± 0.005 -5.783 ± 0.015 -6.496 ± 0.007
children, which themselves each have 5 children for a total of 420 nodes. Following previous work on
the nCRP, we truncate the tree to three levels. Also, because these three data sets contain stop words, we
follow [2] and [1] by including a root node shared by all documents for this batch problem. Following
[2], we perform five-fold cross validation to evaluate performance on each corpora.
We present our results in Table II. We see that for all data sets, the variational nHDP outperforms the
variational nCRP. For the two larger data sets, the variational nHDP also outperforms Gibbs sampling for
the nCRP. Given the relative sizes of these corpora, we see that the benefit of learning a per-document
distribution on the full tree rather than a path appears to increase as the corpus size and document size
increase. Since we are interested in the “Big Data” regime, this strongly hints at an advantage of our
nHDP approach over the nCRP.
C. Stochastic inference for The New York Times and Wikipedia
We next present an evaluation of our stochastic variational inference algorithm on The New York Times
and Wikipedia. These are both very large data sets, with The New York Times containing roughly 1.8
million articles and Wikipedia roughly 3.3 million web pages. The average document size is somewhat
larger than those considered in our batch experiments as well, with an article from The New York Times
containing 254 words on average taken from a vocabulary size of 8,000, and Wikipedia 164 words on
average taken from a vocabulary size of 7,702.
1) Setup: We use the algorithm discussed in Section V-A to initialize a three-level tree with (20, 10, 5)
child nodes per level, giving a total of 1,220 initial topics. For the Dirichlet processes, we set all top-level
DP concentration parameters to ↵ = 5 and the second-level DP concentration parameters to � = 1. For
the switching probabilities U , we set the beta distribution hyperparameters to �1 = 2/3 and �2 = 4/3,
which takes the weight of a uniform prior and skews it toward smaller values, slightly encouraging a
word to continue down the tree. For our greedy subtree selection algorithm, we stop adding nodes to the
subtree when the marginal improvement to the lower bound falls below 10�2. When optimizing the local
variational parameters of a document given its subtree, we continue iterating until the absolute change
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 17 / 25
Results21
104 105⇧7.65
⇧7.6
⇧7.55
⇧7.5
⇧7.45
⇧7.4
⇧7.35
⇧7.3
Number of documents seen
pred
ictiv
elo
glik
elih
ood
nHDPHDPLDA 150LDA 100LDA 50
Fig. 2. The New York Times: Average per-word log likelihood on a held-out test set as a function of training documents seen.
0 5 10 15 20 250
500
1000
1500
2000
1 2 30
1
2
3
4
5
6
7
1 2 30
20
40
60
80
100
120
140
Fig. 3. The New York Times: Per-document statistics from the test set using the tree at the final step of the algorithm. (a)
A histogram of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree
(white), and the average number by level that have at least one expected observation (black). (c) The average number of words
allocated to each level of the tree per document.
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 18 / 25
Results
21
104 105⇧7.65
⇧7.6
⇧7.55
⇧7.5
⇧7.45
⇧7.4
⇧7.35
⇧7.3
Number of documents seen
pred
ictiv
elo
glik
elih
ood
nHDPHDPLDA 150LDA 100LDA 50
Fig. 2. The New York Times: Average per-word log likelihood on a held-out test set as a function of training documents seen.
0 5 10 15 20 250
500
1000
1500
2000
1 2 30
1
2
3
4
5
6
7
1 2 30
20
40
60
80
100
120
140
Fig. 3. The New York Times: Per-document statistics from the test set using the tree at the final step of the algorithm. (a)
A histogram of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree
(white), and the average number by level that have at least one expected observation (black). (c) The average number of words
allocated to each level of the tree per document.
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 19 / 25
Results22
Fig. 4. Tree-structured topics from The New York Times. The shaded node is the top-level node and lines indicate dependencies
within the tree. In general, topics are learning in increasing levels of specificity. For clarity, we have removed grammatical
variations of the same word, such as “scientist” and “scientists.”
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 20 / 25
Results
23
0 0.5 1 1.50
200
400
600
800
1000
1200
2 0 0.5 1 1.5 2 2.5 3100
200
300
400
500
600
700
800
900
1000
1100
90%99%99.9%
Fig. 5. Tree size: The smallest number of nodes containing 90%, 99% and 99.9% of all paths as a function of documents
seen for (a) The New York Times, and (b) Wikipedia.
the nHDP than would be available in the nCRP. In Figure 8, we see example subtrees used by three
documents. We note that the topics contain many more function words than for The New York Times, but
an underlying hierarchical structure is uncovered that would be unlikely to arise along one path, as the
nCRP would require. In Figure 5b we again show the size of the tree as a function of documents seen by
showing the number of nodes containing 90%, 99% and 99.9% of all paths from the subtrees. As with
The New York Times, the model simplifies significantly from the original 1,220 nodes initialized.
VI. CONCLUSION
We have presented the nested hierarchical Dirichlet process (nHDP), an extension of the nested Chinese
restaurant process (nCRP) that allows each observation to follow its own path to a topic in the tree.
Starting with a stick-breaking construction for the nCRP, the new model samples document-specific path
distributions for a shared tree using a hierarchy of Dirichlet processes. By giving a document access to
the entire tree, we are able to borrow thematic content from various parts of the tree in constructing
a document. In our experiments we showed that this led to a general improvement over the nCRP for
hierarchical topic modeling. In addition, we have developed a stochastic variational inference algorithm
that is scalable to very large data sets. We compared the stochastic nHDP topic model with stochastic LDA
and HDP on large collections from The New York Times and Wikipedia, where we showed an improvement
in predictive ability with our tree-structured prior. Qualitative results on these corpora indicate that the
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 21 / 25
Results24
104 105
⇧6.85
⇧6.8
⇧6.75
⇧6.7
⇧6.65
⇧6.6
Number of documents seen
pred
ictive
loglik
eliho
od
nHDPHDPLDA 150LDA 100LDA 50
Fig. 6. Wikipedia: Average per-word log likelihood on a held-out test set as a function of training documents seen.
0 5 10 15 20 250
200
400
600
800
1000
1 2 30
1
2
3
4
5
1 2 30
10
20
30
40
50
60
70
80
Fig. 7. Wikipedia: Per-document statistics from the test set using the tree at the final step of the algorithm. (a) A histogram
of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree (white), and
the average number by level that have at least one expected observation (black). (c) The average number of words allocated to
each level of the tree per document.
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 22 / 25
Results
24
104 105
⇧6.85
⇧6.8
⇧6.75
⇧6.7
⇧6.65
⇧6.6
Number of documents seen
pred
ictiv
elo
glik
elih
ood
nHDPHDPLDA 150LDA 100LDA 50
Fig. 6. Wikipedia: Average per-word log likelihood on a held-out test set as a function of training documents seen.
0 5 10 15 20 250
200
400
600
800
1000
1 2 30
1
2
3
4
5
1 2 30
10
20
30
40
50
60
70
80
Fig. 7. Wikipedia: Per-document statistics from the test set using the tree at the final step of the algorithm. (a) A histogram
of the size of the subtree selected for a document. (b) The average number of nodes by level within the subtree (white), and
the average number by level that have at least one expected observation (black). (c) The average number of words allocated to
each level of the tree per document.
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 23 / 25
Results25
Fig. 8. Examples of subtrees for three articles from Wikipedia. The three sizes of font indicate differentiate the more probable
topics from the less probable.
nHDP can learn meaningful topic hierarchies, and that documents benefit by taking advantage of the
entire tree.
REFERENCES
[1] D. Blei, T. Griffiths, and M. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic
hierarchies,” Journal of the ACM, vol. 57, no. 2, pp. 7:1–30, 2010.
[2] C. Wang and D. Blei, “Variational inference for the nested Chinese restaurant process,” in Advances in Neural Information
Processing Systems, 2009.
[3] J. H. Kim, D. Kim, S. Kim, and A. Oh, “Modeling topic hierarchies with the recursive Chinese restaurant process,” in
International Conference on Information and Knowledge Management (CIKM), 2012.
[4] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association,
vol. 101, no. 476, pp. 1566–1581, 2006.
DRAFT
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 24 / 25
Conclusions
The nHDP provides a way to eliminate some of the constraints of thenCRP and provide a more informative tree that gives a higherpredictive log-likelihood.
Using the complete tree is a method explored in the paper”Tree-Structured Stick Breaking for Hierarchical Data” by Adams et. al,but this method allows documents to share statistical strength forpreferences on the tree structure between documents.
The stochastic variational Bayesian algorithm allows for efficientinference on this complicated model that seems to perform well.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan ()Nested Hierarchical Dirichlet Processes Review by David Carlson 25 / 25