[yang, downey and boyd-graber 2015] efficient methods for incorporating knowledge into topic models

Efficient Methods for Incorporating Knowledge into Topic Models

[Yang, Downey and Boyd-Graber 2015]

2015/10/24

EMNLP 2015 Reading

@shuyo

Large-scale Topic Model

• In academic papers

– Up to 10^3 topics

• Industrial applications

– 10^5~10^6 topics!

– Search engines, online ads. and so on

– To capture infrequent topics

• This paper handles up to 500 topics...

really?

(Standard) LDA [Blei+ 2003, Griffiths+ 2004]

• "Conventional" Gibbs sampling

𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝ 𝑞𝑡 ≔ 𝑛𝑑,𝑡 + 𝛼𝑛𝑤,𝑡 + 𝛽

𝑛𝑡 + 𝑉𝛽

– 𝑇 : Topic size

– For 𝑈~𝒰 0, 𝑧𝑇 𝑞𝑧 , find 𝑡 s.t. 𝑧

𝑡−1 𝑞𝑧 < 𝑈 < 𝑧𝑡 𝑞𝑧

• For large T, it is computationally intensive

– 𝑛𝑤,𝑡 is sparse

– When T is very large, 𝑛𝑑,𝑡 is too e.g. 𝑇 = 106 > 𝑛𝑑

SparseLDA [Yao+ 2009]

𝑡

𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝

𝑡

𝛼𝛽

𝑛𝑡 + 𝑉𝛽+

𝑡

𝑛𝑑,𝑡𝛽

𝑛𝑡 + 𝑉𝛽+

𝑡

𝑛𝑑,𝑡 + 𝛼 𝑛𝑤,𝑡

𝑛𝑡 + 𝑉𝛽

• 𝑠 = 𝑡 𝑠𝑡 , 𝑟 = 𝑡 𝑟𝑡 , 𝑞 = 𝑡 𝑞𝑡

• For 𝑈~𝒰 0, 𝑠 + 𝑟 + 𝑞 ,

– If 0 < 𝑈 < 𝑠, find 𝑡 s.t. 𝑧𝑡−1 𝑠𝑧 < 𝑈 < 𝑧

𝑡 𝑠𝑧

– If 𝑠 < 𝑈 < 𝑠 + 𝑟, find 𝑡 s.t.𝑛𝑑,𝑡 > 0, 𝑧𝑡−1 𝑟𝑧 < 𝑈 − 𝑠 < 𝑧

𝑡 𝑟𝑧

– If 𝑠 + 𝑟 < 𝑈 < 𝑠 + 𝑟 + 𝑞,

find 𝑡 s.t.𝑛𝑤,𝑡 > 0, 𝑧𝑡−1 𝑞𝑧 < 𝑈 − 𝑠 − 𝑟 < 𝑧

𝑡 𝑞𝑧

• Faster because 𝑛𝑤,𝑡 and 𝑛𝑑,𝑡 are sparse

𝑠𝑡 𝑟𝑡 𝑞𝑡

independent on w, d dependent on d only

Leveraging Prior Knowledge

• The objective function of topic models

does not correlate with human

judgements

Word correlation prior knowledge

• Must-link

– “quarterback” and “fumble” are both

related to American football

• Cannot-link

– “fumble” and “bank” imply two different

topics

SC-LDA [Yang+ 2015]

• 𝑚 ∈ 𝑀 : Prior knowledge

• 𝑓𝑚(𝑧, 𝑤, 𝑑) : Potential function of prior

knowledge 𝑚 about word 𝑤 with topic

𝑧 in document 𝑑

• 𝜓 𝒛,𝑀 = 𝑧∈𝒛 exp 𝑓𝑚 𝑧, 𝑤, 𝑑

• 𝑃 𝒘, 𝒛 𝛼, 𝛽,𝑀 = 𝑃 𝒘 𝒛, 𝛽 𝑃 𝒛 𝛼 𝜓(𝒛,𝑀)

maybe ∝

maybe 𝑚 ∈ 𝑀, all 𝑤 with 𝑧 in all 𝑑

Sparse Constrained

Inference for SC-LDA

𝑉

Word correlation prior knowledge for SC-LDA

• 𝑓𝑚 𝑧, 𝑤, 𝑑 =

𝑢∈𝑀𝑤𝑚

logmax 𝜆, 𝑛𝑢,𝑧 +

𝑣∈𝑀𝑤𝑐

log1

max 𝜆, 𝑛𝑣,𝑧

– where 𝑀𝑤𝑚 : Must-link of 𝑤, 𝑀𝑤

𝑐 : Cannot-link of 𝑤

• 𝑃 𝑧 = 𝑡 𝒛−, 𝑤,𝑀 ∝𝛼𝛽

𝑛𝑡+𝑉𝛽+

𝑛𝑑,𝑡𝛽

𝑛𝑡+𝑉𝛽+

𝑛𝑑,𝑡+𝛼 𝑛𝑤,𝑡

𝑛𝑡+𝑉𝛽

𝑢∈𝑀𝑤𝑚

max 𝜆, 𝑛𝑢,𝑧

𝑣∈𝑀𝑤𝑐

1

max 𝜆, 𝑛𝑣,𝑧

Factor Graph

• They tell that prior knowledge is incorporated

“by adding a factor graph to encode prior

knowledge,” but it does not be drawn.

• The potential function 𝑓𝑚 𝑧, 𝑤, 𝑑 contains 𝑛𝑤,𝑧,

and 𝜑𝑤,𝑧 ∝ 𝑛𝑤,𝑧 + 𝛽.

• So the above model seems like Fig.b:

Fig.a Fig.b

[Ramage+ 2009] Labeled LDA

• Supervized LDA for labeled documents

– It is equivalent to SC-LDA with the

following potential function

𝑓𝑚 𝑧, 𝑤, 𝑑 = 1, if 𝑧 ∈ 𝑚𝑑

−∞, else

where 𝑚𝑑 specifies a label set of 𝑑

Experiments

• Baselines

– Dirichlet Forest-LDA [Andrzejewski+ 2009]

– Logic-LDA [Andrzejewski+ 2011]

– MRF-LDA [Xie+ 2015]

• Encodes word correlations in LDA as MRF

– SparseLDA

DATASET DOCS TYPE TOKEN(APPROX) Experiments

NIPS 1,500 12,419 1,900,000Word correlation

NYT-NEWS 3,000,000 102,660 100,000,000

20NG 18,828 21,514 1,946,000 Labeled docs

Generate Word Correlation

• Must-link

– Obtain synsets from WordNet 3.0

– Similarity between the word and its

synsets on word embedding from

word2vec is higher than threshold 0.2

• Cannot-link

– Nothing?

Convergence Speed

The average running time per iteration over 100 iterations, averaged over 5 seeds, on 20NG dataset.

Coherence [Mimno+ 2011]

• 𝐶 𝑡: 𝑉 𝑡 = 𝑚=2𝑀 𝑙=1

𝑚−1 log𝐹 𝑣𝑚

𝑡,𝑣𝑙

𝑡+𝜖

𝐹 𝑣𝑙𝑡

– 𝐹 𝑣 : document frequency of word type 𝑣

– 𝐹 𝑣, 𝑣′ :co-document frequency of word type 𝑣, 𝑣′

It means “include”?

𝜖 is very small like 10−12 [Röder+ 2015]

-39.1 -36.6

References

• [Yang+ 2015] Efficient Methods for Incorporating Knowledge into Topic Models

• [Blei+ 2003] Latent Dirichlet allocation.

• [Griffiths+ 2004] Finding scientific topics.

• [Yao+ 2009] Efficient methods for topic model inference on streaming document collections.

• [Ramage+ 2009] Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora.

• [Andrzejewski+ 2009] Incorporating domain knowledge into topic modeling via Dirichlet forest priors.

• [Andrzejewski+ 2011] A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic.

• [Xie+ 2015] Incorporating word correlation knowledge into topic modeling.

• [Mimno+ 2011] Optimizing semantic coherence in topic models.

• [Röder+ 2015] Exploring the space of topic coherence measures.

[yang, downey and boyd-graber 2015] efficient methods for incorporating knowledge into topic models

Technology