michael paul and roxana girju university of illinois at urbana-champaign a two-dimensional...
TRANSCRIPT
MICHAEL PAUL AND ROXANA GIRJUUNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
A Two-Dimensional Topic-Aspect Model
for Discovering Multi-Faceted Topics
Probabilistic Topic Models
Each word token associated with hidden “topic” variable
Probabilistic approach to dimensionality reduction
Useful for uncovering latent structures in text
Basic formulation: P(w|d) = P(w|topic) P(topic|d)
Probabilistic Topic Models
“Topics” are latent distributions over wordsA topic can be interpreted of as a cluster of
wordsTopic models often cluster words by what
people would consider topicalityThere are often other dimensions in which
words could be clustered Sentiment/perspective/theme
What if we want to model both?
Previous Work
Topic-Sentiment Mixture Model (Mei et al., 2007) Words come from either topic distribution or
sentiment distribution
Topic+Perspective Model (Lin et al., 2008) Words are weighted as topical vs. ideological
Previous Work
Cross-Collection LDA (Paul and Girju, 2009) Each document belongs to a collection Each topic has a word distribution shared among
collections plus distributions unique to each collection
What if the “collection” was a hidden variable? -> Topic-Aspect Model (TAM)
Topic-Aspect Model
Each document has a multinomial topic mixture a multinomial aspect mixture
Words may depend on both!
Topic-Aspect Model
Topic and aspect mixtures are drawn independently of one another This differs from hierarchical topic models where one
depends on the other Can be thought of as two separate clustering dimensions
Topic-Aspect Model
Each word token also has 2 binary variables: the “level” (background or topical) denotes if the word depends on the
topic or not
the “route” (neutral or aspectual) denotes if the word depends on the aspect or not
A word may depend on a topic, an aspect, both, or neither
Topic-Aspect Model
A word may depend on a topic, an aspect, both, or neither
“Computational” Aspect
Route / Level Background Topical
Neutral paper, new, present speech, recognition
Aspectual algorithm, model markov, hmm, error
Topic-Aspect Model
A word may depend on a topic, an aspect, both, or neither
Route / Level Background Topical
Neutral paper, new, present speech, recognition
Aspectual language, linguistic prosody, intonation, tone
“Linguistic” Aspect
Topic-Aspect Model
A word may depend on a topic, an aspect, both, or neither
Route / Level Background Topical
Neutral paper, new, present communication, interaction
Aspectual language, linguistic conversation, social
“Linguistic” Aspect
Topic-Aspect Model
A word may depend on a topic, an aspect, both, or neither
Route / Level Background Topical
Neutral paper, new, present communication, interaction
Aspectual algorithm, model dialogue, system, user
“Computational” Aspect
Topic-Aspect Model
Generative process for a document d: Sample a topic z from P(z|d) Sample an aspect y from P(y|d) Sample a level l from P(l|d) Sample a route x from P(x|l,z)
Sample a word w from either: P(w|l=0,x=0), P(w|z,l=1,x=0), P(w|y,l=0,x=1), P(w|z,y,l=1,x=1)
Topic-Aspect Model
Distributions have Dirichlet/Beta priors Latent Dirchlet Allocation framework
Number of aspects and topics are user-supplied parameters
Straightforward inference with Gibbs sampling
Topic-Aspect Model
Semi-supervised TAM when aspect label is known
Two options: Fix P(y|d)=1 for the correct aspect label and 0
otherwise Behaves like ccLDA (Paul and Girju, 2009)
Define a prior for P(y|d) to bias it toward the true label
Experiments
Three Datasets:
4,247 abstracts from the ACL Anthology
2,173 abstracts from linguistics journals
594 articles from the Bitterlemons corpus (Lin et al., 2006) a collection of editorials on the Israeli/Palestinian
conflict
Evaluation
Cluster coherence “word intrusion” method (Chang et al., 2009) 5 human annotators Compare against ccLDA and LDA TAM clusters are as coherent as other established
models
Evaluation
Document classification Classify Bitterlemons perspectives (Israeli vs
Palestinian) Use TAM (2 aspects + 12 topics) output as input to
SVM Use aspect mixtures and topic mixtures as features Compare against LDA
Evaluation
Document classification 2 aspects from TAM much more strongly associated
with true perspectives than 2 topics from LDA Suggests that TAM is clustering along a different
dimension than LDA by separating out another “topical” dimension (with 12 components)
Summary: Topic-Aspect Model
Can cluster along two independent dimensions
Words may be generated by both dimensions, thus clusters can be inter-related
Cluster definitions are arbitrary and their structure will depend on the data and the model parameterization (especially # of aspects/topics) Modeling with 2 aspects and many topics is shown to
produce aspect clusters corresponding to document perspectives on certain corpora
References
Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; and Blei, D. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems.
Lin, W.; Wilson, T.; Wiebe, J.; and Hauptmann, A. 2006. Which side are you on? identifying perspectives at the document and sentence levels. In Proceedings of Tenth Conference on Natural Language Learning (CoNLL).
Lin, W.; Xing, E.; and Hauptmann, A. 2008. A joint topic and perspective model for ideological discourse. In ECML PKDD ’08: Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases -Part II, 17–32. Berlin, Heidelberg: Springer-Verlag.
Mei, Q.; Ling, X.; Wondra, M.; Su, H.; and Zhai, C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, 171–180.
Paul, M., and Girju, R. 2009. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1408–1417.