contextual text mining with probabilistic topic models
DESCRIPTION
Contextual Text Mining with Probabilistic Topic Models. ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 1
Contextual Text Mining with
Probabilistic Topic Models
ChengXiang ZhaiDepartment of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
Joint work with Qiaozhu Mei
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 2
Motivating Example:Comparing Product Reviews
Common Themes “IBM” specific “APPLE” specific “DELL” specific
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
Unsupervised discovery of common topics and their variations
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 3
Motivating Example:Comparing News about Similar Topics
Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific
United nations … … …Death of people … … …… … … …
Vietnam War Afghan War Iraq War
Unsupervised discovery of common topics and their variations
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 4
Motivating Example:Discovering Topical Trends in Literature
Unsupervised discovery of topics and their temporal variations
Theme Strength
Time
1980 1990 1998 2003TF-IDF Retrieval
IR Applications
Language Model
Text Categorization
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 5
Motivating Example:Analyzing Spatial Topic Patterns
• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?
• Unsupervised discovery of topics and their variations in different locations
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 6
Motivating Example: Sentiment Summary
Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 7
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a unified approach?
• How can we bring human into the loop?
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 8
Rest of Talk
• Contextual Text Mining
• The CPLSA Model
• Sample results of specific CPLSA models
• Discussion
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 9
Contextual Text Mining
• Given collections of text with contextual information (meta-data)
• Discover themes/subtopics/topics (interesting word clusters)
• Compute variations of themes over contexts
• Applications:– Summarizing search results
– Federation of text information
– Opinion analysis
– Social network analysis
– Business intelligence
– ..
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 10
Context Features of Text (Meta-data)
Weblog Article
Author
Author’s OccupationLocationTime
communities
source
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 11
Context = Partitioning of Text
1999
2005
2006
1998
…… ……
papers written in 1998
WWW SIGIR ACL KDD SIGMOD
papers written by authors in US
Papers about Web
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 12
Themes/Topics
• Uses of themes:– Summarize topics/subtopics
– Navigate in a document space
– Retrieve documents
– Segment documents
– …
Theme 1
Theme k
Theme 2
…
Background B
government 0.3
response 0.2..donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
Is 0.05the 0.04a 0.03 ..
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 13
View of Themes: Context-Specific Version of Views
Context: After 1998 (Language models)
Context: Before 1998 (Traditional models)
vectorspace
TF-IDF
Okapi
LSIvector
Rocchioweighting
feedbackterm
retrieval
feedback
languagemodelsmoothing
querygeneration
mixture
estimateEM
pseudo
model
feedbackjudgeexpansionpseudoquery
Theme 2:
FeedbackTheme 1:
Retrieval Model
retrieve
modelrelevancedocumentquery
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 14
Coverage of Themes: Distribution over Themes
Background
• Theme coverage can depend on context
Oil Price
Government Response
Aid and donation
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Background
Oil PriceGovernment Response
Aid and donation
Context: Texas
Context: Louisiana
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 15
General Tasks of Contextual Text Mining
• Theme Extraction: Extract the global salient themes
– Common information shared over all contexts
• View Comparison: Compare a theme from different views
– Analyze the content variation of themes over contexts
• Coverage Comparison: Compare the theme coverage of different contexts
– Reveal how closely a theme is associated to a context
• Others:
– Causal analysis
– Correlation analysis
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 16
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics
• Process of contextual text mining
– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compute probabilistic topic patterns
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 17
Documentcontext:
Time = July 2005Location = Texas
Author = xxxOccup. = Sociologist
Age Group = 45+…
“Generation” Process of CPLSA
View1 View2 View3Themes
government
donation
New Orleans
government 0.3 response 0.2..
donate 0.1relief 0.05help 0.02 ..
city 0.2new 0.1orleans 0.05 ..
Texas July 2005
sociologist
Theme coverages:
Texas July 2005 document
……
Choose a view
Choose a Coverage
government
donate
new
Draw a word from i
response
aid help
Orleans
Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …
Choose a theme
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 18
• To generate a document D with context feature set C:
– Choose a view vi according to the view distribution
– Choose a coverage кj according to the coverage distribution
– Choose a theme according to the coverage кj
– Generate a word using
– The likelihood of the document collection is:
Probabilistic Model
),|( CDvp i
),|( CDp j
il
D
D),( 111
))|()|(),|(),|(log(),()(logCD Vw
k
lilj
m
jj
n
ii wplpCDpCDvpDwcp
il
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 19
Parameter Estimation: EM Algorithm
• Interesting patterns:
– Theme content variation for each view:
– Theme strength variation for each context
• Prior from a user can be incorporated using MAP estimation
n
i
m
j
k
l lit
jt
jt
it
ilt
jt
jt
it
ljiwwplpCDpCDvp
wplpCDpCDvpzp
1' 1' 1' '')()(
')(
')(
)()()()(
,,,)|()'|'(),|(),|(
)|()|(),|(),|()1(
n
i Vw
m
j
k
l ljiw
Vw
m
j
k
l ljiw
it
zpDwc
zpDwcCDvp
1' 1' 1' ',',',
1 1 ,,,)1(
)1(),(
)1(),(),|(
m
j Vw
n
i
k
l ljiw
Vw
n
i
k
l ljiwj
t
zpDwc
zpDwcCDp
1' 1' 1' ',',',
1 1 ,,,)1(
)1(),(
)1(),(),|(
l
l CD Vw
n
i ljiw
CD Vw
n
i ljiw
jt
zpDwc
zpDwclp
1' ),( ' 1' ',,',
),( 1 ,,,)1(
)1(),(
)1(),()|(
D
D
Vw CD
m
j ljiw
CD
m
j ljiw
ilt
zpDwc
zpDwcwp
' ),( 1' ,',,'
),( 1 ,,,)1(
)1(),'(
)1(),()|(
D
D
)|( ilwp
)|( jlp
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 20
Regularization of the Model• Why?
– Generality high complexity (inefficient, multiple local maxima)
– Real applications have domain constraints/knowledge
• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,
author-topic analysis, cross-collection comparative analysis )
– Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)
• In general
– Impose priors on model parameters
– Support the whole spectrum from unsupervised to supervised learning
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 21
Interpretation of Topics
Statistical topic models
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
Multinomial topic models
NLP ChunkerNgram stat.
database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …
Candidate label pool
Collection (Context)
Ranked Listof Labels
clustering algorithm;distance measure;…
Relevance Score Re-ranking
Coverage; Discrimination
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 22
Relevance: the Zero-Order Score
• Intuition: prefer phrases covering high probability words
Clustering
dimensional
algorithm
birch
shape
Latent Topic
…
Good Label (l1): “clustering algorithm”
body
Bad Label (l2): “body shape”
…
p(w|)
)(
)|(
lp
lp
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 23
Relevance: the First-Order Score
• Intuition: prefer phrases with similar context (distribution)
Clustering
dimension
partition
algorithm
hash
Clustering
hash
dimension
algorithm
partition
C: SIGMOD Proceedings
Topic
… …
P(w|) P(w|l1)
D(||l1) < D(||l2)
Good Label (l1):“clustering algorithm”
Clustering
hash
dimension
join
algorithm
…
Bad Label (l2):“hash join”
P(w|l2)
w
ClwPMIwp )|,()|(
Score (l, )
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 24
Sample Results
• Comparative text mining
• Spatiotemporal pattern mining
• Sentiment summary
• Event impact analysis
• Temporal author-topic analysis
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 25
Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)
Cluster 1 Cluster 2 Cluster 3
Common
Theme
united 0.042nations 0.04…
killed 0.035month 0.032deaths 0.023…
…
Iraq
Theme
n 0.03Weapons 0.024Inspections 0.023…
troops 0.016hoon 0.015sanches 0.012…
…
Afghan
Theme
Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…
taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…
…
The common theme indicates that “United Nations” is involved in both wars
Collection-specific themes indicate different roles of “United Nations” in the two wars
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 26
Comparing Laptop Reviews
Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])
These word distributions can be used to segment text and add hyperlinks between documents
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 27
Spatiotemporal Patterns in Blog Articles
• Query= “Hurricane Katrina”
• Topics in the results:
• Spatiotemporal patterns
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 28
Theme Life Cycles for Hurricane Katrina
city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…
price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…
Oil Price
New Orleans
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 29
Theme Snapshots for Hurricane Katrina
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week2: The discussion moves towards the north and west
Week5: The theme fades out in most states
Week1: The theme is the strongest along the Gulf of Mexico
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 30
Theme Life Cycles: KDD
0
0. 002
0. 004
0. 006
0. 008
0. 01
0. 012
0. 014
0. 016
0. 018
0. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Str
engt
h of
The
me
Biology Data
Web Information
Time Series
Classification
Association Rule
Clustering
Bussiness
Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts
gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…
marketing 0.0087customer 0.0086model 0.0079business 0.0048…
rules 0.0142association 0.0064support 0.0053…
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 31
Theme Evolution Graph: KDDT
SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…
decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…
Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…
Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…
……
1999
…
web 0.009classifica –tion 0.007features0.006topic 0.005…
mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010
mixture 0.008LDA 0.006 semantic 0.005…
…
2000 2001 2002 2003 2004
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 32
Blog Sentiment Summary (query=“Da Vinci Code”)
Neutral Positive Negative
Facet 1:Movie
... Ron Howards selection of Tom Hanks to play Robert Langdon.
Tom Hanks stars in the movie,who can be mad at that?
But the movie might get delayed, and even killed off if he loses.
Directed by: Ron Howard Writing credits: Akiva Goldsman ...
Tom Hanks, who is my favorite movie star act the leading role.
protesting ... will lose your faith by ... watching the movie.
After watching the movie I went online and some research on ...
Anybody is interested in it?
... so sick of people making such a big deal about a FICTION book and movie.
Facet 2:Book
I remembered when i first read the book, I finished the book in two days.
Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.
I’m reading “Da Vinci Code” now.
…
So still a good book to past time.
This controversy book cause lots conflict in west society.
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 33
Results: Sentiment Dynamics
Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )
Facet: religious beliefs ( Bursts during the movie, Neg > Pos )
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 34
Event Impact Analysis: IR Research
vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…
xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…
probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…
model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…
1998
Publication of the paper “A language modeling approach to information retrieval”
Starting of the TREC conferences
year1992
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…
Theme: retrieval models
SIGIR papersSIGIR papers
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 35
Temporal-Author-Topic Analysis
pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…
project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…
research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…
close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…
index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…
2000time
Author
Author B
Author AGlobal theme: frequent patterns
Jiawei Han
Rakesh Agrawal
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 36
Related Work
• Specific Contextual Text Mining Problems– Multi-collection Comparative Mining (e.g., [Zhai et al.
04]
– Spatiotemporal theme analysis (e.g., [Mei et al. 06])
– Author-topic analysis (e.g., [Steyvers et al. 04])
– …
• Probabilistic topic models:– Probabilistic latent semantic analysis (PLSA) (e.g.
[Hofmann 99])
– Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03])
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 37
Conclusions
• Defined a general text mining problem – contextual text mining
• Proposed a general solution
– Contextual probabilistic latent semantic analysis
– Probabilistic labeling of topics
• Many applications
• Future work
– Evaluation
– Similar extension to LDA
– Applications
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 38
References
• CPLSA
– Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06.
• Labeling
– Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings KDD'07
• Special cases:
– C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04
– Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, In Proceedings KDD' 05
– Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06
– Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 39
Search
Text
Filtering
Categorization
Summarization
Clustering
Natural Language Content Analysis
Extraction
Mining
VisualizationSearchApplications
MiningApplications
InformationAccess
KnowledgeAcquisition
InformationOrganization
Research of the IR group @ UIUC
- Personalized- Retrieval models- Difficult queries
Current focus
-Comparative text mining
Current focus
Web, Email, and Bioinformatics
Entity/Relation Extraction