contextual text mining with probabilistic topic models

40
2007 © ChengXiang Zhai LLNL, Aug 15, 2007 1 Contextual Text Mining with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected] Joint work with Qiaozhu Mei

Upload: iris-campos

Post on 31-Dec-2015

32 views

Category:

Documents


0 download

DESCRIPTION

Contextual Text Mining with Probabilistic Topic Models. ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 1

Contextual Text Mining with

Probabilistic Topic Models

ChengXiang ZhaiDepartment of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Joint work with Qiaozhu Mei

Page 2: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 2

Motivating Example:Comparing Product Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Unsupervised discovery of common topics and their variations

Page 3: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 3

Motivating Example:Comparing News about Similar Topics

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …Death of people … … …… … … …

Vietnam War Afghan War Iraq War

Unsupervised discovery of common topics and their variations

Page 4: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 4

Motivating Example:Discovering Topical Trends in Literature

Unsupervised discovery of topics and their temporal variations

Theme Strength

Time

1980 1990 1998 2003TF-IDF Retrieval

IR Applications

Language Model

Text Categorization

Page 5: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 5

Motivating Example:Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?

• Unsupervised discovery of topics and their variations in different locations

Page 6: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 6

Motivating Example: Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

Page 7: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 7

Research Questions

• Can we model all these problems generally?

• Can we solve these problems with a unified approach?

• How can we bring human into the loop?

Page 8: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 8

Rest of Talk

• Contextual Text Mining

• The CPLSA Model

• Sample results of specific CPLSA models

• Discussion

Page 9: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 9

Contextual Text Mining

• Given collections of text with contextual information (meta-data)

• Discover themes/subtopics/topics (interesting word clusters)

• Compute variations of themes over contexts

• Applications:– Summarizing search results

– Federation of text information

– Opinion analysis

– Social network analysis

– Business intelligence

– ..

Page 10: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 10

Context Features of Text (Meta-data)

Weblog Article

Author

Author’s OccupationLocationTime

communities

source

Page 11: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 11

Context = Partitioning of Text

1999

2005

2006

1998

…… ……

papers written in 1998

WWW SIGIR ACL KDD SIGMOD

papers written by authors in US

Papers about Web

Page 12: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 12

Themes/Topics

• Uses of themes:– Summarize topics/subtopics

– Navigate in a document space

– Retrieve documents

– Segment documents

– …

Theme 1

Theme k

Theme 2

Background B

government 0.3

response 0.2..donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Is 0.05the 0.04a 0.03 ..

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Page 13: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 13

View of Themes: Context-Specific Version of Views

Context: After 1998 (Language models)

Context: Before 1998 (Traditional models)

vectorspace

TF-IDF

Okapi

LSIvector

Rocchioweighting

feedbackterm

retrieval

feedback

languagemodelsmoothing

querygeneration

mixture

estimateEM

pseudo

model

feedbackjudgeexpansionpseudoquery

Theme 2:

FeedbackTheme 1:

Retrieval Model

retrieve

modelrelevancedocumentquery

Page 14: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 14

Coverage of Themes: Distribution over Themes

Background

• Theme coverage can depend on context

Oil Price

Government Response

Aid and donation

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Background

Oil PriceGovernment Response

Aid and donation

Context: Texas

Context: Louisiana

Page 15: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 15

General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts

• View Comparison: Compare a theme from different views

– Analyze the content variation of themes over contexts

• Coverage Comparison: Compare the theme coverage of different contexts

– Reveal how closely a theme is associated to a context

• Others:

– Causal analysis

– Correlation analysis

Page 16: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 16

A General Solution: CPLSA

• CPLAS = Contextual Probabilistic Latent Semantic Analysis

• An extension of PLSA model ([Hofmann 99]) by

– Introducing context variables

– Modeling views of topics

– Modeling coverage variations of topics

• Process of contextual text mining

– Instantiation of CPLSA (context, views, coverage)

– Fit the model to text data (EM algorithm)

– Compute probabilistic topic patterns

Page 17: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 17

Documentcontext:

Time = July 2005Location = Texas

Author = xxxOccup. = Sociologist

Age Group = 45+…

“Generation” Process of CPLSA

View1 View2 View3Themes

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Texas July 2005

sociologist

Theme coverages:

Texas July 2005 document

……

Choose a view

Choose a Coverage

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a theme

Page 18: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 18

• To generate a document D with context feature set C:

– Choose a view vi according to the view distribution

– Choose a coverage кj according to the coverage distribution

– Choose a theme according to the coverage кj

– Generate a word using

– The likelihood of the document collection is:

Probabilistic Model

),|( CDvp i

),|( CDp j

il

D

D),( 111

))|()|(),|(),|(log(),()(logCD Vw

k

lilj

m

jj

n

ii wplpCDpCDvpDwcp

il

Page 19: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 19

Parameter Estimation: EM Algorithm

• Interesting patterns:

– Theme content variation for each view:

– Theme strength variation for each context

• Prior from a user can be incorporated using MAP estimation

n

i

m

j

k

l lit

jt

jt

it

ilt

jt

jt

it

ljiwwplpCDpCDvp

wplpCDpCDvpzp

1' 1' 1' '')()(

')(

')(

)()()()(

,,,)|()'|'(),|(),|(

)|()|(),|(),|()1(

n

i Vw

m

j

k

l ljiw

Vw

m

j

k

l ljiw

it

zpDwc

zpDwcCDvp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

m

j Vw

n

i

k

l ljiw

Vw

n

i

k

l ljiwj

t

zpDwc

zpDwcCDp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

l

l CD Vw

n

i ljiw

CD Vw

n

i ljiw

jt

zpDwc

zpDwclp

1' ),( ' 1' ',,',

),( 1 ,,,)1(

)1(),(

)1(),()|(

D

D

Vw CD

m

j ljiw

CD

m

j ljiw

ilt

zpDwc

zpDwcwp

' ),( 1' ,',,'

),( 1 ,,,)1(

)1(),'(

)1(),()|(

D

D

)|( ilwp

)|( jlp

Page 20: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 20

Regularization of the Model• Why?

– Generality high complexity (inefficient, multiple local maxima)

– Real applications have domain constraints/knowledge

• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,

author-topic analysis, cross-collection comparative analysis )

– Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis)

• In general

– Impose priors on model parameters

– Support the whole spectrum from unsupervised to supervised learning

Page 21: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 21

Interpretation of Topics

Statistical topic models

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Multinomial topic models

NLP ChunkerNgram stat.

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

Candidate label pool

Collection (Context)

Ranked Listof Labels

clustering algorithm;distance measure;…

Relevance Score Re-ranking

Coverage; Discrimination

Page 22: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 22

Relevance: the Zero-Order Score

• Intuition: prefer phrases covering high probability words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

p(w|)

)(

)|(

lp

lp

Page 23: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 23

Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Clustering

hash

dimension

algorithm

partition

C: SIGMOD Proceedings

Topic

… …

P(w|) P(w|l1)

D(||l1) < D(||l2)

Good Label (l1):“clustering algorithm”

Clustering

hash

dimension

join

algorithm

Bad Label (l2):“hash join”

P(w|l2)

w

ClwPMIwp )|,()|(

Score (l, )

Page 24: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 24

Sample Results

• Comparative text mining

• Spatiotemporal pattern mining

• Sentiment summary

• Event impact analysis

• Temporal author-topic analysis

Page 25: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 25

Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

Afghan

Theme

Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

The common theme indicates that “United Nations” is involved in both wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

Page 26: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 26

Comparing Laptop Reviews

Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and add hyperlinks between documents

Page 27: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 27

Spatiotemporal Patterns in Blog Articles

• Query= “Hurricane Katrina”

• Topics in the results:

• Spatiotemporal patterns

Page 28: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 28

Theme Life Cycles for Hurricane Katrina

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

Oil Price

New Orleans

Page 29: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 29

Theme Snapshots for Hurricane Katrina

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

Page 30: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 30

Theme Life Cycles: KDD

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Str

engt

h of

The

me

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086model 0.0079business 0.0048…

rules 0.0142association 0.0064support 0.0053…

Page 31: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 31

Theme Evolution Graph: KDDT

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

Page 32: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 32

Blog Sentiment Summary (query=“Da Vinci Code”)

Neutral Positive Negative

Facet 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Facet 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

So still a good book to past time.

This controversy book cause lots conflict in west society.

Page 33: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 33

Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: religious beliefs ( Bursts during the movie, Neg > Pos )

Page 34: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 34

Event Impact Analysis: IR Research

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Page 35: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 35

Temporal-Author-Topic Analysis

pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…

project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…

research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…

close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…

index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…

2000time

Author

Author B

Author AGlobal theme: frequent patterns

Jiawei Han

Rakesh Agrawal

Page 36: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 36

Related Work

• Specific Contextual Text Mining Problems– Multi-collection Comparative Mining (e.g., [Zhai et al.

04]

– Spatiotemporal theme analysis (e.g., [Mei et al. 06])

– Author-topic analysis (e.g., [Steyvers et al. 04])

– …

• Probabilistic topic models:– Probabilistic latent semantic analysis (PLSA) (e.g.

[Hofmann 99])

– Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03])

Page 37: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 37

Conclusions

• Defined a general text mining problem – contextual text mining

• Proposed a general solution

– Contextual probabilistic latent semantic analysis

– Probabilistic labeling of topics

• Many applications

• Future work

– Evaluation

– Similar extension to LDA

– Applications

Page 38: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 38

References

• CPLSA

– Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06.

• Labeling

– Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings KDD'07

• Special cases:

– C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04

– Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, In Proceedings KDD' 05

– Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06

– Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07

Page 39: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 39

Search

Text

Filtering

Categorization

Summarization

Clustering

Natural Language Content Analysis

Extraction

Mining

VisualizationSearchApplications

MiningApplications

InformationAccess

KnowledgeAcquisition

InformationOrganization

Research of the IR group @ UIUC

- Personalized- Retrieval models- Difficult queries

Current focus

-Comparative text mining

Current focus

Web, Email, and Bioinformatics

Entity/Relation Extraction

Page 40: Contextual Text Mining with Probabilistic Topic Models

2007 © ChengXiang Zhai LLNL, Aug 15, 2007 40

The End

Thank You!