11/11/2015 1 data mining: concepts and techniques — chapter 10 — 10.3.2 mining text and web data...

75
06/17/22 1 Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj Acknowledgements: Slides by students at CS512 (Spring 2009)

Upload: deirdre-heath

Post on 04-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 1

Data Mining: Concepts and Techniques

— Chapter 10 —10.3.2 Mining Text and Web Data (II)

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

Acknowledgements: Slides by students at CS512 (Spring 2009)

Page 2: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Outline

• Probabilistic Topic Models (Yue Lu)

• Opinion Mining (Hyun Duk Kim)

• Mining Query Logs for Personalized Search (Yuanhua

Lv)

• Online Analytical Processing on Multidimensional Text

Database (Duo Zhang)

Page 3: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

3

Probabilistic Topic Models

Yue LU

Department of Computer Science

University of Illinois, Urbana-Champaign

Many slides are adapted/taken from different sources, including presentations by ChengXiang Zhai, Qiaozhu Mei and Tom Griffiths

Page 4: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Intuition

4

• Documents exhibit multiple topics.

topic: Social network website

topic: education

topic: criticism

Page 5: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

What is a Topic?

5

Topic: A broad concept/theme, semantically coherent, which is hidden in documents

Representation: a multinomial distribution of words, i.e., a unigram language model

retrieval 0.2information 0.15model 0.08query 0.07language 0.06feedback 0.03……

Page 6: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

6

Organize Information with Topics

Words

Entities

Phrases

Topics

Categories

How many in a document?

Resolution

1

several

Many ...

Patterns

thousands

new orleans,

put together, ..

oil, new, put, …

orleans, is, …

new orleans, president bush..

Natural hazards

hundreds

50~100

oil price,

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

government response

loss statistics, …

Page 7: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

7

The Usage of Topic Models

• Usage of a topic model:– Summarize themes/aspects

– Navigate documents

– Retrieve documents

– Segment documents

– Document classification

– Document clustering

Topic 1

Topic k

Topic 2

Background B

government 0.3 response 0.2

...

donate 0.1relief 0.05help 0.02

...

city 0.2new 0.1

orleans 0.05 ...

is 0.05the 0.04a 0.03

...

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the

approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. …

80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries

pledged monetary donations or other assistance]. …

Page 8: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

8

General Idea of Probabilistic Topic Models

• Cast intuition into a generative probabilistic process (Generative Process)

– Each document is a mixture of corpus-wide topics (multinomial distribution/unigram LM)

– Each word is drawn from one of those topics

• Since we only observe the documents, need to figure out (Estimation/Inference)

– What are the topics?

– How are the documents divided according to those topics?

• Two basic models: PLSA and LDA

Page 9: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Probabilistic Latent Semantic Analysis/Indexing [Hofmann 99]

Page 10: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

PLSA: Generation Process

w

Topics

Collection background

B

B

Document

Is 0.05the 0.04a 0.03 ..

1

2

k

d1

d2

dk

battery 0.3 life 0.2..

design 0.1screen 0.05

price 0.2purchase 0.15

Generate a word in a document

Generate a word in a document

[Hofmann 99], [Zhai et al. 04]

Parameters: B=noise-level (manually set)’s and ’s need to be estimated

Page 11: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

PLSA: Estimation

w

Topics

Collection background

B

B

Document

Is ?the ?a ?

1

2

k

d1

d2

dk

battery ? life ?

design ?screen ?

price ?purchase ?

Generate a word in a document

Generate a word in a document

[Hofmann 99], [Zhai et al. 04]

?

?

? Log-likelihood of

the collection

Log-likelihood of the collection

Estimated with Maximum Likelihood Estimator (MLE) through an EM algorithm

Page 12: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Problems with PLSA

– “Documents have no generative probabilistic semantics”

•i.e., document is just a symbol

– Model has many parameters•linear in number of documents

•need heuristic methods to prevent overfitting

– Cannot generalize to new documents

Page 13: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Latent Dirichlet Allocation [Blei et al. 03]

Page 14: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Basic Idea of LDA

• Adding a Dirichlet Prior α on topic distribution in documents

• Adding a Dirichlet Prior β on word distribution in topics

• α, β can be vectors, but for convenience, α = α1= α2=…; β = β1 = β2=… (Smoothed LDA)

w

Topics

1

2

k

d1

d2

dk

Document

[Blei et al. 03], [Griffiths&Steyvers 02, 03, 04]

β

β

Page 15: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Dirichlet Hyperparameters α, β

• Generally have a smoothing effect on multinomial parameters

• Large α, β : more smoothed topic/word distribution

• Small α, β: more skewed topic/word distribution (e.g. bias towards a few words for each topic)

• Common settings: α=50/K, β=0.01

• PLSA is maximum a posteriori estimated LDA when using uniform prior: α=1, β=1

Page 16: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Inference

• Exact inference is intractable

• Approximation techniques:– Mean field variational methods (Blei et al., 2001, 2003)

– Expectation propagation (Minka and Lafferty, 2002)

– Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)

– Collapsed variational inference (Teh et al., 2006)

Page 17: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Would like to know more?

• “Parameter estimation for text analysis” by Gregor Heinrich

• “Probabilistic topic models” by Mark Steyvers

Page 18: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Mining

Hyun Duk Kim

04/20/23Data Mining: Principles and

Algorithms 18

Page 19: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Agenda

Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/20/23Data Mining: Principles and

Algorithms 19

Page 20: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Web 2.0

“ Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as a platform, and an attempt to understand the rules for success on that new platform.” [Wikipedia]

Users participate in content creation ex. Blog, review, Q&A forum

04/20/23Data Mining: Principles and

Algorithms 20

Page 21: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Mining

Huge volume of opinions on the Web Ex. Product

reviews, Blog posts about politic issues

Need a good technique to summarize them

Example of commercial system (MS live search)

04/20/23Data Mining: Principles and

Algorithms 21

Page 22: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Usefulness of opinion mining

Individuals Purchasing a product/ service Tracking political topics Other decision making tasks

Businesses and organizations product and service benchmarking survey on a topic

Ads placements Place an ad when one praises an product Place an ad from a competitor if one criticizes a

product[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/20/23Data Mining: Principles and

Algorithms 22

Page 23: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Subtasks

Opinion finding & sentiment classification Opinion finding

If the target text is opinion or fact Sentiment classification

If the opinion is positive or negative In detail, ‘positive/negative/mixed’

Methods Lexicon based method Machine learning

Opinion Summarization How to show opinion finding/classification results

effectively Methods

Basic statistics showing Feature level summary [Hu & Liu, KDD'04/ Hu & Liu, AAAI'04] Summary paragraph generation [Kim et al, TAC'08] Probabilistic analysis [Mei et al, WWW'07]

Other works

04/20/23Data Mining: Principles and

Algorithms 23

Page 24: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Finding

Lexicon-based method Prepare opinion word list

Ex. Word: ‘good’, ‘bad’ / Phrase: ‘I think’, ‘In my opinion’ Check special part of speech expressing opinions

Ex. Adjective: ‘excellent’, ‘horrible’ / Verb: ‘like’, ‘hate’ Decision based on the those words occurrences Lexicon sources

Manually classified word lists WordNet External sources: Wikipedia (objective), review data

(subjective) Machine learning

Train with tagged examples Main features

Opinion lexicons Part-of-speech tag, Punctuation (ex. ! ), Modifiers (ex. not,

very)Word tokens, Dependency

04/20/23Data Mining: Principles and

Algorithms 24

Page 25: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Sentiment Classification

Method Similar to opinion finding

Lexicon based method Machine learning

Instead of using ‘opinionated word/examples’, use ‘positive and negative’ word/examples

If positive/negative dominant -> positive or negative

Both positive and negative dominantly exist -> mixed

04/20/23Data Mining: Principles and

Algorithms 25

Page 26: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Sentiment Classification

Query dependent sentiment classification [Lee et al, TREC '08/ Jia et al, TREC '08]

Motivation: Sentiments are expressed differently in different queries Ex. Small can be good for ipod size, but can be bad for

LCD monitor size Use external web sources to obtain positive and

negative opinionated lexicons Key Ideas

Objective words: Wikipedia, product specification part of Amazon.com

Subjective words: Reviews from Amazon.com, Rateitall.com and Epinions.com

Reviews rated 4 or 5 out of 5: positive words Reviews rated 1 or 2 out of 5: negative words

Top ranked in Text Retrieval Conference[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/20/23Data Mining: Principles and

Algorithms 26

Page 27: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Agenda

Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/20/23Data Mining: Principles and

Algorithms 27

Page 28: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Summarization

Basic statistics Show how many numbers of opinions

Ex. Opinions about ipod

04/20/23Data Mining: Principles and

Algorithms 28

Positive Negative

80% 20%

Page 29: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Summarization (cont.)

Feature-based summary [Hu & Liu, KDD '04/ Hu & Liu, AAAI '04]

Find lower level of features and analyze. Ex. Opinions about ipod

Feature extraction Usually nouns / noun phrases Frequent feature identification

Association mining Feature pruning and infrequent feature identification

based on heuristic rules Sentiment summary for each features

04/20/23Data Mining: Principles and

Algorithms 29

Battery life Design Price

Pos Neg Pos Neg Pos Neg

50% 50% 95% 5% 30% 70%

Page 30: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Summarization (cont.)

Summary paragraph generation [Kim et al, TAC '08] General NLP summarization techniques

Sentence extraction based summary Opinion filtering

Show sentences opinionated. Show sentences having the same polarity to the

goal of the summary Opinion ordering

Paragraph division by opinion polarity [Paragraph1] … Following are positive opinions…

Following are negative opinions… [Paragraph2] …

Following are mixed opinions… …

04/20/23Data Mining: Principles and

Algorithms 30

Page 31: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Opinion Summarization (cont.)

Probabilistic analysis Topic sentiment mixture model [Mei et al, WWW

'07] Topic modeling with opinion priors

04/20/23Data Mining: Principles and

Algorithms 31

Figure. The generation process of the topic-sentiment mixture model

Page 32: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Agenda

Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/20/23Data Mining: Principles and

Algorithms 32

Page 33: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Other works

Comparative analysis Focus on texts having contradiction or comparison. Finding comparative sentences [Jindal & Liu, SIGIR

'06] Comparison indicator such as ‘than’ or ‘as well

as’. Ex. ‘Ipod’ is better than ‘Zune’. Sequential patterns showing comparative

sentences ex. {NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN} ⟨ ⟩

comparative Finding preferred entity [Murthy & Liu, COLING '08]

Rule based approach Context-dependent orientation finding using Pros

and Cons reviews. 04/20/23

Data Mining: Principles and Algorithms 33

Page 34: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Other works

Opinion Integration [Lu & Zhai, WWW '08]

Integrate expert reviews with arbitrary text collection

Expert reviews: well structured, easy to find features, not often updated

Arbitrary: not structured, various & updated data

Semi-supervised topic model Extract structure aspects (features) data from the

expert review to cluster general documents Add supplementary opinions from general

documents04/20/23

Data Mining: Principles and Algorithms 34

Page 35: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Agenda

Overview Opinion finding & sentiment classification Opinion Summarization Other works Discussion & Conclusion

04/20/23Data Mining: Principles and

Algorithms 35

Page 36: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Challenges in opinion mining

Polarity terms are context sensitive Ex. Small can be good for ipod size, but can be bad for LCD

monitor size Even in the same domain, use different words depending on

target feature Ex. Long ‘ipod’ battery life vs. long ‘ipod’ loading time

Partially solved (query dependent sentiment classification) Implicit and complex opinion expressions

Rhetoric expression, metaphor, double negation Ex. The food was like a stone Need both good IR and NLP techniques for opinion mining.

Cannot divide into pos/neg clearly Not all opinions can be classified into two categories Interpretation can be changed based on conditions Ex. 1) The battery life is ‘long’ if you do not use LCD a lot (pos)

2) The battery life is ‘short’ if you use LCD a lot (neg)Current system classify the first one as positive and second one as negative. However, actually both are saying the same fact.

[Kavita Ganesan & Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008]

04/20/23Data Mining: Principles and

Algorithms 36

Page 37: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Discussion

A difficult task Essential for many blog or review mining

techniques Current stage of opinion finding

Good performance in sentence level, specific domain, sub-problem.

Still low accuracy in general case MAP score of TREC ‘08 top performed system

Opinion finding: 0.4569 Polarity finding: 0.2297~0.2723

A lot of margin to improve !04/20/23

Data Mining: Principles and Algorithms 37

Page 38: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

References

I. Ounis, C. Macdonald and I. Soboroff, Overview of the TREC 2008 Blog Track , TREC, 2008.

Opinion Mining and Summarization: Sentiment Analysis. Tutorial given at WWW-2008, April 21, 2008 in Beijing, China.

Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the 16th International World Wide Web Conference (WWW' 07), pages 171-180, 2007.

Minqing Hu and Bing Liu. "Mining and summarizing customer reviews". To appear in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.

Minqing Hu and Bing Liu. "Mining Opinion Features in Customer Reviews." To appear in Proceedings of Nineteeth National Conference on Artificial Intellgience (AAAI-2004), San Jose, USA, July 2004.

Yue Lu and ChengXiang Zhai. "Opinion Integration Through Semisupervised Topic Modeling", In Proceedings of the 17th International World Wide Web Conference (WWW'08)

Kavita Ganesan, Hyun Duk Kim, Opinion Mining: A Short Tutorial, 2008 Hyun Duk Kim, Dae Hoon Park, V.G.Vinod Vydiswaran, and ChengXiang Zhai,Opinion

Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization: UIUC at TAC 2008 Opinion Summarization Pilot, Text Analysis Conference (TAC), Maryland, USA.

04/20/23Data Mining: Principles and

Algorithms 38

Page 39: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

References

Y. Lee, S.-H. Na, J. Kim, S.-H. Nam, H.-Y. Jung and J.-H. Lee , KLE at TREC 2008 Blog Track: Blog Post and Feed Retrieval , TREC, 2008.

L. Jia, C. Yu and W. Zhang, UIC at TREC 208 Blog Track, TREC, 2008. Nitin Jindal and Bing Liu. "Identifying Comparative Sentences in Text

Documents" To appear in Proceedings of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR-06), Seattle 2006.

Opinion Mining and Summarization (including review spam detection), tutorial given at WWW-2008, April 21, 2008 in Beijing, China.

Murthy Ganapathibhotla and Bing Liu, Mining opinions in comparative sentences, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 241–248, Manchester, August 2008

04/20/23Data Mining: Principles and

Algorithms 39

Page 40: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Thank you

04/20/23Data Mining: Principles and

Algorithms 40

Page 41: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Mining User Query Logs for Personalized Search

Yuanhua Lv

(Some slides are taken from Xuehua Shen, Bin Tan, and ChengXiang Zhai’s presentation)

Page 42: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

42

Problem of Current Search EnginesJaguar

CarApple Software

Animal

Chemistry Software

Suppose we know:

1. Short-term query logs: previous query = “racing cars”. [Shen et al. 05]

2. Long-term query logs: “car” occurs far more frequently than “Apple” in the user’s query logs of the recent 2 months. [Tan et al. 06]

Page 43: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

43

Problem Definition

Q2

{C2,1 , C2,2 ,C2,3 ,… } C2

Q1 User Query

{C1,1 , C1,2 ,C1,3 ,…} C1 User Clickthrough

? User Information Need

How to model and mine user query logs?Qk

e.g., Apple software

e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, a screenshot gallery, latest software downloads, and a directory of ...

Page 44: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

44

Retrieval Model

Qk

D

θQk

θD

Similarity Measure

Results( || )kQ DD

Basis: Unigram language model + KL divergence

( | ) ( | )k kp w p w Q 1 1 1 1,..., , ,...( | ) ,( | , )k kk kQ Qp Cw p w CQ

U

Mining query logs to update query model

'

kQ

'( || )kQ DD

Query Logs

Page 45: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

45

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Page 46: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Four Heuristic Variants

• FixInt: fixed coefficient interpolation( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Page 47: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

47

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Fixed α?

Page 48: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Four Heuristic Variants

• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to

different query length – Intuition: if the current query Qk is longer, we

should trust Qk more

Page 49: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

49

Mining Short-term User Query Logs [Shen et al. 05]

Qk

Q1

Qk-1

C1

Ck-1

Average user’s previous clickthrough

CH

QH

111

1

( | ) ( | )i k

Q iki

p w H p w Q

111

1

( | ) ( | )i k

C iki

p w H p w C

Average user’s previous queries

1 H

Combine previous clickthrough and previous queries

( | ) ( | ) (1 ) ( | )C Qp w H p w H p w H

k

1

Linearly interpolate current queryand history model

( | ) ( | ) (1 ) ( | )k kp w p w Q p w H

Fixed α?

Average?

Page 50: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Four Heuristic Variants

• FixInt: fixed coefficient interpolation• BayesInt: adapt the interpolation coefficient to

different query length – Intuition: if the current query Qk is longer, we

should trust Qk more

• OnlineUp: assign more weight to more recent records.

• BatchUp: the user becomes better and better at query formulation as time goes on, but we do not need to “decay” the clickthrough.

Page 51: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

51

Data Set of Evaluation

• Data collection: TREC AP88-90• Topics: 30 hard topics of TREC topics 1-150• System: search engine + RDBMS• Context: Query and clickthrough history of 3

participants.

Page 52: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

52

Overall Effect of Search Context

Query

FixInt

(=0.1,=1.0)

BayesInt

(=0.2,=5.0)

OnlineUp

(=5.0,=15.0)

BatchUp

(=2.0,=15.0)

MAP pr@20 MAP pr@20 MAP pr@20 MAP pr@20

Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483

Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067

Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%

Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933

Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250

Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%

• Short-term query log helps system improve retrieval accuracy

• BayesInt better than FixInt; BatchUp better than OnlineUp

Page 53: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Mining Long-term User Query Log [Tan et al. 05]

• Can we mine long-term user query log similarly?

• Challenge: long-term query log is noisy– How do we handle the noise?– Can we still improve performance?

• Solution: – Assign weights to the query log data (EM

algorithm)

Page 54: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Hierarchical History Models

q1D1C1

S1

θS1

q2D2C2

S2

θS2

...... qt-1Dt-1Ct-1

St-1

θSt-1

qtDt

St

......

θH θq

θq,H {θd}

D(θq,H||θd)

unit history modelθSk ← qkDkCk

overall history modelθH = Σwk θSk

original query modelθq

contextual query modelθq,H

document modelθd

Weights for query log units

Page 55: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Discriminative Weighting with Mixture Model

q1D1C1

S1

θS1

q2D2C2

S2

θS2

...... qt-1Dt-1Ct-1

St-1

θSt-1

qtDt

St

......

θH θqθB

θMix

Backgroundmodel

λ1?λ2?

λt-1?

λB?

Select {λ} to fit the data: maximize p(Dt|θMix)

λq?

<d1>jaguar car perfect for racing<d2>jaguar is a big cat...<d3>locate jaguar dealerin champaign...

EM algorithm

Page 56: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Experimental Results

two query types

recurring fresh≫combination ≈ clickthrough > docs > query, contextless

Page 57: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Summary

• Mining user query logs can personalize search results and improve retrieval performance– Four different models to exploit short-term query

logs [Shen et al. 05].– Assign weights to the long-term query logs to

reduce the effect of noise [Tan et al. 06].

Page 58: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Reference

• Xuehua Shen, Bin Tan, ChengXiang Zhai: Context-sensitive information retrieval using implicit feedback. SIGIR 2005: 43-50

• Bin Tan, Xuehua Shen, ChengXiang Zhai: Mining long-term search history to improve search accuracy. KDD 2006: 718-723

Page 59: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

59

Thank you !

The End

Page 60: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 60

Data Mining: Concepts and Techniques

— Chapter 11 —11.8. Online Analytical Processing on

Multidimensional Text Database

Duo Zhang

Department of Computer Science

University of Illinois at Urbana-Champaign

http://sifaka.cs.uiuc.edu/~dzhang22/

Page 61: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 61

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases

Page 62: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Motivation

• Industry and commercial applications often collect huge amount of data containing both structured data records and unstructured text data in a multidimensional text database

• Incident reports

• Job descriptions

• Product reviews

• Service feedback

• It is highly desirable and strategically important to support high-performance search and mining over such databases

04/20/23 62

Page 63: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Examples

Aviation Safety Reporting System

How to organize the data to help experts efficiently explore and digest text information?

e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly

in different contexts? e.g. what did pilots say about anomaly “landing without

clearance” during daylight v.s. night?

Time Location Environment … Narrative

199801 TX Daylight ……… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……

199801 LA Daylight ………WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……

199801 LA Night ………THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……

199902 FL Night ………I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……

Page 64: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 64

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database AnalysisC. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao (ICDE’08)

Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases

Page 65: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Text Cube

Text Cube A novel data cube model integrating the power

of traditional data cube and IR techniques for effective text mining

Computing IR measures for multidimensional text database analysis

Heterogeneous records to be examined Structured categorical attributes Unstructured free text

IR statistics are evaluated TF-IDF Inverted Index

04/20/23 65

Page 66: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Text Cube - Implementation

Preprocessingstemming, stop words elimination, TF-IDF weighting

Concept hierarchy construction A dimension hierarchy takes the form of a tree or a

DAG. An attribute at a lower level reveals more details

Four operations are supported: roll-up, drill-down, slice and dice

Term hierarchy construction A term hierarchy represents semantic levels of

terms in the text and their correlations Infusion with expert knowledge Two novel operations: Pull-up & Push-down

04/20/23 66

Page 67: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Text Cube - Implementation

Partial materialization: if a non-materialized cell is retrieved, we compute it on-the-fly based on the partially materialized cuboids

A balance between time and space: given a time threshold δ, we minimize storage size within the query time bound δ for retrieving all cells to be interested in

04/20/23 67

Page 68: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Experiment – Efficiency and Effectiveness

68

Compare avgTF under different“Environment: Weather Elements”

Compare avgTF under different“Supplementary: Problem Areas”

Page 69: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 69

Online Analytical Processing onMultidimensional Text Database

Motivation

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

Topic Cube: Topic Modeling for OLAP on Multidimensional Text DatabasesD. Zhang, C. Zhai, and J. Han (SDM’09)

Page 70: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Motivation

Aviation Safety Reporting System

How to organize the data to help experts efficiently explore and digest text information?

e.g. compare the reports in 1998 and reports in 1999? How to help experts analyze a specific type of anomaly

in different contexts? e.g. what did pilots say about anomaly “landing without

clearance” during daylight v.s. night?

Time Location Environment … Narrative

199801 TX Daylight ……… I TOLD HIM I WAS AT 2000 FT AND HE SAID OK……

199801 LA Daylight ………WE STOPPED THE DSCNT AT CIRCLING MINIMUMS……

199801 LA Night ………THE TAXI/LNDG LIGHTS VERY DIM. NO OTHER VISIBLE TFC IN SIGHT……

199902 FL Night ………I FEEL WE SHOULD ALL EDUCATE OURSELVES ON CHKLISTS……

Page 71: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Solution: Topic Cube

Challenges: How to support operations along the topic dimension? How to quickly extract semantic topics?

98.0199.02

99.01

98.02

LAX SJC MIA AUS

overshoot

undershoot

birds

turbulence

Tim

e

Location

Topic

CA FL TX

Location

1998

1999

Tim

e

Deviation

Encounter

Topic

drill-down

roll-up

Page 72: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Constructing Topic Cube

Time Loc Env … Narrative

98.01 TX Daylight …

98.01 LA Daylight …

98.01 LA Night …

99.02 FL Night …

ALL

Anomaly Altitude Deviation

…… Anomaly Maintenance Problem

…… Anomaly Inflight Encounter

Undershoot

…… Overshoot

Improper Documentation

Improper Maintenance

Birds Turbulence

…… ……

Descent 0.06Cloud 0.03Ft 0.01… ….

Descent 0.05System 0.02View 0.01… ….

Altitude 0.03Ft 0.02Climb 0.01… ….

Altitude 0.04Ft 0.03Instruct 0.01… ….

drill-down

roll-up

Page 73: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Materialization

StandardDimension(Location)

Topic Dimension (Anomaly Event)

CLAX-overshoot CLAX-altitude CLAX-

all

CCA-overshoot CCA-altitude CCA-all

CUS-overshoot CUS-altitude CUS-all

Mtopic-agg

Mtopic-agg Mtopic-agg

Mtopic-

agg

Mtopic-

agg

Mstd-agg Mstd-agg Mstd-agg

Mstd-agg Mstd-agg Mstd-agg

Mtopic-agg

( 1) ( 1)

( 1) ( 1)

,' { , , }(0) ( )

, '' ' { , , }

( , ) ( ')

( | )( ', ) ( ')

L Ls ei i

L Ls ei i

d wdjL

c id w

w dj

c w d p z j

p wc w d p z j

,

( )(0)

, ''

( , ) ( )

( | )( ', ) ( )

i cin

a

i ci

d wc d DL

c jd w

w c d D

c w d p z j

p wc w d p z j

Mtopic-

agg:Mstd-

agg:

Page 74: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

Experimental Results

Context

Word p(w|θ)

daylight

Tower 0.075

Pattern 0.061

Final 0.060

Runway 0.053

Land 0.052

Downwind 0.039

night

Tower 0.035

Runway 0.029

Light 0.027

Instrument Landing System 0.015

Beacon 0.014

landing without clearance

ObjectiveFunction

Iterations

Time (sec.)

Closeness to the optimum point

…WINDS ALOFT AT PATTERN ALT OF 1000 FT MSL, WERE MUCH STRONGER AND A DIRECT XWIND. NEEDLESS TO SAY, THE PATTERNS AND LNDGS WERE DIFFICULT FOR MY STUDENT AND THERE WAS LIGHT TURB ON THE DOWNWIND…

…I LISTENED TO HWD ATIS AND FOUND THE TWR CLOSED AND AN ANNOUNCEMENT THAT THE HIGH INTENSITY LIGHTS FOR RWY 28L WERE INOP. BROADCASTING IN THE BLIND AND LOOKING FOR THE TWR BEACON AND LOW INTENSITY LIGHTS AGAINST A VERY BRIGHT BACKGROUND CLUTTER OF STREET LIGHTS, ETC…

Page 75: 11/11/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.2 Mining Text and Web Data (II) Jiawei Han and Micheline Kamber Department of Computer

04/20/23 75