intelligent database systems lab presenter: wu, min-cong authors: zhiyuan liu, wenyi huang, yabin...

19
Intelligent Database Systems Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction via Topic Decomposition

Upload: aileen-dennis

Post on 02-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Presenter: WU, MIN-CONG

Authors: Zhiyuan Liu, Wenyi Huang,

Yabin Zheng and Maosong Sun

2010, ACM

Automatic Keyphrase Extraction via Topic Decomposition

Page 2: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Outlines

MotivationObjectivesMethodologyExperimentsConclusionsComments

1

Page 3: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Motivation• Existing graph-based ranking methods for

keyphrase extraction just compute a single

importance score for each word via a single

random walk.

• Motivated by the fact that both documents and

words can be represented by a mixture of

semantic topics.2

Page 4: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Objectives• We thus build a Topical PageRank (TPR) on word graph

to measure word importance with respect to different

topics.

• we further calculate the ranking scores of words and

extract the top ranked ones as keyphrases.

3

Page 5: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology-Building Topic Interpreters

1

α, β from: ex: Gibbs sampling

Pr(w|z) ∈ ϕ(z) ∈ ϕ

θ

Pr(z|d) ∈θ (d)∈ θ

Document-topicTopic-wordLDA output:

Page 6: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Topical PageRank for Keyphrase Extraction

1

Page 7: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Constructing Word Graph Slide window size = 3

The document is regarded as a word sequence

1

Page 8: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Topical PageRank(PageRank)

Define:

weight of link (wi,wj) as e(wi,wj)

1

Page 9: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Topical PageRank(PageRank)

out-degree of vertex

equal probabilities of randomjump to all vertices.

1

Page 10: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Topical PageRank

From LDA

1

=pr(w)*pr(z)/pr(z) focuses on word

=pr(z)*pr(w)/pr(w) focuses on topic

(Cohn and Chang, 2000).

Page 11: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Methodology- Extract Keyphrases Using Ranking Scores

1

Step1. annotate the document with POS tags.

Step2. select noun phrases.

Step3. compute the ranking scores of candidate keyphrases separately for each topic.

PageRank Topic PageRank

Step4. integrate topic-specific rankings of candidate keyphrases into a final ranking.

Page 12: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment- Datasets Dataset:

1

Article keyphrases

NEWS 308 2488

RESEARCH 2000 19254

Topic model:build topic interpreters with LDA.

corpus Web page word topic

Wikipedia snapshot at March 2008

2122618 20000 50 to 1500

Page 13: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment- Evaluation Metrics

1

However, precision/recall/F-measure does not take the order of extracted keyphrases into account.

The large value is better than small values.

The values is between 0 and 1.

Page 14: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment- Influences of Parameters to TPR

1

Window Size W

The Number of Topics K

Page 15: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment - Influences of Parameters to TPR

1

Damping Factor λ

Preference Values

=pr(w)*pr(z)/pr(z) focuses on word

=pr(z)*pr(w)/pr(w) focuses on topic

Ex.he 、 she

Page 16: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment - Comparing with Baseline Methods

1

do not use topic information

TPR enjoys the advantages of both LDA and TFIDF/PageRank

Page 17: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Experiment - Extracting Example

1

Page 18: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Conclusions• Experiments on two datasets show that TPR achieves

better performance than other baseline methods.

1

Page 19: Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction

Intelligent Database Systems Lab

Comments• Advantages

– TPR incorporates topic information within random walk for keyphrase extraction.

• Applications– Automatic Keyphrase Extraction.

1