automatic keyphrase extraction via topic decomposition
DESCRIPTION
Automatic Keyphrase Extraction via Topic Decomposition. Presenter : Wu, Min-Cong Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
Presenter: WU, MIN-CONG
Authors: Zhiyuan Liu, Wenyi Huang,
Yabin Zheng and Maosong Sun
2010, ACM
Automatic Keyphrase Extraction via Topic Decomposition
Intelligent Database Systems Lab
Outlines
MotivationObjectivesMethodologyExperimentsConclusionsComments
1
Intelligent Database Systems Lab
Motivation• Existing graph-based ranking methods for
keyphrase extraction just compute a single
importance score for each word via a single
random walk.
• Motivated by the fact that both documents and
words can be represented by a mixture of
semantic topics.2
Intelligent Database Systems Lab
Objectives• We thus build a Topical PageRank (TPR) on word graph
to measure word importance with respect to different
topics.
• we further calculate the ranking scores of words and
extract the top ranked ones as keyphrases.
3
Intelligent Database Systems Lab
Methodology-Building Topic Interpreters
1
α, β from: ex: Gibbs sampling
Pr(w|z) ∈ ϕ(z) ∈ ϕ
θ
Pr(z|d) ∈θ (d)∈ θ
Document-topicTopic-wordLDA output:
Intelligent Database Systems Lab
Methodology- Constructing Word Graph Slide window size = 3
The document is regarded as a word sequence
1
Intelligent Database Systems Lab
Methodology- Topical PageRank(PageRank)
Define:
weight of link (wi,wj) as e(wi,wj)
1
Intelligent Database Systems Lab
Methodology- Topical PageRank(PageRank)
out-degree of vertex
equal probabilities of randomjump to all vertices.
1
Intelligent Database Systems Lab
Methodology- Topical PageRank
From LDA
1
=pr(w)*pr(z)/pr(z) focuses on word
=pr(z)*pr(w)/pr(w) focuses on topic
(Cohn and Chang, 2000).
Intelligent Database Systems Lab
Methodology- Extract Keyphrases Using Ranking Scores
1
Step1. annotate the document with POS tags.
Step2. select noun phrases.
Step3. compute the ranking scores of candidate keyphrases separately for each topic.
PageRank Topic PageRank
Step4. integrate topic-specific rankings of candidate keyphrases into a final ranking.
Intelligent Database Systems Lab
Experiment- Datasets Dataset:
1
Article keyphrases
NEWS 308 2488
RESEARCH 2000 19254
Topic model:build topic interpreters with LDA.
corpus Web page word topic
Wikipedia snapshot at March 2008
2122618 20000 50 to 1500
Intelligent Database Systems Lab
Experiment- Evaluation Metrics
1
However, precision/recall/F-measure does not take the order of extracted keyphrases into account.
The large value is better than small values.
The values is between 0 and 1.
Intelligent Database Systems Lab
Experiment- Influences of Parameters to TPR
1
Window Size W
The Number of Topics K
Intelligent Database Systems Lab
Experiment - Influences of Parameters to TPR
1
Damping Factor λ
Preference Values
=pr(w)*pr(z)/pr(z) focuses on word
=pr(z)*pr(w)/pr(w) focuses on topic
Ex.he 、 she
Intelligent Database Systems Lab
Experiment - Comparing with Baseline Methods
1
do not use topic information
TPR enjoys the advantages of both LDA and TFIDF/PageRank
Intelligent Database Systems Lab
Conclusions• Experiments on two datasets show that TPR achieves
better performance than other baseline methods.
1