Automatic Keyphrase Extraction via Topic Decomposition

19
Intelligent Database Systems Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction via Topic Decomposition

description

Automatic Keyphrase Extraction via Topic Decomposition. Presenter : Wu, Min-Cong Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Transcript of Automatic Keyphrase Extraction via Topic Decomposition

Intelligent Database Systems Lab

Presenter: WU, MIN-CONG

Authors: Zhiyuan Liu, Wenyi Huang,

Yabin Zheng and Maosong Sun

2010, ACM

Automatic Keyphrase Extraction via Topic Decomposition

Intelligent Database Systems Lab

Outlines

MotivationObjectivesMethodologyExperimentsConclusionsComments

1

Intelligent Database Systems Lab

Motivation• Existing graph-based ranking methods for

keyphrase extraction just compute a single

importance score for each word via a single

random walk.

• Motivated by the fact that both documents and

words can be represented by a mixture of

semantic topics.2

Intelligent Database Systems Lab

Objectives• We thus build a Topical PageRank (TPR) on word graph

to measure word importance with respect to different

topics.

• we further calculate the ranking scores of words and

extract the top ranked ones as keyphrases.

3

Intelligent Database Systems Lab

Methodology-Building Topic Interpreters

1

α, β from: ex: Gibbs sampling

Pr(w|z) ∈ ϕ(z) ∈ ϕ

θ

Pr(z|d) ∈θ (d)∈ θ

Document-topicTopic-wordLDA output:

Intelligent Database Systems Lab

Methodology- Topical PageRank for Keyphrase Extraction

1

Intelligent Database Systems Lab

Methodology- Constructing Word Graph Slide window size = 3

The document is regarded as a word sequence

1

Intelligent Database Systems Lab

Methodology- Topical PageRank(PageRank)

Define:

weight of link (wi,wj) as e(wi,wj)

1

Intelligent Database Systems Lab

Methodology- Topical PageRank(PageRank)

out-degree of vertex

equal probabilities of randomjump to all vertices.

1

Intelligent Database Systems Lab

Methodology- Topical PageRank

From LDA

1

=pr(w)*pr(z)/pr(z) focuses on word

=pr(z)*pr(w)/pr(w) focuses on topic

(Cohn and Chang, 2000).

Intelligent Database Systems Lab

Methodology- Extract Keyphrases Using Ranking Scores

1

Step1. annotate the document with POS tags.

Step2. select noun phrases.

Step3. compute the ranking scores of candidate keyphrases separately for each topic.

PageRank Topic PageRank

Step4. integrate topic-specific rankings of candidate keyphrases into a final ranking.

Intelligent Database Systems Lab

Experiment- Datasets Dataset:

1

Article keyphrases

NEWS 308 2488

RESEARCH 2000 19254

Topic model:build topic interpreters with LDA.

corpus Web page word topic

Wikipedia snapshot at March 2008

2122618 20000 50 to 1500

Intelligent Database Systems Lab

Experiment- Evaluation Metrics

1

However, precision/recall/F-measure does not take the order of extracted keyphrases into account.

The large value is better than small values.

The values is between 0 and 1.

Intelligent Database Systems Lab

Experiment- Influences of Parameters to TPR

1

Window Size W

The Number of Topics K

Intelligent Database Systems Lab

Experiment - Influences of Parameters to TPR

1

Damping Factor λ

Preference Values

=pr(w)*pr(z)/pr(z) focuses on word

=pr(z)*pr(w)/pr(w) focuses on topic

Ex.he 、 she

Intelligent Database Systems Lab

Experiment - Comparing with Baseline Methods

1

do not use topic information

TPR enjoys the advantages of both LDA and TFIDF/PageRank

Intelligent Database Systems Lab

Experiment - Extracting Example

1

Intelligent Database Systems Lab

Conclusions• Experiments on two datasets show that TPR achieves

better performance than other baseline methods.

1

Intelligent Database Systems Lab

Comments• Advantages

– TPR incorporates topic information within random walk for keyphrase extraction.

• Applications– Automatic Keyphrase Extraction.

1