AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010...

AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC

DECOMPOSITION

Proceeding EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language

Processing

reporte

r: Yin

g-Y

ing, C

hen

OUTLINE

Introduction Building Topic Interpreters Topical PageRank for Keyphrase Extraction Experiments Related work Conclusion

2

INTRODUCTION

Keyphrases are defined as a set of terms in a document that give a brief summary of its content for readers.

It is widely used in information retrieval and digital library It is also an essential step in document categorization,

clustering and summarization

Two principle approach: supervised and unsupervised Supervised method

regards keyphrase extraction as a classification task required a documents set with human-assigned

keyphrases 3

INTRODUCTION Unsupervised method

Graph-based rank

Process:1. first build a word graph according to word co-occurrences

within the document,2. use random walk techniques to measure word

importance3. top ranked words are selected as keyphrases

Problems: keyphrases should be relevant to the major topics of the

given document keyphrases should also have a good coverage of the

document’s major topics

4

INTRODUCTION To address the problem, it is intuitive to consider the

topics of words and document in random walk for keyphrase extraction. decompose traditional PageRank into multiple PageRanks

specific to various topics obtain the importance scores of words under different topics

We call the topic-decomposed PageRank as Topical PageRank (TPR).

Moreover, TPR is unsupervised and language independent

TPR for keyphrase extraction is a two-stage process:1. Build a topic interpreter to acquire the topics of words and

documents.2. Perform TPR to extract keyphrases for documents.

5

BUILDING TOPIC INTERPRETERS There are two method to acquire topic distributions of

words Use manually annotated knowledge bases.

Ex. WordNet

Use unsupervised machine learning techniques to obtain word topics from a large-scale document collection. LSA(Latent Semantic Analysis) pLSA(probability LSA), LDA(Latent Dirichlet Allocation)

6

BUILDING TOPIC INTERPRETERS LDA

Each word w of a document d is regarded to be generated by first sampling a topic z from d’s topic distribution θ(d) , and then sampling a word from the distribution over words φ(z) that characterizes topic z.

In LDA, θ(d) and φ(z) are drawn from conjugate Dirichlet priors α and β, separately.

Therefore, θ and φ are integrated out and the probability of word w given document d and priors is represented as follows:

Where K is the number of topics7

LDA(LATENT DIRICHLET ALLOCATION)

Dirichlet distribution( 狄氏分配 ) Dirichlet 分配是多項式分配的共軛分配

先驗機率為 Dirichlet 分配，相似度函數為多項式分配，那麼後驗分配仍為 Dirichlet 分配

P(Y|X): 後驗機率 ; P(X): 先驗機率 ; P(X|Y): 相似度函數

8

LDA(LATENT DIRICHLET ALLOCATION) LDA 透過將文本映射到主題空間，也就是他認為一篇文章是由很多個主

題隨機構成，透過主題得到文本與文本之間的關係。 LDA 和 LSA 、 pLSA 的前提都相同，是 bag of word 所以不考慮任何

語法及出現順序的問題。

LDA 與 pLSA 的差異 pLSA 的文件參數是由訓練文集中有出現的文件訓練得到

LDA 會給予沒有出現在訓練文集中的文件一個機率形式的表現方式，所以需要的參數量較少

9

LDA(LATENT DIRICHLET ALLOCATION) LDA是一個生成模型，其可以隨機生成可觀測的數據，也就是可以隨機

生成一篇由多個主題組成的文章。其建模過程是逆向透過文本的集合建立生成模型，生成步驟如下 :

1. 選擇 N ， N 遵守 poisson(ξ) 分配，這裡 N 代表文章長度 ( 文章字數 )

2. 選擇 θ ， θ 遵守 Dirichlet(α) 分配， θ 代表每個主題發生的機率，α是 Dirichlet分配的參數

3. 對 N 個文字中的每一個文字 :1. 選擇主題 zn ， zn會遵守 Multinominal(θ) 多項分配。 zn代表當前選擇的

主題2. 選擇 wn，根據 p(wn|zn;β): 在 zn條件下的多項分配， β是一個 K*V 的矩

陣， βij=P(wj=1|zi=1)

在 LDA中，不同的文章會有不同的 θ 對應，而 θ 可以用來判斷文章的相似度

10

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Given a document d, the process of keyphrase extraction

using TPR consists of the following four steps :1. Construct a word graph for d according to word co-

occurrences within d.2. Perform TPR to calculate the importance scores for each

word with respect to different topics.3. Using the topic-specific importance scores of words, rank

candidate keyphrases respect to each topic separately.4. Given the topics of document d, integrate the topic-

specific rankings of candidate keyphrases into a final ranking, and the top ranked ones are selected as keyphrases.

11

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION

We construct a word graph according to word co-occurrences within the given document Link weight between words

the co-occurrence count within a sliding window with maximum W words in the word sequence.

Direction When sliding a W-width window, at each position, we add links

from the first word pointing to other words within the window.

Format only add adjectives and nouns in word graph

12


PageRank The basic idea of PageRank is that a vertex is

important if there are other important vertices pointing to it.

This can be regarded as voting or recommendation among vertices. G = (V,E) as the graph of a document vertex set V = {w1,w2, · · · ,wN}

link set (wi,wj) ∈ E if there is a link from wi to wj the weight of link (wi,wj) as e(wi,wj) the out-degree of vertex wi as

λ is a damping factor range from 0 to 1 |V| is the number of vertices

13


Topical PageRank(TPR) Each topic-specific PageRank prefers those words

with high relevance to the corresponding topic. In the PageRank of a specific topic z, we will assign a

topic-specific preference value pz(w) to each word w as its random jump probability with

14


Topical PageRank(TPR) We use three measures to set preference values for TPR: pz(w) = pr(w|z),

This indicates how much that topic z focuses on word w.

pz(w) = pr(z|w), This indicates how much that word w focuses on topic z.

pz(w) = pr(w|z) * pr(z|w), This measure is inspired by the work in (Cohn and Chang,

2000).

Terminate conditions: when the number of iterations reaches 100 the difference of each vertex between two neighbor iterations

is less than 0.001.15


Extract Keyphrases Using Ranking Scores We thus select noun phrases from a document as candidate

keyphrases for ranking. The document is first tokenized. After that, we annotate the document with part of-speech

(POS) tags. Third, we extract noun phrases with pattern

(adjective)*(noun)+ We regard these noun phrases as candidate keyphrases.

16


Extract Keyphrases Using Ranking Scores We rank them using the ranking scores obtained

by TPR.

By considering the topic distribution of document, we further integrate topic-specific rankings of candidate keyphrases into a final ranking

17

EXPERIMENTS

Datasets One dataset was built by Wan and Xiao which was

used in (Wan and Xiao, 2008b). It contains 308 news articles in DUC2001 (Over et

al.,2001) 2, 488 manually annotated keyphrases. There are at most 10 keyphrases for each document. In experiments we refer to this dataset as NEWS.

The other dataset was built by Hulth 3 which was used in (Hulth, 2003). It contains 2, 000 abstracts of research articles 19, 254 manually annotated keyphrases. In experiments we refer to this dataset as RESEARCH.

18

EXPERIMENTS

Dataset we use the Wikipedia snapshot at March 2008 to build

topic interpreters with LDA. collected 2, 122, 618 articles build the vocabulary by selecting 20, 000 words

according to their document frequency. learned several models with different numbers of topics,

from 50 to 1, 500 respectively.

19

EXPERIMENTS

20

EXPERIMENTS

Influences of Parameters to TPR There are four parameters in TPR that may influence

the performance of keyphrase extraction:1. window size W for constructing word graph 2. the number of topics K learned by LDA3. different settings of preference values pz(w)

4. damping factor λ of TPR

Except the parameter under investigation, we set parameters to the following values: W =10, K=1000, λ=0.3 and pz(w) = pr(z|w)

21

EXPERIMENTS Window Size W

In experiments on NEWS and W ranges from 5 to 20 as shown in Table 1:

Similarly, W ranges from 2 to 10, the performance on RESEARCH does not change much but it will become poor when W = 20.

RESEARCH(121 words) are much shorter than NEWS(704 words) the graph will become full-connected the weights of links will tend to be equal

22

EXPERIMENTS

The Number of Topics K We demonstrate the influence of the number of

topics K of LDA models in Table 2.

The influence is similar on RESEARCH It indicates that LDA is appropriate for obtaining

topics of words and documents for TPR to extract keyphrases.

23

EXPERIMENTS

Damping Factor λ Damping factor λ of TPR reconciles the

influences of graph walks

24

EXPERIMENTS Preference Values

In Table 3 we show the influence when the number of keyphrases M = 10 on NEWS.

pr(w|z) assigns preference values according to how frequently that words appear in the given topic.

pr(z|w) prefers those words that are focused on the given topic.

25

EXPERIMENTS

Comparing with Baseline Methods We select three baseline methods to compare with

TPR TFIDF

PageRank

TFIDF amd PageRank don’t use the topic information

LDA computes the ranking score for each word using the

topical similarity between the word and the document. The LDA baseline calculated using cosine similarity which

performs the best.

26

EXPERIMENTS In Tables 4 and 5 we show the comparing results of

the four methods on both NEWS and RESEARCH. The improvements of TPR are all statistically significant

tested with bootstrap re-sampling with 95% confidence. LDA performs equal or better than TFIDF and PageRank

under precision/recall/F measure. the performance of LDA under MRR is much worse than

TFIDF and PageRank

27

EXPERIMENTS In Figures 3 and 4 we show the precision-recall

relations of four methods on NEWS and RESEARCH. Each point on the precision-recall curve is evaluated

on different numbers of extracted keyphrases M

28

EXPERIMENTS in Table 6 we show an example of extracted

keyphrases using TPR from a news article with title “Arafat Says U.S. Threatening to Kill PLO Officials”

Top 3 topic: Palestine Israel terrorism 29

EXPERIMENTS TFIDF

only considered the frequency highly ranked the phrases with

“PLO” which appeared about 16 times in this article

LDA without considering the

frequency failed to extract keyphrase

“political assassination”, in which the word “assassination” occurred 8 times in this article.

30

RELATED WORK

1. supervised methods regarded keyphrase extraction as a classification task (Turney, 1999) need manually annotated training set which is time-consuming

2. clustering techniques on word graphs for keyphrase extraction (Grineva et al., 2009; Liu et al., 2009).

performed well on short abstracts but poorly on long articles

3. Topical PageRank with random jumps between topics(Nie et al., 2006)

did not help improve the performance for keyphrase extraction

Peter D. Turney. 1999. Learning to extract keyphrases from text. National Research Council Canada, Institute for Information Technology, Technical Report ERB-1057.

M. Grineva, M. Grinev, and D. Lizorkin. 2009. Extractingkey terms from noisy and multi-theme documents. In Proceedings of WWW, pages 661–670.

Lan Nie, Brian D. Davison, and Xiaoguang Qi. 2006. Topical link analysis for web search. In Proceedings of SIGIR, pages 91–98.

31

CONCLUSION We propose a new graph-based framework, Topical

PageRank We investigate the influence of various parameters on

TPR

Future work We design to obtain topics using other machine

learning methods and from other knowledge bases consider topic information in other graph-based

ranking algorithms such as HITS (Kleinberg, 1999). We will investigate the influence of corpus selection in

training LDA for keyphrase extraction using TPR.32

RELATED WORK

Topical link analysis for web search (Nie et al., 2006)

when surfing following a graph link from vertex wi to wj , the ranking score on topic z of wi will have a higher probability to pass to the same topic of wj and have a lower probability to pass to a different topic of wj .

33

AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010...

Documents

Transcript of AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010...