AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010...
-
Upload
lucinda-stanley -
Category
Documents
-
view
223 -
download
1
Transcript of AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION Proceeding EMNLP '10 Proceedings of the 2010...
AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC
DECOMPOSITION
Proceeding EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing
reporte
r: Yin
g-Y
ing, C
hen
OUTLINE
Introduction Building Topic Interpreters Topical PageRank for Keyphrase Extraction Experiments Related work Conclusion
2
INTRODUCTION
Keyphrases are defined as a set of terms in a document that give a brief summary of its content for readers.
It is widely used in information retrieval and digital library It is also an essential step in document categorization,
clustering and summarization
Two principle approach: supervised and unsupervised Supervised method
regards keyphrase extraction as a classification task required a documents set with human-assigned
keyphrases 3
INTRODUCTION Unsupervised method
Graph-based rank
Process:1. first build a word graph according to word co-occurrences
within the document,2. use random walk techniques to measure word
importance3. top ranked words are selected as keyphrases
Problems: keyphrases should be relevant to the major topics of the
given document keyphrases should also have a good coverage of the
document’s major topics
4
INTRODUCTION To address the problem, it is intuitive to consider the
topics of words and document in random walk for keyphrase extraction. decompose traditional PageRank into multiple PageRanks
specific to various topics obtain the importance scores of words under different topics
We call the topic-decomposed PageRank as Topical PageRank (TPR).
Moreover, TPR is unsupervised and language independent
TPR for keyphrase extraction is a two-stage process:1. Build a topic interpreter to acquire the topics of words and
documents.2. Perform TPR to extract keyphrases for documents.
5
BUILDING TOPIC INTERPRETERS There are two method to acquire topic distributions of
words Use manually annotated knowledge bases.
Ex. WordNet
Use unsupervised machine learning techniques to obtain word topics from a large-scale document collection. LSA(Latent Semantic Analysis) pLSA(probability LSA), LDA(Latent Dirichlet Allocation)
6
BUILDING TOPIC INTERPRETERS LDA
Each word w of a document d is regarded to be generated by first sampling a topic z from d’s topic distribution θ(d) , and then sampling a word from the distribution over words φ(z) that characterizes topic z.
In LDA, θ(d) and φ(z) are drawn from conjugate Dirichlet priors α and β, separately.
Therefore, θ and φ are integrated out and the probability of word w given document d and priors is represented as follows:
Where K is the number of topics7
LDA(LATENT DIRICHLET ALLOCATION)
Dirichlet distribution( 狄氏分配 ) Dirichlet 分配是多項式分配的共軛分配
先驗機率為 Dirichlet 分配,相似度函數為多項式分配,那麼後驗分配仍為 Dirichlet 分配
P(Y|X): 後驗機率 ; P(X): 先驗機率 ; P(X|Y): 相似度函數
8
LDA(LATENT DIRICHLET ALLOCATION) LDA 透過將文本映射到主題空間,也就是他認為一篇文章是由很多個主
題隨機構成,透過主題得到文本與文本之間的關係。 LDA 和 LSA 、 pLSA 的前提都相同,是 bag of word 所以不考慮任何
語法及出現順序的問題。
LDA 與 pLSA 的差異 pLSA 的文件參數是由訓練文集中有出現的文件訓練得到
LDA 會給予沒有出現在訓練文集中的文件一個機率形式的表現方式,所以需要的參數量較少
9
LDA(LATENT DIRICHLET ALLOCATION) LDA是一個生成模型,其可以隨機生成可觀測的數據,也就是可以隨機
生成一篇由多個主題組成的文章。其建模過程是逆向透過文本的集合建立生成模型,生成步驟如下 :
1. 選擇 N , N 遵守 poisson(ξ) 分配,這裡 N 代表文章長度 ( 文章字數 )
2. 選擇 θ , θ 遵守 Dirichlet(α) 分配, θ 代表每個主題發生的機率,α是 Dirichlet分配的參數
3. 對 N 個文字中的每一個文字 :1. 選擇主題 zn , zn會遵守 Multinominal(θ) 多項分配。 zn代表當前選擇的
主題2. 選擇 wn,根據 p(wn|zn;β): 在 zn條件下的多項分配, β是一個 K*V 的矩
陣, βij=P(wj=1|zi=1)
在 LDA中,不同的文章會有不同的 θ 對應,而 θ 可以用來判斷文章的相似度
10
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION Given a document d, the process of keyphrase extraction
using TPR consists of the following four steps :1. Construct a word graph for d according to word co-
occurrences within d.2. Perform TPR to calculate the importance scores for each
word with respect to different topics.3. Using the topic-specific importance scores of words, rank
candidate keyphrases respect to each topic separately.4. Given the topics of document d, integrate the topic-
specific rankings of candidate keyphrases into a final ranking, and the top ranked ones are selected as keyphrases.
11
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
We construct a word graph according to word co-occurrences within the given document Link weight between words
the co-occurrence count within a sliding window with maximum W words in the word sequence.
Direction When sliding a W-width window, at each position, we add links
from the first word pointing to other words within the window.
Format only add adjectives and nouns in word graph
12
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
PageRank The basic idea of PageRank is that a vertex is
important if there are other important vertices pointing to it.
This can be regarded as voting or recommendation among vertices. G = (V,E) as the graph of a document vertex set V = {w1,w2, · · · ,wN}
link set (wi,wj) ∈ E if there is a link from wi to wj the weight of link (wi,wj) as e(wi,wj) the out-degree of vertex wi as
λ is a damping factor range from 0 to 1 |V| is the number of vertices
13
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
Topical PageRank(TPR) Each topic-specific PageRank prefers those words
with high relevance to the corresponding topic. In the PageRank of a specific topic z, we will assign a
topic-specific preference value pz(w) to each word w as its random jump probability with
14
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
Topical PageRank(TPR) We use three measures to set preference values for TPR: pz(w) = pr(w|z),
This indicates how much that topic z focuses on word w.
pz(w) = pr(z|w), This indicates how much that word w focuses on topic z.
pz(w) = pr(w|z) * pr(z|w), This measure is inspired by the work in (Cohn and Chang,
2000).
Terminate conditions: when the number of iterations reaches 100 the difference of each vertex between two neighbor iterations
is less than 0.001.15
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
Extract Keyphrases Using Ranking Scores We thus select noun phrases from a document as candidate
keyphrases for ranking. The document is first tokenized. After that, we annotate the document with part of-speech
(POS) tags. Third, we extract noun phrases with pattern
(adjective)*(noun)+ We regard these noun phrases as candidate keyphrases.
16
TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION
Extract Keyphrases Using Ranking Scores We rank them using the ranking scores obtained
by TPR.
By considering the topic distribution of document, we further integrate topic-specific rankings of candidate keyphrases into a final ranking
17
EXPERIMENTS
Datasets One dataset was built by Wan and Xiao which was
used in (Wan and Xiao, 2008b). It contains 308 news articles in DUC2001 (Over et
al.,2001) 2, 488 manually annotated keyphrases. There are at most 10 keyphrases for each document. In experiments we refer to this dataset as NEWS.
The other dataset was built by Hulth 3 which was used in (Hulth, 2003). It contains 2, 000 abstracts of research articles 19, 254 manually annotated keyphrases. In experiments we refer to this dataset as RESEARCH.
18
EXPERIMENTS
Dataset we use the Wikipedia snapshot at March 2008 to build
topic interpreters with LDA. collected 2, 122, 618 articles build the vocabulary by selecting 20, 000 words
according to their document frequency. learned several models with different numbers of topics,
from 50 to 1, 500 respectively.
19
EXPERIMENTS
20
EXPERIMENTS
Influences of Parameters to TPR There are four parameters in TPR that may influence
the performance of keyphrase extraction:1. window size W for constructing word graph 2. the number of topics K learned by LDA3. different settings of preference values pz(w)
4. damping factor λ of TPR
Except the parameter under investigation, we set parameters to the following values: W =10, K=1000, λ=0.3 and pz(w) = pr(z|w)
21
EXPERIMENTS Window Size W
In experiments on NEWS and W ranges from 5 to 20 as shown in Table 1:
Similarly, W ranges from 2 to 10, the performance on RESEARCH does not change much but it will become poor when W = 20.
RESEARCH(121 words) are much shorter than NEWS(704 words) the graph will become full-connected the weights of links will tend to be equal
22
EXPERIMENTS
The Number of Topics K We demonstrate the influence of the number of
topics K of LDA models in Table 2.
The influence is similar on RESEARCH It indicates that LDA is appropriate for obtaining
topics of words and documents for TPR to extract keyphrases.
23
EXPERIMENTS
Damping Factor λ Damping factor λ of TPR reconciles the
influences of graph walks
24
EXPERIMENTS Preference Values
In Table 3 we show the influence when the number of keyphrases M = 10 on NEWS.
pr(w|z) assigns preference values according to how frequently that words appear in the given topic.
pr(z|w) prefers those words that are focused on the given topic.
25
EXPERIMENTS
Comparing with Baseline Methods We select three baseline methods to compare with
TPR TFIDF
PageRank
TFIDF amd PageRank don’t use the topic information
LDA computes the ranking score for each word using the
topical similarity between the word and the document. The LDA baseline calculated using cosine similarity which
performs the best.
26
EXPERIMENTS In Tables 4 and 5 we show the comparing results of
the four methods on both NEWS and RESEARCH. The improvements of TPR are all statistically significant
tested with bootstrap re-sampling with 95% confidence. LDA performs equal or better than TFIDF and PageRank
under precision/recall/F measure. the performance of LDA under MRR is much worse than
TFIDF and PageRank
27
EXPERIMENTS In Figures 3 and 4 we show the precision-recall
relations of four methods on NEWS and RESEARCH. Each point on the precision-recall curve is evaluated
on different numbers of extracted keyphrases M
28
EXPERIMENTS in Table 6 we show an example of extracted
keyphrases using TPR from a news article with title “Arafat Says U.S. Threatening to Kill PLO Officials”
Top 3 topic: Palestine Israel terrorism 29
EXPERIMENTS TFIDF
only considered the frequency highly ranked the phrases with
“PLO” which appeared about 16 times in this article
LDA without considering the
frequency failed to extract keyphrase
“political assassination”, in which the word “assassination” occurred 8 times in this article.
30
RELATED WORK
1. supervised methods regarded keyphrase extraction as a classification task (Turney, 1999) need manually annotated training set which is time-consuming
2. clustering techniques on word graphs for keyphrase extraction (Grineva et al., 2009; Liu et al., 2009).
performed well on short abstracts but poorly on long articles
3. Topical PageRank with random jumps between topics(Nie et al., 2006)
did not help improve the performance for keyphrase extraction
Peter D. Turney. 1999. Learning to extract keyphrases from text. National Research Council Canada, Institute for Information Technology, Technical Report ERB-1057.
M. Grineva, M. Grinev, and D. Lizorkin. 2009. Extractingkey terms from noisy and multi-theme documents. In Proceedings of WWW, pages 661–670.
Lan Nie, Brian D. Davison, and Xiaoguang Qi. 2006. Topical link analysis for web search. In Proceedings of SIGIR, pages 91–98.
31
CONCLUSION We propose a new graph-based framework, Topical
PageRank We investigate the influence of various parameters on
TPR
Future work We design to obtain topics using other machine
learning methods and from other knowledge bases consider topic information in other graph-based
ranking algorithms such as HITS (Kleinberg, 1999). We will investigate the influence of corpus selection in
training LDA for keyphrase extraction using TPR.32
RELATED WORK
Topical link analysis for web search (Nie et al., 2006)
when surfing following a graph link from vertex wi to wj , the ranking score on topic z of wi will have a higher probability to pass to the same topic of wj and have a lower probability to pass to a different topic of wj .
33