1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen...

1

Centroid Based multi-document summarization: Efficient sentence extraction method

Presenter: Chen Yi-Ting

2

Introduction

• Summaries save readers’ time• This is not a new phenomena• A system which will summarized a large amount of news

from different sources had been developed• This paper describe how multi-document summaries are

built and evaluated• Summarization of text can be done by selecting the most

important sentence of the documents• To do that one should measure the centroid of the words

of the sentences

3

Corpus Development Scheme

• Algorithm: Get the user's input: the starting URL and the desired file type. Add the URL to the currently empt

y list of URLs to search. While the list of URLs to search is not empty, { 1. Get the first URL in the list. 2. Move the URL to the list of URLs already searched. 3. Check the URL to make sure its protocol is HTTP (if not, break out of the loop, back to

"While"). 4. See whether there's a robots.txt file at this site that includes a "Disallow“ statement. (If so, break out of the loop, back to "While".) Try to "open" the URL (that is, retrieve that document From the Web). If it's not an HTML file, break out of the loop, back to "While." 5. Step through the HTML file. While the HTML text contains another link { Validate the link's URL and make sure robots are allowed (just as in the outer loop). If it's an HTML file, If the URL isn't present in either the to-search list or the already-searched list, add it to the to-search list. Else if it's the type of the file the user requested, Add it to the list of files found. } }

4

Working principle of the system

• The program developed by us will not accept any keywords to search

• It will only take address of yahoo news home page as input and start searching that address and links and goes on searching until there is no addresses left to search

• It first loads the HTML source-code in a string variable and then searches for keyword “class=storyheadline” in which main theme of news is kept

5

Design of an application based on Corpus Development

• An application which will search all yahoo news URL addresses and will download the news, its tiles, its writer and time if occurrence have been designed

• Kept special TAG: Template of the documents in the corpus: <storystart>、 <title> </title>、 <time> </time>、 <write

r> </writer>、 <textstart> </textend>、 <storyend>

6

Design of an application based on Corpus Development

• The corpus application designed by us can download news from the end of previous download which was interrupted by some reason

• It can download news only from yahoo’s news website• The system that would summarize some related docume

nts mainly on the basis of centroid is introduced• The information of words of sentences (DF and count) ca

n be stored in database• CIDR computes Coount*IDF in an iterative fashion, updat

ing its values as more articles are inserted in a given cluster

7

Centroid-based algorithm

• INPUT: A collection of related documents.• OUTPUT: A summary.• STEPS TO SUMMARIZE :

– a. Finding Cluster Centroid: Count * idf(w)=count(w) * (log(DN ⁄ df(w))) where df(w)=document frequency for each word. DN=number of documents in the corpus.

– b. Finding Sentence Position Score: The score of ith sentence (Si) is computed as:

Pscore(Si)= max(1 ⁄ i , 1 ⁄ (n-i-1)) where i=sentence number n=number of sentences

– c. Finding Sentence Length Score: The length here means the number of characters in the sentence. Lscore(Si)= 0 ( if Li≤ Lmin)

=(Li-Lmin) ⁄ Li (otherwise) where Li=length of each sentence Lmin=20 ,

8

Centroid-based algorithm

• STEPS TO SUMMARIZE :– d. Finding Headline Score:

Hscore(Si)= t / N where t=number of words in the sentence that match with the wo

rds in the headline N= number of words in the sentence

– e. Compute Sentence Score: SCORE(S)=∑ (wc.Ci + wp.Pi + wf.Fi + wl.Li) where i (1≤i≤n) n=number of sentences within the cluster. Ci=Centroid value of the sentence Pi=sentence position score Fi=headline score Li=sentence length score Wc= wI = wf = wl =1

– f. Extract Sentences: d= r * n where r = Compression Rate and n = total number of sentences taken fr

om input documents.

9

Conclusion

• There are many other techniques related to text summarization based on position of sentences or length of sentences of the documents.

• It will be more reliable if the sentences are parsed in phrase level using Link Grammar parser.

• The information of the word means ‘subject’, ‘time’, ’space/ location’, ‘action i.e. verb’ etc.

• Using these information the sentences are clustered on the basis of• same ‘subject’ or ‘action’ etc.• The clusters are extracted from top order until required summary

length is achieved.• Experiments are also going on other several features of sentences.

It will be very useful for the busy persons who have no time to go through all the news.

1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen...

Documents

Transcript of 1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen...