CHAPTER 6 DOCUMENT SUMMARIZATION BASED ON...

153

CHAPTER 6

DOCUMENT SUMMARIZATION BASED ON SENTENCE

RANKING USING VECTOR SPACE MODEL

WWW is a repository of large collection of information available in the form of

unstructured documents. It is a challenging task to select the documents of interest from

such a huge document pool. To fasten the process of document retrieval, text

summarization technique is used. Ranking of documents is made based on the summary or

the abstract provided by the authors of the document. But it is not always possible as not

all documents come with an abstract or summary. Also when different summarization

tools are used to summarize the document, not all the topics covered within the document

are reflected in its summary. In this chapter, a method to automate the process of text

document summarization is proposed based on the term frequency within the document at

different levels – paragraph and sentence. To summarize the document, similarity between

the paragraphs and sentences within the paragraph is considered using Vector Space

Model. Proposed system evaluation on the standard reference corpus from DUC-2002

using the ROUGE package indicates comparable avg. Recall, avg. Precision and avg. F-

measure to existing summarization tools – Copernic, SweSum, Extractor, MSWord

AutoSummarizer, Intelligent, Brevity, Pertinence taking DUC-2002 (100 words) human

summary as baseline summary.

6.1 Introduction

In the sixties, a large amount of scientific papers and books have been digitally stored.

However, the storage media to store such a large database was very expensive. Therefore

the concept of automatic shortening of texts was introduced to store the information about

papers and books in limited storage space. Now, due to advancement in technology, the

storage media are no longer expensive and bulk of information can be fit into the large

databases these days. But due to increased use of the Internet, and large amount of

information available on the web, there is a need to represent each document by its

summary to save time and effort for searching the correct information. Automatic

154

document summarization is extremely helpful in tackling the information overload

problems. It is the technique to identify the most important pieces of information from the

document, omitting irrelevant information and minimizing details to generate a compact

coherent summary document.

There are different types of summarization approaches [11], [31], [37] depending on what

the summarization method focuses on to make the summary of the text.

i. Abstract vs. Extract summary - Abstraction is the process of paraphrasing sections

of the source document whereas extraction is the process of picking subset of

sentences from the source document and presents them to user in form of summary

that provides an overall sense of the documents content.

ii. Generic vs. Query-based summary - Generic summary do not target to any

particular group. It addresses broad community of readers while Query or topic

focused queries are tailored to the specific needs of an individual or a particular

group and represent particular topic.

iii. Single vs. Multi-document summary - Single document summary provide the most

relevant information contained in single document to the user that helps the user in

deciding whether the document is related to the topic of interest or not whereas

multi-document summary helps to identify redundancy across documents and

compute the summary of a set of related documents of a corpus such that they

cover the major details of the events in the documents, taking into account some of

the major issues [61]: the need to carefully eliminate redundant information from

multiple documents and achieve high compression ratios; information about

document and passage similarities, and weighting different passages accordingly;

the importance of temporal information; co-reference among entities and facts

occurring across documents. Kumar et al. in [82] studied a risk minimization

framework for sentence extraction to produce generic multi-document summary.

155

Automatic text summarization approaches [41] are also classified as:

i. Vector based approach - The summary generated for each document will consist of

sentences that are extracted from it using the Vector Space Model (VSM). After the

preprocessing step each text element, a sentence in the case of text summarization,

is considered as N-dimensional vector [68]. The sentences are then ranked using

the VSM according to their similarity within the document.

ii. Fuzzy based approach - All the rules needed for summarization, are included in the

knowledge base of the fuzzy system [17]. Different characteristic of a text such as

sentence length, location in paragraph, similarity to key word etc, is given as input

to fuzzy system. A value from zero to one is obtained as an output for each

sentence based on sentence characteristics and the available rules in the knowledge

base. The obtained value in the output determines the degree of the importance of

the sentence in the final summary.

iii. Genetic algorithm based approach - In Genetic Algorithm, the solutions are called

individuals or chromosomes. After the initial population is generated randomly,

selection and variation function are executed in a loop until some termination

criterion is reached. Each run of the loop is called a generation. The selection

operator is intended to improve the average quality of the population by giving

individuals of higher quality a higher probability to be copied into the next

generation. The quality of an individual is measured by a fitness function. Fattah

and Ren [48] proposed an automatic text summarizer using several feature score

functions like sentence position, positive and negative keyword, sentence centrality

etc. to train genetic algorithm and mathematical regression models to obtain a

suitable combination of feature weights.

iv. Neural Network based approach - A neural network is trained on a corpus of

documents. The neural network is then modified, through feature fusion, to

produce a summary of highly ranked sentences in the document. Through feature

fusion, the network discovers the importance (and unimportance) of various

156

features used to determine the summary-worthiness of each sentence [76]. The

input to the neural network can be either real or binary vectors.

Based on the summarization approaches discussed above and summarization techniques

discussed in section 2.4, different automation tools for summarization have been

developed to generate fixed length or variable size summary of text documents. Some of

the automated summarization tools producing fixed length (100 words), generic, extract-

based single text document summary are briefly discussed below:

i. Brevity Text Summarizer – Brevity1 works by comparing a document to a set of

similar documents. It stores this document information in a Summary Dictionary.

Several dictionaries are included with Brevity. These are dictionaries designed for

general categories of documents. For example to summarize a newsfeed of political

news, it compares text to other political news stories of the same type.

ii. Copernic Summarizer - Copernic summarization2 technologies implement a wide

range of heuristics to isolate sentences, bulleted lists, and special strings such as e-

mail addresses and scientific formulas. In addition, they tokenize each and every

word according to the context in order to identify actions, people, places and

things. The set of concepts associated with the document‘s main topic forms the

core information extracted by intelligent summarization. The more a sentence

exhibits pertinent concepts, the more it is suited to developing important ideas, and

consequently, the more likely it will be retained for inclusion in the summary.

iii. Extractor Text Summarizer – Extractor3 is a software text summarization engine. It

can works on documents of different type like (text, html, email) and using a

patented genetic extraction algorithm (GenEx), analyzes the recurrence of words

and phrases, their proximity to one another, and the uniqueness of the words to a

particular document. The engine returns a list of key words and phrases found in

the document together with their relative ranking (how many times was the

word/phrase found in the document) along with contextual links back to the

position of the key word/phrase in the document itself.

1[http://www.lextek.com/manuals/brevity/functions.html]

2[http://www.copernic.com/data/pdf/summarization-whitepapereng.pdf]

3[http://www.componentsource.com/products/dbi-extractor/summary.html]

157

iv. Intelligent Text Summarizer – It generates two summaries [17]. Initially a summary

is generated by fuzzy swarm module and is given as input to swarm diversity

module which uses input sentences as initial centroids for clustering process to

generate the final summary after scoring these input sentences, filtering the similar

sentences and selecting the most diverse one.

v. MSWord AutoSummarizer - The AutoSummary Tool in Microsoft Office Word

2007 analyzes a document to identify keywords and then assign score to each

word. Sentences containing the most frequent words in the document having

highest scores are then selected to be included in the summary [55].

vi. SweSum Text Summarizer – SewSum4 is the automatic text summarizer based on

statistical, linguistical and heuristic methods where the summarization system

calculates how often certain key words (the Swedish system has 700,000 possible

Swedish entries pointing at 40,000 Swedish base key words) appear in the

document. The key words belong to the so called open class words. Summarization

system calculates the frequency of key words in the text, which sentences they are

present in, and where these sentences are in the text. It considers if the text is

tagged with bold text tag, first paragraph tag or numerical values. All this

information is compiled and used to summarize the original text. SweSum is

available for Swedish, Danish, Norwegian, English, Spanish, French, Italian,

Greek, Farsi (Persian) and German texts.

vii. Pertinence Summarizer - Pertinence Summarizer5 performs linguistic processing

over a document and evaluates the pertinence (the relevance) of its sentences. The

process takes into account not only general and/or specialized linguistic markers

depending on the nature of the document analyzed, but also the user‘s keywords,

and optionally terminological bases, to enhance the relevance of the selected

sentences.

4[http://people.dsv.su.se/~hercules/textsammanfattningeng.html]

5[http://www.pertinence.net/produits_en.html]

158

After reviewing different summarization approaches and automation tools, a new generic,

extract-based, single document summarization approach is proposed in section 6.2, based

on statistical heuristics using Vector Space Model.

6.2 Proposed Method

Text Document

- Special Character elimination

- Stopwords removal

- Stemming

- Tokenizer

- Construct sentence term vector

- Construct paragraph term vector

- Construct document term vector

- Score sentences

- Ordering of sentences

- Select sentences

Preprocessing

Summarizer

Restructuring / Reorganizing

Synthesizing

Summary

% summary

required

Stopwords

Porter

Stemming

Algorithm

Figure 6.1: Proposed System Architecture

159

The proposed automatic summarization process has three phases:

a) Analyzing the source text (Preprocessing)

In the first phase, preprocessing of the text document is done to obtain a structured

representation of the original text. The preprocessing step includes:

i. Stop-word elimination – common words with no semantics and which do not

aggregate relevant information to the task e.g., ―this‖, ―is‖ are eliminated.

ii. Case folding - all the characters are converted to the same letter case i.e., either

upper case or lower case.

iii. Stemming – all the syntactically similar words, such as plurals, verbal variations,

etc. are reduced to their stems.

After the preprocessing step each text element, a sentence, is considered as N–dimensional

vector.

b) Determining the salient features (Restructuring/Reorganizing)

Sentences in the document are ranked according to their significance relevance to the

document and the paragraph containing the corresponding sentence. Steps to rank the

sentences in the document include:

i. Compute the sentence index term vector, <ti,fi>

• Compute the frequency of occurrence (fi) of each term (ti) appearing in

the sentence

ii. Compute the paragraph index term vector, <ti,csi>

• Select highest frequency occurrence index term(s) from each sentence

of the paragraph to be included in paragraph index term vector

• Compute frequency of occurrence (csi) of each selected term (ti) as

equal to sum of number of member sentences of the paragraph

containing term ti in their index term vector

iii. If the no. of index terms shared between any two paragraphs is greater than or

equal to the smallest size of their paragraph index term vector, then merge the

two paragraphs and re-compute the merged paragraph index term vector

containing index terms with occurrence frequency greater than one

iv. Compute the document index term vector, <ti,cpi>

160

• Select index term(s) (ti) having highest frequency occurrence (csi) in

each paragraph of the document

• Compute frequency of occurrence (cpi) of each selected term (ti) as

equal to sum of number of paragraphs containing (ti) as index term in

its index term vector

v. Repeat until the required number of sentences/words are not included in the

final summary

• Obtain the next highest unique value of term frequency occurrence

(cpx) from the document index term vector

• Select term(s) {ti}T from the document index term vector having

frequency occurrence greater than or equal to (cpx)

• New score of each sentence in the document is computed as

(6.1)

where,

Sk is the score of kth

sentence in the document

T is the total no. of selected terms

cpi is the frequency occurrence of term ti in document index term

vector

A is the adjustment factor defined as:

(6.2)

vi. Arrange the sentences according to their rank score in decreasing order

c) Synthesizing an appropriate output (Filtering)

Sentences with higher score are included in the final summary in the same order of their

occurrence as in the original text document to retain their semantic meaning.

Sentences are included in the final summary based on the following rules:

Rule 1- Sentences are selected to be included in the summary according to their highest to

lowest rank score value.

Rule 2- If more than one sentence in the same paragraph shows same rank score then,

sentence appearing earlier in the paragraph is given preference over the sentence appearing

later to generate fixed length summary.

161

Rule 3- Sentences having same rank score are selected based on their relative order of

occurrence within the original document.

Rule 4- If two or more sentences from the same paragraph shows non-zero sentence score

and also share some index terms between their index term vector, then rank score of

sentences appearing later in the paragraph is modified by subtracting the value obtained on

computing the sum of product of shared terms to their frequency occurrence in document

index term vector from their initial sentence score value.

Rule 5- If sentences of similar paragraphs show non-zero rank score then consider the

similar paragraphs as single merged paragraph and sentence(s) from this merged single

paragraph are selected following Rule 1, 2 and 4.

Above discussed tasks are implemented on the DUC2002 dataset to generate the extracted

summary of 100 words. The intermediate steps of the algorithm applied on DUC-2002

document are explained in section 6.3.2. The comparison results of summary generated by

the proposed summarizer to existing extract-based summarization tools are discussed in

next section 6.3.

6.3 Experiments

6.3.1 Experimental setup

The new proposed single document summarization system was evaluated on the standard

reference corpus from DUC-2002. The DUC-2002 corpus included a single-document

summarization task, in which 13 systems participated. 2002 is the last version of DUC that

included single-document summarization evaluation of informative summaries. The DUC-

2002 corpus used for the task contains 567 documents from different sources; 10 assessors

were used to provide for each document two 100-word human summaries. We gave the

names H1 and H2 for those two model summaries. The human summary H2 is used as

benchmark to measure the quality of our method summary, while the human summary H1

is used as reference summary. In addition to the results of the 13 participating systems, the

DUC organizers also distributed baseline summaries (the first 100 words of a document).

The coverage of all the summaries was assessed by humans [124].

162

Automatic evaluation measures are used to assess the performance of the automatic

summarization tools and the quality of the generated summary. There are different

approaches to evaluate the overall quality of a summarization system [74], [100], [117]. In

general, there are two categories of evaluation: intrinsic and extrinsic. In intrinsic

approaches, the quality of the summarization is evaluated based on analysis of the content

of a summary. In extrinsic approaches, the quality of the summary is measured based on

task-based setting, determining their usefulness as part of an information browsing and

access interface.

We used ROUGE (Recall-Oriented Understudy for Gisting Evaluation) toolkit for

comparing our system to other single document, extract-based summarization tools –

Copernic, SweSum, Extractor, MSWord AutoSummary tool, Intelligent, Brevity,

Pertinence taking DUC-2002 (100 word) human summary as baseline summary. ROUGE

1.5.5 is an intrinsic summarization evaluation toolkit. It is used to calculate the ratio of

how the tested summary (TEST) overlaps the model summary (MODEL). In ROUGE, a

human reference summary is taken as Model, and peer summary generated by machine or

human is taken as TEST. The ROUGE evaluation measure6 (version 1.5.5) generates three

scores for each summary: avg. Recall, avg. Precision and avg. F-measure. These measures

help to quantify how closely the system‘s extract corresponds to the human‘s [9], [89].

ROUGE is the main metric in the DUC text summarization evaluations. Certain ROUGE

configurations have been shown to correlate well with DUC coverage [100]. The measure

is computed by counting the number of overlapping words between the computer-

generated summary to be evaluated and the ideal summaries created by humans. ROUGE

has different variants. In our experiment, we use ROUGE-N and ROUGE-L to compute

avg. Recall, avg. Precision and avg. F-measure. ROUGE-N is an n-gram measure between

a candidate summary and a set of reference summaries.

6[http://www.isi.edu/~cyl/ROUGE]

163

The reason for selecting the measure is that ROUGE-N work well for single document

summarization. N takes the values from 1 to 8. ROUGE-1 is unigram-based, ROUGE-2 is

bigram-based and so on. Unigram recall reflects the proportion of words in X (reference

summary sentence) that are also present in Y (candidate summary sentence); while

unigram precision is the proportion of words in Y that are also in X. ROUGE-L is defined

as the longest common subsequence (LCS) with maximum length in the given two

sequences X and Y. Unigram recall and precision count all co-occurring words regardless

their orders; while ROUGE-L counts only in-sequence co-occurrences.

For data preprocessing, ROUGE-1.5.5's input-format can be SEE, SPL and ISI or

SIMPLE. We used SPL input-format for evaluating the generated summaries.

When running ROUGE, the following evaluation setup is used:

i. Both model and system summaries are stemmed using Porter Stemmer before

computing various statistics.

ii. Stop words are removed in model and system summaries before computing various

statistics.

iii. ROUGE-N (N=1 to 8) and ROUGE-L are computed.

iv. All systems are evaluated at 95% confidence interval.

v. Model average scoring formula is used to compute Recall, Precision and F-

measure since there were 2 model summaries for each document set.

vi. Average ROUGE is computed by averaging sentence (unit) ROUGE scores.

ROUGE can be freely downloadable for research purpose at:

http://www.isi.edu/~cyl/ROUGE.

6.3.2 Experimental Results

ROUGE-N (N=1 to 8) and ROUGE-L values obtained on applying different

summarization tools to DUC 2002 dataset for analyzing avg. Recall, avg. Precision

and avg. F-measure results are shown below.

164

Table 6.1: Comparison of different Summarization tools: average recall using ROUGE-1 to 8 at the 95%-confidence interval

ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4 ROUGE-5 ROUGE-6 ROUGE-7 ROUGE-8

Proposed 0.40160 0.22365 0.13257 0.07335 0.04269 0.02871 0.01582 0.00630

SweSum 0.33847 0.15899 0.08302 0.04148 0.02120 0.01210 0.00426 0.00000

Copernic 0.41421 0.20637 0.11124 0.06474 0.03732 0.02357 0.01093 0.00169

Extractor 0.38345 0.19179 0.10402 0.06227 0.03732 0.02357 0.01093 0.00169

Brevity 0.33814 0.15650 0.08279 0.04148 0.02120 0.01210 0.00426 0.00000

Intelligent 0.33975 0.15260 0.07810 0.04326 0.02398 0.01400 0.00696 0.00169

MSWord 0.45635 0.23939 0.13475 0.07495 0.04339 0.02908 0.01416 0.00632

Pertinence 0.25456 0.09172 0.03768 0.01631 0.01110 0.00567 0.00000 0.00000

165

Figure 6.2. Comparative chart for Recall scores obtained by Different Summarization tools

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45 R

eca

ll M

eas

ure

Text Summarizer

ROUGE-1

ROUGE-2

ROUGE-3

ROUGE-4

ROUGE-5

ROUGE-6

ROUGE-7

ROUGE-8

166

Table 6.2: List different Summarization tools in decreasing order of average recall measures for ROUGE-1 to 8 at the 95%-confidence interval


MSWord MSWord MSWord MSWord MSWord MSWord Proposed MSWord

Copernic Proposed Proposed Proposed Proposed Proposed MSWord Proposed

Proposed Copernic Copernic Copernic Copernic Copernic Copernic Copernic

Extractor Extractor Extractor Extractor Extractor Extractor Extractor Extractor

Intelligent SweSum SweSum Intelligent Intelligent Intelligent Intelligent Intelligent

SweSum Brevity Brevity Brevity Brevity Brevity Brevity Brevity

Brevity Intelligent Intelligent SweSum SweSum SweSum SweSum Pertinence

Pertinence Pertinence Pertinence Pertinence Pertinence Pertinence Pertinence SweSum

167

Table 6.3: Comparison of different Summarization tools: average Precision using ROUGE-1 to 8 at the 95%-confidence interval


Proposed 0.41799 0.23141 0.13640 0.07420 0.04195 0.02803 0.01523 0.00577

SweSum 0.37573 0.17809 0.09396 0.04705 0.02362 0.01354 0.00492 0.00000

Copernic 0.43132 0.21493 0.11613 0.06771 0.03895 0.02481 0.01181 0.00187

Extractor 0.43079 0.21794 0.11798 0.06995 0.03810 0.02420 0.01145 0.00183

Brevity 0.37427 0.17515 0.09501 0.04991 0.02692 0.01602 0.00643 0.00000

Intelligent 0.35308 0.15650 0.08021 0.04461 0.02451 0.01433 0.00721 0.00173

MSWord 0.39189 0.20389 0.11441 0.06397 0.03736 0.02502 0.01224 0.00555

Pertinence 0.25054 0.08747 0.03586 0.01627 0.01107 0.00565 0.00000 0.00000

168

Figure 6.3. Comparative chart for Precision scores obtained by Different Summarization tools

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Me

asu

re

Text Summarizer

ROUGE-8

ROUGE-7

ROUGE-6

ROUGE-5

ROUGE-4

ROUGE-3

ROUGE-2

ROUGE-1

169

Table 6.4: List different Summarization tools in decreasing order of average Precision measures for ROUGE-1 to 8 at the 95%-confidence interval


Copernic Proposed Proposed Proposed Proposed Proposed Proposed Proposed

Extractor Extractor Extractor Extractor Copernic MSWord MSWord MSWord

Proposed Copernic Copernic Copernic Extractor Copernic Copernic Copernic

MSWord MSWord MSWord MSWord MSWord Extractor Extractor Extractor

SweSum SweSum Brevity Brevity Brevity Brevity Intelligent Intelligent

Brevity Brevity SweSum SweSum Intelligent Intelligent Brevity Brevity

Intelligent Intelligent Intelligent Intelligent SweSum SweSum SweSum Pertinence


170

Table 6.5: Comparison of different Summarization tools: average F-measure using ROUGE-1 to 8 at the 95%-confidence interval


Proposed 0.40837 0.22676 0.13402 0.07353 0.04220 0.02828 0.01547 0.00602

SweSum 0.35477 0.16729 0.08775 0.04386 0.02223 0.01269 0.00452 0.00000

Copernic 0.42191 0.21017 0.11339 0.06604 0.03803 0.02412 0.01134 0.00178

Extractor 0.39893 0.19968 0.10803 0.06440 0.03766 0.02385 0.01118 0.00176

Brevity 0.35343 0.16431 0.08782 0.04485 0.02341 0.01359 0.00509 0.00000

Intelligent 0.34560 0.15435 0.07908 0.04389 0.02423 0.01416 0.00709 0.00171

MSWord 0.42099 0.21985 0.12354 0.06892 0.04011 0.02687 0.01312 0.00591

Pertinence 0.25131 0.08923 0.03665 0.01627 0.01107 0.00565 0.00000 0.00000

171

Figure 6.4. Comparative chart for F-measure scores obtained by Different Summarization tools

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F-m

eas

ure

Text Summarizer

ROUGE-8

ROUGE-7

ROUGE-6

ROUGE-5

ROUGE-4

ROUGE-3

ROUGE-2

ROUGE-1

172

Table 6.6: List different Summarization tools in decreasing order of average F-measure for ROUGE-1 to 8 at the 95%-confidence interval


Copernic Proposed Proposed Proposed Proposed Proposed Proposed Proposed

MSWord MSWord MSWord MSWord MSWord MSWord MSWord MSWord

Proposed Copernic Copernic Copernic Copernic Copernic Copernic Copernic

Extractor Extractor Extractor Extractor Extractor Extractor Extractor Extractor

SweSum SweSum Brevity Brevity Intelligent Intelligent Intelligent Intelligent

Brevity Brevity SweSum Intelligent Brevity Brevity Brevity Brevity

Intelligent Intelligent Intelligent SweSum SweSum SweSum SweSum Pertinence


173

Table 6.7: Comparison of different Summarization tools: ROUGE-L at the 95%-confidence interval

According to the Recall results obtained for different ROUGE-N values as given in Table

6.1 and 6.2, MSWord Autosummarizer outperforms all the other summarization tools

including our proposed method for all the ROUGE-N (N=1 to 8) values except for

ROUGE-7. Our proposed summarizer shows highest avg. recall measure for ROUGE-7

when compared to all the other summarizer tools. For ROUGE-2 to ROUGE-6 and

ROUGE-8, our text Summarizer shows next higher avg. recall value after MSWord

AutoSummarizer. As shown in figure 6.2, the difference between recall values between

our proposed summarizer and MSWord AutoSummarizer reduced considerably (showing

difference in value at third decimal place) for ROUGE-2 to ROUGE-8. This slight

variation may occur due to length of the summary not exactly equal to 100 words but vary

between 95 words to 112 words depending upon the length of the last sentence included in

the final summary. Also the human baseline summary given in DUC2002 and as obtained

by MSWord AutoSummarizer are not purely extract summary and sometimes create a

sentence merging more than one sentence from the original document in final summary.

ROUGE-L

Avg Recall Avg Precision Avg F-measure

Proposed 0.38617 0.40184 0.39264

SweSum 0.32148 0.35763 0.33728

Copernic 0.39835 0.41507 0.40588

Extractor 0.36729 0.41377 0.38240

Brevity 0.32116 0.35531 0.33564

Intelligent 0.32422 0.33785 0.33024

MsWord 0.42567 0.36599 0.39295

Pertinence 0.23355 0.22809 0.22966

174

While the summary generated by our proposed summarizer contain original sentences

from the given input text document.

It is observed that our proposed method shows highest avg. Precision value for ROUGE-2

to ROUGE-8 as given in table 6.3 and 6.4. Also, if we analyze the results obtained for avg.

F-measure shown in table 6.5 and 6.6, the proposed method outperforms all the other

summarization tools for ROUGE-2 to ROUGE-8. MSWord AutoSummarizer shows lower

avg. Precision value compared to our proposed method even for ROUGE-1. The

comparative results of avg. Precision and avg. F-measure obtained for different

summarizer tools are shown in figure 6.3 and 6.4 respectively. From table 6.7, it is

analyzed that Copernic summarizer shows highest ROUGE-L values for avg. Precision

and avg. F-measure and MSWord AutoSummarizer shows highest ROUGE-L value for

avg. Recall when compared to other summarization tools. Our proposed method shows

slightly lesser value of ROUGE-L (difference in values is less than 0.1) as compared to

Copernic and MSWord AutoSummarizer.

To summarize, Copernic Summarizer is the highest scored summarization tool amongst all

the discussed automation tools in terms of ROUGE-L for all the three measures. Similarly,

MSWord Autosummarizer outperforms all the summarization tools in terms of ROUGE-N

for avg. Recall. Our proposed method shows best results for avg. Precision and avg. F-

measure compared to other listed summarization tools. Also the proposed method shows

ROUGE-L values comparable to Copernic summarizer and avg. Recall values comparable

to MSWord Autosummarizer.

To further evaluate the quality of the summary obtained by our proposed text summarizer,

we compare the summary obtained by MSWord (showing best recall measure) and our text

summarizer on 30 documents related to different topics (computer, mobile, festivals, etc).

On analyzing the Summaries, it is found that our text summarizer include more

meaningful information.

6.3.3 Discussion

A. A sample document of DUC-2002 d061j ( AP880911-0016) and its summary obtained

by different Summarizing tools (Our Proposed Summarizing model, DUC-2002

Human Summary (H2-H1),Copernic, SweSum, Extractor, MSWord AutoSummary,

Intelligent, Brevity, Pertinence text summarizer) for 100 words is given below as the

175

reference. The Bold and Italics text in each summary shows matching lines of the

respective text Summarizer to human generated summary given in DUC-2002.

Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil

Defense alerted its heavily populated south coast to prepare for high winds, heavy

rains and high seas. The storm was approaching from the southeast with sustained

winds of 75 mph gusting to 92 mph.

``There is no need for alarm,'' Civil Defense Director Eugenio Cabral said in a

television alert shortly before midnight Saturday.

Cabral said residents of the province of Barahona should closely follow Gilbert's

movement. An estimated 100,000 people live in the province, including 70,000 in

the city of Barahona, about 125 miles west of Santo Domingo.

Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a

hurricane Saturday night. The National Hurricane Center in Miami reported its

position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles

south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.

The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving

westward at 15 mph with a ``broad area of cloudiness and heavy weather'' rotating

around the center of the storm.

The weather service issued a flash flood watch for Puerto Rico and the Virgin

Islands until at least 6 p.m. Sunday.

Strong winds associated with the Gilbert brought coastal flooding, strong southeast

winds and up to 12 feet feet to Puerto Rico's south coast. There were no reports of

casualties.

San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided

during the night.

On Saturday, Hurricane Florence was downgraded to a tropical storm and its

remnants pushed inland from the U.S. Gulf Coast. Residents returned home, happy

to find little damage from 80 mph winds and sheets of rain.

Florence, the sixth named storm of the 1988 Atlantic storm season, was the second

hurricane. The first, Debby, reached minimal hurricane strength briefly before

hitting the Mexican coast last month.

Figure 6.5: DUC-2002 d061j ( AP880911-0016 ) Original text Document

176

TTrrooppiiccaall SSttoorrmm GGiillbbeerrtt iinn tthhee eeaasstteerrnn CCaarriibbbbeeaann ssttrreennggtthheenneedd iinnttoo aa hhuurrrriiccaannee

SSaattuurrddaayy nniigghhtt ..

The National Hurricane Center in Miami reported its position at 22 aa..mm.. SSuunnddaayy to be

about 140 miles south of Puerto Rico and 220000 mmiilleess ssoouutthheeaasstt ooff SSaannttoo DDoommiinnggoo.

It is mmoovviinngg wweessttwwaarrdd aatt 1155mmpphh with a broad area of cloudiness and heavy weather

with sustained wwiinnddss ooff 7755mmpphh gusting to 92mph .

The Dominican Republic's Civil Defense alerted that country ' s heavily populated

south coast and the National Weather Service in San Juan , Puerto Rico issued a fflloooodd

watch for PPuueerrttoo RRiiccoo aanndd tthhee VViirrggiinn IIssllaannddss until at least 6 p.m. Sunday .

Figure 6.6: d061j ( AP880911-0016 ) Human Summary (H1)

Hurricane Gilbert is moving toward the Dominican Republic , where the residents

of the south coast , especially the Barahona Province , have been alerted to prepare

for heavy rains , and high winds and seas .

TTrrooppiiccaall SSttoorrmm GGiillbbeerrtt ffoorrmmeedd iinn tthhee eeaasstteerrnn CCaarriibbbbeeaann aanndd bbeeccaammee aa hhuurrrriiccaannee oonn

SSaattuurrddaayy nniigghhtt .

BByy 22 aa..mm.. SSuunnddaayy it was about 220000 mmiilleess ssoouutthheeaasstt ooff SSaannttoo DDoommiinnggoo and mmoovviinngg

wweessttwwaarrdd aatt 1155 mmpphh with wwiinnddss ooff 7755 mmpphh .

Flooding is expected in Puerto Rico and the Virgin Islands .

The second hurricane of the season , Florence , is now over the southern United

States and downgraded to a tropical storm .

Figure 6.7: d061j ( AP880911-0016 ) Human Summary (H2)

** Shadow and underlined text show matching words between H1 and H2 summary. The

two human generated summaries show very few lines of the original document common

between them.

177



rains and high seas .

The storm was approaching from the southeast with sustained winds of 75 mph

gusting to 92 mph .

There is no need for alarm,' Civil Defense Director Eugenio Cabral said in a television

alert shortly before midnight Saturday. Cabral said residents of the province of

Barahona should closely follow Gilbert's movement. An estimated 100,000 people live

in the province, including 70,000 in the city of Barahona, about 125 miles west of

Santo Domingo.

Figure 6.8: d061j ( AP880911-0016 ) Summary obtained from Brevity text Summarizer

Hurricane Gilbert swept toward the Dominican Republic Sunday , and the Civil

Defense alerted its heavily populated south coast to prepare for high winds , heavy



gusting to 92 mph .


hurricane Saturday night .

The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving

westward at 15 mph with a ` ` broad area of cloudiness and heavy weather ' ' rotating

around the center of the storm .

Figure 6.9 : d061j ( AP880911-0016 ) Summary obtained from Copernic text Summarizer

178





gusting to 92 mph .




westward at 15 mph with a ` ` broad area of cloudiness and heavy weather ' ' rotating


Figure 6.10: d061j ( AP880911-0016 ) Summary obtained from Extractor text Summarizer





gusting to 92 mph .



The National Hurricane Center in Miami reported its position at 2 a.m. Sunday at

latitude 16.1 north , longitude 67.5 west , about 140 miles south of Ponce , Puerto

Rico , and 200 miles southeast of Santo Domingo .

The National Weather Service in San Juan , Puerto Rico, said Gilbert was moving

westward at 15 mph with a `` broad area of cloudiness and heavy weather '' rotating


Figure 6.11: d061j ( AP880911-0016 ) Summary obtained from Intelligent text Summarizer

179



rains and high seas.


gusting to 92 mph.


hurricane Saturday night. Strong winds associated with the Gilbert brought coastal

flooding, strong southeast winds and up to 12 feet feet to Puerto Rico's south coast.

Florence, the sixth named storm of the 1988 Atlantic storm season, was the second

hurricane.

Figure 6.12: d061j ( AP880911-0016 ) Summary obtained from MSWord AutoSummarizer





gusting to 92 mph .

`` There is no need for alarm , '' Civil Defense Director Eugenio Cabral said in a

television alert shortly before midnight Saturday .

Cabral said residents of the province of Barahona should closely follow Gilbert ' s

movement .

An estimated 100,000 people live in the province , including 70,000 in the city of

Barahona , about 125 miles west of Santo Domingo .

Figure 6.13: d061j ( AP880911-0016 ) Summary obtained from SweSum text Summarizer

180

``There is no need for alarm,'' Civil Defense Director Eugenio Cabral said in a

television alert shortly before midnight Saturday. Cabral said residents of the province

of Barahona should closely follow Gilbert's movement.

An estimated 100,000 people live in the province, including 70,000 in the city of

Barahona, about 125 miles west of Santo Domingo. Tropical Storm Gilbert formed in

the eastern Caribbean and strengthened into a hurricane Saturday night.

On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants

pushed inland from the U.S. Gulf Coast. Residents returned home, happy to find little

damage from 80 mph winds and sheets of rain.

Figure 6.14: d061j ( AP880911-0016 ) Summary obtained from Pertinence text Summarizer



rains and high seas.


gusting to 92 mph.




westward at 15 mph with a ``broad area of cloudiness and heavy weather ' ' rotating


On Saturday, Hurricane Florence was downgraded to a tropical storm and its

remnants pushed inland from the U.S. Gulf Coast.

Florence, the sixth named storm of the 1988 Atlantic storm season, was the

second hurricane.

Figure 6.15: d061j ( AP880911-0016 ) Summary obtained from our Proposed Text Summarizer

181

** On comparing the summary obtained from different summarization tools and two

human generated summaries H1 and H2 (shown in bold and italics), it is observed that

sentences 1, 2 and 8 appear with maximum frequency occurrence among all the sixteen

sentences of the given document in the above listed summaries. This shows, that sentences

1, 2 and 8 are most relevant sentences to the given document and should be included in the

final summary. The summary generated by our proposed summarizer include all the three

sentences 1, 2 and 8 in the final summary as shown in figure 6.15, while sentence 8 is

found missing in the final summary obtained from MsWord AutoSummarizer.

To generate the extracted summary of 100 words, our proposed text summarizer compute

the document, paragraph and sentence index term vector shown below:

Document Name: d061j-AP880911-0016

Document Index term vector - < {storm, hurrican, coast}3, {wind, weather, tropic, mph,

heavi,}2 > ** subscript integer value represents term frequency occurrence of

corresponding term in index term vector

Table 6.8: Sentence Index term vector for document d061j-AP880911-0016

Paragraph

No.

Sentence

No.

Sentence

Index Term

Vector

Sentence

Score

Ist

Iteration

Modified

Sentence

Score

No. of

words

Included

in

Summary

(Y/N)

1 1 < {heavi}2,

{wind, swept,

south, sea, republ,

rain, prepar,

popul, hurrican,

gilbert,

dominican,

defens, coast,

civil, alert}1 >

6 6 28 Y

2 2 < {mph}2,{wind,

sustain, storm,

southeast, gust,

approach}1 >

3 3 17 Y

182

3 3 < {televis,

shortli, midnight,

eugenio, director,

defens, civil,

cabral, alert,

alarm}1 >

0 0 - N

4 4 < {resid, provinc,

movement,

gilbert, follow,

close, cabral,

barahona}1 >

0 0 - N

4 5 < {west, santo,

provinc, peopl,

mile, live, includ,

estim, domingo,

citi, barahona}1 >

0 0 - N

5 6 < {tropic,

strengthen, storm,

night, hurrican,

gilbert, form,

eastern,

Caribbean}1 >

6 6 15 Y

5 7 < {mile}2, {west,

southeast, south,

santo, rico,

report, puerto,

posit, ponc,

north, nation,

miami, longitud,

latitud, hurrican,

domingo,

center}1 >

3 2 - N

6 8 < {weather}2,

{westward, storm,

servic, san, rotat,

rico, puerto,

nation, mph,

move, juan,

heavi, gilbert,

cloudi, center}1 >

3 3 38 Y

7 9 < {weather,

watch, virgin,

servic, rico,

puerto, issu,

0 0 - N

183

island, flood,

flash}1 >

8 10 < {wind, strong,

feet}2, {southeast,

south, rico,

puerto, gilbert,

flood, coastal,

coast, brought,

associ}1 >

3 3 23 N

8 11 < {report,

casualty}1 >

0 0 - N

9 12 < {subsid, san,

rain, north, night,

juan, heavi, gust,

coast}1 >

3 2 18 N

10 13 < {tropic, storm,

remnant, push,

inland, hurrican,

gulf, florenc,

downgrad,

coast}1 >

9 9 20 Y

10 14 < {wind, sheet,

return, resid, rain,

mph, littl, home,

happi, damage}1

>

0 0 - N

11 15 < {storm}2,

{sixth, season,

name, hurrican,

florenc, atlant}1 >

9 9 15 Y

11 16 < {strength,

reach, month,

minim, mexican,

hurrican, hit,

debbi, coast,

briefly}1 >

6 5 15 N

184

Table 6.9: Paragraph Index term vector for document d061j-AP880911-0016

Paragraph

No.

No. of

sentences

per

paragraph

Paragraph Index Term Vector Similar to

paragraph

No.

1 1 < {heavi}1 > 9

2 1 < {mph}1 > -

3 1 < {alarm, alert, cabral, civil, defens, director,

eugenio, midnight, shortli, televis}1 >

-

4 2 < {barahona, provinc}2, {west, santo, resid,

peopl, movement, mile, live, includ, gilbert,

follow, estim, domingo, close, citi, cabral}1 >

-

5 2 < {caribbean, eastern, form, gilbert, hurrican,

mile, night, storm, strengthen, tropic}1 >

-

6 1 < {weather}1 > -

7 1 < {flash, flood, island, issu, puerto, rico, servic,

virgin, watch, weather}1 >

-

8 2 < {casualti, feet, report, strong, wind}1 > -

9 1 < {coast, gust, heavi, juan, night, north, rain,

san, subsid}1 >

1

10 2 < {coast, damag, downgrad, florenc, gulf,

happi, home, hurrican, inland, littl, mph, push,

rain, remnant, resid, return, sheet, storm, tropic,

wind}1 >

-

11 2 < {briefli, coast, debbi, hit, hurrican, mexican,

minim, month, reach, storm, strength}1 >

-

185

Document d061j-AP880911-0016 contains 11 paragraphs, total 16 sentences and 317

words. To generate a summary of 100 words, proposed algorithm compute the document,

paragraph(s), and sentence(s) index term vector. Score of each sentence is computed using

equation (6.1 and 6.2) and the selection of non-zero score sentences in the final summary

follow rules 1 to 5 listed in section 6.2. Paragraph nos. 1 and 9 are merged as they are

found similar paragraphs shown in table 6.9. The sentence score of sentences are then

modified based on Rule 4.

From the sentence rank score obtained as shown in table 6.8, sentence no. 13 and 15 show

highest rank score value of 9 since all the three highest frequency occurrence term of the

paragraph index term vector <storm, hurrican, coast> matches its sentence index term

vector.

Using equation (6.1 and 6.2) final score of sentence no. 13 is computed as shown below:

Score(sentence 13) = cpstorm * cphurrican * cpcoast = 3 * 3 * 3 = 9

Sentence nos. 1 and 6 show the next highest rank score value of 6 so are selected to be

included in the final summary. Since, the total no. of words in the selected sentences are

only 78 (28+15+20+15) which is less than required summary length of 100 words, so

sentences with next highest rank score value i.e., 3 are selected following rule 1-5.

Second highest sentence score value 6, is taken by sentence no. 1, 6, 15, and 16. So, these

sentences are considered in the final summary making a total of 88 (28+15+20+15+15)

words in the final summary. Since the no. of words in the final summary are still less than

100 words so next sentence score value 3 is picked and sentences having score 3 are

selected following rules 1-5. Sentence no. 11 although shows sentence score value equal to

3, but it is not included in the final summary following rule 3. Sentence 7 belonging to

paragraph 5 initially shows rank score value equal to 3, but the value is modified following

rule 4 since it shares one term (hurrican) with sentence 6 of the same paragraph. So the

modified score value of sentence no. 7 is computed as shown below:

Score(sentence 7) = 3 - cphurrican = 3 -3 = 0

Sentences with sentence score 0 are rejected to be included in the document summary.

186

B. Document-A ( Topic : Plastic Bags)

Document A on topic ―Plastic Bags‖ is shown below along with its summary obtained

through our summarizing model and MSWord AutoSummarizer.

Every once in a while the government here passes out an order banning shop keepers

from providing plastic bags to customers for carrying their purchases, with little lasting

effect. Plastic bags are very popular with both retailers as well as consumers because

they are cheap, strong, lightweight, functional, as well as a hygienic means of carrying

food as well as other goods. Even though they are one of the modern conveniences that

we seem to be unable to do without, they are responsible for causing pollution, killing

wildlife, and using up the precious resources of the earth.

About a hundred billion plastic bags are used each year in the US alone. And then, when

one considers the huge economies and populations of India, China, Europe, and other

parts of the world, the numbers can be staggering. The problem is further exacerbated by

the developed countries shipping off their plastic waste to developing countries like

India.

Here are some of the harmful effects of plastic bags:

Plastic bags litter the landscape. Once they are used, most plastic bags go into landfill,

or rubbish tips. Each year more and more plastic bags are ending up littering the

environment. Once they become litter, plastic bags find their way into our waterways,

parks, beaches, and streets. And, if they are burned, they infuse the air with toxic fumes.

Plastic bags kill animals. About 100,000 animals such as dolphins, turtles whales,

penguins are killed every year due to plastic bags. Many animals ingest plastic bags,

mistaking them for food, and therefore die. And worse, the ingested plastic bag remains

intact even after the death and decomposition of the animal. Thus, it lies around in the

landscape where another victim may ingest it.

Plastic bags are non-biodegradable. And one of the worst environmental effects of

plastic bags is that they are non-biodegradable. The decomposition of plastic bags takes

about 1000 years.

Petroleum is required to produce plastic bags. As it is, petroleum products are

diminishing and getting more expensive by the day, since we have been using this non-

renewable resource increasingly. Petroleum is vital for our modern way of life. It is

187

necessary for our energy requirements – for our factories, transport, heating, lighting,

and so on. Without viable alternative sources of energy yet on the horizon, if the supply

of petroleum were to be turned off, it would lead to practically the whole world grinding

to a halt. Surely, this precious resource should not be wasted on producing plastic bags,

should it?

So, What Can be Done about the Use of Plastic Bags?

Single-use plastic bags have become such a ubiquitous way of life that it seems as if we

simply cannot do without them. However, if we have the will, we can start reducing

their use in small ways. A tote bag can make a good substitute for holding the shopping.

You can keep the bag with the cahier, and then put your purchases into it instead of the

usual plastic bag. Recycling the plastic bags you already have is another good idea.

These can come into use for various purposes, like holding your garbage, instead of

purchasing new ones.

Figure 6.16: Original Document –A ―Plastic Bags”


Plastic bags litter the landscape. Once they are used, most plastic bags go into

landfill, or rubbish tips. Plastic bags kill animals. Many animals ingest plastic

bags, mistaking them for food, and therefore die. Plastic bags are non-

biodegradable. The decomposition of plastic bags takes about 1000 years.

Petroleum is required to produce plastic bags. Surely, this precious resource

should not be wasted on producing plastic bags, should it?

So, What Can be Done about the Use of Plastic Bags?

Recycling the plastic bags you already have is another good idea.

Figure 6.17: Document-A Summary Obtained from MSWord AutoSummarizer

188

Every once in a while the government here passes out an order banning shop keepers

from providing plastic bags to customers for carrying their purchases , with little

lasting effect .

About a hundred billion plastic bags are used each year in the US alone.


Plastic bags litter the landscape.

Plastic bags kill animals.

Plastic bags are non-biodegradable.

Petroleum is required to produce plastic bags.

So , What Can be Done about the Use of Plastic Bags.

Single-use plastic bags have become such a ubiquitous way of life that it seems as if

we simply cannot do without them.

Figure 6.18: Document-A Summary Obtained from our Proposed text Summarizer

Consider the following two sentences in the original document shown in figure 6.16:-

“Plastic bags kill animals. “

“Many animals ingest plastic bags, mistaking them for food, and therefore die.”

The above listed both the lines convey same meaning and therefore either of the two lines

should be included in the summary. Summary generated by MSWord AutoSummarizer

shown in figure 6.17 include both the lines while our proposed text summarizer avoid such

redundancy and include only second line in the final summary as shown in figure 6.18.

This improves the quality of the summary generated by the summarizer by including more

information to the fixed length generated summary.

189

6.4 Conclusion of this chapter

In this chapter, a new generic, extract-based, single document summarization approach is

proposed based on statistical heuristics using Vector Space Model. Summary document

clearly depicts the different topics included in text document and shows linkage between

the sentences of the summary. Also the method is independent of the structure of text

document and the position of sentence within the document. A sentence appearing later in

the document can be included in the summary according to its importance within the

paragraph of the document. Also the method reduces redundancy among the sentences and

paragraphs and hence provides more information to be included in fixed length generated

summary. The proposed summarization approach is evaluated on DUC-2002 corpus and it

showed satisfactory results when compared to all the reported summarization tools in

terms of avg. Recall, avg. Precision and avg. F-measure for different values of ROUGE-N

(N=1 to 8) and ROUGE-L. Our proposed method shows best results for avg. Precision and

avg. F-measure compared to other discussed summarization tools. Also the proposed

method shows ROUGE-L values comparable to Copernic summarizer and avg. Recall

values comparable to MSWord Autosummarizer.

CHAPTER 6 DOCUMENT SUMMARIZATION BASED ON...

Documents

Transcript of CHAPTER 6 DOCUMENT SUMMARIZATION BASED ON...