An Automatic Multidocument Text Summarization Approach Based ...

Research ArticleAn Automatic Multidocument Text Summarization ApproachBased on Nave Bayesian Classifier Using Timestamp Strategy

Nedunchelian Ramanujam1 and Manivannan Kaliappan2

1Department of Computer Science and Engineering Sri Venkateswara College of EngineeringPennalur Sriperumbudur TK 602117 India2Department of Information Technology RMK Engineering College Kavaraipettai 601206 India

Correspondence should be addressed to Nedunchelian Ramanujam nedunsvceacin

Received 20 October 2015 Revised 5 January 2016 Accepted 13 January 2016

Academic Editor Juan M Corchado

Copyright copy 2016 N Ramanujam and M Kaliappan This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Nowadays automatic multidocument text summarization systems can successfully retrieve the summary sentences from the inputdocuments But it has many limitations such as inaccurate extraction to essential sentences low coverage poor coherence amongthe sentences and redundancy This paper introduces a new concept of timestamp approach with Naıve Bayesian Classificationapproach for multidocument text summarization The timestamp provides the summary an ordered look which achieves thecoherent looking summary It extracts the more relevant information from the multiple documents Here scoring strategy isalso used to calculate the score for the words to obtain the word frequency The higher linguistic quality is estimated in terms ofreadability and comprehensibility In order to show the efficiency of the proposed method this paper presents the comparisonbetween the proposed methods with the existing MEAD algorithm The timestamp procedure is also applied on the MEADalgorithm and the results are examined with the proposed method The results show that the proposed method results in lessertime than the existing MEAD algorithm to execute the summarization process Moreover the proposed method results in betterprecision recall and 119865-score than the existing clustering with lexical chaining approach

1 Introduction

Data mining is the domain in which rapid changes areevolved in the recent years due to the enormous advancesin the software and hardware technology The advancementleads to the availability of various kinds of data which isespecially suitable for the instance of text data The softwareand hardware platforms used for the social networks andweb have facilitated the rapid generation of huge repositoriesof various types of data Generally the structured data aremanaged with the help of database system whereas text dataare generally managed by search engine due to the lack ofstructures The search engine allows the web user to identifythe necessary information from the collected works suitablywith the help of keyword query Text summarization isdefined as the process of refining themost useful information

from the source document to provide an abridged versionfor the specific task This paper focuses on developing amultidocument text summarization approach

Multidocument summarization (MDS) is an automaticprocess where the essential information is extracted fromthe multiple input documents Many approaches are alreadyproposed on retrieving the summary from the single ormultiple documents Single and multiple document sum-marization approaches have many challenges The mainproblem in MDS occurred due to the collection of mul-tiple resources from where the data is extracted becauseit may contain the risk of higher redundant informationthan it is generally found in single document Moreover theordering of extracted information into the coherent text toformulate a coherent summary is a nontrivial task Summa-rization can be executed as either abstractive or extractive

Hindawi Publishing Corporatione Scientific World JournalVolume 2016 Article ID 1784827 10 pageshttpdxdoiorg10115520161784827

2 The Scientific World Journal

Abstractive summarization generally requires informationcombination sentence compression and reformulation It iscomplex problem due to the deeper analysis of the inputdocuments and the concept-to-text formulation Extractivesummarization includes assigning saliency factors to someunits (paragraph sentences) of the documents The sentenceextraction is based on the highest score to include in the finalsummary

Nowadays most of the researchers focus their research inautomatic text summarization are extractive summarizationSome of the basic extractive processes are as follows

(i) Coverage extraction plays a major role in textsummarization process It recognizes the necessaryinformation that covers the diverse topics in inputdocuments It can be applied at any level of textparagraphs such as sentence paragraph word andphrase Numerous algorithms have been proposed torecognize the most important information from theinput documents

(ii) Coherency optimal ordering of retrieved sentences toformulate the coherent context flow is the complexissue In single document text summarization oneprobable ordering sentence is given by the inputtext document itself Still this process is a nontrivialtask In MDS the classification is categorized intotwo approaches learning the natural sequence ofthe sentence from the huge corpora and the use ofchronological information

(iii) Redundancy elimination due to the length limitationneeded for an effective summary and the existenceof the extracted sentences which contains the sameinformation most of the approaches utilize the simi-larity measure to identify the duplication informationfrom the input documents

This paper introduces an automatic text summarizationapproach to overcome the difficulties in the existing sum-marization approaches Here Naıve Bayesian Classificationapproach is utilized to identify the necessary keywordsfrom the text Bayes method is machine learning methodto estimate the distinguishing keyword features in a textand retrieves the keyword from the input based on thisinformation The features are generally independent anddistributed Scoring is estimated for the retrieved sentence tocompute the word frequency The combination of this NaıveBayesian scoring and timestamp concept helps to improvethe summarization accuracy The proposed summarizationmethod achieves better coverage and coherence using theNaıve Bayesian classifier and the concept of timestamphence it automatically eliminates the redundancy in the inputdocuments

The remainder of this paper is organized as followsSection 2 summarizes the related works in the multidoc-ument text summarization Section 3 shows the proposedNaıve Bayesian based text summarization approach and theintroduction of timestamp strategy Section 4 describes theperformance analysis And finally the paper ends with theconclusion and future work at Section 5

2 Related Works

This section discusses some of the existing MDS approachesRadev et al [1] discussed the MEAD algorithm which isthe platform for multidocument text summarization Thispaper provides the functionality of the MEAD a publicdomain comprehensive open source and multidocumentmultilingual summarization It has been widely used bymore than 500 companies and organizations It has beenapplied in various applications ranging from the mobilephone technologies to web page for summarization andalso for novelty identification Ma and Wu [2] combined 119899-gram and dependency word pair for multidocument sum-marization The dependency work pair defines the syntacticrelationships among the words Each feature reproduces thecooccurrence of the common topics of multidocuments indiverse perspective The sentence score was estimated basedon theweighted sumof the features Finally the summarywasformulated by retrieving the salient sentences based on thehigher significance score model

Alguliev et al [3] designed an evolutionary optimizationalgorithm formultidocument summarizationThis algorithmcreates a summary by collecting the salient sentence fromthemultiple documentsThis approach utilizes the summary-to-document collection sentence-to-document collectionand sentence-to-sentence collection to choose the mostimportant sentences from the multiple documents Henceit reduces the redundancy in the document summaryAccording to the individual fitness value the algorithmcan adaptively alter the crossover rate The authors in [4]also proposed constraint driven document summarizationmodelsThis approach was developed based on the followingtwo constraints (1) diversity in summarization to reducethe redundancy in the summary among the sentences (2)sufficient coverage to avoid the loss of document majorinformation to generate the summary In order to solvethe Quadratic Integer Programming (QIP) problem thisapproach utilized a discrete Particle Swarm Optimization(PSO) algorithm

Baruque and Corchado [5] presented a weighted votingsummation of Self-Organization Maps (SOMs) ensemblesWeighted voting superposition was used for the outcomesof an ensemble of SOM The objective of this algorithm wasto attain the minimal topographic error in the map Hencea weighted voting process was done among the neuronsto calculate the properties of the neurons located in theresulting map Xiong and Lu [6] introduced an approachfor multidocument summarization using Latent SemanticAnalysis (LSA) Among the existing multidocument summa-rization approaches the LSA was a unique concept whichuses the latent semantic information rather than the originalfeatures It has chosen the sentence individually to remove theredundant sentences Su and Xiaojun [7] discussed an extrac-tive multidocument summarization This approach uses thesemantic role information to improve the graph based rank-ing algorithm for summarization The sentence was parsedto obtain the semantic roles The SRRank algorithm wasproposed to rank the sentence words and semantic rolesimultaneously in a heterogeneous ranking manner

The Scientific World Journal 3

Yang et al [8] introduced a contextual topic modelfor multidocument summarization This method uses theconcept of 119899-grams into latent topics to extract the worddependencies placed in the local context of the word Heuet al [9] presented a multidocument summarization basedon social folksonomy These approaches use the machinelearning and probability based approach to summarize themultiple documents This approach includes the tag clustersused by a folksonomy Flickr system to detect the sentencesused in multiple documents A word frequency table wascreated to examine the contributions and semantics of wordswith the help of HITS algorithm The tag clusters wereexploited to analyze the semantic link among the words in thefrequency table At last a summary was created by examiningthe importance of every word and its semantic relationshipsto each other

Glavas and Snajder [10] proposed event graphs forinformation retrieval and multidocument summarizationAn event based document representation approach wasintroduced to filter and structure the details about the eventsexplained in the text Rule based models and machine learn-ing were integrated to extract the sentence level event andevaluate the temporal relations among them An informationretrieval approach was used to measure the similarity amongthe documents and queries by estimating the graph kernelsacross event graphs Ferreira et al [11] designed a multidocu-ment summarization model based on linguistic and statistictreatment This approach extracts the major concern of setof documents to avoid the problems of this kind of summa-rization such as diversity and information redundancy It wasobtained with the help of clustering algorithm which usesthe statistic similarities and linguistic treatment Meena andGopalani [12] designed a domain independent framework forautomatic text summarization This method was applied forboth the abstractive and extractive text summarization Thesource document text was categorized and set of rules areapplied for summarization

Sankarasubramaniam et al [13] introduced a text sum-marization using Wikipedia This approach constructs abipartite sentence concept graph and the input sentenceswereranked based on the iterative updates Here a personalizedand query focused summarization was considered for userqueries and their interests The Wikipedia based multi-document summarization algorithm was proposed whichpermits the real time incremental summarization Khan etal [14] presented a framework formultidocument abstractivesummarization based on semantic role labelling Labellingwas used for semantic illustration of text The semanticsimilarity measure was applied on the source text to clusterthe semantically similar sentences Finally they were rankedbased on the features weighted using Genetic Algorithm(GA) Erkan and Radev [15] proposed a Lexpage rankapproach formultidocument text summarization A sentenceconnectivity matrix was formulated based on the cosinesimilarity function If the cosine similarity among the twosentences goes beyond a specific threshold then the edge wasappended to the connectivity matrix

Zheng et al [17] exploited conceptual relations of sen-tences for multidocument summarization This concept was

composed of three major elements They were conceptclustering sentence concept semantic relation and summarygeneration The sentence concept semantic relation wasattained to formulate the sentence concept graph The graphweighting algorithm was run to obtain the ranked weightedconcepts and sentences Then clustering was applied toremove the redundancy and summary generation was con-ducted to retrieve the informative summary Celikyilmaz andHakkani-Tur [18] proposed an extractive summarization Anunsupervised probabilistic model was proposed to formu-late the hidden abstract concepts over the documents andalso the correlation among the concepts to create topicallynonredundant and coherent summariesThehigher linguisticquality was estimated in terms of readability coherence andredundancy

3 Materials and Methods An AutomaticMultidocument Text Summarization

31 Frequent Document Summarization Based on NaıveBayesian Classifier Keywords are the important tools whichprovide the shortest summary of the text document Thispaper proposes an automated keyword extraction It iden-tifies the keywords from the training set based on theirfrequencies and positions In order to extract the summaryfrom the frequently used documents this paper proposesa Naıve Bayesian classifier in addition to the supervisedlearning

311 Frequent Document Summarization Consider that theinput is the cluster of text documents on similar topic Herethe task is to extract the frequently used documents andgenerate a small paragraph which safeguards the major detailcontained in the source document Here a set of sevenmultistage compression steps are introduced

(1) Set of documents is provided for processing(2) From the set of documents frequently used related

documents are selected by the system for processing(3) Preprocessing work is done and the sentences are

broken into words(4) Score is calculated for each word using Bayesian

classifier(5) Score is calculated for each sentence(6) For each sentence group one sentence level illustra-

tion is selected(7) The sentence level illustration is either generated as a

linear sentence or further compressed if necessary

At each stage this compression process reduces the complex-ity in the representation using different techniques

The score value is estimated based on the followingequation

Score (119878119894) = sum(119908119888119862

119894119896+ 119908119901119875

119894119896+ 119908119891119865

119894119896) (1)

where 119862119894119896

is centroid value calculated from (2) 119875119894119896

ispositional value calculated from (3) 119865

119894119896is first sentence


overlap value calculated from (4) 119908119888 119908119901 119908119891 are weightsThese are the constant values assumed

The centroid value for sentence 119878119894119896

is defined as thenormalized sum of the centroid components It is defined inthe following

119862119894119896= sum(TF (120572

119894) lowast IDF (120572

119894))

|R| 120572 isin 119878

119894119896 (2)

where 119862119894119896is normalized sum of the centroid values

The positional value for the sentence is computed fromthe following

119875119894119896=119899 minus 119894 + 1

119899lowast 119862max (3)

where 119862max is the maximum centroid value of the sentenceThe overlap value is computed as the inner product of thesentence vectors for the current sentence 119894 and the firstsentence of the document The sentence vectors are the 119899dimensional representations of the words in each sentencewhereby the value at position 119894 of a sentence vector indicatesthe number of occurrences of that word in the sentence Thefirst sentence overlap is calculated from the following

119865119894119896=997888997888rarr119878119894119896lowast997888997888rarr1198781119896 (4)

312 Keyword Extraction The machine learning algorithmstake the keyword extraction as the classification problemThis process identifies the word in the document whichbelongs to the class of ordinary words or keywords Bayesiandecision theory is the basic statistical technique whichdepends on the tradeoffs between the classification decisionsusing the cost and probability that go along with the deci-sions It assumes that the problem is given in probabilisticterms and the necessary values are already given Then itdecides on the best class that gives the minimum error withthe given example In cases where there is no distinctionmade in terms of cost between classes for classification errorsit chooses the class that is most likely with the highestprobability

313 Preprocessing the Documents It is a known fact thatall the words in the document do not have the same priorchance to be selected as a keyword Initially all the tokensare identified with the help of delimiters like tabs new linesspaces dots and so forth Moreover TF lowast IDF score appearsas barrier for such words but may not be sufficient Hence itshould be removed by assigning the prior probability of zeroNumbers and nonalphanumeric characters also need to beeliminatedAfter the removal of thesewords in the documentthe model is accurately constructed It helps to improve thesummary generation

314 Model Construction The words in the documentshould be categorized with the help of keyword propertiesand features in order to recognize whether the word is labeledas the key or not The word frequency is estimated based onthe number of times the word appears in the text Generally

prepositions have no value as the keyword If a word hashigher frequency than the other documents this can beanother categorizing feature to decide the word is key ornot Combining the above two properties the metric TF lowastIDF is computed TF represents the term frequency and IDFrepresents the inverse document frequency It is the standardmeasure used for information extraction [19]

Word 120572 in documentR is defined as follows

TF lowast IDF (119875R) = 119875 (word in R is 120572)

lowast (minus log119875 (120572 in a document)) (5)

(i) First term in this equation it is estimated by com-puting the number of times the word placed in thedocument and splitting it to the total number ofwords

(ii) Second term in this equation it is estimated by count-ing the total number of documents in the trainingset where the word occurs except R and dividingthe value by the total number of documents in thetraining set

Another feature is the position of the word places in the doc-ument There are many options to characterize the positionThe first property is the word distance to the beginning ofthe text Generally the keyword is concerted in the beginningand end of the text Sometimes the more essential keywordsare identified in the beginning or end of the sentence Bayestheorem is applied to compute the probability that the wordis the keyword and it is defined as follows

119875 (key | 119879RPTPS)

=119875 (119879 | key) lowast 119875 (R | key) lowast 119875 (PT | key) lowast 119875 (PS | key)

119875 (119879RPTPS)

(6)

Notations section describes the list of notations

32 Visiting Time Examination Algorithms 1 and 2 are usedto calculate the visiting time for each document and also forselecting the document for processing

33 Summary Generation After performing the NaıveBayesian operation the summary generation is carried outbased on Score and applying timestamp Algorithm 3 whichis for score and timestamp calculation is discussed in thefollowing sections

331 Summary Generation Based on Score The final sum-mary is generated based on the score After processing thedocuments by using the Naıve Bayesian the sentences arearranged based on the score High score sentence will appearfirst next high score sentence and so on The summarygenerated by Naıve Bayesian includes the selected sentencesfor each document and result them in the order in the sourcedocument It should be noted that sentence present from thefirst document will be placed before the sentences selectedfrom the next document subsequently The sequence of the


For each created or visited documentsBegin

Check whether it is a newly created documentIf yes

Set the count value of the document as 0Else

Increment the count value by 1End

End for

Algorithm 1 For counting the visiting time of the document

sentences in the document summary may not be reasonablein occurrence In order to overcome this issue the timestampstrategy is implemented

332 Summary Generation Based on Timestamp The times-tamp strategy is implemented by assigning a value to everysentence of the document It is based on the chronologicalposition of the document Once the sentences are chosenimmediately they are arranged in the ascending order basedon the timestamp An ordered look helps to achieve thesummary which carries out a coherent looking summaryThetotal number of sentences in the summary is represented bythe compression rate

4 Results and Discussion

This section discussed the experimental results obtained forthe proposed Naıve Bayesian based multidocument sum-marization and the comparative results with the MEADalgorithm The timestamp procedure is also applied on theMEAD algorithm and the results are examined with theproposed method There are totally 20 input documentswhich are taken for analysis and the inputs are processedSome of the input documents are the following

Input Document 1

(1) Born in Mumbai (then Bombay) into a middle-classfamily Sachin Tendulkar was named after his familyrsquosfavorite music director Sachin Dev Burman

(2) Sachin Tendulkar went to Sharadashram Vidya-mandir School where he started his cricketing careerunder coach Ramakant Achrekar

(3) While at school Sachin Tendulkar was involved ina mammoth 664 run partnership in a Harris Shieldgame with friend and team mate Vinod Kambli

(4) In 19881989 Sachin Tendulkar scored 100 not-outin his debut first-class match for Bombay againstGujarat

(5) At 15 years and 232 days Sachin Tendulkar was theyoungest to score a century on debut

Input Document 2

(1) Sachin is the fastest to score 10000 runs in Test crickethistory He holds this record along with Brian LaraBoth of them achieved this feat in 195 innings

(2) Sachin scored 4th highest tally of runs in Test cricket(10323)

(3) Career Average 5579mdashSachin has the highest averageamong those who have scored over 10000 Test runs

(4) Tendulkar is the second Indian to make over 10000runs in Test matches

(5) Sachin has 37 Test wickets (14 Dec 2005)(6) Sachin is the second fastest player to reach 9000 runs

(Brian Lara made 9000 in 177 innings Sachin in 179)

Input Document 3

(1) Sachin has highest batting average among batsmenwith over 10000 ODI runs (as of March 17 2006)

(2) Sachin has scored highest individual score amongIndian batsmen (186lowast against New Zealand at Hyder-abad in 1999)

(3) Sachin holds the record for scoring 1000 ODI runs ina calendar year He has done it six timesmdash1994 19961997 1998 2000 and 2003

(4) In 1998 Sachin made 1894 ODI runs still the recordfor ODI runs by any batsman in any given calendaryear

(5) In 1998 Sachin hit 9 ODI centuries the highest by anyplayer in a year

Input Document 4

(1) Sachinrsquos first ODI century came on September 9 1994against Australia in Sri Lanka at Colombo It hadtaken Tendulkar 79 ODIs to score a century

(2) Sachin Tendulkar is the only player to score a centurywhile making his Ranji Trophy Duleep Trophy andIrani Trophy debut

(3) Wisden named Tendulkar as one of the Cricketers ofthe Year in 1997 the first calendar year in which hescored 1000 Test runs

41 Input Documents and Number of Times Visited Auto-matic process has been developed to monitor and calculatethe number of times the document has been read by the usersInput documents along with the number of times visited arecalculated and the results are given in Table 1

42 Frequent Documents Summarization Instead of takingup each sentence for comparison for summarization fromall documents it would be more than enough to summarizeonly the document which has been put to many numbers ofreaders Sincewe track the documentwhich is read frequentlyby many people it is supposed to provide all the necessary


For each documentBegin

Check the number of times the document is visitedSelect the top 119899most frequent visited document for processingCalculate the Centroid value First sentence overlap Position valueAdd the three values to get the scoreCalculate the penalty and subtract it from the scoreSort the sentence based on score and timestamp

EndEnd for

Algorithm 2 For selecting frequent document for processing

(1) For all the sentences in the cluster(2) Begin

(a) Sort the sentence in the descending order depending on the obtained score values afterthe reduction of the redundancy penalty

(3) End(4) Begin

(a) Get the compression rate from the user(b) Select the required number of sentences based on the compression rate(c) Sort the sentences in the ascending order depending on the timestamps(d) If the Timestamps are same(e) Begin

(i) Compare the score values(ii) Sentence with the higher score value will appear first

(f) End(g) End If

(5) End

Algorithm 3 Timestamp based summary generation

information about the topic to the user so the user neednot surf through other documents for information as thedocument in hand would be satisfactory

421 Frequent Documents Selected for Processing In the totalof 20 documents we have selected 10 (2 documents) ofdocuments as frequently used documents for processingSince the Input 6doc and Input 7doc are visited morenumber of times they have been considered for furtheranalysis It is shown in Table 2

43 Summary Generation The performance for summariza-tion of the input documents using Naıve Bayesian classifierand MEAD has been analyzed and compared with frequentdocuments using Naıve Bayesian classifier and MEAD Intotal there are 100 documents among them 10of documentsare selected as frequent documents The performance graphfor frequent document summarization is shown in Figure 1From the graph it visually shows that the proposed NaıveBayesian takes lesser time than the existingMEADalgorithm

44 Score Table for Multidocuments From Figure 2(a) it isunderstood that when the Naıve Bayesian classifier is appliedon the multidocuments the time taken to get the summary is80 sec which is higher than the time taken to summarize the

frequent documents only From Figure 2(b) it is understoodthat when the MEAD is applied on the multidocuments thetime taken to get the summary is 115 sec which is higher thanthe time taken to summarize the frequent documents onlyThe proposed Naıve Bayesian classifier takes 80 sec which islesser than the MEAD algorithm

Figure 3 shows the comparison of frequent and multi-document summarization using MEAD and Naıve Bayesianclassifier Comparison of frequent and multidocument sum-marization using MEAD and Naıve Bayesian classifier showsthat when the Naıve Bayesian classifier is applied on themultidocuments the time taken to get the summary is 80 secand the number of documents is 36 which is less than thetime taken by MEAD where the time taken is 115 sec for36 documents Also for summarizing only the 10 frequentdocuments the Naıve Bayesian classifier takes only 18 secwhich is less than the time taken by MEAD where thetime taken is 25 sec for 10 documents So it is proved thatsummarization of frequent documents using Naıve Bayesianclassifier is better when compared to MEAD

Table 3 compares the run time taken to summarize thedocuments using the two techniques From the table it isobserved that the proposed Naıve Bayesian classifier resultsin lesser time than the existing MEAD algorithm


Table 1 Input documents and number of times visited

Documents name Number of times visitedInput 1doc 11Input 2doc 8Input 3doc 7Input 4doc 9Input 5doc 11Input 6doc 12Input 7doc 12Input 8doc 7Input 9doc 10Input 10doc 9Input 11doc 8Input 12doc 5Input 13doc 6Input 14doc 5Input 15doc 10Input 16doc 4Input 17doc 11Input 18doc 8Input 19doc 10Input 20doc 7

Table 2 Frequently visited documents

Document name Number of times visitedInput 6doc 12Input 7doc 12

Table 3 Comparison of frequent versus multidocument summa-rization using MEAD and NAIVE Bayesian classifier

Summarizationtechniques

Multidocumenttime in seconds

Frequent documenttime in seconds

MEAD 115 25Naıve Bayesianclassifier 80 18

The performance measures such as precision recall and119865-score are calculated for the proposed method and com-pared with the existing method [16] Precision estimates thecorrectness percentage and recall measures the completenessof the summarizer Precision also is used to estimate theusefulness of the summarizer and recall is used to showeffectiveness of themodelWith the combination of precisionand recall 119865-score is evaluated The evaluation measures arecomputed based on the following equation

precision = relevant sentences cap retrieved sentencesretrieved sentences

recall = relevant sentences cap retrieved sentencesrelevant sentences

119865-Score = 2 lowast precision lowast recallprecision + recall

(7)

Table 4 Comparison of proposed Naıve Bayesian classifier andexisting method [16]

Metrics Proposed method Existing methodPrecision 854 78Recall 839 777119865-measure 92 86

Table 4 shows the comparison between the proposedmethod and the existing method From the results it isevidently proved that the proposedmethodworks better thanthe existing method

45 Evaluation of the Summary Multidocument summa-rization is carried out using MEAD algorithm and NaıveBayesian classifier for frequent documents with timestampThis is compared with human generated summary by humanassessors consisting of 5 professors 9 lecturers and 6 studentsto find out whether the output is an informative summarywhich can be a substitute of the original documents In orderto satisfy this the following points are considered importantand are assessed by human assessors

Comprehensibility The summary should include main con-tent of target documents exhaustively

Readability The summary should be a self-contained docu-ment and should be readable

Length of Summary to Be Generated The length of summarygenerated by human and the system is compared

Each participant generated summaries according to topicinformation given and submitted a set of summaries Eachtopic corresponds to one IR result which consists of thefollowing information

(i) topic ID(ii) list of keywords for query in IR(iii) brief description of the information needs(iv) set of documents IDs which are target documents

of summarization the number of documents variesfrom 3 to 20 according to the topic

(v) length of summary to be generated there are twolengths of summary ldquoLongrdquo and ldquoShortrdquo The lengthof ldquoLongrdquo is twice of ldquoShortrdquo

(vi) comprehensibility the summary has all the importantcontent or not

(vii) readability the summary is easy to read or not

451 Comparative Analysis of Summary Generated by Systemand Human About 20 human assessors are used to carryout the comparative study of summary generated by systemand human in terms of comprehensibility readability andlength of the summary Table 5 provides information aboutcomparison of length of summary by human assessors andsystem


Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)

2 4 6 8 10 120Number of documents

(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)

Figure 1 Frequent documents summarization (a) Naıve Bayesian classifier (proposed) and (b) MEAD (existing)

Multidocuments

5 10 15 20 25 30 35 400Number of documents

0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)

Figure 2 Multidocument summarization (a) Naıve Bayesian classifier (proposed) and (b) MEAD (existing)

From Table 5 it is observed that the summary generatedby the system is shorter than summary produced by humanas 70 of the human assessors stated that the length of thesummary generated by the system is short

From Table 6 it is observed that the summary generatedby the system has all important contents as 90 of the humanassessors stated that the summary generated by the system iscomprehensible

From Table 7 it is observed that the summary generatedby the system is easy to read as 100 of the human assessorsstated that the summary generated by the system is readable

From this analysis done using human assessors it isproved that the summary generated by the system is short and

Table 5 Comparison of length of summary generated by humanand system

Length of summary generated by systemShort Long

Professor 3 2Lecturers 6 3Students 5 1

the quality of the summary generated is also good based onthe two factors readability and comprehensibility


MultidocumentsFrequent documents


0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)

Figure 3 Comparison of frequent versus multidocument summarization (a) Naıve Bayesian classifier (proposed) and (b) MEAD (existing)

Table 6 Evaluating the comprehensibility of summary generated bysystem

ComprehensibilityYes No

Professor 5 NilLecturers 9 NilStudents 4 2

Table 7 Evaluating the readability of summary generated by system

ReadabilityYes No

Professor 6 NilLecturers 9 NilStudents 4 Nil

5 Conclusion and Future Work

In this paper an automatic text summarization approachis proposed which uses the Naıve Bayesian Classificationwith the timestamp concept This summarizer works on awide variety of domains varying between international newspolitics sports entertainment and so on Another usefulfeature is that the length of the summary can be adapted to theuserrsquos needs as can the number of articles to be summarizedThe compression rate can be specified by the user so that hecan choose the amount of information he wants to imbibefrom the documents To show the system efficiency it iscompared with the existing MEAD algorithm The resultsshow that the proposed method yields better outputs thanthe MEAD algorithmThe proposed method results in betterprecision recall and 119865-score than the existing clustering andlexical chaining method

The proposed work does not involve a knowledge baseand therefore can be used to summarize articles from fields asdiverse as politics sports current affairs and finance How-ever it does cause a tradeoff between domain independenceand a knowledge based summary which would provide datain a form more easily understandable to the human userA possible application of this work can be made to makedata available on the move on a mobile network by evenshortening the sentences produced by our algorithm andthen shortening it Various NLP based algorithms can beused to achieve this objective Thus we would first producea summary by sentence extraction from various documentsand then abstractive methods are employed to shorten thosesentences produced This will ensure that the summaryproduced is to the highest condensed form which can bemade in the mobile industry

Notations

TF lowast IDF score (119879) Distance to the beginning of theparagraph (R)

PT Relative position of the word withrespect to the entire text

PS Sentence which exists in the entiretext

119875(key) Prior probability that the word is akeyword

119875(119879 | key) Probability of TFlowast IDF score 119879 giventhe word is the keyword

119875(R | key) Probability of having the neighbordistanceR

119875(PT | key) Probability of having the relativedistance

119875(PS | key) Probability of having relative distanceR to the beginning of the paragraph


119875(119879RPTPS) Probability that a word haveTF lowast IDF score 119879

119878119894119896 119894th sentence in the document

R Document119862max The maximum centroid value of the

sentenceTF Term frequencyIDF Inverse document frequency

Conflict of Interests

The authors declare that they have no conflict of interestsregarding the publication of this paper

References

[1] D Radev TAllison S Blair-Goldensohn et al et al ldquoMEADmdashaplatform for multidocument multilingual text summarizationrdquoin Proceedings of the Conference on Language Resources andEvaluation (LREC rsquo04) Lisbon Portugal May 2004

[2] YMa and JWu ldquoCombiningN-gamanddependencyword pairformulti-document Summarizationrdquo in Proceedings of the IEEE17th International Conference on Computational Science andEngineering (CSE rsquo14) pp 27ndash31 Chengdu China December2014

[3] R M Alguliev R M Aliguliyev and N R Isazade ldquoMultipledocuments summarization based on evolutionary optimizationalgorithmrdquo Expert Systems with Applications vol 40 no 5 pp1675ndash1689 2013

[4] R M Alguliev R M Aliguliyev and N R Isazade ldquoCDDSconstraint-driven document summarization modelsrdquo ExpertSystems with Applications vol 40 no 2 pp 458ndash465 2013

[5] B Baruque and E Corchado ldquoA weighted voting summariza-tion of SOMensemblesrdquoDataMining andKnowledgeDiscoveryvol 21 no 3 pp 398ndash426 2010

[6] S Xiong and Y Luo ldquoA new approach formulti-document sum-marization based on latent semantic analysisrdquo in Proceedings ofthe Seventh International Symposium on Computational Intel-ligence and Design (ISCID rsquo14) vol 1 pp 177ndash180 HangzhouChina December 2014

[7] Y Su and W Xiaojun ldquoSRRank leveraging semantic roles forextractive multi-document summarizationrdquo IEEEACM Trans-actions on Audio Speech and Language Processing vol 22 pp2048ndash2058 2014

[8] G Yang DWen Kinshuk N-S Chen and E Sutinen ldquoA novelcontextual topic model for multi-document summarizationrdquoExpert Systems with Applications vol 42 no 3 pp 1340ndash13522015

[9] J-U Heu I Qasim and D-H Lee ldquoFoDoSu multi-documentsummarization exploiting semantic analysis based on socialFolksonomyrdquo Information ProcessingampManagement vol 51 no1 pp 212ndash225 2015

[10] G Glavas and J Snajder ldquoEvent graphs for information retrievaland multi-document summarizationrdquo Expert Systems withApplications vol 41 no 15 pp 6904ndash6916 2014

[11] R Ferreira L de Souza Cabral F Freitas et al ldquoA multi-document summarization system based on statistics and lin-guistic treatmentrdquo Expert Systems with Applications vol 41 no13 pp 5780ndash5787 2014

[12] Y K Meena and D Gopalani ldquoDomain independent frame-work for automatic text summarizationrdquo Procedia ComputerScience vol 48 pp 722ndash727 2015

[13] Y Sankarasubramaniam K Ramanathan and S Ghosh ldquoTextsummarization using Wikipediardquo Information Processing ampManagement vol 50 no 3 pp 443ndash461 2014

[14] A Khan N Salim and Y Jaya Kumar ldquoA framework for multi-document abstractive summarization based on semantic rolelabellingrdquo Applied Soft Computing vol 30 pp 737ndash747 2015

[15] G Erkan and D R Radev ldquoLexPageRank prestige in multi-document text summarizationrdquo in Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLPrsquo04) pp 365ndash371 Barcelona Spain July 2004

[16] S Saraswathi and R Arti ldquoMulti-document text summarizationusing clustering techniques and lexical chainingrdquo ICTACTJournal on Soft Computing vol 1 no 1 pp 23ndash29 2010

[17] H-T Zheng S-Q Gong J-M Guo andW-Z Wu ldquoExploitingconceptual relations of sentences for multi-document summa-rizationrdquo in Web-Age Information Management vol 9098 ofLecture Notes in Computer Science pp 506ndash510 Springer BaselSwitzerland 2015

[18] A Celikyilmaz and D Hakkani-Tur ldquoDiscovery of topicallycoherent sentences for extractive summarizationrdquo in Proceed-ings of the 49th Annual Meeting of the Association for Com-putational Linguistics Human Language Technologies vol 1Portland Ore USA 2011

[19] Y-F B Wu Q Li R S Bot and X Chen ldquoDomain-specifickeyphrase extractionrdquo in Proceedings of the 14th ACM Interna-tional Conference on Information and Knowledge Management(CIKM rsquo05) pp 283ndash284 November 2005

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



Abstractive summarization generally requires informationcombination sentence compression and reformulation It iscomplex problem due to the deeper analysis of the inputdocuments and the concept-to-text formulation Extractivesummarization includes assigning saliency factors to someunits (paragraph sentences) of the documents The sentenceextraction is based on the highest score to include in the finalsummary

Nowadays most of the researchers focus their research inautomatic text summarization are extractive summarizationSome of the basic extractive processes are as follows

(i) Coverage extraction plays a major role in textsummarization process It recognizes the necessaryinformation that covers the diverse topics in inputdocuments It can be applied at any level of textparagraphs such as sentence paragraph word andphrase Numerous algorithms have been proposed torecognize the most important information from theinput documents

(ii) Coherency optimal ordering of retrieved sentences toformulate the coherent context flow is the complexissue In single document text summarization oneprobable ordering sentence is given by the inputtext document itself Still this process is a nontrivialtask In MDS the classification is categorized intotwo approaches learning the natural sequence ofthe sentence from the huge corpora and the use ofchronological information

(iii) Redundancy elimination due to the length limitationneeded for an effective summary and the existenceof the extracted sentences which contains the sameinformation most of the approaches utilize the simi-larity measure to identify the duplication informationfrom the input documents

This paper introduces an automatic text summarizationapproach to overcome the difficulties in the existing sum-marization approaches Here Naıve Bayesian Classificationapproach is utilized to identify the necessary keywordsfrom the text Bayes method is machine learning methodto estimate the distinguishing keyword features in a textand retrieves the keyword from the input based on thisinformation The features are generally independent anddistributed Scoring is estimated for the retrieved sentence tocompute the word frequency The combination of this NaıveBayesian scoring and timestamp concept helps to improvethe summarization accuracy The proposed summarizationmethod achieves better coverage and coherence using theNaıve Bayesian classifier and the concept of timestamphence it automatically eliminates the redundancy in the inputdocuments

The remainder of this paper is organized as followsSection 2 summarizes the related works in the multidoc-ument text summarization Section 3 shows the proposedNaıve Bayesian based text summarization approach and theintroduction of timestamp strategy Section 4 describes theperformance analysis And finally the paper ends with theconclusion and future work at Section 5

2 Related Works

This section discusses some of the existing MDS approachesRadev et al [1] discussed the MEAD algorithm which isthe platform for multidocument text summarization Thispaper provides the functionality of the MEAD a publicdomain comprehensive open source and multidocumentmultilingual summarization It has been widely used bymore than 500 companies and organizations It has beenapplied in various applications ranging from the mobilephone technologies to web page for summarization andalso for novelty identification Ma and Wu [2] combined 119899-gram and dependency word pair for multidocument sum-marization The dependency work pair defines the syntacticrelationships among the words Each feature reproduces thecooccurrence of the common topics of multidocuments indiverse perspective The sentence score was estimated basedon theweighted sumof the features Finally the summarywasformulated by retrieving the salient sentences based on thehigher significance score model

Alguliev et al [3] designed an evolutionary optimizationalgorithm formultidocument summarizationThis algorithmcreates a summary by collecting the salient sentence fromthemultiple documentsThis approach utilizes the summary-to-document collection sentence-to-document collectionand sentence-to-sentence collection to choose the mostimportant sentences from the multiple documents Henceit reduces the redundancy in the document summaryAccording to the individual fitness value the algorithmcan adaptively alter the crossover rate The authors in [4]also proposed constraint driven document summarizationmodelsThis approach was developed based on the followingtwo constraints (1) diversity in summarization to reducethe redundancy in the summary among the sentences (2)sufficient coverage to avoid the loss of document majorinformation to generate the summary In order to solvethe Quadratic Integer Programming (QIP) problem thisapproach utilized a discrete Particle Swarm Optimization(PSO) algorithm

Baruque and Corchado [5] presented a weighted votingsummation of Self-Organization Maps (SOMs) ensemblesWeighted voting superposition was used for the outcomesof an ensemble of SOM The objective of this algorithm wasto attain the minimal topographic error in the map Hencea weighted voting process was done among the neuronsto calculate the properties of the neurons located in theresulting map Xiong and Lu [6] introduced an approachfor multidocument summarization using Latent SemanticAnalysis (LSA) Among the existing multidocument summa-rization approaches the LSA was a unique concept whichuses the latent semantic information rather than the originalfeatures It has chosen the sentence individually to remove theredundant sentences Su and Xiaojun [7] discussed an extrac-tive multidocument summarization This approach uses thesemantic role information to improve the graph based rank-ing algorithm for summarization The sentence was parsedto obtain the semantic roles The SRRank algorithm wasproposed to rank the sentence words and semantic rolesimultaneously in a heterogeneous ranking manner


















Score (119878119894) = sum(119908119888119862

119894119896+ 119908119901119875

119894119896+ 119908119891119865

119894119896) (1)

where 119862119894119896








119862119894119896= sum(TF (120572

119894) lowast IDF (120572

119894))

|R| 120572 isin 119878

119894119896 (2)



119875119894119896=119899 minus 119894 + 1

119899lowast 119862max (3)


119865119894119896=997888997888rarr119878119894119896lowast997888997888rarr1198781119896 (4)











119875 (key | 119879RPTPS)


119875 (119879RPTPS)

(6)










End for






Input Document 1






Input Document 2







Input Document 3






Input Document 4









EndEnd for




(3) End(4) Begin



(f) End(g) End If

(5) End























(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




















Score (119878119894) = sum(119908119888119862

119894119896+ 119908119901119875

119894119896+ 119908119891119865

119894119896) (1)

where 119862119894119896








119862119894119896= sum(TF (120572

119894) lowast IDF (120572

119894))

|R| 120572 isin 119878

119894119896 (2)



119875119894119896=119899 minus 119894 + 1

119899lowast 119862max (3)


119865119894119896=997888997888rarr119878119894119896lowast997888997888rarr1198781119896 (4)











119875 (key | 119879RPTPS)


119875 (119879RPTPS)

(6)










End for






Input Document 1






Input Document 2







Input Document 3






Input Document 4









EndEnd for




(3) End(4) Begin



(f) End(g) End If

(5) End























(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in







119862119894119896= sum(TF (120572

119894) lowast IDF (120572

119894))

|R| 120572 isin 119878

119894119896 (2)



119875119894119896=119899 minus 119894 + 1

119899lowast 119862max (3)


119865119894119896=997888997888rarr119878119894119896lowast997888997888rarr1198781119896 (4)











119875 (key | 119879RPTPS)


119875 (119879RPTPS)

(6)










End for






Input Document 1






Input Document 2







Input Document 3






Input Document 4









EndEnd for




(3) End(4) Begin



(f) End(g) End If

(5) End























(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in








End for






Input Document 1






Input Document 2







Input Document 3






Input Document 4









EndEnd for




(3) End(4) Begin



(f) End(g) End If

(5) End























(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






EndEnd for




(3) End(4) Begin



(f) End(g) End If

(5) End























(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

















(7)
















Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Frequent documents

0

2

4

6

8

10

12

14

16

18

20Ti

me (

s)


(a)

Frequent documents

0

5

10

15

20

25

30

Tim

e (s)


(b)


Multidocuments


0

10

20

30

40

50

60

70

80

90

100

Tim

e (s)

(a)Multidocuments


0102030405060708090

100110120130

Tim

e (s)

(b)













0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






0

10

20

30

40

50

60

70

80

90

100Ti

me (

s)

(a)



0

20

40

60

80

100

120

140

Tim

e (s)

(b)






ReadabilityYes No





Notations
















References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










References



























Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



An Automatic Multidocument Text Summarization Approach Based ...

Documents

Transcript of An Automatic Multidocument Text Summarization Approach Based ...