[IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore,...

5
2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA Text Summarization using Enhanced MMR Technique Rashmi Kurmi Department of Information Technology S.A.T.I, College Vidisha, India Pranita Jain Department of Information Technology S.A.T.I, College Vidisha, India AbstractNow a day when huge amount of documents and web contents are available, so reading of full content is somewhat difficult. Summarization is a way to give abstract form of large document so that the moral of the document can be communicated easily. Current research in automatic summarization is dominated by some effective, yet naive approaches: summarization through extraction, summarization through Abstraction and multi-document summarization. These techniques are used to building a summary of a document. Although there are a number of techniques implemented for the summarization of text for the single document or for the online web data or for any language. Here in this paper we are implemented an efficient technique for text summarization to reduce the computational cost and time and also the storage capacity. Index Terms—NLP, Summarization, MMR, snippets, Extraction. I. INTRODUCTION As soon as the magnitude of on-line information increases, systems that can automatically conclude one or more documents become little by little more advantageous. Modern research has investigated types of summaries, way to create them, and procedures to evaluate them. A summary can be normally defined as a text which is formed from one or more texts, that express significant information in the actual text(s), and that is no longer than half of the actual text(s) and frequently extensively less than that. The main focused objective of a summary is to present the most important ideas in a document in terms of a less space. In a text document if every sentence were of equal significance, generating a summary would not be very efficient, as any reduction in the size of a document would carry a relative decrease in its in formativeness. Information content in a document appears in bursts, and one can consequently differentiate between more and less enlightening segments. Identifying the enlightening segments at the disbursement of the rest is the primary confront in summarization [1]. The subfield of summarization has been investigated by the NLP community for nearly the past few years. Radev et al. [1] express a review under a title ‘a text that is produced from one or more than one texts, that communicate vital information in the actual text(s), and which is not larger than even half of the actual text(s) and usually notably is very less comparatively to that one’. Above explanation include three very important aspects that exemplify research on automatic summarization: Summaries may be formed from a particular document or numerous documents, Summaries should defend significant information, Summaries should be dumpy. Natural language generation techniques have been tailored to work with typed textual phrases, in place of semantics, as input, and this allows researchers to research with approaches to abstraction. Techniques that have been developed for topic-oriented summaries are now being short of supplementary so that they can be functional to the construction of long answers for the question-answering task [1]. 1.1 Single-Document Summarization through Extraction Regardless of the beginnings of research on alternatives to extraction, most work today still relies on drawing out of sentences from the original document to form a summary. The majority of early extraction researches paying attention on the expansion of relatively simple surface-level techniques that tend to signal important passages in the source text. Although the maximum systems implement sentences as a units, some of the work with larger passages, classically paragraphs. Usually, a set of features is added for each passage, and eventually these features are normalized and summed. The passages with the uppermost resulting scores are sorted and returned as the extract. 1.2 Single-Document Summarization through Abstraction At this nears the beginning stage in research on summarization, any approach that categorized and does not use extraction as an abstractive approach. Abstractive approaches have used extraction of information, ontological information, fusion of information, and compression. Information extraction approaches can be characterized as “top-down,” since they seem for a set of predefined information types to include in the summary. For each topic, the user predefines frames of expected information varieties, in concert with gratitude criteria. As for example, an earthquake frame may hold slots for location, earthquake magnitude, amount of casualties, etc. The engine that used for summarization must then place the preferred pieces of information, fill them in, and generate a summary. This method can produce high-quality and correct summaries, albeit in restricted domains only. 1.3 Multi-document Summarization Multi document summarization, It is the method of producing a single summary of a set of associated source documents, is relatively to a new form. The three major drawbacks introduced by having to handle multiple input documents are (1) recognizing and coping with redundancy, (2) identifying vital differences along with documents, and (3) ensuring summary consistency, even when material stems from dissimilar source documents. In an early approach to multi-document summarization, information extraction 978-1-4799-2352-6/14/$31.00 ©2014 IEEE

Transcript of [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore,...

Page 1: [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore, India (2014.1.3-2014.1.5)] 2014 International Conference on Computer Communication

2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA

Text Summarization using Enhanced MMR Technique

Rashmi Kurmi Department of Information Technology

S.A.T.I, College Vidisha, India

Pranita Jain Department of Information Technology

S.A.T.I, College Vidisha, India

Abstract— Now a day when huge amount of documents and

web contents are available, so reading of full content is somewhat difficult. Summarization is a way to give abstract form of large document so that the moral of the document can be communicated easily. Current research in automatic summarization is dominated by some effective, yet naive approaches: summarization through extraction, summarization through Abstraction and multi-document summarization. These techniques are used to building a summary of a document. Although there are a number of techniques implemented for the summarization of text for the single document or for the online web data or for any language. Here in this paper we are implemented an efficient technique for text summarization to reduce the computational cost and time and also the storage capacity.

Index Terms—NLP, Summarization, MMR, snippets, Extraction.

I. INTRODUCTION As soon as the magnitude of on-line information increases, systems that can automatically conclude one or more documents become little by little more advantageous. Modern research has investigated types of summaries, way to create them, and procedures to evaluate them. A summary can be normally defined as a text which is formed from one or more texts, that express significant information in the actual text(s), and that is no longer than half of the actual text(s) and frequently extensively less than that.

The main focused objective of a summary is to present the most important ideas in a document in terms of a less space. In a text document if every sentence were of equal significance, generating a summary would not be very efficient, as any reduction in the size of a document would carry a relative decrease in its in formativeness. Information content in a document appears in bursts, and one can consequently differentiate between more and less enlightening segments. Identifying the enlightening segments at the disbursement of the rest is the primary confront in summarization [1].

The subfield of summarization has been investigated by the NLP community for nearly the past few years. Radev et al. [1] express a review under a title ‘a text that is produced from one or more than one texts, that communicate vital information in the actual text(s), and which is not larger than even half of the actual text(s) and usually notably is very less comparatively to that one’. Above explanation include three very important aspects that exemplify research on automatic summarization:

• Summaries may be formed from a particular document or numerous documents,

• Summaries should defend significant information, • Summaries should be dumpy.

Natural language generation techniques have been tailored to work with typed textual phrases, in place of semantics, as input, and this allows researchers to research with approaches to abstraction. Techniques that have been developed for topic-oriented summaries are now being short of supplementary so that they can be functional to the construction of long answers for the question-answering task [1].

1.1 Single-Document Summarization through Extraction

Regardless of the beginnings of research on alternatives to extraction, most work today still relies on drawing out of sentences from the original document to form a summary. The majority of early extraction researches paying attention on the expansion of relatively simple surface-level techniques that tend to signal important passages in the source text. Although the maximum systems implement sentences as a units, some of the work with larger passages, classically paragraphs. Usually, a set of features is added for each passage, and eventually these features are normalized and summed. The passages with the uppermost resulting scores are sorted and returned as the extract. 1.2 Single-Document Summarization through Abstraction

At this nears the beginning stage in research on summarization, any approach that categorized and does not use extraction as an abstractive approach. Abstractive approaches have used extraction of information, ontological information, fusion of information, and compression. Information extraction approaches can be characterized as “top-down,” since they seem for a set of predefined information types to include in the summary. For each topic, the user predefines frames of expected information varieties, in concert with gratitude criteria. As for example, an earthquake frame may hold slots for location, earthquake magnitude, amount of casualties, etc. The engine that used for summarization must then place the preferred pieces of information, fill them in, and generate a summary. This method can produce high-quality and correct summaries, albeit in restricted domains only. 1.3 Multi-document Summarization

Multi document summarization, It is the method of producing a single summary of a set of associated source documents, is relatively to a new form. The three major drawbacks introduced by having to handle multiple input documents are (1) recognizing and coping with redundancy, (2) identifying vital differences along with documents, and (3) ensuring summary consistency, even when material stems from dissimilar source documents. In an early approach to multi-document summarization, information extraction

978-1-4799-2352-6/14/$31.00 ©2014 IEEE

Page 2: [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore, India (2014.1.3-2014.1.5)] 2014 International Conference on Computer Communication

2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA

was used to make easy the identification of similarities and differences. As for single-document summarization, this approach produces more of a consultation than a summary, as it contains only pre-identified information types. Identities of slot values are used to decide when information is consistent enough to include in the summary. Later work merged information extraction approaches with regeneration of extracted text to enhance summary generation [2]. Important differences (e.g., updates, trends, straight contradictions) are identified through a set of conversation protocols. Recent work also follows this approach, using enhanced information extraction and supplementary forms of contrasts [3]. 1.4 Evaluation

Evaluating the excellence of a summary has verified to be a difficult problem, primarily because there is no obvious “ideal” summary. Even for comparatively straightforward news and articles, human summarizers tend to have the same opinion only approx. 60% of the time, measuring sentence content extends beyond. The make use of various models for system evaluation could help lighten these predicament, but researchers also required to consider other methods that can yield more suitable models, perhaps using a task as motivation. Two wide classes of metrics have been developed: form metrics and content metrics. Form metrics focus on grammaticality, complete text unity, and organization and are frequently measured on a point scale. Content is more tricky to measure. Normally, system output is compared sentence by sentence or splinter by fragment to one or more human-made ideal abstracts, and as in information retrieval, the percentage of irrelevant information present in the system’s summary (precision) and the percentage of significant information omitted from the summary (recall) are recorded.

II. BACKGROUND The high quantity of electronic information obtainable on the Internet increases the difficulty of dealing with it in modern years. Automatic Summarization (AS) task helps users concentrate all this information and present it in a concise way, in order to compose it easier to process the vast amount of documents correlated to the same topic that exist these days. The approaches found in the literature mostly detect and semantically interpret the segments (blocks) of the page and fewer studies are dealing with the problem of removing noisy (non-informative) segments. The less complicated methods for web page segmentation rely on structure of wrappers for a specific type of web pages. The responsibility that the new matter not be in the text overtly means that the system must have access to external information of various type, like as an ontology or a knowledge base information, and be able to perform combinatory inference. In a view of fact there are no large-scale resources of this kind yet existing.

III. RELATED WORK The less sophisticated methods for web page segmentation rely on building wrappers for a specific type of web pages. Some of these approaches rely on hand crafted web scrapers that use hand-coded rules specific for certain template types [4]. The disadvantage of this approach is that they are very inexible and unable to handle the template changes in web pages.

The methods applied for the solution of the web page segmentation problem are using a combination of non-visual and (or) visual characteristics. Examples of non-visual based methods are presented by Diao [5] who treats segments of web pages in a learning

based web query processing system and deals with major types of HTML related tags (<p>, <table>, etc.).

Lin [6] only considers the table tag and its offspring as a content block, uses an entropy based approach to discover informative ones. Gibson et al. [7] considers element frequencies for template detection while Debnath et al. [8] compute an inverse block frequency for classification.

In [9], Chakrabarti et al. determine the "templateness" of DOM1 nodes by regularized isotonic regression. Yi et al. [10] simplify the DOM structure by deriving a so-called Site Style Tree which is then used for classification. Vineel proposed a DOM tree mining approach based on Content Size and Entropy which is able to detect repetitive patterns [11].

Kang et al. also proposed a Repetition-based approach for finding patterns in the DOM tree structure [12]. Alcic et al. investigate the problem from a clustering point of view by using distance measures for content units based on their DOM, geometric and semantic properties [13].

HuYan and MiaoMiao [14] proposed a multi-cue algorithm which uses various information, visual information (background color, font size), some non-visual information (tags), text information and link information. Cao et al. used vision and effective text information to locate the main text of a blog page and the information quantity of separator to detect the comments [15].

A. Zhang et al. [16] focused on precise web page segmentation based on semantic block headers detection using visual and structural features of the pages. At last, Kohlschutter et al. projected an approach by building on methods from quantitative linguistics and computer vision [17].

Query-biased summarization approaches have been shown to perform better than generic summarization approaches in retrieval tasks. In [18], Tombros et al. compared the query biased summaries with the static summaries composed of the title and first few sentences of retrieved documents, and found that query-biased summaries can help users improve the speed and accuracy in identifying relevant documents. Similar results were found in [19]. Major search engines including Google, Yahoo, and Bing usually summarize a search result by including the web page title, URL, and a query biased snippet in the summary [20].

There has been little work on summarizing structured documents until recently. Huang et al. explored the snippet generation problem in XML search in 2008 [21]. Their approaches are designed based on the assumption that a query result snippet should: 1) to be self-contained and logical information unit, 2) able to distinguish it from other query results and 3) be representative of the query result. The snippet retrieval track of INEX 2011 focuses on how best to generate informative snippets for XML search results, in which the Wikipedia corpus is used.

IV. PROPOSED METHODOLOGY The proposed methodology works on the concept of maximal marginal relevance between the sentences or the words. The key idea is to use a unit step function at each step to decide the maximum marginal relevance. The technique also contains a database where the words those are useless in the document and even if they eliminate can’t impact on the meaning of the document.

Page 3: [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore, India (2014.1.3-2014.1.5)] 2014 International Conference on Computer Communication

2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA

The automatic summarization process for Text may contain following steps.

1. Input a document to be summarized. 2. Now the document is traversed and eliminates the words

stored in the database that are not useful. 3. Starting with the starting position of the sentence until the

document finishes. 4. Using the unit step function we can calculate the relevant

information required or say the maximum number of words to be summarized.

The unit step function used in the algorithm is given as

{

Algorithm: Q: the user input document U: the set of all information units in document d M: the maximum number of information units allowed E: the number of units in document d to be eliminated

1. Initialize k=0 ; 2. Repeat S=S+E(Q), the useless words to be eliminate from

document. 3. While the size of the Sk is smaller than M. 4. Select the next unit uk+1 as per the above equation. 5. Uk+1= 6. K=k+1 7. End

Let us suppose a document‘d’ containing a sentence, First of the entire document‘d’ is scanned and a dictionary is maintained which contains the words that are useless.

A an The

From On Operators

We All There

Table 1. Dictionary containing useless words The useless words are then removed from the document. After the removal of unnecessary words from the document, the frequency of each word is calculated.

File 2

Features 3

Table 2. file containing frequency of each word Finally a unit step function given above is used to choose any sentence and parse tree is created and check the grammer and according to that sentence can be reduced.

V. RESULT ANALYSIS As shown in the below table is the different parameter evaluation of the existing work. The algorithm is implemented on 6 different dataset of different sizes and various parameters such as precision, recall and F-measure with accuracy is calculated.

dataset Size (words) Precision Recalll F-

measure Accuracy

dataset1 1232 0.6309 0.9765 0.7665 63.0682

dataset2 1545 0.6807 0.9635 0.7978 67.055

dataset3 1878 0.6679 0.9814 0.7949 66.6134

dataset4 1208 0.656 0.9756 0.7845 65.3974

dataset5 6611 0.772 0.9701 0.8598 75.8433

dataset6 3458 0.7407 0.9732 0.8412 73.0769 Table 1. Parameter Evaluation Existing Work

As shown in the below table is the different parameter evaluation of the Proposed work. The algorithm is implemented on 6 different dataset of different sizes and various parameters such as precision, recall and F-measure with accuracy is calculated.

dataset Size (words) Precision Recalll F-

measure Accuracy

dataset1 1232 0.6362 0.9922 0.7753 64.2857

dataset2 1545 0.6933 0.976 0.8107 69.2557

dataset3 1878 0.6769 0.975 0.7991 67.6784

dataset4 1208 0.6655 0.9846 0.7942 67.053

dataset5 6611 0.7788 0.9663 0.8625 76.4786

dataset6 3458 0.7448 0.9645 0.8405 73.1926

Table 2. Parameter Evaluation Proposed Work

The figure shown below is the comparison of time between existing work and proposed work. The result analysis shows the efficiency of proposed work in terms of time.

Page 4: [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore, India (2014.1.3-2014.1.5)] 2014 International Conference on Computer Communication

2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA

Figure. 1 Time Comparison

The figure shown below is the comparison of number of words summarized out of the total number o words in the original dataset between existing work and proposed work. The result analysis shows the efficiency of proposed work. The technique shows the compression factor also.

Figure. 2 Comparison of No. of Words in summarized File

VI. CONCLUSION The text summarization provides the summary of the text document. Here in this paper an efficient technique of text summarization is implemented. The proposed work implemented here is very efficient as compared to the existing technique of text summarization. As shown in the tables and figures given the proposed techniques has rate of compression factor. The technique has high precision, recall

and F-measure and hence has high accuracy. The proposed techniques also take less time to summarize the file. REFERENCES

[1] Radev, D. R., Hovy, E., and McKeown, K. (2002) “Introduction to the special issue on summarization”, Journal Computational Linguistics – Summarization, Volume 28 Issue 4, pp. 399-408, December 2002.

[2] Radev, Dragomir R. and Kathleen R. McKeown. Generating natural language summaries from multiple on-line sources. Journal Computational Linguistics, Volume 24, Issue 3, pp.469–500, 1998.

[3] White, Michael and Claire Cardie “Selecting sentences for multidocument summaries using randomized local search” In Proceedings of the Workshop on Automatic Summarization (including DUC 2002), Association for Computational Linguistics, New Brunswick, NJ, pp. 9–18, Philadelphia, July. 2002.

[4] A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, and J. S. Teixeira “A brief survey of web data extraction tools”, ACM SIGMOD Record Volume 31, Issue 2, pp. 84-93, June 2002.

[5] Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. Proceedings of the 26th International Conference on Very Large Data Bases, pages 317-328, San Francisco, CA, USA, 2000.

[6] S.-H. Lin and J.-M. Ho. “Discovering informative content blocks from web documents”, Proceedings of the 8th international conference on Knowledge discovery and data mining (SIGKDD), pp. 588-593, 2002.

[7] D. Gibson, K. Punera, and A. Tomkins “The volume and evolution of web page templates” Special interest tracks of the 14th international conference on World Wide Web, pp. 830-839, New York, NY, USA, 2005.

[8] I. Debnath, P. Mitra, N. Pal, and C. L. Giles “Automatic identification of informative sections of web pages”, IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 9, pp. 1233-1246, 2005.

[9] D. Chakrabarti, R. Kumar, and K. Punera “Page-level template detection via isotonic smoothing”, Proceedings of the 16th international conference on World Wide Web, pp. 61-70, Canada - 2007.

[10] L. Yi, B. Liu, and X. Li. “Eliminating noisy information in Web pages for data mining”, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03, page 296-305, 2003.

[11] G. Vineel. “Web page dom node characterization and its application to page segmentation”, In Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications, IMSAA'09, pp. 325-330, NJ, USA, 2009.

[12] J. Kang, J. Yang, and J. Choi. “Repetition-based web page segmentation by detecting tag patterns for small-screen devices”, IEEE Transactions on Consumer Electronics, Volume 56, Issue 2, pp. 980-986, 2010.

[13] S. Alcic and S. Conrad. “Page segmentation by web content clustering”, In Proceedings of the International Conference on Web Intelligence, Mining and Semantics, WIMS '11, pp.1-9, New York, NY, USA, 2011.

Page 5: [IEEE 2014 International Conference on Computer Communication and Informatics (ICCCI) - Coimbatore, India (2014.1.3-2014.1.5)] 2014 International Conference on Computer Communication

2014 International Conference on Computer Communication and Informatics (ICCCI -2014), Jan. 03 – 05, 2014, Coimbatore, INDIA

[14] H. Yan and M. Miao. Research and implementation on multi-cues based page segmentation algorithm. International Conference on Computational Intelligence and Software Engineering, 2009. CiSE 2009., pp. 1-4, 2009.

[15] D. Cao, X. Liao, H. Xu, and S. Bai. Blog post and comment extraction using information quantity of web format. In Proceedings of the 4th Asia information retrieval conference, AIRS'08, pp. 298-309, Berlin, Heidelberg, 2008.

[16] A. Zhang, J. Jing, L. Kang, and L. Zhang. “Precise web page segmentation based on semantic block headers detection”, 6th International Conference on Digital Content, Multimedia Technology and its Applications (IDC), pp. 63-68, 2010.

[17] C. Kohlschutter and W. Nejdl. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pp. 1173-1182, New York, NY, USA, 2008.

[18] A. Tombros and M. Sanderson. “Advantages of query biased summaries in information retrieval”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’98), pp. 2- 10, 1998.

[19] R. W. White, J. M. Jose, and I. Ruthven. “A task-oriented study on the influencing effects of query-biased summarisation in web searching” Information Processing and Management: an International Journal, Volume 39 Issue 5, pp.707–733, September 2003.

[20] C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White. “The influence of caption features on click through patterns in web search”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’07), pp. 135-142, 2007.

[21] Y. Huang, Z. Liu, and Y. Chen. “Query biased snippet generation in xml search”, Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD ’08), pp. 315-326, 2008.