DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör...
-
Upload
turner-russum -
Category
Documents
-
view
217 -
download
0
Transcript of DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör...
![Page 1: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/1.jpg)
DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES
Tunga GüngörBoğaziçi University, Computer Engineering Dept., Istanbul,
Turkey(Visiting Professor at TALP Research Center, UPC)
![Page 2: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/2.jpg)
OUTLINE
• INTRODUCTION
• LITERATURE SURVEY
▫ Search Engines and Query Types
▫ Automatic Analysis of Documents
▫ Automatic Summarization
• OVERVIEW OF METHODOLOGY
▫ System Architecture
▫ Implementation
▫ Data Collection
• STRUCTURAL PROCESSING
▫ Rule-based Approach
▫ Machine Learning Approach
• SUMMARY EXTRACTION
• DISCUSSION
• FUTURE RESEARCH
![Page 3: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/3.jpg)
INTRODUCTION
![Page 4: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/4.jpg)
Introduction
• Rapid growth of information sources▫ World Wide Web▫ “information overload”
• 50% of documents viewed in search engine results▫ not relevant (Jansen and Spink, 2005)
• Users are interested in different types of search▫ rather than queries with commonplace answers
e.g. capital city of Sweden▫ specific and complex queries
e.g. best countries for retirement▫ tasks such as background search
e.g. literature survey on Mexican air pollution
![Page 5: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/5.jpg)
Introduction (cont.)
• Available search engines▫ results in response to a user query▫ each presented with a short ‘summary’
2-3 line extracts document fragments containing query words fail to reveal their context within the whole document
• The users▫ scroll down the results▫ click those that seem relevant to their real information need▫ inadequate summaries
missing relevant documents spending time with irrelevant documents not feasible to open each link
![Page 6: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/6.jpg)
Example Output of Google
![Page 7: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/7.jpg)
Introduction (cont.)
• Automatic summarization▫ as successful as humans
long-term research direction (Sparck Jones, 1999)▫ improve effectiveness of other tasks
e.g. information retrieval
• Traditionally, automatic summarization research:▫ general-purpose summaries
e.g. the “abstract page” of a report But, need to bias towards user queries
in an information retrieval paradigm▫ a document is seen as a flat sequence of sentences
ignoring the inherent structure But, Web documents
complex organization of content sections and subsections with different topics and formatting
![Page 8: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/8.jpg)
Research Goals
• a novel summarization approach for Web search▫ combining these two aspects
Document structure Query-biased techniques
▫ not investigated together in previous studies
• Intuition▫ providing the context of searched terms▫ preserving the structure of the document
Sectional hierarchy and heading structure▫ may help the users to determine the relevancy of results better
• Two-stage approach▫ Structural processing▫ Summary extraction
![Page 9: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/9.jpg)
Research Goals (cont.)
• Web documents▫ no domain restriction▫ typically heterogeneous
images, text in different formats, forms, menus, etc. ▫ diverse content
with sections on different topics, advertisements, etc.
• Structural and semantic analysis of Web documents▫ Heading-based sectional hierarchy
• Use of this structural and semantic information▫ during summarization process▫ in the output summaries▫ query-biased techniques
![Page 10: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/10.jpg)
Part of an Example Web Document
![Page 11: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/11.jpg)
LITERATURE SURVEY
![Page 12: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/12.jpg)
Search Engines
• Information retrieval (IR)▫ storage, retrieval and maintenance of information
• differences on the Web▫ distributed architecture▫ the heterogeneity of the available information▫ its size and growth rate, etc.
• Search engine▫ allows the user to enter search terms (queries)
run against a database▫ retrieves Web pages that match the search terms
![Page 13: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/13.jpg)
Query Types
• Boolean search▫ keywords separated by (implicit or explicit) Boolean operators
• Phrase search▫ a set of contiguous words
• Proximity search
• Range searching
• Field searching
• Natural language search ▫ Thesaurus search▫ Fuzzy search
![Page 14: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/14.jpg)
Information Needs of Users
• Categorization (Ingwersen & Järvelin, 2005)▫ intentionality or goal of the searcher▫ the kind of knowledge currently known by the searcher▫ the quality of what is known
▫ well-defined knowledge of the user specific information sources are searched
▫ in ill-defined (muddled) cases the search process is exploratory
• Types of information need in Web search (White et al., 2003)▫ search for a fact▫ search for a number of items▫ decision search▫ background search
![Page 15: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/15.jpg)
General Document Analysis
• physical components▫ paragraphs, words, figures, etc.
• logical components▫ titles, authors, sections, etc.
• as a syntactic analysis problem• physical and logical components of a document
▫ ordered tree
• transformation-based learning• generalized n-gram model• probabilistic grammars• incremental parsing
▫ syntactic parsing (Collins and Roark, 2004)▫ generating table-of-contents for a long document (Branavan et al.,
2007)
![Page 16: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/16.jpg)
Web Document Analysis
• Web documents▫ HTML (Hypertext Markup Language)
presentation of content▫ semi-structured documents
• Motivations▫ to filter important content▫ to convert HTML documents into semantically-rich XML
documents▫ obtaining a hierarchical structure for the documents▫ display content in small-screen devices such as PDAs▫ more intelligent retrieval of information, summarization, etc
• Approaches▫ HTML tags and DOM tree▫ rule-based or machine learning-based▫ certain domain or domain-independent
![Page 17: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/17.jpg)
Web Document Analysis (cont.)
• Different from most previous work▫ section and subsection headings
• HTML▫ Markup tags, attributes and attribute values▫ e.g. <font size = 3>
• Two types of HTML tags▫ container tags (e.g. <table>, <td>, <tr>, etc.)
contain other HTML tags or text▫ format tags (e.g. <b>, <font>, <h1>, <h2>, etc.)
usually concerned with the formatting of text
• DOM (Document Object Model)▫ provides an interface as a tree
![Page 18: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/18.jpg)
Automatic Summarization
• Process of distilling the most important information▫ from a source (or sources) to produce a shortened version▫ for particular users and tasks
• Uses▫ as an aid for browsing
single large documents or sets of documents▫ in sifting process
to locate useful documents in a large collection▫ as an aid for report writers
by providing abstracts
• related to and influenced by▫ information retrieval▫ information extraction▫ text mining
![Page 19: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/19.jpg)
Automatic Summarization (cont.)
• Types of summaries▫ “Extract” vs “abstract”▫ “Generic” vs “query-relevant”▫ “Single-document” vs “multi-document”▫ “Indicative” vs “informative”
• Phases of summarization▫ Analysis of input text▫ Transformation into a summary representation▫ Synthesis of output summary
![Page 20: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/20.jpg)
Automatic Summarization (cont.)
• Approaches▫ Surface-level approaches
use shallow features to identify important information in the text thematic features, location, background, cue words and phrases, etc.
▫ Entity-level approaches build an internal representation of the text by modeling text entities and their relationships e.g. using graph topology
▫ Discourse-level approaches global structure of the text and its relation to communicative goals
▫ Hybrid approaches
• Evaluation▫ intrinsic
the summary itself is evaluated▫ extrinsic
i.e. task-based evaluation
![Page 21: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/21.jpg)
Recent Work on Summarization
• Mostly generic summaries▫ based on sentence weighting
• Tombros & Sanderson, 1998▫ query-biased summaries in information retrieval
• Google, Altavista
• White et al, 2003 • longer query-biased summaries▫ summary window
• Alam et al, 2003▫ structured and generic summaries “table of content”-like hierarchy of sections and subsections
![Page 22: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/22.jpg)
Recent Work on Summarization (cont.)
• Yang & Wang, 2008▫ fractal summarization▫ hierarchical structure of document
levels, chapters, sections, subsections, paragraphs, sentences and terms
▫ generic summaries
• Varadarajan & Hristidis, 2005▫ adding structure
document is divided into fragments (paragraphs) connecting related fragments as a graph (implicit structure)
▫ query-biased
• In this research, combining▫ explicit document structure and query-biased techniques
![Page 23: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/23.jpg)
OVERVIEW OF METHODOLOGY
![Page 24: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/24.jpg)
System Architecture
![Page 25: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/25.jpg)
Structural Processing
• Rule-based and machine learning-based approaches
• Input▫ a Web document in HTML format
• Output▫ a tree representing the sectional hierarchy of the document
intermediate nodes: headings and subheadings, leaves: other text units
![Page 26: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/26.jpg)
Summarization
• Using the output of structural processing▫ document tree
• indicative summaries▫ extractive approach
• longer summaries▫ in a separate frame
![Page 27: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/27.jpg)
Implementation
• GATE (A General Architecture for Text Engineering)▫ open source project using component-based technology in Java▫ commonly used natural language functionalities
Tokeniser, Sentence Splitter, Stemmer, etc.
• Cobra Java HTML Renderer and Parser▫ open source project▫ supports HTML 4, Javascript and Cascading Style Sheets (CSS)
• Implemented modules▫ Structural analysis of HTML documents▫ Summarization engine
![Page 28: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/28.jpg)
Data Collection
1 Hubble telescope achievements
2 best retirement country
3 literary/journalistic plagiarism
4 Mexican air pollution
5 antibiotics bacteria disease
6 abuses of e-mail
7 declining birth rates
8 human genetic code
9 mental illness drugs
10 literacy rates
11 robotic technology
12 creativity
13 tourism, increase
14 newspapers electronic media
15 wildlife extinction
16 R&D drug prices
17 Amazon rain forest
18 Osteoporosis
19 alternative medicine
20 health and computer terminals
1 Tsunami(tsunami)
2 ekonomik kriz(economic crisis)
3 Türkiye'de meydana gelen depremler(earthquakes in Turkey)
4 sanat ödülleri(art awards)
5 bilişim eğitimi ve projeleri(IT education and projects)
English queries
Turkish queries
• Users• mostly Boolean queries with 2-3 words
• Current search interests• various domains
• English Collection• Turkish Collection• Extended English Collection
![Page 29: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/29.jpg)
RULE-BASED APPROACH FOR STRUCTURAL PROCESSING
![Page 30: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/30.jpg)
The Method
• A heuristic approach based on DOM processing▫ Heading-based sectional hierarchy identification
• nontrivial task▫ heterogeneity of Web documents ▫ the underlying HTML format
• Three steps▫ DOM tree processing▫ Heading identification▫ Hierarchy restructuring
![Page 31: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/31.jpg)
Step 1: DOM Tree Processing
• Semantically related parts▫ same or neighboring container tags
• Traverse DOM tree in a breadth-first way▫ Sentence boundaries▫ Format tags such as <font> are passed as features▫ Output: a simplified version of the original tree
![Page 32: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/32.jpg)
DOM Tree of an Example Document
![Page 33: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/33.jpg)
Example Output of DOM Tree Processing
![Page 34: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/34.jpg)
Step 2: Heading Identification
• Heading tags in HTML▫ <h1> through <h6>▫ rarely used for this purpose
• Headings▫ formed by formatting them differently from surrounding text▫ more emphasized than following content
• Heuristics▫ if-then rules
![Page 35: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/35.jpg)
Features for Identifying Text Format
Feature Description Data Type
h1 <h1>, level-1 heading Boolean
h2 <h2>, level-2 heading Boolean
h3 <h3>, level-3 heading Boolean
h4 <h4>, level-4 heading Booleanh5 <h5>, level-5 heading Boolean
h6 <h6>, level-6 heading Boolean
B <b>, bold Boolean
strong <strong>, strong emphasis Booleanem <em>, emphasis Boolean
A <a>, hyperlink Boolean
U <u>, underlined Boolean
I <i>, italic Booleanf_size <font size=…>, font size Integerf_color <font color=…>, font color Stringf_face <font face=…>, font face StringallUpperCase all the letters of the words are in uppercase BooleancssId CSS id attribute if used StringcssClass CSS class attribute if used Stringalignment align attribute Stringli <li>, different levels of list elements Integer
![Page 36: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/36.jpg)
Step 3: Hierarchy Restructuring
• Headings + feature set▫ to differentiate different levels of heading
• Restructure the document tree▫ bottom-up approach
![Page 37: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/37.jpg)
Step 3: Hierarchy Restructuring (cont.)
![Page 38: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/38.jpg)
Performance Measures
Golden Standard
Heading Non-heading
Proposed MethodHeading TP FP
Non-heading FN TN
FNTP
TPR
FPTP
TPP
RP
RPF
2
i
PCcp
PC
cpeiAccuracyHierarchy i
,),(
)(_
Hierarchy Extraction• Parent-child relationships in the document tree
• Heading-subheading • Heading- underlying text
Heading Extraction
![Page 39: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/39.jpg)
English Collection
Document Set
Actual Number
Proposed Sys. Recall
Proposed Sys. Precision
Proposed Sys. F-measure
Baseline Recall
1 6.50 0.94 0.60 0.69 0.512 11.30 0.80 0.65 0.67 0.343 8.20 0.91 0.56 0.66 0.684 3.60 0.89 0.64 0.73 0.385 9.30 0.89 0.58 0.66 0.576 18.10 0.82 0.70 0.73 0.397 5.40 0.84 0.59 0.67 0.278 6.90 0.98 0.57 0.68 0.569 12.70 0.93 0.76 0.82 0.3810 6.20 0.84 0.75 0.77 0.24Average 8.82 0.88 0.64 0.71 0.43
Heading extraction
• Baseline• using only heading tags <h1> through <h6>
• High value for heading recall• Precision is lower
• cluttered organization in Web documents
![Page 40: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/40.jpg)
English Collection (cont.)
Document Set
DOM Tree
Proposed Sys. Hierarchy
Baseline Hierarchy
Actual Hierarchy
1 15.80 5.50 3.40 3.702 20.80 8.20 3.10 4.203 12.10 7.30 3.90 4.104 13.90 4.90 3.40 3.905 13.20 6.10 3.70 4.006 13.00 7.00 3.60 4.407 19.20 6.20 3.10 3.808 12.80 6.10 3.70 4.209 17.50 7.10 3.30 4.0010 13.80 7.00 2.90 4.80Average 15.21 6.54 3.41 4.11
Document Set
Baseline (only h tags)
Proposed System
1 0.57 0.582 0.52 0.813 0.64 0.744 0.40 0.665 0.51 0.666 0.40 0.657 0.54 0.748 0.55 0.699 0.48 0.7710 0.36 0.78Average 0.50 0.71
Hierarchy extraction
• a significant improvement to accuracy• compared to the baseline
![Page 41: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/41.jpg)
Turkish Collection
Document Set
Number of Headings
Recall Precision F-measure
1 7.60 0.81 0.56 0.642 5.40 0.67 0.63 0.613 5.10 0.84 0.49 0.664 4.90 0.89 0.54 0.685 9.20 0.89 0.68 0.73
Average 5.40 0.79 0.57 0.65
Document Set
DOM Tree Depth
Hierarchy Depth
Hierarchy Accuracy
1 17.6 6.5 0.492 16.2 5.0 0.613 20.4 7.5 0.784 18.8 5.6 0.805 19.2 5.1 0.81
Average 17.2 6.1 0.70
Heading extraction Hierarchy extraction
• Baseline method failed• no <h> tags used
• Additional analysis• 50 documents on boun.edu.tr domain• 71% accuracy
![Page 42: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/42.jpg)
MACHINE LEARNING APPROACH FOR STRUCTURAL PROCESSING
![Page 43: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/43.jpg)
• Machine learning▫ can be more flexible▫ by combining several features using a training corpus
rather than predefined rules
• Extraction of sectional hierarchy of a Web document▫ A tree-based learning approach needed
as in syntactic parsing
▫ exponential search space
• incremental algorithm▫ making a sequence of locally optimal choices▫ to approximate a globally optimal solution
• Document▫ as a sequence of text units
The Approach
![Page 44: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/44.jpg)
Example HTML document
![Page 45: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/45.jpg)
Heading Extraction Model
• Binary classification▫ As a sequence of text units▫ Headings: positive examples▫ Non-headings: negative examples
![Page 46: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/46.jpg)
• Learn a mapping from X (a set of documents) to Y (a set of possible sectional hierarchies of documents)
▫ Training examples (xi, yi) for i = 1…n
▫ A function GEN(x) enumerating a set of possible outputs for an input x
▫ A representation Φ mapping each (xi,yi) to a feature vector Φ(xi, yi)
▫ A parameter vector α
▫ Estimate α such that it will give highest scores to correct outputs:
Hierarchy Extraction Model
![Page 47: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/47.jpg)
Features
• Unit features▫ Formatting features
e.g. font size, boldness, color, etc.▫ DOM tree features
e.g. DOM address, DOM path, etc.▫ Content features
e.g. cue words / phrases, number of characters, punctuation mark, etc.
▫ Other features Visual position in the rendered Web document
• Contextual features▫ composite features of two units in context
distance and difference between features uij : unit i levels above a unit u, and j units to its left
• Global features▫ e.g. the depth of sectional hierarchy
![Page 48: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/48.jpg)
Incremental Learning Approach
• Document graph▫ left to right based on the order of appearance▫ Positive and negative examples
Parent-child relationships (based on golden standard hierarchy)▫ Two constraints
Document order Projectivity rule
“When searching for the parent of a unit uj, consider only the previous unit (uj-1), the parent of uj-1, that unit’s parent, and so on to the root of the tree.
![Page 49: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/49.jpg)
Incremental Learning Approach (cont.)
• Training set• Web documents and corresponding golden standard
hierarchies
• Algorithm• works on units sequentially
![Page 50: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/50.jpg)
Testing Approach
• Beam search▫ Set of partial trees▫ Beam width▫ Two operations
ADV (i.e. Advance) potential attachments of current unit to partial trees
FILTER to prevent exponential growth of the set
![Page 51: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/51.jpg)
Variations• M1
▫ probability value
• M2 ▫ Run the algorithm in two levels
• M3 ▫ integer ranks▫ the times a tree obtains rank ‘1’
• M4 ▫ integer ranks▫ sum ranks obtained at each
step
Testing Approach (cont.)
![Page 52: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/52.jpg)
• Implementation▫ Support Vector Machines
SVM-light (Joachims, 1999)▫ Perceptron
Testing Approach (cont.)
Update α
Process a unit
![Page 53: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/53.jpg)
Evaluation
• 5-fold cross-validation
Heading Extraction
Number of documents 500
Avg. number of text units 110.7
Avg. hierarchy depth 4.1
Avg. number of headings 10.6
Feature
Set
Features Number of
Features
Φ1 Fn, Fn(n+1) 58
Φ2 Fn, Fn(n+1), Fn(n-1) 86
Φ3 Fn, Fn(n+1), Fn(n+2) 82
Φ4 Fn, Fn(n+1), Fn(n+2), Fn(n-1) 110
Φ5 Fn, Fn(n+1), Fn(n+2), Fn(n-1), Fn(n-2) 134
Method Feature Set
Recall Precision F-measure
SVM – Linear Φ1 0.85 0.78 0.81Φ2 0.83 0.78 0.80Φ3 0.81 0.77 0.79Φ4 0.83 0.78 0.80Φ5 0.83 0.78 0.80
SVM – Polynomial Φ1 0.87 0.80 0.83Φ2 0.85 0.80 0.82Φ3 0.87 0.82 0.84Φ4 0.85 0.80 0.82Φ5 0.87 0.84 0.85
SVM – RBF Φ1 0.84 0.76 0.80Φ2 0.84 0.79 0.81Φ3 0.87 0.81 0.84Φ4 0.88 0.83 0.85Φ5 0.87 0.83 0.85
Perceptron Φ1 0.71 0.77 0.74Φ2 0.70 0.78 0.74Φ3 0.71 0.84 0.77Φ4 0.78 0.82 0.80Φ5 0.77 0.81 0.79
Statistics for Extended English Collection
![Page 54: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/54.jpg)
• Comparing with related work▫ Xue et al, 2007
extraction of the main title (i.e. a single heading) from HTML documents
SVM, CRF a maximum f-measure of 0.80
• a more general and challenging problem▫ extraction of all the headings in a given HTML document ▫ obtained an f-measure of 0.85
Evaluation (cont.)
Method Recall Precision F-measure
SVM 0.87 0.84 0.85
Perceptron 0.78 0.82 0.80
Rule-based Approach 0.72 0.64 0.68
![Page 55: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/55.jpg)
Evaluation (cont.)
Hierarchy extraction
Feature Set Features Number of Features
Φ1 F10 17
Φ2 F10, F01 40
Φ3 F10, F01, F20 57
Φ4 F10, F01, F20, F02 73
Learning Algorithm Feature SetΦ1 Φ2 Φ3 Φ4
SVM – Linear 0.42 0.61 0.61 0.61SVM – Polynomial 0.57 0.63 0.63 0.65SVM – RBF 0.58 0.66 0.67 0.67Perceptron 0.51 0.46 0.46 0.46
Learning Algorithm Beam width1 10 20 50 100
SVM – Polynomial 0.64 0.65 0.65 0.65 0.65SVM – RBF 0.66 0.66 0.66 0.66 0.67
![Page 56: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/56.jpg)
Evaluation (cont.)
• Error analysis▫ heading extraction
false negatives false positives
▫ heuristic-based incremental approach▫ cluttered Web documents with complex layouts▫ errors made by Web document authors
• acceptable results as a fully automatic approach
MethodModel 1 headings
Manual headings
Rule-based Approach 0.61 0.81
Perceptron 0.51 0.82
SVM 0.68 0.79
Learning Algorithm MethodM0 M1 M2 M3 M4
SVM – Polynomial 0.65 0.67 0.59 0.64 0.68SVM – RBF 0.67 0.67 0.59 0.67 0.66
![Page 57: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/57.jpg)
SUMMARY EXTRACTION
![Page 58: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/58.jpg)
Summarization Method
• Structural information▫ to determine important sentences and sections▫ preserved in the output summaries
• Two levels of scoring▫ Sentence scoring
to determine important sentences adapted to utilize the output of structural processing Heading method Location method Term frequency method Query method
▫ Section scoring to determine important sections sum of scores of sentence in that section
ssentence = sheading × wheading + slocation × wlocation + stf × wtf + squery × wquery
![Page 59: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/59.jpg)
Unstructured vs Structured Document
![Page 60: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/60.jpg)
Example Sentence Score CalculationQuery: antibiotics bacteria disease
Sentence: “These are the bacteria that are usually involved with bacterial disease such as ulcers, fin rot, acute septicaemia and bacterial gill disease.”
wheading = wlocation = wtf = 1 and wquery = 3
![Page 61: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/61.jpg)
Summarization Experiment
![Page 62: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/62.jpg)
Summarization Experiment
• Task-based evaluation▫ information retrieval tasks
according to usefulness in a search engine▫ queries and documents used in structural processing
experiments
• Four types of summaries▫ Google – Query-biased extracts provided by Google▫ Unstructured – Query-biased summaries without use of
structural information▫ Structured1 – Structure-preserving and query-biased
summaries using output of structural processing step
▫ Structured2 – Structure-preserving and query-biased summaries using manually identified structure
• The summaries are about the same size▫ except Google▫ to make them comparable
![Page 63: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/63.jpg)
Example TREC Query
![Page 64: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/64.jpg)
Example Summary of Proposed System
• for the query “Antibiotics Bacteria Disease”
![Page 65: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/65.jpg)
Experimental Methodology
• Within-subjects (i.e. repeated measures) design▫ to minimize the effects of differences among subjects▫ summary type and documents were presented in a random
order to reduce carryover effects
▫ original full-text document is not displayed until all the summaries for that document are displayed
▫ 4-10 subjects
• Using a web-based interface▫ Decision times of users recorded automatically
• User poll▫ Helpfulness of summaries▫ Likert scale (1: not helpful, 5: very helpful)
![Page 66: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/66.jpg)
Performance Measures
• Relevance prediction (Hobson et al, 2007)▫ compare the subject’s judgment on a summary with his or her
own judgment on the original full-text document▫ more suitable for a real-world scenario
Original document judgment
Relevant Irrelevant
Summary judgmentRelevant TP FP
Irrelevant FN TN
FNFPTNTP
TNTPA
TPFN
FNFNR
TNFP
FPFPR
![Page 67: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/67.jpg)
Experiment Results
English Collection
System TP FP FN TN A P R FGoogle 107 38 60 95 0.67 0.73 0.62 0.63Unstructured 131 28 36 105 0.79 0.82 0.76 0.77Structured1 137 25 30 108 0.82 0.85 0.80 0.80Structured2 138 23 29 110 0.83 0.85 0.83 0.82
System FNR FPRGoogle 0.36 0.29Unstructured 0.22 0.21Structured1 0.18 0.19Structured2 0.17 0.17
System A P R F FNR FPRGoogle +22.39% +16.44% +29.03% +26.98% -50% -34.48%Unstructured +3.80% +3.66% +5.26% +3.90% -18.18% -9.52%
SystemTime (seconds)
Size (words)
Google 14.58 41
Unstructured 27.24 278
Structured1 27.60 264
Structured2 28.58 253
Original 41.43 1566
Improvement of proposed system over other methods
Repeated measures ANOVA: p<0.001 for f-measure
![Page 68: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/68.jpg)
Experiment Results (cont.)
Turkish Collection
System TP FP FN TN A P R FGoogle 45 20 10 75 0.80 0.69 0.82 0.75Unstructured 43 13 12 82 0.83 0.77 0.78 0.77Structured 1 49 8 6 87 0.91 0.86 0.89 0.88Structured 2 47 10 8 85 0.88 0.82 0.85 0.84
System FNR FPRGoogle 0.18 0.21Unstructured 0.22 0.14Structured 1 0.11 0.08Structured 2 0.15 0.11
System A P R F FNR FPR
Google +13.75% +24.64% +8.54% +17.33% -38.89% -61.90%
Unstructured +9.64% +11.69% +14.10% +14.29% -50% -42.86%
SystemTime (seconds)
Size (words)
Google 11.04 30Unstructured 19.96 216Structured1 19.96 230Structured2 19.71 235Original 24.53 900
Improvement of proposed system over other methods
Repeated measures ANOVA: p<0.05 for f-measure
![Page 69: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/69.jpg)
Experiment Results (cont.)
Extended English CollectionSystem TP FP FN TN A P R FGoogle 118 36 120 126 0.57 0.72 0.47 0.52Unstructured1 179 54 59 108 0.72 0.77 0.75 0.73Unstructured2 176 53 62 109 0.72 0.77 0.73 0.72Structured1 185 50 53 112 0.74 0.78 0.77 0.76Structured2 183 40 55 122 0.75 0.82 0.76 0.77
System FNR FPRGoogle 0.50 0.23Unstructured1 0.23 0.32Unstructured2 0.24 0.30Structured1 0.20 0.30Structured2 0.22 0.24
System A P R F FNR FPRGoogle +30.68% +9.66% +63.88% +44.97% -59.65% +29.80%Unstructured1 +3.60% +1.31% +2.98% +3.35% -9.90% -4.91%Unstructured2 +3.14% +1.79% +5.42% +4.90% -16.31% -0.30%
SystemTime (seconds)
Size (words)
Rating
Google 10.20 30 2.60Unstructured1 17.70 298 2.77Unstructured2 18.44 306 2.77Structured1 17.51 277 3.03Structured2 17.02 274 3.12Original 23.59 1340 3.10
Improvement of proposed system over other methods
Repeated measures ANOVA: p<0.05 for f-measure
![Page 70: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/70.jpg)
DISCUSSION
![Page 71: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/71.jpg)
Discussion
• Longer summaries▫ significant performance improvement▫ compared to Google
• Structured summaries▫ increased performance▫ compared to unstructured summaries▫ by providing an overview of the document
• Summary size▫ 15-25% of the document on the average▫ 75-90% correct relevance judgments
• Proposed system summaries (Structured1)▫ a fully automatic approach▫ can be incorporated into a search engine
![Page 72: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/72.jpg)
Discussion (cont.)
• 6-9 times longer than Google extracts▫ less than two times increase in response times
• to balance the time spent and the accuracy▫ Tradeoff▫ Time Overhead = Number of Results Viewed · Tsummary + FP · (Tpage_load + Tdocument)
• Common-place queries▫ by viewing a few of the top results
• Complex queries and background search▫ the accuracy becomes more important▫ Proposed system
Reduced number of missed items (false negative rates) Users usually spend less time in viewing irrelevant results (false
positive rates)
![Page 73: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/73.jpg)
Discussion (cont.)
• High user ratings
• Analysis of time complexity▫ Structural processing stage
performed once beforehand similar to indexing phase of search engines
▫ Summary extraction stage Linear time complexity
![Page 74: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/74.jpg)
FUTURE RESEARCH
![Page 75: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/75.jpg)
Future Research
• Related to the research goals
▫ Automatic analysis of domain-independent Web documents to obtain a hierarchy of sections and subsections together with the
headings rule-based approach machine learning approaches
▫ A novel summarization approach based on document structure and query-biased techniques
![Page 76: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/76.jpg)
Future Research (cont.)
• Extending structural processing
▫ Identify some document components e.g. menus, references and advertisements using machine learning techniques
• Summarization engine▫ linguistic and semantic processing
expanding the queries using WordNet ontology-driven search (e.g. Cyc ontology)
▫ more sophisticated query-biased methods▫ different types of search tasks
e.g. searching for a particular fact or searching for background information about a subject etc.
▫ different document types (i.e. genre) and formats (e.g. XML)▫ automatic evaluation
![Page 77: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/77.jpg)
Future Research (cont.)
• Search engine integration
▫ Automatic display of hierarchical summaries summary of each search result in a separate window indexing mechanism development of a user interface
• Adapting to other languages (e.g. Spanish)
▫ using NLP resources of different languages▫ generating new knowledge sources for these languages
e.g. semantic knowledge base, ontology
![Page 78: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/78.jpg)
REFERENCES
![Page 79: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/79.jpg)
Alam, H., A. Kumar, M. Nakamura, A. F. R. Rahman, Y. Tarnikova and C. Wilcox, “Structured and Unstructured Document Summarization: Design of a Commercial Summarizer Using Lexical Chains”, Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp. 1147-1150, 2003.
Branavan, S. R. K., P. Deshpande and R. Barzilay, “Generating a Table-of-Contents”, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 2007.
Collins, M. and B. Roark, “Incremental Parsing with the Perceptron Algorithm”, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004.
Hobson, S. P., B. J. Dorr, C. Monz and R. Schwartz, “Task-Based Evaluation of Text Summarization Using Relevance Prediction”, Information Processing and Management, Vol. 43, No. 6, pp.1482-1499, 2007.
Ingwersen, P. and K. Järvelin, The Turn: Integration of Information Seeking and Retrieval in Context, Springer, Dordrecht, 2005.
Jansen, B. J. and A. Spink, “An Analysis of Web Searching by European AlltheWeb.com Users”, Information Processing and Management, Vol. 41, No. 2, pp. 361-381, 2005.
Joachims, T.,“Making Large-Scale SVM Learning Practical”, in B. Schölkopf, C. Burges and A. Smola (eds.), Advances in Kernel Methods - Support Vector Learning, MIT Press, 1999.
Pembe, F. C. and T. Güngör, “A Tree Learning Approach to Web Document Sectional Hierarchy Extraction”, 2nd International Conference on Agents and Artificial Intelligence (ICAART 2010), Valencia, January 2010 .
References
![Page 80: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/80.jpg)
References (cont.)Pembe, F. C. and T. Güngör, “Structure-Preserving and Query-Biased Document Summarization for
Web Search”, Online Information Review, Vol.33(4), 2009, p.696-719.
Sparck Jones, K., “Automatic Summarizing: Factors and Directions”, in I. Mani and M. T. Maybury (eds.), Advances in Automatic Text Summarization, pp. 1-12, MIT Press, Cambridge, 1999.
Tombros, A. and M. Sanderson, “Advantages of Query Biased Summaries in Information Retrieval”, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, pp. 2-10, 1998.
Varadarajan R. and V. Hristidis, “Structure-Based Query-Specific Document Summarization”, Proceedings of the 14th ACM international conference on Information and Knowledge Management, 2005.
Xue, Y., Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C. Y. Lin and H. Li, “Web Page Title Extraction and Its Application”, Information Processing and Management, Vol. 43, No. 5, pp. 1332-1347, 2007.
White, R. W., J. M. Jose and I. Ruthven, “A Task-oriented Study on the Influencing Effects of Query-biased Summarization in Web Searching”, Information Processing and Management, Vol. 39, No. 5, pp. 707-733, 2003.
Yang, C. C. and F. L. Wang, “Hierarchical Summarization of Large Documents”, Journal of the American Society for Information Science and Technology, Vol. 59, No. 6, pp. 887-902, 2008.
![Page 81: DEVELOPING AN ADAPTIVE AND HIERARCHICAL SUMMARIZATION FRAMEWORK FOR SEARCH ENGINES Tunga Güngör Boğaziçi University, Computer Engineering Dept., Istanbul,](https://reader038.fdocuments.us/reader038/viewer/2022110320/56649cc25503460f9498998d/html5/thumbnails/81.jpg)
Thank you