Post on 23-Feb-2016
description
A SUMMARIZATION JOURNEY
Search and Information Extraction Lab
IIIT Hyderabad
Information OverloadExplosive growth of information on web
Failure of information retrieval systems tosatisfy user’s information need
Need for sophisticated information accesssolutions
Summarization
Summary is a condensed version of source document(s) having a recognizable genre : to give the reader an exact and concise idea of the contents of the source.
Text interpretation
Extraction of Relevant information
Condensing Extracted Information
Summary Generation
Flavors of Summarization
Progressive
Single documen
t
Query Focused
Opinion/ Sentimen
t
Code
ComparativeGuided
Personalized
Extract Vs. Abstract
Extract An extract is a summary consisting of
entirely of material from the input text Abstract
An abstract is a summary at least some of whose material is not present in the input. eg. paraphrases of content, subject of
categories
Towards Abstraction
Personalized , Cross Lingual Summarization
Guided SummarizationCode SummarizationComparison SummarizationBlog summarization
Progressive Summarization
Abstractive
Single Document, Query Focused Multi Document Summarization
Technological Aspects
Summarization
Support Vector
RegressionRelevance
based Language
Models
External Knowledge
Web, Wikipedia
User Modeling
Statistics – word and
document
Similarity measures,
Novelty detection
Graph Clustering
– Topic identificati
on
EXTRACTIVE SUMMARIZERS
Query Focused Summarization
Documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the system
Query Dependent ranking: Relevance Based Language models Language models (PHAL)
Query Independent ranking: Sentence Prior
RBLM is an IR approach that computes the conditional probabilities of relevance from document and query
PHAL- probabilistic extension to HAL spaces HAL constructs dependencies of a term w on other terms
based on their occurrence in its context in the corpus
DUC Peformance
38 systems participated in 2006
Significant difference between first two systems
2006
Extract vs. Abstract Summarization
We conducted a study (post TAC 2006) Generated best possible extracts Calculated the scores for these extracts
Evaluation with respect to the reference summaries
Rouge 2 Rouge SU4
Human Answers 0.1025 0.1624
Best Answers 0.09965 0.15407
HAL Feature 0.07618 0.13805
Cross Lingual Summarization
Cross Lingual Summarization A bridge between CLIR and MT Extended our mono-lingual summarization
framework to a cross-lingual setting in RBLM framework
Designed a cross-lingual experimental setup using DUC 2005 dataset
Experiments were conducted for Telugu-English language pair
Comparison with mono-lingual baseline shows about 90% performance in ROUGE-SU4 and about 85% in ROUGE-2 f-measures
Progressive Summarization Emerging area of research in summarization
Summarization with a sense of prior knowledge
Introduced as “Update Summarization” at DUC 2007, TAC 2008, TAC 2009
Generate a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles.
To keep track of temporal news stories
Key challenge
To detect information that is not only relevant but also new given the prior knowledge of reader
Relevant and new VsNon-Relevant and new Vs Relevant and redundant
Three level approach to Novelty DetectionSentence Scoring Developing new features
that capture novelty along with relevance of a sentence
NF, NWRanking Sentences are re ranked
based on the amount of novelty it containsITSim, CoSim
Summary GenerationA selected pool of sentences that contain novel facts. All remaining sentences are filtered out
Evaluations TAC 2008 Update Summarization
data for training: 48 topics Each topic divided into A, B with
10 documents Summary for cluster A is normal
summary and cluster B is update summary
TAC 2009 update Summarization for testing: 44 topics
Baseline summarizer generates summary by picking first 100 words of last document
Run1 – DFS + SL1 Run2 – PHAL + KL
Personalized Summarization Perception of text differs with background of
the reader Need of incorporating user background in the
summarization process Summarization not only a function of input text
but also the reader
Serve
Tennis player
Hotel manage
rPoliticia
n
Web-based profile creation: Personal information available on web- a conference page, a project page, an online paper, or even in a Weblog.
Estimate Model P(w/Mu) to incorporate user in sentence extraction process
Opinion summarizationSentiment Analysis User-generated-content is growing rapidly
through blogs Sentiment analysis provides better access to
information
Sentiment Textual information on the Web can be
categorized as facts and opinions Computational study of opinions, sentiments in
market perspective
Optimization of sentiment in the summary to the maximum extent
Sentiment summarization as a two stage classification problem at sentence level
Polarity Estimation Opinion/fact Positive/Negative
SEMI ABSTRACTIVE SUMMARIZERS
Comparative summarization Summaries for comparing multiples items belonging to a
category Category of “Mobile phones“ will have “Nokia”, “Black
berry’ as its items
Comparative summaries provide the properties or facts common to these items and their corresponding values with respect to each item. “Memory”, “Display”, “Battery Life”,
Memory
Battery Life
Comparative Summaries Generation Attribute Extraction
Find the attributes of the product class Attribute Ranking
Rank the attributes according to importance in comparison
Summary Generation Find the occurrence of attributes in various products
Guided Summarization Query Focused Summarization
User’s information need expressed as a query along with a narrative
Set of documents related to the topic Goal is to produce a shot coherent summary
focusing answer to the query Guided Summarization
Each topic is classified into a set of predefined categories
Each category has a template of important aspects about the topic
Summary is expected to answer all the aspects of template while containing other relevant information
Guided summarization Encourage deeper linguistic and semantic analysis of the
source documents instead of relying only on document word frequencies to select important concepts
Shares similarity with information extraction Specific information from unstructured text is identified
and consequently classified into a set of semantic labels (templates)
Makes information more suitable for other information processing tasks
A guided summarization system has to produce a readable summary encompassing all the information about the templates
Very few investigations exploring the potential of merging summarization with information extraction techniques
Our approach Building a domain model
Essential background knowledge for information extraction
Sentence Annotations To identify sentences having answers to aspects of
template
Concept Mining To use semantic concepts instead of words to calculate
sentence importance
Summary Extraction Modification of summary extraction algorithm to adapt
to the requirements using sentence annotations
THANKS