Modeling Novelty and feature combination using Support Vector Regression for Update Summarization
description
Transcript of Modeling Novelty and feature combination using Support Vector Regression for Update Summarization
Modeling Novelty and feature combination using Support Vector Regression for
Update Summarization
Praveen Bysani
Vijay Yaram
Vasudeva Varma
1
Search and Information Extraction Lab LTRC, IIIT Hyderabad
Outline
Search and Information Extraction Lab LTRC, IIIT Hyderabad
2
Text Summarization
Update Summarization
Support Vector Regression
Sentence Scoring Features Novelty Factor (NF)
Experimental Results
Text Summarization Condensing a piece of text while retaining
important information Types of summarization
Extractive Vs Abstractive Single Document Vs Multi Document Query Focused Vs Query Independent Personalized Vs Generic
Focus on Extractive Multi Document Summarization
Dynamic Summarization, Update Summarization
5 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Update Summarization Emerging area in summarization
Summarization with a sense of prior knowledge
Challenging – To detect information that is relevant and also novel
Practical usage – monitor changes in temporally evolving topics especially newswire
6 Search and Information Extraction Lab LTRC, IIIT Hyderabad
An example
Search and Information Extraction Lab LTRC, IIIT Hyderabad
7
Topic : YSR Chopper Missing
A helicopter carrying Andhra Pradesh chief minister Y S Rajashekhar Reddy, two of his staff and two pilots went missing in pouringrain Wednesday morning over the Naxal and tiger-infested Nalamalla forests and with no contact until early Thursday, experts and officials feared the worst. Multiple agencies of the state launched a massive hunt for possible wreckage in the desolate terrain. Apart from Reddy, the chopper was carrying principal secretary to CM S Subrahmanyam and YSR's chief security officer ASC Wesley.
Andhra Pradesh chief minister Y S Rajasekhara Reddy has died in an air crash. The bodies of 60-year-old Reddy, his special secretary P Subramanyam, chief security officer A S C Wesley, pilot Group Captain S K Bhatia and co-pilot M S Reddy were found on Rudrakonda Hill, 40 nautical miles east of here, besides the mangled remains of the helicopter. The central leadership of the Congress is understood to have cleared the name of Andhra Pradesh finance minister K Rosaiah as the caretaker CM of the state.
doc
doc3
doc1
Pre Processing
Sentences
Sentence
Scorers
feature2
feature n
feature1
Ranker
Ranked Set of Sentences
Summary Generator
Summary
Sentence Ranking 4 stages of sentence extractive
summarization Pre Processing Sentence Scoring Sentence Ranking Summary Generation
Scores from features are manually weighted to compute sentence rank
Instead, use a Machine Learning Algorithm to estimate rank from features
9 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Support Vector Regression (SVR) Regression analysis - modeling values of a
dependent variable from one or more independent variables
Regression is a function Y = f(X,β)
The independent variables, X The dependent variable, Y unknown parameters, β
Regression using Support Vectors is Support Vector Regression (SVR)
Estimate Sentence rank (dependent variable) using Scoring Features (Independent variables) through Support Vector Regression (SVR)
10 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Estimating Sentence Rank
ROUGE – recall oriented metric which evaluates based on word overlap with models
ROUGE-2 and ROUGE-SU4 correlate highly with human evaluation
Sentence Rank (is) of a sentence s is
|Bigramm ^ Bigrams| - # of bigrams shared by model and sentence
11 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Oracle summaries
Each oracle summary is best summary that can be generated by any sentence extractive summarization system
Sentences are ranked using the ROUGE-2 score described above
To depict the gap and scope of improvement
12 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Sentence Scoring FeaturesSentence Position is a popular and well
studied feature Sentence Location 1 (SL1)
First 3 sentences of document contain most informative content (also proved by analysis of oracle summaries)
Score of a sentence ‘s’ at position ‘n’ in document ‘d’ is
Score(snd) = 1 – n/1000 if n<=3
= n/1000 else
13 Search and Information Extraction Lab LTRC, IIIT Hyderabad
• Sentence Location 2 (SL2)– Positional index of sentence itself as value– Model will learn optimum sentence position based
on genre– Not inclined to top or bottom as SL1
Score(snd) = n
• Sentence Frequency Score (SFS)– Ratio of number of sentences in which a word
occurred to total number of sentences in cluster– SFS of word w is
14 Search and Information Extraction Lab LTRC, IIIT Hyderabad
TF – IDF• Popular measure to find relevance of document in IR• Same analogy used to find relevance of sentence• Term Frequency(Tf ij) of term (ti) in document (dj) is
• Inverse Document Frequency (IDFi) of a term (ti) is
15 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Document Frequency Score (DFS)
Inverse of IDF Ratio of number of docs in which a term occurred
to total number of docs Average DFS of words in sentence is its score
Probabilistic Hyperspace Analogue to language (PHAL) and Kullback-Leiber Divergence (KL) available as features
Baseline summarizer generates summary by picking first 100 words of last document
16 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Novelty Factor An ad-hoc feature for Update
summarization Consider a stream of articles published on a
topic over time period T All articles published from time 0 to t are
considered to be read previously (prior Knowledge)
Articles published from t to T are new that contains new information.
Let td represent the chronological time stamp of document d.
17 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Novelty Factor
NF of a word ‘w’ is
nd t = { d : w in d and td > t }pd t = { d: w in d and td < t}D = { d: td > t }
•|ndt| signifies the importance of term in new cluster•|pdt| penalizes any term that occurs frequently in previous clusters•|D| for smoothing
18 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Training Data• DUC 2007 Main task data for training
– 45 topics– Each topic with 25 documents and a query– Associated 4 model summaries each 250 words
• DUC 2007 Update task data for training update specific features– 10 topics– Each topic divided into clusters A,B,C in
chronological order with 10, 8, 7 docs respectively
– Associated 4 model summaries each 100 words
19 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Test Dataset
TAC 2008 Update Summarization data for training 48 topics Each topic divided into A, B with 10 documents Summary for cluster A is normal summary and
cluster B is update summary
20 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Experiments and Results
Feature ROUGE-2 ROUGE-SU4
KL 0.09285 0.132325
DFS 0.092225 0.13281
NF 0.086155 0.126455
SL1 0.086245 0.12163
SL2 0.08599 0.12147
SFS 0.077745 0.12419
TF-IDF 0.07317 0.12604
PHAL 0.06505 0.10712
baseline 0.05865 0.09333
• For Individual Features
21 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Experiments and Results
Combination ROUGE-2 ROUGE-SU4
DFS+SL1 0.102195 0.139205
NF+SL1 0.100845 0.13742
DFS+SL2 0.10126 0.13943
NF+SL2 0.0978 0.134925
DFS+TFIDF 0.0993 0.1383
PHAL+KL 0.094035 0.134275
DFS+SP+PHAL+KL 0.09749 0.13705
• For combination of features
Non-complimenting features for KL
22 Search and Information Extraction Lab LTRC, IIIT Hyderabad
At cluster Level For Cluster A
System ROUGE-2 ROUGE-SU4
DFS+SL2 10604 0.13936
DFS+TF+IDF 0.10633 0.14415
System-43 0.11137 0.14297
System-13 0.11045 0.13987
Oracle 0.17041 0.19616
23 Search and Information Extraction Lab LTRC, IIIT Hyderabad
At cluster Level For Cluster B
System ROUGE-2 ROUGE-SU4
DFS+SL1 0.10343 0.14267
NF+SL1 0.10055 0.13791
System-14 0.10111 0.13669
System-65 0.09675 0.13381
Oracle 0.17610 0.19877
24 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Discussion Quality of training data for NF is poor
compared to other features, that explains the relatively less performance compared to DFS
Huge Gap between oracle summaries and best peers
25 Search and Information Extraction Lab LTRC, IIIT Hyderabad
Contribution
Search and Information Extraction Lab LTRC, IIIT Hyderabad
26
Ad-hoc Feature (Novelty Factor) for modeling novelty along with relevance
Analyzing the affect of various feature combinations in quality of summaries using Support Vector regression
Depicting the scope of improvement in summarization
Thank You
27 Search and Information Extraction Lab LTRC, IIIT Hyderabad