Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Modeling Novelty and feature combination using Support Vector Regression for

Update Summarization

Praveen Bysani

Vijay Yaram

Vasudeva Varma

1

Search and Information Extraction Lab LTRC, IIIT Hyderabad

Outline


2

Text Summarization

Update Summarization

Support Vector Regression

Sentence Scoring Features Novelty Factor (NF)

Experimental Results

Text Summarization Condensing a piece of text while retaining

important information Types of summarization

Extractive Vs Abstractive Single Document Vs Multi Document Query Focused Vs Query Independent Personalized Vs Generic

Focus on Extractive Multi Document Summarization

Dynamic Summarization, Update Summarization

5 Search and Information Extraction Lab LTRC, IIIT Hyderabad

Update Summarization Emerging area in summarization

Summarization with a sense of prior knowledge

Challenging – To detect information that is relevant and also novel

Practical usage – monitor changes in temporally evolving topics especially newswire


An example


7

Topic : YSR Chopper Missing

A helicopter carrying Andhra Pradesh chief minister Y S Rajashekhar Reddy, two of his staff and two pilots went missing in pouringrain Wednesday morning over the Naxal and tiger-infested Nalamalla forests and with no contact until early Thursday, experts and officials feared the worst. Multiple agencies of the state launched a massive hunt for possible wreckage in the desolate terrain. Apart from Reddy, the chopper was carrying principal secretary to CM S Subrahmanyam and YSR's chief security officer ASC Wesley.

Andhra Pradesh chief minister Y S Rajasekhara Reddy has died in an air crash. The bodies of 60-year-old Reddy, his special secretary P Subramanyam, chief security officer A S C Wesley, pilot Group Captain S K Bhatia and co-pilot M S Reddy were found on Rudrakonda Hill, 40 nautical miles east of here, besides the mangled remains of the helicopter. The central leadership of the Congress is understood to have cleared the name of Andhra Pradesh finance minister K Rosaiah as the caretaker CM of the state.

doc

doc3

doc1

Pre Processing

Sentences

Sentence

Scorers

feature2

feature n

feature1

Ranker

Ranked Set of Sentences

Summary Generator

Summary

Sentence Ranking 4 stages of sentence extractive

summarization Pre Processing Sentence Scoring Sentence Ranking Summary Generation

Scores from features are manually weighted to compute sentence rank

Instead, use a Machine Learning Algorithm to estimate rank from features


Support Vector Regression (SVR) Regression analysis - modeling values of a

dependent variable from one or more independent variables

Regression is a function Y = f(X,β)

The independent variables, X The dependent variable, Y unknown parameters, β

Regression using Support Vectors is Support Vector Regression (SVR)

Estimate Sentence rank (dependent variable) using Scoring Features (Independent variables) through Support Vector Regression (SVR)


Estimating Sentence Rank

ROUGE – recall oriented metric which evaluates based on word overlap with models

ROUGE-2 and ROUGE-SU4 correlate highly with human evaluation

Sentence Rank (is) of a sentence s is

|Bigramm ^ Bigrams| - # of bigrams shared by model and sentence


Oracle summaries

Each oracle summary is best summary that can be generated by any sentence extractive summarization system

Sentences are ranked using the ROUGE-2 score described above

To depict the gap and scope of improvement


Sentence Scoring FeaturesSentence Position is a popular and well

studied feature Sentence Location 1 (SL1)

First 3 sentences of document contain most informative content (also proved by analysis of oracle summaries)

Score of a sentence ‘s’ at position ‘n’ in document ‘d’ is

Score(snd) = 1 – n/1000 if n<=3

= n/1000 else


• Sentence Location 2 (SL2)– Positional index of sentence itself as value– Model will learn optimum sentence position based

on genre– Not inclined to top or bottom as SL1

Score(snd) = n

• Sentence Frequency Score (SFS)– Ratio of number of sentences in which a word

occurred to total number of sentences in cluster– SFS of word w is


TF – IDF• Popular measure to find relevance of document in IR• Same analogy used to find relevance of sentence• Term Frequency(Tf ij) of term (ti) in document (dj) is

• Inverse Document Frequency (IDFi) of a term (ti) is


Document Frequency Score (DFS)

Inverse of IDF Ratio of number of docs in which a term occurred

to total number of docs Average DFS of words in sentence is its score

Probabilistic Hyperspace Analogue to language (PHAL) and Kullback-Leiber Divergence (KL) available as features

Baseline summarizer generates summary by picking first 100 words of last document


Novelty Factor An ad-hoc feature for Update

summarization Consider a stream of articles published on a

topic over time period T All articles published from time 0 to t are

considered to be read previously (prior Knowledge)

Articles published from t to T are new that contains new information.

Let td represent the chronological time stamp of document d.


Novelty Factor

NF of a word ‘w’ is

nd t = { d : w in d and td > t }pd t = { d: w in d and td < t}D = { d: td > t }

•|ndt| signifies the importance of term in new cluster•|pdt| penalizes any term that occurs frequently in previous clusters•|D| for smoothing


Training Data• DUC 2007 Main task data for training

– 45 topics– Each topic with 25 documents and a query– Associated 4 model summaries each 250 words

• DUC 2007 Update task data for training update specific features– 10 topics– Each topic divided into clusters A,B,C in

chronological order with 10, 8, 7 docs respectively

– Associated 4 model summaries each 100 words


Test Dataset

TAC 2008 Update Summarization data for training 48 topics Each topic divided into A, B with 10 documents Summary for cluster A is normal summary and

cluster B is update summary


Experiments and Results

Feature ROUGE-2 ROUGE-SU4

KL 0.09285 0.132325

DFS 0.092225 0.13281

NF 0.086155 0.126455

SL1 0.086245 0.12163

SL2 0.08599 0.12147

SFS 0.077745 0.12419

TF-IDF 0.07317 0.12604

PHAL 0.06505 0.10712

baseline 0.05865 0.09333

• For Individual Features


Experiments and Results

Combination ROUGE-2 ROUGE-SU4

DFS+SL1 0.102195 0.139205

NF+SL1 0.100845 0.13742

DFS+SL2 0.10126 0.13943

NF+SL2 0.0978 0.134925

DFS+TFIDF 0.0993 0.1383

PHAL+KL 0.094035 0.134275

DFS+SP+PHAL+KL 0.09749 0.13705

• For combination of features

Non-complimenting features for KL


At cluster Level For Cluster A

System ROUGE-2 ROUGE-SU4

DFS+SL2 10604 0.13936

DFS+TF+IDF 0.10633 0.14415

System-43 0.11137 0.14297

System-13 0.11045 0.13987

Oracle 0.17041 0.19616


At cluster Level For Cluster B

System ROUGE-2 ROUGE-SU4

DFS+SL1 0.10343 0.14267

NF+SL1 0.10055 0.13791

System-14 0.10111 0.13669

System-65 0.09675 0.13381

Oracle 0.17610 0.19877


Discussion Quality of training data for NF is poor

compared to other features, that explains the relatively less performance compared to DFS

Huge Gap between oracle summaries and best peers


Contribution


26

Ad-hoc Feature (Novelty Factor) for modeling novelty along with relevance

Analyzing the affect of various feature combinations in quality of summaries using Support Vector regression

Depicting the scope of improvement in summarization

Thank You


Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Documents

Transcript of Modeling Novelty and feature combination using Support Vector Regression for Update Summarization