Enhancing Single-document Summarization by Combing RankNet and Third-party Sources

Enhancing Single-document Summarization by Combing RankNet

and Third-party Sources

Yu-Mei ChangLab Name

National Taiwan Normal UniversityKrysta M. Svore, Lucy Vanderwende, Christopher J.C. Burges, “Enhancing Single-document Summarization by Combing RankNet and Third-party Sources,” in Proc. of the 2007Joint Conference on Empirical Method in Natural Language Processing and Computational Naural Language Learning

National Taiwan Normal University

Abstract • They present a new approach to automatic summarization based on neural

nets, called NetSum. • Features:

– extract a set of features from each sentence– novel features based on news search query logs and Wikipedia entities.

(third-party datasets) enhance sentence feature

• Using the RankNet learning algorithm – a pair-based sentence ranker to score every sentence in the document

and identify the most important sentences.– Neural network ranking algorithm

• They apply their system to documents gathered from CNN.com, – each document includes highlights and an article.

• Their system significantly outperforms the standard baseline in the ROUGE-1 measure on over 70% of their document set.


Their task(1)• Focus on single-document summarization of newswire documents.• Each document consists of

– three or four highlight sentences and the article text.• Each highlight sentence is human-generated but is based on the

article.• The output of their system:

– Purely extracted sentences, where they do not perform any sentence compression or sentence generation.

• Their data consists of :– 1365 news documents(gathered from CNN.com)– Hand-collected (form December 2006 to February 2007)– A maximum of 50 documents per day were collected.


CNN.com

Story HighlightsArticle text

Timestamp

(http://edition.cnn.com/2009/WORLD/asiapcf/08/14/typhoon.village/index.html)


Their task(2)• They develop two separate problems based on their document set.

1. can they extract three sentences that best “match” the highlights as a whole?• concatenate the three sentences produced by their system into a

single summary or block, and the same on the three highlight sentences(disregards sentence order)

2. can they extract three sentences that best “match” the three highlights, such that ordering is preserved?• The first sentence is compared against the first highlight, the

second sentence is compared against the second highlight, and the third sentence is compared against the third highlight.


The system• Goal :

– To extract three sentences from a single news document that best match various characteristics of the three document highlights.

• A set of features is extracted from each sentence in the training and test sets, and the training set is used to train the system.

• The system learns from the train set the distribution of features for the best sentences and outputs a ranked list of sentences for each document.

• In this paper, they rank sentences using a neural network algorithm called RankNet (Burges et al., 2005).– A pair-based neural network algorithm to rank a set of inputs, in this

case, the set of sentences in a given document.


Matching Extracted to Generated Sentences

• Choosing three sentences most similar to the three highlights is very challenging – since the highlights include content that has been gathered across

sentences and even paragraphs– the highlights include vocabulary that may not be present in the text

• For 300 news articles– 19% of human-generated summary sentences contain no matching

article sentence– only 42% of the summary sentences match the content of a single article

sentence• where there are still semantic and syntactic transformations between

the summary sentence and article sentence.

• Since each highlight is human generated and does not exactly match any one sentence in the document.


RankNet(1) • The system is trained on pairs of sentences

– should be ranked higher or equal to – Pairs are generated between sentences in a single document, not across

documents.– Each pair is determined from the input labels.

• Their sentences are labeled using ROUGE , if the ROUGE score of

is greater than the ROUGE score of , then is one input pair.

• The cost function for RankNet – the probabilistic cross-entropy cost function.

• Their system, NetSum, is a two-layer neural net trained using RankNet.

• To speed up the performance of RankNet in the framewok of LambdaRank.

)S,S( ji

iS jS

iS

jS )S,S( ji


RankNet (2)• They implement 4 versions of NetSum.

– NetSum(b): • is trained for their first summarization problem(b indicates block)• They label each by , the maximum ROUGE-1 score between

and each highlight , for , given by – NetSum(n):

• For the second problem of matching three sentences to the three highlights individually,

• They label each sentence by , the ROUGE-1 score between

and , given by • The ranker for highlight n, NetSum(n), is passed samples labeled

using .• Their reference summary

– The set of human-generated highlights• The highlights as a block• A single highlight sentence

nin H,SRmaxl 1iS iS1l

nH 321 ,,n

Rgramj

SRgramj

j

ij

)gram(Count

)gram(Count

NROUGE

iS n,l1 nin, H,SRl 1

iSnH

n,l1


Features(1) • RankNet takes as input a set of samples, where each sample contains a

label and feature vector.• They generate 10 features for each sentence in each document, listed in

Table 1.

1) They include a binary feature– that equals if is the first sentence of the document: , where is the Kronecker delta function.

(wikipedia)

– This feature is used only for • NetSum(b) and NetSum(1).

iS

iS iSF

1,iiSF

p.s. Kronecker delta function

jiifjiif

,,

j,i

01

1


Features(2) 2) Include sentence position

– They found in empirical studies that• the sentence to best match highlight is on average 10% down the

article • the sentence to best match is on average 20% down the article • the sentence to best match is on average 31% down the article

– The position of in document as • , where is the number of sentences in .

3) SumBasic score of a sentence– To estimate the importance of a sentence based on word frequency– SumBasic score of in document as

•

• where is the probability of word and is the number of words in sentence ,

1H

2H

3H

1S D

li)S(Pos i l,,i 1 D

1S D

1S

i

Si S

pSSB i

p 1S

D

Countp


Features(3)

4) They also include the SumBasic score over bigrams, where in 3) is replaced by bigrams and they normalize by the number of bigrams in .

5) They compute the similarity of a sentence in document with the title of

as the relative probability of title terms in as

– , where is the number of times term appears in over the number of terms in .

• The remaining features they use are based on third-party data sources.• They propose using news query logs and Wikipedia entities to enhance

features.– They base several features on query terms frequently issued to

Microsoft’s news search engine http://search.live.com/news– Entities found in the online open-source encyclopedia Wikipedia

(Wikipedia.org, 2007)

1S

1S D

1SD

T

Tt

i

Sti S

tpSSim i

TtCounttp t

T T

http://search.live.com/news

http://search.live.com/news


Features(4)

6) They collected several hundred of the most frequently queried terms in February 2007 from the news query logs. They took the daily top 200 terms for 10 days.– They calculate the average probability of news query terms in as

• – where is the probability of a news term and is

the number of news in

– , where is the number of times term

appears in and is the number of news query terms in

7) They also include the sum of news query terms in , given by 8) The relative probability of news query terms in , given by

i

Sqi Sq

qp)S(NT i

iSq

qp qiSq

iS

DqqCountqp

qCount q

D Dq D

iS

iSq

i qpSNT

i

Sqir S

qp)S(NT i

iS


Features(4)

9) Sentences in CNN document that contain Wikipedia entities that frequently appear in CNN document are considered important. – They calculate the average Wikipedia entity score for as

• , where is the probability of entity , given by

,where is the number of times entity appears in CNN document and is the total number of entities in CNN document .

10) They also include the sum of Wikipedia entities, given by

• All features are computed over sentences where every word has been lowercased and punctuation has been removed after sentence breaking.

DD

iS

i

Sei Se

epSWE i

ep e

DeeCount

ep

eCountD

eDe

D

iSe

i epSWE


Evaluation • For the first summarization task

– They compare against the baseline of choosing the first three sentences as the block summary

• For the second highlights task– They compare NetSum(n) against the baseline of choosing sentence

n( to match highlight n ).

• They perform all experiments using five-fold cross-validation on their dataset of 1365 documents. – They divide their corpus into five random sets and train on three

combined sets, validate on one set, and test on the remaining set.

• They consider ROUGE-1 to be the measure of importance and thus train our model on ROUGE-1 (to optimize ROUGE-1 scores) and likewise evaluate their system on ROUGE-1.


Results: Summarization(1)• Table 2 shows the average ROUGE-1 and ROUGE-2 scores obtained with

NetSum(b) and the baseline. – NetSum(b) produces a higher quality block on average for ROUGE-1.

• Table 3 lists the statistical tests performed and the significance of performance differences between NetSum and the baseline at 95% confidence.


Results: Summarization(2)• Table 4 lists the sentences in the block

produced by NetSum(b) and the baseline block, for the articles shown in Figure 1.

• The NetSum(b) summary achieves a ROUGE-1 score of 0.52, while the baseline summary scores only 0.36.


Result: Highlights(1)• Their second task is to extract three sentences from a document that best

match the three highlights in order.• They train NetSum(n) for each highlight They compare NetSum(n)

with the baseline of picking the th sentence of the document.• NetSum(1) produces a sentence with a ROUGE-1 score that is equal to or

better than the baseline score for 93.26%of documents. • Under ROUGE-2, NetSum(1) performs equal to or better than the baseline

on 94.21% of documents.

• Table 5 shows the average ROUGE-1 and ROUGE-2 scores obtained with NetSum(1) and the baseline. – NetSum(1) produces a higher quality

sentence on average under ROUGE-1.

.,,n 321n


Result: Highlights(1)• Table 6 gives highlights produced by NetSum(n) and the highlights produced

by the baseline, for the article shown in Figure 1.

• The NetSum(n) highlights produce ROUGE-1 scores equal to or higher than the baseline ROUGE-1 scores.

• They confirmed that the inclusion of news-based and Wikipedia-based features improves NetSum’s peformance.– They removed all news-based and

Wikipedia-based features in NetSum(3).

– The resulting performance moderately declined.


Conclusions• They have presented a novel approach to automatic single-document

summarization based on neural networks, called NetSum.

• Their work is the first to use both neural networks for summarization and third-party datasets for features, using Wikipedia and news query logs.

• They have evaluated our system on two novel tasks: – 1) producing a block of highlights – 2) producing three ordered highlight sentences.

• Their experiments were run on previously unstudied data gathered from CNN.com.

• Their system shows remarkable performance over the baseline of choosing the first n sentences of the document, where the performance difference is statistically significant under ROUGE-1.

RankNet

C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. 2005. Learning to Rank using Gradient Descent. ICML, pages 89–96. ACM.


A Probability Ranking Cost Function(1)• They consider models where the learning algorithm is given a set of pairs of

samples in , together with target probabilities that sample A is to be ranked higher than sample B.

• They consider models: such that the rank order of a set of test samples is specified by the real values that takes, specifically, is taken to mean that the model asserts that .

• Denote the modeled posterior by ,and let be the desired target values for those posteriors.

• Define and . .They will use the cross entropy cost function

dR B,A ABP

RR:f d 21 XfXf

21 XX

21 XXP m,,j,i,P j,i 1 j,iP

ii Xfo jij,i XfXfo

ijijijijij PlogPPlogPC 11

f


A Probability Ranking Cost Function(2)• map from outputs to probabilities are modeled using a logistic function

• then becomes– – asymptotes to a linear function

• target probability is defined as:

ij

ij

o

o

ije

eP

1

ijC

ijoijijij elogoPC 1

ijC

ijP

ij

ji

ji

ij

ssss/ss

P

, ,

,

021

1


RankNet: Learning to rank with neural nets(1)• The above cost function is quite general; here they explore using it in neural

network models, as motivated above.

• It is useful first to remind the reader of the back-prop equations for a two layer net with output nodes .

• For the th training sample, denote the outputs of net by , the targets by , let the transfer function of each node in the th layer of nodes be , and let the cost function be

• If are the parameters of the model. Then a gradient descent step amounts to where the are positive learning rates.

q

i ioit j jg

q

iii t,of

1

k

kkk

f

k


RankNet: Learning to rank with neural nets(2)• The net embodies the function

• where for the weights and offsets , the upper indices index the node layer, and the lower indices index the nodes within each corresponding layer.

b

332212323ij

j kjkjkiji gxggo bb

Enhancing Single-document Summarization by Combing RankNet and Third-party Sources

Documents

Transcript of Enhancing Single-document Summarization by Combing RankNet and Third-party Sources