Query-Based Abstractive Summarization Using … Abstractive Summarization Using Neural Networks...

Query-Based Abstractive Summarization Using Neural Networks

Johan [email protected]

Niklas [email protected]

Mikael Kågebäck*

[email protected]

*Department of Computer Science and Engineering, Chalmers University of Technology

Abstract

In this paper, we present a model for gen-erating summaries of text documents withrespect to a query. This is known as query-based summarization. We adapt an ex-isting dataset of news article summariesfor the task and train a pointer-generatormodel using this dataset. The gener-ated summaries are evaluated by measur-ing similarity to reference summaries. Ourresults show that a neural network sum-marization model, similar to existing neu-ral network models for abstractive summa-rization, can be constructed to make use ofqueries to produce targeted summaries.

1 Introduction

Creating short summaries of documents with re-spect to a query has applications in for examplesearch engines, where it may help inform usersof the most relevant results. However, construct-ing such a summary automatically is a difficultproblem yet to be fully solved. In this paper, aneural network model for this task is presented.More specifically, the model is designed for brief,commonly single-sentence, summaries. A situa-tion where this may be useful is when a user hasperformed a search in a search engine and a setof documents have been returned. Concise sum-maries could then be displayed along with thesearch results, giving a quick overview of how thedocument is related to the search query. What iscommonly done in search engines today is thattext surrounding an occurrence of a search queryin the document is displayed as a summary. Thisis an example of extractive summarization, whichproduces a summary that only contains parts ofthe original document. A significant differencein the model we present is that it generates anabstractive summary. This type of summary al-

lows for rephrasing and using words not necessar-ily present in the original document, comparableto a human-written summary. This has the poten-tial of summarizing documents in a more conciseway than what is possible with an extractive sum-mary, i.e. making it easier for a reader to under-stand the relationship between a document and aquery.

Automatic text summarization has been a re-search topic for many years. In general, the goalis to concisely represent the most important in-formation in documents. Much previous work insummarization has been using extractive methods(Nenkova and McKeown, 2012; Mogren et al.,2015). Commonly, individual sentences are ex-tracted and composed together to form a summary.This gives sentences that are as grammatically cor-rect as the source document. They are howeverinherently limited, and cannot reproduce human-written summaries in general. Abstractive sum-marization in particular is closely related to nat-ural language generation, and it would be desir-able to reach human-level performance in writ-ing summaries. It may however require human-level understanding of the context of documentsto produce results comparable to human-writtenones. An important progress in using neural net-work models for generating text is sequence-to-sequence, used by Sutskever et al. (2014) formachine translation. It is a way of mapping avarying-length input text to a varying-length out-put text, and it is applicable to machine transla-tion as well as summarization. In recent years,progress has been made on using neural networkmodels for text summarization and similar prob-lems. Some examples are sequence-to-sequencemodels for non-query-based abstractive summa-rization by Rush et al. (2015) and Nallapati et al.(2016). Neural network models have additionallybeen used for generating image captions (Karpa-thy and Fei-Fei, 2015), which is a form of sum-

arX

iv:1

712.

0610

0v1

[cs

.CL

] 1

7 D

ec 2

017

mailto:[email protected]



mary, and for question answering problems, suchas by Hermann et al. (2015) and Tan et al. (2015).Inspired by this progress, we designed a modelfor query-based summarization using neural net-works.

The main contributions of this work includes:(1) A model for query-based abstractive summa-rization, presented in Section 3. (2) A datasetfor query-based abstractive summarization, cre-ated by adapting an existing dataset originallyused for question answering, described further inSection 4. (3) A quantitative evaluation of the per-formance of the proposed model compared with anextractive baseline and an uninformed abstractivemodel, presented in Section 6. (4) A qualitativeanalysis of the generated summaries.

1.1 Related Work

An early work evaluating several methods for ex-tractive query-based summarization is presentedby Goldstein et al. (1999). Besides "full queries",they use "short queries", which on average are 3.9words. These are similar in length to the typesof queries used in the experiments of this the-sis work. Besides the work by Otterbacher et al.(2009), recent work in query-based summarizationhas been done by Wang et al. (2013), using parsetrees and sentence compression. It is described asnot "pure extractive summarization". During thelater stages of this thesis work, Nema et al. (2017)propose a neural network model for query-basedabstractive summarization, which has some sim-ilarities to the model we present. However, thedataset they use is smaller in both average doc-ument length and number of documents. Addi-tionally, the types of queries used are different, inthat they use complete questions as opposed to oursingle-entity queries.

The task of question answering is to producean answer to a question posed in natural language.The task is very general and many other problemscan be expressed as a question-answering prob-lem. Summarizing with respect to a query mayfor instance be expressed as "What is a summaryof the document with respect to the query X?", forthe query X. If the answer to a question is a sin-gle complete sentence, then it is especially closeto the types of query-based summaries consideredin this thesis. Otterbacher et al. (2009) presenta model, Biased LexRank, which they use for aform of question answering as well as extractive

query-based summarization. The answers theygenerate are full sentences, which makes it similarto our task of query-based summarization. Her-mann et al. (2015) present neural network mod-els for question answering. For training these,they create a large dataset from CNN/Daily Mailnews articles. We adapt this dataset for query-based summarization, as detailed in Chapter 4.Kumar et al. (2016) introduce Dynamic MemoryNetworks, which they show reached state-of-the-art performance in a variety of NLP tasks. Wedraw inspiration from their use of a question mod-ule when we incorporate query information in ourmodel.

General abstractive summarization differs fromquery-based summarization in that a document issummarized without respect to a query. Nalla-pati et al. (2016) build upon a machine transla-tion model by Bahdanau et al. (2015) and gen-erate general abstractive summaries on multipledatasets, including the CNN/Daily Mail datasetby Hermann et al. (2015). Additions they makefor their model include a pointer-generator mech-anism (Gülçehre et al., 2016) that allows the modelto copy words from the source document. Seeet al. (2017) propose a similar model, using a simi-lar pointer-generator mechanism, that outperformsNallapati et al. (2016) on a slightly different ver-sion of the CNN/Daily Mail dataset (making theresult not "strictly comparable"). They also incor-porate what they call coverage for avoiding repe-titions in the output.

2 Background

In the following sections, various terms and con-cepts used throughout the paper are explained.

2.1 Named Entity Recognition

Information extraction is a class of tasks that in-volve extracting structured information from doc-uments. An example of such a task is named entityrecognition, which is the classification of parts oftext into different categories, such as persons orlocations, or no category. An example from thesentence "The mathematician Jeff Paris visited thecity of Paris." is that "Jeff Paris" should be anno-tated as a person, and the last "Paris" as a location.

2.2 Gated Recurrent Units

The gated recurrent unit (GRU) is a type of re-current neural network (RNN) that is designed to

alleviate the vanishing/exploding gradient prob-lem (Hochreiter, 1991; Bengio et al., 1994) whichhinders the original RNN from capturing longterm dependencies. GRU is similar to the popu-lar long short-term memory (LSTM) model but issimpler and less computationally intensive, whilestill achieving comparable results on many tasks(Chung et al., 2014; Kumar et al., 2016). The en-tire GRU architecture can be described by the for-mulas

rt = σ(W r[xt, ht−1] + br)

zt = σ(W z[xt, ht−1] + bz)

h′t = tanh(W h[xt, rt � ht−1] + bh)

ht = zt � ht−1 + (1− zt)� h′t,

The vectors xt is the input at time step t, and ht isthe output, while rt and zt are scaling vectors, in-tended to regulate what information is let through.These can be described as gates. They have ele-ments in [0, 1]. The vector h′t is rather intended tocarry data. Its elements are in [−1, 1], generatedfrom a network with a tanh activation function.We denote an entire GRU update step as

ht = GRU(ht−1, xt).

2.3 Word EmbeddingsGiven a vocabulary V , we can encode each worduniquely using a one-hot encoding. This gives avector of length |V | where every word in the vo-cabulary is mapped uniquely to some dimension,which a value of 1, while the other dimensions are0. This vector can be transformed to an embed-ding for the word by multiplying it by an embed-ding matrix Wemb of dimensionality demb × |V |,where demb is the word embedding dimensional-ity, commonly a hyperparameter in neural networkmodels. The intention is that the embeddings cap-ture some characteristics of words, giving usefulvector representations. For instance, two relatedwords such as football and soccer may be ex-pected to be close to each other in the vector space.Two methods for generating word embeddings areword2vec (Mikolov et al., 2013) and GloVe (Pen-nington et al., 2014).

2.4 AttentionFor many problems, it has been found to be bene-ficial to use more of the RNN states than the finalfixed-size hidden state. Attention is a mechanism

for allowing the model to access more informationin the decoding process, by letting it identify rele-vant parts of the input and use the encoder hiddenstate at these locations. This technique has beenused successfully for machine translation (Bah-danau et al., 2015) and image captioning (Xu et al.,2015).

3 Model

We propose a sequence-to-sequence model withattention and a pointer mechanism, making ita pointer-generator model. The input for theproblem is a document and a query. These aresequences of words passed to a document en-coder and a query encoder respectively. The en-coders’ outputs are then passed to the attentivedecoder, which generates a summary. Both en-coders, as well as the decoder, use RNNs withGRUs. Each occurrence of GRU, with a subscript,in the formulas in the following sections has sep-arate weights and biases. The entire model is de-picted in Figure 1. The different components andvariables in the figure will be explained in detailthroughout the section.

3.1 Document EncoderThe document encoder processes an input docu-ment, generating a state for each input word. Toget a representation of the context around a word,we use a bidirectional RNN (Schuster and Paliwal,1997) encoder, so both the context before and af-ter contribute to the representation. This is usedby Bahdanau et al. (2015) amongst others, achiev-ing good results on a similar task related to textcomprehension.

The combined RNN hidden state at time step i,hi, and the intermediate states,

→h i and

←h i, from

the forward reader and backward reader respec-tively, are computed as

→h i = GRU−→doc(

→h i−1, E(wi))

←h i = GRU←−doc(

←h i−1 E(

←wi))

hi = [→h i,←h i],

where wi ∈ V , for the vocabulary V , is word i inthe input document;

←wi is word i in the reversed

input; and E(wi) is the word embedding of wi.

The initial states→h0 and

←h0 are zero vectors. Due

to the concatenation, the combined state hi hastwice the dimensionality of the state of each uni-directional encoder. The document encoder state

Figure 1: Overview of our model. It illustrates connections between parts of the model at a fixed decodertime step t. The bottom part, containing labeled boxes, correspond to the different RNNs. The top part isintended to visualize the two ways the output word yt can be selected, through the pointer and generatormechanism, to the left and right respectively.

dimensionality is denoted ddoc and the word em-bedding dimensionality demb.

3.2 Query Encoder

The query encoder is responsible for creatinga fixed-size internal representation of the inputquery. Unlike the document encoder, the queryencoder is a unidirectional RNN encoder sincequeries are relatively short compared to docu-ments and we only use the final state to rep-resent the whole query. The RNN state hQi atquery word i, is updated according to hQi =

GRUque(hQi−1, E(wQ

i )), q = hQNQ, where wQ is

the input query and NQ is the length of the query.The initial state hQ0 is the zero vector. The queryencoder state dimensionality is denoted dque.

3.3 Decoder

The decoder is a unidirectional RNN for construct-ing a summary of the input document by depend-ing on the final state of the input encoder, thequery. It utilizes soft attention, in combinationwith a pointer mechanism, as well as a generatorpart similar to Bahdanau et al. (2015). The queryembedding q is fed as input at each decoder timestep. This is similar to the answering module ina question answering model presented by Kumaret al. (2016), who use an RNN-encoded question

representation as input at each decoder time step.In our model, the RNN state is updated accord-ing to st = GRUdec(st−1, [ct, q, E(yt−1)]), wheres0 = hND

, the final document encoder state, ND

being the number of input words; y0 correspondsto a special <GO> token, used at the initial timestep when no previous word has been predicted;ct is the context vector at time step t from theattention mechanism, defined subsequently; andyt−1 ∈ V is the predicted output word at time stept − 1. This is either from the generator mecha-nism, or the pointer mechanism, also defined sub-sequently. The word embeddings are the same asare used in the encoder.

The intention of the inclusion of q to the inputof GRUdec is to give the decoder the ability to tunethe structure of the output sequence to eventuallyoutput something concerning the query. For ex-ample, if the query is a location, the decoder canoutput words leading up to an appropriate inclu-sion of the location.

The generator outputs a word from a subset ofthe vocabulary Vgen ⊆ V at each time step. Theselection of the output words is done through adistribution of words in Vgen, computed througha softmax as pgen

tj =exp(ztj)∑k exp(ztk)

, for j ≤ |Vgen|,an index uniquely mapped to a word w ∈ Vgen,and ztj as defined subsequently. Defining this

as the probability P gent (w), we then select output

word ygent with the highest probability by ygen

t =arg maxw∈Vgen

Pgent (w). The softmax probability de-

pends on ztj , the output from two linear transfor-mations on the decoder state and context vector,defined as zt = W

(2)gen (W

(1)gen [st, ct] + b

(1)gen) + b

(2)gen,

where W(1)gen ∈ Rdgen×(ddec+ddoc), b(1)gen ∈ Rdgen ,

W(2)gen ∈ R|Vgen|×dgen and b(2)gen ∈ R|Vgen| are trainable

hyperparameters, in which dgen is the dimension-ality of the hidden layer. The main function of thislayer is to reduce the dimensionality of the input,for reducing computation time for the final layerwith size |Vgen|.

The model has a soft attention mechanism,based on one used by Bahdanau et al. (2015) formachine translation. The result of the attentionmechanism is a context vector ct produced at eachtime step t, computed as

ct =∑i

αtihi

αti =exp(eti)∑k exp(etk)

eti = score(hi, st−1, E(yt−1), q)

where hi is the document encoder hidden stateat index i. The score function is defined asscore(h, s, x, q) = vᵀatt tanh(Watt[h, s, x, q]+batt),where Watt ∈ Rdatt×(ddoc+ddec+demb+dque) is a weightmatrix, vatt ∈ Rdatt is a vector, and batt is a biasvector, all of which are trained together with therest of the network. The query q is included forthe model to focus attention around query wordswhen appropriate.

3.4 Pointer MechanismA general issue is that with a generator mechanismlimited to frequent words, infrequent words can-not be generated. Further, if the model needs tolearn to output names, and there are many differentones and few occurrences of each in the trainingdata, training a model to generate them correctlyis problematic. A way to solve these issues is toallow the model to directly copy a word in the in-put document to the output summary, or point to it.This may additionally be viewed as using the inputtext as a secondary output vocabulary, in additionto Vgen.

The pointer mechanism adds a switch, pptr ∈(0, 1), at each decoder time step t, to the model.It is computed as the output of a linear transfor-

mation fed through a sigmoid activation function,as pptr

t = σ(vᵀptr[st, E(yt−1), ct] + bptr), wherevptr ∈ Rddec+demb+ddoc and bptr are vectors, all ofwhich are trained together with the rest of the net-work.

If pptrt > 0.5, a word is copied from the input,

otherwise the generator output is used. What iscopied from the input for the tth decoder word isdetermined by the attention distribution. Specifi-cally, at time step t, we select the word at indexi′t = arg max

iαti in the document, where the at-

tention is highest, as yptrt = w(i′t)

. The final outputword can then be defined as

yt =

{y

ptrt if pptr

t > 0.5

ygent otherwise

3.5 Training LossThe model is trained in when to use the pointermechanism in a supervised manner. We de-fine an additional training input xptr

t that is ei-ther 1 if the pointer mechanism is set to beused for the tth word in the summary, or 0otherwise. For training this, we define a lossfunction Lptr =

∑NSt=1(x

ptrt (− log p

ptrt ) + (1 −

xptrt )(− log(1− pptr

t ))).For training the generator mechanism, we de-

fine a loss over the generator softmax layer asLgen =

∑NSt=1(1 − x

ptrt )(− logP

gent (w∗)), where

NS is the length of the target summary, w∗ ∈ Vgenis the the tth word in the target summary. Multi-plying by (1 − xptr

t ) excludes any addition to theloss when the pointer mechanism is set to be used.

We introduce a form of supervised attention forwhen the pointer mechanism is set to be usedfor an output word by introducing a loss functionLatt =

∑NSt=1 x

ptrt (− logαti∗), where i∗ is the in-

dex in the input document to point to.The final loss function is the sum of the different

losses, normalized by the length, computed asL =1NS

(Lgen + Latt + Lptr).

3.6 Generating SummariesSummaries are considered complete when a spe-cial <EOS> token has been generated, or aftera maximum output length is reached. Potentialsummaries are explored using beam search. How-ever, for time steps where the pointer mechanismis used, the partial summaries are prioritized byprobabilities as if the generator had been used in-stead, so k partial summaries with different prob-abilities are created for the word chosen by the

Table 1: Highlights of a CNN article titled "Airlinequality report sorts out the duds from the dynamosin 2012".

1. Hawaiian Airlines again lands at No. 1 in on-timeperformance2. The Airline Quality Rankings Report looks at the 14largest U.S. airlines3. ExpressJet and American Airlines had the worst on-timeperformance4. Virgin America had the best baggage handling;Southwest had lowest complaint rate

pointer mechanism. This is difficult to justify, butwe hope that this should give a reasonable proba-bility at time steps when the pointer mechanismis used, preventing summaries using the pointermechanism more to be prioritized.

A slight deviation from what is presented inSection 3.4 is that when the pointer mechanismis used and the attended word was not in V , wedo not output <UNK>, which it is otherwise inter-preted as in the model, but rather the actual in-put word before it being converted to an index inthe vocabulary. This may be viewed as a post-processing step.

4 Dataset

The dataset constructed for this paper is basedHermann et al. (2015) and consist of docu-ment–query–answer triples from CNN and DailyMail news articles. Included with each publishednews article, there are a number of human-writtenhighlights, which summarize different aspects ofthe article. Table 1 shows some example high-lights for a single article. They construct a doc-ument–query–answer by considering a named en-tity in a highlight to be unknown, making the high-light into a Cloze-style question (Taylor, 1953),whose answer is the entity made unknown. Anexample document and a Cloze-style question andits answer can be seen in Table A.1. We proposeusing the CNN/Daily Mail dataset for query-basedabstractive summarization by regarding each high-light as a summary of its document, and entities inthe highlight as queries. For every occurrence ofan entity in a highlight, we construct a document-query-summary triple for query-based summariza-tion. Table A.1 shows for a sample documenta Cloze-style question compared and the corre-sponding query-summary pair constructed by us.If an entity is mentioned in multiple highlights, we

Table 2: Statistics of the dataset.Training Val. Test

#doc 300,805 4,652 4,652#doc-query pairs 1,066,377 16,308 16,593#doc-query-sum 1,294,730 19,827 20,046avg #words/doc 773.02 778.78 775.70avg #words/query 1.52 1.53 1.52avg #words/sum 14.44 14.52 14.40

consider there being multiple target references forthe document-query pair. In contrast to Hermannet al. (2015), we do not translate entities into iden-tifiers but use only minimal preprocessing in theform of tokenization and lowercasing. Further, wemix articles from DNN and Daily mail while Her-mann et al. (2015) keeps them separate. We de-cided to train our model on a mix of CNN andDaily Mail articles, with a proportion of them be-ing reserved for validation and test sets. Whicharticles are included for the validation and test setis determined randomly with equal probability forevery article.

Some statistics of the resulting dataset can beseen in Table 2. The dataset can be reproducedusing a script made available on GitHub1.

5 Experiments

Two experiments were conducted. The first tomeasure if the model uses the information in thequery, Section 5.1, and the second compares themodel to an extractive baseline, Section 5.2. Abeam width of k = 5 and a maximum outputlength of 32 was used.

5.1 Query Dependence

To determine whether incorporating a query ben-efits our model, we compare our proposed modelto one where the query is corrupted. Instead ofevaluating the generated summary for a documentand a query with ID n against the reference sum-maries for that query, we evaluate it against thereference summaries for query n+1, i.e. the queryID has been offset. For the query with the highestID, the reference summaries for the first query areused. The idea is that if the score is lower than forthe normal evaluation, then the model has madeuse of the additional information in the query. Ta-

1https://github.com/helmertz/querysum-data

https://github.com/helmertz/querysum-data

https://github.com/helmertz/querysum-data

Table 3: Reference summaries used during normalevaluation compared to with offset queries.

Query ID Normal Offset queries1.1 A.1.1, B.1.1 A.1.21.2 A.1.2 A.1.3, B.1.31.3 A.1.3, B.1.3 A.1.1, B.1.1

ble 3 shows for an example document, 1, what thegenerated summaries are evaluated against duringthe query-dependence evaluation. It is worth tomention that two reference summaries for differ-ent queries may be the same, as the same originalhighlight may be used as a reference summary formultiple queries. In these cases, the query will beappropriate for the summary and the model mayhave benefited from the query even in the query-offset evaluation.

5.2 Extractive Baseline

As a baseline, we compare the results to a simpleextractive summary, designed specifically for thedataset used in this thesis work. The baseline sum-mary is constructed by selecting the first sentencein the document containing the query, without re-stricting the length of the document. If no suchsentence is found, i.e. the document does not con-tain the query, the first sentence of the documentis used instead. This does occur in the dataset, butnot frequently.

We additionally observe that the average lengthof baseline sentences using the CNN/Daily Maildataset is commonly greater than for the referencesummaries. The average number of words is 30.56for the baseline summaries, while it is 14.44 forthe reference summaries. It may be possible togain a higher ROUGE score if a fewer number ofwords around the query occurrence is selected, butit might not form a complete sentence.

5.3 Evaluation Metric

Our results are evaluated using four different met-rics provided by ROUGE (Recall-Oriented Un-derstudy for Gisting Evaluation) (Lin, 2004), thedefacto standard evaluation method for automaticsummarization. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4. ROUGE-1 and ROUGE-2are the scores for 1-grams and 2-grams respec-tively. ROUGE-L and ROUGE-SU4 are morecomplex metrics, detailed by Lin (2004).

5.4 Training DetailsThe vocabulary V used for the input text containsthe 150,000 most frequent words in the training setwhile the generator vocabulary Vgen consist of the20,000 most frequent words. The smaller vocabu-lary of the generator is due to the pointer mecha-nism.

Word embeddings for the vocabulary words areinitialized with 100-dimensional GloVe embed-dings2, trained on "Wikipedia 2014 + Gigaword5". If the word does not have a GloVe embed-ding, we initialize the word embedding by sam-pling the per-dimension univariate normal distri-butions with means and standard deviations of theentire collection of GloVe embeddings.

Both during training and test time, we limit thedocument length to the first 800 words, to reducecomputation time.

The loss L is minimized using the SGD-basedAdam optimizer (Kingma and Ba, 2015). We usedmini-batches of 30 samples, with an averaged lossover all the samples in the batch. The mini-batchesremained the same over epochs, but the order inwhich they were trained on was randomized be-tween every epoch.

Experiments have been run on a single NvidiaTesla K80, with 12 GB of memory and took about54 hours to train. The model is implemented usingTensorFlow (Abadi et al., 2015), and the completesource has been made available online3.

The hyperparameters used for the experimentsis reported in Table 4. No extensive hyperpa-rameter tuning has been performed, but insteadexamined hyperparameters used for similar mod-els, such as Nallapati et al. (2016) and See et al.(2017).

Table 4: Hyperparameter configuration used.

Hyperparameter ValueWord embedding size demb 100Document encoder size ddoc 512Query encoder size dque 256Decoder size ddec 512Attention hidden size datt 256Generator hidden size dgen 256

2Downloadable as "glove.6B.zip" at: https://nlp.stanford.edu/projects/glove/

3https://github.com/helmertz/querysum

https://nlp.stanford.edu/projects/glove/

https://nlp.stanford.edu/projects/glove/

https://github.com/helmertz/querysum

Table 5: ROUGE scores of the evaluated models.Model 1 2 L SU4First querysentence

33.81 18.19 29.22 17.49

Our model 18.25 5.04 16.17 6.13Offset queries 16.06 3.89 14.25 5.18

Table 6: Example document-query pair.

Document ( cnn ) – the united states have named formergermany captain jurgen klinsmann as their new nationalcoach , just a day after sacking bob bradley . bradley , whotook over as coach in january 2007 , was relieved of hisduties on thursday , and u.s. soccer federation presidentsunil gulati confirmed in a statement on friday that hisreplacement has already been appointed . [...]

Query united states

Reference jurgen klinsmann is named as coach of theunited states national side

Output klinsmann appointed as the new coach of unitedstates

6 Results

The results from our experiments are summarisedin Table 5. From the result of the query depen-dence evaluation ("offset queries"), described inSection 5.1, we can see that the ROUGE scoresgoes down, with statistical significance accordingto the ROUGE-reported 95% confidence intervals,when the queries are offset. This indicates that themodel benefits from the information provided byqueries.

Further, we observe that our model score lowerthan the baseline model which we denote the firstquery sentence described in Section 5.2. However,it should be noted that this baseline is expected tobe strong given the nature of this dataset.

6.1 Further AnalysisWe observe that the attention at a time step appearsto often be highly focused on only a few words inthe document. An example of an output summarycan be seen in Table 6, and Figure A.1 shows theattention distribution over time for the same gen-erated summary. Another observation we make isthat the attention often is focused at the beginningof the documents. However, there are certainly in-stances when entities are selected from far backin documents. This bias may partly be due to ourdecision to point out the first occurrences of en-tities. Although, it has been noted by Goldsteinet al. (1999) that the beginning of news articles of-


Document president barack obama sided with open-internet activists on monday , urging the federal commu-nications commission to draft new rules that would re-classify the broadband net to regulate it more like a pub-lic utility . the end result would tie the hands of internetservice providers that want to cut special deals with ser-vices like netflix , youtube , hulu and amazon to push theirstreaming content along a ’ fast lane ’ that ordinary amer-icans ca n’t access . [...]

Query netflix

Reference obama ’s vision would bar providers like ver-izon and comcast from cutting deals with hulu , netflixand amazon so their streaming content could be deliveredalong online ’ fast lanes ’

Output obama ’s chief executive of netflix has refused toallow users to access the service


Document february 13 , 2015 a breakthrough in belarus, a verdict in italy , and an expected veto in the u.s. allheadline cnn student news this friday . [...]

Query cnn student news roll call

Reference at the bottom of the page , comment for achance to be mentioned on cnn student news . you mustbe a teacher or a student age 13 or older to request a men-tion on the cnn student news roll call .

Output at the bottom of the page , comment for a chanceto be mentioned on cnn student news . you must be ateacher or a student age 13 or older to

ten summarizes the article quite well.From examining some of the output summaries

from our model, we see that they often stronglymatch the topic of the input documents, but theyrarely succeed in generating summaries rephras-ing something actually stated in the article. Table7 shows an example output that is fairly grammat-ically correct, but not truthful with respect to thearticle.

We observe that the model manages to learnsome of the dataset samples which are not actualsummaries, described in Section 4, such as noticesrepeated over several articles. The generated sum-mary shown in Table 8 is an example of this. In-terestingly, the model manages to literally repeatthe reference summary, up to the maximum outputlength limit. We can frequently see repetitions ofthe same phrases; an extreme example can be seenin Figure A.2. The model appears to get stuck try-ing to begin a summary. Additionally, we observethat the repetition can be observed in the attentiondistribution as well. The same problem has beenseen by Nallapati et al. (2016), who make an addi-

tion, temporal attention (Sankaran et al., 2016), totheir model for alleviating the issue of repetitions.See et al. (2017) propose using coverage to solvethe same issue.

Before running experiments, we suspected thatit may be difficult for the pointer mechanism tosequentially point out words that make up longerentities. However, we see that this is done success-fully quite often. For an example summary, thecertainty of selecting a sequence of entity wordscan be seen in Figure A.3.

Compared to the reference summaries, the out-put is generally shorter. The average number ofwords in output summaries is 11.27, while thedataset average is 14.44. As is noted by Wu et al.(2016), beam search commonly favors shortersummaries. They propose an addition of lengthnormalization, for reducing this tendency. Imple-menting such a measure may improve the resultsof our model as well.

In comparison to Nallapati et al. (2016) and Seeet al. (2017), our ROUGE scores are low. They usea different version of the dataset where all high-lights are combined to form a single, often multi-sentence, summary. With similar models, theyget ROUGE-1 results of around 35 on the generalsummarization task. However, while they alwaystrain the model to output the same summary for thesame document, we often have completely differ-ent target summaries for different queries, wherethe queries make up a much smaller part of the in-put.

7 Conclusion

We have designed a model for query-based ab-stractive summarization and evaluated it on anadapted QA dataset, redesigned for query-basedsummarization. While the overall performance ofthe model is not enough to outperform our extrac-tive baseline, we have shown that it can incorpo-rate a query and utilize the information to createmore focused summaries.

ReferencesMartín Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefow-icz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-enberg, Dan Mané, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, Jonathon

Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude-van, Fernanda Viégas, Oriol Vinyals, Pete Warden,Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi-aoqiang Zheng. 2015. TensorFlow: Large-scale ma-chine learning on heterogeneous systems. Softwareavailable from tensorflow.org.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. International Con-ference on Learning Representations (ICLR 2015)arXiv:1409.0473.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradi-ent descent is difficult. IEEE transactions on neuralnetworks 5(2):157–166.

Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. ArXiv e-prints arXiv:1412.3555.

Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, andJaime Carbonell. 1999. Summarizing text docu-ments: Sentence selection and evaluation metrics.In Proceedings of the 22Nd Annual InternationalACM SIGIR Conference on Research and Devel-opment in Information Retrieval. ACM, New York,NY, USA, SIGIR ’99, pages 121–128.

Çaglar Gülçehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, Berlin, Ger-many, pages 140–149.

Karl Moritz Hermann, Tomás Kociský, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems. pages 1693–1701.

Sepp Hochreiter. 1991. Untersuchungen zu dynamis-chen neuronalen Netzen. Ph.D. thesis, diploma the-sis, institut für informatik, lehrstuhl prof. brauer,technische universität münchen.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. InternationalConference on Learning Representations (ICLR2015) arXiv:1412.6980.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, MohitIyyer, James Bradbury, Ishaan Gulrajani, VictorZhong, Romain Paulus, and Richard Socher. 2016.Ask me anything: Dynamic memory networks for

https://www.tensorflow.org/

natural language processing. In Maria Florina Bal-can and Kilian Q. Weinberger, editors, Proceedingsof The 33rd International Conference on MachineLearning. PMLR, New York, New York, USA, vol-ume 48 of Proceedings of Machine Learning Re-search, pages 1378–1387.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Stan SzpakowiczMarie-Francine Moens, editor, Text SummarizationBranches Out: Proceedings of the ACL-04 Work-shop. Association for Computational Linguistics,Barcelona, Spain, pages 74–81.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In Proceedings of the 26th International Con-ference on Neural Information Processing Systems.Curran Associates Inc., USA, NIPS’13, pages 3111–3119.

Olof Mogren, Mikael Kågebäck, and Devdatt P Dub-hashi. 2015. Extractive summarization by aggregat-ing multiple similarities. In RANLP. pages 451–457.

Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dosSantos, Çaglar Gülçehre, and Bing Xiang. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the20th SIGNLL Conference on Computational NaturalLanguage Learning, CoNLL 2016, Berlin, Germany,August 11-12, 2016. pages 280–290.

Preksha Nema, Mitesh Khapra, Anirban Laha, andBalaraman Ravindran. 2017. Diversity driven At-tention Model for Query-based Abstractive Summa-rization. ArXiv e-prints arXiv:1704.08300.

Ani Nenkova and Kathleen McKeown. 2012. A Sur-vey of Text Summarization Techniques, Springer US,Boston, MA, pages 43–76.

Jahna Otterbacher, Gunes Erkan, and Dragomir R.Radev. 2009. Biased lexrank: Passage retrieval us-ing random walks with question-based priors. Infor-mation Processing and Management 45(1):42–54.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP). pages 1532–1543.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, Lisbon, Portugal, pages 379–389.

Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, andAbe Ittycheriah. 2016. Temporal attention modelfor neural machine translation. ArXiv e-printsarXiv:1608.02927.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing 45(11):2673–2681.

Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get To The Point: Summarizationwith Pointer-Generator Networks. ArXiv e-printsarXiv:1704.04368.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems. MIT Press, Cambridge, MA, USA, NIPS’14,pages 3104–3112.

Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answerselection. ArXiv e-prints arXiv:1511.04108.

Wilson L Taylor. 1953. ‘cloze procedure’: a newtool for measuring readability. Journalism Bulletin30(4):415–433.

Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-rian, and Claire Cardie. 2013. A sentence com-pression based framework to query-focused multi-document summarization. In ACL 2013.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sNeural Machine Translation System: Bridging theGap between Human and Machine Translation.ArXiv e-prints arXiv:1609.08144.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on MachineLearning. pages 2048–2057.

A Supplemental Material

A.1 Dataset

An example of a record in the dataset is shown inTable A.1.

We organize the dataset triples hierarchically,first by document, then query, then reference. Thedocuments and queries are numbered numericallystarting with 1, while the references are numberedalphabetically starting with A. Document 1 mayhave queries 1.1 and 1.2, and reference summaries

A.1.1, B.1.1 and A.1.24. The order is shuffledamongst document, query and reference IDs.

4Selected for matching the format expected by pyrouge

A.2 Attention Visualisations

( cnn) -- th

eunited

states

havenamed

former

germany

captain

jurgen

klinsmann

as theirnewnational

coach

, justa da

yaftersacking

bobbradley

. bradley

, whotookoveras co

ach

in january

2007, wa

srelieved

of hi

<GO>klinsmannappointed

asthenew

coachof

unitedstates

Figure A.1: Visualization of the attention distribution as the summary in Table 6 is generated. The wordsof the document are shown on the horizontal axis, from left to right. Only a limited number of documentwords are shown. The vertical axis shows the output words, from top to bottom, after the <GO> token.The darker a cell is, the higher the attention on that position.

as anychildof th

eeighties

willtellyou, ne

onis a st

aple

of anyafter-dark

celebration-

buttwofoodies

frommelbourne

havetaken

theirloveof it on

estepfurther

to create

a sweet

treatthatlights

up thenight

. steve

felice

andglenn

storey

,<GO>stevefelice

,stevefelice

,glennstorey

,stevefelice

,glennstorey

,stevefelice

,

Figure A.2: Visualization of the attention distribution, αti, as an output summary for a document-querypair is generated. The query is "australia". The format is the same as in Figure A.1.

e. an

dif yo

uheard

himspeak

, hiscockney

accent

would

be as broad

as theriverthames

. thisis de

rek

hockley

, thewheeler-dealer

whodavid

jasonhasrevealed

he tookas hi

sinspiration

when

playing

delboyin on

lyfoolsandhorses

. as wellas

•••

by paulbentley

published

: 16:09

est, 18 oc

tober

2013| up

dated

: 16:51

est, 18 oc

tober

2013he 's cl

early

a sharp

dresser

withhisowndistinct

style. an

dif yo

uheard

himspeak

, hiscockney

accent

would

be as bro

<GO>he

wasinspired

bya

realeastendof

onlyfoolsand

horses

Figure A.3: Visualization of the attention distribution, αti, as an output summary for a document-querypair in the test set is generated. The query is "only fools and horses". The format is the same as in FigureA.1. The ellipsis signifies that parts of the attention distribution has been skipped.

Table A.1: Example of dataset samples gener-ated from a document-query pair using our methodcompared to Hermann et al. (2015). In the Cloze-style questions, the entity corresponding to the an-swer has been replaced by X.

Document( cnn ) former vice president walter mondale was releasedfrom the mayo clinic on saturday after being admitted withinfluenza , hospital spokeswoman kelley luckstein said . “he ’s doing well . we treated him for flu and cold symptomsand he was released today , ” she said . mondale , 87 , wasdiagnosed after he went to the hospital for a routine checkupfollowing a fever , former president jimmy carter said friday. “ he is in the bed right this moment , but looking forwardto come back home , ” carter said during a speech at a nobelpeace prize forum in minneapolis . “ he said tell everybodyhe is doing well . ” mondale underwent treatment at themayo clinic in rochester , minnesota . the 42nd vice presi-dent served under carter between 1977 and 1981 , and laterran for president , but lost to ronald reagan . but not beforehe made history by naming a woman , u.s. rep. geraldine a.ferraro of new york , as his running mate . before that , theformer lawyer was a u.s. senator from minnesota . his wife ,joan mondale , died last year .

Highlightwalter mondale was released from the mayo clinic on satur-day , hospital spokeswoman said

Cloze-style questionwalter mondale was released from the X on saturday , hos-pital spokeswoman said

Cloze-style answermayo clinic

Our querymayo clinic

Our target summarywalter mondale was released from the mayo clinic on satur-day , hospital spokeswoman said

Query-Based Abstractive Summarization Using … Abstractive Summarization Using Neural Networks...

Documents

Transcript of Query-Based Abstractive Summarization Using … Abstractive Summarization Using Neural Networks...