Unified Pragmatic Models for Generating and Following ...tions, and about how listeners will react...

Proceedings of NAACL-HLT 2018, pages 1951–1963New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

Unified Pragmatic Models for Generating and Following Instructions

Daniel Fried Jacob Andreas Dan KleinComputer Science Division

University of California, Berkeley{dfried,jda,klein}@cs.berkeley.edu

AbstractWe show that explicit pragmatic inference aidsin correctly generating and following naturallanguage instructions for complex, sequentialtasks. Our pragmatics-enabled models reasonabout why speakers produce certain instruc-tions, and about how listeners will react uponhearing them. Like previous pragmatic mod-els, we use learned base listener and speakermodels to build a pragmatic speaker that usesthe base listener to simulate the interpretationof candidate descriptions, and a pragmatic lis-tener that reasons counterfactually about al-ternative descriptions. We extend these mod-els to tasks with sequential structure. Eval-uation of language generation and interpreta-tion shows that pragmatic inference improvesstate-of-the-art listener models (at correctlyinterpreting human instructions) and speakermodels (at producing instructions correctly in-terpreted by humans) in diverse settings.

1 Introduction

How should speakers and listeners reason abouteach other when they communicate? A core in-sight of computational pragmatics is that speakerand listener agents operate within a cooperativegame-theoretic context, and that each agent ben-efits from reasoning about others’ intents and ac-tions within that context. Pragmatic inference hasbeen studied by a long line of work in linguistics,natural language processing, and cognitive sci-ence. In this paper, we present a technique for lay-ering explicit pragmatic inference on top of mod-els for complex, sequential instruction-followingand instruction-generation tasks. We investigate arange of current data sets for both tasks, showingthat pragmatic behavior arises naturally from thisinference procedure, and gives rise to state-of-the-art results in a variety of domains.

Consider the example shown in Figure 1a, inwhich a speaker agent must describe a route to

(a)

Behavior

BaseSpeaker

RationalSpeaker

✔

walk forward four times

go forward four segments to the intersection with the bare concrete hall

(b)

Instruction walk along the blue carpet and you pass two objects

BaseListener

RationalListener

✗

✔

Figure 1: Real samples for the SAIL navigation en-vironments, comparing base models, without explicitpragmatic inference, to the rational pragmatic infer-ence procedure. (a) The rational speaker, which rea-sons about listener behavior, generates instructionswhich in this case are more robust to uncertainty aboutthe listener’s initial orientation. (b) The base listenermoves to an unintended position (even though it cor-rectly passes two objects). The rational listener, whichreasons about the speaker, infers that a route endingat the sofa would have been described differently, andstops earlier.

a target position in a hallway. A conventionallearned instruction-generating model produces atruthful description of the route (walk forward fourtimes). But the pragmatic speaker in this paper,which is capable of reasoning about the listener,chooses to also include additional information (theintersection with the bare concrete hall), to reducepotential ambiguity and increase the odds that thelistener reaches the correct destination.

This same reasoning procedure also allows a lis-tener agent to overcome ambiguity in instructionsby reasoning counterfactually about the speaker(Figure 1b). Given the command walk along theblue carpet and you pass two objects, a conven-

1951

tional learned instruction-following model is will-ing to consider all paths that pass two objects,and ultimately arrives at an unintended final po-sition. But a pragmatic listener that reasons aboutthe speaker can infer that the long path would havebeen more easily described as go to the sofa, andthus that the shorter path is probably intended. Inthese two examples, which are produced by thesystem we describe in this paper, a unified rea-soning process (choose the output sequence whichis most preferred by an embedded model of theother agent) produces pragmatic behavior for bothspeakers and listeners.

The application of models with explicit prag-matic reasoning abilities has so far been largelyrestricted to simple reference games, in which thelistener’s only task is to select the right item fromamong a small set of candidate referents givena single short utterance from the speaker. Butas the example shows, there are real-world in-struction following and generation tasks with richaction spaces that might also benefit from prag-matic modeling. Moreover, approaches that learnto map directly between human-annotated instruc-tions and action sequences are ultimately limitedby the effectiveness of the humans themselves.The promise of pragmatic modeling is that we canuse these same annotations to build a model with adifferent (and perhaps even better) mechanism forinterpreting and generating instructions.

The primary contribution of this work is toshow how existing models of pragmatic reasoningcan be extended to support instruction followingand generation for challenging, multi-step, inter-active tasks. Our experimental evaluation focuseson four instruction-following domains which havebeen studied using both semantic parsers and at-tentional neural models. We investigate the in-terrelated tasks of instruction following and in-struction generation, and show that incorporat-ing an explicit model of pragmatics helps in bothcases. Reasoning about the human listener allowsa speaker model to produce instructions that areeasier for humans to interpret correctly in all do-mains (with absolute gains in accuracy rangingfrom 12% to 46%). Similarly, reasoning about thehuman speaker improves the accuracy of the lis-tener models in interpreting instructions in mostdomains (with gains in accuracy of up to 10%).In all cases, the resulting systems are competitivewith, and in many cases exceed, results from past

state-of-the-art systems for these tasks.1

2 Problem Formulation

Consider the instruction following and instruc-tion generation tasks shown in Figure 1, where anagent must produce or interpret instructions abouta structured world context (e.g. walk along theblue carpet and you pass two objects).

In the instruction following task, a listeneragent begins in a world state (in Figure 1 an ini-tial map location and orientation). The agent isthen tasked with following a sequence of directionsentences d1 . . . dK produced by humans. At eachtime t the agent receives a percept yt, which is afeature-based representation of the current worldstate, and chooses an action at (e.g. move forward,or turn). The agent succeeds if it is able to reachthe correct final state described by the directions.

In the instruction generation task, the agentreceives a sequence of actions a1, · · · aT alongwith the world state y1, · · · yT at each action, andmust generate a sequence of direction sentencesd1, . . . dK describing the actions. The agent suc-ceeds if a human listener is able to correctly followthose directions to the intended final state.

We evaluate models for both tasks in four do-mains. The first domain is the SAIL corpusof virtual environments and navigational direc-tions (MacMahon et al., 2006; Chen and Mooney,2011), where an agent navigates through a two-dimensional grid of hallways with patterned wallsand floors and a discrete set of objects (Figure 1shows a portion of one of these hallways).

In the three SCONE domains (Long et al.,2016), the world contains a number of objects withvarious properties, such as colored beakers whichan agent can combine, drain, and mix. Instructionsdescribe how these objects should be manipulated.These domains were designed to elicit instructionswith a variety of context-dependent language phe-nomena, including ellipsis and coreference (Longet al., 2016) which we might expect a model ofpragmatics to help resolve (Potts, 2011).

3 Related Work

The approach in this paper builds upon long linesof work in pragmatic modeling, instruction fol-lowing, and instruction generation.

1Source code is available at http://github.com/dpfried/pragmatic-instructions

1952

Pragmatics Our approach to pragmatics (Grice,1975) belongs to a general category of rationalspeech acts models (Frank and Goodman, 2012),in which the interaction between speakers andlisteners is modeled as a probabilistic processwith Bayesian actors (Goodman and Stuhlmuller,2013). Alternative formulations (e.g. with best-response rather than probabilistic dynamics) arealso possible (Golland et al., 2010). Inference inthese models is challenging even when the spaceof listener actions is extremely simple (Smithet al., 2013), and one of our goals in the presentwork is to show how this inference problem can besolved even in much richer action spaces than pre-viously considered in computational pragmatics.This family of pragmatic models captures a num-ber of important linguistic phenomena, especiallythose involving conversational implicature (Mon-roe and Potts, 2015); we note that many other top-ics studied under the broad heading of “pragmat-ics,” including presupposition and indexicality, re-quire different machinery.

Williams et al. (2015) use pragmatic reasoningwith weighted inference rules to resolve ambigu-ity and generate clarification requests in a human-robot dialog task. Other recent work on pragmaticmodels focuses on the referring expression gener-ation or “contrastive captioning” task introducedby Kazemzadeh et al. (2014). In this family areapproaches that model the listener at training time(Mao et al., 2016), at evaluation time (Andreas andKlein, 2016; Monroe et al., 2017; Vedantam et al.,2017; Su et al., 2017) or both (Yu et al., 2017b;Luo and Shakhnarovich, 2017).

Other conditional sequence rescoring modelsthat are structurally similar but motivated by con-cerns other than pragmatics include Li et al. (2016)and Yu et al. (2017a). Lewis et al. (2017) performa similar inference procedure for a competitive ne-gotiation task. The language learning model ofWang et al. (2016) also features a structured out-put space and uses pragmatics to improve onlinepredictions for a semantic parsing model. Our ap-proach in this paper performs both generation andinterpretation, and investigates both structured andunstructured output representations.

Instruction following Work on instruction fol-lowing tasks includes models that parse com-mands into structured representations processedby a rich execution model (Tellex et al., 2011;Chen, 2012; Artzi and Zettlemoyer, 2013; Guu

et al., 2017), and models that map directly frominstructions to a policy over primitive actions(Branavan et al., 2009), possibly mediated by anintermediate alignment or attention variable (An-dreas and Klein, 2015; Mei et al., 2016). We usea model similar to Mei et al. (2016) as our baselistener in this paper, evaluating on the SAIL nav-igation task (MacMahon et al., 2006) as they did,as well as the SCONE context-dependent execu-tion domains (Long et al., 2016).

Instruction generation Previous work has alsoinvestigated the instruction generation task, in par-ticular for navigational directions. The GIVEshared tasks (Byron et al., 2009; Koller et al.,2010; Striegnitz et al., 2011) have produced alarge number of interactive direction-giving sys-tems, both rule-based and learned. The work mostimmediately related to the generation task in thispaper is that of Daniele et al. (2017), which alsofocuses on the SAIL dataset but requires substan-tial additional structured annotation for training,while both our base and pragmatic speaker modelslearn directly from strings and action sequences.

Older work has studied the properties of effec-tive human strategies for generating navigationaldirections (Anderson et al., 1991). Instructionsof this kind can be used to extract templates forgeneration (Look, 2008; Dale et al., 2005), whilehere we focus on the more challenging problem oflearning to generate new instructions from scratch.Like our pragmatic speaker model, Goeddel andOlson (2012) also reason about listener behaviorwhen generating navigational instructions, but relyon rule-based models for interpretation.

4 Pragmatic inference procedure

As a foundation for pragmatic inference, we as-sume that we have base listener and speaker mod-els to map directions to actions and vice-versa.(Our notation for referring to models is adaptedfrom Bergen et al. (2016).) The base listener, L0,produces a probability distribution over sequencesof actions, conditioned on a representation of thedirections and environment as seen before eachaction: PL0(a1:T |d1:K , y1:T ). Similarly, the basespeaker, S0, defines a distribution over possibledescriptions conditioned on a representation of theactions and environment: PS0(d1:K |a1:T , y1:T ).

Our pragmatic inference procedure requiresthese base models to produce candidate outputsfrom a given input (actions from descriptions, for

1953

BaseListener

BaseSpeaker

Actions

BaseListener

BaseSpeaker Instructions

RationalListener

RationalSpeaker

PragmaticInference

PragmaticInference

simulation

simulation

walk along the blue carpet …

…

…

✔+

…

…

+

BaseListener

BaseSpeaker

(a) (b)Figure 2: (a) Rational pragmatic models embed base listeners and speakers. Potential candidate sequences aredrawn from one base model, and then the other scores each candidate to simulate whether it produces the desiredpragmatic behavior. (b) The base listener and speaker are neural sequence-to-sequence models which are largelysymmetric to each other. Each produces a representation of its input sequence (a description, for the listener;actions with associated environmental percepts, for the listener) using an LSTM encoder. The output sequence isgenerated by an LSTM decoder attending to the input.

the listener; descriptions from actions, for thespeaker), and calculate the probability of a fixedoutput given an input, but is otherwise agnostic tothe form of the models.

We use standard sequence-to-sequence mod-els with attention for both the base listener andspeaker (described in Section 5). Our models usesegmented action sequences, with one segment(sub-sequence of actions) aligned with each de-scription sentence dj , for all j ∈ {1 . . .K}. Thissegmentation is either given as part of the train-ing and testing data (in the instruction followingtask for the SAIL domain, and in both tasks forthe SCONE domain, where each sentence corre-sponds to a single action), or is predicted by a sep-arate segmentation model (in the generation taskfor the SAIL domain), see Section 5.

4.1 Models

Using these base models as self-contained mod-ules, we derive a rational speaker and rational lis-tener that perform inference using embedded in-stances of these base models (Figure 2a). Whendescribing an action sequence, a rational speakerS1 chooses a description that has a high chance ofcausing the listener modeled by L0 to follow thegiven actions:

S1(a1:T ) = argmaxd1:K

PL0(a1:T |d1:K , y1:T ) (1)

(noting that, in all settings we explore here, thepercepts y1:T are completely determined by the ac-tions a1:T ). Conversely, a rational listener L1 fol-lows a description by choosing an action sequencewhich has high probability of having caused the

speaker, modeled by S0, to produce the descrip-tion:

L1(d1:K) = argmaxa1:T

PS0(d1:K |a1:T , y1:T ) (2)

These optimization problems are intractable tosolve for general base listener and speaker agents,including the sequence-to-sequence models weuse, as they involve choosing an input (from acombinatorially large space of possible sequences)to maximize the probability of a fixed output se-quence. We instead follow a simple approximateinference procedure, detailed in Section 4.2.

We consider also incorporating the scores of thebase model used to produce the candidates. Forthe case of the speaker, we define a combined ra-tional speaker, denoted S0 · S1, that selects thecandidate that maximizes a weighted product ofprobabilities under both the base listener and thebase speaker:

argmaxd1:K

PL0(a1:T |d1:K , y1:T )λ

× PS0(d1:K |a1:T , y1:T )1−λ (3)

for a fixed interpolation hyperparameter λ ∈ [0, 1].There are several motivations for this combinationwith the base speaker score. First, as argued byMonroe et al. (2017), we would expect varying de-grees of base and reasoned interpretation in humanspeech acts. Second, we want the descriptions pro-duced by the model to be fluent descriptions of theactions. Since the base models are trained discrim-inatively, maximizing the probability of an outputsequence for a fixed input sequence, their scoringbehaviors for fixed outputs paired with inputs dis-similar to those seen in the training set may be

1954

poorly calibrated (for example when conditioningon ungrammatical descriptions). Incorporating thescores of the base model used to produce the can-didates aims to prevent this behavior.

To define rational listeners, we use the symmet-ric formulation: first, draw candidate action se-quences from L0. For L1, choose the actions thatachieve the highest probability under S0; and forthe combination model L0 · L1 choose the actionswith the highest weighted combination of S0 andL0 (paralleling equation 3).

4.2 Inference

As in past work (Smith et al., 2013; Andreas andKlein, 2016; Monroe et al., 2017), we approximatethe optimization problems in equations 1, 2, and3: use the base models to generate candidates, andrescore them to find ones that are likely to producethe desired behavior.

In the case of the rational speaker S1, we usethe base speaker S0 to produce a set of n can-didate descriptions w

(1)1:K1

. . . w(n)1:Kn

for the se-quences a1:T , y1:T , using beam search. We thenfind the score of each description under PL0 (us-ing it as the input sequence for the observed outputactions we want the rational speaker to describe),or a weighted combination of PL0 and the origi-nal candidate score PS0 , and choose the descrip-tion w(j)

1:Kjwith the largest score, approximately

solving the maximizations in equations 1 or 3, re-spectively. We perform a symmetric procedure forthe rational listener: produce action sequence can-didates from the base listener, and rescore themusing the base speaker.2

As the rational speaker must produce long out-put sequences (with multiple sentences), we inter-leave the speaker and listener in inference, deter-mining each output sentence sequentially. From alist of candidate direction sentences from the basespeaker for the current subsequence of actions, wechoose the top-scoring direction under the listenermodel (which may also condition on the direc-tions which have been output previously), and then

2We use ensembles of models for the base listener andspeaker (subsection 5.3), and to obtain candidates that arehigh-scoring under the combination of models in the ensem-ble, we perform standard beam search using all models inlock-step. At every timestep of the beam search, each pos-sible extension of an output sequence is scored using theproduct of the extension’s conditional probabilities across allmodels in the ensemble.

move on to the next subsequence of actions.3

5 Base model details

Given this framework, all that remains is to de-scribe the base models L0 and S0. We imple-ment these as sequence-to-sequence models thatmap directions to actions (for the listener) or ac-tions to directions (for the speaker), additionallyconditioning on the world state at each timestep.

5.1 Base listener

Our base listener model, L0, predicts action se-quences conditioned on an encoded representationof the directions and the current world state. Inthe SAIL domain, this is the model of Mei et al.(2016) (illustrated in green in Figure 2b for a sin-gle sentence and its associated actions), see “do-main specifics” below.

Encoder Each direction sentence is encodedseparately with a bidirectional LSTM (Hochre-iter and Schmidhuber, 1997); the LSTM’s hiddenstates are reset for each sentence. We obtain a rep-resentation hek for the kth word in the current sen-tence by concatenating an embedding for the wordwith its forward and backward LSTM outputs.

Decoder We generate actions incrementally us-ing an LSTM decoder with monotonic alignmentbetween the direction sentences and subsequencesof actions; at each timestep the decoder predictsthe next action for the current sentence w1:M (in-cluding choosing to shift to the next sentence).The decoder takes as input at timestep t the cur-rent world state, yt and a representation zt of thecurrent sentence, updates the decoder state hd, andoutputs a distribution over possible actions:

hdt = LSTMd(hdt−1, [Wyyt, zt])

qt =Wo(Wyyt +Whhdt +Wzzt)

p(at | a1:t−1, y1:t, w1:M ) ∝ exp(qt)

where all weight matrices W are learned param-eters. The sentence representation zt is producedusing an attention mechanism (Bahdanau et al.,2015) over the representation vectors he1 . . . h

eM

3We also experimented with sampling from the base mod-els to produce these candidate lists, as was done in previ-ous work (Andreas and Klein, 2016; Monroe et al., 2017).In early experiments, however, we found better performancewith beam search in the rational models for all tasks.

1955

for words in the current sentence:

αt,k ∝ exp(v · tanh(Wdhdt−1 +Weh

ek))

zt =

M∑

k=1

αt,khek

where the attention weights αt,k are normalizedto sum to one across positions k in the input, andweight matrices W and vector v are learned.

Domain specifics For SAIL, we use the align-ments between sentences and route segments an-notated by Chen and Mooney (2011), which werealso used in previous work (Artzi and Zettlemoyer,2013; Artzi et al., 2014; Mei et al., 2016). Fol-lowing Mei et al. (2016), we reset the decoder’shidden state for each sentence.

In the SCONE domains, which have a largerspace of possible outputs than SAIL, we extendthe decoder by: (i) decomposing each action intoan action type and arguments for it, (ii) using sepa-rate attention mechanisms for types and argumentsand (iii) using state-dependent action embeddings.See Appendix A in the supplemental material fordetails. The SCONE domains are constructed sothat each sentence corresponds to a single (non-decomposed) action; this provides our segmenta-tion of the action sequence.

5.2 Base speaker

While previous work (Daniele et al., 2017) has re-lied on more structured approaches, we constructour base speaker model S0 using largely the samesequence-to-sequence machinery as above. S0 (il-lustrated in orange in Figure 2b) encodes a se-quence of actions and world states, and then usesa decoder to output a description.

Encoder We encode the sequence of vector em-beddings for the actions at and world states yt us-ing a bidirectional LSTM. Similar to the base lis-tener’s encoder, we then obtain a representation hetfor timestep t by concatenating at and yt with theLSTM outputs at that position.

Decoder As in the listener, we use an LSTM de-coder with monotonic alignment between direc-tion sentences and subsequences of actions, andattention over the subsequences of actions. Thedecoder takes as input at position k an embed-ding for the previously generated word wk−1 anda representation zk of the current subsequence of

actions and world states, and produces a distribu-tion over words (including ending the descriptionfor the current subsequence and advancing to thenext). The decoder’s output distribution is pro-duced by:

hdk = LSTMd(hdk−1, [wk−1, zk])

qk =Whhdk +Wzzk

p(wk | w1:k−1, a1:T , y1:T ) ∝ exp(qk)

where all weight matrices W are learned parame-ters.4 As in the base listener, the input represen-tation zk is produced by attending to the vectorshe1 . . . h

eT encoding the input sequence (here, en-

coding the subsequence of actions and world statesto be described):

αk,t ∝ exp(v · tanh(Wdhdk−1 +Weh

et ))

zk =T∑

t=1

αk,t het

The decoder’s LSTM state is reset at the beginningof each sentence.

Domain specifics In SAIL, for comparison tothe generation system of Daniele et al. (2017)which did not use segmented routes, we train aroute segmenter for use at test time. We also rep-resent routes using a collapsed representation ofaction sequences. In the SCONE domains, we(i) use the same context-dependent action embed-dings used in the listener, and (ii) don’t require anattention mechanism, since only a single action isused to produce a given sentence within the se-quence of direction sentences. See Appendix Afor more details.

5.3 Training

The base listener and speaker models are trainedindependently to maximize the conditional likeli-hoods of the actions–directions pairs in the train-ing sets. See Appendix A for details on the opti-mization, LSTM variant, and hyperparameters.

We use ensembles for the base listener L0 andbase speaker S0, where each ensemble consists of10 models trained from separate random parame-ter initializations. This follows the experimentalsetup of Mei et al. (2016) for the SAIL base lis-tener.

4All parameters are distinct from those used in the baselistener; the listener and speaker are trained separately.

1956

Single-sentence Multi-sentencelistener Rel Abs Rel Abs

past work 69.98 65.28 26.07 35.44(MBW) (AZ) (MBW) (ADP)

L0 68.40 59.62 24.79 13.53L0 · L1 71.64 64.38 34.05 24.50

accuracy gain +3.24 +4.76 +9.26 +10.97

Table 1: Instruction-following results on the SAILdataset. The table shows cross-validation test accu-racy for the base listener (L0) and pragmatic listen-ers (L0 · L1), along with the gain given by prag-matics. We report results for the single- and multi-sentence conditions, under the relative and absolutestarting conditions5, comparing to the best-performingprior work by Artzi and Zettlemoyer (2013) (AZ), Artziet al. (2014) (ADP), and Mei et al. (2016) (MBW).Bold numbers show new state-of-the-art results.

6 Experiments

We evaluate speaker and listener agents on boththe instruction following and instruction genera-tion tasks in the SAIL domain and three SCONEdomains (Section 2). For all domains, we com-pare the rational listener and speaker against thebase listener and speaker, as well as against paststate-of-the-art results for each task and domain.Finally, we examine pragmatic inference froma model combination perspective, comparing thepragmatic reranking procedure to ensembles of alarger number of base speakers or listeners.

For all experiments, we use beam search bothto generate candidate lists for the rational systems(section 4.2) and to generate the base model’s out-put. We fix the beam size n to be the same in boththe base and rational systems, using n = 20 forthe speakers and n = 40 for the listeners. Wetune the weight λ in the combined rational agents(L0 · L1 or S0 · S1) to maximize accuracy (for lis-tener models) or BLEU (for speaker models) oneach domain’s development data.

6.1 Instruction following

We evaluate our listener models by their accuracyin carrying out human instructions: whether thesystems were able to reach the final world statewhich the human was tasked with guiding themto.

SAIL We follow standard cross-validation eval-uation for the instruction following task on theSAIL dataset (Artzi and Zettlemoyer, 2013; Artzi

listener Alchemy Scene Tangrams

GPLL 52.9 46.2 37.3L0 69.7 70.9 69.6

L0 · L1 72.0 72.7 69.6accuracy gain +2.3 +1.8 +0.0

Table 2: Instruction-following results in the SCONEdomains. The table shows accuracy on the test set. Forreference, we also show prior results from Guu et al.(2017) (GPLL), although our models use more super-vision at training time.

a red guy appears on the far leftthen to orange’s other side

base listener, L0 rational listener, L0 · L1

Figure 3: Action traces produced for a partial instruc-tion sequence (two instructions out of five) in the Scenedomain. The base listener moves the red figure to aposition that is a marginal, but valid, interpretation ofthe directions. The rational listener correctly producesthe action sequence the directions were intended to de-scribe.

et al., 2014; Mei et al., 2016).5 Table 1 showsimprovements over the base listener L0 when us-ing the rational listener L0 · L1 in the single- andmulti-sentence settings. We also report the bestaccuracies from past work. We see that the largestrelative gains come in the multi-sentence setting,where handling ambiguity is potentially more im-portant to avoid compounding errors. The rationalmodel improves on the published results of Meiet al. (2016), and while it is still below the sys-tems of Artzi and Zettlemoyer (2013) and Artziet al. (2014), which use additional supervision inthe form of hand-annotated seed lexicons and log-ical domain representations, it approaches their re-sults in the single-sentence setting.

SCONE In the SCONE domains, past workhas trained listener models with weak supervision

5Past work has differed in the handling of undeterminedorientations in the routes, which occur in the first state formulti-sentence routes and the first segment of their corre-sponding single-sentence routes. For comparison to bothtypes of past work, we train and evaluate listeners in twosettings: Abs, which sets these undetermined starting orien-tations to be a fixed absolute orientation, and Rel, where anundetermined starting orientation is set to be a 90 degree ro-tation from the next state in the true route.

1957

speaker SAIL Alchemy Scene Tangrams

DBW 70.9 — — —S0 62.8 29.3 31.3 60.0

S0 · S1 75.2 75.3 69.3 88.0accuracy gain +12.4 +46.0 +38.0 +28.0

human-generated 73.2 83.3 78.0 66.0

Table 3: Instruction generation results. We report theaccuracies of human evaluators at following the outputsof the speaker systems (as well as other humans) on 50-instance samples from the SAIL dataset and SCONEdomains. DBW is the system of Daniele et al. (2017).Bold numbers are new state-of-the-art results.

(with no intermediate actions between start andend world states) on a subset of the full SCONEtraining data. We use the full training set, and touse a model and training procedure consistent withthe SAIL setting, train listener and speaker mod-els using the intermediate actions as supervisionas well.6 The evaluation method and test data arethe same as in past work on SCONE: models areprovided with an initial world state and a sequenceof 5 instructions to carry out, and are evaluated ontheir accuracy in reaching the intended final worldstate.

Results are reported in Table 2. We see gainsfrom the rational system L0 · L1 in both theAlchemy and Scene domains. The pragmaticinference procedure allows correcting errors oroverly-literal interpretations from the base listener.An example is shown in Figure 3. The base lis-tener (left) interprets then to orange’s other sideincorrectly, while the rational listener discountsthis interpretation (it could, for example, be bet-ter described by to the left of blue) and producesthe action the descriptions were meant to describe(right). To the extent that human annotators al-ready account for pragmatic effects when generat-ing instructions, examples like these suggest thatour model’s explicit reasoning is able to captureinterpretation behavior that the base sequence-to-sequence listener model is unable to model.

6.2 Instruction generation

As our primary evaluation for the instruction gen-eration task, we had Mechanical Turk workerscarry out directions produced by the speaker mod-

6Since the pragmatic inference procedure we use is ag-nostic to the models’ training method, it could also be ap-plied to the models of Guu et al. (2017); however we find thatpragmatic inference can improve even upon our stronger baselistener models.

speaker SAIL Alchemy Scene Tangrams

DBW 11.00 — — —S0 12.04 19.34 18.09 21.75

S0 · S1 10.78 18.70 27.15 23.03BLEU gain -1.26 -0.64 +9.06 +1.28

accuracy gain +12.4 +46.0 +38.0 +28.0(from Table 3)

Table 4: Gains in how easy the directions are to fol-low are not always associated with a gain in BLEU.This table shows corpus-level 4-gram BLEU compar-ing outputs of the speaker systems to human-produceddirections on the SAIL dataset and SCONE domains,compared to gains in accuracy when asking humans tocarry out a sample of the systems’ directions (see Ta-ble 3).

els (and by other humans) in a simulated version ofeach domain. For SAIL, we use the simulator re-leased by Daniele et al. (2017) which was used intheir human evaluation results, and we constructsimulators for the three SCONE domains. In allsettings, we take a sample of 50 action sequencesfrom the domain’s test set (using the same sam-ple as Daniele et al. (2017) for SAIL), and havethree separate Turk workers attempt to follow thesystems’ directions for the action sequence.

Table 3 gives the average accuracy of subjectsin reaching the intended final world state acrossall sampled test instances, for each domain. The“human-generated” row reports subjects’ accu-racy at following the datasets’ reference direc-tions. The directions produced by the base speakerS0 are often much harder to follow than those pro-duced by humans (e.g. 29.3% of S0’s directionsare correctly interpretable for Alchemy, vs. 83.3%of human directions). However, we see substan-tial gains from the rational speaker S0 ·S1 over S0in all cases (with absolute gains in accuracy rang-ing from 12.4% to 46.0%), and the average accu-racy of humans at following the rational speaker’sdirections is substantially higher than for human-produced directions in the Tangrams domain. Inthe SAIL evaluation, we also include the direc-tions produced by the system of Daniele et al.(2017) (DBW), and find that the rational speaker’sdirections are followable to comparable accuracy.

We also compare the directions produced by thesystems to the reference instructions given by hu-mans in the dataset, using 4-gram BLEU7 (Pap-

7See Appendix A for details on evaluating BLEU in theSAIL setting, where there may be a different number of ref-erence and predicted sentences for a given example.

1958

human take away the last itemundo the last step

S0remove the last figureadd it back

S0 · S1remove the last figureadd it back in the 3rd position

Figure 4: Descriptions produced for a partial action se-quence in the Tangrams domain. Neither the humannor base speaker S0 correctly specifies where to addthe shape in the second step, while the rational speakerS0 · S1 does.

ineni et al., 2002) in Table 4. Consistent with pastwork (Krahmer and Theune, 2010), we find thatBLEU score is a poor indicator of whether the di-rections can be correctly followed.

Qualitatively, the rational inference procedure ismost successful in fixing ambiguities in the basespeaker model’s descriptions. Figure 4 gives atypical example of this for the last few timestepsfrom a Tangrams instance. The base speaker cor-rectly describes that the shape should be addedback, but does not specify where to add it, whichcould lead a listener to add it in the same positionit was deleted. The human speaker also makes thismistake in their description. This speaks to thedifficulty of describing complex actions pragmat-ically even for humans in the Tangrams domain.The ability of the pragmatic speaker to producedirections that are easier to follow than humans’in this domain (Table 3) shows that the pragmaticmodel can generate something different (and insome cases better) than the training data.

6.3 Pragmatics as model combination

Finally, our rational models can be viewedas pragmatically-motivated model combinations,producing candidates using base listener orspeaker models and reranking using a combina-tion of scores from both. We want to verify thata rational listener using n ensembled base listen-ers and n base speakers outperforms a simple en-semble of 2n base listeners (and similarly for therational speaker).

Fixing the total number of models to 20 in each

listener experiment, we find that the rational lis-tener (using an ensemble of 10 base listener mod-els and 10 base speaker models) still substantiallyoutperforms the ensembled base listener (using 20base listener models): accuracy gains are 68.5→71.6%, 70.1 → 72.0%, 71.9 → 72.7%, and 69.1→ 69.6% for SAIL single-sentence Rel, Alchemy,Scene, and Tangrams, respectively.

For the speaker experiments, fixing the totalnumber of models to 10 (since inference in thespeaker models is more expensive than in the fol-lower models), we find similar gains as well: therational speaker improves human accuracy at fol-lowing the generated instructions from 61.9 →73.4%, 30.7 → 74.7%, 32.0 → 66.0%, 58.7 →92.7%, for SAIL, Alchemy, Scene, and Tangrams,respectively.8

7 Conclusion

We have demonstrated that a simple procedure forpragmatic inference, with a unified treatment forspeakers and listeners, obtains improvements forinstruction following as well as instruction gen-eration in multiple settings. The inference proce-dure is capable of reasoning about sequential, in-terdependent actions in non-trivial world contexts.We find that pragmatics improves upon the perfor-mance of the base models for both tasks, in mostcases substantially. While this is perhaps unsur-prising for the generation task, which has been dis-cussed from a pragmatic perspective in a varietyof recent work in NLP, it is encouraging that prag-matic reasoning can also improve performance fora grounded listening task with sequential, struc-tured output spaces.

Acknowledgments

We are grateful to Andrea Daniele for sharingthe SAIL simulator and their system’s outputs, toHongyuan Mei for help with the dataset, and toTom Griffiths and Chris Potts for helpful com-ments and discussion. This work was supportedby DARPA through the Explainable Artificial In-telligence (XAI) program. DF is supported by aHuawei / Berkeley AI fellowship. JA is supportedby a Facebook graduate fellowship.

8The accuracies for the base speakers are slightly differentthan in Table 3, despite being produced by the same systems,since we reran experiments to control as much as possible fortime variation in the pool of Mechanical Turk workers.

1959

References

Anne H. Anderson, Miles Bader, Ellen Gurman Bard,Elizabeth Boyle, Gwyneth Doherty, Simon Garrod,Stephen Isard, Jacqueline Kowtko, Jan McAllister,Jim Miller, et al. 1991. The HCRC map task corpus.Language and speech 34(4):351–366.

Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction fol-lowing. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.

Jacob Andreas and Dan Klein. 2016. Reasoning aboutpragmatics with neural listeners and speakers. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing.

Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014.Learning compact lexicons for CCG semantic pars-ing. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing. Associa-tion for Computational Linguistics.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion for Computational Linguistics 1(1):49–62.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. International Con-ference on Learning Representations .

Leon Bergen, Roger Levy, and Noah Goodman. 2016.Pragmatic reasoning through semantic inference.Semantics and Pragmatics 9.

S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, andRegina Barzilay. 2009. Reinforcement learning formapping instructions to actions. In Proceedings ofthe Annual Meeting of the Association for Compu-tational Linguistics. Association for ComputationalLinguistics, pages 82–90.

Donna Byron, Alexander Koller, Kristina Striegnitz,Justine Cassell, Robert Dale, Johanna Moore, andJon Oberlander. 2009. Report on the first NLG chal-lenge on generating instructions in virtual environ-ments (GIVE). In Proceedings of the 12th europeanworkshop on natural language generation. Associa-tion for Computational Linguistics, pages 165–173.

David L Chen. 2012. Fast online lexicon learning forgrounded language acquisition. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics. pages 430–439.

David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In Proceedings of the Meet-ing of the Association for the Advancement of Artifi-cial Intelligence. volume 2, pages 1–2.

Robert Dale, Sabine Geldof, and Jean-Philippe Prost.2005. Using natural language generation in auto-matic route. Journal of Research and practice inInformation Technology 37(1):89.

Andrea F. Daniele, Mohit Bansal, and Matthew R. Wal-ter. 2017. Navigational instruction generation as in-verse reinforcement learning with neural machinetranslation. Proceedings of Human-Robot Interac-tion .

Michael C Frank and Noah D Goodman. 2012. Pre-dicting pragmatic reasoning in language games. Sci-ence 336(6084):998–998.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In Advances in Neural InformationProcessing Systems 29 (NIPS).

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neuralnetworks. In AISTATS. volume 9, pages 249–256.

Robert Goeddel and Edwin Olson. 2012. Dart: Aparticle-based method for generating easy-to-followdirections. In Intelligent Robots and Systems(IROS), 2012 IEEE/RSJ International Conferenceon. IEEE, pages 1213–1219.

Dave Golland, Percy Liang, and Dan Klein. 2010. Agame-theoretic approach to generating spatial de-scriptions. In Proceedings of the 2010 conferenceon Empirical Methods in Natural Language Pro-cessing. Association for Computational Linguistics,pages 410–419.

Noah D Goodman and Andreas Stuhlmuller. 2013.Knowledge and implicature: Modeling language un-derstanding as social cognition. Topics in cognitivescience 5(1):173–184.

Klaus Greff, Rupesh K Srivastava, Jan Koutnık, Bas RSteunebrink, and Jurgen Schmidhuber. 2016. Lstm:A search space odyssey. IEEE transactions on neu-ral networks and learning systems .

H. P. Grice. 1975. Logic and conversation. In P. Coleand J. L. Morgan, editors, Syntax and Semantics:Vol. 3: Speech Acts, Academic Press, San Diego,CA, pages 41–58.

Kelvin Guu, Panupong Pasupat, Evan Zheran Liu,and Percy Liang. 2017. From language to pro-grams: Bridging reinforcement learning and maxi-mum marginal likelihood. In Association for Com-putational Linguistics (ACL).

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation9(8):1735–1780.

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara L Berg. 2014. ReferItGame: Referringto objects in photographs of natural scenes. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing. pages 787–798.

1960

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. InternationalConference on Learning Representations .

Alexander Koller, Kristina Striegnitz, Andrew Gargett,Donna Byron, Justine Cassell, Robert Dale, JohannaMoore, and Jon Oberlander. 2010. Report on thesecond NLG challenge on generating instructions invirtual environments (GIVE-2). In Proceedings ofthe 6th international natural language generationconference. Association for Computational Linguis-tics, pages 243–250.

Emiel Krahmer and Mariet Theune, editors. 2010. Em-pirical Methods in Natural Language Generation:Data-oriented Methods and Empirical Evaluation.Springer-Verlag, Berlin, Heidelberg.

Mike Lewis, Denis Yarats, Yann N Dauphin, DeviParikh, and Dhruv Batra. 2017. Deal or no deal?end-to-end learning for negotiation dialogues. InEmpirical Methods in Natural Language Processing(EMNLP).

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting objec-tive function for neural conversation models. In Pro-ceedings of the Annual Meeting of the North Amer-ican Chapter of the Association for ComputationalLinguistics.

Reginald Long, Panupong Pasupat, and Percy Liang.2016. Simpler context-dependent logical forms viamodel projections. In Association for Computa-tional Linguistics (ACL).

Gary Wai Keung Look. 2008. Cognitively-inspired di-rection giving. Ph.D. thesis, Massachusetts Instituteof Technology.

Ruotian Luo and Gregory Shakhnarovich. 2017.Comprehension-guided referring expressions. InComputer Vision and Pattern Recognition.

Matt MacMahon, Brian Stankiewicz, and BenjaminKuipers. 2006. Walk the talk: Connecting language,knowledge, and action in route instructions. Pro-ceedings of the Meeting of the Association for theAdvancement of Artificial Intelligence 2(6):4.

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Computer Vision and PatternRecognition.

Hongyuan Mei, Mohit Bansal, and Matthew Walter.2016. Listen, attend, and walk: Neural mappingof navigational instructions to action sequences. InProceedings of the Meeting of the Association forthe Advancement of Artificial Intelligence.

Will Monroe, Robert X.D. Hawkins, Noah D. Good-man, and Christopher Potts. 2017. Colors in con-text: A pragmatic neural model for grounded lan-guage understanding. Transactions of the Associa-tion for Computational Linguistics .

Will Monroe and Christopher Potts. 2015. Learning inthe Rational Speech Acts model. In Proceedings of20th Amsterdam Colloquium. ILLC, Amsterdam.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics. Association for ComputationalLinguistics, pages 311–318.

Christopher Potts. 2011. Pragmatics, Oxford Univer-sity Press.

Nathaniel J Smith, Noah Goodman, and MichaelFrank. 2013. Learning and using language via recur-sive pragmatic reasoning about other agents. In Ad-vances in Neural Information Processing Systems.pages 3039–3047.

Kristina Striegnitz, Alexandre Denis, Andrew Gargett,Konstantina Garoufi, Alexander Koller, and MarietTheune. 2011. Report on the second second chal-lenge on generating instructions in virtual environ-ments GIVE-2.5. In Proceedings of the 13th Euro-pean workshop on natural language generation. As-sociation for Computational Linguistics, pages 270–279.

Jong-Chyi Su, Chenyun Wu, Huaizu Jiang, andSubhransu Maji. 2017. Reasoning about fine-grained attribute phrases using reference games. InInternational Conference on Computer Vision.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew R. Walter, Ashis Gopal Banerjee, SethTeller, and Nicholas Roy. 2011. Understanding nat-ural language commands for robotic navigation andmobile manipulation. In In Proceedings of the Na-tional Conference on Artificial Intelligence.

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy,Devi Parikh, and Gal Chechik. 2017. Context-awarecaptions from context-agnostic supervision. InComputer Vision and Pattern Recognition (CVPR).volume 3.

Sida I. Wang, Percy Liang, and Christopher D. Man-ning. 2016. Learning language games through inter-action. In Association for Computational Linguis-tics (ACL).

Tom Williams, Gordon Briggs, Bradley Oosterveld,and Matthias Scheutz. 2015. Going beyond lit-eral command-based instructions: Extending roboticnatural language interaction capabilities. In AAAI.pages 1387–1393.

Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefen-stette, and Tomas Kocisky. 2017a. The neural noisychannel. International Conference on LearningRepresentations .

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L.Berg. 2017b. A joint speaker-listener-reinforcermodel for referring expressions. In Computer Visionand Pattern Recognition.

1961

type arguments contextual embedding

AlchemyMIX source i contents of iPOUR source i, target j contents of i and jDRAIN amount a, source i a, contents of i

SceneENTER color c, source i people at i− 1 and i+ 1EXIT source i people at i, i− 1, i+ 1MOVE source i, target j people at i, j − 1, j + 1SWITCH source i, target j people at i and jTAKEHAT source i, target j people at i and j

TangramsREMOVE position i —SWAP positions i and j —

INSERT position i, shape sindex of step when swas removed

Table 5: Action types, arguments, and elements of theworld state or action history that are extracted to pro-duce contextual action embeddings.

A Supplemental Material

A.1 SCONE listener detailsWe factor action production in each of the threeSCONE domains, separately predicting the actiontype and the arguments specific to that action type.Action types and arguments are listed in the firsttwo columns of Table 5. For example, Alchemy’sactions involve predicting the action type, a po-tential source beaker index i and target beaker in-dex j, and potential amount to drain a. All fac-tors of the action (the type and options for eachargument) are predicted using separate attentionmechanisms, which produce a vector qf giving un-normalized scores for factor f (e.g. scoring eachpossible type, or each possible choice for the argu-ment).

We also obtain state-specific embeddings of ac-tions, to make it easier for the model to learnrelevant features from the state embeddings (e.g.rather than needing to learn to select the regionof the state vector corresponding to the 5th beakerin the action MIX(5) in Alchemy, this action’scontextual embedding encodes the current contentof the 5th beaker). We incorporate these state-specific embeddings into computation of the ac-tion probabilities using a bilinear bonus score:

b(a) = q>Wqaa+ w>a a

where q is the concatenation of all qf factor scor-ing vectors, and Wqa and wa are a learned param-eter matrix and vector, respectively. This bonusscore b(a) for each action is added to the un-

normalized score for the corresponding action a(computed by summing the entries of the qf vec-tors which correspond to the factored action com-ponents), and the normalized output distribution isthen produced using a softmax over all valid ac-tions.

A.2 SAIL speaker detailsSince our speaker model operates on segmentedaction sequences, we train a route segmenter onthe training data and then predict segmentationsfor the test data. This provides a closer compar-ison to the generation system of Daniele et al.(2017) which did not use segmented routes. Theroute segmenter runs a bidirectional LSTM overthe concatenated state and action embeddings (asin the speaker encoder), then uses a logistic outputlayer to classify whether the route should be splitat each possible timestep. We also collapse con-secutive sequences of forward movement actionsinto single actions (e.g. MOVE4 representing fourconsecutive forward movements), which we foundhelped prevent counting errors (such as outputtingmove forward three when the correct route movedforward four steps).

A.3 SCONE speaker detailsWe use a one-hot representation of the arguments(see Table 5) and contextual embedding (as de-scribed in A.1) for each action at as input to theSCONE speaker encoder at time t (along withthe representation et of the world state, as inSAIL). Since SCONE uses a monotonic, one-to-one alignment between actions and direction sen-tences, the decoder does not use a learned atten-tion mechanism but fixes the contextual represen-tation zk to be the encoded vector at the action cor-responding to the sentence currently being gener-ated.

A.4 Training detailsWe optimize model parameters using ADAM

(Kingma and Ba, 2015) with default hyperparam-eters and the initialization scheme of Glorot andBengio (2010). All LSTMs have one layer. TheLSTM cell in both the listener and the follower usecoupled input and forget gates, and peephole con-nections to the cell state (Greff et al., 2016). Wealso apply the LSTM variational dropout schemeof Gal and Ghahramani (2016), using the samedropout rate for inputs, outputs, and recurrent con-nections. See Table 6 for hyperparameters. We

1962

dropout hidden attentionmodel domain rate dim dim

L0 SAIL 0.25 100 100L0 Alchemy 0.1 50 50L0 Scene 0.1 100 100L0 Tangrams 0.3 50 100

S0 SAIL 0.25 100 100S0 Alchemy 0.3 100 –S0 Scene 0.3 100 –S0 Tangrams 0.3 50 –

Table 6: Hyperparameters for the base listener (L0) andspeaker (S0) models. The SCONE speakers do not usean attention mechanism.

perform early stopping using the evaluation met-ric (accuracy for the listener and BLEU score forthe speaker) on the development set.

A.5 Computing BLEU for SAILTo compute BLEU in the SAIL experiments, asthe speaker models may choose produce a differ-ent number of sentences for each route than inthe true description, we obtain a single sequenceof words from a multi-sentence description pro-duced for a route by concatenating the sentences,separated by end-of-sentence tokens. We thencalculate corpus-level 4-gram BLEU between allthese sequences in the test set and the true multi-sentence descriptions (concatenated in the sameway).

1963

Unified Pragmatic Models for Generating and Following ...tions, and about how listeners will react...

Documents

Transcript of Unified Pragmatic Models for Generating and Following ...tions, and about how listeners will react...