RMM: A Recursive Mental Model for Dialog Navigation · 2020-05-05 · RMM: A Recursive Mental Model...

RMM: A Recursive Mental Model for Dialog Navigation

Homero Roman Roman1 Yonatan Bisk1,2

Jesse Thomason3 Asli Celikyilmaz1 Jianfeng Gao1

1Microsoft Research AI 2Carnegie Mellon University 3University of [email protected] [email protected]

Abstract

Fluent communication requires understandingyour audience. In the new collaborative taskof Vision-and-Dialog Navigation, one agentmust ask questions and follow instructive an-swers, while the other must provide those an-swers. We introduce the first true dialog nav-igation agents in the literature which gener-ate full conversations, and introduce the Re-cursive Mental Model (RMM) to conduct thesedialogs. RMM dramatically improves generatedlanguage questions and answers by recursivelypropagating reward signals to find the ques-tion expected to elicit the best answer, and theanswer expected to elicit the best navigation.Additionally, we provide baselines for futurework to build on when investigating the uniquechallenges of embodied visual agents that notonly interpret instructions but also ask ques-tions in natural language.

1 Introduction

A key challenge for embodied language is movingbeyond instruction following to instruction gener-ation. This dialog paradigm raises a myriad ofnew research questions, from grounded versionsof traditional problems like co-reference resolution(Das et al., 2017a) to modeling theory of mind toconsider the listener (Bisk et al., 2020).

In this work, we develop end-to-end dialogagents to navigate photorealistic, indoor scenes toreach goal rooms, trained on human-human dialogfrom the Collaborative Vision-and-Dialog Naviga-tion (CVDN) (Thomason et al., 2019) dataset. Pre-vious work considers only the vision-and-languagenavigation task, conditioned on single instruc-tions (Wang et al., 2019; Ma et al., 2019; Friedet al., 2018; Anderson et al., 2018) or dialog his-tories (Hao et al., 2020; Zhu et al., 2020; Wanget al., 2020; Thomason et al., 2019). Towards dia-log, recent work has modelled question answering

RMMHumans

Baseline

Should I head forward?

RMM’’RMM’

RMM

Yes, all the way down the hallShould I turn left here?

…

Figure 1: The RMM agent recursively models conversa-tions with instances of itself to choose the right ques-tions to ask (and answers to give) to reach the goal.

in addition to navigation (Chi et al., 2020; Nguyenand Daume III, 2019; Nguyen et al., 2019). Clos-ing the loop, our work is the first to train agentsto perform end-to-end, collaborative dialogs withquestion generation, question answering, and navi-gation conditioned on dialog history.

Theory of mind (Gopnik and Wellman, 1992)guides human communication. Efficient questionsand answers build on a shared world of experiencesand referents. We formalize this notion through aRecursive Mental Model (RMM) of a conversationalpartner. With this formalism, an agent spawns in-stances of itself to converse with to posit the effectsof dialog acts before asking a question or generat-ing an answer, thus enabling conversational plan-ning to achieve the desired navigation result.

2 Related Work and Background

This work builds on research in multimodal naviga-tion, multimodal dialog and the wider goal orienteddialog literature.

Instruction Following tasks an embodied agentwith interpreting natural language instructions

arX

iv:2

005.

0072

8v1

[cs

.CL

] 2

May

202

0

along with visual observations to reach a goal (An-derson et al., 2018; Chen and Mooney, 2011).These instructions describe step-by-step actionsthe agent needs to take. This paradigm has beenextended to more longer trajectories and outdoorenvironments (Chen et al., 2019), as well as toagents in the real world (Chai et al., 2018; Tellexet al., 2014). In this work, we focus on the thesimulated, photorealistic indoor environments ofthe MatterPort dataset (Chang et al., 2017), andgo beyond instruction following to a cooperative,two-agent dialog setting.

Navigation Dialogs task a navigator and a guideto cooperate to find a destination. Agents canbe trained on human-human dialogs, but previouswork either includes substantial information asym-metry between the navigator and oracle (de Vrieset al., 2018; Narayan-Chen et al., 2019) or onlyinvestigates the navigation portion of the dialogwithout considering question generation and an-swering (Thomason et al., 2019). The latter ap-proach treats dialog histories as longer and moreambiguous forms of static instructions. No text isgenerated to approach such navigation-only tasks.Going beyond models that perform navigation fromdialog history alone (Wang et al., 2020; Zhu et al.,2020; Hao et al., 2020), in this work we train twoagents: a navigator agent that asks questions, and aguide agent that answers those questions.

Multimodal Dialog takes several forms. In Vi-sual Dialog (Das et al., 2017a), an agent answers aseries of questions about an image while account-ing for dialog context in the process. Reinforce-ment learning (Das et al., 2017b) has proved es-sential to strong performance on this task, andsuch paradigms have been extended to producingmulti-domain visual dialog agents (Ju et al., 2019).GuessWhat (de Vries et al., 2017) presents a sim-ilar paradigm, where agents use visual propertiesof objects to reason about which referent meetsvarious constraints. Identifying visual attributescan also lead to emergent communication betweenpairs of learning agents (Cao et al., 2018).

Goal Oriented Dialog Goal-Oriented DialogSystems, or chatbots, help a user achieve a pre-defined goal, like booking flights, within a closeddomain (Gao et al., 2019; Vlad Serban et al., 2015;Bordes and Weston, 2017) while trying to limitthe number of questions asked to the user. Model-ing goal-oriented dialog require skills that go be-

Algorithm 1: Dialog Navigationloc = p0;hist = t0;~a ∼ N (hist);loc, hist = update(~a, loc, hist);while ~a 6= STOP and len(hist) < 20 do

q ∼ Q(hist, loc) ; // Question~s = path(loc, goal, horizon = 5) ;o ∼ O(hist, loc, q, ~s) ; // Answerhist← hist+ (q, o);for a ∈ N (hist) do

loc← loc+ a ; // Movehist← hist+ a;

endendreturn (goal − t0)− (loc− t0)

yond language modeling, such as asking questionsto clearly define a user request, querying knowl-edge bases, and interpreting results from queriesas options to complete a transaction. Most currenttask oriented systems are data-driven which aremostly trained in end-to-end fashion using semi-supervised or transfer learning methods (Ham et al.,2020; Mrksic et al., 2017). However, these data-driven approaches may lack grounding betweenthe text and the current state of the environment.Reinforcement learning-based dialog modeling (Suet al., 2016; Peng et al., 2017; Liu et al., 2017)can improve completion rate and user experienceby helping ground conversational data to environ-ments.

3 Task and Data

Our work creates a two-agent dialog task, build-ing on the CVDN dataset (Thomason et al., 2019)of human-human dialogs. In that dataset, a hu-man N avigator and Guide collaborate to find agoal room containing a target object, such as aplant. The N avigator moves through the environ-ment, and the Guide views this navigation untilthe N avigator asks a question in natural language.Then, the Guide can see the next few steps a short-est path planner would take towards the goal, andwrites a natural language response. This dialogcontinues until the N avigator arrives at the goal.

We model this dialog between two agents:

1. Navigator (N ) & Questioner (Q)

2. Guide (G)

We split the first agent into its two roles: naviga-tion and question asking. The agents receive thesame input as their human counterparts in CVDN.

In particular, both agents (and all three roles) haveaccess to the entire dialog and visual navigationhistories, in addition to a textual description ofthe target object (e.g., a plant). The N avigator,uses this information to decide on a series ofactions: forward, left, right, look up,look down, and stop. The Questioner asksfor specific guidance from the Guide. The Guideis presented not only with the navigation/dialoghistory but also the next five shortest path steps tothe goal.

Agents are trained on real human dialogs ofnatural language questions and answers fromCVDN. Individual question-answer exchanges inthat dataset are underspecified and rarely pro-vide simple step-by-step instructions like “straight,straight, right, ...”. Instead, exchanges rely on as-sumptions of world knowledge and shared con-text (Frank and Goodman, 2012; Grice et al.,1975), which manifest as instructions full of visual-linguistic co-references such as should I go back tothe room I just passed or continue on?

The CVDN release does not provide any base-lines or evaluations for this interactive dialog set-ting, focusing instead solely on the navigation com-ponent of the task. They evaluate navigation agentsby “progress to goal” in meters, the distance reduc-tion between the agent’s starting position versusending position with respect to the goal location.

Dialog navigation proceeds by iterating throughthe three roles until either the navigator choosesto stop or a maximum number of turns are played(Algorithm 1). Upon terminating, the “progressto goal” is returned for evaluation. We also reportBLEU scores (Papineni et al., 2002) for evaluatingthe generation of questions and answers by com-paring against human questions and answers.

Conditioning Context In our experiments, wedefine three different notions of context dialog con-text (tO, QAi-1, and QA1:i-1), to evaluate how wellagents utilize or are confused by the generated con-versations.

tO The agent must navigate to the goal whileonly knowing what type of object they arelooking for (e.g., a plant).

QAi-1 The agent has access to their previousQuestion-Answer exchange. They cancondition on this information to both gen-erate the next exchange and then navigatetowards the goal.

QA1:i-1 This is the “full” evaluation paradigm inwhich an agent has access to the entire di-alog when interacting. This context alsoaffords the most potential distractor infor-mation.

4 Models

We introduce the Recursive Mental Model (RMM)as an initial approach to our full dialog task for-mulation of the CVDN dataset. Key to this ap-proach is allowing component models (N avigator,Questioner, and Guide) to learn from each otherand roll out possible dialogs and trajectories. Wecompare our model to a traditional sequence-to-sequence baseline, and we explore Speaker-Follower data augmentation (Fried et al., 2018).

4.1 Sequence-to-Sequence ArchitectureThe underlying architecture, shown in Figure 2,is shared across all approaches. The core dia-log tasks are navigation action decoding and lan-guage generation for asking and answering ques-tions. We present three sequence-to-sequence (Bah-danau et al., 2015) models to perform asN avigator,Questioner, and Guide. The models rely on anLSTM (Hochreiter and Schmidhuber, 1997) en-coder for the dialog history and a ResNet backbone(He et al., 2015) for processing the visual surround-ings; we take the penultimate ResNet layer as im-age observations.

Navigation Action Decoding Initially, the dia-log context is a target object tO that can be found inthe goal room, for example “plant” indicating thatthe goal room contains a plant. As questions areasked and answered, the dialog context grows. Fol-lowing prior work (Anderson et al., 2018; Thoma-son et al., 2019), dialog history words ~w words areembedded as 256 dimensional vectors and passedthrough an LSTM to produce ~u context vectorsand a final hidden state hN . The hidden state hNis used to initialize the LSTM decoder. At everytimestep the decoder is updated with the previousaction at−1 and current image It. The hidden stateis used to attend over the language ~u and predictthe next action at (Figure 2a).

We pretrain the decoder on the navigation taskalone (Thomason et al., 2019) before fine-tuningin the full dialog setting we introduce in this pa-per. The next action is sampled from the model’spredicted logits and the episode ends when either astop action is sampled or 80 steps are taken.

Action Decoder

<TAR> Plant <NAV> forward? <ORA> Yes

Attend

[!! … !"]Previous Action

'#

ResNet

Encoder

Decoder dt<latexit sha1_base64="pf2IqhhBVI4lW/UO77wvpjq9eRs=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWw2m3bpZhN2J0Ip/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEph0HW/ncLa+sbmVnG7tLO7t39QPjxqmSTTjDdZIhPdCajhUijeRIGSd1LNaRxI3g5GtzO//cS1EYl6xHHK/ZgOlIgEo2ilh7CP/XLFrbpzkFXi5aQCORr98lcvTFgWc4VMUmO6npuiP6EaBZN8WuplhqeUjeiAdy1VNObGn8xPnZIzq4QkSrQthWSu/p6Y0NiYcRzYzpji0Cx7M/E/r5thdO1PhEoz5IotFkWZJJiQ2d8kFJozlGNLKNPC3krYkGrK0KZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQwG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwBUEo3T</latexit>

d0 = hN<latexit sha1_base64="dtrSuflFJlptIOttgqnKDiF9FRc=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9mtgl6EohdPUsF+SLss2WzahibZJckKZemv8OJBEa/+HG/+G9N2D9r6YODx3gwz88KEM21c99sprKyurW8UN0tb2zu7e+X9g5aOU0Vok8Q8Vp0Qa8qZpE3DDKedRFEsQk7b4ehm6refqNIslg9mnFBf4IFkfUawsdJjFLjoCg2Du6BccavuDGiZeDmpQI5GUP7qRTFJBZWGcKx113MT42dYGUY4nZR6qaYJJiM8oF1LJRZU+9ns4Ak6sUqE+rGyJQ2aqb8nMiy0HovQdgpshnrRm4r/ed3U9C/9jMkkNVSS+aJ+ypGJ0fR7FDFFieFjSzBRzN6KyBArTIzNqGRD8BZfXiatWtU7q9buzyv16zyOIhzBMZyCBxdQh1toQBMICHiGV3hzlPPivDsf89aCk88cwh84nz802o9d</latexit>

hN<latexit sha1_base64="YbtmBRlboZ8IE3o5aJie73VbYQY=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqEccJ9yM6UCIUjKKVHoa9u16p7FbcGcgy8XJShhz1Xumr249ZGnGFTFJjOp6boJ9RjYJJPil2U8MTykZ0wDuWKhpx42ezUyfk1Cp9EsbalkIyU39PZDQyZhwFtjOiODSL3lT8z+ukGF75mVBJilyx+aIwlQRjMv2b9IXmDOXYEsq0sLcSNqSaMrTpFG0I3uLLy6RZrXjnler9Rbl2ncdRgGM4gTPw4BJqcAt1aACDATzDK7w50nlx3p2PeeuKk88cwR84nz8gko2x</latexit>

dT<latexit sha1_base64="0pf0LGokFUdCNn2ez6RBSqImxTc=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFiv6ANZbOZtEs3m7C7EUrpT/DiQRGv/iJv/hu3bQ7a+mDg8d4MM/OCVHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctnWSKYZMlIlGdgGoUXGLTcCOwkyqkcSCwHYzuZn77CZXmiWyYcYp+TAeSR5xRY6XHsN/ol8puxZ2DrBIvJ2XIUe+XvnphwrIYpWGCat313NT4E6oMZwKnxV6mMaVsRAfYtVTSGLU/mZ86JedWCUmUKFvSkLn6e2JCY63HcWA7Y2qGetmbif953cxEN/6EyzQzKNliUZQJYhIy+5uEXCEzYmwJZYrbWwkbUkWZsekUbQje8surpFWteJeV6sNVuXabx1GAUziDC/DgGmpwD3VoAoMBPMMrvDnCeXHenY9F65qTz5zAHzifPyOSjbM=</latexit>

at�1<latexit sha1_base64="7S5AJeAlbdv4KStbxzMLyzuTedI=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWpgh6LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rSf4YU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8Slq1qndZrT1cVeq3eRxFOIFTOAcPrqEO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QPwdY9O</latexit>

It<latexit sha1_base64="mCkAjnwpsHBN5GVZjZGo+Afi3to=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj0oreK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9cS1EbF6xHHC/YgOlAgFo2ilh7se9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8vyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEq8I24</latexit>


Decoder

Encoder

wi�1

<latexit sha1_base64="blX3D2da4rwrwTeL2nuqI423rn4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRSUG9FLx4r2A9oQ9lsJ+3SzSbsbpQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj26nfekSleSwfzDhBP6IDyUPOqLFS66mX8XNv0iuV3Yo7A1kmXk7KkKPeK311+zFLI5SGCap1x3MT42dUGc4ETordVGNC2YgOsGOppBFqP5udOyGnVumTMFa2pCEz9fdERiOtx1FgOyNqhnrRm4r/eZ3UhFd+xmWSGpRsvihMBTExmf5O+lwhM2JsCWWK21sJG1JFmbEJFW0I3uLLy6R5UfGqlev7arl2k8dRgGM4gTPw4BJqcAd1aACDETzDK7w5ifPivDsf89YVJ585gj9wPn8ABCWPYQ==</latexit>

Attend

wi

<latexit sha1_base64="yuGyfGzG4Lfdn99vXTg9T6IPfz4=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoN6KXjxWMG2hDWWznbZLN5uwu1FK6G/w4kERr/4gb/4bt20O2vpg4PHeDDPzwkRwbVz32ymsrW9sbhW3Szu7e/sH5cOjpo5TxdBnsYhVO6QaBZfoG24EthOFNAoFtsLx7cxvPaLSPJYPZpJgENGh5APOqLGS/9TL+LRXrrhVdw6ySrycVCBHo1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4CjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKmhdVr1a9vq9V6jd5HEU4gVM4Bw8uoQ530AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/J8WO7w==</latexit>

wi


at

<latexit sha1_base64="AbZP0o7VVguVOGdpFn5wrRUyKCo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoN6KXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKVHmgf++WKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6CfkY1Cib5tNRLDU8oG9Mh71qqaMSNn81PnZIzqwxIGGtbCslc/T2R0ciYSRTYzojiyCx7M/E/r5tieOVnQiUpcsUWi8JUEozJ7G8yEJozlBNLKNPC3krYiGrK0KZTsiF4yy+vktZF1atVr+9rlfpNHkcRTuAUzsGDS6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwBSEo3Y</latexit>

at�2

<latexit sha1_base64="Pj7+jl4asdgRVmjTsRDUo4iTtyo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSSloN6KXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtWk/w8vqtF8quxV3DrJKvJyUIUejX/rqDWKWRlwhk9SYrucm6GdUo2CST4u91PCEsjEd8q6likbc+Nn83Ck5t8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophtd+JlSSIldssShMJcGYzH4nA6E5QzmxhDIt7K2EjaimDG1CRRuCt/zyKmlVK16tcvNQK9dv8zgKcApncAEeXEEd7qEBTWAwhmd4hTcncV6cd+dj0brm5DMn8AfO5w/0jI9X</latexit>

at�3

<latexit sha1_base64="0FwYk9F8tT8gCLqebq7jAbyscMM=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRaUG9FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U391hPXRsTqEccJ9yM6UCIUjKKVWrSX4fnlpFcquxV3BrJMvJyUIUe9V/rq9mOWRlwhk9SYjucm6GdUo2CST4rd1PCEshEd8I6likbc+Nns3Ak5tUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophtd+JlSSIldsvihMJcGYTH8nfaE5Qzm2hDIt7K2EDammDG1CRRuCt/jyMmleVLxq5eahWq7d5nEU4BhO4Aw8uIIa3EMdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/2EY9Y</latexit>

at�4

<latexit sha1_base64="98M72Q9j3DdRP+xJVobn7ODGa2o=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgnorevFYwX5AG8pmu2mXbjZhdyKU0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDfz209cGxGrR5wk3I/oUIlQMIpWatN+hhe1ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHq16s1DrVK/zeMowgmcwjl4cAV1uIcGNIHBGJ7hFd6cxHlx3p2PRWvByWeO4Q+czx/3lo9Z</latexit>

at�5

<latexit sha1_base64="xoP2QZjvPY3360MB4rbuikAMdMQ=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRSUW9FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U391hPXRsTqEccJ9yM6UCIUjKKVWrSX4fnlpFcquxV3BrJMvJyUIUe9V/rq9mOWRlwhk9SYjucm6GdUo2CST4rd1PCEshEd8I6likbc+Nns3Ak5tUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophtd+JlSSIldsvihMJcGYTH8nfaE5Qzm2hDIt7K2EDammDG1CRRuCt/jyMmleVLxq5eahWq7d5nEU4BhO4Aw8uIIa3EMdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/5G49a</latexit>

(a) Dialogue and action histories combined with the currentobservation are used to predict the next navigation action.

Action Decoder

<TAR> Plant <NAV> forward? <ORA> Yes

Attend

[!! … !"]Previous Action

'#

ResNet

Encoder

Decoder dt<latexit sha1_base64="pf2IqhhBVI4lW/UO77wvpjq9eRs=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWw2m3bpZhN2J0Ip/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEph0HW/ncLa+sbmVnG7tLO7t39QPjxqmSTTjDdZIhPdCajhUijeRIGSd1LNaRxI3g5GtzO//cS1EYl6xHHK/ZgOlIgEo2ilh7CP/XLFrbpzkFXi5aQCORr98lcvTFgWc4VMUmO6npuiP6EaBZN8WuplhqeUjeiAdy1VNObGn8xPnZIzq4QkSrQthWSu/p6Y0NiYcRzYzpji0Cx7M/E/r5thdO1PhEoz5IotFkWZJJiQ2d8kFJozlGNLKNPC3krYkGrK0KZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQwG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwBUEo3T</latexit>

d0 = hN<latexit sha1_base64="dtrSuflFJlptIOttgqnKDiF9FRc=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9mtgl6EohdPUsF+SLss2WzahibZJckKZemv8OJBEa/+HG/+G9N2D9r6YODx3gwz88KEM21c99sprKyurW8UN0tb2zu7e+X9g5aOU0Vok8Q8Vp0Qa8qZpE3DDKedRFEsQk7b4ehm6refqNIslg9mnFBf4IFkfUawsdJjFLjoCg2Du6BccavuDGiZeDmpQI5GUP7qRTFJBZWGcKx113MT42dYGUY4nZR6qaYJJiM8oF1LJRZU+9ns4Ak6sUqE+rGyJQ2aqb8nMiy0HovQdgpshnrRm4r/ed3U9C/9jMkkNVSS+aJ+ypGJ0fR7FDFFieFjSzBRzN6KyBArTIzNqGRD8BZfXiatWtU7q9buzyv16zyOIhzBMZyCBxdQh1toQBMICHiGV3hzlPPivDsf89aCk88cwh84nz802o9d</latexit>

hN<latexit sha1_base64="YbtmBRlboZ8IE3o5aJie73VbYQY=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04kkq2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqEccJ9yM6UCIUjKKVHoa9u16p7FbcGcgy8XJShhz1Xumr249ZGnGFTFJjOp6boJ9RjYJJPil2U8MTykZ0wDuWKhpx42ezUyfk1Cp9EsbalkIyU39PZDQyZhwFtjOiODSL3lT8z+ukGF75mVBJilyx+aIwlQRjMv2b9IXmDOXYEsq0sLcSNqSaMrTpFG0I3uLLy6RZrXjnler9Rbl2ncdRgGM4gTPw4BJqcAt1aACDATzDK7w50nlx3p2PeeuKk88cwR84nz8gko2x</latexit>

dT<latexit sha1_base64="0pf0LGokFUdCNn2ez6RBSqImxTc=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rFiv6ANZbOZtEs3m7C7EUrpT/DiQRGv/iJv/hu3bQ7a+mDg8d4MM/OCVHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctnWSKYZMlIlGdgGoUXGLTcCOwkyqkcSCwHYzuZn77CZXmiWyYcYp+TAeSR5xRY6XHsN/ol8puxZ2DrBIvJ2XIUe+XvnphwrIYpWGCat313NT4E6oMZwKnxV6mMaVsRAfYtVTSGLU/mZ86JedWCUmUKFvSkLn6e2JCY63HcWA7Y2qGetmbif953cxEN/6EyzQzKNliUZQJYhIy+5uEXCEzYmwJZYrbWwkbUkWZsekUbQje8surpFWteJeV6sNVuXabx1GAUziDC/DgGmpwD3VoAoMBPMMrvDnCeXHenY9F65qTz5zAHzifPyOSjbM=</latexit>


It<latexit sha1_base64="mCkAjnwpsHBN5GVZjZGo+Afi3to=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj0oreK9gPaUDbbTbt0swm7E6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpGN1O/9cS1EbF6xHHC/YgOlAgFo2ilh7se9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8vyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEq8I24</latexit>


Decoder

Encoder

wi�1

<latexit sha1_base64="blX3D2da4rwrwTeL2nuqI423rn4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRSUG9FLx4r2A9oQ9lsJ+3SzSbsbpQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dlZW19Y3Ngtbxe2d3b390sFhU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj26nfekSleSwfzDhBP6IDyUPOqLFS66mX8XNv0iuV3Yo7A1kmXk7KkKPeK311+zFLI5SGCap1x3MT42dUGc4ETordVGNC2YgOsGOppBFqP5udOyGnVumTMFa2pCEz9fdERiOtx1FgOyNqhnrRm4r/eZ3UhFd+xmWSGpRsvihMBTExmf5O+lwhM2JsCWWK21sJG1JFmbEJFW0I3uLLy6R5UfGqlev7arl2k8dRgGM4gTPw4BJqcAd1aACDETzDK7w5ifPivDsf89YVJ585gj9wPn8ABCWPYQ==</latexit>

Attend

wi


wi


at

<latexit sha1_base64="AbZP0o7VVguVOGdpFn5wrRUyKCo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoN6KXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKVHmgf++WKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6CfkY1Cib5tNRLDU8oG9Mh71qqaMSNn81PnZIzqwxIGGtbCslc/T2R0ciYSRTYzojiyCx7M/E/r5tieOVnQiUpcsUWi8JUEozJ7G8yEJozlBNLKNPC3krYiGrK0KZTsiF4yy+vktZF1atVr+9rlfpNHkcRTuAUzsGDS6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwBSEo3Y</latexit>

at�2

<latexit sha1_base64="Pj7+jl4asdgRVmjTsRDUo4iTtyo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSSloN6KXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtWk/w8vqtF8quxV3DrJKvJyUIUejX/rqDWKWRlwhk9SYrucm6GdUo2CST4u91PCEsjEd8q6likbc+Nn83Ck5t8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophtd+JlSSIldssShMJcGYzH4nA6E5QzmxhDIt7K2EjaimDG1CRRuCt/zyKmlVK16tcvNQK9dv8zgKcApncAEeXEEd7qEBTWAwhmd4hTcncV6cd+dj0brm5DMn8AfO5w/0jI9X</latexit>

at�3

<latexit sha1_base64="0FwYk9F8tT8gCLqebq7jAbyscMM=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRaUG9FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U391hPXRsTqEccJ9yM6UCIUjKKVWrSX4fnlpFcquxV3BrJMvJyUIUe9V/rq9mOWRlwhk9SYjucm6GdUo2CST4rd1PCEshEd8I6likbc+Nns3Ak5tUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophtd+JlSSIldsvihMJcGYTH8nfaE5Qzm2hDIt7K2EDammDG1CRRuCt/jyMmleVLxq5eahWq7d5nEU4BhO4Aw8uIIa3EMdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/2EY9Y</latexit>

at�4

<latexit sha1_base64="98M72Q9j3DdRP+xJVobn7ODGa2o=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgnorevFYwX5AG8pmu2mXbjZhdyKU0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDfz209cGxGrR5wk3I/oUIlQMIpWatN+hhe1ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHq16s1DrVK/zeMowgmcwjl4cAV1uIcGNIHBGJ7hFd6cxHlx3p2PRWvByWeO4Q+czx/3lo9Z</latexit>

at�5

<latexit sha1_base64="xoP2QZjvPY3360MB4rbuikAMdMQ=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRSUW9FLx4r2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U391hPXRsTqEccJ9yM6UCIUjKKVWrSX4fnlpFcquxV3BrJMvJyUIUe9V/rq9mOWRlwhk9SYjucm6GdUo2CST4rd1PCEshEd8I6likbc+Nns3Ak5tUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophtd+JlSSIldsvihMJcGYTH8nfaE5Qzm2hDIt7K2EDammDG1CRRuCt/jyMmleVLxq5eahWq7d5nEU4BhO4Aw8uIIa3EMdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/5G49a</latexit>

(b) A Bi-LSTM over the path is attended to duringdecoding for question and instruction generation.

Figure 2: Our backbone Seq2Seq architectures are provided visual observationsand have access to the dialoguehistory when taking actions or asking/answering questions (Thomason et al., 2019).

Language Generation To generate questionsand answers, we train sequence-to-sequence mod-els (Figure 2b) where an encoder takes in a se-quence of images and a decoder produces a se-quence of word tokens. At each decoding timestep,the decoder attends over the input images to pre-dict the next word of the question or answer. Thismodel is also initialized via training on CVDN di-alogs. In particular, question asking (Questioner)encodes the images of the current viewpoint wherea question is asked, and then decodes the questionasked by the human N avigator. Question answer-ing (Guide) is encoded by viewing images of thenext five steps the shortest path planner would taketowards the goal, then decoding the language an-swer produced by the human Guide.

Pretraining initializes the lexical embeddingsand attention alignments before fine-tuning in thecollaborative, turn-taking setting we introduce inthis paper. We experimented with several beam andtemperature based sampling methods, but saw onlyminor effects; hence, we use direct sampling fromthe model’s predicted logits for this paper.

4.2 Data Augmentation (DA)

Navigation agents can benefit from data augmenta-tion produced by a learned agent that provides addi-tional, generated language instructions (Fried et al.,2018). Data pairs of generated novel language withvisual observations along random routes in the en-vironment can help with navigation generalization.We assess the effectiveness of such data augmenta-tion in our two-agent dialog task.

To augment navigation training data, we choosea CVDN conversation but initialize the navigationagent in a random location in the environment, then

sample multiple action trajectories and evaluatetheir progress towards the conversation goal loca-tion. In practice, we sample two trajectories andadditionally consider the trajectory obtained frompicking the top predicted action each time withoutsampling. We give the visual observations of thebest path to the pretrained Questioner model toproduce a relevant instruction. This augmentationallows the agent to explore and collect alternativeroutes to the goal location. We downweight the con-tributions of these noisier trajectories to the overallloss, so loss = λ ∗ generations+ (1− λ) ∗ human.We explored different ratios before settling onλ = 0.1. The choice of λ affects the fluency ofthe language generated, because a navigator tootuned to generated language leads to deviation fromgrammatically valid English and lack of diversity(Section 6 and Appendix).

4.3 Recursive Mental Model

We introduce the Recursive Mental Model agent(RMM),1 which is trained with reinforcement learn-ing to propagate feedback through all three com-ponent models: N avigator, Questioner, and Guide.In this way, the training signal for question genera-tion includes the training signal for answer genera-tion, which in turn has access to the training signalfrom navigation error. Over training, the agent’sprogress towards the goal in the environment in-forms the dialog itself; each model educates theothers (Figure 3). This model does not use anydata-augmentation but still explores the world andupdates its representations and language.

Each model among the N avigator, Questioner,and Guide may sample N trajectories or genera-

1https://github.com/HomeroRR/rmm

https://github.com/HomeroRR/rmm

How this looks like for Vision-Dialog Navigation

Start

Recursive Mental ModelRollout

GoalN<latexit sha1_base64="AizTeXplBvn9PWeWHAXHzWaqlVY=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFN66kgn3AdCiZNNOGZpIhyQhl6Ge4caGIW7/GnX9jpp2Fth4IHM65l5x7woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjjpapIrRNJJeqF2JNORO0bZjhtJcoiuOQ0244uc397hNVmknxaKYJDWI8EixiBBsr+f0YmzHBPLufDao1t+7OgVaJV5AaFGgNql/9oSRpTIUhHGvte25iggwrwwins0o/1TTBZIJH1LdU4JjqIJtHnqEzqwxRJJV9wqC5+nsjw7HW0zi0k3lEvezl4n+en5roOsiYSFJDBVl8FKUcGYny+9GQKUoMn1qCiWI2KyJjrDAxtqWKLcFbPnmVdBp176LeeLisNW+KOspwAqdwDh5cQRPuoAVtICDhGV7hzTHOi/PufCxGS06xcwx/4Hz+AIVdkWg=</latexit>

N<latexit sha1_base64="AizTeXplBvn9PWeWHAXHzWaqlVY=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFN66kgn3AdCiZNNOGZpIhyQhl6Ge4caGIW7/GnX9jpp2Fth4IHM65l5x7woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjjpapIrRNJJeqF2JNORO0bZjhtJcoiuOQ0244uc397hNVmknxaKYJDWI8EixiBBsr+f0YmzHBPLufDao1t+7OgVaJV5AaFGgNql/9oSRpTIUhHGvte25iggwrwwins0o/1TTBZIJH1LdU4JjqIJtHnqEzqwxRJJV9wqC5+nsjw7HW0zi0k3lEvezl4n+en5roOsiYSFJDBVl8FKUcGYny+9GQKUoMn1qCiWI2KyJjrDAxtqWKLcFbPnmVdBp176LeeLisNW+KOspwAqdwDh5cQRPuoAVtICDhGV7hzTHOi/PufCxGS06xcwx/4Hz+AIVdkWg=</latexit>

⇠q0<latexit sha1_base64="fb1O4ihhj0XwZqjAQXcJBsvGE+U=">AAAB8HicbVBNSwMxEJ2tX7V+VT16iRbRU9mtgh6LXjxWsB/SXUo2zbahSXZNskJZ+iu8eFDEqz/Hm//GtN2DVh8MPN6bYWZemHCmjet+OYWl5ZXVteJ6aWNza3unvLvX0nGqCG2SmMeqE2JNOZO0aZjhtJMoikXIaTscXU/99iNVmsXyzowTGgg8kCxiBBsr3fuaCf8QPZz0yhW36s6A/hIvJxXI0eiVP/1+TFJBpSEca9313MQEGVaGEU4nJT/VNMFkhAe0a6nEguogmx08QcdW6aMoVrakQTP150SGhdZjEdpOgc1QL3pT8T+vm5roMsiYTFJDJZkvilKOTIym36M+U5QYPrYEE8XsrYgMscLE2IxKNgRv8eW/pFWremfV2u15pX6Vx1GEAziCU/DgAupwAw1oAgEBT/ACr45ynp03533eWnDymX34BefjG7jSj7I=</latexit>

⇠q00<latexit sha1_base64="LXQz5SL+BwptIbANCVDNNTKCasU=">AAAB8XicbVBNTwIxEJ3FL8Qv1KOXKjF4IrtookeiF4+YyEdkN6RbCjS03bXtmpAN/8KLB43x6r/x5r+xwB4UfMkkL+/NZGZeGHOmjet+O7mV1bX1jfxmYWt7Z3evuH/Q1FGiCG2QiEeqHWJNOZO0YZjhtB0rikXIaSsc3Uz91hNVmkXy3oxjGgg8kKzPCDZWevA1E/4xeiyXu8WSW3FnQMvEy0gJMtS7xS+/F5FEUGkIx1p3PDc2QYqVYYTTScFPNI0xGeEB7VgqsaA6SGcXT9CpVXqoHylb0qCZ+nsixULrsQhtp8BmqBe9qfif10lM/ypImYwTQyWZL+onHJkITd9HPaYoMXxsCSaK2VsRGWKFibEhFWwI3uLLy6RZrXjnlerdRal2ncWRhyM4gTPw4BJqcAt1aAABCc/wCm+Odl6cd+dj3ppzsplD+APn8wcby4/j</latexit>

⇠q<latexit sha1_base64="J2zTxmevOyeSD2lc2elLMWydMG4=">AAAB73icbVBNSwMxEJ2tX7V+VT16iRbBU9mtgh6LXjxWsB/QXUo2zbahSXabZIWy9E948aCIV/+ON/+NabsHbX0w8Hhvhpl5YcKZNq777RTW1jc2t4rbpZ3dvf2D8uFRS8epIrRJYh6rTog15UzSpmGG006iKBYhp+1wdDfz209UaRbLRzNJaCDwQLKIEWys1PE1E/4pGvfKFbfqzoFWiZeTCuRo9Mpffj8mqaDSEI617npuYoIMK8MIp9OSn2qaYDLCA9q1VGJBdZDN752ic6v0URQrW9Kgufp7IsNC64kIbafAZqiXvZn4n9dNTXQTZEwmqaGSLBZFKUcmRrPnUZ8pSgyfWIKJYvZWRIZYYWJsRCUbgrf88ipp1areZbX2cFWp3+ZxFOEEzuACPLiGOtxDA5pAgMMzvMKbM3ZenHfnY9FacPKZY/gD5/MHVhGPgQ==</latexit> }

<latexit sha1_base64="z9I03uDaA3bS8qEriCB/zLQ+3nM=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWy2m3bpZhN2J0IJ/QdePCji1X/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh960X664VXcOskq8nFQgR6Nf/uoNYpZGXCGT1Jiu5yboZ1SjYJJPS73U8ISyMR3yrqWKRtz42fzSKTmzyoCEsbalkMzV3xMZjYyZRIHtjCiOzLI3E//zuimG134mVJIiV2yxKEwlwZjM3iYDoTlDObGEMi3srYSNqKYMbTglG4K3/PIqadWq3kW1dn9Zqd/kcRThBE7hHDy4gjrcQQOawCCEZ3iFN2fsvDjvzseiteDkM8fwB87nD5/ajWs=</latexit>

Q<latexit sha1_base64="1Z+ai4R7aY9TOkpYl+OcUgKEeJU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy5bsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6WNza3tnfJuZW//4PCoenzS1TJVhHaI5FL1Q6wpZ4J2DDOc9hNFcRxy2gun97nfe6JKMykezSyhQYzHgkWMYGMlfxBjMyGYZ+35sFpz6+4CaJ14BalBgdaw+jUYSZLGVBjCsda+5yYmyLAyjHA6rwxSTRNMpnhMfUsFjqkOskXkObqwyghFUtknDFqovzcyHGs9i0M7mUfUq14u/uf5qYlug4yJJDVUkOVHUcqRkSi/H42YosTwmSWYKGazIjLBChNjW6rYErzVk9dJt1H3ruqN9nWteVfUUYYzOIdL8OAGmvAALegAAQnP8ApvjnFenHfnYzlacoqdU/gD5/MHieyRaw==</latexit>

G

<latexit sha1_base64="Awsmyhs2Oh72geO5wzlqruTIJjg=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxIQd0VXeiygn3AdCiZNNOGZpIhyQhl6Ge4caGIW7/GnX9jpp2Fth4IHM65l5x7woQzbVz32ymtrW9sbpW3Kzu7e/sH1cOjjpapIrRNJJeqF2JNORO0bZjhtJcoiuOQ0244uc397hNVmknxaKYJDWI8EixiBBsr+f0YmzHBPLubDao1t+7OgVaJV5AaFGgNql/9oSRpTIUhHGvte25iggwrwwins0o/1TTBZIJH1LdU4JjqIJtHnqEzqwxRJJV9wqC5+nsjw7HW0zi0k3lEvezl4n+en5roKsiYSFJDBVl8FKUcGYny+9GQKUoMn1qCiWI2KyJjrDAxtqWKLcFbPnmVdC7qXqN+/dCoNW+KOspwAqdwDh5cQhPuoQVtICDhGV7hzTHOi/PufCxGS06xcwx/4Hz+AH1MkWk=</latexit>

⇠g0

<latexit sha1_base64="9u7L5P9siR+ZTMZBnErJCcixW2E=">AAAB8HicbVBNSwMxEJ2tX7V+VT16iRbRU9mVgnorevFYwX5IdynZNNuGJtklyQpl6a/w4kERr/4cb/4b03YP2vpg4PHeDDPzwoQzbVz32ymsrK6tbxQ3S1vbO7t75f2Dlo5TRWiTxDxWnRBrypmkTcMMp51EUSxCTtvh6Hbqt5+o0iyWD2ac0EDggWQRI9hY6dHXTPjHaHDWK1fcqjsDWiZeTiqQo9Erf/n9mKSCSkM41rrruYkJMqwMI5xOSn6qaYLJCA9o11KJBdVBNjt4gk6t0kdRrGxJg2bq74kMC63HIrSdApuhXvSm4n9eNzXRVZAxmaSGSjJfFKUcmRhNv0d9pigxfGwJJorZWxEZYoWJsRmVbAje4svLpHVR9WrV6/tapX6Tx1GEIziBc/DgEupwBw1oAgEBz/AKb45yXpx352PeWnDymUP4A+fzB6wyj7A=</latexit>

⇠g00

<latexit sha1_base64="OUYLw2YWplupXdYbXpsDpTwxiNA=">AAAB8XicbVBNSwMxEJ31s9avqkcv0SL1VHaloN6KXjxWsB/YXUo2zbahSXZJskJZ+i+8eFDEq//Gm//GtN2Dtj4YeLw3w8y8MOFMG9f9dlZW19Y3Ngtbxe2d3b390sFhS8epIrRJYh6rTog15UzSpmGG006iKBYhp+1wdDv1209UaRbLBzNOaCDwQLKIEWys9OhrJvwTNKhUeqWyW3VnQMvEy0kZcjR6pS+/H5NUUGkIx1p3PTcxQYaVYYTTSdFPNU0wGeEB7VoqsaA6yGYXT9CZVfooipUtadBM/T2RYaH1WIS2U2Az1IveVPzP66YmugoyJpPUUEnmi6KUIxOj6fuozxQlho8twUQxeysiQ6wwMTakog3BW3x5mbQuql6ten1fK9dv8jgKcAyncA4eXEId7qABTSAg4Rle4c3Rzovz7nzMW1ecfOYI/sD5/AEPIY/h</latexit>

⇠g

<latexit sha1_base64="eoeAx2mNwIBKVgfHpdY5NFCb5vs=">AAAB73icbVBNSwMxEJ2tX7V+VT16iRbBU9mVgnorevFYwX5AdynZNNuGJtltkhXK0j/hxYMiXv073vw3pu0etPXBwOO9GWbmhQln2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nitAmiXmsOiHWlDNJm4YZTjuJoliEnLbD0d3Mbz9RpVksH80koYHAA8kiRrCxUsfXTPinaNArV9yqOwdaJV5OKpCj0St/+f2YpIJKQzjWuuu5iQkyrAwjnE5LfqppgskID2jXUokF1UE2v3eKzq3SR1GsbEmD5urviQwLrScitJ0Cm6Fe9mbif143NdF1kDGZpIZKslgUpRyZGM2eR32mKDF8YgkmitlbERlihYmxEZVsCN7yy6ukdVn1atWbh1qlfpvHUYQTOIML8OAK6nAPDWgCAQ7P8Apvzth5cd6dj0VrwclnjuEPnM8fSXuPfw==</latexit>

⇠g0


⇠g00


⇠g


⇠g0


⇠g00


⇠g


maxq,g

GP (·)

<latexit sha1_base64="3xdxqrnrxB+XAWLCAeYoT8kPklE=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBSmJFNRd0YUuK9gHNCFMJpN26CQTZyZiDcVfceNCEbf+hzv/xmmbhbYeuHA4517uvcdPGJXKsr6NwsLi0vJKcbW0tr6xuWVu77QkTwUmTcwZFx0fScJoTJqKKkY6iSAo8hlp+4PLsd++J0JSHt+qYULcCPViGlKMlJY8c8+J0IOX3R33RvCqUXFwwNWRZ5atqjUBnCd2TsogR8Mzv5yA4zQiscIMSdm1rUS5GRKKYkZGJSeVJEF4gHqkq2mMIiLdbHL9CB5qJYAhF7piBSfq74kMRVIOI193Rkj15aw3Fv/zuqkKz9yMxkmqSIyni8KUQcXhOAoYUEGwYkNNEBZU3wpxHwmElQ6spEOwZ1+eJ62Tql2rnt/UyvWLPI4i2AcHoAJscArq4Bo0QBNg8AiewSt4M56MF+Pd+Ji2Fox8Zhf8gfH5A15tlIo=</latexit>

Figure 3: The Recursive Mental Model allows for eachsampled generation to spawn a new dialog and corre-sponding trajectory to the goal. The dialog that leadsto the most goal progress is followed by the agent.

tions of max length L. These samples in turn areconsidered recursively by the RMM agent, leading toNT possible dialog trajectories, where T is at mostthe maximum trajectory length. To prevent un-bounded exponential growth during training, eachmodel is limited to a maximum number of totalrecursive calls per run. Search techniques, such asfrontiers (Ke et al., 2019), could be employed infuture work to guide the agent.

Training In the dialog task we introduce, theagents begin only knowing the name of the targetobject. The N avigatoragent must move towardsthe goal room containing the target object, and canask questions using the Questioner model. TheGuide agent answers those questions given a privi-leged view of the next steps in the shortest path tothe goal rendered as visual observations.

We train using a reinforcement learning objec-tive to learn a policy πθ(τ |G) which maximizesthe log-likelihood of the shortest path trajectory τ(Eq. 1) where at = fθD(zt, st) is the action decoder,zt = fθE (w1:t) is the language encoder, and w1:t isthe dialog context at time t.

log πθ(τ |G) =T∑t=1

log πθ(at|st, G,w1:t) (1)

We can calculate the cross entropy loss betweenthe generated action and the shortest path action attime t to do behavioral cloning before sampling thenext action from the N avigator predictions.

Reward Shaping with Advantage Actor CriticAs part of the N avigator loss, the goal progresscan be leveraged for reward shaping. We use theAdvantage Actor Critic (Sutton and Barto, 1998)formulation with regularization (Eq. 2) where

A = Qπ(st, at)−V π(st) is the advantage function

−T∑t=1

A log πθ(at|st, G,w1:t) +1

2

T∑t=1

A2 (2)

The RL agent loss can then be expressed as thesum between the the A2C loss with regularizationand the cross entropy loss between the ground truthand the generated trajectories CE(τ , τ). This isthen propagated through the generation modelsQuestioner and Guide, as well by simply accumu-lating the RL navigator loss on top of the standardgeneration cross entropy CE(W ,W ).

Inference During training, exact environmentalfeedback can be used to evaluate samples and tra-jectories. This information is not available at infer-ence, so we instead rely on the navigator’s con-fidence to determine which of several sampledpaths should be explored. Specifically, for everyquestion-answer pair sampled, the agent rolls for-ward five navigation actions, and the probabilityof all resulting navigation sequences are compared.The trajectory with the highest probability is usedfor the next timestep. Note, that this does not guar-antee that the model is actually progressing towardsthe goal, but rather that the agent is confident thatit is acting correctly given the dialog context andtarget object hint.

4.4 Gameplay

As is common in dialog settings, there are sev-eral moving pieces and a growing notion of statethroughout training and evaluation. In addition totheN avigator,Questioner, and Guide ideally thereshould also be a model which generates the targetobject and one which determines when is best toask a question. We leave these two components forfuture work and instead assume we have access tothe human provided target (e.g. a plant) and set thenumber of steps before asking a question to fourbased on the human average of 4.5 in CVDN.

Setting a maximum trajectory is required due tocomputational constraints as the the language con-text w1:j grows with every exchange. Following(Thomason et al., 2019), we use a maximum navi-gation length of 80 steps, leading to a maximum of804 = 20 dialog question-answer pairs.

To simplify training we use a single model forboth question and answer generation, and indicatethe role of spans of text by prepending <NAV>(Questioner asks during navigation) or <ORA>

Model Goal Progress (m) ↑ BLEU ↑

tO QAi-1 QA1:i-1+OracleStopping QAi-1 QA1:i-1

Val

Seen

Baseline 20.1 10.5 15.0 22.9 0.9 0.8Data Aug. 20.1 10.5 10.0 14.2 1.3 1.3RMMN=1 18.7 10.0 13.3 20.4 3.3 3.0RMMN=3 18.9 11.5 14.0 16.8 3.4 3.6Shortest Path ———– 32.8 ———–

Val

Uns

een Baseline 6.8 4.7 4.6 6.3 0.5 0.5

Data Aug. 6.8 5.6 4.4 6.5 1.3 1.1RMMN=1 6.1 6.1 5.1 6.0 2.6 2.8RMMN=3 7.3 5.5 5.6 8.9 2.9 2.9Shortest Path ———– 29.3 ————

Table 1: Gameplay results on CVDN evaluated whenagent voluntarily stops or at 80 steps. Full evaluationsare highlighted in gray with the best results in blue,remaining white columns are ablation results.

(Guide answers to questions) tags (Figure 2a). Dur-ing roll outs the model is reinitialized to preventinformation sharing via the hidden units.

5 Results

In Table 1 we present gameplay results for our RMMmodel and competitive baselines. We report twomain results and four ablations for seen and unseenhouse evaluations; the former are novel dialogs inhouses seen at training time, while the latter arenovel dialogs in novel houses.

Full Evaluation The full evaluation paradigmcorresponds to QA1:i-1 for goal progress andBLEU. In this setup, the agent has access to and isattending over the entire dialog history up until thecurrent timestep in addition to the original targetobject tO. We present three models and two con-ditions for RMM (N = 1 and N = 3). N refers tothe number of samples explored in our recursivecalls, so N = 1 corresponds to simply taking thesingle maximum prediction while N = 3 allowsthe agent to explore. In the second condition, thechoice of path/dialog is determined by the proba-bilities assigned by the N avigator (Section 4.3).

An additional challenge for navigation agentsis knowing when to stop. Following previouswork (Anderson et al., 2018), we report OracleSuccess Rates measuring the best goal progress theagents achieve along the trajectory, rather than thegoal progress when the stop action is taken.

In unseen environments, the RMM based agentmakes the most progress towards the goal and ben-efits from exploration at during inference. Duringinference the agent is not provided any additional

0 200 400 600 800Index of Word

100

101

102

103

104

Lexi

cal F

requ

ency

Cou

nt DARMMHuman

Figure 4: Log-frequency of words generated by humanspeakers as compared to the Data Augmentation (DA)and our Recursive Mental Model (RMM) models.

supervision, but still makes noticeable gains byevaluating trajectories based on learned N avigatorconfidence. Additionally, we see that while low,the BLEU scores are better for RMM based agentsacross settings.

Ablations We also include two simpler results:tO, where the agent is only provided the target ob-ject and explores based on this simple goal, andQAi-1 where the agent is only provided the previ-ous question-answer pair. Both of these settingssimplify the learning and evaluation by focusingthe agent on search and less ambiguous language,respectively. There are two results to note. First,even in the simple case of tOthe RMM trained modelgeneralizes best to unseen environments. In thissetting, during inference all models have the samelimited information, so the RL loss and explorationhave better equipped RMM to generalize.

Second, several trends invert between the seenand unseen scenarios. Specifically, the simplestmodel with the least information performs bestoverall in seen houses. This high performancecoupled with weak language appears to indicatethe models are learning a different (perhaps searchbased) strategy rather than how to communicatevia and effectively utilize dialog. In the QAi-1 andQA1:i-1 settings, the agent generates a question-answer pair before navigating, so the relativestrength of the RMM model’s communication be-comes clear. We next analyze the language andbehavior of our models to investigate these results.

6 Analysis

We analyze the lexical diversity and effectivenessof generated questions by the RMM, and present aqualitative inspection of generated dialogs.

0 20 40 60 80 100Percent of Questions Asked per Dialogue

0

20

40

60

80

100Pe

rcen

t Goa

l Pro

gres

s2 Qs3 Qs4 Qs5 Qs6 Qs7 Qs8 Qs9 Qs10 Qs

(a) Normalized plot of goal progress and #Qs asked by hu-mans. Note, even for long dialogs, most questions lead tosubstantial progress towards the goal.

0 5 10 15 20Number of Questions

5

10

15

20

Perc

ent G

oal P

rogr

ess

BaselineDARMM

(b) DA and RMM generated dialogs make slower but consis-tent progress (ending below 25% of total goal progress).

Figure 5: Effectiveness of human dialogs (left) vs our models (right) at achieving the goal. The slopes indicate theeffectiveness of each dialog exchange in reaching the target.

6.1 Lexical Diversity

Both RMM and Data Augmentation introduce newlanguage by exploring and the environment andgenerating dialogs. In the case of RMM an RL lossis used to update the models based on the mostsuccessful dialog. In the Data Augmentation strat-egy, the best generations are simply appended tothe dataset for one epoch and weighted appropri-ately for standard, supervised training. The aug-mentation strategy leads to small boost in BLEUperformance and goal progress in several settings(Table 1), but the language appears to collapse torepetitive and generic interactions. We see thismanifest rather dramatically in Figure 4, where theDA is limited to only 22 lexical types. In contrast,Recursive Mental Model continues to produce over500 unique lexical types, much closer to the nearly900 of humans.

6.2 Effective Questions

A novel component of a dialog paradigm is as-sessing the efficacy of every speech act in accom-plishing a goal. Specifically, the optimal questionshould elicit the optimal response, which in turnmaximizes the progress towards the goal room. Ifagents were truly effective at modeling each other,we would expect the number of dialog acts to bekept to a minimum. We plot the percent of ques-tions asked against the percent of goal progress inFigures 5a and 5b. Human conversations in CVDNalways reach the goal location, and usually withonly 3-4 questions (Figure 5a). We see that the rela-tionship between questions and progress is roughlylinear, excusing the occasional lost and confusedhuman teams. The final human-human questionis often simply confirmation that navigation has

Target: Fire Extinguisher

RMMHumans

BaselineRMM

Humans

Baseline

Figure 6: Generated trajectories in an unseen environ-ment. The red stop-sign is the target, while the blackstop-signs are distactors (other fire extinguishers) thatmay confuse the agents. The white dashed trajectoryis the human path from CVDN, black is the baselinemodel, and green is our RMM with N = 3.

arrived successfully to the goal room.In Figure 5b, we plot dialogs for the Baseline,

Data Augmentation, and RMM agents against per-cent goal progress. The RMM consistently outper-forms the other two agents in terms of goal progressfor each dialog act. We see an increase in progressfor the first 10 to 15 questions before the modellevels off. In contrast the other agents exhibit shal-lower curves and fail to reach the same level ofperformance.

6.2.1 Qualitative ResultsFigure 1 gives a cherry-picked example trajectory,and Figure 6 gives a lemon-picked example trajec-tory, from the unseen validation environments.

We discuss the successes and failures of thelemon-picked Figure 6. As with all CVDN in-stances, there are multiple target object candidates(here, “fire extinguisher”) but only one valid goal

Conversation GPH

uman Do I go in between the ropes to my right or straight forward? straight forward through the next room 0

Should I proceed down the hall to the left of turn right? head down the hall to your right into the next room 13.3Should I go through the open doors that are the closest to me? You are in the goal room 29.1

DA should i go into the room? you are in the goal room. 5.7

should i go into the room? you are in the goal room. 0.0

RMM

should i head forward or bedroom the next hallway in frontof me?

yes, all the way down the small hall. 4.0

should i turn left here? head into the house, then you will find a doorwayat the goal staircase. go through the doors beforethose two small exit chairs, about half way downthe hall.

5.7

lots of sink in this house, or wrong did. ok which way do i go go down the hallway, take a left and go down thenext hallway and up the stairs on the right.

8.8

Table 2: Dialog samples for Figure 6 with corresponding Goal Progress – see appendix for complete outputs.

room. Goal progress is measured against the goalroom. When the Guide is shown the next few short-est path steps to communicate, those steps are to-wards the goal room. As can be seen in Figure 6,the learned agents have difficulty in deciding whento stop and begin retracing their steps.

This distinction is most obvious when comparingthe language generated. Table 2 shows generatedconversations along with the Goal Progress (GP)at each point when a question was asked. Note,that the generation procedure for all models is thesame sampler, and they start training from the samecheckpoint, so the relatively coherent nature ofthe RMM as compared to the simple repetitivenessof the Data Augmentation is entirely due to therecursive calls and RL loss. No model has accessto length penalties or other generation tricks toavoid degenerating.

7 Conclusions and Future Work

In this paper, we present a two-agent task paradigmfor cooperative vision-and-dialog navigation. Pre-vious work was limited to navigation only, orto navigation with limited additional instructions,while our work involves navigation, question ask-ing, and question answering components for a full,end-to-end dialog. We demonstrate that a simplespeaker model for data augmentation is insufficientfor the dialog setting, and instead see promising re-sults from a recursive RL formulation with turn tak-ing informed by theory of mind. This task presentsnovel and complex challenges for future work, in-cluding modeling uncertainty for when to ask aquestion, incorporating world knowledge for richernotions of common ground, and increasingly effec-tive question-answer generation.

ReferencesPeter Anderson, Qi Wu, Damien Teney, Jake Bruce,

Mark Johnson, Niko Sunderhauf, Ian Reid, StephenGould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In Computer Vision and Pattern Recognition(CVPR).

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations (ICLR).

Yonatan Bisk, Ari Holtzman, Jesse Thomason, JacobAndreas, Yoshua Bengio, Joyce Chai, Mirella Lap-ata, Angeliki Lazaridou, Jonathan May, AleksandrNisnevich, Nicolas Pinto, and Joseph Turian. 2020.Experience Grounds Language. ArXiv.

Antoine Bordes and Jason Weston. 2017. Learningend-to-end goal-oriented dialog. In InternationalConference on Learning Representations (ICLR).

Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel ZLeibo, Karl Tuyls, and Stephen Clark. 2018. Emer-gent communication through negotiation. In Inter-national Conference on Learning Representations(ICLR).

Joyce Y Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan-guage to action: Towards interactive task learningwith physical agents. In International Joint Confer-ence on Artificial Intelligence (IJCAI).

Angel Chang, Angela Dai, Thomas Funkhouser, Ma-ciej Halber, Matthias Niener, Manolis Savva, ShuranSong, Andy Zeng, and Yinda Zhang. 2017. Matter-port3d: Learning from rgb-d data in indoor environ-ments. In International Conference on 3D Vision.

David Chen and Raymond J Mooney. 2011. Learningto interpret natural language navigation instructionsfrom observations. In Conference on Artificial Intel-ligence (AAAI).

http://arxiv.org/abs/1409.0473


https://arxiv.org/abs/2004.10151



https://openreview.net/forum?id=Hk6WhagRW

https://openreview.net/forum?id=Hk6WhagRW

https://www.ijcai.org/proceedings/2018/0001.pdf



Howard Chen, Alane Suhr, Dipendra Misra, NoahSnavely, and Yoav Artzi. 2019. Touchdown: Nat-ural language navigation and spatial reasoning in vi-sual street environments. In Computer Vision andPattern Recognition (CVPR).

Ta-Chung Chi, Mihail Eric, Seokhwan Kim, MinminShen, and Dilek Hakkani-tur. 2020. Just ask: Aninteractive learning framework for vision and lan-guage navigation. In Conference on Artificial Intel-ligence (AAAI).

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M.F. Moura, DeviParikh, and Dhruv Batra. 2017a. Visual Dialog. InComputer Vision and Pattern Recognition (CVPR).

Abhishek Das, Satwik Kottur, Jose M.F. Moura, Ste-fan Lee, and Dhruv Batra. 2017b. Learning coop-erative visual dialog agents with deep reinforcementlearning. In International Conference on ComputerVision (ICCV).

Michael C. Frank and Noah D. Goodman. 2012. Pre-dicting pragmatic reasoning in language games. Sci-ence, 336(6084):998–998.

Daniel Fried, Ronghang Hu, Volkan Cirik, AnnaRohrbach, Jacob Andreas, Louis-Philippe Morency,Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein,and Trevor Darrell. 2018. Speaker-follower modelsfor vision-and-language navigation. In Neural Infor-mation Processing Systems (NeurIPS).

Jianfeng Gao, Michel Galley, and Lihong Li. 2019.Neural approaches to conversational AI. Founda-tions and Trends in Information Retrieval.

Alison Gopnik and Henry M Wellman. 1992. Why thechilds theory of mind really is a theory. Mind ‘I&’Language, 7 (1-2):145171.

Herbert Paul Grice, P Cole, and J J Morgan. 1975.Logic and conversation. Syntax and Semantics, vol-ume 3: Speech Acts, pages 41–58.

Donghoon Ham, Jeong-Gwan Lee, and Youngsoo JangandKee Eung Kim. 2020. End-to-end neuralpipeline for goal-oriented dialogue system using gpt-2. In Conference on Association for the Advance-ment of Artificial Intelligence (AAAI).

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin,and Jianfeng Gao. 2020. Towards Learning aGeneric Agent for Vision-and-Language Navigationvia Pre-training. In Computer Vision and PatternRecognition (CVPR).

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Deep residual learning for image recog-nition. arxiv:1512.03385.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Da Young Ju, Kurt Shuster, Y-Lan Boureau, and JasonWeston. 2019. All-in-one image-grounded conversa-tional agents. ArXiv, abs/1912.12394.

Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtz-man, Zhe Gan, Jingjing Liu, Jianfeng Gao, YejinChoi, and Siddhartha Srinivasa. 2019. Tacticalrewind: Self-correction via backtracking in vision-and-language navigation. In Computer Vision andPattern Recognition (CVPR).

Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, PararthShah, and Larry P. Heck. 2017. End-to-end opti-mization of task-oriented dialogue model with deepreinforcement learning. CoRR, abs/1711.10712.

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, CaimingXiong, and Zsolt Kira. 2019. The regretful agent:Heuristic-aided navigation through progress estima-tion. In Computer Vision and Pattern Recognition(CVPR).

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-HsienWen, Blaise Thomson, and Steve J. Young. 2017.Neural belief tracker: Data-driven dialogue statetracking. In Association for Computational Linguis-tics (ACL).

Anjali Narayan-Chen, Prashant Jayannavar, and Ju-lia Hockenmaier. 2019. Collaborative dialogue inMinecraft. In Association for Computational Lin-guistics (ACL).

Khanh Nguyen and Hal Daume III. 2019. Help,anna! visual navigation with natural multimodal as-sistance via retrospective curiosity-encouraging im-itation learning. In Empirical Methods in NaturalLanguage Processing (EMNLP).

Khanh Nguyen, Debadeepta Dey, Chris Brockett, andBill Dolan. 2019. Vision-based navigation withlanguage-based assistance via imitation learningwith indirect intervention. In Computer Vision andPattern Recognition (CVPR).

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Association forComputational Linguistics (ACL).

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue systemvia hierarchical deep reinforcement learning. InEmpirical Methods in Natural Language Processing(EMNLP).

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina MariaRojas-Barahona, Stefan Ultes, David Vandyke,Tsung-Hsien Wen, and Steve J. Young. 2016. On-line active reward learning for policy optimisationin spoken dialogue systems. In Proceedings of theAssociation for Computational Linguistics (ACL).

R. S. Sutton and A. G. Barto. 1998. Reinforcementlearning: An introduction. MIT press Cambridge.

https://doi.org/10.1126/science.1218633

https://doi.org/10.1126/science.1218633

https://www.microsoft.com/en-us/research/publication/neural-approaches-to-conversational-ai-2/

https://edge.edx.org/asset-v1:Brown+CSCI2951-K+2015_T2+type@asset+block/grice75.pdf









https://www.aclweb.org/anthology/P19-1537

https://www.aclweb.org/anthology/P19-1537








Stefanie Tellex, Ross A Knepper, Adrian Li,Thomas M. Howard, Daniela Rus, and NicholasRoy. 2014. Asking for help using inverse semantics.In Robots: Science and Systems (RSS).

Jesse Thomason, Michael Murray, Maya Cakmak, andLuke Zettlemoyer. 2019. Vision-and-dialog naviga-tion. Conference on Robot Learning (CoRL).

I. Vlad Serban, R. Lowe, P. Henderson, L. Charlin, andJ. Pineau. 2015. A Survey of Available Corpora forBuilding Data-Driven Dialogue Systems. ArXiv.

Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh,Jason Weston, and Douwe Kiela. 2018. Talk thewalk: Navigating new york city through groundeddialogue. arXiv:1807.03367.

Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C. Courville.2017. Guesswhat?! visual object discovery throughmulti-modal dialogue. In Computer Vision and Pat-tern Recognition (CVPR).

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jian-feng Gao, Dinghan Shen, Yuan-Fang Wang,William Yang Wang, and Lei Zhang. 2019. Rein-forced Cross-Modal Matching and Self-SupervisedImitation Learning for Vision-Language Naviga-tion. In Computer Vision and Pattern Recognition(CVPR).

Xin Wang, Vihan Jain, Eugene Ie, William YangWang, Zornitsa Kozareva, and Sujith Ravi.2020. Environment-agnostic Multitask Learn-ing for Natural Language Grounded Navigation.arXiv:2003.00443.

Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin,Jianbin Jiao, Xiaojun Chang, and Xiaodan Liang.2020. Vision-Dialog Navigation by ExploringCross-modal Memory. In Computer Vision and Pat-tern Recognition (CVPR).

http://cs.brown.edu/courses/csci2951-k/papers/tellex14.pdf









A Appendix

A.1 Training DetailsAction Decoder Pretraining We train with a batch size is 100 for 20000 iterations. The model updatesafter each 100 iterations using RMSProp with learning rate of 0.0001 and weight decay of 0.0005.

A.2 Conversation ExamplesFull human and baseline conversations for Figure 6.

Conversation GP

Hum

an

Do I go in between the ropes to my right or straight forward? 0straight forward through the next roomShould I proceed down the hall to the left of turn right? 13.31head down the hall to your right into the next roomShould I go through the open doors that are the closest to me? 29.05You are in the goal room

Bas

elin

e

ok now which way ? 4.0keep going forward and then into a bedroom on your left . it ’ s to the left of the bed .do i continue going to the right into the living room ? 5.7go into the door in front of you , then an right , the bathroom ’ s there .do i keep going to the right into the living room ? 8.8it looks like that is the goal room .should i go into the laundry room to my right , or do i need to go in ? 7.5i see many ways to tell . you just go around . you ’ re in the right room .ok now which way ? 4.0go into the room near you . that should be the goalokay i see any set of stairs . should i go up them ? 0.4turn right , then another right , the bathroom ’ s there .ok . can you go more directions . i see many rooms on this floor . 0turn right , then another right , the bathroom ’ s there .should i go the room ? or into the door to my left ? 4.0turn right , then another right , the bathroom ’ s there .should i stay in this room or go out ? 8.8turn right , then another right , the bathroom ’ s there .ok , should i go in this direction ? 9.9turn right , go past the pool table , into the room just towards the stairs . that should be the goal .should i go left toward the room , or past the painting and straight ? or to my right ? 7.5turn right , then another right , the bathroom ’ s there .should i take a left here in the room ? 5.7that is the goal roomshould i stay in this room or go out ? 8.8you see the front , that is the room . its goal roomthe room ? or into the door to my left ? 13.3go up the stairs in front of you , then to your left , you ’ ll see a bathroom . i think that bathroom is the goalshould i go left toward the room , or past the painting and straight ? or to my right ? 9.3go towards the table in front of me of the hallway after take a right , then make a left into the sitting room .should i go to the right towards the blue painting ? 9.3continue going up , go to the bedroomshould i go into the door on the left , or keep going forward ? 11.1go into the bedroom near you . that should be the goaldo i need to go up ? 8.8go out in the room , make a right past the blue painting and go up the stairsok now which way ? 7.5go up the spiral stairs again when you ’ go into that bedroom straight ahead .should i go outside ? 8.8a bathroom on pretty left , go in it and then into the bathroom in that bedroom

Data Augmentation and RMM conversations for Figure 6.Conversation GP

DA

should i go into the room? 5.66you are in the goal room.should i go into the room? 0.03you are in the goal room....

RMM

should i head forward or bedroom the next hallway in front of me ? 4.0yes , all the way down the small hall .should i turn left here ? 5.7head into the house , then you will find a doorway at the goal staircase . go through the doors before those twosmall exit chairs , about half way down the hall .lots of sink in this house , or wrong did . ok which way do i go 8.8go down the hallway , take a left and go down the next hallway and up the stairs on the right .right place now ? 7.5turn left and when you reach to the bathroom . before these go up the stairs on your left , and walk passed themirror and chairs on the left into hallway across from the stairsdo i go left , right , or up the stairs ? 4.0turn to your left and keep going straightdo i make a right turn or shall i enter the room to my left ? 0.4turn left to go down the hall , then that fire bed has thatleft or continue ? 0yes , go past the dining table and take an immediate right . head through the small door to the left of the windowbefore those two way doors behind you , go up the second small set of stairs .should i go downstairs or turn left ? 4.0go to the right and go to the front doorway .should i go down the stairs or stay on this floor ? 8.8take a left towards the bathroom , then take a left before it and go all the way down the halldo i go up these is to the right or right of the steps ? 9.9go to the left side of the staircase and turn left in the doorway before the two small office chairs , about half waydown the hall .should i turn left , go straight into the living room , or up the stairs ? 7.5turn to your right and go straight down the halldo i go out into the hallway ? 5.7go left down the hall where the office floor . and pass the second door after a right and table .ok , should i go right or left next ? 8.8go back to the staircase . go through the doorway you and before the hallway on the right .do i make a left turn or shall i enter the room to my left ? 13.3go down the hall and turn right into the bedroomshould i go to the left or the right ? 9.3yes , go out of this room , turn right and go down the white hall before the staircase stairs , then go down the waydown that way you get .ok i was a in by this office painting , or i just in the second hallway in front of me ? 9.3okay .which way do i go in , or do i head up the stairs ? 11.1go all the way to the one of the staircase . turn left in the doorway before the two two office chairs , about halfway down the hall .ok wrong far which way do i go 8.8right then at the top of the stairs .left or continue ? 7.5yes . go down the hall and stop at the landing of the stairs .

RMM: A Recursive Mental Model for Dialog Navigation · 2020-05-05 · RMM: A Recursive Mental Model...

Documents

Transcript of RMM: A Recursive Mental Model for Dialog Navigation · 2020-05-05 · RMM: A Recursive Mental Model...