Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy...

11
Proceedings of the SIGDIAL 2016 Conference, pages 339–349, Los Angeles, USA, 13-15 September 2016. c 2016 Association for Computational Linguistics Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu Interaction Lab Heriot-Watt University [email protected] Arash Eshghi Interaction Lab Heriot-Watt University [email protected] Oliver Lemon Interaction Lab Heriot-Watt University [email protected] Abstract We present a multi-modal dialogue sys- tem for interactive learning of perceptually grounded word meanings from a human tutor. The system integrates an incremen- tal, semantic parsing/generation frame- work - Dynamic Syntax and Type Theory with Records (DS-TTR) - with a set of vi- sual classifiers that are learned through- out the interaction and which ground the meaning representations that it produces. We use this system in interaction with a simulated human tutor to study the ef- fects of dierent dialogue policies and ca- pabilities on accuracy of learned mean- ings, learning rates, and eorts/costs to the tutor. We show that the overall perfor- mance of the learning agent is aected by (1) who takes initiative in the dialogues; (2) the ability to express/use their confi- dence level about visual attributes; and (3) the ability to process elliptical and incre- mentally constructed dialogue turns. Ulti- mately, we train an adaptive dialogue pol- icy which optimises the trade-obetween classifier accuracy and tutoring costs. 1 Introduction Identifying, classifying, and talking about objects or events in the surrounding environment are key capabilities for intelligent, goal-driven systems that interact with other agents and the external world (e.g. robots, smart spaces, and other auto- mated systems). To this end, there has recently been a surge of interest and significant progress made on a variety of related tasks, including gen- eration of Natural Language (NL) descriptions of images, or identifying images based on NL de- scriptions (Karpathy and Fei-Fei, 2014; Bruni et Figure 1: Example dialogues & interactively agreed semantic contents. al., 2014; Socher et al., 2014; Farhadi et al., 2009; Silberer and Lapata, 2014; Sun et al., 2013). Our goal is to build interactive systems that can learn grounded word meanings relating to their perceptions of real-world objects – this is dier- ent from previous work such as e.g. (Roy, 2002), that learn groundings from descriptions without any interaction, and more recent work using Deep Learning methods (e.g. (Socher et al., 2014)). Most of these systems rely on training data of high quantity with no possibility of online error correction. Furthermore, they are unsuitable for robots and multimodal systems that need to con- tinuously, and incrementally learn from the envi- ronment, and may encounter objects they haven’t seen in training data. These limitations are likely to be alleviated if systems can learn concepts, as and when needed, from situated dialogue with hu- mans. Interaction with a human tutor also enables systems to take initiative and seek the particular information they need or lack by e.g. asking ques- tions with the highest information gain (see e.g. (Skocaj et al., 2011), and Fig. 1). For example, a robot could ask questions to learn the colour of a “square” or to request to be presented with more “red” things to improve its performance on the concept (see e.g. Fig. 1). Furthermore, such sys- tems could allow for meaning negotiation in the 339

Transcript of Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy...

Page 1: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Proceedings of the SIGDIAL 2016 Conference, pages 339–349,Los Angeles, USA, 13-15 September 2016. c©2016 Association for Computational Linguistics

Training an adaptive dialogue policy for interactive learning of visuallygrounded word meanings

Yanchao YuInteraction Lab

Heriot-Watt [email protected]

Arash EshghiInteraction Lab

Heriot-Watt [email protected]

Oliver LemonInteraction Lab

Heriot-Watt [email protected]

Abstract

We present a multi-modal dialogue sys-tem for interactive learning of perceptuallygrounded word meanings from a humantutor. The system integrates an incremen-tal, semantic parsing/generation frame-work - Dynamic Syntax and Type Theorywith Records (DS-TTR) - with a set of vi-sual classifiers that are learned through-out the interaction and which ground themeaning representations that it produces.We use this system in interaction with asimulated human tutor to study the ef-fects of different dialogue policies and ca-pabilities on accuracy of learned mean-ings, learning rates, and efforts/costs to thetutor. We show that the overall perfor-mance of the learning agent is affected by(1) who takes initiative in the dialogues;(2) the ability to express/use their confi-dence level about visual attributes; and (3)the ability to process elliptical and incre-mentally constructed dialogue turns. Ulti-mately, we train an adaptive dialogue pol-icy which optimises the trade-off betweenclassifier accuracy and tutoring costs.

1 Introduction

Identifying, classifying, and talking about objectsor events in the surrounding environment are keycapabilities for intelligent, goal-driven systemsthat interact with other agents and the externalworld (e.g. robots, smart spaces, and other auto-mated systems). To this end, there has recentlybeen a surge of interest and significant progressmade on a variety of related tasks, including gen-eration of Natural Language (NL) descriptions ofimages, or identifying images based on NL de-scriptions (Karpathy and Fei-Fei, 2014; Bruni et

Figure 1: Example dialogues & interactivelyagreed semantic contents.al., 2014; Socher et al., 2014; Farhadi et al., 2009;Silberer and Lapata, 2014; Sun et al., 2013).

Our goal is to build interactive systems that canlearn grounded word meanings relating to theirperceptions of real-world objects – this is differ-ent from previous work such as e.g. (Roy, 2002),that learn groundings from descriptions withoutany interaction, and more recent work using DeepLearning methods (e.g. (Socher et al., 2014)).

Most of these systems rely on training data ofhigh quantity with no possibility of online errorcorrection. Furthermore, they are unsuitable forrobots and multimodal systems that need to con-tinuously, and incrementally learn from the envi-ronment, and may encounter objects they haven’tseen in training data. These limitations are likelyto be alleviated if systems can learn concepts, asand when needed, from situated dialogue with hu-mans. Interaction with a human tutor also enablessystems to take initiative and seek the particularinformation they need or lack by e.g. asking ques-tions with the highest information gain (see e.g.(Skocaj et al., 2011), and Fig. 1). For example, arobot could ask questions to learn the colour of a“square” or to request to be presented with more“red” things to improve its performance on theconcept (see e.g. Fig. 1). Furthermore, such sys-tems could allow for meaning negotiation in the

339

Page 2: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

form of clarification interactions with the tutor.This setting means that the system must be

trainable from little data, compositional, adaptive,and able to handle natural human dialogue withall its glorious context-sensitivity and messiness– for instance so that it can learn visual conceptssuitable for specific tasks/domains, or even thosespecific to a particular user. Interactive systemsthat learn continuously, and over the long run fromhumans need to do so incrementally, quickly, andwith minimal effort/cost to human tutors.

In this paper, we first outline an implementeddialogue system that integrates an incremental, se-mantic grammar framework, especially suited todialogue processing – Dynamic Syntax and TypeTheory with Records (DS-TTR1 (Kempson et al.,2001; Eshghi et al., 2012)) with visual classi-fiers which are learned during the interaction, andwhich provide perceptual grounding for the ba-sic semantic atoms in the semantic representations(Record Types in TTR) produced by the parser(see Fig. 1, Fig. 2 and section 3).

We then use this system in interaction with asimulated human tutor, to test hypotheses abouthow the accuracy of learned meanings, learningrates, and the overall cost/effort for the human tu-tor are affected by different dialogue policies andcapabilities: (1) who takes initiative in the dia-logues; (2) the agent’s ability to utilise their levelof uncertainty about an object’s attributes; and(3) their ability to process elliptical as well as in-crementally constructed dialogue turns. The re-sults show that differences along these dimensionshave significant impact both on the accuracy of thegrounded word meanings that are learned, and theprocessing effort required by the tutors.

In section 4.3 we train an adaptive dialoguestrategy that finds a better trade-off between clas-sifier accuracy and tutor cost.

2 Related work

In this section, we will present an overview of vi-sion and language processing systems, as well asmulti-modal systems that learn to associate them.We compare them along two main dimensions: Vi-sual Classification methods: offline vs. online andthe kinds of representation learned/used.Online vs. Offline Learning. A number of im-plemented systems have shown good performanceon classification as well as NL-description of

1Download from http://dylan.sourceforge.net

novel physical objects and their attributes, eitherusing offline methods as in (Farhadi et al., 2009;Lampert et al., 2014; Socher et al., 2013; Kong etal., 2013), or through an incremental learning pro-cess, where the system’s parameters are updatedafter each training example is presented to the sys-tem (Furao and Hasegawa, 2006; Zheng et al.,2013; Kristan and Leonardis, 2014). For the inter-active learning task presented here, only the latteris appropriate, as the system is expected to learnfrom its interactions with a human tutor over a pe-riod of time. Shen & Hasegawa (2006) proposethe SOINN-SVM model that re-trains linear SVMclassifiers with data points that are clustered to-gether with all the examples seen so far. The clus-tering is done incrementally, but the system needsto keep all the examples so far in memory. Kristian& Leonardis (2014), on the other hand, proposethe oKDE model that continuously learns categor-ical knowledge about visual attributes as probabil-ity distributions over the categories (e.g. colours).However, when learning from scratch, it is unre-alistic to predefine these concept groups (e.g. thatred, blue, and green are colours). Systems need tolearn for themselves that, e.g. colour is groundedin a specific sub-space of an object’s features. Forthe visual classifiers, we therefore assume no suchcategory groupings here, and instead learn individ-ual binary classifiers for each visual attribute (seesection 3.1 for details).

Distributional vs. Logical Representations.Learning to ground natural language in percep-tion is one of the fundamental problems in Arti-ficial Intelligence. There are two main strands ofwork that address this problem: (1) those that learndistributional representations using Deep Learn-ing methods: this often works by projecting vectorrepresentations from different modalities (e.g. vi-sion and language) into the same space in orderto be able to retrieve one from the other (Socheret al., 2014; Karpathy and Li, 2015; Silberer andLapata, 2014); (2) those that attempt to groundsymbolic logical forms, obtained through seman-tic parsing (Tellex et al., 2014; Kollar et al., 2013;Matuszek et al., 2014) in classifiers of various en-tities types/events/relations in a segment of an im-age or a video. Perhaps one advantage of the latterover the former method, is that it is strictly com-positional, i.e. the contribution of the meaning ofan individual word, or semantic atom, to the wholerepresentation is clear, whereas this is hard to say

340

Page 3: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

about the distributional models. As noted, ourwork also uses the latter methodology, though it isdialogue, rather than sentence semantics that wecare about. Most similar to our work is probablythat of Kennington & Schlangen (2015) who learna mapping between individual words - rather thanlogical atoms - and low-level visual features (e.g.colour-values) directly. The system is composi-tional, yet does not use a grammar (the composi-tions are defined by hand). Further, the ground-ings are learned from pairings of object referencesin NL and images rather than from dialogue.

What sets our approach apart from others is:a) that we use a domain-general, incremental se-mantic grammar with principled mechanisms forparsing and generation; b) Given the DS modelof dialogue (Eshghi et al., 2015), representationsare constructed jointly and interactively by the tu-tor and system over the course of several turns(see Fig. 1); c) perception and NL-semantics aremodelled in a single logical formalism (TTR); d)we effectively induce an ontology of atomic typesin TTR, which can be combined in arbitrarilycomplex ways for generation of complex descrip-tions of arbitrarily complex visual scenes (see e.g.(Dobnik et al., 2012) and compare this with (Ken-nington and Schlangen, 2015), who do not use agrammar and therefore do not have logical struc-ture over grounded meanings).

3 System Architecture

We have developed a system to support anattribute-based object learning process throughnatural, incremental spoken dialogue interaction.The architecture of the system is shown in Fig. 2.The system has two main modules: a vision mod-ule for visual feature extraction, classification, andlearning; and a dialogue system module using DS-TTR. Below we describe these components indi-vidually and then explain how they interact.

3.1 Attribute-based Classifiers used

Yu et. al (2015a; 2015b) point out that neithermulti-label classification models nor ‘zero-shot’learning models show acceptable performance onattribute-based learning tasks. Here, we insteaduse Logistic Regression SVM classifiers withStochastic Gradient Descent (SGD) (Zhang, 2004)to incrementally learn attribute predictions.

All classifiers will output attribute-based labelsets and corresponding probabilities for novel un-

seen images by predicting binary label vectors.We build visual feature representations to learnclassifiers for particular attributes, as explained inthe following subsections.

3.1.1 Visual Feature RepresentationIn contrast to previous work (Yu et al., 2015a;Yu et al., 2015b), to reduce feature noise throughthe learning process, we simplify the method offeature extraction consisting of two base featurecategories, i.e. the colour space for colour at-tributes, and a ‘bag of visual words’ for the objectshapes/class.

Colour descriptors, consisting of HSV colourspace values, are extracted for each pixel and thenare quantized to a 16×4×4 HSV matrix. These de-scriptors inside the bounding box are binned intoindividual histograms. Meanwhile, a bag of vi-sual words is built in PHOW descriptors using avisual dictionary (that is pre-defined with a hand-made image set). These visual words will becalculated using 2x2 blocks, a 4-pixel step size,and quantized into 1024 k-means centres. Thefeature extractor in the vision module presents a1280-dimensional feature vector for a single train-ing/test instance by stacking all quantized features,as shown in Figure 2.

3.2 Dynamic Syntax and Type Theory withRecords

Dynamic Syntax (DS) a is a word-by-word incre-mental semantic parser/generator, based aroundthe Dynamic Syntax (DS) grammar framework(Cann et al., 2005) especially suited to the frag-mentary and highly contextual nature of dialogue.In DS, dialogue is modelled as the interactive andincremental construction of contextual and seman-tic representations (Eshghi et al., 2015). The con-textual representations afforded by DS are of thefine-grained semantic content that is jointly nego-tiated/agreed upon by the interlocutors, as a re-sult of processing questions and answers, clar-ification requests, corrections, acceptances, etc.We cannot go into any further detail due to lackof space, but proceed to introduce Type Theorywith Records, the formalism in which the DScontextual/semantic representations are couched,but also that within which perception is modelledhere.

Type Theory with Records (TTR) is an ex-tension of standard type theory shown to be use-

341

Page 4: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Figure 2: Architecture of the teachable system

ful in semantics and dialogue modelling (Cooper,2005; Ginzburg, 2012). TTR is particularly well-suited to our problem here as it allows informationfrom various modalities, including vision and lan-guage, to be represented within a single semanticframework (see e.g. Larsson (2013); Dobnik et al.(2012) who use it to model the semantics of spatiallanguage and perceptual classification).

In TTR, logical forms are specified as recordtypes (RTs), which are sequences of fields of theform [ l : T ] containing a label l and a type T . RTscan be witnessed (i.e. judged true) by records ofthat type, where a record is a sequence of label-value pairs [ l = v ]. We say that [ l = v ] is of type[ l : T ] just in case v is of type T .

R1 :

l1 : T1l2=a : T2l3=p(l2) : T3

R2 :[

l1 : T1l2 : T2′

]R3 : []

Figure 3: Example TTR record types

Fields can be manifest, i.e. given a singletontype e.g. [ l : Ta ] where Ta is the type of whichonly a is a member; here, we write this using thesyntactic sugar [ l=a : T ]. Fields can also be de-pendent on fields preceding them (i.e. higher) inthe record type (see Fig. 3).

The standard subtype relation v can be definedfor record types: R1 v R2 if for all fields [ l : T2 ]in R2, R1 contains [ l : T1 ] where T1 v T2. In Fig-ure 3, R1 v R2 if T2 v T2′ , and both R1 and R2 aresubtypes of R3. This subtyping relation allows se-mantic information to be incrementally specified,i.e. record types can be indefinitely extended withmore information/constraints. For us here, thisis a key feature since it allows the system to en-

code partial knowledge about objects, and for thisknowledge (e.g. object attributes) to be extendedin a principled way, as and when this informationbecomes available.

3.3 IntegrationFig. 2 shows how the various parts of the systeminteract. At any point in time, the system has ac-cess to an ontology of (object) types and attributesencoded as a set of TTR Record Types, whose in-dividual atomic symbols, such as ‘red’ or ‘square’are grounded in the set of classifiers trained so far.

Given a set of individuated objects in a scene,encoded as a TTR Record, the system can utiliseits existing ontology to output a Record Typewhich maximally characterises the scene (see e.g.Fig. 1). Dynamic Syntax operates over the samerepresentations, they provide a direct interface be-tween perceptual classification and semantic pro-cessing in dialogue: this representation acts notonly as (1) the non-linguistic (here, visual) contextof the dialogue for the resolution of e.g. definitereference and indexicals (see (Hough and Purver,2014)); but also as (2) the logical database fromwhich the system can generate utterances (descrip-tions), ask, or answer questions about the objects -Fig. 4 illustrates how the semantics of the answerto a question is retrieved from the visual contextthrough unification (this uses the standard sub-type checking operation within TTR).

Conversely, for concept learning, the DS-TTRparser incrementally produces Record Types (RT),representing the meaning jointly established bythe tutor and the system so far. In this domain,this is ultimately one or more type judgements, i.e.that some scene/image/object is judged to be of a

342

Page 5: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Figure 4: Question Answering by the system

particular type, e.g. in Fig. 1 that the individuatedobject, o1 is a red square. These jointly negoti-ated type judgements then go on to provide train-ing instances for the classifiers. In general, thetraining instances are of the form, 〈O,T 〉, whereO is an image/scene segment (an object or TTRRecord), and T , a record type. T is then decom-posed into its constituent atomic types T1 . . . Tn,s.t.∧

Ti = T - where∧

is the so called meetoperation corresponding to type conjunction. Thejudgements O : Ti are then used directly to trainthe classifiers that ground the Ti.

4 Experimental Setup

As noted in the introduction, interactive systemsthat learn continuously, and over the long run fromhumans need to do so incrementally; as quickly aspossible; and with as little effort/cost to the humantutor as possible. In addition, when learning takesplace through dialogue, the dialogue needs to beas human-like/natural as possible.

In general, there are several different dialoguecapabilities and policies that a concept-learningagent might adopt, and these will lead to differ-ent outcomes for the accuracy of the learned con-cepts/meanings, learning rates, and cost to the tu-tor – with trade-offs between these. Our goal inthis paper is therefore an experimental study ofthe effect of different dialogue policies and capa-bilities on the overall performance of the learn-ing agent, which, as we describe below is a mea-sure capturing the trade-off between accuracy oflearned meanings and the cost of tutoring.

Design. We use the dialogue system outlinedabove to carry out our main experiment with a2 × 2 × 2 factorial design, i.e. with three fac-tors each with two levels. Together, these fac-tors determine the learner’s dialogue behaviour:(1) Initiative (Learner/Tutor): determines whotakes initiative in the dialogues. When the tu-tor takes initiative, s/he is the one that drives theconversation forward, by asking questions to thelearner (e.g. “What colour is this?” or “So thisis a ....” ) or making a statement about the at-tributes of the object. On the other hand, whenthe learner has initiative, it makes statements, asksquestions, initiates topics etc. (2) Uncertainty(+UC/-UC): determines whether the learner takesinto account, in its dialogue behaviour, its ownsubjective confidence about the attributes of thepresented object. The confidence is the proba-bility assigned by any of its attribute classifiersof the object being a positive instance of an at-tribute (e.g. ‘red’) - see below for how a confi-dence threshold is used here. In +UC, the agentwill not ask a question if it is confident about theanswer, and it will hedge the answer to a tutorquestion if it is not confident, e.g. “T: What isthis? L: errm, maybe a square?”. In -UC, theagent always takes itself to know the attributes ofthe given object (as given by its currently trainedclassifiers), and behaves according to that assump-tion. (3) Context-Dependency (+CD/-CD): de-termines whether the learner can process (pro-duce/parse) context-dependent expressions suchas short answers and incrementally constructedturns, e.g. “T: What is this? L: a square”, or “T:

343

Page 6: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Figure 5: Example dialogues in different conditions

So this one is ...? L: red/a circle”. This setting canbe turned off/on in the DS-TTR dialogue model.

Tutor Simulation and Policy To run our experi-ment on a large-scale, we have hand-crafted an In-teractive Tutoring Simulator, which simulates thebehaviour of a human tutor2. The tutor policy iskept constant across all conditions. Its policy isthat of an always truthful, helpful and omniscientone: it (1) has complete access to the labels ofeach object; and (2) always acts as the context ofthe dialogue dictates: answers any question asked,confirms or rejects when the learner describes anobject; and (3) always corrects the learner when itdescribes an object erroneously.

Dependent Measures We now go on to describethe dependent measures in our experiment, i.e. thatof classifier accuracy/score, tutoring cost, and theoverall performance measure which combines theformer two measures.

Confidence Threshold To determine when theagent takes themselves to be confident in an at-tribute prediction, we use confidence-score thresh-olds. It consists of two values, a base threshold(e.g. 0.5) and a positive threshold (e.g. 0.9).

If the confidences of all classifiers are under thebase threshold (i.e. the learner has no attribute la-bel that it is confident about), the agent will askfor information directly from the tutor via ques-tions (e.g. “L: what is this?”).

On the other hand, if one or more classifiersscore above the base threshold, then the positivethreshold is used to judge to what extent the agent

2The experiment involves hundreds of dialogues, so run-ning this experiment with real human tutors has proven toocostly at this juncture, though we plan to do this for a fullevaluation of our system in the future.

trusts its prediction or not. If the confidence scoreof a classifier is between the positive and basethresholds, the learner is not very confident aboutits knowledge, and will check with the tutor, e.g.“L: is this red?”. However, if the confidence scoreof a classifier is above the positive threshold, thelearner is confident enough in its knowledge not tobother verifying it with the tutor. This will lead toless effort needed from the tutor as the learner be-comes more confident about its knowledge. How-ever, since a learning agent that has high confi-dence about a prediction will not ask for assistancefrom the tutor, a low positive threshold would re-duce the chances that allow the tutor to correct thelearner’s mistakes. We therefore tested differentfixed values for the confidence threshold and thisdetermined a fixed 0.5 base threshold and a 0.9positive threshold were deemed to be the most ap-propriate values for an interactive learning process- i.e. these values preserved good classifier accu-racy while not requiring much effort from the tutor- see below Section 4.3 for how an adaptive pol-icy was learned that adjusts the agent’s confidencethreshold dynamically over time.

4.1 Evaluation MetricsTo test how the different dialogue capabilities andstrategies affect the learning process, we considerboth the cost to the tutor and the accuracy of thelearned meanings, i.e. the classifiers that groundour colour and shape concepts.

Cost The cost measure reflects the effort neededby a human tutor in interacting with the sys-tem. Skocaj et. al. (2009) point out that a com-prehensive teachable system should learn as au-tonomously as possible, rather than involving thehuman tutor too frequently. There are several pos-

344

Page 7: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Table 1: Tutoring Cost TableCin f Cack Ccrt Cparsing Cproduction

1 0.25 1 0.5 1

sible costs that the tutor might incur, see Table 1:Cin f refers to the cost of the tutor providing infor-mation on a single attribute concept (e.g. “this isred” or “this is a square”); Cack is the cost for asimple confirmation (like “yes”, “right”) or rejec-tion (such as “no”); Ccrt is the cost of correctionfor a single concept (e.g. “no, it is blue” or “no, itis a circle”). We associate a higher cost with cor-rection of statements than that of polar questions.This is to penalise the learning agent when it confi-dently makes a false statement – thereby incorpo-rating an aspect of trust in the metric (humans willnot trust systems which confidently make falsestatements). And finally, parsing (Cparse) as wellas production (Cproduction) costs for tutor are takeninto account: each single word costs 0.5 whenparsed by the tutor, and 1 if generated (produc-tion costs twice as much as parsing). These exactvalues are based on intuition but are kept constantacross the experimental conditions and thereforedo not confound the results reported below.

Learning Performance As mentioned above,an efficient learner dialogue policy should con-sider both classification accuracy and tutor effort(Cost). We thus define an integrated measure – theOverall Performance Ratio (Rper f ) – that we use tocompare the learner’s overall performance acrossthe different conditions:

Rper f =∆AccCtutor

i.e. the increase in accuracy per unit of the cost, orequivalently the gradient of the curve in Fig. 4c.We seek dialogue strategies that maximise this.

Dataset The dataset used here is comprised of600 images of single, simple handmade objectswith a white background (see Fig.1)3. There arenine attributes considered in this dataset: 6 colours(black, blue, green, orange, purple and red) and 3shapes (circle, square and triangle), with a relativebalance on the number of instances per attribute.

4.2 Evaluation and Cross-validationIn each round, the system is trained using 500training instances, with the rest set aside for test-

3All data from this paper will be made freely available.

ing. For each training instance, the system inter-acts (only through dialogue) with the simulatedtutor. Each dialogue about an object ends eitherwhen both the shape and the colour of the objectare discussed and agreed upon, or when the learnerrequests to be presented with the next image (thishappens only in the Learner initiative conditions).We define a Learning Step as comprised of 10such dialogues. At the end of each learning step,the system is tested using the test set (100 test in-stances).

This process is repeated 20 times, i.e. for 20rounds/folds, each time with a different, random500-100 split, thus resulting in 20 data-points forcost and accuracy after every learning step. Thevalues reported below, including those on the plotsin Fig. 6a, 6b and 6c, correspond to averagesacross the 20 folds.

4.3 Learning an Adaptive Policy for aDynamic Confidence Threshold

In the experiment presented above, the learningagent’s positive confidence threshold was heldconstant, at 0.9. However, since the confidencethreshold itself becomes more reliable as the agentis exposed to more training instances, we furtherhypothesised that a threshold that changes dynam-ically over time should lead to a better trade-off

between classification accuracy and cost for thetutor, i.e. a better Overall Performance Ratio (seeabove). For example, lower positive thresholdsmay be more appropriate at the later stages oftraining when the agent is already performing wellwith attribute classifiers which are more reliable.This leads to different dialogue behaviours, as thelearner takes different decisions as it encountersmore training examples.

To test this hypothesis we further trained andevaluated an adaptive policy that adjusts the learn-ing agent’s confidence threshold as it interactswith the tutor (in the +UC conditions only).This optimization used a Markov Decision Pro-cess (MDP) model and Reinforcement Learning4,where: (1) the state space was determined by vari-

4A reviewer points out that one can handle uncertaintyin a more principled way, possibly with better results, usingPOMDPs. Another reviewer points out that the policy learnedis only adapting the confidence threshold, and not the otherconditions (uncertainty, initiative, context-dependency). Wepoint out that we are addressing both of these limitations inwork in progress, where we feed each classifier’s outputtedconfidence level as a continuous feature in a (continuousspace) MDP for full dialogue control.

345

Page 8: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

(a) Accuracy (b) Tutoring Cost

(c) Overall Performance

Figure 6: Evolution of Learning Performance

ables for the number of training instances seen sofar, and the agent’s current confidence threshold(2) the actions were either to increase or decreasethe confidence threshold by 0.05, or keep it thesame; (3) the local reward signal was directlyproportional to the agent’s Overall PerformanceRatio over the previous Learning Step (10 train-ing instances, see above); and (4) the SARSA al-gorithm (Sutton and Barto, 1998) was chosen forlearning, with each episode defined as a completerun through the 500 training instances.

5 Results

Fig. 5 shows example interactions between thelearner and the tutor in some of the experimentalconditions. Note how the system is able to dealwith (parse and generate) utterance continuationsas in T+UC+CD, short answers as in L+UC+CD,and polar answers as in T + UC + CD.

Fig. 6a and 6b plot the progression of aver-age Accuracy and (cumulative) Tutoring Cost foreach of the 8 conditions in our main experiment,

as the system interacts over time with the tutorabout each of the 500 training instances. Theninth curve in red (L+UC(Adaptive)+CD) showsthe same for the learning agent with a dynamicconfidence threshold using the policy trained us-ing Reinforcement Learning (section 4.3) - the lat-ter is only compared below to the dark blue curve(L+UC+CD). As noted in passing, the verticalaxes in these graphs are based on averages acrossthe 20 folds - recall that for Accuracy the systemwas tested, in each fold, at every learning step, i.e.after every 10 training instances.

Fig. 6c, on the other hand, plots Accuracyagainst Tutoring Cost directly. Note that it is tobe expected that the curves should not terminatein the same place on the x-axis since the differentconditions incur different total costs for the tutoracross the 500 training instances. The gradient ofthis curve corresponds to increase in Accuracy perunit of the Tutoring Cost. It is the gradient of theline drawn from the beginning to the end of eachcurve (tan(β) on Fig. 4c) that constitutes our main

346

Page 9: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

evaluation measure of the system’s overall perfor-mance in each condition, and it is this measure forwhich we report statistical significance results:

A between-subjects Analysis of Variance(ANOVA) shows significant main effects of Ini-tiative (p < 0.01; F = 448.33), Uncertainty(p < 0.01; F = 206.06) and Context-Dependency(p < 0.05; F = 4.31) on the system’s over-all performance. There is also a significantInitiative×Uncertainty interaction (p < 0.01; F =

194.31).Keeping all other conditions constant

(L+UC+CD), there is also a significant maineffect of Confidence Threshold type (Con-stant vs. Adaptive) on the same measure(p < 0.01; F = 206.06). The mean gradient of thered, adaptive curve is actually slightly lower thanits constant-threshold counter-part blue curve -discussed below.

6 Discussion

Tutoring Cost As can be seen on Fig. 6b, the cu-mulative cost for the tutor progresses more slowlywhen the learner has initiative (L) and takes itsconfidence into account in its behaviour (+UC) -the grey, blue, and red curves. This is so because aform of active learning is taking place: the learneronly asks a question about an attribute if it isn’tconfident enough already about that attribute. Thisalso explains the slight decrease in the gradientsof the curves as the agent is exposed to more andmore training instances: its subjective confidenceabout its own predictions increases over time, andthus there is progressively less need for tutoring.

Accuracy On the other hand, the L+UC curves(grey and blue) on Fig. 6a show the slowest in-crease in accuracy and flatten out at about 0.76.This is because the agent’s confidence score in thebeginning is unreliable as the agent has only seena few training instances: in many cases it doesn’tquery the tutor or have any interaction whatsoeverwith it and so there are informative examples thatit doesn’t get exposed to. In contrast to this, theL+UC(adaptive)+CD curve (red) achieves muchbetter accuracy.

Comparing the gradients of the curves on Fig.6c shows that the overall performance of the agenton the gradient measure is significantly better thanothers in the L+UC conditions (recall the signif-icant Initiative × Uncertainty interaction). How-ever, while the agent with an adaptive thresh-

old (red/L+UC(adaptive)+CD) achieves slightlylower overall gradient than its constant thresholdcounter-part (blue/L+UC+CD), it achieves muchhigher Accuracy overall, and does this much fasterin the first 1000 units of cost (roughly the totalcost in L+UC+CD condition). We therefore con-clude that the adaptive policy is more desirable.Finally, the significant main effect of Context-Dependency on the overall performance is ex-plained by the fact in the +CD conditions, theagent is able to process context-dependent andincrementally constructed turns, leading to lessrepetition, shorter dialogues, and therefore betteroverall performance.

7 Conclusion and Future work

We have presented a multi-modal dialogue systemthat learns grounded word meanings from a hu-man tutor, incrementally, over time, and employsa dynamic dialogue policy (optimised using Rein-forcement Learning). The system integrates a se-mantic grammar for dialogue (DS), and a logicaltheory of types (TTR), with a set of visual classi-fiers in which the TTR semantic representationsare grounded. We used this implemented sys-tem to study the effect of different dialogue poli-cies and capabilities on the overall performanceof a learning agent - a combined measure of ac-curacy and cost. The results show that in orderto maximise its performance, the agent needs totake initiative in the dialogues, take into accountits changing confidence about its predictions, andbe able to process natural, human-like dialogue.

Ongoing work further uses ReinforcementLearning to learn complete, incremental dialoguepolicies, i.e. which choose system output at thelexical level (Eshghi and Lemon, 2014). To dealwith uncertainty this system takes all the classi-fiers’ outputted confidence levels directly as fea-tures in a continuous space MDP.

Acknowledgements

This research is supported by the EPSRC, un-der grant number EP/M01553X/1 (BABBLEproject5), and by the European Union’s Hori-zon 2020 research and innovation programmeunder grant agreement No. 688147 (MuMMERproject6).

5https://sites.google.com/site/hwinteractionlab/babble

6http://mummer-project.eu/

347

Page 10: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

ReferencesElia Bruni, Nam-Khanh Tran, and Marco Baroni.

2014. Multimodal distributional semantics. J. Ar-tif. Intell. Res.(JAIR), 49(1–47).

Ronnie Cann, Ruth Kempson, and Lutz Marten. 2005.The Dynamics of Language. Elsevier, Oxford.

Robin Cooper. 2005. Records and record types in se-mantic theory. Journal of Logic and Computation,15(2):99–112.

Simon Dobnik, Robin Cooper, and Staffan Larsson.2012. Modelling language, action, and perception intype theory with records. In Proceedings of the 7thInternational Workshop on Constraint Solving andLanguage Processing (CSLPAo12), pages 51–63.

Arash Eshghi and Oliver Lemon. 2014. How domain-general can we be? learning incremental dialoguesystems without dialogue acts. In Proceedings ofSemDial.

Arash Eshghi, Julian Hough, Matthew Purver, RuthKempson, and Eleni Gregoromichelaki. 2012. Con-versational interactions: Capturing dialogue dynam-ics. In S. Larsson and L. Borin, editors, From Quan-tification to Conversation: Festschrift for RobinCooper on the occasion of his 65th birthday, vol-ume 19 of Tributes, pages 325–349. College Publi-cations, London.

A. Eshghi, C. Howes, E. Gregoromichelaki, J. Hough,and M. Purver. 2015. Feedback in conversationas incremental semantic update. In Proceedings ofthe 11th International Conference on ComputationalSemantics (IWCS 2015), London, UK. Associationfor Computational Linguisitics.

Ali Farhadi, Ian Endres, Derek Hoiem, and DavidForsyth. 2009. Describing objects by their at-tributes. In Proceedings of the IEEE Computer So-ciety Conference on Computer Vision and PatternRecognition (CVPR.

Shen Furao and Osamu Hasegawa. 2006. An in-cremental network for on-line unsupervised classi-fication and topology learning. Neural Networks,19(1):90–106.

Jonathan Ginzburg. 2012. The Interactive Stance:Meaning for Conversation. Oxford UniversityPress.

Julian Hough and Matthew Purver. 2014. Probabilis-tic type theory for incremental dialogue processing.In Proceedings of the EACL 2014 Workshop on TypeTheory and Natural Language Semantics (TTNLS),pages 80–88, Gothenburg, Sweden, April. Associa-tion for Computational Linguistics.

Andrej Karpathy and Li Fei-Fei. 2014. Deep visual-semantic alignments for generating image descrip-tions. CoRR, abs/1412.2306.

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descrip-tions. In IEEE Conference on Computer Vision andPattern Recognition, CVPR 2015, Boston, MA, USA,June 7-12, 2015, pages 3128–3137.

Ruth Kempson, Wilfried Meyer-Viol, and Dov Gabbay.2001. Dynamic Syntax: The Flow of Language Un-derstanding. Blackwell.

Casey Kennington and David Schlangen. 2015. Sim-ple learning and compositional application of per-ceptually grounded word meanings for incrementalreference resolution. In Proceedings of the Confer-ence for the Association for Computational Linguis-tics (ACL-IJCNLP). Association for ComputationalLinguistics.

Thomas Kollar, Jayant Krishnamurthy, and GrantStrimel. 2013. Toward interactive grounded lan-guage acqusition. In Robotics: Science and Systems.

Xiangnan Kong, Michael K. Ng, and Zhi-Hua Zhou.2013. Transductive multilabel learning via labelset propagation. IEEE Trans. Knowl. Data Eng.,25(3):704–719.

Matej Kristan and Ales Leonardis. 2014. Online dis-criminative kernel density estimator with gaussiankernels. IEEE Trans. Cybernetics, 44(3):355–365.

Christoph H. Lampert, Hannes Nickisch, and StefanHarmeling. 2014. Attribute-based classification forzero-shot visual object categorization. IEEE Trans.Pattern Anal. Mach. Intell., 36(3):453–465.

Staffan Larsson. 2013. Formal semantics for percep-tual classification. Journal of logic and computa-tion.

Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, andDieter Fox. 2014. Learning from unscripted deicticgesture and language for human-robot interactions.In Proceedings of the Twenty-Eighth AAAI Confer-ence on Artificial Intelligence, July 27 -31, 2014,Quebec City, Quebec, Canada., pages 2556–2563.

Deb Roy. 2002. A trainable visually-grounded spokenlanguage generation system. In Proceedings of theInternational Conference of Spoken Language Pro-cessing.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 721–732,Baltimore, Maryland, June. Association for Compu-tational Linguistics.

Danijel Skocaj, Matej Kristan, Alen Vrecko, MarkoMahnic, Miroslav Janıcek, Geert-Jan M. Krui-jff, Marc Hanheide, Nick Hawes, Thomas Keller,Michael Zillich, and Kai Zhou. 2011. A system forinteractive learning in dialogue with a tutor. In 2011IEEE/RSJ International Conference on Intelligent

348

Page 11: Training an adaptive dialogue policy for interactive ... · Training an adaptive dialogue policy for interactive learning of visually grounded word meanings Yanchao Yu ... tems could

Robots and Systems, IROS 2011, San Francisco, CA,USA, September 25-30, 2011, pages 3387–3394.

Danijel Skocaj, Matej Kristan, and Ales Leonardis.2009. Formalization of different learning strategiesin a continuous learning framework. In Proceed-ings of the Ninth International Conference on Epi-genetic Robotics; Modeling Cognitive Developmentin Robotic Systems, pages 153–160. Lund UniversityCognitive Studies.

Richard Socher, Milind Ganjoo, Christopher D. Man-ning, and Andrew Y. Ng. 2013. Zero-shot learningthrough cross-modal transfer. In Advances in Neu-ral Information Processing Systems 26: 27th AnnualConference on Neural Information Processing Sys-tems, pages 935–943, Lake Tahoe, Nevada, USA.

Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014.Grounded compositional semantics for finding anddescribing images with sentences. Transactionsof the Association for Computational Linguistics,2:207–218.

Yuyin Sun, Liefeng Bo, and Dieter Fox. 2013. At-tribute based object identification. In Robotics andAutomation (ICRA), 2013 IEEE International Con-ference on, pages 2096–2103. IEEE.

Richard S. Sutton and Andrew G. Barto. 1998. Rein-forcement Learning: an Introduction. MIT Press.

Stefanie Tellex, Pratiksha Thaker, Joshua MasonJoseph, and Nicholas Roy. 2014. Learning percep-tually grounded word meanings from unaligned par-allel data. Machine Learning, 94(2):151–167.

Yanchao Yu, Arash Eshghi, and Oliver Lemon. 2015a.Comparing attribute classifiers for interactive lan-guage grounding. In Proceedings of the FourthWorkshop on Vision and Language, pages 60–69,Lisbon, Portugal, September. Association for Com-putational Linguistics.

Yanchao Yu, Oliver Lemon, and Arash Eshghi. 2015b.Interactive learning through dialogue for multimodallanguage grounding. In SemDial 2015, Proceedingsof the 19th Workshop on the Semantics and Prag-matics of Dialogue, Gothenburg, Sweden, August24-26 2015, pages 214–215.

Tong Zhang. 2004. Solving large scale linear predic-tion problems using stochastic gradient descent al-gorithms. In Proceedings of the twenty-first inter-national conference on Machine learning, page 116.ACM.

Jun Zheng, Furao Shen, Hongjun Fan, and Jinxi Zhao.2013. An online incremental learning support vectormachine for large-scale data. Neural Computing andApplications, 22(5):1023–1035.

349