f [email protected] arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained...

10
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions Peng Wang * , Qi Wu * , Chunhua Shen, Anton van den Hengel School of Computer Science, The University of Adelaide, Australia {p.wang,qi.wu01,chunhua.shen,anton.vandenhengel}@adelaide.edu.au Abstract One of the most intriguing features of the Visual Ques- tion Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to an- swer them demands a variety of image operations from de- tection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We pro- pose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an ap- proach that has something in common with the Neural Tur- ing Machine [10]. The core of our proposed method is a new co-attention model. In addition, the proposed approach generates human-readable reasons for its decision, and can still be trained end-to-end without ground truth reasons be- ing given. We demonstrate the effectiveness on two publicly available datasets, Visual Genome and VQA, and show that it produces the state-of-the-art results in both cases. 1. Introduction Visual Question Answering (VQA) is an AI-complete task lying at the intersection of computer vision (CV) and natural language processing (NLP). Current VQA ap- proaches are predominantly based on a joint embedding [3, 9, 19, 23, 35, 41] of image features and question represen- tations into the same space, the result of which is used to predict the answer. One of the advantages of this approach is its ability to exploit a pre-trained CNN model. In contrast to this joint embedding approach we propose a co-attention based method which learns how to use a set of off-the-shelf CV methods in answering image-based ques- tions. Applying the existing CV methods to the image gen- * indicates equal contribution. Figure 1: Two real example results of our proposed model. Given an image-question pair, our model generates not only an answer, but also a set of reasons (as text) and visual attention maps. The colored words in the question have Top-3 weights, ordered as red, blue and cyan. The high- lighted area in the attention map indicates the attention weights on the im- age regions. The Top-3 weighted visual facts are re-formulated as human readable reasons. The comparison between the results for different ques- tions relating to the same image shows that our model can produce highly informative reasons relating to the specifics of each question. erates a variety of information which we label the image facts. Inevitably, much of this information would not be relevant to the particular question asked. So part of the role of the attention mechanism is to determine which types of facts are useful in answering a question. The fact that the VQA-Machine is able to exploit a set of off-the-shelf CV methods in answering a question means that it does not need to learn how to perform these functions itself. The method instead learns to predict the appropriate combination of algorithms to exploit in response to a previ- ously unseen question and image. It thus represents a step towards a Neural Network capable of learning an algorithm for solving its problem. In this sense it is comparable to the Neural Turning Machine [10] (NTM) whereby an RNN is 1 arXiv:1612.05386v1 [cs.CV] 16 Dec 2016

Transcript of f [email protected] arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained...

Page 1: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

The VQA-Machine: Learning How to Use Existing Vision Algorithmsto Answer New Questions

Peng Wang∗, Qi Wu∗, Chunhua Shen, Anton van den HengelSchool of Computer Science, The University of Adelaide, Australia

{p.wang,qi.wu01,chunhua.shen,anton.vandenhengel}@adelaide.edu.au

Abstract

One of the most intriguing features of the Visual Ques-tion Answering (VQA) challenge is the unpredictability ofthe questions. Extracting the information required to an-swer them demands a variety of image operations from de-tection and counting, to segmentation and reconstruction.To train a method to perform even one of these operationsaccurately from {image,question,answer} tuples would bechallenging, but to aim to achieve them all with a limitedset of such training data seems ambitious at best. We pro-pose here instead a more general and scalable approachwhich exploits the fact that very good methods to achievethese operations already exist, and thus do not need to betrained. Our method thus learns how to exploit a set ofexternal off-the-shelf algorithms to achieve its goal, an ap-proach that has something in common with the Neural Tur-ing Machine [10]. The core of our proposed method is anew co-attention model. In addition, the proposed approachgenerates human-readable reasons for its decision, and canstill be trained end-to-end without ground truth reasons be-ing given. We demonstrate the effectiveness on two publiclyavailable datasets, Visual Genome and VQA, and show thatit produces the state-of-the-art results in both cases.

1. IntroductionVisual Question Answering (VQA) is an AI-complete

task lying at the intersection of computer vision (CV)and natural language processing (NLP). Current VQA ap-proaches are predominantly based on a joint embedding [3,9, 19, 23, 35, 41] of image features and question represen-tations into the same space, the result of which is used topredict the answer. One of the advantages of this approachis its ability to exploit a pre-trained CNN model.

In contrast to this joint embedding approach we proposea co-attention based method which learns how to use a set ofoff-the-shelf CV methods in answering image-based ques-tions. Applying the existing CV methods to the image gen-

∗indicates equal contribution.

Figure 1: Two real example results of our proposed model. Given animage-question pair, our model generates not only an answer, but also aset of reasons (as text) and visual attention maps. The colored words inthe question have Top-3 weights, ordered as red, blue and cyan. The high-lighted area in the attention map indicates the attention weights on the im-age regions. The Top-3 weighted visual facts are re-formulated as humanreadable reasons. The comparison between the results for different ques-tions relating to the same image shows that our model can produce highlyinformative reasons relating to the specifics of each question.

erates a variety of information which we label the imagefacts. Inevitably, much of this information would not berelevant to the particular question asked. So part of the roleof the attention mechanism is to determine which types offacts are useful in answering a question.

The fact that the VQA-Machine is able to exploit a setof off-the-shelf CV methods in answering a question meansthat it does not need to learn how to perform these functionsitself. The method instead learns to predict the appropriatecombination of algorithms to exploit in response to a previ-ously unseen question and image. It thus represents a steptowards a Neural Network capable of learning an algorithmfor solving its problem. In this sense it is comparable to theNeural Turning Machine [10] (NTM) whereby an RNN is

1

arX

iv:1

612.

0538

6v1

[cs

.CV

] 1

6 D

ec 2

016

Page 2: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

trained to use an associative memory module in solving itslarger task. The method that we propose does not alter theparameters of the external modules it uses, but then it is ableto exploit a much wider variety of module types.

In order to enable the VQA-Machine to exploit a widevariety of available CV methods, and to provide a com-pact, but flexible interface, we formulate the visual factsas triplets. The advantages of this approach are threefold.Firstly, many relevant off-the-shelf CV methods produceoutputs that can be reformulated as triplets (see Tab.4). Sec-ondly, such compact formats are human readable, and inter-pretable. This allows us to provide human readable rea-sons along with the answers. See Fig. 1 for an example.At last, the proposed triplet representation is similar to thetriplet representations used in some Knowledge Bases, suchas < cat, eat, fish >. The method might thus be extendableto accept information from these sources.

To select the facts which are relevant in answering a spe-cific question, we employ a co-attention mechanism. Thisis achieved by extending the approach of Lu et al. [16],which proposed a co-attention mechanism that jointly rea-sons about the image and question, to also reason over a setof facts. Specifically, we design a sequential co-attentionmechanism (see Fig.3) which aims to ensure that atten-tion can be passed effectively between all three forms ofdata. The initial question representation (without attention)is thus first used to guide facts weighting. The weightedfacts and the initial question representation are then com-bined to guide the image weighting. The weighted factsand image regions are then jointly used to guide the ques-tion attention mechanism. All that remains is to run the factattention again, but informed by the question and image at-tention weights, as this completes the circle, and means thateach attention process has access to the output of all others.All of the weighted features are further fed into a multi-layer perceptron (MLP) to predict the answer.

One of the advantages of this approach is that the ques-tion, image, and facts are interpreted together, which par-ticularly means that information extracted from the image(and represented as facts) can guide the question interpreta-tion. A question such as ‘Who is looking at the man with atelescope?’ means different things when combined with animage of a man holding a telescope, rather than an image ofa man viewed through a telescope.

Our main contributions are as follows:• We propose a new VQA model which is able to learn to

adaptively combine multiple off-the-shelf CV methodsto answer questions.• To achieve that, we extend the co-attention mechanism

to a higher order which is able to jointly process ques-tions, image, and facts.• The method that we propose generates not only an an-

swer to the posed questions, but also a set of supporting

information, including the visual (attention) reasoningand human-readable textual reasons. To the best of ourknowledge, this is the first VQA model that is capa-ble of outputting human-readable reasons on free-formopen-ended visual questions.• Finally, we evaluate our proposed model on two VQA

datasets. Our model achieves the state-of-art in bothcases. A human agreement study is conducted to eval-uate the reason generation ability of our model.

2. Related WorkJoint Embedding Most recent methods are based on ajoint embedding of the image and the question using adeep neural network. Practically, image representations areobtained through CNNs that have been pre-trained on anobject recognition task. The question is typically passedthrough an RNN, which produces a fixed-length vector rep-resentation. These two representations are jointly embed-ded into the same space and fed into a classifier whichpredicts the final answer. Many previous works [3, 9, 19,23, 33, 41] adopted this approach, while some [8, 13, 24]have proposed modifications of this basic idea. How-ever, these modifications are either focused on developingmore advanced embedding techniques or employing differ-ent question encoding methods, with only very few meth-ods actually aim to improve the visual information avail-able [21, 30, 35]. This seems a surprising situation giventhat VQA can be seen as encompassing the vast majority ofCV tasks (by phrasing the task as a question, for instance).It is hard to imagine that a single pre-trained CNN model(such as VGG [26] or ResNet [11]) would suffice for allsuch tasks, or be able to recover all of the required visualinformation. The approach that we propose here, in con-trast, is able to exploit the wealth of methods that alreadyexist to extract useful information from images, and thusdoes not need to learn to perform these operations from adataset that is ill suited to the task.

Attention Mechanisms Instead of directly using theholistic global-image embedding from the fully connectedlayer of a CNN, several recent works [12, 16, 25, 38, 39, 42]have explored image attention models for VQA. Specifi-cally, the feature map (normally the convolutional layer ofa pre-trained deep CNN) is used with the question to deter-mine spatial weights that reflect the most relevant regionsof the image. Recently, Lu et al. [16] determine attentionweights on both image regions and question words. In ourwork, we extend the co-attention to a higher order so thatthe image, question and facts can be jointly weighted.

Modular Architecture and Memory Networks NeuralModule Networks (NMNs) were introduced by Andreas etal. in [1, 2]. In NMNs, the question parse tree is turned intoan assembly of modules from a predefined set, which are

Page 3: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

+

Word EmbeddingQuestion: What is the girl doing?

Visual Facts:

MLP

Image:

1-D Conv. LSTM

Co-attention Co-attention Co-attention

CNNs

Fact Embedding108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133

113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

F

11

118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241

124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

V

12

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

{qw, vw, fw}

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

{qp, vp, fp}

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

{qq, vq, fq}

8

864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917

918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Qw

9

97297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025

102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Qp

10

108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133

113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187

CVPR#****

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Qq

11

Answer: riding a horse

<_img _contain horse><_img _att riding><man holding rope><girl on horse>

<man with horse>… … …

RankingTop-N FactsFormattingReasons

• The image contains the object of horse

• The image contains the action riding

• The girl is on the horse

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#402

CVPR#402

CVPR 2017 Submission #402. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

↵wf

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#402

CVPR#402

CVPR 2017 Submission #402. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

↵pf

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#402

CVPR#402

CVPR 2017 Submission #402. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

↵qf

8

Figure 2: The proposed VQA model. The input question, facts and image features are weighted at three question-encoding levels. Given the co-weightedfeatures at all levels, a multi-layer perceptron (MLP) classifier is used to predict answers. Then the ranked facts are used to generate reasons.

then used to answer the question. Dynamic Memory Net-works (DMN) [37] retrieves the ‘facts’ required to answerthe question, where the ‘facts’ are simply CNN features cal-culated over small image patches. In comparison to NMNsand DMNs, our method uses a set of external algorithmsthat does not depend on the question. The set of algorithmsused is larger, however, the method for combining their out-puts is more flexible, and varies in response to the question,the image, and the facts.

Explicit Reasoning One of the limitations of most VQAmethods is that it impossible to distinguish between an an-swer which has arisen as a result of the image content,and one selected because it occurs frequently in the train-ing set [28]. This is a significant limitation to the practi-cal application of the technology, particularly in Medicineor Defence, as it makes it impossible to have any faith inthe answers provided. One solution to this problem is toprovide human readable reasoning to justify or explain theanswer, which has been a longstanding goal in Neural Net-works (see [7, 29], for example). Wang et al. [31] propose aVQA framework named “Ahab” that uses explicit reasoningover an RDF (Resource Description Framework) Knowl-edge Base to derive the answer, which naturally gives riseto a reasoning chain. This approach is limited to a hand-crafted set of question templates, however. FVQA [32]used an LSTM and a data-driven approach to learn the map-ping of images/questions to RDF queries, but only consid-ers questions relating to specified Knowledge Bases. In thiswork, we employ attention mechanisms over facts providedby multiple off-the-shelf CV methods. The facts are for-mulated as human understandable structural triplets and arefurther processed into human readable reasons. To the bestof our knowledge, this is the first approach that can providehuman readable reasons for the open-ended VQA problem.

A human agreement study is reported in Sec. 4.2.3 whichdemonstrates the performance of this approach.

3. ModelsIn this section, we introduce the proposed VQA model

that takes questions, images and facts as inputs and outputsa predicted answer with ranked reasons. The overall frame-work is described in Sec. 3.1, while Sec. 3.2 demonstrateshow the three types of input are jointly embedded using theproposed sequential co-attention model. Finally, the mod-ule used for generating answers and reasons is introducedin Sec. 3.3.

3.1. Overall Framework

The entire model is shown in the Fig. 2. The first stepsees the input question encoded at three different levels. Ateach level, the question features are embedded jointly withimages and facts via the proposed sequential co-attentionmodel. Finally, a multi-layer perceptron (MLP) is usedto predict answers based on the outputs (i.e., the weightedquestion, image and fact features) of the co-attention mod-els at all levels. Reasons are generated by ranking and re-formulating the weighted facts.Hierarchical Question Encoding We apply a hierarchi-cal question encoding [16] to effectively capture the in-formation from a question at multiple scales, i.e. word,phase and sentence level. Firstly, the one-hot vectors ofquestion words Q = [q1, . . . ,qT ] are embedded individ-ually to continuous vectors Qw = [qw

1 , . . . ,qwT ]. Then

1-D convolutions with different filter sizes (unigram, bi-gram and trigram) are applied to the word-level embed-dings Qw, followed by a max-pooling over different fil-ters at each word location, to form the phrase-level featuresQp = [qp

1, . . . ,qpT ]. Finally, the phrase-level features are

Page 4: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

Triplet Example( img, scene,img scn) ( img, scene,office)( img, att,img att) ( img, att,wedding)( img, contain,obj) ( img, contain,dog)(obj, att,obj att) (shirt, att,red)(obj1,rel,obj2) (man,hold,umbrella)

Table 1: Facts represented by triplets. img, scene, att andcontain are specific tokens. While img scn, obj, img att,obj att and rel refer to vocabularies describing image scenes, objects,image/object attributes and relationships between objects.

further encoded by an LSTM, resulting the question-levelfeatures Qq = [qq

1, . . . ,qqT ].

Encoding Image Regions Following [16, 39], the inputimage is resized to 448 × 448 and divided to 14 × 14regions. The corresponding regions of the last poolinglayer of VGG-19 [26] or ResNet-100 [26] networks are ex-tracted and further embedded using a learnable embeddingweight. The outputs of the embedding layer, denoted asV = [v1, . . . ,vN ] (N = 196 is the number of image re-gions), are taken as image features.Encoding Facts In this work, we use triplets of the form(subject,relation,object) to represent facts inan image, where subject and object denote two visualconcepts and relation represents a relationship betweenthese two concepts. This format of triplets is very generaland widely used in large-scale structured knowledge graphs(such as DBpedia [4], Freebase [5], YAGO [18]) to record asurprising variety of information. In this work, we considerfive types of visual concepts as shown in Table 4, whichrespectively records information about the image scene, ob-jects in the image, attributes of objects and the whole im-age, and relationships between two objects. Vocabulariesare constructed for subject, relation and object,and the entities in a triplet are represented by individualone-hot vectors. Three embeddings are learned end-to-endto project the three triplet entities to continuous-valued vec-tors respectively (fs for subject, fr for relation, fofor object). The concatenated vector f = [fs; fr; fo] isthen used to represent the corresponding fact. By apply-ing different types of visual models, we achieve a list ofencoded fact features F = [f1, . . . , fM ], where M is thenumber of extracted facts. Note that this approach may beeasily extended to using any existing vision methods to ex-tract image information that might usefully be recorded as atriplet, or even more generally, to any method which gener-ates output which can be encoded as a fixed-length vector.

3.2. Sequential Co-attention

Given the encoded question/image/fact features, the pro-posed co-attention approach sequentially generates atten-tion weights for each feature type using the other two asguidance, as shown in Fig 3. The operation in each of thefive attention modules is denoted x = Atten(X,g1,g2),

Input Sequences

Weighted features

432433434435436437438439440441442443444445446447448449450451452453454455456457458459

Atten( Q , 0, 0)

540541542543544545546547548549550551552553554555556557558559560561562

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Atten( F , q0 , 0)

648649650651652653654655656657658659660661662663664665

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Atten( V , q0 , f0 )

756757758759760761762763764765766767

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Atten( Q , v , f0 )

864865866867868869870

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Atten( F , v , q)

Q

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

F

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

V

12961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

q0

140414051406140714081409141014111412141314141415141614171418141914201421142214231424

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

f0

1512151315141515151615171518151915201521152215231524152515261527

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

v

16201621162216231624162516261627162816291630

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

q

1728172917301731

CVPR#****

CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

f

Figure 3: The sequential co-attention module. Given the feature sequencesfor the question (Q), facts (F) and image (V), this module sequentiallygenerates weighted features (v, q, f ).

which can be expressed as follows:

Hi = tanh(Wxxi+Wg1g1+Wg2g2), (1a)

αi = softmax(w>Hi), i = 1, . . . ,N, (1b)

x =∑N

i=1αixi, (1c)

where X = [xi, . . . ,xN ] ∈ Rd×N is the input sequence,and the fixed-length vectors g1, g2 ∈ Rd are attentionguidances. Wx, Wg1 , Wg2 ∈ Rh×d and w ∈ Rh areembedding parameters to be learned. Here α is the atten-tion weights of the input sequence features and the weightedsum x is the weighted feature.

In the proposed co-attention approach, the encoded ques-tion/image/fact features (see Sec. 3.1) are sequentially fedinto the attention module (Equ.1) as input sequences, andthe weighted features from the previous two steps are usedas guidance. Firstly, the question features are summa-rized without any guidance (q0 = Atten(Q,0,0)). Atthe second step, the fact features are weighted based onthe summarized question features (f0 = Atten(F, q0,0)).Next, the weighted image features are generated with theweighted fact features and summarized question featuresas guidances (v = Atten(V, q0, f0)). In step 4 (q =Atten(Q, v, f0)) and step 5 (f = Atten(F, v, q)), the ques-tion and fact features are re-weighted based on the outputsof the previous steps. Finally, the weighted question/im-age/fact features (q, f , v) are further used for answer pre-diction and the attention weights of the last attention moduleαf are used for reasons generation.

3.3. Answer Prediction and Reason GenerationSimilar to many previous VQA models [16, 22, 23, 41],

the answer prediction process is treated as a multi-classclassification problem, in which each class corresponds toa distinct answer. Given the weighted features generated

Page 5: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

from the word/phrase/question levels, a multi-layer percep-tron (MLP) is used for classification:

hw = tanh(Ww(q

w + vw + fw)), (2a)

hp = tanh(Wp

[(qp + vp + fp);hw

]), (2b)

hq = tanh(Wq

[(qq + vq + fq);hp

]), (2c)

p = softmax(Whhq), (2d)

where {qw, vw, fw}, {qp, vp, fp} and {qq, vq, fq} areweighted features from all three levels. Ww, Wp, Wq andWh are parameters and p is the probability vector.

As shown in Fig. 2, the attention weights of facts from allthree levels (αw

f , αpf , αq

f ) are summed together. Then thetop-3 ranked facts are automatically formulated (by simplerule-based approaches, see supplementary) to human read-able sentences and are considered as reasons.

4. Experiments

We evaluate our models on two datasets, Visual GenomeQA [14] and VQA-real [3]. The Visual Genome QA con-tains 1,445,322 questions on 108,077 images. Since theofficial split of the Visual Genome dataset is not released,we randomly generate our own split. In this split, we have723,060 training, 54,506 validation and 667,753 testingquestion/answer examples, based on 54,038 training, 4,039validation and 50,000 testing images. VQA dataset [3] isone of the most widely used datasets, which comprises twoparts, one using natural images, and a second using cartoonimages. In this paper, we only evaluate our models on thereal image subset, which we have labeled VQA-real. VQA-real comprises 123,287 training and 81,434 test images.

4.1. Implementation DetailsFacts Extraction Using the training splits of theVisual Genome dataset, we trained three naive multi-label CNN models to extract objects, object attributesand object-object relations. Specifically, we ex-tract the top-5000/10000/15000 triplets of the form( img, contain,obj)/(obj, att,obj att)/(obj1,rel,obj2) based on the annotations of VisualGenome and use them as class labels. We then formulatethem as three separate multi-label classification problemsusing an element-wise logistic loss function. The pre-trained VGG-16 model is used as initialization and onlythe fully-connected layers are fine-tuned. We also used thescene classification model of [40] and the image attributemodel of [33] to extract facts not included in the VisualGenome dataset (i.e., image scenes and image attributes).Thresholds are set for these visual models and on average76 facts are extracted for each image. The confidence scoreof the predicted facts is added as an additional dimensionof encoded fact features f .

Training Parameters In our system, the dimensionsof the encoded question/image/fact features (d in Eq. 1)and the hidden layers of the LSTM and co-attentionmodels (h in Eq. 1) are set to 512. For facts, thesubject/relation/object entities are embedded to128/128/256 dimensional vectors respectively and concate-nated to form 512d vectors. We used two layers of LSTMmodel. For the MLP in Eq. (2), the dimensions of hw

and hp are also 512, while the dimension of hq is set to1024 for the VQA dataset and 2048 for the Visual Genomedataset. For prediction, we take the top 3000 answers forthe VQA dataset and the top 5000 answers for the VisualGenome dataset. The whole system is implemented on theTorch7 [6]1 and trained end-to-end but with fixed CNN fea-tures. For optimzation, the RMSProp method is used witha base learning rate of 2 × 10−4 and momentum 0.99. Themodel is trained for up to 256 epochs until the validationerror has not improved in the last 5 epochs.

4.2. Results on the Visual Genome QA

Metrics We use the accuracy value and the Wu-Palmersimilarity (WUPS) [36] to measure the performance on theVisual Genome QA (see Tab.2). Before comparison, all re-sponses are made lowercase, numbers converted to digits,and punctuation & articles removed. The accuracy accord-ing to the question types are also reported.

Baselines and State-of-the-art The first baseline methodis VGG+LSTM from Antol et al. in [3], who uses a twolayer LSTM to encode the questions and the last hiddenlayer of VGG [26] to encode the images. The image fea-tures are then `2 normalized. We use the author providedcode2 to train the model on the Visual Genome QA trainingsplit. VGG+Obj+Att+Rel+Extra+LSTM uses the sameconfiguration as VGG+LSTM, except that we concatenateadditional image features extracted by VGG-16 models(fc7) that have been pre-trained on different CV tasks de-scribed in the previous section. HieCoAtt-VGG is the orig-inal model presented in [16], which is the current state ofart. The authors’ implementation3 is used to train the model.

4.2.1 Ablation Studies with Ground Truth FactsIn this section, we conduct an ablation study to evaluate theeffectiveness of incorporating different types of facts. Toavoid the bias that would be caused by the varying accuracywith which the facts are predicted, we use the ground truthfacts provided by the Visual Genome dataset as the inputsto our proposed models for ablation testing.

GtFact(Obj) is the initial implementation of our pro-posed co-attention model, using only the ‘object’ facts,

1The code and pre-trained models will be released upon the acceptanceof the paper.

2https://github.com/VT-vision-lab/VQA LSTM CNN3https://github.com/jiasenlu/HieCoAttenVQA

Page 6: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

Accuracy (%) WUPS (%)Methods What Where When Who Why How Overall Overall

(60.5%) (17.0%) (3.5%) (5.5%) (2.7%) (10.8%) @0.9 @0.0VGG+LSTM [3] 35.12 16.33 52.71 30.03 11.55 42.69 32.46 38.30 58.39VGG+Obj+Att+Rel+Extra+LSTM 36.88 16.85 52.74 32.30 11.65 44.00 33.88 39.61 58.89HieCoAtt-VGG [16] 39.72 17.53 52.53 33.80 12.62 45.14 35.94 41.75 59.97Ours-GtFact(Obj) 37.82 17.73 51.48 37.32 12.84 43.10 34.77 40.83 59.69Ours-GtFact(Obj+Att) 42.21 17.56 51.89 37.45 12.93 43.90 37.50 43.24 60.39Ours-GtFact(Obj+Rel) 38.25 18.10 51.13 38.22 12.86 43.32 35.15 41.25 59.91Ours-GtFact(Obj+Att+Rel) 42.86 18.22 51.06 38.26 13.02 44.26 38.06 43.86 60.72Ours-GtFact(Obj+Att+Rel)+VGG 44.28 18.87 52.06 38.87 12.93 46.08 39.30 44.94 61.21Ours-PredFact(Obj+Att+Rel) 37.13 16.99 51.70 33.87 12.73 42.87 34.01 39.92 59.20Ours-PredFact(Obj+Att+Rel+Extra) 38.52 17.86 51.55 34.65 12.87 44.34 35.20 41.08 59.75Ours-PredFact(Obj+Att+Rel)+VGG 40.34 17.80 52.12 34.98 12.78 45.37 36.44 42.16 60.09Ours-PredFact(Obj+Att+Rel+Extra)+VGG 40.91 18.33 52.33 35.50 12.88 46.04 36.99 42.73 60.39

Table 2: Ablation study on the Visual Genome QA dataset. Accuracy for different question types are shown. The percentage of questions for each type isshown in parentheses. We additionally calculate the WUPS at 0.9 and 0.0 for different models.

which means the co-attention mechanisms only occur be-tween question and the input object facts. The overall accu-racy of this model on the test split is 34.77%, which alreadyoutperforms the baseline model VGG+LSTM. However,there is still a gap between this model and the HieCoAtt-VGG (35.94%), which applies the co-attention mechanismto the questions and whole image features. Considering thatthe image features are still not used at this stage, this resultis reasonable.

Based on this initial model, we add the ‘attribute’ and‘relationship’ facts separately, producing two models, Gt-Fact(Obj+Att) and GtFact(Obj+Rel). Table 2 shows thatthe former performs better than the latter (37.50% VS.35.15%), although both outperform the previous ‘Object’-only model. This suggests that ‘attribute’ facts are moreeffective than the ‘relationship’ facts. However, when itcomes to questions starting with ‘where’ and ‘who’, theGtFact(Obj+Rel) performs slightly better (18.10% VS.17.56%, 38.22% VS.37.45%), which suggests that the ‘re-lationship’ facts play a more important role in these types ofquestion. This makes sense because ‘relationship’ facts nat-urally encode location and identity information, for exam-ple < cat, on,mat > and < boy, holding, ball >. The Gt-Fact(Obj+Att) model exceeds the image features based co-attention model HieCoAtt-VGG by a large margin, 1.56%.This is mainly caused by the substantial improvement inthe ‘what’ questions, from 39.72% to 42.21%. This is notsurprising because the ‘attributes’ facts include detailed in-formation relevant to the ‘what’ questions, such as ‘color’,‘shape’, ‘size’, ‘material’ and so on.

In the GtFact(Obj+Att+Rel), all of the facts are pluggedinto the proposed model, which brings further improve-ments for nearly all question types, achieving an over-all accuracy 38.06%. Compared with the baseline modelVGG+Obj+Att+Rel+Extra+LSTM that uses a conven-tional method (concatenating and embedding) to introduce

additional features into the VQA, our model outperformsby 4.18%, which is a large gap. However, for the ‘when’questions, we find that the performance of our ‘facts’ basedmodels are always lower than the image-based ones. Weobserved that this mainly because the annotated facts inthe Visual Genome normally do not cover the ‘when’ re-lated information, such as time and day or night. The sur-vey paper [34] makes a similar observation - “98.8% of the‘when’ questions in the Visual Genome can not be directlyanswered from the provided scene graph annotations”.Hence, in the final model GtFact(Obj+Att+Rel)+VGG,we add the image features back, allowing for a higher orderco-attention between image, question and facts, achievingan overall accuracy 39.30%. It also brings 1% improve-ments on the ‘when’ questions. The WUPS evaluation ex-hibits the same trend as the above results, the question type-specific results can be found in the supplementary material.

4.2.2 Evaluation with Predicted FactsIn a normal VQA setting the ground truth facts are notprovided, so we now evaluate our model using predictedfacts. All facts were predicted by models that have beenpre-trained on different computer vision tasks, e.g. objectdetection, attributes prediction, relationship prediction andscene classification (see Sec. 4.1 for more details).

The PredFact(Obj+Att+Rel) model uses allof the predicted facts as the input, while Pred-Fact(Obj+Att+Rel+Extra) additionally uses the predictedscene category, which is trained on a different data source,the MIT Places 205 [40]. From Table 2, we see thatalthough all facts have been included, the performanceof these two facts-only models is still lower than thatof the previous state-of-the-art model HieCoAtt-VGG.Considering that errors in fact prediction may pass to thequestion answering part and the image features are notused, these results are reasonable. If the fact predictionmodels perform better, the final answer accuracy will be

Page 7: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

Q: What is in the water? Q: What time of day is it? Q: What color is the plane’s tail? Q: What is sitting on the chair? Q: Who is surfing?A: boat A: daytime A: blue A: cat A: man

• This image contains the objectof boat.• The boat is on the water.• This image happens in the sceneof harbor.

• The sky is blue.• The horse is pulling.• This image contains the ob-ject of road.

• The tail is blue.• This image contains the ob-ject of airplane.• This image happens in thescene of airport.

• This image contains the ob-ject of cat.• The cat is on the chair.• The fur is on the cat

• The man is riding the surf-board.• The man is wearing the wet-suit.• This image contains the ac-tion of surfing.

Q: Where is the kite? Q: How is the weather outside? Q: What the policemen riding? Q: Where is this place? Q: Who is talking on a phone?A: in the sky A: rainy A: horse A: office A: woman

• The kite is in the sky.• This image contains the objectof kite.• This image happens in the sceneof beach.

• The woman is with the um-brella.• This image contains the at-tribute of rain.• This image contains the ob-ject of umbrella.

• The military officer is on thehorse.• The military officer is ridingthe horse.• This image contains the ob-ject of police.

• This image contains the ob-ject of desk.• The printer is on the desk.• This image happens in thescene of office.

• The woman is holding thetelephone.• The woman is wearing thesunglasses.• The woman is wearing theshort pants.

Figure 4: Some qualitative results produced by our complete model on the Visual Genome QA test split. Image, QA pair, attention map and predictedTop-3 reasons are shown in order. Our model is capable of co-attending between question, image and supporting facts. The colored words in the questionhave Top-3 identified weights, ordered as red, blue and cyan. The highlighted area in the attention map indicates the attention weights on the image regions(from red: high to blue: low). The Top-3 identified facts are re-formulated as human readable reasons, shown as bullets.

higher, as shown in the previous ground-truth facts experi-ments. The PredFact(Obj+Att+Rel+Extra) outperformsthe PredFact(Obj+Att+Rel) model by 1.2%, because theextra scene category facts are included. This suggeststhat our model benefits by using multiple off-the-shelfCV methods, even though they are trained on differentdata sources, and on different tasks. And as more factsare added, our model performs better. Compared with thebaseline model VGG+Obj+Att+Rel+Extra+LSTM, ourPredFact(Obj+Att+Rel+Extra) outperforms it by a largemargin, which suggests that our co-attention model canmore effectively exploit the fact-based information.

Image features are further inputted to our model, pro-ducing two variants PredFact(Obj+Att+Rel)+VGG andPredFact(Obj+Att+Rel+Extra)+VGG. Both of these out-perform the state of art, and our complete model Pred-

Fact(Obj+Att+Rel+Extra)+VGG outperforms it by 1%,i.e. more than 6,000 more questions are correctly answered.

Please note that all of these results are produced by us-ing the naive facts extraction models described in Sec. 4.1.We believe that as better facts extraction models (such asthe relationship prediction models from [15], for instance)become available, the results will improve further.

4.2.3 Human agreements on Predicted Reasons

A key differentiator of our proposed model is that theweights resulting from the facts attention process canbe used to generate human-interpretable reasons for theanswer generated. To evaluate the human agreementwith these generated reasons, we sampled 1,000 ques-tions that have been correctly answered by our Pred-Fact(Obj+Att+Rel+Extra)+VGG model, and conduct a

Page 8: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

Test-dev Test-stdMethod Open-Ended Multiple-Choice Open-Ended Multiple-Choice

Y/N Num Other All Y/N Num Other All Y/N Num Other All Y/N Num Other AlliBOWING [41] 76.6 35.0 42.6 55.7 76.7 37.1 54.4 61.7 76.8 35.0 42.6 55.9 76.9 37.3 54.6 62.0MCB-VGG [8] - - - 57.1 - - - - - - - - - - - -DPPnet [22] 80.7 37.2 41.7 57.2 80.8 38.9 52.2 62.5 80.3 36.9 42.2 57.4 80.4 38.8 52.8 62.7D-NMN [1] 80.5 37.4 43.1 57.9 - - - - - - - 58.0 - - - -VQA team [3] 80.5 36.8 43.1 57.8 80.5 38.2 53.0 62.7 80.6 36.4 43.7 58.2 80.6 37.7 53.6 63.1SMem [38] 80.9 37.3 43.1 58.0 - - - - 80.8 37.3 43.1 58.2 - - - -SAN [39] 79.3 36.6 46.1 58.7 - - - - - - - 58.9 - - - -ACK [35] 81.0 38.4 45.2 59.2 - - - - 81.1 37.1 45.8 59.4 - - - -DMN+ [37] 80.5 36.8 48.3 60.3 - - - - - - - 60.4 - - - -MRN-VGG [13] 82.5 38.3 46.8 60.5 82.6 39.9 55.2 64.8 - - - - - - - -HieCoAtt-VGG [16] 79.6 38.4 49.1 60.5 79.7 40.1 57.6 64.9 - - - - - - - -Re-Ask-ResNet [20] 78.4 36.4 46.3 58.4 - - - - 78.2 36.3 46.3 58.4 - - - -FDA-ResNet [12] 81.1 36.2 45.8 59.2 - - - - - - - 59.5 - - - -MRN-ResNet [13] 82.4 38.4 49.3 61.5 82.4 39.7 57.2 65.6 82.4 38.2 49.4 61.8 82.4 39.6 58.4 66.3HieCoAtt-ResNet [16] 79.7 38.7 51.7 61.8 79.7 40.0 59.8 65.8 - - - 62.1 - - - 66.1MCB-Att-ResNet [8] 82.5 37.6 55.6 64.7 - - - 69.1 - - - - - - - -Ours-VGG 81.2 37.7 50.5 61.7 81.3 39.9 60.5 66.8 81.3 36.7 50.9 61.9 81.4 39.0 60.8 67.0Ours-ResNet 81.5 38.4 53.0 63.1 81.5 40.0 62.2 67.7 81.4 38.2 53.2 63.3 81.4 39.8 62.3 67.8

Table 3: Single model performance on the VQA-real test set in the open-ended and multiple-choice settings.

human agreement study on the generated reasons.Since this is the first VQA model that can generate hu-

man readable reasons, there is no previous work that we canfollow to perform the human evaluation. We have thus de-signed the following human agreement experimental pro-tocols. At first, an image with the question and our cor-rectly generated answer are given to a human subject. Thena list of human readable reasons (ranging from 20 to 40)are shown. These reasons are formulated from facts that arepredicted from the facts extraction model introduced in theSec.4.1. Although these reasons are all related to the image,not all of them are relevant to the particular question asked.The task of the human subjects is thus to choose the rea-sons that are related to answering the question. The humanagreements are calculated by matching the human selectedreasons with our model ranked reasons. In order to ease thehuman subjects’ workload and to have an unique guide forthem to select the ‘reasons’, we ask them to select only thetop-1 reason that is related to the question answering. Andthey can choose nothing if they think none of the providedreasons are useful.

Finally, in the evaluation, we calculate the rate at whichthe human selected reason can be found in our generatedtop-1/3/5 reasons. We find that 30.1% of the human se-lected top-1 reason can be matched with our model rankedtop-1 reason. For the top-3 and top-5, the matching rate are54.2% and 70.9%, respectively. This suggests that the rea-sons generated are both interpretable and informative. Fig-ure 4 shows some example reasons generated by our model.

4.3. Results on the VQA-realTable 3 compares our approach with state-of-the-art on

the VQA-real dataset. Since we do not use any ensemblemodels, we only compare with the single models on theVQA test leader-board. The test-dev is normally used for

validation while the test-standard is the default test data.The first section of Table 3 shows the state of art methodsthat use VGG features, except iBOWING [41], which usesthe GoogLeNet features [27]. The second section gives theresults of models that use ResNet [11] features. Two ver-sions of our complete model are evaluated at the last sec-tion, using VGG and ResNet features, respectively.

Ours-VGG produces the best result on all of the splits,compared with models using the same VGG image encod-ing method. Ours-ResNet ranks the second amongst thesingle models using ResNet features on the test-dev split,but we achieve the state of the art results on the test-std, forboth Open-Ended and Multiple-Choice questions. The bestresult on the test-dev with ResNet features is achieved bythe Multimodal Compact Milinear (MCB) pooling modelwith the visual attention [8]. We believe the MCB can beintegrated within our proposed co-attention model, by re-placing the linear embedding steps in Eqs. 1 and 2, but weleave it as a future work.

5. ConclusionWe have proposed a new approach which is capable of

adaptively combining the outputs from other algorithms inorder to solve a new, more complex problem. We haveshown that the approach can be applied to the problem ofVisual Question Answering, and that in doing so it achievesstate of the art results. Visual Question Answering is aparticularly interesting application of the approach, as inthis case the new problem to be solved is not completelyspecified until run time. In retrospect, it seems strange toattempt to answer general questions about images withoutfirst providing access to readily available image informationthat might assist in the process. In developing our approachwe proposed a co-attention method applicable to questions,image and facts jointly. We also showed that attention-

Page 9: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

weighted facts serve to illuminate why the method reachedits conclusion, which is critical if such techniques are to beused in practice.

References[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learn-

ing to compose neural networks for question answering. InProc. Conf. of North American Chapter of Association forComputational Linguistics, 2016. 2, 8

[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. NeuralModule Networks. In Proc. IEEE Conf. Comp. Vis. Patt.Recogn., 2016. 2

[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.Zitnick, and D. Parikh. VQA: Visual Question Answering.In Proc. IEEE Int. Conf. Comp. Vis., 2015. 1, 2, 5, 6, 8, 10

[4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak,and Z. Ives. Dbpedia: A nucleus for a web of open data.Springer, 2007. 4

[5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Tay-lor. Freebase: a collaboratively created graph database forstructuring human knowledge. In ACM SIGMOD Interna-tional Conference on Management of Data, pages 1247–1250. ACM, 2008. 4

[6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: Amatlab-like environment for machine learning. In Proc. Ad-vances in Neural Inf. Process. Syst. Workshop, 2011. 5

[7] M. Craven and J. W. Shavlik. Using sampling and queriesto extract rules from trained neural networks. In Proc. Int.Conf. Mach. Learn., pages 37–45, 1994. 3

[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, andM. Rohrbach. Multimodal compact bilinear pooling for vi-sual question answering and visual grounding. In Proc. Conf.Empirical Methods in Natural Language Processing, 2016.2, 8

[9] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu.Are You Talking to a Machine? Dataset and Methods forMultilingual Image Question Answering. In Proc. Advancesin Neural Inf. Process. Syst., 2015. 1, 2

[10] A. Graves, G. Wayne, and I. Danihelka. Neural turing ma-chines. arXiv preprint arXiv:1410.5401, 2014. 1

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In Proc. IEEE Conf. Comp. Vis. Patt.Recogn., 2016. 2, 8

[12] I. Ilievski, S. Yan, and J. Feng. A focused dynamic atten-tion model for visual question answering. arXiv preprintarXiv:1604.01485, 2016. 2, 8

[13] J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W.Ha, and B.-T. Zhang. Multimodal residual learning for visualqa. In Proc. Advances in Neural Inf. Process. Syst., 2016. 2,8

[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bern-stein, and L. Fei-Fei. Visual genome: Connecting languageand vision using crowdsourced dense image annotations.arXiv preprint arXiv:1602.07332, 2016. 5

[15] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual rela-tionship detection with language priors. In Proc. Eur. Conf.Comp. Vis., 2016. 7

[16] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchicalquestion-image co-attention for visual question answering.In Proc. Advances in Neural Inf. Process. Syst., 2016. 2, 3,4, 5, 6, 8, 10

[17] L. Ma, Z. Lu, and H. Li. Learning to Answer Questions FromImage using Convolutional Neural Network. In Proc. Conf.AAAI, 2016. 10

[18] F. Mahdisoltani, J. Biega, and F. Suchanek. YAGO3: Aknowledge base from multilingual Wikipedias. In CIDR,2015. 4

[19] M. Malinowski, M. Rohrbach, and M. Fritz. Ask Your Neu-rons: A Neural-based Approach to Answering Questionsabout Images. In Proc. IEEE Int. Conf. Comp. Vis., 2015.1, 2

[20] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-rons: A deep learning approach to visual question answering.arXiv preprint arXiv:1605.02697, 2016. 8

[21] A. Mallya and S. Lazebnik. Learning models for actions andperson-object interactions with transfer to question answer-ing. In Proc. Eur. Conf. Comp. Vis., 2016. 2

[22] H. Noh, P. H. Seo, and B. Han. Image Question Answeringusing Convolutional Neural Network with Dynamic Parame-ter Prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,2016. 4, 8

[23] M. Ren, R. Kiros, and R. Zemel. Image Question Answering:A Visual Semantic Embedding Model and a New Dataset. InProc. Advances in Neural Inf. Process. Syst., 2015. 1, 2, 4,10

[24] K. Saito, A. Shin, Y. Ushiku, and T. Harada. Dualnet:Domain-invariant network for visual question answering.arXiv preprint arXiv:1606.06108, 2016. 2

[25] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focusregions for visual question answering. In Proc. IEEE Conf.Comp. Vis. Patt. Recogn., 2016. 2

[26] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In Proc. Int.Conf. Learn. Representations, 2015. 2, 4, 5

[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In Proc. IEEE Conf. Comp.Vis. Patt. Recogn., 2015. 8

[28] D. Teney, L. Liu, and A. van den Hengel. Graph-structuredrepresentations for visual question answering. CoRR,abs/1609.05600, 2016. 3

[29] S. Thrun. Extracting rules from artificial neural networkswith distributed representations. In Proc. Advances in NeuralInf. Process. Syst., pages 505–512, 1995. 3

[30] T. Tommasi, A. Mallya, B. Plummer, S. Lazebnik, A. C.Berg, and T. L. Berg. Solving visual madlibs with multiplecues. In Proc. British Machine Vis. Conf., 2016. 2

[31] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick.Explicit knowledge-based reasoning for visual question an-swering. arXiv preprint arXiv:1511.02570, 2015. 3

Page 10: f g@adelaide.edu.au arXiv:1612.05386v1 [cs.CV] 16 …arXiv:1612.05386v1 [cs.CV] 16 Dec 2016 trained to use an associative memory module in solving its larger task. The method that

[32] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick. Fvqa:Fact-based visual question answering. arXiv:1606.05433,2016. 3

[33] Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick. WhatValue Do Explicit High Level Concepts Have in Vision toLanguage Problems? In Proc. IEEE Conf. Comp. Vis. Patt.Recogn., 2016. 2, 5

[34] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d.Hengel. Visual question answering: A survey of methodsand datasets. arXiv preprint arXiv:1607.05910, 2016. 6

[35] Q. Wu, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. AskMe Anything: Free-form Visual Question Answering Basedon Knowledge from External Sources. In Proc. IEEE Conf.Comp. Vis. Patt. Recogn., 2016. 1, 2, 8

[36] Z. Wu and M. Palmer. Verbs semantics and lexical selec-tion. In Proc. Conf. Association for Computational Linguis-tics, 1994. 5

[37] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-works for visual and textual question answering. In Proc.Int. Conf. Mach. Learn., 2016. 3, 8

[38] H. Xu and K. Saenko. Ask, Attend and Answer: Explor-ing Question-Guided Spatial Attention for Visual QuestionAnswering. In Proc. Eur. Conf. Comp. Vis., 2016. 2, 8

[39] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. StackedAttention Networks for Image Question Answering. In Proc.IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2, 4, 8

[40] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning deep features for scene recognition using placesdatabase. In Proc. Advances in Neural Inf. Process. Syst.,2014. 5, 6

[41] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fer-gus. Simple baseline for visual question answering. arXivpreprint arXiv:1512.02167, 2015. 1, 2, 4, 8

[42] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W:Grounded Question Answering in Images. In Proc. IEEEConf. Comp. Vis. Patt. Recogn., 2016. 2

A. Reformatting Facts to ReasonsIn this section we provide additional detail on the method

used to reformulate the ranked facts into human-readablereasons using a template-based approach.

Each of the extracted facts has been encoded bya structural triplet (see Sec.3.1 for more details)(subject,relation,object), where subjectand object denote two visual concepts and relationrepresents a relationship between these two concepts. Thisstructured representation enables us to reformulate the factto a human-readable sentence easily, with a pre-definedtemplate. All of the templates are shown in Table 4.

For the ( img, att,img att) triplet, since we alsohave the super-class label of the img att vocabulary, suchas ‘action’, ‘number’, ‘attribute’ etc., we can generate morevarieties. For example, we can generate a reason such as‘This image contains the action of surfing’, because‘surfing’ belongs to the super-class of ‘action’.

Fact Triplet Reason Template( img, scene,img scn) This image happens in the scene of img scn.( img, att,img att) This image contains the attribute of img att.( img, contain,obj) This image contains the object of obj.(obj, att,obj att) The obj is obj att.(obj1,rel,obj2) The obj1 is rel the obj2.

Table 4: Facts represented by triplets and the corresponding reason tem-plate. img, scene, att and contain are specific tokens. Whileimg scn, obj, img att, obj att and rel refer to vocabularies de-scribing image scenes, objects, image/object attributes and relationshipsbetween objects. These words are filled into the pre-defined template togenerate reasons.

Example Fact Example Reason( img, scene,office) This image happens in the scene of office.( img, att,wedding) This image contains the attribute of wedding.( img, contain,dog) This image contains the object ofdog.(shirt, att,red) The shirt is red.(man,hold,umbrella) The man is hold the umbrella.

Table 5: Example facts in triplet and their corresponding generated rea-sons based on the templates in the previous table.

B. WUPS Evaluation on the Visual GenomeQA

The WUPS calculates the similarity between two wordsbased on the similarity between their common subsequencein the taxonomy tree. If the similarity between two wordsis greater than a threshold then the candidate answer is con-sidered to be right. We report on thresholds 0.9 and 0.0,following [17, 23]. Table 6 and 7 show the question type-specific results.

WUPS @ 0.9 (%)Methods What Where When Who Why How Overall(60.5%) (17.0%) (3.5%) (5.5%) (2.7%) (10.8%)

VGG+LSTM [3] 42.84 18.61 55.81 35.75 12.53 45.78 38.30VGG+Obj+Att+Rel+Extra+LSTM 44.44 19.12 55.85 37.94 12.62 46.98 39.61HieCoAtt-VGG [16] 47.38 19.89 55.83 39.40 13.72 48.15 41.75GtFact(Obj) 45.86 20.05 55.19 42.99 13.86 48.80 40.83GtFact(Obj+Att) 49.74 19.87 55.53 42.95 13.98 47.00 43.24GtFact(Obj+Rel) 46.35 20.45 54.93 43.75 13.89 46.42 41.25GtFact(Obj+Att+Rel) 50.45 20.55 54.98 43.81 14.15 47.37 43.86GtFact(Obj+Att+Rel)+VGG 51.69 21.18 55.90 44.06 14.05 49.03 44.94PredFact(Obj+Att+Rel) 44.87 19.46 54.97 39.70 13.89 46.04 39.92PredFact(Obj+Att+Rel+Extra) 46.23 20.27 55.30 40.31 13.95 47.37 41.08PredFact(Obj+Att+Rel)+VGG 47.86 20.17 55.77 40.35 13.81 48.34 42.16PredFact(Obj+Att+Rel+Extra)+VGG 48.48 20.69 55.76 40.96 13.97 48.97 42.73

Table 6: Ablation study on the Visual Genome QA dataset. WUPS at 0.9for different question types are shown. The percentage of questions foreach type is shown in parentheses.

WUPS @ 0.0 (%)Methods What Where When Who Why How Overall(60.5%) (17.0%) (3.5%) (5.5%) (2.7%) (10.8%)

VGG+LSTM [3] 66.49 27.39 63.29 55.25 15.99 72.25 58.39VGG+Obj+Att+Rel+Extra+LSTM 67.10 27.64 63.25 56.05 16.12 72.61 58.89HieCoAtt-VGG [16] 68.41 28.50 62.97 56.83 17.41 73.33 59.97GtFact(Obj) 67.94 28.57 62.27 58.19 17.42 72.78 59.69GtFact(Obj+Att) 69.07 28.37 62.68 58.35 17.62 73.06 60.39GtFact(Obj+Rel) 68.22 28.85 62.10 58.60 17.51 72.66 59.91GtFact(Obj+Att+Rel) 69.43 28.85 62.10 58.67 18.00 73.24 60.72GtFact(Obj+Att+Rel)+VGG 69.99 29.29 62.91 58.56 17.70 73.80 61.21PredFact(Obj+Att+Rel) 67.36 28.27 62.30 56.73 17.54 72.68 59.20PredFact(Obj+Att+Rel+Extra) 67.98 28.91 62.56 57.11 17.70 72.96 59.75PredFact(Obj+Att+Rel)+VGG 68.55 28.71 62.70 57.01 17.43 73.32 60.09PredFact(Obj+Att+Rel+Extra)+VGG 68.88 29.07 62.91 57.28 17.76 73.38 60.39

Table 7: Ablation study on the Visual Genome QA dataset. WUPS at 0.0for different question types are shown. The percentage of questions foreach type is shown in parentheses.