Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple...

9
Focal Visual-Text Attention for Visual Question Answering Junwei Liang 1 Lu Jiang 2 Liangliang Cao 3 Li-Jia Li 2 Alexander Hauptmann 1 1 Carnegie Mellon University 2 Google Inc. 3 HelloVera AI {junweil,alex}@cs.cmu.edu, {lujiang,lijiali}@google.com, [email protected] Abstract Recent insights on language and vision with neural net- works have been successfully applied to simple single- image visual question answering. However, to tackle real- life question answering problems on multimedia collections such as personal photos, we have to look at whole collec- tions with sequences of photos or videos. When answering questions from a large collection, a natural problem is to identify snippets to support the answer. In this paper, we describe a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence in- formation such as images and text metadata are presented. FVTA introduces an end-to-end approach that makes use of a hierarchical process to dynamically determine what me- dia and what time to focus on in the sequential data to an- swer the question. FVTA can not only answer the questions well but also provides the justifications which the system re- sults are based upon to get the answers. FVTA achieves state-of-the-art performance on the MemexQA dataset and competitive results on the MovieQA dataset. 1. Introduction Language and vision have emerged as a popular re- search area in computer vision. Visual question answer- ing (VQA) [2] is a successful direction utilizing both com- puter vision and natural language processing techniques to solve an interesting problem: given a pair of image and a question (in natural language), the goal is to learn an in- ference model that can the answer questions according to cues discovered from the image. A variety of methods have been proposed to address the challenges from different as- pects [5, 27, 14, 6, 20, 3, 16, 13], with remarkable progress on answering about a single image. Extending from VQA on a single image, this paper con- siders the following problem: Suppose a user’s photos and videos are organized in a sequence ordered by their creation time. Some photos or videos may be associated with meta labels or annotations such as time, GPS, captions, com- ments, and meaningful title. We are interested in training Figure 1. Focal Visual-Text Attention (FVTA) Mechanism. Given the visual-text sequences input and the question, our temporal visual-text attention tensor captures the temporal constraint in the question and emphasizes the most recent image with ”bar” scene visible. Then FVTA selects the appropriate attention region (the “date”) and finds the correct answer. a model to answer questions about these images and texts, e.g. “when was the last time I went to a bar?” or “what did my son do after his 2017 Halloween dinner party?” There are two challenges to solve the above problem. First, the input is provided in an unstructured form. The question is associated with multiple sequences, in the form of videos or images. Such sequences are temporally or- dered, and each sequence contains multiple time steps. At each time there are visual data, text annotations and other metadata. In this paper, we call the format visual-text se- quence data. Note that not all the photos and videos are annotated, which requires a robust method to leverage in- consistently available multimodal data. The second challenge requires interpretable justifications in addition to direct answer based on sequence data. To help users with a lot of photos and videos, a natural requirement is to identify the supporting evidence for the answer. An example question as shown in Fig. 1, is “when was the last time I went to a bar?” From the users’ viewpoint, a good QA system should not only give a definite answer (e.g., January 20, 2016), but also ground evidential images or text snippets in the input sequence to justify the reasoning process. Given 1 arXiv:1806.01873v2 [cs.CV] 31 May 2019

Transcript of Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple...

Page 1: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

Focal Visual-Text Attention for Visual Question Answering

Junwei Liang1 Lu Jiang2 Liangliang Cao3 Li-Jia Li2 Alexander Hauptmann1

1Carnegie Mellon University 2Google Inc. 3HelloVera AI{junweil,alex}@cs.cmu.edu, {lujiang,lijiali}@google.com, [email protected]

Abstract

Recent insights on language and vision with neural net-works have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collectionssuch as personal photos, we have to look at whole collec-tions with sequences of photos or videos. When answeringquestions from a large collection, a natural problem is toidentify snippets to support the answer. In this paper, wedescribe a novel neural network called Focal Visual-TextAttention network (FVTA) for collective reasoning in visualquestion answering, where both visual and text sequence in-formation such as images and text metadata are presented.FVTA introduces an end-to-end approach that makes use ofa hierarchical process to dynamically determine what me-dia and what time to focus on in the sequential data to an-swer the question. FVTA can not only answer the questionswell but also provides the justifications which the system re-sults are based upon to get the answers. FVTA achievesstate-of-the-art performance on the MemexQA dataset andcompetitive results on the MovieQA dataset.

1. Introduction

Language and vision have emerged as a popular re-search area in computer vision. Visual question answer-ing (VQA) [2] is a successful direction utilizing both com-puter vision and natural language processing techniques tosolve an interesting problem: given a pair of image and aquestion (in natural language), the goal is to learn an in-ference model that can the answer questions according tocues discovered from the image. A variety of methods havebeen proposed to address the challenges from different as-pects [5, 27, 14, 6, 20, 3, 16, 13], with remarkable progresson answering about a single image.

Extending from VQA on a single image, this paper con-siders the following problem: Suppose a user’s photos andvideos are organized in a sequence ordered by their creationtime. Some photos or videos may be associated with metalabels or annotations such as time, GPS, captions, com-ments, and meaningful title. We are interested in training

Figure 1. Focal Visual-Text Attention (FVTA) Mechanism. Giventhe visual-text sequences input and the question, our temporalvisual-text attention tensor captures the temporal constraint in thequestion and emphasizes the most recent image with ”bar” scenevisible. Then FVTA selects the appropriate attention region (the“date”) and finds the correct answer.

a model to answer questions about these images and texts,e.g. “when was the last time I went to a bar?” or “what didmy son do after his 2017 Halloween dinner party?”

There are two challenges to solve the above problem.First, the input is provided in an unstructured form. Thequestion is associated with multiple sequences, in the formof videos or images. Such sequences are temporally or-dered, and each sequence contains multiple time steps. Ateach time there are visual data, text annotations and othermetadata. In this paper, we call the format visual-text se-quence data. Note that not all the photos and videos areannotated, which requires a robust method to leverage in-consistently available multimodal data.

The second challenge requires interpretable justificationsin addition to direct answer based on sequence data. To helpusers with a lot of photos and videos, a natural requirementis to identify the supporting evidence for the answer. Anexample question as shown in Fig. 1, is “when was the lasttime I went to a bar?” From the users’ viewpoint, a good QAsystem should not only give a definite answer (e.g., January20, 2016), but also ground evidential images or text snippetsin the input sequence to justify the reasoning process. Given

1

arX

iv:1

806.

0187

3v2

[cs

.CV

] 3

1 M

ay 2

019

Page 2: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

imperfect VQA models, humans often want to verify theanswer. The inspection process may be trivial for a singleimage but can take a significant amount of time to examineevery image and the complete text words.

To address these two challenges, we propose a focalvisual-text attention (FVTA) model for sequential data 1.Our model is motivated by the reasoning process of humans.In order to answer a question, a human would first quicklyskim the input and then focus on a few, small temporal re-gions in the visual-text sequences to derive an answer. Infact, statistics suggest that, on average, humans only need1.5 images to answer a question after the skimming [9]. In-spired by this process, FVTA first learns to localize relevantinformation within a few, small, temporally consecutive re-gions over the input sequences, and learns to infer an an-swer based on the cross-modal statistics pooled from theseregions. FVTA proposes a novel kernel to compute the at-tention tensor that jointly models the latent information inthree sources: 1) answer-signaling words in the question, 2)temporal correlation within a sequence, and 3) cross-modalinteraction between the text and image. FVTA attention al-lows for collective reasoning by the attention kernel learnedover a few, small, consecutive sub-sequences of text andimage. It can also produce a list of evidential images/textsto justify the reasoning. As shown in Fig. 1, the highlightedcubes are regions of high activations in the proposed FVTA.To summarize, the contribution of this paper is threefold:

• We propose a novel attention kernel for VQA onvisual-text data. Experiments show that it outperformsexisting attention methods.

• The proposed attention tensor can be used to localizeevidential image and text snippets to explain the rea-soning process. We quantitatively verify that the evi-dence produced by our method are more correlated tothat of human annotators.

• Our method achieves the state-of-the-art results on twoVQA benchmarks.

2. Related Work

Visual Question Answering. Image-based visual ques-tion answering has received a large amount of interest in thecomputer vision community. A lot of efforts have been con-ducted on single image QA datasets [2, 12, 31, 17, 26, 1],where a common practice is to train a classifier by combin-ing both question feature and visual features. A recent di-rection is on the question answering based on videos, whichis more relevant to this work. A number of research stud-ies have been carried on MovieQA [22, 10, 15], with movieclips, scripts, and descriptions. Because it is expensive to

1Code and models are released at https://memexqa.cs.cmu.edu/fvta.html

annotate the video-based QA datasets, some research stud-ies generate QA datasets by harvesting online videos anddescriptions [30, 29], while a recent study [7] considersquestion answering using animated GIFs. This work dif-fers from the existing video-based QA in two aspects: (1)video-based QA is to answer questions based on a singlevideo, while our work can handle general visual-text se-quences, where one user may have more than one video oralbums of photos. (2) most existing video-based QA meth-ods map one video sequence with text into a context featurevector, while our work explores a more fine-grained modelby modeling the correlation between query and sequencedata at every time step. To this end, we experiment on theMemexQA dataset [9]. The sequential data in MemexQAinvolves multiple modalities, including titles, timestamps,GPS and visual content, render it an ideal test bed for QAresearch over visual-text sequence data. Unlike the modelin [9], our method also uses the text embedding of the an-swer choices as the input to answer a question.

Attention Mechanism. This work can be viewed as anovel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation but also the temporal dependency. Our work ex-tends the previous studies of using attention model for Im-age QA [20, 4, 26, 13, 27, 16, 5, 3]. A key differencebetween our method and classical attention model lies inthe fact we are modeling the correlation at every time step,across multiple sequences. Existing attention mechanismsfor VQA mainly focus on attention within spatial regions ofan image [31] or within a single sequence [7], and hence,may not fully exploit the multiple sequences and multipletime steps nature. As Fig. 3 shows, our attention is appliedto a three-dimensional tensor, while the classic soft atten-tion model is applied to a vector or matrix.

3. Approach3.1. Problem Formulation

We start the discussion by formally defining the prob-lem. Let Q = q1, · · · , qM represent a question of M wordsQ ∈ ZM , where each word is an integer index in the vocab-ulary. Define a context visual-text sequence of T examplesX = x1, · · · ,xT , where for each example, ximg

t representsan image. xtxt

t is its corresponding text sentence, where itsi-th word is indexed by xtxt

ti . Following [2, 31], the answerto a question is an integer y ∈ [1, L] over the answer vocab-ulary of size L. Given a collection of n questions and theircontext sequences, we are interested in learning a modelmaximizing the following likelihood:

argmaxΘ

n∑i=1

logP (yi|Qi,Xi; Θ) (1)

where Θ represents the model parameters. Given the visual-text sequence input Ximg,Xtxt, we obtain a good joint

Page 3: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

Figure 2. An overview of Focal Visual-Text Attention (FVTA) model. For visual-text embedding, we use a pre-trained convolutional neuralnetwork to embed the photos and pre-trained word vectors to embed the words. We use a bi-directional LSTM as the sequence encoder.All hidden states from the question and the context are used to calculate the FVTA tensor. Based on the FVTA attention, both question andthe context are summarized into single vectors for the output layer to produce final answer. The output layer is used for multiple choicequestion classification. The text embedding of the answer choice is also used as the input. This input is not shown in the figure.

representation by attention model. With FVTA attention,the model takes into account of the sequential dependencyin image or text sequence, respectively, and cross-modalvisual-text correlations. Meanwhile, the computed atten-tion weights over input sequences can be utilized to derivemeaningful justifications.

3.2. Network Architecture

This subsection discusses our overall neural network ar-chitecture. As shown in Fig. 2, the proposed network con-sists of the following layers.Visual-Text Embedding Every image or video frame is en-coded with a pre-trained Convolutional Neural Network.Both word-level and character level embedding [11] areused to represent the word in text and question.Sequence Encoder We use separate LSTM networks to en-code visual and text sequences, respectively, to capture thetemporal dependency within each individual sequence. Theinputs to the LSTM units are image/text embedding pro-duced by the previous layer. Let d denote the size of thehidden state of the LSTM unit; the questionQ is representedas a matrix Q of concatenated bi-directional LSTM outputsat each step, i.e., Q ∈ R2d×M , where M is the maximumlength of the question. Likewise, The sequentially encodedtext and images are represented by H ∈ R2d×T×2, whereT is the maximum length of the sequence.Focal Visual-Text Attention The FVTA is a novel layerto implement the proposed attention mechanism. It repre-sents a network layer that models the correlations betweenquestions and multi-dimensional context and produces thesummarized input to the final output layer, i.e., h ∈ R2d

and q ∈ R2d. We will discuss FVTA in the next section.Output Layer After summarizing the input using the FVTAattention, we use a feed-forward layer to obtain the answercandidate. For multiple-choices questions, the task is to se-

lect one answer from a few candidate choices given the con-text and the question. Let k denote the number of candidateanswers, we utilize the bi-directional LSTM to encode eachof the answer choice and use the last hidden state as therepresentation for answers E ∈ Rk×2d. We tile the con-text representation h and attended question representation,k times into H ∈ Rk×2d and Q ∈ Rk×2d to compute theclassification probability of k choices. In practice we findthe following simple equation works better than fully con-nected layer or straightforward concatenation:

p = softmax(wTp [Q; H;E; Q�E; H�E]) (2)

where the operator [·; ·] represents the concatenation of twomatrices along the last dimension. � is the element-wisemultiplication, wp is the weight vector to learn and p is avector of classification probability. After obtaining the an-swer probability, the model can be trained end-to-end usingcross-entropy loss function.

4. Focal Visual-Text AttentionThis section discusses the details of FVTA model as the

key module in our VQA system. We first introduce simi-larity metric between visual and text features, then discussconstructing the attention tensor that captures both intra-sequence dependency and inter-sequence interaction.

4.1. Similarity between visual and text features

To compute the similarity across different modalities, i.e.visual and text, we first encode every modality by the LSTMnetworks with the same size of hidden states. Then we mea-sure the differences between these hidden state variables.Following the study in text sequence matching [24], we ag-gregate both the cosine similarity and Euclidean distanceto compare the features. Moreover, we choose to keep the

Page 4: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

vector information instead of summing up after the opera-tion. The vector representation can be used as the input ofa learning model, whose inner product represents the simi-larity between these features. More specifically, we use thefollowing equation to compute the similarity representationbetween two hidden state vectors v1 and v2. The result is avector of twice the hidden size:

s(v1,v2) = [(v1 � v2); (v1 − v2)� (v1 − v2)]. (3)

4.2. Intra-sequence temporal dependency

Our visual-text attention layer is designed to let themodel select related visual-text region or timestep based oneach word of the question. Such fine-grained attention isin general nontrivial to learn. Meanwhile, most answersfor visual-text sequence inputs may be constrained and re-stricted in a short temporal period. We learn such localizedrepresentation, called focal context representation, to em-phasize relevant context states based on the question.

First, we introduce a temporal correlation matrix, C ∈RT×T , a symmetric matrix where each entry cij measuresthe correlation between context’s the i-th step and the j-thstep for a question. Let hi = H:i: ∈ R2d×2 denote thevisual/text representation for the i-th timestep in H. Fornotation convenience, : is a slicing operator to extracts allelements from a dimension. For example, hi1 = H:i1 rep-resents the vector representation of the i-th timestep of thevisual sequence. Here we denote the last index 1 for visualand 2 for textual modality. Each entry Cij (∀i, j ∈ [1, T ])is then calculated by:

Cij = tanh

2∑k=1

w>c (w>h s(hik,hjk) + Q:M ) (4)

where wc ∈ R2d×1 and wh ∈ R4d×2dare parameters tolearn. The temporal correlation matrix captures the tempo-ral dependency of question, image and text sequence.

To allow the model to capture the context betweentimesteps based on the question, we introduce temporal fo-cal pooling to connect neighboring time hidden states if theyare related to the question. For example, it can capture therelevance between the moment ”dinner” and the momentlater, ”Went dancing”, given the question ”What did we doafter the dinner on Ben’s birthday?”. Formally, given thetime correlation matrix C and the context representation H,we introduce a temporal focal pooling function g to obtainthe focal representation F ∈ R2d×T×2. Each vector entryF:tk (∀t ∈ [1, T ],∀k ∈ [1, 2]) in F is calculated by:

F:tk = g(H;C, t, k) ∈ R2d, (5)

g(H;C, t, k) =

T∑s=1

1[s ∈ [t− c, t+ c]]Csthsk, (6)

where F:tk is the focal context representation at t-thtimestep for visual (k = 1) or text (k = 2). 1 is the indica-tor function. c stands for the size of the temporal window)

Figure 3. Comparison of our FVTA and classical VQA attentionmechanism. FVTA considers both visual-text intra-sequence cor-relations and cross sequence interaction, and focuses on a few,small regions. In FVTA, the multi-modal feature representationin the sequence data is preserved without losing information.

that is end-to-end learned with other parameters. We con-strain the model to focus on a few small temporal contextwindows of learnable window size 2c+ 1.

4.3. Cross Sequence Interaction

In this section, we introduce the attention mechanism tocapture the important correlation between visual and textualsequences. We apply attention over the focal context repre-sentation to summarize important information for answer-ing the question. We obtain the attention weights based on acorrelation tensor S between each word of the question andeach timestep of the visual-text sequences. The attention ateach timestep only considers the context and the question,and does not depend on the attention at previous timestep.The intuition of using such a memory-less attention mecha-nism is that it simplifies the attention and let the model focuson learning the correlation between context and question.Such mechanism has been proven useful in text question an-swering [19]. We computes a kernel tensor, S ∈ RM×T×2,between the input question and the focal context represen-tation F, where each entry in the kernel smtk models thecorrelation between the m-th word in question and at t-thtimestep over the modal k (images or text words). Let vtk

denote the focal context representation F:tk at t-th timestepfor visual or text. Each entry smtk in S is calculated by:

smtk = κ(F:tk,Q:m) = κ(vtk,q)

= tanh(w>s s(vtk,q) + bs)(7)

Page 5: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

where κ is a function to compute the correlation betweenquestion and context, ws ∈ R4d×1 is the learned weightsand bs is the bias term. s is the mapping defined in(3). As explained for Eq. (4), we use such similarityrepresentations since they capture both the cosine simi-larity and Euclidean distance information. We obtain thevisual-text sequence attention matrix A ∈ RT×2 by A =softmax(maxM

i=1(Si::)) and the visual-text attention vec-tor B ∈ R2 by B = softmax(maxT

i=1 maxMj=1(Sji:)),

where the softmax operation is applied to the first dimen-sion. The maximum function maxi is used reduce the firstdimension of the high-dimensional tensor. Then the at-tended context vector is given by:

h =

2∑k=1

Bk

T∑t=1

AtkF:tk ∈ R2d (8)

The visual-text attention is computed based on the cor-relation between question and the focal context attention,which aligns with our observation that questions often pro-vide constrains of a limited time window for the answers.Similarly, we compute the question attention D ∈ RM byD = softmax(maxT

i=1 max2j=1(S:ij)) and the summa-

rized question vector is given by:

q =

M∑m=1

DmQ:m ∈ R2d (9)

Algorithm 1 summarizes the steps to compute the proposedFVTA attention. To obtain a final context representation,we first summarize the focal context representation sepa-rately for visual sequence and text sequence, emphasizingthe most important information using the intra-sequence at-tention. Then, we obtain the final representation by sum-ming the sequence vector representation based on the inter-sequence importance. Fig. 3 illustrates the difference be-tween FVT attention tensor and one-dimensional soft at-tention vector. Both mechanisms compute the attentionbut FVTA considers both visual-text intra-sequence corre-lations and cross sequence interaction.

Algorithm 1: Computation of Focal Visual-Text Atten-tion.

input : Input visual-text sequence X, Question Qoutput: The FVTA vector h

1 Encode X into H by the visual-text embedding andsequence encoder in Sec. 3.2;

2 Encode Q into Q by the question encoder;3 Compute C by Eq. (4) // temporal correlation

4 Compute F by Eq. (5) // intra-sequence dependency

5 Compute S by Eq. (7) // cross-sequence

interaction

6 Reduce F with S to the FVTA h by Eq. (8);7 return h;

5. Experiments

5.1. MemexQA

Dataset MemexQA [9] is a recently proposed visual-textquestion answering dataset. The dataset consists of 20,860questions about 13,591 personal photos belonging to 101real Flickr users. These personal photos capture a varietyof key moments of their lives such as a trip to Japan, wed-ding ceremonies, family parties, etc. Each album and photocome with comprehensive visual-text information, includ-ing a timestamp, GPS, a photo title, an album title and de-scription. The metadata is incomplete and GPS, the phototitle, the album title and description may not present in ev-ery photo.

MemexQA provides 4 answer choices and only one cor-rect answer for each question. The dataset also providesmore than one ground truth grounding images for eachquestion. There are five types of questions correspondingto the frequent search terms discovered in the Flickr searchlogs [8]. The input visual-text sequence length varies forquestions. Some questions are about images taken on a cer-tain date e.g. “what did we do after 2006 Halloween party?”;others are about all images e.g. “what was the last time wedrove to a bar?”.Baseline Methods A large proportion of the existing so-lutions is to project image or videos into an embeddingspace, and train a classification model using these embed-dings. We implement the following methods as baselines:Logistic Regression predicts the answer with concatenatedimage, question and metadata features as reported in [9].Embedding + LSTM utilizes word embeddings and char-acter embeddings, along with the same visual embeddingsused in FVTA. Embeddings are encoded by LSTM and av-eraged to get the final context representation. Embedding +LSTM + Concat concatenates the last LSTM output fromdifferent modalities to produce the final output. On theother hand, we compare the proposed model to a rich col-lection of VQA attention models: Classic Soft Attentionuses classic one dimensional question-to-context attentionto summarize context for question answering. A correla-tion matrix between each question word and context is usedto compute the attention as in [19, 26]. DMN+ is the im-proved dynamic memory networks [25], which is one of therepresentative architectures that achieve good performanceon the VQA Task. We implement the DMN+ network witheach sentence and each photo representation used in ourproposed network as supporting facts input. MultimodalCompact Bilinear Pooling[5] is the state-of-the-art methodon VQA [2] dataset. The spatial attention in the originalmodel is directly used on the sequential images input. Thehyperparameters including the output dimension of MCBand hidden size of LSTM are selected based on the vali-dation results. Bi-directional Attention Flow implements

Page 6: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

Method how many what when where who overall(11.8%) (41.9%) (16.2%) (17.2%) (12.9%)

Logistic Regression 0.645 0.241 0.217 0.277 0.260 0.295Embedding + LSTM 0.771 0.564 0.349 0.314 0.310 0.478Embedding + LSTM + Concat 0.776 0.668 0.398 0.433 0.409 0.563DMN+ [25] 0.792 0.616 0.346 0.248 0.224 0.480Multimodal Compact Bilinear Pooling [5] 0.773 0.618 0.250 0.229 0.248 0.462Bi-directional Attention Flow [19] 0.790 0.689 0.356 0.567 0.468 0.598Soft Attention 0.795 0.697 0.346 0.604 0.582 0.621TGIF Temporal Attention [7] 0.761 0.700 0.522 0.582 0.477 0.630FVTA 0.761 0.714 0.476 0.676 0.668 0.669

Table 1. Comparison of different methods on MemexQA by question type. The first three methods do not use the attention mechanism.

the single-modal attention flow model [19] over all concate-nated context representations with embeddings as in FVTAnetwork. TGIF Temporal Attention [7] is a recently pro-posed spatial-temporal reasoning network on sequential an-imated image QA. Since other baseline methods do not usespatial attention, we compare the TGIF network with tem-poral attention only. TGIF temporal attention uses a simpleMLP to compute the attention and only the last hidden stateof the question is considered. We compute the attention fol-lowing [7] and use the same output layer in our method.

Implementation Details In MemexQA dataset, each ques-tion is asked to a sequence of photos organized in albums. Aphoto might have 5 types of textual metadata, including thealbum title, album descriptions, GPS Locations, timestampand a title. We use N to denote the maximum number ofalbums, K for the maximum number of photos in an albumand V for the maximum words. For album-level textualsequences like album titles and descriptions, the K dimen-sion only has one item and others are zero-padded. We alsouse zeros to pad those positions with no word/image. Weencode GPS locations using words. The photos and theircorresponding metadata form the visual-text sequences. Allquestions, textual context and answers are tokenized us-ing the Stanford word tokenizer. We use pre-trained GloVeword embeddings [18], which is fixed during training. Forimage/video embedding, we extract fixed-size features us-ing the pre-trained CNN model, Inception-ResNet [21], byconcatenating the pool5 layer and classification layer’s out-put before softmax. We then use a linear transformationto compress the image feature into 100 dimensional. Thena bi-directional LSTM is used for each modality to obtaincontextual representations. Given a hidden state size of d,which is set to 50, we concatenate the output of both direc-tions of the LSTM and get a question matrix Q ∈ R2d×M

and context tensor H ∈ R2d×V×K×N×6 for all media doc-uments. We reshape the context tensor into H ∈ R2d×T×6.To select the best hyperparmeters, we randomly select 20%of the official training set as the validation set. We use theAdaDelta [28] optimizer and an initial learning rate of 0.5to train for 200 epochs with a dropout rate of 0.3.

5.1.1 Comparison to the state-of-the-art

Table 1 compares the accuracy on the MemexQA. Aswe see, the proposed method consistently outperforms thebaseline methods and achieves the state-of-the-art accuracyon this dataset. The first 3 methods in the table show theperformance of embedding methods without any attentions.Although embedding methods are relatively simple to im-plement, their performance is much lower than the proposedFVTA model. The experiment results advocate the attentionmodel among images and image sequences. Compare toprevious attention models, our FVTA network significantlyoutperforms other methods, which proves the efficacy of theproposed method.

HIT@1 HIT@3 mAPSoft Attention 1.16% 12.60% 0.168±0.002MCB 11.98% 30.54% 0.269±0.005TGIF Temporal 13.28% 32.83% 0.289±0.005FVTA 15.48% 35.66% 0.312±0.005

Table 2. The quality comparison of the learned FVTA and classicattention. We compare the image of the highest activation in aleaned attention to the ground truth evidence photos which humanused to answer the question. HIT@1 means the rate of the topattended images being found in the ground truth evidence photos.AP is computed on the photo ranked by their attention activation.

The MemexQA dataset provides ground truth evidencephotos for every question. We can compare the correla-tion between the photos of the highest attention weightsand the ground truth photos to correctly answer a question.An ideal VQA model should not only enjoy a high accu-racy in answering a question (Table 1) but also can find im-ages that are highly correlated to the ground-truth evidencephotos. Table 2 lists the accuracy to examine whether amodel puts focus on the correct photos. FVTA outperformsother attention models on finding the relevant photos for thequestion. The results show that the proposed attention cancapture salient information for answering the question. Forqualitative comparison, we select some representative ques-tions and show both the answer and the retrieved top imagesbased on the attention weights in Fig. 4. As shown in the

Page 7: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

first example, the system has to find the correct photo andvisually identify the object to answer the question ”what didthe daughter eat while her dad was watching during the tripin June 2010?”. FVTA attention puts a high weight on thecorrect photo of the girl eating a corn, which leads to cor-rectly answering the question. Whereas for soft attention,the one-dimensional attention network outputs the wrongimage and gets the wrong answer. This example showsthe advantage of FVTA modeling the correlation at everytime step, across visual-text sequences over the traditionaldimensional attention.

5.1.2 Ablation Study

Table 3 shows the performance of FVTA mechanism and itsablations on the MemexQA dataset. To evaluate the FVTAattention mechanism, we first replace our kernel tensor withsimple cosine similarity function. Results show that stan-dard cosine similarity is inferior to our similarity function.For ablating intra-sequence dependency, we use the repre-sentations from the last timestep of each context document.For ablating cross sequence interaction, we average all at-tended context representation from different modalities toget the final context vector. Both aspects of correlation ofthe FVTA attention tensor contribute towards the model’sperformance, while intra-sequence dependency shows moreimportance in this experiment. We compare the effective-ness of context-aware question attention by removing thequestion attention and use the last timestep of the LSTMoutput from the question as the question representation. Itshows the question attention provides slight improvement.Finally, we train FVTA without photos to see the contribu-tion of visual information. The result is quite good but it isperhaps not surprising due to the language bias in the ques-tions and answers of the dataset, which is not uncommonin VQA dataset [2] and in Visual7W [31]. This also leavessignificant rooms of improvement with visual information.

Ablations Accuracy ∆

FVTA w/ Cosine Similarity 0.619 -4.9%FVTA w/o Intra-seq 0.569 -10.0%FVTA w/o Cross-seq 0.604 -6.5%FVTA w/o Question Attention 0.629 -4.0%FVTA w/o Photos 0.577 -9.1%

Table 3. Ablation studies of the proposed FVTA method on theMemexQA dataset. The last column shows the performance drop.

5.2. MovieQA

Dataset The MovieQA dataset consists of 140 movies and6,462 multiple choice QA pair. Each QA pair contains fiveanswer choices with only one correct answer. Systems arerequired to answer the questions given a number of movie

Method Val TestSSCB [22] 0.219 -MemN2N [22] 0.342 -DEMN [10] - 0.300Soft Attention 0.321 -MCB [5] 0.362 -TGIF Temporal [7] 0.371 -RWMN [15] 0.387 0.363FVTA 0.410 0.373

Table 4. Accuracy comparison on the test and the validation set ofthe MovieQA dataset. The test set performance can only be eval-uated on the MovieQA server, and thus not all the studies providethe accuracy on Test set.

clips from the same movie and the corresponding subtitles.More details of the dataset can be viewed in [22].Implementation Details In the MovieQA dataset, each QAis given a set of N movie clips of the same movie, and eachclip comes with subtitles. We implement FVTA network forMovieQA task with modality number of 2 (video & text).We set the maximum number of movie clips per questionto N = 20, the maximum number of frames to considerto F = 10, the maximum number of subtitle sentences ina clip to K = 100 and the maximum words to V = 10.Visual and text sequences are encoded the same way as inthe MemexQA [9] experiment. We use the AdaDelta [28]optimizer with a minibatch of 16 and an initial learning rateof 0.5 to trained for 300 epochs. A dropout rate is set at 0.2during training. The official training/validation/test split isused in our experiments.Experimental Results We compare FVTA with recent re-sults on MovieQA dataset, including End-to-End Mem-ory Network (MemN2N) [23], Deep Embedded MemoryNetwork (DEMN) [10], and Read-Write Memory Network(RWMN) [15]. Table 4 shows the detailed comparison ofMovieQA results using both videos and subtitles. FVTAmodel outperforms all baseline methods and achieves com-parable performance to the state-of-the-art result 2 on theMovieQA test server. Notably, RWMN [15] is a very re-cent work that uses memory net to cache sequential input,with a high capacity and flexibility due to the read and writenetworks. Our accuracy is 0.410 (vs 0.387 by RWMN) onthe validation set and 0.373 (vs 0.363) on the test set. Ben-efiting from such modeling ability, FVTA consistently out-performs the classical attention models including soft atten-tion, MCB [5] and TGIF [7]. The result demonstrates theconsistent advantages of FVTA over other attention modelsin question-answering for multiple sequence data.

Fig. 5 illustrates the output of our FVTA model. FVTAcan not only predict the correct answer, but also identifythe most relevant subtitle description as well as the movie

2The best test accuracy on the leaderboard by the time of paper submis-sion (Nov. 2017) is 0.39 (Layered Memory Networks). It is not includedin the table as there is no publication to cite.

Page 8: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

Figure 4. Qualitative comparison of FVTA model and other attention models on the MemexQA dataset. For each question, we show theanswer and the images of the highest attention weights. Images are ranked from left to right based on the attention weights. The correctimages and answers have green border whereas the incorrect ones are surrounded by the red border.

Figure 5. Qualitative analysis of FVTA on the MovieQA dataset. It shows the visual justification (movie clip frames) and text justification(subtitles) based on the top attention activation. Both justifications provide supporting evidence for the system to get the correct answer.

clip frames. As shown in Fig. 5, FVTA can provide fine-grained level justifications such as the most informativemovie frames or subtitle sentences, whereas most of exist-ing methods cannot find fine-grained justifications from theattention computed at the movie clip level. We believe theresults show the benefits and potentials of FVTA model.

6. Conclusions and future work

In this paper, we introduced a novel neural networkmodel called Focal Visual-Text Attention network for an-swering questions over visual-text sequences. FVTA em-ployed a hierarchical process to dynamically determinewhich modality and snippets to focus on in the sequentialdata to answer the question, and hence can not only pre-

dict the correct answers but also find the correct support-ing justifications to help users verify the system’s results.The comprehensive experimental results demonstrated thatFVTA achieves comparable or even better than state-of-the-art results on two major question answering benchmarks ofsequential visual-text data. Our future work includes ex-tending FVTA to large scale long visual-text sequences andremoving the use of answer choice embeddings as the input.

Acknowledgements We would like to thank anonymous re-viewers for useful comments and Google Cloud for provid-ing GCP research credits.

Page 9: Focal Visual-Text Attention for Visual Question Answering · novel attention model for multiple variable-length sequen-tial inputs, to take into account not only the visual-text in-formation

References[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning

to compose neural networks for question answering. arXivpreprint arXiv:1601.01705, 2016. 2

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,C. Lawrence Zitnick, and D. Parikh. Vqa: Visual questionanswering. In CVPR, 2015. 1, 2, 5, 7

[3] H. Ben-younes, R. Cadene, M. Cord, and N. Thome. Mu-tan: Multimodal tucker fusion for visual question answering.arXiv preprint arXiv:1705.06676, 2017. 1, 2

[4] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra.Human attention in visual question answering: Do humansand deep networks look at the same regions? ComputerVision and Image Understanding, 2017. 2

[5] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell,and M. Rohrbach. Multimodal compact bilinear poolingfor visual question answering and visual grounding. arXivpreprint arXiv:1606.01847, 2016. 1, 2, 5, 6, 7

[6] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt,W. Kay, M. Suleyman, and P. Blunsom. Teaching machinesto read and comprehend. In NIPS, 2015. 1

[7] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. Tgif-qa: To-ward spatio-temporal reasoning in visual question answer-ing. CVPR, 2017. 2, 6, 7

[8] L. Jiang, Y. Kalantidis, L. Cao, S. Farfade, J. Tang, and A. G.Hauptmann. Delving deep into personal photo and videosearch. In WSDM, 2017. 5

[9] L. Jiang, J. Liang, L. Cao, Y. Kalantidis, S. Farfade, andA. G. Hauptmann. Memexqa: Visual memex question an-swering. arXiv:1708.01336, 2017. 2, 5, 7

[10] K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang. Deep-story: video story qa by deep embedded memory networks.arXiv preprint arXiv:1707.00836, 2017. 2, 7

[11] Y. Kim. Convolutional neural networks for sentence classifi-cation. arXiv preprint arXiv:1408.5882, 2014. 3

[12] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations. International Journal ofComputer Vision, 123(1):32–73, 2017. 2

[13] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchicalquestion-image co-attention for visual question answering.In NIPS, 2016. 1, 2

[14] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neu-rons: A neural-based approach to answering questions aboutimages. In ICCV, 2015. 1

[15] S. Na, S. Lee, J. Kim, and G. Kim. A read-write mem-ory network for movie story understanding. arXiv preprintarXiv:1709.09345, 2017. 2, 7

[16] H. Nam, J.-W. Ha, and J. Kim. Dual attention networksfor multimodal reasoning and matching. arXiv preprintarXiv:1611.00471, 2016. 1, 2

[17] H. Noh, P. Hongsuck Seo, and B. Han. Image question an-swering using convolutional neural network with dynamicparameter prediction. In CVPR, 2016. 2

[18] J. Pennington, R. Socher, and C. D. Manning. Glove: Globalvectors for word representation. In EMNLP, 2014. 6

[19] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidi-rectional attention flow for machine comprehension. arXivpreprint arXiv:1611.01603, 2016. 4, 5, 6

[20] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focusregions for visual question answering. In CVPR, 2016. 1, 2

[21] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.Inception-v4, inception-resnet and the impact of residualconnections on learning. In AAAI, 2017. 6

[22] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urta-sun, and S. Fidler. Movieqa: Understanding stories in moviesthrough question-answering. In CVPR, 2016. 2, 7

[23] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urta-sun, and S. Fidler. Movieqa: Understanding stories in moviesthrough question-answering. In CVPR, 2016. 7

[24] S. Wang and J. Jiang. A compare-aggregate model for match-ing text sequences. arXiv preprint arXiv:1611.01747, 2016.3

[25] C. Xiong, S. Merity, and R. Socher. Dynamic memory net-works for visual and textual question answering. In ICML,2016. 5, 6

[26] H. Xu and K. Saenko. Ask, attend and answer: Exploringquestion-guided spatial attention for visual question answer-ing. In ECCV, 2016. 2, 5

[27] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stackedattention networks for image question answering. In CVPR,2016. 1, 2

[28] M. D. Zeiler. Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012. 6, 7

[29] K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C.Niebles, and M. Sun. Leveraging video descriptions to learnvideo question answering. In AAAI, 2017. 2

[30] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann. Uncoveringtemporal context for video question and answering. arXivpreprint arXiv:1511.04670, 2015. 2

[31] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:Grounded question answering in images. In CVPR, 2016. 2,7