Object Ordering with Bidirectional Matchings for Visual ...airsplay/Hao_NAACL2018_slide.pdf · Seo,...

Hao Tan Mohit Bansal

UNC Chapel Hill

NAACL 2018

Object Ordering with Bidirectional Matchings for Visual Reasoning

1

Problem Description

There is a box with a yellow circle, a yellow square and two black items.

True

Image

Statement

Answer

Suhr, et, al (2017). A Corpus of Natural Language for Visual Reasoning 2

Problem Description

At least one of the towers with exactly three blocks has a blue block in the middle.

Image

StatementFalseAnswer

Suhr, et, al (2017). A Corpus of Natural Language for Visual Reasoning 3

Two Representations

Raw RGB ImageStructured Representation

*Only one sub-image is shown here.

Shape Location Size Color

Top Large Blue

Top Small Blue

Right Small Blue

Bottom Small Blue

Bottom Large Black

4

Two Representations

Raw RGB ImageStructured Representation

(The Dataset View)

*Only one sub-image is shown here.


0.3, 0.3 3 1

0.4, 0.3 1 1

0.7, 0.9 1 1

0.8, 0.3 1 1

0.8, 0.7 3 2

5

Methods

• For Structured Representation: 1. Bidirectional Attention 2. Object Ordering with Pointer Network

• For Raw Image: 3. CNN-Bidirectional Attention

6

Bidirectional Attention

There is a box with three circles between a square and a triangle.

OBJ LSTM

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

7

Seo, et, al (2016). Bi-Directional Attention Flow for Machine comprehension.

Modeling Layer

Output Layer

Attention Flow Layer

Contextual Embed Layer

Word Embed Layer

x1 x2 x3 xT q1 qJ

LSTM

LSTM

LSTM

LSTM

Start End

h1 h2 hT

u1

u2

uJ

Softm

ax

h1 h2 hT

u1

u2

uJ

Max

Softmax

Context2Query

Query2Context

h1 h2 hT u1 uJ

LSTM + SoftmaxDense + Softmax

Context Query

Query2Context and Context2QueryAttention

WordEmbedding

GLOVE Char-CNN

Character Embed Layer

CharacterEmbedding

g1 g2 gT

m1 m2 mT

BiDAF Model

8

Backbone of BiDAF (SQuAD)

QUE LSTM

The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the

National ………. address to Congress. Project Mercury was followed by the two-man Project Gemini (1962–66). The first

manned flight of Apollo was in 1968.

FC

Attention

CTX LSTM

What project put the first Americans into space?

Project Mercury

Answer

Encoding Layer

Modeling Layer

LSTM

9

Unidirectional Attention Here

Basic Model with BiDAF

FC

Attention

LANG LSTM

Encoding Layer

Modeling Layer

LSTM

Suppose there is only one image (instead of 3 sub-images).

Project Mercury

Answer


10

OBJ ENC

MAX

MLP Prob

We added the max-pooling layer followed by the MLP.

Modification 1: Probability Prediction

OBJ ENC

FC

Attention

LANG LSTM

Encoding Layer

Modeling Layer

LSTMThere is a box with three circles between a square and a triangle.

11

�

Modification 2: Object Encoder

MAX

MLP Prob�

The object encoder is a recurrent neural network.


OBJ LSTM

FC

Attention

LANG LSTM

Encoding Layer

Modeling Layer

LSTM

12

Object Encoder: LSTM (First Approach)

LSTM LSTM LSTM LSTM LSTM

Structured Representation

Ordering: RandomWhy? LSTM could handle variable length LSTM could learn object relationships


0.3, 0.3 3 1

0.4, 0.3 1 1

0.7, 0.9 1 1

0.8, 0.3 1 1

0.8, 0.7 3 2

13Second Approach: Learning the order via pointer network

Modification 2: Object Encoder

The obj-sequence is processed by LSTM in random order.


OBJ LSTM

FC

Attention

LSTM MAX

MLP

Unidirectional Attention

Probability�

LANG LSTM

14

Modification 3: Bidirectional Attention


OBJ LSTM

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


Probability�

Attention

FC

LANG LSTM

Object-to-language attention is added.

15

Bidirectional Attention Details

FeaturesLSTMThere is exactly one black triangle

not touching any edge


LSTM

Attentive Features

Contextualized Features

LSTM LSTM

↵i,k = softmaxk (h|i B1 gk)

ci =X

k

↵i,k · gk

ˆhi = relu (WLANG [hi; ci; hi-ci; hi�ci])

Text2Img Attention

�k,i = softmaxi (g|k B2 hi)

dk =

X

i

�k,i · hi

gk = relu (WOBJ [gk; dk; gk�dk; gk�dk])

Img2Text Attention

{hi} {hi}

{gk}{gk}

LANG

OBJ

16

Three Sub-Images


OBJ LSTM

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

A max-pooling layer is added to combine the sub-image scores.

17

Why max-pooling? 1. Permutation invariant 2. Dataset statistics: majority about existence (Also tried min/mean/early-pooling, LSTM, concatenation)

Object Ordering with Pointer Network


Encoder Decoder OBJ LSTM

Pointer Network

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

18

Pointer Network

Vinyals, et, al (2016). Pointer Networks.

Seq-to-Seq Model Pointer Network

uij = vT tanh(W1ej +W2di) j 2 (1, . . . , n)

p(Ci|C1, . . . , Ci�1,P) = softmax(ui)

19

Object Ordering: Pointer Network

LANG LSTM

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

At each decoding step, the decoder will select an object without replacement.


20

Encoder Decoder

Object Ordering: Pointer Network

OBJ LSTMOriginal

New Encoder Decoder

Pointer Network

OBJ LSTM

21

Pointer Network (Full Model)

Before the OBJ-LSTM, the objects are sorted by the pointer network.



Pointer Network

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

22

Pointer Network Optimization

p(⇡ | s, o) =Y

i

p (⇡(i) | ⇡(< i), s, o)

LRL(s, o, y) =E⇡⇠p(·|s,o)L(s, o[⇡], y)

R =� L(s, o[⇡

⇤], y)

r✓

LRL(s, o, y) ⇡ � (R� b)r✓

log p(⇡

⇤ | s, o)+r

✓

L(s, o[⇡

⇤], y)

1. Sample a permutation of the objects

2. Calculating the loss

3. Policy gradient with the reward

Bello, et, al (2016).Neural combinatorial optimization with reinforcement learning.

Mathematical Perspective

Programming Perspective

23

L(·,⇡, ·)

�L(·,⇡, ·)

⇡

Structured Representation: Full Model



Pointer Network

FC

Attention

LSTM

LSTM

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

24

Raw Image: CNN-Bidirectional Attention


OBJ CNN

FC

Attention

LSTM

CNN

MAX

MAX

MLP


MAX

POOLProbability

�

Attention

FC

LANG LSTM

25

Structured (LSTM) vs. Image (CNN)

LSTM LSTM LSTM LSTM LSTM

Structured Representation Raw RGB Image

CNN

Image Spatial Feature Map

26

Bidirectional Attention for Raw Image Model

CNN

Image Spatial Feature Map

LSTMThere is exactly one black triangle not touching any edge


CNN

LSTM

Attentive Feature Map

Contextualized Feature Map

27

Results

28

Model Dev Test-P Test-USTRUCTURED REPRESENTATIONS DATASET

MAXENT (Suhr et al., 2017) 68.0% 67.7% 67.8%MLP (Suhr et al., 2017) 67.5% 66.3% 65.3%ImageFeat+RNN (Suhr et al., 2017) 57.7% 57.6% 56.3%RelationNet (Santoro et al., 2017) 65.1% 62.7% -BiDAF (Seo et al., 2016) 66.5% 68.4% -BiENC Model 65.1% 63.4% -BiATT Model 72.6% 72.3% -BiATT-Pointer Model 74.6% 73.9% 71.8%

RAW IMAGE DATASETCNN+RNN (Suhr et al., 2017) 56.6% 58.0% 56.3%NMN (Suhr et al., 2017) 63.1% 66.1% 62.0%CNN-BiENC Model 58.7% 58.7% -CNN-BiATT Model 66.9% 69.7% 66.1%

Table 1: Dev, Test-P (public), and Test-U (unreleased) results of our model on the structured-representation andraw-image datasets, compared to the previous SotA results and other reimplemented baselines.

The top of the three towers are not the same. Correct Answer: True There are 2 boxes with

at least 2 blue items. Correct Answer: True

There is a blue object touching the base. Correct Answer: FalseThere are at least three yellow objects

touching any edge. Correct Answer: True

Negative Examples

Figure 3: Incorrectly-classified examples.

Results on Raw Images Dataset: To further showthe effectiveness of our BiATT model, we applythis model to the raw image version of the NLVRdataset, with minimal modification. We simplyreplace each object-related LSTM with a visualfeature CNN that directly learns the structure viapixel-level, spatial filters (instead of a pointer net-work which addresses an unordered sequence ofstructured object representations). As shown inTable 1, this CNN-BiATT model outperforms theneural module networks (NMN) (Andreas et al.,2016) previous-best result by 3.6% on the publictest set and 4.1% on the unreleased test set. Moredetails and the model figure are in the appendix.Output Example Analysis: Finally, in Fig. 1,we show some output examples which were suc-cessfully solved by our BiATT-Pointer model butfailed in our strong baselines. The left two ex-amples in Fig. 1 could not be handled by the Bi-ENC model. The right two examples are incorrectfor the BiATT model without the ordering-basedpointer network. Our model can quite successfullyunderstand the complex meanings of the attributesand their relationships with the diverse objects, aswell as count the occurrence of and reason overobjects without any specialized features.

Next, in Fig. 3, we also show some negative ex-amples on which our model fails to predict the cor-rect answer. The top two examples involve com-

plex high-level phrases e.g., “touching any edge”or “touching the base”, which are hard for an end-to-end model to capture, given that such state-ments are rare in the training data. Based on the re-sult of the validation set, the max-pooling layer isselected as the combination method in our model.The max-pooling layer will choose the highestscore from the sub-images as the final score. Thus,the layer could easily handle statements aboutsingle-subimage-existence based reasoning (e.g.,the 4 positively-classified examples in Fig. 1).However, the bottom two negatively-classified ex-amples in Fig. 3 could not be resolved becauseof the limitation of the max-pooling layer on sce-narios that consider multiple-subimage-existence.We did try multiple other pooling and combinationmethods, as mentioned in Sec. 3.1. Among thesemethods, the concatenation, early pooling andLSTM-fusion approaches might have the abilityto solve these particular bottom-two failed state-ments. In our future work, we are addressing mul-tiple types of pooling methods jointly.

6 ConclusionWe presented a novel end-to-end model with jointbidirectional attention and object-ordering pointernetworks for visual reasoning. We evaluate ourmodel on both the structured-representation andraw-image versions of the NLVR dataset andachieve substantial improvements over the previ-ous end-to-end state-of-the-art results.

AcknowledgmentsWe thank the anonymous reviewers for their help-ful comments. This work was supported by aGoogle Faculty Research Award, a BloombergData Science Research Grant, an IBM FacultyAward, and NVidia GPU awards.

*Goldman et al. (ACL 2018) use extra, manually-labeled semantic parsing data to achieve a better structured rep. result.

Positive Examples

There is at least one tower which has blocks of all three colors

There is a box with a yellow circle, a yellow square and two

black items.

At least one of tower with exactly three blocks has a blue block in the

middle

Answer: True

There is a black block attach to a yellow block that is attach to a

blue block.

Answer: True

Answer: False

Answer: True

29

Negative Examples

The top of the three towers are not the same.

There are 2 boxes with at least 2 blue items.

There is a blue object touching the base.

There are at least three yellow objects touching any

edge.

Correct Answer: True

Correct Answer: False



30

Thank You!

31

Hao Tan Mohit Bansal

UNC Chapel Hill

NAACL 2018

Object Ordering with Bidirectional Matchings for Visual ...airsplay/Hao_NAACL2018_slide.pdf · Seo,...

Documents

Transcript of Object Ordering with Bidirectional Matchings for Visual ...airsplay/Hao_NAACL2018_slide.pdf · Seo,...