Object Ordering with Bidirectional Matchings for Visual ...airsplay/Hao_NAACL2018_slide.pdf · Seo,...
Transcript of Object Ordering with Bidirectional Matchings for Visual ...airsplay/Hao_NAACL2018_slide.pdf · Seo,...
Hao Tan Mohit Bansal
UNC Chapel Hill
NAACL 2018
Object Ordering with Bidirectional Matchings for Visual Reasoning
1
Problem Description
There is a box with a yellow circle, a yellow square and two black items.
True
Image
Statement
Answer
Suhr, et, al (2017). A Corpus of Natural Language for Visual Reasoning 2
Problem Description
At least one of the towers with exactly three blocks has a blue block in the middle.
Image
StatementFalseAnswer
Suhr, et, al (2017). A Corpus of Natural Language for Visual Reasoning 3
Two Representations
Raw RGB ImageStructured Representation
*Only one sub-image is shown here.
Shape Location Size Color
Top Large Blue
Top Small Blue
Right Small Blue
Bottom Small Blue
Bottom Large Black
4
Two Representations
Raw RGB ImageStructured Representation
(The Dataset View)
*Only one sub-image is shown here.
Shape Location Size Color
0.3, 0.3 3 1
0.4, 0.3 1 1
0.7, 0.9 1 1
0.8, 0.3 1 1
0.8, 0.7 3 2
5
Methods
• For Structured Representation: 1. Bidirectional Attention 2. Object Ordering with Pointer Network
• For Raw Image: 3. CNN-Bidirectional Attention
6
Bidirectional Attention
There is a box with three circles between a square and a triangle.
OBJ LSTM
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
7
Seo, et, al (2016). Bi-Directional Attention Flow for Machine comprehension.
Modeling Layer
Output Layer
Attention Flow Layer
Contextual Embed Layer
Word Embed Layer
x1 x2 x3 xT q1 qJ
LSTM
LSTM
LSTM
LSTM
Start End
h1 h2 hT
u1
u2
uJ
Softm
ax
h1 h2 hT
u1
u2
uJ
Max
Softmax
Context2Query
Query2Context
h1 h2 hT u1 uJ
LSTM + SoftmaxDense + Softmax
Context Query
Query2Context and Context2QueryAttention
WordEmbedding
GLOVE Char-CNN
Character Embed Layer
CharacterEmbedding
g1 g2 gT
m1 m2 mT
BiDAF Model
8
Backbone of BiDAF (SQuAD)
QUE LSTM
The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the
National ………. address to Congress. Project Mercury was followed by the two-man Project Gemini (1962–66). The first
manned flight of Apollo was in 1968.
FC
Attention
CTX LSTM
What project put the first Americans into space?
Project Mercury
Answer
Encoding Layer
Modeling Layer
LSTM
9
Unidirectional Attention Here
Basic Model with BiDAF
FC
Attention
LANG LSTM
Encoding Layer
Modeling Layer
LSTM
Suppose there is only one image (instead of 3 sub-images).
Project Mercury
Answer
There is a box with three circles between a square and a triangle.
10
OBJ ENC
MAX
MLP Prob
We added the max-pooling layer followed by the MLP.
Modification 1: Probability Prediction
OBJ ENC
FC
Attention
LANG LSTM
Encoding Layer
Modeling Layer
LSTMThere is a box with three circles between a square and a triangle.
11
�
Modification 2: Object Encoder
MAX
MLP Prob�
The object encoder is a recurrent neural network.
There is a box with three circles between a square and a triangle.
OBJ LSTM
FC
Attention
LANG LSTM
Encoding Layer
Modeling Layer
LSTM
12
Object Encoder: LSTM (First Approach)
LSTM LSTM LSTM LSTM LSTM
Structured Representation
Ordering: RandomWhy? LSTM could handle variable length LSTM could learn object relationships
Shape Location Size Color
0.3, 0.3 3 1
0.4, 0.3 1 1
0.7, 0.9 1 1
0.8, 0.3 1 1
0.8, 0.7 3 2
13Second Approach: Learning the order via pointer network
Modification 2: Object Encoder
The obj-sequence is processed by LSTM in random order.
There is a box with three circles between a square and a triangle.
OBJ LSTM
FC
Attention
LSTM MAX
MLP
Unidirectional Attention
Probability�
LANG LSTM
14
Modification 3: Bidirectional Attention
There is a box with three circles between a square and a triangle.
OBJ LSTM
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
Probability�
Attention
FC
LANG LSTM
Object-to-language attention is added.
15
Bidirectional Attention Details
FeaturesLSTMThere is exactly one black triangle
not touching any edge
Bidirectional Attention
LSTM
Attentive Features
Contextualized Features
LSTM LSTM
↵i,k = softmaxk (h|i B1 gk)
ci =X
k
↵i,k · gk
ˆhi = relu (WLANG [hi; ci; hi-ci; hi�ci])
Text2Img Attention
�k,i = softmaxi (g|k B2 hi)
dk =
X
i
�k,i · hi
gk = relu (WOBJ [gk; dk; gk�dk; gk�dk])
Img2Text Attention
{hi} {hi}
{gk}{gk}
LANG
OBJ
16
Three Sub-Images
There is a box with three circles between a square and a triangle.
OBJ LSTM
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
A max-pooling layer is added to combine the sub-image scores.
17
Why max-pooling? 1. Permutation invariant 2. Dataset statistics: majority about existence (Also tried min/mean/early-pooling, LSTM, concatenation)
Object Ordering with Pointer Network
There is a box with three circles between a square and a triangle.
Encoder Decoder OBJ LSTM
Pointer Network
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
18
Pointer Network
Vinyals, et, al (2016). Pointer Networks.
Seq-to-Seq Model Pointer Network
uij = vT tanh(W1ej +W2di) j 2 (1, . . . , n)
p(Ci|C1, . . . , Ci�1,P) = softmax(ui)
19
Object Ordering: Pointer Network
LANG LSTM
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
At each decoding step, the decoder will select an object without replacement.
There is a box with three circles between a square and a triangle.
20
Encoder Decoder
Object Ordering: Pointer Network
OBJ LSTMOriginal
New Encoder Decoder
Pointer Network
OBJ LSTM
21
Pointer Network (Full Model)
Before the OBJ-LSTM, the objects are sorted by the pointer network.
There is a box with three circles between a square and a triangle.
Encoder Decoder OBJ LSTM
Pointer Network
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
22
Pointer Network Optimization
p(⇡ | s, o) =Y
i
p (⇡(i) | ⇡(< i), s, o)
LRL(s, o, y) =E⇡⇠p(·|s,o)L(s, o[⇡], y)
R =� L(s, o[⇡
⇤], y)
r✓
LRL(s, o, y) ⇡ � (R� b)r✓
log p(⇡
⇤ | s, o)+r
✓
L(s, o[⇡
⇤], y)
1. Sample a permutation of the objects
2. Calculating the loss
3. Policy gradient with the reward
Bello, et, al (2016).Neural combinatorial optimization with reinforcement learning.
Mathematical Perspective
Programming Perspective
23
L(·,⇡, ·)
�L(·,⇡, ·)
⇡
Structured Representation: Full Model
There is a box with three circles between a square and a triangle.
Encoder Decoder OBJ LSTM
Pointer Network
FC
Attention
LSTM
LSTM
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
24
Raw Image: CNN-Bidirectional Attention
There is a box with three circles between a square and a triangle.
OBJ CNN
FC
Attention
LSTM
CNN
MAX
MAX
MLP
Bidirectional Attention
MAX
POOLProbability
�
Attention
FC
LANG LSTM
25
Structured (LSTM) vs. Image (CNN)
LSTM LSTM LSTM LSTM LSTM
Structured Representation Raw RGB Image
CNN
Image Spatial Feature Map
26
Bidirectional Attention for Raw Image Model
CNN
Image Spatial Feature Map
LSTMThere is exactly one black triangle not touching any edge
Bidirectional Attention
CNN
LSTM
Attentive Feature Map
Contextualized Feature Map
27
Results
28
Model Dev Test-P Test-USTRUCTURED REPRESENTATIONS DATASET
MAXENT (Suhr et al., 2017) 68.0% 67.7% 67.8%MLP (Suhr et al., 2017) 67.5% 66.3% 65.3%ImageFeat+RNN (Suhr et al., 2017) 57.7% 57.6% 56.3%RelationNet (Santoro et al., 2017) 65.1% 62.7% -BiDAF (Seo et al., 2016) 66.5% 68.4% -BiENC Model 65.1% 63.4% -BiATT Model 72.6% 72.3% -BiATT-Pointer Model 74.6% 73.9% 71.8%
RAW IMAGE DATASETCNN+RNN (Suhr et al., 2017) 56.6% 58.0% 56.3%NMN (Suhr et al., 2017) 63.1% 66.1% 62.0%CNN-BiENC Model 58.7% 58.7% -CNN-BiATT Model 66.9% 69.7% 66.1%
Table 1: Dev, Test-P (public), and Test-U (unreleased) results of our model on the structured-representation andraw-image datasets, compared to the previous SotA results and other reimplemented baselines.
The top of the three towers are not the same. Correct Answer: True There are 2 boxes with
at least 2 blue items. Correct Answer: True
There is a blue object touching the base. Correct Answer: FalseThere are at least three yellow objects
touching any edge. Correct Answer: True
Negative Examples
Figure 3: Incorrectly-classified examples.
Results on Raw Images Dataset: To further showthe effectiveness of our BiATT model, we applythis model to the raw image version of the NLVRdataset, with minimal modification. We simplyreplace each object-related LSTM with a visualfeature CNN that directly learns the structure viapixel-level, spatial filters (instead of a pointer net-work which addresses an unordered sequence ofstructured object representations). As shown inTable 1, this CNN-BiATT model outperforms theneural module networks (NMN) (Andreas et al.,2016) previous-best result by 3.6% on the publictest set and 4.1% on the unreleased test set. Moredetails and the model figure are in the appendix.Output Example Analysis: Finally, in Fig. 1,we show some output examples which were suc-cessfully solved by our BiATT-Pointer model butfailed in our strong baselines. The left two ex-amples in Fig. 1 could not be handled by the Bi-ENC model. The right two examples are incorrectfor the BiATT model without the ordering-basedpointer network. Our model can quite successfullyunderstand the complex meanings of the attributesand their relationships with the diverse objects, aswell as count the occurrence of and reason overobjects without any specialized features.
Next, in Fig. 3, we also show some negative ex-amples on which our model fails to predict the cor-rect answer. The top two examples involve com-
plex high-level phrases e.g., “touching any edge”or “touching the base”, which are hard for an end-to-end model to capture, given that such state-ments are rare in the training data. Based on the re-sult of the validation set, the max-pooling layer isselected as the combination method in our model.The max-pooling layer will choose the highestscore from the sub-images as the final score. Thus,the layer could easily handle statements aboutsingle-subimage-existence based reasoning (e.g.,the 4 positively-classified examples in Fig. 1).However, the bottom two negatively-classified ex-amples in Fig. 3 could not be resolved becauseof the limitation of the max-pooling layer on sce-narios that consider multiple-subimage-existence.We did try multiple other pooling and combinationmethods, as mentioned in Sec. 3.1. Among thesemethods, the concatenation, early pooling andLSTM-fusion approaches might have the abilityto solve these particular bottom-two failed state-ments. In our future work, we are addressing mul-tiple types of pooling methods jointly.
6 ConclusionWe presented a novel end-to-end model with jointbidirectional attention and object-ordering pointernetworks for visual reasoning. We evaluate ourmodel on both the structured-representation andraw-image versions of the NLVR dataset andachieve substantial improvements over the previ-ous end-to-end state-of-the-art results.
AcknowledgmentsWe thank the anonymous reviewers for their help-ful comments. This work was supported by aGoogle Faculty Research Award, a BloombergData Science Research Grant, an IBM FacultyAward, and NVidia GPU awards.
*Goldman et al. (ACL 2018) use extra, manually-labeled semantic parsing data to achieve a better structured rep. result.
Positive Examples
There is at least one tower which has blocks of all three colors
There is a box with a yellow circle, a yellow square and two
black items.
At least one of tower with exactly three blocks has a blue block in the
middle
Answer: True
There is a black block attach to a yellow block that is attach to a
blue block.
Answer: True
Answer: False
Answer: True
29
Negative Examples
The top of the three towers are not the same.
There are 2 boxes with at least 2 blue items.
There is a blue object touching the base.
There are at least three yellow objects touching any
edge.
Correct Answer: True
Correct Answer: False
Correct Answer: True
Correct Answer: True
30
Thank You!
31
Hao Tan Mohit Bansal
UNC Chapel Hill
NAACL 2018