Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does...
Transcript of Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does...
![Page 1: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/1.jpg)
Designing deep architectures for Visual Question Answering
Matthieu CordSorbonne University
valeo.ai research lab. Paris
Thanks to H. Ben-younes, R. Cadène
![Page 2: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/2.jpg)
Visual Question Answering
Question Answering:What does Claudia do?+
![Page 3: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/3.jpg)
Visual Question Answering:What does Claudia do?+
Visual Question Answering
![Page 4: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/4.jpg)
Visual Question Answering:What does Claudia do?+
Visual Question Answering
Sitting at the bottomStanding at the back…
![Page 5: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/5.jpg)
Visual Question Answering:What does Claudia do?+
Visual Question Answering
Sitting at the bottomStanding at the back…
Solving this task interesting for:- Study of deep learning models in a multimodal context - Improving human-machine interaction- One step to build visual assistant for blind people
Deep ML
![Page 6: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/6.jpg)
Outline
1. Multimodal embedding• Deep nets to align text+image• learning
2. VQA framework• Task modeling• Fusion in VQA• Reasoning in VQA
![Page 7: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/7.jpg)
Deep semantic-visual embedding
ConvNet RNN
![Page 8: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/8.jpg)
ConvNet RNN
Deep semantic-visual embedding
Semantic of distanceRetrieval by NN search
![Page 9: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/9.jpg)
A cat on a sofa A dog
playing
A car
2D Semantic visual space example:• Distance in the space has a semantic interpretation• Retrieval is done by finding nearest neighbors
Deep semantic-visual embedding
![Page 10: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/10.jpg)
• Designing image and text embedding architectures
• Learning scheme for these deep hybrid nets
Deep semantic-visual embedding
![Page 11: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/11.jpg)
Visual pipeline:
• ResNet-152 pretrained
• Weldon spatial pooling
• Affine projection
• normalization
Textual pipeline:
• Pretrained word embedding
• Simple Recurrent Unit (SRU)
• Normalization
ResNet conv poolaffine+norm.
(a, man, in, ski, gear,skiing, on, snow)
w2v SRU+norm
!0: 2 and ϕ arethetrainedparameters
Deep semantic-visual embeddingDeViSE: A Deep Visual-Semantic Embedding Model, A. Frome et al, NIPS 2013
Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018
RNN
![Page 12: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/12.jpg)
Visual pipeline:
• ResNet-152 pretrained
• Weldon spatial pooling
• Affine projection
• normalization
Textual pipeline:
• Pretrained word embedding
• Simple Recurrent Unit (SRU)
• Normalization
ResNet conv poolaffine+norm.
(a, man, in, ski, gear,skiing, on, snow)
w2v SRU+norm
Deep semantic-visual embeddingFinding beans in burgers: Deep semantic-visual embedding with localization,M. Engilberge et al, CVPR 2018
![Page 13: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/13.jpg)
Visual pipeline:
• ResNet-152 pretrained
• Weldon spatial pooling
• Affine projection
• normalization
Textual pipeline:
• Pretrained word embedding
• Simple Recurrent Unit (SRU)
• Normalization
ResNet conv poolaffine+norm.
(a, man, in, ski, gear,skiing, on, snow)
w2v SRU+norm
!0: 2 and ϕ =>Learningusingatrainingset
Deep semantic-visual embeddingFinding beans in burgers: Deep semantic-visual embedding with localization,M. Engilberge et al, CVPR 2018
![Page 14: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/14.jpg)
How to get large training datasets?
Learning Cross-modal Embeddings for Cooking Recipes and Food Images. A. Salvador, et al. CVPR 2017
Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings M. Carvalho, R. Cadene, D.
Picard, L. Soulier, N. Thome, M. Cord, SIGIR (2018)
Cooking recipes: easy
to get large
multimodal datasets
with aligned data
![Page 15: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/15.jpg)
Deep semantic-visual embedding
DemoVisiir.lip6.fr
![Page 16: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/16.jpg)
A dog playing with a frisbee
A plane in a cloudy sky
1. A herd of sheep standing on top of snow covered field.2. There are sheep standing in the grass near a fence.3. some black and white sheep a fence dirt and grass
Query Closest elements
Cross-modal retrieval
![Page 17: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/17.jpg)
Visual grounding examples:
• Generating multiple heat maps with different textual queries
Cross-modal retrieval and localization
Finding beans in burgers: Deep semantic-visual embedding with localization, M. Engilberge et al, CVPR 2018
![Page 18: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/18.jpg)
Emergence of color understanding:
Cross-modal retrieval and localization
![Page 19: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/19.jpg)
Outline
1. Multimodal embedding• Deep nets to align text+image• Learning
2. Visual Question Answering• Task modeling• Fusion in VQA• Reasoning in VQA
![Page 20: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/20.jpg)
VQA
![Page 21: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/21.jpg)
VQA
Context - Vision and Language MUTAN Experiments References
Visual Question Answering
Very complex task, that requires :
precise image and text modelshigh level interaction modeling
What color is the fire hydranton the right ? yellow
What color is the fire hydranton the left ? green
full scene understandingreasoning...
MLIA/LIP6 activities on Visual Question Answering 14/46
What color is the fire Hydrant on the left?
Green
![Page 22: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/22.jpg)
VQA
Context - Vision and Language MUTAN Experiments References
Visual Question Answering
Very complex task, that requires :
precise image and text modelshigh level interaction modeling
What color is the fire hydranton the right ? yellow
What color is the fire hydranton the left ? green
full scene understandingreasoning...
MLIA/LIP6 activities on Visual Question Answering 14/46
What color is the fire Hydrant on the right?
Yellow
![Page 23: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/23.jpg)
womanDifferent answers
Similar images
man
Who is wearing glasses?
@VQA workshop, CVPR 2017
ÞNeed very good Visual and Question (deep) representationsÞ Full scene understanding
ÞNeed High level multimodal interaction modelingÞMerging operators, attention and reasoning
![Page 24: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/24.jpg)
Vanilla VQA scheme: 2 deep + fusion
Question Representation Image
![Page 25: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/25.jpg)
VQA
Question: Is the lady with the blue fur wearing glasses ?
Image representation
Question representation
VQA: the output space
Yes
![Page 26: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/26.jpg)
VQA: the output space
![Page 27: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/27.jpg)
VQA: the output space
Output space representation:=> Classify over the most frequent answers (3000/95%)
![Page 28: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/28.jpg)
VQA
ClassesQuestion: Is the lady with the
blue fur wearing glasses ?
Image representation
Question representation
VQA: the output space
![Page 29: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/29.jpg)
Image● Convolutional Network (VGG, ResNet,....) ● Detection system (EdgeBoxes, Faster-RCNN, …)
Question● Bag-of-words● Recurrent Network (RNN, LSTM, GRU, SRU, …)
Learning● Fixed answer vocabulary● Classification (cross-entropy)
VQA processing
Multimodal Fusion
Reasoning
![Page 30: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/30.jpg)
Fusion in VQA
![Page 31: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/31.jpg)
VQA: fusion
FusionConcat+ Proj
Element-Wise Concat+MLP
Is the lady with the purple fur wearing glasses ?
![Page 32: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/32.jpg)
VQA: fusion
FusionConcat+ Proj
Element-Wise Concat+MLP
Is the lady with the purple fur wearing glasses ?
![Page 33: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/33.jpg)
VQA: bilinear fusion[Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question
Answering and Visual Grounding, CVPR 2016]
[Kim, Jin-Hwa et al. Hadamard Product for Low-rank Bilinear Pooling, ICLR 2017]
Bilinear model:
Is the lady with the purple
fur wearing glasses ?
![Page 34: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/34.jpg)
VQA: bilinear fusion
Learn the 3-ways tensor coeff. • Different than the Signal Proc. Tensor analysis
(representation)
Problem: q, v and y are of dimension ~ 2000=> 8 billion free parameters in the tensor
Need to reduce the tensor size:• Idea: structure the tensor to reduce the number of
parameters
![Page 35: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/35.jpg)
Tucker decomposition:
Tensor structure:
⇔ constrain the rank of each unfolding of
VQA: bilinear fusion
![Page 36: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/36.jpg)
=
VQA: bilinear fusion
![Page 37: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/37.jpg)
=
VQA: bilinear fusion
![Page 38: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/38.jpg)
=
VQA: bilinear fusion
![Page 39: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/39.jpg)
=
VQA: bilinear fusion
![Page 40: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/40.jpg)
Compact Bilinear
Pooling
(MCB)
Low-Rank
Bilinear Pooling
(MLB)
Tucker
Decomposition
(MUTAN)
Other ways of structuring the tensor of parameters
VQA: bilinear fusion
Ben-younes H.* Cadene R.*, Thome N., Cord M., MUTAN: Multimodal Tucker Fusion for Visual Question Answering, ICCV 2017
![Page 41: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/41.jpg)
VQA: bilinear fusion [AAAI 2019]
![Page 42: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/42.jpg)
VQA: BLOCK fusion [AAAI 2019]
A
B
C
![Page 43: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/43.jpg)
![Page 44: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/44.jpg)
VQA: bilinear fusion
Is the lady with the purple fur wearing glasses ?
![Page 45: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/45.jpg)
Reasoning in VQA
![Page 46: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/46.jpg)
What is reasoning (for VQA)?
Attentional reasoning
Relational reasoning
Iterative reasoning
Compositional reasoning
VQA: reasoning
![Page 47: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/47.jpg)
What is reasoning (for VQA)?
Attentional reasoning: given a certain context (i.e. Q), focus only on the relevant subparts of the image
Relational reasoning
Iterative reasoning
Compositional reasoning
VQA: reasoning
![Page 48: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/48.jpg)
Idea: focusing only on parts of the image relevant to Q● Each region scored according to the question
● Representation = sum of all (weighted) embeddings
VQA: attentional reasoning
What is sitting on the desk in front of the boys ?
![Page 49: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/49.jpg)
VQA: attentional reasoning
What is sitting on the desk in front of the boys ? GRU
ResNet
![Page 50: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/50.jpg)
VQA: attentional reasoning
What is sitting on the desk in front of the boys ? GRU
ResNet
![Page 51: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/51.jpg)
MUTAN Fusion
MUTAN Fusion
What is sitting on the desk in
front of the boys ?
GRU
ResNet
“laptop”
Attention mechanism
Attentional glimpse in most of recent strategies [MLB, MCB, MUTAN…]
VQA: attentional reasoning
![Page 52: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/52.jpg)
VQA: attentional reasoning
![Page 53: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/53.jpg)
Focusing on multiple regions: Multi-glimpse attention
Where is the smoke
coming from ?
VQA: attentional reasoning
![Page 54: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/54.jpg)
MUTAN Fusion
MUTAN Fusion
Where is the smoke coming
from ?GRU
ResNet
“train”
Attention mechanism
Focus on the train
Focus on the smoke
VQA: attentional reasoning with Multi-glimpse attention
![Page 55: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/55.jpg)
VQA: attentional reasoning with Multi-glimpse attention
![Page 56: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/56.jpg)
VQA: attentional reasoningEvaluation on VQA dataset:Best MUTAN score of 67.36% on test-std
Human performances about 83% on this dataset
The winner of the VQA Challenge in CVPR 2017 (and CVPR 2018) integrates adaptive grid selection from additional region detection learning process
![Page 57: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/57.jpg)
VQA: attentional reasoning
![Page 58: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/58.jpg)
What is reasoning (for VQA)?
Attentional reasoning: given a certain context (i.e. Q), focus only on the relevant subparts of the image
Relational reasoning: object detection + mutual relationships (spatial, semantic,...), merging both with Q
Iterative reasoning
Compositional reasoning
VQA: reasoning
![Page 59: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/59.jpg)
Determine the answer using relevant objects and relationships
Bottom-up and Relational reasoning
![Page 60: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/60.jpg)
What is reasoning (for VQA)?
Attentional reasoning: given a certain context (i.e. Q), focus only on the relevant subparts of the image
Relational reasoning: object detection + mutual relationships (spatial, semantic,...), merging both with Q
Iterative reasoning: refining the attention step-by-step, each time extracting a different piece of information from the image
VQA: reasoning
![Page 61: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/61.jpg)
Iterative Reasoning
At least 3 elementary steps are required to answer the question
● Find bicycles● Find the bicycle that has a basket● Find what is in this basket
Stacked attention: iteratively refining visual attention and question representation
What are sitting in the basket on a
bicycle ?
Zichao Yang et. al., Stacked Attention Networks for Image Question Answering, CVPR 2016
![Page 62: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/62.jpg)
Multimodal Relational Reasoning for VQA
MUREL system:• Vector representation for Attention process• Spatial and semantic contexts to model relations between
image regions• Iterative process /Multistep reasoning
Cadene et al., MuRel: Multimodal Relational Reasoning for Visual Question Answering CVPR 2019
![Page 63: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/63.jpg)
MuRel: Multimodal Relational Reasoning for VQA
Bilinear Fusion
Pairwise Relational Modeling
+
skip-connection
point-wise addition
![Page 64: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/64.jpg)
MuRel: Multimodal Relational Reasoning for VQA
![Page 65: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/65.jpg)
MuRel: Multimodal Relational Reasoning for VQA
![Page 66: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/66.jpg)
TDIUC dataset (12 different categories)VQA v2.0 dataset
![Page 67: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/67.jpg)
Datasets and challengesMany initiatives to improve datasets and evaluate reasoning as:VQA v2.0 [Y. Goyal, D. Batra, D. Parikh, CVPR 2017]
TDIUC dataset and challenge (Task DrivenImage Understanding Challenge)CLEVR dataset [J. Johnson et al, CVPR 2017]• Questions about visual reasoning including
attribute identification, counting, comparison, spatial relationships, and logical operations.
GQA dataset (2019) for compositionalQ answering over real-world images• 22M diverse reasoning questions generated from
a scene graph
Visual dialogue task: to hold a dialog withhumans in natural, conversational languageabout visual content
Are there an equal number of large things and metal spheres?
![Page 68: Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec6a2f51e6ca6370b5bc4ee/html5/thumbnails/68.jpg)
MLIA/Chordettes team: Matthieu Cord http://webia.lip6.fr/~cord
A. Dapogny (Postdoc), PhD T. Robert, T. Mordan, H. BenYounes, R. Cadene, E. Mehr, M. Engilberge, Y. Chen, A. Saporta, N. Thome (CNAM Pr 10% associate)
CVPR 2019 MUREL: Multimodal Relational Reasoning for Visual Question AnsweringR. Cadene, H. Ben-younes, N. Thome, M. Cord
AAAI 2019 BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , H. Ben-younes, R. Cadene, N. Thome, M. Cord
ICCV 2017 MUTAN: Multimodal Tucker Fusion for Visual Question AnsweringH. Ben-Younes*, R. Cadene*, N. Thome, M. Cord
Pytorch code: https://github.com/Cadene
Our Deep Recipe Reco on your mobile: visiir.lip6.fr