Bottom-Up and Top-Down Attention for Image Captioning and … · 2019-12-06 · Pre-training Faster...

1
Example training data: = { % ,…, ( } % * + Ours: bottom-up attention (using Faster R-CNN 5 ) 2048 10 10 % * = { % ,…, %,, } Typical: spatial output of a CNN 2. Generation of attention candidates, Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson 1† , Xiaodong He 2‡ , Chris Buehler 2 , Damien Teney 3 , Mark Johnson 4 , Stephen Gould 1 , Lei Zhang 2 1 Australian National University, 2 Microsoft Research, 3 University of Adelaide, 4 Macquarie University, Transitioning to Georgia Tech, Now at JD AI Research 4. Pre-training Faster R-CNN 3. Captioning and VQA models 1. Visual attention Example outputs: 5. Quantitative results VQA v2 val set (single-model): We pre-train Faster R-CNN on Visual Genome 6 data, using: 1600 object classes 400 attribute classes To select attention candidates, a detection confidence threshold is used Example training data: 6. Qualitative results ResNet: A man sitting on a toilet in a bathroom. Up-Down: A man sitting on a couch in a bathroom. Q: What color is illuminated on the traffic light? A: green Image captioning: VQA: A: red COCO Captions “Karpathy” test set (single-model): Yes/No Number Other Overall ResNet (1×1) 76.0 36.5 46.8 56.3 ResNet (14×14) 76.6 36.2 49.5 57.9 ResNet (7×7) 77.6 37.7 51.5 59.4 Up-Down (Ours) 80.3 42.8 55.8 63.2 BLEU-4 METEOR CIDEr SPICE ResNet (10×10) 34.0 26.5 111.1 20.2 Up-Down (Ours) 36.3 27.7 120.1 21.4 1 st 2017 VQA Challenge (June 2017) 1 st COCO Captions leaderboard (July 2017) Up-Down approach now incorporated into many other models (including many 2018 VQA Challenge entries) +6% +6% Image captioning model: VQA model: Visual attention mechanisms learn to focus on image regions that are relevant to the task, requiring: 1. Learned attention function (network), 2. A set of attention candidates, 3. Task context representation, m= (, ) attention candidates task context attended feature Refer also to our related work: Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge, Poster J21, Wednesday June 20, 10:10-12:30 Poster Session P2-1 5 Ren et al. NIPS, 2015 Code, models and pre-trained features available: http://www.panderson.me/up-down-attention 6 Krishna et al. arXiv 1602.07332, 2016 bench worn wooden grey weathered Top-Down Attention LSTM Language LSTM Attend Softmax n * no% * n * n % no% * no% % n % n t m n no% Concatenation Concatenation Attend m GRU GRU GRU Feedforward Net Sigmoid Eltwise Product Attend block = tanh + = softmax m= ( %

Transcript of Bottom-Up and Top-Down Attention for Image Captioning and … · 2019-12-06 · Pre-training Faster...

Page 1: Bottom-Up and Top-Down Attention for Image Captioning and … · 2019-12-06 · Pre-training Faster R-CNN 1. Visual attention 3. Captioning and VQA models Example outputs: 5. Quantitative

Example training data:

𝑉 = {𝒗%, … , 𝒗(}

𝒗%𝒗*𝒗+

Ours: bottom-up attention (using Faster R-CNN5)

204810

10 𝒗%𝒗*

𝑉 = {𝒗%, … , 𝒗%,,}

Typical: spatial output of a CNN

2. Generation of attention candidates, 𝑽

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson1†, Xiaodong He2‡, Chris Buehler2, Damien Teney3, Mark Johnson4, Stephen Gould1, Lei Zhang21Australian National University, 2Microsoft Research, 3University of Adelaide, 4Macquarie University, †Transitioning to Georgia Tech, ‡Now at JD AI Research

4. Pre-training Faster R-CNN

3. Captioning and VQA models1. Visual attention

Example outputs:

5. Quantitative results

VQAv2val set(single-model):

We pre-train Faster R-CNN on Visual Genome6 data, using:• 1600 object classes• 400 attribute classesTo select 𝑘 attention candidates, a detection confidence threshold is used

Example training data:

6. Qualitative results

ResNet: A man sitting on a toilet in a bathroom.

Up-Down: A man sitting on a couch in a bathroom.

Q: What color is illuminated on the traffic light?

A: green

Image captioning: VQA:

A: red

COCOCaptions“Karpathy”testset(single-model):

Yes/No Number Other OverallResNet (1×1) 76.0 36.5 46.8 56.3ResNet(14×14) 76.6 36.2 49.5 57.9ResNet(7×7) 77.6 37.7 51.5 59.4Up-Down(Ours) 80.3 42.8 55.8 63.2

BLEU-4 METEOR CIDEr SPICEResNet (10×10) 34.0 26.5 111.1 20.2Up-Down(Ours) 36.3 27.7 120.1 21.4

• 1st 2017 VQA Challenge (June 2017)• 1st COCO Captions leaderboard (July 2017)• Up-Down approach now incorporated into many other

models (including many 2018 VQA Challenge entries)

+6%

+6%

Image captioning model: VQA model:Visual attention mechanisms learn to focus on image regions that are relevant to the task, requiring:1. Learned attention function (network), 𝑓2. A set of attention candidates, 𝑉3. Task context representation, 𝒉 𝒗m = 𝑓(𝒉, 𝑉)

attention candidates

task context

attended feature

Refer also to our related work: Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge, Poster J21, Wednesday June 20, 10:10-12:30 Poster Session P2-1

5Ren et al. NIPS, 2015

Code, models and pre-trained features available: http://www.panderson.me/up-down-attention

6Krishna et al. arXiv 1602.07332, 2016

bench

worn

wooden

grey weathered

Top-Down Attention LSTM

Language LSTM

Attend

Softmax

𝑉

𝒉n*

𝒉no%*

𝒉n*

𝒉n%

𝒉no%*

𝒉no%% 𝒉n%

𝑤𝑜𝑟𝑑n

𝒗t

𝒗mn

𝑒𝑚𝑏𝑒𝑑𝑑𝑒𝑑𝑤𝑜𝑟𝑑no%

Concatenation

Concatenation

Attend𝑉𝒗m

𝒉

GRUGRUGRU

Feedforward Net

Sigmoid

𝑎𝑛𝑠𝑤𝑒𝑟

𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑤𝑜𝑟𝑑𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠

Eltwise Product

Attend block𝑎� = 𝒘� tanh 𝑊𝑣𝒗� +𝑊ℎ𝒉

𝜶 = softmax 𝒂𝒗m = ∑ 𝛼�(

��% 𝒗�