Captioning Images with Diverse Objects

38
Lisa Anne Hendricks Subhashini Venugopalan Marcus Rohrbach Raymond Mooney Kate Saenko Trevor Darrell 1 UT Austin UC Berkeley Boston Univ. Captioning Images with Diverse Objects

Transcript of Captioning Images with Diverse Objects

Page 1: Captioning Images with Diverse Objects

Lisa Anne Hendricks

Subhashini Venugopalan

Marcus Rohrbach

Raymond Mooney

Kate Saenko

Trevor Darrell

1

UT Austin UC Berkeley Boston Univ.

Captioning Images with Diverse Objects

Page 2: Captioning Images with Diverse Objects

Object Recognition

2

Can identify hundreds of categories of objects.

14M images, 22K classes [Deng et al. CVPR’09]

Page 3: Captioning Images with Diverse Objects

Visual DescriptionBerkeley LRCN [Donahue et al. CVPR’15]: A brown bear standing on top of a lush green field.

MSR CaptionBot [http://captionbot.ai/]: A large brown bear walking through a forest.

3

MSCOCO 80 classes

Page 4: Captioning Images with Diverse Objects

Novel Object Captioner (NOC)We present Novel Object Captioner which can compose

descriptions of 100s of objects in context.

4

Visual Classifiers.

Existing captioners.

MSCOCO

An okapi standing in the middle of a field.

MSCOCO

+ +

NOC (ours): Describe novel objects without paired image-caption data.

okapi

init + train A horse standing in the dirt.

Page 5: Captioning Images with Diverse Objects

Insights

5

1. Need to recognize and describe objects

outside of image-caption datasets.

okapi

Page 6: Captioning Images with Diverse Objects

Insight 1: Train effectively on external sources

CNN

Embed

LSTM

Embed

Image-Specific Loss Text-Specific Loss

Visual features from unpaired image data

Language model from unannotated text data

6

Page 7: Captioning Images with Diverse Objects

Insights

7

2. Describe unseen objects that are similar to

objects seen in image-caption datasets.

okapi zebra

Page 8: Captioning Images with Diverse Objects

Insight 2: Capture semantic similarity of words

CNN

Embed

LSTM

WTglove

Wglove

Embed

Image-Specific Loss Text-Specific Loss

8

zebra

okapi

dress tutu

cake

scone

Page 9: Captioning Images with Diverse Objects

Insight 2: Capture semantic similarity of words

zebra

okapi

dress tutu

cake

scone

9MSCOCO

LSTM

WTglove

Wglove

Embed

Text-Specific Loss

CNN

Embed

Image-Specific Loss

Page 10: Captioning Images with Diverse Objects

Combine to form a Caption Model

CNN

Embed

MSCOCO

Elementwisesum

CNN

Embed

LSTM

WTglove

Wglove

Embed

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM

WTglove

Wglove

Embed

init parameters

init parameters

10Not different from existing caption models. Problem: Forgetting.

Page 11: Captioning Images with Diverse Objects

Insights

11

3. Overcome “forgetting” since

pre-training alone is not sufficient.

[Catastrophic Forgetting in Neural Networks. Kirkpatrick et al. PNAS 2017]

Page 12: Captioning Images with Diverse Objects

Insight 3: Jointly train on multiple sources

jointtraining

shared parametersCNN

Embed

MSCOCO

shared parameters

Elementwisesum

CNN

Embed

LSTM

WTglove

Wglove

Embedjoint

training

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM

WTglove

Wglove

Embed

12

Page 13: Captioning Images with Diverse Objects

Novel Object Captioner (NOC) Model

jointtraining

shared parametersCNN

Embed

MSCOCO

shared parameters

Elementwisesum

CNN

Embed

LSTM

WTglove

Wglove

Embedjoint

training

Joint-Objective Loss

Image-Specific Loss Image-Text Loss Text-Specific Loss

LSTM

WTglove

Wglove

Embed

13

Page 14: Captioning Images with Diverse Objects

Visual Network

CNN

Embed

Image-Specific Loss Network: VGG-16 with multi-label loss [sigmoid cross-entropy loss]

Training Data: Unpaired image data

Output: Vector with activations corresponding to scores for words in the vocabulary.

impala:0.86green: 0.72...cut: 0.04

14

Page 15: Captioning Images with Diverse Objects

Language Model

LSTM

WTglove

Wglove

Embed

Text-Specific LossNetwork: Single LSTM layer. Predict next word

t+1 given previous words 0..t ( t+1 | 0..t)

(Wglove)T : Shared weights with input embedding.

Training Data: Unannotated text data (BNC, ukWac, Wikipedia, Gigaword)

Output: Vector with activations corresponding to scores for words in the vocabulary.

15

Page 16: Captioning Images with Diverse Objects

Caption Network

CNN

Embed

MSCOCO

Elementwisesum

Image-Text Loss

LSTM

WTglove

Wglove

Embed

Network: Combine output of the visual and text networks. (softmax + cross-entropy loss)

16

Page 17: Captioning Images with Diverse Objects

Caption Model

CNN

Embed

MSCOCO

Elementwisesum

Image-Text Loss

LSTM

WTglove

Wglove

Embed

Training Data: COCO images with

multiple labels

bear, brown, field, grassy, trees,

walking

Training Data: Captions from

COCO

A brown bear walking on a grassy

field next to trees

17

Page 18: Captioning Images with Diverse Objects

NOC Model: Train simultaneously

jointtraining

shared parametersCNN

Embed

MSCOCO

shared parameters

Elementwisesum

CNN

Embed

LSTM

WTglove

Wglove

Embedjoint

training

Joint-Objective LossImage-Specific Loss Image-Text Loss Text-Specific Loss

LSTM

WTglove

Wglove

Embed

18

Page 19: Captioning Images with Diverse Objects

Evaluation

● Empirical: COCO held-out objects○ In-domain [Use images from COCO]

○ Out-of-domain [Use imagenet images for held-out concepts]

● Ablations○ Embedding & Joint training contribution

● ImageNet○ Quantitative

○ Human Evaluation - Objects not in COCO

○ Rare objects in COCO

19

Page 20: Captioning Images with Diverse Objects

Evaluation

● Empirical: COCO held-out objects○ In-domain [Use images from COCO]

○ Out-of-domain [Use imagenet images for held-out concepts]

● Ablations○ Embedding & Joint training contribution

● ImageNet○ Quantitative

○ Human Evaluation - Objects not in COCO

○ Rare objects in COCO

20

Page 21: Captioning Images with Diverse Objects

Empirical Evaluation: COCO dataset In-Domain setting MSCOCO Paired

Image-Sentence DataMSCOCO Unpaired

Image DataMSCOCO Unpaired

Text Data”An elephant galloping

in the green grass”

”Two people playing ball in a field”

”A black train stopped on the tracks”

”Someone is about to eat some pizza”

Elephant, Galloping, Green, Grass

People, Playing, Ball, Field

Black, Train, Tracks

Eat, Pizza

”An elephant galloping in the green grass”

”Two people playing ball in a field”

”A black train stopped on the tracks”

”Someone is about to eat some pizza”

”A microwave is sitting on top of a kitchen counter ”

”A kitchen counter with a microwave on it”

Kitchen, Microwave

21

Page 22: Captioning Images with Diverse Objects

Empirical Evaluation: COCO heldout dataset MSCOCO Paired

Image-Sentence DataMSCOCO Unpaired

Image DataMSCOCO Unpaired

Text Data”An elephant galloping

in the green grass”

”Two people playing ball in a field”

”A black train stopped on the tracks”

”Someone is about to eat some pizza”

Elephant, Galloping, Green, Grass

People, Playing, Ball, Field

Black, Train, Tracks

Pizza

”An elephant galloping in the green grass”

”Two people playing ball in a field”

”A black train stopped on the tracks”

”A white plate topped with cheesy pizza and toppings.”

”A white refrigerator, stove, oven dishwasher and microwave”

”A kitchen counter with a microwave on it”

Microwave

22Held-out

Page 23: Captioning Images with Diverse Objects

Empirical Evaluation: COCOMSCOCO Paired

Image-Sentence DataMSCOCO Unpaired

Image DataMSCOCO Unpaired

Text Data”An elephant galloping

in the green grass”

”Two people playing ball in a field”

”A black train stopped on the tracks”

Baseball, batting, boy, swinging

Black, Train, Tracks

Pizza

”A small elephant standing on top of a dirt field”

”A hitter swinging his bat to hit the ball”

”A black train stopped on the tracks”

”A white plate topped with cheesy pizza and toppings.”

”A white refrigerator, stove, oven dishwasher and microwave”

Microwave

23

Two, elephants, Path, walking

● CNN is pre-trained on ImageNet

Page 24: Captioning Images with Diverse Objects

Empirical Evaluation: Metrics

24

F1 (Utility): Ability to recognize and incorporate new words.(Is the word/object mentioned in the caption?)

METEOR: Fluency and sentence quality.

Page 25: Captioning Images with Diverse Objects

Empirical Evaluation: Baselines

25[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15[2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

LRCN [1]: Does not caption novel objects.

DCC [2] : Copies parameters for the novel object from a similar object seen in training. (also not end-to-end)

Page 26: Captioning Images with Diverse Objects

Empirical Evaluation: Results

26[1] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell. CVPR’15[2] L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell CVPR’16

F1 (Utility) METEOR (Fluency)

Page 27: Captioning Images with Diverse Objects

Ablations

27

MSCOCO

Evaluated on held-out COCO objects.

Page 28: Captioning Images with Diverse Objects

Ablation: Language Embedding

28

Joint Training

Frozen CNN

Language Embedding

MSCOCO

Page 29: Captioning Images with Diverse Objects

Ablation: Freeze CNN after pre-training

29

Joint Training

Frozen CNN

Language Embedding

[Catastrophic forgetting in Neural Networks Kirkpatrick et al. PNAS 2017]

MSCOCO

FIX

Page 30: Captioning Images with Diverse Objects

Ablation: Joint Training

30

MSCOCO

Joint Training

Frozen CNN

Language Embedding

Page 31: Captioning Images with Diverse Objects

ImageNet: Human Evaluations

31

● ImageNet: 638 object classes not mentioned in COCO

NOC can describe 582 object classes (60% more objects than prior work)

Page 32: Captioning Images with Diverse Objects

ImageNet: Human Evaluations

32

● ImageNet: 638 object classes not mentioned in COCO

● Word Incorporation: Which model incorporates the word (name of the object) in the sentence better?

● Image Description: Which sentence (model) describes the image better?

Page 33: Captioning Images with Diverse Objects

ImageNet: Human Evaluations

33

Word Incorporation Image Description

43.78

25.74

6.10

24.3740.16

59.84

Page 34: Captioning Images with Diverse Objects

Qualitative Evaluation: ImageNet

34

Page 35: Captioning Images with Diverse Objects

Qualitative Evaluation: ImageNet

35

Page 36: Captioning Images with Diverse Objects

Qualitative Examples: Errors

36

Sunglass (n04355933) Error: Grammar

NOC: A sunglass mirror reflection of a mirror in a mirror.

Gymnast (n10153594) Error: Gender, Hallucination

NOC: A man gymnast in a blue shirt doing a trick on a skateboard.

Balaclava (n02776825) Error: Repetition

NOC: A balaclava black and white photo of a man in a balaclava.

Cougar (n02125311) Error: Description

NOC: A cougar with a cougar in its mouth.

Page 37: Captioning Images with Diverse Objects

Novel Object Captioner - Take away

37

MSCOCO

A okapi standing in the middle of a field.

Semantic embeddings and joint training to caption 100s of objects.

Page 38: Captioning Images with Diverse Objects

38

Poster 11