Compositional Captioning - University of...

CompositionalCaptioning:DescribingNovelObjectCategories

withoutPairedTrainingDataMLSLP2016

LisaAnneHendricks1,Subhashini Venugopalan2,MarcusRohrbach1,RaymondMooney2,KateSaenko3,TrevorDarrell1

1 UniversityofCalifornia,Berkeley2 UniversityofTexasatAustin3 BostonUniversity

VisualDescription

BerkeleyLRCN:Abrownbearstandingontopofalushgreenfield.

MSCaptionBot:Alargebrownbearwalkingthroughaforest.

LRCN:Donahue, Jeffetal.CVPR2015.MicrosoftCaptionBot:http://captionbot.ai/

Abrownbearwalkingacrossalushgreenfield.

Alargebrownbearwalkingthroughaforest.

Abrownbearsittingontopofagreenfield.

A brownbearwalksinthegrassinfrontoftrees.

A brownbearwalkingonagrassyfieldnexttotrees.

A largebrownbearwalkingacrossalushgreenfield.

ProblemswithVisualDescription

LRCN:Donahue, Jeffetal.CVPR2015.CaptionBot:http://captionbot.ai/

BerkeleyLRCN:“Ablackbear isstandinginthegrass.”

MSCaptionBot:“Abear thatiseatingsomegrass.”

Ours:“Aanteater isstandinginthegrass.”

WepresenttheDeepCompositionalCaptioner (DCC)whichcancomposedescriptionsaboutnovelobjectsincontext.

ExistingMethods

PairedImage-SentenceDataAgreenandwhitebusdrivingdownthestreet.Abrowntablewithlotsofbottlesonit.

DeepCompositionalCaptioner

UnpairedImageData

bottle

otter

toad

bus

UnpairedTextData

Abusisaroadvehicledesigned tocarrymanypassengers.

Ottersliveinavarietyofaquaticenvironments.

DCCKeyInsights2.Transferknowledgebetweenrelated

concepts

giraffe impala

dress tutu

cake scone

Learnimagefeatureswithunpairedimagedata

Learnlanguagefeatureswithunpairedtextdata

PreviousWord

𝑓" 𝑓#

PredictedWord

MultimodalUnit

1.Effectivelytrainwithoutsidedata

Impala:0.86Sunny:0.72…Bus:0.04

TrainingData:UnpairedImageData

Network:VGG+multilabel loss(sigmoidcrossentropy)

Feature:Vectorwithactivationscorrespondingtoscoresforvisualconceptsinanimage.

CNN

ClassificationLayer

𝑓"

LexicalClassifier

TrainingData:UnpairedTextData

Network:Embedlayer+LSTMunit.Modeltrainedtopredictaword,𝑤%,giventhepreviouswordsinasentence,𝑤&:%().

Feature:Vectorwhichencodespreviouswordsinthesentence.

LanguageModelPreviousWord

Embed

LSTM

WL

PredictedWord

𝑓#

LanguageModelPreviousWord

Embed

LSTM

𝑊#

PredictedWord

𝑓#

CaptionModelPreviousWord

𝑓" 𝑓#

PredictedWord

𝑊"𝑊#M

ultim

odal

Unit

CNN

ClassificationLayer

𝑓"

LexicalClassifier

Trainedwithunpairedimagedata

Trainedwithpairedimage-sentencedata

Trainedwithunpairedtextdata

𝑓# 𝑓"

PredictedWord

𝑊#

𝑊"

Multim

odal

Unit

A brown

S 𝑤% 𝐼, 𝑤&:% = 𝑓#𝑊# + 𝑓"𝑊" + 𝑏

𝑓#𝑊# largefor:GiraffeHorseCouch…Standing

𝑓"𝑊" largefor:GiraffeTreesStanding…Couch

LanguageFeature ImageFeature

MultimodalUnit

𝑓# 𝑓"

PredictedWord

𝑊#

𝑊"

Multim

odal

Unit

A brown

S 𝑤% 𝐼, 𝑤&:% = 𝑓#𝑊# + 𝑓"𝑊" + 𝑏

𝑓#𝑊# largefor:GiraffeHorseCouch…Standing

𝑓"𝑊" largefor:GiraffeTreesStanding…Couch

MultimodalUnit

Transferpairchosenusingword2vec

WeightTransfer

MultimodalUnit𝑓# 𝑓"

Transferpairchosenusingword2vec

𝑊# : , 𝑣2

𝑊" : , 𝑣2

S 𝑤% = impala 𝐼,𝑤&:%()) =𝑓#𝑊# : , 𝑣2 + 𝑓"𝑊" : , 𝑣2 + 𝑏2

S 𝑤% = impala 𝐼,𝑤&:%())

WeightTransfer

𝑊" : ,𝑣:

𝑊# : , 𝑣:

S 𝑤% = giraffe 𝐼, 𝑤&:%()) =𝑓#𝑊# : , 𝑣: + 𝑓"𝑊" : , 𝑣: + 𝑏:

S 𝑤% = giraffe 𝐼,𝑤&:%())

0

0

giraffe impala

MSCOCOPairedImage-SentenceData

MSCOCOUnpairedImageData

MSCOCOUnpairedTextData

”Anelephantgallopinginthegreengrass”

”Twopeopleplayingballinafield”

”Ablacktrainstoppedonthetracks”

”Someoneisabouttoeatsomepizza”

Elephant,Galloping,Green,Grass

People,Playing,Ball,Field

Black,Train,Tracks

Eat,Pizza

”Anelephantgalloping inthegreengrass”




”Amicrowaveissittingontopofakitchencounter”

”Akitchencounterwithamicrowaveonit”Kitchen,Microwave

Evaluation










Black,Train,Tracks

Pizza





”Amicrowaveissittingontopofakitchencounter”

”Akitchencounterwithamicrowaveonit”Microwave

Held-outdataset

Evaluation

DCC(Ours)

ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired

imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences

Results:MSCOCOIn-Domain

LRCN DCC(Ours)




LRCN DCC(No Transfer)

DCC(Ours)





DCC(Ours)

Efficacy(F1)





DCC(Ours)

Efficacy(F1)SentenceQuality(METEOR)





DCC(Ours)

Efficacy(F1) 0.00 0.00 39.78SentenceQuality(METEOR)





DCC(Ours)

Efficacy(F1) 0.00 0.00 39.78SentenceQuality(METEOR)

19.33 19.90 21.00




EmpiricalEvaluation










Black,Train,Tracks




”Akitchencounterwithamicrowaveonit”

Out-of-DomainHeldOutDataset

Pizza”Pepperoniisapopular

pizzatopping.”

”Allmicrowavesuseatimerforthecooking

time”

Microwave

UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78

DCCperformswellwhenusingoutofdomaindatatotrainthelexicalclassifierandlanguagemodel.

Results:MSCOCOOut-Of-Domain

UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78DCC(Ours) ImageNet MSCOCO 20.71 33.60


Results:MSCOCOOut-Of-Domain

UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78DCC(Ours) ImageNet MSCOCO 20.71 33.60DCC(Ours) ImageNet CaptionTxt 20.66 35.53DCC(Ours) ImageNet WebCorpus 20.66 34.94


Results:MSCOCOOut-of-Domain

Notransfer:Agreenandwhitestreetsignonacitystreet.DCC:Agreenandwhitebus parkedonthesideofthestreet.

Notransfer:Adoglyingonabedwithalargebrowndog.DCC:Adoglyingonacouchwithalargewindowinthebackground.

Notransfer:Twogiraffesareeatinggrassinthefield.DCC:Twozebra grazinginagreengrassfield.

Notransfer:Awhiteandblackcatissittingonatoilet.DCC:Awhitemicrowave sittingonabrickwall.

DCCcandescribeover300ImageNet visualconceptsindiversecontexts.

DCC:Apersonisholdingagecko intheirhand.

BerkeleyLRCN:Apersonholdingapieceoffoodintheirhand.

MSCaptionBot:Acloseupofapersonholdingababy.

DCC:Agecko isstandingonabranchofatree.

BerkeleyLRCN:Abirdisstandingontheedgeofarock.

MSCaptionBot:Abirdthatisstandinginthewater.

Awomaninachiffon tutu.

DCCcandescribeover300ImageNet visualconceptsindiversecontexts.

Awhitecentrifuge issittingonthetable.

Abunchofalychee areina

market.

Agroupofpeoplestandingaroundabaobab inafield.

Abrownbobcat inagreenfield.

Acloseupofawoodentablewithabottleofwhisky.

Acloseupofascone onaplate.

Ablackandwhitephotoofacandelabra

inaroom.

Awomanisridingaunicycle onaunicycle.

Agroupofpeoplestandingaroundafoxhuntingona

field.

FailureCases

METEOR F1Baseline(NoTransfer) 28.80 0.0+DCC(ours) 28.9 6.0+ILSVRCVideos

(NoTransfer)29.0 0.0

+DCC(ours)+ILSVRCVideos

29.10 22.2

Results:VideoDescription

“CaptioningImageswithDiverseObjects”Venugopalan 2016http://arxiv.org/abs/1606.07770

NovelObjectCaptioner

DCCIssue:NotEnd-to-EndTrainableLanguageModel

PreviousWord

Embed

LSTM

𝑊#

PredictedWord

𝑓#

CaptionModelPreviousWord

𝑓" 𝑓#

PredictedWord

𝑊"

𝑊#Multim

odal

Unit

CNN

ClassificationLayer

𝑓"

LexicalClassifier

Image-SpecificLoss Image-TextLoss Text-SpecificLoss

PreviousWord

Embed

PredictedWord

EmbedLSTMEmbed

NOCSolution:JointObjectiveLoss

PreviousWord

PredictedWord

Embed

LSTM

Embed

CNN

Embed

PredictedWord

JointObjectiveLoss

Amanisplayingracket onaracket.

DCCIssue:TransferMechanism

NOCSolution:SemanticEmbedding

PreviousWord

PredictedWord

𝑊:?@ABC

LSTM

𝑊:?@AB

PreviousWord

PredictedWord

Embed

LSTM

Embed

Training

Image-SpecificLoss Text-SpecificLoss

PreviousWord

PredictedWord

Embed

LSTM

Embed

CNN

Embed

PredictedWord

Image-TextLoss

PreviousWord

Embed

PredictedWord

EmbedLSTMEmbed

JointObjectiveLoss

Bottle Bus Couch Microwave Pizza Racket Suitcase Zebra AverageDCC 4.63 29.79 45.87 28.09 64.59 52.24 13.16 79.88 39.78NOC 19.02 69.34 33.25 26.46 69.16 62.45 34.65 89.78 50.51

F1ScoresforNOCandDCC

Contributing Factor Glove LMPretrain

ImagePretrain

AuxiliaryObjective

Meteor F1

Pretraining &Glove X X X 19.80 25.38FixImageModel X X Fixed 18.91 39.70All X X X X 20.69 50.51

Ablation:AuxiliaryObjective

Contributing Factor Glove LMPretrain

ImagePretrain

AuxiliaryObjective

Meteor F1

AuxiliaryObjective X X 15.78 14.41Glove X X X 19.69 47.02All X X X X 20.69 50.51

Ablation:GloveEmbedding

ImageData TextData Meteor F1MSCOCO MSCOCO 20.69 50.51MSCOCO WebCorpus 19.15 41.74ImageNet WebCorpus 17.55 36.50

TrainingwithOutsideData

DescribingImageNet

Aotter issittingonarockinthesun.

Alargeflounder isrestingonarock.

Atablewithaplateofsashimi andvegetables.

Alargeglacier withamountaininthe

background.

Amanisstandingonabeachholdinga

snapper.

Agroupofpeoplestandingaroundalargewhitewarship.

Errors

Achainsaw issittingonachainsaw near

achainsaw.

Avolcano viewofavolcano inthesun.

OurTeam:

LisaAnneHendricks

SubhashiniVenugopalan

MarcusRohrbach

RaymondMooney

KateSaenko

TrevorDarrell

ExistingMethods

CompositionalCaptioner

Aanteater isstandinginthegrass.

LRCN:Ablackbear isstandinginthegrass.CaptionBot:Abear thatiseatingsomegrass.

PairedImage-SentenceDataUnpairedImageData UnpairedTextData

Compositional Captioning - University of...

Documents

Transcript of Compositional Captioning - University of...