compositional captioning - university of...
TRANSCRIPT
CompositionalCaptioning:DescribingNovelObjectCategories
withoutPairedTrainingDataMLSLP2016
LisaAnneHendricks1,Subhashini Venugopalan2,MarcusRohrbach1,RaymondMooney2,KateSaenko3,TrevorDarrell1
1 UniversityofCalifornia,Berkeley2 UniversityofTexasatAustin3 BostonUniversity
VisualDescription
BerkeleyLRCN:Abrownbearstandingontopofalushgreenfield.
MSCaptionBot:Alargebrownbearwalkingthroughaforest.
LRCN:Donahue, Jeffetal.CVPR2015.MicrosoftCaptionBot:http://captionbot.ai/
Abrownbearwalkingacrossalushgreenfield.
Alargebrownbearwalkingthroughaforest.
Abrownbearsittingontopofagreenfield.
A brownbearwalksinthegrassinfrontoftrees.
A brownbearwalkingonagrassyfieldnexttotrees.
A largebrownbearwalkingacrossalushgreenfield.
ProblemswithVisualDescription
LRCN:Donahue, Jeffetal.CVPR2015.CaptionBot:http://captionbot.ai/
BerkeleyLRCN:“Ablackbear isstandinginthegrass.”
MSCaptionBot:“Abear thatiseatingsomegrass.”
Ours:“Aanteater isstandinginthegrass.”
WepresenttheDeepCompositionalCaptioner (DCC)whichcancomposedescriptionsaboutnovelobjectsincontext.
ExistingMethods
PairedImage-SentenceDataAgreenandwhitebusdrivingdownthestreet.Abrowntablewithlotsofbottlesonit.
DeepCompositionalCaptioner
UnpairedImageData
bottle
otter
toad
bus
UnpairedTextData
Abusisaroadvehicledesigned tocarrymanypassengers.
Ottersliveinavarietyofaquaticenvironments.
DCCKeyInsights2.Transferknowledgebetweenrelated
concepts
giraffe impala
dress tutu
cake scone
Learnimagefeatureswithunpairedimagedata
Learnlanguagefeatureswithunpairedtextdata
PreviousWord
𝑓" 𝑓#
PredictedWord
MultimodalUnit
1.Effectivelytrainwithoutsidedata
Impala:0.86Sunny:0.72…Bus:0.04
TrainingData:UnpairedImageData
Network:VGG+multilabel loss(sigmoidcrossentropy)
Feature:Vectorwithactivationscorrespondingtoscoresforvisualconceptsinanimage.
CNN
ClassificationLayer
𝑓"
LexicalClassifier
TrainingData:UnpairedTextData
Network:Embedlayer+LSTMunit.Modeltrainedtopredictaword,𝑤%,giventhepreviouswordsinasentence,𝑤&:%().
Feature:Vectorwhichencodespreviouswordsinthesentence.
LanguageModelPreviousWord
Embed
LSTM
WL
PredictedWord
𝑓#
LanguageModelPreviousWord
Embed
LSTM
𝑊#
PredictedWord
𝑓#
CaptionModelPreviousWord
𝑓" 𝑓#
PredictedWord
𝑊"𝑊#M
ultim
odal
Unit
CNN
ClassificationLayer
𝑓"
LexicalClassifier
Trainedwithunpairedimagedata
Trainedwithpairedimage-sentencedata
Trainedwithunpairedtextdata
𝑓# 𝑓"
PredictedWord
𝑊#
𝑊"
Multim
odal
Unit
A brown
S 𝑤% 𝐼, 𝑤&:% = 𝑓#𝑊# + 𝑓"𝑊" + 𝑏
𝑓#𝑊# largefor:GiraffeHorseCouch…Standing
𝑓"𝑊" largefor:GiraffeTreesStanding…Couch
LanguageFeature ImageFeature
MultimodalUnit
𝑓# 𝑓"
PredictedWord
𝑊#
𝑊"
Multim
odal
Unit
A brown
S 𝑤% 𝐼, 𝑤&:% = 𝑓#𝑊# + 𝑓"𝑊" + 𝑏
𝑓#𝑊# largefor:GiraffeHorseCouch…Standing
𝑓"𝑊" largefor:GiraffeTreesStanding…Couch
MultimodalUnit
Transferpairchosenusingword2vec
WeightTransfer
MultimodalUnit𝑓# 𝑓"
Transferpairchosenusingword2vec
𝑊# : , 𝑣2
𝑊" : , 𝑣2
S 𝑤% = impala 𝐼,𝑤&:%()) =𝑓#𝑊# : , 𝑣2 + 𝑓"𝑊" : , 𝑣2 + 𝑏2
S 𝑤% = impala 𝐼,𝑤&:%())
WeightTransfer
𝑊" : ,𝑣:
𝑊# : , 𝑣:
S 𝑤% = giraffe 𝐼, 𝑤&:%()) =𝑓#𝑊# : , 𝑣: + 𝑓"𝑊" : , 𝑣: + 𝑏:
S 𝑤% = giraffe 𝐼,𝑤&:%())
0
0
giraffe impala
MSCOCOPairedImage-SentenceData
MSCOCOUnpairedImageData
MSCOCOUnpairedTextData
”Anelephantgallopinginthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Someoneisabouttoeatsomepizza”
Elephant,Galloping,Green,Grass
People,Playing,Ball,Field
Black,Train,Tracks
Eat,Pizza
”Anelephantgalloping inthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Someoneisabouttoeatsomepizza”
”Amicrowaveissittingontopofakitchencounter”
”Akitchencounterwithamicrowaveonit”Kitchen,Microwave
Evaluation
MSCOCOPairedImage-SentenceData
MSCOCOUnpairedImageData
MSCOCOUnpairedTextData
”Anelephantgallopinginthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Someoneisabouttoeatsomepizza”
Elephant,Galloping,Green,Grass
People,Playing,Ball,Field
Black,Train,Tracks
Pizza
”Anelephantgalloping inthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Someoneisabouttoeatsomepizza”
”Amicrowaveissittingontopofakitchencounter”
”Akitchencounterwithamicrowaveonit”Microwave
Held-outdataset
Evaluation
DCC(Ours)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(Ours)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(No Transfer)
DCC(Ours)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(No Transfer)
DCC(Ours)
Efficacy(F1)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(No Transfer)
DCC(Ours)
Efficacy(F1)SentenceQuality(METEOR)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(No Transfer)
DCC(Ours)
Efficacy(F1) 0.00 0.00 39.78SentenceQuality(METEOR)
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
LRCN DCC(No Transfer)
DCC(Ours)
Efficacy(F1) 0.00 0.00 39.78SentenceQuality(METEOR)
19.33 19.90 21.00
ComparisonofDCCtoLRCNandDCCwithnotransfer.Ø HighF1scoreindicatesDCCcandescribewordsoutsideofpaired
imagesentencedataØ IncreasedMETEORindicatesDCCproducesbettersentences
Results:MSCOCOIn-Domain
EmpiricalEvaluation
MSCOCOPairedImage-SentenceData
MSCOCOUnpairedImageData
MSCOCOUnpairedTextData
”Anelephantgallopinginthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Someoneisabouttoeatsomepizza”
Elephant,Galloping,Green,Grass
People,Playing,Ball,Field
Black,Train,Tracks
”Anelephantgalloping inthegreengrass”
”Twopeopleplayingballinafield”
”Ablacktrainstoppedonthetracks”
”Akitchencounterwithamicrowaveonit”
Out-of-DomainHeldOutDataset
Pizza”Pepperoniisapopular
pizzatopping.”
”Allmicrowavesuseatimerforthecooking
time”
Microwave
UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78
DCCperformswellwhenusingoutofdomaindatatotrainthelexicalclassifierandlanguagemodel.
Results:MSCOCOOut-Of-Domain
UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78DCC(Ours) ImageNet MSCOCO 20.71 33.60
DCCperformswellwhenusingoutofdomaindatatotrainthelexicalclassifierandlanguagemodel.
Results:MSCOCOOut-Of-Domain
UnpairedImageData UnpairedTextData METEOR F1LRCN N/A N/A 19.33 0.00DCC(NoTransfer) MSCOCO MSCOCO 19.90 0.00DCC(Ours) MSCOCO MSCOCO 21.00 39.78DCC(Ours) ImageNet MSCOCO 20.71 33.60DCC(Ours) ImageNet CaptionTxt 20.66 35.53DCC(Ours) ImageNet WebCorpus 20.66 34.94
DCCperformswellwhenusingoutofdomaindatatotrainthelexicalclassifierandlanguagemodel.
Results:MSCOCOOut-of-Domain
Notransfer:Agreenandwhitestreetsignonacitystreet.DCC:Agreenandwhitebus parkedonthesideofthestreet.
Notransfer:Adoglyingonabedwithalargebrowndog.DCC:Adoglyingonacouchwithalargewindowinthebackground.
Notransfer:Twogiraffesareeatinggrassinthefield.DCC:Twozebra grazinginagreengrassfield.
Notransfer:Awhiteandblackcatissittingonatoilet.DCC:Awhitemicrowave sittingonabrickwall.
DCCcandescribeover300ImageNet visualconceptsindiversecontexts.
DCC:Apersonisholdingagecko intheirhand.
BerkeleyLRCN:Apersonholdingapieceoffoodintheirhand.
MSCaptionBot:Acloseupofapersonholdingababy.
DCC:Agecko isstandingonabranchofatree.
BerkeleyLRCN:Abirdisstandingontheedgeofarock.
MSCaptionBot:Abirdthatisstandinginthewater.
Awomaninachiffon tutu.
DCCcandescribeover300ImageNet visualconceptsindiversecontexts.
Awhitecentrifuge issittingonthetable.
Abunchofalychee areina
market.
Agroupofpeoplestandingaroundabaobab inafield.
Abrownbobcat inagreenfield.
Acloseupofawoodentablewithabottleofwhisky.
Acloseupofascone onaplate.
Ablackandwhitephotoofacandelabra
inaroom.
Awomanisridingaunicycle onaunicycle.
Agroupofpeoplestandingaroundafoxhuntingona
field.
FailureCases
METEOR F1Baseline(NoTransfer) 28.80 0.0+DCC(ours) 28.9 6.0+ILSVRCVideos
(NoTransfer)29.0 0.0
+DCC(ours)+ILSVRCVideos
29.10 22.2
Results:VideoDescription
“CaptioningImageswithDiverseObjects”Venugopalan 2016http://arxiv.org/abs/1606.07770
NovelObjectCaptioner
DCCIssue:NotEnd-to-EndTrainableLanguageModel
PreviousWord
Embed
LSTM
𝑊#
PredictedWord
𝑓#
CaptionModelPreviousWord
𝑓" 𝑓#
PredictedWord
𝑊"
𝑊#Multim
odal
Unit
CNN
ClassificationLayer
𝑓"
LexicalClassifier
Image-SpecificLoss Image-TextLoss Text-SpecificLoss
PreviousWord
Embed
PredictedWord
EmbedLSTMEmbed
NOCSolution:JointObjectiveLoss
PreviousWord
PredictedWord
Embed
LSTM
Embed
CNN
Embed
PredictedWord
JointObjectiveLoss
Amanisplayingracket onaracket.
DCCIssue:TransferMechanism
NOCSolution:SemanticEmbedding
PreviousWord
PredictedWord
𝑊:?@ABC
LSTM
𝑊:?@AB
PreviousWord
PredictedWord
Embed
LSTM
Embed
Training
Image-SpecificLoss Text-SpecificLoss
PreviousWord
PredictedWord
Embed
LSTM
Embed
CNN
Embed
PredictedWord
Image-TextLoss
PreviousWord
Embed
PredictedWord
EmbedLSTMEmbed
JointObjectiveLoss
Bottle Bus Couch Microwave Pizza Racket Suitcase Zebra AverageDCC 4.63 29.79 45.87 28.09 64.59 52.24 13.16 79.88 39.78NOC 19.02 69.34 33.25 26.46 69.16 62.45 34.65 89.78 50.51
F1ScoresforNOCandDCC
Contributing Factor Glove LMPretrain
ImagePretrain
AuxiliaryObjective
Meteor F1
Pretraining &Glove X X X 19.80 25.38FixImageModel X X Fixed 18.91 39.70All X X X X 20.69 50.51
Ablation:AuxiliaryObjective
Contributing Factor Glove LMPretrain
ImagePretrain
AuxiliaryObjective
Meteor F1
AuxiliaryObjective X X 15.78 14.41Glove X X X 19.69 47.02All X X X X 20.69 50.51
Ablation:GloveEmbedding
ImageData TextData Meteor F1MSCOCO MSCOCO 20.69 50.51MSCOCO WebCorpus 19.15 41.74ImageNet WebCorpus 17.55 36.50
TrainingwithOutsideData
DescribingImageNet
Aotter issittingonarockinthesun.
Alargeflounder isrestingonarock.
Atablewithaplateofsashimi andvegetables.
Alargeglacier withamountaininthe
background.
Amanisstandingonabeachholdinga
snapper.
Agroupofpeoplestandingaroundalargewhitewarship.
Errors
Achainsaw issittingonachainsaw near
achainsaw.
Avolcano viewofavolcano inthesun.
OurTeam:
LisaAnneHendricks
SubhashiniVenugopalan
MarcusRohrbach
RaymondMooney
KateSaenko
TrevorDarrell
ExistingMethods
CompositionalCaptioner
Aanteater isstandinginthegrass.
LRCN:Ablackbear isstandinginthegrass.CaptionBot:Abear thatiseatingsomegrass.
PairedImage-SentenceDataUnpairedImageData UnpairedTextData