frontiers of vision and language: bridging images and texts by deep learning
TRANSCRIPT
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
The University of Tokyo
Yoshitaka Ushiku
losnuevetoros
Documents = Vision + Language
Vision & Language:
an emerging topic
• Integration of CV, NLP
and ML techs
• Several backgrounds
– Impact of Deep Learning
• Image recognition (CV)
• Machine translation (NLP)
– Growth of user generated
contents
– Exploratory researches on
Vision and Language
2012: Impact of Deep Learning
Academic AI startup A famous company
Many slides refer to the first use of CNN (AlexNet) on ImageNet
2012: Impact of Deep Learning
Academic AI startup A famous company
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Many slides refer to the first use of CNN (AlexNet) on ImageNet
2012: Impact of Deep Learning
According to the official site…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]
It’s me!!
2014: Another impact of Deep Learning
• Deep learning appears in machine translation[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
• Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
Growth of user generated contents
Especially in content posting/sharing service
• Facebook: 300 million photos per day
• YouTube: 400-hours videos per minute
Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular.
Pairs of a sentence+ a video / photo→Collectable in
large quantities
Exploratory researches on Vision and Language
Captioning an image associated with its article[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.
Exploratory researches on Vision and Language
Captioning an image associated with its article[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.As a result of these backgrounds:
Various research topics such as …
Image Captioning
Group of people sitting at a table with a dinner.
Tourists are standing on the middle of a flat desert.
[Ushiku+, ICCV 2015]
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Multilingual + Image Caption Translation
Ein Masten mit zwei Ampeln
fur Autofahrer. (German)
A pole with two lights
for drivers. (English)
[Hitschler+, ACL 2016]
Visual Question Answering[Fukui+, EMNLP 2016]
Image Generation from Captions
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
Goal of this keynote
Looking over researches on vision&language
• Historical flow of each area
• Changes by Deep Learning
× Deep Learning enabled these researches
✓ Deep Learning boosted these researches
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
Frontiers of Vision and Language 1
Image Captioning
Every picture tells a story
Dataset:Images + <object, action, scene> + Captions
1. Predict <object, action, scene> for an input image using MRF
2. Search for the existing caption associated with similar <object, action, scene>
<Horse, Ride, Field>
[Farhadi+, ECCV 2010]
Every picture tells a story
<pet, sleep, ground>
See something unexpected.
<transportation, move, track>
A man stands next to a train
on a cloudy day.
[Farhadi+, ECCV 2010]
Retrieve? Generate?
• Retrieve
• Generate
– Template-basede.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-basede.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-baseddog+stand ⇒ A dog stands.
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-baseddog+stand ⇒ A dog stands.
– Template-free
A small white dog standing on a leash.
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Captioning with multi-keyphrases[Ushiku+, ACM MM 2012]
End of sentence
[Ushiku+, ACM MM 2012]
Benefits of Deep Learning
• Refinement of image recognition [Krizhevsky+, NIPS 2012]
• Deep learning appears in machine translation[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
Google NIC
Concatenation of Google’s methods
• GoogLeNet [Szegedy+, CVPR 2015]
• MT with LSTM[Sutskever+, NIPS 2014]
Caption (word seq.) 𝑆0…𝑆𝑁 for image 𝐼
𝑆0: beginning of the sentence
𝑆1 = LSTM CNN 𝐼
𝑆𝑡 = LSTM St−1 , 𝑡 = 2…𝑁 − 1
𝑆𝑁: end of the sentence
[Vinyals+, CVPR 2015]
Examples of generated captions
[https://github.com/tensorflow/models/tree/master/im2txt]
[Vinyals+, CVPR 2015]
Comparison to [Ushiku+, ACM MM 2012]
Input image
[Ushiku+, ACM MM 2012]:
Conventional object recognition
Fisher Vector + Linear classifier
Neural image captioning:
Conventional object recognition
Convolutional Neural Network
Neural image captioning
Conventional machine translation
Recurrent Neural Network + beam search
[Ushiku+, ACM MM 2012]:
Conventional machine translation
Log Linear Model + beam search
Estimation of important words Connect the words with grammar model
• Trained using only images and captions
• Approaches are similar to each other
Current development: Accuracy
• Attention-based captioning [Xu+, ICML 2015]
– Focus on some areas for predicting each word!
– Both attention and caption models are trained
using pairs of an image & caption
Current development: Problem setting
Dense captioning
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]
Current development: Problem setting
Generating captions for a photo sequence[Park+Kim, NIPS 2015][Huang+, NAACL 2016]
The family
got
together for
a cookout.
They had a
lot of
delicious
food.
The dog
was happy
to be there.
They had a
great time
on the
beach.
They even
had a swim
in the water.
Current development: Problem setting
Captioning using sentiment terms
[Mathews+, AAAI 2016][Shin+, BMVC 2016]
Neutral caption
Positive caption
Frontiers of Vision and Language 2
Video Captioning
Before Deep Learning
• Grounding of languages and objects in videos[Yu+Siskind, ACL 2013]
– Learning from only videos and their captions
– Experiment with a small object with few objects
– Controlled and small dataset
• Deep Learning should suite for this problem
– Image Captioning: single image → word sequence
– Video Captioning: image sequence → word sequence
End-to-end learning by Deep Learning
• LRCN[Donahue+, CVPR 2015]
– CNN+RNN for
• Action recognition
• Image / Video
Captioning
• Video to Text[Venugopalan+, ICCV 2015]
– CNNs to recognize
• Objects from RGB frames
• Actions from flow images
– RNN for captioning
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Video Captioning
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]
Video Retrieval from Caption
• Input: Captions
• Output: A video related to the caption
10 sec video clip from 40 min database!
• Video captioning is also addressed
A woman in blue is
playing ping pong in a
room.
A guy is skiing with no
shirt on and yellow
snow pants.
A man is water skiing
while attached to a
long rope.
[Yamaguchi+, ICCV 2017]
Frontiers of Vision and Language 3
Multilingual +
Image Caption Translation
Towards multiple languages
Datasets with multilingual captions
• IAPR TC12 [Grubinger+, 2006] English + Germany
• Multi30K [Elliot+, 2016] English + Germany
• STAIR Captions [Yoshikawa+, 2017]
English + Japanese
Development of cross-lingual tasks
• Non-English-caption generation
• Image Caption Transration
Input: Pair of a caption in Language A + an imageor A caption in Language A
Output: Caption in Language B
Non-English-caption generation
Non-English-caption generation
Most researches: generate English Caption
• Japanese [Miyazaki+Shimizu, ACL 2016]
• Chinese [Li+, ICMR 2016]
• Turkish [Unal+, SIU 2016]
Çimlerde ko¸ san bir köpek
金色头发的小女孩
柵の中にキリンが一頭立っています
Just collecting non-English captions?
Transfer learning among languages[Miyazaki+Shimizu, ACL 2016]
• Vision-Language grounding Wim is transferred
• Efficient learning using small amount of captionsan elephant is
an elephant
一匹の 象が 土の
一匹の 象が
Image Caption Translation
Machine translation via visual data
Images can boost MT [Calixto+,2012]
• Example below (English to Portuguese):
Does the word “seal” in English
– mean “seal” similar to “stamp”?
– mean “seal” which is a sea animal?
• [Calixto+,2012] insist that the mistranslation can be
avoided using a related image (w/o experiments)
Mistranslation!
Input: Caption in Language A + image
• Caption translation via an associated image[Elliott+, 2015] [Hitschler+, ACL 2016]
– Generate translation candidates
– Re-rank the candidates using similar images’
captions in Language B
Eine Person in
einem Anzug
und Krawatte
und einem Rock.
(In German)
Translation w/o the related image
A person in a suit and tie
and a rock.
Translation with the related image
A person in a suit and tie
and a skirt.
Input: Caption in Language A
• Cross-lingual document retrieval via images [Funaki+Nakayama, EMNLP 2015]
• Zero-shot machine translation[Nakayama+Nishida, 2017]
Frontiers of Vision and Language 4
Visual Question Answering
Visual Question Answering (VQA)
Proposed in Human-Computer Interfaces
• VizWiz [Bigham+, UIST 2010]
Manually solved on AMT
• Automation for the first time (w/o Deep Learning)[Malinowski+Fritz, NIPS 2014]
• Similar term: Visual Turing Test [Malinowski+Fritz, 2014]
VQA: Visual Question Answering
• Established VQA as an AI problem
– Provided a benchmark dataset
– Experimental results with reasonable baselines
• Portal web site is also organized
– http://www.visualqa.org/
– Annual competition for VQA accuracy
[Antol+, ICCV 2015]
What color are her eyes?What is the mustache made of?
VQA Dataset
Collected questions and answers on AMT
• Over 100K real images and 30K abstract images
• About 700K questions+10 answers for each
VQA=Multiclass Classification
Feature 𝑍𝐼+𝑄 is applied to an usual classifier
Question 𝑄What objects are
found on the bed?
Answer 𝐴bed sheets, pillow
Image 𝐼Image feature
𝑥𝐼
Question feature
𝑥𝑄
Integrated feature
𝑧𝐼+𝑄
Development of VQA
How to calculate the integrated feature 𝑧𝐼+𝑄?
• VQA [Antol+, ICCV 2015]: Just concatenate them
• Summation例 Summation of an image feature with attention
and a question feature [Xu+Saenko, ECCV 2016]
• Multiplicatione.g.Bilinear multiplication using DFT
[Fukui+, EMNLP 2016]
• Hybrid of summation and multiplicatione.g.Concatenation of sum and multiplication
[Saito+, ICME 2017]
𝑧𝐼+𝑄 =𝑥𝐼
𝑥𝑄
𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄
VQA Challenge
Examples from competition results
Q: What is the woman holding?GT A: laptopMachine A: laptop
Q: Is it going to rain soon?GT A: yesMachine A: yes
VQA Challenge
Examples from competition results
Q: Why is there snow on one side of the stream and clear grass on the other?GT A: shadeMachine A: yes
Q: Is the hydrant painted a new color?GT A: yesMachine A: no
Frontiers of Vision and Language 5
Image Generation from Captions
Image generation from input caption
Photo-realistic image generation itself is difficult
• [Mansimov+, ICLR 2016]: Incrementally draw using LSTM
• N.B. Photo synthesis is well studied [Hays+Efros, 2007]
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a … hmm
Add a Caption to Generator and Discriminator
Conditional Generative Models
Tries to generate an image・photo-realistic
・related to the caption
Tries to detect an image・fake
・unrelated
[Reed+, ICML 2016]
Examples of generated images
• Birds (CUB) / Flowers (Oxford-102)
– About 10K images & 5 captions for each image
– 200 kinds of birds / 102 kinds of flowers
A tiny bird, with a tiny beak,
tarsus and feet, a blue crown,
blue coverts, and black
cheek patch
Bright droopy yellow petals
with burgundy streaks, and a
yellow stigma
[Reed+, ICML 2016]
Towards more realistic image generation
StackGAN [Zhang+, 2016]
Two-step GANs
• First GAN generates small and fuzzy image
• Second GAN enlarges and refines it
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
N.B. Results using dataset specialized in birds / flowers
→ More breakthrough is necessary to generate general images
Take-home Messages
• Looked over researches on vision and language
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
• Contributions of Deep Learning– Most research themes exist before Deep Learning
– Commodity techs for processing images, videos and natural languages
– Evolution of recognition and generation
Towards a new stage among vision and language!