recognize, describe, and generate: introduction of recent work at mil
TRANSCRIPT
Recognize, Describe, and Generate:
Introduction of Recent Work at MIL
The University of Tokyo
Yoshitaka Ushiku
Journalist Robot
• Born in 2006
• Objective: publishing news automatically
– Recognize
• Objects, people, actions
– Describe
• What is happening
– Generate
• Contents as humans do
Outline
• Journalist Robot: ancestor of current work in MIL
• Outline: research originates with this robot
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning
• Video Captioning
– Generate
• Image Reconstruction
• Video Generation
Recognize
MILJS: JavaScript × Deep Learning[Hidaka+, ICLR Workshop 2017]
MILJS: JavaScript × Deep Learning
• Support for both learning and inference
• Support for nodes with GPGPUs
– Currently WebCL is utilized.
– Now working on WebGPU.
• Support for nodes w/o GPGPUs
• No requirements to install any software
– Even ResNet with 152 layers can be trained
[Hidaka+, ICLR Workshop 2017]
WebDNN: Fastest Inference Framework on Web Browser
Optimizes trained models for inference only
• Caffe, Keras, Chainer
• IE, Edge, Safari, Chrome, Firefox on PCs and mobiles
• GPU computation
• Web camera
[Kikura+ 2017]
Asymmetric Tri-training for Domain Adaptation
• Unsupervised domain adaptation
Trained on mnist → Works on SVHN?
– Ground-truth labels are associated with source (mnist)
– However, there are no labels for target (SVHN)
[Saito+, ICML 2017]
Asymmetric Tri-training for Domain Adaptation
• Asymmetric Tri-training: pseudo labels for target domain
[Saito+, ICML 2017]
Asymmetric Tri-training for Domain Adaptation
1st: Training on MNIST → Add pseudo labels for easy samples
2nd~: Training on MNIST+α→ Add more pseudo labels
eight
nine
[Saito+, ICML 2017]
End-to-end learning for environmental sound classification
Existing methods for speech / sound recognition:① Feature extraction: Fourier Transformation (log-mel features)
② Classification: CNN with the extracted feature map
[Tokozume+, ICASSP 2017]
①
②
Log-mel features are suitable for human speech; but for environmental sounds…?
End-to-end learning for environmental sound classification
Proposed approach (EnvNet):CNN for both ① feature map extraction and ② classification
[Tokozume+, ICASSP 2017]
①
②
Extracted “feature map”
End-to-end learning for environmental sound classification
Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015]
[Tokozume+, ICASSP 2017]
64.5 64.071.0
log-mel feature + CNN[Piczak, MLSP 2015]
End-to-end CNN
(Ours)
End-to-end CNN &
log-mel feature + CNN
(Ours)
EnvNet can extract discriminative features for environmental sounds
Multispectral Segmentation & Detection
• Robust Segmentation & Detection
– Daytime and nighttime
– Fog
• Use of multispectral images
– RGB
– Far-infrared (FIR)
– Mid-infrared (MIR)
– Near-infrared (NIR)
[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
RG
BM
IR
NIR
FIR
• How to combine multispectral images?– Just concatenate them to a multi-channel image?
– Develop some sophisticated methods?
• To capture multispectral images– Develop a novel single camera
– Use different camerasLow cost, but many problems:
• Camera parameters
• Capture timings
• Ours: Multispectral Fusion Network– Baseline: SegNet [Badrinarayanan+, PAMI 2017]
– Independent encoders for each camera
Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
Visual Question Answering (VQA)
Question answering system for
• Associated image
• Question by natural language
[Saito+, ICME 2017]
Q: Is it going to rain soon?Ground Truth A: yes
Q: Why is there snow on one side of the stream and clear grass on the other?Ground Truth A: shade
Visual Question Answering (VQA)
After integrating for 𝑍𝐼+𝑄: usual classification
Question 𝑄What objects are
found on the bed?
Answer 𝐴bed sheets, pillow
Image 𝐼Image feature
𝑥𝐼
Question feature
𝑥𝑄
Integrated
vector
𝑧𝐼+𝑄
[Saito+, ICME 2017]
VQA = Multi-class classification
Visual Question Answering (VQA)
Current advancement: improving how to integrate 𝑥𝐼 and 𝑥𝑄
• Concatenation e.g.) [Antol+, ICCV 2015]
• Summatione.g.) Image feature (with attention) + Question feature
[Xu+Saenko, ECCV 2016]
• Multiplicatione.g.) Bilinear multiplication
[Fukui+, EMNLP 2016]
• This work: DualNet doing sum, multiply and concatenation
𝑧𝐼+𝑄 =𝑥𝐼
𝑥𝑄
𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄
[Saito+, ICME 2017]
Visual Question Answering (VQA)
VQA Challenge 2016 (in CVPR 2016)
Won the 1st place on abstract images w/o attention mechanism
[Saito+, ICME 2017]
Q: What fruit is yellow and brown?
A: banana
Q: How many screens are there?
A: 2
Q: What is the boy playing with?
A: teddy bear
Q: Are there any animals swimming in the
pond? A: no
Describe
[Ushiku+, ACMMM 2011]Automatic Image Captioning
Training Dataset
A woman posingon a red scooter.
White and graykitten lying onits side.
A white vanparked in anempty lot.
A white cat restshead on a stone.
Silver car parkedon side of road.
A small gray dogon a leash.
A black dogstanding in a grassy area.
A small white dogwearing a flannelwarmer.
Input ImageA small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
Nearest Captions
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
A small white dog standing on a leash.
Automatic Image Captioning[ACM MM 2012, ICCV 2015]
Group of people sitting at a table with a dinner.
Tourists are standing on the middle of a flat desert.
Image Captioning + Sentiment Terms[Shin+, BMVC 2016]
A confused man in a
blue shirt is sitting on a
bench.
A man in a blue shirt
and blue jeans is
standing in the
overlooked water.
A zebra standing in a
field with a tree in the
dirty background.
Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
[Shin+, BMVC 2016]
The most probable noun is memorized
Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
2. Forced to predict sentiment term before the noun
[Shin+, BMVC 2016]
Beyond Caption to Narrative
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Beyond Caption to Narrative[Shin+, ICIP 2016]
A man is
holding a box
of doughnuts.
he and a
woman are
standing next
each other.
she is holding
a plate of food. Narrative
Beyond Caption to Narrative
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]
Generate
Image Reconstruction[Kato+, CVPR 2014]
1d2d
3d
md
jd
kd
Nd
);( θdp1d
2d 3dmd
kd
Ndjd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Traditional pipeline for image classification
Extracting
local descriptors
Collecting
descriptors
Calculating
Global feature
Classifying
images
Image Reconstruction[Kato+, CVPR 2014]
1d2d
3d
md
jd
kd
Nd
);( θdp1d
2d 3dmd
kd
Ndjd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Inversed problem: Image reconstruction from a label
Pot
Image Reconstruction[Kato+, CVPR 2014]
cat (bombay) camera grand piano headphone joshua treepyramid wheel chairgramophone
Pot
Optimized arrangement using:
Global location cost + Adjacency cost
Other examples
Video Generation
• Image generation is still challenging
Only successful for controlled settings:
– Human faces
– Birds
– Flowers
• Video generation is …
– Additionally requiring temporal consistency
– Extremely challenging
[Yamamoto+, ACMMM 2016]
[Vondrick+, NIPS 2016]
BEGAN
[Berthelot+, 2017 Mar.]
StackGAN
[Zhang+, 2016 Dec.]
Video Generation
• This work: generating easy videos
– C3D (3D convolutional neural network)
for conditional generation with an input label
– tempCAE (temporal convolutional auto-encoder)
for regularizing video to improve its naturalness
[Yamamoto+, ACMMM 2016]
Video Generation[Yamamoto+, ACMMM 2016]
Ours
(C3D+tempCAE)
Only C3D
Ours
(C3D+tempCAE)
Only C3D
Car runs
to left
Rocket
flies up
Conclusion
• MIL: Machine Intelligence Laboratory
Beyond Human Intelligence Based on Cyber-Physical Systems
• This talk introduces some of the current research
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning, Video Captioning
– Generate
• Image Reconstruction, Video Generation