recognize, describe, and generate: introduction of recent work at mil

Recognize, Describe, and Generate:

Introduction of Recent Work at MIL

The University of Tokyo

Yoshitaka Ushiku

Journalist Robot

• Born in 2006

• Objective: publishing news automatically

– Recognize

• Objects, people, actions

– Describe

• What is happening

– Generate

• Contents as humans do

Outline

• Journalist Robot: ancestor of current work in MIL

• Outline: research originates with this robot

– Recognize

• Basic: Framework for DL, Domain Adaptation

• Classification: Single-modality, Multi-modalities

– Describe

• Image Captioning

• Video Captioning

– Generate

• Image Reconstruction

• Video Generation

Recognize

MILJS: JavaScript × Deep Learning[Hidaka+, ICLR Workshop 2017]

MILJS: JavaScript × Deep Learning

• Support for both learning and inference

• Support for nodes with GPGPUs

– Currently WebCL is utilized.

– Now working on WebGPU.

• Support for nodes w/o GPGPUs

• No requirements to install any software

– Even ResNet with 152 layers can be trained

[Hidaka+, ICLR Workshop 2017]

WebDNN: Fastest Inference Framework on Web Browser

Optimizes trained models for inference only

• Caffe, Keras, Chainer

• IE, Edge, Safari, Chrome, Firefox on PCs and mobiles

• GPU computation

• Web camera

[Kikura+ 2017]

Asymmetric Tri-training for Domain Adaptation

• Unsupervised domain adaptation

Trained on mnist → Works on SVHN?

– Ground-truth labels are associated with source (mnist)

– However, there are no labels for target (SVHN)

[Saito+, ICML 2017]


• Asymmetric Tri-training: pseudo labels for target domain

[Saito+, ICML 2017]


1st: Training on MNIST → Add pseudo labels for easy samples

2nd~: Training on MNIST+α→ Add more pseudo labels

eight

nine

[Saito+, ICML 2017]

End-to-end learning for environmental sound classification

Existing methods for speech / sound recognition:① Feature extraction: Fourier Transformation (log-mel features)

② Classification: CNN with the extracted feature map

[Tokozume+, ICASSP 2017]

①

②

Log-mel features are suitable for human speech; but for environmental sounds…?


Proposed approach (EnvNet):CNN for both ① feature map extraction and ② classification


①

②

Extracted “feature map”


Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015]


64.5 64.071.0

log-mel feature + CNN[Piczak, MLSP 2015]

End-to-end CNN

(Ours)

End-to-end CNN &

log-mel feature + CNN

(Ours)

EnvNet can extract discriminative features for environmental sounds

Multispectral Segmentation & Detection

• Robust Segmentation & Detection

– Daytime and nighttime

– Fog

• Use of multispectral images

– RGB

– Far-infrared (FIR)

– Mid-infrared (MIR)

– Near-infrared (NIR)

[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]

RG

BM

IR

NIR

FIR

• How to combine multispectral images?– Just concatenate them to a multi-channel image?

– Develop some sophisticated methods?

• To capture multispectral images– Develop a novel single camera

– Use different camerasLow cost, but many problems:

• Camera parameters

• Capture timings

• Ours: Multispectral Fusion Network– Baseline: SegNet [Badrinarayanan+, PAMI 2017]

– Independent encoders for each camera

Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]

Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]

Visual Question Answering (VQA)

Question answering system for

• Associated image

• Question by natural language

[Saito+, ICME 2017]

Q: Is it going to rain soon?Ground Truth A: yes

Q: Why is there snow on one side of the stream and clear grass on the other?Ground Truth A: shade


After integrating for 𝑍𝐼+𝑄: usual classification

Question 𝑄What objects are

found on the bed?

Answer 𝐴bed sheets, pillow

Image 𝐼Image feature

𝑥𝐼

Question feature

𝑥𝑄

Integrated

vector

𝑧𝐼+𝑄

[Saito+, ICME 2017]

VQA = Multi-class classification


Current advancement: improving how to integrate 𝑥𝐼 and 𝑥𝑄

• Concatenation e.g.) [Antol+, ICCV 2015]

• Summatione.g.) Image feature (with attention) + Question feature

[Xu+Saenko, ECCV 2016]

• Multiplicatione.g.) Bilinear multiplication

[Fukui+, EMNLP 2016]

• This work: DualNet doing sum, multiply and concatenation

𝑧𝐼+𝑄 =𝑥𝐼

𝑥𝑄

𝑥𝐼 𝑥𝑄

𝑥𝐼 𝑥𝑄𝑧𝐼+𝑄 =

𝑧𝐼+𝑄 =

𝑧𝐼+𝑄 =𝑥𝐼 𝑥𝑄

𝑥𝐼 𝑥𝑄

[Saito+, ICME 2017]


VQA Challenge 2016 (in CVPR 2016)

Won the 1st place on abstract images w/o attention mechanism

[Saito+, ICME 2017]

Q: What fruit is yellow and brown?

A: banana

Q: How many screens are there?

A: 2

Q: What is the boy playing with?

A: teddy bear

Q: Are there any animals swimming in the

pond? A: no

Describe

[Ushiku+, ACMMM 2011]Automatic Image Captioning

Training Dataset

A woman posingon a red scooter.

White and graykitten lying onits side.

A white vanparked in anempty lot.

A white cat restshead on a stone.

Silver car parkedon side of road.

A small gray dogon a leash.

A black dogstanding in a grassy area.

A small white dogwearing a flannelwarmer.

Input ImageA small white dog wearing a flannel warmer.

A small gray dog on a leash.

A black dog standing in a grassy area.

Nearest Captions

A small white dog wearing a flannel warmer.

A small gray dog on a leash.

A black dog standing in a grassy area.

A small white dog standing on a leash.

Automatic Image Captioning[ACM MM 2012, ICCV 2015]

Group of people sitting at a table with a dinner.

Tourists are standing on the middle of a flat desert.

Image Captioning + Sentiment Terms[Shin+, BMVC 2016]

A confused man in a

blue shirt is sitting on a

bench.

A man in a blue shirt

and blue jeans is

standing in the

overlooked water.

A zebra standing in a

field with a tree in the

dirty background.

Image Captioning + Sentiment Terms

Two steps for adding a sentiment term

1. Usual image captioning using CNN+RNN

[Shin+, BMVC 2016]

The most probable noun is memorized

Image Captioning + Sentiment Terms

Two steps for adding a sentiment term

1. Usual image captioning using CNN+RNN

2. Forced to predict sentiment term before the noun

[Shin+, BMVC 2016]

Beyond Caption to Narrative

A man is holding a box of doughnuts.

Then he and a woman are standing next each other.

Then she is holding a plate of food.

[Shin+, ICIP 2016]

Beyond Caption to Narrative[Shin+, ICIP 2016]

A man is

holding a box

of doughnuts.

he and a

woman are

standing next

each other.

she is holding

a plate of food. Narrative

Beyond Caption to Narrative

A boat is floating on the water near a mountain.

And a man riding a wave on top of a surfboard.

Then he on the surfboard in the water.

[Shin+, ICIP 2016]

Generate

Image Reconstruction[Kato+, CVPR 2014]

1d2d

3d

md

jd

kd

Nd

);( θdp1d

2d 3dmd

kd

Ndjd Cat

𝐱𝑗 = 𝑓(𝛉𝑗)

Camera

Traditional pipeline for image classification

Extracting

local descriptors

Collecting

descriptors

Calculating

Global feature

Classifying

images


1d2d

3d

md

jd

kd

Nd

);( θdp1d

2d 3dmd

kd

Ndjd Cat

𝐱𝑗 = 𝑓(𝛉𝑗)

Camera

Inversed problem: Image reconstruction from a label

Pot


cat (bombay) camera grand piano headphone joshua treepyramid wheel chairgramophone

Pot

Optimized arrangement using:

Global location cost + Adjacency cost

Other examples

Video Generation

• Image generation is still challenging

Only successful for controlled settings:

– Human faces

– Birds

– Flowers

• Video generation is …

– Additionally requiring temporal consistency

– Extremely challenging

[Yamamoto+, ACMMM 2016]

[Vondrick+, NIPS 2016]

BEGAN

[Berthelot+, 2017 Mar.]

StackGAN

[Zhang+, 2016 Dec.]

Video Generation

• This work: generating easy videos

– C3D (3D convolutional neural network)

for conditional generation with an input label

– tempCAE (temporal convolutional auto-encoder)

for regularizing video to improve its naturalness

[Yamamoto+, ACMMM 2016]

Video Generation[Yamamoto+, ACMMM 2016]

Ours

(C3D+tempCAE)

Only C3D

Ours

(C3D+tempCAE)

Only C3D

Car runs

to left

Rocket

flies up

Conclusion

• MIL: Machine Intelligence Laboratory

Beyond Human Intelligence Based on Cyber-Physical Systems

• This talk introduces some of the current research

– Recognize

• Basic: Framework for DL, Domain Adaptation

• Classification: Single-modality, Multi-modalities

– Describe

• Image Captioning, Video Captioning

– Generate

• Image Reconstruction, Video Generation

recognize, describe, and generate: introduction of recent work at mil

Technology