multimodal deep learning jiquan ngiam aditya khosla, mingyu kim, juhan nam, honglak lee & andrew...

46
Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Upload: gladys-woods

Post on 23-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Multimodal Deep LearningJiquan NgiamAditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Stanford University

Page 2: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University
Page 3: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect

Page 4: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Audio-Visual Speech Recognition

Page 5: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Challenge

Classifier (e.g. SVM)

Page 6: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Representing Lips

• Can we learn better representations for audio/visual speech recognition?

• How can multimodal data (multiple sources of input) be used to find better features?

Page 7: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Unsupervised Feature Learning

51.1

.

.

.

10

91.67

.

.

.

3

Page 8: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Unsupervised Feature Learning

51.1

.

.

.

109

1.67...

3

Page 9: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Features

12.159.......

6.59

Page 10: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-Modality Feature Learning

51.1

.

.

.

10

Page 11: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning Models

Page 12: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning with Autoencoders

...

...Audio Input

...

...Video Input

... ...Audio Reconstruction Video Reconstruction

Page 13: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio Reconstruction Video Reconstruction

Page 14: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio Reconstruction Video Reconstruction

Page 15: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shallow Learning H

idde

n U

nits

Video Input Audio Input

• Mostly unimodal features learned

Page 16: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio Reconstruction Video Reconstruction

Page 17: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder

...

...

... ...

Video Input

HiddenRepresentation

Audio Reconstruction Video Reconstruction

Cross-modality Learning: Learn better video features by using audio as a cue

Page 18: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-modality Deep Autoencoder

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio Reconstruction Video Reconstruction

Page 19: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-modality Deep Autoencoder

...

...

...

...... ......

Audio Input

LearnedRepresentation

Audio Reconstruction Video Reconstruction

Page 20: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders

......

... ...

...

...... ......

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

“Visemes”(Mouth Shapes)

“Phonemes”

Page 21: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders

.........

...... ......

Video Input

Audio Reconstruction Video Reconstruction

“Visemes”(Mouth Shapes)

Page 22: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

“Phonemes”

Bimodal Deep Autoencoders

...

...

...

...... ......

Audio Input

Audio Reconstruction Video Reconstruction

Page 23: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders

......

... ...

...

...... ......

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

“Visemes”(Mouth Shapes)

“Phonemes”

Page 24: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Training Bimodal Deep Autoencoder

...

...

...

...... ......

Audio Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

...

...

...

...... ......

Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

...

...... ......

...... ......

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

• Train a single model to perform all 3 tasks

• Similar in spirit to denoising autoencoders(Vincent et al., 2008)

Page 25: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Evaluations

Page 26: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Visualizations of Learned Features

0 ms 33 ms 67 ms 100 ms

0 ms 33 ms 67 ms 100 ms

Audio (spectrogram) and Video features learned over 100ms windows

Page 27: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters

• AVLetters: – 26-way Letter Classification– 10 Speakers– 60x80 pixels lip regions

• Cross-modality learning

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio Reconstruction Video Reconstruction

Feature Learning Supervised Learning Testing

Audio + Video Video Video

Page 28: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters

Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Page 29: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters

Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Video-Only Learning(Single Modality Learning) 54.2%

Page 30: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters

Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Video-Only Learning(Single Modality Learning) 54.2%

Our Features(Cross Modality Learning) 64.4%

Page 31: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE

• CUAVE: – 10-way Digit Classification– 36 Speakers

• Cross Modality Learning.........

...... ......

Video Input

LearnedRepresentation

Audio Reconstruction Video Reconstruction

Feature Learning Supervised Learning Testing

Audio + Video Video Video

Page 32: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE

Feature Representation Classification Accuracy

Baseline Preprocessed Video 58.5%Video-Only Learning

(Single Modality Learning) 65.4%

Page 33: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE

Feature Representation Classification Accuracy

Baseline Preprocessed Video 58.5%Video-Only Learning

(Single Modality Learning) 65.4%

Our Features(Cross Modality Learning) 68.7%

Page 34: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE

Feature Representation Classification Accuracy

Baseline Preprocessed Video 58.5%Video-Only Learning

(Single Modality Learning) 65.4%

Our Features(Cross Modality Learning) 68.7%

Discrete Cosine Transform(Gurban & Thiran, 2009)

64.0%

Visemic AAM(Papandreou et al., 2009)

83.0%

Page 35: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition

• CUAVE: – 10-way Digit Classification– 36 Speakers

• Evaluate in clean and noisy audio scenarios– In the clean audio scenario, audio performs

extremely well alone

Feature Learning Supervised Learning Testing

Audio + Video Audio + Video Audio + Video

...

...... ......

...... ......

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

Page 36: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition

Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%Our Best Video Features 68.7%

Page 37: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition

Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%Our Best Video Features 68.7%

Bimodal Deep Autoencoder 77.3%

Page 38: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition

Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%Our Best Video Features 68.7%

Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder + Audio Features (RBM) 82.2%

Page 39: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation

SupervisedTesting

Audio

SharedRepresentation

Video Audio

SharedRepresentation

Video

Linear Classifier

Training Testing

Feature Learning Supervised Learning Testing

Audio + Video Audio Video

Page 40: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation

SupervisedTesting

Audio

SharedRepresentation

Video Audio

SharedRepresentation

Video

Linear Classifier

Training Testing

• Method: Learned Features + Canonical Correlation Analysis

Feature Learning Supervised Learning Testing Accuracy

Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%

Page 41: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

Page 42: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

/ga/ /ba/ 28.3% 13.0% 58.7%

Page 43: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Conclusion

• Applied deep autoencoders to discover features in multimodal data

• Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue

• Multimodal Feature Learning:Learn representations that relate across audio and video data

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio Reconstruction Video Reconstruction

...

...... ......

...... ......

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

Page 44: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Page 45: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Page 46: Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Learning with RBMs

…......

Audio Input

Hidden Units

...Video Input