multimodal deep learning (d4l4 deep learning for speech and language upc 2017)

59
Day 4 Lecture 4 Multimodal Deep Learning Xavier Giró-i-Nieto [course site]

Upload: xavier-giro

Post on 07-Feb-2017

354 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Page 1: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Day 4 Lecture 4

Multimodal Deep Learning

Xavier Giroacute-i-Nieto

[course site]

2

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 2: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

2

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 3: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 4: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 5: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 6: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 7: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 8: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 9: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 10: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 11: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 12: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 13: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 14: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 15: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 16: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 17: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 18: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 19: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 20: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 21: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 22: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 23: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 24: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 25: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 26: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 27: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 28: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 29: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 30: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 31: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 32: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 33: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 34: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 35: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 36: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 37: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 38: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 39: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 40: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 41: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 42: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 43: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 44: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 45: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 46: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 47: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 48: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 49: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 50: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 51: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 52: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 53: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 54: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 55: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 56: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 57: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 58: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 59: Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi