video captioningfidler/teaching/2015/slides/csc2523/... · 2016. 3. 9. · caption generation y y...

20
Video Captioning Erin Grant March 1 st , 2016

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Video Captioning

Erin Grant

March 1st, 2016

Page 2: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Last Class: Image Captioning

From Kiros et al. [2014]

Page 3: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

This Week: Video CaptioningAKA: Image captioning through time!

Correct descriptions. Relevant but incorrect descriptions.

Irrelevant descriptions.

(a) (b) (c)

From Venugopalan et al. [2015]

Page 4: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Related Work (1)

Toronto: Joint Embedding from Skip-thoughts + CNN :

from Zhu et al. [2015]: Aligning books and movies: Towards story-like visual explanations by watching moviesand reading books

Page 5: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Related Work (2)Berkeley: Long-term Recurrent Convolutional Networks:

Input:Video

CRF

LSTM

A man juiced the orange

Input:Image

LSTM

CNN

Input:Sequence of Frames

Output:Label

LSTM

CNN

CNN

Activity Recognition Image Description Video Description

A large building with a

clock on the front of it

Output:Sentence

Output:Sentence

From Donahue et al. [2015]: Long-term recurrent convolutional networks for visual recognition and description

Page 6: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Related Work (3)

MPI: Ensemble of weak classifiers + LSTM:

PLACES

LSDA

DT

Places class. scores

Object class. scores

Verb class. scores

LSTMconcat

concat

concat

Someone

enters

the

LSTM

LSTM

select robust classifiers concat roomLSTM

from Rohrbach et al. [2015]: The long-short story of movie description

Page 7: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Related Work (4)Montreal: (SIFT, HOG) Features + 3-D CNN + LSTM+ Attention:

a

man

Soft-Attention

Features-Extraction Caption

Generation

From Yao et al. [2015]: Video description generation incorporating spatio-temporal features and asoft-attention mechanism

Page 8: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

We can simplify the problem. . .

In captioning, we translate one modality (image) to another(text).

Image captioning : Fixed length sequence (image) tovariable length sequence (words).

Video captioning : Variable length sequence (video frames)to variable length sequence (words).

Page 9: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Formulation

I Let (x1, . . . ,xn) be the sequence of video frames.

I Let (y1, . . . , ym) be the sequence of words.

(The, cat, is, afraid, of, the, cucumber.)

I We want to maximise p(y1, . . . , ym | x1, . . . ,xn).

Page 10: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Formulation contd.

Idea:

I Accumulate the sequence of video frames into a singleencoded vector.

I Decode that vector into words one-by-one.

The =⇒ cat =⇒ is =⇒ afraid =⇒ of . . .

Page 11: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

The S2VT Model

CNN - Action pretrained

CNN - Object pretrained

Flow images

Raw Frames

A

man

is

cutting

a

bottle

<eos>

LSTMsCNN Outputs

Our LSTM network is connected to a CNN for RGB frames or aCNN for optical flow images.

From Venugopalan et al. [2015]: Sequence to sequence-video to text

Page 12: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Optimization

During decoding, maximise

log p(y1, . . . , ym | x1, . . . ,xn)

=m∑t=1

log p(yt | hn+1−1, yt−1))

Train using stochastic gradient descent.

Encoder weights are jointly updated with decoder weightsbecause we are backpropagating through time.

Page 13: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

S2VT Model in Detail

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM LSTM

time

<pad> <pad> <pad> <BOS>

man

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM

is talking <EOS>

<pad> <pad> <pad> <pad> <pad>

<pad>

A

Encoding stage Decoding stage

From Venugopalan et al. [2015]

Page 14: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

S2VT Results (Qualitative)

Correct descriptions. Relevant but incorrect descriptions.

Irrelevant descriptions.

(a) (b) (c)

From Venugopalan et al. [2015]

Page 15: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

S2VT Results (Quantitative)Model METEOR

FGM Thomason et al. [2014] 23.9Mean pool- AlexNet Venugopalan et al. [2015] 26.9- VGG 27.7- AlexNet COCO pre-trained Venugopalan et al. [2015] 29.1- GoogleNet Yao et al. [2015] 28.7Temporal attention- GoogleNet Yao et al. [2015] 29.0- GoogleNet + 3D-CNN Yao et al. [2015] 29.6

S2VT- Flow (AlexNet) 24.3- RGB (AlexNet) 27.9- RGB (VGG) random frame order 28.2- RGB (VGG) 29.2- RGB (VGG) + Flow (AlexNet) 29.8

Table: Microsoft Video Description (MSVD) dataset (METEOR in%, higher is better).

From Venugopalan et al. [2015]

Page 16: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Datasets

I Microsoft Video Description corpus (MSVD) Chen andDolan [2011]

I web clips with human-annotated sentences

I MPII Movie Description Corpus (MPII-MD) Rohrbachet al. [2015] and Montreal Video Annotation Dataset(M-VAD) Yao et al. [2015]

I movie clips with captions sourced from audio/script

Page 18: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

References I

D. L. Chen and W. B. Dolan. Collecting highly parallel datafor paraphrase evaluation. In Proceedings of the 49thAnnual Meeting of the Association for ComputationalLinguistics: Human Language Technologies-Volume 1, pages190–200. Association for Computational Linguistics, 2011.

J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-termrecurrent convolutional networks for visual recognition anddescription. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages2625–2634, 2015.

R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifyingvisual-semantic embeddings with multimodal neurallanguage models. arXiv preprint arXiv:1411.2539, 2014.

Page 19: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

References IIA. Rohrbach, M. Rohrbach, and B. Schiele. The long-short

story of movie description. In Pattern Recognition, pages209–221. Springer, 2015.

J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko,and R. J. Mooney. Integrating language and vision togenerate natural language descriptions of videos in the wild.In COLING, volume 2, page 9, 2014.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney,T. Darrell, and K. Saenko. Sequence to sequence-video totext. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 4534–4542, 2015.

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle,and A. Courville. Video description generation incorporatingspatio-temporal features and a soft-attention mechanism.arXiv preprint arXiv:1502.08029, 2015.

Page 20: Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

References IIIY. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun,

A. Torralba, and S. Fidler. Aligning books and movies:Towards story-like visual explanations by watching moviesand reading books. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 19–27, 2015.