paper introduction: sequence to sequence - video to text (iccv2015)
TRANSCRIPT
![Page 1: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/1.jpg)
Sequence to Sequence ‒Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV 2015
M2 Soichiro Murakami
10/14/16 1
![Page 2: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/2.jpg)
Introduction
10/14/16 2
Video
Text
![Page 3: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/3.jpg)
10/14/16 3
A monkey is pulling a dog’s tail and is chased by the dog.
![Page 4: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/4.jpg)
Main contribution• To propose a novel model, which learns to directly map a sequence of frames to a sequence of words
10/14/16 4
General seq2seq modela. handle a variable number of framesb. learn and use the temporal structure
of the videoc. learn a language model to generate
natural and grammatical sentences.
Fig.1
![Page 5: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/5.jpg)
Related work 1/2• image caption [8, 40]
1. generate a fixed length vector representation of an image2. decode this vector into a sequence of words
• FGM [36]1. identify the semantic content (subject, verb, object, scene).2. combine them with confidences from a language model using a
factor graph to infer the most likey tuple in the video.3. generate a sentence based on a template.
• Mean Pool [39]• LSTMs are used to generate video descriptions by pooling the
representations of individual frames.
10/14/16 5
![Page 6: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/6.jpg)
Related work 2/2• Temporal-Attention [43] (ICCV2015)
• employ a 3-D convnet model that incorporates spatiotemporal motion features to extract dense trajectory features (HoG, HoF, MBH).
• use an attention mechanism that learns to weight the frame features.
10/14/16 6
![Page 7: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/7.jpg)
Approach 1/2
• 3.1 LSTM for sequence modeling• 3.2 Sequence to sequence video to text
10/14/16 7
p(y1, ..., ym|x1, ..., xn)seq. of video framesseq. of words
Fig. 2
concatenate
Zt: output of the second LSTM layer
![Page 8: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/8.jpg)
Approach 2/2• 3.3 Video and text representation• RGB frames
• apply a CNN (pre-trained) to input images and provide the output of the top layer as input to the LSTM units. (AlexNet, 16-layer VGG model)
• Optical Flow• first extract classical variational optical flow features[2].• then create flow images and apply a CNN (pre-trained).
• Text• embed words to a lower 500 dimensional space by applying a linear
transformation to the input data.
10/14/16 8for the combined model.
![Page 9: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/9.jpg)
Experimental Setup (1/3)• Video description datasets
• Microsoft Video Description Corpus (MSVD)• a collection of YouTube clips & single sentence descriptions from annotators.
• MPII Movie Description Dataset (MPII-MD)• Hollywood movies & movie scripts and audio description data.
• Montreal Video Annotation Dataset (M-VAD)• Hollywood movies & audio description data for the visually impaired.
ØThey used a single sentence as a target sentence for each video.
10/14/16 9
![Page 10: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/10.jpg)
Experimental Setup (2/3)
10/14/16 10
Table 1. Corpus StatisticsExample of MPII-MD
( A Dataset for Movie Description, Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)
![Page 11: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/11.jpg)
Experimental Setup (3/3)• Evaluation Metrics
• METEOR [7]• METEOR compares exact token matches, stemmed tokens, paraphrase
matches, as well as semantically similar matches using WordNet synonyms.
• Experimental details of the models• unroll the LSTM to a fixed 80 time steps during training.
• for longer videos, truncated the number of frames.• for shorter videos, pad the remaining inputs with zeros.
• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.
10/14/16 11
![Page 12: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/12.jpg)
Results and Discussion ‒ MSVD dataset -
10/14/16 12
• S2VT AlexNet model on RGB video frames achieves 27.9% METEOR.
• The low performance of the flow model.
• Polysemous words• playing a guitar• playing golf
![Page 13: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/13.jpg)
Results and Discussion ‒Movie description datasets-
10/14/16 13
• It was best to use dropout at the inputs and outputs of both LSTM layers.
• SMT [28]• translate holistic video
representations to a single sentence.• Visual-Labels [27]
• LSTM-based approach which uses no temporal encoding, but more diverse visual features, namely object detectors, as well as activity and scene classifiers.
![Page 14: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/14.jpg)
10/14/16 14
![Page 15: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/15.jpg)
10/14/16 15
![Page 16: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)](https://reader031.vdocuments.net/reader031/viewer/2022030311/58efc4f81a28aba3518b46e5/html5/thumbnails/16.jpg)
Conclusion• They construct descriptions using a sequence to sequence
model, where frames are first read sequentially and then words are generated sequentially.
• Their model achieves state-of-the-art performance on the MSVD dataset.
• For further information...• https://www.cs.utexas.edu/~vsub/s2vt.html
10/14/16 16