describing videos by exploiting temporal structure

21
Describing Videos by Exploiting Temporal Structure Slides by Alberto Montes Computer Vision Group, April 12th, 2016 [arXiv ] [GitXiv ] [video ] [code ] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville

Upload: xavier-giro

Post on 13-Apr-2017

214 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Describing videos by exploiting temporal structure

Describing Videos by Exploiting Temporal Structure

Slides by Alberto MontesComputer Vision Group, April 12th, 2016

[arXiv] [GitXiv] [video] [code]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville

Page 2: Describing videos by exploiting temporal structure

Introduction

Page 3: Describing videos by exploiting temporal structure

Introduction

Goal: Generate captions from videos.

Page 4: Describing videos by exploiting temporal structure

Video Description Generation Framework

Page 5: Describing videos by exploiting temporal structure

Encoder-Decoder Framework

Encoder: Convolutional Neural Network

Basic approach:

Deep CNN over frames

Decoder: Long Short-Term Memory Network

Page 6: Describing videos by exploiting temporal structure

Long Short Term Memory

Page 7: Describing videos by exploiting temporal structure

Long Short Term Memory

Forget Gate:

Page 8: Describing videos by exploiting temporal structure

Long Short Term Memory

Input Gate Layer

New candidates for cell state

Page 9: Describing videos by exploiting temporal structure

Long Short Term Memory

Update Memory Content:

Page 10: Describing videos by exploiting temporal structure

Long Short Term Memory

E[yt]: word embedding matrix

inputprevious

hidden stateWeights matrices:context from

encoder bias

Page 11: Describing videos by exploiting temporal structure

Exploiting Temporal Structure

Page 12: Describing videos by exploiting temporal structure

Exploiting Local Features

● Trained for activity recognition.● Only the conv layers will be used.

Histograms of oriented Gradient

Histograms of oriented Flow

Motion Boundary Histogram

A Spatio-Temporal Convolution Neural Net

Page 13: Describing videos by exploiting temporal structure

Exploiting Global Structure

Attention Mechanism

Update of attention weights:

Page 14: Describing videos by exploiting temporal structure

Experiments

Page 15: Describing videos by exploiting temporal structure

YouTube2Text

1,970 video clips with multiple descriptions

Training set: 1,200 video clips

Validation set: 100 video clips

Datasets

DVS

Videos taken from DVDs

49,000 video clips

Training set: 39,000 video clips

Validation set: 5,000

Test set: 5,000

Page 16: Describing videos by exploiting temporal structure

Setup and Training

4 setups:

◉ Basic (2D GoogLeNet CNN)◉ Local (+ 3D CNN features)◉ Global (+ temporal attention

mechanism)◉ Local + Global

Training

- Adadelta gradient- Loss function:

Page 17: Describing videos by exploiting temporal structure

Results

Page 18: Describing videos by exploiting temporal structure

Evaluation

Page 19: Describing videos by exploiting temporal structure

Evaluation

Page 20: Describing videos by exploiting temporal structure

Conclusions

Propose a 3D CNN to capture local fine-grained motion information.

A temporal attention mechanism to capture global information.

State-of-the-art results on Youtube2text with a combination of both approaches.

Page 21: Describing videos by exploiting temporal structure

Thank you!Questions?