recognition/detection departamento de informática ... · person action recognition/detection ......

Person Action Recognition/DetectionFabrício Ceschin

Visão ComputacionalProf. David MenottiDepartamento de Informática - Universidade Federal do Paraná

1

In object recognition: “is there a chair in the image?”

In object detection: “is there a chair and where is it in

the image?”2

In action recognition: “is there an action present in the

video?”

In action detection: “is there an action and where is it

in the video ?”3

Datasets

5

KTH

● Six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping

● Four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4.

● 2391 sequences taken with a static camera with 25 fps.

6

Hollywood2

● 12 classes of human actions.● 10 classes of scenes

distributed over 3669 video clips from 69 movies.

● Approximately 20.1 hours of video in total.

7

UCF Sports Action Data Set

● Set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN.

● 10 classes of human actions.● 150 sequences with the

resolution of 720 x 480.

8

UCF YouTube Action Data Set

● 11 action categories collected from YouTube and personal videos.

● Challenging due to large variations in camera motion, object pose and appearance, object scale, viewpoint, cluttered background, illumination conditions, etc.

9

JHMDB

● 51 categories, 928 clips, 33183 frames.

● Puppet flow per frame (approximated optical flow on the person).

● Puppet mask per frame.● Joint positions per frame.● Action label per clip.● Meta label per clip (camera

motion, visible body parts, camera viewpoint, number of people, video quality).

10

Articles

11

Dense Trajectories and Motion Boundary Descriptors forAction RecognitionHeng Wang, Alexander Kl�aser, Cordelia Schmid and Cheng-Lin LiuIJCV 2013

Timeline

2013 2014 2015

Learning realistic human actions from moviesIvan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia SchmidCVPR 2008

2008

Two-Stream Convolutional Networksfor Action Recognition in VideosKaren SimonyanAndrew ZissermanNIPS 2014

Finding Action TubesGeorgia GkioxariJitendra MalikCVPR 2015

12

Learning Realistic Human Actions from MoviesIvan Laptev, Marcin Marszałek, Cordelia Schmid and Cordelia SchmidCVPR 2008

13

Learning Realistic Human Actions from Movies

Introduction & Dataset Generation

● Inspired by new robust methods for image description and classification.● First version of Hollywood dataset.● Movies contain a rich variety and a large number of realistic human

actions.● To avoid the difficulty of manual annotation, the dataset was build using

script-based action annotation.● Time information is transferred from subtitles to scripts and then time

intervals for scene descriptions are inferred - 60% precision achieved.

2008 14


Script-based Action Annotation

Example of matching speech sections (green) in subtitles and scripts. Time information (blue) from adjacent speech sections is used to estimate time intervals of scene descriptions (yellow).

2008 15


Space-time Features

● Detect interest points using a space-time extension of the Harris operator.● Histogram descriptors of space-time volumes are computed in the

neighborhood of detected points (the size of each volume is related to the detection scales).

● Each volume is subdivided into a (Nx, Ny, Nt) grid of cuboids; for each cuboid HOG and HOF are computed. Both are concatenated, creating a descriptor vector.

2008 16


Space-time Features

1. Space-time interest points detected for two video frames with human actions hand shake (left) and get out car (right).

2. Result of detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball (a) and in a hand clapping sequence (b).

2008 17


Spatio-temporal Bag-of-features

● A visual vocabulary is built clustering a subset of 100k features sampled from the training videos with the k-means algorithm, with k=4000.

● BoF assigns each feature to the closest (Euclidean distance) vocabulary word and computes the histogram of visual word occurrences over a space-time volume corresponding either to the entire video sequence or subsequences defined by a spatio-temporal grid.

2008 18


Spatio-temporal Bag-of-features

Bag-of-features illustration.

2008 19


Classification

● Support Vector Machine (SVM) with a multi-channel X² kernel that combines channels.

● Defined by:

2008 20

● Where Hi={hin} and Hj={Hjn} are the histograms for channel c and Dc(Hi,Hj) is the X² distance defined as:


Results

2008 21

Method Schuldt et al. Niebles et al. Wong et al. This work

Accuracy 71.7% 81.5% 86.7% 91.8%

● Average class accuracy on the KTH actions dataset:


Results

2008 22

Average precision (AP) for each action class of test set - results for clean (annotated), automatic training data and for a random classifier (chance)

Clean Automatic Chance

AnswerPhone 32.1% 16.4% 10.6%

GetOutCar 41.5% 16.4% 6.0%

HandShake 32.3% 9.9% 8.8%

HugPerson 40.6% 26.8% 10.1%

Kiss 53.3% 45.1% 23.5%

SitDown 38.6% 24.8% 13.8%

SitUp 18.2% 10.4% 4.6%

StandUp 50.5% 33.6% 22.6%

Dense Trajectories and Motion Boundary Descriptors forAction RecognitionHeng Wang, Alexander Kl�aser, Cordelia Schmid and Cheng-Lin LiuIJCV 2013

23

Introduction

● Bag-of-features achieves state-of-the-art performance.● Feature trajectories have shown to be efficient for representing videos.● Generally extracted using KLT tracker or matching SIFT descriptors

between frames, however, quantity and quality are not enough.● Video description by dense trajectories.

2420132008 Dense trajectories and motion boundary descriptors for a.r.

Dense Trajectories

● Separate sample feature points on a grid spaced by W pixels (W=5). Sampling is carried out on each spatial scale separately and the goal is to track all these sampled points through the video.

● Areas without any structure are removed (if the eigenvalues of the auto-correlation matrix are very small - “few explanation”).

● Feature points are tracked on each spatial scale separately.● Features are extracted using grids of cuboids, similar to last article.


Dense Trajectories


*Motion boundary histograms (MBH) are extracted by computing derivatives separately for the horizontal and vertical components of the optical flow.

● Left: feature points are densely sampled on a grid for each spatial scale.

● Middle: tracking is carried out in the corresponding spatial scale for L frames by median filtering in a dense optical flow field.

● Right: trajectory shape is represented by relative point coordinates. The descriptors (HOG, HOF, MBH) are computed along the trajectory in a N×N pixels neighborhood, which is divided into grids of cuboids.

Dense Trajectories


Results● Comparison of different

descriptors and methods for extracting trajectories on nine datasets. Mean average precision is reported over all classes (mAP) for Hollywood2 and Olympic Sports, average accuracy over all classes for the other seven datasets. The three best results for each dataset are in bold.


Two-Stream Convolutional Networks for Action Recognition in VideosKaren Simonyan and Andrew ZissermanNIPS 2014

29

Introduction

● CNNs work very well for image recognition.● Extend CNN to action recognition in video.● Two separate recognition streams related to the two-stream hypothesis:

○ Spatial Stream - appearance recognition ConvNet.○ Temporal Stream - motion recognition ConvNet.

302013 20142008 Two-Stream Convolutional Networks for Action Recognition in Videos

Two-stream Hypothesis● Ventral pathway

(purple, “what pathway”) responds to shape, color and texture.

● Dorsal pathway (green, “where pathway”) responds to spatial transformations and movement.

3120142008 Two-Stream Convolutional Networks for Action Recognition in Videos2013

Two-stream Architecture for Video Recognition

● Spatial part: in the form of individual frame appearance, carries information about scenes and objects in the video.

● Temporal part: in the form of motion across the frames, carries information about the movement of the camera and the objects.


● Each stream is implemented using a deep ConvNet, softmax scores which are combined by fusion methods.

● Two fusion methods proposed:○ Averaging.○ Training a multiclass linear SVM on stacked L2-normalised softmax

scores as features.

Two-stream Architecture for Video Recognition


● Similar model used for image classification.● Operates on individual video frames.● Static appearance is a useful feature, due to actions that are strongly

associated with particular objects.● Network pre-trained on a large image classification dataset, such as the

ImageNet challenge dataset.

The Spatial Stream ConvNet


● Input of the ConvNet model is stacking optical flow displacement fields between several consecutive frames.

● This input describes the motion between video frame.● Motion representation:

○ Optical flow stacking: displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels.

○ Trajectory stacking: trajectory-based descriptors.○ Bi-directional optical flow, mean flow subtraction.

The Temporal Stream ConvNet


● Displacement vector fields dtx and dty of L consecutive frames are stacked, creating a total of 2L input channels.

● Examples: higher intensity corresponds to positive values, lower intensity to negative values.

Optical Flow Stacking

3620142008 Two-Stream Convolutional Networks for Action Recognition in Videos

(a) Horizontal component dx of the displacement vector field.

(b) Vertical component dy of the displacement vector field.

2013

● Unlike the spatial stream ConvNet, which can be pre-trained on a large still image classification dataset (such as ImageNet), the temporal ConvNet needs to be trained on video data.

● Available datasets for video action classification are still rather small.○ UCF-101 and HMDB-51 datasets have only 9.5K and 3.7K, respectively.

● ConvNet architecture is modified so that it has two softmax classification layers on top of the last fully-connected layer: one softmax layer computes HMDB-51 classification scores, the other one – the UCF-101 scores.

● Each of the layers is equipped with its own loss function, which operates only on the videos, coming from the respective dataset. The overall training loss is computed as the sum of the individual tasks’ losses.

●

Multi-task Learning


Results


Finding Action TubesGeorgia Gkioxari and Jitendra MalikCVPR 2015

39

Introduction

● Image region proposals: regions that are motion salient are more likely to contain the action, so they are selected.

● Significant reduction in the number of regions being processed and faster computations.

● Detection pipeline also is inspired by the human vision system.● Outperforms other techniques in the task of action detection.

402014 20152008 Finding Action Tubes2013

Regions of Interest

● Selective search are used on the RGB frames to generate approximately 2K regions per frame.

● Regions that are void of motion are discarded using the optical flow signal.● Motion saliency algorithm:

○ Normalized magnitude of optical flow signal (fm) is seen as a heat map at the pixel level.○ If R is a region, then fm(R) = 1/(|R|)∑i∈R fm(i) is a measure of how motion salient R is ɑ.○ R is discarded if fm(R) < ɑ. For ɑ = 0.3, approximately 85% of boxes are discarded.


Feature Extraction

(a) Candidate regions are fed into action specific classifiers, which make predictions using static and motion cues.

(b) The regions are linked across frames based on the action predictions and their spatial overlap. Action tubes are produced for each action and each video.


Action Detection Model

● Action specific SVM classifiers are used on spatio-temporal features.

● The features are extracted from the fc7 layer of two CNNs, spatial-CNN and motion-CNN, which were trained to detect actions using static and motion cues, respectively.

● The architecture of spatial-CNN and motion-CNN is similar to the ones used for image classification.


“This approach yieldsan accuracy of 62.5%, averaged over the three splits of JHMDB.”


General Results

45

Dataset Ivan Laptev et al. 2008

Heng Wang et al. 2013

Karen Simonyan et al. 2014

Georgia Gkioxari et al. 2015

KTH 91.8% 95.0% - -

Hollywood2 38,38%* 58.2% - -

UCF Youtube - 84.1% - -

UCF Sports - 88.0% 88.0% 75.8%

JHMDB - 46.6% 59.4% 62.5%

*First version of Hollywood2.

References

Articles

● Learning Realistic Human Actions from Movies - Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Cordelia Schmid - CVPR 2008

● Action Recognition with Improved Trajectories - Heng Wang and Cordelia Schmid - CVPR 2013

● Dense trajectories and motion boundary descriptors for action recognition- Heng Wang, Alexander Kl�aser, Cordelia Schmid, Cheng-Lin Liu - IJCV 2013

● Two-Stream Convolutional Networks for Action Recognition in Videos - Karen Simonyan and Andrew Zisserman - NIPS 2014

● Finding Action Tubes - Georgia Gkioxari and Jitendra Malik - CVPR 2015.

46

References

Datasets

● KTH Dataset● UCF YouTube Action Data Set● Hollywood2 Dataset● UCF Sports Action Data Set● Joint-annotated Human Motion Data Base (JHMDB)

47

http://www.nada.kth.se/cvap/actions/

http://www.nada.kth.se/cvap/actions/

http://crcv.ucf.edu/data/UCF_YouTube_Action.php

http://crcv.ucf.edu/data/UCF_YouTube_Action.php

http://www.di.ens.fr/~laptev/actions/hollywood2/

http://www.di.ens.fr/~laptev/actions/hollywood2/

http://crcv.ucf.edu/data/UCF_Sports_Action.php

http://crcv.ucf.edu/data/UCF_Sports_Action.php

http://jhmdb.is.tue.mpg.de/

http://jhmdb.is.tue.mpg.de/

recognition/detection departamento de informática ... · person action recognition/detection ......

Documents