dong xu, member, ieee, and shih-fu chang, fellow, ieee video event recognition using kernel methods...

34
DONG XU, MEMBER, IEEE, AND SHIH- FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment

Upload: tristan-kett

Post on 14-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE

Video Event Recognition Using Kernel Methods

with Multilevel Temporal Alignment

Outline

1. Introduction2. Scene-Level Concept Score Feature3. Single-Level Earth Mover’s Distance in The

Temporal Domain4. Temporally Aligned Pyramid Matching5. Experiments6. Contributions and Conclusion

1. Introduction

Previous work on video event recognition can be roughly classified as either activity recognition or abnormal event recognition

Model-based

Abnormal event recognition - Zhang et al. [1] propose a semisupervised

adapted Hidden Markov Model (HMM) frameworkActivity recognition - HMM - coupled HMM - Dynamic Bayesian Network

Appearance-based

Abnormal event recognition - Boiman and Irani [7]Activity recognition - Ke et al. [8] - Efros et al. [9] - Other

Event recognition in broadcast news video

Rich information

Emerging applications of open source intelligence

Online video search

LSCOM ontology

Large-Scale Concept Ontology for Multimedia

Defined 56 event/activity concepts

Manual annotation of such event concepts has been completed for a large data set in TRECVID 2005 [15]

Challenges of events in news video

Large variations of scenes and activities

Difficult to - reliably track moving objects - detect the salient spatiotemporal interest

regions - extract the spatial-temporal features

Address the challenges of news video

Ebadollahi et al. [17]

midlevel Concept score (CS)

nonparametric approach

bag-of-words model

Bag-of-words model

Represent one video clip as a bag of orderless features, extracted from all of the frames

Earth Mover’s Distance (EMD) [21]

Single-level EMD (SLEMD)

Support Vector Machine (SVM)

Temporally Aligned Pyramid Matching (TAPM)

Temporally Aligned Pyramid Matching (TAPM)

2. Scene-Level Concept Score Feature

Holistic features to represent content in constituent image frames

Multilevel temporal alignment framework to match temporal characteristics of various events

Three low-level global feature

Grid Color Moment

Gabor Texture

Edge Direction Histogram

We used because

Efficiently extracted over the large video corpus

Effective for detecting several concepts

Suitable for capturing the characteristics of scenes

3. Single-Level Earth Mover’s Distance in The Temporal Domain

One video clip P can be represented as a signature:

m is the total number of frames, pi is the feature extracted from the ith

frame, wpi is the weight of the ith frame,We also represent another video clip Q as a

signature: n is the total number of frames

dij is the ground distance between pi and qj

SVM classification

4. Temporally Aligned Pyramid Matching

Spatial Pyramid Matching (SPM)Pyramid Match Kernel (PMK)

Temporally Constrained Hierarchical Agglomerative Clustering (T-HAC)

T-HAC

Alignment of Different Subclips

Principle Component Analysis (PCA)

Integer-value-constrained EMD

Fusion of Information from Different Levels

hl is the weight for level-l

TAPM

5. Experiments

SLEMD algorithm with the simplistic detector that uses a single keyframe and multiple keyframes

Multilevel TAPM with the SLEMD method

Midlevel CS feature with three low-level features

Single-Level EMD versus Keyframe-Based Algorithm

SLEMD algorithm , i.e., TAPM at level-0

Keyframe-based algorithm (KF-CS)

Multiframe-based representation (MF-CS)

Multilevel Matching versus Single-Level EMD

Level-0 (L0), level-1 (L1), level-2 (L2)Combination of L0 and L1 (L0+L1) - h0 = h1 = 1Combination of L0, L1 and L2 (L0+L1+L2) - h0 = h1 = h2 = 1Combination of L0, L1 and L2 (L0+L1+L2-d) - h0 = h1 = 1, h2 = 2

Sensitivity to Clustering Method and BoundaryPrecision

The Effect of Temporal Alignment

Algorithmic Complexity Analysis and Speedup

Concept Score Feature versus Low-LevelFeatures

6. Contributions and Conclusion

First systematic studies of diverse visual event recognition in the unconstrained broadcast news domain with clear performance improvements