dong xu, member, ieee, and shih-fu chang, fellow, ieee video event recognition using kernel methods...
TRANSCRIPT
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE
Video Event Recognition Using Kernel Methods
with Multilevel Temporal Alignment
Outline
1. Introduction2. Scene-Level Concept Score Feature3. Single-Level Earth Mover’s Distance in The
Temporal Domain4. Temporally Aligned Pyramid Matching5. Experiments6. Contributions and Conclusion
1. Introduction
Previous work on video event recognition can be roughly classified as either activity recognition or abnormal event recognition
Model-based
Abnormal event recognition - Zhang et al. [1] propose a semisupervised
adapted Hidden Markov Model (HMM) frameworkActivity recognition - HMM - coupled HMM - Dynamic Bayesian Network
Appearance-based
Abnormal event recognition - Boiman and Irani [7]Activity recognition - Ke et al. [8] - Efros et al. [9] - Other
Event recognition in broadcast news video
Rich information
Emerging applications of open source intelligence
Online video search
LSCOM ontology
Large-Scale Concept Ontology for Multimedia
Defined 56 event/activity concepts
Manual annotation of such event concepts has been completed for a large data set in TRECVID 2005 [15]
Challenges of events in news video
Large variations of scenes and activities
Difficult to - reliably track moving objects - detect the salient spatiotemporal interest
regions - extract the spatial-temporal features
Address the challenges of news video
Ebadollahi et al. [17]
midlevel Concept score (CS)
nonparametric approach
bag-of-words model
Bag-of-words model
Represent one video clip as a bag of orderless features, extracted from all of the frames
Earth Mover’s Distance (EMD) [21]
Single-level EMD (SLEMD)
Support Vector Machine (SVM)
Temporally Aligned Pyramid Matching (TAPM)
2. Scene-Level Concept Score Feature
Holistic features to represent content in constituent image frames
Multilevel temporal alignment framework to match temporal characteristics of various events
We used because
Efficiently extracted over the large video corpus
Effective for detecting several concepts
Suitable for capturing the characteristics of scenes
3. Single-Level Earth Mover’s Distance in The Temporal Domain
One video clip P can be represented as a signature:
m is the total number of frames, pi is the feature extracted from the ith
frame, wpi is the weight of the ith frame,We also represent another video clip Q as a
signature: n is the total number of frames
4. Temporally Aligned Pyramid Matching
Spatial Pyramid Matching (SPM)Pyramid Match Kernel (PMK)
Temporally Constrained Hierarchical Agglomerative Clustering (T-HAC)
5. Experiments
SLEMD algorithm with the simplistic detector that uses a single keyframe and multiple keyframes
Multilevel TAPM with the SLEMD method
Midlevel CS feature with three low-level features
Single-Level EMD versus Keyframe-Based Algorithm
SLEMD algorithm , i.e., TAPM at level-0
Keyframe-based algorithm (KF-CS)
Multiframe-based representation (MF-CS)
Multilevel Matching versus Single-Level EMD
Level-0 (L0), level-1 (L1), level-2 (L2)Combination of L0 and L1 (L0+L1) - h0 = h1 = 1Combination of L0, L1 and L2 (L0+L1+L2) - h0 = h1 = h2 = 1Combination of L0, L1 and L2 (L0+L1+L2-d) - h0 = h1 = 1, h2 = 2