learning and recognizing activities in streams of video dinesh govindaraju

Learning and Recognizing Activities in Streams of Video

Dinesh Govindaraju

Motivation

Activity recognition from video for higher functionalityWho is presenting

agenda itemAttendee interest

levels

Motivation

Want it to be automatic and not involve hand generation of modelsImpractical in the case of many

activitiesLess versatile as you might be

constrained to particular aspects of the problem

Problem Definition

Video Data Observations are extracted

movement deltas via face tracking Hand label training segments Learn underlying models from

training segments Carry out activity recognition

Approach - Learning

Assume underlying models can be approximated by HMMs

Use Baum Welch to learn best model using training segments

Need to find observation space and number of states

Approach - Learning

To find observation space:Run through all training segments

and add observationsFor new observation when doing

recognition, augment learned observation matrices

Approach - Learning

To find number of states, Q (for each activity):Set upper bound as length of longest

training segmentIterate over values and generate

most likely model using Baum Welch

Approach - Learning

To find number of states, Q (for each activity):Choose best Q using N-fold cross

validation using criterion of discriminative power

With best Q, run Baum Welch using a number of sets of randomly initialized parameters to get λa

Approach - Recognition

Define a window width, w From the beginning, sequentially

consider windows of observations (where L is length of entire sequence)


Calculate likelihood of each window segment

L Rabinier, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings IEEE, 1989


Label middle frame in each window with activity with highest likelihood

Evaluation and Results

Activities being observed:


Observation stream obtained from 87 second long image sequence

1296 individual frames Example frames after face detection:


Observation sequence first hand labeled

Segments showing same activity extracted

4 training segments used to learn each activity


Once underlying models were learned, calculate likelihood using sliding window

Value of 21 was used for the window width, w, as this was the average length of training segments


Carry out recognition using the likelihoods by assigning activities to the frames

Compare against hand assigned labels

Accuracy approximately 76%


Algorithm assigned:

Different from hand label

Same as hand label


Hand assigned:

Different from algorithm label

Same as algorithm label

Future Work

Learn underlying model generating sequence of activities themselves

Standardize lengths of training segments using Dynamic Time Warping and use that as the window width

The End

Questions

learning and recognizing activities in streams of video dinesh govindaraju

Documents