learning dynamics for exemplar-based gesture recognition 2005-07-27 · learning dynamics for...

8
Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal Vinay Shet Yaser Yacoob Larry S. Davis Department of Computer Science, Rutgers University, Piscataway, NJ, USA Computer Vision Laboratory, University of Maryland, College Park, MD, USA Abstract This paper addresses the problem of capturing the dy- namics for exemplar-based recognition systems. Tradi- tional HMM provides a probabilistic tool to capture sys- tem dynamics and in exemplar paradigm, HMM states are typically coupled with the exemplars. Alternatively, we pro- pose a non-parametric HMM approach that uses a discrete HMM with arbitrary states (decoupled from exemplars) to capture the dynamics over a large exemplar space where a nonparametric estimation approach is used to model the ex- emplar distribution. This reduces the need for lengthy and non-optimal training of the HMM observation model. We used the proposed approach for view-based recognition of gestures. The approach is based on representing each ges- ture as a sequence of learned body poses (exemplars). The gestures are recognized through a probabilistic framework for matching these body poses and for imposing temporal constraints between different poses using the proposed non- parametric HMM. 1 Introduction The recognition of human gestures has many applica- tions in human computer interaction, virtual reality and in robotics. In the last decade there has been a notable ex- tensive interest in gesture recognition in the computer vi- sion community [5, 6, 21, 23, 24, 2] as part of a wider interest in the analysis of human motion in general. The approaches used for gesture recognition and analysis of hu- man motion in general can be classified into three major categories: model-based, appearance-based, and motion- based. Model-based approaches focus on recovering three- dimensional configuration of articulated body parts, e.g., [17]. Appearance-based approaches uses two dimensional information such as gray scale images or body silhouettes and edges, e.g., [21]. In contrast, motion based approaches attempt to recognize the gesture directly from the motion without any structural information about the physical body, e.g., [15, 2]. In all these approaches, the temporal prop- erties of the gesture are typically handled using Dynamic Time Warping (DTW) or statistically using Hidden Markov Models (HMM) such as [21, 3, 11, 24, 23] This paper presents an exemplar-based approach for view-based recognition of gestures. The approach is based on representing each gesture as a sequence of body poses (exemplars) through a probabilistic framework for matching these body poses to the the image data. Previous exemplar- based approaches, such as [7, 22], couple the system dy- namics with the exemplars where training data are used to learn both the system dynamics and the exemplar represen- tation. Alternatively, we use a model where the dynamics of the system is decoupled from the exemplars in order to achieve orthogonality between the spatial and temporal do- mains. The main contribution of the paper is introducing a non-parametric estimation approach for learning the dy- namics from large exemplar spaces where the exemplars are decoupled from the dynamics. This approach has ad- vantages that will be pointed out through the paper. The paper is organized as follows: Section 2 review the exemplar-based tracking paradigm and emphasizes some of the draw backs that arises when using it for gesture recogni- tion. Section 3 introduced the decoupled model and the pro- posed learning approach. Sections 4 and 5 present details about the observation model and the gesture classification. Section 6 contains some experimental results with simula- tion data and results of the proposed approach for recogniz- ing arm gestures. 2 Exemplar-based Model We use the definition of [8, 7]: An exemplar space is specified by a set of “exemplars”, X = {x k ,k =1 ··· K}, containing representatives of the training data, and a dis- tance function ρ that measures the distortion between any two points in the space. The work of [7] was a major step to- wards learning probabilistic models for exemplars and their spatial transformations where exemplars are considered to be centers of a probabilistic mixture. The work of [22] was another major step that introduces the use of exemplars in a metric space within the same framework. Figure 1 shows the probabilistic graphical model for exemplar-based track- ing as introduced by [7, 22]. The observation z t at time t is considered to be drawn from a probabilistic mixture, i.e., z t T α x t where x t X is the exemplar at time t and T α is Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Upload: others

Post on 10-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

Learning Dynamics for Exemplar-based Gesture Recognition

Ahmed Elgammal † Vinay Shet ‡ Yaser Yacoob ‡ Larry S. Davis ‡

† Department of Computer Science, Rutgers University, Piscataway, NJ, USA‡ Computer Vision Laboratory, University of Maryland, College Park, MD, USA

Abstract

This paper addresses the problem of capturing the dy-namics for exemplar-based recognition systems. Tradi-tional HMM provides a probabilistic tool to capture sys-tem dynamics and in exemplar paradigm, HMM states aretypically coupled with the exemplars. Alternatively, we pro-pose a non-parametric HMM approach that uses a discreteHMM with arbitrary states (decoupled from exemplars) tocapture the dynamics over a large exemplar space where anonparametric estimation approach is used to model the ex-emplar distribution. This reduces the need for lengthy andnon-optimal training of the HMM observation model. Weused the proposed approach for view-based recognition ofgestures. The approach is based on representing each ges-ture as a sequence of learned body poses (exemplars). Thegestures are recognized through a probabilistic frameworkfor matching these body poses and for imposing temporalconstraints between different poses using the proposed non-parametric HMM.

1 Introduction

The recognition of human gestures has many applica-tions in human computer interaction, virtual reality and inrobotics. In the last decade there has been a notable ex-tensive interest in gesture recognition in the computer vi-sion community [5, 6, 21, 23, 24, 2] as part of a widerinterest in the analysis of human motion in general. Theapproaches used for gesture recognition and analysis of hu-man motion in general can be classified into three majorcategories: model-based, appearance-based, and motion-based. Model-based approaches focus on recovering three-dimensional configuration of articulated body parts, e.g.,[17]. Appearance-based approaches uses two dimensionalinformation such as gray scale images or body silhouettesand edges, e.g., [21]. In contrast, motion based approachesattempt to recognize the gesture directly from the motionwithout any structural information about the physical body,e.g., [15, 2]. In all these approaches, the temporal prop-erties of the gesture are typically handled using DynamicTime Warping (DTW) or statistically using Hidden Markov

Models (HMM) such as [21, 3, 11, 24, 23]This paper presents an exemplar-based approach for

view-based recognition of gestures. The approach is basedon representing each gesture as a sequence of body poses(exemplars) through a probabilistic framework for matchingthese body poses to the the image data. Previous exemplar-based approaches, such as [7, 22], couple the system dy-namics with the exemplars where training data are used tolearn both the system dynamics and the exemplar represen-tation. Alternatively, we use a model where the dynamicsof the system is decoupled from the exemplars in order toachieve orthogonality between the spatial and temporal do-mains. The main contribution of the paper is introducinga non-parametric estimation approach for learning the dy-namics from large exemplar spaces where the exemplarsare decoupled from the dynamics. This approach has ad-vantages that will be pointed out through the paper.

The paper is organized as follows: Section 2 review theexemplar-based tracking paradigm and emphasizes some ofthe draw backs that arises when using it for gesture recogni-tion. Section 3 introduced the decoupled model and the pro-posed learning approach. Sections 4 and 5 present detailsabout the observation model and the gesture classification.Section 6 contains some experimental results with simula-tion data and results of the proposed approach for recogniz-ing arm gestures.

2 Exemplar-based Model

We use the definition of [8, 7]: An exemplar space isspecified by a set of “exemplars”, X = {xk, k = 1 · · ·K},containing representatives of the training data, and a dis-tance function ρ that measures the distortion between anytwo points in the space. The work of [7] was a major step to-wards learning probabilistic models for exemplars and theirspatial transformations where exemplars are considered tobe centers of a probabilistic mixture. The work of [22] wasanother major step that introduces the use of exemplars ina metric space within the same framework. Figure 1 showsthe probabilistic graphical model for exemplar-based track-ing as introduced by [7, 22]. The observation zt at time tis considered to be drawn from a probabilistic mixture, i.e.,zt ≈ Tαxt where xt ∈ X is the exemplar at time t and Tα is

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 2: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

Figure 1. Graphical model for exemplar-basedtracking

a geometric transformation with parameter α. In this model,kt is the exemplar index at time t and αt is the transforma-tion parameter at time t for the geometric transformationTα. Therefore the system state Xt = (kt, αt) at time t isboth the exemplar index kt and transformation parametersαt. Learning this probabilistic model involves learning theexemplars as a small set of representatives from the train-ing set, learning the dynamics in the form of P (Xt|Xt−1)which, typically, is assumed to be a Markovian process.

In gesture recognition, in general, there are two orthog-onal types of style variability of performing the actions:

1. Temporal style variation: variations due to how fastor how slow individuals perform different parts of thegesture.

2. Spatial style variation: due to the physical constraintsof the body, the appearance of the body at correspond-ing points of time is different between different indi-viduals. Here we are concerned with the spatial varia-tions that can not be modelled by the geometric trans-formation parameter α.

The way the exemplar-based paradigm, as defined in [22,7], handles this problem is to cluster the training data intorepresentative exemplars in the spatial domain, assumingthat the transformation parameters α will take care of thedifferent spatial styles and that the learning will lead tocompact clusters. The dynamics is learned through learn-ing the transition P (xt|xt−1) between different exemplars.So, learning the dynamics is coupled with the exemplars,which are driven by spatial similarity, i.e., the orthogonalitybetween the spatial domain and the temporal domain is notmet. This presents a major draw back in the probabilisticmodel introduced by [22, 7] when used in recognition.

In order to solve this problem, we use an alternativeprobabilistic model that decouples the state variables fromthe exemplars as shown in figure 2. In this case, the statevariable qt at time t is an abstract variable that is indepen-dent of the exemplars as in the traditional sense of an Hid-den Markov model state while the exemplars are interme-diate observations that are being emitted by the underlying

process. The final observation, zt, remains as a probabilis-tic mixture of the exemplars. This way, the orthogonalitybetween the spatial domain (the exemplars) and the tem-poral domain (the states) can be achieved. At each pointof time, the system state is an independent abstract conceptand therefore at certain point of time during the time span ofthe gesture different clusters (exemplars) might occur. Onthe other hand, the same cluster might occur at differentpoints in time.

Given this decoupled model, a leaning approach isneeded to learn P (xt|qt), i.e., the intermediate observation(exemplar) probability given the state, and the dynamicsP (qt|qt−1). If the number of exemplars is small, then this isthe same as the traditional discrete output HMM learning,where each exemplar can be treated as a discrete symboland the learning is done using the tradition Baum-Welchapproach. Unfortunately, this is not the case that we are in-terested in. Instead, we are interested in the case where thenumber of exemplars is large for the reasons pointed below.This requires an alternative approach to learn the interme-diate observation probabilities. This will be introduced inSection 3.2

Here we introduce three motivating scenarios for whichour approach will be advantageous:Large Number of Clusters: Consider the case where thereare many different spatial styles of performing a gesture,where these variations in style can not be captured by thegeometric transformation Tα. This will leads to many ex-emplars, i.e., K is large. Learning the dynamics in the formof P (Xt|Xt−1) or P (xt|xt−1) will be problematic becausewe will not have enough data per cluster to learn the tran-sitions. Therefore the learned model of the dynamics is ex-pected to specialize to the training sequences and to havepoor generalization. The approach we propose is advan-tageous in this case because system state, qt is decoupledfrom the exemplars and therefore the number of possiblestates can be limited which will lead to better generaliza-tion.Data-driven Dynamics: Consider the case where we wantto learn the dynamics from the training data directly, i.e.,no clustering is performed. This is necessary if it is notpossible to cluster the data in a meaningful way due to thehigh dimensionality of the data. For example, in the metricmixture case [22], a one dimensional distance function ρ isused to compute the exemplar distance in spite of the factthat the underlying dimensionality is high. This is becauseclustering the data in the original space will require complexintermediate representations such as parameterized contourmodels or a 3D articulated model as was indicated by [22].As a result of using such a low dimensional function, clus-ters will not necessarily correspond to the real clusters in theoriginal high dimensional space, specially if we have highvariability in the data. So the approach provided in this pa-per moves one step forward to learn the dynamics from the

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 3: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

Figure 2. Graphical model with hidden statedecoupled from exemplars

training data directly, without the need for clustering. Wewill use a low dimensional distance function, ρ, but in thiscase, since no clustering is needed, the effect of this func-tion on learning the dynamics is minimal.Imposing Constraints on the Dynamics: Dynamics inthe form of P (Xt|Xt−1), as used in the exemplar-basedparadigm as presented in [22], do not facilitate imposingconstraints on the learned dynamics. For example, in manygestures, the progress of the gesture is moving forward intime and there is no meaning to going backward. To imposesuch a constraint, certain HMM topologies might be better.For example, a left-right model as shown in figure 5 or othertopologies. By decoupling the states from the exemplars,this can be achieved as traditionally done in HMMs

3 Model Dynamics3.1 Probabilistic Model

As illustrated in figure 2, at each discrete time, t, thesystem state is denoted by the pair (qt, αt) and xt denotesthe exemplar at time t. The hidden variable qt, represent-ing a Markov stochastic process, can take any value froma set of M distinct abstract states, S = {s1, s2, · · · , sM}, .The R.V. xt can be any exemplar from the set of exemplarsX = {xk, k = 1, · · ·K}. So, there is no coupling betweenthe states and the exemplars. The system dynamics is nowdefined by the transitions P (qt|qt−1) and P (αt|αt−1).

The observation zt at time t is a probabilistic mixturefrom all the exemplars and can be calculated using

P (zt|qt, αt) =K∑

k=1

P (zt|xt = xk, αt)P (xt = xk|qt, αt)

(1)We will drop the transformation parameter α from the equa-tions since handling this parameter is well studied by thework of [7]. So the observation at time zt is

P (zt|qt) =K∑

k=1

P (zt|xt = xk)P (xt = xk|qt) (2)

We will call the term P (xt = xk|qt) the intermediate obser-vation probability.

3.2 Nonparametric Exemplar Density Estimation

Learning involves learning the transition matrix, i.e,P (qt|qt−1), learning the initial state distribution and learn-ing the intermediate observation probabilities P (xt =xk|qt). If the number of exemplars are small then this isthe same setting as the traditional discrete output HMM andlearning can be done using Baum-Welch method [16]. Aswas discussed above we are not interested in this case. In-stead, we are interested in learning the dynamics directlyfrom the whole training set, i.e., the set of exemplars isthe whole set of training examples. There is one funda-mental issue that needs to be addressed: In this case, thenumber of exemplars (discrete symbols) K is much largerthan the number of states N , i.e, K � N . In fact, ifwe use the whole training data as the exemplars set X,then during the training each exemplar will be seen exactlyonce each iteration, i.e., the intermediate observation matrixB = {bkj = P (xt = xk|qt = j)} will have only one entryacross each row. In general, if K � N the probability ofobserving certain exemplar xi might be very small althoughthe probability of observing another exemplar xk which isvery similar might be high. This is because the exemplarsare treated as discrete symbols without any similarity metricimposed.

This problem arises because of the discrete manipulationof the data. In fact, the actual observation is a continuousfunction. Treating the output as continuous requires esti-mating the pdf of the state output (observation model). Tra-ditionally, continuous output HMM has been used to modelspeech signal for speech recognition and image signals forgesture recognition. Most of these systems use parametricforms for the observation model pdf. Typically a mixtureof gaussian is used to model this density where the trainingdata is used to estimate the Gaussian mixture parameters.This process in not optimal since both the Markovian as-sumption as well as the pdf form are not good models of thegesture. The same applies for speech signal as was notedby [18]

To overcome the problem of estimating observationmodel pdf, non-parametric density estimation can be used,given that large sets of observations can be obtained. In thiscase, the training data is not used to estimate pdf parame-ters, instead the data is used to obtain an estimate of the pdfdirectly. One main advantage of this approach is that newestimates of the pdf can obtained as more data is obtained.Different Non-parametric approaches have been used to es-timate state output pdf in speech recognition as in the workof [20, 4]

Given the set of all training data X = {xk, k = 1 · · ·K}an estimate of any data point, x , (exemplar) can be obtainedusing the estimator

P̂ (x) =1K

K∑

i=1

ψh(ρ(x, xi))

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 4: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

where ψh is a kernel function with bandwidth h applied onthe exemplar distance function ρ. Let ξ(j, i) be the expectednumber of times in state j and observing exemplar i, andξ(j) be the expected number of times in state j during onecycle of the training and let M be the total number of obser-vations 1. We can obtain an estimate of the joint distributionas

P̂ (xt = xk, qt = j) =1M

K∑

i=1

ψh(ρ(xk, xi))) · ξ(j, i)

Therefore we can obtain an estimate for the exemplar ob-servation probability b̂kj using

b̂kj = P̂ (xt = xk|qt = j) =P̂ (xt = xk, qt = j)

P̂ (qt = j)

=1M

∑Ki=1 ψh(ρ(xk, xi))) · ξ(j, i)

ξ(j)M

=K∑

i=1

Cji · ψh(ρ(xk, xi))) (3)

We call Cji the occupancy coefficients which can be com-puted during the training as:

Cji =ξ(j, i)ξ(j)

(4)

Simply, the occupancy coefficients are the traditional dis-crete HMM observation probabilities.

We need to modify the Baum-Welch learning approachas follows:

Expectation Step use the estimate P̂ (xt = xk|qt = j)from equation 3 to evaluate the observation probabili-ties of the training sequences.

Maximization Step update only the coefficient matrixC = {Cji} as in the traditional Baum-Welch usingequation 4.

If we take a closer look at Equation 3 we can see thatit resembles the observation probability of the continuousoutput HMM model, which is computed as a mixture of mdistributions

bj(O) =M∑

m=1

Cjmφ(O; µm, Σm) (5)

where φ is typically a Guassian with mean µm and covari-ance Σm for the mth mixture and Cjm is the mixture coef-ficient for the mth mixture in state j2. This form of HMM

1M = K if we use the whole training data as the exemplars2This particular case of HMM where the mixture is shared among all

states is called semi-continuous HMM

requires estimation of the parameters µm and Σm for eachof the distributions as part of the training process, which isa lengthy and non-optimal process [18]. Instead, the non-parametric formalization of equation 3 avoids this parame-ter estimation process.

4 Observation Model

4.1 Pose Likelihood

Figure 3. Pose exemplars registered to an im-age

We represent each gesture as a sequence of body poses(exemplars). The temporal relation between these bodyposes is enforced using the probabilistic model presentedin section 3. This section focuses on matching individualbody poses. The objective is to evaluate all different exem-plars with respect to each new frame at time t in order toobtain estimate of observation likelihood given each poseexemplar P (zt|xt = xk).

Let the set of shape exemplars be X = {xk, k =1, · · · , K} which contains all learned body poses for all thegestures to be recognized. Each pose (exemplar) is an edgetemplate representing the body silhouette, i.e., each pose isrepresented as a finite set of contour points image coordi-nates.

xk = {yk1 , yk

2 , · · · , ykmk},

where yki ∈ R2 and mk is the number of points along

the contour for pose exemplar k. Figure 4 shows examplepose templates for two different gestures. All the poses arealigned to each other during the learning so aligning onepose to any new image will therefore align the rest of theposes. Registering these poses to the images is done whilethe person is not performing any gesture (idle). In this case,the matching is performed using an idle pose (shown in fig-ure 4, first pose on top) through a coarse to fine search usingan image pyramid. Figure 3 shows the registered poses to anew frame.

At each new image, I , it is desired to find a probabilisticmatching score for each pose. Let dF (x) be the distancetransformed image at image location, x, given the set ofedge features, F , detected at image I . For each edge fea-ture yk

i in pose model xk, the measurement Dki = dF (yk

i )

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 5: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

Figure 4. Example body poses from two different Gesture

is the distance to the nearest edge feature in the image. Aperfect model to image match will have Dk

i = 0 for allmodel edge features. Consider the random variable associ-ated with this distance measurement, and let the associatedprobability density function (PDF) be pk

i . We assume thatthese random variables are independent. This assumptionwas used in [12, 14] based on the results obtained in [13].Therefore, the observation likelihood (the probability of theobservation given the exemplar xk) can be defined as theproduct of these PDFs as

P (zt|xt = xk) =mk∏

i=1

pki (Dk

i ) (6)

The PDF pki for the distance between model features

and nearest image feature location is defined for each fea-ture i in each pose model k. We use a PDF of the form

pki (D) = c1 +

1σ√

2πe−D2/2(σ)2

Since the distance D can become arbitrary large, the prob-ability can become very small and therefore the constant c1

is used as a lower bound on the probability. This makes thelikelihood function robust to outliers.

This probabilistic formulation was first introducedin [12] and was used in a Hausdorff matching context to findthe best transformation of an edge template using maximumlikelihood estimation. Equation 6 represents a probabilis-tic formulation for Chamfer matching. Chamfer distancehas been used extensively in object detection, for examplein [10]. In [9] the matching was generalized to include mul-tiple feature types, (for example, oriented edges) by match-ing each individual feature template with its correspondingdistance transformed image and combining the results. Alsothe matching was generalized in [9, 10] to match multipletemplates through a hierarchical template structure.

4.2 Weighted Matching

Our objective is to match multiple pose exemplars to thesame image location in order to evaluate the likelihood ofthe observation given each of these poses. Typically, thedifferent pose templates are similar in some parts and dif-ferent in another parts in the templates. For example, the

head, torso and bottom parts of the body are likely to besimilar in different pose templates, while articulated bodyparts that are involved in the gesture, such as the arm, willbe at different positions at different pose templates. For ex-ample, see figure 3. Since the articulated part, such as thearm, is represented by a small number of features with re-spect to the whole pose templates, the matching is likelyto be biased by the major body parts. Instead, it is desiredto make the matching biased more by articulated parts in-volved in performing the gesture since these parts will bemore discriminating between different poses templates.

To achieve this goal, different weights are assigned todifferent feature points in each pose exemplar. Thereforeeach pose exemplar, xk, is represented as a set of featurelocations as well as a set of weights, {wk

1 , wk2 , · · · , wk

mk},

corresponding to each feature where∑mk

i=1 wki = 1. The

likelihood equation 6 can be written in terms of weightedlog-likelihood as

log P (zt|xk) =mk∑

i=1

wki log pk

i (Dki ) (7)

In our case, the set of all recognized poses does not havea common correspondence frame. For example, some fea-tures in one pose might not have corresponding features inanother pose. Also we do not restrict the pose templates tohave the same number of features. Therefore we drive theweights with respect to the image locations.

Let X be the set of all features in all registered poses inthe training data, i.e.,

X =⋃

k

xk = {x1, x2, · · ·xm}

where each xi is the image location of an edge feature.Given this sample of edge feature locations, the edge proba-bility distribution f(y) (the probability to see an edge at cer-tain image location, y) can be estimated using kernel densityestimation [19] as

f̂(y) =1m

m∑

i=1

Kh(y − xi)

Where Kh is a kernel function with a scale variable h. We

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 6: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

used a Gaussian kernel Kh(t) = 1√2πh

e−1/2( th )2 for this

probability estimation.The weight assigned to each feature point is based on

the information this feature provides. Given the estimatededge probability distribution, f̂(y), at any image pixel, y,the weight for a certain feature i at a certain pose k is theratio of the information given by this feature to the totalinformation by that pose, i.e.,

wki =

log f̂(xki )

∑mk

j=1 log f̂(xkj )

5 Gesture Classification

Figure 5. Left-Right HMM

This section summarizes the gesture classification pro-cedure. We represent each gesture g by an exemplar space,(Xg, ρ), and an HMM , λg . The exemplar space is definedby a set of shape exemplars Xg = {xk, k = 1, · · · , K},representing different body poses during the gesture, anda distance function ρ which is defined as the symmetricchamfer distance, i.e., given exemplar xk = {yk

1 , · · · , ykmk}

and exemplar xl = {yl1, · · · , yl

ml} the symmetric distanceρ(xk, xl) is computed as,

ρ(xk, xl) =1

ml

ml∑

i

dxk(yli) +

1mk

mk∑

i

dxl(yki ).

The HMM hidden states correspond to the progress of thegesture with time. We used a left-right model as in figure 5to impose a constraint on the dynamics which leads to bettergeneralization since there are less transitions to adjust. Notethat the number of states is different from the number ofposes as mentioned above. Training sequences of poses areused to learn the model parameters:

1. The state transition probabilities A = {aij} whereaij = P [qt+1 = sj |qt = si] ∀i, j = 1 · · ·N .

2. The initial state distribution π where πj = P [q1 =sj ] ∀j = 1 · · ·N .

3. The probability of each pose xk given the states, B ={bkj = P (xk|sj) ∀j = 1 · · ·N, ∀k = 1 · · ·K}

The learning is performed using the nonparametric learningapproach in section 3.2.

The actual observation zt is the detected edge featuresat each new frame, which is a probabilistic function of the

current state of the gesture as defined in equation 2. Theobservation probabilities given the exemplar, P (zt|xk), areobtained using the likelihood equations 6 and 7 as was de-scribed in section 4

Given a set of observations Z = z1z2, · · · , zT and givena set of HMM models λg corresponding to different gesture,the objective is to determine the probability of that obser-vation sequence given each of the models, i.e., P (Z|λg)∀g.This is a traditional problem for HMM and can be solved ef-ficiently through the Forward-Backward procedure [16, 1].

6 Experimental results

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−1.5

−1

−0.5

0

0.5

1

1.5

Figure 6. Simulation data.

We performed simulation experiments to evaluate theperformance of the proposed learning approach. We usedsynthetic data to emulate a gesture progressing over timewith spatial and temporal variations. We used data ofthe form y = a sin(bx) + c where the parameters a, b, cwere generated randomly. The variation parameters a, b, care meant to emulate actual gesture spatial variations thatcannot be recovered through the geometric transformationparameter α as discussed in section 2. Figure 6-a illus-trates the used data (50 sequences of 20 points each). Wemodelled the dynamics of the data using both the coupledexemplar-based model (shown in figure 1) and using an theuncoupled model (shown in figure 2). For the uncoupledmodel, we estimated the intermediate observations usingthe non-parametric approach presented in section 3.2. Inboth cases we performed cross-evaluation by splitting thedata into training and test sets. For the coupled model,we varied the number of clusters at each learning experi-ment where the cluster centers are considered to be the ex-emplars. Figure 7-a shows the log-likelihood for both thetraining data and the test data for different number of clus-ters. As expected, as the number of clusters increases, themodel specializes to the training data and yields poor gen-eralization. For the uncoupled nonparametric model thereis no clustering performed and the learning is done from thewhole training data. Figures 7-b,c show the learning curves(in terms of log-likelihood) with two different kernels: aGaussian kernel and a Parzen window respectively. Eachfigure shows the log-likelihood for both the training and testdata. As can be noticed from the figures, in both cases the

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 7: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

0 10 20 30 40 50 60 70 80 90 100−45

−40

−35

−30

−25

−20

−15

−10

Number of Clusters

Log−

likel

ihoo

d

Evaluation SetTraining Set

0 2 4 6 8 10 12 14 16 18−65

−60

−55

−50

−45

−40

−35

Itrations

Log−

Like

lihoo

d

Training SetEvaluation Set

0 5 10 15−65

−60

−55

−50

−45

−40

−35

Iterations

Log

− li

kelih

ood

TrainingEvaluation

(a) (b) (c)

Figure 7. (a) Coupled model evaluation with different clustering. (b,c) Nonparametric learning ofuncoupled model with Gaussian kernel and a Parzen window

leaning curves converge which shows the generalization.

The proposed approach was experimented with real im-ages for arm gesture classification. Here we show two dif-ferent gesture classification experiments. In the first exper-iment, the proposed approach was used to classify eightarm gestures. Basically, the eight recognized gestures aresimilar to the ones shown in figure 3, performed with eacharm, in upward motion and downward motion. Figure 8shows some pose classification results for different people.The figures shows the pose with the highest likelihood scoreoverlaid over the original image.

Figure 8. Pose matching results

Figure 9 shows the gesture likelihood probabilities forthe eight gesture classes. from the graphs, all the gestureswere close in likelihood at the beginning of the action butas the gesture progresses with time, the likelihood of thecorrect gesture increases, and the the likelihood of the othergesture decreases as a result of the temporal constrains im-posed by the HMM for each gesture. Each plot shows twoconsecutive gestures performed and as one gesture was rec-ognized all the HMMs were reset.

In the second experiment we used six gestures as shownin figure 10 where in this case each gesture consists of anupward followed by a downward arm motion. In this ex-periments, pose exemplars were obtained from five differ-

0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Frames

Gestu

re Lik

eliho

od

Turn Left UpTurn Right UpStop Left UpStop Right UpTurn Left DownTurn Right DownStop Left DownStop Right Down

0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Turn Left UpTurn Right UpStop Left UpStop Right UpTurn Left DownTurn Right DownStop Left DownStop Right Down

Figure 9. Gesture classification resultsent people where the total number of exemplars were 203,195, 194, 216, 232, 208 exemplars for each of the six ges-ture classes respectively. Six fixed topology HMMs with13 states each were used (one for each gesture class) wherelearning the HMM parameters and the intermediate obser-vation probabilities was performed using the approach de-scribed in section 3.2. We trained five sets of HMMs whereeach HMM set was trained with one person’s exemplars leftout. For the evaluation we used 150 gesture sequences (5people × 5 cycles × 6 gestures. For classifying any ges-ture sequence for a particular person, we used an HMM setthat was trained with this person’s exemplars left out. Theclassification results are shown in figure 6 in terms of a con-fusion matrix.

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE

Page 8: Learning Dynamics for Exemplar-based Gesture Recognition 2005-07-27 · Learning Dynamics for Exemplar-based Gesture Recognition Ahmed Elgammal† Vinay Shet ‡ Yaser Yacoob ‡ Larry

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

Turn Left Turn right Turn both

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

Stop left Stop right Stop both

Figure 10. Six gestures: Each gesture con-sists of an upward followed by a downwardarm motion

turn left 23 0 0 2 0 0turn right 0 19 0 0 6 0turn both 2 0 21 0 2 0stop left 2 0 0 23 0 0stop right 0 2 0 0 23 0stop both 0 0 2 2 2 19

Figure 11. Confusion Matrix

7 Conclusion

The paper presented an approach for learning the dy-namics for exemplar-based recognition systems and its ap-plication to gesture recognition. The key contribution ofthe paper is an approach for learning HMM parameters thatutilizes nonparametric density estimation for modeling theobservation density. This facilitates learning the dynamicsfrom large exemplar spaces where nonparametric estima-tion is used to model the exemplar distribution. The proba-bilistic model used decouples the system dynamics from theexemplar space in order to achieve orthogonality betweenthe spatial and temporal domains. The approach was ex-perimented with simulation data and was used to recognizesimple arm gestures. The experiments showed the ability ofthe learning approach to capture the system dynamics.

References

[1] Y. Bengio. Markovian models for sequential data. NueralComputing Suveys, pages 129–162, 1999.

[2] A. F. Bobick and J. W. Davis. The recognition of humanmovement using temporal templates. IEEE Transactions onPattern Analysis and Machine Intelligence, 23(3):257–267,2001.

[3] M. Brand, N. Oliver, and A. Pentland. Coupled hiddenmarkov models for complex action recognition. In Proc.CVPR, 1997.

[4] M. C., C. M.-J., and L. F. K-nn versus gaussian in a hmm-based recognition system. In Eurospeech, Rhodes, pages529–532, 1997.

[5] T. Darrell and A. Pentland. Space-time gesture. In ProcIEEE CVPR, 1993.

[6] J. Davis and M. Shah. Visual gesture recognition. Vision,Image and Signal Processing, 141(2):101–106, 1994.

[7] B. J. Frey and N. Jojic. Learning graphical models of im-ages, videos and their spatial transformation. In Proceed-ings of the Sixteenth Conference on Uncertainty in ArtificialIntelligence - San Francisco, CA, 2000.

[8] B. J. Frey and N. Jojic. Flexible models: A powerful alter-native to exemplars and explicit models. In IEEE ComputerSociety Workshop on Models vs. Exemplars in Computer Vi-sion, pages 34–41, 2001.

[9] D. Gavrila. Multi-feature hierarchical template matching us-ing distance transforms. In Proc. of the International Con-ference on Pattern Recognition, Brisbane, Australia,1998.,pages 439–444.

[10] D. Gavrila and V. Philomin. Real-time object detection for“smart“ vehicles. In ICCV99, pages 87–93.

[11] C. Morimoto, Y. Yacoob, and L. Davis. Recognition of headgestures using hidden markov models international confer-ence on pattern recognition. In International Conference onPattern Recognition, Vienna, Austria, August 1996, pages461–465, 1996.

[12] C. Olson. A probabilistic formulation for hausdorff match-ing. In CVPR98, pages 150–156.

[13] C. Olson and D. Huttenlocher. Automatic target recognitionby matching oriented edge pixels. IEEE Transactions onImage Processing, (1):103–113, 1997.

[14] C. F. Olson. Maximum-likelihood template matching. InIEEE International Conference on Computer Vision andPattern Recognition, volume 2, pages 52–57, 2000.

[15] R. Polana and R. C. Nelson. Detecting activities. Journalof Visual Communication and Image Representation, June1994.

[16] L. R. Rabiner. A tutorial on hidden markov models and se-lected applications in speech recognition. Proceedings of theIEEE, 77(2):257–285, February 1989.

[17] J. M. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects. In ICCV, pages 612–617,1995.

[18] S. Renals, N. Morgan, H. Bourlard, M. Cohen, andH. Franco. Connectionist probability estimators in HMMspeech recognition. IEEE Transactions Speech and AudioProcessing, 1993.

[19] D. W. Scott. Mulivariate Density Estimation. Wiley-Interscience, 1992.

[20] S. Soudoplatoff. Markov modelling of continous parame-ters in speech recognition. In Proceedings of IEEE ICASSP,pages 45–48, 1986.

[21] T. Starner and A. Pentland. Real-time american sign lan-guage recognition from video using hidden markov models.In SCV95, page 5B Systems and Applications, 1995.

[22] K. Toyama and A. Blake. Probabilistic tracking in a metricspace. In ICCV, pages 50–59, 2001.

[23] C. Vogler and D. N. Metaxas. Parallel hidden markov mod-els for american sign language recognition. In ICCV (1),pages 116–122, 1999.

[24] A. D. Wilson and A. F. Bobick. Parametric hidden markovmodels for gesture recognition. IEEE Transactions onPattern Analysis and Machine Intelligence, 21(9):884–900,1999.

Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03) 1063-6919/03 $17.00 © 2003 IEEE