a template matching approach of one-shot-learning gesture recognition

9
A template matching approach of one-shot-learning gesture recognition Upal Mahbub a,, Hafiz Imtiaz a , Tonmoy Roy a , Md. Shafiur Rahman a , Md. Atiqur Rahman Ahad b a Department of Electrical & Electronic Engineering, Bangladesh University of Engineering & Technology, Dhaka-1000, Bangladesh b Department of Applied Physics, Electronics & Communication Engineering, University of Dhaka, Bangladesh article info Article history: Available online 2 October 2012 Keywords: Gesture recognition Depth image Motion history image 2D Fourier transform abstract This paper proposes a novel approach for gesture recognition from motion depth images based on tem- plate matching. Gestures can be represented with image templates, which in turn can be used to compare and match gestures. The proposed method uses a single example of an action as a query to find similar matches and thus termed one-shot-learning gesture recognition. It does not require prior knowledge about actions, foreground/background segmentation, or any motion estimation or tracking. The proposed method makes a novel approach to separate different gestures from a single video. Moreover, this method is based on the computation of space–time descriptors from the query video which measures the likeness of a gesture in a lexicon. These descriptor extraction methods include the standard deviation of the depth images of a gesture as well as the motion history image. Furthermore, two dimensional discrete Fourier transform is employed to reduce the effect of camera shift. The comparison is done based on correlation coefficient of the image templates and an intelligent classifier is proposed to ensure better recognition accuracy. Extensive experimentation is done on a very complicated and diversified dataset to establish the effectiveness of employing the proposed methods. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction Gestures are non-verbal bodily actions used to communicate among humans and computers. Automatic gesture recognition sys- tems are necessary and integral parts of any human–machine interaction system. A human gesture recognition algorithm con- verts a high-bandwidth video into a compact description of the presence and movement of the human subjects in that scene. The information enables a human user to control a variety of de- vices using gestures alone (Ahad, 2011; Ahad et al., 2008). The ges- ture recognition systems have wide range of applications starting from the control of home appliance to marshaling aircrafts, devel- opment of intelligent surveillance systems, machine interaction with kids for game control to communications with deaf people through sign languages, etc. Nowadays, automatic gesture recogni- tion is widely applied in the fields of bio-mechanics, medicines, sports, cinema, etc. for advance analysis. Human motion is crucial since the machines are required to interact more and more intelligently and effortlessly with a human inhabited environment. However, most of the machine-extracted human movement information is from static events, such as key press. In order to improve real-time machine capabilities, it is important to represent motion. However, due to various limita- tions, no single approach seems to work sufficiently in understand- ing and recognizing action. Existing methods can be classified into several categories, such as view/appearance-based, model-based, space–time volume-based, or direct motion-based methods (Ahad et al., 2012). Template-matching approaches (Bobick and Davis, 2001) consist of simpler and faster algorithms for motion analysis or recognition that can represent an entire video sequence into a single image format. Nowadays, approaches related to spatio-temporal interest feature points (STIP) are becoming promising for action representation (Laptev and Lindeberg, 2003). In this paper, an efficient one-shot-learning mechanism is developed based on a sequence of depth images of human ges- tures. This involves the use of depth images obtained utilizing infrared depth sensors. Motion based template matching tech- niques are employed on the training videos to obtain the key fea- tures from the gesture sequences. The feature vector is formed by employing statistical operations and spatio-temporal motion infor- mation. The testing phase is divided into several steps. First, differ- ent gestures are separated from the long test sequences. Then, similar to the train feature vectors, test feature vectors are gener- ated for each gesture. Finally, every test feature vector is compared to each train feature vector for different gestures and a classifier is employed to find the best possible match of a gesture from the gi- ven training vocabulary. The performance of the recognition algo- rithm is evaluated with respect to the percentage of Levenshtein distance (LD) and a satisfactory recognition performance is 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.09.014 Corresponding author. E-mail addresses: [email protected] (U. Mahbub), hafi[email protected] (H. Imtiaz), [email protected] (T. Roy), [email protected] (M.S. Rahman), [email protected] (M.A. Rahman Ahad). Pattern Recognition Letters 34 (2013) 1780–1788 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Upload: md-atiqur

Post on 20-Dec-2016

226 views

Category:

Documents


4 download

TRANSCRIPT

Pattern Recognition Letters 34 (2013) 1780–1788

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

A template matching approach of one-shot-learning gesture recognition

0167-8655/$ - see front matter � 2012 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.patrec.2012.09.014

⇑ Corresponding author.E-mail addresses: [email protected] (U. Mahbub), [email protected]

(H. Imtiaz), [email protected] (T. Roy), [email protected] (M.S. Rahman),[email protected] (M.A. Rahman Ahad).

Upal Mahbub a,⇑, Hafiz Imtiaz a, Tonmoy Roy a, Md. Shafiur Rahman a, Md. Atiqur Rahman Ahad b

a Department of Electrical & Electronic Engineering, Bangladesh University of Engineering & Technology, Dhaka-1000, Bangladeshb Department of Applied Physics, Electronics & Communication Engineering, University of Dhaka, Bangladesh

a r t i c l e i n f o

Article history:Available online 2 October 2012

Keywords:Gesture recognitionDepth imageMotion history image2D Fourier transform

a b s t r a c t

This paper proposes a novel approach for gesture recognition from motion depth images based on tem-plate matching. Gestures can be represented with image templates, which in turn can be used to compareand match gestures. The proposed method uses a single example of an action as a query to find similarmatches and thus termed one-shot-learning gesture recognition. It does not require prior knowledgeabout actions, foreground/background segmentation, or any motion estimation or tracking. The proposedmethod makes a novel approach to separate different gestures from a single video. Moreover, this methodis based on the computation of space–time descriptors from the query video which measures the likenessof a gesture in a lexicon. These descriptor extraction methods include the standard deviation of the depthimages of a gesture as well as the motion history image. Furthermore, two dimensional discrete Fouriertransform is employed to reduce the effect of camera shift. The comparison is done based on correlationcoefficient of the image templates and an intelligent classifier is proposed to ensure better recognitionaccuracy. Extensive experimentation is done on a very complicated and diversified dataset to establishthe effectiveness of employing the proposed methods.

� 2012 Elsevier B.V. All rights reserved.

1. Introduction

Gestures are non-verbal bodily actions used to communicateamong humans and computers. Automatic gesture recognition sys-tems are necessary and integral parts of any human–machineinteraction system. A human gesture recognition algorithm con-verts a high-bandwidth video into a compact description of thepresence and movement of the human subjects in that scene.The information enables a human user to control a variety of de-vices using gestures alone (Ahad, 2011; Ahad et al., 2008). The ges-ture recognition systems have wide range of applications startingfrom the control of home appliance to marshaling aircrafts, devel-opment of intelligent surveillance systems, machine interactionwith kids for game control to communications with deaf peoplethrough sign languages, etc. Nowadays, automatic gesture recogni-tion is widely applied in the fields of bio-mechanics, medicines,sports, cinema, etc. for advance analysis.

Human motion is crucial since the machines are required tointeract more and more intelligently and effortlessly with a humaninhabited environment. However, most of the machine-extractedhuman movement information is from static events, such as keypress. In order to improve real-time machine capabilities, it is

important to represent motion. However, due to various limita-tions, no single approach seems to work sufficiently in understand-ing and recognizing action. Existing methods can be classified intoseveral categories, such as view/appearance-based, model-based,space–time volume-based, or direct motion-based methods(Ahad et al., 2012). Template-matching approaches (Bobick andDavis, 2001) consist of simpler and faster algorithms for motionanalysis or recognition that can represent an entire video sequenceinto a single image format. Nowadays, approaches related tospatio-temporal interest feature points (STIP) are becomingpromising for action representation (Laptev and Lindeberg, 2003).

In this paper, an efficient one-shot-learning mechanism isdeveloped based on a sequence of depth images of human ges-tures. This involves the use of depth images obtained utilizinginfrared depth sensors. Motion based template matching tech-niques are employed on the training videos to obtain the key fea-tures from the gesture sequences. The feature vector is formed byemploying statistical operations and spatio-temporal motion infor-mation. The testing phase is divided into several steps. First, differ-ent gestures are separated from the long test sequences. Then,similar to the train feature vectors, test feature vectors are gener-ated for each gesture. Finally, every test feature vector is comparedto each train feature vector for different gestures and a classifier isemployed to find the best possible match of a gesture from the gi-ven training vocabulary. The performance of the recognition algo-rithm is evaluated with respect to the percentage of Levenshteindistance (LD) and a satisfactory recognition performance is

U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788 1781

achieved, even though the learning procedure was one-shot, for awide variety of gestures such as sign languages, cluttered back-ground scenarios, partially visible human figure scenarios, etc.

2. Related works

Action and gesture recognition are a subject of extensive re-search and a number of methods have been developed (Ahad,2011), which are already being implemented in some practicalapplications. Approaches differ mainly on the basis of representa-tion of the action, which is commonly performed based onappearance, shape, spatio-temporal orientation, optical flow, inter-est-point and volume. For example, in the spatio-temporal tem-plate based approaches, an image sequence is used to prepare amotion energy image (MEI) and a motion history image (MHI),which indicate the regions of motions and preserve the time infor-mation of the motion (Ahad et al., 2008; Bobick and Davis, 2001) aswell. An extension of this approach to 3D was proposed in (Wein-land et al., 2006). Recently, some researchers used optical flow-based motion detection and localization methods and frequencydomain representation of gestures (Mahbub et al., 2011; Imtiazet al., 2011).

The model-based approach, on the other hand, identifies thebody limbs and tracks those using a 2D or 3D model of the body.The models are significantly based on pose estimations and fig-ure-centric measurements. Some well-known model types arepatches, cylinders, stick-figures, super-quadratic ellipsoids, claypatterns, etc. (Ahad, 2011). Human motions are split into atomicparts each of which represents a basic movement. Article Zhou etal. (2009) presented a tensor analysis based approach for trackingthe object trajectory, which can also be applied in tracking bodyparts from action recognition.

The bag-of-words approach, based on local features, is robust tobackground cutter, partial occlusion and viewpoint changes. It de-tects local salient descriptors such as visual words, which are thenused to recognize the actions. Bag-of-words has been used success-fully for object categorization in (Willamowski et al., 2004). How-ever, this approach lacks the relations between the features in thespatial and those of the temporal domains, which are helpful forrecognition. It is a sparse representation method and is less infor-mative compared to the global representation. In (Shao and Chen,2010), the bag-of-words model is combined with the action repre-sentation based on Histogram of Body Poses sampled from silhou-ettes in the video sequence. Among other methods, paper Seo andMilanfar (2011) proposed a method for segmenting a periodic ac-tion into cycles and thereby classify an action. Some authors tookadvantage of both the probabilistic distribution and the temporalrelationship of human poses to develop representation of humanactions using Correlograms of Body Poses (CBP) (Shao et al.,2011). Some recent works include action recognition exploitinghuman pose (Wu and Shao, 2012).

A number of the available gesture recognition methods have ac-quired high accuracies in different datasets. However, most ofthem depend on a good amount of input to train the system anddo not perform very well if the training data are limited. Yanget al. (2009) proposed a method for action recognition using apatch-based motion descriptor and matching scheme with a singleclip as the template. Seo and Milanfar (2011) proposed an ap-proach using only one training data set by employing space–timelocal steering kernels. They robustly captured underlying space–time data structure (Shao et al., 2012).

Since the release of the Kinect sensor (http://www.xbox.com/en-US/Xbox360/Accessories/Kinect) with a mounted depth sensor,some algorithms have been developed to take advantage of thedepth information. Many works succeeded in identifying body

parts, and using a sensor in tracking them to recognize actionsand gestures. Tang (2011) proposed a method for hand detectionby integrating RGB and depth data. It involved finding possiblehand pixels by skin color detection and then removing false posi-tives, such as neck and other skin colored cloths from the depthdata, based on the fact that these parts of the body will be behindthe hand in most of the cases. Other methods of identifying handsby the fusion of RGB and depth data were also proposed. Anotherrobust method for detection of fingers to easily identify sophisti-cated and confusing gestures was proposed in (Ren et al., 1093),namely the Finger-Earth Mover Distance (FEMD) (Zhou et al.,2011). The gesture recognition process was even extended to realtime application in multi-user environment for sign language rec-ognition. However, gesture recognition in these cases was simplersince a small set of gestures were used with a very little variationin types and the gestures were performed by extending the handsin front of the body to eases up the detection.

This paper addresses a novel gesture recognition techniquebased on statistical evaluation of frequency domain representationof spatio-temporal motion information.

3. Proposed method for one-shot-learning gesture recognition

In this paper, a method of gesture recognition using one-shot-learning from a small vocabulary of gestures is proposed. Everyapplication needs a specialized gesture vocabulary. If gesture rec-ognition is to become part of everyday life, gesture recognition ma-chines are needed, which can easily get tailored to new gesturevocabularies. For this reason, one-shot gesture recognition isimportant. In this context, one-shot gesture recognition meanslearning to recognize new categories of gestures from a single vi-deo clip of each gesture. The gestures will be drawn from a smallvocabulary of gestures, generally related to a particular task, for in-stance, hand signals used by divers, finger codes to representnumerals, signals used by referees, or marshaling signals to guidevehicles or aircrafts. Supplying a lot of training examples may beimpractical in many applications where recording data and/orlabeling them is tedious or expensive. Many consumer applicationsof gesture recognition will become possible only if systems can betrained to recognize new gestures with very few examples, ideally,just one.

Here, it is assumed that in a given data set both RGB and depthimages are available. Although training the system with only oneexample is difficult, the availability of depth image in addition totraditional RGB image opens doors to many new possibilities.The gestures had to be recognized from a set of gestures andmatched with the vocabulary. The proposed algorithm focuses onlyon the depth data from the sensor for the gesture recognitionmodel.

3.1. Gesture dataset

There are some clearly defined hand, body or head gesture datasets, e.g., Cambridge gesture data set (Kim and Cipolla, 2009), Na-val Air Training and Operating Procedures Standardization (NA-TOPS) aircraft handling signals database (Song et al., 2011), Keckgesture dataset (Jiang et al., 2012), Korea University Gesture(KUG) database (Hwang et al., 2006), etc. Among these datasets,the Cambridge Gesture dataset contains 900 video sequences par-titioned into five different illumination subsets with nine classes ofgestures. The NATOPS aircraft handling signals database contains24 gestures, with each gesture performed by 20 subjects for 20times, resulting in 400 samples per gesture. It is a body-and-handgesture dataset containing an official gesture vocabulary used forcommunication between carrier deck personnel and Navy pilots

1 For interpretation of color in Figs. 1, 3, 6, and 7, the reader is referred to the webversion of this article.

1782 U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788

(e.g., yes/no signs, taxing signs, fueling signs, etc.). The Keck ges-ture dataset consists of 14 different gesture classes which are asubset of military signals. There are 126 training sequences cap-tured using a fixed camera with a static background and 168 test-ing sequences captured from a moving camera in the presence ofbackground clutter. The Korea University Gesture (KUG) databaseincludes 14 normal gestures (such as, sitting, walking), 10 abnor-mal gestures (different forms of falling such as forward, backward,or from a chair), and 30 command gestures (well-defined and com-monly used in gesture-based studies such as yes, no, pointing anddrawing numbers) performed by 20 different subjects (Ahad,2011).

Though well-known for their contents and complexities, all ofthese data sets address a particular type of gestures limited to anumber of classes and application domains. Therefore, for the sim-ulation purpose of the proposed method, a rich, but extremelycomplicated data set, namely, the ChaLearn Gesture Dataset(CGD2011), is considered (ChaLearn Gesture Dataset, 2011). TheChaLearn Gesture database contains nine categories of gesturescorresponding to various settings or application domains. The cat-egories are (1) body language gestures (like scratching head, cross-ing arms), (2) gesticulations performed to accompany speech, (3)illustrators (like Italian gestures), (4) emblems (like Indian Mu-dras), (5) signs (from sign languages for the deaf), (6) signals (likereferee signals, diving signals, or marshaling signals to guidemachinery or vehicle), (7) actions (like drinking or writing), (8)pantomimes (gestures made to mimic actions), and (9) dance pos-tures. Some frames of the video samples of the ChaLearn GestureDataset are shown in Fig. 1. The dataset consists of videos fromRGB and depth cameras of a Microsoft Kinect™ Sensor. Each setof data contains a number of actions presented separately and onlyonce for training purpose. Combinations of one or more of thoseactions performed in a video sequence are available forrecognizing.

In the videos, a single user is portrayed in front of a fixed cam-era, interacting with a computer by performing gestures to play agame, control remote appliances or robots, or learn to perform ges-tures from an educational software.The videos are a collection of alarge dataset of gestures including RGB and depth videos of 50,000gestures recorded with the Kinect™ camera with image sizes240� 320 pixels at 10 frames per second. The videos are recordedby 20 different users and grouped in 500 batches of 100 gestures.The data are available from ChaLearn Gesture Dataset (2011) intwo formats: a lossy compressed AVI format and a quasi-losslessAVI format. To get a sufficient spacial resolution, only the upperbody is framed.

The data is divided in different batches. The development dataconsists of batches devel01, devel02, . . ., devel20, etc. For the ‘dev-elXX’ batches, all the labels are provided by the dataset creators.Each batch including 47 sequences of 1–5 gestures is drawn fromvarious small gesture vocabularies of 8–12 unique gestures, calleda lexicon. There are over 30 different gesture vocabularies. For in-stance, a gesture vocabulary could consist of the signs to refereevolleyball games or the signs to represent small animals in the signlanguage for the deaf. Sample frames of both RGB and depth data ofthe first 20 lexicons of the development data are shown in Fig. 1.

The complexities in this dataset are introduced by wide varia-tion of the types of actions, environment, and position of the per-former. While some of the actions include motion over a largearea (development data lexicon 1, 2, and 7), some are limited togestures performed by fingers only (development data lexicon 3and 6). The background of a particular set is constant throughoutthe whole set, but the performer is positioned in different placesfor different video (development data lexicon 1 and 2). Some back-grounds also include different objects both in front and back of theperformer (development data lexicon 7 and 8). The environment is

set to be different and some contain poor lighting (developmentdata lexicon 7). The colors1 of the apparel of the performers alsochange during a set (development data lexicon 3 and 4). The pos-tures of body are different in different sets, mostly displaying theupper frame of the body, but some even display only a part ofthe body or it is shown from behind. In addition, in many casesthe videos are subjected to self-occlusion. Thus, the ChaLearn Ges-ture Dataset is undoubtedly one of the most complex gesture data-sets available to date.

3.2. Feature extraction and training

The depth data provided in the dataset are in RGB format andtherefore need to be converted to grayscale before processing.The grayscale depth data is a true representation of object distancefrom the camera by varying intensity of pixels from dark to brightfor near to far away objects, respectively. This fact also makes iteasy to employ grayscale threshold on each frame of the gesturesample to separate the foreground from the background. Thethreshold level is obtained by using Otsu’s method (Otsu, 1979),which chooses the threshold to minimize the intra-class varianceof the black and white pixels for making a binary image. The binaryimage is used as a mask to filter out the background from the depthimage and thereby separate the human subject as shown in Fig. 2.

The proposed methods of gesture recognition mainly focus onthree types of operation for obtaining feature vectors from trainingsamples. The operations are (a) performing standard deviation(STD) on each pixel values across time; (b) obtaining the motionhistory image (MHI) from the gesture sequences; and (c) perform-ing two dimensional Fourier transform (2D-FFT) on frames. Theprocess of feature extraction, training and classification is shownin brief in Fig. 3.

STD is a very simple, yet widely used statistical measure. It ex-tracts the information on deviation of values from the mean in anarray. In gesture recognition, the values of STD performed on eachpixel for certain time duration can be utilized to track the changein subject’s position within that time. It generates an image quitesimilar to the motion energy image (MEI) but with color tonesdepicting abrupt to slow changes by varying from red to blue re-gions. For the i-th gesture at lexicon L consisting of T frames, thestandard deviation Tr2

i ðx; yÞ of pixel ðx; yÞ across the frames is givenby,

Tr2i ðx; yÞ ¼

PðIxyðtÞ � IxyÞ2

T: ð1Þ

Here IxyðtÞ is the pixel value of the location ðx; yÞ of the frame attime t, where x ¼ 1;2;3; . . . ;m, y ¼ 1;2;3; . . . ;n and t ¼ 1;2;3; . . . ;

T. Ixy is the average of all IxyðtÞ values along time t. Therefore, forthe whole frame, the matrix obtained is defined as,

TDLi ¼

Tr2i ð1;1Þ Tr2

i ð1;2Þ � � � Tr2i ð1;nÞ

Tr2i ð2;1Þ Tr2

i ð2;2Þ � � � Tr2i ð2;nÞ

..

. ... . .

. ...

Tr2i ðm;1Þ Tr2

i ðm;2Þ � � � Tr2i ðm;nÞ

0BBBBB@

1CCCCCA: ð2Þ

This matrix itself can be used as a feature for each of the lexi-cons. The matrices are obtained by performing standard deviationon the training samples of development data lexicon 1 of the Cha-Learn Gesture Dataset is shown in Fig. 4. It is evident from Fig. 4that performing standard deviation (STD) across frames enhancesthe information of movement across the frames while suppressingthe static parts. One can easily gain an insight about path of motion

Fig. 1. Extracts from the dataset: RGB and depth images from the first 20 development data sets.

Fig. 2. Otsu’s threshold to subtract background from depth image.

U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788 1783

flow and type of the gesture by simply observing these figures.Thus, STD across frames is one of the proposed feature vectordevelopment methods.

A similar feature can be obtained from the gesture samples bytaking the absolute value of the two-dimensional discrete Fouriertransform (DFT) using fast Fourier transform on each backgroundsubtracted depth image frame and then calculating the standarddeviation of each point at the spectral domain across frames. Foran image of size m� n the 2D discrete Fourier transform is givenby,

FklðtÞ ¼1

mn

Xm�1

x¼0

Xn�1

y¼0

IxyðtÞe�j2pðkxnþ

lynÞ; ð3Þ

where, IxyðtÞ is the value of the ðx; yÞ pixel of the image given in spa-tial domain for the frame at time t where t ¼ 1;2; . . . ; T and theexponential term is the basis function corresponding to each pointFklðtÞ in the Fourier space (Gonzalez and Woods, 2001). The advan-tage of performing Fourier transform prior to taking STD is thatwhen the depth values of the person are taken to the frequency do-main, the position of the person becomes irrelevant, i.e., any slightmovement of the camera vertically or horizontally would be nulli-fied in the frequency response. For example, if the camera is steady

Fig. 3. Proposed method of gesture recognition.

Fig. 4. Silhouettes of training gestures employing STD on each pixel across time.

1784 U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788

in the training samples, but moves a little bit in the test samples,the STD on temporal values of the test data would be difficult toproject on that that point of the training data for finding a match.However, transforming the background subtracted depth data intothe spectral domain before taking standard deviation across frameswill suppress the time resolution and thereby reduce the effect ofcamera movement and bring the train and test data into similarground for comparison.

It is however evident that the outcome of performing STDacross frames does not reveal the direction of flow of motion andthereby the temporal position information of the moving partsget lost as can be seen from Fig. 4. Therefore, the operation of tak-ing the motion history image (MHI) of the lexicon is proposedalong with the previous two approaches. In the MHI, the silhouettesequence is condensed into grayscale images, while dominant mo-

tion information is preserved. Therefore, it can represent motionsequence in compact manner. This MHI template is also not so sen-sitive to silhouette noises, like holes, shadows, and missing parts.These advantages make these templates as a suitable candidatefor motion and gait analysis (Liu et al., 2009; Liu and Zhang,2007). The advantage of the MHI representation is that a range oftimes may be encoded in a single frame, and as a result, the MHIspans the time scale of human gestures (Bradski and Davis,2002). It expresses the motion flow or sequence by using the inten-sity of every pixel in temporal manner. The motion history recog-nizes general patterns of movement, therefore it can beimplemented with cheap cameras and lower powered CPUs (Brad-ski and Davis, 2002), in low light areas where structure cannot beeasily detected. Another benefit of using the grayscale MHI is thatit is sensitive to the direction of motion unlike the STD approach

U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788 1785

and hence the MHI is preferred for discriminating between actionsof opposite directions (e.g., moving hand from left to right andfrom right to left) (Bobick and Davis, 1996; Ahad, 2011).

To obtain the motion history image from the background sub-tracted depth images of the gesture samples, first a binarized im-age Wðx; y; tÞ for the pixel ðx; yÞ and the t-th frame is obtainedfrom frame subtraction using a threshold /, i.e.,

Wðx; y; tÞ ¼1 if zðx; y; tÞP /;

0 otherwise;

�ð4Þ

where zðx; y; tÞ is defined using difference distance D as follows:

zðx; y; tÞ ¼ jIxyðtÞ � Ixyðt � DÞj: ð5Þ

The MHI Hðx; y; tÞ can be computed from update function Wðx; y; tÞ

Hðx; y; tÞ ¼s if Wðx; y; tÞ ¼ 1;maxð0;Hðx; y; t � 1Þ � dÞ otherwise;

�ð6Þ

where Wðx; y; tÞ signals object presence (or motion) in the currentvideo image, the duration s decides the temporal extent of themovement (e.g., in terms of frames), and d is the decay parameter.This update function is called for every new video frame analyzedin the sequence (Ahad, 2011). Examples of final MHI image contain-ing information about the temporal history of motion are shown inFig. 5.

The feature matrices for gesture recognition are done byemploying the three operations, 2D STD, 2D FFT and MHI, respec-tively, both individually and in different combinations. For exam-ple, though each gesture data in the ChaLearn Gesture Datasetend by returning to the resting gesture, the overall position ofthe subject in the frame may shift slightly. As the proposed STDand MHI based methods are mostly dependent on steady body po-sition (while the arms and legs move to perform certain gesture)this may affect the classification results significantly. Therefore,to make the method shift-invariant, 2D FFT is performed on theimages prior to 2D STD and after generating the MHI. A feature ob-tained in this manner should theoretically provide better recogni-tion result.

3.3. Finding action boundary

The test samples provided in the ChaLearn Gesture Dataset con-tain one or more gestures. Therefore, the first task is to separate thedifferent gestures from the gesture sequences. Fortunately, thegestures provided in the ChaLearn Gesture Dataset are separatedby returning to a resting position (ChaLearn Gesture Dataset,2011). Thus, subtracting the initial frame from each frame of themovie can be a good means for finding the gesture boundary. Fora m� n depth image if IxyðtÞ is the pixel value (depth) of position

Fig. 5. MHI of the training

ðx; yÞ at time t, where, t ¼ 2;3; . . . ; T , the following formulas canbe used to generate a variable St , which provides an amplitude ofchange from the reference frame,

s1 ¼ ðI11ðtÞ � I11ð1ÞÞ2 þ ðI12ðtÞ � I12ð1ÞÞ2 þ ðI13ðtÞ � I13ð1ÞÞ2 þ � � �þ ðI1nðtÞ � I1nð1ÞÞ2

s2 ¼ ðI21ðtÞ � I21ð1ÞÞ2 þ ðI22ðtÞ � I22ð1ÞÞ2 þ ðI23ðtÞ � I23ð1ÞÞ2 þ � � �þ ðI2nðtÞ � I2nð1ÞÞ2

s3 ¼ ðI31ðtÞ � I31ð1ÞÞ2 þ ðI32ðtÞ � I32ð1ÞÞ2 þ ðI33ðtÞ � I33ð1ÞÞ2 þ � � �þ ðI3nðtÞ � I3nð1ÞÞ2

. . . . . . . . .

. . . . . . . . .

sm ¼ ðIm1ðtÞ � Im1ð1ÞÞ2 þ ðIm2ðtÞ � Im2ð1ÞÞ2 þ ðIm3ðtÞ � Im3ð1ÞÞ2 þ � � �þ ðImnðtÞ � Imnð1ÞÞ2:

Thus, the sum of the square differences between the currentframe and the reference frame can be obtained by,

St ¼Xm

k¼0

sk: ð7Þ

The value of St is then normalized. The smoothed graph of nor-malized St vs. t is shown in Fig. 6 (a) and (b) for two different devel-opment data samples. Every crest in these curves represents agesture boundary. Using the locations of these crests in time alongwith some intelligent decision-making on the curve, the actionboundaries can be found with fairly high accuracy. In Fig. 6, boththe actual boundary (determined by observing the gesture se-quence) and the boundary detected by the proposed algorithmare shown.

3.4. Classification

After extracting the features from the gestures in the trainingdataset of a particular lexicon, a feature vector table is formedfor that lexicon as shown in Fig. 3. Then, for the test samples, firstthe gestures are separated if multiple gestures exist. Then featuresare extracted for each gesture in a manner similar to that done fortraining samples. For gesture sequences of frame size m� n, thefeature vector obtained from the test gestures are also m� n matri-ces for each type of features. Next the correlation coefficient is cal-culated between the features obtained from the test gesture to thatsimilar feature obtained from each of the training gestures. Thecoefficient of correlation, r, is a mathematical measure of howmuch one value is expected to be influenced by change in anotherone. It is a widely used measure for image and gesture recognition

gestures (Method 4).

Fig. 6. Action boundary detection for deveoptment set 01 actions (a) 15 and (b) 17.

1786 U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788

(Gonzalez and Woods, 2001). The correlation coefficient betweentwo images A and B is defined as,

r ¼Pm�1

k¼0

Pn�1l¼0 ðAkl � AÞðBkl � BÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPm�1

k¼0

Pn�1l¼0 ðAkl � AÞ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPm�1k¼0

Pn�1l¼0 ðBkl � BÞ2

q ; ð8Þ

where, Akl is the intensity of pixel ðk; lÞ in image A and Bkl is theintensity of pixel ðk; lÞ in image B. A and B are the mean intensityof all the pixels of image A and B, respectively. If r ¼ �1 then thereis a strong positive/negative correlation between the two images,i.e., they are identical/negative of one other. If r is zero then thereis no correlation among the matrices. A high value of the correlationcoefficient between the same feature of the test sample with that ofone of the training samples indicates a higher probability of the twogestures to be identical and vice versa for a low value of correlationcoefficient. In the proposed method, for using combination of morethan one features similar features are matched using correlationcoefficient separately and then decision is taken from the two ar-rays of correlation coefficients to identify the appropriate matchin a way that result of both matches influence the decision making.

4. Experimental results

For the purpose of evaluating gesture recognition performanceof the proposed method, the Levenshtein distance (LD) measureis employed. The Levenshtein distance, also known as edit dis-tance, is number of deletions, insertions, or substitutions requiredto match an array with another (Levenshtein, 1966). The LD mea-sure has a wide range of applications, such as spell checkers, cor-rection systems for optical character recognition and othersimilar systems (Golic and Mihaljevic, 1991; Marzal et al., 1993).For the proposed gesture recognition system, if the list of labelsof true gesture in a test sequence is T, while the labels correspond-ing to the recognized gestures for the same sequence is R, then theLevenshtein distance LðR; TÞ will represent the minimum numberof edit operations that one has to perform to go from R to T (or viceversa). For example, Lð½154�; ½12�Þ ¼ 2 as two edit operations areneeded to get [12] from [154]. The evaluation criteria is the per-centage LD which is the sum of all the LDs obtained from a lexicon

divided by the total number of true gestures in that lexicon andmultiplied by 100. It is evident that the higher the value of LD,the more is the number of wrong estimations. Thus a lower valueof LD is expected (Levenshtein, 1966).

For the purpose of analysis, the first 20 lexicons, namely De-vel01 to Devel20 from CGD2011s (ChaLearn Gesture Dataset,2011,a) dataset are considered. In this dataset, different actionsin a set are assigned to a number as a label and a string containingthe labels of actions are provided for each gesture sequence. Forthe experimentation purpose, seven sets of features are formedby combining the three basic operations. Also, method of classifica-tion deviated a little bit with the types of features. Thus, seven dif-ferent methods are proposed. These methods are:

1. Taking 2D STD of the depth image across time and using max-imum value of the correlation coefficient as best match.

2. Taking 2D STD of the absolute value of the frequency domainspectrum of the depth image across time and using maximumvalue of the correlation coefficient as best match.

3. Measuring 2D STD of both time and spectral domain imagesacross time, measuring (a) correlation coefficients for the spec-tral domain feature and (b) multiplication of correlation coeffi-cient for both features. Then, making intelligent selection fromthe three highes values of (a) by observing corresponding val-ues at (b).

4. Constructing the MHI and using maximum value of the correla-tion coefficient as best match.

5. Taking absolute value of the 2D FFT of the MHI, using maximumvalue of the correlation coefficient as best match.

6. Constructing MHI and taking 2D STD of the absolute value ofthe frequency domain spectrum of the depth image. Then, mea-suring (a) correlation coefficients for the spectral domain fea-ture and (b) multiplication of correlation coefficient for bothfeatures. Then, making intelligent selection.

7. Taking absolute value of the 2D FFT of the MHI and that of the2D STD of the frequency domain spectrum of the depth image.Then, measuring (a) correlation coefficients for the spectraldomain feature and (b) multiplication of correlation coefficientfor both features. Then, making intelligent selection.

Fig. 7. Levenshtein distances (%) for different lexicons.

Table 1Summary of the proposed methods.

Methods ! 1 2 3 4 5 6 7RGB image – – – – – – –Depth image

p p p p p p p

Background separationp p p p p p p

2D STDp

–p

– – – –2D FFT (frame) followed by 2D STD (across time) –

p p– –

p p

MHI – – –p

–p

–MHI followed by 2D FFT (MHI image) – – – –

p–

p

Intelligent selection – –p

– –p p

Maximum correlation coefficient classifierp p p p p p p

Average Levenshtein distance (%) 48.73 38.95 43.43 55.3 52.21 45.05 37.46

U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788 1787

Simulation is performed on the dataset using these methodsand the result obtained in terms of percentage of Levenshtein dis-tance (LD) is shown in Fig. 7 for each lexicon of the first 20 devel-opment data. It can be seen from the figure that the result varieswidely from set to set. The wide variation is caused by the differenttypes of motions in different sets. In some sets (devel03, devel07,devel10, devel14), percentage of LD is higher than the rest becausethe gestures of these sets are mostly hand gestures similar to signlanguage, which are very difficult to differentiate. Also, in manysamples same gestures are performed in different positions rela-tive to the body. Therefore the feature extracted by the MHI are dif-ferent for the same gesture in different sequences and hence, canhardly be matched with the original training gesture. The perfor-mances of the proposed method are also limited by the accuracyof separation of different actions from the videos and unexpectedmovements of the performer. Sometimes the gestures are per-formed without a clear indication of the gesture’s end, whichcauses erroneous segmentation of the action sequences in a video.Also in some videos, the performer moved while performing a ges-ture which added some unexpected features in the model that gavea wrong match. So the result might have been improved if thesecould be overcome.

The best results are obtained for Method 7 which is very muchexpected (Fig. 7) because due to the spectral domain operation, thefeatures in this method are robust enough to perform well againstcamera movement. In Table 1, the proposed methods and their rec-

ognition accuracies are summarized. It can be seen from the tablethat in method 7 all the proposed features, i.e., 2D FFT, 2D STD andMHI are simultaneously applied. However, the classification isdone in steps as it is evident from the average Levenshtein dis-tances of method 2 and method 5 that employing 2D STD on spec-tral data gives better recognition accuracies than employing 2D FFTon MHI. Therefore, if these features are concatenated into a singlevector, it is quite obvious that the recognition accuracy will not bebetter than that of method 2. Thus an intelligent selection is doneto extract only the good aspects of these two features in method 7and thereby obtain the best recognition accuracy among all themethods. It can also be seen that the second best result is obtainedfor Method 2, i.e., employing only STD on spectral domain repre-sentation of frames, which again indicates the effectiveness ofoperating in transform domain.

5. Conclusion

In this paper, an approach for gesture recognition is proposed. Itemploys combinations of statistical measures, frequency domaintransformation and motion representation obtained from depthmotion images. The proposed method first utilizes parameters ex-tracted from the depth motion sequence to separate different ges-tures. Next, it subtracts the background by employing grayscalethreshold. Then, features from each gesture sequence are extractedbased on three basic operations – calculating standard deviation

1788 U. Mahbub et al. / Pattern Recognition Letters 34 (2013) 1780–1788

across frames, taking two dimensional Fourier transform and con-structing motion history images. These methods are then com-bined to find the most useful method for recognizing the gesture.The usefulness of applying intelligent classifier based on the corre-lation coefficient from the standard deviation of the 2 dimensionalFourier transform of the image and the motion history image isapparent from the results. The dataset, namely, the ChaLearn Ges-ture Dataset 2011, targeted for experimental evaluation is a veryrich, but difficult dataset to handle having a lot of variation andclasses. On the top of which, to the knowledge of the authors, thisdataset have not been addressed before. This fact presentedunprecedented obstacles which had to be overcome systematicallywith different methods. Through extensive evaluation it is provedthat the proposed combination of features can provide a satisfac-tory level of recognition performance for one of the most complexdataset of gestures to date.

References

Ahad, M.A.R., 2011. Computer vision and action recognition: a guide for imageprocessing and computer vision community for action understanding. AtlantisAmbient and Pervasive Intelligence. Atlantis Press.

Ahad, M.A.R., Tan, J., Kim, H., Ishikawa, S., 2008. Human activity recognition: variousparadigms. In: Internat. Conf. on Control, Automation and Systems. ICCAS 2008,pp. 1896–1901. http://dx.doi.org/10.1109/ICCAS.2008.4694407.

Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S., 2012. Motion history image: its variantsand applications. Machine Vision App. 23 (2), 255–281. http://dx.doi.org/10.1007/s00138-010-0298-4.

Bobick, A., Davis, J., 1996. An appearance-based representation of action. In: Proc.1996 Internat. Conf. on Pattern Recognition (ICPR’96) volume I – volume 7270.ICPR 96, IEEE Computer Society, Washington, DC, USA, pp. 307–312.

Bobick, A.F., Davis, J.W., 2001. The recognition of human movement using temporaltemplates. IEEE Trans. Pattern Anal. Machine Intell. 23 (3), 257–267. http://dx.doi.org/10.1109/34.910878.

Bradski, G.R., Davis, J.W., 2002. Motion segmentation and pose recognition withmotion history gradients. Machine Vision App. 13 (3), 174–184. http://dx.doi.org/10.1007/s001380100064.

ChaLearn Gesture Dataset (CGD2011), ChaLearn, California, 2011.ChaLearn Gesture Dataset, ChaLearn, California, 2011. Available from: <http://

www.kaggle.com/c/GestureChallenge>Golic, J.D., Mihaljevic, M.J., 1991. A generalized correlation attack on a class of

stream ciphers based on the levenshtein distance. J. Cryptol. 3, 201–212,10.1007/BF00196912.

Gonzalez, R.C., Woods, R.E., 2001. Digital Image Processing,, Addison-WesleyLongman Publishing Co., Inc., Boston, MA, USA.

Hwang, B.-W., Kim, S., Lee, S.-W., 2006. A full-body gesture database for automaticgesture recognition. In: Proc. 7th Internat. Conf. on Automatic Face and GestureRecognition, FGR ’06. IEEE Computer Society, Washington, DC, USA, pp. 243–248. http://dx.doi.org/10.1109/FGR.2006.8.

Imtiaz, H., Mahbub, U., Ahad, M., 2011. Action recognition algorithm based onoptical flow and RANSAC in frequency domain. In: Proc. SICE Annual Conf.(SICE), pp. 1627–1631.

Jiang, Z., Lin, Z., Davis, L., 2012. Recognizing human actions by learning andmatching shape-motion prototype trees. IEEE Trans. Pattern Anal. MachineIntell. 34 (3), 533–547. http://dx.doi.org/10.1109/TPAMI.2011.147.

Kim, T.-K., Cipolla, R., 2009. Canonical correlation analysis of video volume tensorsfor action categorization and detection. IEEE Trans. Pattern Anal. Machine Intell.31 (8), 1415–1428. http://dx.doi.org/10.1109/TPAMI.2008.167.

Laptev, I., Lindeberg, T., 2003. Space-time interest points. IEEE Internat. Conf. onComputer Vision 1, 432.

Levenshtein, V.I., 1966. Binary codes capable of correcting deletions, insertions, andreversals. Sov. Phys. Dokl. 10 (8), 707–710.

Liu, J., Zhang, N., 2007. Gait history image: a novel temporal template for gaitrecognition. In: Proc. IEEE Internat. Conf. on Multimedia and Expo, pp. 663–666.

Liu, L.-F., Jia, W., Zhu, Y.-H., 2009. Survey of gait recognition. In: Proc. IntelligentComputing 5th Internat. Conf. on Emerging Intelligent Computing Technologyand Applications, ICIC’09. Springer-Verlag, Berlin, Heidelberg, pp. 652–659.

Mahbub, U., Imtiaz, H., Ahad, M.A.R., 2011. An optical flow based approach foraction recognition. In: 14th Internat. Conf. on Computer and InformationTechnology (ICCIT), pp. 646–651. http://dx.doi.org/10.1109/ICCITechn.2011.6164868.

Marzal, A., Vidal, E., 1993. Computation of normalized edit distance andapplications. IEEE Trans. Pattern Anal. Machine Intell. 15 (9), 926–932. http://dx.doi.org/10.1109/34.232078.

Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEETrans. Systems Man Cybernet. 9 (1), 62–66.

Ren, Z., Yuan, J., Zhang, Z., 1093. Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In: Proc. 19th ACMInternat. Conf. on Multimedia, MM ’11. ACM, New York, NY, USA, pp. 1093–1096. http://dx.doi.org/10.1145/2072298.2071946.

Seo, H.J., Milanfar, P., 2011. Action recognition from one example. IEEE Trans.Pattern Anal. Machine Intell. 33 (5), 867–882. http://dx.doi.org/10.1109/TPAMI.2010.156.

Shao, L., Chen, X., 2010. Histogram of body poses and spectral regressiondiscriminant analysis for human action categorization. In: Proc. BritishMachine Vision Conf.. BMVA Press, pp. 88.1–88.11. http://dx.doi.org/10.5244/C.24.88.

Shao, L., Wu, D., Chen, X., 2011. Action recognition using correlogram of body posesand spectral regression. In: 18th IEEE Internat. Conf. on Image Processing (ICIP),pp. 209–212. http://dx.doi.org/10.1109/ICIP.2011.6116023.

Shao, L., Ji, L., Liu, Y., Zhang, J., 2012. Human action segmentation and recognitionvia motion and shape analysis. Pattern Recognition Lett. 33 (4), 438–445. http://dx.doi.org/10.1016/j.patrec.2011.05.015.

Song, Y., Demirdjian, D., Davis, R., 2011. Tracking body and hands for gesturerecognition: NATOPS aircraft handling signals database. In: FG, IEEE, pp. 500–506.

Tang, M., 2011. Recognizing hand gestures with microsoft’s kinect. Stanfordedu 14(4), 303–313.

Weinland, D., Ronfard, R., Boyer, E., 2006. Free viewpoint action recognition usingmotion history volumes. Computer Vision and Image Understanding 104 (2),249–257. http://dx.doi.org/10.1016/j.cviu.2006.07.013.

Willamowski, J., Arregui, D., Csurka, G., Dance, C., Fan, L., 2004. Categorizing ninevisual classes using local appearance descriptors. In: Workshop on Learning forAdaptable Visual Systems. In: IEEE Internat. Conf. on Pattern Recognition.

Wu, D., Shao, L., 2012. Silhouette analysis based action recognition via exploitinghuman poses. In: IEEE Trans. Circuits and Systems for Video Technology.

Yang, W., Wang, Y., Mori, G., 2009. Human action recognition from a single clip peraction. Learning, 482–489.

Zhou, H., Tao, D., Yuan, Y., Li, X., 2009. Object trajectory clustering via tensoranalysis. In: 16th IEEE Internat. Conf. on Image Processing (ICIP), pp. 1925–1928.

Zhou, R., Meng, J., Yuan, J., 2011. Depth camera based hand gesture recognition andits applications in human-computer-interaction. In: 8th Internat. Conf.Information, Communications and Signal Processing (ICICS), pp. 1–5.