[ieee 2011 14th international conference on computer and information technology (iccit) - dhaka,...

6

Click here to load reader

Upload: md-atiqur

Post on 15-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh

An Optical Flow Based Approach for Action

Recognition

Upal Mahbub*, Hafiz Imtiaz* and Md. Atiqur Rahman Ahadt, Member, IEEE *Bangladesh University of Engineering and Technology, Dhaka-WOO, Bangladesh

E-mail: [email protected];[email protected] t Kyushu Institute of Technology, Kitakyushu, Japan

Email: [email protected]

Abstract-A new approach for motion-based representation on the basis of optical flow analysis and random sample consensus (RANSAC) method is proposed in this paper. Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. It is intuitive that an action can be characterized by the frequent movement of the optical flow points or interest points at different areas of the human figure. Additionally, RANSAC, an iterative method to estimate parameters of a mathematical model from a set of observed data which contains inliers and outliers, can be used to filter out any unwanted interested points all around the scene and keep only those which are related to the particular human's motion. By this manner, the area of the human body within the frame is estimated and this rectangular area is segmented into a number of smaller regions or blocks. The percentage of change of interest points in each block from frame to frame is then recorded. Similar procedure is repeated for different persons performing the same action and the corresponding values are averaged for respective blocks. A matrix constructed by this strategy is used as a feature vector for that particular action. Afterwards, for the purpose of recognition using the extracted feature vectors, a distance-based similarity measure and a support vector machine (SVM)-based classification technique have been exploited. From extensive experimentations upon a standard motion database, it is found that the proposed method offers not only a very high degree of accuracy but also computational savings.

Index Terms-Motion-based representation, action recognition, optical flow, RANSAC, SVM.

I. INTRODUCTION

Recognizing the identity of individuals as well as the actions, activities and behaviors performed by one or more per­sons in video sequences are very important for various appli­cations. Surveillance, robotics, rehabilitation, video indexing, in the fields of biomechanics, medicine, sports analysis, film, games, mixed reality, etc. are among various key application arenas of human motion recognition [1]. The importance of human motion classification is evident by the increasing requirement of machines to be able to interact intelligently and effortlessly with a human inhabited environment. However, most of the information extracted by machines from human movement have been from static events, such as key press. In order to improve machine capabilities in real-time, it is desirable to represent motion. However, due to various limita­tions and constraints, no single approach seems to be enough for wider applications in action understanding and recognition.

Present methods can be classified into: view/appearance-based, model-based, space-time volume-based, or direct motion­based [2]. Template-matching approaches [3] are simpler and faster algorithms for motion analysis or recognition that can represent an entire video sequence into a single image format. Recently, approaches related to Spatio-Temporal Interest fea­ture Points (STIP) become prominent for action representation [4].

In this paper, a novel action clustering-based human action recognition algorithm is presented, which employs optical flow and RANSAC for determining apparent motion of human. In order to detect the presence and direction of motion, optical flow is employed. RAN SAC is used for further localization and identification of the most prominent motions within the frame. From the density of optical flow interest points, the probable position of the person along the horizontal direction in the frame is determined. Then more localization is done based on evaluation of the mean and standard deviation of the positions of the interest points both horizontally and vertically. A small rectangular area is thus obtained within which the person performs hislher actions. The area has been divided into a number of small blocks and the percentage of change in number of interest points within each block has been calculated frame by frame. All the matrices formed this way from the similar actions have been averaged and used as a feature for that respective action. Finally, simple classifiers have been utilized for the classification task.

II. RELATED WORKS

Human action recognition from video sequences has been a major field of research in recent years for different real life applications. Initially, the MHI method was used to recognize various actions by [3]. Later this method is used for recognition of human movements and moving object tracking by various groups (e.g., Refs. [5][6]). Bradski and Davis [6] and Davis [7] improved the MHI in various ways for recognizing various gestures. Various interactive systems have been successfully constructed using motion history template as a primary sensing mechanism.

Nguyen et al. [8] introduced the concept of a motion swarm, a swarm of particles that moves in response to the field representing a motion history image. They created interactive art that can be enjoyed by groups such as audiences at

987-161284-908-9/11/$26.00 «:l2011 IEEE

Page 2: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

Feature Extraction

Locating the position of the

person in the frame

Dodyarea localization from Statistical Propenies

l'crcCnt8gC

change of intCf($\ pninls

frame by frame in cach

segment

Matching & Classification:

3. Euclidean Distance Result

Person Running

b. Support Vector Machine (SVM)

l'IjI._ Template

'" Feature Space

Fig. 1. The main components of the feature extraction and the human action recognition system

public events. Another interactive art demonstration has been constructed from the motion templates by [9].

On the other hand, some researchers in the computer vision community have used bag-of-words models for various recognition problems. Fei-Fei and Perona [10] use a variant of LDA for natural scene categorization while Sivic et al. [11] use pLSI for unsupervised object class recognition and segmentation. Optical flow based human action detection has also been investigated mainly because of the simplicity of optical flow-based algorithms [12]. , Specially for real-time surveillance scenes [13)[14] optical flow-based algorithms has proved to be fruitful. Various other approaches are immensely expanded into action recognition and understating and some of these, as above, are employed in computer vision-based computer games, interactive systems, which are real-time or pseudo-real- time and can significantly reduce the system cost through simple approaches. This paper addresses a novel ac­tion representation technique based on optical flow, RANSAC and simple statistical evaluation and thereby smartly recognize various actions.

III. PROPOSED METHOD

A. Feature Extraction and Training

In general, any recognition technique consists of two major parts, training and testing. The training phase can be divided into two sub-sections, feature extraction and learning or feature vector formation. Feature extraction is the most crucial part of any recognition system as it directly dictates the overall accuracy. Figure 1 shows the overall flow diagram of the proposed system, which consists of feature extraction, learning and recognition phases. The objective of the proposed method is to extract the variations, which are present in different human actions, by developing a successful measure to follow the movement of different body parts at different directions during an action. During an action, not all the body parts are moving significantly. Any significant movement anywhere within a frame can be detected by the optical flow analysis. Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera)

and the scene. In computer vision, optical flow means tracking specific features (points) in an image across multiple frames. It is widely used in computer vision to find moving objects from one frame to another, to calculate the speed of movement and the direction of motion and to determine the structure of the environment. By principle, optical flow techniques can be de­veloped using four methods, namely, Phase correlation meth­ods, Block-based methods, Differential methods and Discrete optimization methods [15]. Among these a there is a widely used differential method called the LucasKanade method or L. K. method for optical flow estimation developed by Bruce D. Lucas and Takeo Kanade [16)[17] which stands out for its simplicity and lack of assumptions about the underlying image. It assumes that the flow is essentially constant in a local neighborhood of the point p under consideration. Thus the optical flow equation can be assumed to hold for all pixels within a window centered at p. Namely, the local image flow (velocity) vector (Vx,vy) must satisfy,

Ix(Sl)Vx + Iy(Sl)Vy Ix (S2)Vx + Iy(s2)Vy

(1)

where, Sl,S2, ... ,Sd are the pixels inside the window, and Ix(si), Iy(si), It(si) are the partial derivatives of the image I with respect to position x, y and time t, evaluated at the point Si and at the current time. These equations can be written in matrix form Av = b, where

A=

v = [�] , and

r-1t(Sl)] -It(s2)

b = .

-It(sd)

(2)

(3)

Page 3: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

TABLE I THE RA NSAC ALGORITHM

I: Select randomly the minimum number of points required to determine the model parameters.

2: Solve for the parameters of the model.

3: Determine how many points from the set of all points fit with a predefined tolerance E.

4: If the fraction of the number of inliers over the total number points in the set exceeds a predefined threshold T, re-estimate the model parameters using all the identified inliers and terminate.

5: Otherwise, repeat steps I through 4 (maximum of N times).

This system has more equations than unknowns and thus it is usually over-determined. The L. K. method obtains a compromise solution by the least squares principle. Namely, it solves the 2 x 2 system

(4)

where AT is the transpose of matrix A. That is, it computes

[Vx] =

[ Li 1x(si)2 Li 1X(Si)1Y(Si)] -l

Vy Li 1x(si)1y(si) Li 1y(si)2

X [- Li 1x(Si)1t(Si)]

(5) - LJy(si)1t(si)

with the sums running from i = 1 to d. Thus, by combining information from several nearby pixels,

the L. K. method can often resolve the inherent ambiguity of the optical flow equation. It is also less sensitive to image noise than point-wise methods [15]. However, the optical flow method is prone to slightest background movement or camera movements. So, the RAndom SAmple Consensus (RAN SAC) algorithm is next applied for further purification of the motion detection. A basic assumption of RAN SAC is that the data consists of inliers, i.e., data whose distribution can be explained by some set of model parameters, and outliers which are data that do not fit the model. In addition to this, the data can be subject to noise. The outliers can come, e.g., from extreme values of the noise or from erroneous measurements or incorrect hypotheses about the interpretation of data. RANSAC also assumes that, given a (usually small) set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data. The RANSAC algorithm possesses robust capacity to remove outliers. As pointed out by Fischler and Bolles [18], unlike conventional sampling techniques that use as much of the data as possible to obtain an initial solution and then proceed to prune outliers, RAN SAC uses the smallest set possible and proceeds to enlarge this set with consistent data points. The basic algorithm of RANSAC is summerized in Table I. The number of iterations, N, is chosen high enough to ensure that the probability Pin (usually set to 0.99) that at least one of the sets of random samples does not include an outlier. Let u represent the probability that any selected data point is an inlier and v = 1 - u the probability of observing an outlier.

N iterations of the minimum number of points denoted (!min are required, where,

(6)

and thus with some manipulation,

N = log(l - Pin)

log(l - (1 - V ) l!=in ) . (7)

This way, some interest points are obtained, which seem to be tagged with the moving object and move towards whatever direction the object moves. In this case, the objects are nothing but different body parts of a human, for example, during running action, the leg and the hand is significantly moved, so most of the interest points, after performing RAN SAC, are gathered around the hand and leg areas of the human body. Also, it is intuitive that all the interest points would gather around the whole body, which provides the scope to detect the position of the human subject in the scene by calculating the density of the values of the horizontal axis of the points. To do that, a window of a fixed length is taken along X-axis (preferably a window wider than the width of the human body in the scene), and the number of interest points inside the window is calculated. Then the window is shifted by a pixel and again the same calculation is repeated. This process, if done for the entire X-axis, will return the position of the window with maximum number of interest points, which obviously is the position of the human subject.

Next, the mean cay) and standard deviation (O"y) of the Y­axis values of all the points in this window are calculated. Thus, we now know the vertical distribution of the interest points along the human body. The Y-axis is divided into n segments starting from {}y - O"y - Jy to {}y + O"y + Jy. The value Jy is an adjustment constant chosen empirically based on the height of the frame and probable height of the human subject. Within the window along X-axis and the limit imposed along Y-axis, the mean ({) x) and standard deviation (0" x) of the X-axis values of all the points are calculated. Then the X-axis is also divided into n segments starting from {}x -O"x - Jx to {}x + O"x + Jx. The value Jx is another adjustment constant chosen empirically based on the width of the frame and probable width of the human subject. Thus, the human subject is now encapsulated within a rectangular area which is divided into n x n smaller blocks or segments. Fig. 2 shows the original image sequence, the optical flow output, the RANSAC output and finally the output of localization and segmentation operation performed on the action sequences. The procedure is repeated in each frame. If the number of interest points within block k at the i-th frame is I p�, then the change in number of interest points in each block <pi is calculated frame by frame using the following equation,

<pi = <pi-1+ 1 1P� - 1Pt1 I, (8)

where, k = 1,2, ... , n2 and i = 2,3 .... , rand r is the total number of frames. Also, the total change in interest points 'lj;i

for a given frame i is calculated and cumulated throughout the

Page 4: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

Fig. 2. Feature extraction for a person waving both hands (Weizmann Dataset)

operation using the following equation,

n2 'ljJi = L 1 IP� - IP�-l I. (9)

k=l Finally, the percentage of change in number of interest

points in all the blocks is calculated employing,

r <pi PercentChangek = L( 'ljJ7) x 100%. (10)

i=l The above operation is done for several persons performing

the same action and the matrices of percentage change of interest points obtained from all the persons are averaged. This way, a feature vector for the respective action is obtained which contains the information about how the body parts of a person usually flows while performing that particular operation. The whole process is repeated for several actions and a feature vector table is constructed (as shown in fig. 1). Thus, the system is now trained up to perform the action classification and recognition phase.

B. Action Classification

For the purpose of recognition using the extracted features, a distance-based similarity measure and an support vector machine (SVM)-based similarity measure are utilized in the proposed method (see Fig. 1). Given the z-dimensional feature vector for the m-th sample action image of the j-th action be f.Lj(l), f.Lj(2), ... , f.Lj(z) and a f-th test sample action image with a feature vector Xf(1), Xf(2), ... , Xf(z), a similarity measure between the test action image f of the unknown action and the sample images of the j-th action is defined as,

q z

Dj = L L 1 f.Lj(l) - Xf(l) 12, (11) m=ll=l

where a particular class represents an action with q number of sample action images. Therefore, according to (12), given

the f-th test action image, the unknown action is classified as the action j among the p number of classes when,

Dj < Dj, Vj #- g, and, Vg E 1,2, ... , p (12)

SVM [19] [20] can also be used for action classification using the proposed features. After the reduction of the feature space as stated in the previous sections, SVM is used to train the system with some randomly picked action images and then test the system using the rest of the images.

In our experiments with the KTH database [21] and Weiz­mann database [22], polynomial kernel function of order 3 produced similar results as the Euclidean distance-based classifier.

IV. EXPERIMENTAL RESULTS AND ANALYSIS

In this experiment, 6 actions, namely, boxing, handclapping, running, jogging, handwaving and walking, performed by 25 individuals in outdoors from the KTH database [21] is con­sidered for both training and testing phases. All sequences of the KTH database were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160 x 120 pixels. Sample actions from the database and their corresponding localization output images are shown in Fig. 3.

Another database, namely the Weizmann Database, of 90 low-resolution (180 x 144, deinterlaced 50 fps) video se­quences with nine different people, each performing 10 natural actions such as run, walk, skip, jumping-jack (or shortly jack), jump-forward-on-two-Iegs (or jump), jump-in-place-on-two­legs (or pjump), gallopsideways (or side), wave-two-hands (or wave2), wave-one-hand (or waveI), or bend, has also been considered. Extensive simulations have been carried out in order to demonstrate the effectiveness of the method proposed in section III of this paper for human action recognition. During experimentation, the values of adjustment constants Ox and Oy were set empirically to two different values for the two different datasets considering the average size of a human

Page 5: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

TABLE II COMPARISON OF RECOGNITION RATES OF VARIOUS ACTIONS OF THE KTH DATABASE

Action Boxing Running Hand Walking Hand Jogging clapping waving

Euclidean Distance 76% 96% 88% 72% 100% 76%

SVM 83.784% 83.784% 86.496% 83.784% 86.486% 82.432%

TABLE III COMPARISON OF RECOGNITION RATES OF VARIOUS ACTIONS OF THE WEIZMANN DATASET

Action Run Side Skip Jump Pump Bend Jack Walk Wave I Wave2

Euclidean 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% Distance

SVM 91.67% 91.67% 91.67% 91.67% 91.67% 91.67% 91.67% 91.67% 91.67% 91.67%

Boxing Handc lapping Running Jogging Handwaving Wa lk ing

Localization and Segmentation output

Fig. 3. Sample action images from the KTH-database along with the final localization and segmentation results

subject within the frame and the frame size. The performance of the proposed method in terms of recognition accuracy is investigated. The classification task were performed following the leave-one-out test rule. The results in terms of Recognition accuracies obtained by the proposed method for different actions of the KTH-database and the Weizmann-database are listed in Table II and III respectively.

It can be seen from the results of KTH-database (see Table II) that the recognition accuracies based on Euclidean distance are high for hand-waving, running and hand-clapping actions, while it deteriorates for walking, jogging and boxing. The reason behind this shortcoming is that the exact localization of human object in the scene was sometimes hampered due to very little movement in the camera or the background and different people have different body shapes, clothes and action variations. Sometimes there was also moving shadow in the scenes, which led to partial localization of the human subject and thus added erroneous features in the feature vector. The Weizmann-database, on the other hand provided image sequences with very steady camera and almost static back­ground. Thus, the localization was mostly accurate and recog­nition accuracies were 100% for all the actions with simple Euclidean distance measure (see Table III) while with SVM the accuracy rates are 91.67%. This signifies the superiority of the extracted features over the classifier. The major fact behind

the better performance of Euclidean distance over SVM is that, the Euclidean Distance classifier uses the average values of all the feature vectors obtained from different persons performing a certain action and uses the average feature matrix as the feature for that action, while SVM-based classifier considers each person to be a different entity and compares the action of the test subject with the features obtained from each person. As the averaged feature matrix contains more generalized infor­mation about the movement pattern of body during an action, the Euclidean distance measure, in most cases, surpluses the SVM method. Table IV shows the comparison of performance of the proposed method with some other promising methods. It can easily be seen that the propose method outperforms the average accuracy rate of all the other methods for the Weizmann-database classified by Euclidean distance. Among the other methods, Ali et at [25] also used optical flow method for action recognition, however, the proposed method provides superior feature extraction technique, thus produces more accurate results. For the KTH-database, the performance of the proposed method is average as can be seen from the table. However, it can be anticipated from the results that if a more robust localization of human subject can be done using more boundary conditions on the statistical data, then the proposed method can even produce better results for KTH­database also.

Page 6: [IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka, Bangladesh (2011.12.22-2011.12.24)] 14th International Conference on Computer and Information

TABLE IV COMPARISON OF PROPOSED METHOD WITH OTHER

ALGORITHMS IN TERMS OF AVERAGE ACCURACY

Method

Proposed Method

Euclidean Distance

SVM

Dollar et al [23]

Niebles et al [24]

Ali et al [25]

Wong et al [26]

Seo et al [27]

lunejo et al [28]

Laptev et al [29]

KTH dataset Weizmann

84.67%

84.46%

81.17%

81.5%

87.7%

80.99%

95.1%

dataset

100%

91.67%

94.75%

97.5%

95.33%

91.8%

V. CONCLUSIONS

This paper presents a novel motion-based human action rep­resentation approach by evaluation of the statistical properties of a combination of optical flow and RANSAC algorithms. The optical flow and RANSAC algorithms are well-established algorithms which imposes very little computational burden and can be implemented in hardware. In order to detect the presence and direction of motion, optical flow is employed. RAN SAC is used for further localization and identification of the motions. Some interest points, depicting the movement of pixels from frame to frame, were obtained using these two algorithms. In this paper, statistical evaluation of the positions of the interest points has been suggested to estimate the position and the area of the body of the human subject. Then, the total body area is divided into small segments and the rate of change of the interest points in each segment has been calculated. The feature vector formed this way has been used to classify several actions and the experimental results shows high degree of accuracy for both SVM and Euclidean distance-based classifier for the KTH-database and Weizmann­database. The major advantage of the proposed method is that it is simple but efficient. It tries to identify any action by tracking the movement of the arm, leg, head and other body parts. Also, because of object localization, the method is robust enough to identify the same action performed anywhere else within the frame. However, the method is yet to be tested for more complex actions in cluttered outdoor environment and its performance against the self occlusion problem needs to be investigated. Also, more robust statistical evaluation may be applied. Finally, based on the experiments, it can be strongly claimed that the proposed method can be useful for various applications related to gesture and action understanding in the future.

REFERENCES

[I] M. A. R. Ahad, 1. Tan, H. Kim, and S. Ishikawa, "Human activity recognition: various paradigms," Int'l Con! Control, Automation and Systems, pp. 1896-1901, 2008.

[2] --, "Motion history image: its variants and applications," Machine Vision and Applications, pp. 1-27,2010.

[3] A. Bobick and 1. Davis, "The recognition of human movement using temporal templates," IEEE PAMI, vol. 23, pp. 257-267, 2001.

[4] I. Laptev and T. Lindeberg, "Space-time interest points," Int'l Con! on Computer Vision, vol. I, 2003.

[5] 1. Liu and N. Zhang, "Gait history image: a novel temporal template for gait recognition," Proc. IEEE Int'l Con! on Multimedia and Expo, pp. 663-666, 2007.

[6] G. Bradski and 1. Davis, "Motion segmentation and pose recognition with motion history gradients," Machine and Vision Application, pp. 174-184, 2002.

[7] 1. Davis, "Hierarchical motion history images for recognizing human motion," IEEE Workshop on Detection and Recognition of Events in Video, pp. 39-46, 2001.

[8] Q. Nguyen, S. Novakowski, 1. Boyd, C. lacob, and G. Hushlak, "Motion swarms: video interaction for art in complex environments," Proc. ACM Int'l Conf. Multimedia, CA, pp. 461-469, 2006.

[9] 1. Davis and G. Bradski, "Real-time motion template gradients using intel cvlib," Int'l Workshop on Frame-rate Vision with Int'l Con! on Computer Vision, CA, pp. 1-20, 1999.

[10] L. Fei-Fei and P. Perona, "A bayesian hierarchical model for learning natural scene categories," IEEE CVPR, p. 524531, 2005.

[11] 1. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, "Discovering objects and their location in images," IEEE Int'l Con! on Computer Vision, p. 370377, 2005.

[12] K. Guo, P. Ishwar, and 1. Konrad, "Action recognition using sparse representation on covariance manifolds of optical flow," 7-th IEEE Int'l Con! on Advanced Video and Signal-Based Surveillance, Aug 2010.

[13] S. Danafar and N. Gheissari, "Action recognition for surveillance appli­cations using optic flow and svm," Proc. of the 8th Asian conference on Computer vision, vol. 2, 2007.

[14] S. Wang, K. Huang, and T. Tan, "A compact optical flow based motion representation for realtime action recognition in surveillance scenes," Proc. Int'l Con! on Image Processing (ICIP), Cairo, Egypt, Nov 2009.

[15] Wikipedia, The Free Encyclopedia. Optical flow. [Online]. Available: http://en.wikipedia.orglwikilOpticaUlow

[16] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," Proc. Imaging Understanding Workshop, pp. 121-130, 1981.

[17] B. D. Lucas, "Generalized image matching by the method of differ­ences," Ph.D. dissertation, Robotics Institute, Carnegie Mellon Univer­sity, luly 1984.

[18] M. Fischler and R. Bolles, "Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography;' Communications of the ACM, pp. 381-395, 1981.

[19] 1. Suykens and 1. Vandewalle, "Least squares support vector machine classifiers," Neural Processing Letters, vol. 9, pp. 293-300, 1999.

[20] M. Awad, X. liang, and Y. Motai, "Incremental support vector machine framework for visual sensor networks," EURASIP J. Appl. Signal Processing, vol. 2007, pp. 222-222, lanuary 2007.

[21] I. Laptev and B. Caputo. (2004) KTH action database. [Online]. Available: http://www.nada.kth.se/cvap/actions/

[22] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as space-time shapes," IEEE PAMI, vol. 29, no. 12, pp. 2247-2253, December 2007.

[23] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," IEEE Inti Workshop VS-PETS, 2005, p. 6572, 2005.

[24] 1. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised learning of human action categories using spatial-temporal words," Int'l J. on Comput. Vis.,

p. 6572, Mar 2008. [25] S. Ali and M. Shah, "Human action recognition in videos using

kinematic features and multiple instance learning," IEEE PAMI, pp. 288-303, Feb 2010.

[26] S. F.Wong and R. Cipolla, "Extracting spatio-temporal interest points using global information," Oct 2007.

[27] H.l. Seo and P. Milanfar, "Action recognition from one example," IEEE PAMI, vol. 33, no. 5, May 2011.

[28] I. lunejo, E. Dexter, I. Laptev, and P. Perez, " View-independent action recognition from temporal self-similarities," IEEE PAMI, vol. 33, no. I, pp. 172-185, lan 2011.

[29] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning realistic human actions from movies," IEEE CVPR, 2008.