enhancedhumanactionrecognitionusingfusionofskeletal...

16
Research Article Enhanced Human Action Recognition Using Fusion of Skeletal Joint Dynamics and Structural Features S.N.Muralikrishna, 1 BalachandraMuniyal, 2 U.DineshAcharya, 1 andRaghuramaHolla 3 1 Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India 2 Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India 3 Department of Computer Applications, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India Correspondence should be addressed to Raghurama Holla; [email protected] Received 14 October 2019; Revised 12 June 2020; Accepted 9 July 2020; Published 1 August 2020 Academic Editor: L. Fortuna Copyright © 2020 S. N. Muralikrishna et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this research work, we propose a method for human action recognition based on the combination of structural and temporal features. e pose sequence in the video is considered to identify the action type. e structural variation features are obtained by detecting the angle made between the joints during the action, where the angle binning is performed using multiple thresholds. e displacement vector of joint locations is used to compute the temporal features. e structural variation features and the temporal variation features are fused using a neural network to perform action classification. We conducted the experiments on different categories of datasets, namely, KTH, UTKinect, and MSR Action3D datasets. e experimental results exhibit the superiority of the proposed method over some of the existing state-of-the-art techniques. 1. Introduction e rapid growth in hardware and software technologies has resulted in continuous generation of a huge amount of video data through video capturing devices such as smartphones and CCTV camera. Also, a large amount of video content is being uploaded to YouTube every minute. erefore, it is very important to extract useful information from these huge video databases and to recognize high-level activities for various applications such as automated surveillance systems, human-computer interaction, sports video analysis, real-time patient/children monitoring, shopping-behavior analysis, and dynamical systems [1]. Hence, human action recognition (HAR) from videos is an active area of research as it attracted the attention of several researchers in recent years. Human action recognition focuses on detecting and tracking people, in particular, understanding human be- haviors from a video sequence. e research in this area focuses mainly on the development of techniques for an automated visual surveillance system. It requires a combi- nation of computer vision and pattern recognition algo- rithms. However, in the literature, activity, behavior, action, gesture, and ‘primitive/complex event’ are frequently used to describe essentially the same concepts. HAR is challenging because of the intraclass variation and interclass similarity. e same activity may vary from subject to subject, known as the intraclass variation. Without the contextual information, different activities may look similar, which leads to interclass variation, for example, playing and running. ere are many challenges in HAR, such as multisubject interactions, group activities, and complex visual background. e two main approaches used for HAR are based on global descriptors and local descriptors. e local descriptors are robust to noise and can be applied to a wide range of action recognition problems. However, in recent years, the skeleton-based approaches have been widely used due to the availability of depth sensors. Several datasets are available for Hindawi Journal of Robotics Volume 2020, Article ID 3096858, 16 pages https://doi.org/10.1155/2020/3096858

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

Research ArticleEnhanced Human Action Recognition Using Fusion of SkeletalJoint Dynamics and Structural Features

SNMuralikrishna1BalachandraMuniyal2UDineshAcharya1andRaghuramaHolla 3

1Department of Computer Science and Engineering Manipal Institute of Technology Manipal Academy of Higher EducationManipal India2Department of Information and Communication Technology Manipal Institute of TechnologyManipal Academy of Higher Education Manipal India3Department of Computer Applications Manipal Institute of Technology Manipal Academy of Higher Education Manipal India

Correspondence should be addressed to Raghurama Holla raghu247gmailcom

Received 14 October 2019 Revised 12 June 2020 Accepted 9 July 2020 Published 1 August 2020

Academic Editor L Fortuna

Copyright copy 2020 S N Muralikrishna et al is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

In this research work we propose a method for human action recognition based on the combination of structural and temporalfeatures e pose sequence in the video is considered to identify the action type e structural variation features are obtained bydetecting the angle made between the joints during the action where the angle binning is performed using multiple thresholdse displacement vector of joint locations is used to compute the temporal features e structural variation features and thetemporal variation features are fused using a neural network to perform action classification We conducted the experiments ondifferent categories of datasets namely KTH UTKinect and MSR Action3D datasets e experimental results exhibit thesuperiority of the proposed method over some of the existing state-of-the-art techniques

1 Introduction

e rapid growth in hardware and software technologies hasresulted in continuous generation of a huge amount of videodata through video capturing devices such as smartphonesand CCTV camera Also a large amount of video content isbeing uploaded to YouTube every minute erefore it isvery important to extract useful information from thesehuge video databases and to recognize high-level activitiesfor various applications such as automated surveillancesystems human-computer interaction sports video analysisreal-time patientchildren monitoring shopping-behavioranalysis and dynamical systems [1] Hence human actionrecognition (HAR) from videos is an active area of researchas it attracted the attention of several researchers in recentyears

Human action recognition focuses on detecting andtracking people in particular understanding human be-haviors from a video sequence e research in this area

focuses mainly on the development of techniques for anautomated visual surveillance system It requires a combi-nation of computer vision and pattern recognition algo-rithms However in the literature activity behavior actiongesture and lsquoprimitivecomplex eventrsquo are frequently used todescribe essentially the same concepts HAR is challengingbecause of the intraclass variation and interclass similaritye same activity may vary from subject to subject known asthe intraclass variation Without the contextual informationdifferent activities may look similar which leads to interclassvariation for example playing and running ere are manychallenges in HAR such as multisubject interactions groupactivities and complex visual background

e two main approaches used for HAR are based onglobal descriptors and local descriptorse local descriptorsare robust to noise and can be applied to a wide range ofaction recognition problems However in recent years theskeleton-based approaches have been widely used due to theavailability of depth sensors Several datasets are available for

HindawiJournal of RoboticsVolume 2020 Article ID 3096858 16 pageshttpsdoiorg10115520203096858

the evaluation of action recognition algorithmsey vary interms of the number of classes sensors used duration ofaction view point complexity of action performed and soon In this work we address the problem of action recog-nition using skeleton-based approach

Contributions (a) We propose a method for humanaction recognition based on encoded joint angle informationand joint displacement vector (b) A neural network-basedmethod to perform score-level fusion for action classifica-tion is proposed (c) We experimentally show that theproposed method can be applied on datasets containing theskeletal joint information acquired using Kinect sensors andalso on datasets where explicit pose estimation needs to bedone us the proposed method can be used with a vision-based sensor or Kinect sensor

e rest of the paper is organized as follows Section 2gives an overview of the existing techniques for humanaction recognition Section 3 describes the proposed ap-proach e experimental results are demonstrated in Sec-tion 4 e conclusions and discussions are given in Section5

2 Review of Existing Techniques

Human activities can be broadly classified into four cate-gories gestures actions interactions (with objects andothers) and group activities Early approaches developed in1990s mainly focused on identifying gestures and simpleactions based on motion analysis A detailed review ofmotion analysis-based techniques is presented by Aggarwaland Cai [2] However the motion analysis-based method-ologies were found to be less robust as they were insufficientto describe human activities containing complex structureserefore an improved approach was discussed byAggarwal and Ryoo [3] who focused on methodologies toperform high-level activity recognition designed for theanalysis of human actions interactions and group activities

Ben-Arie et al [4] have proposed a technique toperform human action recognition by computing a set ofpose and velocity vectors for body parts such as handslegs and torso ese features are stored in a multidi-mensional hash table to achieve indexing- and sequence-based voting Kellokumpu et al [5] proposed anotherapproach based on texture descriptor by combiningmotion and appearance cues e movement dynamics arecaptured using temporal templates and the observedmovements are characterized using texture features Aspatiotemporal space is considered and the humanmovements are described with dynamic texture featuresAlso the use of motion energy features for human activityanalysis is presented by Gao et al [6] e motion energytemplate is constructed for the video using a filter bankand the actions are classified using SVM Xu et al [7] haveproposed a hierarchical spatiotemporal model for humanactivity recognition e model consists of a two-layerhidden conditional random field (HCRF) where thebottom layer is used to describe the spatial relations ineach frame and the top layer uses high-level features forcharacterizing the temporal relations throughout the

video sequence e bottom layer also provides high-levelsemantic representations A learning algorithm is usedand human activities are identified To improve the ro-bustness of action recognition task a combination offeatures consisting of dense trajectories and motionboundary histogram descriptors has been used by Wanget al [8] e descriptor captures different kinds of in-formation such as shape appearance and motion toaddress the problem of camera motion

e deep learning models gained popularity because oftheir superior performance in the field of pattern recognitionand computer vision research A review by Guo et al [9]highlights the important developments in deep neuralmodels Ji et al [10] proposed a 3D CNN model for humanaction recognition e features are extracted from bothspatial and temporal dimensions using 3D convolutionsthus capturing discriminative features In another workWang et al [11] proposed a technique where the spatio-temporal information obtained from 3D skeleton sequencesis encoded into multiple 2D images forming Joint TrajectoryMaps (JTMs) and ConvNets are applied to accomplish theaction recognition task As Joint Distance Maps (JDMs)describe texture features which are less sensitive to viewvariations Li et al [12] have developed an approach foraction recognition by encoding spatiotemporal informationof skeleton sequences into color texture images en usingconvolutional neural networks the discriminative featuresare obtained from the JDMs for achieving both single-viewand cross-view action recognition Hou et al [13] haveproposed a method for effective action recognition based onskeleton optical spectra (SOS) where discriminative featuresare learned using convolutional neural networks (Con-vNets) e spatiotemporal information of a skeleton se-quence is effectively captured using skeleton optical spectrais method is more suitable in case of limited annotatedtraining video data Wang et al [14] have presented a de-tailed survey of recent advances in RGB-D based motionrecognition using deep learning techniques In anotherapproach Rahmani et al [15] have developed an improvedversion of deep learning model based on nonlinearknowledge transfer model learning achieving invariance toviewpoint change A general codebook is generated usingk-means to encode the action trajectories and then the samecodebook is used for encoding action trajectories of realvideos Li et al [16] have used multiple deep neural networksto achieve multiview learning for three-dimensional humanaction recognition ese multiple networks help to effec-tively learn the discriminative features and also capturespatial and temporal information e recognition scores ofall views are combined using multiply fusion Xiao et al [17]have introduced an end-to-end trainable architecture-basedmodel for human action recognition e model consists ofdeep neural networks and attention models for learningspatiotemporal features from the skeleton data Li et al [18]have proposed an approach for skeleton-based human ac-tion recognition A deep model namely 3DConvLSTM isused to learn spatiotemporal features from the video se-quences and an attention-based dynamic map is built foraction classification

2 Journal of Robotics

An approach for online action recognition has beenproposed by Tang et al [19] based on weighted covariancedescriptor by considering the importance of frame se-quences with respect to their temporal order and discrim-inativeness e combination of nearest neighbour searchand Log-Euclidean kernelndashbased SVM is used for classifi-cation In another work an optical acceleration-based de-scriptor has been used by Edison and Jiji [20] for humanaction recognition Two descriptors have been computed foreffectively capturing the motion information namely thehistogram of optical acceleration and histogram of spatialgradient acceleration An approach based on rank poolingmethod was introduced by Fernando et al [21] for actionrecognition which is capable of capturing both the ap-pearance and the temporal dynamics of the video A rankingfunction generated by the ranking machine provides im-portant information about actions In another work Wanget al [22] have presented a technique for action recognitionbased on order-aware convolutional pooling focusingmainly on effectively capturing the dynamic informationpresent in the video After extracting features from eachvideo frame a convolutional filter bank is applied to eachfeature dimension and then filter responses are aggregatedHu et al [23] introduced a new approach for early actionprediction based on soft regression applied on RGB-Dchannels Here the depth information is considered toachieve more robustness and discriminative power FinallyMultiple Soft labels Recurrent Neural Network (MSRNN)model is constructed where feature extraction is done basedon Local Accumulative Frame Feature (LAFF) Some moreapproaches for action recognition can be found which havebeen developed based on sparse coding Yang and Tian [24]exemplar modeling Hu et al [25] max-margin learningZhu et al [26] Fisher vector Wang and Schmid [27] andblock-level dense connections Hao and Zhang [28]rough literature survey it is found that several techniqueshave been proposed for human action recognition A de-tailed review on action recognition research is reported byRamanathan et al [29] Gowsikhaa et al [30] and Fu [31]

A lot of approaches are available in the literature forhuman action recognition Most of the existing techniquesuse either the local features extracted temporally or theskeleton representation of the human pose in the temporalsequence However the combination of temporal featuresand spatial features provides better recognition rate In thisdirection we propose a method to recognize human actionbased on the combination of appearance and temporalfeatures at the classifier decision level

3 Proposed Work

In this work we propose a method for human action rec-ognition by considering the structural variation feature andthe temporal displacement feature e proposed methodextracts features from the pose sequence in a given videoFigure 1 depicts the methodology of the proposed systemWe extract the structural variation feature by detecting theangle made between the joints during an action ere areseveral methods available to estimate the pose Some of the

pose estimation techniques found in the literature are basedon sensor readings and other methods are based on vision-based techniques

31 Pose Estimation for Action Recognition e OpenPoselibrary [32 33] is one of the well-known vision-based li-braries used to extract the skeletal joints e performanceof the OpenPose library to detect the joint locations islimited when compared to sensor based methods It usesVGG-19 deep neural network model to estimate the posee COCO model [34] consists of 18 skeletal jointswhereas the BODY_25 model gives 25 skeletal joint lo-cations In our experiments we have used OpenPose toestimate the pose for the KTH dataset however for theother datasets the pose information is taken from sensorreadings In the following section we present the idea ofstructural feature extraction

32 Structural Variation Feature Extraction Let us considerthe skeleton represented by a set of pointsS ji | i 1 2 ns1113864 1113865 having ns joints where ji (xi yi)

indicating the estimated joint location in the 2D imagelocation Our goal is to obtain the angle between the joint jk

and a set of joints J ji | i 1 2 ns and ine k1113864 1113865 whichcontributes to structural variation in the skeleton In a videohaving N frames the angle θk

ij is found where θkij represents

the angle between the joints ji and jj in the kth frame wherek 1 2 3 N For each joint ji a binary vector vi

rarr iscomputed using

virarr

v1 v2 vp vns1113960 1113961 (1)

where vp is given by

vp 1N

1113944

N

k1θk

ip

11138681113868111386811138681113868

11138681113868111386811138681113868geT (2)

e procedure followed to fix the threshold ldquoTrdquo is givenin Section 321 e feature vectors vi

rarr for i 1 2 nsare concatenated to obtain the structural feature vector v

rarre dimension of v

rarr is (ns minus 1) times (ns minus 1)

321 Feature Extraction Based on Angle Binning e vectorvrarr does not provide the variation in the angle at a finer levelas it is binarized with a single threshold value Accordinglywe perform angle binning where multiple thresholds areused in (3) to quantize the angle to a b-bit number bymodifying (2)is captures the angle between the joints at afiner level at the same time the quantization helps tosuppress minute variations in the angle during an actione process of feature extraction is shown in Figure 2

vp 1113944bminus1

l0f 1113954θ

k

ij Tl1113874 1113875lowast 2l (3)

where Tl llowast(θmaxb) 0le θmax le (π2) e terms 1113954θk

ij andf(x y) are defined using

Journal of Robotics 3

1113954θk

ij 1N

1113944

N

k1θk

ij

11138681113868111386811138681113868

11138681113868111386811138681113868 (4)

f(x y) 1 if xgty

0 otherwise1113896 (5)

e temporal variation feature captures the dynamics ofindividual joint by tracking them through the frames isprocess is explained in the following section

33 Temporal Feature Extraction e temporal feature ex-traction looks at the change in the joint location for a joint ji

from the frame t to t + 1 We consider the location of thejoint ji in two successive frames to find the relative positionof the joint is is effectively the tracking of the joint lo-cation A histogram of 2D displacement orientation of jointlocation in the XndashY plane is constructed to capture the

temporal dynamics e vector representing the joint dis-placement is computed for the sequence of video frames Foreach displacement vector of joints we obtain the orientationpair consisting of orientation angle and the magnituderepresented by (θi ρi) e orientation angles θii 1 2 ns for all the joints are used as the temporalfeatures e θt

i for joint ji at time instance t + 1 is computedusing

θti arctan

ΔyΔx

1113874 1113875 (6)

e displacement vector vt of joint ji at time instancet + 1 is calculated using

vt Δxi

Δyi

1113890 1113891 xt+1

i minus xti

yt+1i minus yt

i

⎡⎣ ⎤⎦ (7)

e feature vector fi

rarrfor every joint location ji is given

by

Video Poseinformation

Structuralfeature usingjoint angles

Temporal featureusing joint

displacement vector

SVM classifier-1

SVM classifier-2

Predictedclassification

scores

Score fusionusing neural

network

Predictedlabel

Figure 1 Overall methodology of action classification

j16 j17 j16 j17j19j18 j1

j19j18 j1

j3 j2j4

j4

j5j5j10

j10j9 j6

j11j11

j12 j15j24 j24j23 j23

j22j22j20

j20

j21 j21

j15 j15j25 j25

j14j14

j13 j13

j6 j3j2

j6j7

j8

j8j7

At time t At time t + 1

t

(a)

j1j1

j1j1 j2 j3

j2

j2

ji

jn

jn

ji

ji jn

j2 j3 jj jnYt

XO

θk11

θ111

θ121

θ1i1

θ1n1 θ1

nj θ1nn

θ1ij θ1

in

θ112 θ1

13 θ11j θ1

1n

θk21

θki1

θkn1 θknj θknn

θkij θkin

θk12 θk13 θk1j θk1n

Frame 1

Frame k

Y

O X

jj = (xjt yj

t)

ji = (xit yi

t)

θtij

(b)

0 0 0 1 1 1 1 1

up = 31

MSB

θkij = 50 b = 8 θmax = 90 T0 = 0 T1 = 1125 T2 = 225 T7 = 7875

(c)

j2 j2j3 j3

j5j5

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

At time t At time t + 1

Y

X

θ4t

Displacement vector

(d)

Figure 2 Structural feature extraction and temporal feature extraction from joint locations (a) An example of a skeletal joint model (b)Computing the angle information for structural feature extraction using (3) (c) An example of angle binning with b 8 and θmax (π2)(d) An example of computing the displacement angle for temporal feature extraction

4 Journal of Robotics

fi

rarr θ1i θ2i θt

i θNi1113960 1113961 (8)

A k-bin histogram is created for every joint ji from thefeature vector fi

rarr is is concatenated to form a temporal

feature vector frarr

representing an action It is clear that thejoint locations are sparse when compared to the traditionaloptical flow-based methods us the feature extractionprocess is computationally more efficient

34 Score-Level Fusion Using Neural Network We combinethe structural features and the temporal features at the scorelevel For every sample j the classifier i assigns a score rangingfrom minusinf to +inf e score is the signed distance of theobservation j to the decision boundary A positive score indi-cates that the sample j belongs to class i A negative score givesthe distance of j from decision boundarye score-level fusionis performed using a neural networke neural network assignssignificance scores to the classifiers based on structural andtemporal features e structural features are less discriminativefor describing actions having similar body part movements suchas walk run and jogging e optimal fusion of temporal andstructural features would help in better recognition

To generalize the classifier fusion we consider a mul-ticlass classification problem with c classes and n classifiersIn our case we have used scores from two SVM classifiers forfusion e class prediction score for a sample j from ith

classifier is

xijrarr

x(1)ij x

(2)ij x

(c)ij1113960 1113961 (9)

where each x(t)ij is a prediction score corresponding to the

class t e input to the neural network for the sample j isgiven by

vjrarr

x1j x2jrarrrarr

xncjrarr1113876 1113877

T

(10)

where vjrarr isin Rnc

e predicted label at the output layer of the neuralnetwork is given by yj

primerarrisin Rc To get the optimal fusion score

we need to solve the objective function given in (11) for theN training samples in the action recognition dataset

minimize1113944N

j1yjrarr

minus yjprime

rarr

(11)

where yjrarr is the actual label at the output layer for the sample

jFor a neuron k in the hidden layer t the output θk of the

neuron is given by

θk σ wkrarr(tminus 1)

vjrarr

1113874 1113875 (12)

where

wkrarr(tminus 1)

w(tminus1)1k w

(tminus1)2k w

(tminus1)nck

1113960 1113961 (13)

represents the synaptic weights from the previous layer tothe neuron k and σ() is the sigmoid activation function For

a neuron p at the output layer the predicted label op is givenby

opj S wprarrl

vjrarrl

1113874 1113875 (14)

where

wprarrl

w(l)1p w

(l)2p w(l)

ncp1113960 1113961 (15)

represents the synaptic weights from the last hidden layer l tothe output neuron p vj

rarrl is the input from last hidden layer

S() is the softmax functione output of this layer yjprime

rarr for a

sample j is given by

yjprime

rarr o1j o2j opj ocj1113960 1113961 (16)

e neural network uses the backpropagation algorithmto learn the network parameters An example of neuralnetwork architecture used in the proposed model is shownin Figure 3

4 Experiments and Results

To demonstrate the performance of the proposed model wecarried out experiments on three publicly available datasetsnamely KTH [35] UTKinect [36] and MSR Action3Ddataset [37] e KTH dataset requires explicit pose esti-mation However UTKinect dataset contains the pose in-formation captured using Kinect sensors e source code ofour implementation is available at httpsgithubcommuralikrishnasnHARJointDynamicsgit

41 Datasets e KTH dataset contains six action typesperformed by 25 subjects under four different conditionse skeletal joint information is not included in the datasetunlike other datasets used in the experiment e UTKinectdataset is acquired using a Kinect sensor e datasetcontains skeletal joint information for 10 types of actionsperformed by 10 subjects repeated twice per actioneMSRAction3D dataset contains skeleton data for 20 action typesperformed by 10 subjects where each action is performed 2to 3 times e dataset contains 20 joint locations per framecaptured using a sensor similar to Kinect device

42 Experimental Setup and Results In our experimentsOpenPose library [33] is used to estimate the pose for theKTH dataset A pretrained network with BODY_25 model isused in our experiments e parameters of the experimenthave been set as described in [35] e deep neural networkto detect the joints is executed on a Tesla P100 GPU eSupport Vector Machine (SVM) classifiers are used to ex-tract the structural and temporal features e predictedscores from these SVM classifiers are combined using aneural network We used radial basis function kernel in theSVM classifiers A simple feed-forward network with sig-moid function at the hidden layers and softmax outputneurons is used to solve (11) In the experiments the neuralnetwork has been trained with 50 epochs A plot of epochs

Journal of Robotics 5

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 2: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

the evaluation of action recognition algorithmsey vary interms of the number of classes sensors used duration ofaction view point complexity of action performed and soon In this work we address the problem of action recog-nition using skeleton-based approach

Contributions (a) We propose a method for humanaction recognition based on encoded joint angle informationand joint displacement vector (b) A neural network-basedmethod to perform score-level fusion for action classifica-tion is proposed (c) We experimentally show that theproposed method can be applied on datasets containing theskeletal joint information acquired using Kinect sensors andalso on datasets where explicit pose estimation needs to bedone us the proposed method can be used with a vision-based sensor or Kinect sensor

e rest of the paper is organized as follows Section 2gives an overview of the existing techniques for humanaction recognition Section 3 describes the proposed ap-proach e experimental results are demonstrated in Sec-tion 4 e conclusions and discussions are given in Section5

2 Review of Existing Techniques

Human activities can be broadly classified into four cate-gories gestures actions interactions (with objects andothers) and group activities Early approaches developed in1990s mainly focused on identifying gestures and simpleactions based on motion analysis A detailed review ofmotion analysis-based techniques is presented by Aggarwaland Cai [2] However the motion analysis-based method-ologies were found to be less robust as they were insufficientto describe human activities containing complex structureserefore an improved approach was discussed byAggarwal and Ryoo [3] who focused on methodologies toperform high-level activity recognition designed for theanalysis of human actions interactions and group activities

Ben-Arie et al [4] have proposed a technique toperform human action recognition by computing a set ofpose and velocity vectors for body parts such as handslegs and torso ese features are stored in a multidi-mensional hash table to achieve indexing- and sequence-based voting Kellokumpu et al [5] proposed anotherapproach based on texture descriptor by combiningmotion and appearance cues e movement dynamics arecaptured using temporal templates and the observedmovements are characterized using texture features Aspatiotemporal space is considered and the humanmovements are described with dynamic texture featuresAlso the use of motion energy features for human activityanalysis is presented by Gao et al [6] e motion energytemplate is constructed for the video using a filter bankand the actions are classified using SVM Xu et al [7] haveproposed a hierarchical spatiotemporal model for humanactivity recognition e model consists of a two-layerhidden conditional random field (HCRF) where thebottom layer is used to describe the spatial relations ineach frame and the top layer uses high-level features forcharacterizing the temporal relations throughout the

video sequence e bottom layer also provides high-levelsemantic representations A learning algorithm is usedand human activities are identified To improve the ro-bustness of action recognition task a combination offeatures consisting of dense trajectories and motionboundary histogram descriptors has been used by Wanget al [8] e descriptor captures different kinds of in-formation such as shape appearance and motion toaddress the problem of camera motion

e deep learning models gained popularity because oftheir superior performance in the field of pattern recognitionand computer vision research A review by Guo et al [9]highlights the important developments in deep neuralmodels Ji et al [10] proposed a 3D CNN model for humanaction recognition e features are extracted from bothspatial and temporal dimensions using 3D convolutionsthus capturing discriminative features In another workWang et al [11] proposed a technique where the spatio-temporal information obtained from 3D skeleton sequencesis encoded into multiple 2D images forming Joint TrajectoryMaps (JTMs) and ConvNets are applied to accomplish theaction recognition task As Joint Distance Maps (JDMs)describe texture features which are less sensitive to viewvariations Li et al [12] have developed an approach foraction recognition by encoding spatiotemporal informationof skeleton sequences into color texture images en usingconvolutional neural networks the discriminative featuresare obtained from the JDMs for achieving both single-viewand cross-view action recognition Hou et al [13] haveproposed a method for effective action recognition based onskeleton optical spectra (SOS) where discriminative featuresare learned using convolutional neural networks (Con-vNets) e spatiotemporal information of a skeleton se-quence is effectively captured using skeleton optical spectrais method is more suitable in case of limited annotatedtraining video data Wang et al [14] have presented a de-tailed survey of recent advances in RGB-D based motionrecognition using deep learning techniques In anotherapproach Rahmani et al [15] have developed an improvedversion of deep learning model based on nonlinearknowledge transfer model learning achieving invariance toviewpoint change A general codebook is generated usingk-means to encode the action trajectories and then the samecodebook is used for encoding action trajectories of realvideos Li et al [16] have used multiple deep neural networksto achieve multiview learning for three-dimensional humanaction recognition ese multiple networks help to effec-tively learn the discriminative features and also capturespatial and temporal information e recognition scores ofall views are combined using multiply fusion Xiao et al [17]have introduced an end-to-end trainable architecture-basedmodel for human action recognition e model consists ofdeep neural networks and attention models for learningspatiotemporal features from the skeleton data Li et al [18]have proposed an approach for skeleton-based human ac-tion recognition A deep model namely 3DConvLSTM isused to learn spatiotemporal features from the video se-quences and an attention-based dynamic map is built foraction classification

2 Journal of Robotics

An approach for online action recognition has beenproposed by Tang et al [19] based on weighted covariancedescriptor by considering the importance of frame se-quences with respect to their temporal order and discrim-inativeness e combination of nearest neighbour searchand Log-Euclidean kernelndashbased SVM is used for classifi-cation In another work an optical acceleration-based de-scriptor has been used by Edison and Jiji [20] for humanaction recognition Two descriptors have been computed foreffectively capturing the motion information namely thehistogram of optical acceleration and histogram of spatialgradient acceleration An approach based on rank poolingmethod was introduced by Fernando et al [21] for actionrecognition which is capable of capturing both the ap-pearance and the temporal dynamics of the video A rankingfunction generated by the ranking machine provides im-portant information about actions In another work Wanget al [22] have presented a technique for action recognitionbased on order-aware convolutional pooling focusingmainly on effectively capturing the dynamic informationpresent in the video After extracting features from eachvideo frame a convolutional filter bank is applied to eachfeature dimension and then filter responses are aggregatedHu et al [23] introduced a new approach for early actionprediction based on soft regression applied on RGB-Dchannels Here the depth information is considered toachieve more robustness and discriminative power FinallyMultiple Soft labels Recurrent Neural Network (MSRNN)model is constructed where feature extraction is done basedon Local Accumulative Frame Feature (LAFF) Some moreapproaches for action recognition can be found which havebeen developed based on sparse coding Yang and Tian [24]exemplar modeling Hu et al [25] max-margin learningZhu et al [26] Fisher vector Wang and Schmid [27] andblock-level dense connections Hao and Zhang [28]rough literature survey it is found that several techniqueshave been proposed for human action recognition A de-tailed review on action recognition research is reported byRamanathan et al [29] Gowsikhaa et al [30] and Fu [31]

A lot of approaches are available in the literature forhuman action recognition Most of the existing techniquesuse either the local features extracted temporally or theskeleton representation of the human pose in the temporalsequence However the combination of temporal featuresand spatial features provides better recognition rate In thisdirection we propose a method to recognize human actionbased on the combination of appearance and temporalfeatures at the classifier decision level

3 Proposed Work

In this work we propose a method for human action rec-ognition by considering the structural variation feature andthe temporal displacement feature e proposed methodextracts features from the pose sequence in a given videoFigure 1 depicts the methodology of the proposed systemWe extract the structural variation feature by detecting theangle made between the joints during an action ere areseveral methods available to estimate the pose Some of the

pose estimation techniques found in the literature are basedon sensor readings and other methods are based on vision-based techniques

31 Pose Estimation for Action Recognition e OpenPoselibrary [32 33] is one of the well-known vision-based li-braries used to extract the skeletal joints e performanceof the OpenPose library to detect the joint locations islimited when compared to sensor based methods It usesVGG-19 deep neural network model to estimate the posee COCO model [34] consists of 18 skeletal jointswhereas the BODY_25 model gives 25 skeletal joint lo-cations In our experiments we have used OpenPose toestimate the pose for the KTH dataset however for theother datasets the pose information is taken from sensorreadings In the following section we present the idea ofstructural feature extraction

32 Structural Variation Feature Extraction Let us considerthe skeleton represented by a set of pointsS ji | i 1 2 ns1113864 1113865 having ns joints where ji (xi yi)

indicating the estimated joint location in the 2D imagelocation Our goal is to obtain the angle between the joint jk

and a set of joints J ji | i 1 2 ns and ine k1113864 1113865 whichcontributes to structural variation in the skeleton In a videohaving N frames the angle θk

ij is found where θkij represents

the angle between the joints ji and jj in the kth frame wherek 1 2 3 N For each joint ji a binary vector vi

rarr iscomputed using

virarr

v1 v2 vp vns1113960 1113961 (1)

where vp is given by

vp 1N

1113944

N

k1θk

ip

11138681113868111386811138681113868

11138681113868111386811138681113868geT (2)

e procedure followed to fix the threshold ldquoTrdquo is givenin Section 321 e feature vectors vi

rarr for i 1 2 nsare concatenated to obtain the structural feature vector v

rarre dimension of v

rarr is (ns minus 1) times (ns minus 1)

321 Feature Extraction Based on Angle Binning e vectorvrarr does not provide the variation in the angle at a finer levelas it is binarized with a single threshold value Accordinglywe perform angle binning where multiple thresholds areused in (3) to quantize the angle to a b-bit number bymodifying (2)is captures the angle between the joints at afiner level at the same time the quantization helps tosuppress minute variations in the angle during an actione process of feature extraction is shown in Figure 2

vp 1113944bminus1

l0f 1113954θ

k

ij Tl1113874 1113875lowast 2l (3)

where Tl llowast(θmaxb) 0le θmax le (π2) e terms 1113954θk

ij andf(x y) are defined using

Journal of Robotics 3

1113954θk

ij 1N

1113944

N

k1θk

ij

11138681113868111386811138681113868

11138681113868111386811138681113868 (4)

f(x y) 1 if xgty

0 otherwise1113896 (5)

e temporal variation feature captures the dynamics ofindividual joint by tracking them through the frames isprocess is explained in the following section

33 Temporal Feature Extraction e temporal feature ex-traction looks at the change in the joint location for a joint ji

from the frame t to t + 1 We consider the location of thejoint ji in two successive frames to find the relative positionof the joint is is effectively the tracking of the joint lo-cation A histogram of 2D displacement orientation of jointlocation in the XndashY plane is constructed to capture the

temporal dynamics e vector representing the joint dis-placement is computed for the sequence of video frames Foreach displacement vector of joints we obtain the orientationpair consisting of orientation angle and the magnituderepresented by (θi ρi) e orientation angles θii 1 2 ns for all the joints are used as the temporalfeatures e θt

i for joint ji at time instance t + 1 is computedusing

θti arctan

ΔyΔx

1113874 1113875 (6)

e displacement vector vt of joint ji at time instancet + 1 is calculated using

vt Δxi

Δyi

1113890 1113891 xt+1

i minus xti

yt+1i minus yt

i

⎡⎣ ⎤⎦ (7)

e feature vector fi

rarrfor every joint location ji is given

by

Video Poseinformation

Structuralfeature usingjoint angles

Temporal featureusing joint

displacement vector

SVM classifier-1

SVM classifier-2

Predictedclassification

scores

Score fusionusing neural

network

Predictedlabel

Figure 1 Overall methodology of action classification

j16 j17 j16 j17j19j18 j1

j19j18 j1

j3 j2j4

j4

j5j5j10

j10j9 j6

j11j11

j12 j15j24 j24j23 j23

j22j22j20

j20

j21 j21

j15 j15j25 j25

j14j14

j13 j13

j6 j3j2

j6j7

j8

j8j7

At time t At time t + 1

t

(a)

j1j1

j1j1 j2 j3

j2

j2

ji

jn

jn

ji

ji jn

j2 j3 jj jnYt

XO

θk11

θ111

θ121

θ1i1

θ1n1 θ1

nj θ1nn

θ1ij θ1

in

θ112 θ1

13 θ11j θ1

1n

θk21

θki1

θkn1 θknj θknn

θkij θkin

θk12 θk13 θk1j θk1n

Frame 1

Frame k

Y

O X

jj = (xjt yj

t)

ji = (xit yi

t)

θtij

(b)

0 0 0 1 1 1 1 1

up = 31

MSB

θkij = 50 b = 8 θmax = 90 T0 = 0 T1 = 1125 T2 = 225 T7 = 7875

(c)

j2 j2j3 j3

j5j5

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

At time t At time t + 1

Y

X

θ4t

Displacement vector

(d)

Figure 2 Structural feature extraction and temporal feature extraction from joint locations (a) An example of a skeletal joint model (b)Computing the angle information for structural feature extraction using (3) (c) An example of angle binning with b 8 and θmax (π2)(d) An example of computing the displacement angle for temporal feature extraction

4 Journal of Robotics

fi

rarr θ1i θ2i θt

i θNi1113960 1113961 (8)

A k-bin histogram is created for every joint ji from thefeature vector fi

rarr is is concatenated to form a temporal

feature vector frarr

representing an action It is clear that thejoint locations are sparse when compared to the traditionaloptical flow-based methods us the feature extractionprocess is computationally more efficient

34 Score-Level Fusion Using Neural Network We combinethe structural features and the temporal features at the scorelevel For every sample j the classifier i assigns a score rangingfrom minusinf to +inf e score is the signed distance of theobservation j to the decision boundary A positive score indi-cates that the sample j belongs to class i A negative score givesthe distance of j from decision boundarye score-level fusionis performed using a neural networke neural network assignssignificance scores to the classifiers based on structural andtemporal features e structural features are less discriminativefor describing actions having similar body part movements suchas walk run and jogging e optimal fusion of temporal andstructural features would help in better recognition

To generalize the classifier fusion we consider a mul-ticlass classification problem with c classes and n classifiersIn our case we have used scores from two SVM classifiers forfusion e class prediction score for a sample j from ith

classifier is

xijrarr

x(1)ij x

(2)ij x

(c)ij1113960 1113961 (9)

where each x(t)ij is a prediction score corresponding to the

class t e input to the neural network for the sample j isgiven by

vjrarr

x1j x2jrarrrarr

xncjrarr1113876 1113877

T

(10)

where vjrarr isin Rnc

e predicted label at the output layer of the neuralnetwork is given by yj

primerarrisin Rc To get the optimal fusion score

we need to solve the objective function given in (11) for theN training samples in the action recognition dataset

minimize1113944N

j1yjrarr

minus yjprime

rarr

(11)

where yjrarr is the actual label at the output layer for the sample

jFor a neuron k in the hidden layer t the output θk of the

neuron is given by

θk σ wkrarr(tminus 1)

vjrarr

1113874 1113875 (12)

where

wkrarr(tminus 1)

w(tminus1)1k w

(tminus1)2k w

(tminus1)nck

1113960 1113961 (13)

represents the synaptic weights from the previous layer tothe neuron k and σ() is the sigmoid activation function For

a neuron p at the output layer the predicted label op is givenby

opj S wprarrl

vjrarrl

1113874 1113875 (14)

where

wprarrl

w(l)1p w

(l)2p w(l)

ncp1113960 1113961 (15)

represents the synaptic weights from the last hidden layer l tothe output neuron p vj

rarrl is the input from last hidden layer

S() is the softmax functione output of this layer yjprime

rarr for a

sample j is given by

yjprime

rarr o1j o2j opj ocj1113960 1113961 (16)

e neural network uses the backpropagation algorithmto learn the network parameters An example of neuralnetwork architecture used in the proposed model is shownin Figure 3

4 Experiments and Results

To demonstrate the performance of the proposed model wecarried out experiments on three publicly available datasetsnamely KTH [35] UTKinect [36] and MSR Action3Ddataset [37] e KTH dataset requires explicit pose esti-mation However UTKinect dataset contains the pose in-formation captured using Kinect sensors e source code ofour implementation is available at httpsgithubcommuralikrishnasnHARJointDynamicsgit

41 Datasets e KTH dataset contains six action typesperformed by 25 subjects under four different conditionse skeletal joint information is not included in the datasetunlike other datasets used in the experiment e UTKinectdataset is acquired using a Kinect sensor e datasetcontains skeletal joint information for 10 types of actionsperformed by 10 subjects repeated twice per actioneMSRAction3D dataset contains skeleton data for 20 action typesperformed by 10 subjects where each action is performed 2to 3 times e dataset contains 20 joint locations per framecaptured using a sensor similar to Kinect device

42 Experimental Setup and Results In our experimentsOpenPose library [33] is used to estimate the pose for theKTH dataset A pretrained network with BODY_25 model isused in our experiments e parameters of the experimenthave been set as described in [35] e deep neural networkto detect the joints is executed on a Tesla P100 GPU eSupport Vector Machine (SVM) classifiers are used to ex-tract the structural and temporal features e predictedscores from these SVM classifiers are combined using aneural network We used radial basis function kernel in theSVM classifiers A simple feed-forward network with sig-moid function at the hidden layers and softmax outputneurons is used to solve (11) In the experiments the neuralnetwork has been trained with 50 epochs A plot of epochs

Journal of Robotics 5

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 3: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

An approach for online action recognition has beenproposed by Tang et al [19] based on weighted covariancedescriptor by considering the importance of frame se-quences with respect to their temporal order and discrim-inativeness e combination of nearest neighbour searchand Log-Euclidean kernelndashbased SVM is used for classifi-cation In another work an optical acceleration-based de-scriptor has been used by Edison and Jiji [20] for humanaction recognition Two descriptors have been computed foreffectively capturing the motion information namely thehistogram of optical acceleration and histogram of spatialgradient acceleration An approach based on rank poolingmethod was introduced by Fernando et al [21] for actionrecognition which is capable of capturing both the ap-pearance and the temporal dynamics of the video A rankingfunction generated by the ranking machine provides im-portant information about actions In another work Wanget al [22] have presented a technique for action recognitionbased on order-aware convolutional pooling focusingmainly on effectively capturing the dynamic informationpresent in the video After extracting features from eachvideo frame a convolutional filter bank is applied to eachfeature dimension and then filter responses are aggregatedHu et al [23] introduced a new approach for early actionprediction based on soft regression applied on RGB-Dchannels Here the depth information is considered toachieve more robustness and discriminative power FinallyMultiple Soft labels Recurrent Neural Network (MSRNN)model is constructed where feature extraction is done basedon Local Accumulative Frame Feature (LAFF) Some moreapproaches for action recognition can be found which havebeen developed based on sparse coding Yang and Tian [24]exemplar modeling Hu et al [25] max-margin learningZhu et al [26] Fisher vector Wang and Schmid [27] andblock-level dense connections Hao and Zhang [28]rough literature survey it is found that several techniqueshave been proposed for human action recognition A de-tailed review on action recognition research is reported byRamanathan et al [29] Gowsikhaa et al [30] and Fu [31]

A lot of approaches are available in the literature forhuman action recognition Most of the existing techniquesuse either the local features extracted temporally or theskeleton representation of the human pose in the temporalsequence However the combination of temporal featuresand spatial features provides better recognition rate In thisdirection we propose a method to recognize human actionbased on the combination of appearance and temporalfeatures at the classifier decision level

3 Proposed Work

In this work we propose a method for human action rec-ognition by considering the structural variation feature andthe temporal displacement feature e proposed methodextracts features from the pose sequence in a given videoFigure 1 depicts the methodology of the proposed systemWe extract the structural variation feature by detecting theangle made between the joints during an action ere areseveral methods available to estimate the pose Some of the

pose estimation techniques found in the literature are basedon sensor readings and other methods are based on vision-based techniques

31 Pose Estimation for Action Recognition e OpenPoselibrary [32 33] is one of the well-known vision-based li-braries used to extract the skeletal joints e performanceof the OpenPose library to detect the joint locations islimited when compared to sensor based methods It usesVGG-19 deep neural network model to estimate the posee COCO model [34] consists of 18 skeletal jointswhereas the BODY_25 model gives 25 skeletal joint lo-cations In our experiments we have used OpenPose toestimate the pose for the KTH dataset however for theother datasets the pose information is taken from sensorreadings In the following section we present the idea ofstructural feature extraction

32 Structural Variation Feature Extraction Let us considerthe skeleton represented by a set of pointsS ji | i 1 2 ns1113864 1113865 having ns joints where ji (xi yi)

indicating the estimated joint location in the 2D imagelocation Our goal is to obtain the angle between the joint jk

and a set of joints J ji | i 1 2 ns and ine k1113864 1113865 whichcontributes to structural variation in the skeleton In a videohaving N frames the angle θk

ij is found where θkij represents

the angle between the joints ji and jj in the kth frame wherek 1 2 3 N For each joint ji a binary vector vi

rarr iscomputed using

virarr

v1 v2 vp vns1113960 1113961 (1)

where vp is given by

vp 1N

1113944

N

k1θk

ip

11138681113868111386811138681113868

11138681113868111386811138681113868geT (2)

e procedure followed to fix the threshold ldquoTrdquo is givenin Section 321 e feature vectors vi

rarr for i 1 2 nsare concatenated to obtain the structural feature vector v

rarre dimension of v

rarr is (ns minus 1) times (ns minus 1)

321 Feature Extraction Based on Angle Binning e vectorvrarr does not provide the variation in the angle at a finer levelas it is binarized with a single threshold value Accordinglywe perform angle binning where multiple thresholds areused in (3) to quantize the angle to a b-bit number bymodifying (2)is captures the angle between the joints at afiner level at the same time the quantization helps tosuppress minute variations in the angle during an actione process of feature extraction is shown in Figure 2

vp 1113944bminus1

l0f 1113954θ

k

ij Tl1113874 1113875lowast 2l (3)

where Tl llowast(θmaxb) 0le θmax le (π2) e terms 1113954θk

ij andf(x y) are defined using

Journal of Robotics 3

1113954θk

ij 1N

1113944

N

k1θk

ij

11138681113868111386811138681113868

11138681113868111386811138681113868 (4)

f(x y) 1 if xgty

0 otherwise1113896 (5)

e temporal variation feature captures the dynamics ofindividual joint by tracking them through the frames isprocess is explained in the following section

33 Temporal Feature Extraction e temporal feature ex-traction looks at the change in the joint location for a joint ji

from the frame t to t + 1 We consider the location of thejoint ji in two successive frames to find the relative positionof the joint is is effectively the tracking of the joint lo-cation A histogram of 2D displacement orientation of jointlocation in the XndashY plane is constructed to capture the

temporal dynamics e vector representing the joint dis-placement is computed for the sequence of video frames Foreach displacement vector of joints we obtain the orientationpair consisting of orientation angle and the magnituderepresented by (θi ρi) e orientation angles θii 1 2 ns for all the joints are used as the temporalfeatures e θt

i for joint ji at time instance t + 1 is computedusing

θti arctan

ΔyΔx

1113874 1113875 (6)

e displacement vector vt of joint ji at time instancet + 1 is calculated using

vt Δxi

Δyi

1113890 1113891 xt+1

i minus xti

yt+1i minus yt

i

⎡⎣ ⎤⎦ (7)

e feature vector fi

rarrfor every joint location ji is given

by

Video Poseinformation

Structuralfeature usingjoint angles

Temporal featureusing joint

displacement vector

SVM classifier-1

SVM classifier-2

Predictedclassification

scores

Score fusionusing neural

network

Predictedlabel

Figure 1 Overall methodology of action classification

j16 j17 j16 j17j19j18 j1

j19j18 j1

j3 j2j4

j4

j5j5j10

j10j9 j6

j11j11

j12 j15j24 j24j23 j23

j22j22j20

j20

j21 j21

j15 j15j25 j25

j14j14

j13 j13

j6 j3j2

j6j7

j8

j8j7

At time t At time t + 1

t

(a)

j1j1

j1j1 j2 j3

j2

j2

ji

jn

jn

ji

ji jn

j2 j3 jj jnYt

XO

θk11

θ111

θ121

θ1i1

θ1n1 θ1

nj θ1nn

θ1ij θ1

in

θ112 θ1

13 θ11j θ1

1n

θk21

θki1

θkn1 θknj θknn

θkij θkin

θk12 θk13 θk1j θk1n

Frame 1

Frame k

Y

O X

jj = (xjt yj

t)

ji = (xit yi

t)

θtij

(b)

0 0 0 1 1 1 1 1

up = 31

MSB

θkij = 50 b = 8 θmax = 90 T0 = 0 T1 = 1125 T2 = 225 T7 = 7875

(c)

j2 j2j3 j3

j5j5

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

At time t At time t + 1

Y

X

θ4t

Displacement vector

(d)

Figure 2 Structural feature extraction and temporal feature extraction from joint locations (a) An example of a skeletal joint model (b)Computing the angle information for structural feature extraction using (3) (c) An example of angle binning with b 8 and θmax (π2)(d) An example of computing the displacement angle for temporal feature extraction

4 Journal of Robotics

fi

rarr θ1i θ2i θt

i θNi1113960 1113961 (8)

A k-bin histogram is created for every joint ji from thefeature vector fi

rarr is is concatenated to form a temporal

feature vector frarr

representing an action It is clear that thejoint locations are sparse when compared to the traditionaloptical flow-based methods us the feature extractionprocess is computationally more efficient

34 Score-Level Fusion Using Neural Network We combinethe structural features and the temporal features at the scorelevel For every sample j the classifier i assigns a score rangingfrom minusinf to +inf e score is the signed distance of theobservation j to the decision boundary A positive score indi-cates that the sample j belongs to class i A negative score givesthe distance of j from decision boundarye score-level fusionis performed using a neural networke neural network assignssignificance scores to the classifiers based on structural andtemporal features e structural features are less discriminativefor describing actions having similar body part movements suchas walk run and jogging e optimal fusion of temporal andstructural features would help in better recognition

To generalize the classifier fusion we consider a mul-ticlass classification problem with c classes and n classifiersIn our case we have used scores from two SVM classifiers forfusion e class prediction score for a sample j from ith

classifier is

xijrarr

x(1)ij x

(2)ij x

(c)ij1113960 1113961 (9)

where each x(t)ij is a prediction score corresponding to the

class t e input to the neural network for the sample j isgiven by

vjrarr

x1j x2jrarrrarr

xncjrarr1113876 1113877

T

(10)

where vjrarr isin Rnc

e predicted label at the output layer of the neuralnetwork is given by yj

primerarrisin Rc To get the optimal fusion score

we need to solve the objective function given in (11) for theN training samples in the action recognition dataset

minimize1113944N

j1yjrarr

minus yjprime

rarr

(11)

where yjrarr is the actual label at the output layer for the sample

jFor a neuron k in the hidden layer t the output θk of the

neuron is given by

θk σ wkrarr(tminus 1)

vjrarr

1113874 1113875 (12)

where

wkrarr(tminus 1)

w(tminus1)1k w

(tminus1)2k w

(tminus1)nck

1113960 1113961 (13)

represents the synaptic weights from the previous layer tothe neuron k and σ() is the sigmoid activation function For

a neuron p at the output layer the predicted label op is givenby

opj S wprarrl

vjrarrl

1113874 1113875 (14)

where

wprarrl

w(l)1p w

(l)2p w(l)

ncp1113960 1113961 (15)

represents the synaptic weights from the last hidden layer l tothe output neuron p vj

rarrl is the input from last hidden layer

S() is the softmax functione output of this layer yjprime

rarr for a

sample j is given by

yjprime

rarr o1j o2j opj ocj1113960 1113961 (16)

e neural network uses the backpropagation algorithmto learn the network parameters An example of neuralnetwork architecture used in the proposed model is shownin Figure 3

4 Experiments and Results

To demonstrate the performance of the proposed model wecarried out experiments on three publicly available datasetsnamely KTH [35] UTKinect [36] and MSR Action3Ddataset [37] e KTH dataset requires explicit pose esti-mation However UTKinect dataset contains the pose in-formation captured using Kinect sensors e source code ofour implementation is available at httpsgithubcommuralikrishnasnHARJointDynamicsgit

41 Datasets e KTH dataset contains six action typesperformed by 25 subjects under four different conditionse skeletal joint information is not included in the datasetunlike other datasets used in the experiment e UTKinectdataset is acquired using a Kinect sensor e datasetcontains skeletal joint information for 10 types of actionsperformed by 10 subjects repeated twice per actioneMSRAction3D dataset contains skeleton data for 20 action typesperformed by 10 subjects where each action is performed 2to 3 times e dataset contains 20 joint locations per framecaptured using a sensor similar to Kinect device

42 Experimental Setup and Results In our experimentsOpenPose library [33] is used to estimate the pose for theKTH dataset A pretrained network with BODY_25 model isused in our experiments e parameters of the experimenthave been set as described in [35] e deep neural networkto detect the joints is executed on a Tesla P100 GPU eSupport Vector Machine (SVM) classifiers are used to ex-tract the structural and temporal features e predictedscores from these SVM classifiers are combined using aneural network We used radial basis function kernel in theSVM classifiers A simple feed-forward network with sig-moid function at the hidden layers and softmax outputneurons is used to solve (11) In the experiments the neuralnetwork has been trained with 50 epochs A plot of epochs

Journal of Robotics 5

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 4: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

1113954θk

ij 1N

1113944

N

k1θk

ij

11138681113868111386811138681113868

11138681113868111386811138681113868 (4)

f(x y) 1 if xgty

0 otherwise1113896 (5)

e temporal variation feature captures the dynamics ofindividual joint by tracking them through the frames isprocess is explained in the following section

33 Temporal Feature Extraction e temporal feature ex-traction looks at the change in the joint location for a joint ji

from the frame t to t + 1 We consider the location of thejoint ji in two successive frames to find the relative positionof the joint is is effectively the tracking of the joint lo-cation A histogram of 2D displacement orientation of jointlocation in the XndashY plane is constructed to capture the

temporal dynamics e vector representing the joint dis-placement is computed for the sequence of video frames Foreach displacement vector of joints we obtain the orientationpair consisting of orientation angle and the magnituderepresented by (θi ρi) e orientation angles θii 1 2 ns for all the joints are used as the temporalfeatures e θt

i for joint ji at time instance t + 1 is computedusing

θti arctan

ΔyΔx

1113874 1113875 (6)

e displacement vector vt of joint ji at time instancet + 1 is calculated using

vt Δxi

Δyi

1113890 1113891 xt+1

i minus xti

yt+1i minus yt

i

⎡⎣ ⎤⎦ (7)

e feature vector fi

rarrfor every joint location ji is given

by

Video Poseinformation

Structuralfeature usingjoint angles

Temporal featureusing joint

displacement vector

SVM classifier-1

SVM classifier-2

Predictedclassification

scores

Score fusionusing neural

network

Predictedlabel

Figure 1 Overall methodology of action classification

j16 j17 j16 j17j19j18 j1

j19j18 j1

j3 j2j4

j4

j5j5j10

j10j9 j6

j11j11

j12 j15j24 j24j23 j23

j22j22j20

j20

j21 j21

j15 j15j25 j25

j14j14

j13 j13

j6 j3j2

j6j7

j8

j8j7

At time t At time t + 1

t

(a)

j1j1

j1j1 j2 j3

j2

j2

ji

jn

jn

ji

ji jn

j2 j3 jj jnYt

XO

θk11

θ111

θ121

θ1i1

θ1n1 θ1

nj θ1nn

θ1ij θ1

in

θ112 θ1

13 θ11j θ1

1n

θk21

θki1

θkn1 θknj θknn

θkij θkin

θk12 θk13 θk1j θk1n

Frame 1

Frame k

Y

O X

jj = (xjt yj

t)

ji = (xit yi

t)

θtij

(b)

0 0 0 1 1 1 1 1

up = 31

MSB

θkij = 50 b = 8 θmax = 90 T0 = 0 T1 = 1125 T2 = 225 T7 = 7875

(c)

j2 j2j3 j3

j5j5

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

j4 = (x4t y4

t)

j4 = (x4t+1 y4

t+1)

At time t At time t + 1

Y

X

θ4t

Displacement vector

(d)

Figure 2 Structural feature extraction and temporal feature extraction from joint locations (a) An example of a skeletal joint model (b)Computing the angle information for structural feature extraction using (3) (c) An example of angle binning with b 8 and θmax (π2)(d) An example of computing the displacement angle for temporal feature extraction

4 Journal of Robotics

fi

rarr θ1i θ2i θt

i θNi1113960 1113961 (8)

A k-bin histogram is created for every joint ji from thefeature vector fi

rarr is is concatenated to form a temporal

feature vector frarr

representing an action It is clear that thejoint locations are sparse when compared to the traditionaloptical flow-based methods us the feature extractionprocess is computationally more efficient

34 Score-Level Fusion Using Neural Network We combinethe structural features and the temporal features at the scorelevel For every sample j the classifier i assigns a score rangingfrom minusinf to +inf e score is the signed distance of theobservation j to the decision boundary A positive score indi-cates that the sample j belongs to class i A negative score givesthe distance of j from decision boundarye score-level fusionis performed using a neural networke neural network assignssignificance scores to the classifiers based on structural andtemporal features e structural features are less discriminativefor describing actions having similar body part movements suchas walk run and jogging e optimal fusion of temporal andstructural features would help in better recognition

To generalize the classifier fusion we consider a mul-ticlass classification problem with c classes and n classifiersIn our case we have used scores from two SVM classifiers forfusion e class prediction score for a sample j from ith

classifier is

xijrarr

x(1)ij x

(2)ij x

(c)ij1113960 1113961 (9)

where each x(t)ij is a prediction score corresponding to the

class t e input to the neural network for the sample j isgiven by

vjrarr

x1j x2jrarrrarr

xncjrarr1113876 1113877

T

(10)

where vjrarr isin Rnc

e predicted label at the output layer of the neuralnetwork is given by yj

primerarrisin Rc To get the optimal fusion score

we need to solve the objective function given in (11) for theN training samples in the action recognition dataset

minimize1113944N

j1yjrarr

minus yjprime

rarr

(11)

where yjrarr is the actual label at the output layer for the sample

jFor a neuron k in the hidden layer t the output θk of the

neuron is given by

θk σ wkrarr(tminus 1)

vjrarr

1113874 1113875 (12)

where

wkrarr(tminus 1)

w(tminus1)1k w

(tminus1)2k w

(tminus1)nck

1113960 1113961 (13)

represents the synaptic weights from the previous layer tothe neuron k and σ() is the sigmoid activation function For

a neuron p at the output layer the predicted label op is givenby

opj S wprarrl

vjrarrl

1113874 1113875 (14)

where

wprarrl

w(l)1p w

(l)2p w(l)

ncp1113960 1113961 (15)

represents the synaptic weights from the last hidden layer l tothe output neuron p vj

rarrl is the input from last hidden layer

S() is the softmax functione output of this layer yjprime

rarr for a

sample j is given by

yjprime

rarr o1j o2j opj ocj1113960 1113961 (16)

e neural network uses the backpropagation algorithmto learn the network parameters An example of neuralnetwork architecture used in the proposed model is shownin Figure 3

4 Experiments and Results

To demonstrate the performance of the proposed model wecarried out experiments on three publicly available datasetsnamely KTH [35] UTKinect [36] and MSR Action3Ddataset [37] e KTH dataset requires explicit pose esti-mation However UTKinect dataset contains the pose in-formation captured using Kinect sensors e source code ofour implementation is available at httpsgithubcommuralikrishnasnHARJointDynamicsgit

41 Datasets e KTH dataset contains six action typesperformed by 25 subjects under four different conditionse skeletal joint information is not included in the datasetunlike other datasets used in the experiment e UTKinectdataset is acquired using a Kinect sensor e datasetcontains skeletal joint information for 10 types of actionsperformed by 10 subjects repeated twice per actioneMSRAction3D dataset contains skeleton data for 20 action typesperformed by 10 subjects where each action is performed 2to 3 times e dataset contains 20 joint locations per framecaptured using a sensor similar to Kinect device

42 Experimental Setup and Results In our experimentsOpenPose library [33] is used to estimate the pose for theKTH dataset A pretrained network with BODY_25 model isused in our experiments e parameters of the experimenthave been set as described in [35] e deep neural networkto detect the joints is executed on a Tesla P100 GPU eSupport Vector Machine (SVM) classifiers are used to ex-tract the structural and temporal features e predictedscores from these SVM classifiers are combined using aneural network We used radial basis function kernel in theSVM classifiers A simple feed-forward network with sig-moid function at the hidden layers and softmax outputneurons is used to solve (11) In the experiments the neuralnetwork has been trained with 50 epochs A plot of epochs

Journal of Robotics 5

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 5: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

fi

rarr θ1i θ2i θt

i θNi1113960 1113961 (8)

A k-bin histogram is created for every joint ji from thefeature vector fi

rarr is is concatenated to form a temporal

feature vector frarr

representing an action It is clear that thejoint locations are sparse when compared to the traditionaloptical flow-based methods us the feature extractionprocess is computationally more efficient

34 Score-Level Fusion Using Neural Network We combinethe structural features and the temporal features at the scorelevel For every sample j the classifier i assigns a score rangingfrom minusinf to +inf e score is the signed distance of theobservation j to the decision boundary A positive score indi-cates that the sample j belongs to class i A negative score givesthe distance of j from decision boundarye score-level fusionis performed using a neural networke neural network assignssignificance scores to the classifiers based on structural andtemporal features e structural features are less discriminativefor describing actions having similar body part movements suchas walk run and jogging e optimal fusion of temporal andstructural features would help in better recognition

To generalize the classifier fusion we consider a mul-ticlass classification problem with c classes and n classifiersIn our case we have used scores from two SVM classifiers forfusion e class prediction score for a sample j from ith

classifier is

xijrarr

x(1)ij x

(2)ij x

(c)ij1113960 1113961 (9)

where each x(t)ij is a prediction score corresponding to the

class t e input to the neural network for the sample j isgiven by

vjrarr

x1j x2jrarrrarr

xncjrarr1113876 1113877

T

(10)

where vjrarr isin Rnc

e predicted label at the output layer of the neuralnetwork is given by yj

primerarrisin Rc To get the optimal fusion score

we need to solve the objective function given in (11) for theN training samples in the action recognition dataset

minimize1113944N

j1yjrarr

minus yjprime

rarr

(11)

where yjrarr is the actual label at the output layer for the sample

jFor a neuron k in the hidden layer t the output θk of the

neuron is given by

θk σ wkrarr(tminus 1)

vjrarr

1113874 1113875 (12)

where

wkrarr(tminus 1)

w(tminus1)1k w

(tminus1)2k w

(tminus1)nck

1113960 1113961 (13)

represents the synaptic weights from the previous layer tothe neuron k and σ() is the sigmoid activation function For

a neuron p at the output layer the predicted label op is givenby

opj S wprarrl

vjrarrl

1113874 1113875 (14)

where

wprarrl

w(l)1p w

(l)2p w(l)

ncp1113960 1113961 (15)

represents the synaptic weights from the last hidden layer l tothe output neuron p vj

rarrl is the input from last hidden layer

S() is the softmax functione output of this layer yjprime

rarr for a

sample j is given by

yjprime

rarr o1j o2j opj ocj1113960 1113961 (16)

e neural network uses the backpropagation algorithmto learn the network parameters An example of neuralnetwork architecture used in the proposed model is shownin Figure 3

4 Experiments and Results

To demonstrate the performance of the proposed model wecarried out experiments on three publicly available datasetsnamely KTH [35] UTKinect [36] and MSR Action3Ddataset [37] e KTH dataset requires explicit pose esti-mation However UTKinect dataset contains the pose in-formation captured using Kinect sensors e source code ofour implementation is available at httpsgithubcommuralikrishnasnHARJointDynamicsgit

41 Datasets e KTH dataset contains six action typesperformed by 25 subjects under four different conditionse skeletal joint information is not included in the datasetunlike other datasets used in the experiment e UTKinectdataset is acquired using a Kinect sensor e datasetcontains skeletal joint information for 10 types of actionsperformed by 10 subjects repeated twice per actioneMSRAction3D dataset contains skeleton data for 20 action typesperformed by 10 subjects where each action is performed 2to 3 times e dataset contains 20 joint locations per framecaptured using a sensor similar to Kinect device

42 Experimental Setup and Results In our experimentsOpenPose library [33] is used to estimate the pose for theKTH dataset A pretrained network with BODY_25 model isused in our experiments e parameters of the experimenthave been set as described in [35] e deep neural networkto detect the joints is executed on a Tesla P100 GPU eSupport Vector Machine (SVM) classifiers are used to ex-tract the structural and temporal features e predictedscores from these SVM classifiers are combined using aneural network We used radial basis function kernel in theSVM classifiers A simple feed-forward network with sig-moid function at the hidden layers and softmax outputneurons is used to solve (11) In the experiments the neuralnetwork has been trained with 50 epochs A plot of epochs

Journal of Robotics 5

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 6: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

Input Hidden Output Output

w

b

w

b

12 3 6 6

Figure 3 An example of neural network architecture for the KTH dataset with 12 input neurons 3 hidden layers and 6 output neurons

Best validation performance is 0011578 at epoch 23

5 10 15 20 25029 epochs

10ndash4

10ndash3

10ndash2

10ndash1

100

Cros

s-en

tropy

ValidationTrain Test

Best

Figure 4 A plot of epochs vs cross-entropy for the neural network

Confusion matrix

Jogging

Han

dclap

ping

Han

dwav

ing

Runn

ing

Wal

king

Jogg

ing

Boxi

ng

True labelAccuracy = 065 misclass = 035

Boxing

Handclapping

Running

Walking

Handwaving

Pred

icte

d la

bel

0

5

10

15

20

25

30

35064

000

000

000

000

000

000

053

031

000

000

069

000

011

031

033

000

000

000

000

011

031

000

000

031

000

000

000

094

000

006

019

006

006

006

100

(a)

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 084 misclass = 016

0

5

10

15

20

25

30

35097 000 000 000 000

006

000

003

000

100

003

000

000

000

000

089

000

014

067

000

000

017

028

000

000

075011

003

000

000

078000

000

000

011

000

(b)

Figure 5 Continued

6 Journal of Robotics

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 7: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

versus cross-entropy is shown in Figure 4 e results of theexperiments are shown in Figures 5(a)ndash5(c) summarizingthe confusion matrices for structural features temporalfeatures and the score-level fusion respectively It can beseen that the misclassifications are between highly similaractions like running and jogging e proposed model hasachieved an accuracy of asymp 903 on the KTH dataset

We have conducted experiments on UTKinect datasetin a similar manner to that shown in [36 44] e con-fusion matrix considering the structural features is

presented in Figure 5(d) e results for temporal featuresand score-level fusion using neural network are shown inFigures 6(a) and 6(b) e accuracy of the proposed

Confusion matrix

Boxing

Handclapping

Running

Jogging

Walking

Handwaving

Pred

icte

d la

bel

Han

dclap

ping

Boxi

ng

Runn

ing

Wal

king

Jogg

ing

Han

dwav

ing

True labelAccuracy = 091 misclass = 009

0

10

20

30

40

50

60097 000

009

004

000

004

099000

000

001

000

000

000

000

015

000

000

078

001

000

013

000

000

079

001

000

000

000

000

100

000

001

000

000

000

096

(c)

Confusion matrix

ClaphandsWavehands

PullPush

ThrowCarry

PickupStandupSitdown

Walk

Pred

icte

d la

bel

Carr

y

Wav

ehan

ds

Wal

k

Stan

dup

Thro

wPu

shPu

ll

Sitd

own

Clap

hand

s

Pick

up

True labelAccuracy = 065 misclass = 037

0

1

2

3

4

5

6

7086 000

000

000000 000

000000 000

029 029 029

071

014

014

000

000 000

000 000

000

071

014

014

014

014

014 014

000 000 000

000

000

000

000

000

000

000 000

000 000 000 000

000

086

000

000

000 000 000

000

000 014

014

000000014000000000

000 000 000 000 000 000 000 100 000000

000 000 000 000 014 000 000 000 029 057

014 000 029

000000071

014 014

000 000

000

000

000

029

014 000

000

000

000

014

(d)

Figure 5 (a) Confusionmatrix for structural features Dataset KTH bits used for encoding b 8 total number of joints considered 25 2Dlocation of joints (b) Confusion matrix for temporal features Dataset KTH bins k 5 features considered orientation angle only (c)Confusion matrix for the score-level fusion using neural network Dataset KTH number of hidden layers 3 epochs 20 (d) Confusionmatrix for structural features Dataset UTKinect bits used for encoding b 8

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Pick

up

Stan

dup

ro

w

Clap

hand

s

Wav

ehan

ds

Carr

y

Wal

k

Push

Sitd

own

Pull

True labelAccuracy = 066 misclass = 034

0

1

2

3

4

5

6

7Confusion matrix100 000

100000

000

014

043

000

014

000

000

014

000

000 000 000 000

000 000

000 000

000 000 000

000

014

014

000

000

000

000

014

000

014

000

014 014

086

029

029

071

029

014

014

000

000 000

000

071

000

014

000

000

086

000

000

000

000

029

029

014 000 014 000 000

000 000 000

000 000 000

000

057

000 000

000 000

000

029

029

000

000

000

014

000

000

000 000 000

000 000 000

000 000 000 000 000 000 000 000

(a)

Claphands

Wavehands

Pull

Push

row

Carry

Pickup

Standup

Sitdown

Walk

Pred

icte

d la

bel

Sitd

own

Pull

Push

Carr

y

Wav

ehan

ds

Wal

k

Pick

up

Stan

dup

Clap

hand

s

ro

w

True labelAccuracy = 091 misclass = 009

0

2

4

6

8

10

12

14Confusion matrix

093

000

000

000

000 000 000 000

000

000 000

007 000

007

000 000 000

000 000 000 000

000 000

000 000

000 000

007

000

000

000 000 000 000 000

000

007

000 000

000 000

000

093

000

000 000 093

100

000

100

000

000 000

000 000 000 000 000 000

000 000 000 000 000 000

000 000 000 000 000 000 000

000

100

000 000 000 000 000

000

080

000

013

067

007

000 000 000

000

020

087

000 000

000 000

000

100

000

000

007

(b)

Figure 6 (a) Confusion matrix for temporal features Dataset UTKinect bins k 5 features considered orientation angle only (b)Confusion matrix for the score-level fusion using neural network Dataset UTKinect number of hidden layers 3 epochs 50

Table 1 Experimental results for the MSR Action3D dataset

Dataset Cross-subject testAS1 896AS2 832AS3 982Overall 9033

Journal of Robotics 7

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 8: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

method on the UTKinect dataset is asymp 913 with a devi-ation of plusmn15

e experiment on MSR Action3D dataset has beenconducted using cross-subject test as described in [37] unlikethe leave-one-subject-out cross-validation (LOOCV) methodgiven in [40] e actions are grouped into three subsets AS1AS2 and AS3e AS1 and AS2 have less interclass variationswhereas AS3 contains complex actions e obtained resultsare listed in Table 1 A summary of the results from all the threedatasets is reported in Table 2 From Table 2 it is observed thatthe proposed method outperforms the existing methods for

Table 2 Experimental results of the proposed method vs othertechniques for human action recognitionKTH dataset [35]STIP Schuldt [38] 736Efficient motion features [39] 873Proposed method 9030UTKinect dataset [36]Histogram of 3D joints [40] 9092Random forest [41] 8790Proposed method 9130MSR Action3D datasetHistogram of 3D joints [40] 7897Eigen joints [42] 8230Joint angle similarities [43] 8353Proposed method 9033

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 45 0Parameter b

KTHMSR Action3D (AS1)MSR Action3D (AS2)

MSR Action3D (AS3)UTKinect

Figure 7 Accuracy of action recognition using only joint anglefeature with respect to the quantization parameter b

065

07

075

08

085

09

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 8 Accuracy of action recognition by varying number ofbins k in joint displacement feature for KTH dataset

10 20 30 40 50 600Parameter k

06

065

07

075

08

085

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 9 Accuracy of action recognition by varying number ofbins k in joint displacement feature for UTKinect dataset

8 Journal of Robotics

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 9: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

085

09

Accu

racy

Figure 10 Accuracy of action recognition by varying number of bins k in joint displacement feature for MSR Action3D (AS1) dataset

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

045

05

055

06

065

07

075

08

Accu

racy

Figure 11 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS2)dataset

06

065

07

075

08

085

09

095

Accu

racy

10 20 30 40 50 600Parameter k

Joint anglesJoint displacementDecision level fusion

Figure 12 Accuracy of action recognition by varying number ofbins k in joint displacement feature for MSR Action3D (AS3)dataset

Journal of Robotics 9

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 10: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

human action recognition We used similar classifier settingsfor the other datasets in the experiment e experimentalresults show that the proposed method outperforms some ofthe state-of-the-art techniques for all the three datasets con-sidered in the experiments For the MSR Action3D datasetour method gives an accuracy of asymp 9033 with a deviationof plusmn25 which is better than the listed methods in Table 2 bymore than asymp 5 However the fusion of classifiers showsbetter performance than the single classifier

43 Influence of Quantization Parameter b and HistogramBins k on Accuracy e performance of SVM classifier-1shown in Figure 1 is analyzed by varying the quantizationparameter b e number of bits b used in quantizationversus accuracy is plotted in Figure 7 It is observed that theparameter has no influence on the results beyond b 8 forKTH and MSR Action3D datasets However the optimalvalue of b for UTKinect dataset is 16 is is due to thevariations in the range of data values for the locationcoordinates

A plot of number of bins k in joint displacement featureversus the accuracy is shown in Figures 8ndash12 e dis-placement vectors provide complementary information tojoint angles Most of the pose estimation algorithms fail todetect the joints that are hidden due to occlusion or self-occlusion Normally the pose estimation algorithms resultin a zero value for such joint locations ese hidden jointlocations act as noise and may degrade the performance ofthe action recognition algorithm

44 Analysis of Most Significant Joints In KTH dataset thehand-waving action is mainly due the movement of jointsj3 to j8 e other joints do not contribute to the actione most important joints involved in an action aredepicted in Figure 13 It can be observed that actionswalking running and jogging have similar characteristics interms of angular movements is is very useful in iden-tifying any outliers while detecting abnormalities in ac-tions (Dominant joints with respect to angular movementfor other datasets are included in Figures 14ndash17)

Handwaving Walking

Jogging Running

Handclapping Boxing

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

0

05

1

25

20

15

10

5

10 15 20 255

25

20

15

10

5

10 15 20 2550

05

1

25

20

15

10

5

10 15 20 2550

05

1

Figure 13 Visualization of dominant joints with respect to angular movement for KTH dataset

10 Journal of Robotics

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 11: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

6 8 10 12 14 16 18 20

1

05

0

1

05

0

1

05

0

1

05

0

1

05

0

10

20

10

20

10

20

10

20

10

20

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

2 4 6 8 10 12 14 16

Wavehands

Push

Carry

Standup

Walk

Claphands

Pull

Throw

Pickup

Sitdown

Figure 14 Visualization of dominant joints with respect to angular movement for UTKinect dataset

a02 a03

a05 a06

a10 a13

a18 a20

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20100

05

1

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

2010

Figure 15 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS1) dataset

Journal of Robotics 11

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 12: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

a01 a04

a07 a08

a09 a11

a12 a14

0

05

1

20

10

20100

05

1

20

10

20100

05

1

0

05

1

20

10

20100

05

1

201020

10

20

10

2010

20

10

20100

05

1

20

10

20100

05

1

20

10

2010

0

05

1

Figure 16 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS2) dataset

a06 a14

a15 a16

a17 a18

a19 a20

20

10

201020

10

20

10

20

10

20

10

20

10

20

10

20

10

0

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

20100

05

1

0

05

1

0

05

1

0

05

1

2010

2010

2010

Figure 17 Visualization of dominant joints with respect to angular movement for MSR Action3D (AS3) dataset

12 Journal of Robotics

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 13: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

0

02

04

06

08

1

KTH

UTK

inec

t

MSR

-AS1

MSR

-AS2

MSR

-AS3

Fusion using score averagingFusion using neural network

Figure 18 A comparison of decision level fusion using neuralnetwork and score averaging [45]

5 10 15 20 25 30 35 40 450Parameter b

04

05

06

07

08

09

Accu

racy

Joint anglesJoint displacementDecision level fusion

Figure 19 A graph demonstrating improvement in accuracy usingdecision level fusion for the KTH dataset

01

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 20 A graph demonstrating improvement in accuracy usingdecision level fusion for the UTKinect dataset

02

03

04

05

06

07

08

09

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 21 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS1) dataset

Journal of Robotics 13

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 14: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

e accuracy of the proposed system has been analyzedusing two types of combiners a trainable combiner using aneural network and a fixed combiner using score averaging[45]is is shown in Figure 18e neural network is a bettercombiner as it is able to find the optimal weights for the fusionwhereas score averaging works as a fixed combiner with equalimportance to both classifiers showing lower accuracy eneural network-based fusion enhances the performance interms of accuracy It can be seen from Figures 19ndash23 that thefusion technique results in better performance

e correlation analysis is performed on the output oftwo SVM classifiers e result is listed in Table 3 eanalysis shows that the average correlation is less than 05is indicates that the classifiers moderately agree on theclassification Consequently the fusion of these scoresleads to improvement in the overall accuracy of thesystem

5 Conclusions

We have developed a method for human action recog-nition based on skeletal joints e proposed methodextracts structural and temporal features e structuralvariations are captured using joint angle and the tem-poral variations are represented using joint displacementvector e proposed approach is found to be simple as ituses single-view 2D joint locations and yet outperformssome of the state-of-the-art techniques Also we showedthat in the absence of Kinect sensor pose estimationalgorithm can be used as a preliminary step e proposedmethod shows promising results for action recognition taskswhen temporal features and structural features are fused atthe score level us the proposed method is suitable forrobust human action recognition tasks

Data Availability

e references of the datasets used in the experiment areprovided in the reference list

Conflicts of Interest

e authors have no conflicts of interest regarding thepublication of this paper

02

03

04

05

06

07

08

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 22 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS2) dataset

01

02

03

04

05

06

07

08

09

1

Accu

racy

5 10 15 20 25 30 35 40 450Parameter b

Joint anglesJoint displacementDecision level fusion

Figure 23 A graph demonstrating improvement in accuracy usingdecision level fusion for the MSR Action3D (AS3) dataset

Table 3 Correlation analysis of the classifier output to find theclassifier agreement for the SVM classifiers shown in Figure 1

Dataset Correlation coefficients of classifier outputKTH [35] 06308UTKinect [36] 02938MSR Action3D [37]

AS1 04059AS2 05619AS3 05478

Average 048804

14 Journal of Robotics

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 15: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

References

[1] M Bucolo A Buscarino C Famoso L Fortuna andM Frasca ldquoControl of imperfect dynamical systemsrdquo Non-linear Dynamics vol 98 no 4 pp 2989ndash2999 2019

[2] J K Aggarwal and Q Cai ldquoHuman motion analysis a re-viewrdquo Computer Vision and Image Understanding vol 73no 3 pp 428ndash440 1999

[3] J K Aggarwal and M S Ryoo ldquoHuman activity analysis areviewrdquo ACM Computing Surveys vol 43 no 3 pp 1ndash432011

[4] J Ben-Arie Z Zhiqian Wang P Pandit and S RajaramldquoHuman activity recognition using multidimensional index-ingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 24 no 8 pp 1091ndash1104 2002

[5] V Kellokumpu G Zhao and M Pietikainen ldquoRecognition ofhuman actions using texture descriptorsrdquoMachine Vision andApplications vol 22 no 5 pp 767ndash780 2011

[6] C Gao Y Shao and Y Guo ldquoHuman action recognitionusing motion energy templaterdquo Optical Engineering vol 54no 6 pp 1ndash10 2015

[7] W Xu Z Miao X-P Zhang and Y Tian ldquoA hierarchicalspatio-temporal model for human activity recognitionrdquo IEEETransactions on Multimedia vol 19 no 7 pp 1494ndash15092017

[8] H Wang A Klaser C Schmid and C-L Liu ldquoDense tra-jectories and motion boundary descriptors for action rec-ognitionrdquo International Journal of Computer Vision vol 103no 1 pp 60ndash79 2013

[9] Y Guo Y Liu A Oerlemans S Lao S Wu and M S LewldquoDeep learning for visual understanding a reviewrdquo Neuro-computing vol 187 pp 27ndash48 2016

[10] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[11] P Wang Z Li Y Hou and W Li ldquoAction recognition basedon joint trajectory maps using convolutional neural net-worksrdquo in Proceedings of the 24th ACM International Con-ference on Multimedia MM rsquo16 pp 102ndash106 Association forComputing Machinery New York NY USA October 2016

[12] C Li Y Hou P Wang andW Li ldquoJoint distance maps basedaction recognition with convolutional neural networksrdquo IEEESignal Processing Letters vol 24 no 5 pp 624ndash628 2017

[13] Y Hou Z Li P Wang and W Li ldquoSkeleton optical spectra-based action recognition using convolutional neural net-worksrdquo IEEE Transactions on Circuits and Systems for VideoTechnology vol 28 no 3 pp 807ndash811 2018

[14] P Wang W Li P Ogunbona J Wan and S Escalera ldquoRGB-D-based human motion recognition with deep learning asurveyrdquo Computer Vision and Image Understanding vol 171pp 118ndash139 2018

[15] H Rahmani A Mian and M Shah ldquoLearning a deep modelfor human action recognition from novel viewpointsrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 40 no 3 pp 667ndash681 2018

[16] C Li Y Hou P Wang and W Li ldquoMultiview-based 3-Daction recognition using deep networksrdquo IEEE Transactionson Human-Machine Systems vol 49 no 1 pp 95ndash104 2019

[17] R Xiao Y Hou Z Guo C Li P Wang and W Li ldquoSelf-attention guided deep features for action recognitionrdquo inProceedings of the 2019 IEEE International Conference onMultimedia and Expo (ICME) pp 1060ndash1065 ShanghaiChina July 2019

[18] C Li Y Hou W Li and P Wang ldquoLearning attentive dy-namic maps (adms) for understanding human actionsrdquoJournal of Visual Communication and Image Representationvol 65 Article ID 102640 2019

[19] C Tang W Li P Wang and L Wang ldquoOnline human actionrecognition based on incremental learning of weighted co-variance descriptorsrdquo Information Sciences vol 467pp 219ndash237 2018

[20] A Edison and C V Jiji ldquoAutomated video analysis for actionrecognition using descriptors derived from optical accelera-tionrdquo Signal Image and Video Processing vol 13 no 5pp 915ndash922 2019

[21] B Fernando E Gavves M J Oramas A Ghodrati andT Tuytelaars ldquoRank pooling for action recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligencevol 39 no 4 pp 773ndash787 2017

[22] P Wang L Liu C Shen and H T Shen ldquoOrder-awareconvolutional pooling for video based action recognitionrdquoPattern Recognition vol 91 pp 357ndash365 2019

[23] J Hu W Zheng L Ma G Wang J Lai and J Zhang ldquoEarlyaction prediction by soft regressionrdquo IEEE Transactions onPattern Analysis and Machine Intelligence vol 41 no 11pp 2568ndash2586 2019

[24] X Yang and Y L Tian ldquoAction recognition using super sparsecoding vector with spatio-temporal awarenessrdquo in ComputerVisionndashECCV 2014 pp 727ndash741 Springer InternationalPublishing Berlin Germany 2014

[25] J-F Hu W-S Zheng J Lai S Gong and T Xiang ldquoEx-emplar-based recognition of human-object interactionsrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 26 no 4 pp 647ndash660 2016

[26] J Zhu B Wang X Yang W Zhang and Z Tu ldquoActionrecognition with actonsrdquo in Proceedings of the 2013 IEEEInternational Conference on Computer Vision pp 3559ndash3566Sydney Australia December 2013

[27] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 2013 IEEE InternationalConference on Computer Vision pp 3551ndash3558 SydneyAustralia December 2013

[28] W Hao and Z Zhang ldquoSpatiotemporal distilled dense-con-nectivity network for video action recognitionrdquo PatternRecognition vol 92 pp 13ndash24 2019

[29] M Ramanathan W-Y Yau and E K Teoh ldquoHuman actionrecognition with video data research and evaluation chal-lengesrdquo IEEE Transactions on Human-Machine Systemsvol 44 no 5 pp 650ndash663 2014

[30] D Gowsikhaa S Abirami and R Baskaran ldquoAutomatedhuman behavior analysis from surveillance videos a surveyrdquoArtificial Intelligence Review vol 42 no 4 pp 747ndash765 2014

[31] Y Fu Human Activity Recognition and Prediction SpringerBerlin Germany 2016

[32] Z Cao G Hidalgo T Simon S-E Wei and Y SheikhldquoOpenPose realtime multi-person 2D pose estimation usingpart affinity fieldsrdquo 2018 httpsarxivorgabs181208008

[33] Z Cao T Simon S E Wei and Y Sheikh ldquoRealtime multi-person 2D pose estimation using part affinity fieldsrdquo inProceedings of the 30th IEEE Conference on Computer Visionand Pattern Recognition CVPR 2017 pp 1302ndash1310 Hono-lulu HI USA July 2017

[34] MSCOCO keypoint evaluation metricrdquo httpcocodatasetorgkeypoints-eval

[35] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17th

Journal of Robotics 15

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics

Page 16: EnhancedHumanActionRecognitionUsingFusionofSkeletal ...downloads.hindawi.com/journals/jr/2020/3096858.pdf · e displacement vector of joint locations is used to compute the temporal

International Conference on Pattern Recognition 2004 ICPR2004 vol 3 pp 32ndash36 Cambridge UK September 2004

[36] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops pp 486ndash491 Portland OR USAJune 2013

[37] W Li Z Zhang and Z Liu ldquoAction recognition based on abag of 3D pointsrdquo in Proceedings of the 2010 IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition-Workshops pp 9ndash14 San Francisco CA USA June2010

[38] C Schuldt B Caputo C Sch and L Barbara ldquoRecognizinghuman actions a local SVM approach recognizing humanactionsrdquo in Proceedings of the 17th International Conferenceon Pattern Recognition 2004 (ICPR 2004) vol 3 pp 3ndash7Cambridge UK September 2004

[39] M Yang F Lv W Xu K Yu and Y Gong ldquoHuman actiondetection by boosting efficient motion featuresrdquo in Pro-ceedings of the 2009 IEEE 12th International Conference onComputer Vision Workshops ICCV Workshops pp 522ndash529Kyoto Japan September 2009

[40] L Xia C C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo in Pro-ceedings of the 2012 IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshopspp 20ndash27 IEEE Providence RI USA June 2012

[41] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal fea-tures and joints for 3D action recognitionrdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR) Workshops Portland OR USA June 2013

[42] X Yang and Y Tian ldquoEffective 3d action recognition usingeigenjointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[43] E Ohn-Bar and M M Trivedi ldquoJoint angles similarities andhog2 for action recognitionrdquo in Proceedings of the 2013 IEEEConference on Computer Vision and Pattern RecognitionWorkshops CVPRWrsquo13 pp 465ndash470 IEEE Computer Soci-ety Washington DC USA June 2013

[44] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition pp 588ndash595Columbus OH USA June 2014

[45] L I Kuncheva Combining Pattern Classifiers Methods andAlgorithms John Wiley amp Sons Hoboken NJ USA 2004

16 Journal of Robotics