large-scale continuous gesture recognition using ... · large-scale continuous gesture recognition...

arX

iv:1

608.

0633

8v2

[cs.

CV

] 12

Sep

201

6

Large-scale Continuous Gesture Recognition UsingConvolutional Neural Networks

Pichao Wang, Wanqing Li, Song Liu, Yuyao Zhang, Zhimin Gao and Philip Ogunbona

Advanced Multimedia Research Lab, University of Wollongong, Australia

[email protected],{wanqing, songl}@uow.edu.au,{yz606, [email protected]}, [email protected]

Abstract—This paper addresses the problem of continuousgesture recognition from sequences of depth maps using Con-volutional Neural networks (ConvNets). The proposed methodfirst segments individual gestures from a depth sequence basedon quantity of movement (QOM). For each segmented gesture, anImproved Depth Motion Map (IDMM), which converts the depthsequence into one image, is constructed and fed to a ConvNetfor recognition. The IDMM effectively encodes both spatialandtemporal information and allows the fine-tuning with existingConvNet models for classification without introducing millionsof parameters to learn. The proposed method is evaluated on theLarge-scale Continuous Gesture Recognition of the ChaLearnLooking at People (LAP) challenge 2016. It achieved the perfor-mance of 0.2655 (Mean Jaccard Index) and ranked3rd place inthis challenge.

Index Terms—gesture recognition; depth map sequence; con-volutional neural network, depth motion map

I. I NTRODUCTION

Gesture and human action recognition from visual in-formation is an active research topic in Computer Visionand Machine Learning. It has many potential applicationsincluding human-computer interactions and robotics. Sincefirst work [1] on action recognition from depth data capturedby commodity depth sensors (e.g., Kinect) in 2010, manymethods [2], [3], [4], [5], [6], [7], [8], [9] for action recognitionhave been proposed based on specific hand-crafted featuredescriptors extracted from depth or skeleton data. With therecent development of deep learning, a few methods havealso been developed based on Convolutional Neural Networks(ConvNets) [10], [11], [12], [13], [14] and Recurrent NeuralNetworks (RNNs) [15], [16], [17], [18]. However, most ofthe works on gesture/action recognition reported to date focuson the classification of individual gestures or actions byassuming that instances of individual gestures and actionshave been isolated or segmented from a video or a streamof depth maps/skeletons before the classification. In the casesof continuous recognition, the input stream usually containsunknown numbers, unknown orders and unknown boundariesof gestures or actions and both segmentation and recognitionhave to be solved at the same time.

There are three common approaches to continuous recogni-tion. The first approach is to tackle the segmentation and clas-

sification of the gestures or actions separately and sequentially.The key advantages of this approach are that different featurescan be used for segmentation and classification and existingclassification methods can be leveraged. The disadvantagesare that both segmentation and classification could be thebottleneck of the systems and they can not be optimizedtogether. The second approach is to apply classification toa sliding temporal window and aggregate the window basedclassification to achieve final segmentation and classification.One of the key challenges in this approach is the difficultyof setting the size of sliding window because durations ofdifferent gestures or actions can vary significantly. The thirdapproach is to perform the segmentation and classificationsimultaneously.

This paper adopts the first approach and focuses on robustclassification using ConvNets that are insensitive to inaccuratesegmentation of gestures. Specifically, individual gestures arefirst segmented based on quantity of movement (QOM) [19]from a stream of depth maps. For each segmented gesture,an Improved Depth Motion Map (IDMM) is constructed fromits sequence of depth maps. ConvNets are used to learn thedynamic features from the IDMM for effective classification.Fig. 1 shows the framework of the proposed method.

The rest of this paper is organized as follows. Section IIbriefly reviews the related works on video segmentation andgesture/action recognition based on depth and deep learning.Details of the proposed method are presented in Section III.Experimental results on the dataset provided by the challengeare reported in Section IV, and Section V concludes the paper.

II. RELATED WORK

A. Video Segmentation

There are many methods proposed for segmenting indi-vidual gestures from a video. The popular and widely usedmethod employs dynamic time warping (DTW) to decidethe delimitation frames of individual gestures [20], [21],[22]. Difference images are first obtained by subtracting twoconsecutive grey-scale images and each difference image ispartitioned into a grid of3 × 3 cells. Each cell is thenrepresented by the average value of pixels within this cell.The matrix of the cells in a difference image is flattened

http://arxiv.org/abs/1608.06338v2

5

5

3

333

33

55

55

96

256 384 384 256 4096 4096

c27

27

13

13

13

13

13

13

input video

IDMM

conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8

11

11pse

ud

oco

lorin

g

Convolutional Neural Networks

Video

Segmentations Class Labels

Fig. 1: The framework for proposed method.

as a vector called motion feature and calculated for eachframe in the video excluding the final frame. This results ina 9 × (K − 1) matrix of motion features for a video withKframes. The motion feature matrix is extracted from both testvideo and training video which consists of multiple gestures.The two matrices are treated as two temporal sequences witheach motion feature as a feature vector at an instant oftime. The distance between two feature vectors is definedas the negative Euclidean distance and a matrix containingDTW distances (measuring similarity between two temporalsequences) between the two sequences is then calculated andanalysed by Viterbi algorithm [23] to segment the gestures.

Another category of gesture segmentation methods froma multi-gesture video is based on appearance. Upon thegeneral assumption that thestart and end frames of adja-cent gestures are similar, correlation coefficients [24] andK-nearest neighbour algorithm with histogram of orientedgradient (HOG) [25] are used to identify thestart and endframes of gestures. Jiang et al. [19] proposed a method basedon quantity of movement (QOM) by assuming the samestartpose among different gestures. Candidate delimitation framesare chosen based on the global QOM. After a refining stagewhich employs a sliding window to keep the frame withminimum QOM in each windowing session, thestart andendframes are assumed to be the remained frames. This paperadopts the QOM based method and its details will be presentedin SectionIII-A .

B. Depth Based Action Recognition

With Microsoft Kinect Sensors researchers have developedmethods for depth map-based action recognition. Li et al.[1] sampled points from a depth map to obtain a bag of 3Dpoints to encode spatial information and employ an expandablegraphical model to encode temporal information [26]. Yang etal. [3] stacked differences between projected depth maps asa depth motion map (DMM) and then used HOG to extractrelevant features from the DMM. This method transforms theproblem of action recognition from spatio-temporal space to

spatial space. In [4], a feature called Histogram of Oriented4D Normals (HON4D) was proposed; surface normal is ex-tended to 4D space and quantized by regular polychorons.Following this method, Yang and Tian [5] cluster hypersurfacenormals and form the polynormal which can be used to jointlycapture the local motion and geometry information. SuperNormal Vector (SNV) is generated by aggregating the low-level polynormals. In [7], a fast binary range-sample featurewas proposed based on a test statistic by carefully designingthe sampling scheme to exclude most pixels that fall into thebackground and to incorporate spatio-temporal cues.

C. Deep Leaning Based Recogntiion

Existing deep learning approach can be generally dividedinto four categories based on how the video is representedand fed to a deep neural network. The first category viewsa video either as a set of still images [27] or as a shortand smooth transition between similar frames [28], each colorchannel of the images is fed to one channel of a ConvNet.Although obviously suboptimal, considering the video as abag of static frames performs reasonably well. The secondcategory represents a video as a volume and extends ConvNetsto a third, temporal dimension [29], [30] replacing 2D filterswith 3D equivalents. So far, this approach has produced littlebenefits, probably due to the lack of annotated training data.The third category treats a video as a sequence of imagesand feeds the sequence to an RNN [31], [15], [16], [17],[18]. An RNN is typically considered as memory cells, whichare sensitive to both short as well as long term patterns. Itparses the video frames sequentially and encodes the frame-level information in their memory. However, using RNNs didnot give an improvement over temporal pooling of convo-lutional features [27] or over hand-crafted features. The lastcategory represents a video as one or multiple compact imagesand adopts available trained ConvNet architectures for fine-tuning [10], [11], [12], [32]. This category has achieved state-of-the-art results of action recognition on many RGB and

depth/skeleton datasets. The proposed gesture classification inthis paper falls into the last category.

III. PROPOSEDMETHOD

The proposed method consists of two major components,as illustrated in Fig.1: video segmentation and constructionof Improved Depth Motion Map (IDMM) from a sequenceof depth maps as the input to ConvNets. Given a sequenceof depth maps consisting of multiple gestures, thestart andend frames of each gesture are identified based on quantityof movement (QOM) [19]. Then, one IDMM is constructedby accumulating the absolute depth difference between cur-rent frame and thestart frame for each gesture segment.The IDMM goes through a pseudo-color coding process tobecome a pseudo-RGB image as an input to a ConvNet forclassification. The main objective of pseudo-color coding is toenchance the motin pattern captured by the IDMM. In the restof this section, video segmentation, construction of IDMM,pseudo-color coding of IDMM, and training of the ConvNetsare explained in detail.

A. Video Segmentation

Given a sequence of depth maps that contains multiplegestures, Thestart andend frames of each gesture is detectedbased on quantity of movement (QOM) [19] by assuming thatall gestures starts from a similar pose, referred to as Neuralpose. QOM between two frames is a binary image obtainedby applyingf(·) pixel-wisely on the difference image of twodepth maps.f(·) is a step function from 0 to 1 at the adhoc threshold of 60. The global QOM of a frame at timet isdefined as QOM between Framet and the very first frame ofthe whole video sequence. A set of frame indices of candidatedelimitation frames was initialised by choosing frames withlower global QOMs than a threshold. The threshold wascalculated by adding the mean to twice the standard deviationof global QOMs extracted from first and last12.5% of theaveraged gesture sequence lengthL which was calculated fromthe training gestures. A sliding window with a size ofL

2 wasthen used to refine the candidate set and in each windowingsession only the index of frame with a minimum global QOMis retained. After the refinement, the remaining frames areexpected to be the deliminating frames of gestures.

B. Construction Of IDMM

Unlike the Depth Motion Map (DMM) [3] which is cal-culated by accumulating the thresholded difference betweenconsecutive frames, two extensions are proposed to constructan IDMM. First, the motion energy is calculated by accu-mulating the absolute difference between the current frameand the Neural pose frame. This would better measure theslow motion than original DMM. Second, to preserve subtlemotion information, the motion energy is calculated withoutthresholding. Calculation of IDMM can be expressed as:

IDMM =

N∑

i=1

|(depth of frame)i − depth of Neural frame|

(1)

where i denotes the index of the frame andN representsthe total number of frames in the segmented gesture. Forsimplicity, the first frame of each segment is considered asthe Neural frame.

C. Pseudocoloring

In their work Abidi et al. [33] used color-coding to harnessthe perceptual capabilities of the human visual system andextracted more information from gray images. The detailedtexture patterns in the image are significantly enhanced. Us-ing this as a motivation, it is proposed in this paper tocode an IDMM into a pseudo-color image and effectivelyexploit/enhance the texture in the IDMM that correspondsto the motion patterns of actions. In this work, a powerrainbow transform is adopted. For a given intensityI, thepower rainbow transform encodes it into a normalized color(R∗, G∗, B∗) as follows:

R∗ = {(1 + cos(4π

3 ∗ 255I))/2}2

G∗ = {(1 + cos(4π

3 ∗ 255I −

2π

3))/2}2

B∗ = {(1 + cos(4π

3 ∗ 255I −

4π

3))/2}2,

(2)

where R∗, G∗ and B∗ are the normalized RGB valuesthrough the power rainbow transform. To code an IDMM,linear mapping is used to convert IDMM values toI ∈ [0, 255]across per image.

The resulting IDMMs are illustrated as in Fig.2.

D. Network Training & Classification

One ConvNet is trained on the pseudo-color coded IDMM.The layer configuration of the ConvNets follows that in [34].The ConvNet contains eight layers, the first five are convolu-tional layers and the remaining three are fully-connected lay-ers. The implementation is derived from the publicly availableCaffe toolbox [35] based on one NVIDIA Tesla K40 GPUcard.

The training procedure is similar to that in [34]. Thenetwork weights are learned using the mini-batch stochasticgradient descent with the momentum set to 0.9 and weight de-cay set to 0.0005. All hidden weight layers use the rectification(RELU) activation function. At each iteration, a mini-batch of256 samples is constructed by sampling 256 shuffled trainingcolor-coded IDMMs. All color-coded IDMMs are resized to256×256. The learning rate for fine-tuning is set to10−3 withpre-trained models on ILSVRC-2012, and then it is decreasedaccording to a fixed schedule, which is kept the same forall training sets. For the ConvNet the training undergoes 20Kiterations and the learning rate decreases every 5K iterations.For all experiments, the dropout regularisation ratio was set to

TABLE I: Information of the ChaLearn LAP ConGD Dataset.

Sets # of labels # of gestures # of RGB videos # of depth videos # of subjects label provided temporal segment providedTraining 249 30442 14134 14134 17 Yes Yes

Validation 249 8889 4179 4179 2 No NoTesting 249 8602 4042 4042 2 No No

All 249 47933 22535 22535 21 - -

Fig. 2: The samples of IDMMs for six different gestures. The labels from top left to bottom rightare:Mudra1/Ardhachandra;Mudra1/Ardhapataka;Mudra1/Chandrakala;Mudra1/Chatura;Mud ra1/Kartarimukha;Mudra1/Pataka.

0.5 in order to reduce complex co-adaptations of neurons innets.

Given a test depth sequence, a pseudo-colored IDMMis constructed for each segmented gesture and the trainedConvNet is used to predict the gesture label of the segment.

IV. EXPERIMENTS

In this section, the Large-scale Continuous Gesture Recog-nition Dataset of the ChaLearn LAP challenge 2016 (ChaLearnLAP ConGD Dataset)) [36] and evaluation protocol are de-scribed. The experimental results of the proposed method onthis dataset are reported and compared with the baselinesrecommended by the chellenge organisers.

A. Dataset

The ChaLearn LAP ConGD Dataset is derived from theChaLearn Gesture Dataset (CGD) [37]. It has 47933 RGB-Dgesture instances in 22535 RGB-D gesture videos. Each RGB-D video may contain one or more gestures. There are 249gestures performed by 21 different individuals. The detailedinformation of this dataset is shown in TableI. In this paper,only depth data was used in the proposed method. Somesamples of depth maps are shown in Fig.3.

B. Evaluation Protocol

The dataset was divided into training, validation and test setsby the challenge organizers. All three sets include data fromdifferent subjects and the gestures of one subject in validationand test sets do not appear in the training set.

Jaccard index (the higher the better) is adopted to measurethe performance. The Jaccard index measures the averagerelative overlap between true and predicted sequences offrames for a given gesture. For a sequences, let Gs,i andPs,i

be binary indicator vectors for which 1-values correspond toframes in which theith gesture label is being performed. TheJaccard Index for theith class is defined for the sequencesas:

Js,i =Gs,i

⋂Ps,i

Gs,i

⋃Ps,i

, (3)

where Gs,i is the ground truth of theith gesture label insequences, and Ps,i is the prediction for theith label insequences.

When Gs,i and Ps,i are empty,J(s,i) is defined to be 0.Then for the sequences with ls true labels, the Jaccard IndexJs is calculated as:

Js =1

ls

L∑

i=1

Js,i. (4)

For all testing sequencesS = s1, ..., sn with n gestures, themean Jaccard IndexJS is used as the evaluation criteria andcalculated as:

JS =1

n

n∑

j=1

Jsj . (5)

Fig. 3: The sample depth maps from three sequences, each containing 5 different gestures. Each row corresponds to one depthvideo sequence. The labels from top left to bottom right are:(a) CraneHandSignals/EverythingSlow;(b) RefereeVolleyballSignals2/BallServedIntoNetPlayerTouchingNet;(c) GestunoDisaster/110earthquaketremblementdeterre;(d) DivingSignals2/You;(e) RefereeVolleyballSignals2/BallServedIntoNetPlayerTouchingNet;(f) Mudra2/Sandamsha;(g) Mudra2/Sandamsha;(h) DivingSignals2/CannotOpenReserve;(i) GestunoTopography/95region region;(j) DivingSignals2/Meet;(k) RefereeVolleyballSignals1/Timeout;(l) SwatHandSignals1/DogNeeded;(m) DivingSignals2/ReserveOpened;(n) DivingSignals1/ComeHere;(o) DivingSignals1/Watch, SwatHandSignals2/LookSearch.

C. Experimental Results

The results of the proposed method on the validation andtest sets and their comparisons to the results of the baselinemethods [38] (MFSK and MFSK+DeepID) are shown inTable II . The codes and models can be downloaded at theauthor’s homepagehttps://sites.google.com/site/pichaossites/.

TABLE II: Accuracies of the proposed method and baselinemethods on the ChaLearn LAP ConGD Dataset.

Method Set Mean Jaccard IndexJSMFSK Validation 0.0918

MFSK+DeepID Validation 0.0902Proposed Method Validation 0.2403

MFSK Testing 0.1464MFSK+DeepID Testing 0.1435

Proposed Testing 0.2655

The results showed that the proposed method significantlyoutperformed the baseline methods, even though only singlemodality, i.e. depth data, was used while the baseline methodused both RGB and depth videos.

The first three winners’ results are summarized in TableIII .We can see that our method is among the top performers andour recognition rate is very close to the best performanceof this challenge (0.265506 vs. 0.269235&0.286915), even

though we only used depth data for proposed method. Regard-ing computational cost, our implementation is based on CUDA7.5 and Matlab 2015b, and it takes about 0.8s to process onedepth sequence for testing in our workstation equipped with8 cores CPU, 64G RAM, and Tesla K40 GPU.

TABLE III: Comparsion the performances of the first threewinners in this challenge. Our team ranks the third place inthe ICPR ChaLearn LAP challenge 2016.

Rank Team Mean Jaccard IndexJS1 ICT NHCI [39] 0.2869152 TARDIS [40] 0.2692353 AMRL (ours) 0.265506

V. CONCLUSIONS

This paper presents an effective yet simple method for con-tinuous gesture recognition using only depth map sequences.Depth sequences are first segmented so that each segmentationcontains only one gesture and a ConvNet is used for featureextraction and classification. The proposed construction ofIDMM enables the use of available pre-trained models forfine-tuning without learning afresh. Experimental resultsonChaLearn LAP ConGD Dataset verified the effectiveness ofthe proposed method. How to exactly extract the neutral pose

https://sites.google.com/site/pichaossites/

https://sites.google.com/site/pichaossites/

and fuse different modalities to improve the accuracy will beour future work.

ACKNOWLEDGMENT

The authors would like to thank NVIDIA Corporation forthe donation of a Tesla K40 GPU card used in this challenge.

REFERENCES

[1] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on abag of3D points,” in Proc. IEEE Computer Society Conference on ComputerVision and Pattern Recognition Workshops (CVPRW), 2010, pp. 9–14.

[2] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble foraction recognition with depth cameras,” inProc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012, pp. 1290–1297.

[3] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depthmotion maps-based histograms of oriented gradients,” inProc. ACMinternational conference on Multimedia (ACM MM), 2012, pp. 1057–1060.

[4] O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D normals foractivity recognition from depth sequences,” inProc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2013, pp. 716–723.

[5] X. Yang and Y. Tian, “Super normal vector for activity recognition usingdepth sequences,” inProc. IEEE International Conference on ComputerVision and Pattern Recognition (CVPR), 2014, pp. 804–811.

[6] P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang, “Mining mid-levelfeatures for action recognition based on effective skeleton representa-tion,” in Proc. International Conference on Digital Image Computing:Techniques and Applications (DICTA), 2014, pp. 1–8.

[7] C. Lu, J. Jia, and C.-K. Tang, “Range-sample depth feature for actionrecognition,” inProc. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2014, pp. 772–779.

[8] R. Vemulapalli and R. Chellappa, “Rolling rotations forrecognizinghuman actions from 3d skeletal data,” inProc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2016, pp. 1–9.

[9] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang, “Rgb-d-basedaction recognition datasets: A survey,”Pattern Recognition, vol. 60, pp.86–105, 2016.

[10] P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. O. Ogunbona,“Convnets-based action recognition from depth maps through virtualcameras and pseudocoloring,” inProc. ACM international conferenceon Multimedia (ACM MM), 2015, pp. 1119–1122.

[11] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona,“Actionrecognition from depth maps using deep convolutional neural networks,”Human-Machine Systems, IEEE Transactions on, vol. 46, no. 4, pp. 498–509, 2016.

[12] P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on jointtrajectory maps using convolutional neural networks,” inProc. ACMinternational conference on Multimedia (ACM MM), 2016, pp. 1–5.

[13] P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, and P. Ogunbona, “Large-scale isolated gesture recognition using convolutional neural networks,”in Proceedings of ICPRW, 2016.

[14] Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra basedaction recognition using convolutional neural networks,”in Circuits andSystems for Video Technology, IEEE Transactions on, 2016, pp. 1–5.

[15] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural networkfor skeleton based action recognition,” inProc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 1110–1118.

[16] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural net-works for action recognition,” inProc. IEEE International Conferenceon Computer Vision (ICCV), 2015, pp. 4041–4049.

[17] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie,“Co-occurrence feature learning for skeleton based action recognition usingregularized deep LSTM networks,” inThe 30th AAAI Conference onArtificial Intelligence (AAAI), 2016.

[18] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+ D: Alargescale dataset for 3D human activity analysis,” inProc. IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.

[19] F. Jiang, S. Zhang, S. Wu, Y. Gao, and D. Zhao, “Multi-layered gesturerecognition with kinect,”Journal of Machine Learning Research, vol. 16,no. 2, pp. 227–254, 2015.

[20] J. Wan, Q. Ruan, W. Li, and S. Deng, “One-shot learning gesturerecognition from rgb-d data using bag of features,”Journal of MachineLearning Research, vol. 14, pp. 2549–2582, 2013.

[21] J. Wan, V. Athitsos, P. Jangyodsuk, H. J. Escalante, Q. Ruan, andI. Guyon, “Csmmi: Class-specific maximization of mutual informationfor action and gesture recognition,”IEEE Transactions on Image Pro-cessing, vol. 23, no. 7, pp. 3152–3165, July 2014.

[22] H. J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, and J. Wan,“Principal motion components for one-shot gesture recognition,” PatternAnalysis and Applications, pp. 1–16, 2015.

[23] G. D. Forney, “The viterbi algorithm,”Proceedings of the IEEE, vol. 61,no. 3, pp. 268–278, March 1973.

[24] Y. M. Lui, “Human gesture recognition on product manifolds,” Journalof Machine Learning Research, vol. 13, no. 1, pp. 3297–3321, Nov 2012.

[25] D. Wu, F. Zhu, and L. Shao, “One shot learning gesture recognition fromrgbd images,” in2012 IEEE Computer Society Conference on ComputerVision and Pattern Recognition Workshops, June 2012, pp. 7–12.

[26] W. Li, Z. Zhang, and Z. Liu, “Expandable data-driven graphical model-ing of human actions based on salient postures,”Circuits and Systemsfor Video Technology, IEEE Transactions on, vol. 18, no. 11, pp. 1499–1510, 2008.

[27] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,R. Monga, and G. Toderici, “Beyond short snippets: Deep networksfor video classification,” inProc. IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2015, pp. 4694–4702.

[28] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” inProc. Annual Conference on NeuralInformation Processing Systems (NIPS), 2014, pp. 568–576.

[29] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neuralnetworks forhuman action recognition,”Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 35, no. 1, pp. 221–231, 2013.

[30] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learningspatiotemporal features with 3D convolutional networks,”in Proc. IEEEInternational Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.

[31] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutionalnetworks for visual recognition and description,” inProc. IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), 2015, pp.2625–2634.

[32] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamicimage networks for action recognition,” inProc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2016.

[33] B. R. Abidi, Y. Zheng, A. V. Gribok, and M. A. Abidi, “Improvingweapon detection in single energy X-ray images through pseudocolor-ing,” Systems, Man, and Cybernetics, Part C: Applications and Reviews,IEEE Transactions on, vol. 36, no. 6, pp. 784–796, 2006.

[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” inProc. Annual Conferenceon Neural Information Processing Systems (NIPS), 2012, pp. 1106–1114.

[35] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.B. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding.” inProc. ACM international conference onMultimedia (ACM MM), 2014, pp. 675–678.

[36] H. J. Escalante, V. Ponce-Lpez, J. Wan, M. A. Riegler, B.Chen,A. Claps, S. Escalera, I. Guyon, X. Bar, P. Halvorsen, H. Mller, andM. Larson, “Chalearn joint contest on multimedia challenges beyondvisual analysis: An overview,” inProceedings of ICPRW, 2016.

[37] I. Guyon, V. Athitsos, P. Jangyodsuk, and H. J. Escalante, “The chalearngesture dataset (CGD 2011),”Machine Vision and Applications, vol. 25,no. 8, pp. 1929–1951, 2014.

[38] J. Wan, G. Guo, and S. Z. Li, “Explore efficient local features from rgb-d data for one-shot learning gesture recognition,”IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1626–1639, Aug 2016.

[39] X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, “Two streams recurrentneural networks for large-scale continuous gesture recognition,” inProceedings of ICPRW, 2016.

[40] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using con-volutional 3d neural networks for user-independent continuous gesturerecognition,” inProceedings of ICPRW, 2016.

large-scale continuous gesture recognition using ... · large-scale continuous gesture recognition...

Documents