c 699 acta - university of oulujultika.oulu.fi/files/isbn9789526222011.pdf · 2019. 2. 21. ·...

UNIVERSITY OF OULU P .O. Box 8000 F I -90014 UNIVERSITY OF OULU FINLAND

A C T A U N I V E R S I T A T I S O U L U E N S I S

University Lecturer Tuomo Glumoff

University Lecturer Santeri Palviainen

Senior research fellow Jari Juuti

Professor Olli Vuolteenaho

University Lecturer Veli-Matti Ulvinen

Planning Director Pertti Tikkanen

Professor Jari Juga

University Lecturer Anu Soikkeli


Publications Editor Kirsti Nurkkala

ISBN 978-952-62-2200-4 (Paperback)ISBN 978-952-62-2201-1 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)

U N I V E R S I TAT I S O U L U E N S I SACTAC

TECHNICA


TECHNICA

OULU 2019

C 699

Xin Liu

HUMAN MOTION DETECTION AND GESTURE RECOGNITION USING COMPUTER VISION METHODS

UNIVERSITY OF OULU GRADUATE SCHOOL;UNIVERSITY OF OULU,FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING;UNIVERSITY OF OULU,INFOTECH OULU

C 699

AC

TAX

in Liu

C699etukansi.fm Thursday, February 7, 2019 3:07 PM

ACTA UNIVERS ITAT I S OULUENS I SC Te c h n i c a 6 9 9

XIN LIU


Academic dissertation to be presented with the assent ofthe Doctoral Training Committee of InformationTechnology and Electrical Engineering of the University ofOulu for public defence in the OP auditorium (L10),Linnanmaa, on 8 March 2019, at 12 noon

UNIVERSITY OF OULU, OULU 2019

Copyright © 2019Acta Univ. Oul. C 699, 2019

Supervised byProfessor Guoying Zhao

Reviewed byProfessor Karen EgiazarianProfessor Stan Z. Li

ISBN 978-952-62-2200-4 (Paperback)ISBN 978-952-62-2201-1 (PDF)

ISSN 0355-3213 (Printed)ISSN 1796-2226 (Online)

Cover DesignRaimo Ahonen

JUVENES PRINTTAMPERE 2019

OpponentDocent Jorma Laaksonen

Liu, Xin, Human motion detection and gesture recognition using computer visionmethods. University of Oulu Graduate School; University of Oulu, Faculty of Information Technologyand Electrical Engineering; University of Oulu, Infotech OuluActa Univ. Oul. C 699, 2019University of Oulu, P.O. Box 8000, FI-90014 University of Oulu, Finland

Abstract

Gestures are present in most daily human activities and automatic gestures analysis is a significanttopic with the goal of enabling the interaction between humans and computers as natural as thecommunication between humans. From a computer vision perspective, a gesture analysis systemis typically composed of two stages, the low-level stage for human motion detection and the high-level stage for understanding human gestures. Therefore, this thesis contributes to the research ongesture analysis from two aspects, 1) Detection: human motion segmentation from videosequences, and 2) Understanding: gesture cues extraction and recognition.

In the first part of this thesis, two sparse signal recovery based human motion detectionmethods are presented. In real videos the foreground (human motions) pixels are often notrandomly distributed but have the group properties in both spatial and temporal domains. Basedon this observation, a spatio-temporal group sparsity recovery model is proposed, which explicitlyconsider the foreground pixels' group clustering priors of spatial coherence and temporalcontiguity. Moreover, a pixel should be considered as a multi-channel signal. Namely, if a pixel isequal to the adjacent ones that means all the three RGB coefficients should be equal. Motivatedby this observation, a multi-channel fused Lasso regularizer is developed to explore thesmoothness of multi-channels signals.

In the second part of this thesis, two human gesture recognition methods are presented toresolve the issue of temporal dynamics, which is crucial to the interpretation of the observedgestures. In the first study, a gesture skeletal sequence is characterized by a trajectory on aRiemannian manifold. Then, a time-warping invariant metric on the Riemannian manifold isproposed. Furthermore, a sparse coding for skeletal trajectories is presented by explicitlyconsidering the labelling information, with the aim to enforcing the discriminant validity of thedictionary. In the second work, based on the observation that a gesture is a time series withdistinctly defined phases, a low-rank matrix decomposition model is proposed to build temporalcompositions of gestures. In this way, a more appropriate alignment of hidden states for a hiddenMarkov model can be achieved.

Keywords: 3D skeleton, action recognition, background subtraction, human gesturerecognition, manifold learning, motion detection, sparse recovery

Liu, Xin, Ihmisen liikkeiden havaitseminen ja eleiden tunnistaminen konenäönmenetelmiä hyödyntäen. Oulun yliopiston tutkijakoulu; Oulun yliopisto, Tieto- ja sähkötekniikan tiedekunta; Oulunyliopisto, Infotech OuluActa Univ. Oul. C 699, 2019Oulun yliopisto, PL 8000, 90014 Oulun yliopisto

Tiivistelmä

Eleet ovat läsnä useimmissa päivittäisissä ihmisen toiminnoissa. Automaattista eleiden analyy-sia tarvitaan laitteiden ja ihmisten välisestä vuorovaikutuksesta parantamiseksi ja tavoitteena onyhtä luonnollinen vuorovaikutus kuin ihmisten välinen vuorovaikutus. Konenäön näkökulmastaeleiden analyysijärjestelmä koostuu ihmisen liikkeiden havainnoinnista ja eleiden tunnistamises-ta. Tämä väitöskirjatyö edistää eleanalyysin-tutkimusta erityisesti kahdesta näkökulmasta: 1)Havainnointi - ihmisen liikkeiden segmentointi videosekvenssistä. 2) Ymmärtäminen - elemark-kerien erottaminen ja tunnistaminen.

Väitöskirjan ensimmäinen osa esittelee kaksi liikkeen havainnointi menetelmää, jotka perus-tuvat harvan signaalin rekonstruktioon. Videokuvan etualan (ihmisen liikkeet) pikselit eivätyleensä ole satunnaisesti jakautuneita vaan niillä toisistaan riippuvia ominaisuuksia spatiaali- jaaikatasolla tarkasteltuna. Tähän havaintoon perustuen esitellään spatiaalis-ajallinen harvarekonstruktiomalli, joka käsittää etualan pikseleiden klusteroinnin spatiaalisen koherenssin jaajallisen jatkuvuuden perusteella. Lisäksi tehdään oletus, että pikseli on monikanavainen signaa-li (RGB-väriarvot). Pikselin ollessa samankaltainen vieruspikseliensä kanssa myös niiden väri-kanava-arvot ovat samankaltaisia. Havaintoon nojautuen kehitettiin kanavat yhdistävä lasso-regularisointi, joka mahdollistaa monikanavaisen signaalin tasaisuuden tutkimisen.

Väitöskirjan toisessa osassa esitellään kaksi menetelmää ihmisen eleiden tunnistamiseksi.Menetelmiä voidaan käyttää eleiden ajallisen dynamiikan ongelmien (eleiden nopeuden vaihte-lu) ratkaisemiseksi, mikä on ensiarvoisen tärkeää havainnoitujen eleiden oikein tulkitsemiseksi.Ensimmäisessä menetelmässä ele kuvataan luurankomallin liikeratana Riemannin monistossa(Riemannian manifold), joka hyödyntää aikavääristymille sietoista metriikkaa. Lisäksi esitel-lään harvakoodaus (sparse coding) luurankomallien liikeradoille. Harvakoodaus perustuunimiöintitietoon, jonka tavoitteena on varmistua koodisanaston keskinäisestä riippumattomuu-desta. Toisen menetelmän lähtökohtana on havainto, että ele on ajallinen sarja selkeästi määritel-täviä vaiheita. Vaiheiden yhdistämiseen ehdotetaan matala-asteista matriisihajotelmamallia, jot-ta piilotilat voidaan sovittaa paremmin Markovin piilomalliin (Hidden Markov Model).

Asiasanat: 3D-luurankomalli, harva rekonstruktio, ihmisen eleiden tunnistus, liikkeenhavaitseminen, moniston opettaminen, taustan irroitus, toiminnon tunnistus

To my father Bofeng Liu

Acknowledgements

This work was carried out at the Center for Machine Vision and Signal Analysis (CMVS)and the Faculty of Information Technology and Electrical Engineering (ITEE) at theUniversity of Oulu, Finland.

I would like to express my sincere gratitude to my supervisor Prof. Guoying Zhaofor her supervision and tremendous support during my doctoral study. Her insightful andvaluable advice has greatly broadened my horizons and deeply influenced my systematicthinking. I would also like to gratefully acknowledge Professor Matti Pietikäinen foroffering me the opportunity to work in the research group. I would also like to thankProf. Xilin Chen for constructive discussions on my research topics. His continuousencouragement played a critical role in my doctoral study.

I would like to thank Prof. Juho Kannala, Dr. Sami Huttunen, and Dr. XiaopengHong who served as my follow-up group members and provided a lot of valuablesuggestions to my research and study. I am grateful to Dr. Vili-Petteri Kellokumpu, Dr.Li Liu, Dr. Wei Ke, Dr. Qing Liu, Dr. Xianbiao Qi, and Dr. Jiguang He for their usefuldiscussions and comments on my doctoral research. I wish to express my appreciationto the co-authors, Dr. Jie Chen, Dr. Xiaobai Li, Dr. Jingang Shi, Dr. Jiawen Yao, Dr.Xiaohua Huang, Dr. Ziheng Zhou, Dr. Zhiyuan Zha, Dr. Yuan Zong, Ms. Yingyue Xu,Mr. Henglin Shi, and Mr. Haoyu Chen. I am grateful to all my CMVS colleagues forcreating an inspiring and supportive atmosphere.

I am also grateful to Prof. Dacheng Tao for hosting me during my study in hisresearch group at the University of Sydney, Australia.

I would like to gratefully acknowledge the thesis reviewers, Prof. Stan Z. Li andProf. Karen Egiazarian. Their valuable suggestions and comments help to improve thequality of the thesis significantly.

I would like to express my appreciation to Prof. Jorma Laaksonen for serving as theopponent in the defence.

The financial support provided by the Academy of Finland, Infotech Oulu, FinnishFoundation for Technology Promotion (Tekniikan Edistämissäätiö), Nokia Foundation,Tauno Tönning Foundation, Otto A. Malm Foundation, Riitta Ja Jorma J. TakanenFoundation, and the Endeavour Research Fellowship from the Australian GovernmentDepartment of Education and Training, is gratefully acknowledged.

9

I would like to express my deepest gratitude to my parents Bofeng Liu and ChunzhenHuang for their unconditional support over the years. Unfortunately, my beloved fathercould not see this thesis completed. I would also like to thank my litter sister, Wen, forher constant support over the years. I would like to say thank you to my lovely childrenZihan and Ziru. They always been a source of joy for me. Last but not least, I wouldlike to thank my wife Wenting Tao for her endless support and selfless love.

Oulu, November 2018

10

List of abbreviations

ALM Augmented Lagrange Multiplier

BP Basis Pursuit

BRMF Bayesian Robust Matrix Factorization

CNN Convolutional Neural Network

CoSaMP Compressive Sampling Matching Pursuit

CS Compressive Sensing

CRF Conditional Random Field

DBN Deep Belief Network

DGS Dynamic Group Sparsity

DTW Dynamic Time Warping

DMW Dynamic Manifold Warping

DP Dynamic Programming

FTP Fourier Temporal Pyramid

GMM Gaussian Mixture Model

HCI Human Computer Interaction

HMM Hidden Markov Models

HRI Human Robot Interaction

KNN K-Nearest-Neighbors

LDA Latent Dirichlet Allocation

LBD Low-rank and Block-sparse matrix Decomposition

LSD Low-rank and Structured sparsity Decomposition

LM3TL Latent Max-margin MultiTask Learning

LaMP Lattice Matching Pursuit

11

LSTM Long Short-Term Memory

MAP Maximum A Posteriori

MEMM Maximum Entropy Markov Model

MRF Markov Random Fields

MDL Minimum Description Length

NBNN Naive Bayes Nearest Neighbor

NP hard Non-deterministic Polynomial-time hard

OMP Orthogonal Matching Pursuit

PCP Principal Component Pursuit

ProxFlow Proximal Operator using Network Flow

RNN Recurrent Neural Network

RRV Rotation and Relative Velocity

ROI Region Of Interest

RPCA Robust Principle Component Analysis

SMIJ Sequence of Most Informative Joints

SP Subspace Pursuit

SPD Symmetric Positive Definite

SSR Sparse Signal Recovery

SRV Square Root Velocity

STGS Spatio-Temporal Group Sparsity

SVD Singular Value Decomposition

TV Total Variation

TSRVF Transported Square-Root Vector Field

12

List of original articles

This thesis is based on the following articles, which are referred to in the text by theirRoman numerals (I–IV):

I Liu X., Yao J., Hong X., Huang X., Zhou Z., Qi C., & Zhao G. (2018). Backgroundsubtraction using spatio-temporal group sparsity recovery. IEEE Transactions on Circuitsand Systems for Video Technology. 28(8), 1737–1751. IEEE

II Liu X. & Zhao G. (2019). Background subtraction using multi-channel fused Lasso. InProceedings of the IS&T 2019 International Symposium on Electronic Imaging (EI 2019).IS&T. Accepted for publication.

III Liu X. & Zhao G. (2019). 3D skeletal gesture recognition using sparse coding of time-warping invariant Riemannian trajectories. In Proceedings of the 2019 InternationalConference on Multimedia Modeling (MMM 2019), 678–690. Springer, Cham.

IV Liu X., Shi H., Hong X., Chen H., Tao D., & Zhao G. (2019). Hidden states exploration for3D skeleton-based gesture recognition. In Proceedings of the 2019 IEEE Winter Conferenceon Applications of Computer Vision (WACV 2019). IEEE. Accepted for publication.

The author of the dissertation is the first author in articles I-IV and had the main rolein implementing the ideas, deriving the equations, and writing the papers. However,many ideas presented in the articles have been developed as team work. For article I, allthe coding and paper writing work were done by the present author. The present authoralso carried out part of the evaluation experiments with the help of the second author Dr.Jiawen Yao. The role of the other co-authors was to provide guidance and commentsduring the research and writing process. For publications II and III, the present authorhad the major role in creating the ideas, conducting the experiments, and presenting allthe results with discussions and conclusions, while valuable comments and suggestionswere given by Prof. Guoying Zhao. For article IV, the present author played an importantrole in the creation of the ideas, as well as in the code implementation, and designedthe structure and wrote the paper. The second author Mr. Henglin Shi greatly helpedwith implementing the code and conducting of experiments. The ideas were developedtogether with the co-authors, who gave advice and guidance throughout the work.

13

Contents

AbstractTiivistelmäAcknowledgements 9List of abbreviations 11List of original articles 13Contents 151 Introduction 17

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3 Summary of original articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Compressive sensing for motion detection 252.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Related methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Sparse signal recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 Robust principle component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Spatio-temporal group sparsity recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

2.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Spatio-temporal group sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.3 Two-pass framework for motion detection . . . . . . . . . . . . . . . . . . . . . . . 38

2.4 Multi-channel fused sparsity recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4.2 Multi-channel fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.3 Two-pass framework for motion detection . . . . . . . . . . . . . . . . . . . . . . . . 47

2.5 Experimental results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 Riemannian trajectory analysis for gesture recognition 573.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Related methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.1 Local temporal modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

15

3.2.3 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2.4 Manifold space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3 Lie group based representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4 TSRVF for Riemannian trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 Sparse coding of 3D skeletal trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.6 Experimental results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Hidden states exploration for gesture recognition 754.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Hidden Markov models for gesture recognition . . . . . . . . . . . . . . . . . . . . . . . . . 774.3 Lie algebra based representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4 Low-rank decomposition for exploring gesture temporal structures . . . . . . . . 824.5 Hidden states learning via LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.6 Experimental results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 Summary 955.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

References 99Original articles 111

16

1 Introduction

1.1 Background

Gestures are naturally performed by humans in most daily activities. This non-verbalbehaviour is a fundamental aspect of human interaction, either complementing speech orsubstituting spoken language in an environment which requires silent communication (orfor people with hearing impairments). Gesture recognition (Lee & Kim, 1999; Mitra &Acharya, 2007; Morency, Quattoni, & Darrell, 2007; Rautaray & Agrawal, 2015;S. Wang, Quattoni, Morency, Demirdjian, & Darrell, 2006; Y. Wu & Huang, 1999) is atechnique which enables humans to communicate with computers. It is a significant andcentral research topic in computer science and is widely utilized in Human ComputerInteraction (HCI), Human Robot Interaction (HRI), gaming, security, sports, as wellas sign language interpreters, artificial agents and companions, and used for driverassistance, assisted living, and in business negotiations. Some examples of applicationsare illustrated in Fig. 1.

Artificial human companion

Human Computer Interaction

Assisted living

Business negotiation

Gaming

Driver assistance

Fig. 1. Applications of gesture recognition.

The accuracy of gesture recognition can significantly affect the overall performanceof the applications employing it. From a machine learning point of view, recognizing a

17

gesture from a set of labelled gestures is a typical supervised learning problem. Althoughexisting methods differ in the use of data (multi-modalities), feature representationsand modelings, there are still a number of challenges that require careful discussion,including but not limited to the temporal dynamics of gestures (execution speed,duration, starting and ending points) of an individual performer or between performers,variabilities in geometry (subject size), incorporating context information, and learningfrom a limited set of training data. With the aim to deal with above issues of humangesture recognition, in this thesis we explore new solutions via computer vision methods.

As shown in Fig. 2, a standard gesture analysis system is composed of two stages:the low-level stage for region of interest (ROI) extraction, such as human motiondetection, and the high-level stage for understanding human gestures. Therefore, thisthesis concerns two core objectives: (1) Detection: human motion segmentation fromvideo sequences, and (2) Recognition: gesture modeling and understanding.

ROI Extraction

(human motion detection, 3D

skeleton)

Human gesture recognition on

extracted ROIVideo frames

Input ROI Output

Fig. 2. A standard pipeline for human gesture recognition.

Objective 1. Gestures involve the motion of all parts of the body, but not only thearms and hands. Thus, human motion detection is a step which is prior to gesturerecognition. Although profound progresses in human motion detection have beenachieved, ensuring segmentation accuracy is a challenging task, due to numerousvariables including: lighting variations, background dynamics, crowded environments,noisy images, and camera movements. In the theory of the compressive censing(CS) (Candès, Romberg, & Tao, 2006; Donoho, 2006), only a little specific assumptionis made about the background of video sequences, and human motion (foreground)detection is formulated as a decomposition and optimization problem. CS can capturecorrelations between video frames. Thus, CS can deal with global variations in thebackground, such as the illumination changes and dynamic background movements. Inthe preliminary methods using CS to represent the background, no prior knowledge ofthe spatial and temporal distribution of outliers was considered. Nevertheless, in realvideos, the interesting foreground pixels are often not randomly distributed but have

18

group properties of spatial coherence and temporal contiguity. Therefore, this thesisconsiders a CS-based model with group clustering to extract interesting objects.

Objective 2. The main challenges in gesture recognition are extracting invariantfeatures, handling temporal dynamics, and modeling the context. Recently, due to thedevelopment of sensors such as the Kinect (Z. Zhang, 2012), the task of explicitlylocalizing the human body has been simplified from using classical RGB data to 3Dskeletal data (Shotton et al., 2011). This thesis studies novel features (such as a manifoldbased representation) to represent a 3D human skeleton. The temporal dynamics ofgestures are crucial for the interpretation of the observed behaviour, since a gestureeven performed by the same person may have different implementation rates, let alonewhen the gesture is performed by different subjects. The context information is veryimportant to time series modeling. However, relatively a few approaches explicitlytake into account the above two factors. Therefore, this thesis focuses on combiningtemporal dynamics and understanding the context to investigate new methods thatdescribe gestures. Furthermore, deep learning methods such as the recurrent neuralnetwork (RNN) (Hochreiter & Schmidhuber, 1997) are studied in this thesis due totheir power in modelling the context of sequential data, and the parameters learning withlimited training data for RNN are considered.

1.2 Contributions of the thesis

Regarding the above-mentioned research objectives, the contributions of this thesis aremainly two-fold:

The first main contribution are two compressive sensing-based methods for motiondetection. The motivation for this part is twofold: (1) most sparse signal recovery-basedalgorithms do not consider the group properties in both the spatial and temporal domains;(2) existing methods include few considerations of the homogeneity of the channels.When dealing with colour images, a typical option is to convert the RGB to gray frame,and another way is to apply sparse recovery to each channel independently. Regardingthe first motivation, a greedy pursuit-based method is proposed, which takes into accountthe group properties of foreground (motion) signals in both spatial and temporal domains.For the second motivation, a multi-channel fused Lasso regularizer is proposed toexplicitly reconstruct multi-channel foreground signals with a spatial structure thatreflects smooth changes along the group features. The performance of these methods isevaluated on I2R (L. Li, Huang, Gu, & Tian, 2004) and CDnet 2012 (Goyette, Jodoin,

19

Porikli, Konrad, & Ishwar, 2012) datasets. Experimental results on motion detectionshow the effectiveness of two proposed methods.

Another main contribution of this thesis are two methods for gesture recognition.There are four motivations for this part of the thesis: (1) Existing 3D skeletal-basedmethods utilize absolute coordinates to present human motion features, but gesturesshould be independent of the performer’s location, and the features should be invariantto the body size of performer. (2) Dynamic time warping (DTW)-based methods heavilydepend on the metric used to measure the similarity of frames, while temporal dynamicscan significantly distort the distance metric when comparing and identifying gestures.(3) In most hidden Markov model (HMM)-based methods, the input sequences have tobe previously segmented on the basis of specific clustering metrics, or the sequencesneed to be divided equally using the fixed anchors for obtaining hidden states, whichmakes it hard to depict the explicit temporal structure of gestures; (4) RNN is a powerfulframework for modeling sequential data, but it is still arduous to learn the informationof the entire sequence with many sub-events. For the first motivation, a Lie group(manifold space)-based representation is introduced, this feature is independent of theperformer’s location and can explicitly model the 3D geometric relationships betweenbody parts. For the second motivation, in order to identify gestures in a time-warpinginvariant manner, the transported square-root vector field is extended to the space of theLie group to obtain a re-parametrization invariant metric. For the third motivation, a newformulation is proposed to build temporal compositions of gestures using a low-rankmatrix decomposition. For the last motivation, a new gesture recognition framework isproposed by absorbing the powers of the HMM and RNN. Rather than model the wholesequences (a gesture) within the RNN as existing methods did, we feed the network withhidden states with shorter temporal lengths and more training samples. The performanceof the two proposed methods is validated on the widely used ChaLearn 2014 gesture(Escalera et al., 2014) and MSR Action3D (W. Li, Zhang, & Liu, 2010) datasets.

1.3 Summary of original articles

Four articles (I-IV) were published (or accepted) (X. Liu et al., 2019, 2018; X. Liu &Zhao, 2019a, 2019b) according to the objectives described above. In addition to thesearticles which comprise the main content of the thesis, other authored or co-authoredpublications (H. Chen, Liu, Li, Shi, & Zhao, 2019; H. Chen, Liu, & Zhao, 2018;X. Huang et al., 2017; X. Liu, Zhao, Yao, & Qi, 2015; H. Shi, Liu, Hong, & Zhao,

20

2018; J. Shi, Liu, Zong, Qi, & Zhao, 2018; Y. Xu, Hong, Liu, & Zhao, 2018; Y. Xu,Hong, Porikli, et al., 2018; Zha, Liu, Huang, et al., 2017; Zha, Liu, et al., 2018; Zha,Liu, Zhou, et al., 2017; Zha, Zhang, Wang, Bai, et al., 2018; Zha, Zhang, Wang, Tang,& Liu, 2018; Zha, Zhang, Wu, et al., 2018) are not included in the thesis. Paper I andpaper II fall under the scope of the first objective about human motion detection. PaperIII and paper IV are belong to the scope of the second objective concerning humangesture recognition.

The main contribution and novelty of these methods (in I-IV) are highly related tothe theory of sparsity recovery. In fact, the scientific research theme of this thesis isthe sparsity recovery for spatio-temporal sequences analysis. An overall diagram toillustrate this research theme is shown in Fig. 3. The contents of each article are brieflysummarized below.

I

II

III

IV Sparsity Recovery

for Spatio-Temporal

Sequences Analysis

Spatio-Temporal

Group Sparsity

Recovery

Multi-Channel

Fused Lasso

Discriminative

Sparse Coding

Low-rank and

Column-Block

Sparsity

Decomposition

Fig. 3. Scientific research theme of this thesis (with four papers I-IV).

Paper I proposes a spatio-temporal group sparsity recovery method for motiondetection which takes into account the group properties of the foreground signal in bothspatial and temporal domains. Moreover, a two-pass framework is proposed to meet theonline process requirement.

Paper II proposes a multi-channel fused Lasso regularizer to enforce a multi-channelforeground to be piece-wise constant at the group level, with adjacent groups equal

21

on all the different channels at the same time. Furthermore, a two-pass framework isproposed to improve the accuracy for background subtraction. Firstly, a low-rank andsparse matrix decomposition is utilized on video slices along the X-T and Y-T planes.Secondly, a new sparse signal recovery method with the proposed regularizer is used torefine the foreground detection.

Paper III proposes a gesture recognition method. In this paper, a 3D human skeletonis represented as a point in the product space of a special orthogonal group SO3, then ahuman gesture sequence can be characterized as a trajectory in the Riemannian manifoldspace. To consider the re-parametrization invariance properties for trajectory analysis,this paper generalizes the transported square-root vector field to obtain a time-warpinginvariant metric for comparing trajectories. Moreover, a sparse coding scheme of skeletaltrajectories is proposed by explicitly considering the labelling information with eachatom to enforce the discriminant validity of the dictionary.

Paper IV focuses on studying HMM-based approaches to explore more appropriatehidden states alignment of skeletal action data. Gestures are explained as a sequence ofseparated sub-gestures or phases, each of which is associated with a video segmentof unfixed length. Based on this observation, a low-rank and column-block sparsitymatrix decomposition method is proposed to build temporal structures of gestureswith semantically meaningful and discriminative concepts. Additionally, a novelskeleton-based recognition framework is proposed which integrates the powers of thegenerative models (HMM) and deep recurrent neural network (LSTM).

1.4 Organization of the thesis

This thesis is composed of two parts. The first part consist of five chapters that reviewthe state of the art of human motion detection and gesture recognition, and present themotivation, ideas and results of the original papers (I-IV). The original papers (I-IV)form the second part of this thesis.

Chapter 1 briefly introduces the contents of this thesis. It presents the background tothe topic and the objectives, summarizes the contributions and provides a summary ofthe original papers.

Chapter 2 describes the work on human motion detection, including a literaturereview of existing research, and two proposed methods for motion detection based oncompressive sensing. This part includes methods and experimental results that wereoriginally presented in Paper I and Paper II.

22

Chapter 3 presents the work on human gesture recognition, including a literaturereview of existing research on 3D skeletal data, and proposed method for gesturerecognition using Riemannian trajectory analysis. The method and experimental resultspresented in this chapter are from Paper III.

Chapter 4 presents a method for gesture recognition via hidden states exploration.This part includes the method and experimental results that were originally presented inPaper IV.

Finally, conclusions are drawn in Chapter 5 with a discussion on the limitations andpossible future research directions/extensions.

23

2 Compressive sensing for motion detection

Compressive sensing (CS) (Candès et al., 2006; Donoho, 2006) based methods,which make a little specific assumptions about the background of video sequences,have recently attracted wide attention in motion detection. Within the frameworkof compressive sensing, background subtraction is solved as a decomposition andoptimization problem, where the foreground is typically modelled as pixel-wise sparseoutliers. However, in real-world captured videos, the foreground pixels are oftennot randomly distributed, but instead, group clustered. In this chapter, we presentour findings in Papers I and II on human motion detection under the framework ofcompressive sensing.

2.1 Introduction

Motion detection is a fundamental step in automatically detecting and tracking movingobjects with applications in intelligent video surveillance and human behaviour analysis.Background subtraction is a commonly used method for segmenting foreground invideo sequences from static cameras. Its performance is mainly related to the algorithmsutilized for background modeling.

Over the past few decades, a considerable number of algorithms have been explored.One of the most famous pixel-based approaches is the Gaussian Mixture Model (H. Lin,Chuang, & Liu, 2011; Stauffer & Grimson, 1999), which uses a mixture of Gaussianprobability density functions to model colour intensity variations of individual pixels.In (L. Li et al., 2004), Li et al. utilized spatial and temporal features to model a dynamicbackground and used a Bayesian rule to estimate the probability distributions. Thecode-book methods by (J. Guo, Liu, Hsia, Shih, & Hsu, 2011; Kim, Chalidabhongse,Harwood, & Davis, 2005) record the background states of each pixel with a certainnumber of codewords, and the foreground is detected using a distance measurement.The MAP-MRF algorithm in (Sheikh & Shah, 2005) used a maximum a posterioriMarkov random fields (MRF) decision framework to determine pixels as belongingto either the background or foreground. The self-organizing artificial neural networkmethod in (Maddalena & Petrosino, 2008, 2012) presented another alternative forsolving the background subtraction problem. In ViBe (Barnich & Van Droogenbroeck,2011) and PBAS (Hofmann, Tiefenbacher, & Rigoll, 2012), background modeling is

25

based on the selection and updating of pixel samples. A more detailed discussion ofthese conventional techniques can be found in recent surveys (Bouwmans, 2011, 2014;Brutzer, Hoferlin, & Heidemann, 2011; Y. Xu, Dong, Zhang, & Xu, 2016).

Although there have been significant advancements in foreground detection, ac-curately detecting the foreground in unconstrained settings still remains challenging.Most models fail to work well in complex environments with dynamic backgroundvariations. The reason for this is that these methods often make overly restrictiveassumptions about the background (Bouwmans, Sobral, Javed, Jung, & Zahzah, 2016;Gao, Cheong, & Wang, 2014), and that there is rare attention paid to the relationship ofthe background across each frame. However, real backgrounds in complex environmentsusually include distracting variations. The background model should detect shadowscast by moving objects adaptively and deal with different variations such as a newobject introduced into the background or when an old one is removed from it. Moreover,an ideal background model needs to tolerate sudden background changes such as anillumination device closing and opening, at the same time, without losing sensitivitydetecting real foreground objects. Considering these factors, it is very difficult to yield arobust background model by making specific assumptions.

Lately, the technology of compressive sensing (Candès et al., 2006; Donoho,2006) has demonstrated promising performances in many computer vision and machinelearning jobs. In this framework, the foreground detection task is formulated as aclassic problem of learning a low dimensional linear model from high dimensionalinputs. On the basis of different assumptions about the background, compressivesensing-based approaches can be divided into two major groups, Robust PrincipleComponent Analysis (RPCA) and Sparse Signal Recovery (SSR). In RPCA (Wright,Ganesh, Rao, Peng, & Ma, 2009), the only assumption about the background is thatany change in background appearance is highly constrained and can be captured bythe low-rank condition of a suitably formulated matrix (Gao et al., 2014). In contrastto RPCA, sparse signal recovery only assumes that a new observed frame should bemodelled as a sparse linear combination of a few preceding frames (dictionary) plusa sparse outlier term (J. Huang, Huang, & Metaxas, 2009). CS-based methods candeal with most of the challenges mentioned above since it makes very few specificassumption about the background (Bouwmans et al., 2016; Gao et al., 2014). However,the constraints on the foreground signal tend to treat each entry independently, withoutconsidering the prior of the foreground pixels, such as the group and structure properties.Specifically, few works have explored the group properties in both spatial and temporal

26

domains. In this chapter, we explore these priors of the foreground to improve theaccuracy of human motion detection.

2.2 Related methods

According to different assumptions of the background, compressive sensing-basedmethods can be grouped into two major categories, namely sparse signal recovery androbust principle component analysis.

2.2.1 Sparse signal recovery

Sparse signal recovery provides a comprehensive solution to deal with various tasks insignal processing. Under this framework, Dikmen and Huang (Dikmen & Huang, 2008)firstly imposed the sparsity constraint on the residual (foreground) term, where a comingframe y ∈ Rm can be modelled as a sparse linear combination of p previous framesX ∈ Rm×p, plus a sparse outlier term (residual) e ∈ Rm which can be expressed as

y = Xw+ e, (1)

where w ∈ Rp is the coefficient vector. The term Xw corresponds to background parts,while the sparse outlier e accounts for the foreground in y. The goal of the method is tocompute the coefficients from a training set by minimizing a given objective function.The Lasso model (Tibshirani, 1996) is the most commonly used objective function,which can be expressed as

w = argminw

(‖y−Xw‖22 +λ‖w‖1), (2)

where ‖·‖2 denotes the `2-norm. The estimated coefficient w can be obtained accordingto (2). Then the foreground is achieved as e = y−Xw. However, in previous methods,no prior knowledge on the spatial distribution of outliers e was utilized. Aiming to tacklethe above issues, Huang et al. proposed a Dynamic Group Sparsity (DGS) (J. Huang etal., 2009; J. Huang, Zhang, & Metaxas, 2011) recovery method by considering groupclustering priors. Nevertheless, in DGS, the sparsity degree must be given in advance, orelse the method needs to set the lower bound of the sparsity range to zero, and iterate byincrementing the step size until certain halting conditions are satisfied. Thus, results in

27

very low processing ability. Another limitation of DGS is that training data composed ofclean background images is required.

In (Mairal, Jenatton, Bach, & Obozinski, 2010), the ProxFlow method solves thefollowing optimization

minw,e

12‖y−Xw− e‖2

2 +λ1‖w‖1 +λ2‖e‖`1/`∞, (3)

where the ‖ · ‖`1/`∞is a structured sparsity inducing-norm. This constraint was firstly

developed in (Mairal et al., 2010), but ProxFlow needs a training sequence whichdoes not contain any foreground objects. In fact, a training sequence X with only thebackground may not be easily obtained in real applications.

2.2.2 Robust principle component analysis

Another important branch of CS is robust principle component analysis, which considersmotion detection from the viewpoint of matrix decomposition and optimization. It canbe expressed as

D = L+S, (4)

where D ∈ Rm×n is the observed video sequences with n frames. L and S mean thebackground and foreground parts respectively. Many properties and constraints of L andS have been explored for the matrix decomposition. A typical model is the Robust PCAvia Principal Component Pursuit (PCP) (Candès, Li, Ma, & Wright, 2011) which firstlyproposed using a `1-norm to constrain the foreground, since these signals must be asparse matrix with a small fraction of nonzero entries. The authors of (Candès et al.,2011) also assumed that the background images are linearly correlated with each other,forming a low-rank matrix L. In this case, the matrix decomposition could be expressedas

minL,S‖L‖∗+λ‖S‖1 s.t. D = L+S, (5)

where ‖L‖∗ denote the nuclear norm of matrix L, the sum of its singular values, and‖S‖1 denotes the `1-norm of S.

28

2.3 Spatio-temporal group sparsity recovery

2.3.1 Motivation

As discussed above, the RPCA-based methods are batch model (Ebadi & Izquierdo,2016; Guyon, Bouwmans, & Zahzah, 2012; W. Hu, Yang, Zhang, & Xie, 2017;X. Liu et al., 2015; G. Tang & Nehorai, 2011; N. Wang & Yeung, 2013; Wright etal., 2009; Zhou, Yang, & Yu, 2013), which emphasize the low-rank property of theinput matrix stacked by a large number of frames (Wright et al., 2009). In that way, thematrix decomposition can be carried out only after all of a predefined number of framesare collected, which makes online processing impossible. As the computational costof RPCA-based approaches is dominated by the evaluation of the low-rank of matrix,typically, operated by singular value decomposition (SVD) with large time complexityO(mnr) (Rodriguez & Wohlberg, 2015), where m is the size of a frame, n is the numberof frames, and r is the rank of the matrix. It is also worth mentioning the large memoryfootprint due to the size of input matrix, e.g. in the video “Fall” of CDnet (Goyette etal., 2012) dataset, the frame resolution is 720×480, hence, the size of an input matrixwith 400 frames is 1036800×400, which is equivalent to 2.17 Gb for a single precisionfloating-point. Recently, some online and incremental RPCA-based methods havebeen studied, but they still need a batch initialization step (H. Guo, Qiu, & Vaswani,2014; Hage, Seidel, & Kleinsteuber, 2014; He, Balzano, & Szlam, 2012; Y. Hu,Sirlantzis, Howells, Ragot, & Rodriguez, 2015; Qiu & Vaswani, 2011; Seidel, Hage,& Kleinsteuber, 2014; J. Xu, Ithapu, Mukherjee, Rehg, & Singh, 2013) to guaranteethe accuracy of the background recovery. Otherwise they need other post-processingto refine the foreground results, such as using MRF (Javed, Oh, Bouwmans, & Jung,2015; Javed, Oh, Sobral, Bouwmans, & Jung, 2014) or median filters (Hage et al.,2014; Javed, Sobral, Bouwmans, & Jung, 2015; Seidel et al., 2014). Also, someincremental algorithms have utilized a pre-processing step to improve the speed and thequality of their results, for instance, employing a super-pixel to obtain homogeneousregions (group or structural information) (Javed, Oh, Sobral, Bouwmans, & Jung, 2015;J. Xu et al., 2013), or using a saliency map to locate salient foreground objects inadvance (Pang, Ye, Li, & Pan, 2016).

Different from the RPCA approach, sparse signal recovery-based algorithms processobserved frames sequentially, which is an instinctive mean of online processingmodelling. Furthermore, for RPCA-based models, the final foreground mask is

29

dependent by a thresholding operation to remove background noise. However, in sparsesignal recovery, the foreground is identified by non-zero coefficients with K sparsity,so a thresholding operation is not needed, and thus it is robust against backgroundnoise. To the best of our knowledge, the performances of existing online RPCA-basedapproaches without batch initializations or any pre- and post-processing are still inferiorto traditional batch-based RPCA approaches.

Bg

Fg

ProxFlow DGS Ours

Frame

Fig. 4. An example illustrating the difference between our method and ProxFlow (Mairal etal., 2010) and DGS (J. Huang et al., 2009) on the sequence “Shopping mall” (L. Li et al.,2004) (Background (Bg), Foreground (Fg)) (X. Liu et al., 2018) c©2018, IEEE.

It should be noted that a training set with foreground-free (contain no foregroundobjects) frames is needed for sparse signal recovery. However, in real video scenes,a well-defined training sequence is not easily obtained. Fig. 4 shows such a videosequence from an indoor surveillance camera, where crowds of people are always inthe scene. “Ghosts” appear in the background (see the red rectangle) produced bythe ProxFlow (Mairal et al., 2010) and DGS (J. Huang et al., 2009), and result inincomplete foreground detection.

On the other hand, in most of existing sparsity signal recovery methods (Dai &Milenkovic, 2009; J. Huang et al., 2009; Needell & Tropp, 2009), the sparsity degreeK must be known in advance. Otherwise the method has to initialize the lower bound ofthe sparsity range to zero, and search for the real sparsity degree by incrementing thestep size until certain halting conditions are satisfied. It is very hard to maintain thebalance between the computation time and the recovery accuracy (Gao et al., 2014).Typically, a very small step size is arranged to confirm recovery performance, whichleads to a very long processing time.

30

Similarly to RPCA, another limitation is that many sparse signal recovery basedmethods do not consider any prior knowledge of the spatial distribution of outliers(foreground pixels). To handle this issue, methods in (J. Huang et al., 2009; Mairalet al., 2010; J. Xu et al., 2013) explored the relationships between subsets of thenon-zero entries in the spatial domain. Actually, the group information does exist in realscenarios. Fig. 5 provides an example. It can be seen that non-zero coefficients (whitecircles) are not randomly distributed but clustered spatially in the foreground image.However, few methods have considered the group clustering priors in the temporaldomain. Obviously, we can easily find the temporal continuity of non-zero coefficientsfrom consecutive frames, as shown in Fig. 5.

t-1

t

(a) Frame (b) Foreground (c) Spatial-temporal group sparsity

Fig. 5. Illustration of spatio-temporal group properties on the foreground from two consec-utive frames on the sequence “Campus” (L. Li et al., 2004) (X. Liu et al., 2018) c©2018,IEEE.

In this section, a new model for human motion detection is proposed, which fallsinto the category of sparse signal recovery, and we name it Spatio-Temporal GroupSparsity recovery. It explicitly considers group properties of sparse outliers (foregroundpixels) in both spatial and temporal domains for better sparse recovery.

2.3.2 Spatio-temporal group sparsity

In the framework of sparse signal recovery (Dikmen, Tsai, & Huang, 2009; J. Huanget al., 2009), for a newly observed frame1 yt ∈ Rm, at time t can be modelled as a1yt is reshaped from a 2D image frame to a vector with the size m, and m = w×h, where w and h are the sizeof frame.

31

sparse linear combination of n preceding image frames D=[yt−n, . . . ,yt−1] ∈ Rm×n plusa sparse outlier term e ∈ Rm which can be expressed as

yt = Dx+ e, (6)

where x ∈ Rn is the vector of the coefficients. The term Dx means the background part,and x should be a kx sparse vector and kx� n. Also, the sparse outlier e accounts for theforeground part in yt . In this chapter, we employ an identity matrix I ∈ Rm×m as thecomplete dictionary (basis) for the foreground signals, thus, the Eq. (6) can be rewrittenas

yt = Φz (Φ =[D, I], z =

[x

e

]), (7)

where Φ ∈ Rm×(n+m) and z ∈ Rn+m. Here, we use ϕ1, . . . ,ϕn+m to denote columns of Φ.For background subtraction, the goal of this method is to estimate the coefficient z froma training data-set by minimizing the following objective function

(x,e) = argmin‖z‖0 s.t. ‖yt −Φz‖22 < ε, (8)

where z should be the (kx + ke) sparse vector and (kx + ke)� (n+m). As we know, Eq.(8) is an NP-hard problem due to the non-convexity of the `0-norm. In order to determinethe optimal solution, many algorithms are introduced. One family of approaches triesto seek the sparsest solution by performing a basis pursuit (BP) (S. Chen, Donoho, &Saunders, 1998) based `1-Minimization instead of `0 (Donoho, 2006), its objectivefunction

(x,e) = argmin‖z‖1 s.t. ‖yt −Φz‖22 < ε. (9)

BP provides strong guarantees and stability, but because of linear programming, thismethod does not yet have strong polynomially bounded run-times (Needell & Tropp,2009).

Another method category is the iterative greedy pursuit approach. A representativemodel of this is the orthogonal matching pursuit (OMP) (Tropp & Gilbert, 2007), whichutilise the observation vector to iteratively calculate the support of the signal. Duringeach iteration, the OMP selects the largest component of the observation vector to be inthe support, using

υc = arg max

l=1,...,m+n

∣∣⟨yc−1r ,ϕl

⟩∣∣, (10)

32

T c = T c−1∪υc, (11)

where c is the iteration counter, yr is the signal residual, 〈., .〉 means the inner product,and the support set T is defined as the set of indices corresponding to the non-zeroelements of the signal, which can be used to reconstruct the signal. From thosementioned above, we find that OMP chooses the column that most strongly correlateswith the remaining part of y. Then OMP subtracts its contribution and iterates on theresidual.

OMP is a very fast greedy pursuit algorithm, but the key issue is that it lacks provablerecovery guarantees and requires more measurements for perfect recovery (Dai &Milenkovic, 2009; J. Huang et al., 2009; Needell & Tropp, 2009). In order toclose this gap, a backtracking strategy is induced in the subspace pursuit (SP) (Dai &Milenkovic, 2009) and compressive sampling matching pursuit (CoSaMP) (Needell &Tropp, 2009), in which many columns are selected (by maintaining a list of indices)from the observation vector to be in the support (candidates) at each iteration, using

TM = supp(Φ∗yc−1r ,K), (12)

T c = T c−1∪TM, (13)

where supp(X ,K) returns the support of the largest K magnitude elements in X , K is thesparsity degree, and ∗ stands for the matrix transposition. It can be concluded fromEq. (10) to (11) that in OMP once an index is added to the support set, it remains inthis set throughout the remainder of the process. Nevertheless, in the SP and CoSaMPalgorithms, a list of indices of size K is maintained and refined during each iteration.An index, which is considered reliable in previous iterations but shown to be wrong ata later iteration, can be added or removed from the estimated support set freely (Dai& Milenkovic, 2009). SP and CoSaMP provide guarantees are as strong as the `1

approach, and the computational complexity is comparable to that of the greedy pursuitalgorithms (J. Huang et al., 2009), but all of the above methods treat each non-zerocoefficient independently, and the confidence score in Eq. (10) and (12) relies on theresults of the inner product (projection), which can be expressed as

p(l) =∣∣⟨yc−1

r ,ϕl⟩∣∣ . (14)

33

Neither of these take into account any possible relations between the subsets of theentries in the sparse matrix.

In this chapter, we propose a new greedy sparse recovery algorithm, which ismotivated by the observation that in real video sequences, the sparse outliers treatedas the foreground are not randomly distributed, but often have group properties ofspatial coherence and temporal contiguity. Based on this observation, we assume thatgroup signals live in the union of subspaces, which implies that, if a point lives in theunion of subspaces (see the red point in Fig. 5), its neighbouring (spatial and temporal)points would also live in this union of subspaces with a high probability, and vice versa.Inspired by (Dai & Milenkovic, 2009; J. Huang et al., 2009; Needell & Tropp, 2009),we propose a Spatio-Temporal Group Sparsity (STGS) recovery algorithm that includesfive main steps in each iteration (see Algorithm 1). This is outlined below.

Algorithm 1 Spatio-temporal group sparsity recovery algorithm (STGS) (X. Liu et al.,2018).Input: K, Φ, y where y = Φz, K is the sparsity degree of z, ω are the weights for neighbors, τ

controls the size of neighbor, and σ encourages temporal continuity.Output: z: K- sparse approximation of z

Initialize the support: T 0 = /0, the residual: y0r = y, and

set c = 0while the halting criterion is not satisfied do

c = c+11) Find new support elements:

Compute subspace projection: P = Φ∗yc−1r

For each entry p(i, j) in P , calculate the confidencescore by Eq. (18)TM = supp (E,K).

2) Update the support sets: T c = T c−1∪TM3) Compute the representation: zv = Φ

†T c y

For each entry p(i, j) in zv, calculate theconfidence score by Eq. (18)

4) Prune small entries in the representation:T c = supp (zv,K)

5) Update the residual: ycr = y− yp (yp = ΦT c Φ

†T c y)

end whileForm final solution: zT c = Φ

†T c y

1) Find new support elements. The purpose of signal reconstruction is to identifythe locations of the largest components in the target signal. At each iteration c of STGS,the current approximation induces a residual yc−1

r , which is the part of the target signalthat has not been approximated. The STGS first computes the subspace projection

34

coefficients by the inner products of the residual signal with the columns of Φ, using

P = Φ∗yc−1

r . (15)

Then, the STGS algorithm attempts to collect a set of candidate non-zeros (Kcolumns of Φ) that are most correlated with the residual. This idea is similar to the SPand CoSaMP approaches. However, in the background subtraction, the relationship ofthese non-zeros (foreground) is ignored by the SP and CoSaMP. In fact, these non-zerocoefficients (foreground pixels) are clustered into G (G < K) groups. Therefore, insteadof only using inner product, STGS calculates the confidence score by combining eachentry with its neighbours in both spatial and temporal domains (see Eq. (18) and itsexplanation in Step 3). More specifically, if a pixel is a foreground, its neighbouring(spatial and temporal) pixels would also belong to the foreground with a high probability,and vice versa. Hence, STGS selects the new support sets TM that reflect the currentresidual in terms of the order statistics of the new defined confidence score by Eq. (18).

2) Update the support sets. STGS maintains a list T c−1 with K columns of Φ (atemporal solution with K non-zero entries) and adds the additional set TM of K columnsobtained from the last step. As such, the size of merged support sets T c should beK ≤ size(T c)≤ 2K.

3) Compute the representation. STGS calculates the non-zero signal coefficients(representation) by applying a pseudo-inversion process, which can be expressed as

zv = Φ†T c y, (16)

where the matrix ΦT c consists of the columns of Φ with indices T c, and Φ†T c denotes the

pseudo-inverse of the matrix ΦT c , obtained by

Φ†T c = (Φ∗T cΦT c)−1

Φ∗T c . (17)

Similarly to Step 1), STGS calculates the confidence score of each entry, by not onlyusing the coefficient itself, but also combining its neighbours in both spatial and temporaldomains. Then, the confidence score is defined as

p(i, j) = p2(i, j)+i+τ

∑a=i−τ

j+τ

∑b= j−τ

ω2(a,b)(1+σ)p2(a,b). (18)

35

Please note that zv (and P of Eq. (15)) is a 1-D vector, while we use p(i, j) rather thanp(l) as in Eq. (14) to represent entries of zv. The reason for this is that we want topreserve the 2-D structure and neighbouring information of the pixels2. In Eq. (18), theτ controls the size of the spatial neighbouring. According to the experimental results, an8-connected neighbourhood proved to be satisfactory for our approach (see experimentsin Section 2.5). Then we set τ = 1. The parameter σ is used to encourage temporalcontinuity, which is very important for dealing with noise and dynamic backgroundmotions. Because noise or background motion is usually smaller, shorter and moreirregular than real foreground object motion, and foreground object commonly has theproperty of temporal consistency. According to the experimental results (see Section2.5), if pixel (a,b) in the last frame is foreground, then we set σ = 0.2, otherwise σ = 0.Please note that ω is the weight of the neighbours, which is employed to control thebalance between the sparsity prior and the group clustering prior. The setting strategy ofω will be subsequently reported.

4) Prune small entries in the representation. In this step, STGS produces a newapproximation (support sets T c with K columns of Φ ) by retaining only the K largestentries. By enforcing the spatio-temporal group sparsity constraint mentioned in thelast step, the signal pruning process can be accelerated, since the sparse signals havebeen significantly reduced to a narrower union of subspaces (J. Huang et al., 2009).Moreover, false positives caused by noise can be reduced because STGS only permitscertain combinations of its support set rather than any random entries. Results fromthe experiments (Fig. 9 and Fig. 10) have shown better foreground detection whencompared to other related approaches.

It should be noted that STGS extends the backtracking scheme from SP and CoSaMP.However, in traditional greedy pursuit algorithms, such as the well-known OMPapproach, the method selects only one (or several) column(s) during each iteration thatmost strongly relate to the residual, and once this index is added to the support set, itcannot be removed throughout the remainder of the process. As a result, these strictinclusion rules are needed to ensure that a significant fraction of the newly added index(column) belongs to the correct support set (Dai & Milenkovic, 2009). In contrast tothis, STGS maintains a temporal (candidate) list with K non-zero entries, and in eachiteration it adds an additional set of K candidate non-zeros that are most correlated

2In Eq. (7) z =[x,e]′ is a 1-D vector. The e part (of length m) can be regarded as (or reshape to) a w×h matrix

(w and h are the frame size, m = w×h), thus, entries of e can be regarded as a pixel with 2-D coordinates inan image.

36

with the residual, and then STGS refines this list back into K elements. These recursiverefinements of the estimate of the support set lead to subspaces with a strictly decreasingdistance from the measurement vector (Dai & Milenkovic, 2009).

5) Update the residual. In this step, STGS first calculates a new projection yp of y

onto span(ΦT c), using

yp = pro j(y,ΦT c) := ΦT cΦ†T cy. (19)

The space spanned by the columns of ΦT c (support set) is denoted by span(ΦT c) (Dai &Milenkovic, 2009). Next, STGS calculates the residue vector as the outlier (foregroundcandidate), which is the part of the target signal that has not been approximated, as

yr = resid(y,ΦT c) := y− yp. (20)

These five steps are repeated until the halting criterion is triggered, and STGS usesthe final support set T c to obtain the sparse non-zero entries zT c by

zT c = Φ†T c y. (21)

The halting criterion recommended in (Dai & Milenkovic, 2009) is adopted. The wholealgorithm is summarized in Algorithm 1.

The closest work to ours include the SP, CoSaMP, and DGS. However, the groupsparsity property of the foreground signals has been ignored by SP and CoSaMP. In DGS,the temporal context information is missing. In fact, it is difficult to deal with dynamicbackground variations without considering the group property in the temporal domain(see experimental results in Section 2.5). Furthermore, there are two issues with thesemethods that need to be addressed. Firstly, they need to be given the sparsity degree inadvance. However, knowing the sparsity degree before identifying the foreground makesit a typical chicken-egg issue. Usually, a multiple iterative approximation procedure isutilized to ensure the quality of recovery (J. Huang et al., 2009), which leads to a verylow processing speed. Secondly, these methods assume that the background shouldbe modeled as a sparse linear combination of atoms from a dictionary D in Eq. (6),and that they obtain the atoms directly from n preceding original frames, which maycontain both background and unwanted foreground pixels. There is no strategy to train abackground dictionary which leads to inaccurate background recovery and incompleteforeground detection.

37

2.3.3 Two-pass framework for motion detection

Inspired by (He et al., 2012; Qiu & Vaswani, 2011; J. Xu et al., 2013), a two-passscheme is proposed to solve the aforementioned problems. The framework is illustratedin Fig. 6. In the first-pass a fast background modeling is utilized to fast segment thelikely regions of the foreground. This operation can yield several advantages:∗ The obtained number of foreground pixels could be utilized to estimate the bound ornarrow the range of the sparsity degree. Therefore, the number of iterations will besignificantly reduced.∗ The location of foreground pixels could be utilized as prior of the group structure.∗ The obtained background image could be utilized as a candidate for atoms of thebackground dictionary.

Thus, in the first-pass, a pixel intensity sampling based method is utilized. This modelrecords a history of recently observed pixel intensity, and employs a random aggregationstrategy to update the background samples. The spirit is similar to ViBe (Barnich &Van Droogenbroeck, 2011). The main reason we utilize this method is due to its highframe-rate of processing, which is about 200 FPS for frames with a resolution 640×480,as noted in (Barnich & Van Droogenbroeck, 2011).

Input:

Video frame yt

Pixel sampling based background model

Robust background dictionary learning

Foreground

STGS for background (init kx as 1, and set ω = 0)

Background

dictionary Dt Backgroud: Dtx

Residual: yt-DtxForeground candidate

Background image

STGS for foreground

Weight for neighbors ω fBlock Bi: ωBi = Bfi/Bs

Init sparsity

number e

First-pass

Second-pass

Output:

Foreground et

et-1 for σ

(Temporal)

Weight map

(Spatial)

Fig. 6. Illustration of the framework of the proposed method. The first-pass introduces ahigh-speed method (Barnich & Van Droogenbroeck, 2011) to estimate the background andforeground roughly, and the spatio-temporal group sparsity recovery is proposed to detectthe foreground with high-accuracy in the second-pass (X. Liu et al., 2018) c©2018, IEEE.

In the second-pass, instead of using the original video frame to create atoms of thedictionary, as traditional methods did, we employ the background image b ft from the

38

first-pass to update the background dictionary D. Next, we briefly explain our learningstrategy for the background dictionary. We know that many previous methods (J. Huanget al., 2009; Mairal et al., 2010) used first-in and first-out policies to update theirdictionaries. At the same time, a good dictionary should contain samples (atoms) fromrecent past background images, and older samples should not necessarily be discarded.Motivated by (Barnich & Van Droogenbroeck, 2011), we randomly choose an atom tobe discarded according to a uniform probability density function (see Eq. (23)) ratherthan to remove the oldest atom from the dictionary. The reason behind this is that itcan handle a wide range of events in the background scene. In terms of the randomupdating strategy, the probability of an atom existing in the dictionary at time t, and ispreserved after the next update (t +1) is (N−1)/N, where N is the number of atoms inthe dictionary. Thus, the probability of this atom still being alive in the dictionary afterdt time, and the probability is

Pr(t, t +dt) = e− ln( NN−1 )dt . (22)

As reported in (Barnich & Van Droogenbroeck, 2011), the above expression explainsthat the probability of an atom being preserved for the interval (t, t +dt) is independentof t, assuming that it was included in the dictionary prior to the time t. In other words,an atom is not to be discarded after a fixed number of frames, but to be extended foran expected lifespan to maintain the diversity of the background states. In many realvideos, the old history is meaningful and useful, especially for multi-mode backgroundvariations. In addition, this random update mechanism extends the time windowscovered by the background dictionary without increasing the number of atoms. This iskey to controlling the computational burden and memory requirements. Furthermore,background modeling often encounters challenges from sudden background variations,such as changes in illumination. To rapidly respond to these variations, a T h as thethreshold is employed, when the foreground ratio (Fnum/m, Fnum is the number of theforeground, and m denotes the size of the frame) is greater than T h, the origin frameyt is utilized to update the background dictionary rather than the background imageb ft produced by the first-pass. Therefore, the update of the background dictionary isexpressed as

Dt = Dt−1{ ∀bi | bi = bt , with probability 1/N}

(i f Fnum/m > T h, bt = yt ; else bt = b ft). (23)

39

As mentioned above, for the outlier part e in Eq. (6), we utilized a parameter ω

to control the balance between the sparsity prior and the group clustering prior. Thismeans that if the degree of the group clustering is higher in the sparse signal, ω shouldbe assigned a greater value, and vice versa. Based on this assumption, we propose anadaptive setting of ω for image regions with distinct properties for each frame. Morespecifically, in our method, each frame is divided into small blocks with the same sizeBs, and in each block we calculate the number of foreground B fi produced by first-pass.Then, at time t, for the pixels in block i, its weight is computed by

ωBi,t = B fi,t/Bs. (24)

Up to now, we have obtained all of the inputs (parameters) needed in the second-pass,including the weight ω f (for the x-related (background) part, ω can be set as zero), aswell as the background dictionary D, and the estimation of the foreground pixel numberke from the first-pass. In our method, there are two loops for iteration. During eachstage (the inner loop), sparse data is iteratively optimized with a fixed sparsity numberuntil the halting condition within that stage is satisfied. Then the next stage (the outerloop) is switched to after adding the step size ∆k to the current sparsity number.

Usually, the range of the sparsity should start from zero as it dose in the existingSSR-based methods. Thus, the gap between iteration starting value and ending value isK. Thanks to the estimation of the foreground pixel number ke from the first-pass, onthe foreground part, we can initialize the sparsity number (outer loop) as ke. The gapbetween sparsity K and ke is

d = |K− ke|. (25)

Since d is usually much less than K, it can be concluded that the time of the iterationcould be significantly reduced by the proposed two-pass framework.

The same two loops in the second-pass are utilized. For the background, the sparsitydegree kx could start from 1, in its outer loop. For the foreground, we need to confirmthat ke is less or greater than sparsity degree K. Hence, we need to check whether thehalting condition of the outer loop is satisfied after adding the step size ∆k at the firstiteration. If it is satisfied, we should subtract a step size from ke, and the whole iterativeprocess would stop whenever the halting condition is not satisfied, and vice versa.

40

The algorithm sketch of the proposed two-pass framework is summarized inAlgorithm 2. We report the experimental results of our method in Section 2.5, please seethat section for details.

Algorithm 2 Two-pass framework for background subtraction (X. Liu et al., 2018).Input: Time t, video frame yt .Output: et : sparse approximation of foreground et

First-pass:Foreground detection (Barnich & Van Droogenbroeck, 2011)Results: rough estimation of foreground, including foregroundnumber ke and weight ω f by Eq. (24), and background imageb ft

Second-pass:Background part:

Init: Using b ft and Eq. (23) to update background dictionaryDt , C = 0 (iterations of outer loop), support set TC = /0,set Φ = Dt , y = yt , ω = 0, and K = 1 for input of STGSrepeat

C = C +1Perform STGS recovery with sparsity K to obtain zT C

update TC , zT C = Φ†T C y, and K = K +∆k

until halting criterion (‖zT C − zT C−1‖2 ≤ ε) trueResults: Sparse coefficient of background x = zT C = Φ

†T C y

foreground candidate (residual) yt −DtxForeground part:

Init: C = 0, support set TC = /0, Φ = I (identity matrix),y = yt −Dtx, ω = ω f , and K = ke for input of STGSPerform STGS with sparsity K +∆k and K to obtainzT 1 and zT 2 respectively, and now C = 2

if ‖zT 2 − zT 1‖2 ≤ ε is satisfiedrepeat

C = C +1, K = K−∆kPerform STGS with sparsity K to obtain zT C

update TC , zT C = Φ†T C y

until halting criterion (‖zT C − zT C−1‖2 > ε) trueTC = TC−1

elserepeat

C = C +1, K = K−∆kPerform STGS with sparsity K to obtain zT C

update TC , zT C = Φ†T C y

until halting criterion (‖zT C − zT C−1‖2 ≤ ε) trueForm final solution: et = Φ

†T C y

41

2.4 Multi-channel fused sparsity recovery

2.4.1 Motivation

From the signal processing point of view, foreground detection can be regarded asseparating a source signal from a mixture of sources, which can be expressed in generalas

Y = B+F + ε, (26)

where Y is the observed (a video frame) signal which is composed by individualsources, namely the background B, foreground F and noise ε . Given the assumption thatforeground objects are usually sparse, then the signals of the foreground and noise canbe considered as the residual R between the frame and the background, as

R = Y −B. (27)

According to the framework of the sparse signal recovery, at time t, given a residualsignal Rt ∈ Rs (s = w×h×C, where w, h and C are the width, height and number ofchannels of an input frame), its binarization for obtaining a foreground mask can bemodeled by a denoising process, as

Rt = Φx+ εt , (28)

where Φx accounts for the recovered foreground signal Ft , and εt is the noise. The x ∈ Rs

is the coefficient vector, and x should be a kx sparse vector and kx� s. In other words,the computed nonzero part of x can be utilized to binarize the foreground mask. Here,we employ an identity matrix I ∈ Rs×s as the complete dictionary Φ for the foregroundsignals.

The moving foreground objects are spatially coherent clusters, namely, if a pixel is aforeground, its neighbouring pixels would also belong to the foreground, and vice versa.Therefore, a variety of constraints have been utilized to enforce the spatial contiguitybetween the neighbouring pixels of the foreground, such as the well-known Fused Lasso(Tibshirani, Saunders, Rosset, Zhu, & Knight, 2005), its objective function

minz

12‖Rt −Φx‖2

2 +λ1 ‖x‖1 +λ2 ‖Dx‖1 , (29)

42

where the term ‖Rt −Φx‖22 is the ordinary least squares minimization criterion for

counting the reconstruction error, where ‖ · ‖2 denote the `2-norm. The term ‖x‖1 is thesparsity constraint on coefficients, where ‖ · ‖1 denote the `1-norm. The λ1 and λ2 areregularization parameters which control the relative contributions of the correspondingterms. The term ‖Dx‖1 is the Total Variation (TV) regularizer which penalizes thedifferences between consecutive coefficients, where D ∈ R(s−1)×s is the differencingmatrix, that is, Di,i =−1, Di,i+1 = 1 and Di, j = 0 elsewhere. Then, D is defined as

D =

−1 1

−1 1. . . . . .

−1 1

.

Besides, in the DECOLOR model (Zhou et al., 2013), Zhou et al. employed MarkovRandom Fields (MRFs) to impose smoothness on the foreground matrix. Additionally,a group Lasso regularization operation was applied to model the foregrounds in theGOSUS (J. Xu et al., 2013). Liu et al. (X. Liu et al., 2015) proposed a low-rankand structured sparse decomposition where a (stacked frames) matrix is divided intooverlapping groups of pixels to enforce structural sparsity constraints. In (X. Guo,Wang, Yang, Cao, & Ma, 2014; Xin, Tian, Wang, & Gao, 2015), the local sparsenessconstraint was exploited by a total variation (TV-RPCA) penalty and generalized fusedLasso (GFL) to better deal with corrupted data. However, the existing methods show fewconsiderations to the homogeneity of the channels. When dealing with colour images, atypical option is to convert the RGB to a gray frame (X. Liu et al., 2015; J. Xu et al.,2013; Zhou et al., 2013), and another way is to apply sparse recovery independently toeach of the three RGB channels (X. Guo et al., 2014; Xin et al., 2015). In fact, a pixelshould be considered a multi-channel feature and if a pixel is equal to the adjacent onesthis means all the three RGB coefficients should be equal. So it is necessary to enforcethe homogeneity of the channels at group levels. This thesis aims to explore the prior ofa multi-channel group-sparse for motion detection.

43

2.4.2 Multi-channel fused Lasso

Considering the foreground signal Ft has N pixels (N = w×h), and a pixel pi of Ft hasC channels, therefore

Ft = (

p1︷︸︸︷p1,1, p1,2, · · · p1,C,

p2︷︸︸︷p2,1, p2,2, · · · p2,C, · · ·

pN︷︸︸︷pN,1, pN,2, · · · pN,C)

>.

From the above, we can find that Ft has a group structure, namely, Ft has NC

components that come in N groups with C channels. As such, the multichannel of apixel should be considered irrelevant or relevant as a whole, and not each componentindependently as in the traditional model. In other words, all the coefficients of aparticular group (pixel) should be zero, or nonzero at the same time, so a sparsity of x isachieved at the group level. Based on this observation, in this chapter we propose aMulti-Channel Fused Lasso (MCFL) model to explore the smoothness of multi-channelsignals. The objective function is defined as

minx

12‖Rt −Φx‖2

2 +λ1 ‖x‖2,1 +λ2 ‖Gx‖2,1 , (30)

where the ‖ · ‖2,1 denote the `2,1-norm. Since the sparsity should be achieved at thegroup level, for a coefficient vector x, the term

‖x‖2,1 =N

∑n=1‖xn‖2 =

N

∑n=1

√C

∑c=1

x2n,c, (31)

which is the group Lasso model, means the `1 norm of the `2 group norms. In contrastto the TV regularizer, the term ‖Gx‖2,1 enforces the similarity between the coefficientscorresponding to nearby groups, namely the differences between consecutive groupsshould to be identically zero, as

‖Gx‖2,1 =N

∑n=2

√C

∑c=1

(xn,c− xn−1,c)2,

with G =

−I I

−I I. . . . . .

−I I

,(32)

44

where G ∈ R(N−1)C×NC is a group differencing matrix, and I ∈ RC×C denotes the identitymatrix.

As we know the `1 regularizer is not differentiable, which rules out conventionalsmooth optimization techniques. In this chapter, we introduce the proximal splittingmethod (Combettes & Pesquet, 2011) for the optimization Eq. (29), which can beformulated as a convex optimization problem of the form

minx∈RM

θ1(x)+ · · ·+θm(x). (33)

The proximal splitting method is designed to split the objective into functionsθ1(x), · · · ,θm(x) individually (minimizing them independently) so as to yield an easilyimplementable algorithm, and each non-smooth function in (33) is involved via itsproximity operator (Combettes & Pesquet, 2011). Specifically, if θi is a convex, lowersemi-continuous function, its proximity operator (denoted as proxγ;θi

) at x with stepγ > 0 is defined as

zx = proxγ;θi(x) = arg min

z∈RM

12‖z− x‖2

2 + γθi(z). (34)

Recall that in Eq. (29), θ1(x) = ‖Ft −Φx‖22 is differentiable, while θ2(x) = λ1 ‖x‖2,1

and θ3(x) = λ2 ‖Gx‖2,1 are convex but non-smooth functions. Here, we note θ+(x) =

θ2(x)+ θ3(x). Since both functions θ1(x) and θ+(x) are convex, according to theproximal splitting method, the optimization of (29), namely the minimization the sum ofθ1(x) and θ+(x), can be achieved through the iterative minimization of θ1(x) and θ+(x)

individually. Based on the proximal gradient method (Combettes & Pesquet, 2011) andthe Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) (Beck & Teboulle, 2009),an optimal x∗ can be obtained by optimizing the variable x and updating the variables z

and t iteratively, which solves the following three sub-problemsxk = proxγ;θ+ (zk− γ∇θ1(zk))

zk+1 = xk +tk−1tk+1

(xk− xk−1)

tk+1 =1+√

1+4t2k

2

, (35)

where ∇ denotes the differential operator, and γ = 1/L where L is a Lipschitz constantfor ∇θ1. Here, take z1 = x0, t1 = 1 for parameters initialization.

45

For the term proxγ;θ+ , the proximity operators of the sum of θ2(x) and θ3(x) areneeded. For that, we employ the Dykstra-like Proximal (DP) algorithm (Combettes &Pesquet, 2011), which computes the proximity operator of the sum of two (or more)functions combining their individual proximity operators in an iterative way. In our case,the problem is

zx = proxθ+(x) = arg min

z∈RM

12‖z− x‖2

2 +θ2(z)+θ3(z). (36)

Based on the DP algorithm, an optimal z∗ can be obtained by alternating betweenoptimizing the variables z, y and updating the variables α , β , which solves the followingfour sub-problems

yk = proxγ;θ2(zk +αk)

αk+1 = zk +αk− yk

zk+1 = proxγ;θ3(yk +βk)

βk+1 = yk +βk− zk+1

. (37)

For which we set z1 = x, α1 = 0 and β1 = 0 for parameter initialization. In our case, forthe term proxγ;θ2

, namely the proximity operator of ‖x‖2,1 is the group soft-thresholding(Bach, Jenatton, Mairal, & Obozinski, 2011), defined as

proxγ;‖·‖2,1(xn,c) = max(

0,1− γ

‖xn‖2

)xn,c, (38)

which indicates that any group xn with a `2-norm less than γ will be zeroed. For theterm proxγ;θ3

, we need to solve

proxγ;θ3= arg min

z∈RM

12‖z− x‖2

2 + γ ‖Gz‖2,1, (39)

which is a particular case of the more general problem infz,y {η(z)+ γδ (y)} s.t. y = Gz

where η(z)≡ 12 ‖z− x‖2

2 and δ (·)≡ ‖·‖2,1, as such y∈ R(N−1)C. Then, we can determineits Lagrangian as L (z,y; µ) = η(z)+ γδ (y)+µ · (Gz− y) with µ ∈ R(N−1)C. Inspiredby (Alaíz, Barbero, & Dorronsoro, 2013; Rockafellar, 2015), we can transform theequivalent saddle point problem infz,y{supu L (z,y; µ)} into a dual problem, as

infµ

{η∗(−G>µ)+ γδ

∗(

1γ

µ

)}, (40)

46

Input:

Video sequences

objective function:

Second-pass

X-T slice as D

Y-T slice as D

X-T foreground F

Y-T foreground F

First-pass

objective function:

Raw foreground

signal Rt (RGB)

Foreground mask

(binary result)

Fig. 7. Illustration of the framework of the proposed method. The first-pass introduces aRPCA-PCP method to estimate the raw foreground signal (residual Rt ), and a new sparse sig-nal recovery with MCFL regularizer is proposed to obtain foreground masks in the second-pass (X. Liu & Zhao, 2019b) c©2019, IS&T.

In terms of the Fenchel Conjugate (Alaíz et al., 2013; Bauschke & Combettes, 2011),the dual problem can be transformed as

minµ

{12

∥∥∥G>µ− x∥∥∥2

2

}, (41)

which is quadratic with simple convex constraints (Alaíz et al., 2013), and canbe solved by the projected gradient method. Thus, following from the condition0 = ∇zL = zx− x+G>µ∗, the proximity operator of (39) can be recovered from thedual solution µ∗ through the equality (Alaíz et al., 2013), as

zx = proxγ;θ3(x) = x−G>µ

∗. (42)

2.4.3 Two-pass framework for motion detection

We propose a two-pass framework for motion detection. The framework is illustrated inFig. 7. In the first-pass, a low-rank and sparse matrix decomposition is introduced. InRPCA (Wright et al., 2009), Wright et al. considered background subtraction from theviewpoint of a matrix decomposition problem, which can be expressed as

minB,F‖B‖∗+κ‖F‖1 s.t. D = B+F, (43)

47

(a) frames

(e) X-T slice D (y=100) (f) background B (g) foreground F

(b) Y-T slice D (x=78) (c) background B (d) foreground F

Fig. 8. Illustration of the matrix decomposition (first-pass) results on temporal slices Y-Tand X-T (X. Liu & Zhao, 2019b) c©2019, IS&T.

where D ∈ Rs×p is the observed video matrix which stacked by p frames, and s is thesize of a frame, and κ is a regularizing parameter. B and F denote the backgroundmatrix and foreground matrix respectively. It is assumed that the background images arelinearly correlated with each other, forming a low-rank matrix B (‖ · ‖∗ is the nuclearnorm). The `1-norm is employed to constrain the foreground, since these regions shouldbe a sparse matrix with a small fraction of non-zero entries. However, this methodignores the temporal continuity of foreground pixels.

Inspired by (Xue, Guo, & Cao, 2012), we stack the temporal (T frames) slices alongX-T (D ∈ Rh×T ) and Y-T (D ∈ Rw×T ) as the matrices D. Similar to Eq. (43), D canbe decomposed into the low-rank part B representing the background and the sparsecomponent F corresponding to the motion objects in the foreground. As illustrated inFig. 8, since the background motion is usually smaller and more regular than foregroundobject motion, the foreground object will form a distinct trajectory from the backgroundin a temporal slice on the X-T and Y-T plane.

The motion matrices obtained from the X-T and Y-T slices (planes) are integratedas the residual Rt , namely the input of the second-pass. Then, in the second-pass, weutilize the proposed sparse signal recovery with an MCFL regularizer to segment theforeground masks.

48

(a)

Fram

e (

b)

Gro

un

d t

ruth

(c)

STG

S (o

urs

) (d

) M

CFL

(o

urs

)

(

e)

PC

P

(

f) D

GS

(g)

LSD

(h

) LB

D

(

i) D

EC

(

j) G

OS

(k)

GFL

(

l) T

V

(

m)

OR

M

(

n)

IPC

Fig. 9. Detected foreground results on videos from I2R (L. Li et al., 2004) data-set (X. Liu etal., 2018) c©2018, IEEE.

49

(a)

Fram

e (

b)

Gro

un

d t

ruth

(c)

STG

S (o

urs

) (d

) M

CFL

(o

urs

)

(

e)

PC

P

(f)

DG

S

(g

) LS

D

(h)

LBD

(i)

DEC

(

j) G

OS

(k)

GFL

(l

) TV

(m

) O

RM

(n)

IPC

Fig. 10. Detected foreground results on videos from CDnet (Goyette et al., 2012) data-set(X. Liu et al., 2018) c©2018, IEEE.

50

2.5 Experimental results and analysis

In order to testify the performance of the proposed algorithms, comparison experimentswere conducted qualitatively and quantitatively on videos from the I2R data-set (L. Li etal., 2004) and CDnet 2012 data-set (Goyette et al., 2012).

Qualitatively, the proposed STGS and MCFL were compared with 10 state-of-the-artmethods, including RPCA-PCP (PCP) (Wright et al., 2009), LBD (Guyon et al.,2012), DGS (J. Huang et al., 2009), GOSUS (GOS) (J. Xu et al., 2013), DECOLOR(DEC) (Zhou et al., 2013), TV-RPCA (TV) (X. Guo et al., 2014), LSD (X. Liu etal., 2015), GFL (Xin et al., 2015), incPCP (IPC) (Rodriguez & Wohlberg, 2014,2015), and OR-PCA with MRF (ORM) (Javed et al., 2014). The algorithms of PCP,LBD, GOSUS, DECOLOR, TV-RPCA, LSD, GFL, incPCP, and ORM are RPCA basedmodels. GOSUS, incPCP, ORM and DGS are incremental (online) models whereas PCP,LBD, DECOLOR, TV-RPCA, GFL, LSD are batch models. The DGS is a sparse signalrecovery-based method. For all comparison algorithms, the results were yielded bythe implementations released by the authors. We also employed the defaults in theircodes for parameter settings. Besides the ORM, which utilizes a 5×5 median filter forpost-processing the binary foreground mask (Javed et al., 2014), other methods do notperform any further processing steps (e.g., morphological operations).

Firstly, as shown in Fig. 9, we present the detected foreground masks on sequencesfrom the I2R data-set (L. Li et al., 2004). It includes nine videos: Bootstrap, Campus,Curtain, Escalator, Fountain, Hall, Lobby, Shopping Mall, and Water Surface. Thesevideos cover a wide range of challenging scenarios: highly dynamic backgrounds,outdoor wild environments, sudden light variations, etc. The first three rows in Fig. 9 givethe indoor scene where people are shown walking in and out. As noted in Section 2.4, akey difference between the proposed STGS and the DGS is the assumption about theavailability of training sequences with/without foreground objects. Typically, sparsesignal recovery-based methods need a set of background frames without any foregrounds,which is not always available for the surveillance of crowded scenes. As such, it is hardto recover an accurate background. In contrast, RPCA-based methods can obtain a cleanbackground from occluded data. However, due to the smoothness constraint imposed onthe foreground, DECOLOR always produces a high number of false alarms. The LBD,PCP, and incPCP fail to obtain complete foregrounds without a group or structuredsparsity constraint. GOSUS, GFL, and ORM cannot obtain complete foreground results.The proposed MCFL and STGS achieved a similar performance to the TV-RPCA,

51

and were found to be better than the others. The next five rows of Fig. 9 report thescenes with dynamic backgrounds, which are caused by the motion of tree branches,curtain, fountain, water surface, and escalator. We can see that the GOSUS, PCP, LBD,TV-RPCA, and incPCP are difficult to suppress false positives. DECOLOR, PCP, andincPCP fail to segment the person in the “WaterSurface”, and over-smooth results areproduced by DECOLOR in the “Curtain”. The GFL with a signal channel model usingthe fused Lasso failed to detect the people in the “Campus” and “Fountain”. In contrastto GFL, the proposed multi-channel fused Lasso can promote the detection resultsclearly. In the “Escalator”, the background with working escalators is very hard toremove. Moreover, the background is challenging to model if there is a steady streamof human flow on the escalators. The detected results show that our methods (MCFLand STGS) and the ORM provided better foreground masks than the others. As shownin Fig. 9, the last row is a video with a light being turned on and off. It is noted thatDGS and GOSUS had difficulties dealing with such sudden light changes. It is notsurprised that training data composed of background variations are needed for thesesparse signal recovery and batch initialization-based approaches. It can be seen thatthe proposed methods can remove dynamic backgrounds effectively and obtain thesilhouettes completely.

Secondly, we evaluated the proposed MCFL and STGS on another widely useddata-set, the CDnet (2012) (Goyette et al., 2012) dataset. This dataset includes 90000frames in 31 sequences consisting of six categories: Baseline (BL), Camera Jitter(CJ), Dynamic Background (DB), Intermittent Object Motion (IM), Shadow (SD) andThermal (TM). In Fig. 10, we illustrate the foreground results on ten sequences, whichrepresent challenging scenes for the task of video surveillance. The first row representsthe foreground masks for the “Highway” from Baseline category. Obviously, all thecompared methods performed well on this insufficiently challenging sequence. Thesecond and third rows show the foregrounds of the “Boulevard” and “Traffic” (CameraJitter category). We can see that PCP, LBD, TV-RPCA, and incPCP are sensitive to thejitter of the camera and produce many false alarms. The DGS and GFL could not obtainthe masks of the car completely in the “Traffic”, and the DGS is also find it hard to handlethe “Boulevard”. The GOSUS can tolerate the camera motion, while it yields incompletemasks in the “Traffic”. It can be seen that our methods (MCFL and STGS), ORM andDECOLOR achieve almost perfect foreground masks with only a few false positives.The next two rows in Fig. 10 show the results for the “Canoe” and “Fall”. It seemsthat PCP, TV-RPCA, and incPCP are difficult to deal with motion backgrounds with

52

water and tree movements. The LBD is also hard to suppress the background dynamics.In fact, the `2,1-norm is employed in the LBD to detect foreground with column-wisesparsity, but this constraint still has no structured or ground prior with which to modeloutliers of foreground. The DECOLOR could segment most of foregrounds, however ityielded many false positives due to the smoothness constraint imposed on the foreground.Furthermore, DECOLOR is hard to detect the “canoe” correctly. Although the GOSUScould restrain the dynamic motion of background, it ignored a lot of positive foregroundpixels. It is notable that the ORM was one of the top-performing algorithms, accordingto the evaluation results on the CDnet (Goyette et al., 2012). The proposed STGS andORM were able to filter out most of the background motions, while the ORM lost somesensitivity in detecting foreground completely, as shown in “Fall”. We would like topoint out a weakness of the proposed MCFL, as shown with the “Fall”, MCFL cannotrestrain background movements entirely when they taking up a large portion of theframe. The sixth and seventh rows of Fig. 10 are the results of the Intermittent ObjectMotion category, which represent cases in which background objects are moving away,and foreground objects stop for a short while and then move away. It can be seen thatthe GOSUS, LBD, DECOLOR, and incPCP methods failed to detect the abandoned boxin the “Sofa”. Additionally, in the “Parking” video, the TV-RPCA, GFL, DGS andGOSUS lost the truck (foreground) in the parking lot when the car (background object)moves away. As shown in the next two rows of Fig. 10, PCP and ORM produced a lot offalse alarms caused by the shadows in the “Backdoor” and “Bus Station” sequences.The proposed methods also could not ignore shadows completely, this is because wedid not employ any sophisticated features to classify shadows. The last row shows athermal video captured by a far-infrared camera. Our methods (MCFL and STGS)were able to segment such a large-sized foreground object and achieve more completeforeground masks than the other methods. It is noted that DGS misses the foregroundin the “Corridor”. The reason behind is DGS collects the atoms of the backgrounddictionary from n preceding images directly which may contain the foreground objects.Clearly, this leads to poor background recovery.

From above we can conclude that the results of the proposed methods are theclosest to the ground-truth references qualitatively. In order to better understand theperformance of the proposed approaches, we conducted a quantitative evaluation interms of the F-measure3, which is a weighted harmonic mean of precision and recall.Here, we compared more state-of-the-art methods, including two online algorithms,3F-Measure is defined as 2·precision · recall/(precision+recall).

53

Table 1. Performance of the F-Measure (%) on data-set I2R (L. Li et al., 2004) (best: bold,second best: underline).

Methods F-Measure

DGS (J. Huang et al., 2009) 57.35PCP (Wright et al., 2009) 58.74LBD (Guyon et al., 2012) 60.25DEC (Zhou et al., 2013) 73.08TV (X. Guo et al., 2014) 74.89LSD (X. Liu et al., 2015) 75.94TLS (W. Hu et al., 2017) 76.22SM (Pang et al., 2016) 77.11GOS (J. Xu et al., 2013) 77.67STGS (X. Liu et al., 2018) (Ours) 77.81GFL (Xin et al., 2015)∗ 85.15MCFL (X. Liu & Zhao, 2019b) (Ours) 86.39

ORM (Javed et al., 2014)† 79.46SOD (Javed, Oh, Sobral, et al., 2015)§ 87.51DSP (Ebadi & Izquierdo, 2016)§ 88.73OMM (Javed, Oh, Bouwmans, & Jung, 2015)† 89.21∗ The F-Measure score reported in paper (W. Hu et al., 2017) is 70.01.† As reported in (Javed, Oh, Bouwmans, & Jung, 2015; Javed et al., 2014), a 5×5 median filteringwas applied as a post-processing step on binary foreground mask.§ As reported in (Ebadi & Izquierdo, 2016; Javed, Oh, Sobral, et al., 2015), the superpixel wasutilized as the input of their model.

OR-PCA with MRF via multiple features (OMM) (Javed, Oh, Bouwmans, & Jung,2015), SODSC (SOD) (Javed, Oh, Sobral, et al., 2015), and three algorithms DSPSS(DSP) (Ebadi & Izquierdo, 2016), SM-RPCA (SM) (Pang et al., 2016), and TLSFSD(TLS) (W. Hu et al., 2017). As noted in the SODSC (Javed, Oh, Sobral, et al., 2015)and DSPSS (Ebadi & Izquierdo, 2016), the super-pixels of frames were used as the inputof their algoritms, and in OMM (Javed, Oh, Bouwmans, & Jung, 2015), a 5×5 medianfilter was utilized as a post-processing for refining binary foreground mask. Thesepre- or post- processing steps could undoubtedly improve the detection of foreground.Actually, most of the compared algorithms and ours only use the pixels’ intensity, anddo not employ any refinement procedures to upgrade precision. In Table 1, the averageF-measure scores of methods on data-set I2R (L. Li et al., 2004) are given. We can seethat in all of the eleven methods without using any pre- or post- processing, the proposedMCFL has achieved the best numerical value, and the STGS is the second runner-up.In the same condition, MCFL yields the best average score on the CDnet (Goyette etal., 2012) data-set as shown in Table 2, the proposed STGS is ranked second. Here,we cannot report the results of TLSFSD (TLS) (W. Hu et al., 2017) and SM-RPCA

54

Table 2. Performance of the F-Measure (%) on data-set CDnet (Goyette et al., 2012) (best:bold, second best: underline).

Methods F-Measure

TLS (W. Hu et al., 2017) –SM (Pang et al., 2016) –DGS (J. Huang et al., 2009) 68.19LBD (Guyon et al., 2012) 70.91PCP (Wright et al., 2009) 72.86GFL (Xin et al., 2015) 72.96TV (X. Guo et al., 2014) 73.42GOS (J. Xu et al., 2013) 75.36DEC (Zhou et al., 2013) 75.70LSD (X. Liu et al., 2015) 77.93STGS (X. Liu et al., 2018) (Ours) 81.22MCFL (X. Liu & Zhao, 2019b) (Ours) 83.81

ORM (Javed et al., 2014)† 73.52OMM (Javed, Oh, Bouwmans, & Jung, 2015)† 80.01SOD (Javed, Oh, Sobral, et al., 2015)§ 81.39DSP (Ebadi & Izquierdo, 2016)§ 86.26† As reported in (Javed, Oh, Bouwmans, & Jung, 2015; Javed et al., 2014), a 5×5 median filteringwas applied as a post-processing step on binary foreground mask.§ As reported in (Ebadi & Izquierdo, 2016; Javed, Oh, Sobral, et al., 2015), the superpixel wasutilized as the input of their model.

(SM) (Pang et al., 2016), since the source code of these methods are not publiclyavailable.

2.6 Conclusion

In this chapter, we present two novel methods for human motion (foreground) detection,which fall into the category of compressive sensing based methods. Firstly, we proposeda novel method for foreground detection by using a sparse signal recovery. We formulatethe problem in a unified framework named Spatio-Temporal Group Sparsity recovery(STGS). The main contributions are summarized as follows:

1. We propose a new formulation of foreground detection via a fast greedy pursuitalgorithm. It explicitly considers group properties of sparse outliers (foreground) in bothspatial and temporal domains for better sparse recovery, instead of merely consideringthe spatial domain, as conventional methods do.

2. We formulate the background modeling as a dictionary learning problem, so thata training sequence without any foreground is not required. Furthermore, a randomupdate policy is employed to extend the time windows covered by the background

55

dictionary, therefore it can deal with a wide range of events in the background scene,such as dynamic background motions and illumination variations. This backgroundmodel can rapidly respond to sudden background changes.

Secondly, we take into account the prior of multi-channel group-sparse on foreground,and propose using a Multi-Channel Fused Lasso (MCFL) regularizer, to enforce a multi-channel foreground to become piece-wise constant at the group level, being adjacentgroups equal in all the different channels at the same time. The main contributions aresummarized as follows:

1. We propose a new formulation of sparse signal recovery via a Multi-ChannelFused Lasso (MCFL) regularizer. It explicitly reconstructs multi-channel foregroundsignals with a spatial structure that reflects smooth changes in group features.

2. The experimental results on two benchmarks show that the proposed methodworks well in a wide range of complex environments and achieves state-of-the-artperformance for background subtraction.

56

3 Riemannian trajectory analysis for gesturerecognition

3D skeleton-based gesture recognition has attracted attention due to its invariance tocamera viewpoint and video background dynamics. Many methods tend to use absolutecoordinates to extract human motion features. Nevertheless, gestures are independentof the performer’s location, and the features should be invariant to the body size ofperformer. In addition, issue of temporal dynamics can significantly distort the distancemetric when comparing and identifying gestures. In this chapter, we represent a gestureskeletal sequences as a trajectory on a Riemannian manifold and present our findings inPaper III for human body gesture recognition.

3.1 Introduction and motivation

Human body gesture analysis is a key area in computer vision researches and has beenwidely used in applications such as human-computer interfaces, artificial companions,business negotiations, and gaming. 3D skeletal data is gaining popularity since itsimplifies the task from using monocular RGB cameras to more sophisticated sensors,such as the Kinect. This feature can explicitly localize gesture performers and producethe trajectories of human skeletal joints. Compared to RGB input, skeletal data is robustto background dynamics and invariant to camera view.

In recent years, a larger number of 3D skeleton-based models have been proposed,ranging from handcrafted-based feature representations, such as histogram of 3D joints(HOJ3D) (Xia, Chen, & Aggarwal, 2012), EigenJoints by principal component analysis(PCA) (X. Yang & Tian, 2012), manifold representations (Amor, Su, & Srivastava,2016; Devanne et al., 2015; Gong, Medioni, & Zhao, 2014; Vemulapalli, Arrate,& Chellappa, 2014), discriminative key-frames (Zanfir, Leordeanu, & Sminchisescu,2013), histogram of oriented 4D normals (HON4D) (Oreifej & Liu, 2013), sequenceof most informative joints (SMIJ) (Ofli, Chaudhry, Kurillo, Vidal, & Bajcsy, 2014),rotation and relative velocity (RRV) (Y. Guo, Li, & Shao, 2017); to various formsof parametric approaches such as actionlets ensemble (J. Wang, Liu, Wu, & Yuan,2012, 2014), maximum entropy Markov model (MEMM) (Sung, Ponce, Selman, &Saxena, 2012), latent structural SVM (pose-based) (Packer, Saenko, & Koller, 2012),

57

hidden Markov models (HMM) (Lv & Nevatia, 2006; Piyathilaka & Kodagoda, 2013),conditional random field (CRF) (Koppula & Saxena, 2013, 2016), latent Dirichletallocation (LDA) (C. Wu, Zhang, Savarese, & Saxena, 2015), naive Bayes nearestneighbor (NBNN) (Weng, Weng, & Yuan, 2017), latent max-margin multitask learning(LM3TL) (Y. Yang et al., 2017); and also includes plenty of deep learning methods, i.e.deep belief network (DBN) (D. Wu et al., 2016; D. Wu & Shao, 2014), convolutionalneural network (CNN) (Ke, Bennamoun, An, Sohel, & Boussaid, 2017; Neverova,Wolf, Taylor, & Nebout, 2016), recurrent neural network (RNN) (Du, Wang, & Wang,2015; Y. Li et al., 2016; J. Liu, Shahroudy, Xu, & Wang, 2016; J. Liu, Wang, Hu,Duan, & Kot, 2017; Mahasseni & Todorovic, 2016; Shahroudy, Liu, Ng, & Wang,2016; Zhu et al., 2016). Rather than covering all works exhaustively, we refer interestedreaders to recent surveys (Han, Reily, Hoff, & Zhang, 2017; Presti & La Cascia, 2016).

While there have been significant advancements in this area, accurate recognitionof the human gestures in unconstrained settings is still challenging. There are twoproblems which need to be carefully discussed:

1. One problem in human gesture recognition is the feature representation to capturevariability of 3D human body (skeleton) and its dynamics. Existing methods commonlyuse the absolute (real world) coordinates to extract human motion features. Nevertheless,activities are independent of performer’s position, and the feature should be invariant tothe length of body part (size of performer).

2. Another problem in human gesture recognition is the temporal dynamics. Forexample, even the same gesture performed by the same person can occur at differentspeeds and different starting/ending points, let alone for cases with different performers.Therefore, the variance of a category of human behavior can be very large, andif temporal dynamics are ignored, the accuracy of recognition would undoubtedlydeteriorate.

To handle the first issue, a common solution is to transform all 3D joint coordinatesfrom the world coordinate system to a performer-centric coordinate system, such asby placing the hip center at the origin. However, its success heavily depends on theprecise positioning of the specific point (human hip center). Another scheme is toconsider the relative geometry between different body parts (bones), such as the LieGroup (Vemulapalli et al., 2014), which uses rotations and translations (rigid-bodytransformation) to represent the 3D geometric relationships. Nevertheless, the translationis not a scale-invariant feature since the size of skeleton varies from subject to subject.In (Vemulapalli et al., 2014), the authors selected one of the skeletons from the training

58

sets as a reference, but this empiric operation is hard to normalize the skeletal data toexplicitly handle scale variations.

To solve the second problem, a typical method is to utilize a graphical model todescribe the presence of sub-states (events), where time series are reorganized by asequential prototype, and the temporal dynamics of gestures are trained as a set oftransitions in these prototypes (Amor et al., 2016). Representative models includethe hidden Markov model (HMM) (D. Wu et al., 2016; D. Wu & Shao, 2014) andconditional random field (CRF) (Koppula & Saxena, 2016). Nevertheless, the inputsequences of HMM have to be segmented in advance, according to some specificclustering metrics or discriminative states, which in itself is a challenging problem.Recently, many researches (Du et al., 2015; Y. Li et al., 2016; J. Liu et al., 2016,2017; Shahroudy et al., 2016) addressed the issue of temporal dynamics via recurrentneural networks (RNN), such as the long short-term memory (LSTM) (Hochreiter &Schmidhuber, 1997). RNN is a powerful framework to model sequential data, but it isstill challenging to study the information of the entire sequence with many sub-states. Infact, the most common solution to temporal dynamics is to resort to the Dynamic TimeWarping (DTW) (Y. Guo et al., 2017; Vemulapalli et al., 2014), which needs to selecta nominal temporal alignment, and then all the sequences of a category are warped tothat alignment. Obviously, the performance of DTW highly depends on the selection ofthe reference sequence, and such a reference is commonly obtained by experience.

3.2 Related methods

For addressing the temporal dynamics problem of human activity analysis (includingactions and gestures) on 3D skeleton data, plenty of work have been proposed. Weprovide a categorized overview of the related literature mainly on local temporalmodeling, generative models, recurrent neural networks, and Manifold space-basedmodel.

3.2.1 Local temporal modeling

To account for temporal dynamics, a widely used solution is to use dynamic timewarping (DTW), as in the Lie group(Vemulapalli et al., 2014), RRV (Y. Guo et al.,2017), and (LM3TL) (Y. Yang et al., 2017). DTW resorts to seeking an optimaltemporal alignment, then warp all sequences of the same category to that reference.

59

Finally, a classifier such as the SVM is typically employed to complete the recognitiontask. Nevertheless, the performance of DTW is highly related to the metric used tomeasure the similarity of frames (feature of a frame). Furthermore, for periodic gestures,DTW may yield large temporal misalignments, which may encumber the accuracy ofclassification (J. Wang et al., 2012). Wang et al.(J. Wang et al., 2012, 2014) introducedthe local occupancy pattern (LOP) to represent 3D human activities, and proposed theFourier temporal pyramid (FTP) to capture local temporal patterns, which is more robustto noise and temporal misalignments than DTW. While FTP is restricted by the width ofthe time window and can only use limited contextual information (Du et al., 2015). In(Zanfir et al., 2013), Zanfir et al. proposed a moving pose descriptor by integrating thenormalized positions of joints from discriminative key-frames, as well as their velocitiesand accelerations. Then, a non-parametric K-nearest-neighbors (KNN) is employedfor action classification. Leveraging key frames can help to exclude frames whichare less relevant to the underlying gestures, but in comparison to the holistic-basedapproaches, damaged essential information is inevitable. In these methods, the localtemporal dynamics is represented within a certain time window, so they cannot globallycapture the temporal evolution of gestures (Du et al., 2015).

3.2.2 Generative models

One widely used scheme to deal with the issue of temporal dynamics is the generativemodels, where time series are reorganized by a sequential prototype. Thus, the temporaldynamics of gestures are learned as a set of transitions among these prototypes (Amoret al., 2016). A representative work is the hidden Markov model (HMM). It canglobally model the temporal evolution of gestures, which is more robust to the temporalwarping of the sequence. This algorithm has been utilized by (Lv & Nevatia, 2006;Piyathilaka & Kodagoda, 2013; D. Wu et al., 2016; D. Wu & Shao, 2014; Xia etal., 2012). However, in HMM, the input sequences have to be previously segmented,which in itself is a challenging problem. Commonly, HMM-based methods divide eachsequence into a fixed number of segments with equal-length. However, it may be hardto deal with complex gestures composed of diverse temporal durations. Koppula andSaxena (Koppula & Saxena, 2013, 2016) modeled the human-object interactions with aspatio-temporal conditional random field (CRF), which is also a popular generativemodel. Nevertheless, the structure of the graphs has to be fully known making thismethod highly dependent on the quality of annotated video data. Wu et al. (C. Wu et

60

al., 2015) presented a latent Dirichlet allocation (LDA) based method to model theco-occurrence and temporal relation among short actions (states) in long term videos,and the authors employed a K-means-based clustering to model action-states and learnactivities as sequences of these states. It is noted that a standard K-means segmentationmethod lacks temporal information and thus its performance is usually inferior to thetraditional transition state clustering (Murali et al., 2016). Actually, existing methodsalways face the same difficulty in determining the accurate states from observationswithout careful selection of the features, which undermines the performance of suchgenerative models (J. Wang et al., 2012).

3.2.3 Recurrent neural network

Another popular technique for addressing the issue of temporal dynamics is the recurrentneural networks (RNN) (Du et al., 2015; Y. Li et al., 2016; J. Liu et al., 2016,2017; Mahasseni & Todorovic, 2016; Shahroudy et al., 2016; Zhu et al., 2016).Specifically, long short-term memory (LSTM) designs a suit of schemes to memorizecontextual information observed from previous inputs, which enables tracking thelong-term temporal dependency. In (Du et al., 2015; Hochreiter & Schmidhuber,1997), Du et al. presented a bi-directional LSTMs for action recognition, where theentire skeleton was divided into five groups of joints and each group was fed into aspecific LSTM subnetwork. Then the system fused the outputs of these subnetworkshierarchically and finally fed them into another set of higher level LSTMs to representthe global body movements. In (J. Liu et al., 2016), Liu et al. incorporated a trust gateinto the framework of LSTM to learn the reliability of the inputs and accordingly adjusttheir confidence on updating the context information. Zhu et al. (Zhu et al., 2016)introduced a sparse regularization term to the cost function of LSTM, enabling thenetwork to train the co-occurrence of discriminative skeleton joints. In (Y. Li et al.,2016), Li et al. employed a Gaussian-like curve to represent the confidences of thestarting and ending frame of actions, and proposed a joint classification regressionLSTM to handle online action detection and recognition problem. In (Mahasseni &Todorovic, 2016), an encoder/decoder LSTMs framework is proposed for recognizingaction. The encoder is trained in an unsupervised manner using the 3D skeleton data.Then, the manifold is utilized to regularize the supervised learning of decoder LSTMfor RGB data based recognition. In the recent work (J. Liu et al., 2017), Liu et al.

presented a global context-aware attention LSTM (GCALSTM), which aimed to handle

61

LSTM’s limitation in keeping the global contextual information. LSTM is powerfulin modeling sequential data, but it still hard to preserve the information of the entiresequence with many states (Ke et al., 2017; Weston, Chopra, & Bordes, 2014). Inaddition, compared with the progress of data augmentation technologies in RGB data,studies on skeletal data are at an early stage. Therefore, it is difficult to learn parametersfor LSTM from limited data (Shahroudy et al., 2016; J. Wang et al., 2012).

3.2.4 Manifold space

Recently, manifold-based methods have shown promising performance regarding actionand gesture recognition. A popular model is the Lie group (Vemulapalli et al., 2014),which employs a special Euclidean group SE(3) to represent the geometric relationshipsbetween body parts. The authors utilized the DTW and Fourier temporal pyramid(FTP) to deal with the temporal dynamics issues of gesture recognition. However, asreported above, the success of DTW heavily relies on the selection of the nominaltemporal alignment. The FTP is restricted by the width of the time window and canonly utilize limited contextual information (Du et al., 2015). Gong et al. (Gong et al.,2014) proposed a dynamic manifold warping (DMW) method to compute the motionsimilarity among video sequences, which is an adaptation of DTW methods for themanifold space. Anirudh et al. (Anirudh, Turaga, Su, & Srivastava, 2015) incorporatedthe transported square-root velocity fields (TSRVF) (Su, Kurtek, Klassen, & Srivastava,2014) to analyze the trajectories (a gesture sequence) lying on Lie groups, thus thedistance between two trajectories is invariant to identical time warping. Finally, theprincipal component analysis (PCA) is used to reduce the dimension of feature vectors.However, the PCA is an unsupervised model and thus the discriminant of dictionary cannot be boosted through a labeled training. In terms of the square root velocity (SRV)framework (Srivastava, Klassen, Joshi, & Jermyn, 2011), in (Devanne et al., 2015),trajectories are transported to a reference tangent space attached to the Kendallars shapespace at a fixed point. This operation may introduce the distortions when points are notclose to the fixed reference point. Another branch is based on kernel functions to embedthe Riemannian manifolds into a reproducing kernel Hilbert space (RKHS) (M. Harandi& Salzmann, 2015; M. T. Harandi, Salzmann, & Hartley, 2014). However, due to thecomputationally intensive kernel functions, the input manifold of symmetric positivedefinite (SPD) matrices is from a covariance descriptor calculated by only a few skeletonjoints.

62

3.3 Lie group based representation

One important issue in a recognition task is the choice of representation models tocapture variability of 3D skeleton and its dynamics, within and across gesture classes.To the best of our knowledge, most of the existing methods utilized absolute coordinatesto present human motion features. However, gestures are independent of performer’slocation, so it is necessary to transform raw skeletal data from real world coordinates tohuman-centered coordinates.

Inspired by rigid body kinematics (Murray, Li, Sastry, & Sastry, 1994), the relativegeometry between different body parts (Vemulapalli et al., 2014) is introduced, since therelative geometry has the property of view-invariant which can guarantee the uniquenessof motion representation, and can thus provide a better reflection of human gestures thanusing absolute locations and absolute motion.

nb

z

y

x

mb

nbx

yz

xyz

nmb 2

nb

mbx

y

z

x

y

z

Local coordinateSystem of bn

Global coordinateSystem

Local coordinateSystem of bm

(a) (b) (c)

mb nmb 1

2nb

1nb

mnb 2

mnb 1

2mb1mb

Fig. 11. (a) Illustration of a skeleton consisting of 20 joints and 19 bones, (b) Representationof bone bm in the local coordinate system of bn, (c) Representation of bone bn in the localcoordinate system of bm (X. Liu & Zhao, 2019a) c©2019, Springer, Cham.

Mathematically, any rigid body displacement can be realized by a rotation about anaxis combined with a translation parallel to that axis. This 3D rigid body displacementforms an SE(3), the special Euclidean group in three dimensions (Murray et al., 1994).SE(3) can be identified with the space of 4×4 matrices of the form

63

P(R,~v) =

[R ~v

0 1

], (44)

where R ∈ SO(3) is a point in the special orthogonal group SO(3), denotes the rotationmatrix, and~v ∈ R3 denotes the translation vector.

The human skeleton can be modeled by an articulated system of rigid segmentsconnected by joints. As such, let S = (J,B) be a skeleton, where J = { j1, · · · , jN}indicates the set of body joints, and B = {b1, · · · ,bM} indicates the set of body bones(oriented edges). As studied in (Vemulapalli et al., 2014), the relative geometry betweena pair of body parts (bones) can be represented as a point in SE(3). More specifically,given a pair of bones bm and bn, their relative geometry can be represented in a localcoordinate system attached to other (Vemulapalli et al., 2014). Let bi1 ∈ R3, bi2 ∈ R3

denote the starting and end points of bones bi respectively. The local coordinate systemof bone bn is calculated by rotating with minimum rotation and translating the globalcoordinate system so that bn1 acts as the origin and bn coincides with the x−axis, Fig. 11give an example to explain this pictorially. As such, at time t, the representation of bonebm in the local coordinate system of bn (Fig. 11 (b)), the starting point bn

m1(t) ∈ R3 andend point bn

m2(t) ∈ R3 are given by

[bn

m1(t) bnm2(t)

1 1

]=

[Rm,n(t) ~vm,n(t)

1 1

]0 lm001

001

, (45)

where Rm,n(t) and~vm,n(t) respectively denote the rotation and translation measured inthe local coordinate system attached to bn, and lm is the length of bm. In the same way,the representation of bone bn in the local coordinate system of bm can be obtained byRn,m(t),~vn,m(t), and ln (Fig. 11 (c)). According to the theory of rigid body kinematics,the lengths of bones (body parts) do not vary with time. Therefore, the relative geometry

64

of bm and bn at time t can be described by

Pm,n(t) =

[Rm,n(t) ~vm,n(t)

1 1

]∈ SE(3),

Pn,m(t) =

[Rn,m(t) ~vn,m(t)

1 1

]∈ SE(3).

(46)

These relative geometry have a natural stability and consistency. For example,if a pair of bones undergo the same rotation, their relative geometry matrix wouldnot be altered. However, one restriction of this motion feature is that the translation~v is relative to the size of performer (subject). As we know it is very important toobtain a scale-invariant skeletal representation for recognition task in an unconstrainedenvironment. To remove the skeleton scaling variations, in this chapter, we discardthe translation from motion representation, then the relative geometry of bm and bn attime t can be described by rotations Rm,n(t) and Rn,m(t) , and expressed as elementsof SO(3). Then, let M denote the number of bones, the resulting feature for an entirehuman skeleton is interpreted by the relative geometry between all pairs of bones, as apoint C(t) = (R1,2(t),R2,1(t), . . . ,RM−1,M(t),RM,M−1(t)) on the curved product spaceof SO(3)×·· ·×SO(3), and the number of SO(3) is 2C2

M , where C2M is the combination

formula.

3.4 TSRVF for Riemannian trajectory analysis

As presented in last section, a gesture video sequence can be characterized as a trajectoryon a Riemannian manifold. Therefore, the gesture recognition task is to computethe similarity between the shape of trajectories. The basis for these comparabilitydeterminations is related to a distance function on the Riemannian manifold. A fewRiemannian metrics (Bauer, Bruveris, & Michor, 2014) have been proposed foraddressing this problem, but accurately modeling the temporal dynamics of gesturetrajectories is still challenging.

Specifically, let α denote a smooth oriented curve (trajectory) on a Riemannianmanifold M, and let M denote the set of all such trajectories: M = {α : [0,1]→M|α is smooth}. Reparameterizations are operated by increasing diffeomorphismsγ : [0,1]→ [0,1], and the set of all these orientation preserving diffeomorphisms is

65

denoted by Γ = {γ→ [0,1]}. In fact, γ plays the role of a time-warping operation, whereγ(0) = 0,γ(1) = 1 so that the end points of the curve are preserved. As such, if α in theform of time observations α(t1), ...,α(tn), is a trajectory on M, the composition α ◦ γ inthe form of time-warped trajectory α(γ(t1)), ...,α(γ(tn)), is also a trajectory that goesthrough the same sequence of points as α but at the evolution rate governed by γ (Su etal., 2014).

For identifying trajectories, a metric is needed to describe the variation of a classof trajectories and to quantify the information contained within a trajectory. A directand common solution is to calculate a point-wise difference. Since M is a Riemannianmanifold, a natural distance dm between points on M (Su et al., 2014) can be employed.Then, for any two trajectories: α1,α2 : [0,1]→M, the distance dx between them can becalculated by

dx(α1,α2) =∫ 1

0dm (α1(t),α2(t))dt. (47)

This quantity gives a natural extension of dm from M to M[0,1]. However, it suffersfrom the issue that dx(α1,α2) 6= dx(α1 ◦ γ1,α2 ◦ γ2). As analyzed in Section 3.1, forthe task of gesture recognition, the temporal dynamics is a central problem that needsto be solved when a trajectory (gesture) α is observed as α ◦ γ , at a random temporalevolution γ . In other words, for arbitrary temporal re-parameterizations γ1,γ2 andarbitrary trajectories α1, α2, a distance d(·, ·) is needed that enables

d(α1,α2) = d(α1 ◦ γ1,α2 ◦ γ2). (48)

Thanks to the Square Root Velocity (SRV) framework (Srivastava et al., 2011), theconcept of elastic trajectories is particularly well-suited for our goal. In order to extendthe original Euclidean metric based SRV to the manifold space, Su (Su et al., 2014)proposed a Transported Square-Root Vector Field (TSRVF). Specifically, for a smoothtrajectory α ∈M , the TSRVF is a parallel transport of a scaled velocity vector field ofα to a reference point c ∈M according to

hα(t) =α(t)α(t)→c√|α(t)|

∈ Tc(M), (49)

where α(t) is the velocity vector along the trajectory at time t, and α(t)α(t)→c is itstransport from the point α(t) to c along a geodesic path, and | · | means the the normrelated to the Riemannian metric on M and Tc(M) denotes the tangent space of M at

66

c. Especially, when |α(t)|= 0, hα(t) = 0 ∈ Tc(M). Let H ⊂ Tc(M)[0,1] be the set ofsmooth curves in Tc(M) obtained as TSRVFs of trajectories in M, H = {hα |α ∈M }(Su et al., 2014). By means of TSRVF, two trajectories such as α1 and α2, can bemapped into the tangent space Tc(M), as two corresponding TSRVFs, hα1 and hα2 . Thedistance between them can be measured by the `2-norm in the typical vector space, as

dh(hα1 ,hα2) =

√∫ 1

0|hα1(t)−hα2(t)|2dt. (50)

In fact, the main motivation for TSRVF representation comes from the followingfact. If a trajectory α is warped by γ , to result in α ◦ γ , the TSRVF of α ◦ γ is given by

hα◦γ(t) = hα(γ(t))√

γ(t). (51)

Then, for any α1,α2 ∈M and γ ∈ Γ, the distance dh satisfies

dh(hα1◦γ ,hα2◦γ) =

√∫ 1

0|hα1(s)−hα2(s)|2ds

= dh(hα1 ,hα2),

(52)

where s = γ(t). For the proof of equality, we refer the interested reader to (Srivastava etal., 2011; Su et al., 2014). From the geometric point of view, this equality implies thatthe action of Γ on H under the `2 metric is by isometries. This enable us to develop afully invariant distance to time-warping and use it to properly register trajectories (Su etal., 2014). Additionally, this invariability in execution rates is crucial for statisticalanalyses, such as sample means and covariances. Then, we define the equivalence class[hα ] (or the notation [α]) to denote the set of all trajectories that are equivalent to a givenhα ∈H (or α ∈M ), as

[hα ] ={

hα◦γ |γ ∈ Γ}. (53)

Obviously, such an equivalent class [hα ] (or [α]) is associated with a category ofgesture. Under this scheme, the task of comparison of two trajectories is performedby comparing their equivalence classes. Namely, an optimal reparameterization γ∗ isneed to be obtained to minimize the cost function dh(hα1 ,hα2◦γ). Let H / ∼ be thecorresponding quotient space, which can be bijectively identified with the set M /∼using [hα ] 7→ [α] (Anirudh et al., 2015). The distance ds on H /∼ (or M /∼) is the

67

shortest dh distance between equivalence classes in H (Su et al., 2014), given by

ds([α1], [α2])≡ ds([hα1 ], [hα2 ])

= infγ∈Γ

dh(hα1 ,hα2◦γ)

= infγ∈Γ

(∫ 1

0|hα1(t)−hα2(γ(t))

√γ(t)|2dt

)1/2

.

(54)

In practice, the minimization over Γ is solved for using the dynamic programming.In this chapter, we provide a brief description of SRV and TSRVF. Interested readers arereferred to the original papers (Srivastava et al., 2011; Su et al., 2014).

One can find that an important parameter of TSRVF is the reference point c, whichshould remain unchanged throughout the entire computing process. Since the selectionof c can potentially affect the results, typically, a point is a natural candidate for c ifmost trajectories pass close to it. In this chapter, the Karcher mean (Karcher, 1977) asthe Riemannian center of mass is utilized, since it is equally distant from all the pointsthereby minimizing the possible distortions.

Given a set of {αi(t)t=1,..,n}mi=1 of sequences (trajectories) of gestures (or actions),

its Karcher mean µ(t) is calculated using the TSRVF representation with respect to ds inH /∼, defined as

hµ = arg min[hα ]∈H /∼

m

∑i=1

ds([hα ], [hαi ])2. (55)

As a result, each trajectory is recursively aligned to the mean µ(t), thus, anotheroutput of the Karcher mean computing is the set of aligned trajectories {αi(t)t=1,...,n}m

i=1.Following (Srivastava et al., 2011; Su et al., 2014), for each aligned trajectory αi(t) attime t, the shooting vector vi(t) ∈ Tµ(t)(M) is computed so that a geodesic that goesfrom µ(t) to αi(t) in unit time (Su et al., 2014) with the initial velocity vi(t), as

vi(t) = exp−1µ(t)(αi(t)). (56)

Then, the combined shooting vectors V (i) = [vi(1)T vi(2)T ... vi(n)T]T is the finalfeature of a trajectory αi.

68

3.5 Sparse coding of 3D skeletal trajectories

It is noted that the feature of a trajectory (gesture sequence) lies on a high dimensionalspace. A common solution for reducing the dimension is the principal componentanalysis (PCA), such as the methods (Srivastava et al., 2011; Su et al., 2014)applied. Nevertheless, PCA is an unsupervised learning model without using the labelinformation. Compared to component analysis techniques, a sparse coding representationwith labelled training is better able to capture inherent relationships among the inputdata and their labels. To the best of our knowledge, few manifold representation-basedmodels considered the connection between labels and dictionary learning. In this section,we attempt to associate label information with each dictionary atom to enforce thediscriminability in sparse codes during the dictionary learning.

Specifically, given a set of observations (feature vectors of gestures) Y = {yi}Ni=1,

where yi ∈ Rn, let D = {di}Ki=1 be a set of vectors in Rn denoting a dictionary of K

atoms, the learning of dictionary D for sparse representation of Y can be expressed as

< D ,X >= argminD ,X‖Y −DX‖2

2 s.t. ∀i, ‖xi‖0 ≤ T, (57)

where X = [x1, ...,xN ] ∈ RK×N means the sparse codes of observation Y , and T isa sparsity constraint factor. The construction of D is achieved by minimizing thereconstruction error ‖Y −DX‖2

2, and satisfying the sparsity constraints. The K-SVD(Aharon, Elad, & Bruckstein, 2006) algorithm is a commonly used solution to (57).

Inspired by (Jiang, Lin, & Davis, 2013; Q. Zhang & Li, 2010), the classificationerror and label consistency regularization are introduced into the objective function

< D ,W,A,X >=arg minD ,W,A,X

‖Y −DX‖22 +β‖L−WX‖2

2

+ τ‖Q−AX‖22 s.t. ∀i, ‖xi‖0 ≤ T,

(58)

where W ∈ RC×K represents the classifier parameters, and C denotes the number ofcategories. L = [l1, ..., lN ] ∈ RC×N denotes the class labels of observation Y , and li = [0,...,1, ...,0]T ∈ RC is a label vector corresponding to an observation yi, where the nonzeroposition (index) indicates the class of yi. Then, the additional term ‖L−WX‖2

2 denotesthe classification error for label information.

For the last term ‖Q−AX‖22, where Q = [q1, ...,qN ] ∈ RK×N and qi = [0, ...,1, ...,1,

...,0]T ∈ RK is a sparse code corresponding to an observation yi for classification. The

69

purpose for setting nonzero elements is to enforce the “discriminative" of sparse codes(Jiang et al., 2013). Specifically, the nonzero elements of qi occur at those indiceswhere the corresponding dictionary atom dn shares the same label with the observationyi. The A denotes a K×K transformation matrix, which is utilized to transform theoriginal sparse code x to be a discriminative one. Thus, the term ‖Q−AX‖2

2 representsthe discriminative sparse code error, which enforces that the transformed sparse codesAX approximate the discriminative sparse codes Q. This forces the signals from thesame class to have similar sparse representations. β and τ are regularization parameterswhich control the relative contributions of the corresponding terms. Equation (58) canbe rewritten as

< D ,W,A,X >=arg minD ,W,A,X

∥∥∥∥∥∥∥ Y√

βL√

τQ

− D√

βW√

τA

X

∥∥∥∥∥∥∥2

2

s.t. ∀i, ‖xi‖0 ≤ T. (59)

Let Y ′ = (Y T,√

βLT,√

τQT)T, D ′ = (Y T,√

βW T,√

τAT)T. Then, the optimiza-tion of Equation (59) is equivalent to solving the (57) (replace Y and D with Y ′ andD ′ respectively), this is just the problem that K-SVD (Aharon et al., 2006) solves. Inthis section, a similar initialization and optimization solution of K-SVD described in(Jiang et al., 2013) is adopted. For parameter settings, the maximal iteration equals to60, the sparsity factor T = 50 is used, and β and τ are set to 1.0 in our experiments.


In order to verify the effectiveness of the proposed method, we compared ours with 18methods on two public datasets, the ChaLearn 2014 gesture (Escalera et al., 2014) andthe MSR Action3D (W. Li et al., 2010) datasets. We simply divided 18 state-of-the-artmethods into three groups.

The first group contains the algorithms most related to ours, including four Lie groupbased methods, the Lie group using DTW (Vemulapalli et al., 2014) (Lie group-DTW),Lie group with TSRVF (Su et al., 2014) (Lie group-TSRVF) and using PCA fordimensionality reduction (Anirudh et al., 2015) (Lie group-TSRVF-PCA), and K-SVDfor sparse coding (Aharon et al., 2006) (Lie group-TSRVF-SVD). Additionally, twoTSRVF related methods are included: the body part features with SRV and k-nearest

70

neighbors clustering (Devanne et al., 2015) (SRV-KNN), and TSRVF on Kendall’sshape space (Amor et al., 2016) (Kendall-TSRVF).

The methods in the second group are related to classic feature representations, suchas histogram of 3D joints (HOJ3D) (Xia et al., 2012), EigenJoints (X. Yang & Tian,2012), actionlet ensemble (Actionlet) (J. Wang et al., 2014), histogram of oriented 4Dnormals (HON4D) (Oreifej & Liu, 2013), rotation and relative velocity with DTW(RRV+DTW) (Y. Guo et al., 2017), naive Bayes nearest neighbor (NBNN) (Weng et al.,2017).

The last group includes seven deep learning methods, namely the convolutionalneural network based ModDrop (CNN) (Neverova et al., 2016), HMM with deep beliefnetwork (HMM-DBN) (D. Wu & Shao, 2014), LSTM (Hochreiter & Schmidhuber,1997), hierarchical recurrent neural network (HBRNN) (Du et al., 2015), spatio-temporalLSTM with trust gates (ST-LSTM-TG) (J. Liu et al., 2016), and global context-awareattention LSTM (GCA-LSTM) (J. Liu et al., 2017). The baseline results are reportedfrom their original papers.

To evaluate the performance of the TSRVF on a product space of SO(3)×·· ·×SO(3)(SO3-TSRVF), we present its discriminative performance without any further steps (suchas PCA or sparse coding) on two datasets. For comparison of the dictionary learningability, we also report the results of the classic coding such as K-SVD (Aharon et al.,2006) (SO3-TSRVF-SVD) and the proposed sparse coding scheme (SO3-TSRVF-SC).In order to fairly compare, we follow the same classification setup as in (Vemulapalli etal., 2014) (Anirudh et al., 2015) (Amor et al., 2016) (Su et al., 2014) (Aharon et al.,2006) , namely, we utilized a one-vs-all linear SVM classifier (the parameter C set to1.0).

The ChaLearn 2014 (Escalera et al., 2014) is a gesture dataset with multi-modalitydata, including audio, RGB, depth, human body mask maps, and 3D skeletal joints. Thisdataset collects 13585 gesture video segments (Italian cultural gesture) from 20 classes.We followed the evaluation protocol provided by the dataset which assigns 7754 gesturesequences for training, 3362 sequences for validation, and 2742 sequences for testing. Adetailed comparison with other approaches is shown in Table 3. It can be seen that theproposed method achieves the highest recognition accuracy of 93.2%. Compared toLie group-based methods, the effectiveness of SO3-TSRVF has been proved by theexperimental results. It is noted that the accuracy of Lie group-DTW (Vemulapalli etal., 2014) is only 79.2%, this is due to that the performance of DTW highly dependson the reference sequences for each category, and that empiric selection task turns to

71

Table 3. Comparison of recognition accuracy (%) with existing skeleton-based methods onChaLearn 2014 (Escalera et al., 2014) dataset (best: bold, second best: underline) (X. Liu &Zhao, 2019a).

Methods Accuracy

Lie group-DTW (Vemulapalli et al., 2014) 79.2Lie group-TSRVF (Su et al., 2014) 91.8Lie group-TSRVF-PCA (Anirudh et al., 2015) 90.4Lie group-TSRVF-SVD (Aharon et al., 2006) 91.5SRV-KNN (Devanne et al., 2015) –Kendall-TSRVF (Amor et al., 2016) –

EigenJoints (X. Yang & Tian, 2012) 59.3Actionlet (J. Wang et al., 2014)* –HOJ3D (Xia et al., 2012) –HON4D (Oreifej & Liu, 2013)* –RVV-DTW (Y. Guo et al., 2017) –NBNN (Weng et al., 2017) –

ModDrop (CNN) (Neverova et al., 2016)* 93.1HMM-DBN (D. Wu & Shao, 2014) 83.6HBRNN (Du et al., 2015) –LSTM (Hochreiter & Schmidhuber, 1997) 87.1ST-LSTM-TG (J. Liu et al., 2016) 92.0

Ours (SO3-TSRVF) 92.1Ours (SO3-TSRVF-SVD) 92.8Ours (SO3-TSRVF-SC) 93.2

* The methods use skeleton and RGB-D data.

difficult as the size of dataset gets larger. It also can be observed that the accuracy ofthe LSTM (Hochreiter & Schmidhuber, 1997) is 6 percents less than the proposedmethod. Although LSTM is designed for perceiving the contextual information, it is stillchallenging to model the sequence with temporal dynamics, especially when trainingdata is limited. It is important to mention that the ModDrop (Neverova et al., 2016)ranked the first place in the Looking at People challenge (Escalera et al., 2014), whichuses the ChaLearn 2014 dataset as the benchmark. Please note that our method canachieve a higher score than ModDrop but without using RGB-D and audio data.

The MSR Action3D (W. Li et al., 2010) is a commonly used dataset for evaluatingthe performance of action recognition. The dataset is challenging where actions arehighly similar to each other and have typically large temporal misalignments. Thisdataset comprises of 567 pre-segmented action instances, and 10 people performing 20classes of actions. MSR Action3D dataset is so popular that many researchers havereported their results on it. For a fair comparison, the same evaluation protocol, namelythe cross-subject testing as described in (W. Li et al., 2010) is followed, where half of

72

Table 4. Comparison of recognition accuracy (%) with existing skeleton-based methods onMSR Action3D (W. Li et al., 2010) dataset (best: bold, second best: underline) (X. Liu & Zhao,2019a).

Methods Accuracy

Lie group-DTW (Vemulapalli et al., 2014) 92.5Lie group-TSRVF (Su et al., 2014) 87.7Lie group-TSRVF-PCA (Anirudh et al., 2015) 88.3Lie group-TSRVF-SVD (Aharon et al., 2006) 87.6SRV-KNN (Devanne et al., 2015) 92.1Kendall-TSRVF (Amor et al., 2016) 89.9

EigenJoints (X. Yang & Tian, 2012) 82.3Actionlet (J. Wang et al., 2014)* 88.2HOJ3D (Xia et al., 2012) 78.9HON4D (Oreifej & Liu, 2013)* 88.9RVV-DTW (Y. Guo et al., 2017) 93.4NBNN (Weng et al., 2017) 94.8

ModDrop (CNN) (Neverova et al., 2016)* –HMM-DBN (D. Wu & Shao, 2014) 82.0HBRNN (Du et al., 2015) 94.5LSTM (Hochreiter & Schmidhuber, 1997) 88.9ST-LSTM-TG (J. Liu et al., 2016) 94.8

Ours (SO3-TSRVF) 93.4Ours (SO3-TSRVF-SVD) 93.7Ours (SO3-TSRVF-SC) 94.6


the subjects are used for training (subjects number 1, 3, 5, 7, 9) and the remainder fortesting (2, 4, 6, 8, 10). We compare the proposed method with the state-of-the-arts,the recognition accuracies on MSR Action3D dataset are recorded in Table 4. We cansee that the proposed method achieves better performance than Lie group based andclassical feature representation approaches. And again, the performance of proposedsparse coding is superior than K-SVD and PCA based coding methods. Actually, therecognition accuracy of the proposed is only 0.2% inferior to the recently proposedNBNN (Weng et al., 2017) and ST-LSTM-TG (J. Liu et al., 2016).

3.7 Conclusion

In this chapter, we propose a novel method for gesture recognition. We present a 3Dhuman skeleton as a point in the product space of special orthogonal group SO3. As such,a human gesture can be characterized as a trajectory in this Riemannian manifold space.To consider reparametrization invariance properties for trajectory analysis, we generalize

73

the transported square-root vector field to obtain a time-warping invariant metric forcomparing trajectories. Moreover, a sparse coding scheme of skeletal trajectories isproposed by explicitly considering the labeling information with each atom to enforcethe discriminant validity of dictionary. The main contributions of this method aresummarized as follows:

1. we represent a human skeleton as a point on the product space of specialorthogonal group (SO3), which is a Riemannian manifold. This representation isindependent to the performerars location, and explicitly models the 3D geometricrelationships between body parts using rotations. Then a gesture (skeletal sequences)can be represented by a trajectory composed of these points. The gesture recognitiontask is formulated as the problem of computing the similarity between the shape oftrajectories.

2. we extend the transported square-root vector field (TSRVF) representation forcomparing trajectories on the product space of SO(3)× ·· · × SO(3). As such, thetemporal dynamic issue of gesture recognition can be solved by this time-warpinginvariant feature.

3. we present a sparse coding of skeletal trajectories by explicitly considering thelabeling information with each atom to enforce the discriminant validity of dictionary.The comparison experimental results on two challenging datasets demonstrated theproposed method has achieved state-of-the-art performance.

74

4 Hidden states exploration for gesturerecognition

One of the main difficulties in human body gesture recognition is the high intra-classvariance caused by temporal dynamics. A popular solution to this problem is to usethe generative models, such as the hidden Markov model. Nevertheless, most of workassumes fixed anchors for each hidden state, which make it difficult to describe theexplicit temporal structure of gestures. In this chapter, based on the observation that agesture is a time series with distinctly defined phases, we propose a new formulation tobuild temporal compositions of gestures by the low-rank matrix decomposition, andpresent our findings for gesture recognition in Paper IV.

4.1 Introduction and motivation

Human body gesture analysis is a fundamental study which has been widely applied in avariety of computer vision applications. As reported in the last chapter, encouragingprogress has made by various researches, but it is still challenging to accurately recognizehuman gestures. One open issue in human gesture recognition lies in the temporaldynamics. For instance, even the same gesture performed by the same person can havedifferent implementation rates and different starting/end points, let alone when comes todifferent performers.

Recently, researchers have been resorting to modelling human behaviors by studyingtemporal structures, e.g. (Gong et al., 2014; K. Tang, Fei-Fei, & Koller, 2012;Weng et al., 2017). However, most of these models focus on human actions ratherthan body gestures. Compared to actions, the structural property of gestures is moresemantically meaningful and discriminative. According to the research on gesturemovements (Kendon, 1980; Kita, Van Gijn, & Van der Hulst, 1997), a gesture instancecan be decomposed into the following gesticular phases (see examples in Fig. 12):

1) Resting, see Fig. 12 (a).2) Preparation: the hands move to the initial position of the stroke, see Fig. 12 (a)→(b).3) Pre-stroke hold 1: a brief pause at the end of preparation, see Fig. 12 (b).

1The “Pre-stroke hold” phase is optional and can be merged into the “Preparation” phase (Kita et al., 1997).

75

4) Stroke: hands movement that expresses the meaning of the gesture, see Fig. 12(b)→(c)→(d).

5) Post-stroke hold (Hold): a brief pause at the end of a stroke, maintaining the hands’configuration and position, see Fig. 12 (d).

6) Retraction: the hands move back to a rest position to conclude a gesture unit, see Fig.12 (d)→(e)→(f).

7) Resting, see Fig. 12 (f).

From the above definitions, we can conclude that three phases (2, 4, 6) with handmovements, namely Preparation, Stroke, Retraction are partitioned by four “hold”phases (1, 3, 5, 7) with static poses, Resting (Independent hold (Kita et al., 1997)),Pre-stroke hold and Post-stroke hold. In other words, the temporal structures of a gesturecan be obtained once we can identify these “hold” phases.

(a) (b) (c) (d) (e) (f)

Fig. 12. Frames (cropped) selected from two gestures (Escalera et al., 2014) representingthe meanings of “basta (enough)” and “furbo (clever)” respectively. These frames illustratea gesture consists of a series of gesticular phrases: “Resting”→“Preparation”→“Pre-strokehold”→“Stroke”→“Post-stroke hold”→“Retraction”→ “Resting” (Kendon, 1980; Kita et al.,1997). (a) “Resting”, (a)→(b) “Preparation”, (b) “Pre-stroke hold”, (b)→(c)→(d) “Stroke”, (d)“Post-stroke hold”, (d)→(e)→(f) “Retraction”, (f) “Resting” (X. Liu et al., 2019) c©2019, IEEE.

Based this observation, in this chapter, we develop a novel model for human gesturerecognition aiming to address the difficulties of modeling temporal dynamics. We treatone human gesture as a series of separated phases, each of which is associated with asegment of an unfixed-length, as Fig. 13 (c) illustrates, and we propose to globallycapture the temporal evolution of gestures by a generative model which is built upon arecurrent neural network to memorize contextual information for better prediction oftransition and emission probabilities. We formulate the problem in a unified frameworknamed Hidden States Learning by Long Short-Term Memory (HSL-LSTM).

76

4.2 Hidden Markov models for gesture recognition

In this chapter, the gesture modeling via HMM is formulated by following definitions:

* Given a set Θ = {θ1,θ2, · · · ,θK−1,θK} which contains K gesture sequences withvaried lengths.

* Any gesture sequence θk can be denoted as θk ={

fk,1, fk,2, · · · , fk,Tk−1, fk,Tk

}, where

fk,t is the t th frame (or its feature representation) of θk and Tk denotes its length.* For any θk from Θ, its label δc satisfies δc ∈ ∆, where ∆ is the set of C gesture labels

which is denoted as ∆ = {δ1,δ2, · · · ,δC−1,δC}.

Specifically, given an observation of gesture sequences as X = {x1,x2, · · ·xT−1,xT},where X ∈Θ, we utilize the HMM to infer a hidden state sequence H = {h1,h2, · · ·hT−1,hT}.Any state ht from H fulfills ht ∈Ψ (1≤ t ≤ T ), where Ψ denotes an universal set whichcontains all possible Markov hidden states.

(a)

(c)

(d)

(b)

h4h3h2h1 h5

h'4h'3h'2h'1 h'5

Fig. 13. Illustration of phases (hidden states) of a gesture sequence with temporal struc-ture. (a) Frames selected from a gesture (Escalera et al., 2014) representing the meaningof “basta (enough)”, (b) Skeletons (corresponding to selected frames) of static poses from“hold” phases, (c) Temporal structure (phases) segmentation by proposed, resulting hiddenstates h1 (“Resting”), h2 (“Preparation”, “Stroke”), h3 (“Post-stroke hold”), h4 (“Retraction”),h5 (“Resting”), (d) Fixed anchors based methods with equal-sized segmentation, resultinghidden states h

′1, h

′2, h

′3, h

′4, h

′5 (X. Liu et al., 2019) c©2019, IEEE.

Typically, the state alignment is conducted based on a hypothesis that gesturesare completed by uniformly performing Z defined hidden states in order, and hiddenstates from different gesture classes are not overlapping. Then, for gestures from classδc, given a unique hidden states set {ψc,1,ψc,2, · · · ,ψc,Z−1,ψc,Z}, we generalize thisconcept for all gesture classes, and define a universal set of hidden states for all gesture

77

classes, as

Ψ =

ψ1,1 ψ1,2 · · · ψ1,Z−1 ψ1,Z

ψ2,1 ψ2,2 · · · ψ2,Z−1 ψ2,Z

· · · · · · ψc,z · · · · · ·ψC−1,1 ψC−1,2 · · · ψC−1,Z−1 ψC−1,Z

ψC,1 ψC,2 · · · ψC,Z−1 ψC,Z

,

where ψc,z denotes the zth hidden state of gesture class δc, 1≤ δc ≤C and 1≤ z≤ Z. Soin total there are E state types for all gestures, where E = Z×C.

Thus, according to the HMM full probability model

P(H,X) = P(h1)P(x1|h1)T

∏t=2

P(ht |ht−1)P(xt |ht), (60)

where the goal of the gesture modeling problem is to find an optimal hidden statesequence H which could maximize the joint probability P(H,X), based on a givenset of observations X . Because the observation X is equivalent for all hidden statecombinations H, so the optimization problem for solving H could be rewritten as

H = argmaxH

P(H|X) ∝X

argmaxH

P(H,X). (61)

From the above discussion, we can conclude that HMM-based gesture recognitionhas two important problems which need to be carefully solved:

– Given an observation of gesture sequence, how can a corresponding hidden statesequence be selected that is optimal in some meaningful sense to best explain theobservation?

– Three sets of parameters need to be measured to complete the specification ofan HMM, namely the initial probability of the first hidden state prior P(h1), thehidden state transition probability P(ht |ht−1), and the emission probability P(xt |ht) ofgenerating an observation at time t when given the hidden state ht . How can theseparameters (distributions) be efficiently computed?

For the first problem, Wu et al. (D. Wu et al., 2016; D. Wu & Shao, 2014)employed a deep neural network to estimate the probability distributions over the statesof HMM, while the authors used a forced alignment scheme to divide video sequencestemporally equal. In HOJ3D (Xia et al., 2012), a spherical histogram of the locations of12 manually selected 3D skeleton joints is computed. These histograms are projected

78

using linear discriminant analysis and then clustered into K posture words. Finally, eachaction is characterized as a time series of these words (hidden states). Nevertheless, thisone frame one posture label (state) tactic cannot fully model the motion temporalitysince it ignores the contextual information. Actually, an explicit definition of hiddenstates of the sequences is necessary, including the number of states and the number ofdistinct frames per state. Although the states are hidden, for many practical applicationsthere is often some physical significance attached to the states. For gesture recognition,the gestures themselves exhibit internal temporal structure. As defined in Section 4.1,gestures typically have definite gesticular phases with varied durations and starting/endtimes. To illustrate this, two examples are given in Figs. 12 and 13. Based on thisobservation, in this chapter, gestures are modeled as compositions of different gesticularphases. Once the gestures have distinct phases, models that exploit hidden states areadvantageous. As such, the different gesticular phases correspond to the different hiddenstates of HMM and the usage of HMM allows heterogeneous information of one gestureclass to be distributed over many states (phases), which is key to improving the ability tomodel complex patterns.

For the second issue, Gaussian mixture models (Lv & Nevatia, 2006; Murphy,2012; Piyathilaka & Kodagoda, 2013) are widely used as the dominant techniquefor modeling the emission distribution of HMM. In (D. Wu et al., 2016; D. Wu &Shao, 2014), a deep belief network (DBN) is utilized as a generative model to replacethe traditional GMM for estimating the emission probability. However, there existsa conflict that any frame within a sequence usually has contextual information anddepends on previous frames. Nevertheless, this is ignored in previous works. Boththe DBN and GMM treat input frames at each time as independent variables so thatoutput emission probability in the current time step only relates to the current input.To handle this issue and acquire the emission probability more appropriately, in thischapter, the LSTM (Hochreiter & Schmidhuber, 1997) is used for its stronger contextualinformation modeling ability. As a special type of RNN, the LSTM also utilizes memorycells to store contextual information learned from previous sequential inputs and storedinformation can affect the output of the network.

4.3 Lie algebra based representation

As reported in Section 3.3, the relative geometry between different body parts (Lie grouprepresentation) (Vemulapalli et al., 2014) is introduced. However, the Lie group SE(3)

79

is endowed with the Riemannian manifold such that standard classification and clusteringalgorithms are not directly applicable to this non-Euclidean space (Vemulapalli et al.,2014).

Specifically, the tangent space of SE(3) at the identity I4 is called its Lie algebra andis denoted by se(3), which is isomorphic to the space of twists and therefore provides anatural setting for analysis of instantaneous motions (Murray et al., 1994), as Fig. 14(a) illustrated. In that way, the former classification tasks in manifold curve space areconverted into the classification problems in typical vector space. The se(3) can beidentified with 4×4 matrices of the form

ξ =

[ω ~v

0 0

]=

0 −ω3 ω2 v1

ω3 0 −ω1 v2

−ω2 ω1 0 v3

0 0 0 0

, (62)

where ω is a 3× 3 skew-symmetric matrix and can be thus identified with a vectorω = [ω1,ω2,ω3]

T ∈ R3, and ~v ∈ R3. In other words, each element of se(3) can beidentified with a vector ξ = [ω1,ω2,ω3,v1,v2,v3]

T ∈ R6.The logarithm map logP : SE(3)→ se(3) between the Lie group and Lie algebra

(Murray et al., 1994) is given by

ξ = log

[R ~v

0 1

]=

[ω A−1~v

0 0

], (63)

where ω = logR and

A−1 = I− 12

ω +2sin‖ω‖−‖ω‖(1+ cos‖ω‖)

2‖ω‖2 sin‖ω‖ω

2ω 6= 0. (64)

If ω = 0 then A = I. Here, since the logP is not unique, typically, the value with smallestnorm is used (Vemulapalli et al., 2014). The inverse logarithm map, namely, theexponential map exp ξ : se(3)→ SE(3) between the Lie algebra and Lie group is givenby

exp ξ =

[I ~v

0 1

]ω = 0 and exp ξ =

[eω A~v

0 1

]ω 6= 0, (65)

80

where eω is given by Rodrigues’s formula

eω = I +ω

‖ω‖sin‖ω‖+ ω2

‖ω‖2 (1− cos‖ω‖), (66)

and A

A = I +ω

‖ω‖2 (1− cos‖ω‖)+ ω2

‖ω‖3 (‖ω‖− sin‖ω‖). (67)

For more details of Lie group and Lie algebra, we refer interested readers to (Murray etal., 1994).

(a)

SLD

= +(b)

Fig. 14. (a) Representation of a gesture (skeletal sequence) as a curve on the Lie groupSE(3)× ·· ·× SE(3) (manifold curved space), and can be mapped into its Lie algebra (vectorspace). (b) Illustration of matrix decomposition for exploring hidden states.

As a result, a skeleton can be represented by a point in product space of the Lie groupSE(3)×·· ·×SE(3), and the number of SE(3) is 2C2

M , where C2M is the combination

formula. Furthermore, this SE(3)× ·· · × SE(3) can be mapped to its Lie algebrase(3)×·· ·× se(3), and each se(3) can be identified with a vector [ω1,ω2,ω3,v1,v2,

v3]T ∈ R6. As such, at time t, a human skeleton G can be modeled by a 6M(M− 1)

dimensional vector, then G ∈ R6M(M−1).

81

4.4 Low-rank decomposition for exploring gesture temporalstructures

In this section, we attempt to discover the temporal structures (phases) of gesturesequences, and formulate a model over the temporal domain which is able to explore thehidden states of gestures.

Given an observed sequence (T frames), for a gesture performer, we can constructa matrix D by stacking (Lie algebra based) skeletal representations of every framehorizontally (column wise), then D ∈ RG×T . Since a gesture’s “hold” phases are with(more or less) static poses, and Lie group (algebra)-based representation have propertiesof view-invariance and stability for dynamics, these static poses (“hold” phases) shouldbe captured by a low rank matrix, and hand movements (phases) means gesture changeswhich cannot be fitted into the low-rank model of static poses, and thus should be treatedas outliers, as Fig. 14 (b) illustrates.

Algorithm 3 Low-rank and column-block sparsity matrix decomposition.

Input: Given Matrix D ∈ RG×T and the parameters κ , λ .Output: Estimate of (L,S).

1: Parameters initialization : S0 = Y0 = 0; L0 = 0; µ0 = 40/‖sign(D)‖2; ρ > 1; κ = 0.041;λ = 0.73; k = 0.

2: While not converged do3: //Line 4-11 solve Lk+1 = argminL fµ (L,Sk,Yk), as Eq. (72).4: GL=D−Sk +µ

−1k Yk.

5: j← 0, L0k+1 = GL.

6: While not converged do7: L( j+1/2)

k+1 =USβ (∑)V T where L( j)k+1 =U ∑V T is the SVD of L( j)

k+1.

8: Π = α

(τ βκ(1−λ )

1+β µk

(2L( j+1/2)

k+1 −L( j)k+1+β µkGL

1+β µk

)−L( j+1/2)

k+1

)9: L( j+1)

k+1 = L( j)k+1 +Π.

10: j← j+1.11: end while.12: Lk+1 = L( j+1/2)

k+1 .13: //Line 14-15 solve Sk+1 = argminS fµ (Lk+1,S,Yk).14: GS=D−Lk+1 +µ

−1k Yk.

15: Sk+1 = τ κλ

µk

(GS).

16: Yk+1 = Yk +µk (D−Lk+1−Sk+1).17: µk+1 = ρµk; k← k+1.18: end while.19: L← Lk, S← Sk.

82

Based on this observation, we consider the hidden states exploration from theviewpoint of a matrix decomposition and optimization problem, which can be expressedas

D = L+S, (68)

where L and S denote the “hold” states (phases) and hand movement signals (phases)respectively. We assume that the static poses of “hold” states are linearly correlatedwith each other, forming a low-rank matrix L. Component S should be a column-blocksparsity matrix with non-zero columns corresponding to the outliers. In order toeliminate ambiguity, the columns of the low-rank matrix L corresponding to the outliercolumns are assumed to be zeros. To formalize column-block priors on outliers, weintroduce the `2,1-norm and then propose a Low-rank and Column-Block sparsity matrixDecomposition (LCBD) method, as

minL,S‖L‖∗+κλ‖S‖2,1 +κ (1−λ )‖L‖2,1 s.t. D = L+S, (69)

where ‖L‖∗ means the nuclear norm of matrix L, the sum of its singular values, and‖S‖2,1 means `1-norm of the vector formed by taking the `2-norms of the columns ofmatrix S, as

‖S‖2,1 =T

∑i=1‖Si‖2, (70)

where Si denotes the ith column of S.Inspired by methods (Z. Lin, Chen, & Ma, 2010; Yao, Liu, & Qi, 2014), the extra

introduced term κ (1−λ )‖L‖2,1 ensures that recovered matrix L has exact zero columnscorresponding to S. Eq. (69) is an optimization problem and we could solve it basedon the augmented Lagrange multiplier (ALM) method (Z. Lin et al., 2010), which isdefined as

L (L,S,Y ; µ) = ‖L‖∗+κλ‖S‖2,1 +κ (1−λ )‖L‖2,1+

〈Y,D−L−S〉+ µ

2‖D−L−S‖2

F ,(71)

where Y is a vector of Lagrange multipliers, µ is a positive scalar. ALM solves (71) byalternating between optimizing the primal variables L,S and updating the dual variableY , which solves the following three sub-problems

83

Lk+1 = argminLL1 (L,Sk,Yk; µ)

Sk+1 = argminSL1 (Lk+1,S,Yk; µ)

Yk+1 = Yk +µ (D−Lk+1−Sk+1)

. (72)

The first problem in (72) which solves for L at fixed S,Y can be explicitly expressedas the following form

minL‖L‖∗+κ (1−λ )‖L‖2,1 +

µ

2

∥∥(D−Sk +µ−1Yk

)−L∥∥2

F . (73)

In each iteration, the (73) can be rewritten as

Lk+1 = argminL

{‖L‖∗+κ (1−λ )‖L‖2,1 +

µk

2

∥∥GL−L∥∥2

F

}, (74)

where GL = D−Sk +µ−1Yk. We use the Douglas/Peaceman Rachford (DR) monotoneoperator splitting method (Combettes & Pesquet, 2007; Fadili & Starck, 2009) toiteratively solve (74).

Define f1 (L) = κ (1−λ )‖L‖2,1 +µk2

∥∥GL−L∥∥2

F and f2 (L) = ‖L‖∗. For β > 0 anda sequence α j ∈ (0,2), the DR iteration for (74) is expressed as

L( j+1/2) = proxβ f2

(L( j)),

L( j+1) = L( j)+α j

(proxβ f1

(2L( j+1/2)−L( j)

)−L( j+1/2)

),

(75)

where the two proximity operators involved in DR iteration are defined as

proxβ f1 (L) = τ βκ(1−λ )1+β µk

(L+β µkGL

1+β µk

),

proxβ f2 (L) =USβ (∑)V T ,

τη (Gp) = Gp max(

0,1− η

‖Gp‖2

), p = 1,2, ...,n

Sβ (x) = max(0,x−β ) ,x≥ 0,β > 0.

(76)

With the same idea of developing (73), the second problem in (72) can be shown asthe following equivalent formula

minS

µ

2

∥∥(D−Lk+1 +µ−1Yk

)−S∥∥2

F +κλ‖S‖2,1. (77)

Similar, note GS = D−Lk +µ−1Yk. Then, S can be obtained by

84

S = τ κλµk

(GS) . (78)

The whole algorithm is shown in Algorithm 1. In the algorithm, the error in outerloop is computed as ‖D−Lk−Sk‖F/‖D‖F . The outer loop stops when it reaches thevalue lower than 10−7 or the maximal iteration number 500 is reached. The error in theinner loop stops when the difference between successive matrices L j

k equals to 10−6 or amaximal iteration equals to 20. The tuning parameters κ and λ are set to 0.041 and0.73, respectively. For the DR iteration, α and β are set to 1 and 0.57, respectively. TheALM parameter ρ = 1.1. Please refer to (Combettes & Pesquet, 2007; Fadili & Starck,2009; Z. Lin et al., 2010) for more details.

4.5 Hidden states learning via LSTM

As reported in the last section, we initialize the hidden states of the temporal segmentsfor each training sample, according to the most discriminative phases of sequences aspresented in Section 4.4. Based on these hidden states we can calculate three sets ofHMM parameters in a more meaningful sense than in previous methods.

For representing the probability of the first hidden state prior, we use π = (πi)E×1,where πi = P(h1 = ψi), and ψi is the ith state of hidden states set Ψ. Then we canestimate πi by calculating

πi =∑

Kk=1(hk,1 == ψi)

K, (79)

where k denotes the index of an observation, and K is the total number of observations(gesture sequences).

Next, the hidden states transition parameter (matrix) is denoted using A = [ai, j]E×E ,where ai, j = P(ht = ψ j|ht−1 = ψi). We can calculate ai, j by

ai, j =∑

Kk=1 ∑

Tkt=2((hk,t−1 == ψi)AND(hk,t == ψ j))

∑Kk=1 ∑

Tkt=2(hk,t−1 == ψi)

. (80)

Another key parameter is the emission probability. Compared to DBN and GMMmodels as widely utilized in existing methods, LSTM can learn the contextual infor-mation from sequential data, which provides a powerful model for sequential datamodeling. On one hand, it receives the output from the previous one step and uses it as apart of the input in the current time step. On the other hand, it uses memory cells to

85

store contextual information learned from the input and uses gate units to maintain thestored contextual information. Fig. 15 illustrates the architecture of an example LSTMunit with one memory cell. We can find that an LSTM unit typically consists of threegates, namely the input gate It , forget gate Ft , and output gate Ot . Each gate and thetanh function gt receives the same input including the data of current time step and thenetwork output from previous time step. The removal of the previous information oraddition of the current information to the cell state is regulated with linear interactionsby the forget gate Ft and the input gate It . In our method, the LSTM does not includethe memory cell output (from the previous time step) in the input (of current time step).The forward pass is defined as follows

It = σ(WI ,H Ht−1 +WH ,X Xt +BI

), (81)

Ft = σ(WF ,H Ht−1 +WF ,X Xt +BF

), (82)

Ot = σ(WO,H Ht−1 +WO,X Xt +BO

), (83)

Gt = tanh(WG ,H Ht−1 +WG ,X Xt +BG

), (84)

Ct = FtCt−1 +ItGt , (85)

Ht = Ot tanh(Ct) . (86)

𝐶𝑡

𝑂𝑡

𝐹𝑡

𝐼𝑡

𝐺𝑡 tanh

𝑋𝑡 𝐻𝑡−1 𝑋𝑡 𝐻𝑡−1

𝑋𝑡

𝐻𝑡−1

𝑋𝑡 𝐻𝑡−1

Fig. 15. The architecture of an example LSTM unit with one memory cell. The memory cellis denoted as Ct in the figure. The It , Ot , Ft , Gt correspondingly represent the input gate,output gate, forget gate, and a tanh function.

In order to ensure that the LSTM generates outputs in the form of the emissionprobability P(xt |ht), we use a softmax loss function to train the network. It can instruct

86

the LSTM network to generate a posterior distribution P(ht |xt ,ζ ), where ζ is thenetwork parameter which is shared by all time steps. Thus, we can use such network toinfer the emission probability by

P(xt |ht) =P(ht |xt)P(xt)

P(ht)∝

ζ ,xt

P(ht |xt)

P(ht). (87)

Lastly, by combining (60), (61), and (87), we can arrive at our final objectivefunction as follows

H = argmaxH

P(h1|x1)T

∏t=2

P(ht |ht−1)P(ht |xt)

P(ht), (88)

where H denotes the optimal hidden state sequence. The optimization problem of (88)can be easily solved by Viterbi path decoding (Murphy, 2012). The pipelines (trainingand testing) of the proposed method are shown in Fig. 16.

It is noted that LSTM based methods commonly feed the network with a wholegesture or action sequence. Although LSTM are designed to learn the long-termtemporal dependency, it is still challenging for LSTM to memorize the informationof the entire sequence with many states (Ke et al., 2017; Weston et al., 2014). Inaddition, with a limited amount of training data, training an LSTM is prone to overfitting(Shahroudy et al., 2016; J. Wang et al., 2014). In our scheme, the shorter videosegments (states) are fed into the network to bypass the difficulty of LSTM whenmodeling long-term gestures with temporal dynamics. Furthermore, this states-basedfeeding enlarges the number of training samples but without any data augmentationoperations. Experiments demonstrate that our method outperforms LSTM with thesimple mode of feeding the whole sequence.

87

𝑳𝑺𝑻𝑴𝟏 𝑳𝑺𝑻𝑴𝒛 𝑳𝑺𝑻𝑴𝒁⋯ ⋯

𝒕 = 𝟏 𝒕 = 𝒊 𝒕 = 𝑻

𝑳𝑺𝑻𝑴𝟏 𝑳𝑺𝑻𝑴𝒊

𝒙𝟏

𝑳𝑺𝑻𝑴𝟏

𝑳𝑺𝑻𝑴𝒕−𝟏

𝑳𝑺𝑻𝑴𝟏 𝑳𝑺𝑻𝑴𝟏

⋯⋯

⋯

𝒙𝒊 𝒙𝑻

𝒇𝟏 𝒇𝒊 = 𝒔𝒆 𝟑 ×⋯× 𝒔𝒆(𝟑)𝒊

Lie group 𝑺𝑬 𝟑 ×⋯× 𝑺𝑬(𝟑) mapping Lie algebra 𝒔𝒆 𝟑 ×⋯× 𝒔𝒆(𝟑)

3) Hidden states

exploration by low-rank

matrix decomposition

2) Lie group based relative

geometry representation

1) 3D skeletal input(category c)

4) HMM parameters

learning

⋯

𝒇𝑻

⋯ ⋯ ⋯ ⋯

Emission probability learning via a LSTM

𝒇𝟏, 𝒇𝒊, 𝒇𝑻,

𝜳𝒄,𝟏

𝜳𝒄,𝟏

𝜳𝒄,𝒛

𝜳𝒄,𝒛 𝜳𝒄,𝒁

𝜳𝒄,𝒁

(a)

𝑳𝑺𝑻𝑴𝟏 𝑳𝑺𝑻𝑴𝒛 𝑳𝑺𝑻𝑴𝒁⋯ ⋯

𝒕 = 𝟏 𝒕 = 𝒊 𝒕 = 𝑻

𝒉𝟏 𝒉𝒊 𝒉𝑻⋯

𝒙𝟏

𝑳𝑺𝑻𝑴𝟏

⋯

𝑳𝑺𝑻𝑴𝟏 𝑳𝑺𝑻𝑴𝟏

⋯⋯

⋯

𝒙𝒊 𝒙𝑻

𝒇𝟏 𝒇𝒊 = 𝒔𝒆 𝟑 ×⋯× 𝒔𝒆(𝟑)𝒊

𝑷 𝒉𝟏 𝒇𝟏 𝑷 𝒉𝒊 𝒇𝒊 𝑷 𝒉𝑻 𝒇𝑻

𝑷 𝒉𝟏 𝑷 𝒉𝒊 𝒉𝒊−𝟏 𝑷 𝒉𝑻 𝒉𝑻−𝟏𝑷 𝒉𝟐 𝒉𝟏 𝑷 𝒉𝒊+𝟏 𝒉𝒊

Lie group 𝑺𝑬 𝟑 ×⋯× 𝑺𝑬(𝟑) mapping Lie algebra 𝒔𝒆 𝟑 ×⋯× 𝒔𝒆(𝟑)

4) A First-Order HMM

2) Lie group based relative

geometry representation

3) Emission Probability

Estimator

⋯

𝒇𝑻

A Pre-trained LSTM

1) 3D skeletal input

(b)Fig. 16. Illustration of pipelines of the proposed method. (a) Training pipeline, (b) Testingpipeline. Please note the purpose we use fi rather xi in P(hi| fi) is to emphasize the Lie groupbased representation (X. Liu et al., 2019) c©2019, IEEE.

88


In this section, a series of experiments are performed to evaluate the proposed approach.Two benchmark datasets, the ChaLearn 2014 gesture (Escalera et al., 2014) and MSRAction3D (W. Li et al., 2010) are used for evaluation purposes.

In the proposed method, the emission probability is estimated by a recurrent neuralnetwork with four layers which are connected in the following order: one LSTM layerwith 512 units, a fully connected layer with 256 neurons, a dropout layer with thedropout ratio of 50%, and a softmax loss layer to force the network to generate thelikelihood P(ht |xt ,ζ ). When network training, the batch size is set to 400. The learningrate is fixed to 0.01 for the ChaLearn 2014 gesture dataset and 0.002 for MSR Action3Ddataset. The network is trained till the validation accuracy and the loss is stable after anumber of epochs of iterations depending on the size of training data. We set 70 as themaximum training epoch for the ChaLearn 2014 gesture dataset due to its large trainingdata size. For the MSR Action3D dataset, the maximum training epoch is set to 200.

(a) (b)

(c) (d)

Fig. 17. Illustration of arrangement of matrix L. (a) The original L with several chunks (con-tinuous frames with same property), the columns in gray are the low-rank part, and the whitecolumns denote the non-low-rank part. (b) “Dilation” operation with interval threshold 2 toenlarge the boundaries of chunks. (c) “Erosion” operation with length threshold 2 to erodeaway the isolated small chunks. (d) “Dilation” operation with interval threshold 3.

An important hyper parameter is the number of hidden states. As mentioned inSection 4.1, a gesture is typically composed of seven phases, while in the experiments,we found the phases of Preparation, Pre-stroke hold, and Stroke are usually mixedtogether (they can be merged together as the Stroke phase). Therefore, the number ofhidden states for each gesture is set to five. The outputs (L) of matrix decompositionare not always in the form of five chunks (continuous frames with same property)because of the disturbances from noises and misalignments of skeleton, we need a

89

post-processing to arrange the matrix L (an example is illustrated in Fig. 17 (a)). Fromthe phase definition of the gesture we can conclude that the beginning and ending phasesare Resting. Also, we can easily find that performers are always static at the starting andending frames of each gesture sequence. Based on such observation, the first and lastseveral frames (at least the first and last frame) of each gesture sequence should be thelow-rank part (the columns in gray as illustrated in Fig. 17 (a)) of the matrix L, andthey could be initialized as Resting phases. In order to further reduce the complexityof the arrangement, the longest low-rank chunk of matrix L is picked out to initializethe Post-stroke hold phase. Therefore, the final arrangement should make matrix L

have three low-rank chunks (two Resting phases and a Post-stroke hold phase) and twonon-low-rank ones (Stroke and Retraction phases), as an example shown in Fig. 17(d). After the initialization of phases, there will be two very special situations if thenumber of chunks is smaller than 5. Firstly, there is only one chunk when all columnsof matrix L are low-rank part. Secondly, there will be two low-rank chunks (Resting

phases) and a non-low-rank chunk in matrix L. For above two cases, an equal divisionscheme can be utilized to obtain 5 chunks. But more commonly, the number of chunkswill be greater than 5. Inspired by the morphology in binary images processing, for alow-rank chunk, an operation similar to “dilation” is adopted to enlarge its boundariesthrough merging the adjacent low-rank chunk (see Fig. 17 (b) and (d)), if the intervalamong them is smaller than a threshold. The merged chunk will repeat this operationuntil above condition can not be satisfied. In this chapter, the “dilation” is operated onlyon three chunks corresponding to the initialized low-rank phases. Also, the “erosion”operation is utilized to erode away the isolated small chunks whose lengths are smallerthan a threshold (see Fig. 17 (c)). In this paper, we employ an iterative process toachieve the arrangement. More specifically, a “dilation” and an “erosion” operation areexecuted successively in each iteration. And we increase the thresholds of interval andchunk length after each iteration. In our experiments, the starting thresholds and stepsize are set to 1.

The effectiveness of the proposed method is compared to many state-of-the-artapproaches, including four HMM related methods, namely HMM with Gaussian mixturemodel (HMM-GMM) (Murphy, 2012), HMM with AdaBoost (HMM-AdaBoost) (Lv &Nevatia, 2006), HMM with deep belief network (HMM-DBN) (D. Wu & Shao, 2014)and its extension (HMM-DBN-ext) (D. Wu et al., 2016); also including histogram of3D joints (HOJ3D) (Xia et al., 2012), EigenJoints (X. Yang & Tian, 2012), actionletensemble (Actionlet) (J. Wang et al., 2012, 2014), histogram of oriented 4D normals

90

Table 5. Comparison of recognition accuracy (%) with existing skeleton-based methods onChaLearn 2014 (Escalera et al., 2014) dataset (best: bold, second best: underline) (X. Liu etal., 2019).

Methods Accuracy

HMM-AdaBoost(Lv & Nevatia, 2006) –HMM-GMM (Murphy, 2012) 49.1HMM-DBN (D. Wu & Shao, 2014) 83.6HMM-DBN-ext (D. Wu et al., 2016)* 86.4

EigenJoints (X. Yang & Tian, 2012) 59.3Actionlet (J. Wang et al., 2012, 2014)* –HOJ3D (Xia et al., 2012) –HON4D (Oreifej & Liu, 2013)* –Key-frames (Zanfir et al., 2013) –Lie group (Vemulapalli et al., 2014) 79.2Manifold (Devanne et al., 2015) –RVV+DTW (Y. Guo et al., 2017) –LM3TL (Y. Yang et al., 2017) –ST-NBNN (Weng et al., 2017) –

ModDrop (CNN) (Neverova et al., 2016)* 93.1LSTM (Hochreiter & Schmidhuber, 1997) 82.0HBRNN (Du et al., 2015) –ST-LSTM-TG (J. Liu et al., 2016) –

Ours 93.8


(HON4D) (Oreifej & Liu, 2013), discriminative key-frames (Key-frames) (Zanfir etal., 2013), Lie group (Vemulapalli et al., 2014), Riemannian manifold (Manifold)(Devanne et al., 2015), rotation and relative velocity with DTW (RVV+DTW) (Y. Guoet al., 2017), latent max-margin multitask learning (LM3TL) (Y. Yang et al., 2017),spatio-temporal naive-Bayes nearest-neighbor (ST-NBNN) (Weng et al., 2017). Theproposed method is also compared to four deep learning methods: the convolutionalneural network based ModDrop (CNN) (Neverova et al., 2016), LSTM (Hochreiter &Schmidhuber, 1997), hierarchical recurrent neural network (HBRNN) (Du et al., 2015),and spatio-temporal LSTM with trust gates (ST-LSTM-TG) (J. Liu et al., 2016). Thebaseline results are reported from original papers. Note that some of the comparedmethods were developed for multi-modal datasets such as the HMM-DBN-ext (D. Wu etal., 2016) which utilizes multiple modalities (RGB and skeleton), while the proposedmethod is only based on 3D skeleton data.

Firstly, we given the experimental results on the ChaLearn 2014 gesture dataset(Escalera et al., 2014), we follow the same testing protocol as described in Section 3.6.

91

To verify the effectiveness of the hidden states exploration for HMM, we compared theproposed method with three HMM-based state-of-the-arts. As shown in Table 5, therecognition accuracies of HMM with Gaussian mixture model (HMM-GMM) (Murphy,2012) and with deep belief network (HMM-DBN) (D. Wu & Shao, 2014) are only49.1% and 83.6%. This is due to that both of the DBN and GMM treat input frames ateach time step as independent variables, the contextual information is missing whenlearning the emission probability. The extension of HMM-DBN (HMM-DBN-ext)(D. Wu et al., 2016) can reach up to 86.4%, while it used not only skeleton, but alsoRGB and depth data. It also can be observed that the accuracy of the LSTM (Hochreiter& Schmidhuber, 1997) is 11 percentage points less than the proposed method. AlthoughLSTM is designed for perceiving the contextual information, it is still challenging tomodel the sequence with many states, especially when training data is limited. Themethod (Vemulapalli et al., 2014) utilized the same Lie group to represent the 3Dskeletons as ours, and it employed the DTW to deal with the temporal dynamics issueof human gesture recognition. However, DTW cannot globally capture the temporalevolution of whole sequences, so its performance is inferior to the proposed. It is notablethat the ModDrop (Neverova et al., 2016) was the winner of the 2014 LAP Challenge(track 3). The proposed method can achieve the similar performance to ModDrop butwithout using the RGB-D and audio data.

Secondly, we reported the performance of proposed method on the MSR Action3D(W. Li et al., 2010) dataset, which is a commonly used actions recognition dataset,especially for evaluating the effectiveness of temporal dynamics modeling techniques.We compare the proposed method with the state-of-the-arts with the same evaluationprotocol as noted in Section 3.6, and the recognition accuracies are recorded in Table6. It can be seen the proposed method achieves better performance than DTW-basedrecognition approaches, such as Lie group (Vemulapalli et al., 2014) and RVV+DTW(Y. Guo et al., 2017). In (Zanfir et al., 2013), the authors emphasized the importance ofdiscriminative key-frames for action recognition. However, the key frame selectionitself is a difficult task, which usually suffers from an issue of information losing. TheHMM-DBN (D. Wu & Shao, 2014) employed a deep neural network to learn theparameters of HMM, but it utilized the fixed anchors for obtaining the hidden states. Onthe contrary, we formulate a model over the temporal domain that is able to capture thestatic poses between sub-gestures, therefore, a gesture sequence could be segmentedinto temporal compositions (states) with semantically meaningful and discriminativeconcepts. Compared with HMM-DBN, the experimental results on MSR Action3D

92

Table 6. Comparison of recognition accuracy (%) with existing skeleton-based methods onMSR Action3D (W. Li et al., 2010) dataset (best: bold, second best: underline) (X. Liu et al.,2019).

Methods Accuracy

HMM-AdaBoost(Lv & Nevatia, 2006) 63.0HMM-GMM (Murphy, 2012) 81.5HMM-DBN (D. Wu & Shao, 2014) 82.0HMM-DBN-ext (D. Wu et al., 2016)* –

EigenJoints (X. Yang & Tian, 2012) 82.3Actionlet (J. Wang et al., 2012, 2014)* 88.2HOJ3D (Xia et al., 2012) 78.9HON4D (Oreifej & Liu, 2013)* 88.9Key-frames (Zanfir et al., 2013) 91.7Lie group (Vemulapalli et al., 2014) 92.5Manifold (Devanne et al., 2015) 92.1RVV+DTW (Y. Guo et al., 2017) 93.4LM3TL (Y. Yang et al., 2017) 95.6ST-NBNN (Weng et al., 2017) 94.8

ModDrop (CNN) (Neverova et al., 2016)* –LSTM (Hochreiter & Schmidhuber, 1997) 88.9HBRNN (Du et al., 2015) 94.5ST-LSTM-TG (J. Liu et al., 2016) 94.8

Ours 96.3


dataset verified the effectiveness of the proposed method again. As can be seen, in all ofthe 17 methods, our model achieves the highest recognition accuracy.

4.7 Conclusion

In this chapter, we propose a novel skeleton based recognition framework whichintegrates the powers of the generative models (HMM) and deep recurrent neuralnetwork (LSTM).

The main contributions are summarized as follows:1. we propose a new formulation to build temporal structures models based on a low-

rank matrix decomposition algorithm. The only assumption is that the gesture’s “hold”phases with (more or less) static poses are linearly correlated with each other, which canbe captured by the low rank matrix. We also explicitly consider the column-block priorof the outlier signals, the part of hands movement (phases) which cannot be fitted intothe low-rank model. Thus, the temporal structure alignment is interpreted as a binary

93

clustering problem. Different to conventional methods using fixed anchors (see Fig. 13(d)), a gesture sequence could be segmented into temporal compositions (phases) withsemantically meaningful and discriminative concepts (see Fig. 13 (c)).

2. we propose a new hidden states learning model based on a recurrent neuralnetwork. The different temporal compositions actually correspond to the different hiddenstates of HMM. The usage of HMM allows to distribute heterogeneous information ofone gesture class over many states (phases), which is key to improve the capabilityof modeling complex patterns. Different to traditional HMM using Gaussian mixturemodel (GMM) (Murphy, 2012) which ignored the temporal contextual information, theLSTM is utilized to learn probability distributions over states of HMM as it providesrobust classification of small temporal chunks.

3. we propose a new human gesture recognition framework by absorbing the powersof the HMM and LSTM. Rather than model the whole sequences (a gesture) withinthe LSTM as conventional RNN methods did, we feed the network with temporalcompositions (hidden states) with shorter temporal lengths and more training samples.As such, the parameter learning for LSTM with a large amount of training data is notneeded. In addition, we introduce a Lie group based feature to better represent the 3Dgeometric relationships between various body parts. Experiments demonstrate that ourapproach achieves a state-of-the-art performance for 3D skeleton based human gesturerecognition benchmarks.

94

5 Summary

As a long-lasting, ongoing problem in computer vision, human motion detection andgesture recognition is very significant due to its wide applicability in many real worlddomains. According to the typical pipeline of gesture analysis, this thesis presents thework done on motion detection and gesture understanding. In summary, two motiondetection methods have been presented based on compressive sensing, and two methodsusing 3D skeletons for gesture recognition have been proposed.

5.1 Contributions

The contributions of this thesis can be summarized from two prespectives.For motion detection:We propose a new formulation of motion detection via a fast greedy pursuit algorithm.

It explicitly considers group properties of sparse outliers (foreground) on both spatialand temporal domains for better sparse recovery, instead of merely considering thespatial, as conventional methods do.

We formulate the background modeling as a dictionary learning problem, so that atraining sequence without any foreground is no longer required. Furthermore, a randomupdate policy is employed to deal with a wide range of events in the background scene.This background model can rapidly respond to sudden background changes.

We propose a new formulation of sparse signal recovery via the Multi-ChannelFused Lasso (MCFL) regularizer. It explicitly reconstructs multi-channel foregroundsignals with a spatial structure that reflects smooth changes along the group features.

For gesture recognition:We represent a human skeleton as a point in the product space of special orthogonal

group (SO3), which is a Riemannian manifold. This representation is independent to theperformers’ location, and can explicitly model the 3D geometric relationships betweenbody parts using rotations. Then a gesture (skeletal sequences) can be represented by atrajectory composed of these points. The gesture recognition task is formulated as aproblem of computing the similarity between the shapes of trajectories.

We extend the TSRVF representation for comparing trajectories on the product spaceof SO(3)×·· ·×SO(3). Therefore, the temporal dynamic issue of gesture recognitioncan be solved with this time-warping invariant feature.

95

We present a sparse coding of skeletal trajectories by explicitly considering thelabelling information with each atom to enforce the discriminant validity of the dictionary.The comparison experimental results on several challenging datasets demonstrated theproposed method achieves state-of-the-art performances.

We propose a new formulation to build temporal structures based on a low-rankmatrix decomposition algorithm. The only assumption is that the gesture’s “hold” phaseswith static poses are linearly correlated with each other, which can be captured by thelow rank matrix. We also explicitly consider the column-block prior of the outliersignals, the part of hand movements (phases) which cannot be fitted into the low-rankmodel. Thus, the temporal structure alignment is interpreted as a binary clusteringproblem. In contrast to conventional methods using fixed anchors, the proposed methodcan segment a gesture sequence into temporal compositions (phases) with semanticallymeaningful and discriminative concepts.

We propose a new hidden states learning model based on a recurrent neural network.Different temporal compositions actually correspond to the different hidden statesof HMM. The usage of HMM allows to distribute heterogeneous information of onegesture class over many states (phases), and is key to improve the capability of modelingcomplex patterns. Different to traditional HMM using the Gaussian mixture model(GMM) which ignores the temporal contextual information and uses specific distancemetric for clustering, the LSTM is utilized to enhance the HMM by generating betteremission probability as it provides robust classification of small temporal chunks.

We propose a new gesture recognition framework by absorbing the advantages of theHMM and LSTM. Rather than model the whole sequences (a gesture) within the LSTMas conventional RNN methods do, we feed the network by temporal compositions(hidden states) with shorter temporal length and more training samples. Therefore,the parameter learning for LSTM with large size training data is not needed. Inaddition, we introduce a Lie group based feature to better represent the 3D geometricrelationships between various body parts. Experiments demonstrate that our approachachieves a state-of-the-art performance for 3D skeleton based human gesture recognitionbenchmarks.

5.2 Limitations and future work

In conclusion, the methods presented in this thesis address a number of challengingproblems in gesture analysis including human motion segmentation with dynamic

96

backgrounds, dealing with the gesturears variability caused by temporal dynamics,incorporating geometric invariance (invariant feature for performance’s size), preservingcontext information, and learning from a limited amount of training data.

However, a lot of work still needs to be done to realize the full potential of gestureunderstanding in real-world applications. For example, the methods proposed in thisthesis are not an on-line model to detect and classify the temporal segmentation ofgestures. In other words, a predefined starting and ending frame of each gesture isneeded. Therefore, real-time inference and detecting gestures with a low latency time isa potential future research topic.

Another future area of research could be to develop advanced social-intelligenttechnologies based on affective computing and social signal processing techniques toendow machines with the ability of social-awareness. Like facial expressions, whichhave been widely used for human emotion perception, human body gestures are alsoextremely critical cues for understanding human social attitudes. In future work, bodybehavior cues which emerge in social communication as social gestures could be studiedand attempts could be made to conduct automatic analysis on them by leveragingcomputer vision and machine learning techniques. Currently, affective computingoriented automatic gesture analysis is still a research gap. Even though communicativegestures have been widely studied in behavioral sciences, criminology and other areas ofsocial science related research, only rare studies are conducted in computer sciencesand engineering fields. In recent years, the most closely related studies have beencarried out in analyzing sign language or actions. However, they are either related tothe language for the deaf or conducted without any affective analysis. As a result, itis would be highly useful to conduct research to fill the gap. Social gestures can beperformed intentionally or unintentionally. Gestures performed intentionally duringsocial communications are used for expressing a subject’s specific social attitudesor for assisting verbal communication. We refer to such gestures as communicativegestures. Moreover, those performed unintentionally are termed micro-gestures and areusually suppressed by humans. Such micro-gestures usually reveal a person’s true socialattitudes and emotional moods. Both of the two types of social gestures are importantcues in affective computing and human computer interaction. However, both of themalso have large intra-class variability and inter-class similarity. For example, differentsubjects may display different habits concerning their body behaviour, or the samegesture in different subjects may also indicate different emotions. Thus, performing

97

automatic analysis on such social gestures is a highly challenging yet very interestingarea of research.

98

References

Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: An algorithm for designingovercomplete dictionaries for sparse representation. IEEE Trans. Signal Process.,54(11), 4311–4322.

Alaíz, C. M., Barbero, A., & Dorronsoro, J. R. (2013). Group fused lasso. International

Conference on Artificial Neural Networks, 66–73.Amor, B. B., Su, J., & Srivastava, A. (2016). Action recognition using rate-invariant

analysis of skeletal shape trajectories. IEEE Trans. Pattern Anal. Mach. Intell.,38(1), 1–13.

Anirudh, R., Turaga, P., Su, J., & Srivastava, A. (2015). Elastic functional coding ofhuman actions: From vector-fields to latent variables. Proc. IEEE Conf. Comput.

Vis. Pattern Recognit., 3147–3155.Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2011). Convex optimization with

sparsity-inducing norms. Optimization for Machine Learning, 5, 19–53.Barnich, O., & Van Droogenbroeck, M. (2011). ViBe: A universal background

subtraction algorithm for video sequences. IEEE Trans. Image Process., 20(6),1709–1724.

Bauer, M., Bruveris, M., & Michor, P. W. (2014). Overview of the geometries of shapespaces and diffeomorphism groups. J. Math. Imaging Vis., 50(1-2), 60–97.

Bauschke, H. H., & Combettes, P. L. (2011). Convex analysis and monotone operatortheory in hilbert spaces. , 408.

Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm forlinear inverse problems. SIAM journal on imaging sciences, 2(1), 183–202.

Bouwmans, T. (2011). Recent advanced statistical background modeling for foregrounddetection: A systematic survey. Recent Patents on Computer Science, 4(3),147–176.

Bouwmans, T. (2014). Traditional and recent approaches in background modeling forforeground detection: An overview. Computer Science Review, 11, 31–66.

Bouwmans, T., Sobral, A., Javed, S., Jung, S. K., & Zahzah, E.-H. (2016). Decomposi-tion into low-rank plus additive matrices for background/foreground separation: Areview for a comparative evaluation with a large-scale dataset. Computer Science

Review.

99

Brutzer, S., Hoferlin, B., & Heidemann, G. (2011). Evaluation of backgroundsubtraction techniques for video surveillance. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., 1937–1944.Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis?

J. ACM, 58(3), 11.Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact

signal reconstruction from highly incomplete frequency information. IEEE Trans.

Inf. Theory, 52(2), 489–509.Chen, H., Liu, X., Li, X., Shi, H., & Zhao, G. (2019). Analyze spontaneous gestures for

emotional stress state recognition: A micro-gesture dataset and analysis withdeep learning. IEEE International Conference on Automatic Face & Gesture

Recognition (FG).Chen, H., Liu, X., & Zhao, G. (2018). Temporal hierarchical dictionary with HMM

for fast gesture recognition. International Conference on Pattern Recognition

(ICPR).Chen, S., Donoho, D., & Saunders, M. A. (1998). Atomic decomposition by basis

pursuit. SIAM journal on scientific computing, 20(1), 33–61.Combettes, P. L., & Pesquet, J.-C. (2007). A Douglas–Rachford splitting approach

to nonsmooth convex variational signal recovery. IEEE J. Sel. Topics Signal

Process., 1(4), 564–574.Combettes, P. L., & Pesquet, J.-C. (2011). Proximal splitting methods in signal

processing. In (pp. 185–212). Springer.Dai, W., & Milenkovic, O. (2009). Subspace pursuit for compressive sensing signal

reconstruction. IEEE Trans. Inf. Theory, 55(5), 2230–2249.Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi, M., & Del Bimbo, A.

(2015). 3D human action recognition by shape analysis of motion trajectories onRiemannian manifold. IEEE Trans. Cybern., 45(7), 1340–1352.

Dikmen, M., & Huang, T. S. (2008). Robust estimation of foreground in surveillancevideos by sparse error estimation. Proc. IAPR Int. Conf. Pattern Recognit., 1–4.

Dikmen, M., Tsai, S.-F., & Huang, T. S. (2009). Base selection in estimating sparseforeground in video. IEEE Int. Conf. Image Process., 3217–3220.

Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inf. Theory, 52(4), 1289–1306.Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton

based action recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,1110–1118.

100

Ebadi, S. E., & Izquierdo, E. (2016). Foreground segmentation via dynamic tree-structured sparse RPCA. Proc. Eur. Conf. Comput. Vis., 314–329.

Escalera, S., Baró, X., Gonzalez, J., Bautista, M. A., Madadi, M., Reyes, M., . . . Guyon,I. (2014). Chalearn looking at people challenge 2014: Dataset and results. Proc.

Eur. Conf. Comput. Vis. Workshops, 459–473.Fadili, M.-J., & Starck, J.-L. (2009). Monotone operator splitting for optimization

problems in sparse recovery. Proc. IEEE Int. Conf. Image Process., 1461–1464.Gao, Z., Cheong, L., & Wang, Y. (2014). Block-sparse RPCA for salient motion

detection. IEEE Trans. Pattern Anal. Mach. Intell., 36(10), 1975–1987.Gong, D., Medioni, G., & Zhao, X. (2014). Structured time series analysis for human

action segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell.,36(7), 1414–1427.

Goyette, N., Jodoin, P., Porikli, F., Konrad, J., & Ishwar, P. (2012). Changedetection.net: A new change detection benchmark dataset. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit. Workshops, 1–8.Guo, H., Qiu, C., & Vaswani, N. (2014). An online algorithm for separating sparse and

low-dimensional signal sequences from their sum. IEEE Trans. Signal Process.,62(16), 4284–4297.

Guo, J., Liu, Y., Hsia, C., Shih, M., & Hsu, C. (2011). Hierarchical method forforeground detection using codebook model. IEEE Trans. Circuits Syst. Video

Technol., 21(6), 804–815.Guo, X., Wang, X., Yang, L., Cao, X., & Ma, Y. (2014). Robust foreground detection

using smoothness and arbitrariness constraints. Proc. Eur. Conf. Comput. Vis.,535–550.

Guo, Y., Li, Y., & Shao, Z. (2017). RRV: A spatiotemporal descriptor for rigid bodymotion recognition. IEEE Trans. Cybern..

Guyon, C., Bouwmans, T., & Zahzah, E. (2012). Foreground detection based onlow-rank and block-sparse matrix decomposition. Proc. IEEE Int. Conf. Image

Process., 1225–1228.Hage, C., Seidel, F., & Kleinsteuber, M. (2014). GPU Implementation for Background-

Foreground-Separation via Robust PCA and Robust Subspace Tracking. Back-

ground Modeling and Foreground Detection for Video Surveillance.Han, F., Reily, B., Hoff, W., & Zhang, H. (2017). Space-time representation of people

based on 3D skeletal data: A review. Comput. Vis. Image Underst., 158, 85–105.Harandi, M., & Salzmann, M. (2015). Riemannian coding and dictionary learning:

101

Kernels to the rescue. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 3926–3935.

Harandi, M. T., Salzmann, M., & Hartley, R. (2014). From manifold to manifold:Geometry-aware dimensionality reduction for spd matrices. Proc. Eur. Conf.

Comput. Vis., 17–32.He, J., Balzano, L., & Szlam, A. (2012). Incremental gradient on the Grassmannian for

online foreground and background separation in subsampled video. Proc. IEEE

Conf. Comput. Vis. Pattern Recognit., 1568–1575.Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Comput.,

9(8), 1735–1780.Hofmann, M., Tiefenbacher, P., & Rigoll, G. (2012). Background segmentation with

feedback: The pixel-based adaptive segmenter. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit. Workshops, 38–43.Hu, W., Yang, Y., Zhang, W., & Xie, Y. (2017). Moving object detection using

tensor-based low-rank and saliently fused-sparse decomposition. IEEE Trans.

Image Process., 26(2), 724–737.Hu, Y., Sirlantzis, K., Howells, G., Ragot, N., & Rodriguez, P. (2015). An online

background subtraction algorithm using a contiguously weighted linear regressionmodel. Signal Processing Conference (EUSIPCO), 2015 23rd European, 1845–1849.

Huang, J., Huang, X., & Metaxas, D. (2009). Learning with dynamic group sparsity.Proc. IEEE Int. Conf. Comput. Vis., 64–71.

Huang, J., Zhang, T., & Metaxas, D. (2011). Learning with structured sparsity. J. Mach.

Learn. Res., 12, 3371–3412.Huang, X., Wang, S.-J., Liu, X., Zhao, G., Feng, X., & Pietikainen, M. (2017). Discrim-

inative spatiotemporal local binary pattern with revisited integral projection forspontaneous facial micro-expression recognition. IEEE Trans. Affect. Comput..

Javed, S., Oh, S., Bouwmans, T., & Jung, S. K. (2015). Robust background subtractionto global illumination changes via multiple features-based online robust principalcomponents analysis with markov random field. J. Electron. Imaging, 24(4),043011–043011.

Javed, S., Oh, S., Sobral, A., Bouwmans, T., & Jung, S. (2015). Background subtractionvia superpixel-based online matrix decomposition with structured foregroundconstraints. Proc. IEEE Int. Conf. Comput. Vis. Workshops, 90–98.

Javed, S., Oh, S., Sobral, A., Bouwmans, T., & Jung, S. K. (2014). OR-PCA with MRF

102

for robust foreground detection in highly dynamic backgrounds. In (pp. 284–299).Springer.

Javed, S., Sobral, A., Bouwmans, T., & Jung, S. (2015). OR-PCA with dynamic featureselection for robust background subtraction. Proceedings of the 30th Annual ACM

Symposium on Applied Computing, 86–91.Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent K-SVD: Learning a

discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell.,35(11), 2651–2664.

Karcher, H. (1977). Riemannian center of mass and mollifier smoothing. Commun.

Pure Appl. Math., 30(5), 509–541.Ke, Q., Bennamoun, M., An, S., Sohel, F., & Boussaid, F. (2017). A new representation

of skeleton sequences for 3D action recognition. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit..Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance.

The relationship of verbal and nonverbal communication, 25(1980), 207–227.Kim, K., Chalidabhongse, T. H., Harwood, D., & Davis, L. (2005). Real-time

foreground–background segmentation using codebook model. Real-time imaging,11(3), 172–185.

Kita, S., Van Gijn, I., & Van der Hulst, H. (1997). Movement phases in signs andco-speech gestures, and their transcription by human coders. International Gesture

Workshop, 23–35.Koppula, H., & Saxena, A. (2013). Learning spatio-temporal structure from RGB-D

videos for human activity detection and anticipation. Proc. Int. Conf. Mach.

Learn., 792–800.Koppula, H., & Saxena, A. (2016). Anticipating human activities using object

affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell.,38(1), 14–29.

Lee, H.-K., & Kim, J.-H. (1999). An HMM-based threshold model approach for gesturerecognition. IEEE Trans. Pattern Anal. Mach. Intell., 21(10), 961–973.

Li, L., Huang, W., Gu, I., & Tian, Q. (2004). Statistical modeling of complexbackgrounds for foreground object detection. IEEE Trans. Image Process., 13(11),1459–1472.

Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points.Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 9–14.

Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., & Liu, J. (2016). Online human action

103

detection using joint classification-regression recurrent neural networks. Proc.

Eur. Conf. Comput. Vis., 203–220.Lin, H., Chuang, J., & Liu, T. (2011). Regularized background adaptation: a novel

learning rate control scheme for Gaussian mixture modeling. IEEE Trans. Image

Process., 20(3), 822–836.Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for

exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust

gates for 3D human action recognition. Proc. Eur. Conf. Comput. Vis., 816–833.Liu, J., Wang, G., Hu, P., Duan, L.-Y., & Kot, A. C. (2017). Global context-aware

attention LSTM networks for 3D action recognition. Proc. IEEE Conf. Comput.

Vis. Pattern Recognit., 1647–1656.Liu, X., Shi, H., Hong, X., Chen, H., Tao, D., & Zhao, G. (2019). Hidden states

exploration for 3D skeleton-based gesture recognition. IEEE Winter Conf. Appl.

Comput. Vis..Liu, X., Yao, J., Hong, X., Huang, X., Zhou, Z., Qi, C., & Zhao, G. (2018). Background

subtraction using spatio-temporal group sparsity recovery. IEEE Trans. Circuits

Syst. Video Technol., 28(8), 1737–1751.Liu, X., & Zhao, G. (2019a). 3D skeletal gesture recognition via sparse coding of

time-warping invariant Riemannian trajectories. International Conference on

Multimedia Modeling, 678–690.Liu, X., & Zhao, G. (2019b). Background subtraction using multi-channel fused lasso.

IS&T Int. Symposium on Electronic Imaging.Liu, X., Zhao, G., Yao, J., & Qi, C. (2015). Background subtraction based on low-

rank and structured sparse decomposition. IEEE Trans. Image Process., 24(8),2502–2514.

Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3D human action usingHMM and multi-class adaboost. Proc. Eur. Conf. Comput. Vis., 359–372.

Maddalena, L., & Petrosino, A. (2008). A self-organizing approach to backgroundsubtraction for visual surveillance applications. IEEE Trans. Image Process.,17(7), 1168–1177.

Maddalena, L., & Petrosino, A. (2012). The SOBS algorithm: what are the limits? Proc.

IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 21–26.Mahasseni, B., & Todorovic, S. (2016, June). Regularizing long short term memory

with 3D human-skeleton sequences for action recognition. Proc. IEEE Conf.

104

Comput. Vis. Pattern Recognit., 3054-3062.Mairal, J., Jenatton, R., Bach, F. R., & Obozinski, G. R. (2010). Network flow algorithms

for structured sparsity. Proc. Adv. Neural Inf. Process. Syst., 1558–1566.Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Trans. Syst., Man,

Cybern., Part C, 37(3), 311–324.Morency, L.-P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative

models for continuous gesture recognition. Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., 1–8.Murali, A., Garg, A., Krishnan, S., Pokorny, F. T., Abbeel, P., Darrell, T., & Goldberg, K.

(2016). TSC-DL: Unsupervised trajectory segmentation of multi-modal surgicaldemonstrations with deep learning. Proc. IEEE Conf. Robot. Autom., 4150–4157.

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. , 27-71.Murray, R. M., Li, Z., Sastry, S. S., & Sastry, S. S. (1994). A mathematical introduction

to robotic manipulation.Needell, D., & Tropp, J. (2009). CoSaMP: Iterative signal recovery from incomplete

and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3),301–321.

Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2016). ModDrop: Adaptivemulti-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell., 38(8),1692–1706.

Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2014). Sequence of themost informative joints (SMIJ): A new representation for human skeletal actionrecognition. J Vis. Commun. Image Represent., 25(1), 24–38.

Oreifej, O., & Liu, Z. (2013). HON4D: Histogram of oriented 4D normals for activityrecognition from depth sequences. Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., 716–723.Packer, B., Saenko, K., & Koller, D. (2012). A combined pose, object, and feature

model for action understanding. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,1378–1385.

Pang, Y., Ye, L., Li, X., & Pan, J. (2016). Incremental learning with saliency map formoving object detection. IEEE Trans. Circuits Syst. Video Technol..

Piyathilaka, L., & Kodagoda, S. (2013). Gaussian mixture based HMM for human dailyactivity recognition using 3D skeleton features. Proc. IEEE Conf. Ind. Electron.

Appl., 567–572.Presti, L. L., & La Cascia, M. (2016). 3D skeleton-based human action classification: A

105

survey. Pattern Recognit., 53, 130–147.Qiu, C., & Vaswani, N. (2011). ReProCS: A missing link between recursive Robust

PCA and recursive sparse recovery in large but correlated noise. arXiv preprint

arXiv:1106.3286.Rautaray, S. S., & Agrawal, A. (2015). Vision based hand gesture recognition for

human computer interaction: a survey. Artif. Intell. Rev., 43(1), 1–54.Rockafellar, R. T. (2015). Convex analysis.Rodriguez, P., & Wohlberg, B. (2014). A Matlab implementation of a fast incremental

principal component pursuit algorithm for video background modeling. Proc.

IEEE Int. Conf. Image Process., 3414–3416.Rodriguez, P., & Wohlberg, B. (2015). Incremental principal component pursuit for

video background modeling. J. Math. Imaging Vis., 1–18.Seidel, F., Hage, C., & Kleinsteuber, M. (2014, July). pROST : A Smoothed `p-norm

Robust Online Subspace Tracking Method for Realtime Background Subtractionin Video. Mach. Vis. Appl., 25(5), 1227–1240.

Shahroudy, A., Liu, J., Ng, T.-T., & Wang, G. (2016). NTU RGB+D: A large scaledataset for 3D human activity analysis. Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., 1010–1019.Sheikh, Y., & Shah, M. (2005). Bayesian modeling of dynamic scenes for object

detection. IEEE Trans. Pattern Anal. Mach. Intell., 27(11), 1778–1792.Shi, H., Liu, X., Hong, X., & Zhao, G. (2018). Bidirectional long short-term memory

variational autoencoder. British Machine Vision Conference (BMVC).Shi, J., Liu, X., Zong, Y., Qi, C., & Zhao, G. (2018). Hallucinating face image

by regularization models in high-resolution feature space. IEEE Trans. Image

Process., 27(6), 2980–2995.Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., . . . Blake, A.

(2011). Real-time human pose recognition in parts from single depth images.Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1297–1304.

Srivastava, A., Klassen, E., Joshi, S. H., & Jermyn, I. H. (2011). Shape analysis ofelastic curves in Euclidean spaces. IEEE Trans. Pattern Anal. Mach. Intell., 33(7),1415–1428.

Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models forreal-time tracking. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2.

Su, J., Kurtek, S., Klassen, E., & Srivastava, A. (2014). Statistical analysis oftrajectories on Riemannian manifolds: bird migration, hurricane tracking and

106

video surveillance. Ann. Appl. Stat., 530–552.Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012). Unstructured human activity

detection from RBGD images. Proc. IEEE Conf. Robot. Autom., 842–849.Tang, G., & Nehorai, A. (2011). Robust principal component analysis based on low-rank

and block-sparse matrix decomposition. Proc. Conf. Infor, Sci. Syst., 1–5.Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for

complex event detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,1250–1257.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society. Series B (Methodological), 267–288.Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and

smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 67(1), 91–108.Tropp, J., & Gilbert, A. (2007). Signal recovery from random measurements via

orthogonal matching pursuit. IEEE Trans. Inf. Theory, 53(12), 4655–4666.Vemulapalli, R., Arrate, F., & Chellappa, R. (2014). Human action recognition by

representing 3D skeletons as points in a Lie group. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., 588–595.Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action

recognition with depth cameras. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,1290–1297.

Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014). Learning actionlet ensemble for 3D humanaction recognition. IEEE Trans. Pattern Anal. Mach. Intell., 36(5), 914–927.

Wang, N., & Yeung, D. (2013). Bayesian robust matrix factorization for image andvideo processing. Proc. IEEE Int. Conf. Comput. Vis., 1785–1792.

Wang, S., Quattoni, A., Morency, L.-P., Demirdjian, D., & Darrell, T. (2006). Hiddenconditional random fields for gesture recognition. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., 1521–1527.Weng, J., Weng, C., & Yuan, J. (2017). Spatio-temporal naive-Bayes nearest-neighbor

(ST-NBNN) for skeleton-based action recognition. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit..Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv preprint

arXiv:1410.3916.Wright, J., Ganesh, A., Rao, S., Peng, Y., & Ma, Y. (2009). Robust principal component

analysis: Exact recovery of corrupted low-rank matrices via convex optimization.

107

Proc. Adv. Neural Inf. Process. Syst., 2080–2088.Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised

understanding of actions and relations. Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., 4362–4370.Wu, D., Pigou, L., Kindermans, P. J., Le, N., Shao, L., Dambre, J., & Odobez, J. M.

(2016). Deep dynamic neural networks for multimodal gesture segmentation andrecognition. IEEE Trans. Pattern Anal. Mach. Intell., 38(8), 1583–1597.

Wu, D., & Shao, L. (2014). Leveraging hierarchical parametric networks for skeletaljoints based action segmentation and recognition. Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., 724–731.Wu, Y., & Huang, T. S. (1999). Vision-based gesture recognition: A review. International

Gesture Workshop, 103–115.Xia, L., Chen, C. C., & Aggarwal, J. K. (2012). View invariant human action recognition

using histograms of 3D joints. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

Workshops, 20-27.Xin, B., Tian, Y., Wang, Y., & Gao, W. (2015). Background subtraction via generalized

fused lasso foreground modeling. Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,4676–4684.

Xu, J., Ithapu, V., Mukherjee, L., Rehg, J. M., & Singh, V. (2013). GOSUS: Grass-mannian online subspace updates with structured-sparsity. Proc. IEEE Int. Conf.

Comput. Vis., 3376–3383.Xu, Y., Dong, J., Zhang, B., & Xu, D. (2016). Background modeling methods in

video analysis: A review and comparative evaluation. CAAI Transactions on

Intelligence Technology.Xu, Y., Hong, X., Liu, X., & Zhao, G. (2018). Saliency detection via bi-directional

propagation. J. Vis. Commun. Image Represent., 53, 113–121.Xu, Y., Hong, X., Porikli, F., Liu, X., Chen, J., & Zhao, G. (2018). Saliency integration:

An arbitrator model. IEEE Trans. Multimed..Xue, Y., Guo, X., & Cao, X. (2012). Motion saliency detection using low rank and

sparse decomposition. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,1485–1488.

Yang, X., & Tian, Y. (2012). Eigenjoints-based action recognition using naive-Bayes-nearest-neighbor. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops,14–19.

Yang, Y., Deng, C., Tao, D., Zhang, S., Liu, W., & Gao, X. (2017). Latent max-margin

108

multitask learning with skelets for 3-D action recognition. IEEE Trans. Cybern.,47(2), 439–448.

Yao, J., Liu, X., & Qi, C. (2014). Foreground detection using low rank and structuredsparsity. Proc. IEEE Int. Conf. Multimed. Expo., 1–6.

Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient3D kinematics descriptor for low-latency action recognition and detection. Proc.

IEEE Int. Conf. Comput. Vis., 2752–2759.Zha, Z., Liu, X., Huang, X., Shi, H., Xu, Y., Wang, Q., . . . Zhang, X. (2017). Analyzing

the group sparsity based on the rank minimization methods. Proc. IEEE Int. Conf.

Multimed. Expo., 883–888.Zha, Z., Liu, X., Zhang, X., Chen, Y., Tang, L., Bai, Y., . . . Shang, Z. (2018). Compressed

sensing image reconstruction via adaptive sparse nonlocal regularization. Visual

Comput., 34, 117–137.Zha, Z., Liu, X., Zhou, Z., Huang, X., Shi, J., Shang, Z., . . . Zhang, X. (2017). Image

denoising via group sparsity residual constraint. Proc. IEEE Int. Conf. Acoust.,

Speech, Signal Process., 1787–1791.Zha, Z., Zhang, X., Wang, Q., Bai, Y., Chen, Y., Tang, L., & Liu, X. (2018). Group

sparsity residual constraint for image denoising with external nonlocal self-similarity prior. Neurocomputing, 275, 2294–2306.

Zha, Z., Zhang, X., Wang, Q., Tang, L., & Liu, X. (2018). Group-based sparserepresentation for image compressive sensing reconstruction with non-convexregularization. Neurocomputing, 296, 55–63.

Zha, Z., Zhang, X., Wu, Y., Wang, Q., Liu, X., Tang, L., & Yuan, X. (2018). Non-convex weighted lp nuclear norm based ADMM framework for image restoration.Neurocomputing.

Zhang, Q., & Li, B. (2010). Discriminative K-SVD for dictionary learning in facerecognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2691–2698.

Zhang, Z. (2012). Microsoft Kinect sensor and its effect. IEEE Multimed., 19(2), 4–10.Zhou, X., Yang, C., & Yu, W. (2013). Moving object detection by detecting contiguous

outliers in the low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell.,35(3), 597–610.

Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016). Co-occurrencefeature learning for skeleton based action recognition using regularized deepLSTM networks. Proc. AAAI Conf. Artif. Intell., 2, 8.

109

Original articles

I Liu X., Yao J., Hong X., Huang X., Zhou Z., Qi C., & Zhao G. (2018). Backgroundsubtraction using spatio-temporal group sparsity recovery. IEEE Transactions on Circuitsand Systems for Video Technology. 28(8), 1737–1751. IEEE

II Liu X. & Zhao G. (2019). Background subtraction using multi-channel fused Lasso. InProceedings of the IS&T 2019 International Symposium on Electronic Imaging (EI 2019).IS&T. Accepted for publication.

III Liu X. & Zhao G. (2019). 3D skeletal gesture recognition using sparse coding of time-warping invariant Riemannian trajectories. In Proceedings of the 2019 InternationalConference on Multimedia Modeling (MMM 2019), 678–690. Springer, Cham.

IV Liu X., Shi H., Hong X., Chen H., Tao D., & Zhao G. (2019). Hidden states exploration for3D skeleton-based gesture recognition. In Proceedings of the 2019 IEEE Winter Conferenceon Applications of Computer Vision (WACV 2019). IEEE. Accepted for publication.

Reprinted with permission from IEEE (I, IV), IS&T (II), and Springer, Cham (III).

Original publications are not included in the electronic version of the dissertation.

111


Book orders:Granum: Virtual book storehttp://granum.uta.fi/granum/

S E R I E S C T E C H N I C A

683. Tomperi, Jani (2018) Predicting the treated wastewater quality utilizing opticalmonitoring of the activated sludge process

684. Fazel Modares, Nasim (2018) The role of climate and land use change in LakeUrmia desiccation

685. Kärnä, Aki (2018) Modelling of supersonic top lance and the heat-up stage of theCAS-OB process

686. Silvola, Risto (2018) One product data for integrated business processes

687. Hildebrandt, Nils Christoph (2018) Paper-based composites via the partialdissolution route with NaOH/urea

688. El Assal, Zouhair (2018) Synthesis and characterization of catalysts for the totaloxidation of chlorinated volatile organic compounds

689. Akanegbu, Justice Orazulukwe (2018) Development of a precipitation index-based conceptual model to overcome sparse data barriers in runoff prediction incold climate

690. Niva, Laura (2018) Self-optimizing control of oxy-combustion in circulatingfluidized bed boilers

691. Alavesa, Paula (2018) Playful appropriations of hybrid space : combining virtualand physical environments in urban pervasive games

692. Sethi, Jatin (2018) Cellulose nanopapers with improved preparation time,mechanical properties, and water resistance

693. Sanguanpuak, Tachporn (2019) Radio resource sharing with edge caching formulti-operator in large cellular networks

694. Hintikka, Mikko (2019) Integrated CMOS receiver techniques for sub-ns basedpulsed time-of-flight laser rangefinding

695. Järvenpää, Antti (2019) Microstructures, mechanical stability and strength of low-temperature reversion-treated AISI 301LN stainless steel under monotonic anddynamic loading

696. Klakegg, Simon (2019) Enabling awareness in nursing homes with mobile healthtechnologies

697. Goldmann Valdés, Werner Marcelo (2019) Valorization of pine kraft lignin byfractionation and partial depolymerization

698. Mekonnen, Tenager (2019) Efficient resource management in Multimedia Internetof Things

C699etukansi.fm Page 2 Thursday, February 7, 2019 3:07 PM

UNIVERSITY OF OULU P .O. Box 8000 F I -90014 UNIVERSITY OF OULU FINLAND


University Lecturer Tuomo Glumoff

University Lecturer Santeri Palviainen

Senior research fellow Jari Juuti


University Lecturer Veli-Matti Ulvinen

Planning Director Pertti Tikkanen

Professor Jari Juga

University Lecturer Anu Soikkeli


Publications Editor Kirsti Nurkkala

ISBN 978-952-62-2200-4 (Paperback)ISBN 978-952-62-2201-1 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)


TECHNICA


TECHNICA

OULU 2019

C 699

Xin Liu


UNIVERSITY OF OULU GRADUATE SCHOOL;UNIVERSITY OF OULU,FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING;UNIVERSITY OF OULU,INFOTECH OULU

C 699

AC

TAX

in LiuC699etukansi.fm Page 1 Thursday, February 7, 2019 3:07 PM

c 699 acta - university of oulujultika.oulu.fi/files/isbn9789526222011.pdf · 2019. 2. 21. ·...

Documents