micro hand gesture recognition system using … micro hand gesture recognition system using...

1

Micro Hand Gesture Recognition System UsingUltrasonic Active Sensing

Yu Sang, Laixi Shi, Yimin Liu, Member, IEEE

Abstract—In this paper, we propose a micro hand gesturerecognition system using ultrasonic active sensing, which uses mi-cro dynamic hand gestures within a time interval for classificationand recognition to achieve Human-Computer Interaction (HCI).The implemented system called Hand-Ultrasonic-Gesture (HUG)consists of ultrasonic active sensing, pulsed radar signal process-ing, and time-sequence pattern recognition by machine learning.We adopted lower-frequency (less than 1MHz) ultrasonic activesensing to obtain range-Doppler image features, detecting microfingers motion at a fine resolution of range and velocity. Makinguse of high resolution sequential range-Doppler features, wepropose a state transition based Hidden Markov Model forclassification in which high dimensional features are symbolized,achieving competitive accuracy of nearly 90% and significantlyreducing the computation complexity and power consumption.Furthermore, we utilized the End-to-End neural network modelfor classification and reached the accuracy of 96.32%. Besidesoffline analysis, a real-time prototype was released to verify ourmethods potential of application in the real world.

Index Terms—Ultrasonic active sensing, range-Doppler, microhand gesture, Hidden Markov Model, neural network.

I. INTRODUCTION

RECENTLY, Human-computer interaction (HCI) has be-come more and more influential among research and

daily life. Human hand gesture recognition (HGR) is oneof the main topics in the area of computer science as aneffective approach for HCI. HGR can significantly facilitatea lot of applications in different fields such as automobiles fordriving assistance and safety, smart house, wearable devices,virtual reality, etc. In these application scenarios, which basedon haptic controls and touch screens may be physically ormentally limited, the touchless input modality of hand gesturesis more convenient and attractive to a large extent.

Considering of precision and ability, wed like to addressmore promising technologies based on micro dynamic handgestures. Rather than wide-range and large shift actions such aswaving or rolling hands, these micro hand gestures involvingthe movements of multiple fingers such as tapping, clicking,rotating, pressing, rubbing, are exactly the ways how peoplenormally using their hands in the real physical world. Directlyusing the hand and multi-fingers to control devices is a natural,flexible, and user-friendly way without extra learning efforts.

In the hand gesture recognition research area, numerousoptical sensor methods based on computer vision utilizingRGB cameras [1],[2], depth cameras [3],[4], such as infraredcameras has been proposed even commercialized such as

Yu Sang, Laixi Shi and Yimin Liu are with the Department of ElectronicEngineering, Tsinghua University, Beijing, 100086 China.

e-mail:[email protected]

Microsoft Kinect or Leap Motion. These optical approacheswith RGB images or depth information perform well onrelative wide-range gestures and static gestures [3]. However,to our best knowledge, their reliability in harsh environmentsowing to the variance of light condition, such as in the darknight, or under direct sunlight is a tough problem should betaken into consideration. It is also not energy efficiency forthese approaches to achieve continuous real-time hand gesturerecognition for HCI, such as for wearable and portable devices.

In comparison, active sensing which is able to controlsignal transceiver is more robust under different circumstanceslights, weather, etc. Aware of this, HGRs based on radartechnologies have begun to attract amount of interest inmicrowave engineering. Approaches based on sound or radiofrequency (RF) active sensing can obtain features of rangeprofile, Doppler, micro-Doppler, the angle of arrival(AOA)from received signal. Such kind of features are competitiveand suitable to be used for identity the signature of handgestures [6], [8], [12]. However, the demand of RF wavesbandwidths are rigorous and thus become hard to meet theresolution demand for micro hand gestures recognition. Ul-trasonics active sensing has advantages to meet the resolutiondemand for its low velocity. What’s more, it costs much lessand has great potential to be used in wearable and portabledevices in the future. [18] The hardware can be integrated andminiaturized with MMIC or MEMS techniques, consumingmuch less energy (400µW at 30fps for 1m maximum rangewith ultrasonic MEMS technique) than CMOS image sensors(> 250mW ) [5].

We present our Project HUG (Hand-Ultrasonic-Gesture), tobuild a micro hand gesture recognition system with activeultrasonic sensors, radar signal processing technique, and time-sequence pattern recognition methods in this work. Receivedthe waves from ultrasonic receiver, we implement pulsed radarsignal processing technique to obtain time-sequential Range-Doppler features form ultrasonic waves, measuring objectsdistances and velocities precisely and simultaneously througha single channel. According to the high-resolution features, wepropose a state transition based Hidden Markov Model(HMM)approach for micro dynamic gesture recognition and achievedcompetitive accuracy. For comparison and evaluation, we si-multaneously implemented random forest, convolutional neu-ral network (CNN), Long-short term memory(LSTM), and anEnd-to-End neural network of CNN followed by LSTM. Themain contributions of this work are summarized as follows:

• We implemented a micro hand gesture recognition sys-tem using ultrasonic active sensing. Based on the high-resolution range-Doppler features, we propose an HMM

arX

iv:1

712.

0021

6v1

[ee

ss.S

P] 1

Dec

201

7

2

Fig. 1: Pipeline design Block Diagram

model based on a state transition mechanism to symbol-ize the features which significantly compresses the dataand extract the most intrinsic signatures. This approachachieved 89.38% on the dataset of seven micro handgestures and much improved the computation and energyefficiency of recognition part, providing more potentialfor portable and wearable device. We implement theEnd-to-End neural network model and achieve 96.34%accuracy on the same dataset.

• To verify our system’s real-time feasibility, we releaseda real-time micro hand gesture recognition prototype todemonstrate the real-time control of a music player.

II. RELATED WORK

As radar signal processing involved in HCI ubiquitously,there have been many radar signal processing technologiescombined with machine learning methods or neural net-work models used in hand gesture recognition, especially theDoppler effect and Micro-Doppler effect. Different kinds ofradars including continuous wave radars [6], doppler radars[7] and FMCW radars [8], [9] have been utilized for HGR andachieve competitive results. A K-band Continuous wave (CW)radar has been utilized in HGR and get up to 90% accuracy of4 wide-range hand gestures [6]. Not only radar radio frequency(RF) signal, but also other kinds of signals have also beenpracticed in HGR. Microsoft Research introduced a work thatused acoustic doppler methods and leveraged the speakers andmicrophones already embedded in commodity devices to sensein-air gestures, which achieve plausible accuracy but noisy forcontinuous audible waves [10]. N. Patel et al developed WiSee,a novel gesture recognition system based on dual-channeldoppler radar that leveraged wireless signals to enable whole-home sensing and recognition under complex conditions [11].These works arouse more interests towards ultrasound or RFsignals to be utilized in HGR. However, as we can see, mostof them focus on relatively wide-range and large-shift humangestures. As most radar signals frequency and bandwidthlimit the features property, they are not suitable enough forrecognizing micro hand gestures with several fingers subtlemotions, as they almost cannot distinguish between fingers.The limitation of the signal and features is a big challenge forthe micro hand gesture recognition.

Fig. 2: Platform Design

Similar to our ideas, Google announced Project Soli in2015, aiming at controlling smart devices and wearablesthrough human micro hand gestures [12], [13]. Our projectdiffers from Google Soli in several aspects. We could ob-tain millimeter range and centimeter-per-second-level velocityresolution by choosing ultrasound signal which transmits ata much slower speed than light, much better performancethan Solis millimeter radar owing to their platforms limitation.Benefit from better features, we mainly propose a particularHMM model for gesture recognition and achieve comparableresults comparing with another four methods. We also achievemuch higher accuracy of 96.34% using End-to-End networkowing to our more discriminating features. For the potentialapplication, we are enabled to implement viable device atmuch lower frequencies (less than 1MHz), which reducethe front-end circuit complexity. we develop the pulsed radarbased on ultrasonic sensors and simultaneously measure therange and Doppler with high resolution in one channel, furtherreducing the hardware complexity.

III. METHODOLOGY

A. Platform Design and Implementation

To have a high-quality and low-cost raw data, we designedan architecture implementing ultrasonic active sensing forraw signal acquisition, followed by signal processing andrecognition parts. Fig.1 illustrates the block diagram of thewhole prototype.

In order to meet the required high resolution for micro handgesture recognition, we elaborately designed the hardwarecomposition and parameters. The velocity resolution vd andrange resolution Rd are as (1),

vd =λ

2MT,Rd =

c

2B(1)

where c is the propagation speed, λ is the wavelength of theultrasound, B is the bandwidth, T stands for the Pulse Repeti-tion Interval(PRI), and M is the total number of pulses whichis carefully chosen to meet the stop-and-hop assumption. Toclearly differentiate fingers positions and motions, millimeter-level range resolution (< 1cm) and centimeter-per-second-level velocity resolution (< 0.1cm/s) are needed, besides,high precision of range and velocity are also recommended. To

3

meet the range and velocity resolution, we choose ultrasonicin the frequency of 300kHz and the bandwidth of 20kHz,the pulse repetition interval (PRI) is 600us and pulse numberM is 12.

After rigorous design, all the requirements are reached andillustrates as Table.1. Compared to Solis millimeter radarsrange resolution of 2.14cm, the HUG system is capable ofproviding much more precise features, which are enough formicro hand gestures recognition.

TABLE I: Parameter setting and system performance

The whole system illustrates as Fig.2. To ensure accurateclock synchronization for the quality of Doppler features,we use customized FPGA for DAC and ADC modules. Forvibration offset, we utilize digital canceling to avoid ultrasonictransceiver leak. After the parameters are set by serial port,the DAC produces the modulated pulse signal to be amplifiedby the front-end circuit. The amplified signal stimulates theMA300D1-1 Murata Ultrasonic Transceiver to emit ultrasonicwaves. Ultrasonic waves are reflected from the micro handgesture with interference noise. Received waves are amplifiedand modulated by the front-end circuits and then transferredinto ADC. After sampling in ADC, the raw signal passages PCbased on UDP protocol using parallel computation to reachreal-time capability. The transceiver module transferred theraw signal into the next signal processing module will bementioned later.

B. Pulse Radar Signal Processing Feature Extraction

In our system, we mainly use the Range-Doppler pulsedradar signal processing method to detect the palm and thefingers movement and extract features for later classifiers.Pulse radar measures the target range via round trip delayand utilizes the Doppler effect to determine the velocity byemitting pulses and processing the returned signals at a specificpulse repetition frequency. Two important procedures in pulseradar signal processing method are fast-time sampling andslow-time sampling, while the former refers to the samplinginside PRI, and which determines the range; the latter refers tothe sampling across multiple pulses using PRI as the samplingperiod to determine the Doppler shift for velocity.

To overcome the weak of echoes and attenuate the influenceof noise, the Chirp pulses are applied to the system as omittingsignal to improve the SNR.The baseband waveform can beexpressed as:

x(t) =

M−1∑m=0

a (t−mT ) (2)

where

a(t) = cos

(πB

τt2 − πBt

)0 ≤ t ≤ τ (3)

In the expressions, a(t) is the single pulse with time durationτ , B, T,M and PRI are the same as mentioned above.

After receiving waves reflected from the hand, the signalprocessing is as following. Received reflections of the ultra-sonic waves is processed via the IQ quadrature demodulatorand the low-pass filter to get baseband signals.

Fig. 3: Range-Doppler feature

With fast-time sampling at the frequency Fs of 400kHz,the matched filters are applied to detect ranges of the palmand fingers. With slow-time sampling, N points FFTs areapplied to the samples of different pulses to detect objectsvelocities. After two dimension sampling, the range-Dopplerimage features would be generated in frame-level. Every frameof range-Doppler image is in the shape of 256 × 180 justlike Fig.3, which is the region of interest (ROI) of the range-doppler image at a specific time t. The data of a micro handgesture is 3 dimensions (range-Doppler-time) cube which isthe stack of range-Doppler planes among the time dimensionsas the output. Fig. 4 shows an example button off gesture, withfour range-Doppler planes sampled at different time from thedata cube.

C. Micro Hand Gesture Recognition

A pattern recognition module is cascaded after the pulsesignal processing to classify the time sequence range-Dopplersignatures. We propose both machine learning architecturesand deep-learning architectures designed for micro gesturerecognition with HUG system, generalizing the approaches toother RF signals or ultrasonic signals is straightforward. Thereare a large amount of powerful classification algorithms thatcan be applied to temporal hand gesture recognition.

1) Hidden Markov Model: Taking advantage of high-resolution Range-Doppler planes resulting from ultrasonic ac-tive sensing, we are capable of separating contiguous scatteredpoints. We aim at finding a method to extract and compressfeatures containing more efficient information and less noisebefore feeding into robust classification architecture. Concern-ing the scattered points of ultrasonic waves are user’s fingers,the detected peaks in the range-Doppler images can illustratethese fingers motion. The micro hand gestures’ structure and

4

Fig. 4: Range-Doppler feature frames sampled in the “button off” (see Fig. 8) gesture. The first subfigure shows the indexfinger and thumb separate at a 3cm distance. Noises will be removed using time smoothing to increase robustness as the 3marked “noise” objects in the first subfigure. The second and third subfigures show the index finger is moving down withan acceleration while the thumb almost keeps static with a tiny velocity moving up. The last subfigure shows two fingersget touched. Noting that an object’s trajectory will always be a curve in the range-Doppler plane. The symbolized states arelabeled at the center of the detected objects.

Fig. 5: state transition diagram, where states represent the scat-tering points kinds, the arrows means the transition ways. Hereare the introduction of all states: Initialization states includestate ’new’, state ’0’, which are initial states, uncertain statesand first missing scattering points state.Missing states includestate ’-1’, ’-2’, which means missing some frames scatteringpoints. Tracking states include state ’1’, ’2’, ’3’,’4’, whichmeans static and dynamic objects’s tracking and locking.Ending states include ’8’, ’del’, which means end of motionwhen the dynamic object’s speed less than the threshold ormissing more than 3 frames.

features are explicitly demonstrated in these detected fingers’motions. Except for precise ranges and velocities of fingers,the moving tendencies and relations of fingers are exactlyindispensable features for recognition. Based on the analysismentioned above, we propose a state transition based HMMfor gesture classification. This method consists of three partsincluding state transition mechanism, feature embedding andHMM.

• Finite-state Transition MachineWe propose a state transition mechanism to summarize eachdetected objects state. The state of one detected object willbe judged by the following information. Its moving direction

and discretized velocity define whether the object is staticor its dynamic condition. The distance in the range-Dopplerimage between objects of adjacent frames can be used tojudge whether it should be merge with or separate from otherobjects. The reflection intensity and range help to wipe offnoise, etc. There are also tracking mechanism to complementand improve robust of the whole state transition processing.The detailed design is shown as Fig.5. Through the extractionprocess, the features of every frame’s range-Doppler imagewould be reduced to several integer numbers. For a gestures,we compress the data to a feature sequence

P = {pt}Nt=1

, where range-Doppler plane in frame t is extracted as asymbol pt, which can be described as (4):

pt = L (nt, vt, rt) (4)

• Mapping into Symbol SequenceTo fit features into HMM model, an mapping function σdiscretizes and maps these frame’s features to the summarysymbol st. These symbols from every frames formed the finalfeature sequence

S = {st}Nt=1

defines as (5):

st = σ (pt) (5)

The feature of every frame is mapped into an integer, whichformed this frame’s observation of the HMM model. Featuresin such form saved tremendous calculation than other methods.

• HMM for ClassificationWe use the HMM to digging out the patterns of the symbol-

ized states and Fig. 6 illustrates the working flow. As describedin equation (6), the HMM used in our module is composed ofa hidden state transition model and a multinomial emittingmodel, where X = {xt}Nt=1 is the observation sequence,Z = {zt}Nt=1 is the hidden state sequence, A is the transitionprobabilities matrix of hidden state Z, and φ represents themultinomial possibility distribution. θ= {π,A, φ} is all theparameters of the model and π is the initial hidden state

5

Fig. 6: Feature extraction and classification work flow

distribution. For each pre-defined micro hand gesture, anHMM is trained through the Baum-Welch algorithm. Whenthe ground truth is class C of an input gesture, likelihoodof each model of kind n is calculated through the forwardalgorithm and compared with each other from a Bayesianposterior perspective as described by equation (6)

p (X,Z|θ) = p (z1|π)

[N∏

n=2

p (zn|zn−1, A)

]N∏

m=1

p (xm|zm, φ),

p (C = n|X) =p (X|C = n) p (C = n)∑np (X|C = n) p (C = n)

(6)

2) End-to-End Network: The detailed network is demon-strated in Fig. 7. More and more combined network out-performs only CNN or LSTM network [16]. We cascadethe automatic feature extraction DCNN part with the LSTMdynamic gesture classification jointly to form the end-to-endnetwork. In the experiment, we train the two parts togetherinstead of train them individually. We transfer the output ofthe fifth convolutional layer of DCNN as features sequenceof one frame into the LSTM to produce the frame-levelclassification of hand gesture through a softmax layer. Toimprove the accuracy of dynamic hand gestures classification,we add a time-weight filter in a time window of 30 framesto get gesture-level classification. The filter parameters areempirical by experience, considering the nearest frame withhigh weights, the time differential pooling filter designed as:

gout =

N∑i=1

wiσ [g(i)], wi =2i

N

where gout is the gesture prediction, g(i) is per-frameprediction, wi are the filter weights.

3) Comparison Methods: For comparison with methodsmentioned above, we implemented another three frequently-used methods including random forest, convolutional neuralnetwork and long-short term memory.

Fig. 7: End-to-End neural network architecture

Random forest has been proved time and energy efficientin image classification tasks [20]. As range-Doppler imagesare also in the range image, we choose it as one approach toimplementing gesture recognition. Random forest consistingof 50 trees and the model size is only less than 2MB.

Among deep learning algorithms, convolutional neural net-work is one of the powerful methods to do image classificationby automatically extract features through multi-layers convo-lutional layers and pooling layers. There are popular modelssuch as Lenet, Alexnet [15], VGG, Resnet, Googlenet andso on so forth, which already achieve warranted success innumerous image or video classification. Considering the range-doppler data cube is similar to video and already have visiblefeatures, the DCNN was chosen to be applied to our system.Considering the complexity of features, we implemented ournetwork consisting of 5 conv-layers, 3 pooling layers. Theconcrete architectures are illustrated in Fig.8.

Another network architecture we implemented is suitableto do time sequential classification called Recurrent NeuralNetwork(RNN), especially a specific kind of it called Longshort-term memory [14] are designed to solve long-termdependencies. RNN architecture is different from other feed-forward networks as it contains loops, allowing informationto accumulate and persist. The network produces frame-levelclassification of the gesture by putting output yt into a softmaxlayer. We add the same filter as in End-to-End model to getgesture-level classification.

IV. EXPERIMENTS AND PERFORMANCE EVALUATION

A. Gesture Set

There are seven gestures evaluated in our system, includingone kind is no gesture, others as Fig.8 involving finger press,button on, button down, motion up, motion down, screw. Theprinciples used in design is elucidated as follow: (1) Micro-Gestures chosen involves subtle fingers motions no more thanthe wrist, instead of large muscle groups like the arm. (2) Mostgestures are dynamic gestures which enable waves to captureDoppler features of micro fingers’ motions for followingrecognition architecture.

B. Data Acquisition

To ensure adequate variation of gesture execution resultedfrom individual characteristic and other environment facets,we demand various subjects’ repetitions of all the gestures. We

6

Fig. 8: Examples of micro hand gestures, named by finger,button on (BtnOn), button off (BtnOff), motion up (MtnUp),motion down (MtnDn), and screw, respectively.

invited 9 subjects from our lab to execute the 6 designed ges-tures, only giving sketch instructions for how to perform thesegestures. Each kind of gestures was labeled, without removingany data from dataset to ensure large variance. We recordraw Range-Doppler images and record 50 gestures samples ofevery kind form each subject, resulting in 9× 6× 50 = 2700sample sequences. Besides these 6 gestures, we added no-gestures as another kind of gestures (label as 0). We got9 × 50 × 6 sequences which include different circumstances’data such as no hand, randomly performing gesture, randomnoise, etc. We shuffled them into each subjects sub-dataset.Finally, the dataset of total 5400 samples is used in all of ourexperiments as the raw input.

For training and validation for all the methods, we usestandard k-fold leave-one-subject, which k is 9 subjects in ourexperiments.

C. Experimental Implementation

For our particular HMM model, as it specializes in dealingwith the unstable length of the input, we just feed differentlength sequence of the features into models and trainingseven models for seven micro gestures respectively. Afterextracting frame-level features thorough Transition Machine,we embedded them into the final feature sequence S accordingto the produced dictionary. After 10 iteration loops for themaximum iteration number of 1000, the training process stopsmuch quicker than neural network model. The classification isbased on the maximum posterior probability of seven outputprobability, which also use prior probability to adjust it owingto no gesture is the usual circumstance.

We implemented the CNN architectures (shown in Table.2),adapted from the neural network which is widely used invideo action recognition and other image classification as-signments [15]. The network contains 8 layers containingconvolutional layers and fully connected layers with Rectified-Linear (ReLU) activation function after every layer. For theconvolutional layers we used, some with padding evenly leftand right and others no padding, using max pooling layers. Asthe gestures samples in our dataset are with variable lengthmainly from 30 frames to 120 frames, we downsampled rawRDI data to stable length 256× 180× 30 as high dimensional

input for CNN architecture. The samples are stochasticallyshuffled and maintain order within each sequence. For compat-ibility with our features, we reduce features maps dimensionsas well. The alignment of samples makes mini-batch trainingpossible and accelerate the training process. Avoiding localoptima and dampening oscillation and for training efficiency,we use mini-batch Adaptive Moment Estimation (Adam) [17]with two thresholds beta1 0.9 and beta2 0.999, the initiallearning rate set to 0.001, and a batch size of 16 or 32 forall the neural network including standalone CNN, RNN andEnd-to-End model.

Before feeding the feature, we down-sampled both the timesequence signal cube in the tensor size of 64×45×30(height,width, time sequence) and transform it into one-dimensionalsequence. Using a forest size of 50 with no maximum depth.To reduce feature size and obtain more data for LSTM, wedivide each raw sequences data to the uniform length of 30frames and subsampled into 128×90×30 (for RNN). For End-to-End architecture, the input data is the same as CNN model,while the fully connected layer after convolutional layers inCNN is connected to LSTM directly and trained jointly. Theloss is an aggregate of each frame just as LSTM model.

D. Realtime and Computation Complexity

All the training and test implementation are done on theplatform of GPU Nvidia Tesla K20. Our five methods achievedreal-time micro hand gesture recognition in our system. Thesemodels have relatively small size of less than 150MB. Formodel size, the random forest, standalone CNN, standaloneLSTM and End-to-End models are of the size of order MB.Our proposal method based on HMM is just ordered KB,which is much smaller than other models (40KB vs 20MBof End-to-End model, 689MB of Soli’s End-to-End model[12]). For time performance, Soli’s End-to-End model predictsframes at 150Hz [12], while our End-to-End model achieved450Hz. In gesture level, the prediction rates of Soli’s End-to-End model would be slower than 150Hz. Our proposedHMM model predicts gesture at much higher rate(250Hz)than our End-to-End system(250Hz vs 15Hz). Our modelcannot just be used on high-end GPUs, but also capable ofbeing implemented into pure CPU, especially our HMM modelwhich parameters contains only 40KB and have the potentialto be embedded into wearable and portable devices.

E. Results and Performance Evaluation

In the related work of Google’s Soli [12], they implementedthe system using millimeter radar to classify 11 hand gestures.Despite 4 relatively wide-range hand gestures, there are actu-ally 6 micro hand gestures and a static gesture(palm hold)related to our work. For comparison, we delete the 4 kindsof wide-range hand gestures results and turned other 7 kindsof gestures confusion with them to rights according to theconfusion matrix. The classification accuracy of the 7 microhand gestures is nearly 88

Fig .9 shows the confusion matrix of HMM model. Wepropose the HMM model, which achieve competitive clas-sification accuracy compared to Soli and reach the balance

7

TABLE II: Network Architecture used in our experiments

of accuracy and time consumption. Although the accuracyof the end-to-end model is 7% outperform HMM model, thesize of HMM model is almost 500 times smaller than it andcan achieve 18 times higher recognition speed. It has morepotential to be used in wearable and embedded systems suchas the smart watch or virtual reality system.

Fig. 9: Confusion matrix of accuracy under HMM model

Fig .10 shows the confusion matrix of the end-to-end model,which accuracy is 96.34%, stable enough for real-time gesturerecognition, without personalized training. Comparing withthe accuracy of Google Soli’s millimeter radar, profiting fromthe advantages of the ultrasonic signal’s high resolution, thesimilar End-to-End architecture can achieve more accurateclassification. As nearly 100% accuracy when there is nogesture, network ensures the gesture recognition from therandom signal with low false positives.

Fig. 10: Confusion matrix of accuracy under End-to-Endmodel

Fig.11 demonstrates overall classification accuracy of dif-ferent architectures with each gestures accuracy. We directly

using sequence-level recognition instead of frame-level clas-sification owing to our micro hand gesture are all dynamicgesture which could be distinguished by temporal-spatial fea-tures.We feed 30 frames length gesture sequence into CNNarchitecture automatically using temporal information andachieve 89.10% average accuracy of seven gestures. As LSTMinvolve contextual and temporal information into considerationdirectly, it also achieves useable classification result. The firstgesture is actually randomly hand gestures which labeled no-gesture, achieving great accuracy among all the architectures.The lower accuracy of HMM model on the first gestures iscaused by the much fewer model parameters. The increase ofno-gesture data could not continuously improve its accuracylike other methods.

Fig. 11: The accuracy of each gesture using different architec-tures. All of them using leave-one-subject-out cross-validation(1) Random forest with per-gesture sequence accuracy (2)CNN with per-gesture sequence accuracy (3) RNN with se-quence pooling of per-frame prediction (4) End-to-End withsequence pooling of per-frame prediction

Fig.12 illustrates the various subject classification validationaccuracy under all the architectures, using leave-one-subject-out cross-validation. Owing to the personal variance of gestureexecution, we use 8 subjects samples and leave the lastsubjects samples separately for validation. The experiment canbe a great predictor of real-world gesture recognition as thevalidation dataset are entirely separated from the training set.The variance between the subjects is showed in Fig.12, whichare less than 7% in those methods, which also verify ourarchitecture are capable of dealing with various users withgreat generalization.

We implemented the prototype of real-time micro handgestures recognition by calling for system functions. We usemicro hand gestures to control a music player in real-time andverify the system’s performance. The demo for demonstrationhas been updated [19].

8

Fig. 12: Accuracies of various architectures for 11 subjects and their average fold1−fold11 represent 11 subjects.

V. CONCLUSION

In this paper we have proposed the first micro hand gesturerecognition system using ultrasonic active sensing, capable ofrecognizing various micro hand gesture. Owing to the highresolution forming more efficient features, our proposal HMMmodel which is concise, less time and energy consumption,achieved comparable 89.38% accuracy result, the end-to-end model can achieve 96.34%, on our dataset including 9subjects with 7 gestures. Implementing Hug system to thereal-world application such as the music player, we verify thesystems capacity to recognize real-time micro hand gestureand give future potential utilization into omnipresent artificialintelligent applications and wearable devices.

VI. FUTURE WORK

Despite the advantages of ultrasonic signal mentionedabove, there are some problems remains to be improvedsuch as the vibration and the relatively narrow beam-widthof ultrasonic transceiver system. The hardware system alsoremains to be more robust for application in reality.

ACKNOWLEDGMENT

We are grateful to all the participants for their efforts incollecting hand gesture data for our experiments.

REFERENCES

[1] Chen, Feng Sheng and Fu, Chih Ming and Huang, Chung Lin, Handgesture recognition using a real-time tracking method and hidden Markovmodels. Image and Vision Computing, 2003.

[2] Yao, Yi and Li, Chang Tsun, Hand gesture recognition and spottingin uncontrolled environments based on classifier weighting. IEEEInternational Conference on Image Processing, 2015.

[3] Ren, Zhou and Meng, Jingjing and Yuan, Junsong, Depth camera basedhand gesture recognition and its applications in Human-Computer-Interaction. Communications and Signal Processing, 2011.

[4] Suarez, Jesus and Murphy, Robin R, Hand gesture recognition with depthimages: A review. Ro-Man, 2012.

[5] Przybyla, Richard J and Tang, HaoYen and Guedes, Andre and Shelton,Stefon E and Horsley, David A and Boser, Bernhard E, 3D ultrasonicrangefinder on a chip. IEEE Journal of Solid-State Circuits, 2015.

[6] Li, Gang and Zhang, Rui and Ritchie, Matthew and Griffiths, Hugh,Sparsity-based dynamic hand gesture recognition using micro-Dopplersignatures. Radar Conference, 2017.

[7] Kim, Youngwook and Toomajian, Brian, Hand Gesture Recognition UsingMicro-Doppler Signatures With Convolutional Neural Network. IEEEAccess, 2016.

[8] Molchanov, Pavlo and Gupta, Shalini and Kim, Kihwan and Pulli, Kari,Short-range FMCW monopulse radar for hand-gesture sensing. RadarConference, 2015.

[9] Peng, Zhengyu and Li, Changzhi and Munoz-Ferreras, Jose Maria andGomez-Garcia, Roberto, An FMCW radar sensor for human gesturerecognition in the presence of multiple targets. Microwave Bio Con-ference, 2017.

[10] Gupta, Sidhant and Morris, Daniel and Patel, Shwetak and Tan, Desney,SoundWave:using the doppler effect to sense gestures. 2012.

[11] Pu, Qifan and Gupta, Sidhant and Gollakota, Shyamnath and Patel,Shwetak, Whole-home gesture recognition using wireless signals. Pro-ceedings of the 19th annual international conference on Mobile computing& networking, 2013.

[12] Wang, Saiwen and Song, Jie and Lien, Jaime and Poupyrev, Ivan andHilliges, Otmar, Interacting with Soli: Exploring Fine-Grained DynamicGesture Recognition in the Radio-Frequency Spectrum. Symposium onUser Interface Software and Technology, 2016.

[13] Poupyrev, Ivan and Poupyrev, Ivan and Poupyrev, Ivan and Poupyrev,Ivan and Poupyrev, Ivan and Poupyrev, Ivan and Poupyrev, Ivan andPoupyrev, Ivan, Soli: ubiquitous gesture sensing with millimeter waveradar. Acm Transactions on Graphics, 2016.

[14] Graves, Alex and Mohamed, Abdel Rahman and Hinton, Geoffrey,Speech recognition with deep recurrent neural networks. IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing, 2013.

[15] Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E, ImageNetclassification with deep convolutional neural networks. Communicationsof the ACM, 2017.

[16] Vu, Ngoc Thang and Adel, Heike and Gupta, Pankaj and Schutze,Hinrich, Combining Recurrent and Convolutional Neural Networks forRelation Classification. NAACL, 2016.

[17] Kingma, Diederik P and Ba, Jimmy, Adam: A Method for StochasticOptimization. Computer Science, 2014.

[18] Snapdragon, Snapdragon sense ID. https://www.qualcomm.com/products/features/security/fingerprint-sensors, 2015.

[19] HUG, HUG music player controlling. https://www.youtube.com/watch?v=8FgdiIb9WqY, 2016.

[20] Bosch, Anna and Zisserman, Andrew and Munoz, Xavier,Image Classi-fication using Random Forests and Ferns. IEEE International Conferenceon Computer Vision, 2007.

https://www.qualcomm.com/products/features/security/fingerprint-sensors

https://www.qualcomm.com/products/features/security/fingerprint-sensors

https://www.youtube.com/watch?v=8FgdiIb9WqY

https://www.youtube.com/watch?v=8FgdiIb9WqY

micro hand gesture recognition system using … micro hand gesture recognition system using...

Documents