hand detection in first person visionfusion.isif.org/proceedings/fusion2013/html/pdf/thursday...of...

Hand Detection in First Person Vision

Pietro Morerio, Lucio Marcenaro, Carlo S. RegazzoniDITEN

University of GenoaVia Opera Pia 11A 16145 Genoa - Italy

Email: {pmorerio, mlucio}@ginevra.dibe.unige.it

Abstract—The emergence of new pervasive wearable technolo-gies (e.g. action cameras and smart glasses) calls attention to theso called First Person Vision (FPV). In the future, more andmore everyday-life videos will be shot from a first-person pointof view, overturning the classical fixed-camera understanding ofVision, specializing the existing knowledge of moving camerasand bringing new challenges in the field of video processing. Thetrend in research is going to be oriented towards a new typeof computer vision, centred on moving sensors and driven bythe need for new applications for wearable devices. We identifyhand tracking and gesture recognition as an essential topic in thisfield, motivated by the simple realization that we often look at ourhand, even while performing the simplest tasks in everyday life.In addition, the next frontier in user interfaces are hands-freedevices.

In this work we argue that applications based on FPV mayinvolve information fusion at various complexity and abstractionlevels, ranging from pure image processing to inference overpatterns. We address the lowest, by proposing a first investigationon hand detection from a first-person point of view sensor andsome preliminary results obtained fusing colour and optic flowinformation.

I. INTRODUCTION

Portable head-mounted action cameras, built for recordingdynamic high-quality first-person videos, have become a com-mon item among many sportsmen, and have lately arousedthe interest of researchers [1]. Such cameras allow to recordfirst-person point-of-view experiences, nothing more.

Fig. 1. Google Glasses (U.S. Patent D659,741 - May 15, 2012).

Last year Google announced Project Glass and startedpublishing short teasers on the web. Its goal was to build a

wearable computer (Figure 1) that records your perspectiveof the world, and also delivers information through a head-up display. The idea is not completely new, as back in 2004Pentland et al. [2] proposed a wearable data collection systemcalled InSense. The concept of ”wearable” has surely evolvedby then.

Fig. 2. The schematic drawing for Microsoft’s patent on augmented realityglasses for ’live events’ (U.S. Patent Application 20120293548 - November22, 2012).

A developer edition of Google Glasses is ready by now,and the hashtag ”#IfIHadGlass” was lately flooding Twitterand Google+. Unfortunately only American citizens can applyfor getting a prototype, and the final version is not due until2014 at the earliest. Microsoft obviously set itself off in thechase and got a patent on augmented reality glasses at theend of last year (Figure 2). This new technology trend hasalready deserved attention in the world of research [3] andsomeone has already tried to build his own prototype [4]. Onthe other hand, someone else has been experiencing wearablecomputing and home-made augmented reality for the past 20years [5]. As the market of Apps for smartphones has alreadyhad its boom, we suspect another boom is to be expectedvery soon for wearable-device applications. What is more,interest in first-person perspective was also lately awakenedin the field of Cognitive Computation: Froese et al. proposein [6] to address the challenge of investigating first-personexperience by designing new humancomputer interfaces, andwidely discuss the deep meaning of what they call mind-as-it-could-be from the first-person perspective.

Based on these considerations, we see the emergence of anew field of research in computer vision, focused on the users’

point of view. Such a First Person Computer Vision (FPCV)is de facto solicited, and at the same time made possible, bythe above mentioned brand-new available technology, namelya wearable computer equipped with a first-person cameraframing everyday life. Noticeably, FPV is somehow morespecific than simple vision from moving cameras, due to theconstraints the device has with the subject and his sub-parts.This specificity might represent an exploitable added valuein processing the information gathered, especially for whatconcerns interactions between the user and the outside world.

As already mentioned, FPV from wearable devices involvesinformation fusion at various complexity and abstraction levelsranging from pure image processing to inference over patterns.We stress here three points and we then start to address thefirst one throughout the rest of the paper.

• At a low level, image processing must be exploitedfor object detection, localization, tracking and recog-nition. Here a the problem of information fusion [7]appears in a manifold of aspects [8] [9] [10].

• Wearable devices such as the above described glassesare going to be equipped with other sensors than acamera. These may include an accelerometer, a gyro-scope, a magnetometer, wifi, bluetooth, gps barometer,microphone and many others. Fusing data from thisvariety of sensor will become an issue in designingapplications as for smartphones. Multisensor data fu-sion is a well known issue and has always drawn theattention of researchers in many fields [11] [12].

• At a higher level of abstraction, scene and behaviourunderstanding is required for cognitive applications.Here context-based information fusion plays a centralrole. More in detail, the modelling of the interactionbetween a the user and the outside world often relieson the fusion of ”external” and ”internal” information(e.g. [13] and [14]).

We start to address the first level of complexity and focusour investigation on hand detection. The reason we focus onhands is exceptionally simple: hands are almost always in ourfield of view, and are involved in the majority of everydaytask (just think about writing, lacing shoes, driving, eating ...).Hands are maybe the principal means we employ to activelyinteract with the surrounding world, things and persons. So weclaim that hands are the best starting point for implementingcontext-aware functionalities on wearable devices. In addition,new technologies as for instance the Microsoft Kinect and theever-new Leap Motion controllers [15], seem to be pushingtowards hands-free and or even device-free interfaces for anenhanced human/computer interaction. In this context, gesturerecognition will play a central role.

The remaining of this work is organized as follows. SectionII presents a review of state of the art techniques for handdetection and tracking, together with some consideration onthe designing of such algorithms for a wearable device. SectionIII presents a proposed approach for hand detection and showspreliminary results. Eventually conclusions and additional con-siderations are drawn in section IV.

II. HAND DETECTION AND TRACKING

Hand detection, tracking and extended tracking (i.e. track-ing of hand sub-parts) are problems which have been widelyaddressed in computer vision. At present, a perfect handsegmentation or accurate hand localization are hardly reachedespecially under complex conditions. Past approaches aremainly focused on detecting hands from a fixed cameraframing a whole person and thus relied on prior knowledgeof human silhouettes [16]. The credit for the best degree ofaccuracy surely goes to depth sensors, which allow for a 3Dreconstruction of the scene and a more accurate segmentationof hand shapes [17], thanks to the additional informationcarried by the depth channel. Yet, we focus on old fashioneddigital cameras, as for the moment no such depth sensor isexpected to be wearable.

Fig. 3. Hand detection in a human silhouette [17]

Most 2D approaches for hand tracking rely on skin colourfeatures, which looks natural. However, colour features aresensitive to variations in illumination and shadows. A colourcorrection method is proposed in [18] to overcome this diffi-culty. It is here claimed that a Gaussian Mixture Model (GMM)can well capture complex variations caused by the differenceof human races, gender, age etc. In [19] a Skin HOG model(SHOG) is proposed to construct a robust and efficient handdetector. However, a hybrid approach seems preferable, as alsoclaimed in [20], where the Viola-Jones-like object detectionscheme (originally applied to face detection [21]) is combinedwith a colour based detector, giving satisfactory results for theset of postures considered.

As already pointed out, in the near future more and morevideos will be shot by wearable devices, from a first-personpoint of view. FPV video processing will be more and morerequested, raising, among other things, considerable privacyissues. To the best of our knowledge, the problem of detectingand tracking hands from a FPV point of view from a wearabledevice has never been investigated. We address here somepoints of this issue and make some considerations.

Good news is that first person perspective can be somehowexploited. Besides, hand tracking is a very peculiar problem.It is common understanding that the more specific a problemis, the more the constraints from the problem itself can betaken advantage of. General issues are often more difficult tobe addressed.

• Number of targets. Although FPV hand tracking is nota problem with a fixed number of targets, priors on itcan be guessed. No more than two targets are allowed,and an equal probability of having zero, one or twotargets in the scene can be conjectured.

• Shape. Hands from a first person point of view areoften framed together with the naked arm. A typicaloblong silhouette is shown in Figure 9.

• Occlusions. Hand from a first person point of vieware hardly occluded. Hand-to-hand or hand-to-objectocclusions may of course occur, however hand-to-body occlusions hardly happen.

• Interactions. Hands often interact one with the otherwhile performing basic tasks. It has been showed [13]that target interactions can be exploited for improvingobject tracking.

• Geometrical constraints. Hands are linked througharms to the trunk, on which our head is mounted. De-grees of freedom in moving them are then limited bygeometrical constraints, which can be then exploitedin hand tracking.

• Personalization Skin colour features vary a lot fromperson to person and it is difficult to capture suchcomplex variations, caused by the difference of humanraces, gender, etc. However, wearable devices arepersonal (as mobile phones and smartphones), andcustomized colour models learned from a single userare simpler and more reliable, as it will be shown inIII-A.

On the other hand, some bad news comes out in consideringfirst person perspective from a wearable device.

• Moving camera. The problem is a typical movingcamera issue, which has been widely addressed in lit-erature especially for what concerns moving vehicles.However constant speed or at least certain amount ofregularity in the motion must often be hypothesised (see e.g. [22]). Anyway many standard methods suchas old fashioned background subtraction for changedetection cannot be employed in these circumstances,unless an accurate off-line training phase is performed[23].

• Framing. As depicted in Figure 3, hand detectionis often performed on well framed ad hoc images.This hardly happens in shooting first person videos,but for specific gestures that could be required for ahypothetical device-free interface.

• Camera motion estimation. is complicated by the factthat complex roto-translation matrices are involved,as a head often moves sharply and brusquely. Evenintegrating data with other sensors such as an ac-celerometer could be not so easy. The most naturallocal frame of reference would be the one integralwith the device (and thus with the user).

• Perspective. The majority of hand detection and ges-ture recognition methods in literature address theproblem from the perspective shown in Figure 3, i.eframing the hands’ palm, from the front. From a firstperson perspective, the back (or, even worse, the side)of the hand is much more often seen, making gesturerecognition harder.

• Real time requirements. The video processing algo-rithm must meet real time requirement, while dealingwith the limited computational resources and powersupply carried by a wearable device.

III. PROPOSED APPROACH

Based on the above considerations we investigate how itcould be possible to design a hand tracker, starting from handdetection.

We identify skin colour as the most distinctive and sig-nificant feature to be exploited, being also one of the mostcommon used in literature. After concentrating on RGB colourfor a while [24], researchers realized that other colour spacessuch as L*a*b, HSV [25] and YCbCr [26] proved to be moresuitable for colour-based segmentation, not only in the fieldof skin detection [27]. Various models have been proposed forcapturing the information carried by colour, the most commonbeing GMM as already mentioned [18], as Non parametricbelief propagation [28] and Viterbi algorithm [29] turned outto be powerful tools for hand tracking.

However, we realized that colour alone does not bringenough information to reliably detect hands, or better, toreliably detect hands only. For this reason we exploit opticflow information in order to filter out false detections, by,roughly speaking, subtracting the global motion of the camerawhere possible. The way skin-like coloured targets from thebackground are removed will be explained in details in III-B.

A. Colour

Although a GMM better captures complex variations ofskin colour due to suntan, gender, age etc. [18] we have alreadyargued that for a single user a single Gaussian is enough tosatisfactorily grasp the relevant colour information. The spacewhich better shows clustering of skin pixels turns out ot be,from our experience, the CbCr subplane of YCbCr.

The experiment that was carried out is relatively simpleand it is depicted in Figures 4, 5 and 6. The device used is aGoPro Hero, outputting a 848x480 video at 50 fps (bitrate isapproximately 8000-9000 kbps). Many video sequences wereshot, framing a slightly moving hand and gradually changingluminosity in the environment as shown in Figure 4. It canbe seen that illumination conditions were stresses to a goodextent. Statistics were calculated over the manually drawn redbox (the box is drawn in the first frame; it is then checked thatit only encompass skin pixels through the video sequence). Thetypical resulting colour histogram for a generic frame is shownin Figure 5. As one can see very peaked distribution appear forthe Cb and Cr channels, while a changing luminosity resultsin a larger Y histogram. Mean and standard deviation werecalculated for each channel for each frames. Means are plottedin Figure 6 for one of the five 1500-frame sequences. Standarddeviations are always around 3 (2,95 on average) for the Cbchannel and around 4 (4.3 on average) for the Cr channel. Itcan be seen the extent to which illumination was altered.

Figure 7 shows how the vectors (µCb, µCr) cluster in the(Cb,Cr) plane. The covariance matrix of the 2d distributionclearly have eigenvectors which are not parallel to the axis,thus we opted for a bi-dimensional Gaussian to describe thecolour model, instead of two separate Gaussians, one for eachchannel.

(a)

(b)

Fig. 4. Experiment: hand colour characterization.

Fig. 5. Histogram of hand pixels’ colour (relative to a single frame, ROI isshown in Figure 4). Blue line is Y channel, green is Cr and red is Cb.

Figure 8 shows instead how the scaled feature (Cb/Y,Cr/Y)cluster along what seems to be a straight line (three differenthands, three different coefficient). This model does not intro-duce much improvement, thus we set aside this observation forfuture works which may include classifying hands sides basedon colour.

The model was test on several sequences shot while per-forming different activities, like drawing, writing on a white-board, typing. Figure 9 (a) shows a sample frame. Figure 9(b) shows colour-based segmentation using a single Gaussian.Given a model (µ,Σ), segmentation is obtain by setting the

Fig. 6. Average hand pixels’ colour frame by frame, changing illuminationin the scene

Fig. 7. Clustering of Cb and Cr features

Fig. 8. Clustering of Cb/Y and Cr/Y features

condition

(x− µ)T Σ−1(x− µ) ≤ 2, x ∈ (Cb⊗ Cr) ⊂ R2. (1)

It can be clearly seen that objects with skin-like colour (asthe mouse pad for instance) are segmented as well as the twohands in the proposed sequence. The way uninteresting targetsfrom the surrounding environment are filtered out is explainbelow.

(a)

(b) (c)

Fig. 9. Colour based segmentation. (a) Sample frame (b) Colour-basedsegmentation (c) Optic flow-improved segmentation.

B. Optic flow

We propose that hands doing things hardly move jointlywith the head, rather they show different displacements fromone frame to another. For this reason we employ optic flowto estimate the average movement of the camera based on theflow vectors calculated through the method proposed in [30]and exploiting the features proposed in [31] and refined in[32]. This way, things moving disjointly from the head can beidentified for they show different optic flow vectors associatedto their interest points. Results are shown in figure 10. Thehead is almost still, thus the majority of the flow vectors hasvery little module and a direction which is opposite to the onetowards which the head is (slightly) moving. O the other handthe moving hand show marked vectors, which vectorially sumup their and head movements.

Blobs which show no interest points, or which flow issimilar to the global one are eliminated as shown in Figure9 (c). Unfortunately this leads in most frames to the removalof the blob generated by the left hand, which lies still on thetable. This suggest that the proposed method works for handswhich are acting relevantly only.

The two algorithms do not required massive computationalresources, however the 50 fps of the GoPro camera are not

supported. On average, frame processing time is around 50ms.

Fig. 10. Optic flow

IV. CONCLUSION

In this work we have presented a global view of thecurrent trend in wearable technologies and their effects incomputer vision were examined. We are confident with thefact the in the future the portion of computer vision appliedto first person vision will become more and more significant,based on the confidence that wearable technologies will soonhave a boom similar to the one smartphones recently had. Anoverview of the issues arising in FPCV was given, togetherwith considerations on how, on the opposite side, the first-person perspective could be exploited in such a field. Theimportance of hand detection in this context was widelyaddressed.

Eventually an approach for hand detection in first personvideos was proposed and tested on sequences recorded witha GoPro wearable camera. Such an approach strongly relieson the most natural feature usable for skin detection, namelycolour. Fusion with camera movement information, extractedfrom optic flow vectors allow for filtering undesired detectedtarget which are integral with the environment. Unfortunatelythis also leads to the removal of inactive hand targets, whichcould be seen either as a negative or as a positive point.

As a future investigation and prosecution of this work wewould like to address the problem of training Haar classifiers[21] for recurring back views of hands and specific handshapes.

REFERENCES[1] K. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised

ego-action learning for first-person sports videos,” in Computer Visionand Pattern Recognition (CVPR), 2011 IEEE Conference on, June, pp.3241–3248.

[2] M. Blum, A. Pentland, and G. Troster, “Insense: Interest-based lifelogging,” MultiMedia, IEEE, vol. 13, no. 4, pp. 40–48, 2006.

[3] E. Ackerman, “Google gets in your face [2013 tech to watch],”Spectrum, IEEE, vol. 50, no. 1, pp. 26–29, 2013.

[4] R. Furlan, “Build your own google glass,” Spectrum, IEEE, vol. 50,no. 1, pp. 20–21, 2013.

[5] S. Mann, “Through the glass, lightly [viewpoint],” Technology andSociety Magazine, IEEE, vol. 31, no. 3, pp. 10–14, 2012.

[6] T. Froese, K. Suzuki, Y. Ogai, and T. Ikegami, “Using human-computer interfaces to investigate mind-as-it-could-be from thefirst-person perspective,” Cognitive Computation, vol. 4, pp. 365–382,2012. [Online]. Available: http://dx.doi.org/10.1007/s12559-012-9153-4

[7] I. Bloch and H. Maitre, “Data fusion in 2d and 3d image processing:an overview,” in Computer Graphics and Image Processing, 1997.Proceedings., X Brazilian Symposium on, Oct, pp. 127–134.

[8] A. Del Bue, D. Comaniciu, V. Ramesh, and C. Regazzoni, “Smartcameras with real-time video object generation,” vol. 3, 2002, pp.III/429–III/432.

[9] M. Spirito, C. S. Regazzoni, and L. Marcenaro, “Automatic detection ofdangerous events for underground surveillance,” in IEEE Conference onAdvanced Video and Signal Based Surveillance (AVSS 2005). Como,Italy: IEEE Computer Society, September 2005, pp. 195–200.

[10] A. Dore, M. Soto, and C. Regazzoni, “Bayesian tracking for videoanalytics,” Signal Processing Magazine, IEEE, vol. 27, no. 5, pp. 46–55, sept. 2010.

[11] G. Foresti and C. Regazzoni, “Multisensor data fusion for autonomousvehicle navigation in risky environments,” IEEE Transactions on Vehic-ular Technology, vol. 51, no. 5, pp. 1165–1185, 2002.

[12] G. L. Foresti, C. S. Regazzoni, and P. K. Varshney, MultisensorSurveillance Systems: The Fusion Perspective. Kluwer Academic,Boston, 2003.

[13] L. Marcenaro, L. Marchesotti, and C. S. Regazzoni, “Self-organizingshape description for tracking and classifying multiple interactingobjects,” Image Vision Comput., vol. 24, no. 11, pp. 1179–1191, 2006.

[14] S. Chiappino, P. Morerio, L. Marcenaro, E. Fuiano, G. Repetto, andC. Regazzoni, “A multi-sensor cognitive approach for active securitymonitoring of abnormal overcrowding situations in critical infrastruc-ture,” 15th International Conference on Information Fusion, Jul. 2012.

[15] “News briefs,” Computer, vol. 45, no. 7, pp. 21–23, 2012.

[16] W. Kong, A. Hussain, M. Saad, and N. Tahir, “Hand detection fromsilhouette for video surveillance application,” in Signal Processing andits Applications (CSPA), 2012 IEEE 8th International Colloquium on,March, pp. 514–518.

[17] H. Li, L. Yang, X. Wu, and J. Zhai, “Hands detection based on statisticallearning,” in Computational Intelligence and Design (ISCID), 2012 FifthInternational Symposium on, vol. 2, Oct., pp. 227–230.

[18] S. Xie and J. Pan, “Hand detection using robust color correction andgaussian mixture model,” in Image and Graphics (ICIG), 2011 SixthInternational Conference on, Aug., pp. 553–557.

[19] X. Meng, J. Lin, and Y. Ding, “An extended hog model: Schog forhuman hand detection,” in Systems and Informatics (ICSAI), 2012International Conference on, May, pp. 2593–2596.

[20] S. Bilal, R. Akmeliawati, M. Salami, A. Shafie, and E. Bouhabba, “Ahybrid method using haar-like and skin-color algorithm for hand posturedetection, recognition and tracking,” in Mechatronics and Automation(ICMA), 2010 International Conference on, Aug., pp. 934–939.

[21] P. Viola and M. Jones, “Robust real-time face detection,” in ComputerVision, 2001. ICCV 2001. Proceedings. Eighth IEEE InternationalConference on, vol. 2, pp. 747–747.

[22] A. Dani, Z. Kan, N. Fischer, and W. Dixon, “Structure and motionestimation of a moving object using a moving camera,” in AmericanControl Conference (ACC), 2010, 30 2010-July 2, pp. 6962–6967.

[23] L. Marcenaro, F. Oberti, and C. S. Regazzoni, “Change detectionmethods for automatic scene analysis by using mobile surveillancecameras,” in ICIP, 2000, pp. 1244–1247.

[24] M. Jones and J. Rehg, “Statistical color models with application to skindetection,” in Computer Vision and Pattern Recognition, 1999. IEEEComputer Society Conference on., vol. 1, pp. –280 Vol. 1.

[25] Y.-T. Pai, L.-T. Lee, S.-J. Ruan, Y.-H. Chen, S. Mohanty, andE. Kougianos, “Honeycomb model based skin color detector for facedetection,” in Mechatronics and Machine Vision in Practice, 2008.M2VIP 2008. 15th International Conference on, Dec., pp. 11–16.

[26] G. Yang, H. Li, L. Zhang, and Y. Cao, “Research on a skin colordetection algorithm based on self-adaptive skin color model,” in

Communications and Intelligence Information Security (ICCIIS), 2010International Conference on, Oct., pp. 266–270.

[27] P. Morerio, L. Marcenaro, C. S. Regazzoni, and G. Gera, “Early fireand smoke detection based on colour features and motion analysis,” inImage Processing (ICIP), 2012 19th IEEE International Conference on,30 2012-Oct. 3, pp. 1041–1044.

[28] E. Sudderth, M. Mandel, W. Freeman, and A. Willsky, “Visual handtracking using nonparametric belief propagation,” in Computer Visionand Pattern Recognition Workshop, 2004. CVPRW ’04. Conference on,June, pp. 189–189.

[29] Q. Yuan, S. Sclaroff, and V. Athitsos, “Automatic 2d hand track-ing in video sequences,” in Application of Computer Vision, 2005.WACV/MOTIONS ’05 Volume 1. Seventh IEEE Workshops on, vol. 1,Jan., pp. 250–256.

[30] S. Tamgade and V. Bora, “Motion vector estimation of video image bypyramidal implementation of lucas kanade optical flow,” in EmergingTrends in Engineering and Technology (ICETET), 2009 2nd Interna-tional Conference on, Dec., pp. 914–917.

[31] J. Shi and C. Tomasi, “Good features to track,” in Computer Visionand Pattern Recognition, 1994. Proceedings CVPR ’94., 1994 IEEEComputer Society Conference on, Jun, pp. 593–600.

[32] J. Jiang and A. Yilmaz, “Good features to track: A view geometricapproach,” in Computer Vision Workshops (ICCV Workshops), 2011IEEE International Conference on, Nov., pp. 72–79.

hand detection in first person visionfusion.isif.org/proceedings/fusion2013/html/pdf/thursday...of...

Documents