master in 3d multimedia technology · the virtual world that hasn’t been tackled completely by...

61
Master in 3D Multimedia Technology Physical interaction in augmented environments Master Thesis Report Presented by Samuel Jiménez and defended at Gjøvik University College Academic Supervisor(s): Simon McCallum Konstantinos Boletsis Jury Committee:

Upload: others

Post on 25-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Master in 3D Multimedia Technology

Physical interaction in augmented environments

Master Thesis Report

Presented by

Samuel Jiménez

and defended at

Gjøvik University College

Academic Supervisor(s): Simon McCallumKonstantinos Boletsis

Jury Committee:

Page 2: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap
Page 3: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Samuel Jiménez

2014/07/15

Page 4: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Abstract

Augmented Reality (AR) is a technology that superimpose virtual imagery onto the real worldby registering and tracking features from it in real time. Has been done for many years butwith the recent technologies its availability and application has spread with more strength indifferent fields. Nevertheless, there is still a break of illusion when a user wants to interact withthe virtual world that hasn’t been tackled completely by using hand gestures. With the inclusionof depth sensing devices as Kinect or Leap Motion and their relatively low cost, the possibilitiesof interaction within an augmented reality environment are a great opportunity to face. In thiswork we want to propose and implement a solution to integrate a hand tracking technologywithin the augmented reality environment to offer a more robust and seamless integration of theuser gestures with the virtual objects, specifically using the Leap Motion sensing device.

Keywords

IEEE Keywords: [Computers and information processing] Augmented Reality, [Computers andinformation processing] Gesture recognition, [Systems, man and cybernetics] Human computerinteraction.General Keywords: Gesture-based interaction, Augmented Reality, Depth sensor device, Inter-action on AR, Tangible AR.

i

Page 5: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Preface

I would like to thank to all the people that directly and indirectly get involved in the developmentof this work, to my family and friends for their support and motivation, to my professors fromUniversité Jean Monnet Saint-Etienne and Gjøvik University College, with special thanks to mysupervisors Simon McCallum and Konstantinos Boletsis for their vast knowledge, great ideas andcontinuous guidance during this time.Gjøvik, July 2014

ii

Page 6: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives and scope delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Leap Motion Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Gesture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Gestures classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Gestures acceptance and intuitiveness . . . . . . . . . . . . . . . . . . . . 8

2.4 Pinch Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5.1 AR interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5.2 Leap Motion Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5.3 AR Gesture-based Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Design and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Choice of frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 AR Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Feature tracking (Image target) . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Hand-motion tracking Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Interaction Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Pinch Gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 AR - Hand-motion tracking integration . . . . . . . . . . . . . . . . . . . . . . . . 223.7 Proof of Concept for Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1 Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 AR-Leap Motion Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 Test No.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii

Page 7: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

4.4.2 Test No. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Test No. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Test No. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 SUS Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Questionnaire and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

iv

Page 8: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

List of Figures

1 a) Fiducial Marker and b) Marker-less tracking[47] . . . . . . . . . . . . . . . . . 52 Tracking Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Leap Motion’s a) View[1], b) Structure[2] . . . . . . . . . . . . . . . . . . . . . . 64 Pointable fingers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Leap Motion’s skeletal tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Pinch Gesture sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Analysis of accuracy from measured points, with a fixed orientation marked in

red. a)xy-plane; b) xz-plane; c)yz-plane[3] . . . . . . . . . . . . . . . . . . . . . . 118 Vuforia architecture (yellow highlight) . . . . . . . . . . . . . . . . . . . . . . . . 179 Feature matching using SURF feature detector . . . . . . . . . . . . . . . . . . . . 1810 Features: A square has 4, for its corners, a circle has no features and third shape

has 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1811 Stones image target and its distributed features on the right side . . . . . . . . . . 1912 Leap Motion Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2013 Interaction Box within the Leap’s field of view . . . . . . . . . . . . . . . . . . . . 2114 Pinch Gesture Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2215 Pinch Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2316 Leap Motion and Image target set-up . . . . . . . . . . . . . . . . . . . . . . . . . 2317 General Application Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 2418 Image target within perspective frustrum . . . . . . . . . . . . . . . . . . . . . . . 2519 AR interaction Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2720 Mobile Architecture Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2721 Scene Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2922 Experiment Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3023 3D scene of Test No. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3124 Participant performing test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3225 Sequence grab-move release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3226 Cubes distribution along the interaction space . . . . . . . . . . . . . . . . . . . . 3427 Participant showing difficulties in the edges . . . . . . . . . . . . . . . . . . . . . 3528 Frequency distribution on scene A and scene B . . . . . . . . . . . . . . . . . . . . 3729 Frequency distribution of time from each attempt on Scene A . . . . . . . . . . . 3830 Means comparison of a) Grab-release and b) Hand-grab movements . . . . . . . . 3831 Mean’s comparison of attempts from scene A and B . . . . . . . . . . . . . . . . . 3932 Time-distance per user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4033 SUS Average Results per question . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

v

Page 9: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

List of Tables

1 Summary of related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Interaction box size (mm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Event log entry parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Event Data Type collected on test No. 1 . . . . . . . . . . . . . . . . . . . . . . . . 335 Event Data Type collected on test No. 2 . . . . . . . . . . . . . . . . . . . . . . . . 356 Scenes comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 SUS results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 User’s previous experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi

Page 10: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

1 Introduction

Augmented Reality (AR) has been studied for many years but just nowadays has become widelyavailable mainly on web and mobile phones. The user interaction in AR, i.e. manipulation of vir-tual objects, has been carried out using non-natural interaction techniques following the capabil-ities that current mobile/desktop devices have[4][5], such as touch screen inputs or mouse/key-board respectively, which breaks the illusion that the users can interact directly with the virtualobjects in the real world [5].

The research through gesture-based interaction techniques in AR has started recently and isconsidered a key topic for future on AR, there have been different approaches using computervision and image processing techniques to detect gestures[6][7][8][9], however, in the last fewyears, low-cost depth sensing devices as Kinect[10] or Leap Motion[1] have offered great oppor-tunities to track hand gestures and natural free-hand interactions[11][12] that can be applied toAR environments.

1.1 Problem Description

Most of the advances on augmented reality (AR) are related to tracking techniques and displaytechnologies[13][4], however the interaction with virtual objects -usually limited to touch-screendisplays- is still a challenging area that needs further improvements in order to address seamlessinteractions between the user and the augmented environment.

On previous works[6][7][8][9], the gesture recognition has been implemented using imageprocessing techniques to detect hand gestures, using a single camera. The current possibilitiesthat low-cost depth sensors like Kinect or Leap-Motion offer, can help to locate gestures within acertain space (our Augmented Reality space) and, in this manner, enable us to use the hand/fin-gers’ pose information to support the interaction with the virtual content.Research has been done in this area using Kinect[14][15], to infer, for example, the physicalobjects of a tabletop [16], and use the information to place the virtual content.

The problem of interaction with virtual objects with physical gestures in an augmented realityenvironment is a challenging task that faces several problems that breaks the user experience[5]. According to [17], the hands-fingers interaction within augmented environments faces twomajor challenges:

1. The user’s fingers should be able to physically interact with virtual and real objects in analmost seamless way, a physically correct collision detection and processing with virtualobjects is key[17].

2. The mutual visual occlusion between virtual and real elements has to be of convincingquality. The correct occlusions between the user’s fingers and the virtual objects in the ARspace should be as correct as possible[17].

1

Page 11: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Although both problems are linked together, on this project we pretend to focus on the first goalconsidering a natural gesture to interact with content, however, with an evaluation we expect togive an insight of the occlusion problem for latter improvements; this second problem comes outwhen we try to manipulate virtual objects as physical ones, e.g. when we try to grab a teapot(virtual object), our real hand is occluded completely by the teapot, as the virtual imagery isrendered in front of the real image, to tackle this, the teapot should be seen between our thumband other fingers by segmenting the thumb and observed while the fingers remain behind theobject. This problem will not be part of the scope but is important to consider its existence.

In addition, hand gesture recognition faces two main issues: hand detection and gesturerecognition; there are several approaches to perform both tasks, but it depends on the contextand applications. In our AR context, for example, the use of the current version of Kinect (april2014) works very well on body tracking, however it is difficult to detect small objects like thehuman hand due to the low resolution of its depth map (640x480)[18]. Another effective way toenable more robust hand gesture recognition is to use data gloves; comparing to optical sensors,this option is more reliable and not affected by lighting conditions or cluttered backgrounds,however it requires calibration and can obstruct the gesture naturalness of the user, also, thesetools are expensive, which becomes an unpopular solution for gesture recognition.[18].

1.2 Objectives and scope delimitation

Within the scope of this project, some concepts of value and application are considered to presenta final result of promising advantages: depth sensing devices in AR[5][19], gesture-based userinterfaces to provide new opportunities for specific areas such as entertainment, learning, health,engineering[2] and target device-implementations, firstly on a desktop prototype (for our pur-pose) but considering a future implementation on mobile devices.

The main contribution from this thesis will be the integration of a depth sensing device intoan AR environment in order to provide a better interaction with the virtual content. The authorsof [16] faced this problem in a similar way using Kinect with good results. Nevertheless, the LeapMotion states to offer a better accuracy and has the potential of being portable in comparison toKinect. At the moment, there are not many studies related to the use of Leap Motion controller,which is an advantage to conduct future studies with this technology.

The gesture-based interaction will be narrowed to design and use a single pinch gesture,which is one of the most common gestures and natural sign for selecting and grabbing objects inreal world as well as a known metaphor to interact with virtual content.

The intention of this work is to study, implement and integrate a gestures’ tracking technique inthe AR space, using a depth sensor device, ultimately aiming for the optimisation of the interactionbetween the user and the virtual objects in an AR environment, focusing on the interaction withsingle pinch gesture.

Therefore, our formulated goals and research problems can be stated as:

1. How could we integrate virtual content positioning in the AR space effectively when using

2

Page 12: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

a depth sensor that outputs the pose (palm-fingers) of the hand?2. How does the integration of a natural pinch gesture is perceivable by the users and how

can we improve it?3. Which difficulties faces the interaction of a hand tracking technology in a Augmented Re-

ality scenario.

Defining two hypothesis that could give an earlier insight to solve our questions: 1) The pinchgesture is fundamental in interaction and its application in virtual objects should be seamlesswith the real world, hence easy to perform, 2) The user’s interaction within the virtual spaceimproves easily with the time and its difficulties should rely only on technological limitationsrather than the natural actions performed by the user.

For this purpose, the Leap Motion controller [1] -a depth sensing device that recognize theposition of hands, fingers and sharp tools in real time within a space range around it- is used todetect the pinch gesture. We are also based on the development concept proposed by [20][21]and some of the findings of [22][23] where the acceptance of gesture interaction and user-defined gestures serve as basis for a better design.

The project follows a technical approach that intends to evaluate the problems on AR tointeract with virtual content and develop an application using a single pinch gesture.

1.3 Outline

The thesis is divided in 6 sections: 1) Introduction of the current work; 2) Background con-cepts and Related Work of state of the art implementations in AR, gesture interaction within ARenvironments; 3) Design and Methods explaining the guidelines to follow, AR-hand tracking in-tegration, pinch gesture algorithm and limitations; 4) Implementation and Evaluation with thedescription of the developed product and test scenarios; 5) Results and Discussion of the find-ings from the experiment carried out and 6) Conclusion and Future Work, showing directions toimprove the proposed work.

3

Page 13: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

2 Background and Related Work

This chapter explains the basic concepts and technologies involved in this project: AugmentedReality, the Leap Motion Controller’s technology, the definition of gesture recognition and cate-gorization, targeting a pinch gesture for our purpose. The following section shows the relevantrelated work reviewed.

2.1 Augmented Reality

The most common definition given by [24] refers to Augmented Reality as an interactive tech-nology that combines real and virtual imagery in real time by registering virtual content ontoreal world. On a typical AR system, the working process can be divided in two main tasks alongthe general steps: tracking and registration and it is intended to detect a known pattern storedin memory (marker tracking)[25]. The commonly used technique to perform tracking nowadaysis based on optical devices (conventional web camera, ToF, structured light) applying imageprocessing and computer vision algorithms.

The simplest tracking method in AR is by using fiducial markers, Figure 1, its basic workingprocess consists on a tracking algorithm that performs a series of image processing tasks overeach frame of a video sequence in order to detect where a marker is, and a registration whichcorrelates a virtual object to the physical marker[25].

It consists on threshold the image and look for the edges using a scanline search. With theknown edges of the marker, it fits a rectangle by finding the corners based on the maximumdiagonals between the points in the edges, this is done with the purpose of delimit the rectanglewhere a pattern is recognized and matched against a pattern stored in memory.Once we recognize the pattern, the pose of the marker is estimated by calculating the rotationand translation parameters relative to the camera. This action is done using a transformationmatrix formed with the angles (rotation) and translation vectors (translation). In other words,we want to know the coordinates of the marker with respect of the camera coordinates usingthis transformation points called extrinsic parameters (varies according of the movement of thecamera) while the intrinsic parameters are used to move from the camera coordinates to theimage coordinates and are fixed on the camera by calibration; these are the focal distance, scaleparameters and (x,y) center coordinates of the projection image.

With the position of the marker detected, a virtual model is placed on the same position ofthe image coordinate relatively to the marker. This gives the impression that the virtual object isstick to the marker and the process is called registration.

The resulting image is shown in a display in which the user see the virtual object superposedover the real world. The process is done dynamically and fast enough to give the impressionof real-virtual world correlation. It is not computationally expensive but the marker must beeasily recognizable, i.e. the lighting conditions must be favourable and it should not be partiallyoccluded, otherwise the recognition can fail.

4

Page 14: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 1: a) Fiducial Marker and b) Marker-less tracking[47]

Another tracking method used is the marker-less tracking, Figure 1, that consists on detectthe features from the real environment (edges, shapes, color, texture, interest points) to extractkey points that are usually stored previously using pattern recognition techniques; these keypoints are used to detect when a virtual model must be placed or not over the real image. It ismore complex and computationally expensive than marker tracking but it overcomes occlusionproblems as the recognition is more robust.

Figure 2: Tracking Coordinate System

2.2 Leap Motion Controller

Is a small peripheral that works over USB port. Using two cameras to capture motion informationand three infra-red LEDs as light sources, the system tracks the movements of hands, fingers,pens, or several other objects in an area 60 cm in front of, to the side of, and above the devicein real time. It detects small motions and is accurate to within 0.01 mm[26].

In comparison to some gestural systems that use large cameras that extract a lot of data froman area, requiring considerable computational analysis, which can cause latency, the Leap Motion

5

Page 15: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Controller has little latency because its algorithms extract only the data required for the task athand. Because the system is small and largely software based, it could, potentially be embeddedin many types of devices.

The technology could make gaming more interactive and immersive. People who are disabledcould use gestural interfaces to work with devices they could not otherwise operate.

Figure 3: Leap Motion’s a) View[1], b) Structure[2]

Because of patent pending restrictions, a few details are known about the functioning prin-ciple and algorithms, however, it is known that it uses three separated infra-red (IR) LEDs inconjunction with two IR cameras to determine the position of predefined shapes as fingers andthin tools, returning discrete positions[2]. It can be categorized as as an optical tracking systembased on stereo vision principle. Its field of view is an inverted pyramid centred on the controller.The effective range of the controller extends from 25 to 600 mm above the device. It is accessedthrough an Application Programming Interface(API) with support for different programminglanguages. It is important to note that the sampling frequency is not stable and cannot be setthrough the API[2].

Figure 4: Pointable fingers

Figure 5: Leap Motion’s skeletal tracking

6

Page 16: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

In order to leverage the processing in the device itself, images captured by the cameras arepost-processed in the computer to remove noise and construct a model of the hands, fingersand pointable tools[27]. The API provides access to the depth abstract models created by thecontrollers software and is able to detect:

1. Hands detected in a frame, including rotation, position, velocity and movement.2. Fingers recognized and attached to to each hand, including rotation, position and velocity.3. Pixel location on a display pointed at by a finger tool.4. Basic recognition of gestures such as swipes and taps.5. Detection of position and orientation changes between frames.[27]

2.3 Gesture recognition

A simple definition of gesture given by [28] is a motion of the body that contains information,while gesture recognition is the mathematical interpretation of a human motion by a computingdevice which is successfully accurate when the repeatability of a certain movement is interpretedin the same way from different people[23].

2.3.1 Gestures classification

A challenge on gesture recognition systems is that there is no exact meaning for a gesture to beuniversally understood, i.e. the same gesture can have different meanings for a single or severaltasks depending on the context or tasks we want to accomplish, however, a distinction betweengestures can be established as continuous (online) and discrete (off-line)[23][29]:

• On-line Gestures (continuous): Are evaluated while are being performed. e.g. A zoomingmetaphor with the fingers movement.

• Off-line Gestures (discrete): Are evaluated after they have been completely performed, e.g.pointing gestures towards a screen, clicking, static symbols to execute commands[23][29].

A gesture taxonomy based on user-defined gestures to interact in AR environments has beenbuilt by [22], [30] based on sets of identical gestures along different tasks performed by users.These are characterized by:

• Reversible gestures: those when performed in an opposite direction yielded opposite effectse.g. rotation, scaling, increase/decrease speed etc.

• Reusable gestures as those which were used commonly for tasks which were different, butparticipants felt had common attributes e.g. increase speed/ redo, decrease speed/undoand insert/paste.

• Virtual object size: The size influenced the decision of users, especially with regards to thenumber of hands that they would use to manipulate the object, for example, for a scalingtask the majority of gestures performed uses two hands.

• For particular tasks as "delete", users tend to use known metaphors from UI like doubleclick on a Windows OS or double tapping in a tactile phones.

• As gestures in 3D space allows higher expressiveness, it leads to perform gestures in realworld, as symbolic gestures thumbs up/down for accept/reject. These are known as concretegestures.

7

Page 17: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

• Nevertheless, abstract gestures as a "box selection" by defining the boundary of the objecthave no natural meaning, in this case, the finding of a common used gesture needs furtherstudies; the abstract gestures need to be learned by the user while the concrete gestures areintuitive[30][22].

With the previous taxonomy in mind, the fundamental interactions[31][32]: Rotation, trans-lation and scaling are performed. For rotation, a user would pinch or grasp the object with atleast two contact points and would move their hand or turn their wrist accordingly. For scalingon 3 axes, participants would grasp or use open-hands to align with the sides of object and in-creased or decreased the distance between them to enlarge or shrink in the same direction asthe transformation. Uniform scaling is less obvious as for example some users prefer using openhands moving along a single axis in front of them while others grasp objects opposing diagonalcorners and moving along diagonal lines [22], [30].

2.3.2 Gestures acceptance and intuitiveness

As we address to interact naturally with virtual objects in the same way as real objects, the designshould be centred on the user’s perception of the virtual content in the real space, and thus, anacceptance of specific gestures and intuitiveness to perform them is critical to offer a naturalinteraction in AR.

[23] studied the acceptance of gesture interaction according to the age, gender and previousexperience from the users, where important assumptions could be taken into account:

• People can adapt quickly to a gesture based interaction technique.• Too many gestures to interact with a system is undesirable as it is more difficult to the user

to learn and fit gestures accordingly.• When designing a 3D gesture system, the tool itself should be designed for a usage and the

targeted users, not the technology.• There are not significant differences in age and gender experience that influence the per-

formance of the gestures.• When gestures are natural, the interaction experience is positive.

The implementation of touch-less gestures have challenges in achieving accurate and mean-ingful gesture recognition and identifying natural[33], intuitive and meaningful gesture vocab-ularies appropriate for the tasks in question[34].

[34] tried to understand what makes touch-less gesture production natural -understood asspontaneous- and intuitive -coming naturally without excessive deliberation- to provide designguidelines for such interfaces.

Based on the assumptions of neuro-psychology, [34] hypothesize that body-part-as-objectgestures, are generally not as intuitive as hand actions holding-imagined-objects, i.e. we seegesture forms in terms of its relationship to what is being communicated or represented; forexample, in the task of brushing teeth with a toothbrush it is more normal for the hand itself tohold an imagined brushing teeth with a toothbrush object (an imaginary toothbrush) rather thanrepresenting a body part as the object itself (finger representing a toothbrush). These are calledtransitive gestures.

8

Page 18: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

2.4 Pinch Gesture

The pinch gesture is one of the most common gestures for interaction with digital interfaces.It is defined as the movement of expansion and contraction of a finger spread[35]. It has beenused for different purposes depending on target applications, e.g. the zooming metaphor bycontracting and expanding, scaling or picking. Moreover, with the use of multi-touch devices[35],its metaphors have become ubiquitous to interact with virtual content. While interacting with realobjects, it is used to grab small pieces or malleable objects like fabrics. It resembles a grabbing orpicking action and offers natural signal to select or move an object in an interactive system[36]and due to the nature of the thumb and index fingers likewise the large amount of experiencefrom the people, the pinch grabbing is precise and has high performance[37].

Once we have the information of the hand from sensing device, it is relatively easy to detecta pinch as there is little ambiguity whether the thumb and fingers are touching themselves ornot[36].

For our purpose, and based on the commonly used gestures described on the taxonomy of[22], [30], the pinch is a pivotal gesture to implement interactions within our Augmented Realityspace.

Figure 6: Pinch Gesture sequence

2.5 Related Work

From previous studies performed, different approaches have been proposed in order to providean understanding of natural hand-gestures to interact with virtual objects in an AR space. In theliterature review, we can distinguish and classify the main concepts on AR interaction, Leap Mo-tion Approach (General information and gesture recognition) and AR-Gesture-based interaction.

2.5.1 AR interaction

On [8], the authors proposed a Virtual English Classroom system using marker-less AR to en-hance learning experiences. With the 3D pose estimation of the hands: HSV color segmentationof the skin, hand contours estimation, high curvatures to detect the fingertips, label them andestimate a 3x4 projection matrix of the position and orientation of the palm; then set a coordi-nate system onto the palm[8]. They designed four hand-gestures to manipulate virtual objectsand teaching activities based on Google Lit Trips to test their system.They proposed an AR inter-action classification[8]:

• Users interact by means of fiducial markers.• Through special sensing equipment such as data gloves, depth sensing devices.• Using bare hand gestures.

9

Page 19: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

A tangible tabletop interface system for patients in rehabilitation was developed by [38] fea-turing various physical interaction objects that are manipulated on a table’s surface, emulating aphysical environment in order to overcome barriers of input-output interfaces (keyboard, mouse,desktop screen). This concept is based on "Tangible bits" which discusses the advantages of tan-gible virtual media.

[38] explored the motivation and possibilities as well as methods and facilities of tangibletabletops in the rehabilitation of people with visual impairments and perception problems (e.g.stroke patients), finding that therapists consider visual and audio feedback as an important fea-ture to be adaptable to the patients needs.

They present three concepts based on therapy models and tested on normal people using theWizard of Oz technique by comparing patterns of physical objects against projected objects onthe tabletop surface and composing cubes according to a projected pattern on the surface.

In order to provide a tangible user interaction in collaborative AR environment, previousresearch has worked with hard-wired devices as vibro-tactile feedback gloves for virtual objectmanipulation.

Following on tangible interaction, [39] developed a solution for removing visual confusioncaused by hand occlusion. It detects hand regions in a real world image, based on skin colourand being aware that hand occlusion occurs only in the region on which virtual objects arerendered and subtracts the hand regions from the rendered image of the virtual objects; then,it superimposes the refined image onto the real world one to obtain a final image without handocclusion. From their experiment, consisting on select numbers from a virtual keyboard, theyfound that the interaction using a fingertip was less accurate than using a physical pen, butno statistically significant difference on time and number of wrong selections were found withdifferent sizes of buttons, concluding that the the hand occlusion solver was helpful to producea better immersion without visual confusion.

It is noticed that a next step could consider a depth camera to recognise the fingers andcompute the distance from the fingertip to a virtual object based on the depth information.

2.5.2 Leap Motion Approach

A few studies using the same technology have been published so far (april 2014), showing thefeasibility, limitations and future work on this field. Considering these capabilities and limitationswe are able to conduct our implementation in a efficient way, seen in the next chapter.

[17] showed an implementation of Leap Motion supporting an AR interface where the in-teraction is made by creating voxels1 on the tracked fingers in order to correct the occlusion ofthe hand. They developed an application in the context of physical rehabilitation where partsof the fingers are rendered in 2D (video see-through) using an openCV implementation and theinteracting parts (fingertips) are converted into voxels, showing that a high number of voxelsrendered requires more computational resources, which limits the performance or quality of therendered video, there is no gesture interaction implemented but the rendering approach withfingertips voxelization gives a good degree of immersion.

Different from the AR scope, but regarding the gesture recognition, [40] made an initial qual-

1Is the 3D conceptual counterpart of the 2D pixel,the smallest distinguishable box-shaped part of a three-dimensionalimage, Source:http://labs.cs.sunysb.edu/labs/projects/volume/Papers/Voxel/

10

Page 20: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

itative evaluation of the controller to observe its potential in Auslan sign language recognition,concluding that its API is not yet ready to interpret a full range of the sign language, it can beused for basic signs recognition but not for complex signs as the ones that requires face or bodycontact, they state to be potentially useful in the near future with improvements on the tech-nology development[40]. An important distinction is mentioned here (and proved in practice)where the finger recognition fails when the hand is placed perpendicularly in relation to thecontroller, as the stereo-matching algorithms work on the fingers parallel to the device.

Due to pending patent issues, there is no fully available information regarding the Leap Mo-tion controller; however, to study the capabilities to lean on this technology, an analysis of theprecision and reliability of the controller in static and dynamic conditions was made [2], to de-termine its suitability as an economically attractive finger/hand and object tracking sensor withthe aid of a professional motion tracking system, finding that in a static scenario (acquisitionof a limited number of static points in the space) there is a significant increase in the standarddeviation when moving away from the controller and when moving to the far left and right of thecontroller, and it was unstable on tracking objects in front of the the controller (z>0). In a dy-namic scenario (tracking of moving objects with constant distance between each other within thecalibrated space), they evaluate the distortion of the controller’s perception of space (measuredas the deviation of the distance between two markers located at the tips of a V-tool) concludingthat there is a significant drop in accuracy when the objects are 250 mm above the controller[2].

On a similar way, [3] used a robotic arm holding a pen to precisely analyse its accuracyand repeatability by setting different points coordinates in static (fixed point in the scene) anddynamic (moving to a different position) scenarios, where a standard deviation of 0.2mm andbelow 0.7mm per axis respectively was found, thus indicating that under real conditions, thetheoretical accuracy of 0.01mm is not achieved, however its precision is higher than controllersaround the same price range as Microsoft Kinect, in the Figure 7 some of the results indicate thedeviation along the three planes, seeing that the yz-plane shows more variation on the accuracy.

Figure 7: Analysis of accuracy from measured points, with a fixed orientation marked in red.a)xy-plane; b) xz-plane; c)yz-plane[3]

2.5.3 AR Gesture-based Interaction

Former studies have implemented image processing and computer vision techniques to recogniseand segment fingers in video sequences, in recent times, optical tracking approaches have beenused extensively to tackle recognition challenges, e.g. with the Microsoft Kinect sensor [10],[18], [11] proposed a shape distance metric -based on Earth Mover’s Distance (EMD) [41]-

11

Page 21: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

called Finger-Earth Mover’s Distance (FEMD) to measure the dissimilarities between differenthand shapes, which represents a hand shape as a signature where each finger is a cluster andthe dissimilarity distance between two had shapes is defined as the sum of the work needed tomove the earth piles and the penalty on the unmatched fingers, finally recognizing each inputhand by template matching. They demonstrate their gesture recognition using an application forarithmetic operations using gestures input as commands instead of mouse and keyboard; a rock,paper scissors game was implemented recognizing three hand gestures.

In [16], starting from the premise that computer vision based marker algorithms are able tocalculate the pose of a given target but there is no awareness of the environment in which thetarget exists, they developed a framework using the Microsoft Kinect[10] in order to obtain amapped representation of a physical terrain (in a tabletop) and set the virtual content using aphysics simulation. In one application, they use coloured markers on the fingertips to manipulatethe content with a set of predefined gestures that are detected with the RGB camera of Kinectand color-segmented. Despite they use a depth-sensing camera to reconstruct the terrain andcoloured markers to detect the fingertips, there are still considerations that can be faced asavoiding the use of coloured markers on the fingertips and segment objects for occlusion.

It is observed on [14] that for a correct occlusion solver, the collision detection, realistic il-lumination and shadowing effects are cues that establish stronger connection between real andvirtual content. A general remark on previous approaches include contour-based object segmen-tation, depth information from stereo cameras and time-of-flight cameras, which are automatictechniques that doesn’t require offline calibration and the systems can process the environmenteven when the objects change. For their purpose, [14] investigate additional methods of objectsegmentation to provide more realistic interactions between the real and virtual worlds throughhand gestures.

Many prototypes implemented with sophisticated computer vision algorithms to robustly rec-ognize hand gestures have demonstrated that gesture recognition is rather complex and com-putationally expensive. Less-demanding and simpler approaches have been suggested by usingfixed skin-color segmentation and morphological filtering to find fingertips[42].

[42] proposed an interaction methodology using hand and fingers for mobile phones by keep-ing the hand detection simpler in order to reduce the resources consumption and provide a goodperformance. Their system consists on three input gestures common on touch screens to opti-mize the learnability of their system (swiping, scaling, sliding). Once they detect the markerswith ARToolkit, they applied a background subtraction using a Projective Texture Mapping andapply skin color segmentation on RGB space (normalized) to detect the hands from the scene;consecutively restrict a ROI around the virtual object by three times the area of the marker, usinga search grid of an experimental size and record how the hand occludes the grid cell (in a con-tinuous mode); In this manner, each cell is governed by the amount of hand occlusion to detecta hit on the grid.

[9] lead another approach where they proposed a method for 3D computer vision-basednatural interaction with finger pointing and direct touch of augmented objects. Consisting on atracking system of four steps: 1) Segmentation of skin colour (on HSV colour space), 2) Findingfeature points for palm centre and fingertips (finding palm’s centre and mapping to a disparity

12

Page 22: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

map to estimate 3D info of each point), 3) Finding hand direction (calculate using parametricequations), 4) Simple collision detection (finger ray casting for pointing and selecting distantobjects). It is integrated in a multi-modal system that reacts to speech and hand gestures in the3D space to move virtual objects using a stereo camera BumbleBee2; however, their fingertiptracking was unstable and prone to errors because the finger was too thin to calculate the stereodisparity. With the Leap Motion controller, this problem should be solved hypothetically.

Similar approaches held by [43], segments the user’s hand and tracks the 3D position of thefingertips using an RGB-D camera to subsequently map the fingertips coordinates onto the worldcoordinate system of the tracked AR marker. Finding that the interaction is more natural andenjoyable than with a 2D touch-based interface. Their prototype was evaluated with a user studywith manipulation technique (3D gesture and 2D screen touch way) and a task scenario (rotate,scale, translate).

[12] with the Kinect[10] controller, used a pointer to select virtual objects for an assemblytask using a set-up environment to track the user’s hand gesture to rotate, translate and scale theitems and place them in a corresponding position in space. Its working principle is fairly similarones that uses the depth data of Kinect, the gesture interaction is limited to open-close hands,which worked well but for more precise interactions as using fingers further improvements needsto addressed.

[31] studied the potential of gesture interaction based on finger tracking in front of a phone’scamera, this approach offers a smooth integration but also contains potential issues as robustfingers tracking and the actual interaction experience itself. They focus on the second issue byevaluating "canonical interactions" (i.e. translating, scaling, rotating) identifying typical scenar-ios as well as potential limitations.

Their experiments consists on performing the interactions translate, scale and rotate a pawnin mid-air and in a real board;accuracy is measured in two categories, as could be interpretedsubjectively: a perfect solution (perfect position, size, angle for translation, scaling and rotationtasks respectively), and a solution that would be considered as good enough, this was measuredby using experimentally determined thresholds.

One of the most important issues and still an open question that [31] detected, is how toswitch between different operations, how to best combine the canonical interactions to enableuser to seamlessly rotate and scale a virtual object.

A summary of the relevant literature review consulted can be seen on the Table 1.

13

Page 23: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Area Description ContributionAR interaction A general survey of different implemen-

tations (not particularly related) for in-teraction ranging from tangible table-tops, gesture metaphors, hand occlu-sions and feedback, showed to be rele-vant concepts to consider for an AR in-teraction design.

General concepts in AR

Leap Motion ap-proach

Few papers written so far (april 2014),mostly analysing its capabilities for pre-cise tracking, feasibility for sign lan-guage gesture recognition and one ac-tual implementation using voxelizationof a first stage of development.

Device capabilities, oneLeap implementation

AR Gesture-basedinteraction

An overview showing gesture-based ap-proaches only, most of them basedon image processing and computer vi-sion techniques using depth cameras,Kinect, to detect the hands and gesturessubsequently.

There are no documentedimplementations besides[17] using Leap Motionapproach that can relateto it, the controller dealswith the image process-ing/computer vision toexpose only the positionaldata we need.

Table 1: Summary of related work

14

Page 24: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

3 Design and Methods

On this chapter, we describe the considerations to follow in an interactive Augmented Realityscenario, the integration of Augmented Reality and Hand-motion tracking technologies alongwith an algorithm suitable for our purpose seen in the problem statement, chapter 1, focusing ona single pinch gesture and the basis of the intended system development described in a technicalworkflow as well as the limitations in the design process.

3.1 Guidelines

Based on the survey [16][22][44][23][31] of related work performed, we can distinguish fun-damental guidelines to consider in the design of an AR interaction application, which can besummarized as follows:

1. The manipulation of virtual objects are usually based on user’s knowledge and skills withthe physical world. The user’s hands can be used as gestures that the system recognize andprocess as natural inputs. Research on previous works [16] [22] has shown that physicalinput devices encourage actions that users normally apply on physical objects, which notnecessarily are supported by an AR system..

2. Direct manipulation of virtual objects is enabled by employing tracking with six degrees offreedom (6DOF) -forward/backward, up/down, right/left-.[44]

3. The design of environments that support gesture-based interaction systems might dependon the knowledge of the gesture space and further research can determine if the gesture invirtual realities varies to real scenarios [23].

4. To overcome the problem of occlusion, the depth of the real world from the user’s point ofview is employed. As a first step, the virtual imagery needs to have the same coordinatesystem as the real one. After, the occlusion between virtual-real objects must be corrected[45].

5. The delimitation of the interaction space (technical and user-centred) should be considerin the design, the Least Distance of Distinctive Vision (LDDV) describes the closest distanceat which a person with normal vision can comfortably look at something, is commonlyaround 25 cm[31].

6. The use of natural gestures rather than abstract gestures is desirable as it reduces the brainload during interaction. Some of the expectations from the people for spatial interfaces areinherited from human behaviour as a response from the objects. For non-trained interac-tion, humans can effectively manage 4-5 degrees of freedom without any train work; fortrained tasks, certainly we can manage more degrees of freedom (humans naturally canhandle up to 20 degrees of freedom, usually for multi-touch interactions)[46].

15

Page 25: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

The basic interaction concepts in an AR environment determined [22][32], and defined asCanonical Interactions[31] are:

• Selection: The initial event which enables the interaction by identifying the collision of avirtual object with the tracked hand(virtual hand on this case).

• Translation: Move an object along x,y,z axis.• Scaling: Can be done only with virtual objects, therefore there is no natural equivalent. In

a 2D-3D space, by grabbing the object from two different sides and decreasing the distancebetween two fingers to scale.

• Rotation: Is usually done by grabbing the object with two fingers and turning it around.• Lack of haptic feedback: In order to deal with this issue, which is one of the major problems

with AR interactions, a particular interpretation is visualized depending on the status of theobject, i.e. if it is collided, selected, grabbed, etc. A tolerance distance between the handmodel and the object can be considered by defining a bounding area1 around the object onwhich the reaction to a specific task not necessarily depends on touching the object directlybut the area around it.

3.2 Choice of frameworks

Prior to the design and during a process of testing, the libraries and frameworks to use was anessential aspect to consider to carry out this project.

As explained in the previous chapter 2.2, the Leap Motion device was chosen because of itscapabilities against other tracking devices, it provides support for different languages and due toits size, is potentially portable; despite it doesn’t provides direct manipulation of the raw data,the exposed information is robust enough as it provides the coordinates, velocities and directionvectors of the tracked fingers an palm. Nonetheless, we needed to look for options compatiblewith its development platforms.

Among a variety of libraries for Augmented Reality, the Qualcomm Vuforia library[47] hasbeen chosen due to its robustness on tracking using fiducial markers, images, text, comparingto other such as ARToolkit which only tracks fiducial markers. The current version 2.8 (April2014) has support for different development platforms (Java, C++, Unity3D) and devices, iswell documented and its API provides a comprehensive guide through its different capabilities;additionally, and comparing to Metaio, it has a free license, even for commercial purposes [47].

On the other hand, at an initial stage, the 3D graphics library OpenGL ES -a version of OpenGLfor mobile devices-, that is well known and widely available in multiple platforms, was the bestchoice to work, however OpenGL works with low level functions to render graphics, and non-trivial tasks to interact with graphics would have increase the programming complexity. To fa-cilitate the development, the solution was to use a game engine, Unity3D has multi-platformsupport and well documented API through scripting, is good for rapid development of goodquality applications and both Vuforia and Leap Motion have support for Unity3D.

1Source: wikipedia.org/wiki/Bounding_volume

16

Page 26: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

3.3 AR Workflow

The Vuforia AR platform v2.8 [47]. its a Software Development Kit (SDK) that enables to buildAugmented Reality applications on mobile devices with support for Android, IOS and a Unity3Dextension for development. It has a robust architecture to optimize tracking, registering andgraphics visualization, enabling a variety of features to track and register frame markers (fiducialmarkers), images, text and cylinder targets.

The main workflow modules of the architecture[47] is divided in two core processes essentialin any AR application: tracking and registration. The tracking process consists on a 1)Cameramodule which acquires each frame from the camera device and outputs an image with a definedformat and size depending on the device; a 2)Pixel format converter transforms the image fromthe camera format (e.g. YUV12) to a suitable format for OpenGL ES rendering (e.g. RGB5652)and tracking internally, it also down-sample the image in different resolutions available; the3)Tracker component detects and track the real-world targets and markers using computer vi-sion algorithms (no information available) based on each camera frame and stores the resultsin a 4)State object stack; finally, a 5)Video background renderer component renders the imageframe stored in the State object, it represents the real world behind the virtual 3D objects. Theregistration is achieved in a 6)Application segment, considering the update of the State object anda rendering method on each frame processed; for this purpose, the State object must be queriedfor new detected targets, perform and update the logic our application according to each targetand render the augmented objects in the scene.

Figure 8: Vuforia architecture (yellow highlight)

2RGB565 is the raw pixel data of a bitmap file that contains the total number of pixels and without header information

17

Page 27: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

3.3.1 Feature tracking (Image target)

During the design process, an iterative development has been carried out, identifying that theframe marker recognition for AR, chapter 2, shows relevant issues during the test when themarker is occluded by the hand; to solve this problem, we have decided to use a feature-basedtracking by the AR framework which is robust against occlusions. The AR framework, QualcommVuforia v2.8 [47] uses an alternate algorithm to detect and track features in an image. The SDKrecognizes the image target by comparing its natural features against a known target resourcedatabase which is pre-processed in advance with the image. Once the image target is detected,the SDK will track it as long as it is at least partially in the camera’s field of view.

An example of feature extraction and matching can be seen on the Figure 9 using the SURF(SpeededUp Robust Features)[48] where the local features in the right down corner are matched betweena reference and a target image.

Figure 9: Feature matching using SURF feature detector

These image targets need certain requirements to efficiently extract features that are nat-urally found in the image and are rated accordingly by an online tool Target Manager whichprocesses the image, creates a dataset with the image features, that is subsequently loaded tothe AR framework application and rates it in a quality range [0,5] where 0 means a bad targetand 5 an easily tracked target, an acceptable target should have certain characteristics to be rec-ognized correctly: The features considered are spiked, sharp corners details as the ones presentin textured objects. The higher number of features present, the better, however it should be takeninto account that the details do not create a repeating pattern as it inhibits the detection, alsoavoid soft or round details that does not contain relevant features.

Figure 10: Features: A square has 4, for its corners, a circle has no features and third shape has2.

A higher local contrast in the details provides a better feature recognition as it identifies the

18

Page 28: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

image with sharper edges. Finally, the feature distribution along the whole image is better thanjust in a single area of it. The target size parameter is very important, as the pose informationreturned during tracking will be in the same scale, e.g, if the target image is 16 units wide,moving the camera from the left border of the target to the right border of the target changesthe returned position by 16 units along the x-axis[47].

A predefined stones image with a good ranking is used for our purpose, the Figure 11 showsthe stones image with its coordinates axes and the features detected on it.

Figure 11: Stones image target and its distributed features on the right side

Additionally, Vuforia provides the feature to extend the tracking of a image target once it hasbeen detected. It uses information from the surrounding area to infer the position of the targetwhen it is out of camera focus. This option makes the tracking more robust when it is used instatic environments by building a map around the target for this purpose.

As our prototyping environment is static and our hands are moving above the target, it isimportant to keep a persistent position of the virtual content with respect to the real world andthe reference image target. It enhances the visualization of dynamic content efficiently whilebeing used in static environment, however it can slow down a little on a dynamic due to that itneeds to recalculate the area around the target.

3.4 Hand-motion tracking Workflow

The Leap Motion controller’s API v2.0 [1] provides access to the filtered and processed datareceived from the device.

As indicated previously, chapter 2, the Leap Motion runs over a USB port and, on Windowsplatform, a service receives motion tracking data from the device; a DLL connects to the ser-vice and provides data through a variety of languages available (C# with Unity3D plugin, Java,Python, C++, Objective-C, Javascript).

The architecture shown on Figure 12 consists on 1) Leap Service, receives motion data fromthe device and exposes the data to a running application. 2)Leap Control Panel configures thedevice tracking settings and troubleshooting. As the applications run independently from theservice, it has control directly on the panel. 3) Foreground Application gets motion data directlyfrom the service while focusing on the application. 4)Background Application kept receiving datafrom the service, even when are out of focus or running in the background.

It uses a right handed coordinate system with the origin in the top center of the controlleritself, it provides coordinates from fingers positions in millimetres within the device’s field of

19

Page 29: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 12: Leap Motion Architecture

view.

3.4.1 Interaction Box

From the The Leap Motion’s field of view, a fully operational area for a correct hand recognitionis called interaction box, it is a rectangular prism region on the field of view which provides realand normalized coordinates for the hands. Its size is determined by the field of view and theheight setting in the device’s control panel (max. 25 cm). The controller software adjusts the sizeof the box based on the height to keep the bottom corners within the field of view [1].

Height Width Depth Center235.2471 235.2471 147.7511 (0,200,0)

Table 2: Interaction box size (mm)

To map the coordinates with the interaction box, a normalization of the points should becarried out to range [0..1] and move the origin to the bottom, left, back corner, to subsequentlyconvert them to the application coordinate system (Unity3D) by multiplying the normalizedcoordinates by the maximum range for each axis on the application and translating the originof the interaction box to the center; due to Unity3D environment uses a left-hand coordinatesystem, the Leap Motion’s z-axis coordinates should be multiplied by -1 to be mapped correctlyin Unity3D before normalization.

3.5 Pinch Gesture

The pinch is a basic gesture that can be used to interact on a 3D space rather than a pointinggesture that is usually used on 2D view systems, which is simpler and well performed. On auser-based gesture taxonomy developed [22], [30], several gestures such as ponting, pinch,grab, stretch (using both hands) have been classified according to user experience. The pinchgesture in an AR interaction space is performed several times to imply natural movements andother gestures like selection, shrink/stretch in different axis, rotation, either using one hand or

20

Page 30: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 13: Interaction Box within the Leap’s field of view

both[30]. Hence, its recognition is fundamental for following basic interactions.On this work, we have implemented a pinch recognition algorithm that triggers a grabbing

action to move and release virtual objects in the AR scene based on the hand coordinates exposedfrom the Leap Motion’s API.

The algorithm, Figure 14 runs on a Hand Controller -explained in the next section- whichgets the information tracked from the device. Once a hand is recognized, a list of fingers orderedfrom thumb[0] to pinky[4] are obtained.

It consists on identify the thumb fingertip (thumb = finger[0]); calculate a fixed thresholdvalue experimentally based on a proportion of 0.7 times the thumb’s size; from the Fingers list,obtain the fingertip position of each remaining fingers and compare the euclidean distance, equa-tion 3.1 from the thumb to the current finger positions with the threshold value, if the resultingPinchDistance is lower, then a flag is activated, indicating that a pinch has been performed.

d =√

(x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2 (3.1)

Afterwards, the grabbing function is performed when the pinch flag is activated. It generatesa bounding sphere around the pinch position of the thumb and initializes a vector GrabDistancethat will be used to store the difference of distance from the pinch position to the found collidedobject -this is used to grab only a single object that is closer to the pinch position-; then itidentifies each object that collides with the bounding sphere and updates the GrabDistance vectorwith the difference of the pinch position and the collided object. If the updated distance is lessthan the GrabDistance of the previous collided object, the object is set as grabbed and a force

21

Page 31: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

-a vector calculated from the difference of the pinch position and the position of the grabbedobject- is added to the rigid body of the object in order to follow the position of the pinch. Whenthe PinchDistance is higher, the grabbing and pinching flags are disabled. This grabbing functionis based on [1] example algorithms.

ThumbPosition = Get thumb fingertip

Calculate threshold based on thumb length

For each remaining fingers

Get finger positionPinchDistance <

threshholdPinchDistance = d(thumb

fingertip, current fingertip)

Pinching = trueIs Pinching?

Pinching = false;GrabbedObj = null;

For each collided object

Get collided objects on bounding sphere

Initialize GrabDistance

NewGrabDistance < GrabDistance

NewGrabDistance = ThumbPosition – Collided

Object position

GrabDistance = NewGrabDistance;GabbedObj = collidedObject;

Add Force Vector

GrabbedObj?

PINCH

RELEASE

GRAB

Figure 14: Pinch Gesture Algorithm

The design is compared with a built-in pinch strength measure from the SDK, the most recentversion of the Leap Motion’s SDK provides an automatic measure of the pinch strength valuethat ranges from [0-1], where 0 corresponds to the open hand with fingers extended and 1 to afull pinch, varying while you move the thumb and other fingers toward each other. On the nextchapter 4, the evaluation is explained in more detail.

3.6 AR - Hand-motion tracking integration

The design approach consists on integrate the hand-fingers information provided by our sensorinto the AR scenario. With the sensor data, a virtual hand skeleton is created every time it detectsa real hand in the sensor space and is drawn into the AR camera view according to the positionof the image target.

The process of design has relied on the SDKs available for this purpose to achieve the inte-gration of both tracking technologies and the interaction of virtual objects within the createdenvironment.

22

Page 32: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Threshold

Bounding box

Threshold

Grabbing trigger

Figure 15: Pinch Threshold

Coordinate Systems Matching

A first step is to consider a unified coordinate system between the AR target and hand trackingsensor. The simplest coordinate systems matching is based on the assumption that our AR markerand sensor are placed on the same origin, when this is not the case, i.e. the sensor and markerare placed in different areas from the camera view perspective, a translation of the Leap Motion’spose matrix needs to be performed by using the AR marker as a reference.

A solution to this, is to use the image target to act as the world center and bring the LeapMotion’s coordinate system by multiplying the inverse pose of our main target (image target)by the pose of the Leap Motion’s target. This creates an offset matrix that can be used to bringpoints from Leap Motion to image target coordinate system[47].

Nevertheless, for our prototype purpose, it is assumed that the AR marker and sensor areplaced in the same origin, where it matches both coordinate systems. This is done, both pro-grammatically, through the game engine and physically where are adjusted in the same originfrom a camera view perspective as seen on the Figure.

Figure 16: Leap Motion and Image target set-up

23

Page 33: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

General Architecture

The proposed architecture describes the steps that integrates the previous mentioned technolo-gies and method. It is basically created following the game engine’s workflow, on which, eachelement’s behaviour works independently and is controlled by scripting; all the elements areintegrated as a final embedded product with a build-compilation.

It consists on two tracking inputs: A conventional web-camera that detects an Image Target,the feature-based marker registered in the application beforehand, and the Leap Motion Con-troller that sends motion tracking data through a background service in the operative system.The Image Target registers virtual content according to its pose information in the space; thecontent includes the Virtual objects, i.e. any 3D model, and the Virtual Hand, which is a skeleton-hand model with its origin coordinate system is in the same space as our Image Target, i.e. if theImage Target is out of the camera’s field of view, the Virtual Hand will not be drawn in the screenas it depends on the marker to be drawn in the foreground video; the Hand Controller gets accessto the Leap Motion service through a .dll library and transforms the points to Unity3D units. Itcontrols the motion of the Virtual Hand and remains persistent while there are no hands in thefield of view. Additionally, it fires up the Pinch Gesture recognition when it detects thumb-fingersclose together. When a pinch gesture is detected, it triggers a flag to look for hittable objectsaround a boundary as explained in the previous section 3.5 and attaches the object to the pinchposition; the Collision detection relies on the properties of each virtual object that are consideredas rigid bodies that are affected by physics simulation from the game engine as it permits toadd forces to translate the object when grabbed with a pinch. The visual cues to indicate to theuser that a pinch-collision-grab action has been performed, is done by changing the color of theaffected virtual object. The AR-scene is displayed on a screen thanks to the Video Backgroundrender.

Figure 17: General Application Architecture

24

Page 34: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

AR-Leap Motion Tracking and Rendering

Although the Unity platform deals with the graphics and integration of libraries for this project,it is necessary to give a deeper review of the process by explaining the target pose calculationand content positioning in the AR framework and the hand controller’s integration.

The AR framework exposes a 3x4 Pose Matrix, equation 3.2 that represents the pose of atarget with respect to the camera plane. The left 3X3 matrix expresses the rotation matrix, whichindicates how the target is rotated while the right column is the translation vector, which is theposition of the target as seen from the camera[47], e.g. an identity matrix indicates that thetarget is parallel to the camera plane (rotation) and a vector (0,0,0) indicates the camera andthe target are in the same position.

xyz

=

r1 r2 r3 txr4 r5 r6 tyr7 r8 r9 tz

XYZ1

(3.2)

To position content on the target, the first step is to know the projection matrix -4x4 matrixthat projects the scene to the image plane of the camera[49]- that is created with the intrinsiccamera calibration parameters, vuforia uses a right handed coordinate system (with the cameraorigin pointing into the positive z-axis, x-axis to the right and y-axis is downwards) because itis the same coordinate system for targets (just rotated 90 degrees around the x axis: x-axis tothe right, y-axis is upwards and z-axis points out of the target-plane; in Unity, z-axis is upwardsand y-axis points out of the target)[47]. Once we know the projection matrix we need to seta perspective frustrum from it, where a near and far planes must be set and the size of ourtrackable -defined in the dataset with feature- should fall within these two planes, Figure 18.

Figure 18: Image target within perspective frustrum

With these requirements in mind, the virtual content (vertices, normals, indices, texture co-ordinates) is placed according to the pose matrix calculated at runtime.

The Leap Motion coordinates can be used to move a virtual object, firstly, normalizing andtransforming to suitable coordinates in the interaction box, Figure 13 explained previously.

25

Page 35: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

The next algorithm shows the AR and Leap Motion interaction (without pinch gesture recog-nition), algorithm 1

Init Initialization()Load camera parameters;Load vertex, normals, textures handlers;Create virtual objects;Initialize leap motion controller;

Main Loop (per frame) RenderFrame()while trackables found(image targets) do

Get trackable result with id;Get trackable pose matrix (modelViewMatrix);Clean vertex, normals, indices, texture buffer;Assign vertex, normals, indices, texture of virtual obj to trackable;LeapFingersService() // Get finger coordinates;Translate, rotate, scale model view matrix // multiply by pose matrix;Multiply by projection matrix;Draw elements;

endLeap Controller LeapFingersService()

Get fingers data;Normalize and transform coordinates;return Finger positions;

Destroy Destroy()

Clean buffers;Stop trackers;

Algorithm 1: AR-Leap Motion Algorithm

3.7 Proof of Concept for Mobile Devices

Aside from the design discussed in this chapter, we have integrated, in an early stage, the controlof Leap Motion independent from a mobile device and developed the required services for anandroid system to receive coordinate data directly from the device through a wireless connection.At this stage, the device only receives the finger coordinates, however, for a more precise trackingof the fingers, more information is required (velocities, vector directions), thus, the amount ofdata transferred reduces the performance on the reception device, in this sense, a filtering processneeds to be implemented in a server side as well as a controller to manage the tracking of dataalong the video frames captured by the device asynchronously.

The reasons of performance and complexity in the integration led us to continue the develop-ment process in a desktop environment, where we have full support and resources for a prototypepurpose.

The ideal architecture proposed consists on a conventional laptop computer to which the LeapMotion controller is connected through USB port 3.0. The controller’s libraries run on a Windows7 OS. A Javascript web server (Node.js) runs a script which collect the data received from the

26

Page 36: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

device and broadcast it via wireless through a websocket in JSON format3.The Android application consists on the application itself and two services which perform

the connection through the websocket. The first service, that we called LeapService is launchedbefore the application starts to create the connection with the Javascript web server. It connectsand creates a socket from which an input stream is obtained and re-send as a broadcast messagein the Android device. The second service, LeapReceiver is actually a plugin (.jar file) created forthe Unity3D application that remains persistent to get the data from the LeapService and assignit to our Unity3D variables every time a broadcast message is received.

Figure 19: AR interaction Scenario

Figure 20: Mobile Architecture Concept

3JSON (JavaScript Object Notation) - http://json.org/

27

Page 37: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

4 Implementation and Evaluation

The suitable architecture and frameworks presented previously, chapter 3 are implemented inthis section, starting with a summary of the development tools and equipment required. At first,we show the implementation of the basic scenario in the Unity engine, after we present thedescription of the evaluation performed with two tasks and the experimental set-up used.

4.1 Development Tools

• Unity Game Engine [50]. Fully integrated development engine for creating games andinteractive 2D-3D content. It provides an editing interface to create content easily and addextra functionalities through scripting using C#.

• Vuforia AR platform v2.8 [47]. Its SDK has a Unity Extension to integrate AR components,test and assemble applications.

• Leap Motion Controller v2.0 [1]. Provides a Unity Extension with components ready to usesuch as virtual skeleton-hand models and Hand controller to deal with the motions directly.

• Windows 7 OS Home edition, running on a laptop Asus with Intel Core-i5 processor and4Gb of RAM memory.

• Leap Motion sensor, connected through USB port 3.0.• Webcam Logitech-QuickCam Pro with 1600x1200 pixels video capture.• Vuforia Stones Image Target printed on conventional paper A4 size.

4.2 AR-Leap Motion Scene

Our basic scene is composed for five main elements, called GameObjects:

1. AR Camera. Is a prefab (an asset that allows to store a GameObject object with componentsand properties in Unity) from Vuforia that sets the field of view of the scene and renders thevideo with the virtual content overlaid, it is basically the Augmented Reality tracker thatcontrols the AR functionality through a script (QCARBehaviour.cs). The vuforia librariesonly work on applications for mobile devices, nevertheless, for test and prototyping pur-poses there is a possibility to run the application in a desktop platform using the UnityPlay Mode feature1, which uses the webcam available on the system and renders the aug-mented video on the screen. This is controlled by a WebCamBehaviour.cs script. Finally, theAR Camera loads the corresponding datasets to be tracked and its field of view points outto those trackable components i,e, the Image target. Additionally, a sceneControl.cs scripthas been used to get keyboard inputs to start, stop and restart elements in the scene.

2. Image Target. This Vuforia prefab is placed within the field of view of the AR camera, itssize is fixed to fall into the range of clipping planes of the AR Camera frustrum, the clipping

1Vuforia Apps requires a Unity Pro license to use this feature

28

Page 38: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

planes can be adjusted in the AR camera, while the size of the Image Target is adjusted inthe scale transform of the object. It is positioned in the origin (0,0,0) to match the coor-dinate system with the Hand Controller afterwards. The registration of the virtual objectsrelative to the Image Target is achieved by simply dragging them to it as children in thescene graph. Their position can be adjusted, relative to the target (local position) and rel-ative to the world coordinate system. In the scene, we consider the interaction area of theleap motion and defined a semi-transparent box following the measures of Figure 13, intowhich we place other virtual components.

Figure 21: Scene Example

To enable movements and collision detection, the virtual objects require an attached rigidbody that fits its shape. The rigid bodies react to the physics engine of Unity and can bemoved in a realistic way by receiving forces.

3. Hand Controller. Another prefab, provided by the Leap extension, controls the hands mo-tions through a .dll library loaded into Unity. It is composed of one graphics skeleton-handmodel SkeletalHand used for visualization and a physics hand-model RigidHand that worksas a rigid body to enable the physical contact with other rigid bodies. It is placed in theorigin (0,0,0) to match the same coordinates as the Image Target.

4. Pinch Recognizer. This is a script attached to the RigidHand that enables the recognition ofthe pinch gesture as described in the algorithm 3.5. When a pinch is performed and theRigidHand is close an object’s rigid body, a OverlapSphere detects whether the RigidHandis colliding with an object or not, if it does, then, it changes the object’s color and adds aforce to move the object following the pinching position. If it is unpinched, the object isreleased and it’s original color changes back.

5. Logging Control. Configuration script to log events from the scene, it is used for evaluationpurposes.

6. Directional Light. A component from the engine that illuminates the whole scene.

29

Page 39: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

4.3 Experimental Set-up

In order to interact with the content and evaluate the pinch gesture interaction, we have set-up atesting area on which a fixed tabletop of 30x30 cm is placed in a table, easily reachable by a usersitting in front of it. The tabletop keeps a "stones" image target from the Vuforia SDK and the LeapMotion sensor in the center at the same height level with the aim of preserve the same origin. Awebcam stands at the top of the table pointing its field of view towards the image target in orderto encompass the upper space of the tabletop, the considered rectangular area of interactionabove the target given by the sensor is height:235.2471 mm, width:235.2471, depth:147.7511mm and its center in (0,200,0) mm. Finally, a screen is placed in front of the user’s view wherethe augmented video captured by the camera is displayed. The devices are connected to a laptopwhich runs the application and logs the events performed in an external server.

Figure 22: Experiment Set-up

4.4 Evaluation

The evaluation of the pinch gesture in the interaction scene is divided in two tasks developedon top of the AR-Leap Motion scene implementation, section 4.2, measuring the performanceof a grab-translate-release action of an object from one point to another in the AR space andanalysing the accuracy of the gesture in relation to its positional data respectively.

A large group of participants has been required for the first task as its a more general analysiswhile the second task included a few subjects to perform several movements.

Additionally, a small qualitative analysis based on a questionnaire and observation has beencarried out as well a System Usability Scale[51] to obtain information about the user experiencewith the application.

4.4.1 Test No.1

The first experiment aims to provide a general understanding of the performance of the pinchgesture interaction by measuring the time that takes to perform a grab-translate-release actionwith two pinch gestures approaches. Evaluate the user’s performance throughout time and prac-tise and an insight of the hands used mostly to accomplish better results.

This test compares the pinch algorithm 3.5 (scene A) with a slightly different one (scene B)

30

Page 40: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

that instead uses an automatic pinch strength[1] measure from the Leap Motion’s SDK rangingfrom [0-1] where a fixed threshold of 0.5 enables the grabbing action. This value has been chosenexperimentally for being robust enough to be considered as a pinch. For both scene A and sceneB, the task consists on two cubes placed in the front and back of the interaction area respectivelyand two marked zones of the same color of the cubes placed in the opposite direction. The useris required to pinch each cube with a virtual-hand model drawn over the real-hand when isdetected, move it to the area of the same color and released by un-pinching it. To indicate thata correct pinch-grabbing action is performed, the object should turn to a green color, and whenit touches the corresponding area, it turns green as well. The user can use any hand that feelsmore comfortable.

Figure 23: 3D scene of Test No. 1

Participants

A group of 20 subjects (young adults and middle-aged adults) have been willing to participatein the test, where the majority doesn’t have experience with Augmented Reality or hand trackingdevices.

Procedure

The test for each participant runs as follows:

• An initial test to familiarize with the environment.• Perform 3 times the task in the scene A• Perform 3 times the task in the scene B• Answer a brief questionnaire explained latter, section 4.4.1, section 4.4.1.

A variation has been included to randomize the practise of the user when testing a scene afteranother. Half of the participants start with the scene A and the other half start with the scene B.

31

Page 41: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 24: Participant performing test

Participant instructions

1. Spread the hand(s) in front of the table (15cm above). It will show two virtual hand mod-els.

2. Use a pinch gesture to grab the blue and yellow box in the middle of the scene with anyhand you feel comfortable, the boxes will turn green when grabbed.

3. Move the cubes in the space (as holding them with the pinch) to the area indicated in blueand yellow respectively.

4. Un-pinch the objects on the coloured are. It will turn green when the cube falls into it.

Figure 25: Sequence grab-move release

Data

The data collected is gathered from a series of events registered in the application. Each event isstored in an external server through an internet connection.

Each event entry consists on the parameters showed on Figure 3, while the type of datacollected depending on the event is seen on Figure 4.

System Usability Scale

The System Usability Scale(SUS)[52][51] is a common, free and reliable questionnaire that assesthe usability of a given product or service in a quick an easy manner. It is widely used due toits flexibility to asses a vast range of interface technologies and easily understood as it provides

32

Page 42: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Parameter DescriptionUser Participant NumberTime Local TimeminiID SceneID :0(Scene A), 1(Scene B)Level Attempt : 1,2,3GameID 24EventType Event Reference: 1,2,3,4,5,6,100EventData Custom data, depends on EventType

Table 3: Event log entry parameters

Event Type ID Reference Description0 STARTSCENE Time1 HANDRECOGNIZED Right Hand found, Left hand found - Time2 GESTURERECOGNITION Pinch performed - Time3 GRABSUCCESS Object grabbed - Time4 RELEASESUCCESS Object releases - Time5 OBJECTSUCCESS Object coordinates(x,y,z) when collides area - Zone:

0=yellow, 1=blue - Time6 STOPSCENE Time

100 RESTART Restart objects position - Time

Table 4: Event Data Type collected on test No. 1

a single score on a easily understood scale[51]. It is composed of 10 statements that are scoredon a 5-point scale of strength of agreement (Strongly disagree, Somewhat disagree, neutral,somewhat agree, strongly agree). Its final score can range from [0-100] where higher scoresindicate better usability[51], the questions are as follows:

1. I think that I would like to use this system frequently.2. I found the system unnecessarily complex.3. I thought the system was easy to use.4. I think that I would need the support of a technical person to be able to use this system.5. I found the various functions in this system were well integrated.6. I thought there was too much inconsistency in this system.7. I would imagine that most people would learn to use this system very quickly.8. I found the system very cumbersome to use.9. I felt very confident using the system.

10. I needed to learn a lot of things before I could get going with this system.

It’s scoring method consists on take the odd questions, i.e. 1,3,5,7,9 and subtract one(-1)from the chosen point in the scale, while in the even questions, i.e. 2,4,6,8,10, we subtract theselected answer from five(5), sum-up the ten results obtaining a score in a scale of [0-40]. Finally,to convert the range to [0-100] just multiply the result by 2.5[53].

As part of the current work, we expect to obtain information about the participant’s opiniondue to the fact that a work if this nature relies notably on the end-user experience, and SUS is a

33

Page 43: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

simple and reliable method to asses the usability that suits our purpose accordingly.

Questionnaire and observation

Additionally, to support the SUS, we have elaborated an open questionnaire to gather the im-pressions and thoughts of the participants. It intends to fall into two categories to focus on Pinchgesture itself and AR scene perception to summarize negative and positive aspects of it.

The following questions are given to the participant right after finish the test:

• Do you have any previous experience with Augmented Reality?• Do you have any previous experience with motion sensors using hand gestures?• Is it intuitive?• Does the gesture feel responsive enough?• Was the depth perceivable?• Was the occlusion of the virtual hand within the object perceived?• Does it felt frustrating or tiring?• Do you think it needs practice?

During the conducted experiment, observation and notes are taken.

4.4.2 Test No. 2

The second test is more detailed in the sense that the positions of the pinch gesture, section 3.5when is performed are evaluated along with time in different locations of the interaction space.

The task is fairly similar to the previous test; in this case, our target coloured mark is placedin the center of the scene, and 30 cubes localized in fixed positions are shown to the user insequence each time he/she moves a current cube to the mark, i.e. the user starts grabbing thefirst cube fixed in one position with the pinch gesture and it is translated to the mark in thecenter, then a new cube is shown in a different position and the action is repeated again.

Figure 26: Cubes distribution along the interaction space

34

Page 44: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Event Type ID Reference Description0 STARTSCENE Time1 HANDRECOGNIZED Right Hand found, Left hand found - Time2 GESTURERECOGNITION Pinch performed - Time - Pinch coordinates(x,y,z)3 GRABSUCCESS Object grabbed - Time - Pinch coordinates(x,y,z)4 RELEASESUCCESS Object releases - Time - Pinch coordinates(x,y,z)5 OBJECTSUCCESS Object coordinates(x,y,z) when collides area - Zone:

0=yellow, 1=blue - Time6 STOPSCENE Time8 CUBEPOSITION Initial object coordinates(x,y,z)

100 RESTART Restart objects position - Time

Table 5: Event Data Type collected on test No. 2

The aim is to verify whether the pinch recognition and the object’s translation in time variessignificantly with the distance from the target zone (center), and if it is due to the device’s limi-tations or the performed gesture itself. The cubes positions are fixed in order to test throughoutthe space within the boundaries of interaction.

Participants

Four subjects (young adults) have participated in this test performing 30 movements each, twowithout any experience and two with previous involvement on the first test. The instructionsfrom test No. 1 are used here with the only distinction that there is only one cube to move.

Figure 27: Participant showing difficulties in the edges

Data

The data is gathered in the same way as the previous test, with a difference on the type of data,which includes coordinates of each pinch gesture and object in the scene, Figure 5.

35

Page 45: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

5 Results and Discussion

On this section, we describe and discuss the experimental results obtained as well as the findingsand qualitative observations gathered through the previous evaluation.

5.1 Test No. 1

240 samples gathered through the two scenes presented to the users has been analysed. Eachscene composed of tree attempts for the same tasks and two movements in the space for eachattempt (move a cube from the back-left corner to the front and another from the front-rightcorner to the back).

From the dataset collected during the experiment, an outlier has been identified and dis-carded to avoid an skew on the results.

Scene Algorithm Mean Std. Dev.A Pinch 6.0167 4.7023B Pinch Strength 5.4160 2.7091

Table 6: Scenes comparison

The Figure 28 shows the frequency distribution of the time to perform the tasks on bothscenes, finding that there is no significant difference between both algorithms.

The results showed very little variations while both frequencies overlap in the same range oftime. Additionally, we tried to understand the learnability and familiarization with the systemduring the three attempts performed by measuring the time it takes to move the object to thetarget mark (grab-release) in the different scenes.

The assumption we made on section 4.4.1 turned out to be false as the data was not followinga desired pattern that reduced the time on each attempt. On Figure 29 we see the samples takenfrom each attempt of scene A, where the mean time ranges 5.9, 6.5, 7.9 respectively and it isdisperse in all the attempts, requiring more sampling to give a better answer. However, with theadditional data we have already collected, we decided to measure the time it takes to performthe gesture to grab an object by obtaining the difference of time when a pinch-grab is detectedand when a hand is recognized in the system (hand-grab). From our observation, this actionusually took more time than the object’s movement but was not taken into account previously.The results on Figure 30 showed that, in fact, there was a decrease on time between the attemptson different movements, even though the data collected is not homogeneous, a reduction on timeis easily perceivable, and comparing to the grab-release movement that looks somewhat stable(5.8 6 sec)-that could be inferred from the shape of the frequency distribution on Figure 28-, thisbehaviour supports the idea that the learnability and familiarization with the system reduces thetime to perform a task in the Augmented Reality scenario.

36

Page 46: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 28: Frequency distribution on scene A and scene B

Moreover, when comparing the hand-grab movement in the scene A and B, a more significantresult has been perceived, giving us the opportunity to verify our first hypothesis, where theimplemented algorithm performed in a better way.

5.2 Test No. 2

For this experiment, 120 samples were taken from 4 participants (30 each), where each sampleis the grab-translate-release task from a different position to the middle, ranging from the centralarea of the interaction space to the boundaries. The distances against time are measured. It isexpected to have an increase on time with the distance because of the difficulty to track thegesture in the boundaries of the interaction space.

The user 1 and 4 have obtained experience in the previous test while the 2 and 3 are novicethat interact with the system after a small training session to be familiar with it.

The Figure 32 shows the time and distance measurements performed, seeing a progressiveincreasing time/distance while relatively similar jumps at the longer distances where the bound-aries affect the tracking. Except from the user 4 which has an increase of time practically since thefirst try, very similar values are constant among the users in shorter distances [5.6cm-10.6cm],

37

Page 47: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 29: Frequency distribution of time from each attempt on Scene A

Figure 30: Means comparison of a) Grab-release and b) Hand-grab movements

38

Page 48: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 31: Mean’s comparison of attempts from scene A and B

inferring that the tracking performance and the gesture recognition is good in this range, then,we start to see longer variations. The highest difference is perceived at 18 cm. where only user1 is able to concrete the movements in shorter time while the tracking affections started to bemore noticeable.

5.3 SUS Assessment

In the System Usability Scale questionnaire, the average score obtained was 70.87 with a stan-dard deviation of 12.60 and median of 71.25 collected from the 20 participants, that accordingto [51] is an acceptable result (it states to be acceptable above 70). Furthermore, informationcan be inferred from the questionnaire.

The Table 7 shows the results from the SUS, where the columns Q1...Qn present the selectedanswer from each user and its corresponding calculated Score. As explained in the previous chap-ter 4, the assessment is based on a ranking for even and odd questions that has certain weightaccording to the answer selected. The odd questions tend to show positive aspects while the evenquestions remark negative. In the Figure 33, the average of results for each statement shows thatthe lowest ranks are even questions, meaning that most of the participants have agreed withthe negative aspects such as "the need to learn a lot things before get going well with the system","unnecessarily complex" ranked the lowest; statements such as the "need for technical support","inconsistency on the system" and "cumbersome handling" were ranked negatively as well but in alower degree. This could also have been leaned by users that showed more difficulties during theexperiment and thus agree strongly with the negative aspects. Anyhow, the tendency proves theneed to solve most of the technical issues present in the implementation that also caused scat-tered results in the previous test. On the other hand, the positive statements also present good

39

Page 49: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 32: Time-distance per user

results that lean the balance up. Although they are contradictory between each other (positiveand negative), the positives are ranked higher at some degree, however we should consider thatthe majority of users didn’t have experience with the technologies and a first impression couldhave lead to more optimistic answers.

5.4 Questionnaire and Observations

As additional support to gather information about the user experience, the open questionnaire,seen on chapter 4 was held to the participants along the SUS test after the conclusion of theexperiment. No personal data was collected during this process and only information about theprevious user experience, Table 8 was requested.

We have decided to categorize the answers from users and conductor’s observations in twosections to summarize the relevant findings:

1. Pinch Gesture Performance: The responsiveness of the gesture was well perceived, never-theless, when some participants used the hands too close to the borders and tried to grabthe objects, the tracking failed, more specially when performing the gesture as a horizontalmovement from outside-into the space; the frustration and tiredness caused by keepingthe hand in the air while pinching really depended on how the user adapted to the system.People who started to slowly touch the objects performed better, improved easily and didn’t

40

Page 50: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

USER Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 ScoreUSER1 4 1 5 1 5 1 5 1 5 1 97.5USER2 2 3 3 2 4 4 4 3 3 1 57.5USER3 3 1 4 1 4 3 5 2 5 1 82.5USER4 4 3 2 2 3 2 4 2 3 3 60USER5 5 1 5 3 5 2 5 1 4 1 90USER6 4 2 5 3 4 2 5 3 3 1 75USER7 2 3 2 1 4 3 4 2 3 2 60USER8 2 2 2 5 4 3 4 3 4 1 55USER9 4 2 4 4 4 2 4 3 4 2 67.5USER10 5 2 2 4 2 1 2 4 3 2 52.5USER11 4 3 4 4 3 2 3 3 4 3 57.5USER12 2 3 2 2 3 3 4 2 2 2 52.5USER13 3 1 4 2 4 1 4 2 5 1 82.5USER14 4 2 4 1 4 4 5 1 5 1 82.5USER15 4 2 4 1 2 4 4 2 4 1 70USER16 4 1 3 1 4 1 2 2 2 1 72.5USER17 4 2 4 2 4 1 5 4 4 2 75USER18 4 3 4 1 4 3 4 1 4 1 77.5USER19 3 1 5 5 5 2 4 3 5 3 70USER20 4 1 4 2 3 2 4 2 5 1 80

Table 7: SUS results

No. of users Previous experience14 Do not have experience with AR.4 Do have experience in AR.2 Know the concept, but no experience with it.16 Do not have experience with hand tracking devices4 Have used Kinect mainly, but nobody have used the

Leap Motion Controller before

Table 8: User’s previous experience

41

Page 51: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Figure 33: SUS Average Results per question

stated problems of tiredness while others that started grabbing it with faster movementsrevealed more frustration and stated to be tired in some way; the majority agreed on thatpractise is necessary for control but it was easy to get used to it, just as using another inter-action device for the first time, during the series of attempts in both scenes, the users feltmore comfortable and performed the task with easiness at some degree.

2. AR Scene perception: The main problem identified during the experiment was the depthperception from the field of view of the user different to the camera. It was identifieddifferently from one user to another, either from front-back and up-down movements, i.e.while certain user is confident on moving the object to the back, in fact is moving it upper.Moreover, in words of a participant, "Once I identify where the object was in relation tothe real video using the marker in the table as reference it was easier to get the idea ofthe virtual world", was a shared though among participants. Here, two main problemsare identified, the lack of visual feedback from the environment (not interaction) and theposition of the camera in relation to the user’s field of view; The occlusion problem wasnot perceived as expected, mainly because of the virtual hand overlaying the real one wasused to interact with the content, only a few noticed the problem at first sight because, atsome point, they were trying to use their real hand instead of the virtual.

Additionally, two experienced users gave their opinion about the overall application for im-provements, suggesting to deactivate the collision detection when not grabbing the object toavoid bouncing effects produced by the physics engine in Unity. To tackle the problem of depthperception, another user suggested the use of 3D glasses, despite this requires a special displayand lenses -which is out of the scope-, could be of interest for latter improvements.

The depth perception [54] is a common issue to solve in this scenario, as more visual cuesshould be implemented to indicate the user how are the objects located in the space, and whetherthe camera’s field of view should be located at the position of the eye (glasses) to overcome the

42

Page 52: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

adjustment of the user’s perception when moving between real and virtual world. However,the use of the real tabletop as reference to perceive the virtual objects in the space succeed atproviding the insights of the definition of Augmented Reality.

5.5 Discussion

From the results and findings described previously, on the Test No. 1 5.1, there was no significantdifference on time between the pinch-release movement task performed with both algorithms,invalidating our first thought, however, after looking at the learnability aspect in the pinch-release movement within our implementation, which turned out to be non-significant as well,the research focus on other clues that could sustain our hypothesis: The user’s interaction withinthe virtual space improves easily with the time, chapter 1, leading to verify the additional datacollected and measure the time when a hand is recognized and the pinch gesture is performed.The new results showed that our algorithm performed well along the attempts although therewere random results in the attempts that does not reflect precisely if it was easier or not, -in general, this behaviour was seen in both cases- concluding that actually there is a decreaseon time of each attempt when performing the gesture, validating the learnability aspect in thegesture performance, but not when moving the object to the specified area, in this case, theresults were very smooth, meaning that the average time it took to grab-move-release was almostthe same in all the samples.

Furthermore, when comparing these results within both algorithms we found that there was arelevant improvement with our algorithm, validating our first assumption made, but on differentevents in the scene.

On the qualitative results, the agreement with the learning effect statement is mutual amongthe users that stated to improve at some level on each attempt while on the SUS, the detectionof problems that influenced our results corroborates the dispersion of data we have, speciallyduring the grabbing task -that can be inferred from the time it took- that it was more difficult toinitiate the grabbing action than manipulating the object with it.

One problem that affected the performance initially was the depth perception of the users,where the point of view of the camera did not help the user to perceive the exact location ofthe objects in the z-axis(depth) and y-axis(height), even when the farther objects were smallerthan the closer ones, it was observed that users struggle with this problem frequently; just afterrecognizing the environment, users take the tabletop marker as a reference to infer the locationsabove it, this achievement is important, as it shows the inner relationship between the real andvirtual worlds perceived by the users, however, the awareness was not immediate and in thissense, more visual cues should guide the user through the augmentation to offer the perceptionof depth in a proper way like illumination and shadowing[55] or meshes on the scene[32].

Although the occlusion problem is not considered in this project, little evidence of its aware-ness should be taken into consideration. During the tests, we could see that there is an immediateand sufficient awareness from the users that a virtual hand is drawn on the top of their real oneand it is controlled by their hand movements while other virtual objects react according to it.However, few participants initially tried to grab the objects focusing on their real hand ratherthan using the virtual, hence identifying the problem of occlusion at the time of trying to grab

43

Page 53: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

the objects and decreasing the level of immersion at a first stage. The use of just a real handwould improve the sensation of grabbing the object as in the real world.

Lastly, the results of the second test shows the effect of the limits given by the interactionspace of the device itself where there is a progressive increase of time to move objects around thecenter of the scenario, pointing increasing time intervals while moving objects on the boundariesof the interaction space.

5.5.1 Limitations

The next limitations were considered for the design, implementation and testing purposes:

• As pointed in [40], the Leap Motion’s API stores detected information in a constructed datatype that is not modifiable and as there is no easy way to access to a full depth map andpotentially correct detected data or make use of the depth map. However, the confidenceon the exposed data is enough to work efficiently at the moment.

• As seen before, the Vuforia framework is intended to work with mobile devices ratherthan desktop applications, the see-through concept in AR is fundamental and it cannot beachieved using a desktop screen. However, the integration proposed with mobile devicesis difficult at a certain point as the Leap Motion requires a USB connection. It has beenmanaged to send coordinates through a wireless connection but latency issues[54] andthe lack of support libraries for mobile platforms requires a higher degree of knowledge toovercome these problems; additionally, the performance in a prototyping stage decreasedwith this approach. Hence, it has been decided to work in a desktop environment usingmulti-platform tools, ultimately managing the interaction in a fully controlled environmentand without the lack of resources that influence the performance of the application. Thedetails of the proof of concept for mobile devices were explained in the previous chapter3.7.

• The Leap Motion’s limitations at some positions of the hands, the tracking fails when plac-ing the hand perpendicularly to the sensor as the fingers shapes are not in its field of view.

• The Test No.2 included a deeper analysis, measuring the position data and time, it wouldinclude a description of which hands are used in the right/left regions of the space toobserve its performance with outside-in/inside-out movements directed to the space, assometimes, users tend to perform the pinch-grab gesture from outside the range of thesensor and the easiest solution was to use the left-hand to move content on the right andvice versa. However, due to time constraints, it was not possible.

44

Page 54: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

6 Conclusion and Future Work

In the current work we have presented the design and implementation of an AR interface tointeract with virtual content by means of a pinch gesture performed with bare hands, using theLeap Motion Controller -a depth sensing device- to obtain the information of hands, fingers andarticulations.

Firstly, we have introduced the basic concepts of interaction in AR, gesture recognition andits acceptance for interaction. A review of different approaches was shown, finding scarce im-plementations with this specific device, most of the research found included computer visiontechniques applied with depth cameras that usually expose raw data while the device deliversthe information of hand and fingers directly. Afterwards, we defined a set of guidelines to followbased on perceptual issues and recommendations from previous works, a description of the de-velopment frameworks and an architecture to integrate the AR and hand tracking technologies,including the definition of an algorithm to recognize pinch gestures through the device. An eval-uation was conducted among different users in order to test the performance of the prototypewith a simple grab-move-release task designed for this purpose. We found positive results relatedto the gesture but various issues in the AR perception.

The Leap Motion’s technology is very promising in the sense that it has a potential for widerange of applications for gesture interaction, virtual and augmented reality environments andmore robust and serious applications as sign language detection or control. Combined with an ARscenario, is potentially useful for virtual modelling and prototyping, collaborative environmentsor gaming.

6.1 Contributions

We have designed and developed a prototype for a Gesture-based AR application, Despite it’sin an early stage, it could be used as basis for future developments in this area. Furthermore,a proposed integration of an architecture for mobile devices was designed and presented as aproof of concept where the data is sent through wireless connection, this will be useful in thenear future, as currently development is this area is being carried out 1.

To support the development, we have reviewed and analysed the capabilities of the device,based on few works published at the moment and presenting our experimental findings in thefield of AR.

Additionally, we conducted a usability study SUS to verify the user experience issues relatedand serve as basis for further improvements.

1http://www.pocket-lint.com/news/125716-handsfree-control-from-leap-motion-coming-to-tablets-in-q3-4-2014(Last visit June 2014)

45

Page 55: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

6.2 Future work

Based on the results obtained, the future work include several improvements to the currentprototype and additional features to manage a seamless interaction in AR.

The fully mobile compatibility is a goal that should be addressed, as mentioned previously,its integration on mobile devices would permit to design a variety of applications, and an earlyapproach with AR is an area of value.

The occlusion handling of real hand is an important next step to replace the virtual hand,Vuforia provides access to the background video, from which it is possible to use it as textureand get rid of the virtual hand by replacing a real hand visualization as texture by segmenting itand generating shaders from it.

Increase the recognition of more gestures, based on the taxonomy for AR interaction reviewedin the chapter 3, which would consists on implement the gestures and create a gesture-managerto deal with the control and correct recognition of different gestures.

Use of machine learning techniques to get better results on the gesture recognition. Althoughthe gesture recognition is based on the pinch positions, machine learning techniques could beimplemented to infer gestures in time, defining a ground truth data set with movement’s direc-tions specific for the Leap Motion, where the raw data is not exposed but if we rely on its data,machine learning techniques could improve effectively the gesture recognition.

Improve the visual feedback to enhance the user’s perception of the virtual world, as the treat-ment of shadows and illumination in the AR. There are some implementations already carriedout to deal with this feature.

46

Page 56: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

Bibliography

[1] Leap Motion Controller - https://developer.leapmotion.com/ (last visit july 2014).

[2] Guna, J., Jakus, G., Pogacnik, M., Tomažic, S., & Sodnik, J. 2014. An analysis of theprecision and reliability of the leap motion sensor and its suitability for static and dynamictracking. Sensors, 14(2), 3702–3720.

[3] Weichert, F., Bachmann, D., Rudak, B., & Fisseler, D. 2013. Analysis of the accuracy androbustness of the leap motion controller. Sensors, 13(5), 6380–6393. URL: http://www.mdpi.com/1424-8220/13/5/6380, doi:10.3390/s130506380.

[4] Zhou, F., Duh, H. B.-L., & Billinghurst, M. 2008. Trends in augmented reality tracking,interaction and display: A review of ten years of ismar. In Proceedings of the 7th IEEE/ACMInternational Symposium on Mixed and Augmented Reality, ISMAR ’08, 193–202, Washing-ton, DC, USA. IEEE Computer Society. URL: http://dx.doi.org/10.1109/ISMAR.2008.4637362, doi:10.1109/ISMAR.2008.4637362.

[5] Billinghurst, M. 2013. Hands and speech in space: Multimodal interaction with augmentedreality interfaces. In Proceedings of the 15th ACM on International Conference on MultimodalInteraction, ICMI ’13, 379–380, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2522848.2532202, doi:10.1145/2522848.2532202.

[6] Lee, B. & Chun, J. April 2010. Interactive manipulation of augmented objects in marker-less ar using vision-based hand interaction. In Information Technology: New Generations(ITNG), 2010 Seventh International Conference on, 398–403. doi:10.1109/ITNG.2010.36.

[7] Störring, M., Moeslund, T. B., Liu, Y., & Granum, E. Computer vision-based gesture recog-nition for an augmented reality interface.

[8] Yang, M.-T., Liao, W.-C., & Shih, Y.-C. 2013. Vecar: Virtual english classroom with marker-less augmented reality and intuitive gesture interaction. In Advanced Learning Technologies(ICALT), 2013 IEEE 13th International Conference on, 439–440. doi:10.1109/ICALT.2013.134.

[9] Lee, M., Green, R., & Billinghurst, M. 2008. 3d natural hand interaction for ar applica-tions. In Image and Vision Computing New Zealand, 2008. IVCNZ 2008. 23rd InternationalConference, 1–6. doi:10.1109/IVCNZ.2008.4762125.

[10] Microsoft Kinect - http://www.xbox.com/en-us/kinect (last visit april 2014)). URL: http://www.xbox.com/en-US/kinect.

47

Page 57: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

[11] Ren, Z., Meng, J., Yuan, J., & Zhang, Z. 2011. Robust hand gesture recognition withkinect sensor. In Proceedings of the 19th ACM International Conference on Multimedia, MM’11, 759–760, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2072298.2072443, doi:10.1145/2072298.2072443.

[12] Radkowski, R. & Stritzke, C. 2012. Interactive hand gesture-based assembly for aug-mented reality applications. In ACHI 2012, The Fifth International Conference on Advancesin Computer-Human Interactions, 303–308.

[13] Wang, X., Kim, M. J., Love, P. E., & Kang, S.-C. 2013. Augmented reality in builtenvironment: Classification and implications for future research. Automation in Con-struction, 32(0), 1 – 13. URL: http://www.sciencedirect.com/science/article/pii/S0926580512002166, doi:http://dx.doi.org/10.1016/j.autcon.2012.11.021.

[14] Clark, A. & Piumsomboon, T. 2011. A realistic augmented reality racing game using adepth-sensing camera. In Proceedings of the 10th International Conference on Virtual Real-ity Continuum and Its Applications in Industry, VRCAI ’11, 499–502, New York, NY, USA.ACM. URL: http://doi.acm.org/10.1145/2087756.2087851, doi:10.1145/2087756.

2087851.

[15] Lee, G. A., Bai, H., & Billinghurst, M. 2012. Automatic zooming interface for tangibleaugmented reality applications. In Proceedings of the 11th ACM SIGGRAPH InternationalConference on Virtual-Reality Continuum and Its Applications in Industry, VRCAI ’12, 9–12,New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2407516.2407518, doi:10.1145/2407516.2407518.

[16] Piumsomboon, T., Clark, A., & Billinghurst, M. Dec 2011. Physically-based Interac-tion for Tabletop Augmented Reality Using a Depth-sensing Camera for EnvironmentMapping. In Proc. Image and Vision Computing New Zealand (IVCNZ-2011), 161–166,Auckland. URL: http://www.ivs.auckland.ac.nz/ivcnz2011_temp/uploads/1345/

3-Physically-based_Interaction_for_Tabletop_Augmented_Reality_Using_a_De.

pdf.

[17] Regenbrecht, H., Collins, J., & Hoermann, S. 2013. A leap-supported, hybrid ar inter-face approach. In Proceedings of the 25th Australian Computer-Human Interaction Confer-ence: Augmentation, Application, Innovation, Collaboration, OzCHI ’13, 281–284, New York,NY, USA. ACM. URL: http://doi.acm.org/10.1145/2541016.2541053, doi:10.1145/2541016.2541053.

[18] Ren, Z., Yuan, J., & Zhang, Z. 2011. Robust hand gesture recognition based on finger-earthmover’s distance with a commodity depth camera. In Proceedings of the 19th ACM Inter-national Conference on Multimedia, MM ’11, 1093–1096, New York, NY, USA. ACM. URL:http://doi.acm.org/10.1145/2072298.2071946, doi:10.1145/2072298.2071946.

[19] Meng, Y. Gestures interaction analysis and design strategy.

48

Page 58: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

[20] McCallum, Simon; Boletsis, C. 2013. Augmented reality & gesture-based architecture ingames for the elderly. Studies in Health Technology and Informatics 2013. IOS Press, Volume189, 139–144.

[21] Costas Boletsis, S. M. 2014. Augmented reality cube game for cognitive training: Aninteraction study. In Studies in Health Technology and Informatics, Volume 200: pHealth2014, Press, I., ed, 81 – 87.

[22] Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. 2013. User-defined gesturesfor augmented reality. In Human-Computer Interaction-INTERACT 2013, Kotzé, P., Marsden,G., Lindgaard, G., Wesson, J., & Winckler, M., eds, volume 8118 of Lecture Notes in Com-puter Science, 282–299. Springer Berlin Heidelberg. URL: http://dx.doi.org/10.1007/978-3-642-40480-1_18, doi:10.1007/978-3-642-40480-1_18.

[23] Comtet, H. Acceptance of 3d-gestures based on age, gender and experience. Master’s thesis,Department of Computer Science and Media Technology, Gjøvik University College, 2013.URL: http://www.nb.no/idtjeneste/URN:NBN:no-bibsys_brage_44195.

[24] Azuma, R. T. et al. A survey of augmented reality.

[25] Van Krevelen, D. & Poelman, R. 2010. A survey of augmented reality technologies, appli-cations and limitations. International Journal of Virtual Reality, 9(2), 1.

[26] Garber, L. 2013. Gestural technology: Moving interfaces in a new direction [technologynews]. Computer, 46(10), 22–25. doi:10.1109/MC.2013.352.

[27] Spiegelmock, M. 2013. Leap Motion Development Essentials. Packt Publishing Ltd.

[28] Billinghurst, M. & Buxton, B. 2011. Gesture based interaction. Haptic input, 24.

[29] Ruiz, J., Li, Y., & Lank, E. 2011. User-defined motion gestures for mobile interaction. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, 197–206, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/1978942.1978971,doi:10.1145/1978942.1978971.

[30] Piumsomboon, T., Clark, A., Billinghurst, M., & Cockburn, A. 2013. User-defined gesturesfor augmented reality. In CHI ’13 Extended Abstracts on Human Factors in Computing Sys-tems, CHI EA ’13, 955–960, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2468356.2468527, doi:10.1145/2468356.2468527.

[31] Hürst, W. & Wezel, C. 2013. Gesture-based interaction via finger tracking for mobileaugmented reality. Multimedia Tools and Applications, 62(1), 233–258. URL: http://dx.doi.org/10.1007/s11042-011-0983-y, doi:10.1007/s11042-011-0983-y.

[32] Hürst, W. & Wezel, C. 2011. Multimodal interaction concepts for mobile augmentedreality applications. In Advances in Multimedia Modeling, Lee, K.-T., Tsai, W.-H., Liao,H.-Y., Chen, T., Hsieh, J.-W., & Tseng, C.-C., eds, volume 6524 of Lecture Notes in Com-puter Science, 157–167. Springer Berlin Heidelberg. URL: http://dx.doi.org/10.1007/978-3-642-17829-0_15, doi:10.1007/978-3-642-17829-0_15.

49

Page 59: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

[33] Reifinger, S., Wallhoff, F., Ablassmeier, M., Poitschke, T., & Rigoll, G. 2007. Static anddynamic hand-gesture recognition for augmented reality applications. In Human-ComputerInteraction. HCI Intelligent Multimodal Interaction Environments, Jacko, J., ed, volume 4552of Lecture Notes in Computer Science, 728–737. Springer Berlin Heidelberg. URL: http://dx.doi.org/10.1007/978-3-540-73110-8_79, doi:10.1007/978-3-540-73110-8_79.

[34] Grandhi, S. A., Joue, G., & Mittelberg, I. 2011. Understanding naturalness and intu-itiveness in gesture production: Insights for touchless gestural interfaces. In Proceed-ings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, 821–824, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/1978942.1979061,doi:10.1145/1978942.1979061.

[35] Hoggan, E., Nacenta, M., Kristensson, P. O., Williamson, J., Oulasvirta, A., & Lehtiö, A.2013. Multi-touch pinch gestures: Performance and ergonomics. In Proceedings of the2013 ACM International Conference on Interactive Tabletops and Surfaces, ITS ’13, 219–222,New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2512349.2512817, doi:10.1145/2512349.2512817.

[36] Wilson, A. D. 2006. Robust computer vision-based detection of pinching for one andtwo-handed gesture input. In Proceedings of the 19th Annual ACM Symposium on UserInterface Software and Technology, UIST ’06, 255–258, New York, NY, USA. ACM. URL:http://doi.acm.org/10.1145/1166253.1166292, doi:10.1145/1166253.1166292.

[37] Balakrishnan, R. & MacKenzie, I. S. 1997. Performance differences in the fingers, wrist,and forearm in computer input control. In Proceedings of the ACM SIGCHI Conference onHuman Factors in Computing Systems, CHI ’97, 303–310, New York, NY, USA. ACM. URL:http://doi.acm.org/10.1145/258549.258764, doi:10.1145/258549.258764.

[38] Leitner, M., Tomitsch, M., Költringer, T., Kappel, K., & Grechenig, T. 2007. Designingtangible table-top interfaces for patients in rehabilitation. In CVHI.

[39] Park, H. & Moon, H.-C. 2013. Design evaluation of information appli-ances using augmented reality-based tangible interaction. Computers in Industry,64(7), 854 – 868. URL: http://www.sciencedirect.com/science/article/pii/

S0166361513001127, doi:http://dx.doi.org/10.1016/j.compind.2013.05.006.

[40] Potter, L. E., Araullo, J., & Carter, L. 2013. The leap motion controller: A view onsign language. In Proceedings of the 25th Australian Computer-Human Interaction Confer-ence: Augmentation, Application, Innovation, Collaboration, OzCHI ’13, 175–178, New York,NY, USA. ACM. URL: http://doi.acm.org/10.1145/2541016.2541072, doi:10.1145/2541016.2541072.

[41] Rubner, Y., Tomasi, C., & Guibas, L. 2000. The earth mover’s distance as a metric forimage retrieval. International Journal of Computer Vision, 40(2), 99–121. URL: http:

//dx.doi.org/10.1023/A%3A1026543900054, doi:10.1023/A:1026543900054.

50

Page 60: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

[42] Chun, W. H. & Höllerer, T. 2013. Real-time hand interaction for augmented reality onmobile phones. In Proceedings of the 2013 International Conference on Intelligent User Inter-faces, IUI ’13, 307–314, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/2449396.2449435, doi:10.1145/2449396.2449435.

[43] Bai, H., Gao, L., El-Sana, J., & Billinghurst, M. 2013. Markerless 3d gesture-based in-teraction for handheld augmented reality interfaces. In SIGGRAPH Asia 2013 Symposiumon Mobile Graphics and Interactive Applications, SA ’13, 22:1–22:1, New York, NY, USA.ACM. URL: http://doi.acm.org/10.1145/2543651.2543678, doi:10.1145/2543651.

2543678.

[44] Dorfmuller-Ulhaas, K. & Schmalstieg, D. 2001. Finger tracking for interaction in augmentedenvironments. In Augmented Reality, 2001. Proceedings. IEEE and ACM International Sym-posium on, 55–64. doi:10.1109/ISAR.2001.970515.

[45] Hayashi, K., Kato, H., & Nishida, S. 2005. Occlusion detection of real objects using contourbased stereo matching. In Proceedings of the 2005 International Conference on AugmentedTele-existence, ICAT ’05, 180–186, New York, NY, USA. ACM. URL: http://doi.acm.org/10.1145/1152399.1152432, doi:10.1145/1152399.1152432.

[46] Rovelo, G. may 2014. Multi-viewer gesture-based interaction for omni-directional video.In Forum sur l’Interaction Tactile et Gestuelle, Lille. Hasselt University. URL: http://fitg.lille.inria.fr/exposes/rovelo/index.html.

[47] Vuforia SDK https://developer.vuforia.com/ (last visit july 2014).

[48] Bay, H., Tuytelaars, T., & Van Gool, L. 2006. Surf: Speeded up robust features. In ComputerVision ECCV 2006, Leonardis, A., Bischof, H., & Pinz, A., eds, volume 3951 of Lecture Notesin Computer Science, 404–417. Springer Berlin Heidelberg. URL: http://dx.doi.org/10.1007/11744023_32, doi:10.1007/11744023_32.

[49] Pulli, K., Aarnio, T., Miettinen, V., Roimela, K., & Vaarala, J. 2007. Mobile 3D graphics withOpenGL ES and M3G. Morgan Kaufmann Series in Computer Graphics. Elsevier, Burlington,MA.

[50] Unity Game Engine - http://unity3d.com/learn/documentation (last visit july 2014).

[51] Bangor, A., Kortum, P. T., & Miller, J. T. 2008. An empirical evaluation of thesystem usability scale. International Journal of Human-Computer Interaction, 24(6),574–594. URL: http://www.tandfonline.com/doi/abs/10.1080/10447310802205776,arXiv:http://www.tandfonline.com/doi/pdf/10.1080/10447310802205776, doi:10.

1080/10447310802205776.

[52] Finstad, K. 2006. The system usability scale and non-native english speakers. Journal ofusability studies, 1(4), 185–188.

[53] Measuring usability with the system usability scale (sus) -http://www.measuringusability.com/sus.php (last visit june 2014)).

51

Page 61: Master in 3D Multimedia Technology · the virtual world that hasn’t been tackled completely by using hand gestures. With the inclusion of depth sensing devices as Kinect or Leap

Physical interaction in augmented environments

[54] Kruijff, E., Swan, J., & Feiner, S. Oct 2010. Perceptual issues in augmented reality revisited.In Mixed and Augmented Reality (ISMAR), 2010 9th IEEE International Symposium on, 3–12.doi:10.1109/ISMAR.2010.5643530.

[55] Arief, I. A novel shadow-based illumination estimation method for mobile augmentedreality system. Master’s thesis, Department of Computer Science and Media Technology,Gjøvik University College, 2011.

52