paper robust finger tracking for gesture control of mobile

Paper

Robust Finger Tracking for Gesture Control of Mobile

Devices Using Contour and Interior Information of a

Finger

Masakazu Higuchi †, Takashi Komuro (member)†

Abstract In this paper, we propose a method for robust tracking of a moving finger in an image sequence. The method

is suitable for application to our input interface system, which recognizes a moving finger in the air. The proposed method

extracts edges from input images, and then estimates the position and rotation of a finger in the input images by matching

points in a template to edges. The most remarkable feature of our method is that it also takes into account the presence

or absence of edges in regions in the input images corresponding to the inside of the finger in the template for estimating.

This makes it possible to estimate the position and rotation of a finger exactly in images with complex backgrounds. Our

method successfully tracked a finger in several situations with an average processing time of 6.32 [ms/frame], and the finger

was tracked with good accuracy.

Key words: Finger tracking, Edge detection, Chamfer distance, Template matching, Pointing device.

1. Introduction

In recent years, high-performance mobile devices,

such as smart phones, have become popular as tech-

nological advances have allowed computing devices to

become smaller and lighter. However, miniature mobile

devices lack usability because of the small operating (in-

put) areas on their surfaces. Therefore, several input

interface systems providing larger operating area and

greater flexibility have been proposed in order to im-

prove usability8)11)15). These systems typically project

virtual objects on the hands or arms of users, or on sur-

faces of objects in the real world, allowing users to op-

erate them in the real world. As a result, users are able

to give inputs to mobile devices in three-dimensional

space via virtual objects, which improves the operat-

ing area and flexibility of the input interface. However,

there are several drawbacks with these systems. For ex-

ample, they require users to wear special devices, such

as a small projector, or they work only with limited

shapes of image planes for projecting virtual objects.

To overcome these problems, we have proposed a new

input interface system which recognizes a moving fin-

ger in the air by processing images captured from a

monocular camera12)18). The system is shown in Fig. 1.

Received December 28, 2012; Revised March 19, 2013; Accepted April

18, 2013

†Graduate School of Science and Engineering, Saitama University

(255, Shimo-Okubo, Sakura-ku, Saitama-shi, 338-8570, Japan)

Fig. 1 Our input interface system which recognizes a

moving finger in the air.

This experimental system consists of a small monitor, a

monocular camera, and a PC. The monocular camera is

attached to the monitor. The camera captures images

including the user’s finger held in front of the camera,

and then the PC processes the images. The result of

the processing is fed to an application displayed on the

monitor, allowing the user to use the application by

moving his or her finger in the air. This system makes

use of a large operating area, namely, the space in front

of the camera.

There are several challenges for integrating this sys-

tem into mobile devices. First, the face and clothes of

the user and lights on the ceiling may also be included in

images captured by the camera when capturing images

of the user’s finger. The captured images would then

have complex backgrounds, making it difficult to rec-

ITE Trans. on MTA Vol. 1, No. 3, pp. 226-236 (2013)

226

Copyright © 2013 by ITE Transactions on Media Technology and Applications (MTA)

ognize the finger in the images. Second, it is necessary

to reduce the computational complexity of the finger

recognition process in order to use a small amount of

computational resources on mobile devices. In addition,

since our system is intended to be used as a pointing

device, high-speed response with high accuracy of finger

tracking is required, unlike systems such as gesture in-

put systems, which often require only rough recognized

results of hand poses or fingers.

To resolve these problems, in this paper, we propose

a method of estimating the position and rotation of a

finger using the contour and interior information of the

finger in template matching, and tracking the finger in

an image sequence.

The organization of this paper is as follows: In Sec-

tion 2, we describe related research. In Section 3, we

describe the proposed method. In Section 4, we de-

scribe an experiment conducted to evaluate the effec-

tiveness of the proposed method and show the results

of the experiment. In Section 5, we conclude this paper.

2. Related research

In this section, we describe research related to our

work.

There have been many studies on the detection of

hands or fingers in images5)7)9)10)13)14)17). In these stud-

ies, the image features that are mainly used are edges

and color. In5)10)13), only color information was used in

the detection of hands or fingers. The disadvantages

of using only color information are that false detection

may occur in regions having a color close to skin color,

and the method is not robust to illumination changes.

On the other hand, edges are perhaps the most low-

level image features, so it is relatively easy to extract

them in images. Also, one of the advantages of using

edges is that they are robust to a narrow range of il-

lumination changes. In7)9)14)17), edges were used in the

detection of hands or fingers. Stenger et al. made a like-

lihood function using edge and color information and

detected various hand poses in an image sequence by

combining hierarchical template matching and Bayesian

filtering9). Do et al. extracted circular features in edge

images by using the Hough transform and tracked fin-

gers with these features in an image sequence by using

a particle filter14). Athitsos et al. estimated a hand pose

by retrieving the image closest to an input image from

a large database of synthetic hand images by using edge

and line matching7). Oikonomidis et al. simultaneously

tracked a hand and a simple-shaped object manipulated

with the hand in image sequences captured by a multi-

camera system by using edge and color information17).

The common idea in7)9)17) is to generate a parametric

model (template) of hand poses or objects and to esti-

mate parameters to make the best match between the

model and an input image.

As in17), this idea can also be applied to the detection

of general objects in images, and there have been many

studies taking this approach1) 3)6)16). In the detection of

general objects in an image, Steger described a way of

making robust similarity measures against various dis-

turbances and showed that it is effective to use edges

in making similarity measures6). The chamfer distance

used by Borgefors1), in particular, makes smooth edge

distance functions, i.e., similarity measures. Smooth

functions are suitable for iterative optimization algo-

rithms for minimizing an n-dimensional function, such

as Newton’s method, because these functions give the

gradient at each point on domains of functions. There-

fore, an initial search point can be made to converge

toward an optimal solution in the domain space with a

light computational load. In7)9)17), similarity measures

were made using the chamfer distance, and results in

detecting objects showed the effectiveness of this ap-

proach.

Edges are useful in various object detection problems;

however, in the case of finger detection, false detection

often occurs in images with complex backgrounds. This

is because images with complex backgrounds include

many regions similar to the form of a finger itself be-

cause a finger has a simpler form than the whole hand

or general objects.

Also, the non-sequence-based search in7) faces the

problem that the detection processing time is very long.

In9), a sequence-based search, which uses information

between images generated sequentially, was adopted,

but the method requires a heavy computational load

for estimating hand poses in an image sequence, making

the method unsuitable for integration into mobile de-

vices having a small amount of computational resources.

As mentioned above, the problems of lack of robust-

ness in detecting a finger in images with complex back-

grounds, and the long processing time in the detection

of general objects (including fingers) remain unsolved.

On the other hand, in12)18), we have proposed a new

input interface system which captures images including

the user’s finger by the camera attached to a mobile

device, and that recognizes the moving finger in the

air by image processing. In12), since the finger is rec-

227

Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger

ognized by using color information, the system often

fails to recognize the finger in front of the user’s face.

In18), the finger region is extracted by using IR LEDs

and an IR filter, and the finger pose is estimated. In

this system, by turning on and off the IR LEDs in syn-

chronization with capturing images and taking the dif-

ference between two successive images, the system can

detect only the object near the camera, i.e., the finger.

This makes it possible to estimate the finger pose with

a little disturbance of backgrounds. However, when the

finger moves away from the camera, the camera cannot

capture the finger, and detection of the finger becomes

difficult.

In our approach, the finger is detected by using edges

from input images. In addition, the information of the

presence or absence of edges inside the finger is also

used in matching, and it enables robust estimation of

the finger pose against complex backgrounds.

3. Estimation and tracking of finger posi-

tion and rotation

In this section, we introduce the proposed method,

which estimates the position and rotation of a finger

using the contour and interior information of the finger

in template matching, and tracks the finger in an image

sequence.

3. 1 Outline of the proposed method

The proposed method is a kind of template matching

based on the idea of the chamfer distance. A template

of a finger is prepared as a binary image.

Our method estimates four parameters for recogniz-

ing a finger: translations in two directions in a plane

perpendicular to the optical axis of a monocular camera

which captures a sequence of images containing the fin-

ger, a scale change along the optical axis, and a rotation

around the optical axis. In general, it is necessary to

take the rotation of a finger in three-dimensional space

into account so as to estimate the position and rotation

of the finger moving in the space, and therefore, affine

matching is often used. In our method, on the other

hand, since the template is a simple binary image, we

restrict the motions given to our input interface system

to simple motions, e.g., a rotation around only one axis

and not a bent finger, so that the above four parameters

are sufficient for estimating the position and rotation of

the finger.

The estimated parameters are denoted as follows:

p = (tx, ty, s, θ)T , (1)

where tx and ty are the horizontal and vertical trans-

lations of the finger, respectively, and s and θ are the

scale change and the rotation of the finger, respectively.

To solve the problem of non-robustness in images

with complex backgrounds mentioned in Section 2, our

method uses edges extracted from input images and in-

formation about the presence or absence of edges in

regions in the input images corresponding to the inside

of a finger in the template when the template is super-

imposed on the input images. This has the advantage of

fewer false matches. In general, there are few edges in

the region inside a finger. Hence, when the template is

superimposed on input images, if there are many edges

in regions in the input images corresponding to the in-

side of the finger in the template, these regions are rec-

ognized as not being a finger in our method. Therefore,

it is possible to estimate the position and rotation of a

finger exactly in images with complex backgrounds.

In our method, the similarity measure is an edge

distance function, which is smooth, so that an initial

search point converges toward an optimal solution with

a low computational load by using a suitable iterative

optimization algorithm.

Also, our system is intended to be used as a pointing

device, which requires high-speed response and high-

accuracy finger tracking. Our method is a sequence-

based search whose advantage is that the region used

for estimating the position and rotation of a finger is

limited to the region around the finger. Therefore, we

can set an initial search point near the optimal solution

in the finger parameter space, which makes it possible

to converge the point toward the optimal solution with

a small number of iterations and a simple computation.

As a result, we can achieve high-speed response and

high-accuracy finger tracking.

Accordingly, the proposed method achieves robust-

ness in complex backgrounds and high-speed process-

ing in finger recognition, which are necessary features

for integrating the method into mobile devices. More-

over, our method also achieves high-speed response and

high-accuracy finger tracking, which are necessary fea-

tures for pointing devices.

The flow of the proposed method is as follows:

Step 1. Set an initial search point in the finger pa-

rameter space for an input image captured by a

monocular camera.

Step 2. Focus on a region of interest (ROI) in the

input image.

Step 3. Extract edges of the ROI, and make a dis-

ITE Trans. on MTA Vol. 1, No. 3 (2013)

228

tance map of the edge image.

Step 4. Estimate parameters for the position and

rotation of a finger by template matching.

Step 5. Get the next input image, and return to

Step 1.

The details of each step are described in the follow-

ing subsections. The most remarkable feature of our

method is the use of information about the presence or

absence of edges in regions in the input images corre-

sponding to the inside of a finger in the template when

the template is superimposed on the input images in

Step 4. Details of this are described in Subsections 3. 5

and 3. 6.

3. 2 The initial search point for estimating

parameters

An image captured by the camera at the current time

is denoted as frame k, and the captured image at the

previous time is frame k − 1. Let pstk and p̂k−1 be an

initial search point for estimating parameters for the

position and rotation of the finger in frame k and a pa-

rameter estimation point in frame k − 1, respectively.

In our method, we have

pstk = p̂k−1. (2)

Excessively fast input motions are not allowed in our

input interface system, so it is possible to assume that

the difference between frames is small. Therefore, the

initial search point given by Eq. (2) is close to a param-

eter estimation point in frame k in the parameter space.

This condition leads to an estimation result with high

accuracy and reduced search time.

3. 3 ROI

Elements of pstk are denoted as follows:

pstk = (txstk , ty

stk , s

stk , θ

stk )

T . (3)

We focus on a ROI that is a rectangular region whose

center is the pixel position (txstk , ty

stk ) in frame k. The

height and width of the ROI are

ROIH = 2TH · sstk and ROIW = 2TW · sstk , (4)

respectively, where TH and TW are the height and width

of the template, respectively. Eq. (4) means that the

size of the ROI depends on that of the finger.

3. 4 Edge extraction and distance map

Since the performance of the proposed method is de-

pendent on the accuracy of edge extraction, we use color

information of captured images to extract edges from

them. In brief, we apply a Sobel filter to each color

component (RGB) of the ROI and compute gradient

Edges from grayscale image Edges from color image

Fig. 2 Comparison between edges from gray-scale and

color images.

Edge image Distance map

Fig. 3 Distance map corresponding to edge image.

intensities of pixels in each color component; in other

words, pixels in the ROI have three (RGB) gradient in-

tensities. For each pixel in the ROI, we select the max-

imum from three values as the gradient intensity of the

pixel. Then, edges of the ROI are obtained by thresh-

olding the gradient intensities of pixels in the ROI4).

As shown in Fig. 2, this gives edges with good accu-

racy compared with the conventional way that extracts

edges using the pixel brightness.

Next, we make the distance map for the edge image of

the ROI. In the edge image, the distance to the closest

edge pixel, called the chamfer distance, is calculated at

each non-edge pixel. The distance is zero at each edge

pixel. The distance map is a map generated by giving

the chamfer distance for each pixel in the edge image1).

In the distance map, the distance increases with the dis-

tance from the edge pixels. Fig. 3 shows an example of

the distance map corresponding to an edge image. The

distance map is used to make the similarity measure.

Using the chamfer distance results in a smooth distance

function that is suitable for iterative optimization algo-

rithms for minimizing an n-dimensional function.

3. 5 Template matching using the contour

and interior information of a finger

Estimation of the position and rotation of a finger

is achieved by template matching with a binary tem-

plate image of a finger. In our method, we take edges

in the inside of a finger into account. Though there are

variations among individuals, in general, there are few

edges in the inside of a finger, as shown in Fig. 4. We

229


Input image Edge image

Fig. 4 The interiors of fingers, showing lack of edges.

Template image (binary)

Contour points C

Interior points IN

Fig. 5 Template image.

build this condition into the similarity measure to esti-

mate the position and rotation of a finger, which makes

it possible to recognize a finger exactly in images with

complex backgrounds.

3. 6 Optimization by gradient method

We make the similarity measure for estimating the

position and rotation of a finger as follows: First, as

shown in Fig. 5, we register the positions of the con-

tour and interior points of a finger in the template. Let

C and IN be the set of all contour points and the set

of all interior points, respectively.

Next, we compose the following similarity measure,

E, as a real-valued function depending on p:

E(p) = E1(p) + E2(p) + E3(p) (5)

where

E1(p) =∑x∈C

{ID (W (x, p))}2 , (6)

E2(p) =∑x∈IN

λx,p, (7)

E3(p) = h · ty. (8)

Here

W is a warp function

W (x, p) = sR(θ)x+ (tx, ty)T , (9)

where R(θ) is the rotation matrix depending on θ;

ID is a distance map;

λx,p is a positive constant at an edge point and

zero at a non-edge point in ID, i.e.,

λx,p

{= 0 (ID (W (x, p)) |= 0)

> 0 (ID (W (x, p)) = 0); (10)

and h is a positive correction weight toward the

top of the finger.

We estimate p that minimizes E. Eq. (6) is the term

for estimating the position and rotation of the finger.

The value of this term is small when new positions

of contour points of a finger in the template trans-

formed by the warp function are close to positions of

edge points in the edged ROI. Eq. (7) is a penalty term

for the finger interior. The value of this term is large

when positions in the edged ROI corresponding to new

positions of interior points of a finger in the template

transformed by the warp function are the positions of

edge points. In general, since there are few edges in

the inside of a finger, if there are many edges in the

corresponding region, that region is recognized as not

being a finger. Therefore, it is possible to estimate the

position and rotation of a finger exactly in images with

complex backgrounds. Eq. (8) is a term for prevent-

ing a finger from being recognized in the region below

the first knuckle joint of the finger. By increasing this

value, a finger is more recognizable in the region above

the first knuckle joint of the finger.

The similarity measure given by Eq. (5) is a smooth

function because it is made using the chamfer distance,

which makes it suitable for iterative optimization al-

gorithms. Fig. 6 shows an example of our similarity

measure. The graph in Fig. 6 shows the variation of E

depending on tx of p giving an estimation solution, with

the other parameters (ty, s, and θ) fixed. In the graph,

0

20000

40000

60000

80000

100000

120000

20 30 40 50 60 70 80 90 100

ener

gy

tx

Fig. 6 An example of our similarity measure.


230

there are few ups and downs around the optimal solu-

tion. Hence, by choosing a suitable initial parameter

search point, the point converges toward the optimal

solution with a small number of iterations, which leads

to reduced search time.

The similarity measure is minimized by using a gra-

dient method. In the gradient method, the rule for

updating parameters is as follows:

p← p− α(Δtx,Δty, ωsΔs, ωθΔθ)T (11)

where

Δtx =∂E(p)

∂tx, Δty =

∂E(p)

∂ty, (12)

Δs =∂E(p)

∂s, Δθ =

∂E(p)

∂θ. (13)

Here, ωs and ωθ are positive update weights for s and

θ, respectively, and α is a positive ratio of parameter

update sizes. The solution is sensitive to scale and ro-

tation changes, so it is necessary to adjust the amounts

of these changes by weighting them.

3. 7 Initial detection and evaluating suc-

cess/failure of finger tracking

We give an initial position and rotation of a finger by

holding the finger over a specific region in the captured

images. This is shown in Fig. 7. The left image in

Fig. 7 is an image used for detecting an initial finger.

Then, a ROI is centered in the image, and the template

image is displayed in the ROI with a suitable size and

color (red in the example shown). An initial finger is

detected by fitting the finger to the red template. For

each frame, the matching score is constantly computed

by estimating finger parameters with a local random

search in the ROI. The condition for the initial finger

detection is as follows: Let

pst1 = ( 12IW , 12IH , 1, 0)T , (14)

and if

Fig. 7 The initial detection of a finger.

∃p ∈ N(pst1 ) s.t. E1(p) + E2(p) <= TS , (15)

then an initial finger is detected, and tracking of the

finger starts with p, where IH and IW are the height

and width of the captured images, respectively, N(pst1 )

is the neighborhood of pst1 , and TS is a positive thresh-

old for detecting an initial finger. When a finger enters

the ROI, the computed matching scores are so small as

to satisfy Eq. (15).

Also, if tracking failure is detected in a certain num-

ber of sequential frames, then the tracking is judged to

have failed, and the method returns to the initial finger

detection. The condition for evaluating the success or

failure of finger tracking is as follows: If, for a frame,

E1(p) + E2(p) >= TF , (16)

then the tracking is considered to have failed in that

frame, where TF is a positive threshold for evaluating

the tracking.

4. Performance evaluations

In this section, we describe an experiment conducted

to evaluate the effectiveness of the proposed method,

and we show the results of the experiment. The effec-

tiveness of the proposed method was tested for several

pseudo-input motions in assumed situations by using

the experimental system shown in Fig. 1. In the exper-

iment, we handled our system as if it were operating

on an actual mobile device. Two pseudo-input motions

with specific backgrounds were captured by the monoc-

ular camera several times for each motion. Then, we

applied the proposed method to image sequences of the

two kinds of motion.

4. 1 Experimental settings

In the experiment, we used the backgrounds, input

motions, camera, lens, and PC shown in Table 1. We

considered combinations of backgrounds and input mo-

tions, i.e., a total of 6 situations. For each of the situa-

tions of input motions on clothes, 15 videos were made,

and for each of the situations of input motions on face

Table 1 Experimental settings.

Background Clothes, Face, Light

Input motion Slide, Push

Camera Firefly MV from Point Grey Research Inc.

Frame size: 640 × 480 pixels

Frame rate: 60 fps

Lens Focal distance: 4 mm

PC Optiplex390 from Dell Inc.

CPU: Intel(R) Core(TM) i3-2120

(3 M Cache, 3.30 GHz)

RAM: 3.00 GB

OS: 32 bit Windows 7 Professional

231


Frame 60 Frame 80 Frame 100


On clothes

On face


On light


Fig. 8 Snapshots of finger tracking (the number of iterations in each frame was 10).

and light, 5 videos were made, i.e., a total of 50 videos.

The subjects of the experiment were 3 persons. We ap-

plied the proposed method to each video and evaluated

the success rate and accuracy of finger tracking. The

length of each video was 2 seconds, i.e., 120 frames.

4. 2 Results

( 1 ) Robustness

Fig. 8 shows snapshots of the tracking of a finger in

images containing clothes, a face, and a light. In the fig-

ure, the monochrome regions are edged ROIs, and red

lines in these regions are transformed templates match-

ing a finger in frames. The number of iterations in the

gradient method for estimating finger parameters was

10 in each frame. Finger tracking was successful in each

of the backgrounds shown in the figure. Table 2 shows

the success rates of finger tracking. Success ratios were


232

Table 2 Finger tracking success rates (%).

Number of Success rate

iterations No-inside Proposed

5 42 84

10 50 90

15 50 96

20 52 96

Table 3 The average processing time per frame

(ms/frame).

ProcessingNumber of iterations

5 10 15 20

Edge extraction 3.71 3.91 4.11 4.19

Distance map 0.50 0.51 0.53 0.54

Matching 0.96 1.90 2.83 3.76

Total 5.17 6.32 7.47 8.49

computed for tracking results with 5, 10, 15, and 20

iterations in the gradient method. In the table, values

in the row labeled “Proposed” are the success rates of

finger tracking using the proposed method. We also

computed success rates for finger tracking without us-

ing information of the interior of a finger, i.e., the case

where E2(p) = 0 in Eq. (5). These are the values in the

row labeled “No-inside”. Tracking was judged to have

failed when the red template obviously came off the fin-

ger in subjective visual evaluation. The results in the

table show that our method successfully tracked a finger

in almost all of the test videos with 10 or more itera-

tions. Also, the results show the effectiveness of using

the information of the finger interior in finger tracking

in images with complex backgrounds.

( 2 ) High-speed processing

Table 3 shows the processing time per frame when

our method was applied to test videos. The processing

time includes the time for extracting edges, the time

for making a distance map, and the time for estimating

finger parameters (Matching). In the case of 10 itera-

tions, which is the smallest number of iterations that

gave good performance in finger tracking, the average

processing time was 6.32 [ms/frame] in our method. Al-

though parameter estimation (Matching) often requires

a long time compared with the other processing steps,

the results in Table 3 demonstrate that we were able

to greatly reduce this processing time. The majority

of the processing time was the edge extraction time. If

this could be reduced, it would be possible to speed up

our method even more in the future. The performance

of mobile devices is progressing dramatically, and our

system is promising for integrating into mobile devices.

( 3 ) Tracking accuracy

Fig. 9 shows an example of a comparison of track-

ing paths obtained by an exhaustive search and by

Table 4 Root mean square (RMS) error.

ParameterNumber of iterations

5 10 15 20

tx 1.65 0.97 0.72 0.66

ty 9.65 6.43 4.29 3.64

s 0.21 0.15 0.11 0.09

θ 2.75 1.73 1.53 1.41

our method. In the figure, paths labeled “ES” are

those obtained by the exhaustive search, and paths la-

beled “PM–n times” are those obtained by the proposed

method with n iterations of the gradient method. In

the exhaustive search, we searched for finger parame-

ters everywhere in the neighborhood of an initial search

point in each frame. Parameter step-lengths in the ex-

haustive search were 0.1 pixels for translations, 0.01 for

scale change, and 0.1 degrees for rotation. The exhaus-

tive search successfully tracked a finger in all of the

test videos. We regarded the result obtained by the ex-

haustive search as a correct finger tracking result. The

figure shows that the paths for translations (tx and ty)

obtained by our method with 10 iterations were sim-

ilar to those obtained by the exhaustive search. The

paths for scale change (s) obtained by our method were

very different in shape from those obtained by the ex-

haustive search, but the absolute values differed only

slightly. Fig. 10 shows frame 60, which corresponds

to the positions of the dashed lines in Fig. 9, in track-

ing videos using the exhaustive search and our method

with 10 iterations. The figure shows that there is a

slight difference between the exhaustive search and our

method, but that the template almost fits the finger in

our method.

Table 4 shows the accuracy of the finger tracking,

given by the root mean square (RMS) error between

the tracking paths obtained by the exhaustive search

and our method. The RMS error is computed as fol-

lows:

RMS =

√1N

∑Ni=1(p̂i,ES − p̂i,PM)2, (17)

where p̂i,ES and p̂i,PM are parameter solutions esti-

mated by the exhaustive search and the proposed

method in frame i, respectively, and N is the total num-

ber of frames. The RMS error was computed for each

parameter, i.e., translations, scale change, and rotation,

in each test video. Thus, four RMS values are obtained

for each test video. We averaged all RMS values ob-

tained from all test videos for each parameter. The

table shows that the finger tracking accuracy improved

as the number of iterations increased. The accuracy for

vertical translation was low compared with that of the

233


Fig. 9 Comparison of tracking paths for tx, ty, s and θ.

other parameters. The finger template tends to shake

up and down during finger tracking because the finger

shape is simple. It is difficult to prevent this behav-

ior. However, the tracking itself did not fail, and the

Exhaustive search Proposed method

Fig. 10 Comparison between frame 60 in tracking

videos using the exhaustive search and our

method with 10 iterations, from which the

graphs in Fig. 9 are obtained.

accuracy of the vertical translation was acceptable.

The above results showed that our method achieved

finger tracking with good accuracy and short processing

time.

4. 3 Comparison with conventional methods

We compare our method with others described in7)9)

and17), as conventional methods of the hand pose de-

tection, and describe the superiority of our method to

them.

These conventional methods have a common feature

that they use the chamfer distance to calculate the simi-

larity measure for estimating hand poses as our method

does, but do not use information of the finger interior.

As seen in the experimental result in Subsection 4. 2,

finger tracking becomes unstable without information

of the finger interior.

These conventional methods make a model of the

whole hand in the high dimensional parameter space,

and then estimate parameters that match best with

an input image. Accurate estimation is not always re-

quired in these conventional methods. In7), a hand pose

is estimated by retrieving the image closest to an input

image from a large database of synthetic hand images

by using both the chamfer distance and line matching

cost. The hand parameters corresponding to the re-

trieved image are discrete and low resolution. In9), a

likelihood function is made of the chamfer distance and

color information, and then various hand poses are de-

tected in an image sequence by combining hierarchical

template matching and Bayesian filtering. Since a hi-

erarchy of templates is constructed by partitioning the

parameter space, the solution is also discrete. In17),

the similarity measure is also made of the chamfer dis-

tance and color information, and then hand pose pa-

rameters are estimated by using particle swarm opti-

mization (PSO). The accuracy of estimation solutions

in7) and9) is not sufficient because the estimated param-


234

eters are discrete and low resolution. The accuracy in17)

is bad when using real image data.

Our method is developed specific to a pointing de-

vice, and the aim is different from those of conventional

methods above. In a pointing device, high accuracy is

needed in estimating the pose of a finger. Our tracking-

based method achieves the estimation of finger parame-

ters with higher-accuracy compared to the conventional

methods.

5. Conclusion

In this paper, we have proposed a method for robust

tracking of a moving finger in an image sequence. The

method is suitable for implementing in our input inter-

face system, which recognizes a moving finger in the air

by processing images captured by a monocular camera.

The proposed method extracts edges from input images,

then estimates the position and rotation of a finger in

input images by matching points in a template to edges,

and tracks the finger.

By using information about the presence or absence

of edges in regions in input images corresponding to the

inside of a finger in the template when the template

is superimposed on the input images, it is possible to

estimate the position and rotation of a finger exactly

in images with complex backgrounds. The similarity

measure is an edge distance function using the cham-

fer distance; in other words, it is a smooth function,

so that iterative optimization algorithms can be used

to minimize an n-dimensional function. Moreover, the

proposed method is a sequence-based search whose ad-

vantage is that the region used to estimate the position

and rotation of a finger is limited to the region around

the finger. Therefore, we can set an initial search point

near the optimal solution in the finger parameter space,

and hence, it is possible to converge the point toward

the optimal solution with a small number of iterations

and a simple computation. As a result, our method can

achieve finger tracking with high-speed response and

high accuracy.

The effectiveness of the proposed method was tested

in several situations by using an experimental system

including a PC. By applying our method to test videos

including pseudo-input motions with complex back-

grounds, our method successfully tracked a finger in

all test videos with an average processing time of 6.32

[ms/frame], and the finger was tracked with good accu-

racy.

Accordingly, the proposed method showed robust-

ness with complex backgrounds and achieved high-

speed processing in finger recognition. Moreover, our

method also achieved finger tracking with high-speed

response and high accuracy, which are necessary fea-

tures for pointing devices.

In addition, we greatly reduced the parameter esti-

mation time. However, there is still room for reducing

the whole processing time. Also, the performance of

mobile devices is progressing dramatically year by year,

and our system is promising for integrating into mobile

devices.

Acknowledgment

This work is joint research with Exvision Inc.

References

1) G. Borgefors : “Hierarchical Chamfer Matching: A Paramet-

ric Edge Matching Algorithm” , IEEE Transactions on Pattern

Analysis and Machine Intelligence, 10, 6, pp. 849–865 (1988)

2) D. G. Lowe : “Fitting Parameterized Three-Dimensional Models

to Images” , IEEE Transactions on Pattern Analysis and Machine

Intelligence, 13, 5, pp. 441–450 (1991)

3) D. M. Gavrila, V. Philomin : “Real-Time Object Detection for

“Smart” Vehicles” , The Proceedings of the Seventh IEEE Inter-

national Conference on Computer Vision, 1, pp. 87–93 (1999)

4) S. Zhu, K. N. Plataniotis, A. N. Venetsanopoulos : “Compre-

hensive Analysis of Edge Detection in Color Image Processing” ,

Optical Engineering, 38, 4, pp. 612–625 (1999)

5) L. Bretzner, I. Laptev, T. Lindeberg : “Hand Gesture Recog-

nition using Multi-Scale Colour Features, Hierarchical Models

and Particle Filtering” , Proceedings of Fifth IEEE Interna-

tional Conference on Automatic Face and Gesture Recognition,

pp. 423–428 (2002)

6) C. Steger : “Occlusion, Clutter, and Illumination Invariant Ob-

ject Recognition” , International Archives of Photogrammetry,

Remote Sensing, and Spatial Information Sciences, XXXIV, 3A,

pp. 345–350 (2002)

7) V. Athitsos, S. Sclaroff : “Estimating 3D Hand Pose from a

Cluttered Image” , Proceedings of 2003 IEEE Computer Soci-

ety Conference on Computer Vision and Pattern Recognition, 2,

pp. 432–439 (2003)

8) H. Roeber, J. Bacus, C. Tomasi : “Typing in Thin Air: The

Canesta Projection Keyboard – A New Method of Interaction

with Electronic Devices” , CHI’03 Extended Abstracts on Hu-

man Factors in Computing Systems, pp. 712–713 (2003)

9) B. Stenger, A. Thayananthan, P. H. S. Torr, R. Cipolla : “Model-

Based Hand Tracking Using a Hierarchical Bayesian Filter” ,

IEEE Transactions on Pattern Analysis and Machine Intelligence,

28, 9, pp. 1372–1384 (2006)

10) O. Gallo, S. M. Arteaga, J. E. Davis : “A Camera-Based Pointing

Interface for Mobile Devices” , Proceedings of the International

Conference of Image Processing, pp. 1420–1423 (2008)

11) P. Mistry, P. Maes : “SixthSense: A Wearable Gestural Inter-

face” , ACM SIGGRAPH ASIA 2009 Sketches Article, 11 (2009)

12) T. Niikura, Y. Hirobe, A. Cassinelli, Y. Watanabe, T. Komuro,

M. Ishikawa : “In-Air Typing Interface for Mobile Devices with

Vibration Feedback” , ACM SIGGRAPH 2010 Emerging Tech-

nologies Article, 15 (2010)

13) J. Ahn, J. Kim : “A Stable Hand Tracking Method by Skin

Color Blob Matching” , Pacific Science Review, 12, 2, pp. 146–

151 (2011)

14) M. Do, T. Asfour, R. Dillmann : “Particle Filter-Based Fingertip

Tracking with Circular Hough Transform Features” , The 12th

IAPR Conference on Machine Vision Applications, pp. 471–474

(2011)

15) C. Harrison, H. Benko, A. D. Wilson : “OmniTouch: Wearable

Multitouch Interaction Everywhere” , Proceedings of the 24th

Annual ACM Symposium on User Interface Software and Tech-

nology, pp. 441–450 (2011)

235


16) S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Kono-

lige, N. Navab, V. Lepetit : “Multimodal Templates for Real-

Time Detection of Texture-less Objects in Heavily Cluttered

Scenes” , 2011 IEEE International Conference on Computer Vi-

sion, pp. 858–865 (2011)

17) I. Oikonomidis, N. Kyriazis, A. A. Argyros : “Full DOF tracking

of a hand interacting with an object by modeling occlusions and

physical constraints” , 2011 IEEE International Conference on

Computer Vision, pp. 2088–2095 (2011)

18) T. Niikura, Y. Watanabe, T. Komuro, M. Ishikawa : “In-air Typ-

ing Interface: Realizing 3D operation for mobile devices” , 2012

IEEE 1st Global Conference on Consumer Electronics, pp. 228–

232 (2012)

Masakazu Higuchi Masakazu Higuchi re-ceived the PhD degree in Science from the Ni-igata university in 2005. He is currently a re-searcher of graduate school of science and engineer-ing at Saitama University. His research interestsinclude 3D user interfaces and theory of multicrite-ria games.

Takashi Komuro Takashi Komuro receivedthe B.E., M.E., and Ph.D. degrees in mathemat-ical engineering and information physics from theUniversity of Tokyo in 1996, 1998, and 2001, re-spectively. At present he is an associate profes-sor of mathematics, electronics and informatics atSaitama University. His current research interestsinclude image sensing and 3D user interfaces.


236

paper robust finger tracking for gesture control of mobile

Documents