paper robust finger tracking for gesture control of mobile
TRANSCRIPT
Paper
Robust Finger Tracking for Gesture Control of Mobile
Devices Using Contour and Interior Information of a
Finger
Masakazu Higuchi †, Takashi Komuro (member)†
Abstract In this paper, we propose a method for robust tracking of a moving finger in an image sequence. The method
is suitable for application to our input interface system, which recognizes a moving finger in the air. The proposed method
extracts edges from input images, and then estimates the position and rotation of a finger in the input images by matching
points in a template to edges. The most remarkable feature of our method is that it also takes into account the presence
or absence of edges in regions in the input images corresponding to the inside of the finger in the template for estimating.
This makes it possible to estimate the position and rotation of a finger exactly in images with complex backgrounds. Our
method successfully tracked a finger in several situations with an average processing time of 6.32 [ms/frame], and the finger
was tracked with good accuracy.
Key words: Finger tracking, Edge detection, Chamfer distance, Template matching, Pointing device.
1. Introduction
In recent years, high-performance mobile devices,
such as smart phones, have become popular as tech-
nological advances have allowed computing devices to
become smaller and lighter. However, miniature mobile
devices lack usability because of the small operating (in-
put) areas on their surfaces. Therefore, several input
interface systems providing larger operating area and
greater flexibility have been proposed in order to im-
prove usability8)11)15). These systems typically project
virtual objects on the hands or arms of users, or on sur-
faces of objects in the real world, allowing users to op-
erate them in the real world. As a result, users are able
to give inputs to mobile devices in three-dimensional
space via virtual objects, which improves the operat-
ing area and flexibility of the input interface. However,
there are several drawbacks with these systems. For ex-
ample, they require users to wear special devices, such
as a small projector, or they work only with limited
shapes of image planes for projecting virtual objects.
To overcome these problems, we have proposed a new
input interface system which recognizes a moving fin-
ger in the air by processing images captured from a
monocular camera12)18). The system is shown in Fig. 1.
Received December 28, 2012; Revised March 19, 2013; Accepted April
18, 2013
†Graduate School of Science and Engineering, Saitama University
(255, Shimo-Okubo, Sakura-ku, Saitama-shi, 338-8570, Japan)
Fig. 1 Our input interface system which recognizes a
moving finger in the air.
This experimental system consists of a small monitor, a
monocular camera, and a PC. The monocular camera is
attached to the monitor. The camera captures images
including the user’s finger held in front of the camera,
and then the PC processes the images. The result of
the processing is fed to an application displayed on the
monitor, allowing the user to use the application by
moving his or her finger in the air. This system makes
use of a large operating area, namely, the space in front
of the camera.
There are several challenges for integrating this sys-
tem into mobile devices. First, the face and clothes of
the user and lights on the ceiling may also be included in
images captured by the camera when capturing images
of the user’s finger. The captured images would then
have complex backgrounds, making it difficult to rec-
ITE Trans. on MTA Vol. 1, No. 3, pp. 226-236 (2013)
226
Copyright © 2013 by ITE Transactions on Media Technology and Applications (MTA)
ognize the finger in the images. Second, it is necessary
to reduce the computational complexity of the finger
recognition process in order to use a small amount of
computational resources on mobile devices. In addition,
since our system is intended to be used as a pointing
device, high-speed response with high accuracy of finger
tracking is required, unlike systems such as gesture in-
put systems, which often require only rough recognized
results of hand poses or fingers.
To resolve these problems, in this paper, we propose
a method of estimating the position and rotation of a
finger using the contour and interior information of the
finger in template matching, and tracking the finger in
an image sequence.
The organization of this paper is as follows: In Sec-
tion 2, we describe related research. In Section 3, we
describe the proposed method. In Section 4, we de-
scribe an experiment conducted to evaluate the effec-
tiveness of the proposed method and show the results
of the experiment. In Section 5, we conclude this paper.
2. Related research
In this section, we describe research related to our
work.
There have been many studies on the detection of
hands or fingers in images5)7)9)10)13)14)17). In these stud-
ies, the image features that are mainly used are edges
and color. In5)10)13), only color information was used in
the detection of hands or fingers. The disadvantages
of using only color information are that false detection
may occur in regions having a color close to skin color,
and the method is not robust to illumination changes.
On the other hand, edges are perhaps the most low-
level image features, so it is relatively easy to extract
them in images. Also, one of the advantages of using
edges is that they are robust to a narrow range of il-
lumination changes. In7)9)14)17), edges were used in the
detection of hands or fingers. Stenger et al. made a like-
lihood function using edge and color information and
detected various hand poses in an image sequence by
combining hierarchical template matching and Bayesian
filtering9). Do et al. extracted circular features in edge
images by using the Hough transform and tracked fin-
gers with these features in an image sequence by using
a particle filter14). Athitsos et al. estimated a hand pose
by retrieving the image closest to an input image from
a large database of synthetic hand images by using edge
and line matching7). Oikonomidis et al. simultaneously
tracked a hand and a simple-shaped object manipulated
with the hand in image sequences captured by a multi-
camera system by using edge and color information17).
The common idea in7)9)17) is to generate a parametric
model (template) of hand poses or objects and to esti-
mate parameters to make the best match between the
model and an input image.
As in17), this idea can also be applied to the detection
of general objects in images, and there have been many
studies taking this approach1) 3)6)16). In the detection of
general objects in an image, Steger described a way of
making robust similarity measures against various dis-
turbances and showed that it is effective to use edges
in making similarity measures6). The chamfer distance
used by Borgefors1), in particular, makes smooth edge
distance functions, i.e., similarity measures. Smooth
functions are suitable for iterative optimization algo-
rithms for minimizing an n-dimensional function, such
as Newton’s method, because these functions give the
gradient at each point on domains of functions. There-
fore, an initial search point can be made to converge
toward an optimal solution in the domain space with a
light computational load. In7)9)17), similarity measures
were made using the chamfer distance, and results in
detecting objects showed the effectiveness of this ap-
proach.
Edges are useful in various object detection problems;
however, in the case of finger detection, false detection
often occurs in images with complex backgrounds. This
is because images with complex backgrounds include
many regions similar to the form of a finger itself be-
cause a finger has a simpler form than the whole hand
or general objects.
Also, the non-sequence-based search in7) faces the
problem that the detection processing time is very long.
In9), a sequence-based search, which uses information
between images generated sequentially, was adopted,
but the method requires a heavy computational load
for estimating hand poses in an image sequence, making
the method unsuitable for integration into mobile de-
vices having a small amount of computational resources.
As mentioned above, the problems of lack of robust-
ness in detecting a finger in images with complex back-
grounds, and the long processing time in the detection
of general objects (including fingers) remain unsolved.
On the other hand, in12)18), we have proposed a new
input interface system which captures images including
the user’s finger by the camera attached to a mobile
device, and that recognizes the moving finger in the
air by image processing. In12), since the finger is rec-
227
Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger
ognized by using color information, the system often
fails to recognize the finger in front of the user’s face.
In18), the finger region is extracted by using IR LEDs
and an IR filter, and the finger pose is estimated. In
this system, by turning on and off the IR LEDs in syn-
chronization with capturing images and taking the dif-
ference between two successive images, the system can
detect only the object near the camera, i.e., the finger.
This makes it possible to estimate the finger pose with
a little disturbance of backgrounds. However, when the
finger moves away from the camera, the camera cannot
capture the finger, and detection of the finger becomes
difficult.
In our approach, the finger is detected by using edges
from input images. In addition, the information of the
presence or absence of edges inside the finger is also
used in matching, and it enables robust estimation of
the finger pose against complex backgrounds.
3. Estimation and tracking of finger posi-
tion and rotation
In this section, we introduce the proposed method,
which estimates the position and rotation of a finger
using the contour and interior information of the finger
in template matching, and tracks the finger in an image
sequence.
3. 1 Outline of the proposed method
The proposed method is a kind of template matching
based on the idea of the chamfer distance. A template
of a finger is prepared as a binary image.
Our method estimates four parameters for recogniz-
ing a finger: translations in two directions in a plane
perpendicular to the optical axis of a monocular camera
which captures a sequence of images containing the fin-
ger, a scale change along the optical axis, and a rotation
around the optical axis. In general, it is necessary to
take the rotation of a finger in three-dimensional space
into account so as to estimate the position and rotation
of the finger moving in the space, and therefore, affine
matching is often used. In our method, on the other
hand, since the template is a simple binary image, we
restrict the motions given to our input interface system
to simple motions, e.g., a rotation around only one axis
and not a bent finger, so that the above four parameters
are sufficient for estimating the position and rotation of
the finger.
The estimated parameters are denoted as follows:
p = (tx, ty, s, θ)T , (1)
where tx and ty are the horizontal and vertical trans-
lations of the finger, respectively, and s and θ are the
scale change and the rotation of the finger, respectively.
To solve the problem of non-robustness in images
with complex backgrounds mentioned in Section 2, our
method uses edges extracted from input images and in-
formation about the presence or absence of edges in
regions in the input images corresponding to the inside
of a finger in the template when the template is super-
imposed on the input images. This has the advantage of
fewer false matches. In general, there are few edges in
the region inside a finger. Hence, when the template is
superimposed on input images, if there are many edges
in regions in the input images corresponding to the in-
side of the finger in the template, these regions are rec-
ognized as not being a finger in our method. Therefore,
it is possible to estimate the position and rotation of a
finger exactly in images with complex backgrounds.
In our method, the similarity measure is an edge
distance function, which is smooth, so that an initial
search point converges toward an optimal solution with
a low computational load by using a suitable iterative
optimization algorithm.
Also, our system is intended to be used as a pointing
device, which requires high-speed response and high-
accuracy finger tracking. Our method is a sequence-
based search whose advantage is that the region used
for estimating the position and rotation of a finger is
limited to the region around the finger. Therefore, we
can set an initial search point near the optimal solution
in the finger parameter space, which makes it possible
to converge the point toward the optimal solution with
a small number of iterations and a simple computation.
As a result, we can achieve high-speed response and
high-accuracy finger tracking.
Accordingly, the proposed method achieves robust-
ness in complex backgrounds and high-speed process-
ing in finger recognition, which are necessary features
for integrating the method into mobile devices. More-
over, our method also achieves high-speed response and
high-accuracy finger tracking, which are necessary fea-
tures for pointing devices.
The flow of the proposed method is as follows:
Step 1. Set an initial search point in the finger pa-
rameter space for an input image captured by a
monocular camera.
Step 2. Focus on a region of interest (ROI) in the
input image.
Step 3. Extract edges of the ROI, and make a dis-
ITE Trans. on MTA Vol. 1, No. 3 (2013)
228
tance map of the edge image.
Step 4. Estimate parameters for the position and
rotation of a finger by template matching.
Step 5. Get the next input image, and return to
Step 1.
The details of each step are described in the follow-
ing subsections. The most remarkable feature of our
method is the use of information about the presence or
absence of edges in regions in the input images corre-
sponding to the inside of a finger in the template when
the template is superimposed on the input images in
Step 4. Details of this are described in Subsections 3. 5
and 3. 6.
3. 2 The initial search point for estimating
parameters
An image captured by the camera at the current time
is denoted as frame k, and the captured image at the
previous time is frame k − 1. Let pstk and p̂k−1 be an
initial search point for estimating parameters for the
position and rotation of the finger in frame k and a pa-
rameter estimation point in frame k − 1, respectively.
In our method, we have
pstk = p̂k−1. (2)
Excessively fast input motions are not allowed in our
input interface system, so it is possible to assume that
the difference between frames is small. Therefore, the
initial search point given by Eq. (2) is close to a param-
eter estimation point in frame k in the parameter space.
This condition leads to an estimation result with high
accuracy and reduced search time.
3. 3 ROI
Elements of pstk are denoted as follows:
pstk = (txstk , ty
stk , s
stk , θ
stk )
T . (3)
We focus on a ROI that is a rectangular region whose
center is the pixel position (txstk , ty
stk ) in frame k. The
height and width of the ROI are
ROIH = 2TH · sstk and ROIW = 2TW · sstk , (4)
respectively, where TH and TW are the height and width
of the template, respectively. Eq. (4) means that the
size of the ROI depends on that of the finger.
3. 4 Edge extraction and distance map
Since the performance of the proposed method is de-
pendent on the accuracy of edge extraction, we use color
information of captured images to extract edges from
them. In brief, we apply a Sobel filter to each color
component (RGB) of the ROI and compute gradient
Edges from grayscale image Edges from color image
Fig. 2 Comparison between edges from gray-scale and
color images.
Edge image Distance map
Fig. 3 Distance map corresponding to edge image.
intensities of pixels in each color component; in other
words, pixels in the ROI have three (RGB) gradient in-
tensities. For each pixel in the ROI, we select the max-
imum from three values as the gradient intensity of the
pixel. Then, edges of the ROI are obtained by thresh-
olding the gradient intensities of pixels in the ROI4).
As shown in Fig. 2, this gives edges with good accu-
racy compared with the conventional way that extracts
edges using the pixel brightness.
Next, we make the distance map for the edge image of
the ROI. In the edge image, the distance to the closest
edge pixel, called the chamfer distance, is calculated at
each non-edge pixel. The distance is zero at each edge
pixel. The distance map is a map generated by giving
the chamfer distance for each pixel in the edge image1).
In the distance map, the distance increases with the dis-
tance from the edge pixels. Fig. 3 shows an example of
the distance map corresponding to an edge image. The
distance map is used to make the similarity measure.
Using the chamfer distance results in a smooth distance
function that is suitable for iterative optimization algo-
rithms for minimizing an n-dimensional function.
3. 5 Template matching using the contour
and interior information of a finger
Estimation of the position and rotation of a finger
is achieved by template matching with a binary tem-
plate image of a finger. In our method, we take edges
in the inside of a finger into account. Though there are
variations among individuals, in general, there are few
edges in the inside of a finger, as shown in Fig. 4. We
229
Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger
Input image Edge image
Fig. 4 The interiors of fingers, showing lack of edges.
Template image (binary)
Contour points C
Interior points IN
Fig. 5 Template image.
build this condition into the similarity measure to esti-
mate the position and rotation of a finger, which makes
it possible to recognize a finger exactly in images with
complex backgrounds.
3. 6 Optimization by gradient method
We make the similarity measure for estimating the
position and rotation of a finger as follows: First, as
shown in Fig. 5, we register the positions of the con-
tour and interior points of a finger in the template. Let
C and IN be the set of all contour points and the set
of all interior points, respectively.
Next, we compose the following similarity measure,
E, as a real-valued function depending on p:
E(p) = E1(p) + E2(p) + E3(p) (5)
where
E1(p) =∑x∈C
{ID (W (x, p))}2 , (6)
E2(p) =∑x∈IN
λx,p, (7)
E3(p) = h · ty. (8)
Here
W is a warp function
W (x, p) = sR(θ)x+ (tx, ty)T , (9)
where R(θ) is the rotation matrix depending on θ;
ID is a distance map;
λx,p is a positive constant at an edge point and
zero at a non-edge point in ID, i.e.,
λx,p
{= 0 (ID (W (x, p)) |= 0)
> 0 (ID (W (x, p)) = 0); (10)
and h is a positive correction weight toward the
top of the finger.
We estimate p that minimizes E. Eq. (6) is the term
for estimating the position and rotation of the finger.
The value of this term is small when new positions
of contour points of a finger in the template trans-
formed by the warp function are close to positions of
edge points in the edged ROI. Eq. (7) is a penalty term
for the finger interior. The value of this term is large
when positions in the edged ROI corresponding to new
positions of interior points of a finger in the template
transformed by the warp function are the positions of
edge points. In general, since there are few edges in
the inside of a finger, if there are many edges in the
corresponding region, that region is recognized as not
being a finger. Therefore, it is possible to estimate the
position and rotation of a finger exactly in images with
complex backgrounds. Eq. (8) is a term for prevent-
ing a finger from being recognized in the region below
the first knuckle joint of the finger. By increasing this
value, a finger is more recognizable in the region above
the first knuckle joint of the finger.
The similarity measure given by Eq. (5) is a smooth
function because it is made using the chamfer distance,
which makes it suitable for iterative optimization al-
gorithms. Fig. 6 shows an example of our similarity
measure. The graph in Fig. 6 shows the variation of E
depending on tx of p giving an estimation solution, with
the other parameters (ty, s, and θ) fixed. In the graph,
0
20000
40000
60000
80000
100000
120000
20 30 40 50 60 70 80 90 100
ener
gy
tx
Fig. 6 An example of our similarity measure.
ITE Trans. on MTA Vol. 1, No. 3 (2013)
230
there are few ups and downs around the optimal solu-
tion. Hence, by choosing a suitable initial parameter
search point, the point converges toward the optimal
solution with a small number of iterations, which leads
to reduced search time.
The similarity measure is minimized by using a gra-
dient method. In the gradient method, the rule for
updating parameters is as follows:
p← p− α(Δtx,Δty, ωsΔs, ωθΔθ)T (11)
where
Δtx =∂E(p)
∂tx, Δty =
∂E(p)
∂ty, (12)
Δs =∂E(p)
∂s, Δθ =
∂E(p)
∂θ. (13)
Here, ωs and ωθ are positive update weights for s and
θ, respectively, and α is a positive ratio of parameter
update sizes. The solution is sensitive to scale and ro-
tation changes, so it is necessary to adjust the amounts
of these changes by weighting them.
3. 7 Initial detection and evaluating suc-
cess/failure of finger tracking
We give an initial position and rotation of a finger by
holding the finger over a specific region in the captured
images. This is shown in Fig. 7. The left image in
Fig. 7 is an image used for detecting an initial finger.
Then, a ROI is centered in the image, and the template
image is displayed in the ROI with a suitable size and
color (red in the example shown). An initial finger is
detected by fitting the finger to the red template. For
each frame, the matching score is constantly computed
by estimating finger parameters with a local random
search in the ROI. The condition for the initial finger
detection is as follows: Let
pst1 = ( 12IW , 12IH , 1, 0)T , (14)
and if
Fig. 7 The initial detection of a finger.
∃p ∈ N(pst1 ) s.t. E1(p) + E2(p) <= TS , (15)
then an initial finger is detected, and tracking of the
finger starts with p, where IH and IW are the height
and width of the captured images, respectively, N(pst1 )
is the neighborhood of pst1 , and TS is a positive thresh-
old for detecting an initial finger. When a finger enters
the ROI, the computed matching scores are so small as
to satisfy Eq. (15).
Also, if tracking failure is detected in a certain num-
ber of sequential frames, then the tracking is judged to
have failed, and the method returns to the initial finger
detection. The condition for evaluating the success or
failure of finger tracking is as follows: If, for a frame,
E1(p) + E2(p) >= TF , (16)
then the tracking is considered to have failed in that
frame, where TF is a positive threshold for evaluating
the tracking.
4. Performance evaluations
In this section, we describe an experiment conducted
to evaluate the effectiveness of the proposed method,
and we show the results of the experiment. The effec-
tiveness of the proposed method was tested for several
pseudo-input motions in assumed situations by using
the experimental system shown in Fig. 1. In the exper-
iment, we handled our system as if it were operating
on an actual mobile device. Two pseudo-input motions
with specific backgrounds were captured by the monoc-
ular camera several times for each motion. Then, we
applied the proposed method to image sequences of the
two kinds of motion.
4. 1 Experimental settings
In the experiment, we used the backgrounds, input
motions, camera, lens, and PC shown in Table 1. We
considered combinations of backgrounds and input mo-
tions, i.e., a total of 6 situations. For each of the situa-
tions of input motions on clothes, 15 videos were made,
and for each of the situations of input motions on face
Table 1 Experimental settings.
Background Clothes, Face, Light
Input motion Slide, Push
Camera Firefly MV from Point Grey Research Inc.
Frame size: 640 × 480 pixels
Frame rate: 60 fps
Lens Focal distance: 4 mm
PC Optiplex390 from Dell Inc.
CPU: Intel(R) Core(TM) i3-2120
(3 M Cache, 3.30 GHz)
RAM: 3.00 GB
OS: 32 bit Windows 7 Professional
231
Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger
Frame 60 Frame 80 Frame 100
Frame 60 Frame 80 Frame 100
On clothes
On face
Frame 60 Frame 80 Frame 100
On light
Frame 60 Frame 80 Frame 100
Fig. 8 Snapshots of finger tracking (the number of iterations in each frame was 10).
and light, 5 videos were made, i.e., a total of 50 videos.
The subjects of the experiment were 3 persons. We ap-
plied the proposed method to each video and evaluated
the success rate and accuracy of finger tracking. The
length of each video was 2 seconds, i.e., 120 frames.
4. 2 Results
( 1 ) Robustness
Fig. 8 shows snapshots of the tracking of a finger in
images containing clothes, a face, and a light. In the fig-
ure, the monochrome regions are edged ROIs, and red
lines in these regions are transformed templates match-
ing a finger in frames. The number of iterations in the
gradient method for estimating finger parameters was
10 in each frame. Finger tracking was successful in each
of the backgrounds shown in the figure. Table 2 shows
the success rates of finger tracking. Success ratios were
ITE Trans. on MTA Vol. 1, No. 3 (2013)
232
Table 2 Finger tracking success rates (%).
Number of Success rate
iterations No-inside Proposed
5 42 84
10 50 90
15 50 96
20 52 96
Table 3 The average processing time per frame
(ms/frame).
ProcessingNumber of iterations
5 10 15 20
Edge extraction 3.71 3.91 4.11 4.19
Distance map 0.50 0.51 0.53 0.54
Matching 0.96 1.90 2.83 3.76
Total 5.17 6.32 7.47 8.49
computed for tracking results with 5, 10, 15, and 20
iterations in the gradient method. In the table, values
in the row labeled “Proposed” are the success rates of
finger tracking using the proposed method. We also
computed success rates for finger tracking without us-
ing information of the interior of a finger, i.e., the case
where E2(p) = 0 in Eq. (5). These are the values in the
row labeled “No-inside”. Tracking was judged to have
failed when the red template obviously came off the fin-
ger in subjective visual evaluation. The results in the
table show that our method successfully tracked a finger
in almost all of the test videos with 10 or more itera-
tions. Also, the results show the effectiveness of using
the information of the finger interior in finger tracking
in images with complex backgrounds.
( 2 ) High-speed processing
Table 3 shows the processing time per frame when
our method was applied to test videos. The processing
time includes the time for extracting edges, the time
for making a distance map, and the time for estimating
finger parameters (Matching). In the case of 10 itera-
tions, which is the smallest number of iterations that
gave good performance in finger tracking, the average
processing time was 6.32 [ms/frame] in our method. Al-
though parameter estimation (Matching) often requires
a long time compared with the other processing steps,
the results in Table 3 demonstrate that we were able
to greatly reduce this processing time. The majority
of the processing time was the edge extraction time. If
this could be reduced, it would be possible to speed up
our method even more in the future. The performance
of mobile devices is progressing dramatically, and our
system is promising for integrating into mobile devices.
( 3 ) Tracking accuracy
Fig. 9 shows an example of a comparison of track-
ing paths obtained by an exhaustive search and by
Table 4 Root mean square (RMS) error.
ParameterNumber of iterations
5 10 15 20
tx 1.65 0.97 0.72 0.66
ty 9.65 6.43 4.29 3.64
s 0.21 0.15 0.11 0.09
θ 2.75 1.73 1.53 1.41
our method. In the figure, paths labeled “ES” are
those obtained by the exhaustive search, and paths la-
beled “PM–n times” are those obtained by the proposed
method with n iterations of the gradient method. In
the exhaustive search, we searched for finger parame-
ters everywhere in the neighborhood of an initial search
point in each frame. Parameter step-lengths in the ex-
haustive search were 0.1 pixels for translations, 0.01 for
scale change, and 0.1 degrees for rotation. The exhaus-
tive search successfully tracked a finger in all of the
test videos. We regarded the result obtained by the ex-
haustive search as a correct finger tracking result. The
figure shows that the paths for translations (tx and ty)
obtained by our method with 10 iterations were sim-
ilar to those obtained by the exhaustive search. The
paths for scale change (s) obtained by our method were
very different in shape from those obtained by the ex-
haustive search, but the absolute values differed only
slightly. Fig. 10 shows frame 60, which corresponds
to the positions of the dashed lines in Fig. 9, in track-
ing videos using the exhaustive search and our method
with 10 iterations. The figure shows that there is a
slight difference between the exhaustive search and our
method, but that the template almost fits the finger in
our method.
Table 4 shows the accuracy of the finger tracking,
given by the root mean square (RMS) error between
the tracking paths obtained by the exhaustive search
and our method. The RMS error is computed as fol-
lows:
RMS =
√1N
∑Ni=1(p̂i,ES − p̂i,PM)2, (17)
where p̂i,ES and p̂i,PM are parameter solutions esti-
mated by the exhaustive search and the proposed
method in frame i, respectively, and N is the total num-
ber of frames. The RMS error was computed for each
parameter, i.e., translations, scale change, and rotation,
in each test video. Thus, four RMS values are obtained
for each test video. We averaged all RMS values ob-
tained from all test videos for each parameter. The
table shows that the finger tracking accuracy improved
as the number of iterations increased. The accuracy for
vertical translation was low compared with that of the
233
Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger
Fig. 9 Comparison of tracking paths for tx, ty, s and θ.
other parameters. The finger template tends to shake
up and down during finger tracking because the finger
shape is simple. It is difficult to prevent this behav-
ior. However, the tracking itself did not fail, and the
Exhaustive search Proposed method
Fig. 10 Comparison between frame 60 in tracking
videos using the exhaustive search and our
method with 10 iterations, from which the
graphs in Fig. 9 are obtained.
accuracy of the vertical translation was acceptable.
The above results showed that our method achieved
finger tracking with good accuracy and short processing
time.
4. 3 Comparison with conventional methods
We compare our method with others described in7)9)
and17), as conventional methods of the hand pose de-
tection, and describe the superiority of our method to
them.
These conventional methods have a common feature
that they use the chamfer distance to calculate the simi-
larity measure for estimating hand poses as our method
does, but do not use information of the finger interior.
As seen in the experimental result in Subsection 4. 2,
finger tracking becomes unstable without information
of the finger interior.
These conventional methods make a model of the
whole hand in the high dimensional parameter space,
and then estimate parameters that match best with
an input image. Accurate estimation is not always re-
quired in these conventional methods. In7), a hand pose
is estimated by retrieving the image closest to an input
image from a large database of synthetic hand images
by using both the chamfer distance and line matching
cost. The hand parameters corresponding to the re-
trieved image are discrete and low resolution. In9), a
likelihood function is made of the chamfer distance and
color information, and then various hand poses are de-
tected in an image sequence by combining hierarchical
template matching and Bayesian filtering. Since a hi-
erarchy of templates is constructed by partitioning the
parameter space, the solution is also discrete. In17),
the similarity measure is also made of the chamfer dis-
tance and color information, and then hand pose pa-
rameters are estimated by using particle swarm opti-
mization (PSO). The accuracy of estimation solutions
in7) and9) is not sufficient because the estimated param-
ITE Trans. on MTA Vol. 1, No. 3 (2013)
234
eters are discrete and low resolution. The accuracy in17)
is bad when using real image data.
Our method is developed specific to a pointing de-
vice, and the aim is different from those of conventional
methods above. In a pointing device, high accuracy is
needed in estimating the pose of a finger. Our tracking-
based method achieves the estimation of finger parame-
ters with higher-accuracy compared to the conventional
methods.
5. Conclusion
In this paper, we have proposed a method for robust
tracking of a moving finger in an image sequence. The
method is suitable for implementing in our input inter-
face system, which recognizes a moving finger in the air
by processing images captured by a monocular camera.
The proposed method extracts edges from input images,
then estimates the position and rotation of a finger in
input images by matching points in a template to edges,
and tracks the finger.
By using information about the presence or absence
of edges in regions in input images corresponding to the
inside of a finger in the template when the template
is superimposed on the input images, it is possible to
estimate the position and rotation of a finger exactly
in images with complex backgrounds. The similarity
measure is an edge distance function using the cham-
fer distance; in other words, it is a smooth function,
so that iterative optimization algorithms can be used
to minimize an n-dimensional function. Moreover, the
proposed method is a sequence-based search whose ad-
vantage is that the region used to estimate the position
and rotation of a finger is limited to the region around
the finger. Therefore, we can set an initial search point
near the optimal solution in the finger parameter space,
and hence, it is possible to converge the point toward
the optimal solution with a small number of iterations
and a simple computation. As a result, our method can
achieve finger tracking with high-speed response and
high accuracy.
The effectiveness of the proposed method was tested
in several situations by using an experimental system
including a PC. By applying our method to test videos
including pseudo-input motions with complex back-
grounds, our method successfully tracked a finger in
all test videos with an average processing time of 6.32
[ms/frame], and the finger was tracked with good accu-
racy.
Accordingly, the proposed method showed robust-
ness with complex backgrounds and achieved high-
speed processing in finger recognition. Moreover, our
method also achieved finger tracking with high-speed
response and high accuracy, which are necessary fea-
tures for pointing devices.
In addition, we greatly reduced the parameter esti-
mation time. However, there is still room for reducing
the whole processing time. Also, the performance of
mobile devices is progressing dramatically year by year,
and our system is promising for integrating into mobile
devices.
Acknowledgment
This work is joint research with Exvision Inc.
References
1) G. Borgefors : “Hierarchical Chamfer Matching: A Paramet-
ric Edge Matching Algorithm” , IEEE Transactions on Pattern
Analysis and Machine Intelligence, 10, 6, pp. 849–865 (1988)
2) D. G. Lowe : “Fitting Parameterized Three-Dimensional Models
to Images” , IEEE Transactions on Pattern Analysis and Machine
Intelligence, 13, 5, pp. 441–450 (1991)
3) D. M. Gavrila, V. Philomin : “Real-Time Object Detection for
“Smart” Vehicles” , The Proceedings of the Seventh IEEE Inter-
national Conference on Computer Vision, 1, pp. 87–93 (1999)
4) S. Zhu, K. N. Plataniotis, A. N. Venetsanopoulos : “Compre-
hensive Analysis of Edge Detection in Color Image Processing” ,
Optical Engineering, 38, 4, pp. 612–625 (1999)
5) L. Bretzner, I. Laptev, T. Lindeberg : “Hand Gesture Recog-
nition using Multi-Scale Colour Features, Hierarchical Models
and Particle Filtering” , Proceedings of Fifth IEEE Interna-
tional Conference on Automatic Face and Gesture Recognition,
pp. 423–428 (2002)
6) C. Steger : “Occlusion, Clutter, and Illumination Invariant Ob-
ject Recognition” , International Archives of Photogrammetry,
Remote Sensing, and Spatial Information Sciences, XXXIV, 3A,
pp. 345–350 (2002)
7) V. Athitsos, S. Sclaroff : “Estimating 3D Hand Pose from a
Cluttered Image” , Proceedings of 2003 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, 2,
pp. 432–439 (2003)
8) H. Roeber, J. Bacus, C. Tomasi : “Typing in Thin Air: The
Canesta Projection Keyboard – A New Method of Interaction
with Electronic Devices” , CHI’03 Extended Abstracts on Hu-
man Factors in Computing Systems, pp. 712–713 (2003)
9) B. Stenger, A. Thayananthan, P. H. S. Torr, R. Cipolla : “Model-
Based Hand Tracking Using a Hierarchical Bayesian Filter” ,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
28, 9, pp. 1372–1384 (2006)
10) O. Gallo, S. M. Arteaga, J. E. Davis : “A Camera-Based Pointing
Interface for Mobile Devices” , Proceedings of the International
Conference of Image Processing, pp. 1420–1423 (2008)
11) P. Mistry, P. Maes : “SixthSense: A Wearable Gestural Inter-
face” , ACM SIGGRAPH ASIA 2009 Sketches Article, 11 (2009)
12) T. Niikura, Y. Hirobe, A. Cassinelli, Y. Watanabe, T. Komuro,
M. Ishikawa : “In-Air Typing Interface for Mobile Devices with
Vibration Feedback” , ACM SIGGRAPH 2010 Emerging Tech-
nologies Article, 15 (2010)
13) J. Ahn, J. Kim : “A Stable Hand Tracking Method by Skin
Color Blob Matching” , Pacific Science Review, 12, 2, pp. 146–
151 (2011)
14) M. Do, T. Asfour, R. Dillmann : “Particle Filter-Based Fingertip
Tracking with Circular Hough Transform Features” , The 12th
IAPR Conference on Machine Vision Applications, pp. 471–474
(2011)
15) C. Harrison, H. Benko, A. D. Wilson : “OmniTouch: Wearable
Multitouch Interaction Everywhere” , Proceedings of the 24th
Annual ACM Symposium on User Interface Software and Tech-
nology, pp. 441–450 (2011)
235
Paper » Robust Finger Tracking for Gesture Control of Mobile Devices Using Contour and Interior Information of a Finger
16) S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Kono-
lige, N. Navab, V. Lepetit : “Multimodal Templates for Real-
Time Detection of Texture-less Objects in Heavily Cluttered
Scenes” , 2011 IEEE International Conference on Computer Vi-
sion, pp. 858–865 (2011)
17) I. Oikonomidis, N. Kyriazis, A. A. Argyros : “Full DOF tracking
of a hand interacting with an object by modeling occlusions and
physical constraints” , 2011 IEEE International Conference on
Computer Vision, pp. 2088–2095 (2011)
18) T. Niikura, Y. Watanabe, T. Komuro, M. Ishikawa : “In-air Typ-
ing Interface: Realizing 3D operation for mobile devices” , 2012
IEEE 1st Global Conference on Consumer Electronics, pp. 228–
232 (2012)
Masakazu Higuchi Masakazu Higuchi re-ceived the PhD degree in Science from the Ni-igata university in 2005. He is currently a re-searcher of graduate school of science and engineer-ing at Saitama University. His research interestsinclude 3D user interfaces and theory of multicrite-ria games.
Takashi Komuro Takashi Komuro receivedthe B.E., M.E., and Ph.D. degrees in mathemat-ical engineering and information physics from theUniversity of Tokyo in 1996, 1998, and 2001, re-spectively. At present he is an associate profes-sor of mathematics, electronics and informatics atSaitama University. His current research interestsinclude image sensing and 3D user interfaces.
ITE Trans. on MTA Vol. 1, No. 3 (2013)
236