object recognition algorithm for mobile devices
TRANSCRIPT
Image Processing & Communication, vol. 18,no. 1, pp.31-36DOI: 10.2478/v10248-012-0088-x 31
OBJECT RECOGNITION ALGORITHM FOR MOBILE DEVICES
RAFAŁ KOZIK ADAM MARCHEWKA
Institute of Telecommunications, University of Technology & Life Sciences in Bydgoszcz, ul. Kaliskiego 7, 85-789Bydgoszcz, Poland
Abstract. In this paper an object recognition al-
gorithm for mobile devices is presented. The al-
gorithm is based on a hierarchical approach for
visual information coding proposed by Riesen-
huber and Poggio [1] and later extended by
Serre et al. [2]. The proposed method adapts
an efficient algorithm to extract the informa-
tion about local gradients. This allows the al-
gorithm to approximate the behaviour of simple
cells layer of Riesenhuber and Poggio model.
Moreover, it is also demonstrated that the pro-
posed algorithm can be successfully deployed
on a low-cost Android smartphone.
1 Introduction
This paper is a continuation of author’s recent research on
object detection algorithms dedicate for mobile devices.
The tentative results have been presented in [4]. In con-
trast to previous work, modifications to overall method
architecture have been proposed. Moreover, the exper-
iments described in this paper have been conducted for
extended database of images.
The proposed approach follows the idea of HMAX
model proposed by Riesenhuber and Poggio [1]. It adapts
the hierarchical fed-forward processing approach. How-
ever, modifications are introduced in order to reduce com-
putational complexity of the original algorithm.
Moreover, in contrast to Mutch and Lowe [10] model
an additional layer that mimics the "retina codding" mech-
anism is added. The results showed that this step increases
the robustness of the proposed method. However, it is
considered as an optional step of visual information pro-
cessing, because it introduces additional computational
burden that can increase the response time (the frame rate
drops significantly) of at low-cost mobile devices.
In contrast to original HMAX model, a different
method for calculating the S1 layer response is proposed.
The responses of 4 Gabor filters are approximated with
algorithm that mimic the behaviour of HOG descriptors.
The HOG [3] descriptors are commonly use for object
recognition problems. The image is decomposed into
small cells. In each cell an histogram of oriented gradi-
ents is used. Typically for HOG features the result is nor-
malised and the descriptor is returned for each cell. How-
ever, in contrast to original HOG descriptor proposed by
Dalal and Triggs, in this method (in order to simplify the
computations) the block of cells are not overlapping, and
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM
32 R. Kozik, A. Marchewka
the input image is not normalised.
Moreover, in previous work [7] the S2 and C2 layers
were replaced with machine-learned classifier. Such ap-
proach significantly simplifies the learning process and
decreases the amount of computations. However, this im-
pacts the invariance to shape changes of a recognised ob-
ject and the feature vectors presented to classifier are rel-
atively of grater dimensionality.
2 The HMAX model overview
The HMAX Visual Cortex model proposed by Riesenhu-
ber and Poggio [1] exploits a hierarchical structure for the
image processing and coding. It is arranged in several lay-
ers that process information in a bottom-up manner. The
lowest layer is fed with a grayscale image. The higher
layers of the model are either called "S" or "C". These
names correspond to simple the (S) and complex (C) cells
discovered by Hubel and Wiesel [9]. Both type of cells
are located in the striate cortex (called V1), which is the
part of visual cortex that lies in the most posterior area of
the occipital lobe.
The simple cells located in "S" layers apply local filters
that responses form a vector of texture features. As noted
by Hubel and Wiesel the individual cell in the cortex re-
spond to the presence of edges. They also discovered that
these cells are sensitive to edge orientation (some of cells
fire only when a given orientation of an edge is observed).
The complex cells located in "C" layers calculate in a
limited range of a previous layer the strongest responses
of a given type (orientation). That way more complex
combination of simple features are obtained combined
from three simple features.
The hierarchical HMAX model proposed by Mutch and
Lowe [10] is a modification of the model presented by
Serre et al. in [2]. It introduces two layers of simple cells
(S1 and S2) and two layers of complex cells (see Fig. 1). It
uses set of filters designed to emulate V1 simple cells. The
Fig. 1: The structure of a hierarchical model proposed byMutch and Lowe [10]. (used symbols: I - image, S -simple cells, C - complex cells, GF - Gabor Filters, F -prototype features vectors, ⊗ - convolution operation)
simple layers are computed using a hard max filter [11].
It means that the "C" cells responses are the maximum
values of the associated "S" cells. As it is shown in the
Fig. 1 the images are processed by the subsequent simple
and complex cells layers and reduced to feature vectors,
which are further used in the classification process. The
set of features (F ) is shared across all images and object
categories. Features are computed hierarchically in sub-
sequent layers built from the previous one by alternating
the template matching and the max pooling operations.
The S1 layer in the Mutch and Lowe [10] model adapts
the 2D Gabor filters computed for four orientations (hor-
izontal, vertical, and two diagonals) at each possible po-
sition and scale. The Gabor filters are 11×11 in size, and
are described by:
G(x, y) = e−(X2+γY 2)/(2σ2) cos(2π
λ) (1)
whereX = x cosφ−y sinφ and Y = x sinφ+y cosφ;
x, y ∈< −5; 5 >, and φ ∈< 0;π >. The aspect ra-
tion (γ), effective width (σ), and wavelength (λ) are set
to 0.3, 4.5 and 5.6 respectively. The response of a patch
of pixels X to a particular filter G is computed using the
formula (2).
R(X,G) = abs(
∑XiGi√∑X2i
) (2)
The complex cells located in C1 layer pool associated
units in the S1 layer. For each orientation, the S1 re-
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM
Image Processing & Communication, vol. 18, no. 1, pp. 31-36 33
sponses are convolved with a max filter, that is 10×10 of
size in x, y dimension (position) and has 2 units of deep
in scale.
As it is shown in the Fig. 1, the intermediate S2 layer
is formed by convolving the C1 layer response with a set
of intermediate-level features (depicted as F in Fig. 1).
The set of intermediate-level features is established dur-
ing the learning phase. For a given set of learning im-
ages C1 responses are computed. The most meaningful
features are selected using SVM weighting. Mutch and
Lowe [10] suggested to sub-sample the C1 responses be-
fore feature selection. Therefore, authors select at ran-
dom positions and scales of the C1 layer patches of size
4×4, 8×8, 12×12, and 16×16. Selected and weighted
features compose so called prototypes that are used in the
Mutch and Lowe model as filters which responses create
the S2 layer. The C2 layer composes a global feature vec-
tor which particular element corresponds to the maximum
response to a given prototype patch. In order to identify
the visual object on the basis of feature vector a classifier
is learnt (e.g. SVM).
3 The proposed method
In order to approximate the S1 layer behaviour of the
HMAX model the algorithms that mimics the HOG de-
scriptors is used. However, to make this step as simple
as possible (due to the low computation power of mo-
bile devices) only gradient orientation binning schema is
adapted. In fact, the intention here is no to compute HOG
descriptor, but to approximate the response of four Gabor
filters used in original HMAX model.
In the S1 layer there are N ×M × 4 simple cells ar-
ranged in a grid of sizeN×M blocks. In each block there
are 4 cells. Each cell is assigned a receptive field (pixels
inside the block). Each cell activates depend on a stimuli.
In this case there are four possible stimulus, namely verti-
cal, horizontal, left diagonal, and right diagonal edges. As
a result the S1 simple cells layer output has dimensional-
ity of a size 4 (x, y, scale and 4 cells). In order to compute
responses of all four cells inside a given block (receptive
field), an Algorithm 1 is applied.
Algorithm 1 Algorithm for calculating S1 responseRequire: Grayscaled image I of W ×H sizeEnsure: S1 layer of size N ×M × 4
AssignGmin the low-threshold for gradient magnitude.for each pixel (x, y) in image I do
Compute horizontalGx and verticalGy gradients us-ing Prewitt operatorCompute gradient magnitude |Gx,y| in point (x, y)n← x · NW ; m← y · MWif |G| < Gmin then
go to next pixelelseactive← get_cell_type(Gx, Gy)S1[n,m, active]← S1[n,m, active] + |Gx,y|
end ifend for
The algorithm computes the responses of all cells us-
ing only one iteration over the whole input image I . For
each pixel at position (x, y) a vertical and horizontal gra-
dients are computed (Gx and Gy). Given the pixel posi-
tion (x, y) and gradient vector [Gx, Gy] the algorithm in-
dicates the block position (n,m) and type of cell (active)
that response has to be incremented by |G|.In order to classify given gradient vector [Gx, Gy]
as horizontal, diagonal or vertical the get_cell_type(·, ·)function uses simple schema for voting. If a point
(|Gx|, |Gy|) is located between line y = 0.3 · x and
y = 3.3 · x it is classified as a diagonal. If Gy is posi-
tive then the vector is classified as a right diagonal (other-
wise it is a left diagonal). In case the point (|Gx|, |Gy|) is
located above line y = 3.3 · x the gradient vector is clas-
sified as vertical and as horizontal when it lies below line
y = 0.3 · x.
The processing in S1 layer is applied for three scales
(an original image, and two images scaled by a factor of
0.7 and 1.5). This is a necessary operation to achieve the
response of C1 layer. The C1 complex layer response is
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM
34 R. Kozik, A. Marchewka
computed using max-pooling filter, that is applied to S1
layer. This layer allows the algorithm to achieve scale
invariance.
The output of theC1 layer cells is used in order to select
features that are meaningful for a given task of an object
detection. Firstly, randomly selected images of an object
are reduced to a features vectors by a consecutive layers
of the proposed model. Afterwards, the PCA is applied
to calculate the eigenvectors. Each eigenvector is used as
a single feature in a features vector. The response of S2
layer is obtained using eigenvectors to compute the dot
product with the C1 layer.
Finally, an classifier is learned to recognise an object
using S2 response layer. For this paper different machine-
learnt algorithms are investigated.
4 Experiments and Results
In contrast to the previous work [4], the data base of
images has been extended and new experiments have
been conducted, which aimed at evaluating the effective-
ness of different classifiers. In this work tree-based and
rule-based classification approaches have been compared,
namely N (5,10,15) Random Trees, J48 and PART (pre-
viously used by author in [5, 6, 4]) algorithms. The com-
parison has been presented in Fig.2.
For the experiments the dataset was randomly split for
training and testing samples (in proportion 70% to 30%
respectively). For training samples the C1 layers re-
sponses were calculated and eigenvectors were extracted.
The eigenvectors were used to calculate the S2 layers re-
sponses that were further used to learn classifier. During
the testing phase the eigenvectors were again used in or-
der to extract S2 layers responses for testing samples. The
whole procedure was repeated 10 times ( the dataset was
randomly reshuffled before split) and the results were av-
eraged.
The best results were obtained for 10 Random Trees. It
was possible to achieve 95,6% of detection rate while hav-
ing 2.3% of false positives. It can be noticed that adding
additional trees does not improve the effectiveness.
The early prototype of proposed algorithm was de-
ployed on an Android device. The code was written in
pure Java and tested on Samsung Galaxy Ace device. This
device is equipped with 800 MHz CPU, 278 MB of RAM
and Android 2.3 operating system. Current version imple-
ments brute force Nearest Neighbour classifier, operates
only for one scale (PC scans three scale - an original im-
age, and two images scaled by a factor of 0.7 and 1.5),
and achieves about 10 FPS when less than 10 training
samples are provided. Some examples of object detec-
tion are shown in Fig. 3. During the testing it was noticed
that the algorithm is able to recognise an object on a clut-
tered background even if only few learning samples are
provided.
Fig. 3: Example of object detection (mug and doors) withproposed algorithm deployed on an Android device
5 Conclusions
In this paper an algorithm for object recognition dedi-
cated form mobile devices was presented. The algorithm
is based on a hierarchical approach for visual information
coding proposed by Riesenhuber and Poggio [1] and later
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM
Image Processing & Communication, vol. 18, no. 1, pp. 31-36 35
0
0,005
0,01
0,015
0,02
0,025
0,03
0,035
0,04
0,78 0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 0,98
Fals
e P
osi
tive
Rat
e
Detection Rate
10 random trees
5 random trees
15 random trees
J48
PART
Expon. (10 random trees)
Expon. (5 random trees)
Expon. (15 random trees)
Expon. (J48)
Expon. (PART)
Fig. 2: Different classifiers effectiveness
extended by Serre et al. [2]. The experiments show that
the introduced algorithm allows for efficient feature ex-
traction and a visual information coding. Moreover, it was
shown that it is also possible to deploy proposed approach
on a low-cost mobile device. The results are promising
and show that described in this paper algorithm can be
further investigated as one of assistive solutions dedicated
for visually impaired people.
References
[1] M. Riesenhuber, T. Poggio, Hierarchical models of
object recognition in cortex, Nat Neurosci, Vol. 2,
pp.1019–1025, 1999
[2] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U.
Knoblich T. Poggio, A quantitative theory of imme-
diate visual recognition, In:Progress in Brain Re-
search, Computational Neuroscience: Theoretical
Insights into Brain Function, Vol. 165, pp. 33–56,
2007
[3] N. Dalal, B. Triggs, Histograms of oriented gradi-
ents for human detection, Computer Vision and Pat-
tern Recognition, CVPR 2005, IEEE Computer So-
ciety Conference on, Vol.1, pp. 886–893, June 2005
[4] R. Kozik, A Simplified Visual Cortex Model for
Efficient Image Codding and Object Recognition,
Image Processing and Communications Challenges
5, Advances in Intelligent Systems and Computing,
pp. 271–278, 2013
[5] R. Kozik, Rapid Threat Detection for Stereovision
Mobility Aid System , In: T. Czachorski et al.
(Eds.): Man-Machine Interactions 2, AISC 103, pp.
115–123, 2011
[6] R. Kozik, Stereovision system for visually im-
paired. Burduk, Robert (ed.) et al., Computer recog-
nition systems 4, Advances in Intelligent and Soft
Computing 95, pp. 459–468, 2011
[7] R. Kozik, A Proposal of Biologically Inspired Hi-
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM
36 R. Kozik, A. Marchewka
erarchical Approach to Object Recognition, Jour-
nal of Medical Informatics & Technologies, Vol.
22/2013, ISSN 1642-6037, pp. 171–176, 2013
[8] R. Kozik, A Simplified Visual Cortex Model for Ef-
ficient Image Codding and Object Recognition, Im-
age Processing and Communications Challenges 5,
Advances in Intelligent Systems and Computing S.
Choras, Ryszard, pp. 271-278, 2013
[9] D.H. Hubel, T.N. Wiesel Receptive fields, binoc-
ular interaction and functional architecture in the
cat’s visual cortex, J Physiol, Vol. 160, pp. 106–
154, 1962
[10] J. Mutch, D.G. Lowe, Object class recognition and
localization using sparse features with limited re-
ceptive fields,International Journal of Computer Vi-
sion (IJCV), Vol. 80, No. 1, pp. 45–57, October
2008
[11] MAX pooling, http://ufldl.stanford.edu/wiki/index.
php/Pooling
Brought to you by | University of California - San FranciscoAuthenticated
Download Date | 12/17/14 10:11 AM