object recognition algorithm for mobile devices

Image Processing & Communication, vol. 18,no. 1, pp.31-36DOI: 10.2478/v10248-012-0088-x 31

OBJECT RECOGNITION ALGORITHM FOR MOBILE DEVICES

RAFAŁ KOZIK ADAM MARCHEWKA

Institute of Telecommunications, University of Technology & Life Sciences in Bydgoszcz, ul. Kaliskiego 7, 85-789Bydgoszcz, Poland

[email protected]

Abstract. In this paper an object recognition al-

gorithm for mobile devices is presented. The al-

gorithm is based on a hierarchical approach for

visual information coding proposed by Riesen-

huber and Poggio [1] and later extended by

Serre et al. [2]. The proposed method adapts

an efficient algorithm to extract the informa-

tion about local gradients. This allows the al-

gorithm to approximate the behaviour of simple

cells layer of Riesenhuber and Poggio model.

Moreover, it is also demonstrated that the pro-

posed algorithm can be successfully deployed

on a low-cost Android smartphone.

1 Introduction

This paper is a continuation of author’s recent research on

object detection algorithms dedicate for mobile devices.

The tentative results have been presented in [4]. In con-

trast to previous work, modifications to overall method

architecture have been proposed. Moreover, the exper-

iments described in this paper have been conducted for

extended database of images.

The proposed approach follows the idea of HMAX

model proposed by Riesenhuber and Poggio [1]. It adapts

the hierarchical fed-forward processing approach. How-

ever, modifications are introduced in order to reduce com-

putational complexity of the original algorithm.

Moreover, in contrast to Mutch and Lowe [10] model

an additional layer that mimics the "retina codding" mech-

anism is added. The results showed that this step increases

the robustness of the proposed method. However, it is

considered as an optional step of visual information pro-

cessing, because it introduces additional computational

burden that can increase the response time (the frame rate

drops significantly) of at low-cost mobile devices.

In contrast to original HMAX model, a different

method for calculating the S1 layer response is proposed.

The responses of 4 Gabor filters are approximated with

algorithm that mimic the behaviour of HOG descriptors.

The HOG [3] descriptors are commonly use for object

recognition problems. The image is decomposed into

small cells. In each cell an histogram of oriented gradi-

ents is used. Typically for HOG features the result is nor-

malised and the descriptor is returned for each cell. How-

ever, in contrast to original HOG descriptor proposed by

Dalal and Triggs, in this method (in order to simplify the

computations) the block of cells are not overlapping, and

Brought to you by | University of California - San FranciscoAuthenticated

Download Date | 12/17/14 10:11 AM

32 R. Kozik, A. Marchewka

the input image is not normalised.

Moreover, in previous work [7] the S2 and C2 layers

were replaced with machine-learned classifier. Such ap-

proach significantly simplifies the learning process and

decreases the amount of computations. However, this im-

pacts the invariance to shape changes of a recognised ob-

ject and the feature vectors presented to classifier are rel-

atively of grater dimensionality.

2 The HMAX model overview

The HMAX Visual Cortex model proposed by Riesenhu-

ber and Poggio [1] exploits a hierarchical structure for the

image processing and coding. It is arranged in several lay-

ers that process information in a bottom-up manner. The

lowest layer is fed with a grayscale image. The higher

layers of the model are either called "S" or "C". These

names correspond to simple the (S) and complex (C) cells

discovered by Hubel and Wiesel [9]. Both type of cells

are located in the striate cortex (called V1), which is the

part of visual cortex that lies in the most posterior area of

the occipital lobe.

The simple cells located in "S" layers apply local filters

that responses form a vector of texture features. As noted

by Hubel and Wiesel the individual cell in the cortex re-

spond to the presence of edges. They also discovered that

these cells are sensitive to edge orientation (some of cells

fire only when a given orientation of an edge is observed).

The complex cells located in "C" layers calculate in a

limited range of a previous layer the strongest responses

of a given type (orientation). That way more complex

combination of simple features are obtained combined

from three simple features.

The hierarchical HMAX model proposed by Mutch and

Lowe [10] is a modification of the model presented by

Serre et al. in [2]. It introduces two layers of simple cells

(S1 and S2) and two layers of complex cells (see Fig. 1). It

uses set of filters designed to emulate V1 simple cells. The

Fig. 1: The structure of a hierarchical model proposed byMutch and Lowe [10]. (used symbols: I - image, S -simple cells, C - complex cells, GF - Gabor Filters, F -prototype features vectors, ⊗ - convolution operation)

simple layers are computed using a hard max filter [11].

It means that the "C" cells responses are the maximum

values of the associated "S" cells. As it is shown in the

Fig. 1 the images are processed by the subsequent simple

and complex cells layers and reduced to feature vectors,

which are further used in the classification process. The

set of features (F ) is shared across all images and object

categories. Features are computed hierarchically in sub-

sequent layers built from the previous one by alternating

the template matching and the max pooling operations.

The S1 layer in the Mutch and Lowe [10] model adapts

the 2D Gabor filters computed for four orientations (hor-

izontal, vertical, and two diagonals) at each possible po-

sition and scale. The Gabor filters are 11×11 in size, and

are described by:

G(x, y) = e−(X2+γY 2)/(2σ2) cos(2π

λ) (1)

whereX = x cosφ−y sinφ and Y = x sinφ+y cosφ;

x, y ∈< −5; 5 >, and φ ∈< 0;π >. The aspect ra-

tion (γ), effective width (σ), and wavelength (λ) are set

to 0.3, 4.5 and 5.6 respectively. The response of a patch

of pixels X to a particular filter G is computed using the

formula (2).

R(X,G) = abs(

∑XiGi√∑X2i

) (2)

The complex cells located in C1 layer pool associated

units in the S1 layer. For each orientation, the S1 re-



Image Processing & Communication, vol. 18, no. 1, pp. 31-36 33

sponses are convolved with a max filter, that is 10×10 of

size in x, y dimension (position) and has 2 units of deep

in scale.

As it is shown in the Fig. 1, the intermediate S2 layer

is formed by convolving the C1 layer response with a set

of intermediate-level features (depicted as F in Fig. 1).

The set of intermediate-level features is established dur-

ing the learning phase. For a given set of learning im-

ages C1 responses are computed. The most meaningful

features are selected using SVM weighting. Mutch and

Lowe [10] suggested to sub-sample the C1 responses be-

fore feature selection. Therefore, authors select at ran-

dom positions and scales of the C1 layer patches of size

4×4, 8×8, 12×12, and 16×16. Selected and weighted

features compose so called prototypes that are used in the

Mutch and Lowe model as filters which responses create

the S2 layer. The C2 layer composes a global feature vec-

tor which particular element corresponds to the maximum

response to a given prototype patch. In order to identify

the visual object on the basis of feature vector a classifier

is learnt (e.g. SVM).

3 The proposed method

In order to approximate the S1 layer behaviour of the

HMAX model the algorithms that mimics the HOG de-

scriptors is used. However, to make this step as simple

as possible (due to the low computation power of mo-

bile devices) only gradient orientation binning schema is

adapted. In fact, the intention here is no to compute HOG

descriptor, but to approximate the response of four Gabor

filters used in original HMAX model.

In the S1 layer there are N ×M × 4 simple cells ar-

ranged in a grid of sizeN×M blocks. In each block there

are 4 cells. Each cell is assigned a receptive field (pixels

inside the block). Each cell activates depend on a stimuli.

In this case there are four possible stimulus, namely verti-

cal, horizontal, left diagonal, and right diagonal edges. As

a result the S1 simple cells layer output has dimensional-

ity of a size 4 (x, y, scale and 4 cells). In order to compute

responses of all four cells inside a given block (receptive

field), an Algorithm 1 is applied.

Algorithm 1 Algorithm for calculating S1 responseRequire: Grayscaled image I of W ×H sizeEnsure: S1 layer of size N ×M × 4

AssignGmin the low-threshold for gradient magnitude.for each pixel (x, y) in image I do

Compute horizontalGx and verticalGy gradients us-ing Prewitt operatorCompute gradient magnitude |Gx,y| in point (x, y)n← x · NW ; m← y · MWif |G| < Gmin then

go to next pixelelseactive← get_cell_type(Gx, Gy)S1[n,m, active]← S1[n,m, active] + |Gx,y|

end ifend for

The algorithm computes the responses of all cells us-

ing only one iteration over the whole input image I . For

each pixel at position (x, y) a vertical and horizontal gra-

dients are computed (Gx and Gy). Given the pixel posi-

tion (x, y) and gradient vector [Gx, Gy] the algorithm in-

dicates the block position (n,m) and type of cell (active)

that response has to be incremented by |G|.In order to classify given gradient vector [Gx, Gy]

as horizontal, diagonal or vertical the get_cell_type(·, ·)function uses simple schema for voting. If a point

(|Gx|, |Gy|) is located between line y = 0.3 · x and

y = 3.3 · x it is classified as a diagonal. If Gy is posi-

tive then the vector is classified as a right diagonal (other-

wise it is a left diagonal). In case the point (|Gx|, |Gy|) is

located above line y = 3.3 · x the gradient vector is clas-

sified as vertical and as horizontal when it lies below line

y = 0.3 · x.

The processing in S1 layer is applied for three scales

(an original image, and two images scaled by a factor of

0.7 and 1.5). This is a necessary operation to achieve the

response of C1 layer. The C1 complex layer response is




computed using max-pooling filter, that is applied to S1

layer. This layer allows the algorithm to achieve scale

invariance.

The output of theC1 layer cells is used in order to select

features that are meaningful for a given task of an object

detection. Firstly, randomly selected images of an object

are reduced to a features vectors by a consecutive layers

of the proposed model. Afterwards, the PCA is applied

to calculate the eigenvectors. Each eigenvector is used as

a single feature in a features vector. The response of S2

layer is obtained using eigenvectors to compute the dot

product with the C1 layer.

Finally, an classifier is learned to recognise an object

using S2 response layer. For this paper different machine-

learnt algorithms are investigated.

4 Experiments and Results

In contrast to the previous work [4], the data base of

images has been extended and new experiments have

been conducted, which aimed at evaluating the effective-

ness of different classifiers. In this work tree-based and

rule-based classification approaches have been compared,

namely N (5,10,15) Random Trees, J48 and PART (pre-

viously used by author in [5, 6, 4]) algorithms. The com-

parison has been presented in Fig.2.

For the experiments the dataset was randomly split for

training and testing samples (in proportion 70% to 30%

respectively). For training samples the C1 layers re-

sponses were calculated and eigenvectors were extracted.

The eigenvectors were used to calculate the S2 layers re-

sponses that were further used to learn classifier. During

the testing phase the eigenvectors were again used in or-

der to extract S2 layers responses for testing samples. The

whole procedure was repeated 10 times ( the dataset was

randomly reshuffled before split) and the results were av-

eraged.

The best results were obtained for 10 Random Trees. It

was possible to achieve 95,6% of detection rate while hav-

ing 2.3% of false positives. It can be noticed that adding

additional trees does not improve the effectiveness.

The early prototype of proposed algorithm was de-

ployed on an Android device. The code was written in

pure Java and tested on Samsung Galaxy Ace device. This

device is equipped with 800 MHz CPU, 278 MB of RAM

and Android 2.3 operating system. Current version imple-

ments brute force Nearest Neighbour classifier, operates

only for one scale (PC scans three scale - an original im-

age, and two images scaled by a factor of 0.7 and 1.5),

and achieves about 10 FPS when less than 10 training

samples are provided. Some examples of object detec-

tion are shown in Fig. 3. During the testing it was noticed

that the algorithm is able to recognise an object on a clut-

tered background even if only few learning samples are

provided.

Fig. 3: Example of object detection (mug and doors) withproposed algorithm deployed on an Android device

5 Conclusions

In this paper an algorithm for object recognition dedi-

cated form mobile devices was presented. The algorithm

is based on a hierarchical approach for visual information

coding proposed by Riesenhuber and Poggio [1] and later



Image Processing & Communication, vol. 18, no. 1, pp. 31-36 35

0

0,005

0,01

0,015

0,02

0,025

0,03

0,035

0,04

0,78 0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 0,98

Fals

e P

osi

tive

Rat

e

Detection Rate

10 random trees

5 random trees

15 random trees

J48

PART

Expon. (10 random trees)



Expon. (J48)

Expon. (PART)

Fig. 2: Different classifiers effectiveness

extended by Serre et al. [2]. The experiments show that

the introduced algorithm allows for efficient feature ex-

traction and a visual information coding. Moreover, it was

shown that it is also possible to deploy proposed approach

on a low-cost mobile device. The results are promising

and show that described in this paper algorithm can be

further investigated as one of assistive solutions dedicated

for visually impaired people.

References

[1] M. Riesenhuber, T. Poggio, Hierarchical models of

object recognition in cortex, Nat Neurosci, Vol. 2,

pp.1019–1025, 1999

[2] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U.

Knoblich T. Poggio, A quantitative theory of imme-

diate visual recognition, In:Progress in Brain Re-

search, Computational Neuroscience: Theoretical

Insights into Brain Function, Vol. 165, pp. 33–56,

2007

[3] N. Dalal, B. Triggs, Histograms of oriented gradi-

ents for human detection, Computer Vision and Pat-

tern Recognition, CVPR 2005, IEEE Computer So-

ciety Conference on, Vol.1, pp. 886–893, June 2005

[4] R. Kozik, A Simplified Visual Cortex Model for

Efficient Image Codding and Object Recognition,

Image Processing and Communications Challenges

5, Advances in Intelligent Systems and Computing,

pp. 271–278, 2013

[5] R. Kozik, Rapid Threat Detection for Stereovision

Mobility Aid System , In: T. Czachorski et al.

(Eds.): Man-Machine Interactions 2, AISC 103, pp.

115–123, 2011

[6] R. Kozik, Stereovision system for visually im-

paired. Burduk, Robert (ed.) et al., Computer recog-

nition systems 4, Advances in Intelligent and Soft

Computing 95, pp. 459–468, 2011

[7] R. Kozik, A Proposal of Biologically Inspired Hi-




erarchical Approach to Object Recognition, Jour-

nal of Medical Informatics & Technologies, Vol.

22/2013, ISSN 1642-6037, pp. 171–176, 2013

[8] R. Kozik, A Simplified Visual Cortex Model for Ef-

ficient Image Codding and Object Recognition, Im-

age Processing and Communications Challenges 5,

Advances in Intelligent Systems and Computing S.

Choras, Ryszard, pp. 271-278, 2013

[9] D.H. Hubel, T.N. Wiesel Receptive fields, binoc-

ular interaction and functional architecture in the

cat’s visual cortex, J Physiol, Vol. 160, pp. 106–

154, 1962

[10] J. Mutch, D.G. Lowe, Object class recognition and

localization using sparse features with limited re-

ceptive fields,International Journal of Computer Vi-

sion (IJCV), Vol. 80, No. 1, pp. 45–57, October

2008

[11] MAX pooling, http://ufldl.stanford.edu/wiki/index.

php/Pooling



http://ufldl.stanford.edu/wiki/index.php/Pooling

http://ufldl.stanford.edu/wiki/index.php/Pooling

object recognition algorithm for mobile devices

Documents