appearance based object recognition in domestic ... · domestic environments utseendebaserad...

Appearance Based Object Recognition inDomestic Environments

Utseendebaserad objekigenkänning ivardagsmiljöer

Examensarbete inom datalogi

Niklas Hallenfur

30th January 2003

Examinator: Prof. Jan-Olof EklundHandledare: Dr. Danica Kragic

1

Abstract

Appearance Based Object Recognition in Domestic Envi-ronments

This master thesis examines and discusses the usefulness of three different ap-proaches to appearance-based object recognition, for the task of object recognitionand detection in a domestic environment. The three different approaches arePrincipalComponent Analysis, HistogramsandLocal Image Features. Each of the approachesis outlined with a presentation of the theory and how they have been used before.

The objective of implementing the algorithms was to make them perform two maintasks. The first was to perform object detection, that is locate an object within an image.The second task was to perform object recognition, that is identify an object depictedin a given image.

The principal component analysis-based algorithm was implemented to performboth of these tasks. The histogram-based algorithm was focused on performing objectdetection, while the feature-based algorithm was focused to perform object recognition.

The three algorithms were tested to determine the detection rates and recognitionrates, that is how many percent of the winning hypothesis are the correct ones.

The principal component analysis-based algorithm turned out to give lower detec-tion rates than the histogram-based algorithm, but gave a more precise positioning ofthe objects. The feature-based algorithm gave much better results at object recognitionthan the other two. The histogram-based algorithm was also tested for object recogni-tion rates, though not designed for that task, for the sake of comparison.

Sammanfattning

Utseendebaserad objekigenkänning i vardagsmiljöerDetta examensarbete undersöker hur användbara tre olika angreppssätt för utseen-

debaserad objektigenkänning är, för att utföra objektigenkänning och lokalisering i envardagsmiljö. De tre angreppssätten ärPrincipal Component Analysis, histogramochlokala kännetecken i bilder.Vart och ett av dessa angreppssätt presenteras kort medbakomliggande teori, samt hur de tidigare har tillämpats av andra.

De implementerade algoritmerna skulle huvudsakligen utföra två uppgifter. Denförsta bestod i att utföra lokalisering av objekt i bilder. Den andra bestod av att göraobjektigenkänning, det vill säga att identifiera ett objekt i en given bild.

Den principal component analysis-baserade algoritmen implementerades för attkunna utföra båda dessa uppgifter. Den histogram-baserade algoritmen gjordes i hu-vudsak för att kunna uföra lokalisering, medan den kännetecken-baserade algoritmenimplementerades för att kunna utföra objektigenkänning.

De tre algoritmerna testades för att avgöra deras lokaliseringsfrekvens och igenkän-ningsfrekvens, det vill säga hur många procent av de vinnande hypoteserna som var dekorrekta.

Den principal component analysis-baserade algoritmen visade sig ge sämre lokalis-eringsfrekvens än den histgram-baserade, men den gav å andra sidan an mer pre-cis lokalisering av objekten. Den kännetecken-baserade algoritmen gav mycket bät-tre igenkänningsfrekvens än de övriga två. Även den histogram-baserade algoritmentestades för att avgöra igenkänningsfrekvensen, även om den inte var avsedd för denuppgiften, för att kunna jämföra med de andra.

Contents

1 Introduction 11.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Representation 42.1 Principal Component Analysis (PCA). . . . . . . . . . . . . . . . . 42.2 Histogram Based Approaches . . . . . . . . . . . . . . . . . . . . . . 62.3 Feature Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Key Locations . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Classification 153.1 Nearest Neighbour Classifier .. . . . . . . . . . . . . . . . . . . . . 153.2 Learning Vector Quantization . . . . . . . . . . . . . . . . . . . . . . 163.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Support Vector Machines . . .. . . . . . . . . . . . . . . . . . . . . 173.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Implementation 204.1 Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . 204.2 Colour Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Verification using local histograms . . . . . . . . . . . . . . . 25

5 Experimental Evaluation 265.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Colour Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Discussion 34

7 Summary 37

References 39

List of Figures

1 The XR4000 robotic platform . . . . . . . . . . . . . . . . . . . . . 12 Objects to recognize.. . . . . . . . . . . . . . . . . . . . . . . . . . 33 The first four eigenvectors from a set of images of rice packages under

different rotation and illumination.. . . . . . . . . . . . . . . . . . . 54 Gray-scale image of a rice package and corresponding histogram with

50 bins. The two dominant peaks correspond to object and backgroundrespectively.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 a) An image with two striped regions. b) Convolved by a derivativefilter along y-axis. c) Absolute value of convolution.. . . . . . . . . . 9

6 Using NNC, the star is classified to belong to the same class as thetriangles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Using kNN, with k = 5, the star is classified to belong to the same classas the squares.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 Separating hyper-plane (solid line) and margin (dotted lines). The datalying on the margin are the support vectors. . . . . . . . . . . . . . 18

9 Raisins before (top) and after (bottom) multiplication by a Gaussianmask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

10 Example detection results for raisins (left) and rice (right) when notusing Gaussian mask. Corresponding maps are shown in the bottom row.21

11 Example detection results for raisins (left) and rice (right) when usinga Gaussian mask. Corresponding maps are shown in the bottom row.. 22

12 Two training images, and the results of calculating Harris corner strengthmeasure.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

13 One of the images used for detection tests.. . . . . . . . . . . . . . . 2614 Examples of real-world test images.. . . . . . . . . . . . . . . . . . 2615 Detection results for rice package in the example image using the PCA

implementation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2716 Recognition rates with varying number of eigenvectors. . . . . . . . 2817 Recognition rates with varying� for Gaussian mask . . . . . . . . . 2918 Detection results for rice package in the example image using the his-

togram implementation.. . . . . . . . . . . . . . . . . . . . . . . . . 3119 Detection results for the rice package after using the existing co-occurence

based segmentation algorithm (left), and results after applying the feature-based verification step (right).. . . . . . . . . . . . . . . . . . . . . 33

20 Recognition rate using different values of� . . . . . . . . . . . . . . 3421 Recognition rate using segmented images. . . . . . . . . . . . . . . 3522 Recognition rate using real-world images with background. . . . . . 36

List of Tables

1 Detection rate (%) for the different objects, using 20 eigenvectors and� = 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Detection rates (%) of the different objects, using average histogramsand Euclidean distance.. . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Detection rates (%) of the different objects, using average histogramsand histogram intersection.. . . . . . . . . . . . . . . . . . . . . . . 32

4 Recognition rates (%) using different distance measures.. . . . . . . 325 Detection rates (%) using co-occurrence histogram, and a verification

step based on local features.. . . . . . . . . . . . . . . . . . . . . . 33

1 Introduction

stereo−head

pan−tilt unit

eye−in−hand

force−torquesensor

camera

Barret hand

Figure 1: The XR4000 robotic platform

Object recognition has been an active field of research in computer vision for thelast few decades. However there still does not exist a general solution for object recog-nition that can be applied in realistic everyday environments.

One of the projects inside Centre for Autonomous Systems (CAS) is theIntelligentService Robot(ISR) project. It is a project primarily concentrated on systems inte-gration and perception in domestic environments (homes, offices, hospitals, etc). Thelongterm goal is to, by implementing a basic robot architecture and a few prototypi-cal tasks, show that it is possible to create a household robot which works in a robustmanner.

A real-world environment is needed in order to perform realistic tests with therobots. For performing tests in office environment, the existing offices at CVAP areused. A home environment has been created by turning a room in the lab into a “livingroom”.

The main robot platforms are a Nomadic 200 and a Nomadic XR4000. The largerXR4000 is is equipped with a Puma 560 robot arm to be able to perform more advancedtasks, where objects will be manipulated. The XR4000 platform is shown in Figure. 1.

As stated earlier, a few prototypical tasks have been implemented. The idea behindthese tasks is that they must be useful and pose a challenge from a scientific point ofview. The goal is to show that such tasks can be implemented in a robust manner, sothe solutions must work in a variety of house and office environments, without the needto engineer the environment. Examples of tasks:

� Go fetch the milk in the refrigerator.

� Find the remote control and bring it back.

� Deliver mail or printer output to any person in the lab.

� Go to the next floor using the elevator.

Research in several fields is needed in order to accomplish the above tasks, which are:

1

� Localization

� Navigation

� Object manipulation and grasping

To be able to pick up objects and perform fetch-and-carry tasks, as well as search forobjects, the robot has to be able to locate and recognize objects in the environment.Computer vision is by far the best suited sensory modality for this purpose. Images areacquired from two cameras for stereo vision mounted on top of the robot and from acamera mounted on the robot arm.

1.1 Problem

The problem pursued here is testing of a few approaches for systems which performsegmentation and object recognition in images. The segmentation module is supposedto generate hypothesis about object locations. After segmentation, the hypothesis areused as input to the recognition module, which decides what objects are present.

The system should be designed for indoor environments, where no strong assump-tions can be made about illumination or background, so the system must handle a widerange of illumination, background and occlusions.

The objects to be recognized are a limited set of everyday objects, selected becausethey can be gripped by the robot arm. The task for the system is to recognize thesecertain instances of objects. The chosen objects are primarily the following:

� Raisins package

� Rice package

� Cup

� Soda bottle

� Cleaner bottle

These objects are shown in Figure. 2.The tasks of this master’s project is to:

� Implement a few methods for object recognition, suitable for the types of objectsconsidered here.

� Evaluate the implemented methods.

1.2 Outline

This thesis is organized as follows.Section 2 gives an introduction to some approaches used for object representation

in terms of the object recognition problem. The approaches are first presented alongwith the theory behind them, followed by examples of applications. The section isclosed by a discussion of the usability of the presented methods, in this thesis.

Section 3 outlines the theory of some methods for classification. The section isclosed by a short discussion motivating the choice of classifier for this work.

Section 4 presents the implementations of the chosen methods.

2

RAISINS

CUPSODA CANCLEANER

FRUIT CAN RICE

SODA BOTTLE

SOUP

Figure 2:Objects to recognize.

Section 5 presents how the implementations were tested, along with the data ac-quired from the tests. Some examples of test images are shown, as well as results ofapplying the implemented methods to the images.

Section 6 discusses the results of the experimental evaluation, as well as discussingwhat was good and not so good about the implemented algorithms. It is also discussedwhy the algorithms gave the results they did, and how they could be improved.

The thesis is finally summed up in Section 7.

3

2 Representation

The brief introduction to object recognition presented below is by no means supposedto be a general presentation of this huge subject, but rather give a path down to themethods discussed further in this section.

Two main approaches for representing objects in terms of object recognition are:

� Geometry-based models

� Appearance-based models

The geometry-based approaches rely on matching image features to a model of theobject, often a wire-frame model. The matching requires shape and contour of the ob-ject to be extracted from the image, which in a general setting is a difficult task. Withvarying backgrounds, it is very difficult to tell whether the extracted contours belongto the object or the background. In addition, what can be extracted is not invariant tovariations in illumination and shading. The geometry-based approach for object recog-nition therefore works best in a structured environment with non changing illumination,which can be found in a factory or laboratory.

Appearance-based approaches rely on photometric properties instead of shape. Therepresentation of an object is based on properties extracted from training images ofthe object. The reliability of the matching is still subject to variations in backgroundand illumination, as well as projective transformations, but research in the field haveproduced representations which are more robust to such distractions.

There are many difficulties with representing objects. An ideal representationshould be the same no matter what the conditions are, i.e. the representation shouldbe invariant to varying lighting, rotations, scale changes, background, translation, etc.However, it has proven to be a difficult task to achieve invariance to most of thesevarying conditions, let alone all of them at the same time. Therefore, work in thearea of object recognition focus on finding representations that are robust to varyingconditions.

Since this work only considers appearance-based representations, this section onlycovers a few of the techniques used for such representations.

This section will focus on presenting the basic principles of each covered method,followed by a short summary of how the method has been used by others. Here followsthe methods that have been studied in the literature, and are considered in this thesis.

2.1 Principal Component Analysis (PCA)

PCA is a mathematical method that, for a population of data, finds the axes alongwhich the data has the most variance. These are called theeigenvectorsof the set ofdata. The data can then be represented as the projection along the eigenvectors. Whenthe data is correlated, the number of eigenvectors can often be greatly reduced andstill represent the data well. That property can be very useful in terms of dimensionreduction. In terms of object recognition, this is useful for representing training imagesas projections along eigenvectors instead of the images themselves. The eigenvectorsused then constitute the dimensions of an image subspace, called theeigenspace.

To compute the eigenspace, the averagec of all training images is subtracted fromeach image. An image matrixP is then computed by stacking the images column-wise:

P = fI1 � c; I2 � c; :::; IM � cg

4

P is an NxM matrix, where N is the number of pixels in each image and M is thenumber of images. To compute the eigenvectors for the training images, we computethe covariance matrixQ:

Q = PP T

The covariance matrix Q is a NxN matrix. Since N is the number of pixels in atraining image, Q is a large matrix. The eigenvalues� and the corresponding eigenvec-torse are computed by solving the equation:

�kek = Qek

The eigenvalue�i is the variance of the data (in this case training images) along axisei. So the eigenspace is determined by choosing theK eigenvectors with the highesteigenvalue.K can often be chosen magnitudes smaller than the original dimensionalityand still capture the majority of the variance in the data.

Figure. 3 shows an example of eigenvectors of a set of images of rice packages. Itshows the four eigenvectors with largest eigenvalue.

Figure 3:The first four eigenvectors from a set of images of rice packages under differ-ent rotation and illumination.

The dimension reduction is useful for fast comparison between test and trainingimages, since only the projections into the eigenspace have to be compared. It isa straightforward method to implement since it is mathematically simple and well-defined. However this method is essentially nothing else than template-matching andbecause of that very sensitive to cluttered or varying backgrounds, rotations and transla-tions of objects within the image, scale changes, lighting changes, i.e. to all conditions

5

a good representation should be invariant to. So because of its simplicity this methodcan be very useful for recognition when the objects to recognize can be expected to becentered in the image, with a static background and at the same scale and rotation as inthe training data.

PCA has often been used in terms of the face recognition problem, by Turk andPentland (1991) among others. Their idea was to compute a subspace from a test setof faces and map face images to that subspace to perform matching. The subspace wasdefined by the eigenvectors of the set of training images, as described above. Eachface to be recognized was projected onto that space. The projection was compared toprojections of known faces using Euclidean distance, to identify a match.

They also proposed a method for face localization. It was based on the observationthat a face image could be reconstructed using the eigenspace. On the other hand,images that were not faces could not be reconstructed to look like the original image.Therefore, the Euclidean distance between an image, and the reconstruction of thatimage from the face eigenspace, could measure how much the image looked like aface.

Given a large image, the method was to, for each location, calculate the distancefrom a window to its reconstruction from face space. The resulting map gave a “faci-ness” measure, with peaks indicating high probability that a face was present at thepeak location.

2.2 Histogram Based Approaches

Histograms are a way of approximating distributions by counting occurrences of valuesin sample data. The variable whose distribution is to be approximated is often a con-tinuous variable, while a histogram can only make a discrete approximation. Thereforethe range of values the continuous variable can assume must be divided into a discretenumber of bins. In each such bin the number of mappings, from the sample data tothat bin, is stored. Each bin in the histogram can then be normalized by dividing by thetotal number of sample data.

One method for deciding similarity of histograms is called histogram intersection.Given two histograms,I andM , each of which containsn bins which are not nor-malized, the intersection of these two histograms is defined by Swain and Ballard[Swain and Ballard, 1991, p. 15]:

nXi=1

min (Ii;Mi)

The result is the total number of pairs where two pixels of the same colour canbe found in each image. To get a fractional match values ranging from 0 to 1, theintersection can be normalized by the number of pixels in the model histogramM .The match value is then defined as:Pn

i=1min (Ii;Mi)Pn

i=1Mi

In the field of computer vision, histograms are commonly used to approximate colourand intensity distributions. A simple form of histogram approximates the distributionof intensity in a gray scale image, as in Figure. 4

6

0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90

Figure 4: Gray-scale image of a rice package and corresponding histogram with 50bins. The two dominant peaks correspond to object and background respectively.

Histograms can be used directly on each of the three colour channels or combina-tions of them in order to make more efficient representations for matching.

Colour histograms can be useful for object recognition when the objects have suf-ficiently specific colours. It is fast and efficient to compute a colour histogram, sinceit is only a matter of counting occurrences of colours in the image. If each pixel istaken into account, the work of constructing a histogram is proportional to the num-ber of pixels in the image. An advantage when it comes to object recognition is thathistograms are invariant to rotation around the viewing axis and only change slowlywhen the object in the image is rotated around the other axes. Depending on how thematching is done, histograms can also be semi-invariant to scale changes.

The main weakness of using colour histograms for object recognition is the inher-ent sensitivity to variations in illumination. This can to some extent be compensated forby using various pre-processing techniques on the images to compensate for illumina-tion changes. The most straightforward approach for such a pre-processing algorithmnormalizes the red, green and blue components by their sum, for each pixel:

r0 =r

(r + g + b)(1)

g0 =g

(r + g + b)(2)

b0 =b

(r + g + b)(3)

These are called the chromaticity coordinates. When processing images in chro-maticity space only ther-g components are normally used. Only two degrees of free-dom remain, since the values of any two can be used to calculate the third, so thebcomponent is usually ignored. In addition, the blue channel of a camera is usuallynoisy, in low light conditions in particular [Christensen et al. 2002]. The resulting rep-resentation is less sensitive to varying illumination.

Another weakness of using colour histograms is that they are dependent on back-ground, unless figure-ground segmentation is used. This property makes histogramsless useful for recognition in a cluttered environment where the background can be ex-pected to change between different images of the same object.

7

Colour histograms have previously been used for object recognition by Swain and Bal-lard [Swain and Ballard, 1991]. They used histograms both for the task of object iden-tification and object localization. The former is the task of, given an image windowcontaining an object, determine what object is in the window. The latter is the task of,given an image, determine the location of an object within the image.

For the identification task, they used three-dimensional colour histograms and his-togram intersection for matching histograms. The three dimensions of the histogramswere the three opponent colour axes, defined as:

rg = r � g

by = 2b� r � g

wb = r + g + b

These colour axes were used to allow the intensity (wb) axis to be more coarselysampled, since it is more sensitive to varying illumination and shadows than the othertwo. The bins were divided into 16 sections each for therg and by axes and 8 forthewb axis. They used 66 pre-segmented model images and 32 test images withoutprior segmentation. The test images were matched to the model images with histogramintersection and 29 were correctly classified as best match. The three objects that werenot correctly classified by the best match, were all second-best matches for the correctmodels.

They also tried the method of normalizing the red, green and blue components inorder to achieve a better robustness to illumination changes. With the same model andtest images, the results were that 15 objects were correctly classified as best match, 7as 2nd best match, 3 as 3rd match and 7 worse than 3rd match.

To test dependence on object rotation, they tried matching objects that were succes-sively rotated away from the model view of the object. Tests on dependence of scalewere also performed by rescaling the test images prior to matching.

For the localization task they used a method calledhistogram backprojection. Withthe same model histogramM for each model and histogramI from the test image, aratio histogram was defined:

Ri = min

�Mi

Ii; 1

�

This histogram is backprojected onto the image. That is, the image values arereplaced by the value ofR that they index. With this technique, colours that are in themodel histogram but are few in the image, get a strong response and are considered astrong clue to the presence of the object. Colours that are very common in the image,but may still be in the model histogram, are considered a weak clue and hence get aweak response.

After backprojection of the ratio histogram, the resulting intensity image was con-volved with a disc with the same area as expected from the object in the image. Bylooking at peaks in the resulting image, probable locations of the object were found.The results of these tests were that for the same test objects, the correct location wasfound in 28 cases as the highest peak. The remaining four objects’ locations were eitherthe second or third highest peak.

8

2.3 Feature Based Approaches

This section outlines the basic approach of using features, i.e. localimage descriptors.This approach has been used in many different ways, which can not be covered here.Therefore, only a brief summary of how features can be used is given, followed by afew examples of applications.

Features are a way of representing objects by calculating values from the imagein the form of descriptors. In order to make the features sufficiently discriminating,feature vectorscan be formed from values of different features. Local descriptors areoften used as features for the task of object recognition, so one image of an object cangenerate several feature vectors sampled at differentkey locations(see Section 2.3.2).

2.3.1 Local Features

A feature can be anything from a direct pixel value to derivatives, contours, curvaturemeasures etc. Responses from Gaussian derivative filters have been commonly used asfeatures ([Schiele and Pentland, 1999], [Rao and Ballard, 1995]). Other types of fea-tures that have been used include colour, contour measures, angle measures, cornermeasures, blob measures etc, which were all used by Mel, 1997.

Figure 5: a) An image with two striped regions. b) Convolved by a derivative filteralong y-axis. c) Absolute value of convolution.

To clarify the concept of features we give an example. Assume one wants a measureof how striped a region in an image is (see example image in Figure. 5a). If the image

9

is derived along the vertical axis, by convolution with a derivative filter, the result willbe an image showing the derivative response (Figure. 5b). Homogeneous regions andvertical lines will get no response, while the edges of horizontal lines will get a strongresponse, either positive or negative. By averaging the absolute value in the desiredregion, a measure of horizontal stripes in that region is acquired (Figure. 5c). Thatmeasure is a feature. In Figure. 5c, the region with vertical lines gets no response,while the region with horizontal lines gets a positive response.

In order to make local feature representations of images, features are sampled atseveral locations. There are several approaches for choosing such locations, where thetwo main approaches are:

� Choose pre-determined locations

� Choose locations dynamically

The most straightforward approach is to sample the features at pre-determined loca-tions, for example in a grid. Rao and Ballard [Rao and Ballard, 1995] used a radial gridfor key locations, while Mel [Mel, 1997] used each pixel location to sample featuresin order to build histograms. The approach used by Rao and Ballard greatly simplifiesthe matching of features, since each one corresponds only to one known location in theimage. This approach is limited to using a pre-determined size of both training and testimages, where the object has to be centered in every image.

2.3.2 Key Locations

When images cannot be expected to be of a given size and the object might be trans-lated or rotated within the image, key locations must be chosen dynamically. Findingmethods and algorithms for this problem is an active field of research and new methodsare reported continuously. The current focus of research is to find key locations that areinvariant to scale. In order for features to be a useful representation, the key locationsmust be repeatable over rotation, translation, scale and illumination, or the sampledfeatures would not be comparable.

When performing recognition on realistic images, the objects will not have thesame position in each image. Therefore the features have to be sampled in the corre-sponding locations within the object, to be comparable.

Rotation invariance is important for object recognition if objects need to be recog-nized at different angles, relative to the training images. In order for rotation invariancein the key localization algorithm to be useful, the features themselves must of coursebe rotation invariant.

Scale invariance means that the corresponding key locations will be found in thesame image, even after it is rescaled. This is of importance for recognizing object ofvarying distance from the camera. For a recognition system for real world images, itcan not be assumed that objects will always be at the same distance to the camera andmaybe it is desirable to use zoom on the camera.

There are many proposed solutions to deal with the problem of finding key locationsin a robust manner. Perhaps one of the most commonly used algorithms for this taskis the Harris corner detector [Harris and Stevens, 1988]. Others include finding linesegments, finding extrema in difference of Gaussian functions, etc

10

2.3.3 Applications

Schmid and Mohr [Schmid and Mohr, 1997] used local greyvalue invariants computedat interest points, for image retrieval from an image database. The database they usedcontained more than 1000 images, ranging from paintings to aerial images and 3D ob-jects. As interest point detector, they used the Harris corner detector with motivationthat under varying conditions, the most repeatable results are obtained from that detec-tor. In order to obtain invariance under rigid displacements of the image, differentialinvariants were used for features. The features were computed at multiple scales toreduce the sensitivity to scale. They used feature vectors with eight elements, eachfeature based on different Gaussian derivatives, to achieve rotational invariance.

Matching of images were performed in the following way:

� For the test image, a set of feature vectors were computed in the locations thatcorrespond to the extracted interest points.

� For each such vector, a distance measure (Mahalanobis measure) was computedto each model vector corresponding to images in the database.

� Voting took place when the distance measure was below a certain threshold andthe corresponding model received a vote.

� After the voting process, the winner-take-all strategy was used to determine thebest match.

In order to improve matching, a semi-local constraint was used on the neighbouringvectors. The constraint was that angles between matched neighbouring vectors had tobe consistent with the angles between the corresponding model vectors, to allow thematching vector to give that model a vote.

For testing, they used the database images under different modifications. They werecropped, scaled, rotated and for the 3D objects, viewpoint was changed between im-ages. For all of these tests, recognition rates were between 99 and 100%. The changeof scale seemed to be limited by a factor of two, which they explained was becausethe stability of the Harris corner detector decreased rapidly when the scale change wasgreater than 1.6.

Schiele and Crowley [Schiele and Crowley, 2000] used featurehistograms to repre-sent objects, thereby combining the use of histograms and the use of features. Theargument is that colour histograms have proven to give good recognition results overdifferent scales and some amount of occlusion, but that colour alone is not sufficientfor discriminating all object classes. Instead they used multidimensional histograms toestimate the distribution of the responses of first order Gaussian derivatives.

The methods they used for matching included using histogram similarity measure,a probabilistic matching using only a few test features and an extension to that methodfor the use in cluttered scenes.

The distance measure they used between histograms was the�2-divergence andthey used the following formula to calculate it:

�2 (Q; V ) =

nXi=1

(qi � vi)2

qi + vi

WhereV is the model histogram,Q is the histogram to be matched,qi andvi arebinsi in the respective histogram andn is the number of bins in the histogram.

11

The probabilistic matching method they used was based on single, arbitrarily cho-sen features in the image. Using that single feature, the probability of each object inthe database can be calculated using Bayes’ rule:

p (on j mk) =p (mk j on) p (on)

p (mk)=

p (mk j on) p (on)Pi p (mk j oi) p (oi)

Wherep (on) is the a priori probability of objecton, p (mk) the a priori probabilityof feature vectormk andp (mk j on) is the probably density function of objecton.This density function was approximated with a normalized histogram.

To be able to recognize an object, one feature is not enough, so a function is neededto calculate the probability of an object given many features. HavingK independentfeature vectorsm1, m2, ...,mK and assumingp (on) to be 1

N, whereN is the number

of objects, the function for calculation the probability of an object could be written:

p

on j

^k

mk

!=

Qk p (mk j on)P

i

Qk p (mk j oi)

The testing was performed with 1327 test images and 2130 training images, repre-senting 103 objects. The test images were cut in order to test the performance whenvarying part of the objects were shown. When the entire objects were shown, thehistogram intersection method, the�2 method and the probabilistic method all had arecognition rate of 100%. When the part of the objects shown was reduced to 62%, theintersection measure and the probabilistic method still provided a recognition rate of100%, while the rate for�2 dropped to 99%. When 33% of the object was visible, theprobabilistic matching still obtained about 99%, histogram intersection provided 94%and�2 obtained 84% recognition.

Lowe [Lowe, 1999] addressed the problem of scale invariance and presented amethod to achieve features that were more robust when subject to changes in scale and3D projection. The suggested method for feature generation is called Scale InvariantFeature Transform (SIFT).

The first step in the method was to find the key locations in scale-space. It was doneby finding maxima and minima of difference of Gaussian applied in scale space. Thefeatures were computed at these locations, see Lowe [Lowe, 1999, p. 1154] for details.

For matching of the features, a form of nearest-neighbour was used. The matchingfeature indexed a hash table where features that agree to model pose were clustered.Once the matching between features was done, the clusters in the hash table weresorted according to size. Each model hypothesis was then verified by performing aleast-squares solution for the affine projection relating the model to the image.

Mikolajczyk and Schmid [Mikolajczyk and Schmid, 2001] also address the problemof scale invariance and presented a method similar to that of Lowe. The proposedmethod for finding key locations was a combination of the Harris corner detector andsearching for maxima of the Laplacian over scales. The motivation for using the Har-ris corner detector was that it is robust to rotations. Since the Harris detector is notrobust over scales, it was only used to locate points at each scale and then the pointswere kept if they were a maxima of the Laplacian over scales. The repeatability of thisHarris-Laplacian detector was better than only searching for maxima of the Laplacianin scale-space.

The local image features that were computed at key locations and characteris-tic scale, were Gaussian derivatives of up to 4th order. The matching between fea-

12

tures was done by nearest-neighbour using the Mahalanobis distance and matchingbetween images were either verified or rejected by using RANdom SAmple Consensus(RANSAC).

Testing was performed with a database of 5000 images and 10 test sequences usedfor matching at different scales. At a scale factor of 1.4, 60% of the images were re-trieved as the most similar one, while a scale factor of 4.4 retrieved 30% as the mostsimilar. The correct image was among the five best matches for 100% of the images ata scale of 1.4 and among the five best matches for 50%, at a scale factor of 4.4.

2.4 Discussion

The different methods for representation outlined in the previous sections have all beenused more or less successfully for the task of object recognition. They are not directlycomparable, since each method can be used in many different ways. Therefore, insteadof looking at what recognition rates have been achieved with different implementations,it is more interesting to look at how well each approach handles the different variationsof viewing conditions, such as transformations, lighting, etc.

The PCA approach is, because of its simplicity, an attractive method to implementand use as a basis for comparison with more advanced methods. It is very sensitive toobject location, which is not necessarily a drawback since it gives very precise objectpositioning, however at the expense of computation time. The inherent sensitivity toscale changes is not only a weakness of PCA, but is more or less present in the otherapproaches as well. The scale invariance problem can be dealt with by performingsegmentation and recognition at many scales. That is however not computationallyefficient and makes the concept of finding characteristic scales at key locations veryinteresting. Using segmentation and recognition at different scales is not only used forPCA, but has been used as well by for example Mel [Mel, 1997] for feature compu-tation and Roobaert [Roobaert, 2001] for recognition with a support vector machine(SVM) classifier. Histograms are more robust to small scale changes than PCA, sincethe characteristic distribution of colours in a window does not change drastically bychanging the window size. By using different features, a feature based method may ormay not be very sensitive to scale.

PCA is very sensitive to the other conditions as well, i.e. varying lighting andobject rotation. This could at least partly be overcome by using training data, wherethe object is subject to different rotations and lighting. However, the sensitivity tobackground is not easily overcome, which might make PCA a little less suitable forreal-world conditions. But it can not be completely discarded because of that, sinceTurk and Pentland [Turk and Pentland, 1991] used PCA for real-time face tracking.

Colour histograms is an approach that has been proven to be able to distinguish be-tween a number of objects [Swain and Ballard, 1991] and has also quite successfullybeen used for object detection [Swain and Ballard, 1991]. Since no geometric infor-mation is used, histograms are invariant to translation and rotation around the viewingaxis. Histograms are semi-invariant to rotations around the two other axes, since smallrotations result in most of the same colours are still seen. At rotations of more than180Æ however, no same surfaces are seen and at90Æ, most surfaces are different. Soto represent objects around the whole viewing sphere, several images are needed, butstill considerably less than needed for representing the whole viewing sphere of an ob-ject using PCA. Computing a histogram is fast and linear in the number of pixels in theimage. The major drawbacks of histograms is the sensitivity to changes of lighting con-

13

dition and background sensitivity. Sensitivity to illumination intensity can be reducedin several ways, e.g. using chromatic colour space. Dealing with differently colouredlight sources is difficult, but on the other hand, this applies to other approaches as well.

When using colour histograms for object recognition, one assumption must bemade. The colours of the objects must be sufficiently distinguishing, between ob-jects and background. It was stated in Section 1.1, that such assumptions about objectsshould not be made. However, it could still be useful to use colour histograms fordetection of object and use another method for recognition of the detected objects.

Using local image features has worked very well in several applications. By carefulselection of features, rotation invariance can be obtained. Translation invariance canbe obtained through the use of dynamically selected key locations, which is the waythat has been used for real-world images and lately even on image database images.Invariance to background can be achieved, provided enough key locations can be de-tected within an object, since only features computed at the object borders depend onthe background. Several methods have been used to enable detection and recognitionand different scales. The most straightforward approach is to re-scale the test imageand compute features at several scales. Lately, more sophisticated methods have beenproposed, searching for characteristic scales at which features are computed. So thefeature based approach is interesting for implementation, but careful thought must bemade to decide how to compute features, decide key locations and handle scale.

14

3 Classification

This section presents some methods for classification and a brief summary of how theywork. The problem of classification is the problem of, given datax = (x1; x2; :::; xn),decide which classCi it belongs to from a set of classesfC1; C2; :::; Cmg. The set ofmethods presented are supervised learning algorithms, except for LVQ. The trainingdata for such algorithms consist of pairs of datax and the correct classCi, provided bya “supervisor”.

For the task of object recognition,xmight be a histogram, a feature, a set of weightsfor eigenvectors etc.

3.1 Nearest Neighbour Classifier

The most naive method for classification is to directly compare the input data to thetraining data by computing a distance measure. The class of the nearest training data,determines the class of the new point, hence the name Nearest Neighbour Classifier(NNC). A typical distance measure is the Euclidean distance. An example classifica-tion using NNC with Euclidean distance is seen in Figure. 6, where the star is classifiedto belong to the class of triangles.

Figure 6:Using NNC, the star is classified to belong to the same class as the triangles.

An obvious drawback of this method is that a distance has to be computed to alltraining data, so it may become very slow to perform matching, when the training datais large. There exist several methods to work around this problem.

15

Figure 7:Using kNN, with k = 5, the star is classified to belong to the same class asthe squares.

Another problem is that no learning takes place, so the ability to generalize is gen-erally poor.

In order to address the problem of learning, a variation is to look at thek nearestneighbours and use majority voting on these points to determine the class of the newpoint. This method is called the k-Nearest Neighbour Classifier (kNN). A classificationusing kNN is seen in Figure. 7, using the same training data and test data as in Figure.6. The star is now classified to belong to the class of squares, which by inspection,seems to be more appropriate.

Although not the best method for consistent learning and generalization, variationsof NNC have often been used for object recognition purposes because of its simplicityand often very good results. Most of the examples in section 2 use variations of NNC.

3.2 Learning Vector Quantization

One approach to address the main drawbacks of NNC is to use prototype vectors, in-stead of all the training data, together with NNC. By using a smaller number of data,the comparison process becomes less computationally expensive and outliers can beeliminated. Learning Vector Quantization (LVQ) is a method for construction of suchprototype vectors [Kohonen, 1988]. The idea is to construct a small numberk of pro-totype vectors through an iterative training process. Once the prototype points aredetermined, they are considered the training points for NNC.

16

3.3 Neural Networks

The text in this section is based on Gurney [Gurney, 1997, Sections 2–3, and 6].There exist several network architectures for classification tasks, where the multi-

layer perceptron consisting of threshold logic units (TLUs) is one of them. It is amathematical model used to imitate the function of biological neural networks. TheTLU is a model for the behaviour of a single biological neuron and works as follows:

� The input vectorx = (x1; x2; :::; xn) is summarized with weightsw = (w1; w2; :::; wn) forming an activationa, in the following way:

a =

nXi=1

wixi

� Firsta is transformed, often using the Sigmoid function�, such thaty = � (a).Other functions than the Sigmoid function can be used, but that is probably themost commonly used one.

� The activationy is thresholded by a threshold#, to get the value0 or 1. This stepis not always needed or desirable, for example when used in a network.

In a feed-forward network, the input goes into theinput layerof TLUs. The outputfrom the input layer is fed to the next layer, either ahidden layeror theoutput layer.There may exist no or many hidden layers, but finally output from the output layer iscomputed. For the task of classification, each node (TLU) in the output layer mightcorrespond to one class. The output can be thresholded to obtain the winning class, orkept as-is to get some kind of probability measure for class belonging.

For information on training a feed-forward network, see Gurney [Gurney, 1997,Sections 4–6]

The training of neural networks in general includes much parameter tuning.

3.4 Support Vector Machines

Support vector machines (SVM) is another method for classification, based on hyper-planes separating the classes of data, with a margin. The margin is the distance betweenthe hyper-plane and the closest training data. In the two-dimensional case, the hyper-plane is a line separating the two classes, see Figure. 8. In the general case, the hyper-plane can be written:

f (x) = w � x+ b

Wherex = (x1; x2; :::; xn) andw = (w1; w2; :::; wn) is the weight vector for thehyper-plane andb is a real number.

The training data for a SVM is positive and negative examples, that is, the correctclassy is either1 or�1. If the two classes of data are linearly separable, a separatinghyper-plane satisfies:

w � x+ b � 1, if y = 1w � x+ b � 1, if y = �1

or written more compactly:

y (w � x+ b) � 1 (4)

17

Figure 8:Separating hyper-plane (solid line) and margin (dotted lines). The data lyingon the margin are the support vectors

In order to obtain a margin, it is not only the problem of finding a hyper-planethat satisfies the above condition, but the hyper-plane must also maximize the mar-gin. It can be shown that minimizingkwk is equivalent to maximizing the margin[Roobaert, 2001, p. 32]. So the problem of finding the optimal separating hyper-planebecomes the problem of minimizingkwk under condition (4).

How to compute the solution for the hyper-plane is not covered here, but the result-ing vectorw� will be a linear combination of the training vectorsxi; where only thesupport vectors have coefficients not equal to zero. The support vectors are the onesthat lie on the margin distance from the hyper-plane.

For the case where the classes are not linearly separable, akernel functioncan beused to map the data to a higher dimensional space where the classes are linearly sep-arable and thus solvable with a linear SVM.

Roobaert [Roobaert, 2001] used support vector machines for object detection and clas-sification, without using any intermediate representation. The idea was to use a purelearning approach, in order to eliminate the a priori knowledge built into the system.This approach was used to create a system which was supposed to be as general aspossible. On the other hand, training data had to be selected in such a way that learningwas accomplished in a sufficient manner.

3.5 Discussion

Many of the examples in Section 2 use NNC. When the key research topic is repre-sentation, NNC is commonly used as the classification strategy. When using NNC forPCA, classification is simply the task of comparing projections in the eigenspace. Butwhen using features, the classification of features is done by NNC, while the classifica-tion of the object is often based on the results from the feature classification. Thereforesome kind of voting algorithm is usually used, along with a verification which checks ifthe features voting for an objects are consistent. The verification may consist of some

18

geometrical constraints. Using NNC is interesting because of its simplicity and thegood results that have been achieved by others, despite the simplicity. However, thedrawbacks of this approach are that all training data must be stored and the relativelyslow matching process, where all training examples must be considered.

Neural networks are known to be good at generalization and are therefore com-monly used when there are many training examples. There are however many draw-backs. As mentioned earlier, training a neural network generally includes much fine-tuning of parameters and often a slow process even without the tuning. For the classi-fication task, it is also difficult to get some kind of certainty measure, as the networkworks much like a black box and the activity within is difficult to interpret. Although awell trained neural network can be a very good classifier, the drawbacks of added com-plexity and training time rules out the neural network as a classifier in this work. Toomuch focus would probably be stolen from the representation problem and no classifierwill make up for an insufficient representation.

The support vector machine, like the neural network, is good at generalization.And also like the neural network, implementing a classifier using SVM would stealtoo much time from the representation problem. However, Roobaert [Roobaert, 2001]used SVM for object detection and classification, without any intermediate steps ofrepresentation, focusing only on the SVM. The results were impressive and show thatthe representation is not necessarily the most important thing to consider.

For these reasons, the simplicity of NNC makes it the best choice for this work, inorder to focus more on the representation problem.

19

4 Implementation

This section covers the implemented methods and the implementational details in abrief manner.

All the implementations described in this Section has in common that they all usethe NN-classifier. The motivation for not using a more sophisticated classifier is, asstated in Section 3.5, to be able to focus on the representation instead.

4.1 Principal Component Analysis

5 10 15 20 25 30 35

5

10

15

20

25

30

355 10 15 20 25 30 35

5

10

15

20

25

30

35

5 10 15 20 25 30 35

5

10

15

20

25

30

355 10 15 20 25 30 35

5

10

15

20

25

30

35

Figure 9:Raisins before (top) and after (bottom) multiplication by a Gaussian mask.

The motivation for implementing a recognition algorithm using PCA was that it isa straightforward method to implement. It was also interesting to compare to the otherimplementations.

The training data consisted of a set of sizen colour images of the objects of size35x35 pixels, with different rotation around the vertical axis and different lighting. Forcomputation of the eigenspace, the images were loaded and rearranged into a matrixAof size 3675xn, with each row containing one image. Each row inA was subtracted bythe average of all rows. The first 30 eigenvectors were extracted from the covariancematrixAAT , using the trick from Turk and Pentland [Turk and Pentland, 1991], of firstcalculating the eigenvectors to the matrixATA. The size ofAAT was 3675x3675, ver-susnxn for ATA, resulting in a much faster computation of the eigenvectors, sincen was typically much smaller than 3000. Lete and� be the eigenvectors and corre-sponding eigenvalues ofATA, then the eigenvaluese

0

toAAT can be computed using

20

raisins 0.02

0.02

0.02

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

rice

0.010.01

0.01

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

Figure 10:Example detection results for raisins (left) and rice (right) when not usingGaussian mask. Corresponding maps are shown in the bottom row.

linear combinations ofe. This was implemented by calculatinge0

= Ae, which wasthen normalized to unit length. For testing, between five and twenty eigenvectors wereused to define the eigenspace.

The representation of the objects was then formed by the projections of the trainingimages onto the eigenspace. For detection and recognition, NNC was used on theEuclidean distance between the projection of the test image and the projections oftraining images.

For the detection, the projection of a window in the test image was calculated ateach point. The distance to the nearest projection of the training data, for each suchpoint, was stored in a map. Local minima in that map corresponded to points in the testimage where there was a high probability that a learned object was present.

For recognition, the nearest neighbour classifier was used on the projections ontothe eigenspace.

A variation of this detection and recognition algorithm was implemented. With thetesting of the above implementation, it became obvious that the background had toomuch impact on the results. The objects in the training images typically only occupied25 to 50 percent of the image, with the rest being background, so the thought to givedifferent weights to different parts of the image came to mind. Since the objects werecentered in the training images, a weighting method was to use a Gaussian mask on thetraining images and test images, to give more weight to the center parts of the image,in the recognition process.

21

raisins

1.841.050.73

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

rice

3.002.11

1.97

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

Figure 11:Example detection results for raisins (left) and rice (right) when using aGaussian mask. Corresponding maps are shown in the bottom row.

This idea was implemented and tested. Every image, training images as well as testimages, were multiplied by the Gaussian mask before used as input to the recognitionsystem in the same way as described above.

The results of applying the Gaussian mask to two training images are shown inFigure. 9. The two bottom images are clearly more similar than the two top images,since the background has been eliminated. However, some information of the raisinspackage is also lost.

The difference in detection results between not using the mask and using it is illus-trated in Figures 10 and 11. Figure. 10 shows the hypothesis for raisins package andrice package and the corresponding maps, when not using the Gaussian mask. Figure.11 shows the results for the same objects and image when using the Gaussian mask.The peaks are clearly more distinct.

4.2 Colour Histograms

The second method which was implemented was a segmentation algorithm based oncolour histograms. The motivation for that was that many of the objects (Figure. 2)have very distinct colours, making them easy to segment from the background.

For the histogram representation, chromatic colour space was used. Training im-ages for an object were first normalized according to Eq. 1, 2 and 3.

Since only two degrees of freedom remain when using chromatic colour space, thehistograms used were two-dimensional, with red and green being the two axes. Each

22

dimension of the histogram was divided into ten bins, for a total of 100 bins.A histogram was computed for each training image and normalized so the sum of

the bins was one.For each object, an average histogram was computed from the histograms of the

different training images. The motivation for an averaged histogram was that objectsdo not have the same colours on every side and under different illumination. Theprecision of the distance measure was reduced by this, but on the other hand, the ideawas to enable detection of objects under different poses and lighting conditions. Byselection of training images, the impact of the averaged histogram could be controlled.Giving only one training image, the average histogram simply was the histogram of thatimage, while giving many images with different viewing conditions made the averagehistogram correspond to the object under more general conditions. Since this algorithmwas only supposed to be used for initial segmentation, it did not matter much if it couldnot identify the objects precisely, since a recognition algorithm was supposed to dothat. Of most importance was that the algorithm would generate good hypothesis aboutobject locations.

For distance measure, the Euclidean distance as well as histogram intersection wasused. The segmentation was done in much the same way as in Section 4.1, a map ofobject similarity measure was computed, for each object. In this case the measure wasthe distance between the histogram of the window and the average histogram of eachobject respectively.

Local minima or maxima (depending on using Euclidean distance or histogramintersection) in these computed maps were the hypothesis about object locations.

4.3 Local Features

5 10 15 20 25 30 35

5

10

15

20

25

30

355 10 15 20 25 30 35

5

10

15

20

25

30

35

5 10 15 20 25 30 35

5

10

15

20

25

30

355 10 15 20 25 30 35

5

10

15

20

25

30

35

Figure 12:Two training images, and the results of calculating Harris corner strengthmeasure.

The third method which was implemented, was a recognition algorithm based onlocal features for representation. Also, an extension to this used local colour histogramsfor verification of matched features. The motivation for using local features was thatit is a method that in the literature has shown good results regarding robustness to

23

different transforms and in particular to partial occlusion. The motivation for usinglocal colour histograms was to see if the recognition rate is improved.

The training images were represented by the extracted features. In each train-ing image, features were computed at key locations. Since the training images wereonly of size 35x35 pixels, only between four to ten corners were detected in eachimage. Therefore maximum number of key locations used in each image was lim-ited to the five strongest corners, in order to suppress the impact of noise and back-ground. Figure. shows typical corner strength measures on two training images. Forthe detection of key locations, Harris corner detector was used. Schmid and Mohr[Schmid and Mohr, 1997] performed tests that showed that the repeatability of Harriscorner detector was about 90% under different rotations.

The idea behind the Harris corner detector is to use auto-correlation to find lo-cations where the signal (image) changes in two directions. A matrix related to theauto-correlation function is computed [Schmid and Mohr, 1997]

e�x2+y2

2�2

�I2x IxIyIxIy I2y

�

WhereIx andIy are the first derivatives of the image in the x- and y-directions.The eigenvectors of this matrix are principal curvature of the auto-correlation function.For convenience, a corner measure can be defined to eliminate the need to define cornerstrength in terms of eigenvectors [Harris and Stevens, 1988]:

R = AB � C2 � k (A+B)2

WhereA = I2x w

B = I2y w

C = IxIy w

wx;y = e�x2+y2

2�2

The value ofk has been suggested to be0:04. A local maxima inR indicatesa corner point. This choice ofk may seem arbitrary, so another measure has beensuggested [Kovesi]:

R =�AB � C2

�= (A+B)2

That is the measure which was used here for detecting corners.Once the key locations were identified, the features were computed. They consisted

of nine Gaussian derivatives of up to third order, on the red and green colour channel.Also, the mean red and green value in chromatic colour space, in a local neighbour-hood, was computed. This made up a feature vector of twenty dimensions in total.

Each of the five objects was represented by 300 training images, for a total of 1500training images, generating a total of 7500 features.

For recognition, features were computed in the test images in the same way asfor the training images. The only difference was that the number of key locationswas not limited to five, instead only the corners with corner measure higher than theaverage corner measure was used. A number of different approaches were tested, butthis approach showed the highest robustness.

24

The features from the test image were compared to the features from the trainingimages using NNC. To perform classification of a test image, a variant of majorityvoting was used. Instead of giving each matched feature one vote, the object wasgiven a vote inversely proportional to the distance between the feature from the testimage and the feature from the training images. The thought of doing that way wasto give close matches a strong vote and far matches a weak vote, in order to minimizemisclassification caused by false matches from the background.

4.3.1 Verification using local histograms

Since no geometric constraints were used in the plain matching of features, an attemptto improve classification was done using verification of matched features by matchingcorresponding local colour histograms.

The features were computed as described above, but in addition, local histogramswere computed in the five by five pixels neighbourhood of the key locations. Two-dimensional histograms with ten bins along each axis were used, normalized to sum upto one.

For the matching, features were computed in the same way as described previouslyand the corresponding local colour histograms were computed. NNC was used formatching features in training images, to each test feature. Voting was performed in aslightly different way than before. Instead of giving a vote, a strength measure inverselyproportional to distance between matching features, the similarity of the correspondinglocal colour histograms determined vote strength. The similarity measure used was theEuclidean distance, which was thresholded using two thresholds. A distance belowthe low threshold gave a strong vote, while a distance between the thresholds gave amedium vote. If the distance was above both thresholds, a weak vote was given. Theidea behind using those thresholds was to enforce matches that were close in colour.Weak matches were still given votes, since histogram matching is not a perfect way ofverification and is sensitive to varying lighting and background.

An object probability measure was defined for the task of verification i.e. answer thequestion “Does this image depict objectk?”. The purpose of this measure was to beused together with a detection algorithm, to verify that the detected object locationswere in fact correct. The measure was defined as:

Mk =VkPi Vi

WhereMk is the pseudo-probability that the image depicts objectk, Vi is the num-ber of votes for objecti. The answer to the question was answered yes if the measureM was larger for objectk than for all other objects, no otherwise. This verificationmeasure is clearly naive, and is probably not a very good one. The motivation for usingit was to quickly do some tests together with an object detection algorithm, since timefor this was running out.

25

5 Experimental Evaluation

All tests were performed using training images from a set of images of size 35x35pixels, of five different objects, namely raisins box, rice box, mug, fanta bottle andcleaner bottle, see Fig 2. The images were acquired under a few different lightingconditions and with both black and white background. The training images were notperfectly segmented. Fore each lighting and background condition, the objects wererotated12Æ around the vertical axis between each image for a total of360Æ.

Figure 13:One of the images used for detection tests.

For object detection tests, thirteen images from a large set of test images were used.They consisted of all the objects arranged in different ways in an indoor environmente.g on a table, in a shelf and with different lighting. The objects were partly occludedin some images and occlusions by another object existed. Figure. 13 shows one of thetest images. The results of performing detection of the rice package in this image areshown for each method in the following sections.

Figure 14:Examples of real-world test images.

26

For recognition tests, two sets of fifty images each were used, ten for each object.The sets consisted of objects on a homogeneous background and objects in a real-world environment. The real-world test images were acquired from the test imagesfor detection, by cutting out the objects to be roughly centered in the new images.Hence, the test images included cluttered backgrounds and different lighting. In somecases, the test images contain parts of other objects, thus making it a more difficulttask for the recognition algorithm to give a correct classification. The test images withhomogeneous background were taken from the same set of images as the training set,but no images in the training data set were used for testing. For a few examples of testimages, see Fig 14.

Recognition rate in these tests, is defined as the percentage of object images thatwere correctly classified with the best match.

5.1 PCA

rice

3.002.11

1.97

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

Figure 15: Detection results for rice package in the example image using the PCAimplementation.

The training images for the PCA recognition were 300 images for each object, fora total of 1500 images. The images used to compute the eigenspace were every fifth ofthese images, for a total of 300 images.

The test images used for recognition tests were the same as mentioned above, withthe modification that they were 35x35 pixels in size. This was solved by padding theimages and stretch them to size, introducing small deformations. All tests that arepresented in this Section were performed with all five objects.

27

10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

reco

gniti

on r

ate

(%)

Number of eigenvectors

real images

segmented images

Figure 16:Recognition rates with varying number of eigenvectors

Early tests of the PCA recognition implementation on the real-world images gavea recognition rate of 32 percent, using ten eigenvectors to define the eigenspace. Theresult was so low that no further tests were performed using this algorithm. Instead,as mentioned in section 4.1, the algorithm was modified by multiplying all test andtraining images, with a Gaussian mask, before using them. All further testing wasdone with this modified algorithm.

To decide on an appropriate number of eigenvectors to define the eigenspace, recog-nition tests were performed on both test sets, with varying number of eigenvectors,ranging from ten to fifty. The value of� for the Gaussian mask was in these tests ini-tially chosen to be six pixels. The results of varying the number of eigenvectors can beseen in Figure. 16.

To decide upon an appropriate value of� for the Gaussian mask, recognition testswere performed on both test sets, with varying values of�, ranging from four to ten. Inthese tests, 20 eigenvectors were used to define the eigenspace. The recognition resultsfor these tests are presented in Figure. 17.

As seen in Figure. 16, recognition results did not improve much when increasing thenumber of eigenvectors used to define the eigenspace. The rates lies steadily around 70percent for the real-world images, with a slight drop in performance when ten eigen-vectors are used, and a slight increase when 40 or more vectors are used. The case ofimages with homogeneous background is almost identical, but with a recognition ratethat lie steadily around 80 percent.

Since one of the goals was to make a system capable of operating fast, as few eigen-

28

4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

reco

gniti

on r

ate

(%)

Value of sigma

real images

segmented images

Figure 17:Recognition rates with varying� for Gaussian mask

vectors as possible should be chosen to define the eigenspace. Using only ten eigenvec-tors did show a slight drop in performance with the test images. Using 20 eigenvectorsshowed the best performance for both real-world test images and test images with ho-mogeneous background, up until 40 eigenvectors were used. But the doubling of thecomputational burden did not seem to be quite motivated by the very small increasein performance. And since only 50 test images were used for each set, it is possiblethat the increase in performance would not apply for any set of test images. Therefore,20 eigenvectors were chosen to define the eigenspace for further testing, in order tobalance speed and performance.

The results of varying� for the Gaussian mask did not look as similar betweenthe real-world images and those with homogeneous background, as the results of vary-ing the number of eigenvectors. For the real-world images, the recognition rate wasquite steady with� between four and eight. But when� was increased to nine andten, the recognition rates dropped noticeably. For the test images with homogeneousbackground, this curve looked a bit different. The recognition rate when� was four,was 74 percent, and increased steadily until� equaled nine, where the recognition ratewas 88 percent. The recognition rate dropped to 84 percent when� was increased toten.

The different looking curves for the two test sets could be explained by the highsensitivity to background. For the homogeneous background images, using small�removed the background influence completely. Increasing� resulted in that a largerpart of the object was used for recognition, hence improving the ability to discriminatebetween objects. Since the background did not change much between images, it did not

29

Table 1:Detection rate (%) for the different objects, using 20 eigenvectors and� = 8object highest peak three highest peaksraisins 62 69rice 77 77mug 15 23fanta 46 54cleaner 54 69total 51 58

have much impact on the recognition rate when only a small portion of the backgroundwas included. However, when� was increased to ten, the background influence becametoo large and the recognition rate dropped.

For the real-world images, the background was very cluttered and changed muchbetween test images. Therefore the recognition rate dropped when increasing� to in-clude even small parts of the background.

For the detection tests, the thirteen test images described above were used. They weresearched at five different scales with a ratio of 1.2 between each. An object was consid-ered correctly detected if the highest peak in the voting matrix was within the bound-aries of the object, no matter the scale. The results are shown in Table 1. The secondrow shows the detection rate if the three highest peaks in the voting matrix are consid-ered.

Obviously, some objects are easier to find than others using this algorithm. The ricepackage is found correctly in 77 percent of the test images, while the mug is only foundcorrectly in 15 percent of the cases.

Figure. 15 shows the three highest matching locations for the rice package in theexample image. The different sized boxes correspond to matches at different scales.

5.2 Colour Histograms

Table 2: Detection rates (%) of the different objects, using average histograms andEuclidean distance.

object highest peak three highest peaksraisins 38 62rice 77 100mug 100 100fanta 54 62cleaner 100 100total 74 85

The primary objective of the implemented histogram algorithm was to performobject detection in images. However, recognition tests were performed on the two setsof fifty test images to get a comparison to the other implementations as well.

For the recognition test, two different distance measures were used. The first wasthe Euclidean distance, and the second was histogram intersection, described in Section

30

rice

46.22

36.18

30.25

20 40 60 80 100 120 140 160 180 200

20

40

60

80

100

120

140

Figure 18:Detection results for rice package in the example image using the histogramimplementation.

2.2. The results of these test can be found in Table 4.For the detection tests, the Euclidean distance and histogram intersection were

used. The same thirteen test images as for the PCA tests were used. The resultingdetection rates for the objects are shown in Table 2 for the Euclidean distance, andin Table 3 for histogram intersection. As was the case with PCA, some objects weremore easily detected with this algorithm than others. The mug and the cleaner wereboth found correctly in all images. The overall performance was improved using thehistogram approach used here, over the PCA approach. The overall difference betweenusing Euclidean distance and histogram intersection is hardly noticeable. Using his-togram intersection does improve the recognition rate for the raisins, but on the otherhand decreases the recognition rate for the rice, compared to Euclidean distance.

Figure. 18 shows the three best matching locations for the rice package in the image.As can be seen, the windows used for computing the histograms were small comparedto the objects.

5.3 Local Features

The objective of the feature-based implementation was object recognition. No testsfor object detection rates were performed since this approach was not suitable for suchtasks the way it was implemented. However, the recognition algorithm was modifiedto give a verification of a hypothesis of a certain object in an image. This was used

31

Table 3: Detection rates (%) of the different objects, using average histograms andhistogram intersection.

highest peak three highest peaksraisins 46 62rice 69 100mug 100 100fanta 54 69cleaner 100 100total 74 86

Table 4:Recognition rates (%) using different distance measures.Euclidean intersection

real-world images 34 52homogeneous background 34 52

together with an existing detection algorithm based on co-occurrence histograms, tosee if detection rates could be improved. The test images used for those tests were thesame as for the PCA and histogram tests.

Recognition tests were first performed with three objects, and varying the value of� for the Gaussian derivatives, to find an appropriate value. The real-world imageswere used. The resulting recognition rates are shown in Figure. 20. For further tests,the value of� was chosen to be five, since that provided the best recognition rate inthese tests.

Then tests were performed using different number of objects, both for the fea-ture based only algorithm, and for the algorithm that used local colour histograms forverification. In addition, these tests were also performed using only the local colourhistograms as features. The tests were performed using the real-world images and theimages with homogeneous background. The recognition rates are shown in Figure. 21for the segmented images and in Figure. 22 for the real-world images.

It is clear that it was an easy task to classify the test images with homogeneous back-ground, both for the feature based only algorithm and the algorithm verifying featurematches by local histograms. In all cases, a 100 percent recognition rate was achieved.The very naive testing of using local colour histograms as features gave a 84 percentrecognition rate for five objects. That is fully comparable to the best recognition rateusing PCA for the same set of test images.

For the real-world images, it becomes obvious that a verification step of matchingfeatures is in order. When only using feature matching, the recognition rate was 70percent using five objects. When the very simple verification step by local histogrammatching was used, the recognition rate increased to 80 percent. Using only localcolour histograms as features, gave a recognition rate of 56 percent for five objects.

The detection tests were performed by first using the co-occurrence histogram algo-rithm to find possible object locations. The first column of Table 5 shows the rates atwhich the correct location was chosen as best match for this algorithm. After possibleobject locations were found, each of them was verified using local features with his-togram verification. The second column of Table 5 shows the rate at which the correct

32

segmentation for rice

78.52

65.80

61.50

50 100 150 200 250 300 350 400 450

50

100

150

200

250

300

350

After verification, object rice

0.93

0.57

0.41

50 100 150 200 250 300 350 400 450

50

100

150

200

250

300

350

Figure 19:Detection results for the rice package after using the existing co-occurencebased segmentation algorithm (left), and results after applying the feature-based veri-fication step (right).

Table 5: Detection rates (%) using co-occurrence histogram, and a verification stepbased on local features.

after detection after verificationraisins 85 85rice 92 85mug 92 85fanta 54 15cleaner 92 100total 83 74

location was chosen as best match after this verification process.

Figure. 19 shows the result of detecting the rice package in the example test imagewith the existing co-occurence based algorithm, and performing the verification step.

33

4 4.5 5 5.5 6 6.5 70

10

20

30

40

50

60

70

80

90

100

value of sigma

reco

gniti

on r

ate

(%)

Figure 20:Recognition rate using different values of�

6 Discussion

From the test results, it is obvious that none of the implemented algorithms providesgood enough results to be used directly in the service robot. The tests do howevertell something about the feasibility of each approach for further research. Just look-ing at the numbers, the histogram based approaches show the best results for objectdetection, with 74 percent detection rate for the chromatic color space r-g histogramapproach that was implemented in this work, and 83 percent detection rate for the ex-isting co-occurrence histogram algorithm. Histogram based object detection can beimplemented to run in time proportional to the number of pixels in the image to search.

The PCA based implementation showed, as expected, high sensitivity to translations.The sensitivity to background was greatly reduced by using a Gaussian mask to re-move the background as much as possible without removing too much of the objectitself. Still, using a Gaussian mask to remove the background is a crude method ofapproaching the problem of background sensitivity. Since the matching using PCA isessentially template matching it is perhaps surprising that the recognition rate was only86 percent for objects on a homogeneous background, similar to the training set. Itis possible that the choice of images used to compute the eigenspace was poor, sinceevery fifth of the training images was used. The result might have been that because ofthe periodicity of the training images, some object rotations were excluded.

For the real-world test images featuring cluttered backgrounds, some occlusions,objects not perfectly centered and stretched objects, the recognition rate was higher

34

2 2.5 3 3.5 4 4.5 50

10

20

30

40

50

60

70

80

90

100

number of objects

reco

gniti

on r

ate

(%)

features only

local histograms only

features with histogram verification

Figure 21:Recognition rate using segmented images

than expected. The Gaussian mask certainly did improve the performance. Anotherfactor that could possibly explain the relatively good results was that the objects wereessentially of one colour. That made the matching work good even if the object wasslightly translated in the image, since essentially the same colours were seen. If theobjects had been less homogeneous in colour, the recognition rate would most likelyhave been a lot worse for small transformations.

The overall detection rate of 51 percent was a low, but not completely unexpectedresult. The conditions in the test images were not at all ideal for using PCA, withcluttered backgrounds, occlusions, different lighting, and objects at different distances.When the objects were correctly detected, however, they were often located quite pre-cisely both in spatial position and scale. This result is probably a consequence of theinherent sensitivity to translation and scale.

The histogram detection algorithm implemented in this work did give some interestingresults. Apparently, using the approach of an average histogram was not appropriatefor the task of object classification. But still, it gave the best overall detection rates ofthe two detection algorithms implemented in this work. The co-occurrence histogrambased approach gave an overall better detection rate, but for the fanta and cleaner ob-jects, the histogram algorithm implemented in this work gave better detection rates.This shows that the co-occurrence based approach could be the better one in general,but the histogram only approach can be better when certain constraints apply to theobjects. What those constraints were for the case of the fanta and cleaner objects arenot known, but obviously exist. However, as stated in Section 1.1, there should not be

35

2 2.5 3 3.5 4 4.5 50

10

20

30

40

50

60

70

80

90

100

number of objects

reco

gniti

on r

ate

(%)

features only

local histograms only

features with histogram verification

Figure 22:Recognition rate using real-world images with background

strong assumptions about the objects integrated in the system. The idea of using anaverage histogram over a large set of training images was possibly a bad one. It couldhave resulted in the histogram becoming too general.

The recognition tests that were performed based on feature matching only, did notgive particularly good results, even with only five objects. That was understand-able, since no verification of the matching features was done. Schmid and Mohr[Schmid and Mohr, 1997] showed that using local neighbourhood constraints greatlyreduced the number of mis-matches and hence improved recognition rates. In order toimprove the recognition rate of the feature based algorithm in this work, verification bylocal colour histograms was introduced. The verification process certainly improvedthe recognition rate, from 70 to 80 percent for five objects. The recognition rate couldprobably be further improved by using some form of geometric constraints for verifi-cation. It is however possible that the approach in this work, to use 300 images fortraining of each object to, is not the best one for geometric verification. The idea be-hind this was to be able to represent an object in all its possible rotations around thevertical axis, and under many lighting conditions, at once. The result of matching a setof test features to the learned ones might be that different matches correspond to differ-ent object rotations. When this is the case, it does not make sense to apply geometricconstraints between the different features, since they have no relationship to each otherwithin a single image.

Another possible explanation for the mediocre recognition results is that the real-world test images were so cluttered, since 24 out of the 50 test images contained parts

36

of at least one other object that the system was trained to recognize, in addition to thecorrect one. The backgrounds also contained many other different objects, some ofsimilar colour to the trained objects.

One approach of this work that might not be feasible when using features, was touse training images which were only 35x35 pixels in size. The resulting number ofdetected corners, not caused by noise, were about four to six in each image. This isprobably a too small number for reliable object recognition. As a comparison, Lowe[Lowe, 1999] used in the order of 1000 features for each image. The idea behind usingsuch small training images was that it could not be assumed that any more informa-tion than that would be available in the test images. But maybe larger training imagesshould have been used, to be able to use as much information as possible from thetest images, when available. For further research, it is suggested that larger and fewertraining images are used, and the set of features for each image are separated so thatgeometric constraints can be applied. Even though only five features from each testimage were used, 300 test images for each object resulted in a total of 1500 featuresfor each object.However, most of the features represent the same group of corners, andcould probably have been successfully clustered for a more efficient matching process.

For both the PCA and the feature-based algorithms, the NN-classifier is used in dif-ferent ways. Using such a large set of training data that is used, may not be appropriatewhen using the NNC, since no consistent learning is guaranteed [Roobaert, 2001]. Amore appropriate approach would, again, probably have been to choose a limited set oftraining images for the learning of each object.

The tests using co-occurrence histogram detection, and feature verification seemed toindicate that it would not be useful to use the implemented feature recognition algo-rithm to improve detection rates. However, the feature based object recognition algo-rithm was not constructed with the intention to perform verification of hypothesis, sothe object probability measure which was constructed was not well-founded. The fea-ture based algorithm was constructed to decide which one, of a known object, an imagedepicted.

A more suitable approach to developing a hypothesis verification algorithm wouldof course be to have in mind that the algorithm should answer the question “Does thisimage depict this certain object?”.

7 Summary

This thesis presented the theory behind three different approaches to the appearance-based object recognition problem in Section 2. The presented approaches were PCA,histograms, and local features. They were presented because they were all interestingapproaches to test for the task at hand in this thesis.

Section 3 presented some methods that could be suitable for this thesis. It washowever concluded that NNC was to be used for classification because its simplicitywould allow more time to be spent on focusing on the representation.

Three different algorithms were implemented, each based on one of the proposedmethods for representation. The implementations were presented in Section 4. ThePCA-based algorithm used a gaussian mask to minimize background dependency inorder to improve recognition and detection rates. The histogram-based algorithm wasconstructed for the purpose of object detection. In order to be able to detect the objects

37

under a wide range of viewing conditions, average histograms were introduced, usingthe histograms from many training images at once. The feature-based algorithm wasimplemented for the task of object identification. Two variations of the algorithm wereimplemented, one where the objects were given a number of votes inversely propor-tional to the distance between matching features. The other instead used a verificationby local colour histograms to determine vote strength.

The results from performing tests on each of the three algorithms were presented inSection 5. The PCA-based algorithm gave an average detection rate of 51 percent,while the histogram-based algorithm gave an average detection rate of 74 percent,both when using histogram intersection and euclidean distance as distance measure.Recognition rate for the real-world images, using the PCA-based algorithm and all ob-jects, was 72 percent for the selected parameters. The corresponding results for thehistogram-based and the feature-based algorithms were 34 and 80 percent respectively.

Finally, Section 6 discussed the experimental results and what was good about theimplementations, what was less good, as well as propositions on how they could beimproved.

38

References

[Christensen et al. 2002] CHRISTENSEN H I, K RAGIC D, SANDBERG F,2002. Vision for Interaction.Sensor Based Intelli-gent Robots2002, LNCS 2238, p 51-73.

[Gurney, 1997] GURNEY K, 1997.An Introduction to Neural Net-works. New Press Limited, London.

[Harris and Stevens, 1988] HARRIS & STEPHENS, 1988. A combined cor-ner and edge detector.Proc 4th Alvey Vision Conf1988,p 147-151.

[ISR CAS webpage] Centre for Autonomous Systems. ISR.http://www.cas.kth.se/isr/isr.htmInformation about the ISR project.

[Kohonen, 1988] KOHONEN T, 1988. Learning Vector Quantization.Neural Networks1:1988, p. 303.

[Kovesi] KOVESI P. MATLAB Functions forComputer Vision and Image Analysis.http://www.cs.uwa.edu.au/˜pk/Research/MatlabFns/Last visited 2002-11-14.

[Lindeberg, 1998] LINDEBERG T, 1998. Feature Detection with Auto-matic Scale Selection.International Journal of Com-puter Vision, 2:1998, p 79-116.

[Lowe, 1999] LOWE D G, 1999.Object Recognition from LocalScale-Invariant Features.International Conferenceon Computer Vision (ICCV) 1999. p 1150-1157.

[Mel, 1997] MEL B W, 1997. Seemore: Combining Color,Shape and Texture Histogramming in a Neurally In-spired Approach to Visual Object Recognition.Neu-ral Computation9:1997, s 777-804.

[Mikolajczyk and Schmid, 2001] MiKOLAJCZYK K & SCHMID C, 2001. Indexingbased on scale invariant interest points.IEEE 0-7695-1143-0/01.

[Nayar, Nene and Murase, 1994] NAYAR S K & NENE S A & M URASE H, 1994.Subspace Methods for Robot Vision. Technical re-port CUCS-019-95, September 1994.

[Rao and Ballard, 1995] RAO R P N & BALLARD D H, 1995. Object In-dexing using an Iconic Sparse Distributed Memory.Proc 5th International Conference on Computer Vi-sion (ICCV).

[Roobaert, 2001] ROOBAERT D, 2001. Pedagogical support vectorlearning: A pure learning approach to object recog-nition. Doctor dissertation, NADA, KTH. ISRNKTH/NA/P-01/15-SE.

39

[Sanger, 1989] SANGER T D, 1989, Optimal Unsupervised Learn-ing in a Single-Layer Linear Feedforward NeuralNetwork,Neural Networks, 2:1989, s 459-473.

[Schiele and Crowley, 2000] SCHIELE B & CROWLEY J L, 2000. Recogni-tion without Correspondence using Multidimen-sional Receptive Field Histograms.InternationalJournal of Computer Vision 1:2000. p 31-52.

[Schiele and Pentland, 1999] SCHIELE B & PENTLAND A, 1999 . ProbabilisticObject Recognition and Localization.M.I.T. MediaLaboratory Perceptual Computing Section TechnicalReport No. 499

[Schmid and Mohr, 1997] SCHMID C & M OHR R, 1997. Local Greyvalue In-variants for Image Retrieval.IEEE PAMI vol 19,5:1997, s 530-534.

[Swain and Ballard, 1991] SWAIN M J & BALLARD D H. Color indexing.In-ternational Journal of Computer Vision 7:1991. s11-32.

[Turk and Pentland, 1991] TURK M & PENTLAND A, 1991. Eigenfaces forrecognition. Journal of Cognitive Neuroscience,3:1991, p 71-86

40

appearance based object recognition in domestic ... · domestic environments utseendebaserad...

Documents