implementing lvq for age classification - kth · implementing lvq for age classification ... att...

Implementing LVQ for Age Classification

O L L E P E T T E R S S O N

Master of Science Thesis Stockholm, Sweden 2007

Implementing LVQ for Age Classification

O L L E P E T T E R S S O N

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2007 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2007:109 ISRN-KTH/CSC/E--07/109--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Implementing LVQ for Age Classification Abstract This thesis is commissioned by the Mitsubishi Electric Corporation, Japan. The subject of this thesis is the problem of Age Classification. Specifically the task is to train and evaluate a classifier of grayscale face images. Furthermore, the classifier will be developed for integration into a Demo Application for gender and age classification. The Demo Application is a live classification system that captures images through a web-camera.

Initial studies shows that Age Classification is a new problem within the machine learning field. As such, specific approaches to this classification task are undocumented. Related techniques for face recognition and face classification will be given an overview in this thesis.

The method of Learning Vector Quantization (LVQ) is modified and implemented as a classifier with an optional pre-processing step using Principal Component Analysis (PCA). The current implementation of PCA as a pre-processing step did not help to improve the classifiers performance. And while LVQ performs fairly well as an age classifier – given its inherent simplicity – the conclusion is that some type of pre-processing is required for optimal performance, either by finding a different approach to pre-processing or further developing the PCA Pre-processing step.

Implementering av LVQ för åldersklassificering Sammanfattning Detta examensarbete är utfört på uppdrag av Mitsubishi Electric Corporation, Japan. Syftet med arbetet var att studera problemet åldersklassificering. Den specifika uppgiften som behandlas i denna rapport är att träna och utvärdera en klassificerare för gråskaliga bilder av ansikten. Klassificeraren kommer också att integreras med en demoapplikation för köns- och åldersklassificering. Demoapplikationen är ett realtidsklassificeringssystem som hämtar in bilder för klassificering via en webkamera.

Förstudier visar att problemet med åldersklassificering är väldigt nytt inom maskininlärningsområdet och att angreppssätt på detta problem är i stort sett odokumenterade. Rapporten kommer att ge en överblick av relaterade tekniker för ansiktsigenkänning och klassificering.

Algoritmen för Learning Vector Quantization (LVQ) kommer att modifieras och implementeras som klassificerare tillsammans med ett valfritt steg med förbehandling av data med hjälp av Principalkomponentsanalys (PCA). Analys visar att den nuvarande förbehandlingen med PCA inte ökar klassificerarens prestanda. Slutsatsen som dras är att LVQ ger bra resultat trots sin enkla struktur, men att någon form av förbehandling av data troligtvis krävs för optimal prestanda. Detta kan uppnås antingen med en alternativ förbehandling av data eller genom att ytterligare utveckla steget med PCA som förbehandlingsmetod.

Preface This thesis project was suggested and sponsored by Mitsubishi Electric Corporation Advanced Technology R&D Center, Hyogo, Japan. I would like to thank my supervisor Hiroshi Kage and his staff for their encouragement and vital support during my work on this thesis.

Also, I would like to thank Ronald Trumpf-Nordqvist, International Coordinator at KTH and Professor Hiroyasu Funakubo for helping me finding the opportunity to do my final thesis project at Mitsubishi.

Finally, I want to thank my academic supervisor Örjan Ekeberg for his support and help in keeping this thesis project up to academic standards.

Olle Pettersson, October 2007, Osaka, Japan

Table of Contents 1 Introduction ............................................................................................................... 1

1.1 Background ........................................................................................................ 1

1.2 Problem Definition............................................................................................. 3

2 Theory........................................................................................................................ 5

2.1 Templates ........................................................................................................... 5

2.2 Image Invariance................................................................................................ 6

2.3 Learning Vector Quantization............................................................................ 6

2.4 Principal Component Analysis......................................................................... 10

2.5 Demo Application Technology ........................................................................ 12

3 Methods and Architecture ....................................................................................... 17

3.1 Face Image Definitions .................................................................................... 17

3.2 The LVQ Learner ............................................................................................. 19

3.3 PCA Pre-processing ......................................................................................... 23

3.4 Integration with Demo Application ................................................................. 24

4 Results ..................................................................................................................... 25

4.1 Face Region Evaluation ................................................................................... 25

4.2 LVQ 2.1............................................................................................................ 26

4.3 Pre-processing with PCA ................................................................................. 30

4.4 Testing of Demo Application........................................................................... 33

5 Discussion................................................................................................................ 34

5.1 The Difficulty of Age Classification................................................................ 34

5.2 Strict Class Boundaries .................................................................................... 35

5.3 The Eigenfaces of Age ..................................................................................... 35

5.4 Future Improvements ....................................................................................... 36

6 Conclusion............................................................................................................... 37

Bibliography ................................................................................................................... 38

Introduction

1

1 Introduction This chapter is a general introduction to the field of face recognition and entails the approaches to previous similar problems coupled with a more detailed problem description

The purpose of this Master’s Thesis is to examine the feasibility of implementing a simple Machine Learning method for age classification of face images. The task of classifying faces by age is not an explored part of the face recognition field; therefore the initial research will be focused on different but similar tasks of face recognition.

The actual use and value of an age classifier will not be dealt with in this thesis, although many possible applications exists, e.g. as one of the metrics in personal identification system. The current goal is to create a prototype classifier to integrate into a Demo Application developed at Mitsubishi Electric. The purpose of this particular implementation is to do a preliminary investigation as to the feasibility of the Machine Learning method of Learning Vector Quantization as an age classification system.

Now follows a brief background on the area face classification and the Demo Application and the constraints imposed in the prototype age classifier.

1.1 Background 1.1.1 Face classification How is a face classified? And on what criterion is classification done? Face classification has a very broad field of applications and approaches. Face classification is different from face identification in the regard that instead of trying to match an unknown face to a library of known faces the unknown face should be classified by various criterions (e.g. gender, expression or age). Specifically the task of face classification turns into the task of extracting a set of parameters which describes the difference between groups of faces rather than the difference between individual faces.

There is of course the choice of how to represent a person’s face. One common approach is to use grayscale images to represent the faces. This is a very simplified representation of the data, but at the same time it keeps much of the relevant information by preserving the luminosity in the pixels (Freund et al. 1996, Jones et al. 2001). The full RGB color representation can also be used. But with both grayscale and RGB color image representations some type of pre-processing is appropriate (Heseltine et al. 2002).

1.1.2 Classifier requirements The main requirement imposed on the classifier is one of simplicity. One of the goals set out by Mitsubishi Electric is to implement its classifiers on embedded systems – the Age Classifier should be developed with this in mind. On low-powered embedded devices both processing power and memory is a scarce resource, therefore the decision rule and any pre-processing step should be of moderate complexity. Also it should be implemented in a language close to machine level for efficiency reasons.

Introduction

2

Also, because the available training data is only grayscale images the classifier can only perform classification on images in the grayscale format. Therefore the classifier will always assume input to be grayscale images. And even though the Demo Application captures images in full RGB-color through a web-camera, these images will be down-sampled and handled internally as grayscale images.

1.1.3 The Demo Application Currently, Mitsubishi Electric has developed a Demo Application to demonstrate a face extraction algorithm and a gender classifier (a more detailed description of the underlying technology of the Demo Application will be given in Section 2.6). The Application is set up to capture images using an ordinary web-camera. The face classification is initiated by pressing the Run button in the Applications interface an image is then captured with the web-camera. The Application then employs two main steps for image classification, these are:

• Face Extraction • Face Classification

In the Face Extraction step the captured image (these images will from now on be referred to as scene image) is searched for presence of faces using an efficient search algorithm. If a face is found in the scene image that face is passed on to the Face Classification step. The current face classification is based on gender and the face is determined to be either male or female.

The goal is to integrate an Age Classifier alongside the current Gender Classifier. A mockup of the desired structure finished Demo Application with age classification integrated can be seen in Figure 1.1. The top half contains a view of the captured scene image and in the lower right corner the extracted face image is displayed. Classification results of the extracted face image are displayed in the lower left corner.

Introduction

3

Figure 1.1 A mock up of the Demo Application. The bottom left corner displays the age classification.

1.2 Problem Definition The goal is to create an Age Classifier for grayscale face images and integrate it into the Demo Application. Since the already finished Face Extractor and Gender Classifier is written in the C and C++ language the Age classifier should also be implemented in this language.

For maximum portability to the architecture of embedded devices, the C language should be used but because the goal is to create a prototype – and also to facilitate an easy implementation – the Age Classifier will be implemented in the C++ language. This makes the Face Classifier easy to integrate with the current Demo Application and the architecture is somewhat easy to transfer to the C language. This will be useful if the ultimate goal to implement the algorithm on an embedded device.

Grayscale face images will be used for training; these are all frontal faces taken under similar lighting conditions. The images will be described in greater detail in Section 3.1. The performance of the Age Classifier will be measured by how well it classifies these images.

The classifier will also be integrated into the Demo Application and the Age classifier will classify the face image extracted from a scene image by the Face Extraction step. Because of the robustness of the Face Extraction step the faces extracted can have significantly different pose and orientation compared to faces in the training data. Also, the lighting conditions of a face image extracted from a scene image may differ greatly from the training images. The Age Classifier is not expected to perform well under

Introduction

4

these circumstances because these extracted face images differ greatly from training data. Therefore the performance of the Age Classifier when integrated into the Demo Application will not be thoroughly evaluated. Merely an observation of its performance will be given.

Theory

5

2 Theory This chapter contains the relevant techniques from Machine Learning that will be used in this thesis The material presented here covers the subject of face classification and face recognition with a Machine Learning approach that was researched for this thesis. Section 2.1 to 2.4 deal with Machine Learning techniques applied the problem of face classification. Section 2.5 describes the current version of the Demo Application and the technology behind the Face Extraction.

Common to all the techniques covered in Section 2.1 to 2.4 are that the face classification only operates on face images, not entire scene images. These techniques all assume that the face images are have been aligned and scaled to similar proportions. In order to apply these techniques in a type of classification system that captures live images (scene images) an additional Face Extraction step is needed, this step finds and extracts areas of interests in the scene image. Such an extracted area of interest is called a candidate image, in the sense that it is the most prominent face image found in the whole scene image. Some of the techniques that are described below can be adapted to perform this task. But a fast and accurate algorithm for this task is already given in the form of the Face Extraction step in the Demo Application; the focus in this Thesis will remain on face classification algorithms. The search for candidate images in the whole scene image will not be a task for the Age Classifier and will therefore mainly be brought up to explain the functionality of Demo Application.

2.1 Templates Sung et al. (1994) go through many of the techniques in the early attempts to tackle the face recognition problem. One approach is to use different forms of templates. These templates incorporate some type of knowledge about the problem domain of face recognition; some of this knowledge is encoded manually into the template approach (e.g. creating templates for eyes, nose or mouth).

2.1.1 Correlations Templates Correlation Templates is a filter technique, the difference between the candidate image and a face template is calculated by comparing the pixel values in the Correlation Template to the pixels in the candidate image. The measured difference is then thres-holded to determine if a match was found or not.

A face is a too complicated pattern to model with just one filter, therefore the correlations templates are usually sub-features of a face, with many templates for just one feature. The assumption is that a few sub features are enough to describe a face. Then the presence of a face can be inferred by analyzing how well the templates match a candidate image.

To decompose the problem in this way reduces the statistical variance from background disturbance. Also a recognizer becomes somewhat more robust to occlusion of various parts of the face (Heseltine et al. 2002)

Theory

6

Finding the optimal positions and properties for these templates is not a trivial task; one approach is to create Correlation Templates for interesting face regions like eyes, nose and mouth, then to train the parameters of the Correlation Templates through some machine learning method. This will be the general approach used in this thesis; later in section 2.2 LVQ will be looked into as the training method for the templates.

2.1.2 Deformable Template Deformable Templates work in a similar fashion as Correlation Templates but instead uses parameterized curves to model features. The main concept behind Deformable Templates is to attempt to conform a template to a target candidate image while reducing the stress in the template (the freedom of the parameters of the template). The end result is a less rigid model of the face. The threshold value for the maximum allowed stress in the global model is then used to determine if a match is found in a candidate image or not (i.e. if the stress in the template is low enough, a match is found).

Since values of the template’s parameters are relative to the location of the face’s descriptive features the template becomes less sensitive to scale and rotation of the face. The template only require good starting values for parameters, which are easy to guess based on assumptions about the general structure of a face.

2.2 Image Invariance The approach of using Image Invariance focuses not on the traditional features of a face like eyes, nose or mouth but mainly relies on the brightness invariance between different parts of the face. The model then aims to encode this invariance. If the relationships between brightness in the candidate image matches those in the template model a match is made.

One way of finding this invariance is through Principal Component Analysis; this technique can be used for either face identification (Heseltine et al. 2002) or pre-processing of face images before classification by a different Machine Learning technique (Sirovich et al. 1987). Principal Component Analysis will be described in Section 2.4 along with how it can be applied to faces.

2.3 Learning Vector Quantization Learning Vector Quantization is an extended application of Correlation Templates, the theory described here is largely based on the LVQ PAK compendium – a manual for a C library for training and classifying with the LVQ algorithm developed by Kohonen et al. (1995), which also contains a good introduction to the theory of LVQ. In this Section LVQ will be given a general overview and the example given here will deal with the 2-dimensional case. But LVQ is not bound by dimensionality and is just as applicable to higher-dimensional data such as face images of face features. In this Thesis it will used to train templates for parts of faces.

The core task of the LVQ algorithm is finding the optimal placement of Feature Vectors (free parameter vectors) in input space in order to approximate the different class domains in the vector space that the training samples reside in. An example of this can be seen in Figure 2.1 where the input space is two dimensional. Typically, multiple Feature Vectors are assigned to each class when attempting to model the class domains.

Theory

7

Another benefit of the LVQ algorithm is that it can give a nonlinear separation of the sample space. What this means is that the decision boundary derived from training need not be linear (In Figure 2.1 the sample space would have to be separated by a line, in the general case it would be a hyper plane). This property of the LVQ algorithm becomes useful when dealing with complex class domains where a linear decision border are not sufficient; as illustrated in Figure 2.1.

Figure 2.1(a) A sample distribution of two classes (positive and negative) in 2D-space. (b) Positions of Feature Vectors that might be found by the LVQ algorithm. The class borders are defined by the Nearest-neighbor rule and are illustrated by dashed lines.

After training, the classification of an unknown sample x is determined by the Nearest-neighbor rule by comparing x to all Feature Vectors vi in the sample space.

{ }||||minarg iivxw −=

Equation 1. The index w of the Feature Vector closest to sample x is chosen by measuring the Euclidian distance.

The closest Feature Vector vw then determines the class of the unknown sample x.

In the next section the Nearest-neighbor rule, which is an essential principal of the LVQ classification rule, will be explained. Then follows the various update rules that can be employed by LVQ when positioning the Feature Vectors before classification.

2.3.1 Nearest-neighbor Rule Used frequently in the field of pattern recognition and machine learning, the Nearest-neighbor rule (sometimes called the k-Nearest-neighbor rule) is not specific to the LVQ algorithm but will be described in this section because of its relevance to this algorithm.

The Nearest-neighbor rule is based on the assumption that when observing samples xi of different classes ci it is reasonable to assume that observations which are close together – measured by some appropriate metric – are of the same class. Therefore,

+ +

+

+

- -

-

-

-

+

-

-

+ +

+

+

+

-

-

-

+

+

+

-

- +

-

(b)

+ +

+

+

- -

-

-

-

+

-

-

+ +

+

+

+

-

-

+ -

-

(a)

Theory

8

when evaluating an unknown sample xi, the classification of x is based upon the class of nearby samples.

The LVQ’s decision is based on this assumption. But instead of making use of points in the training set when classifying new unknown points, LVQ relies on a good choice of Feature Vectors to approximate the class domains and to help classify an unknown sample xi.

2.3.2 LVQ1 The core of the LVQ algorithm is the competitive learning process of positioning the Feature Vectors. Here the most basic version of LVQ is explained, called LVQ1. But the principal of competitive learning holds true for later versions of LVQ algorithms.

During training a pairs of sample points x and their class labels are shown to the learner, samples are shown one at a time. The closest Feature Vector vw, also refereed to as the winning Feature Vector is updated according to one of the following equations depending on if Feature Vector vw classifies sample point x correctly or not.

[ ]

[ ] classsesdifferent tobelong and if ,)()()()1(

classs same the tobelong and if ,)()()()1(

wwww

wwww

vxtvxttvtv

vxtvxttvtv

−−=+

−+=+

α

α

Equation 2. The LVQ1 Update rule. Only the winning Feature Vector vw is modified.

The term α(t) controls how large the movement of the Feature Vector should be in the update. The value of α(t) is updated after each training step as follows:

otherwise 1- correctly, classified is point sample if 1 Where)(1

)()1(

xptp

tt

=⋅+

=+α

αα

Equation 3. The update of the α-value depends on the classification of x by the winning Feature Vector.

This causes α(t) to decrease when a sample is classified correctly and increase when an incorrect classification is made by the winning Feature Vector . This cause the Feature Vectors to move significantly more in the earlier stages of training when α(t) is still high. They later slow down as good positions for the Feature Vectors are found and less incorrect classifications are made. Good starting values α(0) is usually in the range of 0.1 to 0.3.

2.3.3 LVQ2.1 LVQ2.1 is a further development of the LVQ algorithm suggested by Kohonen et al. (1995). LVQ2.1 selects the two closest Feature Vectors vl and vm by the nearest

Theory

9

neighbor rule in Equation 1. Two additional conditions are introduced which must be met before an update takes place.

The first restriction is that the two closest Feature Vectors found must be of different, in the case when sample space is divided into only two classes this means that one of the Feature Vectors must classify the sample x correctly. When Feature Vectors vl and vm classify sample x correctly and incorrectly, respectively the update rule is as follows:

[ ]

[ ] xvtvxttvtv

xvtvxttvtv

mmmm

llll

as class same tobelongnot does ,)()()()1(

as class same tobelong ,)()()()1(

−−=+

−+=+

α

α

Equation 4. The LVQ2.1 update rule focuses on the two Feature Vectors closest to the training sample x.

If the class labels of vl and vm are equal or none of them match the class of x then no update is done. In this Thesis the sample space will contain four different classes (generation 20, 30, 40 and 50), this makes the above requirement much harder to fulfill. In order to relax this update criteria some changes to the LVQ2.1 update rule is proposed later in Section 3.2.2.

The second condition which needs to be met before an update takes place depends on the sample point x position relative to the two closest Feature Vectors. The sample is said to have to fall within a window (defined by the two closest Feature Vectors) if an update is to take place. Equation 5 defines the inequality that must be satisfied in order to allow an update.

wwss

vxvx

vxvx

l

m

m

l

+−

=>⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−−

11where,

||||||||

,||||||||

min

Equation 5. The window is defined by the ‘width’ w. Typical values for w are 0.2 to 0.3.

This rule allows the Feature Vectors to only update when a training sample x lay on the class border between the Feature Vectors, this area is illustrated as gray in figure 2.2.

Figure 2.2. The Class boundary for two Feature Vectors of opposing class (marked by x and o) as described by the LVQ2.1 update rule.

Theory

10

This prevents the Feature Vectors from diverging from their optimal positions (Kohonen et al. 1995). Sambu et al. (2006) also showed that with this limitation on the update rule LVQ obtains similar margin maximization properties as the Support Vector Machine (SVM) algorithm. SVM is another Machine Learning technique useful for finding class borders in a sample space. The difference from the LVQ approach is that SVM selects the samples that lie on the class border and use them to classify any unknown sample. LVQ2.1 updates the Feature Vectors only according to samples that lay on the class border instead to achieve similar results. SVM has successfully been used in a face recognition system earlier by Osuna et al. (1997) but will not be discussed further in this Thesis.

The LVQ algorithm fits the criteria of simplicity stated in the problem definition in Section 1.2 by having a simple decision rule; therefore LVQ2.1 will be implemented and tested in this thesis.

2.4 Principal Component Analysis Principal Component Analysis, PCA is a technique used to reduce the dimensionality of a multidimensional dataset for further analysis; this is often desired when dealing with datasets of very large dimensions (like the hundreds or thousands of pixels in an image). PCA provides a method for this dimensionality reduction and at the same time preserves the variance of the dataset. The assumption of PCA is that a high dimensional dataset (e.g. images of faces) have an underlying much lower dimensionality (e.g. age or race) PCA is a way of finding these dimensions – the Principal Components of the data set. In this Thesis PCA will be used to attempt to find the Principal Components of age in the data set of the faces images described in section 3.1.

In this section the general theory and application of PCA will be described first with the mathematical background in Section 2.4.1 and then the more specific application of PCA to face images in Section 2.4.2, introducing the concept of Eigenfaces.

For a thorough description of the application of PCA in this manner see the work by O’Toole et al. (1993) where PCA is applied to faces and the properties of the Eigenfaces are further discussed. For a more thorough description of the concept of PCA see the Principal Component Analysis tutorial by Smith (2002) which provides a good introduction to PCA.

2.4.1 Method Overview The main goal of the PCA approach is to find the transformation matrix S, also known as the score matrix, which projects the high dimensional sample data into a lower dimensional vector space.

Fist the dataset is ordered in the sample matrix X where the rows contain the m samples labeled x1, x2, ... , xm. Each row xi in X is standardized by subtracting the mean value and dividing by the standard deviation of the sample set. X has n columns which correspond to the dimensionality of the original dataset. The first step of PCA is then to calculate the covariance matrix C of the standardized dataset X.

Theory

11

)1(),(

where

)),(,(

1 ,,

,,

−=

==

∑ =

nxx

ddcov

ddcovccC

n

i biaiba

jiijji

Covariance is a measurement of the degree of similarity between the columns di and dj in sample matrix X. In fact when i=j covariance measurement reduces to the variance of column dk in X, where k=i=j. The dimensions of C are bound by the number of dimensions of the dataset and C is therefore an n × n matrix that is symmetrical around its diagonal, the elements ci,j contain the covariance between columns i and j in X.

By finding the most significant eigenvectors of C we find the eigenvectors that best describes the dataset in X. These are the principal components of the dataset, the more redundant the data, the less Principal Components are needed to model it. The Principal Components are the basis vectors which form the linear span of this new sub-space in which the samples xi end up in. This subspace is commonly refereed to as PC-Space.

For a visual illustration of how these eigenvectors can describe a dataset see the next section about Eigenfaces – the special case of PCA being applied to face images.

2.4.2 Eigenfaces Since images of faces have many common features which implies that the raw pixel data will contain much redundancy, a set of face images is a good candidate for dimensionality reduction through PCA. The term Eigenfaces refer to the Principal Components found when PCA is applied to sets of face images, because of their resemblance to human faces. They do not look like typical faces yet some of their characteristics make them unmistakably face-like. See Figure 2.3 for some examples of Eigenfaces. The term Eigenface and Principal Component will be used somewhat interchangeably throughout this Thesis.

The use of PCA for face recognition was first introduced by Sirovich et al. (1987), it was shown to be a very successful approach to the problem of face recognition. However it imposes some restrictions on face images. The Lighting conditions under which all the images are captured need to be the same in order to not distort the similarities of the features. Also images need to be aligned so face features are in the same place in all images. Both these conditions are fulfilled by the available face images at Mitsubishi, which makes this approach feasible as a pre-processing step for the LVQ algorithm. The task is then for the LVQ algorithm to find Feature Vector positions in the in the new vector space defined by the Principal Components (Eigenfaces) found by analyzing the training set of face images.

It should be noted that by keeping all the Eigenfaces – not just the most significant – any face in the training dataset can be completely reconstructed by a linear combination of these Eigenfaces. This is shown in Figure 2.3. The n Eigenfaces are stored after they are obtained from the Principal Component Analysis, the coefficients si, … , sn (obtained by multiplying the face sample with the Score matrix S) can then be used to combine the Eigenfaces to recreate the face.

Theory

12

Figure 2.3 A reconstruction of a face by Eigenfaces generated by the PCA algorithm. The most significant information is contained in the first Eigenfaces (AT&T Laboratories, 2007).

2.4.3 Reducing the Dimensions of a Face By keeping the k most important Eigenfaces, as in the previous section, we find the Eigenfaces that best can be used to approximate any face in the training dataset. The assumption is then that the Eigenfaces’ ability to represent faces in the training set will carry over to unknown faces as well. Hopefully any face image captured under the same conditions will be able be adequately represented with these k Eigenfaces. The value of k varies from dataset to dataset but is usually significantly smaller than n, the total number of pixels in a face image.

Sirovich et al. (1987) showed that face recognition can be done by only using a subset of these Eigenvectors. Also, they observed good performing even under slightly different lighting. Using less than the 50 first significant coefficients they were successful in storing the identity of a face.

The underlying ground truth (like gender or face identity) of a face that is sampled by taking a photo is not guaranteed to be found by PCA but it might be modeled by one or more Eigenfaces. And different ground truths might not be modeled by the same Eigenface or combination of Eigenfaces. Therefore special care has to be taken when choosing principal components depending on the type of recognition being implemented (O’Toole et al. 1993).

2.5 Demo Application Technology The Demo Application Developed at Mitsubishi Electric currently performs face extraction and classification by gender. Specific details in this approach can be found in the work by Jones et. al. (2001), where this technique was successfully implemented as a fast recognizer of faces in scene images.

=

s1 x + s2 x + ... + sn x

Theory

13

2.5.1 Face Extraction The approach to fast face extraction can be broken down into three main components – all necessary for both fast and accurate results. These are listed below and further down they will all be explained briefly.

• Simple decision rule (Haar-like Features) • New Image Representation (Integral Image) • Boosted cascade of classifiers (based on the Haar-like Features)

Haar-like Features

Haar-like Features are simple rectangular filters and have previously been used by Papageorgiou et al. (1998) in application to the face recognition problem. A Haar-like Feature contains regions of different intensity; Figure 2.4 depicts the three types of filters used by the face recognitions system.

Figure 2.4. Some samples of Haar- like Features used to construct a classifier. When applied to an image the sum of the pixels that lie within the white region is subtracted from the sum of the pixels that lie within the black region.

The Integral Image Format

By changing the image representation, the sums of the triangles in the Haar-like features can be calculated very quickly. Viola et al. (2001) introduce the concept of Integral Image for this purpose, the Integral Image has the same dimensions as the original image but instead of each pixel containing intensity values the pixels in an Integral Image contains the cumulative sum of all pixels above it, with the top left corner defined as pixel (0,0).

),()1,(),1(),()0,0()0,0(

intintint

int

yxIyxIyxIyxIII

org

orgl

+−+−=

=

Equation 6. Creating the Integral image Iint from the original image Iorg. Pixels outside of the image dimensions are defined as 0

With the Integral Image format the intensity sum of any given rectangle from (0,0) to coordinate (x,y) is contained in coordinate (x,y). With this information it is a matter of simple geometry to calculate the sum of the regions within the Haar-like Features. So instead of calculating the sum of a black or white feature rectangle by referencing each

Theory

14

pixel in the original image only four references are needed in the Integral Image (one to each corner of the feature rectangle)

Figure 2.5. The rectangle areas needed to calculate the sum of the pixels in the striped area are all contained in the four corners of the rectangle. The area of any given rectangle in the Integral Image is D-C-B+A.

Adaboost

A detailed description of the Adaboost algorithm can be found in the work by Freund et al. (1996).

The Haar-like Features need to have meaningful positions and dimensions in order to describe some sort of feature in a face image, although it is not apparent which features to choose. Adaboost can be used for this task, Adaboost combines many so called weak classifiers (in this case the Haar-like Features) into one strong classifier. Like this, each possible position of a Haar-like Feature makes up one weak classifier.

A weak classifier hj(x) of a face image x has an associated threshold θ, the sum from a Haar-like Feature fj(x) and an integer sj defining the direction of the inequality sign associated with it. sj is needed because the value of fj(x) needs to be bound by either a maximum or minimum depending on what kind of face feature fj(x) tries to model.

⎩⎨⎧ <

=otherwise0

)(if1)( jjjj

j

sxfsxh

θ

Equation 7. The weak classifier used by Adaboost.

The useful Haar-like Features are searched for with the Adaboost algorithm to construct a strong classifier from all possible Haar-like features in the face image. The weak classifiers hj(x) are only required to perform better than chance for the Adaboost algorithm to be able to combine them into a strong classifier.

Adaboost returns a linear combination of features. These are then used to create the cascade of classifiers, which can be seen in Figure 2.6. The arrangement of the

C

A B

D

Integral Image

Theory

15

classifiers into this cascade is the key to the fast recognition time if this approach, how this is achieved this is explained below.

The Cascade of Classifiers

A cascade of classifiers, here referred to as the Attention Cascade is used to evaluate all the sub-windows in the scene to determine if they contain a face or not. The Adaboost training returns this cascade of increasing complex classifiers. The classifiers at the early stages of the cascade reject a significant amount of negative face examples early. This reduces the time spent evaluating negative sub-windows because these will be discarded early in the cascade. Only sub-windows that are difficult to determine travel far in the cascade and sub-windows that actually do contain a face will be classified as positive by all stages of the cascade.

Figure 2.6. An illustration of the Attention Cascade. The cascade consists of n simple classifiers of increasing complexity. If a classifier Ci does not reject a sub-window it is passed along in the attention cascade. This reduces the time spent evaluating negative sub-window as only interesting sub-windows are sent forward in the cascade for further processing.

Mitsubishi’s Demo Application returns the biggest sub-window classified as a candidate image by the attention cascade. This sub-window is then normalized and re-sampled to the dimensions required by the Gender Classifier and then the resulting image is finally classified by gender.

The Attention Cascade gives very fast and accurate face recognition. The architecture easily lends itself for an implementation on embedded devices. One common

Cn C1 C2

Further Processing

Sub-windows

Candidate Image

No No No

Yes Yes Yes

Scene Image

Theory

16

application of this technique is in digital cameras it is used to locate faces in an image before capturing a picture to correctly set the lens’s focal point.

This approach is also applicable to other tasks besides face recognition; Lalonde et al. (2006) used a similar cascade of increasingly complex classifiers for spotting text in video frames. Here the cascade was used to locate interesting regions for further processing with an appropriate Optical Character Recognition technique.

Methods and Architecture

17

3 Methods and Architecture This chapter details the implementation of the Methods chosen and the general architecture of the Age Classifier The general architecture of the training and classification will be described here. In Section 3.1 the sample data is thoroughly defined and the approach of decomposing the problem using Face Regions is explained. Section 3.2 deals with the LVQ-algorithm chosen as decision rule. Section 3.3 explains the optional pre-processing step employing Principal Component Analysis. Finally, some comments are given on how the LVQ-Classifier was integrated into the Demo Application.

3.1 Face Image Definitions The dataset contains 324 images of Japanese males (all of Japanese ethnicity). Each sample image in the dataset belongs to one of four different classes, defined as four different generations

• Generation 20 – ages 20 to 29 • Generation 30 – ages 30 to 39 • Generation 40 – ages 40 to 49 • Generation 50 – ages 50 to 59

Further properties of the faces are that that they are always expressionless and have no glasses or other discriminating features (like a significant amount of facial hair or scars). Faces are aligned and scaled to maintain the same spatial proportions between the eyes and mouth in all images.

The dimensions of the images are 64 x 80 pixels (width x height), the bottom left corner is defined as the origin at (0,0).

3.1.1 Face Regions In order to decompose the problem of classifying an entire face image multiple sub-windows of the same face image can be used for training and classification. In this Thesis these sub-windows will be refereed to as Face Regions. This copping reduces the inherent variance that comes from dealing with the full face image.

Another benefit with using Face Regions is that attention can be focused on interesting areas of the images. Although this has to be done manually by selecting appropriate dimensions for Face Regions the images are relatively uniform and it is easy to find these interesting regions.

The most discriminating part of a face image is in the central part of the face as found by Heseltine et al. (2002), this is also intuitively obvious (i.e. the location of eyes, nose, mouth, etc.) and therefore face regions to use will be located around these parts of the face.

Even though the Face Regions represent sub-windows of the original image, they do not have to be separate, the regions can overlap as seen in the example regions in Figure 3.1. However, this overlapping should be kept to a minimum, since if overlapping is too great Face Regions essentially represent the same area.


18

Figure 3.1. (left) A face with Eyes Region and Right Face Strip Region depicted. (right) The defining points of a Face Region

Table 3.1 Definitions of face regions in sample images. Region Bounding Box

xstart ystart xend yend

Eyes 10 45 53 64

Nose 10 35 53 44

Mouth 10 10 53 35

Right Face Strip 32 10 53 64

Left Face Strip 10 10 31 64

Eyes With Hair 10 45 53 79

Big Face Region 10 10 53 54

The Face Regions defined in Table 3.1 will be evaluated in Section 4.1 before the choice is made which regions to use in the final classifier. The program is constructed in such a way that the face regions can be easily modified with aspect to dimension. Also to make any combination of face features possible the number of face regions to train and classify by can be varied freely.

xstart xend

ystart

yend

(0,0)


19

3.2 The LVQ Learner According to the requirements of simplicity and compatibility with existing architecture as stated in Section 1.2 the decision algorithm was to be implemented in the C++. However, this does not prevent the training algorithm from being written in another language. Therefore the goal of the implementation should be a modular approach where the decision rule easily can be separated and integrated into the architecture of the Demo Application. In order to achieve this, information about face regions and the corresponding Feature Vectors need to be exported in a pre-defined format after training, making the classifier completely independent of the training architecture.

Principal Component Analysis will be implemented as a pre-processing step but because of the complexity of such an algorithm that part of the training will be implemented in the Matlab language which already has a good library for performing PCA on sets of data. This will also allow for easier visualization of results from pre-processing. For simplicity, training and testing mechanics for the LVQ algorithm with PCA Pre-processing will also be implemented in the Matlab language. It is important to note that by keeping the classifier general enough, it can easily be modified to incorporate a PCA pre-processing step

The goal is to create and evaluate a Classifier using the LVQ decision rule and have an optional PCA pre-processing step. Finally, a classification component that is found to work well should be integrated – together with its parameters – into the Demo Application for demonstration purposes.

3.2.1 Implementing the LVQ algorithm The specific algorithm chosen was LVQ2.1. Both training and testing of the LVQ2.1 algorithm without any Pre-processing step will be implemented C++.

The general architecture is shown in Figure 3.2 and an explanation of the various components is given below.


20

Figure 3.2. A schematic illustration of the LVQ Training and Classification architecture.

LVQ Learners

One Learner is created for each Face Region in the Face Definitions. A set of Feature Vectors Vi is trained for each Learner Li.

LVQ Classifiers

The LVQ Classifiers are defined by the face regions and their corresponding set of Feature Vectors.

Training Samples

This is the training partition of the available samples. The batch of training samples is given to the program which extracts the regions and then shows them to the LVQ Learners.

LVQ Learners Region Definitions

Training Samples

Sets of Feature Vectors

L1 L2 Ln

V1 V2 Vn

Region Definitions LVQ Classifiers

C1 C2 Cn

Testing

Test Sample

Classification

Training


21

Test Samples

The test samples shown are from the test partition. Any grayscale images are accepted, dimensions have to be 80 x 64 pixels.

Region Definitions

These are the definitions of the Face Regions to be used during both training and testing. They define how many LVQ Learners and LVQ Classifiers to initialize for training and classification, respectively.

3.2.2 Modifications to the LVQ2.1 algorithm In order to better accommodate a multitude of classes it might be possible modify the LVQ2.1 update rule. How these modifications affect the margin maximizing properties of the LVQ2.1 update rule will not be investigated in this thesis. However the change in performance will be evaluated. This version of the LVQ algorithm will from now on be referred to as Modified LVQ2.1, the performance will be documented in Section 4.2.1.

Update Conditions

The LVQ2.1 update rule is very restrictive to which cases an update is allowed to take place. The update rule states that only when the two Feature Vectors closest to the shown training sample x are of different class may the Feature Vectors be updated. Also one of the Feature Vectors has to classify the samples correctly and the sample has to lie within the window of the two Feature Vectors.

The criteria that that one of the two Feature Vectors must be of the same class as a shown sample x becomes less and less likely as more classes are used. This is especially true in the earlier stages of training when the Feature Vectors’ positions have not yet approached their optimal positions.

The modification made to the LVQ2.1 update rule is that if the two closest Feature Vectors are of different class and one of the Feature Vector does not classify the sample x correctly the normal LVQ1 update rule is used (on only the winning Feature Vector).

Feature Vector Movement

Another change in the update rule is the amount of which Feature Vectors are moved when modified. The original LVQ2.1 update rule states that Feature Vectors should always be moved in equal amount when moving towards (a correct classification) or away (wrong classification) from a sample.

If the sample space is polluted (class regions intersect), there is a risk that Feature Vectors on average move away from the entire sample cluster and never obtain meaningful positions.

To counter this risk the amount of movement when moving away form a sample (wrong classification) will be 1/3 of the movement when moving towards a sample (correct classification)

3.2.3 Starting Values Feature Vectors are initialized to class average. In order to spread out the Feature Vectors while at the same time keeping them close to their corresponding class regions, random noise can be added before training starts.


22

3.2.4 Format of the Feature Vectors After a training session is complete the Feature Vectors need to be stored for retrieval later by the classifier. Each Feature Vector is an n-dimensional array of float values representing a point in the feature space, when dealing with face regions n is usually in the range of a thousand dimensions. With four classes and four to five Feature Vectors per class as a typical maximum, the total number of Feature Vectors remain low and does not present an unmanageable amount of data. Therefore the Feature Vectors are stored as raw data in individual files. The information to store is:

• The dimension and location of the Face Features on which the Feature Vector operates. This is information used by the LVQ Learner to create and train the Feature Vectors, later the same information is used to set up the LVQ Classifier.

• The properties of the each individual Feature Vector. This data is created by the LVQ Learner and is stored to a pre-defined location for the LVQ Classifier to retrieve later.

3.2.5 Classification By handling the Feature Vectors in this way the LVQ Classifier can easily support additional Pre-processing by PCA because the core classifier does not depend on the dimensionality of the feature space in which it operates. Therefore any type of Feature Vectors can be used by the LVQ Classifier as long as they are properly defined. The additional component needed for PCA pre-processing is the Score Matrix S in order to transform the test samples into the subspace in which the Feature Vectors reside. Details of these additional components are available in Section 3.3.

With this additional matrix multiplication of the samples the LVQ Classifier can classify samples in PC-space. With regards to the restrictions imposed by the requirement of a simplistic decision rule, this is still reasonably upheld because the only extra step required in the decision rule is a matrix multiplication to transform the Feature Vectors into the new PC-space. As an added bonus, the Feature Vectors reduce in size when the dimensionality is lowered, instead of containing hundreds of elements when dealing with raw pixel data, a Feature Vector now has tens of elements in PC-space.

Figure 3.3. Structure of the decision algorithm. Each Classifier Ci classifies Region Ri by the Nearest-neighbor rule. Final Classifications is obtained by a majority vote from the n region classifiers.

C1

C2

Cn

Vote

Face Image

Classification


23

3.3 PCA Pre-processing In order to avoid the complex task of implementing Principal Component Analysis in the C++ language, the choice was made to use Matlab for the development and testing of the PCA pre-processing.

When PCA Pre-processing is used the Feature Vectors obtained from training have been trained in PC-space, the test samples must also be converted into this sub-space before any classification can be done. This is done by importing the Score Matrix S obtained from Principal Component Analysis of the original test set. Therefore the LVQ Classifier needs booth a Score Matrix and a set of Feature Vectors for initialization.

This means that not much modification needs to be done to the original architecture in order to support PCA Pre-processing. PCA needs to be done on the training data to find S and then a module that transforms samples using S is added to the both the LVQ Learner and Classifier. The architecture of the program with PCA Pre-processing added can be seen in Figure 3.4. A description of the various components follows below.

Figure 3.4. The general architecture of LVQ classification and training with a PCA Pre-processing step.

Train Samples/Test Sample

These are all the available samples partitioned into train and test groups. The train samples are used by the PCA-step to construct the Score Matrix. The samples can then

PCA

Score Matrix

Feature Vectors

Test Sample

LVQ Classifiers LVQ

Learners

PCA Transformation

PCA Transformation

Train Samples

Classification

Training Testing


24

be transformed for training and classification by the LVQ Learner and LVQ Classifier, respectively.

PCA

The PCA step finds the best linear transform for the samples. The transformation matrix from the analysis, the Score Matrix S, is used to transform the raw training samples before training and the test samples before classification (only the most significant Principal Components are used for transformation in order to reduce the dimensionality of the Face Regions). This step was implemented in Matlab.

PCA Pre-processing

The PCA Pre-processing transforms a cropped sample image into PC-space by applying the transformation through the Score Matrix.

Score Matrix

The Score Matrix is the transformation matrix for the raw data. Used by the PCA Pre-processing step.

LVQ Learner

The Learner accepts a batch of training samples. Regions are first transformed into PC-space by the Score Matrix before training takes place.

LVQ Classifier

The LVQ Classifier loads the set of Feature Vectors and then extracts the regions from the test images. When PCA is used for pre-processing the extracted region is first transformed into PC-space by the Score Matrix. Classification is then done by the Feature Vectors positioned in the subspace of these Principal Components.

3.4 Integration with Demo Application The classifier can be integrated into the Demo Application by creating a wrapper class with the responsibility to initialize all the LVQ Learners by reading the Face Regions definitions and corresponding Feature Vectors. The Feature Vectors used are the ones obtained from an earlier training session. The parameter settings that gave the best classification performance will be used in the Demo Application.

To display the results from classification the Demo Application’s interface is extended to display an age classification along with the already existing gender classification.

Results

25

4 Results This chapter contains a compilation of the most relevant results

Section 4.1 documents the classification performance of single Face Features; this is a preliminary evaluation of Face Regions in order to motivate the choice of Face Features for the final implementation. Section 4.2 contains the results of the LVQ2.1 algorithm without any pre-processing. In Section 4.3 PCA is used as pre-processing step before a classification is made with the LVQ 2.1 decision rule. Finally, in section 4.4 an observation is made about the age classifiers’ performance in a live setting.

All test setups use the same batch of samples (324 images) partitioned into a 90-10% ratio of train and test samples (292 train images, 32 test images), the permutation is random in each new training session but samples of each generation are always evenly distributed across the train and test set.

Common to all tests is that in order to better make use of the available samples phase-shifting was used to create 600 extra images from the original training set, by randomly shifting original images by 1 pixel.

Performance is always measured by the ratio of correctly classified images in the Test Set.

4.1 Face Region Evaluation Using the Face Region definitions from Table 3.1 the LVQ learner is set up to train and test one face region at a time.

Initial learning rate αstart was set to 0.02 and the LVQ Learners were trained for 2000 epochs which seemed to be enough time for Feature Vectors to converge as α becomes too low to allow further significant movement. Three Feature Vectors were used for each class and Feature Vectors were initialized to class average before training starts. The results obtained in Table 4.1 are averaged over 50 training and test sessions.

Table 4.1 Performance of single face features, average over 50 training and test sessions. Region Performance

Eyes 0,45

Nose 0,40

Mouth 0,43

Right Face Strip 0,42

Left Face Strip 0,43

Eyes With Hair 0,40

Big Face Region 0,36

Results

26

4.2 LVQ 2.1 Here multiple Face Features will be used for classification of a single face. Motivated by the preliminary results obtained in section 4.1 the four regions chosen were: Eyes, Mouth, Right Face Strip and Left Face Strip. This means that there are four sets of Feature Vectors for each image, the sets vary in size depending on how many Feature Vectors per class are used.

The two versions of LVQ2.1 will be evaluated in this section. The original LVQ2.1 and the Modified LVQ2.1, both described earlier in Section 2.3.

4.2.1 Original LVQ2.1 The Feature Vectors are all initialized to class average as in Section 4.1 but before training starts noise with a maximum range of ±2% is added to the Feature Vectors. The sets of Feature Vectors for each face region are then trained independently but on the same batch of training samples. The initial learning rate αstart is set to 0.02 for each set.

The results displayed in Table 4.2 are average over 50 training and test sessions. Standard deviation of the performance with this setup is around 0.15

Table 4.2. The unmodified implementation of LVQ2.1 trained with increasing number of epochs.

Epochs Feature Vectors per Class

1 2 3 4

1 000 0,43 0,44 0,48 0,46

2 000 0,38 0,44 0,42 0,35

4 000 0,40 0,40 0,28 0,31

Results

27

LVQ2.1

0,00

0,20

0,40

0,60

0,80

1,00

1 2 3 4

Feature Vectors per Class

Cor

rect

Cla

ssifi

catio

ns

100020004000

Figure 4.1 The unmodified implementation of LVQ2.1 trained with increasing number of epochs. (The figure corresponding to Table 4.2)

When more noise is added to the Feature Vectors before training overall performance worsens (see Figure 4.3). Also the final value of the learning rate α decreases less when more noise is added, this can be seen in Table 4.3.

Results

28

LVQ2.1 Random noise added to Feature Vectors before training

0

0,2

0,4

0,6

0,8

1

1 2 3 4


Cor

rect

Cla

ssifi

catio

ns

2%5%10%15%

Figure 4.2. The unmodified implementation of LVQ2.1 trained for 1000 epochs with random noise added to Feature Vectors before the training step.

Table 4.3 Average values of the learning rate α after training for 1000 epochs. αstart = 0.002.

Noise Feature Vectors per Class

1 2 3 4

2% 0,0031 0,0034 0,0041 0,0036

5% 0,0040 0,0039 0,0049 0,0057

10% 0,0054 0,0078 0,0200 0,0200

15% 0,0200 0,0200 0,0200 0,0200

4.2.2 Modified LVQ 2.1 The Feature Vectors are initialized in the same way as the previous test setup. First a fixed amount of noise of ±10% is added then the learner is trained for varying epoch lengths. The results obtained can be viewed in Figure 4.4

The standard deviation of the performance decreases with the modified LVQ2.1 algorithm, from around 0.15 with the standard LVQ2.1 algorithm down to 0.8.

Results

29

LVQ2.1 Modified Update Rule

0,00

0,20

0,40

0,60

0,80

1,00

1 2 3 4


Corr

ect C

lass

ifica

tions

1000 2000 4000 10000

Figure 4.3. The Modified LVQ2.1 trained with increasing number of epochs. Feature vectors are initialized to class average.

When more noise is added the performance stays high and the learning rate α decreases drastically even with many Feature Vectors per class and does not exhibit the behavior observed in Table 4.3. After training for more than 10 000 epochs, the low value of α prevents any further significant movements of the Feature Vectors. Also, with too many Feature Vectors (six or more) it becomes hard for the LVQ Learner to stabilize as α fails to decrease and the final value of α stays close to that of αstart.

Best performance is achieved with noise added to the Feature Vectors before training is 54 % correct classification, as can be seen in Table 4.4 and its corresponding plot in Figure 4.4. Here, three Feature Vectors with an added noise of around 10% gives good results.

Table 4.4 Classification performance by the Modified LVQ2.1 update rule. Training was done for 4000 epochs, with increasing amount of noise added.

Noise Feature Vectors

1 2 3 4

2% 0,43 0,46 0,49 0,48

5% 0,43 0,47 0,50 0,51

10% 0,45 0,45 0,54 0,51

15% 0,43 0,43 0,52 0,46

Results

30

LVQ2.1 Modified with Random noise added to Feature Vectors before training

0

0,2

0,4

0,6

0,8

1

1 2 3 4


Cor

rect

Cla

ssifi

catio

ns

2%5%10%15%

Figure 4.4 Modified LVQ2.1 trained for 4000 epochs with varying amount of noise added to Feature Vectors before the training step.

4.3 Pre-processing with PCA In this section the PCA pre-processing step will be evaluated. With PCA pre-processing the choice of Face Regions must be reevaluated, because PCA have the inherent ability to find discriminating feature in an entire face image (Heseltine et al. 2002). Therefore it might be wise to use as much information as possible when performing PCA on faces and Big Face Region, which covers the main part of the face, will be used here (defined in Table 3.1). The reason that the Big Face Region is used instead of the whole face image is because processing time during training becomes very long if the whole face image is used and Big Face Region is assumed to still retain most of the face characteristics while at the same time reduce the sample size.

Before the results from the PCA Pre-processing are presented the Eigenfaces that were obtained will be inspected. This was not part of the actual process of training and classification but will be included here because it provides a good foundation to understand the results later. Also it provides a good basis for the discussion about PCA for Pre-processing later in Chapter 5.

4.3.1 The Eigenfaces By normalizing the Principal Components obtained from the Principal Component Analysis of the Face Regions, the Eigenfaces can easily be visualized as images.

Results

31

(1) (2) (3) (4) (5)

(6) (7) (8) (9) (10)

Figure 4.5. The first 10 Eigenfaces (Principal Components) from a test set. The Eigenfaces are normalized and mapped to a grayscale image representation. Some Eigenfaces describe common facial features (e.g. .in (1) the top dark areas represents the hair on the head, (10) model facial hair in the form of eyebrows and mustache).

By viewing the Face Region’s new positions in the vector space of the Principal Components some separation based on the age can be observed.

However, the Principal Components does not stay the same from between training sessions. In Figure 4.6 Principal Component 3 and 7 gave a noticeable separation of the samples based on age for one training session but could be different for another training session. But common to all training sessions, the Principal Components representing age could usually be found in the range of the first 20 most significant Principal Components.

Searching for these specific Principal Components that may represent age lay outside the scope of this thesis and therefore the k most important Principal Components will be given to the LVQ Learner. Where k will be in the range of 10 to 30.

Results

32

Figure 4.6. The training samples transformed into the vector space of Principal Components (Eigenfaces) 3 and 7. Here there is a minor separation between generation 20 (circles) and generation 50 (stars) but the class regions clearly intersect.

4.3.2 Results with PCA-preprocessing All the training samples are used to create the Score Matrix for sample transformation. The training samples are then transformed into the desired vector space by keeping the most significant eigenvectors in the Score Matrix. Finally, in the new vector space positions for twelve Feature Vectors (three per class) are searched for by the Modified LVQ2.1 algorithm. Feature vectors are initialized to class average with ±10% random noise added, initial learning rate αstart is set to 0,01 and training is done for 4000 epochs. The dimensionality of the samples was reduced to 10, 15, 20 and 30.

The test samples are then transformed into the same vector space and classified by the trained Feature Vectors; Table 4.5 contains the average performance over 10 training sessions.

Table 4.5 Results after training on Big Face Region with reduced dimensions Dimensions 10 15 20 30

Performance 0,32 0,33 0,27 0,29

Attempting to do PCA on a more diverse subset of the training data (e.g. using only the 20 and 50 generations in the dataset) in order to force a more significant separation by age performance does not improve performance, results still remain around these levels.

Results

33

4.4 Testing of Demo Application No thorough performance evaluation of the Demo Application was done only an observation about its performance. As noted earlier in section 1.2 the restriction to frontal faces does not make the Generation Classifier robust in live settings

Sample images were inserted into an artificial scene image where the face was positioned on a white background. This is a very easy task for the face recognizer but its performance is not of interest in this evaluation. The images extracted from this easy scene image were then classified by the Modified LVQ2.1 Decision rule since it performed the best in previous tests.

Observations showed that classification by a test image extracted from such an easy scene image was the same as when classification was done directly on the test image. However age classification is somewhat erratic when performed on real persons (images captured by web-cam), especially if the pose of the face is not in a frontal position.

Discussion

34

5 Discussion This chapter contains an analysis and discussion about the results obtained in the previous chapter First the difficulty of the problem of age classification using LVQ will be examined. Since the approach of not using PCA-pre processing was shown to be the most successful discussion will mainly be focused on these findings. Later the poor performance of the PCA pre-processing step will be looked into. Finally, further improvements on this approach to age classification are presented in Section 5.3.

5.1 The Difficulty of Age Classification Determining the age of an individual only based on the face presents a significant challenge and it might not be surprising that the LVQ algorithm at best only classifies around 54 % of the samples correctly.

It might not be reasonable to expect the LVQ algorithm to come close to near perfect classification of test samples. But it is hard to determine how well the LVQ algorithm performs given this complex task. Is 54 % correct classification to be considered good or bad performance? This is hard to judge because the performance of the LVQ algorithm is not measured against any other classification system in this Thesis.

The lowest bound for the classification performance is 25 %, this equivalent to randomly classifying the samples by generation or classifying all samples into only one of the four generations. When Feature Vectors are only initialized to class average and no training is done by the LVQ algorithm, essentially making one average template for each class performance is already at 43 %. Although it does not say much about how well LVQ works as training algorithm it gives a new lower boundary for classification using templates (Face Regions).

5.1.1 Human performance But how high performance can be expected from a classification system operating on these images? One way of finding out what level can be considered good performance is to compare against the level of correct classifications made by a human.

This measurement of human performance was not conducted in a thorough manner and is therefore not part of the results in Chapter 4. They are only brought up here to serve as a basis for discussing the LVQ algorithms performance.

The test issued to a human subject consists of two parts: (1) Training – the subject is shown a few samples from each generation. (2) Testing – the subject is shown 40 unknown images with 10 images from each generation and is asked to label them according to generation.

When the test is given to persons of Japanese nationality, who would arguably have an easier task of determining the age of images of people of Japanese ethnicity, the correct classification of face images is usually in the range of 75 %.

This shows that the performance of the LVQ algorithm at 54 % is lower than what can be expected from a human performing the same task but not a totally disappointing

Discussion

35

figure either. But in this regard the use of only grayscale images seems to limit the maximum possible performance of any classification system.

5.2 Strict Class Boundaries It might well be that the classification problem is too difficult to be solved just by using grayscale face images. But the separation into four different classes (Generation 20, 30, 40 and 50) further complicates the task of finding good positions for the Feature Vectors by the LVQ Algorithm. This strict separation is not optimal when considering samples that lay on the class borders.

If for example if an image of a 49 year old person is classified to belong to generation 50 that mistake is treated the same as if it was an image of a 40 year old. The information about exact age is not available to the LVQ algorithm as the samples are partitioned into generations and no other information about age is available.

During training this makes it hard for Feature Vectors to stabilize because samples that lay on the class border might not be that far away from the opposing class. With a more continuous representation of sample data this problem might be avoided. In Figure 5.1 it is apparent that even though a sample might get classified incorrectly it is often classified as a class close to its correct label.

Figure 5.1. A plot of the test results after PCA Pre-processing. Along the x-axis lie the 32 test samples. The y-axis is the age of a sample. The age classification of a sample done by the LVQ-algorithm is marked by a star. The circles represent which generation the sample actually belongs to.

5.3 The Eigenfaces of Age The poor performance with PCA pre-processing might suggest that it is not a good method for finding the characteristics of a face that represent age. However the inferior

Discussion

36

performance compared to when no pre-processing is used might depend on which of the Eigenfaces are chosen.

As stated earlier in Section 2.4 when using Eigenfaces for face classification the choice of which Eigenfaces to keep in order to best separate the data becomes very important. Due to time constraints and the complexity of searching for the appropriate Eigenfaces only the first most significant Eigenfaces were kept instead. The assumption was made that LVQ would successfully stabilize at meaningful positions in this new vector space of Eigenfaces.

But the failure to position the Feature Vectors is most likely due to the fact that there are still many more Eigenfaces that say very little about age than Eigenfaces that represent age in the whole set of Eigenfaces. The noise from these unimportant Eigenfaces, which might represent other facial characteristics not specific to age, prevents the LVQ Learner from stabilizing the Feature Vectors.

5.4 Future Improvements As a general note, the total number of sample available, 324, is a very modest amount. It cannot be considered an improvement to the method to increase the number of available samples but it would help to better evaluate the feasibility of the approach used in this Thesis and any further improvements made to it.

The main improvement suggested here has to do with the search for Eigenfaces that are descriptive for the age-features of a face.

5.4.1 Searching for Discriminating Eigenfaces Although the limit of performance is probably reached with LVQ when using the Face Region’s raw pixel data, it is probable that further effort put into PCA pre-processing step could help improve classification performance. Specifically, if a search for the Eigenface or Eigenfaces that best represent age should be implemented after PCA is performed on the training data.

Attempts were made to find these Eigenfaces by only providing samples from generation 20 and generation 50 and thereby making the age variance in this new sample group higher. But unfortunately this method did not find better Eigenfaces as classification performance on test samples remained the same.

It is not apparent how this search can be done, but as seen in Figure 4.6 earlier some Eigenfaces do display some separation of age among the samples. It might well be that more of these kind of Eigenfaces exist among all the Eigenfaces found by PCA. If these Eigenfaces representing age could be found the LVQ algorithm would have an easier task of finding the Feature Vector positions in this sub-space where only the Eigenfaces of age are the basis vectors.

Conclusion

37

6 Conclusion The conclusion of this thesis is that although the LVQ algorithm is very simple by design it performs fairly well at the task of age classification. However, it does by no means constitute a robust classification system. It should also be noted that this performance is achieved with only 324 face samples and is a significant limitation on the performance of the LVQ algorithm.

Although PCA-Pre processing does not improve performance of the original LVQ algorithm as it is implemented in this thesis, it is believed that additional effort put into the search for Principal Components that represent age would make PCA a good choice for pre-processing Face Regions. This coupled with more face samples to train on would probably further improve the performance and make it a good candidate for a simple, fast and accurate age classification system.

Bibliography

38

Bibliography Y. ADINI, Y. MOSES AND S. ULLMAN. 1997. Face Recognition: the Problem of Compensating for Changes in Illumination Direction. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 721-732

AT&T LABORATORIES. The ORL Database of Faces. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html Last Accessed on June 17th 2007.

C. BURGES. 1996. Simplified Support Vector Decision Rules. In International Conference on Machine Learning, pp. 71-77

Y. FREUND AND R. SCHAPIRE. 1996. Experiments with a New Boosting Algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148-156

R. GILAD-BACHRACH, A. NAVOT AND N. TISHBY. 2004. Margin Based Feature Selection – Theory and Algorithms. In Proc. 21'st International Conference on Machine Learning (ICML)

T. HESELTINE, N. PEARS AND J. AUSTIN. 2002. Evaluation of image pre-processing techniques for eigenface based face recognition. In Proceedings of the Second International Conference on Image and Graphics, SPIE vol. 4875, pp. 677-685.

M. JONES AND P. VIOLA. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In Conference on Computer Vision and Pattern Recognition

T. KOHONEN, J. HYNNINEN, J. KANGAS, J. LAAKSONEN AND K. TORKKOLA. 1995. LVQ PAK: The Learning Vector Quantization Program Package. Version 3.1. Laboratory of Computer and Information Science. Helsinki University of Technology.

M. LALONDE AND L. GAGNON. 2006. Key-text spotting in documentary videos using Adaboost. Proceedings of the SPIE, Volume 6064, pp. 507-514

R. LIENHART AND J. MAYDT. 2002. An Extended set of Haar-like Features for Rapid Object Detection. In Proc. ICIP, pp. 900-903

J. LIU, X. HUANG, Y. WANG, M. JORGE S., P. NICOLAS, P. PEDRO. 2005. Removing shadows from face images using ICA, In Iberian conference on pattern recognition and image analysis, vol. 3523, pp. 703, ISBN 3-540-26153-2

Bibliography

39

E. OSUNA, R. FREUND AND F. GIROSI. 1997. Training Support Vector Machines: an Application to Face Detection. In Proceedings of CVPR.

A. O’TOOLE, H. ABDI, K. DEFFENBACHER AND D. VALENTIN. 1993. Low-dimensional representation of faces in higher dimensions of the face space. In Journal of the Optical Society of America A, vol. 10, pp. 405-410

A. O'TOOLE, H. BÜLTHOFF, N. TROJE, T. VETTER. 1995. Face Recognition across Large Viewpoint Changes. Proceedings of the International Workshop on Automatic Face and Gesture Recognition. Pp. 326-331

C. PAPAGEORGIOU, M. OREN AND T. POGGIO. 1998. A general framework for object detection (abstract only). In International Conference on Computer Vision. pp. 555.

S. SAMBU SEO AND K. OBERMAYER. 2006. Dynamic Hyperparameter Scaling Method for LVQ Algorith. 2006. In IJCNN '06. International Joint Conference on Neural Networks. Pp 3196- 3203

L. SIROVICH AND M. KIRBY. 1987. Low-dimensional procedure for the characterization of human faces. In J. Opt. Soc. Am. Vol. 4, pp. 519-518.

L. SMITH. 2002. A tutorial on Principal Components Analysis. http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf Last Accessed on June 11th 2007.

K. SUNG AND T. POGGIO. 1994. Example-based Learning for View-Based Human Face Detection. In AI Memo No. 5121, MIT A.I. Lab., December 1994.

WOLFRAM MATHWORLD. Haar Functions. http://mathworld.wolfram.com/HaarFunction.html Last Accessed on June 7th 2007.

Z. ZHU, T. MORIMOTO, H. ADCHI, O. KIRIYAMA, T. KOIDE AND H. J. MATTAUSCH. 2005. Multi-view Face Detection and Recognition using Haar-like Features. In Signal and Image Processing, pp. 479.

TRITA-CSC-E 2007:109 ISRN-KTH/CSC/E--07/109--SE

ISSN-1653-5715

www.kth.se

implementing lvq for age classification - kth · implementing lvq for age classification ... att...

Documents