face recognition

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELGAUM-590014

Project Report

On

Face Recognition using Artificial Neural Networks Submitted in partial fulfilment of the requirement for the award of degree of

BACHELOR OF ENGINEERING

In

ELECTRONICS & COMMUNICATION

By

Prashanth Namit

1PI05EC072 1PI05EC057 Under the Guidance of

Prof. Koshy George

PES Centre for Intelligent Systems; Dept. of Telecommunication Engineering

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085

2008- 2009

PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085

CERTIFICATE

Certified that the project work entitled Face Recognition Techniques using Neural Networks is a bona fide work carried out by Prashanth H, Namit Chauhan bearing USN 1PI05EC072 and 1PI05EC043 in partial fulfilment for the award of degree of BACHELOR OF ENGINEERING in ELECTRONICS & COMMUNICATION of Visvesvaraya Technological University, during the academic year 2008-2009. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in the report. The project report has been approved as it satisfies the academic requirements with respect to project work prescribed for the mentioned degree.

Dr Koshy George PES Centre for Intelligent Systems; Dept. of Telecom. Engg., PESIT

Dr A. Vijaykumar HOD,

Dept. of E&C Engg., PESIT

Dr K. N. B. Murthy Principal PESIT

Prashanth

1PI05EC072

Namit Chauhan

1PI05EC057

External Viva

Name of the Examiners Signature with Date

1._____________________ _____________________

2. _____________________ _____________________

ACKNOWLEDGEMENTS

We would like to express our sincere thanks to Dr. Koshy George Professor, TE

Department, PESIT, Bangalore, for his kind and constant support and guidance during the

course of this project without which, the project would not have been a success

We would like to thank and extend 1our heart-felt gratitude to Dr. A Vijaya Kumar,

HOD of the Department of Electronics and Communication, for his kind, inspiring and

illustrious guidance and ample encouragement.

We would like to thank Prof. D Jawahar, CEO, PES Group of Institutions, and

Dr. K N Balasubramanya Murthy, Director & Principal, PESIT, for the valuable

resources provided for the completion of the project.

We would like to express our sincere thanks and deep sense of gratitude to our parents

for their everlasting support.

And finally it gives us immense pleasure to thank our friends and everyone who have

been instrumental in the successful completion of the project.

Table of Contents:

Serial

Number Topic Page Number

1 Face Recognition: An Introduction 1

1.1 Classification 3

1.1.1 Application of face recognition 6

1.1.2 Adopted approach 6

1.2 Literature survey 7

1.2.1 Existing algorithms 7

1.2.2 Existing techniques 8

1.3 Organization of the report 10

2 Neural Networks: An Introduction 11

2.1 Multilayer Perceptrons 13

2.2 Derivation of Error back propagation

algorithm 15

2.3 Activation function 20

2.4 Rate of learning 22

2.5 Modes of Training 23

2.6 Local minima of Error Surface 24

2.7 Supervised learning viewed as an

optimization problem 24

2.8 Face recognition using Neural Networks 26

2.9 Conclusions 26

3 Principal Components Analysis: An

Introduction 27

3.1 Derivation of Principal components 28

3.2 Face Recognition using Eigen faces 30

3.3 Euclidean Distance method for

Classification 33

ii

3.4 Conclusions 33

4 Support Vector Machines: An Introduction 34

4.1 Basics 36

4.2 Linear Support Vector Machines:

The Separable Case 37

4.3 Quadratic Optimization for Finding the

Optimal Hyperplane 39

4.4 The Karush-Kuhn-Tucker Conditions 41

4.5 The Non Separable Case 42

4.6 Nonlinear Support Vector Machines 45

4.6.1 Inner product Kernel 46

4.6.2 Mercers Theorem 48

4.6.3 Summary of Inner-Product Kernels 49

4.7 Multiple Class Support Vector Machine 49

4.8 Binary Search Tree 52

4.9 SVM for Face Recognition 53

4.10 Conclusions 53

5 Image Preprocessing 54

5.1 Histogram Equalization 54

5.2 Median Filtering 57

5.3 Bi-Cubic Interpolation 57

5.4 Conclusions 58

6 Results 59

6.1 MLP 59

6.2 PCA 63

6.3 SVMs 67

6.4 PCA+MLP 70

7 Conclusions and Future work 73

8 References 75

Appendix 78

iii

List of Figures

Figure 1.1: Applications of Face Recognition Figure 1.2: Supervised Learning Figure 1.3: Unsupervised Learning Figure 1.4: (a) Good Generalization, (b) Bad Generalization Figure 2.1: Block diagram of nervous system Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks,

Recurrent Networks

Figure 2.3: Nonlinear model of a neuron Figure 2.4: MLP with one hidden layer Figure 2.5: Signal-flow graph highlighting the details of output neuron j Figure 2.7: Signal-flow graphical summary of back-propogation learning. Top part of

the graph: forward pass. Bottom part of the graph: backward pass.

Figure 2.8: Hyperbolic tangent function Figure 2.9: Error surface Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd

PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC.

Figure 3.2: Eigen faces Figure 4.1: General structure of SVM Figure 4.2: Linear separating hyperplane for separable case Figure 4.3: Linear separating hyperplane for non-separable case Figure 4.4: Mapping into higher space Figure 4.5: Decision surface of Kernels Figure 4.6: Decision surface using one SVM Figure 4.7: Problem encountered when one SVM is used for multi- classification Figure 4.8: Multi-classification using Binary classifiers Figure 4.9: Binary tree structure for 8 classes Figure 5.1: Image after Histogram Equalization Figure 5.2: Median filtering Figure 6.1: Face recognition accuracy with multilayer feedforward networks with

histogram equalization and median filtering.

iv

Figure 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.

Figure 6.3: Face recognition accuracy with multilayer feedforward networks with no pre-processing

Figure 6.4: Face recognition accuracy with PCA with histogram equalization and median filtering.

Figure 6.5: Face recognition accuracy with PCA with only median filtering. Figure 6.6: Face recognition accuracy with PCA and no pre-processing Figure 6.7: Face recognition accuracy with PCA+MLP with histogram equalization

and median filtering.

Figure 6.8: Face recognition accuracy with PCA+MLP with only median filtering.

Face Recognition using Artificial Neural Network

Dept. of ECE Jan June 09 Page 1

CHAPTER I

Face Recognition: An Introduction

Human beings have recognition capabilities that are unparalleled in the modern

computing era. These are mainly due to the high degree of interconnectivity, adaptive nature,

learning skills and generalization capabilities of the nervous system. The human brain has

numerous highly interconnected biological neurons which, on some specific tasks, can

outperform super computers. A child can accurately identify a face, but for a computer it is a

cumbersome task. Therefore, the main idea is to engineer a system which can emulate what a

child can do. Advancements in computing capability over the past few decades have enabled

comparable recognition capabilities from such engineered systems quite successfully. Early

face recognition algorithms used simple geometric models, but recently the recognition

process has now matured into a science of sophisticated mathematical representations and

matching processes. Major advancements and initiatives have propelled face recognition

technology into the spotlight.

Face recognition technology can be used in wide range of applications. (A few

example applications are shown in Fig 1.1.) An important aspect is that such technology

should be able to deal with various changes in face images, like rotation, changes in

expression. Surprisingly, the mathematical variations between the images of the same face

due to illumination and viewing direction are almost always larger than image variations due

to changes in face identity. This presents a great challenge to face recognition. At the core,

two issues are central to successful face recognition algorithms: First, the choice of features

used to represent a face. Since images are subject to changes in viewpoint, illumination, and

expression, an effective representation should be able to deal with these possible changes.

Secondly, the classification of a new face image using the chosen representation.



Face Recognition can be of two types:

Feature Based (Geometric) Template Based (Photometric)

In geometric or feature-based methods, facial features such as eyes, nose, mouth, and chin

are detected. Properties and relations such as areas, distances, and angles between the

features are used as descriptors of faces. Although this class of methods is economical and

efficient in achieving data reduction and is insensitive to variations in illumination and

viewpoint, it relies heavily on the extraction and measurement of facial features.

Unfortunately, feature extraction and measurement techniques and algorithms developed to

date have not been reliable enough to cater to this need.

In contrast, template matching and neural methods generally operate directly on an

image-based representation of faces, i.e., pixel intensity array. Because the detection and

measurement of geometric facial features are not required, this type of method has been more

practical and easier to implement when compared to geometric feature-based methods.

Figure 1.1: Applications of face recognition.



1.1 Classification Face recognition is nothing but the ability of a machine to successfully categorize a set

of images based on certain discriminatory features. Classification or pattern recognition can

be a very difficult problem and is still a very active field of research. The aim of the current

project can be considered as an attempt to simulate the recognition capability of the human

brain. A human being has the ability to put a certain scenario in a context and identify its

components. Of course it is not entirely obvious how to make the machine discriminate

between elements of different objects of different classes. The classification task may be

mathematically represented as the mapping:

f : A x B = C (1.1)

where A represents the elements that have to be classified, i.e. the pixel intensity vectors,

using B as the function parameters. The output C has to discriminate between images of

different classes so that the class of each element can be determined. Classification by

machines requires two steps: First, the properties which distinguish an element of one class

from that of another class have to be identified. Secondly, the machine has to be trained to

know how to discriminate between the classes by defining a learning model. The learning

model describes the procedure that has to be used for the actual training. Basically there are

two types of training often referred to as supervised and unsupervised learning.

Supervised Learning: When dealing with supervised learning, the trainer feeds the machine a sample and the machine classifies it. The trainer then tells the machine

whether it has classified the sample correctly or not. In the case that it has

misclassified the sample, the machine has to adjust its classifier parameters to better

classify the given sample. Supervised learning is illustrated in Fig. 1.2. Examples of

these include Bayes Classifier and Neural Networks. Another important aspect

involved is the method of training imparted to these learning machines. It is important

to have a training set which is representative for the given classes so that future

classification can be successful.



Figure 1.2: Supervised Learning

Unsupervised Learning: In unsupervised or self-organized learning there is no external trainer to oversee the learning process. Rather a task-independent measure of

the quality of classification is computed, and the free parameters of the network are

optimized with respect to that measure. Once the network has become tuned to the

statistical regularities of the input data, it develops the ability to form internal

representations for encoding features of the input and thereby create new classes

automatically. One technique for this is called clustering. In this case the trainer has

nothing to do with the classification. Instead the learning machine determines which

elements should belong to the same class and assigns a class accordingly.

Unsupervised learning is as shown below in Fig. 1.3.

Figure 1.3: Unsupervised Learning

It is to be noted that over-training leads to poor generalization. For example, an over-

trained machine would return different classes for the same image subjected to different

modifications. As illustrated in the Fig 1.4, over-training essentially is memorizing the data,

and leads to poor generalization.



Figure 1.4: (a) Good Generalization, (b) Bad Generalization

Face recognition is a very complex form of pattern recognition. It consists of

classifying highly ambiguous input signals, with multiple dimensions and matching them

with known signals. The following are the problems that occur:

The intrinsic 2D structure of an image matrix is more often than not removed. Consequently, the spatial information stored therein is discarded and not effectively

utilized.

Curse of Dimensionality: Each image sample is modeled is typically a point in a high-dimensional space. Consequently, a large number of training samples are often



needed to get reliable and robust estimation about the characteristics of data

distribution.

Usually very limited amounts of data are available in real applications such as face recognition, image retrieval, and image classification.

A number of ways have been proposed to solve these problems. Finding an effective means

to reduce the dimensionality is the first step in face recognition.

1.1.1 Applications of Face Recognition Some of the applications are as follows:

Trying to find a face within a large database of faces. In this approach the system returns a possible list of faces from the database. The most useful applications contain

crowd surveillance, video content indexing, personal identification (example: drivers

license), mug shots matching, etc.

Real time face recognition: Here, face recognition is used to identify a person on the spot and grant access to a building or a compound, thus avoiding security hassles. In

this case the face is compared against a multiple training samples of a person.

1.1.2 Adopted Approach Face recognition is done naturally by humans. However, developing a computer

algorithm to achieve the same purpose is a rather difficult task. Starting with images, the aim

is to distinguish between faces of different people. One class of methods presupposes the

existence of certain features in the image, i.e. eyes, nose, mouth, hair, and an algorithm is

devised to find and characterize these features. A second class of methods also assumes that

there are features to be found but does not predefine what these features are or how to

measure them. The approach followed in this project belongs to the latter class of methods

since it much less restrictive than the former. Specifically, given a set of training images, the

strategy is to develop algorithms that construct a different set of images which

approximates best the given image data set. The training set is then projected onto this

subspace. To query a new image, its projection onto this subspace is determined, and a

training image whose projection is closest to that of the new image is sought.



1.2 Literature Survey Several algorithms and techniques for face recognition have been developed in the past

by researchers. These are discussed briefly in this section.

1.2.1 Existing Algorithms

Face Recognition Based on Principal Component Analysis: Principal Component

Analysis (PCA) is a well known algorithm used in face recognition. The basic idea in PCA is

to determine a vector of much lower dimension that best approximates in some sense a given

data vector. Thus, in face recognition, it takes an s-dimensional vector representation of each

face in a training set of images as input, and determines a t-dimensional subspace whose

basis vector is maximum corresponding to the original image. The dimension of this new

subspace is lower than the original one (t



Elastic Bunch Graph Matching: All human faces share a similar topological structure.

Faces are represented as graphs, with nodes positioned at fiducially points (eyes, nose, etc.)

and edges labeled with 2-D distance vectors. Each node contains a set of 40 complex Gabor

wavelet coefficients at different scales and orientations (phase, amplitude). They are called

"jets". Recognition is based on labeled graphs. A labeled graph is a set of nodes connected by

edges; each node is labeled as jets, edges are labeled as distances [20].

Kernel Methods: The face manifold in subspace need not be linear. Kernel methods are a

generalization of linear methods. Direct non-linear manifold schemes are explored to learn

this non-linear manifold [4],[5],[10],[11].

Trace Transform: The Trace transform, a generalization of the Radon transform, is a new

tool for image processing which can be used for recognizing objects under transformations,

e.g. rotation, translation and scaling. To produce the Trace transform one computes a

functional along tracing lines of an image. Different Trace transforms can be produced from

an image using different trace functional [21],[22].

1.2.2. Existing Techniques Eigenfaces: Many approaches to the overall face recognition problem have been devised

over the years, but one of the most accurate and fastest ways to identify faces is to use the so-

called eigenface technique. This uses a combination of linear algebra and statistical analysis

to generate a set of basis faces the eigenfaces against which inputs are tested. Eigenface

mostly uses PCA and ICA tools for face recognition. Using these, the efficiency is variable.

Although ICA significantly outperforms the standard PCA, it has been claimed that that the

performance of ICA strongly depends on its involved PCA process. The pure ICA projection

has little effect on the performance of face recognition.



Range Imaging: Range imaging is the name for a collection of techniques, used to produce

a 2D image showing the distance to a set of points in a scene from a specific point, normally

associated with some type of sensor device. The resulting image, the range image, has pixel

values which correspond to the distance, e.g., brighter values mean shorter distance, or vice

versa. If the sensor which is used for produce the range image is properly calibrated, the pixel

values can be given directly in physical units such as centimeters [23].

Line edge map: Image-based face recognition algorithm that uses a set of random rectilinear

line segments of 2D face image views as the underlying image representation, together with a

nearest-neighbor classifier as the line-matching scheme. The combination of 1D line

segments exploits the inherent coherence in one or more 2D face image views in the viewing

sphere [24].

Neural Network based Face Recognition Techniques: Neural networks are used to create

the face database and recognize the face. A separate network for each person is built. The

input face is projected onto the eigenface space first to get a new descriptor. This descriptor

is used as network input and applied to each person's network. The one with maximum

output is selected and reported as the host if it is larger than a predefined recognition

threshold [3],[1],[13],[12].

Gabor wavelet networks (GWN): The Gabor wavelet network is used for an effective

object representation. The Gabor wavelet network has several advantages such as invariance

to some degree with respect to translation, rotation and dilation. Furthermore, it has the

ability to generalize and to abstract from the training data and to assure, for a given network.



Sparse Representation: The input image is divided with L-1 minimization. And then those

sparse will be compared with training datas sparse. There are also many techniques which

has been tried to use for facial recognition techniques, but among all of them, the eigenface

technique shows the fastest and most accurate results than other techniques. So far, from our

research, we have seen that the eigenface technique has faster performance than other

techniques. But in recent times the researchers and scientists are trying to focus on Human

Vision System (HVS) for face recognition. It has not been implemented in practical field of

face recognition yet but has laid the foundation for future studies [25].

1.3 Organization of the Report The report is organized as follows: The basics of neural networks and the learning

processes involved in neural networks are discussed in Chapter 2. The back propagation

learning and conjugate gradient algorithms are also dealt with. Then the process of face

recognition using neural networks is then described. Chapter 3 covers a famous and widely

used method for face recognition i.e. eigenface approach. Both principal components and

nearest neighbour classification are dealt with. Support vector machines together with the

various optimization techniques involved are presented in Chapter 4. Both linear and

nonlinear cases are explained, as well as the kernels involved in support vector machines

(SVMs). Face recognition using binary SVMs for multi-classification purposes is also

explained. Face recognition involves several pre-processing steps. These are detailed in

Chapter 5. The results obtained using a number of techniques presented in the previous

chapters are compared in Chapter 6. Concluding comments and directions for future work are

indicated in Chapter 7.



CHAPTER II

Artificial Neural Networks

Neural Networks are nonlinear models of the neural pathways in the nervous system. The

flow of signals in the nervous system is as shown in Fig. 2.1. The propagation of stimulus, in

either direction throughout the system of sense organs, is due to the firing of simple

elements operating in parallel. These elements can be structured as:

Single-Layer Feedforward Networks Multilayer Feedforward Networks Recurrent Networks These structures are shown in Fig. 2.2, where each node represents a mathematical model

of a neuron.

Figure 2.1: Block diagram of nervous system

Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks, Recurrent

Networks



The manner in which this transmission proceeds is determined by the learning

abilities of these neurons. Learning may be error-correction learning, memory-based

learning, Hebbian learning; or based on paradigms as supervised learning, unsupervised

learning, reinforcement learning through continued interaction with the environment.

Rosenblatts perceptron is a basic form of a neural network, built around a nonlinear

model of a neuron, namely, the McCulloh-Pitts model of a neuron as shown in Fig. 2.3. The

adder node of the neuron model computes a linear combination of input data applied to the

synapses and the tendency to be biased, given by the value b, during a classification task.

This induced local field is converted to a binary signal, +1 or -1, which determines the

weighted capacity to activate the neighboring neuron.

Figure 2.3: Nonlinear model of a neuron

In its most general form, a neural network is a machine that is designed to model the

way in which the brain performs a particular task or function of interest. It resembles the

brain in the following two respects:

1. Knowledge is acquired by the network from its environment through a learning

process.

2. Interneuron connection strengths, known as synaptic weights, are used to store the

acquired knowledge.

The procedure used to perform the learning process is called a learning algorithm, the

function of which is to modify the synaptic weights of the network in an orderly fashion to



attain a desired design objective. The modification of synaptic weights provides the

traditional basis for the design of neural networks. Further, it is also possible for a neural

network to modify its own topology, which is motivated by the fact that neurons in the

human brain can die and that new synaptic connections can grow.

2.1 Multilayer Perceptrons Multilayer perceptrons are multilayer feedforward networks consisting of at least three

layers: the input layer, the hidden layer(s) and the output layer:

1. The input layer consists of input sensory nodes to which data, represented by pixel

intensity vectors of pre-processed images, is presented. These propagate through the

neural network in a forward direction, on a layer by layer basis, hence known as

multi-layer feedforward networks.

2. The hidden layer performs a non-linear transformation on the input signal into a new

space called the feature space. More than one hidden layer maybe used to achieve

this purpose. As learning progresses, the hidden neurons gradually discover the

features that characterize the images. The number of hidden neurons in each hidden

layer, and the number of hidden layers is critical to the learning process, and is varied

for higher accuracy.

3. At the output layer, the output/functional signal(s), expressed as a continuous

nonlinear function of the input signal and synaptic weights associated with that

neuron, is obtained.

Fig. 2.4 shows the architectural graph of a multilayer perceptron. The network shown is fully

connected, that is, a neuron in any layer of the network is connected to all the neurons in the

previous layer.

In the case of supervised learning, this output signal is subtracted from the desired

response, to yield the error signal. In the classification problem, the desired response

corresponds to the classes of images. For this specific problem, the number of output neurons

equals the number of classes of images. The backward propagation of the estimate of the

error gradient vector, that is, the gradients of the error surface with respect to the weights

connected to the inputs of a neuron, forms the core of the supervised learning process. In the



training phase, the optimum weights of the hidden and output layers are obtained, which

minimize the error estimate for desired results. The optimum weights are learnt by

propagating the error signal backwards, against the synaptic connections. This supervised

learning algorithm is known as error back propagation algorithm, which is described in detail

in Section 2.2.

Figure 2.4: MLP with one hidden layer



2.2 Derivation of Error Back Propagation Algorithm The error signal at the output of neuron j at iteration n, that is, presentation of the nth training

example, is defined by:

ej(n)=dj(n)-yj(n) (2.1)

The instantaneous value of the error energy for neuron j is defined as 0.5e2j(n).

Correspondingly, the instantaneous value of the error energy E(n) of the total error energy is

obtained by summing 0.5e2j(n) over all neurons in the output layer. Thus:

E(n)=0.5 (2.2)

where the set C includes all the neurons in the output layer of the network. Let N denote the

total number of patterns contained in the training set. The average squared error energy is

obtained by summing E(n) over all n and then normalizing with respect to the set size N, as

shown by:

Eav=1/N (2.3)

The instantaneous error energy E(n), and therefore the average error energy Eav, is a function

of all the free parameters (i.e. synaptic weights and bias levels) of the network. For a given

training set, Eav represents the cost function as a measure of learning performance. The

objective of the learning process is to adjust the free parameters of the network to minimize

Eav. A simple method of training is considered in which the weights are updated on a pattern

by pattern basis until one epoch, that is, one complete presentation of one training set has

been dealt with. The adjustments to the weights are made in accordance with the respective

errors computed for each pattern presented to the network.



Figure 2.5: Signal-flow graph highlighting the details of output neuron j

The arithmetic average of these individual weight changes over the training set is

therefore an estimate of the true change that would result in from modifying the weights

based on minimizing the cost function Eav over the entire training set. Consider Fig. 2.5

which depicts neuron j being fed by a set of function signals produced by a layer of neurons

to its left. The induced local field vj(n) produced at the input of the activation function

associated with neuron j is therefore:

vj(n)= (2.4)

and m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic

weight wj0 (corresponding to the fixed input y0=+1) equals the bias bj applied to neuron j.

Hence the function signal yj(n) appearing at the output of neuron j at iteration n is:

yj(n)=j(vj(n)) (2.5)

The back-propagation algorithm applies a correction wji(n) to the synaptic weight wji(n),

which is proportional to the partial derivative E(n)/dwji(n). According to the chain rule of

calculus, one can express this gradient as:



E(n)/wji(n)= E(n)/ ej(n). ej(n)/ yi(n). yj(n)/ vj(n). vj(n)/ wji(n) (2.6)

The derivative E(n)/ wji(n) represents the sensitivity factor, determining the direction of

search in weight space for the synaptic weight wji. Differentiating both sides of equation (2.2)

with respect to ej(n), equation (2.1) with respect to yj(n), equation (2.5) with respect to

yj(n), and equation (2.4) with respect to wji(n) one respectively gets:

E(n)/ ej(n)=ej(n) (2.7)

ej(n)/ yj(n)=-1 (2.8)

yj(n)/ vj(n)=j(vj(n)) (2.9)

vj(n)/ wji(n)=yi(n) (2.10)

The use of equations (2.7) and (2.10) in (6) yields:

E(n)/ wji(n)=-ej(n)(vj(n))yi(n) (2.11)

The correction wji(n) applied to wji(n) is defined by the delta rule:

wji(n)=-E(n)/wji(n) (2.12)

where is the learning-rate parameter of the back propagation algorithm. The use of the

minus sign accounts for the gradient descent in weight space. Accordingly, the use of (2.11)

in (2.12) yields:

wji(n)= j(n)yi(n) (2.13)

where local gradient j(n) is defined by:

j(n)=- E(n)/ vj(n)=ej(n)j(vj(n)) (2.14)

The local gradient points to required changes in synaptic weights. According to (2.14), the

local gradient j(n) for output neuron j is equal to the product of the corresponding error

signal ej(n) for that neuron and the derivative j(vj(n)) of the associated activation function.

Case 1: Neuron j is an output node

When neuron j is located in the output layer of the network, it is supplied with a

desired response of its own. Equation (2.1) is used to compute the error signal ej(n)

associated with this neuron.



Case 2: Neuron j is the hidden node

When neuron j is located in a hidden layer of the network, there is no specified

desired response for that neuron. Accordingly, the error signal for a hidden neuron would

have to be determined recursively in terms of the error signals of all the neurons to which

that hidden neuron is directly connected; this is where the development of the back-

propagation algorithm gets complicated. Consider the situation depicted in Fig 2.6, which

depicts neuron j as a hidden node of the network. According to equation (2.14), the local

gradient j(n) for hidden neuron j is redefined as:

j(n)=-[E(n)/ yj(n)].[yj(n)/ vj(n)]

=-[E(n)/ yj(n)].j(vj(n)), neuron j is hidden (2.15)

where in the second line equation (2.9) is used. To calculate the partial derivative E(n)/

yj(n), one proceeds as follows. From equation (2.2):

E(n)=0.5 , neuron k is an output node (2.16)

Differentiating equation (2.16) with respect to the function signal yj(n):

E(n)/ yj(n)= [ek(n)/ yj(n)] (2.17)

Using the chain rule for the partial derivative ek(n)/ yj(n), and rewriting in the equivalent

form the following is obtained:

E(n)/yj(n)= .ek(n)/vk(n).vk(n)/yj(n) (2.18)

However,

ek(n)=dk(n)-yk(n)

=dk(n)-k(vk(n)), neuron k is output node (2.19)

Hence:

ek(n)/ vk(n)=-k(vk(n)) (2.20)

Note that for neuron k the induced local field is:

vk(n)= (2.21)



Figure 2.6: Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j

where m is the total number of inputs (excluding the bias) applied to neuron k. Here again,

the synaptic weight wko(n) is equal to the bias bk(n) applied to neuron k, and the

corresponding input is fixed at the value +1. Differentiating equation (2.21) with respect to

yj(n) yields:

vk(n)/ yj(n)=wkj(n) (2.22)

By using equation (2.20) and equation (2.22) in equation (2.18), the desired partial derivative

is obtained:

E(n)/ yj(n)= - k(vk(n))wkj(n)

= - wkj(n) (2.23)

where in the second line the definition of the local gradient k(n) given in equation (2.14)

with the index k substituted for j is used. Finally, the back-propagation formula for the local

gradient j(n) is obtained as described:

j(n)= (vj(n)) wkj(n) (2.24)



The factor j(vj(n)) involved in the computation of the local gradient j(n) in (2.24) depends

solely on the activation function associated with hidden neuron j. The remaining factor

involved in this computation, namely the summation over k, depends on two sets of terms.

The first set of terms, the k(n), requires knowledge of the error signals ek(n), for all neurons

that lie in the layer to the immediate right of hidden neuron j, and that are directly connected

to neuron j. The second set of terms, the wkj(n), consists of synaptic weights associated with

these connections. Figure 2.7 summarizes back-propagation learning.

Figure 2.7: Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward

pass. Bottom part of the graph: backward pass.

2.3 Activation function The computation of for each neuron of a multilayer perceptron requires knowledge

of the derivative of the activation function (.) associated with that neuron. For this

derivative to exist, the function (.) must be continuous. Therefore, differentiability is the

only requirement that an activation function has to satisfy.



Hyperbolic tangent function as shown in Fig 2.8, an antisymmetric function, is a

commonly used form of sigmoidal non-linearity, which in its most general form is defined

by:

j(vj(n))=atanh(bvj(n)), (a,b)>0 (2.25)

where a and b are constants. Its derivative with respect to vj(n) is given by:

j(vj(n))=absec2h(bvj(n))

=ab(1-tanh2(bvj(n)))

=b/a[a-vj(n)][a+yj(n)] (2.26)

For a neuron j located in the output layer yj(n)=oj(n), the local gradient is therefore:

j(n)=ej(n)j(vj(n))

=b/a[dj(n)-oj(n)][a-oj(n)][a+oj(n)] (2.27)

For a neuron j in the hidden layer,

j(n)=j(vj(n)) wkj(n)

=[b/a][a-yj(n)][a+yj(n)] wkj(n), neuron j is hidden (2.28)

Figure 2.8: Hyperbolic tangent function



2.4 Rate of learning The back-propagation algorithm provides an approximation to the trajectory in weight

space computed by the method of steepest descent. The smaller the learning-rate parameter

is made, the smaller the changes to the synaptic weights in the network will be from one

iteration to the next, and the smoother will be the trajectory in weight space. The

improvement, however is attained at the cost of a slower rate of learning. If, on the other

hand, the learning-rate parameter is made too large in order to speed up the rate of learning,

the resulting large changes in the synaptic weights assume such a form that the network may

become unstable. A simple method of increasing the rate of learning yet avoiding the danger

of instability is to modify the delta rule, by including a momentum term as follows:

wji(n)= wji(n-1)+ j(n)yi(n) (2.29)

where is a positive number called the momentum constant. Rewriting the above equation

(2.29) as a time series with index t. The index t goes from the initial time 0 to current time n.

It maybe viewed as a first-order difference equation in the weight correction wji(n):

wji(n)= (j(t)yi(t))

=- (E(t)/wji(t)) (2.30)

Based on this relation, the following conclusions are made:

1. The current adjustment wji(n) represents the sum of an exponentially weighted time

series. For the time series to be convergent, the momentum constant must be

restricted to the range 0



2.5 Modes of Training One complete representation of the entire training set during the learning process is called

an epoch. The learning process is maintained on an epoch by epoch basis until the synaptic

weights and bias levels of the network stabilize and the averaged mean squared error over the

entire training set converges to some minimum value. It is good practice to randomize the

order of presentation of training examples from one epoch to the next.

For a given training set, back-propagation learning may proceed in one of two basic

ways:

1. Sequential mode: The sequential mode of back-propagation learning is also referred

to as on-line, pattern or stochastic mode. In this mode of operation weight updating is

performed after the presentation of each training example.

2. Batch mode: In the batch mode of back-propagation learning, weight updating is

performed after the presentation of all training examples that constitute an epoch. For

a particular, the cost function is defined as the averaged squared error of equations

(2.2) and (2.3):

Eav=1/2N (2.31)

where the error signal pertains to the output neuron j for the training example n.

For an on-line operational point of view, the sequential mode of training is

preferred over the batch mode because it requires less local storage for each synaptic

connection. Moreover, given that the patterns are presented to the network in a random

manner, the use of pattern by pattern updating of weights makes the search in weight space

stochastic in nature. This in turn makes it less likely for the back-propagation algorithm to be

trapped in a local minimum.

In the same way, the stochastic nature of the sequential mode makes it difficult to

establish theoretical conditions for convergence of the algorithm, In contrast, the use of batch

mode of training provides an accurate estimate of the gradient vector; convergence to a local

minimum is guaranteed under simple conditions.



2.6 Local Minima of Error Surface A peculiarity of the error surface (Fig 2.9) that impacts the performance of the back-

propagation algorithm is the presence of local minima in addition to global minima. It runs

the risk of being trapped in a local minimum where even small change in synaptic weights

increases the cost function. But somewhere else in the weight space there exists another set

of synaptic weights for which the cost function is smaller than the local minima in which the

network is stuck. It is clearly undesirable to have the learning process terminate at local

minima, especially if it is located far above a global minimum.

Figure 2.9: Error surface

2.7 Supervised Learning Viewed as an Optimization Problem In this case, the supervised training of a multilayer perceptron is viewed as a problem

in numerical optimization. The error surface of a multilayer perceptron is a highly nonlinear

function of the synaptic weight vector w. Let Eav(w) denote the cost function, averaged over

the training sample. Using the Taylor series expand Eav(w) about the current point on the

error surface w(n) as:

Eav(w(n)+w(n)) =



Eav(w(n))+gT(n)w(n)+0.5wT(n)H(n)w(n) (2.32)

where g(n) is the local gradient vector defined by:

g(n)=dEav(w)/dw, at w=w(n) (2.33)

and H(n) is the local hessian matrix defined by:

H(n)=d2Eav(w)/dw2, at w=w(n) (2.34)

The use of an ensemble-averaged cost function Eav(w) presumes a batch mode of

operation. The basic back-propagation algorithm adjusts the weights in the steepest descent

direction (negative of the gradient). This is the direction in which the performance function is

decreasing most rapidly. It turns out that, although the function decreases most rapidly along

the negative of the gradient, this does not necessarily produce the fastest convergence. In the

conjugate gradient type of algorithms a search is performed along conjugate directions,

which produces generally faster convergence than steepest descent directions.

In most of the training algorithms, a learning rate is used to determine the length of

the weight update (step size). In most of the conjugate gradient algorithms, the step size is

adjusted at the end of each iteration. A search is made along the conjugate gradient direction

to determine the step size, which minimizes the performance function along that line.

Fletcher-Reeves Update: All of the conjugate gradient algorithms start out by searching in

the steepest descent direction (negative of the gradient) on the first iteration:

po=-go (2.35)

A line search is then performed to determine the optimal distance to move along the current

search direction:

xk+1=xk+kpk (2.36)

Then the next search direction is determined so that it is conjugate to previous search

directions. The general procedure for determining the new search direction is to combine the

new steepest descent direction with the previous search direction:

Pk=-gk+kpk-1 (2.37)



The various versions of conjugate gradient are distinguished by the manner in which the

constant is computed. For the Fletcher-Reeves update the procedure is:

k=gTkgk /gTk-1gk-1 (2.38)

This is the ratio of the norm squared of the current gradient to the norm squared of the

previous gradient. Once the MLP has been trained, it can be tested on a database of images.

The output neuron which fires corresponds to the class of image.

2.8 Face Recognition Using Neural Networks The Cambridge ORL database consists of 400 frontal images of 40 people, i.e. 10

images of each person, of which 5 are used for training and the other 5 are used for testing

purposes. Since there are 40 people that need to be identified, the number of classes that a

neural network must distinguish is also 40. The pixel intensity vectors of the pre-processed

training images are obtained and the MLP is thus trained. The mode of training used is batch

mode, and using the back-propagation algorithm, in conjunction with Fletcher-Reeves

update, the optimum weights for MLP are obtained. In order to improve results, the inputs

are fed at random. Training is conducted for 2500 epochs.

2.9 Conclusions This chapter dealt with artificial neural networks. Specifically, the architecture of

multilayered feedforward neural networks is introduced. Supervised learning via the back

propagation algorithm is discussed along with the modes of training, the choice of learning

rate, and the potential pitfalls of getting trapped in local minima. Finally, the use of artificial

neural networks for face recognition is also introduced. The results are presented in a Chapter

6. A technique that is frequently used for analysis of images is that of principal components.

This is the topic of the next chapter.



CHAPTER III

Principal Components Analysis

Principal components analysis is a method of unsupervised learning. The main idea is

to discover significant patterns or features in the input data without an external aide. To do

so, the algorithm is provided with a set of rules of a local nature, which enables it to learn to

compute an input-output mapping with specific desirable properties; the term local implies

that the change applied to the synaptic weight of a neuron is confined to the immediate

neighborhood of that neuron.

Feature selection refers to a process whereby the given data is compressed to features

or patterns fewer in number than the given data; Thus, the data space is transformed into a

feature space and undergoes a dimensionality reduction. Evidently, this data compression is

lossy, and it can be shown that in the case of principal component analysis, the mean-square

of the resulting error equals the sum of the variances of the elements of that part of the data

vector that is eventually eliminated. Therefore one seeks a transformation that is optimum in

the mean squared sense; i.e., the eliminated component must have total variance that is less

than a predefined threshold. Principal components analysis computes the basis of a space

which represents the training vectors. These basis vectors are eigenvectors of a related

covariance matrix.

When applied to face recognition, one determines the principal components of the

distribution of faces treating the image as a point in a high dimensional space; i.e., the

eigenvectors of the covariance matrix of the set of face images are computed. These

eigenvectors referred to as eigenfaces in this context contain relevant discriminatory

information extracted from the images. Thus, characteristic features representing the

variation in the collection of faces are captured, and this information is used to encode and

compare individual faces. It is to be noted that the features (i.e., the eigenvectors) are ordered

with respect to the corresponding eigenvalues, and only those features that are significant (in



the sense of the value of the eigenvalues) are considered. Recognition is performed by

projecting a new image on to the subspace spanned by the eigenfaces and then classifying the

face by comparing its position in the face space with the position of known individuals.

Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd PC z2 is a

minimum distance fit to a line in the plane perpendicular to the 1st PC.

3.1 Derivation of Principal Components (PC)

Given a sample of n observations on a vector of p variable:

x=(x1, x2,, xp) (3.1)

define the first principal component of the sample by the linear transformation:

z1= a1Tx= (3.2)

where the vector a1=(a11,a21,,ap1) is chosen such that var[z1] is maximum. Likewise, define

the kth PC of the sample by the linear transformation:

zk=akTx, k=1p (3.3)

where the vector ak=(a1k,a2k,,apk) is chosen such that var[z1] is maximum:

cov[zk,zl]=0, for k1, and akTak=1 (3.4)



To find a1 first note that:

var[z1] = (z12) (z1)2

= , where Sij=xixj=(xixj) - (xi)(xj)

= a1TSa1 (3.5)

where S is the is the covariance matrix for x.

To find a1 maximize var[z1], subject to a1Ta1=1. This constrained optimization problem can

be transformed into an unconstrained one by using the Lagrange multiplier . Thus, the

problem is reduced to maximization of a1TSa1 (a1Ta1 1). Differentiating with respect to

both the vector a1 and the Lagrange multiplier and setting the derivatives to zero one obtains

Sa1 a1 = 0

(S Ip)a1 = 0 (3.6)

Therefore a1 is an eigenvector of S corresponding to eigenvalue =1. Note that the following

has been maximized:

var[z1]= a1TSa1 =a1T1a1 = 1 (3.7)

So 1 is the largest eigenvalue of S. The first PC z1 retains the greatest amount of variation in

the sample. An example is illustrated in Fig. 3.1(a). To find the next coefficient vector a2

maximize var[z2] subject to:

cov[z2,z1]=0 and a1Ta1 =1 (3.8)

First note that:

cov[z2,z1]= a1TSa2= 1a1Ta2 (3.9)

then let and be Lagrange multipliers, and maximize a2TSa2 (a2Ta2 1) - a2Ta1. It is

found that find that a2 is also an eigenvector of S whose eigenvalue =2 and is the second

largest, as illustrated in Fig. 3.1(b) in the two dimensional case. In general:

var[zk]= akTSak =k (3.10)

The kth largest eigenvalue of S is the variance of the kth PC. The kth PC zk retains the kth greatest fraction of the variation in the sample.



Given a sample of n observations on a vector of p variables x=(x1, x2,, xp), define a vector

of p PCs z=(z1, z2,, zp), according to Z=ATX, where A is an orthogonal p x p matrix whose

kth column is the kth eigenvector ak of S. Then = ATSA is the covariance matrix of the PCs; the matrix is diagonal with elements ij=ij ij.

3.2 Face Recognition using Eigenfaces

Let a face image (x,y) be a two-dimensional N by N array of intensity values. By stacking the columns of this array into a single vector, an image may also be considered as a

vector of dimension 2N . Thus, a typical image of size 256 by 256 becomes a vector of

dimension 65,536, or equivalently, a point in 65,536-dimensional space. An ensemble of

images, then, maps to a collection of points in this vector space of a relatively large

dimension.

Images of faces, being similar in overall configuration, will not be randomly

distributed in this huge image space and thus can be described by a relatively low

dimensional subspace. The main idea of the principal component analysis is to find a set of

basis vectors that best accounts for the distribution of face images within the entire image

space. These vectors define the subspace of face images, which is generally referred to as the

face space. Each vector is of length 2N , describes an N by N image, and is a linear

combination of the original face images. Because these vectors are the eigenvectors of the

covariance matrix corresponding to the original face images, and because they are face-like

in appearance, they are typically referred to as eigenfaces.

As described earlier, a 2-D facial image can be represented as 1-D vector by

concatenating each row (or column). Let the training set of face images be M ,,, 21 K .

The average face of the set is defined by ==

M

nnM 1

1 . Each face differs from the average by

the vector = nn . This set of very large vectors is then subject to principal component analysis, which seeks a set of M orthonormal vectors n which best describes the distribution of the data. The kth vector k is chosen such that:



=

=M

nn

Tkk M 1

2)(1 (3.11)

is a maximum, subject to:

==

otherwisekl

kTl ,0

,1 (3.12)

The vectors k and scalars k are the eigenvectors and eigenvalues, respectively, of the covariance matrix:

=

==M

n

TTnn AAM

C1

1

(3.13)

where the matrix ]...[ 21 MA = . The matrix C is however 2N by 2N , and determining the 2N eigenvectors and eigenvalues is an intractable task for typical image sizes. A

computationally feasible method is needed to find these eigenvectors.

If the number of data points in the image space is less than the dimension of the space

( 2NM < ), there will be only M , rather than 2N , meaningful eigenvectors in the sense that the remaining eigenvectors are associated with zero eigenvalues. Accordingly, it is

computationally better to determine the eigenvalues and eigenvectors of much smaller

matrix. For example, in the situation outlined earlier, one computes the eigenvalues and the

corresponding eigenvectors of an M x M matrix as opposed to a 65,536 x 65,536 matrix.

Consider the eigenvectors n of AAT such that:

nnnT AA = (3.14)

Premultiplying both sides by A,

nnnT AAAA = (3.15)

from which it can be inferred that nA are the eigenvectors of TAAC = . Following this analysis, construct the M by M matrix AAL T= , where nTmmnL = , and find the M eigenvectors n of L. These vectors determine linear combinations of the M training set face images to form the eigenfaces n :



MnA nM

kknkn ,......,1,

1

=== =

(3.16)

With this analysis the calculations are greatly reduced, from the order of the number of pixels

in the images ( 2N ) to the order of the number of images in the training set (M). In practice,

the training set of face images will be relatively small ( 2NM < ), and the calculations become quite manageable. The value of the associated eigenvalues allows the ordering of the

eigenvectors according to their usefulness in characterizing the variation among the images.

It is to be emphasized at this juncture that such a ordering is possible as all the eigenvalues of

AAT (and hence TAA ) are positive due to the sign-definiteness of these matrices.

The purpose of PCA is to reduce the large dimensionality of the data space (observed

variables) to the smaller intrinsic dimensionality of feature space (independent variables),



which are needed to describe the data economically. This is the case when there is a strong

correlation between observed variables. Therefore, the goal is to find a set of eis which have

the largest possible projection onto each of the wis. The eigenvectors corresponding to

nonzero eigenvalues of the covariance matrix produce an orthonormal basis for the subspace

within which most image data can be represented with a small amount of error. The

eigenvectors are sorted from high to low according to their corresponding eigenvalues. The

eigenvector associated with the largest eigenvalue is one that reflects the greatest variance in

the image. That is, the smallest eigenvalue is associated with the eigenvector that finds the

least variance. It has been the experience that the usefulness of the eigenvectors decreases in

exponential fashion; i.e., roughly 90% of the total variance is contained in the first 5% to

10% of the dimensions.

3.3 Euclidean Distance Method for Classification The simplest method for determining which face class provides the best description of

an input facial image is to find the face class k that minimizes the Euclidean distance k=|-

k|, where k is a vector describing the kth face class. If k is less than some predefined

threshold , a face is classified as belonging to the class k.

Note: Eigenfaces, once obtained, can also be used to train a multilayer perceptron for

classification purposes. This type of training incorporates both supervised and unsupervised

learning. Once trained, it can be tested on a database of images.

3.4 Conclusions In this chapter the eigenface approach to face recognition is dealt with. This is the

first statistics-based face recognition technique to have been proposed by researchers. A

principal advantage of this is that it is amenable to real-time face recognition. The eigenface

approach followed in this project uses principal components and nearest-neighbour

classification technique. In this next chapter, face recognition using support vector machines

is presented.



CHAPTER IV

Support Vector Machines

Support vector machines are universal feedforward networks, which can be used for pattern

classification and non-linear regression. Basically the support vector machine is a linear

machine with some suitable properties. To explain how it works, it is perhaps easiest to start

with the case of separable patterns that could arise in the context of pattern classification. In

this context, the main idea of a support vector machine is to construct a decision surface,

called a hyperplane, in such a way that the margin of separation between positive and

negative examples is maximized.

The machine achieves this desirable property by following a principled approach

rooted in statistical learning theory. More precisely, support vector machine is an

approximate implementation of the method of structural risk minimization. The induction

principle is based on the fact that the error-rate of a learning machine on test data is bounded

by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis

(VC) dimension; in the case of separable patterns, a support vector machine produces a value

of zero for the first term and minimizes the second term. Accordingly, the support vector

machine can provide a good generalization on pattern classification problems despite the fact

that it does not incorporate problem-domain knowledge. This attribute is unique to support

vector machines.

A notion that is central to the construction of the support vector learning algorithm is

the inner-product kernel between a support vector xi and the vector x drawn from the input

space. The support vectors consist of a small subset of the training data extracted by the

algorithm. Depending on how this inner-product kernel is generated, construction of different

learning machines is characterized by their respective nonlinear decision surfaces. In

particular, the support vector learning algorithm is used to construct the following three types

of learning machines:

Polynomial learning machines



Radial-basis function networks Two-layer perceptrons with a single hidden layer.

That is, for each of these feedforward networks the support vector learning algorithm is used

to implement the learning process using a given set of training data, automatically

determining the required number of hidden units. Stated in another way, whereas the back-

propagation algorithm is devised specifically to train a multilayer perceptron, the support

vector learning algorithm is of a more generic nature because it has wider applicability.

Support Vector machines are rooted in statistical learning theory which gives a family

of bounds which govern the learning capacity of the machine:

R(alpha) = 0.5|y-f(x,alpha)|.dP(x,y) (4.1)

Properties of the bound:

1. It is independent of P(x,y). It assumes only that both the training data and the test data

are drawn independently according to some P(x,y)

2. It is usually not possible to compute the left hand side.

3. If h is known, the right hand side can be easily computed.

Thus given several different learning machines, and choosing a fixed, sufficiently small ,

that machine is chosen which gives the lowest upper bound on the actual risk. This gives a

principled method for choosing a learning machine for a given task, and is the essential idea

of structural risk minimization.

The general structure of SVM is as shown in Fig. 4.1.



Figure 4.1: General structure of SVM

4.1 Basics Let X Rn denote the possible input to the Support Vector Machine. These inputs are pixel

intensity vectors obtained after image pre-processing. An SVM is trained with images

belonging to two classes at a time. A single machine is not efficient at encoding multiple

classes. Assume that a point x X is associated with one of two possible classes, denoted -1

and +1. This means that it is sufficient to have the output Yestimate={-1,1} from the classifier.

Consider two stochastic variables, representing the point denoted X, Y representing class

label. These variables then give rise to the following conditional distributions, p(X=x|Y=1),

p(X=x |Y=-1). There are two phases concerning support vector machines, first training and

then classifying. By letting f: X -> Yestimate represent the discriminating function, the task

during training is to minimize the probability of returning wrong classification for all



elements included in the training set. This means that for the training set Yestimate should be

identical to Y in as many cases as possible.

During training, SVMs solve this problem by simply finding a function f, which for

every point xi=[xi1, xi2, , xin]T, with corresponding class label yk, in the training set

S=((x1,y1),,(xl,yl)) has the following property, f(xk)0, if yk=1 and f(xk)



To formulate this, suppose that all the training data satisfy the following constraints:

w.xi + b +1 for yi = +1 (4.3.a)

w.xi + b -1 for yi = -1 (4.3.b)

These can be combined into one set of inequalities:

yi(w.xi + b) 1 0, i=1,,N (4.3)

Now consider the points for which the equality in (4.3.a) holds. These points are the support

vectors which lie on the hyperplane H1: w.xi + b = +1, with normal w and perpendicular

distance from the origin |1 - b|/||w||. Similarly, the points for which the equality in (4.3.b)

holds are also the support vectors which lie on the hyperplane H2: w.xi + b = -1 with

normal again w and perpendicular distance from the origin |-1-b|/||w||. Hence d+ = d- = 1/||w||

and the margin is simply 2/||w||. Note that H1 and H2 are parallel as they have the same

normal and that no training points fall between them. Thus, the pair of hyperplanes which

gives the maximum margin by minimizing the objective function 0.5||w||2 subject to

constraints in equation can be found. Thus the solution for a typical two dimensional case is

expected to have the form shown in Figure 4.2. Those training points for which the equality

in (4.3) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2) and whose

removal would change the solution found, are called support vectors; they are indicated in

Figure 4.2 by the extra circles.

Fig. 4.2: Linear separating hyperplanes for the separable case. The support vectors are circled.



4.3 Quadratic Optimization for Finding the Optimal Hyperplane A Lagrangian formulation is used to obtain the solution. There are two reasons for

doing this. The first is that the constraints variables are replaced by constraints on the

Lagrange multipliers themselves, which are much easier to handle. The second is that in this

reformulation of the problem, the training data appears (in the actual training and test

algorithms) in the form of dot products between vectors. This is a crucial property which will

allows one to generalize the procedure to the nonlinear case.

Thus, introducing positive Lagrange multipliers i, i=1.N; one for each of the

inequality constraints in equation. Recall that the rule is that for constraints of the form ci0,

the constraint equations are multiplied by positive Lagrange multipliers and subtracted from

the objective function, to form the Lagrangian. For equality constraints, the Lagrange

multipliers are unconstrained. This gives Lagrangian:

LP = 0.5||w||2 yi(w.xi + b) + (4.4)

The solution to the constrained optimization problem is determined by the saddle

point of LP, which must be minimized with respect to w, b, and simultaneously require that

the derivatives of LP with respect to all the i vanish, all subject to the constraints i0.

Suppose that the set of constraints is referred to as C1. Now this is a convex quadratic

programming problem, since the objective function is itself convex, and those points which

satisfy the constraints also form a convex set. This means that following dual problem can

be equivalently solved: maximize LP, subject to the constraints that the gradient of LP with

respect to w and b vanish, and subject also to the constraints i0. Suppose that this set of

constraints is referred to as C2. This particular dual formulation of the problem is called the

Wolfe dual. It has the property that the maximum of LP, subject to constraints C2, occurs at

the same values of w, b and , as the minimum of LP, subject to constraints C1.



Requiring that the gradient of LP with respect to w and b vanish, gives the conditions:

1. w = yixi (4.5)

2. yi = 0 (4.6)

Since these are equality constraints in the dual formulation substituting them into equation to

give:

LD = 0.5 jyiyjxi.xj (4.7)

Maximize (4.7) subject to constraints:

1. yi = 0 (4.8)

2. i0 (4.9)

Note that there are different labels for Lagrangian (LP for primal, LD for dual), to emphasize

that the two formulations are different: LP and LD arise from the same objective function but

with different constraints, and the solution is found by minimizing LP or by maximizing LD.

Note also that if the problem is formulated with b=0, which amounts to requiring that all

hyperplanes contain the origin, the constraint (4.8) does not appear.

Support vector training (for the separable, linear case) therefore amounts to

maximizing LD with respect to i subject to constraints (4.8) and positivity of i with

solution given by (4.5). Notice that there is a Lagrange multiplier i for every training point.

In the solution, those points for which i>0 are called support vectors, and lie on one of the

hyperplanes H1, H2. All other training points have i=0 and lie on that side of H1 or H2 such

that the strict inequality in Equation (4.3) holds. For these machines, the support vectors are

the critical elements of the training set. They lie closest to the decision boundary, if all other

training points are removed or moved around, but so as not to cross H1 or H2, and training

repeated, the same separating hyperplane is obtained.



4.4 The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and

practice of constrained optimization. For the primal problem described in the previous

section, the KKT conditions may be stated as:

(4.10)

The solution vector w is defined in terms of an expansion that involves the N training

examples. Note, however, that although this solution is unique by virtue of convexity of the

Lagrangian, the same cannot be said about the Lagrange coefficients, i. It is important to

note that at the saddle point, for each i, the product of that multiplier with the corresponding

constraint vanishes, as shown by (4.10). Therefore, only those multipliers exactly meeting the

above equation can assume nonzero values.

Having determined the optimal Lagrange multipliers, o,i, computing the optimum

weight vector as:

wo = yi xi, where Ns is the support vectors (4.11)

The optimum bias can be calculated as:

bo = 1 wo.x(s) for d(s)=1 (4.12)



4.5 Linear Support Vector Machines: The Non-separable Case The margin of separation between classes is said to be soft if a data point (xi,yi)

violates the following condition:

yi(w.xi + b) + 1, i=1,,N

This violation can arise in one of two ways, as shown in Figure 4.3:

The data point falls inside the region but on the right hand side of the decision surface.

The data point falls on the wrong side of the decision surface, resulting in misclassification.

This can be rectified by introducing positive slack variables i, i=1,,N, in the constraints

which then become:

w.xi + b +1 - i for yi =+1

w.xi + b -1 + i for yi =-1

These can be combined into one set of inequalities:

yi(w.xi + b) 1 - i, i=1,,N; where the slack variables 0 for all i. (4.13)



Fig. 4.3: Linear separating hyperplanes for the non-separable case.

Thus, for an error to occur, the corresponding i must exceed unity, so i is an upper bound

on the number of training errors. Hence a natural way to assign an extra cost for errors is to

change the objective function to be minimized to 0.5||w||2 to 0.5||w||2 + C(i)k, where C is a

parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to

errors. As before minimizing the first term is related to minimizing the VC dimension of the

support vector machine. As for the second term, it is an upper bound on the number of test

errors. Formulation of the cost function is therefore in perfect accord with the principle of

structural risk minimization.

The parameter C controls the trade-off between the complexity of the machine and

the number of non-separable points; it may therefore be viewed as a form of a

regularization parameter. The parameter C has to be selected by the user. This can be done

in one of two ways:

The parameter C is determined experimentally via the standard use of a training/(validation) test set, which is a crude form of resampling.

It is determined analytically by estimating the VC dimension and then using the bounds thus determined.



Maximize:

LD = 0.5 jyiyjxi.xj (4.14)

Subject to the constraints:

1. 0 i C (4.15)

2. yi = 0 (4.16)

Having determined the optimal Lagrange multipliers, o,i, computing the optimum weight

vector as:

wo = yi xi (4.17)

The optimum bias can be calculated as:

bo = 1 wo.x(s) for d(s)=1 (4.18)

The Karush-Kuhn-Tucker conditions are needed for the primal problem. The primal

Lagrangian is:

LP = 0.5||w||2 + C - {yi(w.xi + b)1+i} - i (4.19)

where i are the Lagrange multipliers introduced to enforce positivity of i. The KKT

conditions for the primal problem are therefore:



(4.20) and (4.21)

The optimum bias b0 is determined by taking any data point in the training set for which 0

0,i C and therefore i=0 and using that data point in the KKT conditions. However, from a

numerical perspective it is better to take the mean value of b0 resulting from all such data

points in the sample.

4.6 Nonlinear Support Vector Machines Basically the idea of a nonlinear support vector machine hinges on two mathematical

operations summarized here.

Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both input and output.

Construction of an optimal hyperplane for separating the features discovered in previous step.

An important question at this juncture is whether or not the methods discussed in earlier

sections can be generalized to the case where the decision function is not a linear function of

the data, or, in other words, when the input space is made up of nonlinearly separable

patterns. Covers theorem states that such a multi-dimensional space maybe transformed into

a new feature space where the patterns are linearly separable with high probability, provided



two conditions are satisfied. First, the transformation is nonlinear. Second, the dimensionality

of the feature space is high enough. The next operation exploits the idea of building an

optimal separating hyperplane in accordance with the theory described, but with a

fundamental difference: the separating hyperplane is now defined as a linear function of

vectors drawn from the feature space rather than the original input space.

4.6.1 Inner Product Kernel Observe that the data appears in the training problem only in the form of dot products,

xi.xj. Now suppose the data is first mapped to some other (possibly infinite dimensional)

Euclidean space H, using a mapping : Rd to H. Then of course the training algorithm would

only depend on the data through dot products in H, i.e. on functions of the form (xi). (xj).

Now if there were a kernel function K such that K(xi,xj)= (xi). (xj) only K need to be

used in the training algorithm, and would never need to explicitly even know what is. One

example is: K(xi,xj)=exp(-0.5||xi-xj||2/2*sigma2). In this particular example, H is infinite

dimensional, so it would not be very easy to work with explicitly. However, if one replaces

xi.xj by K(xi,xj) everywhere in the training algorithm, the algorithm produces a support vector

machine in an infinite dimensional space, and furthermore do so in roughly the same amount

of time it would take to train on the unmapped data.

Suppose that the data belongs to a space denoted L. (The notation L is a mnemonic

for low-dimensional, as is the notation for the range space H for high-dimensional.)

Evidently, a map is not necessarily onto; i.e., there need exist an element in L that is

mapped onto a specific element in H.

Let j(x), j=1,,M; where M is the number of hidden units, denote a set of nonlinear

transformations, define a hyperplane acting as the decision surface as follows: j j(x)

+ b=0, where w defines the vector of linear weights connecting the feature space to the

output space, and b is the bias. The quantity j(x) represents the input supplied to the weight

wj via the feature space. In effect, the vector (x), represents the image induced in the



feature space due to the input vector x. Thus, in terms of this image define the decision

face recognition

Documents

project report

project work

face recognition techniques

pesit dr

dr koshy george pes

guidance of

artificial neural networks

bachelor of engineering