face recognition
DESCRIPTION
Face recognitionTRANSCRIPT
-
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELGAUM-590014
Project Report
On
Face Recognition using Artificial Neural Networks Submitted in partial fulfilment of the requirement for the award of degree of
BACHELOR OF ENGINEERING
In
ELECTRONICS & COMMUNICATION
By
Prashanth Namit
1PI05EC072 1PI05EC057 Under the Guidance of
Prof. Koshy George
PES Centre for Intelligent Systems; Dept. of Telecommunication Engineering
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085
2008- 2009
-
PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085
CERTIFICATE
Certified that the project work entitled Face Recognition Techniques using Neural Networks is a bona fide work carried out by Prashanth H, Namit Chauhan bearing USN 1PI05EC072 and 1PI05EC043 in partial fulfilment for the award of degree of BACHELOR OF ENGINEERING in ELECTRONICS & COMMUNICATION of Visvesvaraya Technological University, during the academic year 2008-2009. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in the report. The project report has been approved as it satisfies the academic requirements with respect to project work prescribed for the mentioned degree.
Dr Koshy George PES Centre for Intelligent Systems; Dept. of Telecom. Engg., PESIT
Dr A. Vijaykumar HOD,
Dept. of E&C Engg., PESIT
Dr K. N. B. Murthy Principal PESIT
Prashanth
1PI05EC072
Namit Chauhan
1PI05EC057
External Viva
Name of the Examiners Signature with Date
1._____________________ _____________________
2. _____________________ _____________________
-
ACKNOWLEDGEMENTS
We would like to express our sincere thanks to Dr. Koshy George Professor, TE
Department, PESIT, Bangalore, for his kind and constant support and guidance during the
course of this project without which, the project would not have been a success
We would like to thank and extend 1our heart-felt gratitude to Dr. A Vijaya Kumar,
HOD of the Department of Electronics and Communication, for his kind, inspiring and
illustrious guidance and ample encouragement.
We would like to thank Prof. D Jawahar, CEO, PES Group of Institutions, and
Dr. K N Balasubramanya Murthy, Director & Principal, PESIT, for the valuable
resources provided for the completion of the project.
We would like to express our sincere thanks and deep sense of gratitude to our parents
for their everlasting support.
And finally it gives us immense pleasure to thank our friends and everyone who have
been instrumental in the successful completion of the project.
-
Table of Contents:
Serial
Number Topic Page Number
1 Face Recognition: An Introduction 1
1.1 Classification 3
1.1.1 Application of face recognition 6
1.1.2 Adopted approach 6
1.2 Literature survey 7
1.2.1 Existing algorithms 7
1.2.2 Existing techniques 8
1.3 Organization of the report 10
2 Neural Networks: An Introduction 11
2.1 Multilayer Perceptrons 13
2.2 Derivation of Error back propagation
algorithm 15
2.3 Activation function 20
2.4 Rate of learning 22
2.5 Modes of Training 23
2.6 Local minima of Error Surface 24
2.7 Supervised learning viewed as an
optimization problem 24
2.8 Face recognition using Neural Networks 26
2.9 Conclusions 26
3 Principal Components Analysis: An
Introduction 27
3.1 Derivation of Principal components 28
3.2 Face Recognition using Eigen faces 30
3.3 Euclidean Distance method for
Classification 33
-
ii
3.4 Conclusions 33
4 Support Vector Machines: An Introduction 34
4.1 Basics 36
4.2 Linear Support Vector Machines:
The Separable Case 37
4.3 Quadratic Optimization for Finding the
Optimal Hyperplane 39
4.4 The Karush-Kuhn-Tucker Conditions 41
4.5 The Non Separable Case 42
4.6 Nonlinear Support Vector Machines 45
4.6.1 Inner product Kernel 46
4.6.2 Mercers Theorem 48
4.6.3 Summary of Inner-Product Kernels 49
4.7 Multiple Class Support Vector Machine 49
4.8 Binary Search Tree 52
4.9 SVM for Face Recognition 53
4.10 Conclusions 53
5 Image Preprocessing 54
5.1 Histogram Equalization 54
5.2 Median Filtering 57
5.3 Bi-Cubic Interpolation 57
5.4 Conclusions 58
6 Results 59
6.1 MLP 59
6.2 PCA 63
6.3 SVMs 67
6.4 PCA+MLP 70
7 Conclusions and Future work 73
8 References 75
Appendix 78
-
iii
List of Figures
Figure 1.1: Applications of Face Recognition Figure 1.2: Supervised Learning Figure 1.3: Unsupervised Learning Figure 1.4: (a) Good Generalization, (b) Bad Generalization Figure 2.1: Block diagram of nervous system Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks,
Recurrent Networks
Figure 2.3: Nonlinear model of a neuron Figure 2.4: MLP with one hidden layer Figure 2.5: Signal-flow graph highlighting the details of output neuron j Figure 2.7: Signal-flow graphical summary of back-propogation learning. Top part of
the graph: forward pass. Bottom part of the graph: backward pass.
Figure 2.8: Hyperbolic tangent function Figure 2.9: Error surface Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd
PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC.
Figure 3.2: Eigen faces Figure 4.1: General structure of SVM Figure 4.2: Linear separating hyperplane for separable case Figure 4.3: Linear separating hyperplane for non-separable case Figure 4.4: Mapping into higher space Figure 4.5: Decision surface of Kernels Figure 4.6: Decision surface using one SVM Figure 4.7: Problem encountered when one SVM is used for multi- classification Figure 4.8: Multi-classification using Binary classifiers Figure 4.9: Binary tree structure for 8 classes Figure 5.1: Image after Histogram Equalization Figure 5.2: Median filtering Figure 6.1: Face recognition accuracy with multilayer feedforward networks with
histogram equalization and median filtering.
-
iv
Figure 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.
Figure 6.3: Face recognition accuracy with multilayer feedforward networks with no pre-processing
Figure 6.4: Face recognition accuracy with PCA with histogram equalization and median filtering.
Figure 6.5: Face recognition accuracy with PCA with only median filtering. Figure 6.6: Face recognition accuracy with PCA and no pre-processing Figure 6.7: Face recognition accuracy with PCA+MLP with histogram equalization
and median filtering.
Figure 6.8: Face recognition accuracy with PCA+MLP with only median filtering.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 1
CHAPTER I
Face Recognition: An Introduction
Human beings have recognition capabilities that are unparalleled in the modern
computing era. These are mainly due to the high degree of interconnectivity, adaptive nature,
learning skills and generalization capabilities of the nervous system. The human brain has
numerous highly interconnected biological neurons which, on some specific tasks, can
outperform super computers. A child can accurately identify a face, but for a computer it is a
cumbersome task. Therefore, the main idea is to engineer a system which can emulate what a
child can do. Advancements in computing capability over the past few decades have enabled
comparable recognition capabilities from such engineered systems quite successfully. Early
face recognition algorithms used simple geometric models, but recently the recognition
process has now matured into a science of sophisticated mathematical representations and
matching processes. Major advancements and initiatives have propelled face recognition
technology into the spotlight.
Face recognition technology can be used in wide range of applications. (A few
example applications are shown in Fig 1.1.) An important aspect is that such technology
should be able to deal with various changes in face images, like rotation, changes in
expression. Surprisingly, the mathematical variations between the images of the same face
due to illumination and viewing direction are almost always larger than image variations due
to changes in face identity. This presents a great challenge to face recognition. At the core,
two issues are central to successful face recognition algorithms: First, the choice of features
used to represent a face. Since images are subject to changes in viewpoint, illumination, and
expression, an effective representation should be able to deal with these possible changes.
Secondly, the classification of a new face image using the chosen representation.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 2
Face Recognition can be of two types:
Feature Based (Geometric) Template Based (Photometric)
In geometric or feature-based methods, facial features such as eyes, nose, mouth, and chin
are detected. Properties and relations such as areas, distances, and angles between the
features are used as descriptors of faces. Although this class of methods is economical and
efficient in achieving data reduction and is insensitive to variations in illumination and
viewpoint, it relies heavily on the extraction and measurement of facial features.
Unfortunately, feature extraction and measurement techniques and algorithms developed to
date have not been reliable enough to cater to this need.
In contrast, template matching and neural methods generally operate directly on an
image-based representation of faces, i.e., pixel intensity array. Because the detection and
measurement of geometric facial features are not required, this type of method has been more
practical and easier to implement when compared to geometric feature-based methods.
Figure 1.1: Applications of face recognition.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 3
1.1 Classification Face recognition is nothing but the ability of a machine to successfully categorize a set
of images based on certain discriminatory features. Classification or pattern recognition can
be a very difficult problem and is still a very active field of research. The aim of the current
project can be considered as an attempt to simulate the recognition capability of the human
brain. A human being has the ability to put a certain scenario in a context and identify its
components. Of course it is not entirely obvious how to make the machine discriminate
between elements of different objects of different classes. The classification task may be
mathematically represented as the mapping:
f : A x B = C (1.1)
where A represents the elements that have to be classified, i.e. the pixel intensity vectors,
using B as the function parameters. The output C has to discriminate between images of
different classes so that the class of each element can be determined. Classification by
machines requires two steps: First, the properties which distinguish an element of one class
from that of another class have to be identified. Secondly, the machine has to be trained to
know how to discriminate between the classes by defining a learning model. The learning
model describes the procedure that has to be used for the actual training. Basically there are
two types of training often referred to as supervised and unsupervised learning.
Supervised Learning: When dealing with supervised learning, the trainer feeds the machine a sample and the machine classifies it. The trainer then tells the machine
whether it has classified the sample correctly or not. In the case that it has
misclassified the sample, the machine has to adjust its classifier parameters to better
classify the given sample. Supervised learning is illustrated in Fig. 1.2. Examples of
these include Bayes Classifier and Neural Networks. Another important aspect
involved is the method of training imparted to these learning machines. It is important
to have a training set which is representative for the given classes so that future
classification can be successful.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 4
Figure 1.2: Supervised Learning
Unsupervised Learning: In unsupervised or self-organized learning there is no external trainer to oversee the learning process. Rather a task-independent measure of
the quality of classification is computed, and the free parameters of the network are
optimized with respect to that measure. Once the network has become tuned to the
statistical regularities of the input data, it develops the ability to form internal
representations for encoding features of the input and thereby create new classes
automatically. One technique for this is called clustering. In this case the trainer has
nothing to do with the classification. Instead the learning machine determines which
elements should belong to the same class and assigns a class accordingly.
Unsupervised learning is as shown below in Fig. 1.3.
Figure 1.3: Unsupervised Learning
It is to be noted that over-training leads to poor generalization. For example, an over-
trained machine would return different classes for the same image subjected to different
modifications. As illustrated in the Fig 1.4, over-training essentially is memorizing the data,
and leads to poor generalization.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 5
Figure 1.4: (a) Good Generalization, (b) Bad Generalization
Face recognition is a very complex form of pattern recognition. It consists of
classifying highly ambiguous input signals, with multiple dimensions and matching them
with known signals. The following are the problems that occur:
The intrinsic 2D structure of an image matrix is more often than not removed. Consequently, the spatial information stored therein is discarded and not effectively
utilized.
Curse of Dimensionality: Each image sample is modeled is typically a point in a high-dimensional space. Consequently, a large number of training samples are often
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 6
needed to get reliable and robust estimation about the characteristics of data
distribution.
Usually very limited amounts of data are available in real applications such as face recognition, image retrieval, and image classification.
A number of ways have been proposed to solve these problems. Finding an effective means
to reduce the dimensionality is the first step in face recognition.
1.1.1 Applications of Face Recognition Some of the applications are as follows:
Trying to find a face within a large database of faces. In this approach the system returns a possible list of faces from the database. The most useful applications contain
crowd surveillance, video content indexing, personal identification (example: drivers
license), mug shots matching, etc.
Real time face recognition: Here, face recognition is used to identify a person on the spot and grant access to a building or a compound, thus avoiding security hassles. In
this case the face is compared against a multiple training samples of a person.
1.1.2 Adopted Approach Face recognition is done naturally by humans. However, developing a computer
algorithm to achieve the same purpose is a rather difficult task. Starting with images, the aim
is to distinguish between faces of different people. One class of methods presupposes the
existence of certain features in the image, i.e. eyes, nose, mouth, hair, and an algorithm is
devised to find and characterize these features. A second class of methods also assumes that
there are features to be found but does not predefine what these features are or how to
measure them. The approach followed in this project belongs to the latter class of methods
since it much less restrictive than the former. Specifically, given a set of training images, the
strategy is to develop algorithms that construct a different set of images which
approximates best the given image data set. The training set is then projected onto this
subspace. To query a new image, its projection onto this subspace is determined, and a
training image whose projection is closest to that of the new image is sought.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 7
1.2 Literature Survey Several algorithms and techniques for face recognition have been developed in the past
by researchers. These are discussed briefly in this section.
1.2.1 Existing Algorithms
Face Recognition Based on Principal Component Analysis: Principal Component
Analysis (PCA) is a well known algorithm used in face recognition. The basic idea in PCA is
to determine a vector of much lower dimension that best approximates in some sense a given
data vector. Thus, in face recognition, it takes an s-dimensional vector representation of each
face in a training set of images as input, and determines a t-dimensional subspace whose
basis vector is maximum corresponding to the original image. The dimension of this new
subspace is lower than the original one (t
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 8
Elastic Bunch Graph Matching: All human faces share a similar topological structure.
Faces are represented as graphs, with nodes positioned at fiducially points (eyes, nose, etc.)
and edges labeled with 2-D distance vectors. Each node contains a set of 40 complex Gabor
wavelet coefficients at different scales and orientations (phase, amplitude). They are called
"jets". Recognition is based on labeled graphs. A labeled graph is a set of nodes connected by
edges; each node is labeled as jets, edges are labeled as distances [20].
Kernel Methods: The face manifold in subspace need not be linear. Kernel methods are a
generalization of linear methods. Direct non-linear manifold schemes are explored to learn
this non-linear manifold [4],[5],[10],[11].
Trace Transform: The Trace transform, a generalization of the Radon transform, is a new
tool for image processing which can be used for recognizing objects under transformations,
e.g. rotation, translation and scaling. To produce the Trace transform one computes a
functional along tracing lines of an image. Different Trace transforms can be produced from
an image using different trace functional [21],[22].
1.2.2. Existing Techniques Eigenfaces: Many approaches to the overall face recognition problem have been devised
over the years, but one of the most accurate and fastest ways to identify faces is to use the so-
called eigenface technique. This uses a combination of linear algebra and statistical analysis
to generate a set of basis faces the eigenfaces against which inputs are tested. Eigenface
mostly uses PCA and ICA tools for face recognition. Using these, the efficiency is variable.
Although ICA significantly outperforms the standard PCA, it has been claimed that that the
performance of ICA strongly depends on its involved PCA process. The pure ICA projection
has little effect on the performance of face recognition.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 9
Range Imaging: Range imaging is the name for a collection of techniques, used to produce
a 2D image showing the distance to a set of points in a scene from a specific point, normally
associated with some type of sensor device. The resulting image, the range image, has pixel
values which correspond to the distance, e.g., brighter values mean shorter distance, or vice
versa. If the sensor which is used for produce the range image is properly calibrated, the pixel
values can be given directly in physical units such as centimeters [23].
Line edge map: Image-based face recognition algorithm that uses a set of random rectilinear
line segments of 2D face image views as the underlying image representation, together with a
nearest-neighbor classifier as the line-matching scheme. The combination of 1D line
segments exploits the inherent coherence in one or more 2D face image views in the viewing
sphere [24].
Neural Network based Face Recognition Techniques: Neural networks are used to create
the face database and recognize the face. A separate network for each person is built. The
input face is projected onto the eigenface space first to get a new descriptor. This descriptor
is used as network input and applied to each person's network. The one with maximum
output is selected and reported as the host if it is larger than a predefined recognition
threshold [3],[1],[13],[12].
Gabor wavelet networks (GWN): The Gabor wavelet network is used for an effective
object representation. The Gabor wavelet network has several advantages such as invariance
to some degree with respect to translation, rotation and dilation. Furthermore, it has the
ability to generalize and to abstract from the training data and to assure, for a given network.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 10
Sparse Representation: The input image is divided with L-1 minimization. And then those
sparse will be compared with training datas sparse. There are also many techniques which
has been tried to use for facial recognition techniques, but among all of them, the eigenface
technique shows the fastest and most accurate results than other techniques. So far, from our
research, we have seen that the eigenface technique has faster performance than other
techniques. But in recent times the researchers and scientists are trying to focus on Human
Vision System (HVS) for face recognition. It has not been implemented in practical field of
face recognition yet but has laid the foundation for future studies [25].
1.3 Organization of the Report The report is organized as follows: The basics of neural networks and the learning
processes involved in neural networks are discussed in Chapter 2. The back propagation
learning and conjugate gradient algorithms are also dealt with. Then the process of face
recognition using neural networks is then described. Chapter 3 covers a famous and widely
used method for face recognition i.e. eigenface approach. Both principal components and
nearest neighbour classification are dealt with. Support vector machines together with the
various optimization techniques involved are presented in Chapter 4. Both linear and
nonlinear cases are explained, as well as the kernels involved in support vector machines
(SVMs). Face recognition using binary SVMs for multi-classification purposes is also
explained. Face recognition involves several pre-processing steps. These are detailed in
Chapter 5. The results obtained using a number of techniques presented in the previous
chapters are compared in Chapter 6. Concluding comments and directions for future work are
indicated in Chapter 7.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 11
CHAPTER II
Artificial Neural Networks
Neural Networks are nonlinear models of the neural pathways in the nervous system. The
flow of signals in the nervous system is as shown in Fig. 2.1. The propagation of stimulus, in
either direction throughout the system of sense organs, is due to the firing of simple
elements operating in parallel. These elements can be structured as:
Single-Layer Feedforward Networks Multilayer Feedforward Networks Recurrent Networks These structures are shown in Fig. 2.2, where each node represents a mathematical model
of a neuron.
Figure 2.1: Block diagram of nervous system
Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks, Recurrent
Networks
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 12
The manner in which this transmission proceeds is determined by the learning
abilities of these neurons. Learning may be error-correction learning, memory-based
learning, Hebbian learning; or based on paradigms as supervised learning, unsupervised
learning, reinforcement learning through continued interaction with the environment.
Rosenblatts perceptron is a basic form of a neural network, built around a nonlinear
model of a neuron, namely, the McCulloh-Pitts model of a neuron as shown in Fig. 2.3. The
adder node of the neuron model computes a linear combination of input data applied to the
synapses and the tendency to be biased, given by the value b, during a classification task.
This induced local field is converted to a binary signal, +1 or -1, which determines the
weighted capacity to activate the neighboring neuron.
Figure 2.3: Nonlinear model of a neuron
In its most general form, a neural network is a machine that is designed to model the
way in which the brain performs a particular task or function of interest. It resembles the
brain in the following two respects:
1. Knowledge is acquired by the network from its environment through a learning
process.
2. Interneuron connection strengths, known as synaptic weights, are used to store the
acquired knowledge.
The procedure used to perform the learning process is called a learning algorithm, the
function of which is to modify the synaptic weights of the network in an orderly fashion to
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 13
attain a desired design objective. The modification of synaptic weights provides the
traditional basis for the design of neural networks. Further, it is also possible for a neural
network to modify its own topology, which is motivated by the fact that neurons in the
human brain can die and that new synaptic connections can grow.
2.1 Multilayer Perceptrons Multilayer perceptrons are multilayer feedforward networks consisting of at least three
layers: the input layer, the hidden layer(s) and the output layer:
1. The input layer consists of input sensory nodes to which data, represented by pixel
intensity vectors of pre-processed images, is presented. These propagate through the
neural network in a forward direction, on a layer by layer basis, hence known as
multi-layer feedforward networks.
2. The hidden layer performs a non-linear transformation on the input signal into a new
space called the feature space. More than one hidden layer maybe used to achieve
this purpose. As learning progresses, the hidden neurons gradually discover the
features that characterize the images. The number of hidden neurons in each hidden
layer, and the number of hidden layers is critical to the learning process, and is varied
for higher accuracy.
3. At the output layer, the output/functional signal(s), expressed as a continuous
nonlinear function of the input signal and synaptic weights associated with that
neuron, is obtained.
Fig. 2.4 shows the architectural graph of a multilayer perceptron. The network shown is fully
connected, that is, a neuron in any layer of the network is connected to all the neurons in the
previous layer.
In the case of supervised learning, this output signal is subtracted from the desired
response, to yield the error signal. In the classification problem, the desired response
corresponds to the classes of images. For this specific problem, the number of output neurons
equals the number of classes of images. The backward propagation of the estimate of the
error gradient vector, that is, the gradients of the error surface with respect to the weights
connected to the inputs of a neuron, forms the core of the supervised learning process. In the
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 14
training phase, the optimum weights of the hidden and output layers are obtained, which
minimize the error estimate for desired results. The optimum weights are learnt by
propagating the error signal backwards, against the synaptic connections. This supervised
learning algorithm is known as error back propagation algorithm, which is described in detail
in Section 2.2.
Figure 2.4: MLP with one hidden layer
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 15
2.2 Derivation of Error Back Propagation Algorithm The error signal at the output of neuron j at iteration n, that is, presentation of the nth training
example, is defined by:
ej(n)=dj(n)-yj(n) (2.1)
The instantaneous value of the error energy for neuron j is defined as 0.5e2j(n).
Correspondingly, the instantaneous value of the error energy E(n) of the total error energy is
obtained by summing 0.5e2j(n) over all neurons in the output layer. Thus:
E(n)=0.5 (2.2)
where the set C includes all the neurons in the output layer of the network. Let N denote the
total number of patterns contained in the training set. The average squared error energy is
obtained by summing E(n) over all n and then normalizing with respect to the set size N, as
shown by:
Eav=1/N (2.3)
The instantaneous error energy E(n), and therefore the average error energy Eav, is a function
of all the free parameters (i.e. synaptic weights and bias levels) of the network. For a given
training set, Eav represents the cost function as a measure of learning performance. The
objective of the learning process is to adjust the free parameters of the network to minimize
Eav. A simple method of training is considered in which the weights are updated on a pattern
by pattern basis until one epoch, that is, one complete presentation of one training set has
been dealt with. The adjustments to the weights are made in accordance with the respective
errors computed for each pattern presented to the network.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 16
Figure 2.5: Signal-flow graph highlighting the details of output neuron j
The arithmetic average of these individual weight changes over the training set is
therefore an estimate of the true change that would result in from modifying the weights
based on minimizing the cost function Eav over the entire training set. Consider Fig. 2.5
which depicts neuron j being fed by a set of function signals produced by a layer of neurons
to its left. The induced local field vj(n) produced at the input of the activation function
associated with neuron j is therefore:
vj(n)= (2.4)
and m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic
weight wj0 (corresponding to the fixed input y0=+1) equals the bias bj applied to neuron j.
Hence the function signal yj(n) appearing at the output of neuron j at iteration n is:
yj(n)=j(vj(n)) (2.5)
The back-propagation algorithm applies a correction wji(n) to the synaptic weight wji(n),
which is proportional to the partial derivative E(n)/dwji(n). According to the chain rule of
calculus, one can express this gradient as:
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 17
E(n)/wji(n)= E(n)/ ej(n). ej(n)/ yi(n). yj(n)/ vj(n). vj(n)/ wji(n) (2.6)
The derivative E(n)/ wji(n) represents the sensitivity factor, determining the direction of
search in weight space for the synaptic weight wji. Differentiating both sides of equation (2.2)
with respect to ej(n), equation (2.1) with respect to yj(n), equation (2.5) with respect to
yj(n), and equation (2.4) with respect to wji(n) one respectively gets:
E(n)/ ej(n)=ej(n) (2.7)
ej(n)/ yj(n)=-1 (2.8)
yj(n)/ vj(n)=j(vj(n)) (2.9)
vj(n)/ wji(n)=yi(n) (2.10)
The use of equations (2.7) and (2.10) in (6) yields:
E(n)/ wji(n)=-ej(n)(vj(n))yi(n) (2.11)
The correction wji(n) applied to wji(n) is defined by the delta rule:
wji(n)=-E(n)/wji(n) (2.12)
where is the learning-rate parameter of the back propagation algorithm. The use of the
minus sign accounts for the gradient descent in weight space. Accordingly, the use of (2.11)
in (2.12) yields:
wji(n)= j(n)yi(n) (2.13)
where local gradient j(n) is defined by:
j(n)=- E(n)/ vj(n)=ej(n)j(vj(n)) (2.14)
The local gradient points to required changes in synaptic weights. According to (2.14), the
local gradient j(n) for output neuron j is equal to the product of the corresponding error
signal ej(n) for that neuron and the derivative j(vj(n)) of the associated activation function.
Case 1: Neuron j is an output node
When neuron j is located in the output layer of the network, it is supplied with a
desired response of its own. Equation (2.1) is used to compute the error signal ej(n)
associated with this neuron.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 18
Case 2: Neuron j is the hidden node
When neuron j is located in a hidden layer of the network, there is no specified
desired response for that neuron. Accordingly, the error signal for a hidden neuron would
have to be determined recursively in terms of the error signals of all the neurons to which
that hidden neuron is directly connected; this is where the development of the back-
propagation algorithm gets complicated. Consider the situation depicted in Fig 2.6, which
depicts neuron j as a hidden node of the network. According to equation (2.14), the local
gradient j(n) for hidden neuron j is redefined as:
j(n)=-[E(n)/ yj(n)].[yj(n)/ vj(n)]
=-[E(n)/ yj(n)].j(vj(n)), neuron j is hidden (2.15)
where in the second line equation (2.9) is used. To calculate the partial derivative E(n)/
yj(n), one proceeds as follows. From equation (2.2):
E(n)=0.5 , neuron k is an output node (2.16)
Differentiating equation (2.16) with respect to the function signal yj(n):
E(n)/ yj(n)= [ek(n)/ yj(n)] (2.17)
Using the chain rule for the partial derivative ek(n)/ yj(n), and rewriting in the equivalent
form the following is obtained:
E(n)/yj(n)= .ek(n)/vk(n).vk(n)/yj(n) (2.18)
However,
ek(n)=dk(n)-yk(n)
=dk(n)-k(vk(n)), neuron k is output node (2.19)
Hence:
ek(n)/ vk(n)=-k(vk(n)) (2.20)
Note that for neuron k the induced local field is:
vk(n)= (2.21)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 19
Figure 2.6: Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j
where m is the total number of inputs (excluding the bias) applied to neuron k. Here again,
the synaptic weight wko(n) is equal to the bias bk(n) applied to neuron k, and the
corresponding input is fixed at the value +1. Differentiating equation (2.21) with respect to
yj(n) yields:
vk(n)/ yj(n)=wkj(n) (2.22)
By using equation (2.20) and equation (2.22) in equation (2.18), the desired partial derivative
is obtained:
E(n)/ yj(n)= - k(vk(n))wkj(n)
= - wkj(n) (2.23)
where in the second line the definition of the local gradient k(n) given in equation (2.14)
with the index k substituted for j is used. Finally, the back-propagation formula for the local
gradient j(n) is obtained as described:
j(n)= (vj(n)) wkj(n) (2.24)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 20
The factor j(vj(n)) involved in the computation of the local gradient j(n) in (2.24) depends
solely on the activation function associated with hidden neuron j. The remaining factor
involved in this computation, namely the summation over k, depends on two sets of terms.
The first set of terms, the k(n), requires knowledge of the error signals ek(n), for all neurons
that lie in the layer to the immediate right of hidden neuron j, and that are directly connected
to neuron j. The second set of terms, the wkj(n), consists of synaptic weights associated with
these connections. Figure 2.7 summarizes back-propagation learning.
Figure 2.7: Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward
pass. Bottom part of the graph: backward pass.
2.3 Activation function The computation of for each neuron of a multilayer perceptron requires knowledge
of the derivative of the activation function (.) associated with that neuron. For this
derivative to exist, the function (.) must be continuous. Therefore, differentiability is the
only requirement that an activation function has to satisfy.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 21
Hyperbolic tangent function as shown in Fig 2.8, an antisymmetric function, is a
commonly used form of sigmoidal non-linearity, which in its most general form is defined
by:
j(vj(n))=atanh(bvj(n)), (a,b)>0 (2.25)
where a and b are constants. Its derivative with respect to vj(n) is given by:
j(vj(n))=absec2h(bvj(n))
=ab(1-tanh2(bvj(n)))
=b/a[a-vj(n)][a+yj(n)] (2.26)
For a neuron j located in the output layer yj(n)=oj(n), the local gradient is therefore:
j(n)=ej(n)j(vj(n))
=b/a[dj(n)-oj(n)][a-oj(n)][a+oj(n)] (2.27)
For a neuron j in the hidden layer,
j(n)=j(vj(n)) wkj(n)
=[b/a][a-yj(n)][a+yj(n)] wkj(n), neuron j is hidden (2.28)
Figure 2.8: Hyperbolic tangent function
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 22
2.4 Rate of learning The back-propagation algorithm provides an approximation to the trajectory in weight
space computed by the method of steepest descent. The smaller the learning-rate parameter
is made, the smaller the changes to the synaptic weights in the network will be from one
iteration to the next, and the smoother will be the trajectory in weight space. The
improvement, however is attained at the cost of a slower rate of learning. If, on the other
hand, the learning-rate parameter is made too large in order to speed up the rate of learning,
the resulting large changes in the synaptic weights assume such a form that the network may
become unstable. A simple method of increasing the rate of learning yet avoiding the danger
of instability is to modify the delta rule, by including a momentum term as follows:
wji(n)= wji(n-1)+ j(n)yi(n) (2.29)
where is a positive number called the momentum constant. Rewriting the above equation
(2.29) as a time series with index t. The index t goes from the initial time 0 to current time n.
It maybe viewed as a first-order difference equation in the weight correction wji(n):
wji(n)= (j(t)yi(t))
=- (E(t)/wji(t)) (2.30)
Based on this relation, the following conclusions are made:
1. The current adjustment wji(n) represents the sum of an exponentially weighted time
series. For the time series to be convergent, the momentum constant must be
restricted to the range 0
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 23
2.5 Modes of Training One complete representation of the entire training set during the learning process is called
an epoch. The learning process is maintained on an epoch by epoch basis until the synaptic
weights and bias levels of the network stabilize and the averaged mean squared error over the
entire training set converges to some minimum value. It is good practice to randomize the
order of presentation of training examples from one epoch to the next.
For a given training set, back-propagation learning may proceed in one of two basic
ways:
1. Sequential mode: The sequential mode of back-propagation learning is also referred
to as on-line, pattern or stochastic mode. In this mode of operation weight updating is
performed after the presentation of each training example.
2. Batch mode: In the batch mode of back-propagation learning, weight updating is
performed after the presentation of all training examples that constitute an epoch. For
a particular, the cost function is defined as the averaged squared error of equations
(2.2) and (2.3):
Eav=1/2N (2.31)
where the error signal pertains to the output neuron j for the training example n.
For an on-line operational point of view, the sequential mode of training is
preferred over the batch mode because it requires less local storage for each synaptic
connection. Moreover, given that the patterns are presented to the network in a random
manner, the use of pattern by pattern updating of weights makes the search in weight space
stochastic in nature. This in turn makes it less likely for the back-propagation algorithm to be
trapped in a local minimum.
In the same way, the stochastic nature of the sequential mode makes it difficult to
establish theoretical conditions for convergence of the algorithm, In contrast, the use of batch
mode of training provides an accurate estimate of the gradient vector; convergence to a local
minimum is guaranteed under simple conditions.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 24
2.6 Local Minima of Error Surface A peculiarity of the error surface (Fig 2.9) that impacts the performance of the back-
propagation algorithm is the presence of local minima in addition to global minima. It runs
the risk of being trapped in a local minimum where even small change in synaptic weights
increases the cost function. But somewhere else in the weight space there exists another set
of synaptic weights for which the cost function is smaller than the local minima in which the
network is stuck. It is clearly undesirable to have the learning process terminate at local
minima, especially if it is located far above a global minimum.
Figure 2.9: Error surface
2.7 Supervised Learning Viewed as an Optimization Problem In this case, the supervised training of a multilayer perceptron is viewed as a problem
in numerical optimization. The error surface of a multilayer perceptron is a highly nonlinear
function of the synaptic weight vector w. Let Eav(w) denote the cost function, averaged over
the training sample. Using the Taylor series expand Eav(w) about the current point on the
error surface w(n) as:
Eav(w(n)+w(n)) =
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 25
Eav(w(n))+gT(n)w(n)+0.5wT(n)H(n)w(n) (2.32)
where g(n) is the local gradient vector defined by:
g(n)=dEav(w)/dw, at w=w(n) (2.33)
and H(n) is the local hessian matrix defined by:
H(n)=d2Eav(w)/dw2, at w=w(n) (2.34)
The use of an ensemble-averaged cost function Eav(w) presumes a batch mode of
operation. The basic back-propagation algorithm adjusts the weights in the steepest descent
direction (negative of the gradient). This is the direction in which the performance function is
decreasing most rapidly. It turns out that, although the function decreases most rapidly along
the negative of the gradient, this does not necessarily produce the fastest convergence. In the
conjugate gradient type of algorithms a search is performed along conjugate directions,
which produces generally faster convergence than steepest descent directions.
In most of the training algorithms, a learning rate is used to determine the length of
the weight update (step size). In most of the conjugate gradient algorithms, the step size is
adjusted at the end of each iteration. A search is made along the conjugate gradient direction
to determine the step size, which minimizes the performance function along that line.
Fletcher-Reeves Update: All of the conjugate gradient algorithms start out by searching in
the steepest descent direction (negative of the gradient) on the first iteration:
po=-go (2.35)
A line search is then performed to determine the optimal distance to move along the current
search direction:
xk+1=xk+kpk (2.36)
Then the next search direction is determined so that it is conjugate to previous search
directions. The general procedure for determining the new search direction is to combine the
new steepest descent direction with the previous search direction:
Pk=-gk+kpk-1 (2.37)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 26
The various versions of conjugate gradient are distinguished by the manner in which the
constant is computed. For the Fletcher-Reeves update the procedure is:
k=gTkgk /gTk-1gk-1 (2.38)
This is the ratio of the norm squared of the current gradient to the norm squared of the
previous gradient. Once the MLP has been trained, it can be tested on a database of images.
The output neuron which fires corresponds to the class of image.
2.8 Face Recognition Using Neural Networks The Cambridge ORL database consists of 400 frontal images of 40 people, i.e. 10
images of each person, of which 5 are used for training and the other 5 are used for testing
purposes. Since there are 40 people that need to be identified, the number of classes that a
neural network must distinguish is also 40. The pixel intensity vectors of the pre-processed
training images are obtained and the MLP is thus trained. The mode of training used is batch
mode, and using the back-propagation algorithm, in conjunction with Fletcher-Reeves
update, the optimum weights for MLP are obtained. In order to improve results, the inputs
are fed at random. Training is conducted for 2500 epochs.
2.9 Conclusions This chapter dealt with artificial neural networks. Specifically, the architecture of
multilayered feedforward neural networks is introduced. Supervised learning via the back
propagation algorithm is discussed along with the modes of training, the choice of learning
rate, and the potential pitfalls of getting trapped in local minima. Finally, the use of artificial
neural networks for face recognition is also introduced. The results are presented in a Chapter
6. A technique that is frequently used for analysis of images is that of principal components.
This is the topic of the next chapter.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 27
CHAPTER III
Principal Components Analysis
Principal components analysis is a method of unsupervised learning. The main idea is
to discover significant patterns or features in the input data without an external aide. To do
so, the algorithm is provided with a set of rules of a local nature, which enables it to learn to
compute an input-output mapping with specific desirable properties; the term local implies
that the change applied to the synaptic weight of a neuron is confined to the immediate
neighborhood of that neuron.
Feature selection refers to a process whereby the given data is compressed to features
or patterns fewer in number than the given data; Thus, the data space is transformed into a
feature space and undergoes a dimensionality reduction. Evidently, this data compression is
lossy, and it can be shown that in the case of principal component analysis, the mean-square
of the resulting error equals the sum of the variances of the elements of that part of the data
vector that is eventually eliminated. Therefore one seeks a transformation that is optimum in
the mean squared sense; i.e., the eliminated component must have total variance that is less
than a predefined threshold. Principal components analysis computes the basis of a space
which represents the training vectors. These basis vectors are eigenvectors of a related
covariance matrix.
When applied to face recognition, one determines the principal components of the
distribution of faces treating the image as a point in a high dimensional space; i.e., the
eigenvectors of the covariance matrix of the set of face images are computed. These
eigenvectors referred to as eigenfaces in this context contain relevant discriminatory
information extracted from the images. Thus, characteristic features representing the
variation in the collection of faces are captured, and this information is used to encode and
compare individual faces. It is to be noted that the features (i.e., the eigenvectors) are ordered
with respect to the corresponding eigenvalues, and only those features that are significant (in
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 28
the sense of the value of the eigenvalues) are considered. Recognition is performed by
projecting a new image on to the subspace spanned by the eigenfaces and then classifying the
face by comparing its position in the face space with the position of known individuals.
Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd PC z2 is a
minimum distance fit to a line in the plane perpendicular to the 1st PC.
3.1 Derivation of Principal Components (PC)
Given a sample of n observations on a vector of p variable:
x=(x1, x2,, xp) (3.1)
define the first principal component of the sample by the linear transformation:
z1= a1Tx= (3.2)
where the vector a1=(a11,a21,,ap1) is chosen such that var[z1] is maximum. Likewise, define
the kth PC of the sample by the linear transformation:
zk=akTx, k=1p (3.3)
where the vector ak=(a1k,a2k,,apk) is chosen such that var[z1] is maximum:
cov[zk,zl]=0, for k1, and akTak=1 (3.4)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 29
To find a1 first note that:
var[z1] = (z12) (z1)2
= , where Sij=xixj=(xixj) - (xi)(xj)
= a1TSa1 (3.5)
where S is the is the covariance matrix for x.
To find a1 maximize var[z1], subject to a1Ta1=1. This constrained optimization problem can
be transformed into an unconstrained one by using the Lagrange multiplier . Thus, the
problem is reduced to maximization of a1TSa1 (a1Ta1 1). Differentiating with respect to
both the vector a1 and the Lagrange multiplier and setting the derivatives to zero one obtains
Sa1 a1 = 0
(S Ip)a1 = 0 (3.6)
Therefore a1 is an eigenvector of S corresponding to eigenvalue =1. Note that the following
has been maximized:
var[z1]= a1TSa1 =a1T1a1 = 1 (3.7)
So 1 is the largest eigenvalue of S. The first PC z1 retains the greatest amount of variation in
the sample. An example is illustrated in Fig. 3.1(a). To find the next coefficient vector a2
maximize var[z2] subject to:
cov[z2,z1]=0 and a1Ta1 =1 (3.8)
First note that:
cov[z2,z1]= a1TSa2= 1a1Ta2 (3.9)
then let and be Lagrange multipliers, and maximize a2TSa2 (a2Ta2 1) - a2Ta1. It is
found that find that a2 is also an eigenvector of S whose eigenvalue =2 and is the second
largest, as illustrated in Fig. 3.1(b) in the two dimensional case. In general:
var[zk]= akTSak =k (3.10)
The kth largest eigenvalue of S is the variance of the kth PC. The kth PC zk retains the kth greatest fraction of the variation in the sample.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 30
Given a sample of n observations on a vector of p variables x=(x1, x2,, xp), define a vector
of p PCs z=(z1, z2,, zp), according to Z=ATX, where A is an orthogonal p x p matrix whose
kth column is the kth eigenvector ak of S. Then = ATSA is the covariance matrix of the PCs; the matrix is diagonal with elements ij=ij ij.
3.2 Face Recognition using Eigenfaces
Let a face image (x,y) be a two-dimensional N by N array of intensity values. By stacking the columns of this array into a single vector, an image may also be considered as a
vector of dimension 2N . Thus, a typical image of size 256 by 256 becomes a vector of
dimension 65,536, or equivalently, a point in 65,536-dimensional space. An ensemble of
images, then, maps to a collection of points in this vector space of a relatively large
dimension.
Images of faces, being similar in overall configuration, will not be randomly
distributed in this huge image space and thus can be described by a relatively low
dimensional subspace. The main idea of the principal component analysis is to find a set of
basis vectors that best accounts for the distribution of face images within the entire image
space. These vectors define the subspace of face images, which is generally referred to as the
face space. Each vector is of length 2N , describes an N by N image, and is a linear
combination of the original face images. Because these vectors are the eigenvectors of the
covariance matrix corresponding to the original face images, and because they are face-like
in appearance, they are typically referred to as eigenfaces.
As described earlier, a 2-D facial image can be represented as 1-D vector by
concatenating each row (or column). Let the training set of face images be M ,,, 21 K .
The average face of the set is defined by ==
M
nnM 1
1 . Each face differs from the average by
the vector = nn . This set of very large vectors is then subject to principal component analysis, which seeks a set of M orthonormal vectors n which best describes the distribution of the data. The kth vector k is chosen such that:
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 31
=
=M
nn
Tkk M 1
2)(1 (3.11)
is a maximum, subject to:
==
otherwisekl
kTl ,0
,1 (3.12)
The vectors k and scalars k are the eigenvectors and eigenvalues, respectively, of the covariance matrix:
=
==M
n
TTnn AAM
C1
1
(3.13)
where the matrix ]...[ 21 MA = . The matrix C is however 2N by 2N , and determining the 2N eigenvectors and eigenvalues is an intractable task for typical image sizes. A
computationally feasible method is needed to find these eigenvectors.
If the number of data points in the image space is less than the dimension of the space
( 2NM < ), there will be only M , rather than 2N , meaningful eigenvectors in the sense that the remaining eigenvectors are associated with zero eigenvalues. Accordingly, it is
computationally better to determine the eigenvalues and eigenvectors of much smaller
matrix. For example, in the situation outlined earlier, one computes the eigenvalues and the
corresponding eigenvectors of an M x M matrix as opposed to a 65,536 x 65,536 matrix.
Consider the eigenvectors n of AAT such that:
nnnT AA = (3.14)
Premultiplying both sides by A,
nnnT AAAA = (3.15)
from which it can be inferred that nA are the eigenvectors of TAAC = . Following this analysis, construct the M by M matrix AAL T= , where nTmmnL = , and find the M eigenvectors n of L. These vectors determine linear combinations of the M training set face images to form the eigenfaces n :
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 32
MnA nM
kknkn ,......,1,
1
=== =
(3.16)
With this analysis the calculations are greatly reduced, from the order of the number of pixels
in the images ( 2N ) to the order of the number of images in the training set (M). In practice,
the training set of face images will be relatively small ( 2NM < ), and the calculations become quite manageable. The value of the associated eigenvalues allows the ordering of the
eigenvectors according to their usefulness in characterizing the variation among the images.
It is to be emphasized at this juncture that such a ordering is possible as all the eigenvalues of
AAT (and hence TAA ) are positive due to the sign-definiteness of these matrices.
The purpose of PCA is to reduce the large dimensionality of the data space (observed
variables) to the smaller intrinsic dimensionality of feature space (independent variables),
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 33
which are needed to describe the data economically. This is the case when there is a strong
correlation between observed variables. Therefore, the goal is to find a set of eis which have
the largest possible projection onto each of the wis. The eigenvectors corresponding to
nonzero eigenvalues of the covariance matrix produce an orthonormal basis for the subspace
within which most image data can be represented with a small amount of error. The
eigenvectors are sorted from high to low according to their corresponding eigenvalues. The
eigenvector associated with the largest eigenvalue is one that reflects the greatest variance in
the image. That is, the smallest eigenvalue is associated with the eigenvector that finds the
least variance. It has been the experience that the usefulness of the eigenvectors decreases in
exponential fashion; i.e., roughly 90% of the total variance is contained in the first 5% to
10% of the dimensions.
3.3 Euclidean Distance Method for Classification The simplest method for determining which face class provides the best description of
an input facial image is to find the face class k that minimizes the Euclidean distance k=|-
k|, where k is a vector describing the kth face class. If k is less than some predefined
threshold , a face is classified as belonging to the class k.
Note: Eigenfaces, once obtained, can also be used to train a multilayer perceptron for
classification purposes. This type of training incorporates both supervised and unsupervised
learning. Once trained, it can be tested on a database of images.
3.4 Conclusions In this chapter the eigenface approach to face recognition is dealt with. This is the
first statistics-based face recognition technique to have been proposed by researchers. A
principal advantage of this is that it is amenable to real-time face recognition. The eigenface
approach followed in this project uses principal components and nearest-neighbour
classification technique. In this next chapter, face recognition using support vector machines
is presented.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 34
CHAPTER IV
Support Vector Machines
Support vector machines are universal feedforward networks, which can be used for pattern
classification and non-linear regression. Basically the support vector machine is a linear
machine with some suitable properties. To explain how it works, it is perhaps easiest to start
with the case of separable patterns that could arise in the context of pattern classification. In
this context, the main idea of a support vector machine is to construct a decision surface,
called a hyperplane, in such a way that the margin of separation between positive and
negative examples is maximized.
The machine achieves this desirable property by following a principled approach
rooted in statistical learning theory. More precisely, support vector machine is an
approximate implementation of the method of structural risk minimization. The induction
principle is based on the fact that the error-rate of a learning machine on test data is bounded
by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis
(VC) dimension; in the case of separable patterns, a support vector machine produces a value
of zero for the first term and minimizes the second term. Accordingly, the support vector
machine can provide a good generalization on pattern classification problems despite the fact
that it does not incorporate problem-domain knowledge. This attribute is unique to support
vector machines.
A notion that is central to the construction of the support vector learning algorithm is
the inner-product kernel between a support vector xi and the vector x drawn from the input
space. The support vectors consist of a small subset of the training data extracted by the
algorithm. Depending on how this inner-product kernel is generated, construction of different
learning machines is characterized by their respective nonlinear decision surfaces. In
particular, the support vector learning algorithm is used to construct the following three types
of learning machines:
Polynomial learning machines
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 35
Radial-basis function networks Two-layer perceptrons with a single hidden layer.
That is, for each of these feedforward networks the support vector learning algorithm is used
to implement the learning process using a given set of training data, automatically
determining the required number of hidden units. Stated in another way, whereas the back-
propagation algorithm is devised specifically to train a multilayer perceptron, the support
vector learning algorithm is of a more generic nature because it has wider applicability.
Support Vector machines are rooted in statistical learning theory which gives a family
of bounds which govern the learning capacity of the machine:
R(alpha) = 0.5|y-f(x,alpha)|.dP(x,y) (4.1)
Properties of the bound:
1. It is independent of P(x,y). It assumes only that both the training data and the test data
are drawn independently according to some P(x,y)
2. It is usually not possible to compute the left hand side.
3. If h is known, the right hand side can be easily computed.
Thus given several different learning machines, and choosing a fixed, sufficiently small ,
that machine is chosen which gives the lowest upper bound on the actual risk. This gives a
principled method for choosing a learning machine for a given task, and is the essential idea
of structural risk minimization.
The general structure of SVM is as shown in Fig. 4.1.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 36
Figure 4.1: General structure of SVM
4.1 Basics Let X Rn denote the possible input to the Support Vector Machine. These inputs are pixel
intensity vectors obtained after image pre-processing. An SVM is trained with images
belonging to two classes at a time. A single machine is not efficient at encoding multiple
classes. Assume that a point x X is associated with one of two possible classes, denoted -1
and +1. This means that it is sufficient to have the output Yestimate={-1,1} from the classifier.
Consider two stochastic variables, representing the point denoted X, Y representing class
label. These variables then give rise to the following conditional distributions, p(X=x|Y=1),
p(X=x |Y=-1). There are two phases concerning support vector machines, first training and
then classifying. By letting f: X -> Yestimate represent the discriminating function, the task
during training is to minimize the probability of returning wrong classification for all
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 37
elements included in the training set. This means that for the training set Yestimate should be
identical to Y in as many cases as possible.
During training, SVMs solve this problem by simply finding a function f, which for
every point xi=[xi1, xi2, , xin]T, with corresponding class label yk, in the training set
S=((x1,y1),,(xl,yl)) has the following property, f(xk)0, if yk=1 and f(xk)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 38
To formulate this, suppose that all the training data satisfy the following constraints:
w.xi + b +1 for yi = +1 (4.3.a)
w.xi + b -1 for yi = -1 (4.3.b)
These can be combined into one set of inequalities:
yi(w.xi + b) 1 0, i=1,,N (4.3)
Now consider the points for which the equality in (4.3.a) holds. These points are the support
vectors which lie on the hyperplane H1: w.xi + b = +1, with normal w and perpendicular
distance from the origin |1 - b|/||w||. Similarly, the points for which the equality in (4.3.b)
holds are also the support vectors which lie on the hyperplane H2: w.xi + b = -1 with
normal again w and perpendicular distance from the origin |-1-b|/||w||. Hence d+ = d- = 1/||w||
and the margin is simply 2/||w||. Note that H1 and H2 are parallel as they have the same
normal and that no training points fall between them. Thus, the pair of hyperplanes which
gives the maximum margin by minimizing the objective function 0.5||w||2 subject to
constraints in equation can be found. Thus the solution for a typical two dimensional case is
expected to have the form shown in Figure 4.2. Those training points for which the equality
in (4.3) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2) and whose
removal would change the solution found, are called support vectors; they are indicated in
Figure 4.2 by the extra circles.
Fig. 4.2: Linear separating hyperplanes for the separable case. The support vectors are circled.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 39
4.3 Quadratic Optimization for Finding the Optimal Hyperplane A Lagrangian formulation is used to obtain the solution. There are two reasons for
doing this. The first is that the constraints variables are replaced by constraints on the
Lagrange multipliers themselves, which are much easier to handle. The second is that in this
reformulation of the problem, the training data appears (in the actual training and test
algorithms) in the form of dot products between vectors. This is a crucial property which will
allows one to generalize the procedure to the nonlinear case.
Thus, introducing positive Lagrange multipliers i, i=1.N; one for each of the
inequality constraints in equation. Recall that the rule is that for constraints of the form ci0,
the constraint equations are multiplied by positive Lagrange multipliers and subtracted from
the objective function, to form the Lagrangian. For equality constraints, the Lagrange
multipliers are unconstrained. This gives Lagrangian:
LP = 0.5||w||2 yi(w.xi + b) + (4.4)
The solution to the constrained optimization problem is determined by the saddle
point of LP, which must be minimized with respect to w, b, and simultaneously require that
the derivatives of LP with respect to all the i vanish, all subject to the constraints i0.
Suppose that the set of constraints is referred to as C1. Now this is a convex quadratic
programming problem, since the objective function is itself convex, and those points which
satisfy the constraints also form a convex set. This means that following dual problem can
be equivalently solved: maximize LP, subject to the constraints that the gradient of LP with
respect to w and b vanish, and subject also to the constraints i0. Suppose that this set of
constraints is referred to as C2. This particular dual formulation of the problem is called the
Wolfe dual. It has the property that the maximum of LP, subject to constraints C2, occurs at
the same values of w, b and , as the minimum of LP, subject to constraints C1.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 40
Requiring that the gradient of LP with respect to w and b vanish, gives the conditions:
1. w = yixi (4.5)
2. yi = 0 (4.6)
Since these are equality constraints in the dual formulation substituting them into equation to
give:
LD = 0.5 jyiyjxi.xj (4.7)
Maximize (4.7) subject to constraints:
1. yi = 0 (4.8)
2. i0 (4.9)
Note that there are different labels for Lagrangian (LP for primal, LD for dual), to emphasize
that the two formulations are different: LP and LD arise from the same objective function but
with different constraints, and the solution is found by minimizing LP or by maximizing LD.
Note also that if the problem is formulated with b=0, which amounts to requiring that all
hyperplanes contain the origin, the constraint (4.8) does not appear.
Support vector training (for the separable, linear case) therefore amounts to
maximizing LD with respect to i subject to constraints (4.8) and positivity of i with
solution given by (4.5). Notice that there is a Lagrange multiplier i for every training point.
In the solution, those points for which i>0 are called support vectors, and lie on one of the
hyperplanes H1, H2. All other training points have i=0 and lie on that side of H1 or H2 such
that the strict inequality in Equation (4.3) holds. For these machines, the support vectors are
the critical elements of the training set. They lie closest to the decision boundary, if all other
training points are removed or moved around, but so as not to cross H1 or H2, and training
repeated, the same separating hyperplane is obtained.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 41
4.4 The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and
practice of constrained optimization. For the primal problem described in the previous
section, the KKT conditions may be stated as:
(4.10)
The solution vector w is defined in terms of an expansion that involves the N training
examples. Note, however, that although this solution is unique by virtue of convexity of the
Lagrangian, the same cannot be said about the Lagrange coefficients, i. It is important to
note that at the saddle point, for each i, the product of that multiplier with the corresponding
constraint vanishes, as shown by (4.10). Therefore, only those multipliers exactly meeting the
above equation can assume nonzero values.
Having determined the optimal Lagrange multipliers, o,i, computing the optimum
weight vector as:
wo = yi xi, where Ns is the support vectors (4.11)
The optimum bias can be calculated as:
bo = 1 wo.x(s) for d(s)=1 (4.12)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 42
4.5 Linear Support Vector Machines: The Non-separable Case The margin of separation between classes is said to be soft if a data point (xi,yi)
violates the following condition:
yi(w.xi + b) + 1, i=1,,N
This violation can arise in one of two ways, as shown in Figure 4.3:
The data point falls inside the region but on the right hand side of the decision surface.
The data point falls on the wrong side of the decision surface, resulting in misclassification.
This can be rectified by introducing positive slack variables i, i=1,,N, in the constraints
which then become:
w.xi + b +1 - i for yi =+1
w.xi + b -1 + i for yi =-1
These can be combined into one set of inequalities:
yi(w.xi + b) 1 - i, i=1,,N; where the slack variables 0 for all i. (4.13)
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 43
Fig. 4.3: Linear separating hyperplanes for the non-separable case.
Thus, for an error to occur, the corresponding i must exceed unity, so i is an upper bound
on the number of training errors. Hence a natural way to assign an extra cost for errors is to
change the objective function to be minimized to 0.5||w||2 to 0.5||w||2 + C(i)k, where C is a
parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to
errors. As before minimizing the first term is related to minimizing the VC dimension of the
support vector machine. As for the second term, it is an upper bound on the number of test
errors. Formulation of the cost function is therefore in perfect accord with the principle of
structural risk minimization.
The parameter C controls the trade-off between the complexity of the machine and
the number of non-separable points; it may therefore be viewed as a form of a
regularization parameter. The parameter C has to be selected by the user. This can be done
in one of two ways:
The parameter C is determined experimentally via the standard use of a training/(validation) test set, which is a crude form of resampling.
It is determined analytically by estimating the VC dimension and then using the bounds thus determined.
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 44
Maximize:
LD = 0.5 jyiyjxi.xj (4.14)
Subject to the constraints:
1. 0 i C (4.15)
2. yi = 0 (4.16)
Having determined the optimal Lagrange multipliers, o,i, computing the optimum weight
vector as:
wo = yi xi (4.17)
The optimum bias can be calculated as:
bo = 1 wo.x(s) for d(s)=1 (4.18)
The Karush-Kuhn-Tucker conditions are needed for the primal problem. The primal
Lagrangian is:
LP = 0.5||w||2 + C - {yi(w.xi + b)1+i} - i (4.19)
where i are the Lagrange multipliers introduced to enforce positivity of i. The KKT
conditions for the primal problem are therefore:
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 45
(4.20) and (4.21)
The optimum bias b0 is determined by taking any data point in the training set for which 0
0,i C and therefore i=0 and using that data point in the KKT conditions. However, from a
numerical perspective it is better to take the mean value of b0 resulting from all such data
points in the sample.
4.6 Nonlinear Support Vector Machines Basically the idea of a nonlinear support vector machine hinges on two mathematical
operations summarized here.
Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both input and output.
Construction of an optimal hyperplane for separating the features discovered in previous step.
An important question at this juncture is whether or not the methods discussed in earlier
sections can be generalized to the case where the decision function is not a linear function of
the data, or, in other words, when the input space is made up of nonlinearly separable
patterns. Covers theorem states that such a multi-dimensional space maybe transformed into
a new feature space where the patterns are linearly separable with high probability, provided
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 46
two conditions are satisfied. First, the transformation is nonlinear. Second, the dimensionality
of the feature space is high enough. The next operation exploits the idea of building an
optimal separating hyperplane in accordance with the theory described, but with a
fundamental difference: the separating hyperplane is now defined as a linear function of
vectors drawn from the feature space rather than the original input space.
4.6.1 Inner Product Kernel Observe that the data appears in the training problem only in the form of dot products,
xi.xj. Now suppose the data is first mapped to some other (possibly infinite dimensional)
Euclidean space H, using a mapping : Rd to H. Then of course the training algorithm would
only depend on the data through dot products in H, i.e. on functions of the form (xi). (xj).
Now if there were a kernel function K such that K(xi,xj)= (xi). (xj) only K need to be
used in the training algorithm, and would never need to explicitly even know what is. One
example is: K(xi,xj)=exp(-0.5||xi-xj||2/2*sigma2). In this particular example, H is infinite
dimensional, so it would not be very easy to work with explicitly. However, if one replaces
xi.xj by K(xi,xj) everywhere in the training algorithm, the algorithm produces a support vector
machine in an infinite dimensional space, and furthermore do so in roughly the same amount
of time it would take to train on the unmapped data.
Suppose that the data belongs to a space denoted L. (The notation L is a mnemonic
for low-dimensional, as is the notation for the range space H for high-dimensional.)
Evidently, a map is not necessarily onto; i.e., there need exist an element in L that is
mapped onto a specific element in H.
Let j(x), j=1,,M; where M is the number of hidden units, denote a set of nonlinear
transformations, define a hyperplane acting as the decision surface as follows: j j(x)
+ b=0, where w defines the vector of linear weights connecting the feature space to the
output space, and b is the bias. The quantity j(x) represents the input supplied to the weight
wj via the feature space. In effect, the vector (x), represents the image induced in the
-
Face Recognition using Artificial Neural Network
Dept. of ECE Jan June 09 Page 47
feature space due to the input vector x. Thus, in terms of this image define the decision