face recognition

92
VISVESVARAYA TECHNOLOGICAL UNIVERSITY BELGAUM-590014 Project Report On Face Recognition using Artificial Neural Networks Submitted in partial fulfilment of the requirement for the award of degree of BACHELOR OF ENGINEERING In ELECTRONICS & COMMUNICATION By Prashanth Namit 1PI05EC072 1PI05EC057 Under the Guidance of Prof. Koshy George PES Centre for Intelligent Systems; Dept. of Telecommunication Engineering DEPARTMENT OF ELECTRONICS AND COMMUNICATION PES INSTITUTE OF TECHNOLOGY BANGALORE – 560 085 2008- 2009

Upload: chaithra580

Post on 24-Nov-2015

65 views

Category:

Documents


9 download

DESCRIPTION

Face recognition

TRANSCRIPT

  • VISVESVARAYA TECHNOLOGICAL UNIVERSITY

    BELGAUM-590014

    Project Report

    On

    Face Recognition using Artificial Neural Networks Submitted in partial fulfilment of the requirement for the award of degree of

    BACHELOR OF ENGINEERING

    In

    ELECTRONICS & COMMUNICATION

    By

    Prashanth Namit

    1PI05EC072 1PI05EC057 Under the Guidance of

    Prof. Koshy George

    PES Centre for Intelligent Systems; Dept. of Telecommunication Engineering

    DEPARTMENT OF ELECTRONICS AND COMMUNICATION

    PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085

    2008- 2009

  • PES INSTITUTE OF TECHNOLOGY BANGALORE 560 085

    CERTIFICATE

    Certified that the project work entitled Face Recognition Techniques using Neural Networks is a bona fide work carried out by Prashanth H, Namit Chauhan bearing USN 1PI05EC072 and 1PI05EC043 in partial fulfilment for the award of degree of BACHELOR OF ENGINEERING in ELECTRONICS & COMMUNICATION of Visvesvaraya Technological University, during the academic year 2008-2009. It is certified that all corrections/suggestions indicated for internal assessment have been incorporated in the report. The project report has been approved as it satisfies the academic requirements with respect to project work prescribed for the mentioned degree.

    Dr Koshy George PES Centre for Intelligent Systems; Dept. of Telecom. Engg., PESIT

    Dr A. Vijaykumar HOD,

    Dept. of E&C Engg., PESIT

    Dr K. N. B. Murthy Principal PESIT

    Prashanth

    1PI05EC072

    Namit Chauhan

    1PI05EC057

    External Viva

    Name of the Examiners Signature with Date

    1._____________________ _____________________

    2. _____________________ _____________________

  • ACKNOWLEDGEMENTS

    We would like to express our sincere thanks to Dr. Koshy George Professor, TE

    Department, PESIT, Bangalore, for his kind and constant support and guidance during the

    course of this project without which, the project would not have been a success

    We would like to thank and extend 1our heart-felt gratitude to Dr. A Vijaya Kumar,

    HOD of the Department of Electronics and Communication, for his kind, inspiring and

    illustrious guidance and ample encouragement.

    We would like to thank Prof. D Jawahar, CEO, PES Group of Institutions, and

    Dr. K N Balasubramanya Murthy, Director & Principal, PESIT, for the valuable

    resources provided for the completion of the project.

    We would like to express our sincere thanks and deep sense of gratitude to our parents

    for their everlasting support.

    And finally it gives us immense pleasure to thank our friends and everyone who have

    been instrumental in the successful completion of the project.

  • Table of Contents:

    Serial

    Number Topic Page Number

    1 Face Recognition: An Introduction 1

    1.1 Classification 3

    1.1.1 Application of face recognition 6

    1.1.2 Adopted approach 6

    1.2 Literature survey 7

    1.2.1 Existing algorithms 7

    1.2.2 Existing techniques 8

    1.3 Organization of the report 10

    2 Neural Networks: An Introduction 11

    2.1 Multilayer Perceptrons 13

    2.2 Derivation of Error back propagation

    algorithm 15

    2.3 Activation function 20

    2.4 Rate of learning 22

    2.5 Modes of Training 23

    2.6 Local minima of Error Surface 24

    2.7 Supervised learning viewed as an

    optimization problem 24

    2.8 Face recognition using Neural Networks 26

    2.9 Conclusions 26

    3 Principal Components Analysis: An

    Introduction 27

    3.1 Derivation of Principal components 28

    3.2 Face Recognition using Eigen faces 30

    3.3 Euclidean Distance method for

    Classification 33

  • ii

    3.4 Conclusions 33

    4 Support Vector Machines: An Introduction 34

    4.1 Basics 36

    4.2 Linear Support Vector Machines:

    The Separable Case 37

    4.3 Quadratic Optimization for Finding the

    Optimal Hyperplane 39

    4.4 The Karush-Kuhn-Tucker Conditions 41

    4.5 The Non Separable Case 42

    4.6 Nonlinear Support Vector Machines 45

    4.6.1 Inner product Kernel 46

    4.6.2 Mercers Theorem 48

    4.6.3 Summary of Inner-Product Kernels 49

    4.7 Multiple Class Support Vector Machine 49

    4.8 Binary Search Tree 52

    4.9 SVM for Face Recognition 53

    4.10 Conclusions 53

    5 Image Preprocessing 54

    5.1 Histogram Equalization 54

    5.2 Median Filtering 57

    5.3 Bi-Cubic Interpolation 57

    5.4 Conclusions 58

    6 Results 59

    6.1 MLP 59

    6.2 PCA 63

    6.3 SVMs 67

    6.4 PCA+MLP 70

    7 Conclusions and Future work 73

    8 References 75

    Appendix 78

  • iii

    List of Figures

    Figure 1.1: Applications of Face Recognition Figure 1.2: Supervised Learning Figure 1.3: Unsupervised Learning Figure 1.4: (a) Good Generalization, (b) Bad Generalization Figure 2.1: Block diagram of nervous system Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks,

    Recurrent Networks

    Figure 2.3: Nonlinear model of a neuron Figure 2.4: MLP with one hidden layer Figure 2.5: Signal-flow graph highlighting the details of output neuron j Figure 2.7: Signal-flow graphical summary of back-propogation learning. Top part of

    the graph: forward pass. Bottom part of the graph: backward pass.

    Figure 2.8: Hyperbolic tangent function Figure 2.9: Error surface Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd

    PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC.

    Figure 3.2: Eigen faces Figure 4.1: General structure of SVM Figure 4.2: Linear separating hyperplane for separable case Figure 4.3: Linear separating hyperplane for non-separable case Figure 4.4: Mapping into higher space Figure 4.5: Decision surface of Kernels Figure 4.6: Decision surface using one SVM Figure 4.7: Problem encountered when one SVM is used for multi- classification Figure 4.8: Multi-classification using Binary classifiers Figure 4.9: Binary tree structure for 8 classes Figure 5.1: Image after Histogram Equalization Figure 5.2: Median filtering Figure 6.1: Face recognition accuracy with multilayer feedforward networks with

    histogram equalization and median filtering.

  • iv

    Figure 6.2: Face recognition accuracy with multilayer feedforward networks with only median filtering.

    Figure 6.3: Face recognition accuracy with multilayer feedforward networks with no pre-processing

    Figure 6.4: Face recognition accuracy with PCA with histogram equalization and median filtering.

    Figure 6.5: Face recognition accuracy with PCA with only median filtering. Figure 6.6: Face recognition accuracy with PCA and no pre-processing Figure 6.7: Face recognition accuracy with PCA+MLP with histogram equalization

    and median filtering.

    Figure 6.8: Face recognition accuracy with PCA+MLP with only median filtering.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 1

    CHAPTER I

    Face Recognition: An Introduction

    Human beings have recognition capabilities that are unparalleled in the modern

    computing era. These are mainly due to the high degree of interconnectivity, adaptive nature,

    learning skills and generalization capabilities of the nervous system. The human brain has

    numerous highly interconnected biological neurons which, on some specific tasks, can

    outperform super computers. A child can accurately identify a face, but for a computer it is a

    cumbersome task. Therefore, the main idea is to engineer a system which can emulate what a

    child can do. Advancements in computing capability over the past few decades have enabled

    comparable recognition capabilities from such engineered systems quite successfully. Early

    face recognition algorithms used simple geometric models, but recently the recognition

    process has now matured into a science of sophisticated mathematical representations and

    matching processes. Major advancements and initiatives have propelled face recognition

    technology into the spotlight.

    Face recognition technology can be used in wide range of applications. (A few

    example applications are shown in Fig 1.1.) An important aspect is that such technology

    should be able to deal with various changes in face images, like rotation, changes in

    expression. Surprisingly, the mathematical variations between the images of the same face

    due to illumination and viewing direction are almost always larger than image variations due

    to changes in face identity. This presents a great challenge to face recognition. At the core,

    two issues are central to successful face recognition algorithms: First, the choice of features

    used to represent a face. Since images are subject to changes in viewpoint, illumination, and

    expression, an effective representation should be able to deal with these possible changes.

    Secondly, the classification of a new face image using the chosen representation.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 2

    Face Recognition can be of two types:

    Feature Based (Geometric) Template Based (Photometric)

    In geometric or feature-based methods, facial features such as eyes, nose, mouth, and chin

    are detected. Properties and relations such as areas, distances, and angles between the

    features are used as descriptors of faces. Although this class of methods is economical and

    efficient in achieving data reduction and is insensitive to variations in illumination and

    viewpoint, it relies heavily on the extraction and measurement of facial features.

    Unfortunately, feature extraction and measurement techniques and algorithms developed to

    date have not been reliable enough to cater to this need.

    In contrast, template matching and neural methods generally operate directly on an

    image-based representation of faces, i.e., pixel intensity array. Because the detection and

    measurement of geometric facial features are not required, this type of method has been more

    practical and easier to implement when compared to geometric feature-based methods.

    Figure 1.1: Applications of face recognition.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 3

    1.1 Classification Face recognition is nothing but the ability of a machine to successfully categorize a set

    of images based on certain discriminatory features. Classification or pattern recognition can

    be a very difficult problem and is still a very active field of research. The aim of the current

    project can be considered as an attempt to simulate the recognition capability of the human

    brain. A human being has the ability to put a certain scenario in a context and identify its

    components. Of course it is not entirely obvious how to make the machine discriminate

    between elements of different objects of different classes. The classification task may be

    mathematically represented as the mapping:

    f : A x B = C (1.1)

    where A represents the elements that have to be classified, i.e. the pixel intensity vectors,

    using B as the function parameters. The output C has to discriminate between images of

    different classes so that the class of each element can be determined. Classification by

    machines requires two steps: First, the properties which distinguish an element of one class

    from that of another class have to be identified. Secondly, the machine has to be trained to

    know how to discriminate between the classes by defining a learning model. The learning

    model describes the procedure that has to be used for the actual training. Basically there are

    two types of training often referred to as supervised and unsupervised learning.

    Supervised Learning: When dealing with supervised learning, the trainer feeds the machine a sample and the machine classifies it. The trainer then tells the machine

    whether it has classified the sample correctly or not. In the case that it has

    misclassified the sample, the machine has to adjust its classifier parameters to better

    classify the given sample. Supervised learning is illustrated in Fig. 1.2. Examples of

    these include Bayes Classifier and Neural Networks. Another important aspect

    involved is the method of training imparted to these learning machines. It is important

    to have a training set which is representative for the given classes so that future

    classification can be successful.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 4

    Figure 1.2: Supervised Learning

    Unsupervised Learning: In unsupervised or self-organized learning there is no external trainer to oversee the learning process. Rather a task-independent measure of

    the quality of classification is computed, and the free parameters of the network are

    optimized with respect to that measure. Once the network has become tuned to the

    statistical regularities of the input data, it develops the ability to form internal

    representations for encoding features of the input and thereby create new classes

    automatically. One technique for this is called clustering. In this case the trainer has

    nothing to do with the classification. Instead the learning machine determines which

    elements should belong to the same class and assigns a class accordingly.

    Unsupervised learning is as shown below in Fig. 1.3.

    Figure 1.3: Unsupervised Learning

    It is to be noted that over-training leads to poor generalization. For example, an over-

    trained machine would return different classes for the same image subjected to different

    modifications. As illustrated in the Fig 1.4, over-training essentially is memorizing the data,

    and leads to poor generalization.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 5

    Figure 1.4: (a) Good Generalization, (b) Bad Generalization

    Face recognition is a very complex form of pattern recognition. It consists of

    classifying highly ambiguous input signals, with multiple dimensions and matching them

    with known signals. The following are the problems that occur:

    The intrinsic 2D structure of an image matrix is more often than not removed. Consequently, the spatial information stored therein is discarded and not effectively

    utilized.

    Curse of Dimensionality: Each image sample is modeled is typically a point in a high-dimensional space. Consequently, a large number of training samples are often

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 6

    needed to get reliable and robust estimation about the characteristics of data

    distribution.

    Usually very limited amounts of data are available in real applications such as face recognition, image retrieval, and image classification.

    A number of ways have been proposed to solve these problems. Finding an effective means

    to reduce the dimensionality is the first step in face recognition.

    1.1.1 Applications of Face Recognition Some of the applications are as follows:

    Trying to find a face within a large database of faces. In this approach the system returns a possible list of faces from the database. The most useful applications contain

    crowd surveillance, video content indexing, personal identification (example: drivers

    license), mug shots matching, etc.

    Real time face recognition: Here, face recognition is used to identify a person on the spot and grant access to a building or a compound, thus avoiding security hassles. In

    this case the face is compared against a multiple training samples of a person.

    1.1.2 Adopted Approach Face recognition is done naturally by humans. However, developing a computer

    algorithm to achieve the same purpose is a rather difficult task. Starting with images, the aim

    is to distinguish between faces of different people. One class of methods presupposes the

    existence of certain features in the image, i.e. eyes, nose, mouth, hair, and an algorithm is

    devised to find and characterize these features. A second class of methods also assumes that

    there are features to be found but does not predefine what these features are or how to

    measure them. The approach followed in this project belongs to the latter class of methods

    since it much less restrictive than the former. Specifically, given a set of training images, the

    strategy is to develop algorithms that construct a different set of images which

    approximates best the given image data set. The training set is then projected onto this

    subspace. To query a new image, its projection onto this subspace is determined, and a

    training image whose projection is closest to that of the new image is sought.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 7

    1.2 Literature Survey Several algorithms and techniques for face recognition have been developed in the past

    by researchers. These are discussed briefly in this section.

    1.2.1 Existing Algorithms

    Face Recognition Based on Principal Component Analysis: Principal Component

    Analysis (PCA) is a well known algorithm used in face recognition. The basic idea in PCA is

    to determine a vector of much lower dimension that best approximates in some sense a given

    data vector. Thus, in face recognition, it takes an s-dimensional vector representation of each

    face in a training set of images as input, and determines a t-dimensional subspace whose

    basis vector is maximum corresponding to the original image. The dimension of this new

    subspace is lower than the original one (t

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 8

    Elastic Bunch Graph Matching: All human faces share a similar topological structure.

    Faces are represented as graphs, with nodes positioned at fiducially points (eyes, nose, etc.)

    and edges labeled with 2-D distance vectors. Each node contains a set of 40 complex Gabor

    wavelet coefficients at different scales and orientations (phase, amplitude). They are called

    "jets". Recognition is based on labeled graphs. A labeled graph is a set of nodes connected by

    edges; each node is labeled as jets, edges are labeled as distances [20].

    Kernel Methods: The face manifold in subspace need not be linear. Kernel methods are a

    generalization of linear methods. Direct non-linear manifold schemes are explored to learn

    this non-linear manifold [4],[5],[10],[11].

    Trace Transform: The Trace transform, a generalization of the Radon transform, is a new

    tool for image processing which can be used for recognizing objects under transformations,

    e.g. rotation, translation and scaling. To produce the Trace transform one computes a

    functional along tracing lines of an image. Different Trace transforms can be produced from

    an image using different trace functional [21],[22].

    1.2.2. Existing Techniques Eigenfaces: Many approaches to the overall face recognition problem have been devised

    over the years, but one of the most accurate and fastest ways to identify faces is to use the so-

    called eigenface technique. This uses a combination of linear algebra and statistical analysis

    to generate a set of basis faces the eigenfaces against which inputs are tested. Eigenface

    mostly uses PCA and ICA tools for face recognition. Using these, the efficiency is variable.

    Although ICA significantly outperforms the standard PCA, it has been claimed that that the

    performance of ICA strongly depends on its involved PCA process. The pure ICA projection

    has little effect on the performance of face recognition.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 9

    Range Imaging: Range imaging is the name for a collection of techniques, used to produce

    a 2D image showing the distance to a set of points in a scene from a specific point, normally

    associated with some type of sensor device. The resulting image, the range image, has pixel

    values which correspond to the distance, e.g., brighter values mean shorter distance, or vice

    versa. If the sensor which is used for produce the range image is properly calibrated, the pixel

    values can be given directly in physical units such as centimeters [23].

    Line edge map: Image-based face recognition algorithm that uses a set of random rectilinear

    line segments of 2D face image views as the underlying image representation, together with a

    nearest-neighbor classifier as the line-matching scheme. The combination of 1D line

    segments exploits the inherent coherence in one or more 2D face image views in the viewing

    sphere [24].

    Neural Network based Face Recognition Techniques: Neural networks are used to create

    the face database and recognize the face. A separate network for each person is built. The

    input face is projected onto the eigenface space first to get a new descriptor. This descriptor

    is used as network input and applied to each person's network. The one with maximum

    output is selected and reported as the host if it is larger than a predefined recognition

    threshold [3],[1],[13],[12].

    Gabor wavelet networks (GWN): The Gabor wavelet network is used for an effective

    object representation. The Gabor wavelet network has several advantages such as invariance

    to some degree with respect to translation, rotation and dilation. Furthermore, it has the

    ability to generalize and to abstract from the training data and to assure, for a given network.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 10

    Sparse Representation: The input image is divided with L-1 minimization. And then those

    sparse will be compared with training datas sparse. There are also many techniques which

    has been tried to use for facial recognition techniques, but among all of them, the eigenface

    technique shows the fastest and most accurate results than other techniques. So far, from our

    research, we have seen that the eigenface technique has faster performance than other

    techniques. But in recent times the researchers and scientists are trying to focus on Human

    Vision System (HVS) for face recognition. It has not been implemented in practical field of

    face recognition yet but has laid the foundation for future studies [25].

    1.3 Organization of the Report The report is organized as follows: The basics of neural networks and the learning

    processes involved in neural networks are discussed in Chapter 2. The back propagation

    learning and conjugate gradient algorithms are also dealt with. Then the process of face

    recognition using neural networks is then described. Chapter 3 covers a famous and widely

    used method for face recognition i.e. eigenface approach. Both principal components and

    nearest neighbour classification are dealt with. Support vector machines together with the

    various optimization techniques involved are presented in Chapter 4. Both linear and

    nonlinear cases are explained, as well as the kernels involved in support vector machines

    (SVMs). Face recognition using binary SVMs for multi-classification purposes is also

    explained. Face recognition involves several pre-processing steps. These are detailed in

    Chapter 5. The results obtained using a number of techniques presented in the previous

    chapters are compared in Chapter 6. Concluding comments and directions for future work are

    indicated in Chapter 7.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 11

    CHAPTER II

    Artificial Neural Networks

    Neural Networks are nonlinear models of the neural pathways in the nervous system. The

    flow of signals in the nervous system is as shown in Fig. 2.1. The propagation of stimulus, in

    either direction throughout the system of sense organs, is due to the firing of simple

    elements operating in parallel. These elements can be structured as:

    Single-Layer Feedforward Networks Multilayer Feedforward Networks Recurrent Networks These structures are shown in Fig. 2.2, where each node represents a mathematical model

    of a neuron.

    Figure 2.1: Block diagram of nervous system

    Figure 2.2: Single-Layer Feedforward Networks, Multilayer Feedforward Networks, Recurrent

    Networks

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 12

    The manner in which this transmission proceeds is determined by the learning

    abilities of these neurons. Learning may be error-correction learning, memory-based

    learning, Hebbian learning; or based on paradigms as supervised learning, unsupervised

    learning, reinforcement learning through continued interaction with the environment.

    Rosenblatts perceptron is a basic form of a neural network, built around a nonlinear

    model of a neuron, namely, the McCulloh-Pitts model of a neuron as shown in Fig. 2.3. The

    adder node of the neuron model computes a linear combination of input data applied to the

    synapses and the tendency to be biased, given by the value b, during a classification task.

    This induced local field is converted to a binary signal, +1 or -1, which determines the

    weighted capacity to activate the neighboring neuron.

    Figure 2.3: Nonlinear model of a neuron

    In its most general form, a neural network is a machine that is designed to model the

    way in which the brain performs a particular task or function of interest. It resembles the

    brain in the following two respects:

    1. Knowledge is acquired by the network from its environment through a learning

    process.

    2. Interneuron connection strengths, known as synaptic weights, are used to store the

    acquired knowledge.

    The procedure used to perform the learning process is called a learning algorithm, the

    function of which is to modify the synaptic weights of the network in an orderly fashion to

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 13

    attain a desired design objective. The modification of synaptic weights provides the

    traditional basis for the design of neural networks. Further, it is also possible for a neural

    network to modify its own topology, which is motivated by the fact that neurons in the

    human brain can die and that new synaptic connections can grow.

    2.1 Multilayer Perceptrons Multilayer perceptrons are multilayer feedforward networks consisting of at least three

    layers: the input layer, the hidden layer(s) and the output layer:

    1. The input layer consists of input sensory nodes to which data, represented by pixel

    intensity vectors of pre-processed images, is presented. These propagate through the

    neural network in a forward direction, on a layer by layer basis, hence known as

    multi-layer feedforward networks.

    2. The hidden layer performs a non-linear transformation on the input signal into a new

    space called the feature space. More than one hidden layer maybe used to achieve

    this purpose. As learning progresses, the hidden neurons gradually discover the

    features that characterize the images. The number of hidden neurons in each hidden

    layer, and the number of hidden layers is critical to the learning process, and is varied

    for higher accuracy.

    3. At the output layer, the output/functional signal(s), expressed as a continuous

    nonlinear function of the input signal and synaptic weights associated with that

    neuron, is obtained.

    Fig. 2.4 shows the architectural graph of a multilayer perceptron. The network shown is fully

    connected, that is, a neuron in any layer of the network is connected to all the neurons in the

    previous layer.

    In the case of supervised learning, this output signal is subtracted from the desired

    response, to yield the error signal. In the classification problem, the desired response

    corresponds to the classes of images. For this specific problem, the number of output neurons

    equals the number of classes of images. The backward propagation of the estimate of the

    error gradient vector, that is, the gradients of the error surface with respect to the weights

    connected to the inputs of a neuron, forms the core of the supervised learning process. In the

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 14

    training phase, the optimum weights of the hidden and output layers are obtained, which

    minimize the error estimate for desired results. The optimum weights are learnt by

    propagating the error signal backwards, against the synaptic connections. This supervised

    learning algorithm is known as error back propagation algorithm, which is described in detail

    in Section 2.2.

    Figure 2.4: MLP with one hidden layer

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 15

    2.2 Derivation of Error Back Propagation Algorithm The error signal at the output of neuron j at iteration n, that is, presentation of the nth training

    example, is defined by:

    ej(n)=dj(n)-yj(n) (2.1)

    The instantaneous value of the error energy for neuron j is defined as 0.5e2j(n).

    Correspondingly, the instantaneous value of the error energy E(n) of the total error energy is

    obtained by summing 0.5e2j(n) over all neurons in the output layer. Thus:

    E(n)=0.5 (2.2)

    where the set C includes all the neurons in the output layer of the network. Let N denote the

    total number of patterns contained in the training set. The average squared error energy is

    obtained by summing E(n) over all n and then normalizing with respect to the set size N, as

    shown by:

    Eav=1/N (2.3)

    The instantaneous error energy E(n), and therefore the average error energy Eav, is a function

    of all the free parameters (i.e. synaptic weights and bias levels) of the network. For a given

    training set, Eav represents the cost function as a measure of learning performance. The

    objective of the learning process is to adjust the free parameters of the network to minimize

    Eav. A simple method of training is considered in which the weights are updated on a pattern

    by pattern basis until one epoch, that is, one complete presentation of one training set has

    been dealt with. The adjustments to the weights are made in accordance with the respective

    errors computed for each pattern presented to the network.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 16

    Figure 2.5: Signal-flow graph highlighting the details of output neuron j

    The arithmetic average of these individual weight changes over the training set is

    therefore an estimate of the true change that would result in from modifying the weights

    based on minimizing the cost function Eav over the entire training set. Consider Fig. 2.5

    which depicts neuron j being fed by a set of function signals produced by a layer of neurons

    to its left. The induced local field vj(n) produced at the input of the activation function

    associated with neuron j is therefore:

    vj(n)= (2.4)

    and m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic

    weight wj0 (corresponding to the fixed input y0=+1) equals the bias bj applied to neuron j.

    Hence the function signal yj(n) appearing at the output of neuron j at iteration n is:

    yj(n)=j(vj(n)) (2.5)

    The back-propagation algorithm applies a correction wji(n) to the synaptic weight wji(n),

    which is proportional to the partial derivative E(n)/dwji(n). According to the chain rule of

    calculus, one can express this gradient as:

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 17

    E(n)/wji(n)= E(n)/ ej(n). ej(n)/ yi(n). yj(n)/ vj(n). vj(n)/ wji(n) (2.6)

    The derivative E(n)/ wji(n) represents the sensitivity factor, determining the direction of

    search in weight space for the synaptic weight wji. Differentiating both sides of equation (2.2)

    with respect to ej(n), equation (2.1) with respect to yj(n), equation (2.5) with respect to

    yj(n), and equation (2.4) with respect to wji(n) one respectively gets:

    E(n)/ ej(n)=ej(n) (2.7)

    ej(n)/ yj(n)=-1 (2.8)

    yj(n)/ vj(n)=j(vj(n)) (2.9)

    vj(n)/ wji(n)=yi(n) (2.10)

    The use of equations (2.7) and (2.10) in (6) yields:

    E(n)/ wji(n)=-ej(n)(vj(n))yi(n) (2.11)

    The correction wji(n) applied to wji(n) is defined by the delta rule:

    wji(n)=-E(n)/wji(n) (2.12)

    where is the learning-rate parameter of the back propagation algorithm. The use of the

    minus sign accounts for the gradient descent in weight space. Accordingly, the use of (2.11)

    in (2.12) yields:

    wji(n)= j(n)yi(n) (2.13)

    where local gradient j(n) is defined by:

    j(n)=- E(n)/ vj(n)=ej(n)j(vj(n)) (2.14)

    The local gradient points to required changes in synaptic weights. According to (2.14), the

    local gradient j(n) for output neuron j is equal to the product of the corresponding error

    signal ej(n) for that neuron and the derivative j(vj(n)) of the associated activation function.

    Case 1: Neuron j is an output node

    When neuron j is located in the output layer of the network, it is supplied with a

    desired response of its own. Equation (2.1) is used to compute the error signal ej(n)

    associated with this neuron.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 18

    Case 2: Neuron j is the hidden node

    When neuron j is located in a hidden layer of the network, there is no specified

    desired response for that neuron. Accordingly, the error signal for a hidden neuron would

    have to be determined recursively in terms of the error signals of all the neurons to which

    that hidden neuron is directly connected; this is where the development of the back-

    propagation algorithm gets complicated. Consider the situation depicted in Fig 2.6, which

    depicts neuron j as a hidden node of the network. According to equation (2.14), the local

    gradient j(n) for hidden neuron j is redefined as:

    j(n)=-[E(n)/ yj(n)].[yj(n)/ vj(n)]

    =-[E(n)/ yj(n)].j(vj(n)), neuron j is hidden (2.15)

    where in the second line equation (2.9) is used. To calculate the partial derivative E(n)/

    yj(n), one proceeds as follows. From equation (2.2):

    E(n)=0.5 , neuron k is an output node (2.16)

    Differentiating equation (2.16) with respect to the function signal yj(n):

    E(n)/ yj(n)= [ek(n)/ yj(n)] (2.17)

    Using the chain rule for the partial derivative ek(n)/ yj(n), and rewriting in the equivalent

    form the following is obtained:

    E(n)/yj(n)= .ek(n)/vk(n).vk(n)/yj(n) (2.18)

    However,

    ek(n)=dk(n)-yk(n)

    =dk(n)-k(vk(n)), neuron k is output node (2.19)

    Hence:

    ek(n)/ vk(n)=-k(vk(n)) (2.20)

    Note that for neuron k the induced local field is:

    vk(n)= (2.21)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 19

    Figure 2.6: Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j

    where m is the total number of inputs (excluding the bias) applied to neuron k. Here again,

    the synaptic weight wko(n) is equal to the bias bk(n) applied to neuron k, and the

    corresponding input is fixed at the value +1. Differentiating equation (2.21) with respect to

    yj(n) yields:

    vk(n)/ yj(n)=wkj(n) (2.22)

    By using equation (2.20) and equation (2.22) in equation (2.18), the desired partial derivative

    is obtained:

    E(n)/ yj(n)= - k(vk(n))wkj(n)

    = - wkj(n) (2.23)

    where in the second line the definition of the local gradient k(n) given in equation (2.14)

    with the index k substituted for j is used. Finally, the back-propagation formula for the local

    gradient j(n) is obtained as described:

    j(n)= (vj(n)) wkj(n) (2.24)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 20

    The factor j(vj(n)) involved in the computation of the local gradient j(n) in (2.24) depends

    solely on the activation function associated with hidden neuron j. The remaining factor

    involved in this computation, namely the summation over k, depends on two sets of terms.

    The first set of terms, the k(n), requires knowledge of the error signals ek(n), for all neurons

    that lie in the layer to the immediate right of hidden neuron j, and that are directly connected

    to neuron j. The second set of terms, the wkj(n), consists of synaptic weights associated with

    these connections. Figure 2.7 summarizes back-propagation learning.

    Figure 2.7: Signal-flow graphical summary of back-propagation learning. Top part of the graph: forward

    pass. Bottom part of the graph: backward pass.

    2.3 Activation function The computation of for each neuron of a multilayer perceptron requires knowledge

    of the derivative of the activation function (.) associated with that neuron. For this

    derivative to exist, the function (.) must be continuous. Therefore, differentiability is the

    only requirement that an activation function has to satisfy.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 21

    Hyperbolic tangent function as shown in Fig 2.8, an antisymmetric function, is a

    commonly used form of sigmoidal non-linearity, which in its most general form is defined

    by:

    j(vj(n))=atanh(bvj(n)), (a,b)>0 (2.25)

    where a and b are constants. Its derivative with respect to vj(n) is given by:

    j(vj(n))=absec2h(bvj(n))

    =ab(1-tanh2(bvj(n)))

    =b/a[a-vj(n)][a+yj(n)] (2.26)

    For a neuron j located in the output layer yj(n)=oj(n), the local gradient is therefore:

    j(n)=ej(n)j(vj(n))

    =b/a[dj(n)-oj(n)][a-oj(n)][a+oj(n)] (2.27)

    For a neuron j in the hidden layer,

    j(n)=j(vj(n)) wkj(n)

    =[b/a][a-yj(n)][a+yj(n)] wkj(n), neuron j is hidden (2.28)

    Figure 2.8: Hyperbolic tangent function

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 22

    2.4 Rate of learning The back-propagation algorithm provides an approximation to the trajectory in weight

    space computed by the method of steepest descent. The smaller the learning-rate parameter

    is made, the smaller the changes to the synaptic weights in the network will be from one

    iteration to the next, and the smoother will be the trajectory in weight space. The

    improvement, however is attained at the cost of a slower rate of learning. If, on the other

    hand, the learning-rate parameter is made too large in order to speed up the rate of learning,

    the resulting large changes in the synaptic weights assume such a form that the network may

    become unstable. A simple method of increasing the rate of learning yet avoiding the danger

    of instability is to modify the delta rule, by including a momentum term as follows:

    wji(n)= wji(n-1)+ j(n)yi(n) (2.29)

    where is a positive number called the momentum constant. Rewriting the above equation

    (2.29) as a time series with index t. The index t goes from the initial time 0 to current time n.

    It maybe viewed as a first-order difference equation in the weight correction wji(n):

    wji(n)= (j(t)yi(t))

    =- (E(t)/wji(t)) (2.30)

    Based on this relation, the following conclusions are made:

    1. The current adjustment wji(n) represents the sum of an exponentially weighted time

    series. For the time series to be convergent, the momentum constant must be

    restricted to the range 0

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 23

    2.5 Modes of Training One complete representation of the entire training set during the learning process is called

    an epoch. The learning process is maintained on an epoch by epoch basis until the synaptic

    weights and bias levels of the network stabilize and the averaged mean squared error over the

    entire training set converges to some minimum value. It is good practice to randomize the

    order of presentation of training examples from one epoch to the next.

    For a given training set, back-propagation learning may proceed in one of two basic

    ways:

    1. Sequential mode: The sequential mode of back-propagation learning is also referred

    to as on-line, pattern or stochastic mode. In this mode of operation weight updating is

    performed after the presentation of each training example.

    2. Batch mode: In the batch mode of back-propagation learning, weight updating is

    performed after the presentation of all training examples that constitute an epoch. For

    a particular, the cost function is defined as the averaged squared error of equations

    (2.2) and (2.3):

    Eav=1/2N (2.31)

    where the error signal pertains to the output neuron j for the training example n.

    For an on-line operational point of view, the sequential mode of training is

    preferred over the batch mode because it requires less local storage for each synaptic

    connection. Moreover, given that the patterns are presented to the network in a random

    manner, the use of pattern by pattern updating of weights makes the search in weight space

    stochastic in nature. This in turn makes it less likely for the back-propagation algorithm to be

    trapped in a local minimum.

    In the same way, the stochastic nature of the sequential mode makes it difficult to

    establish theoretical conditions for convergence of the algorithm, In contrast, the use of batch

    mode of training provides an accurate estimate of the gradient vector; convergence to a local

    minimum is guaranteed under simple conditions.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 24

    2.6 Local Minima of Error Surface A peculiarity of the error surface (Fig 2.9) that impacts the performance of the back-

    propagation algorithm is the presence of local minima in addition to global minima. It runs

    the risk of being trapped in a local minimum where even small change in synaptic weights

    increases the cost function. But somewhere else in the weight space there exists another set

    of synaptic weights for which the cost function is smaller than the local minima in which the

    network is stuck. It is clearly undesirable to have the learning process terminate at local

    minima, especially if it is located far above a global minimum.

    Figure 2.9: Error surface

    2.7 Supervised Learning Viewed as an Optimization Problem In this case, the supervised training of a multilayer perceptron is viewed as a problem

    in numerical optimization. The error surface of a multilayer perceptron is a highly nonlinear

    function of the synaptic weight vector w. Let Eav(w) denote the cost function, averaged over

    the training sample. Using the Taylor series expand Eav(w) about the current point on the

    error surface w(n) as:

    Eav(w(n)+w(n)) =

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 25

    Eav(w(n))+gT(n)w(n)+0.5wT(n)H(n)w(n) (2.32)

    where g(n) is the local gradient vector defined by:

    g(n)=dEav(w)/dw, at w=w(n) (2.33)

    and H(n) is the local hessian matrix defined by:

    H(n)=d2Eav(w)/dw2, at w=w(n) (2.34)

    The use of an ensemble-averaged cost function Eav(w) presumes a batch mode of

    operation. The basic back-propagation algorithm adjusts the weights in the steepest descent

    direction (negative of the gradient). This is the direction in which the performance function is

    decreasing most rapidly. It turns out that, although the function decreases most rapidly along

    the negative of the gradient, this does not necessarily produce the fastest convergence. In the

    conjugate gradient type of algorithms a search is performed along conjugate directions,

    which produces generally faster convergence than steepest descent directions.

    In most of the training algorithms, a learning rate is used to determine the length of

    the weight update (step size). In most of the conjugate gradient algorithms, the step size is

    adjusted at the end of each iteration. A search is made along the conjugate gradient direction

    to determine the step size, which minimizes the performance function along that line.

    Fletcher-Reeves Update: All of the conjugate gradient algorithms start out by searching in

    the steepest descent direction (negative of the gradient) on the first iteration:

    po=-go (2.35)

    A line search is then performed to determine the optimal distance to move along the current

    search direction:

    xk+1=xk+kpk (2.36)

    Then the next search direction is determined so that it is conjugate to previous search

    directions. The general procedure for determining the new search direction is to combine the

    new steepest descent direction with the previous search direction:

    Pk=-gk+kpk-1 (2.37)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 26

    The various versions of conjugate gradient are distinguished by the manner in which the

    constant is computed. For the Fletcher-Reeves update the procedure is:

    k=gTkgk /gTk-1gk-1 (2.38)

    This is the ratio of the norm squared of the current gradient to the norm squared of the

    previous gradient. Once the MLP has been trained, it can be tested on a database of images.

    The output neuron which fires corresponds to the class of image.

    2.8 Face Recognition Using Neural Networks The Cambridge ORL database consists of 400 frontal images of 40 people, i.e. 10

    images of each person, of which 5 are used for training and the other 5 are used for testing

    purposes. Since there are 40 people that need to be identified, the number of classes that a

    neural network must distinguish is also 40. The pixel intensity vectors of the pre-processed

    training images are obtained and the MLP is thus trained. The mode of training used is batch

    mode, and using the back-propagation algorithm, in conjunction with Fletcher-Reeves

    update, the optimum weights for MLP are obtained. In order to improve results, the inputs

    are fed at random. Training is conducted for 2500 epochs.

    2.9 Conclusions This chapter dealt with artificial neural networks. Specifically, the architecture of

    multilayered feedforward neural networks is introduced. Supervised learning via the back

    propagation algorithm is discussed along with the modes of training, the choice of learning

    rate, and the potential pitfalls of getting trapped in local minima. Finally, the use of artificial

    neural networks for face recognition is also introduced. The results are presented in a Chapter

    6. A technique that is frequently used for analysis of images is that of principal components.

    This is the topic of the next chapter.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 27

    CHAPTER III

    Principal Components Analysis

    Principal components analysis is a method of unsupervised learning. The main idea is

    to discover significant patterns or features in the input data without an external aide. To do

    so, the algorithm is provided with a set of rules of a local nature, which enables it to learn to

    compute an input-output mapping with specific desirable properties; the term local implies

    that the change applied to the synaptic weight of a neuron is confined to the immediate

    neighborhood of that neuron.

    Feature selection refers to a process whereby the given data is compressed to features

    or patterns fewer in number than the given data; Thus, the data space is transformed into a

    feature space and undergoes a dimensionality reduction. Evidently, this data compression is

    lossy, and it can be shown that in the case of principal component analysis, the mean-square

    of the resulting error equals the sum of the variances of the elements of that part of the data

    vector that is eventually eliminated. Therefore one seeks a transformation that is optimum in

    the mean squared sense; i.e., the eliminated component must have total variance that is less

    than a predefined threshold. Principal components analysis computes the basis of a space

    which represents the training vectors. These basis vectors are eigenvectors of a related

    covariance matrix.

    When applied to face recognition, one determines the principal components of the

    distribution of faces treating the image as a point in a high dimensional space; i.e., the

    eigenvectors of the covariance matrix of the set of face images are computed. These

    eigenvectors referred to as eigenfaces in this context contain relevant discriminatory

    information extracted from the images. Thus, characteristic features representing the

    variation in the collection of faces are captured, and this information is used to encode and

    compare individual faces. It is to be noted that the features (i.e., the eigenvectors) are ordered

    with respect to the corresponding eigenvalues, and only those features that are significant (in

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 28

    the sense of the value of the eigenvalues) are considered. Recognition is performed by

    projecting a new image on to the subspace spanned by the eigenfaces and then classifying the

    face by comparing its position in the face space with the position of known individuals.

    Figure 3.1: (a) The 1st PC z1 is a minimum distance fit to a line in X space. (b) The 2nd PC z2 is a

    minimum distance fit to a line in the plane perpendicular to the 1st PC.

    3.1 Derivation of Principal Components (PC)

    Given a sample of n observations on a vector of p variable:

    x=(x1, x2,, xp) (3.1)

    define the first principal component of the sample by the linear transformation:

    z1= a1Tx= (3.2)

    where the vector a1=(a11,a21,,ap1) is chosen such that var[z1] is maximum. Likewise, define

    the kth PC of the sample by the linear transformation:

    zk=akTx, k=1p (3.3)

    where the vector ak=(a1k,a2k,,apk) is chosen such that var[z1] is maximum:

    cov[zk,zl]=0, for k1, and akTak=1 (3.4)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 29

    To find a1 first note that:

    var[z1] = (z12) (z1)2

    = , where Sij=xixj=(xixj) - (xi)(xj)

    = a1TSa1 (3.5)

    where S is the is the covariance matrix for x.

    To find a1 maximize var[z1], subject to a1Ta1=1. This constrained optimization problem can

    be transformed into an unconstrained one by using the Lagrange multiplier . Thus, the

    problem is reduced to maximization of a1TSa1 (a1Ta1 1). Differentiating with respect to

    both the vector a1 and the Lagrange multiplier and setting the derivatives to zero one obtains

    Sa1 a1 = 0

    (S Ip)a1 = 0 (3.6)

    Therefore a1 is an eigenvector of S corresponding to eigenvalue =1. Note that the following

    has been maximized:

    var[z1]= a1TSa1 =a1T1a1 = 1 (3.7)

    So 1 is the largest eigenvalue of S. The first PC z1 retains the greatest amount of variation in

    the sample. An example is illustrated in Fig. 3.1(a). To find the next coefficient vector a2

    maximize var[z2] subject to:

    cov[z2,z1]=0 and a1Ta1 =1 (3.8)

    First note that:

    cov[z2,z1]= a1TSa2= 1a1Ta2 (3.9)

    then let and be Lagrange multipliers, and maximize a2TSa2 (a2Ta2 1) - a2Ta1. It is

    found that find that a2 is also an eigenvector of S whose eigenvalue =2 and is the second

    largest, as illustrated in Fig. 3.1(b) in the two dimensional case. In general:

    var[zk]= akTSak =k (3.10)

    The kth largest eigenvalue of S is the variance of the kth PC. The kth PC zk retains the kth greatest fraction of the variation in the sample.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 30

    Given a sample of n observations on a vector of p variables x=(x1, x2,, xp), define a vector

    of p PCs z=(z1, z2,, zp), according to Z=ATX, where A is an orthogonal p x p matrix whose

    kth column is the kth eigenvector ak of S. Then = ATSA is the covariance matrix of the PCs; the matrix is diagonal with elements ij=ij ij.

    3.2 Face Recognition using Eigenfaces

    Let a face image (x,y) be a two-dimensional N by N array of intensity values. By stacking the columns of this array into a single vector, an image may also be considered as a

    vector of dimension 2N . Thus, a typical image of size 256 by 256 becomes a vector of

    dimension 65,536, or equivalently, a point in 65,536-dimensional space. An ensemble of

    images, then, maps to a collection of points in this vector space of a relatively large

    dimension.

    Images of faces, being similar in overall configuration, will not be randomly

    distributed in this huge image space and thus can be described by a relatively low

    dimensional subspace. The main idea of the principal component analysis is to find a set of

    basis vectors that best accounts for the distribution of face images within the entire image

    space. These vectors define the subspace of face images, which is generally referred to as the

    face space. Each vector is of length 2N , describes an N by N image, and is a linear

    combination of the original face images. Because these vectors are the eigenvectors of the

    covariance matrix corresponding to the original face images, and because they are face-like

    in appearance, they are typically referred to as eigenfaces.

    As described earlier, a 2-D facial image can be represented as 1-D vector by

    concatenating each row (or column). Let the training set of face images be M ,,, 21 K .

    The average face of the set is defined by ==

    M

    nnM 1

    1 . Each face differs from the average by

    the vector = nn . This set of very large vectors is then subject to principal component analysis, which seeks a set of M orthonormal vectors n which best describes the distribution of the data. The kth vector k is chosen such that:

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 31

    =

    =M

    nn

    Tkk M 1

    2)(1 (3.11)

    is a maximum, subject to:

    ==

    otherwisekl

    kTl ,0

    ,1 (3.12)

    The vectors k and scalars k are the eigenvectors and eigenvalues, respectively, of the covariance matrix:

    =

    ==M

    n

    TTnn AAM

    C1

    1

    (3.13)

    where the matrix ]...[ 21 MA = . The matrix C is however 2N by 2N , and determining the 2N eigenvectors and eigenvalues is an intractable task for typical image sizes. A

    computationally feasible method is needed to find these eigenvectors.

    If the number of data points in the image space is less than the dimension of the space

    ( 2NM < ), there will be only M , rather than 2N , meaningful eigenvectors in the sense that the remaining eigenvectors are associated with zero eigenvalues. Accordingly, it is

    computationally better to determine the eigenvalues and eigenvectors of much smaller

    matrix. For example, in the situation outlined earlier, one computes the eigenvalues and the

    corresponding eigenvectors of an M x M matrix as opposed to a 65,536 x 65,536 matrix.

    Consider the eigenvectors n of AAT such that:

    nnnT AA = (3.14)

    Premultiplying both sides by A,

    nnnT AAAA = (3.15)

    from which it can be inferred that nA are the eigenvectors of TAAC = . Following this analysis, construct the M by M matrix AAL T= , where nTmmnL = , and find the M eigenvectors n of L. These vectors determine linear combinations of the M training set face images to form the eigenfaces n :

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 32

    MnA nM

    kknkn ,......,1,

    1

    === =

    (3.16)

    With this analysis the calculations are greatly reduced, from the order of the number of pixels

    in the images ( 2N ) to the order of the number of images in the training set (M). In practice,

    the training set of face images will be relatively small ( 2NM < ), and the calculations become quite manageable. The value of the associated eigenvalues allows the ordering of the

    eigenvectors according to their usefulness in characterizing the variation among the images.

    It is to be emphasized at this juncture that such a ordering is possible as all the eigenvalues of

    AAT (and hence TAA ) are positive due to the sign-definiteness of these matrices.

    The purpose of PCA is to reduce the large dimensionality of the data space (observed

    variables) to the smaller intrinsic dimensionality of feature space (independent variables),

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 33

    which are needed to describe the data economically. This is the case when there is a strong

    correlation between observed variables. Therefore, the goal is to find a set of eis which have

    the largest possible projection onto each of the wis. The eigenvectors corresponding to

    nonzero eigenvalues of the covariance matrix produce an orthonormal basis for the subspace

    within which most image data can be represented with a small amount of error. The

    eigenvectors are sorted from high to low according to their corresponding eigenvalues. The

    eigenvector associated with the largest eigenvalue is one that reflects the greatest variance in

    the image. That is, the smallest eigenvalue is associated with the eigenvector that finds the

    least variance. It has been the experience that the usefulness of the eigenvectors decreases in

    exponential fashion; i.e., roughly 90% of the total variance is contained in the first 5% to

    10% of the dimensions.

    3.3 Euclidean Distance Method for Classification The simplest method for determining which face class provides the best description of

    an input facial image is to find the face class k that minimizes the Euclidean distance k=|-

    k|, where k is a vector describing the kth face class. If k is less than some predefined

    threshold , a face is classified as belonging to the class k.

    Note: Eigenfaces, once obtained, can also be used to train a multilayer perceptron for

    classification purposes. This type of training incorporates both supervised and unsupervised

    learning. Once trained, it can be tested on a database of images.

    3.4 Conclusions In this chapter the eigenface approach to face recognition is dealt with. This is the

    first statistics-based face recognition technique to have been proposed by researchers. A

    principal advantage of this is that it is amenable to real-time face recognition. The eigenface

    approach followed in this project uses principal components and nearest-neighbour

    classification technique. In this next chapter, face recognition using support vector machines

    is presented.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 34

    CHAPTER IV

    Support Vector Machines

    Support vector machines are universal feedforward networks, which can be used for pattern

    classification and non-linear regression. Basically the support vector machine is a linear

    machine with some suitable properties. To explain how it works, it is perhaps easiest to start

    with the case of separable patterns that could arise in the context of pattern classification. In

    this context, the main idea of a support vector machine is to construct a decision surface,

    called a hyperplane, in such a way that the margin of separation between positive and

    negative examples is maximized.

    The machine achieves this desirable property by following a principled approach

    rooted in statistical learning theory. More precisely, support vector machine is an

    approximate implementation of the method of structural risk minimization. The induction

    principle is based on the fact that the error-rate of a learning machine on test data is bounded

    by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis

    (VC) dimension; in the case of separable patterns, a support vector machine produces a value

    of zero for the first term and minimizes the second term. Accordingly, the support vector

    machine can provide a good generalization on pattern classification problems despite the fact

    that it does not incorporate problem-domain knowledge. This attribute is unique to support

    vector machines.

    A notion that is central to the construction of the support vector learning algorithm is

    the inner-product kernel between a support vector xi and the vector x drawn from the input

    space. The support vectors consist of a small subset of the training data extracted by the

    algorithm. Depending on how this inner-product kernel is generated, construction of different

    learning machines is characterized by their respective nonlinear decision surfaces. In

    particular, the support vector learning algorithm is used to construct the following three types

    of learning machines:

    Polynomial learning machines

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 35

    Radial-basis function networks Two-layer perceptrons with a single hidden layer.

    That is, for each of these feedforward networks the support vector learning algorithm is used

    to implement the learning process using a given set of training data, automatically

    determining the required number of hidden units. Stated in another way, whereas the back-

    propagation algorithm is devised specifically to train a multilayer perceptron, the support

    vector learning algorithm is of a more generic nature because it has wider applicability.

    Support Vector machines are rooted in statistical learning theory which gives a family

    of bounds which govern the learning capacity of the machine:

    R(alpha) = 0.5|y-f(x,alpha)|.dP(x,y) (4.1)

    Properties of the bound:

    1. It is independent of P(x,y). It assumes only that both the training data and the test data

    are drawn independently according to some P(x,y)

    2. It is usually not possible to compute the left hand side.

    3. If h is known, the right hand side can be easily computed.

    Thus given several different learning machines, and choosing a fixed, sufficiently small ,

    that machine is chosen which gives the lowest upper bound on the actual risk. This gives a

    principled method for choosing a learning machine for a given task, and is the essential idea

    of structural risk minimization.

    The general structure of SVM is as shown in Fig. 4.1.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 36

    Figure 4.1: General structure of SVM

    4.1 Basics Let X Rn denote the possible input to the Support Vector Machine. These inputs are pixel

    intensity vectors obtained after image pre-processing. An SVM is trained with images

    belonging to two classes at a time. A single machine is not efficient at encoding multiple

    classes. Assume that a point x X is associated with one of two possible classes, denoted -1

    and +1. This means that it is sufficient to have the output Yestimate={-1,1} from the classifier.

    Consider two stochastic variables, representing the point denoted X, Y representing class

    label. These variables then give rise to the following conditional distributions, p(X=x|Y=1),

    p(X=x |Y=-1). There are two phases concerning support vector machines, first training and

    then classifying. By letting f: X -> Yestimate represent the discriminating function, the task

    during training is to minimize the probability of returning wrong classification for all

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 37

    elements included in the training set. This means that for the training set Yestimate should be

    identical to Y in as many cases as possible.

    During training, SVMs solve this problem by simply finding a function f, which for

    every point xi=[xi1, xi2, , xin]T, with corresponding class label yk, in the training set

    S=((x1,y1),,(xl,yl)) has the following property, f(xk)0, if yk=1 and f(xk)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 38

    To formulate this, suppose that all the training data satisfy the following constraints:

    w.xi + b +1 for yi = +1 (4.3.a)

    w.xi + b -1 for yi = -1 (4.3.b)

    These can be combined into one set of inequalities:

    yi(w.xi + b) 1 0, i=1,,N (4.3)

    Now consider the points for which the equality in (4.3.a) holds. These points are the support

    vectors which lie on the hyperplane H1: w.xi + b = +1, with normal w and perpendicular

    distance from the origin |1 - b|/||w||. Similarly, the points for which the equality in (4.3.b)

    holds are also the support vectors which lie on the hyperplane H2: w.xi + b = -1 with

    normal again w and perpendicular distance from the origin |-1-b|/||w||. Hence d+ = d- = 1/||w||

    and the margin is simply 2/||w||. Note that H1 and H2 are parallel as they have the same

    normal and that no training points fall between them. Thus, the pair of hyperplanes which

    gives the maximum margin by minimizing the objective function 0.5||w||2 subject to

    constraints in equation can be found. Thus the solution for a typical two dimensional case is

    expected to have the form shown in Figure 4.2. Those training points for which the equality

    in (4.3) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2) and whose

    removal would change the solution found, are called support vectors; they are indicated in

    Figure 4.2 by the extra circles.

    Fig. 4.2: Linear separating hyperplanes for the separable case. The support vectors are circled.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 39

    4.3 Quadratic Optimization for Finding the Optimal Hyperplane A Lagrangian formulation is used to obtain the solution. There are two reasons for

    doing this. The first is that the constraints variables are replaced by constraints on the

    Lagrange multipliers themselves, which are much easier to handle. The second is that in this

    reformulation of the problem, the training data appears (in the actual training and test

    algorithms) in the form of dot products between vectors. This is a crucial property which will

    allows one to generalize the procedure to the nonlinear case.

    Thus, introducing positive Lagrange multipliers i, i=1.N; one for each of the

    inequality constraints in equation. Recall that the rule is that for constraints of the form ci0,

    the constraint equations are multiplied by positive Lagrange multipliers and subtracted from

    the objective function, to form the Lagrangian. For equality constraints, the Lagrange

    multipliers are unconstrained. This gives Lagrangian:

    LP = 0.5||w||2 yi(w.xi + b) + (4.4)

    The solution to the constrained optimization problem is determined by the saddle

    point of LP, which must be minimized with respect to w, b, and simultaneously require that

    the derivatives of LP with respect to all the i vanish, all subject to the constraints i0.

    Suppose that the set of constraints is referred to as C1. Now this is a convex quadratic

    programming problem, since the objective function is itself convex, and those points which

    satisfy the constraints also form a convex set. This means that following dual problem can

    be equivalently solved: maximize LP, subject to the constraints that the gradient of LP with

    respect to w and b vanish, and subject also to the constraints i0. Suppose that this set of

    constraints is referred to as C2. This particular dual formulation of the problem is called the

    Wolfe dual. It has the property that the maximum of LP, subject to constraints C2, occurs at

    the same values of w, b and , as the minimum of LP, subject to constraints C1.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 40

    Requiring that the gradient of LP with respect to w and b vanish, gives the conditions:

    1. w = yixi (4.5)

    2. yi = 0 (4.6)

    Since these are equality constraints in the dual formulation substituting them into equation to

    give:

    LD = 0.5 jyiyjxi.xj (4.7)

    Maximize (4.7) subject to constraints:

    1. yi = 0 (4.8)

    2. i0 (4.9)

    Note that there are different labels for Lagrangian (LP for primal, LD for dual), to emphasize

    that the two formulations are different: LP and LD arise from the same objective function but

    with different constraints, and the solution is found by minimizing LP or by maximizing LD.

    Note also that if the problem is formulated with b=0, which amounts to requiring that all

    hyperplanes contain the origin, the constraint (4.8) does not appear.

    Support vector training (for the separable, linear case) therefore amounts to

    maximizing LD with respect to i subject to constraints (4.8) and positivity of i with

    solution given by (4.5). Notice that there is a Lagrange multiplier i for every training point.

    In the solution, those points for which i>0 are called support vectors, and lie on one of the

    hyperplanes H1, H2. All other training points have i=0 and lie on that side of H1 or H2 such

    that the strict inequality in Equation (4.3) holds. For these machines, the support vectors are

    the critical elements of the training set. They lie closest to the decision boundary, if all other

    training points are removed or moved around, but so as not to cross H1 or H2, and training

    repeated, the same separating hyperplane is obtained.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 41

    4.4 The Karush-Kuhn-Tucker Conditions The Karush-Kuhn-Tucker (KKT) conditions play a central role in both the theory and

    practice of constrained optimization. For the primal problem described in the previous

    section, the KKT conditions may be stated as:

    (4.10)

    The solution vector w is defined in terms of an expansion that involves the N training

    examples. Note, however, that although this solution is unique by virtue of convexity of the

    Lagrangian, the same cannot be said about the Lagrange coefficients, i. It is important to

    note that at the saddle point, for each i, the product of that multiplier with the corresponding

    constraint vanishes, as shown by (4.10). Therefore, only those multipliers exactly meeting the

    above equation can assume nonzero values.

    Having determined the optimal Lagrange multipliers, o,i, computing the optimum

    weight vector as:

    wo = yi xi, where Ns is the support vectors (4.11)

    The optimum bias can be calculated as:

    bo = 1 wo.x(s) for d(s)=1 (4.12)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 42

    4.5 Linear Support Vector Machines: The Non-separable Case The margin of separation between classes is said to be soft if a data point (xi,yi)

    violates the following condition:

    yi(w.xi + b) + 1, i=1,,N

    This violation can arise in one of two ways, as shown in Figure 4.3:

    The data point falls inside the region but on the right hand side of the decision surface.

    The data point falls on the wrong side of the decision surface, resulting in misclassification.

    This can be rectified by introducing positive slack variables i, i=1,,N, in the constraints

    which then become:

    w.xi + b +1 - i for yi =+1

    w.xi + b -1 + i for yi =-1

    These can be combined into one set of inequalities:

    yi(w.xi + b) 1 - i, i=1,,N; where the slack variables 0 for all i. (4.13)

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 43

    Fig. 4.3: Linear separating hyperplanes for the non-separable case.

    Thus, for an error to occur, the corresponding i must exceed unity, so i is an upper bound

    on the number of training errors. Hence a natural way to assign an extra cost for errors is to

    change the objective function to be minimized to 0.5||w||2 to 0.5||w||2 + C(i)k, where C is a

    parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to

    errors. As before minimizing the first term is related to minimizing the VC dimension of the

    support vector machine. As for the second term, it is an upper bound on the number of test

    errors. Formulation of the cost function is therefore in perfect accord with the principle of

    structural risk minimization.

    The parameter C controls the trade-off between the complexity of the machine and

    the number of non-separable points; it may therefore be viewed as a form of a

    regularization parameter. The parameter C has to be selected by the user. This can be done

    in one of two ways:

    The parameter C is determined experimentally via the standard use of a training/(validation) test set, which is a crude form of resampling.

    It is determined analytically by estimating the VC dimension and then using the bounds thus determined.

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 44

    Maximize:

    LD = 0.5 jyiyjxi.xj (4.14)

    Subject to the constraints:

    1. 0 i C (4.15)

    2. yi = 0 (4.16)

    Having determined the optimal Lagrange multipliers, o,i, computing the optimum weight

    vector as:

    wo = yi xi (4.17)

    The optimum bias can be calculated as:

    bo = 1 wo.x(s) for d(s)=1 (4.18)

    The Karush-Kuhn-Tucker conditions are needed for the primal problem. The primal

    Lagrangian is:

    LP = 0.5||w||2 + C - {yi(w.xi + b)1+i} - i (4.19)

    where i are the Lagrange multipliers introduced to enforce positivity of i. The KKT

    conditions for the primal problem are therefore:

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 45

    (4.20) and (4.21)

    The optimum bias b0 is determined by taking any data point in the training set for which 0

    0,i C and therefore i=0 and using that data point in the KKT conditions. However, from a

    numerical perspective it is better to take the mean value of b0 resulting from all such data

    points in the sample.

    4.6 Nonlinear Support Vector Machines Basically the idea of a nonlinear support vector machine hinges on two mathematical

    operations summarized here.

    Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both input and output.

    Construction of an optimal hyperplane for separating the features discovered in previous step.

    An important question at this juncture is whether or not the methods discussed in earlier

    sections can be generalized to the case where the decision function is not a linear function of

    the data, or, in other words, when the input space is made up of nonlinearly separable

    patterns. Covers theorem states that such a multi-dimensional space maybe transformed into

    a new feature space where the patterns are linearly separable with high probability, provided

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 46

    two conditions are satisfied. First, the transformation is nonlinear. Second, the dimensionality

    of the feature space is high enough. The next operation exploits the idea of building an

    optimal separating hyperplane in accordance with the theory described, but with a

    fundamental difference: the separating hyperplane is now defined as a linear function of

    vectors drawn from the feature space rather than the original input space.

    4.6.1 Inner Product Kernel Observe that the data appears in the training problem only in the form of dot products,

    xi.xj. Now suppose the data is first mapped to some other (possibly infinite dimensional)

    Euclidean space H, using a mapping : Rd to H. Then of course the training algorithm would

    only depend on the data through dot products in H, i.e. on functions of the form (xi). (xj).

    Now if there were a kernel function K such that K(xi,xj)= (xi). (xj) only K need to be

    used in the training algorithm, and would never need to explicitly even know what is. One

    example is: K(xi,xj)=exp(-0.5||xi-xj||2/2*sigma2). In this particular example, H is infinite

    dimensional, so it would not be very easy to work with explicitly. However, if one replaces

    xi.xj by K(xi,xj) everywhere in the training algorithm, the algorithm produces a support vector

    machine in an infinite dimensional space, and furthermore do so in roughly the same amount

    of time it would take to train on the unmapped data.

    Suppose that the data belongs to a space denoted L. (The notation L is a mnemonic

    for low-dimensional, as is the notation for the range space H for high-dimensional.)

    Evidently, a map is not necessarily onto; i.e., there need exist an element in L that is

    mapped onto a specific element in H.

    Let j(x), j=1,,M; where M is the number of hidden units, denote a set of nonlinear

    transformations, define a hyperplane acting as the decision surface as follows: j j(x)

    + b=0, where w defines the vector of linear weights connecting the feature space to the

    output space, and b is the bias. The quantity j(x) represents the input supplied to the weight

    wj via the feature space. In effect, the vector (x), represents the image induced in the

  • Face Recognition using Artificial Neural Network

    Dept. of ECE Jan June 09 Page 47

    feature space due to the input vector x. Thus, in terms of this image define the decision