stereo vision for face recognition dissertation

56
A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected] Page 1 of 56 A Correlation Based Stereo Vision System For Face Recognition Applications Daniel Bardsley [email protected] Supervised by: Bai Li [email protected] 2004 University of Nottingham

Upload: api-26400509

Post on 10-Apr-2015

790 views

Category:

Documents


0 download

DESCRIPTION

FaceRecognition using streo vision

TRANSCRIPT

Page 1: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 1 of 56

A Correlation Based Stereo Vision System For Face Recognition Applications

Daniel Bardsley [email protected]

Supervised by: Bai Li

[email protected]

2004 University of Nottingham

Page 2: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 2 of 56

Contents 1 Abstract .....................................................................................................................5

2 Introduction ...............................................................................................................5

3 Goals and Motivation ................................................................................................6

4 Literature Review ......................................................................................................8

4.1 Face Recognition ....................................................................................................8

4.2 3D Reconstruction ..................................................................................................9

4.3 Surface Estimation................................................................................................11

4.4 Summary ..............................................................................................................12

5 System Outline ........................................................................................................13

6 Calibration ...............................................................................................................15

6.1 Intrinsic and Extrinsic Parameters.........................................................................15

6.2 Parameter Estimation ...........................................................................................16

6.3 Calibration Testing ................................................................................................18

7 Rectification.............................................................................................................21

8 Correlation...............................................................................................................23

8.1 Input Point Detection.............................................................................................23

8.2 Intensity Based Pixel-wise Correlation ..................................................................23

8.3 SSD ......................................................................................................................24

8.4 ZMNCC.................................................................................................................24

8.5 Correspondence Testing.......................................................................................25

8.6 Matching Constraints ............................................................................................27

8.7 Constraint Testing.................................................................................................27

8.8 Alternative Correspondence Measures .................................................................28

9 Projective Reconstruction......................................................................................30

9.1 Reconstruction Testing .........................................................................................32

10 Surface Estimation..................................................................................................34

10.1 Surface Estimation Testing ...................................................................................34

10.2 Texture Mapping ...................................................................................................36

11 Implementation........................................................................................................37

11.1 Design Choices.....................................................................................................37

11.2 Application Architecture ........................................................................................38

11.3 Data Structures and Algorithms ............................................................................42

Page 3: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 3 of 56

11.4 Implementation Results.........................................................................................44

12 Software Libraries...................................................................................................46

13 Results .....................................................................................................................48

14 Conclusions and Future Work ...............................................................................52

15 Bibliography ............................................................................................................53

Page 4: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 4 of 56

List of Figures Figure 1: High level outline of the reconstruction system. ......................................................13

Figure 2: Calibration Dialog Screenshot.................................................................................19

Figure 3: Artificial Test Rig Camera Configuration .................................................................20

Figure 4: Graphical representation of epipolar geometry. ......................................................21

Figure 5: A Rectified Input Image Pair ...................................................................................21

Figure 6: Stereo pair input image (left) and ground truth disparity data (right)........................25

Figure 7: SSD (left) and ZMNCC (right) disparity maps..........................................................26

Figure 8: Effects of constraint application...............................................................................28

Figure 9: 3D Studio Max cube reconstruction. Test input (purple cube) and reconstructed

output (red spheres) shown from left, right, top and perspective views. ........................32

Figure 10: Reconstructed Cube output after mesh triangulation and surface construction.....35

Figure 11: Original face rendering (left) and the reconstructed mesh (middle) along with a full

surface reconstruction (right). ........................................................................................35

Figure 12: Texture mapped model reconstruction ..................................................................36

Figure 13: The raw data view displays actual pixel co-ordinates of the point matches,

reconstructed points, normalised model co-ordinates and raw calibration data. ............39

Figure 14: Simplified UML diagram of the user interface / MFC portion of the FaceScanner

Application. Some fields and methods have been omitted for conciseness. .................40

Figure 15: Simplified UML diagram of VisionLib, the library containing all the computer vision

related code within the project. ......................................................................................41

Figure 16: FaceScanner application screenshot ....................................................................45

Figure 17: Fully automatic reconstruction of a synthesized face from stereo images. White

dots on the 3d model show initial point match positions. ...............................................50

Page 5: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 5 of 56

1 Abstract

Three dimensional reconstruction using stereo vision is an important topic of research in

computer science. A computers ability to perceive the world in which it is situated has

application in many areas of industry, likewise, face recognition is an area of comparable

interest. The fusion of the two subject areas should allow the differing techniques to

complement each other in a fashion that allows improvements in recognition results and a

robustness against variant recognition conditions. We explore the reconstruction process in

general and the specifics of implementing a stereo vision system aimed at reconstructed face

surfaces in three dimensions with particular attention being paid to the suitability of the output

models for face recognition systems.

2 Introduction Computer Vision is one of the fastest growing areas within Computer Science. Aided by rapid

recent progress in hardware and software design, computer vision projects are making use of

vast increases in processing and memory capacities to enhance their performance. In order

for computers to effectively processes, segment and analyse visual input of their

environments it is often a requirement that the system is able to obtain data of the

surrounding world in a format that can be easily equated to the actual environment in which

the system finds itself. In the case of many vision systems this could be a 3 dimensional

representation of the real world. For humans this is a task that we achieve quite naturally

from an early age and it soon becomes second nature for us to accurately judge distance,

perspective and space, however, when human vision systems are analysed it becomes

apparent that the brain uses a multitude of techniques to give us a sense of the three

dimensional world in which we live.

In order for a vision system to obtain depth data from a scene it is possible to use a number of

different techniques. Three dimensional scene data can be obtained from sources including

object shading, motion parallax data, structured light or laser range finders. However,

perhaps the most obvious technique is that of stereo vision. In a system analogous to a pair

of human eyes, the input to two cameras observing the same scene can be analysed and the

differences between the two images used to compute object depth and hence a model of the

scene that the system is viewing. The utilities of a robust implementation of such a system

are many and potentially include applications in areas such as space flight [18], face

recognition [23], immersive video conferencing [56] and industrial inspection [20] to name just

a few.

Page 6: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 6 of 56

3 Goals and Motivation Traditionally face recognition algorithms have been able to achieve high levels of accuracy

when the subject face is presented in a frontal pose. Balasuriya and Kodikara describe one

such system in [6] which utilises principle component analysis to achieve reasonable levels of

accuracy. Indeed, [45] furthers this work to produce a system which is capable of face

authentication under difficult image conditions such as “linear and non-linear illumination

change, white Gaussian noise and compression artefacts”, [45]. Both of these systems and

many others suffer from reduced rates of accuracy when a none frontal face pose is used as

input thus reducing their usefulness for many application areas where they may otherwise

have been implemented. Despite work to develop more advanced algorithms that display a

higher degree of pose invariance [32, 43, 58] greatest accuracy seems to be achievable only

when the face is presented in a frontal pose. Work to create systems with a higher degree of

pose invariance has led down a number of paths including, for example, the development of

systems that synthesize face images under varying pose conditions and then use these

synthesized images as a basis for recognition [34]. Other systems fully reconstruct a 3D face

surface either from scratch [1] or by deforming a generic head model [23]. Input into the

various face reconstruction systems ranges from structured light, range scanner data or

standard video images. The first two of these options require hardware in addition to image

capture devices to obtain range data and will often require the subject to be positioned in a

controlled environment whilst the data is acquired. As a more versatile solution, applicable to

a wider variety of applications, the use of standard images as input is preferable. This

method, however, suffers from a reduced level of reconstruction accuracy when compared to

hardware range finders due to a greater number of interfering factors (illumination, pose,

background clutter etc.) and inherent difficulties in implementing successful correlation

techniques.

Our work will consider generating a face surface model, initially without the aid of a generic

head model, from a single pair of calibrated stereo images captured from standard CCTV

cameras. Output will be produced in the form of a 3D, texture mapped surface model of the

subject face with the additional aim that the models produced will be suitable for input at a

later date into a recognition system. The motivation behind this choice is the desire to

produce high quality reconstructions without the aid of dedicated range finding hardware and

to potentially use the reconstructions as part of a pose invariant recognition system. In order

to implement such a system algorithms and software will need to handle initial calibration, the

correspondence problem, mesh generation and object reconstruction. Since a wide variety of

methods and techniques are required simply to obtain a reconstruction the process of

recognition is beyond the scope of this project. Furthermore since a wide variety of

algorithms could be implemented at each stage of the reconstruction the project will focus

Page 7: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 7 of 56

some attention on the development of a robust reconstruction framework for testing, analysis

and comparison of each algorithm. A final aim throughout development will be to ensure that

all the general computer vision algorithms which may prove useful in other applications

should be implemented in a reusable library to enable their functionality to be leveraged in

future projects.

As a final general aim it is important that the application interface is of commercial standard.

Reconstruction entails high data collection, processing and display demands as well as

relatively complicated user interaction with this data. To this end data must be displayed and

collected in an intelligent and intuitive environment in order that the volume of data be

analysed in a useful manner.

In summary the high level aims of the project are:

• Research and develop methods of stereo camera rig calibration using standard

CCTV cameras and appropriate capture cards.

• Development of solutions to the correspondence problem.

• Research into appropriate mesh generation and surface reconstruction algorithms.

• Produce surface models suitable for input into a potentially pose invariant recognition

system.

• Create a working implementation of a face surface reconstruction application with a

commercial strength user interface.

Application specific implementation goals are discussed in greater detail in the

implementation section.

Page 8: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 8 of 56

4 Literature Review Due to the wide variety of possible implementations of various 3D vision systems much

research has been devoted to the field. The effectiveness of different methods and their

appropriateness for specific applications has been widely considered. Moyung provides an

excellent overview of many of the fundamental problems associated with stereo vision [42]

however, here we will consider the development of a multitude of algorithms and techniques

for both the recovery of scene depth information and the reconstruction of the acquired three

dimensional data, paying particular attention to suitability of the data for face recognition

applications.

4.1 Face Recognition Analysis of the human visual system proved that we can quickly recognise a large number of

faces, suggesting the human brain may only use a small number of parameters to identify

each face [30]. The problem of compressing face data to a few parameters without reducing

accuracy is non-trivial. Principle component analysis can be utilised in order to aid the

paramatisation of this data, and through feature comparison, early face recognition systems

often used this and related techniques to achieve recognition [3, 29]. Principle component

analysis involves the selection of image features such that the original data is represented

accurately by a reduced data set. For example if one image feature can be accurately

predicted from another then the image point is clearly redundant and need not be included in

the dataset. Furthering the idea of removing redundant image features we can create new

ones as functions of the old features. Forsyth and Ponce state “in principle component

analysis, we take a set of data points and construct a lower dimensional linear subspace that

best explains the variations of these data points from their mean” [19]. The PCA compressed

form of the images were known originally as eigenpictures and more recently when applied

specifically to face recognition applications, eigenfaces. Here, faces were indexed, stored

and compared in eigenface form for recognition. Yambor, Draper and Beveridge discuss a

variety of recent PCA based approaches to face recognition in [57].

Whilst it is possible to achieve good results using these methods faces are generally required

to be in a frontal pose. Several modifications have been proposed in order to overcome this

limitation. Investigations and research into eigenspaces as a solution to this and other variant

image property problems are discussed in [43] and [58] with the latter paying particular

attention to the robustness of the eigenspace solution under variant pose conditions.

However, whilst these solutions improve pose invariance to a degree they do not provide a

general solution to the frontal pose problem.

Page 9: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 9 of 56

Taking a different approach several projects have been undertaken to investigate the

potential of utilising a true three dimensional representation of the face in an effort to give

recognition systems a greater degree of pose invariance. Beumier and Acheroy, [8], present

such a system for automatic face recognition from 3D surface models. The system uses

projected parallel white light stripes to reconstruct depth information by analysing the

deformation of the projected pattern. The authors describe the development of a system

which is “adapted to facial surface acquisition”, [8], to make use of domain specific

knowledge. Examples of this type of domain knowledge include the utilisation of symmetric

face properties and other invariant face features to aid the reconstruction in the feature

correspondence stages. In contrast to this technique [23] utilises basic stereo image pairs as

input without the help of additional reconstruction aids. Furthermore, rather than attempting

to use the 3D data in a direct reconstruction the system deforms a generic head model to the

parameters specified by the acquired data. The advantages of such a system usually include

increased accuracy and speed due to the amount of data initially available to the system in

the form of the generic model. However the system is specifically designed to only consider

head models and as such does not have as wide an application outside of the face

recognition field as it might. Lee and Ranganath propose “a novel, pose-invariant face

recognition system based on a deformable, generic 3D face model” in [32] which is

comparable to the system described in [23] and according to the authors is capable of a

recognition success rate of 92.3% over a test data set of 660 images.

4.2 3D Reconstruction In order to utilize 3D data in any of the recognition systems described above it is first

necessary to somehow obtain the data. This can be obtained from a number of sources

including hardware range finders or standard image capture devices. Using standard image

capture devices our input is restricted to two dimensions, a number of techniques can

however be used to analyze the image data in order to deduce 3D information.

Systems that utilise motion cues in order to directly reconstruct 3D data are in existence but

are not appropriate or accurate enough to handle the intricacies of the human face. [49]

describes such a system that utilises motion between image frames to simultaneously

segment and produce relative depth ordering of objects in a scene. Whilst the data available

from motion cues could potentially be useful in segmentation, feature extraction and layer

recovery it is not a suitable technique for the capture of face features and as such is of little

use for any 3D surface recognition systems except perhaps as a tool for initial segmentation

processes.

A traditional and much more common approach to 3D reconstruction is represented by a

mass of stereo correspondence based reconstruction techniques. Image points are matched

Page 10: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 10 of 56

across stereo image pairs and then reconstructed to three dimensions. The most common

class of correspondence measures are pixel based algorithms [13, 28] which compare

similarity between pixels across images in order to deduce likely matching image points. The

problem of matching 2D camera projections of real world image points across stereo image

pairs leads to a host of additional issues including input point selection and “good” match

selection. Keller conducts a comprehensive evaluation of matching algorithms and match

quality measures in [27].

As an alternative to pixel based correspondence measures feature based approaches [21, 47]

have also been considered. Here common face features are detected first and their relative

position used to calculate the head pose. Advanced vision techniques such as Gabor jet

features and bunch graph matching can be used to aid reconstruction [16]. Once the pose

has been calculated from a set of feature points it then becomes possible to synthesize the

face in any orientation. This kind of technique, when combined with a deformable model to

produce a much more accurate face description, results in some of the most accurate facial

reconstructions available to date.

A prerequisite for some reconstruction stages is camera calibration. This involves

automatically calculating properties of the stereo camera rig. Several techniques have been

proposed both for the mathematical techniques behind calibration (linear, non-linear and two-

step metods) and for obtaining calibration data (from motion, calibration patterns or directly

from a scene). Multiple stage calibration procedures which seek to minimise an error function

over time are the current norm [5, 53]. Multiple images of a calibration pattern are captured

and used as input into a constrained set of equations. An alternative to using a calibration

pattern is to perform “on the job” calibration, whilst the reconstruction target object is being

viewed. This fully automatic approach described by Maas [35] is only possible under multiple

camera geometries however “can be considered a versatile and reliable method for the

calibration of photo-grammetric systems” [35]. Since much of the accuracy of the overall

system is dependant on the quality of the calibration it is essential that this stage of the

reconstruction is accurately implemented.

A number of research papers devoted to the development of stereo vision systems discuss

problems associated with the reconstruction process [37, 59]. Zu, Ku and Chen implement a

system which utilises stereo camera input to seed a SSD intensity based correspondence

matching stage. They state that their results were “not optimal”, [59], and contained high

levels of noise in the model output citing the correspondence algorithms as the under

performing system sub-section. McLauchlan suggests a recursive approach to tackling these

problems and attempts to “develop algorithmic and statistical tools that combine data from

multiple images” [37] in order to develop scene information over a given time window. He

Page 11: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 11 of 56

shows that the recursive approach to scene reconstruction increases system accuracy levels.

Other research by the same author details further reconstruction techniques which utilise data

over a number of image frames to recursively improve captured data [38-40].

4.3 Surface Estimation In addition to the vast amount of literature available on the reconstruction of three dimensional

data a large amount of research has also gone into the development of algorithms to convert,

possibly incomplete, point cloud data produced by the earlier system stages into more

useable forms such as meshes or other 3D surfaces. One possible technique for

implementing this process is discussed in [14] where a technique using simulated annealing

to create an optimal surface mesh is implemented. Much more advanced techniques capable

of dealing with situations such as incomplete meshes or other errors are also available. An

example of one such technique is discussed in [12]. Here surfaces are represented

completely by polyharmonic radial basis functions (RBF). Fast methods for fitting and

evaluating RBFs have been developed which allow techniques such as this to be

implemented quickly and efficiently, this type of representation also lends itself for the efficient

processing of large data sets. Since we expect to be matching a large number of face points

it is possible that in the future a solution such as this for representing face models will be

required.

In addition to the recent advancements in mesh generation and surface reconstruction

techniques a number of algorithms developed some time ago are still proving useful. Convex

Hulls are an important topic in computational geometry and form the basis of a number of

calculations relating to mesh construction. QuickHull is a widely used algorithm for computing

the convex hull of a point set and is defined in greater detail in [7]. Delaunay triangulations

are an example of a set of algorithms that have their mathematical basis in convex hull

calculations. The Delaunay method works by subdividing the volume defined by the input

point cloud into tetrahedrons with the property that the circumsphere of every tetrahedron

does not contain any other points of the triangulation. In addition to the method described

here constraints have been developed by various authors in order to improve the triangulation

accuracy and efficiency, Kallmann, Bier and Thalmann discuss algorithms for “the efficient

insertion and removal of constraints in Delaunay Triangulations” in [26]. With the addition of a

set of constraints Delaunay triangulations are capable of generating meshes suitable for our

surface requirements. Further to this description of the Delaunay method Bourke provides an

algorithm for efficient triangulation of irregularly spaced data points in [10], Bourke’s work has

specific applications in terrain modelling however is based on the Delaunay method and as

such has relevance to the general surface construction problem.

Page 12: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 12 of 56

Another volumetric reconstruction method that has been researched and used effectively in

past work is the marching cubes algorithm [33]. As with Delaunay’s methods, marching

cubes has been subjected to numerous modifications and algorithmic improvements [11, 50].

The basic form of the algorithm splits the dataspace into a series of sub-cubes. Eight sample

points, known as voxels, that form the sub-cube are considered for triangulation. When one

sub-cube is fully processed the algorithm moves (“marches”) on to the next sub-cube until a

complete surface has been reconstructed in a recursive fashion. The original Marching

Cubes technique “did not resolve ambiguous cases… resulting in spurious holes and surfaces

in the surface representation for some datasets”, [11], however several recent proposed

improvements deal with such cases [11, 50, 51] in order to provide more complete surface

reconstructions.

4.4 Summary The work described above covers the majority of the techniques required in the

implementation of our stereo vision system. From calibration and correspondence to

reconstruction vast quantities of research have been carried out to achieve maximum

performance and accuracy. Some stages of the reconstruction process are now considered

solved. For example, the process of re-creating a set of points in three dimensions once we

have a suitable calibration and the locations of matching image points is trivial and only

requires manipulation of the appropriate projection equation. Other stages of the process are

not so completely understood. The correspondence problem is widely regarded as the most

error-prone stage of any reconstruction, intensity based measures tend to fail on natural

images whilst research into other methods, such as wavelet matching [9], have only been

carried out recently and still require additional development time before gaining acceptance in

the computer vision community. A number of systems, both commercial and research based,

are available to provide almost fully autonomous face scanning to a reasonable degree of

accuracy. A couple of research based face recognition systems which utilise 3D scanning

techniques as a means for input to a recognition sub-system have very recently been the

subject of much interest and it is very likely that over the next year we start to see commercial

implementations of such systems.

Page 13: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 13 of 56

5 System Outline In this paper we consider acquiring stereo image pairs from a set of automatically calibrated

CCTV cameras, investigating various stereo correspondence algorithms and reconstructing

the three dimensional data to produce a fully textured model of the face. Initially the 3D data

will be used to directly reconstruct the face rather than using any deformable model

techniques, although this may be a consideration for future work. The basic system outline is

detailed in Figure 1.

Figure 1: High level outline of the reconstruction system.

Initially the system has no knowledge of either the type or position of the cameras which will

be used as input to the system. The first task, therefore, is to calibrate the cameras. This

involves obtaining a set of internal camera parameters followed by a set of external

parameters. Once calibration is complete we have a set of parameters which completely

define the setup of the camera rig. The calibration parameters are used at later stages of the

reconstruction process, both to optimise the efficiency of correspondence matching and

during the actual reconstruction calculations.

Page 14: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 14 of 56

Following calibration the next stage is to capture the input images of the data we are trying to

reconstruct and find corresponding points between the two images. Once we have a set of

matching points we can use the calibration data along with the point match data to calculate

the 3D position of each of the points for which we have a match. At this stage in the process

we have an unorganised data cloud of 3D points. In order to make the data more useful the

point cloud is transformed into a meshed surface representing the original object we are trying

to reconstruct. The final task of the reconstruction routines is to map the newly created

surface with a texture lifted from the original input images to give the reconstruction a sense

of realism. Once this stage has been reached we should be left with a surface that accurately

represents the surface we were trying to reconstruct.

Page 15: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 15 of 56

6 Calibration In order for it to be possible to reconstruct a scene from a stereo image pair it is necessary

that several important properties about each of the cameras be known. Obtaining values for

these properties is known as calibration. Techniques for camera calibration loosely fall into

three categories linear, non-linear and two-step. Linear techniques assume a simple pinhole

camera model [52] and do not account for lens distortion effects which turn out to be

“significant in most off-the-shelf charge coupled devices” [41]. In non-linear methods a

relationship between parameters is established and then an iterative solution is calculated

through minimisation. Many early vision systems used this non-linear technique and have

since been modified to take into account camera lens distortions, however in order for the

minimisation to function correctly good initial estimations must be made of the camera

parameters to avoid converging to an incorrect solution. Finally, two-step techniques use a

combination of linear and non-linear methods to find a direct solution to some parameters and

iteratively estimate others. The final method is the most commonly implemented solution at

present.

6.1 Intrinsic and Extrinsic Parameters In order for reconstruction to take place we effectively need to be able to translate from the

image co-ordinates as seen by the camera system into real world 3D co-ordinates. The co-

ordinate systems that we need to translate between are related by two sets of parameters,

intrinsic and extrinsic. Camera calibration is an optimisation process where observed image

features and their theoretical positions are minimised with respect to these parameters. The

intrinsic parameters are determined by the optical and digital sensors in each camera. The

parameters determine the prospective projection of a three dimensional point onto a two-

dimensional image plane. The required variables for each camera are the focal length, the

effective pixel width and height and the principle point of the camera. The extrinsic camera

parameters consist of a 3X3 orthogonal matrix and a translation vector describing the

transformation required to move from one co-ordinate system to the other. Essentially

calibration is the process of calculating two matrices which fully represent both the internal

and external parameters of the cameras being calibrated. The two matrices which require

calculation take the following forms:

1000

0

int yy

xx

ofsofs

M =

3333231

2232221

1131211

trrrtrrrtrrr

M ext =

In order to obtain these values in a straightforward manner it is possible to utilise a calibration

pattern with known control point positions. Whilst there are a number of possible calibration

Page 16: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 16 of 56

patterns the simplest to use involves a planar calibration pattern (a chessboard) in order to

perform the calibration procedure. When all the chessboards squares are visible to both

cameras it is possible to capture a stereo image pair, analyse both images with a corner

detector and begin making deductions about camera parameters. This is possible since we

know the physical properties of the chessboard and can deduce its position and rotation

entirely from the relative location of the internal board corners (the control points). Utilising

this data from both cameras it then becomes possible over a sequence of frames to further

deduce the relative position and rotation of the two cameras (and hence the extrinsic

parameters) as well as the focal length, pixel width, height and principle point (and hence the

intrinsic parameters). Once this process if finished the calibration stage is complete and the

obtained data can be reused whilst the camera setup remains the same.

6.2 Parameter Estimation Our implementation of the calibration stage of the system relies heavily on a number of

functions found in the Intel OpenCV image library. A multi-plane approach to calibration

which relies on techniques drawn from photogrammetric and self calibration approaches is

described below and is the basis of our calibration functionality. A planar calibration pattern is

used since rather than other alternatives this can be printed on standard paper and fixed to a

rigid object rather than a more complicated construction being required. Initially multiple

images of the calibration pattern are captured and control points in each of the images are

found. The algorithm in question is based on a homography which maps points on one plane

to points on another plane using a linear transformation. The following description of the

calibration routines is based on that of [55]. To begin we must consider the following

definitions:

int

],,[],[

MAzyxM

vumT

T

==

=

T

T

zyxM

vum

]1,,,[~]1,,[~

=

=

Where m is a 2D co-ordinate, M a 3D co-ordinate and s some arbitrary scale factor. The

cameras projection equation can therefore be written as:

MTRAms ~],[~ =

This approach is based on the first fundamental theorem of projective geometry which states

“There exists a unique homography that performs a change of basis between two projective

spaces of the same dimension” [17]. Thus given any plane in world space, there is a

mapping between the plane and any additional images of it. This mapping is defined up to

the scale factor s and can be derived through the expansion of the camera projection

equation detailed above. Expanding the projection equation gives us:

Page 17: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 17 of 56

TT yxHvus )1()1( = and thus, a point on the image plane is mapped to the world plane with:

MHms ~~ = H in the above equation is a homography. Homographies can be estimated from four points

and over a sequence of frames it becomes possible to build a system of homogenous

equations such that we can estimate both the intrinsic and extrinsic parameters of the

cameras and hence calculate our calibration matrices for later use.

From the above equation we can expand to:

)21()321( trrsAhhh = Re-writing the homography equations in column form:

33

22

11

sArhsArhsArh

===

Some basic constraints on the camera parameters can now be calculated as follows:

j

Tji

Ti

jti

rrrr

rr

=

= 0

It then becomes possible to derive two basic constraints on the parameters we are trying to

obtain:

0

1

1

21

221

111

11

=

=

=

=

rr

rhAs

rhAs

sArh

T

21

211

1

2211

21

1 0

hAAhhAAh

rrrr

hAAh

TTTT

TT

TT

−−−−

−−

=

=

=

Page 18: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 18 of 56

The two constraints are represented by 021

1 =−− hAAh TT and

21

211

1 hAAhhAAh TTTT −−−− = . Two constraints are required since there are 6 degrees of

freedom for the extrinsic parameters. For each known homography we can therefore obtain

two constraints on the five intrinsic parameters. Hence three or more homographys are

required to determine the intrinsic parameters.

A closed form solution of the camera calibration is therefore:

B is in the form of a vector containing 6 parameters. Since B is symmetric then the two

constrains for the intrinsic parameters can be used to build a system of homogenous

equations:

0

)( 2211

=

=

bvv

v

vBhh

T

Tij

Tijj

Ti

For each image in the calibration it is possible to stack a corresponding equation into the

equation above and thus solve for b. Once b has been obtained we can solve for the intrinsic

parameters.

6.3 Calibration Testing Figure 2 shows the calibration process in progress. The left and right cameras in the stereo

rig capture simultaneous input of the calibration pattern. The two images then undergo

thresh-holding in order to deduce the chess board corner positions. As can be seen in Figure

2 the corners of each square on the board is marked and matched to the corresponding

corner on the opposite input image. The order of the points as they appear in both images is

used as a constraint to ensure that each square corner is matched correctly.

Page 19: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 19 of 56

Figure 2: Calibration Dialog Screenshot

In order to test the calibration algorithms it is necessary to attempt a calibration on a series of

images for which we already know the calibration results in order to compare our calculations

for accuracy. Since it is difficult to obtain actual data regarding intrinsic parameters and

similarly difficult to make measurements relating the position of the two cameras in the stereo

rig the calibration routines were tested using a sequence of synthesized images for which the

actual calibration parameters were known.

The results obtained from the

test calibration sequence are

shown in Table 1. Since the

cameras are virtual, and the

left camera is an exact copy

of the right camera it can be

assumed that the intrinsic

parameters of both cameras

will be identical. It can be

seen from the results that the

calibration procedure produces almost identical values for left and right camera intrinsic

parameters. Since the values between left and right cameras are almost equal it can be

assumed that the automatic detection of these parameters is being performed correctly.

Since the calibration can be viewed as a process of constrained optimisation, over a

sequence of input images, it can be assumed that the slight errors in the calculations are a

result of either slight discrepancies in these optimisations, inaccuracies in the chess board

corner finding algorithms, or since the input images were rendered using JPEG compression,

a result of image quality degradation. Despite these minor inaccuracies the calculation of

intrinsic parameters appears to be correct.

Left Camera Right Camera Intrinsic Parameters Intrinsic Parameters

771.642 0 322.097 770.628 0 322.2510 772.772 239.779 0 773.177 239.4320 0 1 0 0 1

Translation Translation -15.984 -14.67 66.785 3.3 -14.635 66.955

Rotation Rotation 0.961 -0.019 0.276 0.961 -0.02 0.277 -0.026 0.987 0.159 -0.026 0.987 0.161 -0.276 -0.16 0.948 -0.276 -0.162 0.947

Table 1. Calibration Test Sequence Results

Page 20: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 20 of 56

The geometry of the test camera rig was such that both cameras lay on the same position of

the vertical axis and were positioned the same distance from the target object (equal positions

on the z axis). The only displacement between camera positions was on the horizontal axis.

Figure 3 shows a graphical representation of the camera rig.

Figure 3: Artificial Test Rig Camera Configuration

Under these conditions it can be seen that the extrinsic parameters automatically calculated

for this rig are correct. The translation vector that is calculated indicates that the only

translation between cameras is on the horizontal axis. Furthermore, it should be noted that

the original horizontal axis displacement of the two cameras was 20 units, where as the

calibration finds this displacement to be 19.284 units. This demonstrates a satisfactory level

of accuracy throughout this calibration process however appropriate tests and research

should be carried out regarding none parallel camera geometries and which geometries have

the most positive effect on the reconstruction results. The calibration process was tested

using camera captured images, and returned results closely approximating those that would

be expected (i.e. a logically correct ratio between x, y and z displacement and approximately

correct rotation matrices) however no precise physical calibrations of our stereo rig were

possible and hence it proved difficult to test the performance under actual calibrations with no

correct results for comparison. Despite this, observable evidence and results suggested the

calibration procedures detailed in the section function in a satisfactory manner.

Page 21: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 21 of 56

7 Rectification Implementations of correlation algorithms required in the next stage of the reconstruction

process can be greatly simplified if the input images can be rectified. The process of

rectification involves a 2D image transformation of the input images such that corresponding

image points are located on equivalent image scan lines. Utilising geometric properties

inherent to epipolar geometry (see Figure 4), given a point and its projected location on one

image plane, it is possible to calculate on which epipolar line in the other image plane the

point will appear. This epipolar constraint allows us to calculate and perform the rectification

2D transformation of the original input images.

Figure 4: Graphical representation of epipolar geometry.

The epipolar constraint expresses the relation between two images of the same scene. The

plane marked by COP1, COP2 and P, shown in Figure 4, represents the epipolar plane. The

intersection of this plane with the two image planes represents the epipolar lines.

Figure 5: A Rectified Input Image Pair

Page 22: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 22 of 56

The effect of the rectification is such that the correspondence problem is reduced to one

dimension since we only have to search for matching points across a single horizontal line of

the matching input image. Figure 5 shows the results of rectifying some input images after

calibration of a stereo rig and the capture of a stereo image pair. Analysis of the rectified

image pairs shows that matching points are indeed positioned on matching scan lines

showing this to be a useful rectification. With the rectification of the input images complete it

then becomes possible to begin attempting additional stages in the reconstruction process.

Page 23: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 23 of 56

8 Correlation In order to calculate the depth of a point in the scene we have to find points in both the left

and right camera images which represent the same real world co-ordinate. Perhaps the most

important contributing factor in terms of the accuracy of the final reconstruction is a systems

ability to find a comprehensive solution to this problem. There are vast arrays of available

correlation algorithms including local window based methods [36] and feature point based

techniques [2]. A number of the other available methods for matching points between images

are discussed in [31] by Laganiere and Vincent. Since initially we do not know where we

might find a correlating image point the search space for matching a point is relatively large.

In order to constrain the size of the search the left and right camera images can be rectified.

This process involves rotating left and right camera images in accordance with parameters

obtained at the calibration stage. The rectification of the images ensures that matching points

on both images can be found on identical raster scan lines in both images. This causes a

large improvement in the performance of the point matching algorithms since the correlation

search space can be reduced to one dimension.

8.1 Input Point Detection Before we consider correlation it should be obvious that we need to select a set of feature

points which we are going to attempt to match. This is not as trivial a problem as it might first

seem. A good selection of input points makes finding correspondences easier. The term

used to define the suitability of a point for matching is known as its saliency. Hall, Leibe and

Schiele state that the saliency of an image feature can be defined to be “inversely

proportional to the probability of occurrence of that image feature”, [22]. These authors

continue to create a formal definition of saliency in the early part of their paper and go on to

show that good candidate input points are usually those with high saliency. Sebe and Lew

provide a good comparison of a number of salient point detectors in [46] including

comparisons of the Harr feature detector, the Harris feature detector, random point selection

and others. The authors also propose a method based on analysis of the image using

wavelet decomposition. A number of these feature point selectors will be implemented and

compared within our vision system.

8.2 Intensity Based Pixel-wise Correlation The simplest correlation algorithms rely on local window based matching techniques which

consider the “similarity” of a local window surrounding a potential point correlation. The

easiest similarity measure to implement would be one that, given a pixel to match, simply tries

to find a pixel with the same colour and intensity in the matching input image. This technique

will obviously have problems when there is one or more pixels with similar or identical

properties in the match search space. An advance on this technique is to also consider the

Page 24: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 24 of 56

values of pixels surrounding the pixel that we are trying to match, in this manner we should be

able to differentiate between the pixel we are looking for and pixels which are simply similar in

colour and intensity. Pixel-wise image correspondence methods were among the first used to

attempt to solve the stereo correspondence problem.

Initially two different intensity based correlation matching algorithms were considered. The

first is the sum of squared difference (SSD) similarity measure which calculates the difference

between pixels in an image window on each part of the image pair and then sums these

differences to decide on how similar two given regions are. The second algorithm

investigated is the zero mean based cross correlation algorithm (ZMNCC), which attempts to

compensate for differences in average intensity across image pairs whilst calculating

matching points.

8.3 SSD The SSD algorithm is defined as follows:

where (2W+1) is the width of the correlation window. Il and Ir are the intensities of the left and

right image pixels. [I, j] are the coordinates of the left image pixel.

The following definitions complete the algorithm:

where the first statement is the relative displacement between left and right image pixels and

the second statement represents the SSD correlation function.

This algorithm functions by assuming that correlating image points will be surrounded by a

window of other image points which when subtracted from their respective pixels in the

matching correlation window can then be squared and the results summed to measure the

similarity of the two points at the centre of each window.

8.4 ZMNCC The alternative to the SSD similarity measure algorithm that is considered here is the Zero

Mean Normalized Cross Correlation Algorithm. This algorithm is defined as follows:

Page 25: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 25 of 56

Here fl and fr represent vectors containing the intensity levels of pixels in the left and right

correlation windows.

The ZMNCC algorithm subtracts the average intensity of each correlation window from the

pixels within that correlation window before computing point similarity from the intensity

vectors. This is in an attempt to compensate for consistent changes in intensity surrounding

points that may occur between images in a stereo pair due to scene illumination, light source

direction or a number of additional factors.

8.5 Correspondence Testing In order to test the performance of the two algorithms a common stereo image pair was

selected. One of the input images from the pair is shown to the left of Figure 6. The actual

disparity map of the scene, which was computed by a laser range finder, is shown on the right

of the same figure.

The difference in position between corresponding points can be used to represent the

intensity of a pixel at any location in order to produce a disparity map representing the depth

of objects in a scene. Whilst this is not a full 3D reconstruction it is the first step and the

quality of a disparity map is usually representative of the effectiveness of a point matching

algorithm. The output of the SSD similarity measure and the ZMNCC algorithm is shown in

Figure 7.

Figure 6: Stereo pair input image (left) and ground truth disparity data (right).

Page 26: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 26 of 56

As expected, neither of the algorithms produce perfect results. A fairly large number of points

are incorrectly matched to their corresponding points. This is especially apparent around the

edges of objects and in areas of similar or low texture. Reasons for these errors include,

insufficient differences between image window intensities, illumination differences, image

noise or occluded points. Both algorithms do, however, produce recognisable output and the

depths of the majority of the scenes objects can be observed in the resultant disparity maps.

Figure 7: SSD (left) and ZMNCC (right) disparity maps..

Furthermore, in order to produce the disparity map an attempt was made to match every

single pixel in one image to an appropriate pixel in the opposite image. Matching every pixel

will not be a priority when we are attempting to reconstruct the surface of the face, and as

such we need only select strong candidates for a match and then interpolate depths between

these matched points. This should give us a higher rate of accuracy during correspondence

than is evident in the disparity maps of Figure 7.

Page 27: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 27 of 56

8.6 Matching Constraints It can be seen from the results that whilst in simple cases pixel-wise correspondence

matching is capable of finding corresponding points we can expect a number of erroneous

correspondences to be found. Depending on the severity of the mismatches this could have

drastic effects on our final model if we are not able to discern which of our correspondences

are likely to be correct and which are likely to be errors. In order to make this decision it is

possibly to impose a number of constraints on the matches in order to eliminate erroneous

points. The constraints which are likely to have the most positive effect are those that make

assumptions about the world we are viewing and hence are able to determine which matches

violate properties we expect to observe in the data we are viewing.

Constraints appropriate for pixel-wise correspondence techniques include the following:

• Similarity: For intensity based approaches, the similarity of two corresponding points

is completely defined by some measure of how similar a set of pixel intensities are. It

is possible to eliminate some weak matches by specifying a minimum match strength

threshold under which matches are marked as invalid.

• Uniqueness: Almost always a given pixel should match at most one corresponding

pixel in the match image. Occluded points and partially transparent objects can

violate this constraint.

• Continuity: A property of most natural objects (including the human face) is that the

disparity of matches should vary smoothly over the object.

• Ordering: The order of points on the original image will, almost always, be preserved

in the matching image. This constraint fails when points lie in what is known as “the

forbidden zone.”

• Left / Right Consistency: A point that is matched from the left to the right image

should be in the same location if the point is matched from the right to the left image.

• Statistical: Assuming a certain distribution for the reconstructed points can help to

remove false matches. For example removing points that fall outside of the standard

deviation of a point cloud can help eliminate spurious matches is we expect points in

the cloud to be normally distributed.

8.7 Constraint Testing It should be noted that under certain conditions any or all of these constraints can fail to

eliminate incorrect matches and / or eliminate correct matches. However it has been proven

that the introduction of constraints into a system does serve to reduce the amount of

incorrectly matches points. A disadvantage of combining many constraints is that this

“results in more thresholds, and thus a greater need for tuning” [31]. This does remove a

Page 28: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 28 of 56

degree of automation from the process, however, it appears to be an essential step in the

absence of any perfect matching strategies. Figure 8 shows the potential importance of the

implementation of constraints. The left image shows the results of a reconstruction with no

constraints applied where as the right image demonstrates the improvements to the point

cloud results when constraints are applied. The initial correspondence match produces a

couple of matching errors and the reconstruction of the 3D points (discussed in the following

section) shows a group of points to one side of the reconstruction and a pair of mismatched

points a long distance from the actual model. This sort of matching error leads to the

production of an unsuitable model, however, with the addition of some simple constraints to

the matching process a much improved starting point for model generation is created.

Figure 8: Effects of constraint application

The development of the constraints goes further to show that the process of stereo

reconstruction is one of constrained optimisation. We can never produce perfect calibrations,

point matches and reconstructions from actual imagery and hence the best results are

produced when we can constrain our results to such an extent that the majority of errors are

eliminated or reduced to the point that they do not produce observable errors in our final

output.

8.8 Alternative Correspondence Measures Despite the introduction of constraints to the matching process, the inaccuracies of the

algorithms discussed so far will have a fairy major impact on the final output of the system.

The difficulty of finding correct correlations on real imagery is probably a problem which

intensity based matching algorithms cannot fully overcome. Differences between stereo

image pairs due to illumination, point occlusion and image noise can be such that matching

points based on the intensity of a window surrounding those points cannot yield a high rate of

accuracy. Solving this problem has been the subject of much research with the most

promising results coming from investigations into the use of wavelet decomposition of the

input images in order to find points of interest and also the corresponding point match. [9, 48]

discuss this approach in more detail. Their research is based around multiscale matching

Page 29: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 29 of 56

techniques which seek to decompose an image into several parts containing copies of the

image under certain scale changes and gaussian filtering [44]. Utilising a dyadic wavelet

transform as the basis for a matching algorithm Bhatti claims to demonstrate “the viability of

applying the wavelet transform approach to stereo matching”, [9]. As such this appears to be

a promising direction in which to take future development of more robust, automatic and

accurate correspondence matching algorithms.

Page 30: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 30 of 56

9 Projective Reconstruction The initial task of the reconstruction stage is to calculate the 3D position of the correlated

image points. In order to achieve this calculation data from all the previous stages is utilized.

Both the cameras internal and external parameters are utilized along with a vector of co-

ordinates matching corresponding points between the left and right input images. In order to

extrapolate the x, y and z real world co-ordinates an over constrained set of linear equations

must be solved. Since the system of equations is over constrained it is necessary to obtain a

best fit least squares estimation of the results. The following calculations demonstrate the

algebraic reconstruction of the corresponding image points:

Let Tll rc )1''( and )1''( rr rc represent the corresponding image points on the left

and right rectified input images. The original non-rectified image co-ordinates can then be

recovered from the rectified matching co-ordinates:

T

lllrectlT

ll rcWRWrc )1(')1''( 1−=

Trrrrectr

Trr rcWRWrc )1(')1''( 1−=

Where W is the 3X3 matrix representing the left or right cameras intrinsic parameters (Mint in

the calibration section) and M represents the cameras extrinsic parameters (Mext in the

calibration section).

The perspective projection equation is defined as:

TT zyxWMrc )1()1( =λ

The combination of the above equations then yields the following:

Tlrectl

Tlllrectl

Tlllrectl

Tlll

zyxMRW

zyxMWWRW

rcWRW

rc

)1('

)1('

)1('

)1''(

1

1

=

=

=−

λ

and

Page 31: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 31 of 56

Trrectr

Trrrrectr

Trrrrectr

Trrr

zyxRMRW

zyxMWWRW

rcWRW

rc

)1('

)1('

)1('

)1''(

1

1

=

=

=−

λ

This leaves us with five unknowns (x, y, z, λl and λr) for which we have six equations. As

stated above the solution can be obtained using the least squares method.

If,

lrectll MRWP '= and

rrectrr RMRWP '=

then:

T

Tlll

T

llll

llll

llll

lllT

l

rczyxPPPPPPPPPppp

rczyxP

)000(

)1()1(

)1()1(

34333231

24232221

14131211

=

=

λ

λ

T

Trrr

T

rrrlr

rrrr

rrrr

rrrT

r

rczyxPPPPPPPPPppp

rczyxP

)000(

)1()1(

)1()1(

34333231

24232221

14131211

=

=

λ

λ

These two equation systems can be combined to help us obtain the final solution:

−−−−−−

=

−−−

−−−

34

24

14

34

24

14

333231

232221

131211

333231

232221

131211

1000

0100

r

r

r

l

l

l

r

l

rrr

rrrr

rrrr

lll

llll

llll

pppppp

zyx

PPPrPPPcPPP

PPPrPPPcPPP

λλ

Page 32: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 32 of 56

The least squares solution to the linear equation system AX = B is given by:

BAAAX TT 1)( −=

Solving this system for each of our corresponding image points allows each of the points to

be projected back into three dimensions. Whilst the linear system has to be solved for each

point many of the values in the system remain constant whilst a consistent calibration is

maintained, and hence the results can be calculated more efficiently if we do not recalculate

all values in the system for each point we are reconstructing.

9.1 Reconstruction Testing In order to test the correctness of this reconstruction technique a scene with known 3D

parameters was created and fed as input into the system.

Figure 9: 3D Studio Max cube reconstruction. Test input (purple cube) and reconstructed

output (red spheres) shown from left, right, top and perspective views.

The system was also calibrated using generated images to ensure that errors in the

calibration stage are kept to a minimum and correlating points were manually matched to

ensure that any errors in the results were not due to erroneous point correlations. The virtual

Page 33: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 33 of 56

stereo rig was identical to that shown in Figure 3. A cube was chosen as input for the test

reconstruction since only a small number of points need to be considered.

The test data was produced and rendered using Discreet’s 3D Studio Max [15]. In order to

test the validity of the results the calculated 3D co-ordinates were fully reconstructed and then

imported back into the original 3D Studio Max scene and compared with the original 3D

model. Figure 9 shows the results of importing the reconstructed data set into 3D Studio Max

with the cube representing the original data and the spheres showing the locations of the

reconstructed points.

As can be seen from the

output, the system produces

results which are almost

identical to the initial 3D

points with the centre of

each sphere falling almost

exactly on the vertices of the

original cube. Table 2

shows the reconstructed co-ordinates compared to the original cube co-ordinates. The

reconstructed points all fall within one unit of their original position. This demonstrates a low

level of error, which is most likely due to slight inaccuracies in manual correspondence

matching, lack of input resolution or image compression artefacts. Thus the reconstruction

algorithms exhibit a satisfactory level of accuracy.

Original Co-ordinates Reconstructed Co-Ordinates(5, 11, 20) (4.362, 10.943, 20.157) (5, -14, 20) (4.695, -13.404, 20.039) (-20, 10, 20) (-20.061, 10.671, 20.014) (5, -14, -4) (4.888, -12.994. -4.375)

(-20, -14, -4) (-19.351, -13.27, -4.68) (-20, 11, -4) (-20.299, 10.597, -4.014) (5, 11, -4) (4.566, 11.256, -4.056)

Table 2: Reconstructed Co-ordinate Comparison

Page 34: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 34 of 56

10 Surface Estimation Once the projective reconstruction procedures have been carried out our results take the form

of an unorganised three-dimensional point cloud. In order to continue with the reconstruction

it is necessary to estimate properties of the original surface and hence decide which points

from the cloud should be interconnected. The choice of polygon with which we will attempt to

construct our surface from is not of great importance as long as the shape can correctly

represent the surface we are attempting to reconstruct. Since the majority of 3D rendering

algorithms are optimised for dealing with meshes constructed from triangles and a number of

suitable surface construction algorithms generate their output as a list of connected triangles

this is the most suitable mesh construction primitive. The process of triangulation “involves

creating from the sample points a set of non-overlapping triangularly bounded facets”, such

that, “the vertices of the triangles are the input sample points” [10]. Whilst there are a number

of algorithms readily available for triangulation “the more popular algorithms are the radial

sweep method and the Watson algorithm which implement Delaunay triangulation” [10]. A

high quantity of surface construction algorithms are based on Delaunay triangulations. These

triangulations have been the subject of much research aimed at optimising and constraining

the original algorithm to achieve fast and accurate surface representations. For the purpose

of reconstructing our face surfaces, Bourke’s modification of Delaunay’s method is suitable for

out irregularly spaced dataset. A more detailed description of Bourke’s triangulation

technique is available in [10] however he summarises the work as follows:

“The Delauney triangulation is closely related geometrically to the Direchlet tessellation also

known as the Voronoi or Theissen tessellations. These tessellations split the plane into a

number of polygonal regions called tiles. Each tile has one sample point in its interior called a

generating point. All other points inside the polygonal tile are closer to the generating point

than to any other. The Delauney triangulation is created by connecting all generating points

which share a common tile edge. Thus formed, the triangle edges are perpendicular bisectors

of the tile edges.”

At this stage it is impossible to tell which surface construction algorithm will prove most

suitable for recognition tasks. Since Bourke’s method provides visually satisfactory results a

slightly modified implementation of his code has been used, however, as with all other

algorithms that have been implemented within the system it is easily possible to implement

additional algorithms should Bourke’s method not provide satisfactory recognition and

reconstruction results at a later stage.

10.1 Surface Estimation Testing In order to test the quality of the mesh construction, initial tests were carried out using the

same artificially rendered cube as was reconstructed in figure 8. Once a surface has been

Page 35: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 35 of 56

created it then becomes trivial to fill the surface and calculate lighting effects to add a degree

of realism to the scene. Figure 10 shows the reconstruction and meshing process on the

cube data.

Figure 10: Reconstructed Cube output after mesh triangulation and surface construction.

Inspection of the output model shows the triangulation method provides accurate results

under simple conditions. Further testing was also carried out on a more complicated

reconstruction, where testing was carried out under identical conditions except for the cube

model being replaced by a head mode. The reconstruction of this more complex model is on

display in Figure 11 and for the number of input points provides a good reconstruction of the

original head model.

Figure 11: Original face rendering (left) and the reconstructed mesh (middle) along with a full

surface reconstruction (right).

The texture displayed on the original face image was chosen to make the point matching

stage as simple and error free as possible, with matching points being selected on the

chequered pattern vertices. Since a relatively low number of points were matched across the

input images the resultant model is of a relatively low resolution, however, increased

resolution is simply a matter of matching a greater number of input points. Satisfactory

surface models are obtained using the described method, however Bourke’s implementation

Page 36: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 36 of 56

is somewhat simplistic despite creating an accurate mesh. In order to create more accurate

models more sophisticated meshing algorithms are required. Better surface reconstruction

algorithms would include features such as the smoothing to help reduce the effect of noisy

model data or better interpolation between data points to allow the creation of correctly curved

face surfaces. Finally, as an alternative to surface estimation, deforming a generic head

model would mostly eliminate the need for any surface reconstruction and should be

considered for future work.

10.2 Texture Mapping The final stage of the reconstruction process involves applying a texture to the surface model.

Since we have already have the 2D and 3D co-ordinates for each of the points in the

reconstruction texture mapping simply involves extracting the 2D texture data from the input

images and applying it to the corresponding surface on the 3D model. More sophisticated

techniques that could use textures from both input images are possible and would improve

the final output, however, at this stage taking the texture from a single input image and

applying it to the 3D model provides satisfactory results. Figure 12 shows the reconstruction

from Figure 11 after an appropriate texture has been applied.

Figure 12: Texture mapped model reconstruction

Whilst the reconstruction process involves a number of steps and a large amount of

calculations the volume of research carried out on the subject has led to the development of

many techniques which prove suitable for out application. The actual 3D point calculations

can be considered correct since the perspective projection equations have been well

researched and have been successfully applied to a multitude of applications. The surface

meshing algorithms appear to be successful at this stage, however, only further testing will

prove whether they will be successful when used in more complex, non-simulated, situations.

Finally the approach to texture mapping is successful in producing a more realistic output and

complements the mesh in providing a realistic reconstruction.

Page 37: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 37 of 56

11 Implementation Major goals of the implementation of this stereo vision system include:

• The development of a complete system capable of reconstructing the visible surface

of an object given left and right images of a stereo pair and calibration data regarding

the camera rig under which the stereo pair was captured.

• The system should be able to obtain the camera calibration data automatically from a

sequence of calibration images.

• An implicit assumption of the implementation should be that the specific algorithms

which have been selected may not be ideal and hence the application framework

should be developed in a manner which allows future expansion and fast integration

of improved algorithms.

• A degree of platform independence should be achieved.

• The system should be geared towards producing output which provides a suitable

basis for input into a pose-invarient face recognition system.

11.1 Design Choices Initial choices of development language, target platform and development environment were

made easily. C++ is currently the most widely used application development language and as

such has a wide variety of libraries whose power can be harnessed during the development

phase. Many of the calculations which we are required to carry out have been implemented

many times over by a variety of authors. For example, the linear algebra and least squares

estimation which we require in the reconstruction stage need not be re-programmed since

libraries such as LAPACK [4] (amongst others), handle this sort of calculation efficiently.

Examples of other useful libraries include OpenCV [25], Intel’s Image Processing Library (IPL)

[24], and a number of others aimed squarely at developers creating computer vision

applications. A number of these libraries are available only in C++ and as such the choice of

C++ as our development language was heavily weighted by the existence of such libraries

thus giving us more scope when deciding how various features should be implemented. In

addition C++ is considered faster than alternatives such as Java under intensive processing

conditions. Also during early development stages C# was still a mostly unproven technology

and hence was not a primary consideration for this implementation. A more complete

description of the libraries which have been used can be found in the software libraries

section of this report.

Whilst one of the goals of the implementation is to provide a system which maximises the

degree of cross-platform compatibility, it should be noted that a relatively extensive GUI

Page 38: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 38 of 56

interface to handle the display of large quantities of data in a graphical manner is required

and hence high compatibility is difficult to achieve. However, in order to combat this difficulty

all algorithms which play an important part in the reconstruction (SSD cross-correlation,

ZMNCC and Delauney etc.) have been implemented in a library separate from the user

interface code. This achieves a software separation between vision code and GUI code and

hence whilst the interface is application and platform specific the actual vision algorithms

retain a degree of cross-compatibility and re-useability. The current implementation contains

all the vision code in a library named “VisionLib” and compiles the code to a dll file enabling

other applications to use the latest version of the library without storing multiple versions of

the compiled code on a single computer. The main application is called “FaceScanner” and is

linked at compile time to the VisionLib library.

11.2 Application Architecture The initial implementation of the system is compiled for the windows platform. Since

Microsoft’s Visual Studio is the IDE of choice for programming C++ under windows this

software was selected for development and hence it was logical to utilise the visual

development features of the environment. To this end the application makes extensive use of

Microsoft Foundation Class (MFC) functionality. Furthermore the Microsoft MDI Document /

View application architecture has been adhered to. In essence this application architecture

divides and application into a document part (derived from the Document super-class) which

contains all user data pertaining to the contents of the work space. In our case this is the

currently matched points, current matching algorithm selection data, calibration information

and all the rest of the data associated with the reconstruction currently being worked on. The

document section of the application has a close association with the vision library since the

data stored in the document must be manipulated by the algorithms available in the VisionLib

dll. The second part of the architecture contain a series of “views” on the document which

display and optionally edit parts of the data stored in the document. The views are derived

from the FormView super class. Each view is associated with a document and defines the

way in which the user is presented with the information contained within the document.

Examples of views in our application are the raw data viewer, shown here in Figure 13, the

3D data viewer and the input image view amongst others. An advantage of this approach is

that as functionality is added to the application, new views can be created in order to allow the

user to interact with the new functionality and data as easily as possible. Additionally more

views can be added at a later date to allow new ways for data to be entered into the program

and new ways for the data to be examined. For example, it would be trivial to add a view

which enabled the user to select point matches by hand without changing any of the code

already in place. Since the new view would simply make changes to the data structures

already available in the document all previous views would already be equipped to deal with

Page 39: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 39 of 56

displaying the data in a manner appropriate to that view. This advantage is crucial in for-filling

the implementation requirement that we make no assumptions about the specific algorithms

which will be in place in the final application.

Figure 13: The raw data view displays actual pixel co-ordinates of the point matches,

reconstructed points, normalised model co-ordinates and raw calibration data.

Data structure design is one of the most important aspects of any application but plays a

particularly important role when applied to graphically intensive applications. Much of the

reason for this lies in the complexity of many graphics objects. Not only does the data need

to be stored but it also needs to be processed quickly leading to the requirement that all data

structures should be optimised for fast processing. This can lead to some difficulties in

designing appropriate data structures as many require constant redesign to cater for

additional algorithms processing requirements. Apart from fundamental data objects such as

images (which are handled by the IplImage data structure, a part of the IPL image library)

custom data structures were implemented for the majority of objects. Readers should refer

directly to the code to observe implementation specific data structures. It should be noted

however that a number of libraries exist aimed at providing many of the data structures which

have been implemented. To this end it probably would have been more efficient to use some

of these data structures “out of the box” rather than relying on custom implementations which

required constant rewrites as the application grew. Despite this possible oversight the data

structures in place serve to adequately represent the user data and perform processing in a

timely manner suggesting that they are indeed suitable for the task in hand.

The main application structure is shown in the simplified UML class diagram of Figure 14.

The FaceScannerApp class is the main class of the application and brings together all other

elements of the program including the document and the available views. The document

Page 40: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 40 of 56

object is perhaps the most important in this structure since this is where all data on the

current reconstruction is held. In general the document holds all the objects defined in the

VisionLib library relating to the reconstruction. Objects representing the calibration, the image

pre-processing algorithm, the point input algorithm, the point match algorithm, the surface

reconstruction algorithm, a list of applied constrains and the input image data are all stored in

the document with data regarding actual point matches and current 3D points etc. All the

data types for these objects are defined in the VisionLib library and therefore are reusable in

alternative applications. Furthermore, through support provided by the MFC document / view

architecture it becomes simple to save the state of a reconstruction by serialising the

document object and thus saving the state of the currently active algorithms for reloading in a

future session. Also since we are using the multiple document / view version of the

architecture it is possible to work on multiple documents (and hence multiple reconstructions)

in the same workspace, thus allowing direct comparison between different algorithms and

datasets. A final advantage of grouping the majority of the reconstruction specific data into

the same document object is that this constrains interaction with the external VisionLib library

to a single object and thus reduces the complexity of the interaction between the two separate

components of the application.

Docconstraints2Dconstraints3DcalibrationpreproAlginputAlgmatchAlgreconstructionAlgcamerasReady : boolptrMatches

View

View3D

OnCheckPoints()OnCheckMesh()OnCheckFill()OnCheckTexture()OnCheckLights()GetRenderView()

ViewAlgsOnButtonAlgSettingsOnButtonAlgUpdate

ViewCalibnFrames : intnInterval : int

OnButtonCalibrate()OnButtonSave()OnButtonLoad()CalibratePair()

ViewData

createDataStrings()displayDataStrings()

ViewImages

OnRadioProcessed()OnRadioRectify()OnRadioInput()OnButtonCapture()OnButtonSave()

ViewRender3D

OnMouseEvent()initGeometry()drawGLScene()drawPoints()drawMesh()drawTexture()drawFill()CreateGLContext()CalcVectorNormal()

MDIChildFrame

. ..ViewFrameContainers.. .

VisionLib

MainFrame

FaceScannerApp

InitInstance()OnAppAbout()

MFC

Figure 14: Simplified UML diagram of the user interface / MFC portion of the FaceScanner

Application. Some fields and methods have been omitted for conciseness.

Page 41: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 41 of 56

The inclusion of a variety of views on the document object allows the data contained within

the document to be presented to the user in a manageable fashion. Each view is tailored to

displaying portions of the document data to the user in a unique manner. Each view

optionally provides the user with ways to interact with the data. In essence the above

completely defines the main application structure, ie. a document and some views. User

interaction with the views triggers application events which interface with the vision library

code to manipulate the document data and thus guide the reconstruction process.

Figure 15 shows the more complex class interactions found within the Vision library

(VisionLib). As well as containing all code relating to specific vision algorithms VisionLib

contains the data structures required by each of the algorithms.

MatchpreviewLeft : IplImage *previewRight : IplImage *disparityImage : IplImage *

getDisparityMap()getPreviewLeft()getPreviewRight()

Constraint2DtheName : String

getName()setName()

Input

FaceFindcascadefacesstorage

getFaceSeq()newFaceDetector()

PatternGridunitx : intunity : intconstrain2Rectangle : bool

showSettings()update()

CVFeaturesminQualityminDistancethresh

sho wSet tings()update ()findCorners()

NoDupl icates

update()showSettings()

NoWeakMatches

thresh : int

update()showSettings()

SSDwindx : intwindy : int

showSettings()update()calcIntensityWindow()matchPoints()

ZMNCCwindx : intwindy : int

showSettings()update()calcIntensityWindow()matchPoints()

Hi stNorm

update()showSettings()normHisto()

Hi stMatch

update()showSettings()matchHisto()

Fil e

filename : String

showSettings()update()loadPoints()

Fil efilename : String

showSettings()update()loadPoints()

NoOutl iers

calcSD()showSet tings()upd ate()

Constrain3DtheName : String

getName()setName()

StereoptrMatches : vector<MatchingPoint> *up2Date : boolprog : CProgre ssCtrl *

<<virtual>> showSettings()<<virtual>> up date()setProgressRange()setProgressCtrl()resetProgress()makeProg ress()

ImageDataleftImagerightImagecalib

getleft()getRight()getLeftRectify()getRightRectify()

MatchingPointpt1 : CvPointpt2 : CvPointvalid : boolstrength : float

getDistance()getPoint()getStrength()isValid()set3D()get3D()

Tools

PreProleftImage : IplImage *rightImage : IplImage *

getOutputLeft()getOutputRight()

Reconst ruct

triangulate()createCalibMats()update()calculate3DPoint()getNoOfTriangles()circumCircle()

Vi sionLib

images : ImageDataptrMatches : MatchingPointpreproAlg : PreProinputAlg : InputmatchAlg : MatchreconstructAlg : Reconstruct

Figure 15: Simplified UML diagram of VisionLib, the library containing all the computer vision

related code within the project.

Page 42: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 42 of 56

11.3 Data Structures and Algorithms The major data structures defined within VisionLib are as follows:

• CvCalibFilter: This is the calibration object. VisionLib utilises this DirectX

DirectShow filter provided by OpenCV to obtain stereo rig calibration data for use

throughout the library.

• ImageData: This class holds all the image data required by the reconstruction.

• MatchingPoint: All of the data regarding point correspondence matches is stored in

this object. This includes x and y co-ordinates on both the left and the right images,

whether the point is deemed valid and the calculated strength of the match. This

object also contains methods for storing the calculated 3D position of the matched

point.

• Reconstruct: This object contains data structures and methods for surface

generation, including a linked-list of joined 3D points specifying the triangular surface

mesh.

The majority of the remaining classes interact with these four data structures to progress

through the different stages of the reconstruction. It is also the data in these three structures

that the views in the main application interface are designed to interact with and display via

the application document object.

In order to support the fast interchange and comparison of algorithms the VisionLib library

exhibits a high degree of polymorphism. The set of objects relating to stereo reconstruction

are derived from the Stereo class. This defines a set of virtual methods which derived objects

must implement. This allows for a consistent interface to each of the different algorithms.

The three sets of algorithms that are derived from the stereo class are Input, Match and

Constraint. Input contains subclasses for handling input point selection, ie. which initial points

we will attempt to correlate. Match contains subclasses for finding corresponding image

points and Constraint handles constraining these matches. Each of these three objects are

further subclasses to provide the actual functionality. For example the Match class is

subclassed to provide implementations of the actual matching algorithms. Currently

implemented here is the SSD and ZMNCC algorithms investigated in the Correlation section

of this report.

Some of the more important aspects of some specific algorithm implementations are

described below:

Page 43: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 43 of 56

Input Objects:

• CvGoodTrackingFeatures: This algorithm is based on a method within the

OpenCV library. It takes an image which has had a binary threshold filter applied

to it and uses a modification of the Harris detector to find feature points with a

good chance of being matched. The binary threshold filter and point detector can

be applied under at a number of different thresholds simultaneously to provide a

point set which is evenly distributed over the target object.

• PatternGrid: Feature points are chosen by overlaying the input image with

regularly spaced input points which form a rectangular grid. This allows every

pixel in the input image to be selected, either for reconstruction purposes or to

attempt to create a dense disparity map. This is only useful when we need

regular input point spacing, since it makes no guarantees that the points will

make good matches.

• Manual: Input points can be manually entered via a text file in the form of 2D co-

ordinates.

Match Objects:

• SSD and ZMNCC: Both these correlation algorithms behave as described in the

appropriate section of this report except they have both been modified for

improved performance on colour images. The algorithms can optionally take into

consideration information from all three colour channels to help differentiate

between closely contested best matches.

• Manual: Matching points can be entered via a text file containing a list of 2D co-

ordinates.

Constraint Objects:

• Similarity: During the correlation phase a match strength is calculated for each

point candidate based on the similarity of the point and the surrounding area.

When the similarity constraint is applied all points with a match strength below a

given value are marked as invalid.

• Uniqueness: Each point in a dataset is tested for uniqueness with all other co-

ordinates in the dataset.

• Statistical: Certain assumptions can be made about the reconstructed data.

Assuming a normal distribution of the 3D points we can eliminate points that fall

outside a given standard distribution and hence remove some points that may

have been matched incorrectly.

A number of other important classes exist in the library which are not specific to stereo

matching and hence are not derived from the Stereo object. These are Tools, PrePro and

Page 44: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 44 of 56

Reconstruct. Tools contains miscellaneous tools which do not fall into other categories but

are useful vision algorithms none the less. For example in some cases it may be useful to

find a face within a given image and hence the Tools class contains methods for performing

such tasks. The face finding algorithm is based on Haar feature cascade [54] and is a direct

implementation of functionality provided by the OpenCV library. The PrePro class contains

functionality regarding 2D image manipulation which may be useful in stages prior to

matching. Functionality such as histogram matching is provided which has potential uses in

invariant illumination matching across input images. Finally the Reconstruct object forms the

basis for a set of reconstruction algorithms. This set of objects take 3D point cloud data as

input and return a predicted surface. Bourke’s modified version of Delaunay triangulation is

implemented here.

11.4 Implementation Results The current implementation of the face scanner vision system meets most of the

requirements specified as goals at the beginning of the implementation. We have indeed

implemented a system that is “capable of reconstructing the visible surface of an object given

left and right images of a stereo pair and calibration data.” We have not implemented a

general vision system, instead focussing on the reconstruction of face objects. Furthermore

the implementation is capable of obtaining accurate calibration data from a sequence of

images containing an appropriate calibration pattern. The application is also successful in

separating vision code from GUI code to ensure maximum re-usability of promising vision

algorithms. The structure of the vision library code also satisfies the goal that the system

“allows future expansion and fast integration of improved algorithms.” This is achieved

through the implementation of a polymorphic class structure within the library.

This enables every component of a vision process to interact with every other component

without regard for the specific algorithms in use. Finally, the separation of the vision code

from the GUI code has the additional benefit of slightly increasing the ease with which the

vision code could be ported to another platform since only the GUI code is heavily platform

dependant.

With regards to the goal that the application should provide output suitable for the basis of a

pose-invarient face recognition system it is unclear if the current application would meet these

requirements. Since we currently have no frontal face recognition system available for testing

purposes the requirements for “good” models for recognition are unclear and hence our

application can not be tested for suitability. It is likely however that a number of additional

features would be required. For example when a face is viewed in a none frontal pose and

Page 45: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 45 of 56

then rotated to a frontal pose prior to recognition it is possible non-visible surfaces are now

visible.

Figure 16: FaceScanner application screenshot

These surfaces must be estimated to allow proper recognition to take place. Symmetric

properties of the face suggest that we could estimate the missing surfaces from the data

already available. At this stage algorithms aimed at solving this set of missing-data surface

reconstruction problems are a target for future work. The current implementation of the

application does however show some promise in this department since the calibration,

reconstruction and application framework currently in place have proved to be both correct

and accurate. The point correlation algorithms, whilst correct, are relatively basic and more

advanced correlation algorithms need to be implemented. The application is, however,

relatively successful in meeting the demands set by the implementation goals. Figure 16

shows a screenshot of the application running with a reconstruction in progress.

Page 46: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 46 of 56

12 Software Libraries

A number of libraries and API’s were found to be useful during the development of the stereo

vision system. The libraries which were used are listed below.

Intel’s Image Processing Library (v2.5) This library provides low level functionality for processing bitmaps, JPEGs and other image

formats. Whilst the stereo vision system does not implement much IPL functionality directly

libraries such as OpenCV rely heavily on functionality provided by this library.

Intel’s OpenCV (v3.1.Beta) This open source computer vision library contains a mass of functions for many computer

vision related tasks. These include functions from camera calibration and disparity estimation

to the computation of optical flow. This library has since been superseded by the Intel

Performance Primitives library, however, much functionality is reportedly identical to that of

OpenCV. Much of the OpenCV functionality is based on lower level functions provided by the

Intel Performance Primitives Library.

Intel’s Math Kernel Library (v6.1.009) Our application make use of the linear algebra sections of this maths library. The linear

algebra / least squares technique is used in order to project correlated image points back into

3D space. The routines in linear algebra routines in MKL are based on those implemented in

LAPACK.

Microsoft DirectX SDK (v9.0) Several of the image capture and camera control routines are based on the DirectX SDK.

The calibration process is also implemented as a DirectX filter.

Microsoft Foundation Classes The windows interface is programmed to take full advantage of available MFC resources with

the current implementation supporting the Multiple Document Interface Document/View

architecture to allow simultaneous, complex views of the large datasets which we have to

deal with throughout the course of a reconstruction.

OpenGL All 3D views of our data are rendered using the OpenGL library. This selection was made

rather than alternatives, such as Direct3D, because of its programming simplicity and its wide

support both from application programmers and hardware designers. Furthermore OpenGL is

much more platform independent that the Direct3D library.

Page 47: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 47 of 56

A number of applications were utilized in the development of the stereo vision system.

Microsoft Visual Studio 6.0 C++ application development was carried out exclusively in this industry standard

development tool. Selected primarily for its MFC support and visual application development.

C++ was selected as the development language due to its speed, versatility and support

which supercedes that achievable through interpreted languages such as Java.

Mathworks Matlab 12 The more mathematically based algorithms within the system were tested for correctness and

underwent fast track development using this matrix evaluation application.

Page 48: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 48 of 56

13 Results The system was tested with a variety of data under a number of conditions. Most of the

system tests we implemented using synthesized images. The reason for this is that it is

easier to eliminate unwanted input features such as noise or illumination variations.

Furthermore through the use of synthesized images a correct version of the model we are

trying to reconstruct already exists and hence we have data with which we can compare our

results. Much of the output from the system has been included in earlier sections in order to

demonstrate the correctness of various sub-systems. The project meets the majority of the

goals we set out to achieve. Significant research and development has been carried out into

stereo camera calibration, the correspondence problem, 3D projective reconstruction and

surface estimation. Further to this an implementation of a vision system aimed at tackling and

solving the problems brought about by the reconstruction process has been developed with

the results obtained from the program demonstrating an acceptable level of accuracy.

Practical evidence suggests that the calibration process performs correctly. The calibration

section of this report details some results from a sequence of synthesized images from a

known rig calibration and demonstrates that the results obtained represent the properties of

the actual rig correctly. Testing of the calibration procedures on both real image sequences

and live video have also produced accurate results. Furthermore testing of the system under

a number of varying camera rigs demonstrated that this implementation of the calibration

routines work under the majority of general stereo rig calibrations.

Perhaps the most error-prone area of the reconstruction process at present is the

correspondence matching phase. The correspondence problem is widely considered the

most difficult area of reconstruction and this is demonstrated by our implementation. As

demonstrated in the correspondence section of this report both the SSD and ZMNCC

algorithms perform well and produce good disparity maps however these simple pixel-wise

intensity based algorithms prove to be too light weight to perform well under general

reconstruction conditions. Simple intensity based algorithms are unlikely to yield a quality

solution to this particular correspondence problem. The major reasoning behind this is simply

that correct point matches are often too similar to incorrect candidate matches for intensity

based methods to correctly differentiate between them. The situation is complicated further

by noise during the image capture process of varying light levels between a stereo pair.

These image features cause major errors in the correspondence phase which propagate to

other stages. The addition of numerous matching constraints improves matters at the cost of

reducing the degree of automation present in the system, however, they do not provide a

perfect solution and erroneous points are still matched. This system is unlikely to be

improved much further through the addition of more constraints and as such the development

Page 49: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 49 of 56

of more advanced matching algorithms using none intensity based methods is going to be a

primary goal of any additional work. In order for the currently implemented correlation

techniques to be effective we need to be matching only a small number of highly salient input

points. This would increase the likelihood of obtaining an accurate result set at the expense

of having a smaller number of points to work with, and hence a less accurate resultant

surface.

Once a set of matches has been found and constrained we can commence with the actual

reconstruction. The calculations behind the 3D projection have been researched in the past

and with the existence of the appropriate projection equations reconstruction to a 3D point

cloud is relatively trivial. The reconstruction stage demonstrates a cube and a face model

reconstruction which serve to demonstrate the accuracy of the technique. Surface

reconstruction from the point cloud provides adequate results using Burke’s Delauney

implementation. Some consideration should be taken into account by the meshing algorithm

of potential errors in previous stages, however, at this time this is not taking place and hence

the surface reconstruction stage does nothing to “smooth over” the errors in the

correspondence stage. To this end a more sophisticated algorithm could be implemented

which attempts to create a smooth surface possibly using techniques such as Beizer curves.

An implementation of the marching cubes algorithm would also provide an interesting

comparison in terms of producing a surface with more desirable properties than that of the

mesh currently produced. Furthermore if the system were to be involved as a recognition

subsystem then it would be essential to consider algorithms for hidden surface reconstruction

to enable rendering of surfaces of the face no initial visible in the input images. As an

alternative to estimating the hidden surface it may well prove useful to implement a system

that utilizes a generic head model to aid reconstruction, this may also prove a viable solution

to increasing the effectiveness of the currently implemented intensity based correlation

algorithms, since a smaller number of points would have to be required due to the volume of

data already available to the system. The current system is already capable of selecting only

input points with a high chance of being matched (using the GoodTrackingFeatures input

algorithm) and as such the framework is already suitable for the addition of a generic head

model.

Figure 17 shows the results of a fully automated reconstruction after a number of thresholds

were set and calibration data acquired. The images used were again synthesized since at

this point it is difficult to obtain accurate reconstructions from actual imagery due to problems

described with the correspondence algorithms above. Also the images were created in such

a manner that the correspondence algorithms would find matches easier to make.

Page 50: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 50 of 56

Figure 17: Fully automatic reconstruction of a synthesized face from stereo images. White

dots on the 3d model show initial point match positions.

It should be obvious from the output produced that the correspondence algorithms struggled

even under these constrained conditions and still produced numerous incorrect matches,

many of these were eliminated with the application of constraints however the correlation

algorithms are simply not accurate enough to perform on real world data at this point.

Despite some correlation accuracy problems the system performs well throughout. It should

be noted however that problems with the correspondence stage of the reconstruction appear

to be due to properties of the matching algorithms involved rather than fundamental problems

Page 51: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 51 of 56

with the system. Furthermore we have been successful in creating an application with a

framework such that it is possible to implement new and more efficient algorithms easily and

integrate them with the system with major problems. This has the advantage that despite the

under performance of the current correspondence algorithms new algorithms which show

promise such as wavelet decomposition and matching can be implemented and integrated

into the system so their performance can be analyzed. Thus, despite some errors within

certain areas of the system these errors can be observed and algorithms that perform at a

lower error rate introduced into the system with ease.

The implementation of the FaceScanner application and the development of the VisionLib

library have led to the creation of a successful architecture for investigating various vision

algorithms and reconstruction techniques and thus has proved to be a useful application

implementation. Furthermore the interface with which the user can interact with the system is

of potential commercial quality allowing simple guidance of the reconstruction processes

fueled by intuitive data representation. The reuseable nature of each of the system

components allows future development within the current application framework to improve

current performance.

Page 52: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 52 of 56

14 Conclusions and Future Work The majority of the goals specified at the beginning of this report have been met. Successful

research has gone into each stage of the reconstruction process and the system is geared

towards working with a set of CCTV camera. At this stage autonomous reconstruction from

real imagery has not been achieved due to an partially inadequate solution to the

correspondence problem, however, the framework is in place and capable of supporting

future reconstructions in the presence of more powerful matching algorithms. The system

has however proved a number of techniques to be correct and well suited to the task of facial

reconstruction. Testing carried out on synthesized input yielded accurate results in areas of

the system where they were expected. A working implementation of a vision system capable

of stereo calibration and reconstruction has been developed and adheres to the design goals

specified in the implementation section.

With regards to the usefulness of the system output as input to a recognition subsystem the

results are inconclusive. Additional features would certainly have to be implemented and

point matching improved to ensure we could construct an accurate face model, however the

basis for such as system is in place. The addition of hidden surface reconstruction possibly

through the use a generic head model would be essential in ensuring we can recreate

recognizable face models. The development of more accurate point matching algorithms

should probably be the focus of future work since this is the area where currently the

application struggles to perform. This could be further aided by the use of different algorithms

in the feature point detection stage, despite the current algorithm performing reasonably well.

Finally improvements to surface estimation with the addition of mesh smoothing to eliminate

errors from earlier stages would probably provide a system which was very suitable for use in

a pose-invariant face recognition system despite the face that the system is not currently at

this stage.

The stereo reconstruction problem is one of constrained optimization. The existence of

numerous stages in the system leads to the propagation of estimation errors throughout. By

increasing accuracy in any way possible and constraining each part of the system to eliminate

most errors we have produced a system which to a degree is capable of face surface

reconstructions. Despite containing stages which, under some conditions, fail to perform the

system produces accurate results in general. The majority of initial design goals have been

satisfied and with the implementation of some additional algorithms the system could be

made to completely for-fill all aims and goals and find application in face recognition utilities.

Page 53: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 53 of 56

15 Bibliography 1. A Novel Technique For Face Recognition Using Range Imaging,

http://lcv.stat.fsu.edu/publications/paperfiles/pcarecog.pdf, Last Accessed: 09/04/04

2. Adjouadi and F. Candocia, A Similarity Measure for Stereo Feature Matching. IEEE Transactions on Image Processing, 1997. 6(10).

3. Akamatsu, S., H.F. T. Sasaki, and Y. Suenaga, A Robust Face Identification Scheme - KL expansion of an invariant feature space. SPIE Proceedings of Intelligent Robots and Computer Vision X: Algorithms and Techniques, 1991. 1607: p. 71-84.

4. Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK User Guide. Society for Industrial and Applied Mathematics, 1999.

5. Bacakoglu, H. and M.S. Kamel, A three-step camera calibration method. IEEE Trans. Instrumuntation and Measurements, 1997. 46: p. 1165-1172.

6. Balasuriya, L.S. and D.N.D. Kodikara, Frontal View Human Face Detection and Recognition. 2001.

7. Barber, C.B., D.P. Dobkin, and H. Huhdanpaa, The Quickhull Algorithm for Convex Hulls. 1996.

8. Beumier, C. and M. Acheroy. Automatic Face Authentication from 3D surfaces. in British Machine Vision Conference. 2001. Royal Military Academy, Signal & Image Centre (c/o ELEC) Avenue de la Renaissance, 30 B1000 Brussels, Belgium.

9. Bhatti, A., S. Nahavandi, and H. Zheng. Image Matching using TI Multi-Wavelet Transform. in VIIth Digital Image Computing: Techniques and Applications. 2003. Sydney.

10. Bourke, P., An Algorithm for Interpolating Irregularly-Spaced Data with Applications in Terrain Modelling, 1989, http://astronomy.swin.edu.au/~pbourke/terrain/triangulate/, Last Accessed: 07/04/2004

11. Bouview, D.J., Double-Time Cubes: A Fast 3D Surface Construction Algorithm for Volume Visualization. 1994.

12. Carr, J.C., R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum, and T.R. Evans, Reconstruction and Representation of 3D Objects with Radial Basis Functions, Applied Research Associates, University of Canterbury NZ.

13. Chan, S.O.-Y., Y.-P. Wong, and J.K. Daniel, Dense Stereo Correspondence Based on Recursive Adaptive Size Multi-Windowing. 2000.

14. Cooper, O., N. Cambell, and D. Gibson, Automated Meshing of Sparse 3D Point Clouds, University of Bristol.

15. Discreet, Homepage for the makers of 3D Studio Max, 2004, http://www.discreet.com/, Last Accessed: 15/04/04

16. Elagin, E., J. Steffens, and H. Neven. Automatic Pose Estimation System For Human Faces Based on Bunch Graph Matching Technology. in Proceedings of the Third International Conference on Automatic Face and Gesture Recognition. 1998. Nara, Japan.

Page 54: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 54 of 56

17. Faugeras, O. and Q.-T. Luong, The Geometry of Multiple Images. MIT Press, 2001.

18. Fieguth, P.W. and T.J. Moyung, Incremental Shape Reconstruction Using Stereo Image Sequences. Department of Systems Design Engineering, University of Waterloo, Ontario, Canada.

19. Forsyth, D. and J. Ponce, Computer Vision: A Modern Approach. 2003: Prentice Hall.

20. Fraser, C. Automated Vision Metrology: A Mature Technology For Industrial Inspection and Engineering Surveys. in 6th South East Asian Surveyors Congress Fremantle. 1999. Department of Geomatics, University of Melbourne, Western Australia.

21. Galo, M. and C.L. Tozzi, Feature Based Matching: A Sequential approach based on relaxation labeling and relative orientation. 1997.

22. Hall, D., B. Leibe, and B. Schiele. Saliency of Interest Points under Scale Changes. in British Maching Vision Conference 2002. 2002.

23. Huang, J., V. Blanz, and B. Heisele, Face Recognition with Support Vector Machines and 3D Head Models. Center for Biological and Computer Learning, M.I.T, Cambridge, MA, USA and Computer Graphics Research Group, University of Freiburg, Freiburg, Germany.

24. Intel, Image Processing Library, 2000, http://developer.intel.com/software/products/perflib/ijl/, Last Accessed: 04-2004

25. Intel, Open Source Computer Vision Library, 2000, http://www.intel.com/research/mrl/research/opencv/, Last Accessed: 04-2004

26. Kallmann, M., H. Bieri, and D. Thalmann, Fully Dynamic Constrained Delaunay Triangulations. 2002.

27. Keller, M.G., Matching Algorithms and Feature Match Quality Measures For Model Based Object Recognition with Applications to Automatic Target Recognition, in Courant Institute of Mathmatical Sciences. 1999, New York University.

28. Kim, J., V. Kolmogorov, and R. Zabih, Visual Correspondence Using Energy Minimization and Mutual Information. 2003.

29. Kirby, M. and L. Sirovich, Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces. Pattern Analysis and Machine Intelligence, 1990. 12: p. 103-108.

30. Kirby, M. and L. Sirovitch, Low dimentsional procedure for the charaterization of human faces. Opt. Soc, 1987. 2(A): p. 586-591.

31. Laganiere, R. and E. Vincent, Matching Feature Points in Stereo Pairs: A Comparative Study of Some Matching Strategies. 2001, School of Information Technology and Engineering, University of Ottawa.

32. Lee, M.W. and S. Ranganath, Pose-invariant face recognition using a 3D deformable model. Department of Electrical and Computer Engineering, National University of Singapore, Pattern Recognition, 2003. 36: p. 1835-1846.

33. Lorensen, W.E. and H.E. Cline, Marching Cubes: a high resolution 3d Surface Reconstruction Algorithm. Computer Graphics, 1987. 21: p. 163-169.

Page 55: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 55 of 56

34. Lu, X., R.-L. Hsu, A.K. Jain, B. Kamgar-Parsi, and B. Kamgar-Parsi, Face Recognition with 3D Model-Based Synthesis. 2002.

35. Maas, H.-G., Image sequence based automatic multi-camera system calibration techniques. 1997, Delft University of Technology, The Netherlands.

36. Mattoccia, S., M. Marchionni, G. Neri, and D. Stefano, A Fast Area Based Stereo Matching Algorithm. 2002.

37. McLauchlan, P.F., A Batch/Recursive Algorithm for 3D Scene Reconstruction, in School of Electrical Engineering. 2001, University of Surrey.

38. McLauchlan, P.F., The variable state dimension filter., in VSSP 4/99. 1999, University of Surrey.

39. McLauchlan, P.F. and A. Jaenicke, Accurate mosaicing using structure from motion methods, in VSSP 5/99. 1999, University of Surrey.

40. McLauchlan, P.F. and D. Murray. A unifying framework for structure and motion recovery from image sequences. in 5th International Conference on Computer Vision. 1995. Boston.

41. Memony, Q. and S. Khanz, Camera calibration and three-dimensional world reconstruction of stereo-vision using neural networks. International Journal of Systems Science, 2001. 32(9): p. 1155-1159.

42. Moyung, T., Incremental 3D Reconstruction Using Stereo Image Sequences. 2000, University of Waterloo: Ontario, Canada.

43. Pentland, A., B. Moghaddam, T. Starner, O. Oliyide, and M. Turk, View-based and Modular Eigenspaces for Face Recognition, in Technical Report No 245. 1994: MIT Media Laboratory, Perceptual Computing Section.

44. Rosenfield, A. and M. Thurston, Course-fine Template Matching. IEEE Trans. System, Man and Cybernetics, 1977. 7: p. 104-107.

45. Sanderson, C. and S. Bengio, Robust Features For Frontal Face Authentication in Difficult Image Conditions. 2003.

46. Sebe, N. and M.S. Lew, Comparing Salient Point Detectors. ICME, 2001.

47. Sharghi, S.D. and F.A. Kamangar, Geometric Feature-Based Matching in Stereo Images. 1999.

48. Shi, F., N.R. Hughes, and G. Roberts, SSD Matching Using Shift-Invariant Wavelet Transform: Mechatronics Research Centre, University of Wales College, Newport, Allt-Yr-Yn Campus.

49. Smith, P., T. Drummond, and R. Cipolla. Segmentation of Multiple Motions by Edge Tracking between Two Frames. in British Machine Vision Conference. 2000.

50. Theisel, H., Exact Isosurfaces for Marching Cubes. Computer Graphics Forum, 2002. 21(1): p. 19-31.

51. Treece, G.M., R.W. Prager, and A.H. Gee, Regularised marching tetrahedra: Improved iso-surface extraction. 1998.

Page 56: Stereo Vision for Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications April 2004 Daniel Bardsley (djb01u) [email protected]

Page 56 of 56

52. Trucco, E. and A. Verri, Introductory Techniques for 3-D Computer Vision. 1998: Prentice Hall.

53. Tsai, R.Y., A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV camera and lenses. IEEE Trans. Robot Automation, 1987. RA-3: p. 323-344.

54. Viola, P. and M. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features. Accepted Conference on Computer Vision and Pattern Recognition, 2001.

55. Wild, D., Realtime 3D Reconstruction From Stereo. 2003, University of York.

56. Xu, L.-Q., B. Lei, and E. Hendriks, Computer vision for a 3-D visualisation and telepresence collaborative working environment. BT Technology Journal, 2002. 20(1): p. 64-74.

57. Yambor, W., B. Draper, and R. Beveridge, Analyzing PCA-based Face Recognition Algorithms: Eigenvector Selection and Distance Measures. 2000.

58. Yan, J. and H. Zhang, Synthesized Virtual View-Based EigenSpace for Face Recognition. 1997.

59. Zou, J., P.-J. Ku, and L. Chen, 3D Face Reconstruction Using Passive Stereo. ECSE 6650 - Computer Vision, 2001.