real-time face tracking with gpu acceleration

CUDA Accelerated Face Recognition

Numaan. AThird Year Undergraduate

Department of Electrical EngineeringIndian Institute of Technology, Madras, India

[email protected]

Sibi. ANeST-NVIDIA Centre for GPU Computing

NeST, [email protected]

July 26, 2010

Abstract

This paper presents a study of the efficiency andperformance speedup achieved by applying Graph-ics Processing Units for Face Recognition Solutions.We explore one of the possibilities of parallelizing andoptimizing a well-known Face Recognition algorithm,Principal Component Analysis (PCA) with Eigen-faces.

1 Introduction

In recent years, the Graphics Processing Units (GPU)has been the subject of extensive research and thecomputation speed of GPUs has been rapidly increas-ing. The computational power of the latest genera-tion of GPUs, measured in Flops1, is several timesthat of a high end CPU and for this reason, theyare being increasingly used for non-graphics applica-tions or general-purpose computing (GPGPU). Tra-ditionally, this power of the GPUs could only beharnessed through graphics APIs and was primarilyused only by professionals familiar with these APIs.

1floating-point operations per second

CUDA (Compute Unified Device Architecture) de-veloped by NVIDIA, tries to address this issue by in-troducing a familiar C like development environmentto GPGPU programming and allows programmers tolaunch hundreds of concurrent threads to run on the“massively” parallel NVIDIA GPUs, with very littlesoftware overhead. This paper portrays our efforts touse this power to tame a computationally intensive,yet highly parallelizable PCA based algorithm used inface recognition solutions. We developed both CPUserial code and GPU parallel code to compare the ex-ecution time in each case and measure the speed upachieved by the GPU over the CPU.

2 PCA Theory

2.1 Introduction

Principal Component Analysis (PCA) is one of theearly and most successful techniques that have beenused for face recognition. PCA aims to reduce thedimensionality of data so that it can be economicallyrepresented and processed. Information contained ina human face is highly redundant, with each pixelhighly correlated to its neighboring pixels and the

1

mailto:[email protected]

mailto:[email protected]

main idea of using PCA for face recognition is toremove this redundancy, and extract the features re-quired for comparison of faces. To increase accuracyof the algorithm, we are using a slightly modifiedmethod called Improved PCA, in which the imagesused for training are grouped into different classesand each class contains multiple images of a singleperson with different facial expressions. The mathe-matics behind PCA is described in the following sub-sections.

2.2 Mathematics of PCA

Firstly, the 2-D facial images are resized using bilin-ear interpolation to reduce the dimension of data andincrease the speed of computation. The resized im-age is represented as a 1-D vector by concatenatingeach row into one long vector . Let’s suppose we haveM training samples per class, each of size N (=totalpixels in the resized image). Let the training vectorsbe represented as xi. pj ’s represent pixel values.

xi = [p1 . . . pN ]T , i = 1, . . . ,M

The training vectors are then normalized by sub-tracting the class mean image m from them.

m =1

M

M∑i=1

xi

Let wi be the normalized images.

wi = xi −m

The matrix W is composed by placing the normal-ized image vectors wi side by side. The eigenvectorsand eigenvalues of the covariance matrix C is com-puted.

C =WWT

The size of C is N ×N which could be enormous.For example, images of size 16 × 16 give a covari-ance of size 256 × 256. It is not practical to solvefor eigenvectors of C directly. Hence the eigenvectors

of the surrogate matrix WTW of size M × M arecomputed and the firstM−1 eigenvectors and eigen-values of C are given by Wdi and µi, where di andµi are eigenvectors and eigenvalues of the surrogatematrix, respectively.

The eigenvectors corresponding to non-zero eigen-values of the covariance matrix make an orthonormalbasis for the subspace within which most of the imagedata can be represented with minimum error. Theeigenvector associated with the highest eigenvalue re-flects the greatest variance in the image. These eigen-vectors are known as eigenimages or eigenfaces andwhen normalized look like faces. The eigenvectors ofall the classes are computed similarly and all theseeigenvectors are placed side by side to make up theeigenspace S.

The mean image of the entire training set, m′

is computed and each training vector xi is normal-ized. The normalized training vectors w′

i are pro-jected onto the eigenspace S, and the projected fea-ture vectors yi of the training samples are obtained.

w′i = xi −m′

yi = STw′i

The simplest method for determining which classthe test face falls under is to find the class k, thatminimizes the Euclidean distance. The test imageis projected onto the eigenspace and the Euclideandistance between the projected test image and eachof the projected training samples are computed. Ifthe minimum Euclidean distance falls under a prede-fined threshold θ, the face is classified as belonging tothe class to which contained the feature vector thatyielded the minimum Euclidean distance.

3 CPU Implementation

3.1 DatabaseFor our implementation, we are using one of the mostcommon databases used for testing face recognitionsolutions, called the ORL Face Database (formerlyknown as the AT&T Database). The ORL Database

2

contains 400 grayscale images of resolution 112 ×92 of 40 subjects with 10 images per subject. Theimages are taken under various situations, such asdifferent time, different angles, different expressions(happy, angry, surprise, etc.) and different face de-tails (with/without spectacles, with/without beard,different hair styles etc). To truly show the powerof a GPU, the image database has to be large. Soin our implementation we scaled the ORL Databasemultiple times by copying and concatenating, to cre-ate bigger databases. This allowed us to measure theGPU performance for very large databases, as highas 15000 images.

3.2 Training phase

The most time consuming operation in the trainingphase is the extraction of feature vector from thetraining samples by projecting each one of them tothe eigenspace. The computation of eigenfaces andeigenspace is relatively less intensive since we haveused the surrogate matrix workaround to decrease thesize of the matrix and compute the eigenvectors withease. Hence we have decided to parallelize only pro-jection step of the training process. The steps till theprojection of training samples are done in MATLABand the projection is written in C.

The MATLAB routine acquires the images fromfiles, resizes them to a standard 16 × 16 resolutionand computes the eigenfaces and eigenspace. Thedata required for the projection step, namely, theresized training samples, the database mean imageand the eigenvectors, are then dumped into a bi-nary file with is later read by the C routine to com-plete the training process. The C routine reads thedata dumped by the MATLAB routine and extractsthe feature vectors by normalizing the resized train-ing samples and projecting each of them onto theeigenspace. With this the training process is com-plete and the feature vectors are dumped onto a bi-nary file to be used in the testing process.

3.3 Testing phase

The entire testing process is written in C. OpenCVis used to acquire the testing images from file, and

resize them to the standard 16 × 16 resolution usingbilinear interpolation. The resized image is normal-ized with the database mean image and projectedonto the eigenspace computed in the training phase.The euclidean distance between the test image fea-ture vector and the training sample feature vectorsare computed and the index of the feature vectoryielding the minimum euclidean distance is found.The face that yielded this feature vector is the mostprobable match for the input test face.

4 GPU Implementation

4.1 Training phase

4.1.1 Introduction

As mentioned in Section 3.2, only the final feature ex-traction process of the training phase is parallelized.Before the training samples can be projected, all thedata required for the projection process is copied tothe device’s global memory and the time taken forcopying is noted. As all the data are of a read-onlynature, they are bound as texture to take advantageof the cached texture memory.

4.1.2 Kernel

The projection process is highly parallelizable andcan be parallelized in two ways. The threads can belaunched to parallelize the computation of a partic-ular feature vector, wherein, each thread computes asingle element of the feature vector. Or, the threadscan be launched to parallelize projection of multipletraining samples, wherein each thread projects andcomputes the feature vector of a particular trainingsample. Since the number of training samples is large,the latter is adopted for the projection operation intraining phase. We have adopted the former in thetesting phase, where only one image has to be pro-jected, details of which are explained in Section 4.2.

Before the projection kernel is called, the executionconfiguration is set. The number of threads per block,T1, is set to a standard of 256 and the total number ofblocks is B1, where B1 = (ceil) (N1/T1), N1 = totalnumber of training samples.

3

Each thread projects and computes the feature vec-tor of a particular training sample by serially comput-ing each element of the feature vector one by one.Each element of the feature vector is obtained bytaking inner product of the training image vectorand the corresponding eigenvector in the eigenspace.The training sample is normalized with the databasemean image element by element as it is fetched fromtexture memory and the intermediate sum of the in-ner product with eigenvector is stored in the sharedmemory. After each element of the feature vector iscomputed, the data is written back into the globalmemory and the next element is computed. All thedata is aligned in a columnar fashion to avoid un-coalesced memory accesses and shared memory bankconflicts.

Figure 1: Threads in Projection Kernel (Training)

After the kernel has finished running on the devicethe entire data is copied back to the host memoryand dumped as a binary file to be used in the testingphase.

4.2 Testing Phase4.2.1 Introduction

The testing process is completely run on the GPUand is handled by three kernels. The first kernelnormalizes and projects it onto the eigenspace andextracts the feature vector. The second kernel par-allely computes the euclidean distance between the

feature vector of the test image and that of the train-ing images. The final kernel, finds the minimum ofthe euclidean distance and index of the training sam-ple which yielded that minimum. The resized testimage, database mean image, eigenvectors and theprojected training samples are first copied to the de-vice memory and the test image, mean image andeigenvectors are bound as texture to take advantageof the cached texture memory. Due to the relativelylarger size of the projected training samples data andthe limitation on maximum texture memory, the pro-jected training samples are not bound as texture.

Figure 2: Recognition pipeline

4.2.2 Projection Kernel

As mentioned in section 4.1.2, the projection kernelin testing process is parallelized to concurrently com-pute each element of the feature vector. The numberof threads per block, T2, is set to a standard of 256and the total number of blocks is B2, where B2 =(ceil) (N2/T2), N2 = size of feature vector.

Each thread computes each element of the featurevector, which is obtained by taking inner product ofthe test image vector and the corresponding eigenvec-tor in the eigenspace. The test image is normalizedwith the database mean image, element by elementas it is fetched from texture memory and the inter-mediate sum of the inner product with eigenvector is

4

stored in the shared memory.

Figure 3: Threads in Projection Kernel (Testing)

After the entire feature vector is computed, thedata is written back into global memory. The colum-nar alignment of all the eigenvectors avoids unco-alesced memory accesses and shared memory bankconflicts.

4.2.3 Euclidean Distance Kernel

The kernel for computing the euclidean distance isvery similar to the projection kernel used in trainingphase. Threads are launched to concurrently com-pute the euclidean distance between the test imagefeature vector and the training sample feature vec-tors. The number of threads per block, T3, is set toa standard of 256 and the total number of blocks isB3, where B3 = (ceil) (N3/T3), N3 = total numberof training samples.

Each thread computes a particular euclidean dis-tance serially. The difference of each element of thevectors are computed, squared and summed. The in-termediate sum is stored in shared memory. After allthe euclidean distances are computed, the data fromthe shared memory is written to the global mem-ory. The columnar alignment of training sample fea-ture vectors avoids uncoalesced memory accesses andshared memory bank conflicts.

Figure 4: Threads in Euclidean Distance Kernel

4.2.4 Minimum Kernel

The minimum kernel computes the minimum valueof euclidean distance and its distance. The vectorcontaining euclidean distances is divided into smallerblocks and each thread serially finds the value andindex of the minimum in a particular block. The ker-nel is called iteratively with fewer and fewer threadstill only one block is left. After execution of the ker-nel the minimum value and its index is copid backto host memory. The training sample at the indexcomputed by the kernel is the most probable matchfor the test image.

5 Performance Statistics

To test the performance of CPU and GPU, 5 imagesper subject of the ORL Database was selected as thetraining set. This set of 200 images was then repli-cated and concatenated to create databases of sizeranging from 1000 to 15000 images. The eigenvectorscorresponding to the 4 highest eigenvalues per class,were selected for forming the eigenspace. This ledto feature vectors which grew in size as the databasegrew. This replicated database was trained with CPUand GPU and the execution time was noted and theGPU speedup for the training process was calculated.

5

For testing, one image per subject from the ORLDatabase were selected and the total time taken bythe CPU and GPU to test all 40 test images wasnoted and was used to calculate the GPU speedupfor testing process.

To get accurate performance statistics, the trainingand testing processes were run multiple times on dif-ferent CPUs and GPUs. The following graphs wereplotted with the data obtained from the performancetests. All the CPU times are based on single-coreperformances.

Fig. 5 shows the time taken three different CPUsto execute the projection process during training.

Figure 5: Training time for different CPUs

Fig. 6 shows the time taken by different NVIDIAGPUs to execute the projection process during train-ing. It includes the time taken for data transfers toand from the device.

Figure 6: Training time for different GPUs

Fig. 7 shows the performance speedup of differentGPUs over Intel Core 2 Quad Q9550 CPU duringtraining databases of varying sizes.

Figure 7: Training Speedup

Fig. 8 shows total time taken by different CPUsfor testing 40 images.

6

Figure 8: Testing using GPU

Fig. 9 shows total time taken by GPUs to test 40images. For this test, the trained database was copiedto device memory once and 40 images were testedone by one. It includes time taken for transferringtest image to device and getting match index fromdevice.

Figure 9: Testing using GPU

Fig. 10 shows the performance speedup of differ-ent GPUs over Intel Core 2 Quad Q9550 CPU whentesting 40 images.

Figure 10: Testing Speedup

Fig 11 shows the execution time of the recognitionpipeline on the GPU for varying database sizes.

Figure 11: Recognition pipeline on CPU

7

Fig. 12 shows the execution time of the recognitionpipeline on the GPU for varying database sizes. It isthe time taken to transfer test image to device, findthe match and transfer its index back to host.

Figure 12: Recognition Pipeline on GPU

Fig. 13 shows the performance speedup of differ-ent GPUs over Intel Core 2 Quad Q9550 CPU whenexecuting the recognition pipeline.

Figure 13: Recognition Pipeline Speedup

6 Conclusion

The recognition rate of a PCA based face recognitionsolution depends heavily on the exhaustiveness of thetraining samples. Higher the number of training sam-ples, higher the recognition rate. But as the num-ber of training samples increases, CPUs get highlystrained and the training process will take severalminutes to complete (refer Fig. 5). But the sameprocess, when run on a GPU, will be completed ina manner of seconds (refer Fig. 6). The highestspeedup achieved was 207x for training process, 330xfor the recognition pipeline and 165x for overall test-ing process on the latest GeForce GTX 480 GPU, fora database size of 15,000 images.

The execution time of the recognition pipeline onthe GPU is in the order of a few milli seconds evenfor very large databases and and this allows the GPUbased testing to be integrated with real time videoand used for other applications involving large vol-umes of test images. Our primary purpose in writ-ing this paper is to make clear, the high perfor-mance boosts that can be obtained by developingGPU based face recognition solutions.

8

7 Future WorksOur future plans on this field include the paralleliza-tion of other face recognition algorithms like LDA(Linear Discriminant Analysis) and to replace the eu-clidean distance based matching process with a neu-ral network based one. We feel that algorithms witha high degree of parallelism in them, like neural net-works, will benefit immensely, if implemented on theGPU. We are also working on integrating the GPUrecognition pipeline with real time video.

9

real-time face tracking with gpu acceleration

Software