real-time masked face recognition using machine learning

Real-Time Masked Face Recognition Using

Machine Learning Support Vector Machine (SVM)

BASSAM AL-ASADI

Master’s thesis

November 2020

Information Technology

Full Stack Software Development

Description

Author(s) Al-Asadi, Bassam

Type of publication Master’s thesis

Date November 2020 Language of publication: English

Number of pages 88

Permission for web publication: x

Title of publication Real-Time Masked Face Recognition Using Machine Learning

Support Vector Machine (SVM)

Degree programme Full Stack Software Development

Supervisor(s) Huotari, Jouni. Kotikoski, Sampo.

Assigned by Abstract

An enormous number of robust face recognition systems has been around to help authorities and commercial companies to recognize people. While these systems work very well, a new problem came from nowhere that forced people everywhere to wear a medical face mask to prevent the spread of COVID-19 pandemic. These systems began to fail in the prediction that led to many problems. The efficiency of facial recognition systems can significantly deteriorate due to occlusions, such as medical masks, hats, facial hair, and sunglasses. Significant companies started gathering photographs of people wearing medical masks posted on their accounts to develop their facial recognition technologies, and they are struggling to keep the recognition technology up-to-date and appropriate.

Essential data has been collected from theories, observations data, reviews to develop some methods and solve the recognition problem by using quantitative experiment research. These experiments were applied in a restricted real-time streaming system. The author aims in this thesis to develop a model that can detect faces with a mask and without a mask and classify it to which identity it belongs. The detection was the first step to extract faces from an image of the frame to embed it to the recognition phase. The recognition model has been trained on different object recognition pre-trained models with the same data and evaluated on multiple environments to achieve good accuracy for limited identities.

Keywords/tags (subjects) Face Recognition, Deep learning, Machine Learning, Datasets, Support Vector Machine (SVM) Miscellaneous (Confidential information)

http://finto.fi/en/

http://vesa.lib.helsinki.fi/

https://intra.jamk.fi/opiskelijat/student/thesis/Pages/publicity.aspx

Description

Tekijä(t) Al-Asadi, Bassam

Julkaisun laji Opinnäytetyö, ylempi AMK

Päivämäärä Marraskuu 2020

Julkaisun kieli Englanti

Sivumäärä 88

Verkkojulkaisulupa myönnetty: x

Työn nimi

Real-Time Masked Face recognition Using Machine Learning

Support Vector Machine (SVM)

Tutkinto-ohjelma

Full Stack Software Development

Työn ohjaaja(t) Huotari, Jouni. Kotikoski, Sampo.

Toimeksiantaja(t)

Tiivistelmä

Valtava määrä vankkoja kasvojentunnistusjärjestelmiä on auttanut viranomaisia ja kaupallisia yrityksiä tunnistamaan ihmiset. Vaikka nämä järjestelmät toimivat erittäin hyvin, uusi ongelma tuli kuin tyhjästä, ja pakotti ihmiset kaikkialla käyttämään lääketieteellistä kasvomaskia estääkseen COVID-19-epidemian leviämisen. Nämä järjestelmät alkoivat epäonnistua ennusteessa, mikä johti moniin ongelmiin. Kasvojentunnistusjärjestelmien tehokkuus voi heikentyä huomattavasti okkluusioiden, kuten lääketieteellisten naamioiden, hattujen, kasvojen, hiusten ja aurinkolasienjoi vuoksi. Merkittävät yritykset alkoivat kerätä tileilleen valokuvia ihmisistä, jotka käyttävät lääketieteellisiä maskia kasvojentunnistustekniikoiden kehittämiseksi, ja ne pyrkivät pitämään tunnistusteknologian ajan tasalla ja tarkoituksenmukaisena. Olennaista tietoa on kerätty teorioista, havainnoititiedoista, arvosteluista joidenkin menetelmien kehittämiseksi ja tunnistamisongelman ratkaisemiseksi kvantitatiivisella kokeilututkimuksella. Näitä kokeita sovellettiin rajoitetussa reaaliaikaisessa suoratoistojärjestelmässä. Kirjoittaja pyrkii tässä opinnäytetyössä kehittämään mallin, joka tunnistaa kasvot maskilla ja ilman ja luokittelee ne identiteettiin, johon se kuuluu. Havaitseminen oli ensimmäinen vaihe kasvojen poimimiseksi kehyksen kuvasta niiden upottamiseksi tunnistusvaiheeseen. Tunnistamismalli on koulutettu erilaisten esineiden tunnistamisen esikoulutetuille malleille, joilla on samat tiedot, ja arvioitu useissa ympäristöissä rajoitetun identiteetin tarkkuuden saavuttamiseksi.

Avainsanat Kasvojentunnistus, Syväoppiminen, Koneoppiminen, Vertailuarvot, Support Vector Machine (SVM)

Muut tiedot

Abbreviations and Acronyms AAM Active Appearance Models AI Artificial Intelligence ANN Artificial Neural Network AFW Annotated Face in-the-Wild dataset CNN Convolutional Neural Network ConvL Convolutional layer ConvNet Convolutional Network CPU Central Processing Unit DFW Disguised Faces in the Wild dataset DL Deep Learning DNN Deep Neural Network EBGM Elastic Bunch Graph Matching dataset FERET Facial Recognition Technology dataset FR Facial Recognition GPU Graphics Processing Unit HMM Hidden Markov Models HPO Head Poses and Occlusion IBUG Tight face bounding box dataset IOU Intersection over Union JFA Joint Head Pose Estimation and Face Alignment Framework L1-SVM Linear norm Support Vector Machine L2-SVM Square norm Support Vector Machine LAG Large Age-Gap Database LBF Local Binary Features LDA Linear Discriminant Analysis LFPW Labelled Face Parts in the Wild ML Machine Learning OFR Occluded Face Recognition PCA Principal component analysis R-CNN Region Based Convolutional Neural Networks ReLU Rectified linear unit ROI Regions of Interest RPN Region proposal network SCfaceDB surveillance cameras face database SIFT Scale-Invariant feature transform SOF Specs on Faces Dataset SVM Support Vector Machine XM2VTS Extended Multi-modal face database YOLO You only look once

1

Contents

1 INTRODUCTION ........................................................................................................................ 5

2 RESEARCH ................................................................................................................................. 7

2.1 PURPOSE ....................................................................................................................................... 7

2.2 OBJECTIVES .................................................................................................................................... 7

2.3 RESEARCH QUESTIONS ...................................................................................................................... 8

2.4 RESEARCH METHODS ........................................................................................................................ 9

3 BACKGROUND ........................................................................................................................ 11

3.1 NEURAL NETWORKS ....................................................................................................................... 11

3.1.1 Deep FeedForward Networks .......................................................................................... 12

3.1.2 Activation Function ......................................................................................................... 13

3.1.4 Optimization (Gradient Descent) .................................................................................... 16

3.1.5 Backward Propagation .................................................................................................... 18

3.2 CONVOLUTIONAL NEURAL NETWORKS (CNN) .................................................................................... 19

4 FACE RECOGNITION ALGORITHMS ......................................................................................... 24

4.1 AAM - ACTIVE APPEARANCE MODELS ............................................................................................... 25

4.2 HMM - HIDDEN MARKOV MODELS.................................................................................................. 26

4.3 PCA - PRINCIPAL COMPONENT ANALYSIS ............................................................................................ 28

4.4 LDA - LINEAR DISCRIMINANT ANALYSIS ............................................................................................. 30

4.5 EBGM - ELASTIC BUNCH GRAPH MATCHING ...................................................................................... 31

5 STANDARD BENCHMARKS...................................................................................................... 33

5.1 FERET DATABASE ......................................................................................................................... 33

5.2 SCFACEDB LANDMARKS ................................................................................................................. 34

5.3 SPECS ON FACES (SOF) DATASET ...................................................................................................... 35

5.4 LARGE AGE-GAP DATABASE (LAG) ................................................................................................... 36

5.5 DISGUISED FACES IN THE WILD (DFW) .............................................................................................. 37

5.6 EURECOM VISIBLE AND THERMAL PAIRED FACE DATABASE .................................................................. 39

6 FACE RECOGNITION PIPELINE ................................................................................................. 40

6.1 DETECTION AND LOCALIZATION ........................................................................................................ 40

6.1.1 Viola-Jones ...................................................................................................................... 41

6.1.2 You Only Look Once (YOLO) ............................................................................................. 42

6.1.3 Faster R-CNN ................................................................................................................... 44

2

Summary ....................................................................................................................................... 47

6.2 ALIGNMENT .................................................................................................................................. 48

6.2.1 Supervised Descent Method and its Applications to Face Alignment ............................. 50

6.2.2 Face Alignment at 3000 FPS via Regressing Local Binary Features ................................. 51

6.2.3 Robust Facial Landmark Detection under Significant Head Poses and Occlusion ........... 52

6.2.4 Joint Head Pose Estimation and Face Alignment Framework ......................................... 53

Summary ....................................................................................................................................... 55

7 EXPERIMENT AND RESULT ..................................................................................................... 56

7.1 DETECTION AND EXTRACTION........................................................................................................... 57

7.1.1 Data ................................................................................................................................. 57

7.1.2 Implementation ............................................................................................................... 59

7.2 RECOGNITION ............................................................................................................................... 65

7.2.1 Landmarks ....................................................................................................................... 66

7.2.2 Visible features embedding ............................................................................................. 67

7.2.3 Classification ................................................................................................................... 69

8 - DISCUSSION ................................................................................................................................ 71

8.1 ANSWERS TO RESEARCH QUESTIONS ...................................................................................................... 71

8.2 CONCLUSION .................................................................................................................................... 74

8.3 RECOMMENDATION FOR FUTURE WORK ................................................................................................. 76

8.4 SUMMARY ....................................................................................................................................... 77

REFERENCES ..................................................................................................................................... 78

APPENDICES ..................................................................................................................................... 83

3

Figures

FIGURE 1. SIMPLE NEURAL NETWORK. ........................................................................................................ 12

FIGURE 2. NEURAL NETWORK STRUCTURE ................................................................................................... 13

FIGURE 3. LINEAR ACTIVATION FUNCTION. ................................................................................................... 14

FIGURE 4. GRAPH OF SIGMOID, TANH, AND RELU FUNCTIONS (NON-LINEAR ACTIVATION FUNCTION). ..................... 15

FIGURE 5. FIVE ITERATIONS OF GRADIENT DESCENT ....................................................................................... 17

FIGURE 6. BACKWARD PROPAGATION ......................................................................................................... 18

FIGURE 7. THE STRUCTURE OF A CNN, CONSISTING OF CONVOLUTIONAL, POOLING, AND FULLY-CONNECTED LAYERS. . 20

FIGURE 8. MAX POOLING AND AVERAGE POOLING. ....................................................................................... 21

FIGURE 9. FULLY-CONNECTED LAYER ........................................................................................................... 21

FIGURE 10. CONVOLUTION LAYER OPERATION. ............................................................................................. 22

FIGURE 11. SHAPE AND LABELLED IMAGE. .................................................................................................... 25

FIGURE 12. SAMPLE OF TRAINING DATA FOR ERGODIC HMM 2- LEFT-TO-RIGHT MODELS. ................................... 27

FIGURE 13. SAMPLE OF TRAINING DATA FOR TOP-TO-BOTTOM HMM. .............................................................. 28

FIGURE 14. LDA INFLUENCE ON THE DATA TO SEPARATE THE CLASSES, CONSIDERING EACH COLOUR IS A VARIABLE. .... 30

FIGURE 16. EXAMPLE OF DIFFERENT CATEGORIES OF PHOTOS FOR ONE INDIVIDUAL. ............................................. 34

FIGURE 17. EXAMPLE OF DIFFERENT POSE IMAGES. ........................................................................................ 35

FIGURE 18. SAMPLES OF THE SPECS ON FACES (SOF) DATASET. ....................................................................... 36

FIGURE 19. EXAMPLES OF FACE CROPS FOR MATCHING PAIRS. .......................................................................... 36

FIGURE 20. SAMPLE IMAGES OF THREE SUBJECTS FROM THE DFW DATASET. ...................................................... 37

FIGURE 21. VISIBLE AND THERMAL FACE IMAGES. .......................................................................................... 39

FIGURE 22. RECTANGLE FEATURES FOR OBJECT DETECTION. ............................................................................. 42

FIGURE 23. YOLO BOUNDING BOXES, CONFIDENCE, AND CLASS PROBABILITY MAP............................................... 43

FIGURE 24. INTERSECTION OVER UNION (IOU). ............................................................................................ 44

FIGURE 25. OBJECT DETECTION BY FASTER R-CNN. ...................................................................................... 45

FIGURE 26. FACE ALIGNMENT AND LANDMARK. ............................................................................................ 49

FIGURE 27. A COMPARISON OF GRADIENT DESCENT (GREEN) AND NEWTON'S METHOD (RED) FOR MINIMIZING A

FUNCTION. ............................................................................................................................................. 50

FIGURE 28. FACIAL LANDMARK DETECTION AND OCCLUSION PREDICTION IN DIFFERENT ITERATIONS. ........................ 52

FIGURE 29. THE TOP IMAGES USE (DLIB IMPLEMENTATION) AND BOTTOM IMAGES USE JFA ALGORITHM FOR LANDMARK

DETECTION. ............................................................................................................................................ 54

FIGURE 30. LFW DATASET WITH MEDICAL MASKS. ........................................................................................ 58

FIGURE 31. FACE MASK DETECTION DATASET. .............................................................................................. 58

FIGURE 32. MODELS COMPLEXITY (PARAMETERS SIZE) COMPARING TO TOTAL MEMORY UTILIZATION. ..................... 60

FIGURE 33. RED AND GREEN BOUNDING BOXES WITH DIFFERENT MODELS. ......................................................... 63

4

FIGURE 34. (A) ORIGINAL IMAGE (B) FACES' CROPPED WITH 224X224 PIXELS DIMENSION. .................................... 64

FIGURE 35. VALIDATION ACCURACY IMPLEMENTED BY THREE MODELS (INCEPTION V3, MOBILENETV2, VGG16) ON

MASKED FACES DATASET FOR 20 EPOCHS. ................................................................................................... 65

FIGURE 36. VALIDATION LOSS IMPLEMENTED BY THREE MODELS (INCEPTION V3, MOBILENETV2, VGG16) ON

MASKED FACES DATASET FOR 20 EPOCHS. ................................................................................................... 65

FIGURE 37. CROPPED THE LOCAL VISIBLE FEATURES FROM EXTRACTED FACE. ....................................................... 66

FIGURE 38. 2D FACE LANDMARKS (A) 68 LANDMARKS FOR THE ENTIRE FACE (B) 24 LANDMARKS FOR VISIBLE PARTS OF

THE FACE (C) CROPPED THE VISIBLE PARTS WITH THE LANDMARKS. .................................................................... 67

FIGURE 39. EMBEDDING A FACE IMAGE TO 128-DIMENSIONAL VECTOR. ........................................................... 69

Tables

TABLE 1: IMAGES IN THE TRAINING AND TESTING PARTITION. ........................................................................... 38

TABLE 2: AVERAGE ACCURACY OF FACE AND HEAD DETECTION ON THE FDDB DATASET AND CASABLANCA DATASET. .. 47

TABLE 3: AVERAGE TIME AND MEMORY COMPLEXITY FOR FACE DETECTION ON FDDB. ......................................... 48

TABLE 4: FACIAL LANDMARKS DETECTION ERROR. .......................................................................................... 55

TABLE 5: HEAD POSE VARIATIONS. .............................................................................................................. 56

TABLE 6: A CSV FILE REPRESENTS ALL THE ANNOTATIONS DETAILS FOR (FACE MASK DETECTION DATASET) IMAGES

NAME, DIMENSIONS, FACES' CATEGORY AND COORDINATES EXTRACTED FROM XML FILES IN THE DATASET. ............... 59

TABLE 7: FOUR EPOCHS FOR MASKED FACE DETECTION AND LOCALIZATION ON A SMALL DATASET, PRESENTING THE

VALIDATION ACCURACY AND LOSS FOR EACH MODEL. ...................................................................................... 61

TABLE 8: INCEPTIONV3, MOBILENETV2, AND VGG16 PARAMETERS WITH NUMBER OF CONVLS. .......................... 62

TABLE 9: ACCURACY AND STANDARD DEVIATION FOR FIVE DIFFERENT MODELS OVER MASKED FACES DATASET. .......... 70

5

1 Introduction

Artificial Intelligence and Neural networks have already transformed internet

technologies across the globe from something useful to something significant in our

lives. The artificial intelligence innovations have intervened in all fields ranging from

improved health care, where these machines are better suited to cancer detection

than any doctor, to self-driving cars that are considered safer than human driving.

Without ignoring the practical assistance of AI in simulations to measuring, monitoring

and resource management on climate change, conservation and environment.

Artificial neural networks are one of the most important technologies ever developed

that can operate without human intervention. In the programming conventional

techniques, we tell the machine the steps to breaking big problems down to small ones

in different scenarios. Next, the device can use its computational capabilities to help

users manage data more rapidly and effectively. On the contrary, machine learning

uses a massive amount of data to develop a pattern that can classify and predict to

solve complicated problems without human interferes. Therefore, we train the

machine and create various models that predict the outcome with high accuracy,

which will be a significant aid in achieving a complicated mission. The computer can

find a solution to any problems by supplying an appropriate amount of labelled data

and using supervised learning techniques. However, had its portion of problems not

in space but on the ground.

The last twenty years, we have been facing an increase in global terrorism.

Nonetheless, the modern world has had its share of problems not in space but on the

ground. The last twenty years, we have been faced with a spike in global terrorism.

This problem affected law-enforcement agencies in airports, border crossings and

ports across the globe on millions of people travelling daily.

According to the United States Department of Transportation around 900 million

passengers have been transported in 2019 using airports, hence managing the traffic

6

flow by a human was complicated task to accomplish. To avoid security breaches and

human mistakes. Governments began funding many agencies and companies to

develop these technologies to enhance human work and fulfil the grown needs to

improve artificial intelligence. Since the Iris and fingerprint recognition are a slow way

to achieve the task of recognition and not safe, authorities start to use a new biometric

recognition called "Face Recognition", and these techniques become so popular and

ubiquitous last five years.

At the end of 2019, the world was confronted with a widespread problem in the

history of humanity Coronavirus pandemic (COVID-19). The COVID-19 virus affects the

respiratory system and spreads through close contact. As a result, most people began

to use face masks for protection; thus, facial recognition systems struggle to identify

faces wearing medical masks.

The purpose of this thesis aims to explore and test the new detection model and

evaluation algorithms and their efficacy on masked faces. The author will create a

masked faces dataset by collecting images from the internet to use it for the

recognition models, also we will use a masked face detection dataset to trained

different models to detect masked faces and recognize the faces. It was a complicated

problem since there were no labelled datasets for masked faces to train our

recognition model; nor do we have enough earlier research on this subject for

comparison and evaluation.

The methodology of developing Artificial intelligence systems are correlated to full-

stack developments approaches through different aspects such as, provide the front

end applications to the end-users and host the AI models on the backend which are

built to be running on a server. In general, FR systems use many software development

methodologies to improve the system's performance, such as using the agile approach

to deploy facial recognition systems in easily manageable environments, with the agile

approach the FR systems result delivered in quick, constant control. The agile software

developments are appropriate to computer vision systems because the end-users

need to be involved earlier to further examination, adjustments, improvements, and

7

finally evaluate the facial recognition models. Many developers use facial recognition

technologies nowadays to grant access to various applications.

2 Research 2.1 Purpose Face recognition is one of the problems of computer vision that has gained a lot of

attention in the previous decade. Many researchers have been contributed to develop

and innovate new theories to improve computer vision and applied these theories in

real applications. Various studies have published on face recognition, that focuses on

developing methods to ascertain the presence of face and recognize it. The study

conduces to clarify the obstacles of detection and identification and introduce an

explication that able to develop a new recognition system method for both

circumstances ( full-visible features and half-visible features ).

2.2 Objectives The thesis aims to develop a new facial recognition system capable of recognizing an

individual wearing a medical mask. To understand how the facial recognition system

functions, the author divides the issue into four stages. The purpose of the division is

to make the problem understandable and experimental by taking each stage on its

own and clarifying the characteristics and the necessity for experiments to be carried

out. The thesis, therefore, presents these steps, beginning by presenting some popular

algorithms used previously to implement facial recognition systems alongside datasets

used to build and evaluate these algorithms. Next step is to research the detection

and alignment processes, to understand the mechanism behind them and how they

can apply them in the system.

8

2.3 Research Questions According to Dawson R. Hancock and Bob Algozzine (2017, 3), Various types of

questions (What?, How?, Why?) that have driven scholars to explore the causes why

things have happened and to create more specific approaches. Usually, while

specialists are analysing a topic, this implies that they are seeking to obtain feedbacks

for a better understanding of the subject, alternative scenarios for analysis, and

possible explanations for review. These feedbacks and questions have driven the study

to draw conclusions that are reliable, practical, and interpretable.

According to Dawson H. et al. (2017,4), A research effort should not be carried out by

a researcher without an organizational paradigm. This paradigm lays out for the

researcher the distinguishing characteristics of the study and the possibilities for

obtaining answers to the questions. Therefore, the author determines a paradigm to

conduct systematic research that investigates the study's topic by identifying three

critical pillars to design the research process:

• Study's methods

• Gathering information

• Results' confirmation

These three essential pillars drove the study to comprehend research methods, data,

and analysis forms to illustrate the OFR stages and to understand the data required to

build a very accurate model. By following the previous steps, the research questions

based on the triple paradigm are:

1- What are the best approaches used to develop an OFR system?

2- How does the quantity and quality of data affect the OFR system?

3. How does the OFR system detect and extract visual facial features?

4- What is the evaluation of the OFR system’s performance?

9

The questions lead the study to dive deeper into theories and algorithms to determine

the usefulness and viability of using them in the research.

2.4 Research methods By addressing a question such as, "What is this study trying to do?" the author came

up with specific research methods to figure out the answer. The answer to that

question and the purpose of this study is to arise a new approach to solve the OFR

problem. Therefore, the study began to conduct quantitative research by setting up

controlled experiments and collecting data through different resources.

According to Claes Wohlin, Martin Höst, and Kennet Henningsson (2003,2)

"quantitative data promotes comparisons and statistical analysis". And by utilizing the

literature available to the different contributors in computer vision society such as

Stanford University, google AI research labs, Massachusetts Institute of Technology,

and numerous researchers' works. The author was able to collect essential data for

use in the research to enhance the system performance by comparing and analysing

the findings. The study considered the collected materials to be noteworthy,

particularly the work of Gregory Koch (2015) in his thesis "Siamese Neural Networks

for One-Shot Image Recognition" when Koch explored a method to classify the

similarity between two figures. And the dissertation of Ali Sharif Razavian(2017) in

"Convolutional Network Representation for Visual Recognition" which described the

representation of the convolutional network in visual recognition from an empirical

perspective.

The study carries out with testable background information to utilize as a basis for

implementing the OFR system in real-time and addressing the challenges. The

research focuses on several different insights that combine to include responding to

questions, clarify the working methods and mechanisms used to develop the

algorithms, and study recommendations. The study uses experiments, which requires

10

several variables to conduct them and to review the effects. These variables, such as

data, hypotheses, reviews, devices' performance, and environmental circumstances,

are more critical in the research than the experiment itself.

Two quantitative research methods have been applied that conduct the experiments

and analyse the findings. According to Claes Wohlin, Martin Höst, and Kennet

Henningsson (2003, 9), the empirical research method can be mapped to the following

steps: Definition, Planning, Operation, Analysis, and Conclusion. The objective of

empirical research is to manipulate one or more variables and control all other

variables at a fixed level. And the effect of variable manipulation can be measured

based on statistical results. Since we compare different types of methods in each stage

and analysis the outcomes to calculate the ratios, the empirical research method was

the proper pattern for this quantitative research. The data used for this study has

collected from different resources (books, Articles, Reviews, the accumulated

experience of the author) and has been obtained by using specific keywords such as

Artificial intelligence, Neural network, Machine learning, Deep learning, Computer

vision, Object detection and localisation, Object recognition and comparison,

Algorithm reviews, and Finally datasets.

The empirical research method uses to obtain the evidence and observe the scientific

data from the experiments, then review these findings by using secondary data

analysis research method. The empiric process begins by reacting to the research

concerns that need to be investigated in order to determine the direction of the

research by highlighting the fundamental goals of the systematic investigation.

The second step is to reanalysis the previously collected data and compares them with

the findings. Therefore, the author uses Secondary data analysis that is widely used

data collection technique in science research. According to Melissa P. Johnston (2014,

8) " The major advantages associated with secondary analysis are the cost-

effectiveness and convenience it provides". Secondary analysis is the best technique

for gathering data for several purposes, such as offering validation opportunities for

replications, and this is substantial in the study as study findings are credible if they

11

occur in a variety of other studies. The author needs to consider how the data are

categorized, organized, and how this might affect the results. Also, to make necessary

adjustments to the data in some way to conduction the study in the right way. The

chosen methods bring some considerations for findings and evaluations to implement

new experiments and to determine the research process.

3 Background

This chapter discusses the theoretical framework used in the research, starting with

Deep Learning approach that mimics certain aspects of human brain function in the

recognition of objects, images, voices or patterns, whether supervised or

unsupervised. Let's start with the neural networks.

3.1 Neural networks

In order to understand the functionality of the Neural network, it is essential to start

describing the single neuron, which is a mathematical process modelled as a

representation of biological neurons. The mathematical mechanism that occurs within

the artificial neuron is a necessary process for all machine learning algorithms.

According to Dilip Singh Sisodia, Ram Bilas Pachori, and Lalit Garg, (2020, 123) “A

neural network is a series of algorithms that endeavors to recognize underlying

relationships in a set of data through a process that mimics the way the human brain

operates”. Neural networks are a simplified representation of the architecture of

machine learning that represent a small network containing two layers (the hidden

layer and the output layer). Data are flowing from the input or previous layers

presented as a one-dimensional column vector (tensor), which will be processed inside

the artificial neuron by using some mathematical equation, such as, the Logistic

Regression equation. Once the parameters ( Weights as a two-dimensional tensor,

bias as a one-dimensional tensor ) are added to the equation, the output of the

12

equation should be within a specified range, so that the process needs to place a

specific activation function on the equation to minimise the output's value. From the

(Figure 1), we can see how the single neural network operates on data received from

the one-dimensional tensor how to implement the forward propagation phase. The

architecture of the neural network has four primary phases, which will be described

in detail as the critical pillars for the development of deep learning models, starting

with the forward propagation step to backward propagation, these phases will be

explained in detail by the author:

Figure 1. Simple Neural network.

3.1.1 Deep FeedForward Networks

The term FeedForward refers to the mathematical calculation of intermediate

variables (weights, and biases) applied to the input data and including the outputs of

the previous layer. As a result of this process, the coefficients' data will be stored in

the neuron to be used again when the network processes the backward propagation.

13

(Figure 2) Describing the architecture of the feedforward networks requires defining

depth, width, and activation functions in each hidden layer. Depth is the number of

hidden layers. Width is the number of neurons on each hidden layer. The activation

function is a function within each neuron. According to Ian Goodfellow, Yoshua

Bengio, and Aaron Courville (2016, 165), naming "deep learning" emerged from the

chain of functions used to build the feedforward network.

Figure 2. Neural network Structure

3.1.2 Activation Function

The activation function used in feedforward networks is to minimise the outcome's

value and evaluate the output behaviour of each node. The activation functions are

essential components of an artificial neural network, which allows learning complex,

non-linear mappings between inputs and response coefficients. The primary objective

behind the activation function within each node is to process the output of the

equation and simplify the results between specific scales with respect to input data.

Activation functions can be divided into two parts linear and non-linear operations

(Figure 3, 4).

14

In this section the author will list some of the common activation functions usually

used in deep learning:

• Linear Activation Function

Linear activation functions used to multiply the input data by the intermediate

coefficients within each neuron and generate a single output's value (one or zero).

These functions have two significant disadvantages: it cannot be used for backward

propagation, and all layers will collapse into one layer.

𝑓(𝑥) = 𝑥

Figure 3. Linear activation function.

(Source: https://towardsdatascience.com )

• Non-linear Activation Functions

Unlike linear activation functions, the purpose of these activation functions is to

generate an output within a limited range. Since the output's value is not limited to

two values, the non-linear activation function ensures that the neural network layers

will not behave as a single layer. In the other hand, the backward propagation will

normally run. In this section, the author will describe three non-linear activation

functions:

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

15

1. Sigmoid or Logistic Activation Function - The sigmoid function modifies the

input value to output a new value between 0 to 1.

𝑓(𝑥) =1

1 + 𝑒−𝑥

2. Tanh or hyperbolic tangent Activation Function - The hyperbolic tangent

function is rescaling the sigmoid function's output to be between -1 to 1.

𝑓 (𝑥) =𝑒𝑥 + 𝑒−𝑥

𝑒𝑥 − 𝑒−𝑥

3. ReLU (Rectified Linear Unit) Activation Function - ReLU is not a linear function,

but it provides the same result as Sigmoid with superior performance.

𝑓(𝑥) = max (0, 𝑥)

Figure 4. Graph of Sigmoid, Tanh, and Relu functions (non-linear activation function).

(Source: https://www.researchgate.net )

https://www.researchgate.net/figure/Function-curves-of-sigmoid-Tanh-and-ReLU_fig1_330467251

16

2.1.3 Cost functions

The cost function purpose is to measure the neural network error ratio. The general

formula for the cost function represents the sum of the error, which is the difference

between the real value and predicted value.

Cost function formula:

Loss (error) function: in a single training example:

𝐿𝑜𝑠𝑠(�̂� , 𝑦) = (𝑦 𝑙𝑜𝑔 �̂� + (1 − 𝑦)𝑙𝑜𝑔(1 − �̂�)

Cost function: for the entire training set:

𝐽(𝑤 , 𝑏) = 1

𝑚 ∑ 𝐿𝑜𝑠𝑠(�̂�𝑖 , 𝑦𝑖)

𝑚

𝑖=1

3.1.4 Optimization (Gradient Descent)

The optimisation is the first step in backward propagation, which tries to find the best

value from some set of obtainable values. Gradient descent algorithm is one of the

necessary optimisation algorithms most used to update the models' parameters

(weights and biases). The algorithm starts at the initial point and takes a step in the

slope direction (Figure 5), after many iterations of gradient descent, you might end up

converging to a global optimum. Used in backpropagation in neural networks uses a

gradient descent algorithm and applied massively in linear regression and

classification algorithms. It defines how the parameters should be improved and

updated them so that the task can be carried to a minimum by modifying the losses

so can be reduced. The gradient descent cons are; requires a large memory to

calculate the gradient for the entire dataset, the weights are updated with each

gradient measurement, which can take years to complete the process if applied to a

large dataset.

17

On the other hand, the pros of this optimization function are: easy to carry out,

understand, and compute. It is a basic feature used in different models but is not

enough to execute on massive projects.

Figure 5. Five iterations of Gradient Descent

To understand the optimiser function, the author implies the gradient descent

mathematically by sitting these equations:

Update parameters (w, b)

𝑤 = 𝑤 − 𝛼𝜕𝐽(𝑤, 𝑏)

𝜕𝑤

𝑏 = 𝑏 − 𝛼𝜕𝐽(𝑤, 𝑏)

𝜕𝑏

𝛼 = learning rate 𝜕 = derivative

18

3.1.5 Backward Propagation

According to David E. Rumelhart and Yves Chauvin (1995, 1), "backpropagation

terminology derives from the Rosenblatt attempt (1962) to generalize the perception

learning algorithm to the multilayer case". The researchers' objectives were to replace

hand-engineered features with trainable multilayer networks, according to David E. R.

et al. (1995, 2), the solution was not widely understood until (1986) when Rumelhart

published a paper explaining in more detail how this algorithm works in the real-world

application. The idea behind the backward propagation is minimizing the cost function

error ratio and modifying the coefficients iteratively, by implementing the

optimization algorithm of the cost function (Figure 6). As it turns out, multilayer

architectures can be trained by straightforward stochastic gradient descent, or you

prefer hill-climbing optimizer. The backward propagation includes two steps:

1- Compute the partial derivative of ( 𝜕𝑍𝑙 , 𝜕𝑊𝑙, 𝜕𝑏𝑙 ).

2- Update the weighting matrix (W, b).

Figure 6. Backward Propagation

19

The backward Propagation calculates the derivatives to update the parameters

(weights and biases) by using these equations:

First, calculate the derivative (dZ) for the output layer (Z)

𝜕𝑍𝑙 = 𝑎𝑙 − 𝑦

Second, calculate the previous layer derivative and update the weight parameters by

multiplying the derivative value of the final layer with the current layer parameters.

Biases will take the derivative value of the final layer.

𝜕𝑊𝑙 = 𝜕𝑍𝑙 . 𝑎[𝑙−1].𝑇

𝜕𝑏𝑙 = 𝜕𝑍𝑙

3.2 Convolutional Neural Networks (CNN)

According to Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016, 326)

“Convolutional networks are simply neural networks that use convolution in place of

general matrix multiplication in at least one of their layers.” Convolutional neural

networks designed to process the incoming data in the shape of multiple arrays, this

approach has had an innovative result over the past few years in image processing,

voice recognition, and other recognition patterns. The main feature of CNN is the

reduction of parameters in the ANN (Artificial neural network).

CNN is one of the categories of deep learning networks, mostly used to analyse images

and video frames. CNN focuses on the basis "input will always be images" that will lead

the architecture to be set-up in a way that best suits the specific type of data. The

input data for the CNN layer will include four tensors (x, h, w, d). The x sample denotes

to the number of data, h and w will be image's height and width, the last dimension is

the image's depth, which represents the pixel colour for input image (red, green, blue),

for instance, let say we use an image with dimensions (x,32,32,3). The first hidden

layer will be (x,28,28,6), the depth dimension in the first hidden layer represent the

20

classes, and output layer will be (x,1,1,n) where n represents the possible number of

classes. Convolution neural networks shaped from three types of layers, (convolutions

layers, Pooling layers, and Fully connected layers), these three stacked together will

form CNN architecture. Figure 7 simplifies CNN architecture.

Figure 7. The structure of a CNN, consisting of convolutional, pooling, and fully-

connected layers.

(Source: https://www.mdpi.com )

CNN basic implementation will be:

1- The input layer will not do any change on pixel values of the image.

2- The convolutions layer's parameters consist of a group of learnable filters. Every

filter has the width and height and depth of the input volume. For Example, let say the

first filter in ConvNet have size 5x5x3 and this filter will slide over the input volume

(input layer) and the output after the computation will be 2-dimensional activation

map that present filter at every place, the network will enhance the filters that activate

when determining the type of visual attributes like an edge of orientation or some

significant colour or any patterns on layers of the network. In each ConvNet, we will

have a set of filters stacked together to produce the output value.

3- The pooling layer will simplify the ConvNet filters to progressively reduce the spatial

size of the representation and reduce the number of parameters in the network, and

https://www.mdpi.com/1099-4300/19/6/242/xml

21

hence to control the overfitting. A common approach used in pooling is max pooling

and average pooling.

Figure 8. Max Pooling and Average Pooling.

( Source: https://www.researchgate.net )

4- The fully connected layer is essential to check the incoming data from the previous

layer and compare it with stored labelled data inside it; it’s dividing and analysing the

image into class.

Figure 9. Fully-connected layer

Convolutional layers are the primary constituent elements in Convolutional neural

networks. Convolution is the simple implementation of a filter. iterative

https://www.researchgate.net/figure/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451

22

implementation of the same filter to an input layer produces a feature map, denoting

to place and strength of detected features in an image. CNN is capable to updating

many filters in parallel, and the outcome is a set feature that is detected anywhere in

the image.

The ConvL filters shift to the right with a specific Stride value until it parses the entire

width. Then, it jumps down to the beginning left of an image with the same Stride

value and iterates this operation for the entire image.

ConvNet layers will reduce the complexity of the model by optimizing

hyperparameters, depth, stride, and setting zero-padding.

For example, let say the input image (source layer) has a size (8x8x3) and the filter

size will be 3x3 then each node in the ConvL will have weights to a ( 3x3x3 ) region in

the convolutional kernel, so total weights will be 27 and +1 for bias factor, and the

destination layer (feature map ) will have (6x6xn).

Figure 10. Convolution layer operation.


https://www.researchgate.net/figure/Schematic-illustration-of-a-convolutional-operation-The-convolutional-kernel-shifts-over_fig2_332190148

23

Last two decades, CNN has been applied with tremendous success to detect, segment,

and recognise objects in an image. These implementations were effortless to attain on

account of the labelled data were plentiful and accurate. These data have been used

to detect an individual, faces, moving objects. Jonathan Tompson, Ross Goroshin,

Arjun Jain, Yann LeCun, Christoph Bregler. (2014, 2) applied a novel ConvNets

architecture to detect human body pose detection; they have shown the precision lost

due to pooling in ConvNet architectures. Convolutional Networks have been

successfully applied to a multitude of other problems, such as recommender systems

as indicated in (Ayush Singhal, et al., 2017, 20), natural language understanding in

(Ronan Collobert, R. et al., 2011, 2503), and speech recognition as described in (Tara

N. Tara N. Sainath. et al., 2013, 4).

24

4 Face Recognition Algorithms

For the last decade, Face recognition has been one of the most researched topics in

the field of computer vision and biometrics. Traditional approaches, based on hand-

crafted features and Conventional machine learning techniques such as, feature-

based and geometry-based methods, were the first steps to develop deep neural

networks techniques which considered the backbone to all computer vision

applications. In this section, the author will explain five conventional machine learning

algorithms that have been used widely on face recognition.

W. Zhao, R. Chellappa, P. J. Phillips, A. Rosenfeld (2003, 400) states that face

recognition problems can be formulated as following " given still or video images of a

scene, identify or verify one or more persons in the scene using a stored database of

faces", from this perspective we understand that challenges for developing accurate

model will depend on (occlusions, poses, illumination, ageing, conditions, and facial

expressions). The researchers for the last decade concentrated on methods that used

image processing technology to describe the geometry of the face to correspond with

the exact faces from an image.

In this chapter, the author will briefly explain some algorithms used to detect and

recognize faces in an image. Also, we will go through some advantages and

disadvantages for each one. In general, all algorithms will be affected by factors that

debase the accuracy of recognition and these factors are low resolution, illumination,

and expressions, etc. Face recognition from still and moving faces is a tremendous

task, many machine learning experts have already accomplished 100 % accuracy in the

frontal face images, with factors mentioned previously, the accuracy will not be

reliable.

25

4.1 AAM - Active Appearance Models

According to Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor ( 2001,

681) Active Appearance Models Has an integrated statistical model combines the

shape of variations with appearance variations, also AMM contains a monochrome

representation for the spot of interest, Cootes, F., does not attempt to solve

comprehensive optimisation every time they want to assign the new image to the

model, alternatively, they avail the truth of the optimisation problem is always

identical, hence they did the optimisation step discreetly (offline).

AMM structure depends on a Gaussian image pyramid, that achieves rapidity and

robustly multiresolution approach for identification. The difficulty of this approach is

to understand and learn the differences between real interval change to the data and

changes only due to systematic and unsystematic distortion. The AMM method is one

of the oldest in matching statistical methods which achieved 88% accuracy, that

trained on 100 faces hand-labelled images for a training set and an equal number for

the testing set included face expectations.

Active Appearance model algorithm has two procedures which are modelling and

fitting. The modelling step AMM will separate an object for two parts, first will be the

shape which is a vector established by connecting the facial landmarks, the second

part is the texture is the measure of pixels represented by the density of colours. Once

the model is formed, it is necessary to fit the model into different images, which are

vital to finding the most realistic parameters of the face.

Figure 11. Shape and labelled image.

(Source: http://pages.cs.wisc.edu )

http://pages.cs.wisc.edu/~dyer/ai-qual/cootes-pami01.pdf

26

Xinbo Gao, Ya Su, Xuelong Li, and Dacheng Tao (2010, 147-151) states that the AMM

is commonly used in the modelling of distorted objects as it has effective

representation and reliable fitting capability. Recently there are improvements to

address the difficulties and extend the ability in three aspects:

• Efficiency - To increase the efficiency of the AMM various enhancements are

proposed to make the algorithm capable of fitting an image successfully, these

enhancements consider the reduction of the computational process by optimizing the

algorithm, texture representation, and model training.

• Discrimination - To improve the model's discrimination, many improvements

have made to determine the accuracy, such as shape prior, texture representation,

nonlinear modelling, etc.

• Robustness - many significant improvements have been made to the AMM to

improve the robustness that is influenced by the changing circumstances, such as

condition changes, pose variations, missing features, low resolution.

4.2 HMM - Hidden Markov Models

Hidden Markov model (HMM) is a mathematical approach used to model biological

sequences. This approach was developed in the 1970s by Andrey Andreyevich Markov.

This mathematician scientist developed much relevant statistical theory, and it has

been used for the first time in the 1980s to apply the speech recognition system by (L.

R. Rabiner and B. H. Juang, 1986). HMM is one of the solutions to a mathematical

model reception of some visible signals, according to Rabiner. L., (1989, 257) "the

signals can be discrete in nature (e.g., characters from a finite alphabet) or continuous

in nature (e.g., temperature measurements)". HMM has been efficiently used with

one-dimensional data and accomplished important outcomes in activity recognition

and voice recognition. It has also been used with face detection and recognition. In

Claudia Iancu's view (2011, 4), "HMM techniques remain mathematically complex

27

even in the one-dimensional form. The extension of HMM to two-dimensional model

structures is exponentially more complex", but researchers and scientists had used

this model to develop the facial recognition model despite the mathematical

complexity we can see in the F. H. Alhadi, W. Fakhr and A. Farag. (2005, 2) experiment.

Implement an HMM in face recognition has two significant disadvantages:

1- Pixel values do not consider a useful feature in face recognition methodology due

to image conditions (illumination, shift, noise, rotation).

2- The vast vector dimension engenders a high computational process for the

detection system.

Ferdinando Samaria, Frank Fallside (2007, 2-3) initialised two sets of HMMs that

trained for each identity in the dataset. Two models have been used for training and

testing:

1- Ergodic Models:

In this model, the authors used to train an HMM 10 training images for each identity

size of (256*256, 8-bit grey levels), each image was divided into 64*64 window, which

slides on the given image by 58 pixels for each window from left to the right of the

image, then moves down by 48 pixels and starts moving again but from right to left

(Figure 12).

Figure 12. Sample of training data for ergodic HMM 2- Left-to-right Models.

(Source: https://www.researchgate.net)

https://www.researchgate.net/figure/Training-data-and-states-for-ergodic-HMM-Samaria-Fallside-1993_fig1_221913956

28

2- Top-to-bottom Models:

For each identity in the dataset five variant images size of (184*224 - 8 bits grey level)

used to train each HMM, these images were analysed into 16 lines of blocks spatially

ordered in a top-to-bottom direction (Figure 13).

Figure 13. Sample of training data for top-to-bottom HMM.


4.3 PCA - Principal component analysis

Principal component analysis or PCA is a statistical approach set to decreasing the

number of parameters in face recognition. According to Kim Esbensen and Paul Geladi

(1987, 37-38), PCA is developed to model data, which is distinguished via a significant

interrelationship between the parameters concerned. According to C. Li, Y. Diao, H.

Ma and Y. Li. (2008, 376), "PCA is a classical feature extraction and data representation

technique widely used in the areas of pattern recognition and computer vision such as

face recognition". In PCA, every image in the training dataset is represented as a linear

combination of weighted eigenvectors called eigenfaces. The author will explain the

Eigenfaces in the next chapter. This technique used a multifaceted data analysis

method to investigative data analysis, aberration detection, decrease classification,

and reduce regressions.

According to Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf, Mazdak

Zamani, Alireza Hooman (2013, 173-174), they introduced a paper "An Overview of

Principal Component Analysis " describing the mathematical processes for this

https://www.researchgate.net/figure/Examples-of-training-data-and-states-for-top-to-bottom-HMM-from-Samaria-Fallside_fig2_221913956

29

algorithm by presenting the set of M images (B1, B2,....BM) and image size will be (N

* N), and the training set image average() will be:

𝜇 = 1

𝑚 ∑ 𝐵𝑛

𝑀

𝑛=1

The average image by the vector (W) has a new value for each training image

describing by:

𝑊𝑖 = 𝐵𝑖 − 𝜇

The authors calculate the covariance matrix by:

𝐶 = ∑ 𝑤𝑛𝑤𝑛𝑡 = 𝐴𝐴𝑇

𝑀

𝑛=1

Where A = [𝑊1, 𝑊2, 𝑊3, … . , 𝑊𝑀]

Then calculate the eigenvectors 𝑈𝐿 and eigenvalues 𝜆𝐿 of covariance matrix. The last

equation will be measuring the vectors of weights, for image classification by:

Ω𝑇 = [𝑤1, 𝑤2, … . . , 𝑤𝑀]

Whereby,

Hk = UkT (B − µ), k = 1, 2, … , M

PCA as other algorithms has disadvantages which are the complexity of mathematical

calculations, and roundoff errors incline to accumulate for each step in the algorithm

which makes it a complex task to evaluate the scatter matrix (covariance matrix)

accurately. Moreover, the PCA could not capture even the simplest invariance unless

this information provided in the training data.

30

4.4 LDA - Linear Discriminant Analysis

LDA is a technique used to reduce the dimensions in a dataset and keeping as much

data as possible. It generates an ideal linear discrimination function that maps the

input data into the classification venue, the input data will handle via scatter matrices

analysis, that searches the matching used in this approach could be a simple Euclidean.

According to N. Mohanty, A. Lee-St. John, R. Manmatha, T.M. Rath (2013, 253) "LDA

provides class separability by drawing a decision region between the different classes.

LDA tries to maximize the ratio of the between-class variance and the within-class

variance", In this situation, LDA will create a linear amalgamation of these which

produce a higher average variation between the classes. LDA approach successfully

used in several applications, such as image identification, pattern recognition, data

classification, and bioinformatics, etc.

According to Alok Sharma • Kuldip K. Paliwal (2013, 1), the orientation Z in the LDA

technique transforms higher-dimensional feature vectors for the different classes to a

lower-dimensional feature space; in this range, lower feature vectors will separate the

classes. Hence the reduction of d-dimensional (Rd) space to h-dimensional (Rh) space

(where d > h), then the size of the Z orientation will be R h*d. For more simplicity, LDA

creates a diagonal axis and displays the information from both features to reduce the

variance and increase the distance between the two classes, as shown below.

Figure 14. LDA influence on the data to separate the classes, considering each colour

is a variable.

31

4.5 EBGM - Elastic Bunch Graph Matching

The EBGM algorithm has established on the basis that the real face in images has a

variety of nonlinear features that are not touched upon by the linear analysis method

like LDA, such as pose, illumination, the algorithm identifies the faces by localising a

set of landmark features after that calculate the similarity between these features.

The information extracted for faces in an image can be represented by a bunch-graph,

the bunch-graph provides a database for each landmark that can be used to locate

features for new faces in an image. " A person is recognised correctly if the correct

model yields the highest graph similarity, i.e., if it is of rank one. A confidence criterion

on how reliably a person is recognised can easily be derived from the statistics of the

ranking"(David S. Bolme 2003,3).

The algorithm proved effective in the detection of faces in the FERET (Facial

Recognition Project) because the algorithm perceived faces by contrasting their parts,

instead of performing extensive picture coordination. The nodes in EBGM labelled

with a variety of Gabor wavelet parameters called a jet which is used for matching and

recognition. According to Laurenz Wiskott, Jean-Marc Fellous, Norbert Kruger, and

Christoph von der Malsburg (1999, 2), " The representation of local features is based

on the Gabor wavelet transform. Gabor wavelets are biologically motivated

convolution kernels in the shape of plane waves restricted by a Gaussian envelope

function". Figure 15 clarifies that when you apply the EBGM algorithm in a face

recognition system, a set of the process will define the graph representation of a face

(Gabor wavelet transform, convolution wavelet kernels).

32

Figure 15. The two major keys to represent a face in EBGM are Gabor wavelet

transform, convolution with a set of wavelet kernels.


https://www.researchgate.net/publication/220182105_Face_recognition_by_elastic_bunch_graph_matching

33

5 Standard Benchmarks

In this chapter, the author will analyze and explain six dataset benchmarks used to

train and test face-recognition algorithms and to evaluate the performance.

Systematic benchmark studies are the most accurate, which considered being the

source of the algorithms to demonstrate efficiency and robustness. A reliable

benchmark should be authorized with an open-source, data should be explicitly

labelled, and applied at least on one machine learning algorithm. The datasets used in

this chapter extracted from www.face-rec.org/databases/, the first dataset to be

broached will be FERET, which mentioned in section 3.5.

5.1 FERET Database

The FERET database was established in 1993 under a collaborative effort of Dr.

Wechsler H., and Dr. Phillips J., assuming that a database should serve both

development and testing by providing the algorithms sequestered images. According

to P. Jonathon Phillips,l Hyeonjoon Moon, Patrick Rauss, and Syed A. Rizvi (1998, 137),

the dataset included 14,126 images from 1199 individuals serves as a standard dataset

of face images for researchers to develop facial recognition algorithms and evaluate

the outcomes. A colour FERET high resolution (512*768 pixels) dataset was released

in 2003. The FERET dataset is split into two parts, and the first one is the development

set served to researchers, and the second is isolated images for the testing set. The

founders of the FERET database had used to collect the images "a 35-mm Kodak

camera and processed them into CD_ROM via Kodak's multiresolution technique for

digitising and storing digital imagery. The colour images were retrieved from the CD-

ROM and converted into 8-bit grayscale images." (ibid., 138.). In order to preserve a

level of consistency across the database, the images captured in a semi-controlled

environment. The same physical configuration used for each photographic session.

Each session, the equipment needed to be reassembled; therefore, there was a small

divergence in the images captured on various days.

34

Figure 16. Example of different categories of photos for one individual.


5.2 SCfaceDB Landmarks

Released in 2011, SCface – surveillance cameras face database (SCfaceDB) was

designed basically to evaluate the face recognition algorithms' robustness in real-

world monitoring but has been used to assess other recognition algorithms, such as

face recognition algorithms for head poses, and illumination normalisation algorithms.

According to Mislav Grgic, Kresimir Delac, and Sonja Grgic (2009, 863), "Images from

different quality cameras should mimic real-world conditions and enable robust face

recognition algorithms testing, emphasizing different law enforcement and

surveillance use case scenarios.".

The authors decided to use six surveillance cameras, professional digital video

surveillance recorder, professional high-quality photo camera, and a computer to

capture 4160 static images of different quality for 130 individuals, 114 were males and

16 females. The participants for this work were students, employees, and professors

from the University of Zagreb, Croatia. Their ages range from 20 to 75. SCfaceDB was

considered as one of the unique databases for face recognition in 2009, since the

authors distributed the textual file that contains the birthday of each participant with

the database, and the birthdays' feature is not available in many faces databases. Also

is held additional information about gender, glasses, and facial hair (beard,

moustache). (ibid., 870.).

https://www.researchgate.net/publication/3193162_The_FERET_Evaluation_Methodology_for_Face-Recognition_Algorithms

35

Figure 17. Example of different pose images.

(Source: https://www.researchgate.net)

5.3 Specs on Faces (SoF) Dataset

The SoF dataset collected from April 2015 to October 2016, were captured in different

countries over a long period to test the detection, recognition and classification of face

algorithms. The SoF has been assembled from 112 individuals (66 males and 46

females) Who wears glasses in various lighting conditions, comprises 42,592 images

(640 * 480 pixels). The dataset is dedicated to solving gender classification problems

in case of face occlusions and high varying illumination from various ages. The author

of the dataset has used many occlusions techniques to conceal features of the faces,

but the primary occlusion was glasses.

According to Mahmoud Afifi and Abdelrahman Abdelhamed (2017, 15), " The SoF

dataset involves handcrafted metadata that contains subject ID, view (frontal/near-

frontal) label, 17 facial feature points, face and glasses rectangle, gender and age

labels, illumination quality, and facial emotion for each subject". The authors applied

three filters on the original images to generate more challenging artificial images that

may circumvent face detection systems.

https://www.researchgate.net/publication/220663934_SCface_-_Surveillance_cameras_face_database

36

Figure 18. Samples of the Specs on Faces (SoF) dataset.

(Source: https://arxiv.org/pdf)

5.4 Large Age-Gap Database (LAG)

The Large Age-Gap (LAG) database was presented in "Large age-gap face verification

by feature injection in deep networks" by Simone Bianco (2016, 1). Simone Bianco

introduces a face verification method across significant age gaps. Also, he assembled

a dataset containing variations of age in the wild, collecting face images extending

from children to older, and included pictures of celebrities using Google image search

engine, and YouTube by adding "adult", "childhood" keywords to search query. Bianco

checked his dataset and removed all noisy and duplicate images manually.

Subsequently, he obtained 3,828 images of 1,010 celebrities. The LAG dataset is highly

required in applications used by law enforcement. It's intractable even for a human to

recognise faces across ageing; therefore, it became a defy task for computer vision

systems, because of the age-related biological transformations in the presence of the

other variations in appearance.

Figure 19. Examples of face crops for matching pairs.

(Source: https://arxiv.org )

https://arxiv.org/pdf/1706.04277.pdf


37

The paper presents a novel method for face verification over the age gap by exploiting

a deep conventional neural network (DCNN) which trained on a Siamese architecture

applying multiple loss functions. The method has been evaluated by comparing with

different techniques like high dimensional local binary feature (HDLBP), One Shot

Similarity Kernel, Joint Bayesian, and Cross-Age Reference Coding (CARC), etc.

5.5 Disguised Faces in the Wild (DFW)

Having similar purposes to (LAG), the DFW was assembled on a large scale in

controlled scenarios to solve the recognition of faces under the covariate of disguise.

The dataset contains a wide range of unrestricted disguised faces, the main body of

the dataset has collected from the internet, represented in 11,157 images of 1,000

individuals primarily of Indian or Caucasian origin. According to Kushwaha, Maneet

Singh, Richa Singh, Mayank Vatsa(2018, 1), "DFW is a first-of-a-kind dataset containing

images pertaining to both obfuscation and impersonation for understanding the effect

of disguise variations.". The dataset contains concealment variations concerning

hairstyles, facial hair (beard, moustache, goatee), make-up, hats, veils, glasses, etc.

These variations make face recognition. Those variations make facial recognition a

challenging task, moreover, if we consider other differences making it arduous to

recognise faces such as illumination, head pose, ethnicity, age, gender, facial

expression, and camera quality. (ibid., 2)

Figure 20. Sample images of three subjects from the DFW dataset.

(Source: https://ieeexplore )

https://ieeexplore.ieee.org/document/8575545

38

The authors of DFW dataset have partitioned the collected data to four types of

images: 1000 of Normal Face Images, 903 of Validation Face Images, 4,814 of

Disguised Face Images, and 4,440 Impersonator Face Images. Each of these four types

has been distributed on the training set and validation set, as shown in table 1.

Table 1: Images in the training and testing partition.

Number of Training Set Testing Set

Subjects 400 600

Images 3,386 7,771

Normal Images 400 600

Validation Images 308 595

Disguised Images 1,756 3,058

Impersonator Images 922 3,518

Kushwaha V. et al., (2018, 3) states that the evaluation protocols in this approach for

face recognition model have divided into three types:

• Protocol-1 (Impersonation) - evaluate the ability to distinguish identity

impersonators.

• Protocol-2 (Obfuscation) - evaluate the performance of the model for concealment

faces deliberate or Inadvertent.

• Protocol-3 (Overall Performance) - evaluate the performance of face recognition

algorithm on the entire dataset.

39

5.6 EURECOM Visible and Thermal paired Face database

The EURECOM benchmark was introduced in 2015 by the paper "A benchmark

database of visible and thermal paired face images across multiple variations" and

released by Khawla Mallat and Jean-Luc Dugelay. The EURECOM dataset is composed

of 2100 images for 50 individuals of different age, ethnicity, and sex. Each participant

has two photography sessions ranging from 3 to 4 months, and each session includes

21 face images per individual with different facial variations. The variations in the

Photography environment include head pose, facial expression, illumination and

occlusion. Therefore, the Authors have used to capture these images a camera (FLIR

Duo R by FLIR Systems) that is designed to photograph simultaneous faces of thermal

and visible spectra as illustrated in K. Mallat and J-L. Dugelay (2015, 1). The purpose of

this approach is to recognise faces, throw the thermal range and compare it with data

collected for face images obtained in the visible spectrum.

Figure 21. Visible and thermal face images.

(Source: http://www.eurecom.fr )

The authors used the FisherFace approach to evaluate the database, followed by 1-

Nearest Neighbourhood classification (Vittorio Castelli, 1). Unlike feature-based

algorithms, the FisherFace algorithm does not rely on facial features detection, which

http://www.eurecom.fr/fr/publication/5700/download/sec-publi-5700.pdf

40

can be especially difficult for thermal images, instead is based on PCA and LDA

techniques (described in chapter two). These FisherFace methods performed a high

accuracy on recognising visible and thermal face images compared to holistic face

recognition algorithms.

6 Face Recognition pipeline In this chapter, the author will cover the important steps to recognize an object in an

image, which are Detection and Localization, and alignment. These three steps it's

compulsory in object recognition algorithms to extract the relevant feature's

information from an input image.

6.1 Detection and Localization

Face detection is the first stage of a face recognition system since the system should

first locate the face then recognise it. The face is just like instances of objects, the

system seeks to locate and categorize them from a specific denomination such as

(people, buildings, number, etc.) in an image, this technique is performed by

discriminating the patterns formed by the objects from the other patterns and

oriented the dimensions of each object.

Ali Sharifara, Mohd Shafry Mohd Rahim, and Yasaman Anisi (2014, 73) states that

"Face detection is one of the demanding issues in the image processing and it aims to

apply for all feasible appearance variations occurred by changing in illumination,

occlusions, facial feature".

According to Dr. P.Shanmugavadivu and Ashish Kumar (2016, 594), they illustrated

three methods to resolve the problem of partially occluded face distributed between

part-based, feature-based, and fractal-based methods which divided the face

detected to overlapping and non-overlapping parts and compute self-similarities

between the images or consider facial features (nose, left eye, right eye, mouth, left

41

ear, right ear, and chin). The skin detection factor has been significantly influential in

face detection algorithms to decrease the area of search feature detection, and

computational process, a variety of human skin colours are spanned in a pre-specified

scale, and each pixel within that domain is known as skin pixels. Moreover, locating

that domain itself is a difficult task as the range of skin varies by ethnicity and race. In

this section, three algorithms are addressed which are used for object detection:

6.1.1 Viola-Jones

Paul Viola and Michael Jones (2001, 1) in their paper " Rapid Object Detection using a

Boosted Cascade of Simple Features ", provided a new competitive object detection

framework that the primary purpose is to detect faces in real-time, this framework

has a very low false-positive rate and the high true-positive rate which makes the

algorithm rapid and robust. The author will briefly explain the three main features of

this framework: The first feature is called an integral image, and it is considered that

all human faces have the same features, such as the nose bridge region is whiter than

the eyes region, and so on. The computed process for an integral image will be using

a few operations per pixel. The second feature is to describe the methods to construct

a classifier by using the AdaBoost training algorithm that helps to find small critical

visible features from a wide range of possible features. The third feature is the process

of combining the cascading classifiers, which neglect the background regions so that

more computation can be performed on face-like regions.

The object detection process in this algorithm classifies images in accordance with the

value of simple features, and the reason to use features instead of the pixel value

return to the features can act to encode the domain knowledge that is difficult to learn

using small dataset. Moreover, the feature-based system runs rapidly and robustly

than a pixel-based system. In figure 22, four examples demonstrate the rectangle

features relative to the detection window, Figures 22 A and B have the same size and

shape, to detect a specific feature the algorithm will subtract the sum of the pixel of

the white rectangle from the sum of the pixel of the grey rectangle, both rectangles

should move horizontally or vertically. Figure 22C shows three rectangles which will

42

subtract the sum of the pixels in the centre rectangle from the sum of the pixels of the

sides’ rectangles. Figure 22D calculate the variants between diagonal pairs of the

rectangle. (ibid., 2.).

Figure 22. Rectangle features for object detection.

(Source: https://en.wikipedia.org )

6.1.2 You Only Look Once (YOLO)

According to Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi (2016, 1),

Yolo, or (You Only Look Once) takes one feedforward propagation across the network

to make predictions. YOLO detects the objects in an arbitrary image magnificently

unlike region-based and parts-based methods since YOLO used to see the full

representation during the training and testing phase, it gets every information about

the entire image and the objects. The algorithm has good detection accuracy under

different complicated conditions, like an illumination environment, and noise while

satisfying the real-time performance.

Redmon J. et al. (2016,2) “YOLO sees the entire image during training and test time,

so it implicitly encodes contextual information about classes as well as their

appearance”. It is developed based on a regression problem; it predicts classes and

bounding boxes for an image by applying the algorithm one time on the image. The

https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework

43

algorithm will split the given image to grid cells S * S, as shown in Figure 23, each grid

cell will have a prediction bounding box, the degree of confidence for each prediction,

and a class of probabilities; most of the boxes have a low predicted result, so we can

avoid unnecessary bound boxes or objects detected by setting a threshold. The

prediction bounding box can be portrayed as (width, height, centre, class); in addition,

the model will calculate the value of prediction by using this formula:

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃𝑟(𝑜𝑏𝑗𝑒𝑐𝑡) ∗ 𝐼𝑂𝑈𝑝𝑟𝑒𝑑𝑡𝑟𝑢𝑡ℎ

Figure 23. YOLO bounding boxes, confidence, and class probability map.

(Source: https://arxiv.org )

The formula computes the confidence scores for the grid cell if the confidence value

is zero that means no object detected in the bounding box. Otherwise, the confidence

scores would equal the (IOU - intersection over union), scaling among the certainty of

the ground-truth bounding box and predicted bounding box as shown below.


44

Figure 24. Intersection over Union (IOU).

6.1.3 Faster R-CNN

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik (2014, 1), they propose

" a simple and scalable detection algorithm that improves mean average precision

(mAP)," R-CNN algorithm will apply on high capacity convolutional neural network,

sliding on an image to divide to more than 2000 bottom-up Regions-of-Interest (ROI)

to localize and segment objects, then classifies each part using class-specific linear

SVM (Support Vector Machines).

The R_CNN is slow, and each extracted region needs to operate the entire forward

CNN calculation, and the network coefficients will not update during the regression.

Faster R-CNN will normalise the input image and any extracted boxes around features

from the first layers and added to the final convolution layer directly. This step will

increase the computing speed since there is no prolonged need to store any

information for an extracted feature in the first layers, which makes training faster.

Ross Girshick (2016, 1441) proposed the faster R-CNN algorithm to improve the R-CNN

speed and accuracy. Faster R-CNN has many advantages, such as GPU will not be filled

with extracted data for the feature caching; the training set is one-step using a multi-

task loss will update the network layers. The comprehensive performance has been

improved particularly in terms of detection speed since it creates a convolution

45

network to generate the proposed box and shares the convolution network with the

object detection network which reduces the number of proposed frames

approximately to more than half compared to R-CNN.

The architecture of Faster R-CNN is composed of the region proposal network (RPN)

and Fast R-CNN. RPN used to reduce the computational process by scanning rapidly

and effectively the spots in an image to evaluate the needs for more computational in

specific spots using a convolutional neural network. Fast R-CNN has a deeper

architecture than RPN; it consists of a convolutional neural network, a region of

interest pooling layer (ROI), fully connected layers, and finally has two sections for

classification and regression (Figure 25).

Figure 25. Object detection by Faster R-CNN.

(Source: https://towardsdatascience.com)

According to BIN LIU, Wencang ZHAO and Qiaoqiao SUN (2017, 6234) Faster R-CNN

used RPN (Region Proposal Networks) instead of Selective Search methods used in

Fast R-CNN. The Faster R-CNN framework can be classified into four steps:

https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46

46

• Convolution layers.

Through the first step, the algorithm through the ConvNet layers, ReLU activation

function, and pooling layers will extract the image feature maps.

• Region Proposal Networks (RPN).

RPN is a fully convolutional network that predicts the boundaries and numbers of

objects at each position. RPN shares image convolutional feature maps with a

detection network, thus enabling nearly complementary region proposals.

• ROI (Region of Interest) Pooling.

ROI gathers both of input image features and the proposals for these features and

produces fixed-size feature maps by applying max-pooling on the inputs. In the pooling

layer, the output and the input channels are identical.

• Classification

Classification layers calculate the proposal class and compare it to the proposal feature

maps to get the final exact position of the checkbox.

According to Shahpour Alirezaee, Hassan Aghaeinia, Karim Faez, and Farid Askari

(2006, 30), “The ultimate goal of the face localization is finding an object in an image

whose shape resembles the shape of a face”. They stated four methods used to

localise and detect a face, and the method efficiency depends on motion and colour

information. They classified the processes into four categories (Knowledge-based

methods, Feature invariant approaches, Template matching methods, Appearance-

based methods).

Facial landmarks localization is an uncomplicated detection problem and intends to

ascertain the image size and position of faces in still and video images. Localization is

a crucial stage in the face recognition process. In face recognition and face tracking in

47

real-time, we can use the location found by the face detector. The area used to align

the face, but if the detection steps are not robust and accurate enough, the new face

landmarks, like a nose, between the eyebrows, and mouth are required. (ibid., 30-31).

Summary

rigid templates. The author summarizes the pros and cons for each algorithm

alongside with references to different works: when published in 2003, V&J achieved

detection at 2 fps with a detection accuracy rate of 95% comparing and it was very

robust face detection classifier at that time because it had very low false-positive rate

and very high detection rate. on the other hand, the Yolo and Faster R-CNN are based

on CNN and according to the table 2.

Table 2: Average accuracy of face and head detection on the FDDB dataset and

Casablanca dataset.

CNN based neural networks are significantly more reliable than Viola-Jones in terms

of accuracy but need more computational power and time for training and to calculate

the results. The mean average accuracy error for CNN based networks is 5 times less

than for VJ for FDDB data as stated in the Le Thanh Nguyen-Meidine, Eric Granger,

Madhu Kiran, and Louis-Antoine Blais-Morin2 paper (2018, 5).

The only reason why Viola-Jones still has a presence among modern algorithms is it

allows real time recognition with 60 FPS as we can see in the table 3 with very low GPU

consuming compared to Faster R-CNN and YOLO. We conclude that the Viola-Jones is

the fastest face detector with range between 40 to 60 FPS, and the CNN algorithms

48

excel it by the ability to detect from different angles. Moreover Faster R-CNN is

acceptable in terms of accuracy and speed but the only disadvantage is that it

consumes too much memory. On the other hand Yolo is fast and easy to implement in

a real-time system with major drawbacks which is not good at detecting distant

faces. (ibid., 6)

Table 3: Average time and memory complexity for face detection on FDDB.

6.2 Alignment

Face alignment is computer vision technology used to determine the human face

geometric structure in images (Figure 22) which consider one of the important stages

of face recognition, besides face recognition, face alignment had applied for another

face implementations such as ( Deep fakes, face synthesis, and face modelling). Given

the position and size of a face, the shape of facial landmarks such as eyes, nose, and

mouth are calculated automatically. Due to the existence of factors such as different

posture, voice, lighting, and partial occlusion in face pictures, the face-alignment

function becomes a complicated problem.

49

Figure 26. Face alignment and landmark.

(Source: https://ldl.herokuapp.com )

According to Timothy F. Cootes, et al. (2001, 682), the Active Appearance Model which

discussed in chapter two used the density of the face's pixels to obtain better

accuracies, the only challenge with AAM is, however, the labelling effort (positioning

the landmarks on face features for training set). Due to the existence of factors such

as different posture, voice, lighting, and partial occlusion in face pictures, the face-

alignment function becomes a very difficult problem.

Face alignment with 3D algorithms for object detection will align not only the

appearance of the face but also the head pose. 2D alignment algorithms are not able

to reach the depth of an occluded face. Lie Gu and Takeo Kanade (2006, 1) in their 3D

patch-based approach. "A face is modelled by a set of sparse 3D points (shape) and

the view-based patches (appearance) associated with every point", and it has two

advantages: First, It is easier to compensate for local illumination, and second, The

texture variance within the patch is considerably smaller than that of the entire face.

In this section, we examine four methods that have a large impact on face recognition

and understanding the process and the principles for each one.

https://ldl.herokuapp.com/deep

50

6.2.1 Supervised Descent Method and its Applications to Face Alignment

According to Xuehan Xiong and Fernando De la Torre (2013, 3), their approach

formulates the problem of the face alignment problem as a minimisation problem, and

they calculated the using the formula:

𝑓(𝑥0, Δ𝑥) = ‖ℎ(𝑑(𝑥0 + Δ𝑥)) − 𝜃∗‖2

2

where (d) represents an image, d(x) are the landmarks for the image, (h) is an

extraction function, such as SIFT (Scale-Invariant feature transform) which is a feature

detection algorithm that calculates the values of the features of images. () indicates

(SIFT) values in the manually labelled landmarks. The classical approach to solving the

minimisation problems was Newton's function which can find the minimum of the

scalar function by approximating the loss function through calculating the quadratic

surface and stepping to the optimal point, figure 27 illustrates the comparison of the

gradient descent function with the newton's function. This approach demands

computing the reverse matrix (Hessian), which has two disadvantages represented by

computational cost for the massive data and it's infeasible for non-differentiable

function.

Figure 27. A comparison of gradient descent (green) and Newton's method (red) for

minimizing a function.

( Source: https://en.wikipedia.org )

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

51

The supervised descent algorithm is first validated on simple analytic functions and

compared to Newton's method. Then the algorithm experimented on facial feature

detection in two datasets the LFPW dataset that achieved 96.7% accuracy and LFW-

A&C dataset that achieved 98.7% and evaluated it by compares with modern

detection methods by showing cumulative error distribution with linear regression

and Belhumeur methods. Finally, the algorithm tested facial feature tracking on a

video dataset where the algorithm tries to detect facial landmarks in each frame.

6.2.2 Face Alignment at 3000 FPS via Regressing Local Binary Features

Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian SunLocal (2014, 3) presented the

Local Binary Feature algorithm attempting to achieve significant error reduction and

improvement in speed. Shaoqinr R. et al. (2014, 1) stated that LBF runs at 300FPS on

mobile or 3,000FPS on desktop, and that led to new possibilities for online face

applications on portable devices. Binary features use the training data to learn local

features instead of the SIFT methods; this technique used to learn each feature

independently in the local region. Utilising the LBF will depreciate the face alignment

error-rate, increasing the discriminative features, and diminish the computational

process. To find the local features in an image the LBF approach takes regions nearby

the h landmarks and to solve the learning step for the local features h in the area by

using the regression target, and the formula to calculate that:

min𝑤𝑡,𝜙ℎ

𝑡∑‖𝜋ℎ ∘ ∆�̂�𝑖

𝑡 − 𝑤ℎ𝑡 𝜙ℎ

𝑡 (𝐼𝑖, 𝑆𝑖𝑡−1)‖

2

2

𝑖=1

LBF update ( 𝑤ℎ𝑡 , 𝜙ℎ

𝑡 ) both of a local feature and local mapping simultaneously, the

operator ( 𝜋ℎ) extract two-element ( 2h -1, 2h ) from the vector ∆�̂�𝑖𝑡 in each iteration.

The factor (i) denotes the of training samples. Consequently the ( 𝜋ℎ ∘ ∆�̂�𝑖𝑡 ) will be

the ground truth of (ℎ𝑡ℎ) landmarks in the (𝑖𝑡ℎ) training samples.

The LBF has performed the experiment and evaluation phases on three kinds of

datasets to assess the algorithm. First is LFPW dataset include (29 landmarks) that

were collected from the web. The Second is Helen datasets that include (194

52

landmarks) and contain 2300 high-resolution web images. The third dataset is 300-W

including (68 landmarks) that collected from an existing dataset such as (AFW, LFPW,

Helen, XM2VTS). These datasets manually split into two portions for the training set

and test set.

6.2.3 Robust Facial Landmark Detection under Significant Head Poses and Occlusion

According to Yue Wu Qiang Ji (2015, 3660-3661), The purpose of head poses, and

occlusion is to take the landmarks visibility into consideration and train models that

are able to predict clarity ratio of the face visible features. Conversely of LBF that

handle all facial landmarks symmetrically without considering the occlusions and head

pose (Figure 28), which clarify the head poses and occlusion for two subjects. The

authors display the visible landmarks points as the final output for the vast head pose.

The pre-trained model will assist the face alignment by extracting the face local

features and combining with the visibility information and form configuration around

face landmarks.

Figure 28. Facial landmark detection and occlusion prediction in different iterations.

(Source: https://openaccess.thecvf.com )

minΔ𝑝𝑡

‖Δ𝑝𝑡 − Τ𝑡Ψ(𝐼, 𝑥𝑡−1)‖22 + 𝜆𝔼𝑝𝑡 [𝐿𝑜𝑠𝑠(𝑐)]

𝑝𝑡 = 𝑝𝑡−1 + ∆𝑝𝑡 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑝𝑡 ≤ 1

https://openaccess.thecvf.com/content_iccv_2015/papers/Wu_Robust_Facial_Landmark_ICCV_2015_paper.pdf

53

The mathematical formula used in HPO calculate the probability function 𝑙𝑜𝑠𝑠(𝑐),

where (c) denotes to each possible occlusion and head poses patterns that can occur

in an image of (m) pixels and it is a vector of length 2𝑚. Last equation ( 𝑝𝑡 ) solve the

problem iteratively by optimising the function concerning ∆𝑝 and 𝑇, to update T (

number of parameters ) the authors used the least square problem and used to solve

∆p problem a gradient descent method to update the prediction task. They used to

evaluate their algorithm on three kinds of databases. The first dataset was collected

from the internet, the second one is Labelled Face Parts in the Wild (LFPW), and the

last dataset is Helen dataset. The authors took into consideration, the degree of

inclination and declination for near-frontal head poses and limited occlusion in each

dataset.

6.2.4 Joint Head Pose Estimation and Face Alignment Framework

Xiang Xu and Ioannis A. Kakadiaris (2017, 2) states that JFA was the first approach that

calculates the global and local CNN features to enhance both the head pose and face

alignment to determine the landmarks detection task. Since head pose and face

alignment are deeply correlated and to reduce errors in face alignment and head pose,

the authors have trained CNN to detect facial features by using different head pose

combinations from multiple datasets.

The JFA algorithm had two parts to detect and localise faces in images from global to

local features and analyse the faces in a cascade manner. By utilising; firstly, global

CNN features to produce appropriate initialisation to diminish the covariance for the

bounding boxes around faces and secondly, local CNN features that are implemented

to discriminative features for the cascade regression. It was the first time to utilise

both global and local features together using CNN techniques in a cascade way, by

strengthening the relationship between the head and landmarks the algorithm will

establish the proper shape initialisations utilising the following formula in L iteration.

54

𝑆𝐿 = 𝑆𝐿−1 + 𝑊𝑙𝜃𝐿(𝐼, 𝑆𝐿−1)

Where ( 𝐼 ) is a face image and ( 𝑆 ) denotes landmarks. ( 𝜃𝐿(𝐼, 𝑆𝐿−1) ) is the most

crucial part of the formula which indicates the function that extracts the facial features

by applying the image on the previously extracted landmark.

Figure 29. The top images use (Dlib implementation) and bottom images use JFA

algorithm for landmark detection.


JFA used for training and evaluation multiple datasets gathered from the 300-W

competition include LFPW, AFW, HELEN, and IBUG datasets. The images were

distribution into two sections: The first section was a training set that consists of 3,146

images from LFPW, AFW, AND HELEN. The second section is called the full testing set

since it contains two parts of data: a common testing set that includes 689 images

collected from LFPW and HELEN testing set. And the second is the challenge testing

set contains 135 images (significant variations of head pose and lower resolution)

assembled from IBUG datasets. These datasets annotated with 68 landmarks but

without head pose information (Figure 29).

https://www.researchgate.net/publication/313389378_Joint_Head_Pose_Estimation_and_Face_Alignment_Framework_using_Global_and_Local_CNN_Features

55

Summary

To sum up, we presented in this chapter four facial landmarks algorithms, the SDM

and LBF are use cascaded regressors to predict the coordinates of landmarks directly

from shape-indexed features, the HPO is binary landmark occlusion vector that

updates the visibility likelihoods and the landmark position over iteration to achieve

convergence between the face features and the suggested landmarks, finally JFA has

different approach according to Xiang Xu, et al (2017, 1), "use the global and local CNN

features to solve head pose estimation and landmark detection tasks jointly".

Essentially, All the algorithms aim to estimate the angle of head poses and introduce

a constrained supervised regression to achieve an accurate convergence.

Table 4: Facial landmarks detection error.

Algorithms Helen 194 L Helen 68 L LFPW 68 L LFPW 29 L 300W 68 L

SDM 5.82 - - 3.47 5.57

LBF 5.41 6.58 5.58 3.35 4.95

HPO 5.49 - - 3.93 -

JFA - 5.48 5.08 - 5.32

The results of error ratio shown in the table 4 which obtained from Hongwen Zhang,

Qi Li, and Zhenan Sun (2018, 7) and from Yue Wu and Qiang Ji (2015, 3665), on Helen

dataset with 194 and 68 landmarks, LFPW dataset with 68 and 29 landmarks, and

300W dataset with 68 landmarks, showed that all the algorithms have approximately

the same error ratio on the same dataset the only difference will be the speed of

detection on real-time systems and the head pose variations. These algorithms

achieve different results according to the head pose, and the databases used in these

experiments have different head poses as shown in the table 5. Experimental results

described the algorithms effectiveness on face images with extreme appearance

variations, heavy occlusions, and large head poses.

56

Table 5: Head pose variations.

(Source: https://ibug.doc.ic.ac.uk )

7 Experiment and Result

Developing an optimal masked Face recognition algorithm is a significant challenge in

computer vision with limited sample cases and quite a few reference datasets. The

experiment for masked faces recognition system was implemented by connecting the

three steps in chapter three (Detection, Alignment, and Recognition), by utilising

multiple solutions for each stage were achieved, the author was able to evaluate the

efficiency and the rapidity to accomplish the experiment.

The author used the Fastai library, which builds on Pytorch as Jeremy Howard and

Sylvain Gugger stated in their article" fastai: A Layered API for Deep Learning "(2020,

3), to develop our model and test the effectiveness of the methods used in this

chapter. The author used two datasets for training and testing steps; the first one

collected by the author given the fact that there was no masked face dataset and the

second dataset loaded from (https://makeml.app/datasets/mask). These datasets and

methods used in the research performed by Jupyter notebook using (Python

programming language) and these processes implemented on various cloud

platforms, such as Google cloud, Paperspace, and Google Colab. The computational

process for different methods calculated via time cost and accuracy.

https://ibug.doc.ic.ac.uk/media/uploads/documents/sagonas_iccv_2013_300_w.pdf

57

7.1 Detection and Extraction Face detection and localisation are a long-standing challenge in computer vision. In

the previous chapter, we investigated and sought into face detection through multiple

techniques beginning with machine learning algorithms to the deep learning methods.

The human face has a unique structure, and it's back to local facial parts like eyes,

nose, and mouth, and these features assist in localising and detecting faces under

unrestrained condition.

7.1.1 Data

All Face detection systems necessitate the use of face datasets for training and testing

objectives. In the deep learning approach, the accuracy of any CNNs enormously relies

upon the scale of training datasets. Despite many face detection and localisation

datasets, their usage restricted to the research purposes only, moreover, prohibited

for commercial use. Therefore, In the experiment, the author started with Labelled

Faces in the Wild (LFW) dataset which contained 13,000 images and large-scale

celebfaces attributes (CelebA) that include 200,000 images. To prepare datasets to be

compatible with our models we must add a mask for each face to craft a new dataset

for masked faces. Wherefore the author embeds a mask on datasets programmatically

using a Dlib landmark's facial detector that used to estimate the location of 68

coordinates which map the face points the idea behind using landmarks detector was

to align the faces that have a large head pose (figure 30). The results were good but

not perfect as compared to the real medical mask and that might lead the model to

make false predictions if the data provided was not accurate, moreover, these new

images were without any annotations. On the other hand, the author was looking for

specific annotations that include the face coordinates in images with information

about the face, these annotations will help to develop a face detection model

alongside with a mask classifier.

58

Figure 30. LFW dataset with medical masks.

To solve this dilemma the author uses a dataset (Face Mask Detection dataset)

obtained from Kaggle (a subsidiary of Google). This dataset includes 854 images (figure

31) referring to the three classes (Mask, Without-Mask, Mask-Worn-Incorrectly). The

dataset divided into two sections (images, annotations).

Figure 31. Face Mask Detection dataset.

The images section contains 4072 faces into different images. The separation ratio of

the dataset through the process of the training is 0.8% for learning (682 images) and

0.2% for testing (171 images). The annotations section provided details on each image

like the category name for each face in the image, the image dimension, and the face

coordinates as we can see the table 6. The second dataset was collected using google

and firefox browsers by selecting specifically the images of people wearing a medical

mask, and we obtained 494 images each image include one face and this dataset was

used for the recognition phase.

59

Table 6: A CSV file represents all the annotations details for (Face Mask Detection

dataset) Images name, dimensions, faces' category and coordinates extracted from

XML files in the dataset.

Image Dimensions Face 1 Face 2

0 maksssksksss579.png ['400', '226'] ['with_mask', '150', '26', '193', '86']

['with_mask', '204', '127', '245', '175']


0


['with_mask', '146', '238', '161', '255']


['with_mask', '23', '162', '38', '179']


['with_mask', '120', '7', '141', '28']

5 maksssksksss699.png ['400', '279'] ['without_mask', '18', '82', '64', '131']

['without_mask', '18', '198', '66', '245']


['without_mask', '309', '119', '368', '191']


['with_mask', '80', '39', '97', '64']


['with_mask', '344', '115', '366', '142']

7.1.2 Implementation

For experiments, we used three pre-trained models (InceptionV3, MobileNetv2, and

VGG16), training them on Google Colaboratory, which is a free Jupyter notebook

environment embedded on the Tesla K80 GPU (12G). The author used those three pre-

trained modes for two reasons; The models have a good architecture design with 94

ConvLs for inceptionV3, 35 ConvLs for MobileNetv2, and finally 13 ConvLs for VGG16

which consider very simple architecture design utilize high GPU power, using these

pre-trained models would achieves good accuracy rate in the detection process.

(Figure 32). To extract the faces from video or live-streaming, OpenCV library presents

a novel approach called OpenCV-dnn, which is a deep learning pre-trained model. This

model trained to detect faces with various head poses and divergent illumination. For

training face detection models, a TensorFlow Framework and Fastai used. For both

frameworks, three pre-trained models were utilized for transfer learning in

TensorFlow and two in Fastai. Since the author uses small datasets for Masked faces,

60

pre-trained models built on these frameworks are necessarily required to detect a

masked face in the video streaming.

Figure 32. Models complexity (parameters size) comparing to total memory

utilization.

(Source: https://arxiv.org)

First, Using TensorFlow Object Detection API that contains pre-trained models will

facilitate the task by reducing the time and computational effort to train the MF

detector. The InceptionV3, MobileNetV2, and ResNet101 pre-trained models were

used in detection and localization experiments since their speed of deduction was

supposed to be quick enough. These models have been established so the shape of

the input images to the detector will set to (224*224*3) due to diminishing the shape

of the images the detector will achieve more reliable results and will decrease the time

to detect.


61

In the detection experiments and after 20 epochs on the training and validation

datasets, the author randomly chosen some epochs to demonstrate the validation

accuracy and validation loss to compare the efficiency of these models as shown in

table 7.

Table 7: Four epochs for masked face detection and localization on a small dataset,

presenting the validation accuracy and loss for each model.

Methods Validation Accuracy Validation Loss

Epoch 1

InceptionV3 0.8626 0.3353

MobileNetV2 0.8380 0.3794

VGG16 0.7411 0.5950

Epoch 7



VGG16 0.8110 0.5201

Epoch 16



VGG16 0.8172 0.5068

Epoch 20



VGG16 0.8172 0.4976

By using the OpenCV deep neural network (dnn) which officially released in 2017 with

our models, we achieved a quite high-grade accuracy to detect faces with mask and

without a mask. The detection task is completed by using the pre-trained models,

which are object detection models used in our approach to enhance facial features

with a face region and gathering data augmentation to deal with occlusions and small

faces.

62

As can be seen from Table 7, the validation accuracy and loss for both InceptionV3 and

MobileNetV2 have achieved substantially better performance on a small dataset than

VGG16. This is because the VGG16 has been trained on an extensive image dataset (15

million labelled high-resolution images) from ImageNet dataset to classify more than

22,000 categories. Also, it has a simple CNN architecture not built to detect and classify

faces with more than 14 million parameters (Table 8). To solve the underfitting

problem with VGG16, we need more data, different augmentations, and different

architecture to achieve good accuracy. Fitting the model to be configured on our

dataset is not an essential step in the experiment since we have two models with

proper accuracy, and we do not have another source for unrestricted mask faces

dataset.

Table 8: InceptionV3, MobileNetV2, and VGG16 parameters with number of ConvLs.

Parameters Inception3 MobileNetV2 VGG16

Total params 22,328,099 2,586,691 14,846,787

Trainable Params 525,315 328,707 132,099

Convolutional layers 94 - Conv2D 35 - Conv2D 13 - Conv2D

The process of facial detection is responsible for detecting and extracting faces in an

image or any frame in a video. It is considered the first step for the face recognition

pipeline. After the detection step that we applied, we use the extracted faces

coordinates to draw bounding boxes around the faces (Figure 33), which is the outputs

for the models (a) InceptionV3, (b) MobileNetV2, and (c) VGG16. The detection

process will provide red bounding boxes for unmasked face and green bounding boxes

for masked faces accompanied by the prediction accuracy. The prediction accuracy is

different (Figure 33). Based on the used model, it looks like all the prediction outcomes

are approximately close to each other. But from the table 7, we found that

InceptionV3 and MobileNetV2 models had better accuracy, lower validation-loss, and

faster to train than VGG16 and this is because these models (MobileNetV2 and

63

InceptionV3) are utilising depth-wise separable convolutions which reduce the

number of shrinking parameters. On the contrary, and according to Song Han, Huizi

Mao, and William J. Dally (2016, 1-2), the VGG16 model has a large number of

parameters that consume a storage capacity and GPU with simple convolutional

architecture that make it a very complicated structure with three fully connected

layers.

(a) InceptionV3 (b) MobileNetV2 (c) VGG16

Figure 33. Red and Green bounding boxes with different models.

Finally, in the last step in the detection and alignment phase is to resize the extracted

faces from an image without losing face's purity and clarity. The output face's image

would be square, and the face covers all the images with size (224, 224) pixels to

increase the speed of the recognition process. In figure 34 as we provide the original

64

image containing four subjects through the detection phase, and the outcome will be

seen in figure 34 b extracting four faces with the same dimension size.

(a) (b)

Figure 34. (a) original image (b) faces' cropped with 224x224 pixels dimension.

(source: https://www.pexels.com )

We also demonstrate the percentage of validation accuracy and loss for masked face

dataset after operating 20 epochs for each model (InceptionV3, MobileNetV2, and

VGG16). We define the learning rate parameter to be (1e-4), and the bunch size for

each iteration was assigned to be 32 for all models. The training time consuming was

different from each model. The InceptionV3 consume 14.8 minutes to finish 20 epochs

in the second iteration, MobileNetV2 was faster to accomplish same epochs in the

second iteration with 13.6 minutes, since the VGG16 has a large number of

parameters, therefore will take longer than others with 16.8 minutes to finish same

epochs in the second iteration. As we can see in the figures 33, 34 the performance

for the models on the same data set and same models’ configuration, we will choose

from among the three models (InceptionV3). The InceptionV3 model outperforms the

MobileNetV2 with validation accuracy around (0.9070), and validation loss about

(0.2240), which is shown by the percentage of low thresholds on different epochs

(Figures 33, 34).

https://www.pexels.com/photo/health-workers-wearing-face-mask-3957987/

65

Figure 35. Validation Accuracy implemented by three models (Inception V3,

MobileNetV2, VGG16) on Masked faces Dataset for 20 epochs.

Figure 36. Validation Loss implemented by three models (Inception V3, MobileNetV2,

VGG16) on Masked faces Dataset for 20 epochs.

7.2 Recognition

In order to distinguish the difference between the two faces, several approaches have

been proposed to solve the discrimination task. The author described some of the

machine learning algorithms that calculate the embedded faces variance in chapter

two. Over the years, different methods have been presented in the classification field,

66

depending on the way of calculation the features. However, it can classify the face

recognition methods into three categories (features extraction, dimensionality

reduction, and hybrid approaches).

7.2.1 Landmarks

In our experiment and to develop an identification model from single-shot image

classification, we first set a neural network masked face detector to find and extract

the faces then we will embed the face feature to 128-dimensions to discriminate

between the classes via transferring the features through a deep neural network. Local

feature methods are used to identify and describe the facial features of a face with

specific geometrical properties (Figure 37).

Figure 37. Cropped the local visible features from extracted face.

We sit the local features that include the upper side of the face since half of the face

is hidden, by using the Dlib-ml open-source library to detect face features map. The

dlib library has a shape detector (68 face landmarks) used to cover the entire face

(Figure 38 a). We used the 68-shape detector to implement the alignment task to

correct the rotation face so that we can remove the masked region efficiently. Also,

we will use 24 landmarks to localise and represent salient features of the face, such as

eyes, eyebrows, the centre of the face (Figure 38 b).

67

(a) (b) (c)

Figure 38. 2D face landmarks (a) 68 landmarks for the entire face (b) 24 landmarks

for visible parts of the face (c) cropped the visible parts with the landmarks.

7.2.2 Visible features embedding

To embed the visible facial features to 128 dimensional, we need to crop the detected

faces and to do that we have chosen InceptionV3 for the lightweight architecture and

accuracy. It is implemented with Keras, which is an open-source neural-network

library running on TensorFlow library that is developed to implement computer vision

tasks. To create a DCNN to embed a face features, google researchers had introduced

the state-of-art network that can take a face image as input and embedding 128D as

output. This technique facilitates the problem of classification since the classification

on many people using a deep learning pre-trained model is taking a long time to

compare, and we need to accomplish this task in milliseconds. According to Florian

Schroff, Dmitry Kalenichenko, and James Philbin (2015, 815-816), FaceNet uses DCNN

(deep convolutional neural network) that trained to compute the range between the

embedding corresponding to face similarity.

The authors of FaceNet used a triplet loss function to calculate the similarity between

two embedding faces. Since we have three input for the network (anchor, positive,

and negative objects), the apparent idea to use this loss function that the anchor

68

object should be relatively similar to positive object as compared to negative object,

and the formal to calculate this comparison:

ℒ(𝑥, 𝑦, 𝑧) = max (‖𝑓(𝑥) − 𝑓(𝑦)‖2 − ‖𝑓(𝑥) − 𝑓(𝑧)‖2 + 𝜃)

Where 𝑥 denote to the anchor object, 𝑦 to the positive object, and 𝑧 to the negative

object. 𝐹() indicate the function that is embedding the image to 128 dimensional.

And 𝜃 represent the margin between the positive and negative part that means the

discriminative value between image pairs.

Embedding means to take a few necessary measurements from the input image and

encoding to 128 dimensional (Figure 39). There are useful pre-trained models used to

encode a face image to vectors, such as FaceNet from Google, DeepFace from

Facebook, and Rekognition from Amazon, these models are trained to encode facial

features and make the comparison. For our purpose, we need a model that encodes

the upper part of the face to 128-dimensional vector. Different configurations were

tested to fulfil the requirement of this task, but FaceNet has the best performance.

The authors of FaceNet prove that the 128 dimensional is enough to perform excellent

accuracy comparing to the modern methods such as DeepFace. The FaceNet model

was implemented with TensorFlow using 3x96x96 input size images based on the

output size from the OpenCV-dnn face detection model. Through the encoding step,

we normalized the embedded face that means scaling the values measured on the

different range to a standard scale and normalizing our vectors we used scikits-learn

library that has various classification, regression and clustering algorithms.

69

Figure 39. Embedding a face image to 128-Dimensional vector.

7.2.3 Classification

In our experiments on the classification task, we used different models to compare

the accuracy and the deviation accuracy for both datasets (Masked faces, without

mask faces). The standard model for face classification tasks in the thesis is the

Support Vector Machine. The support vector machine was introduced by Corinna

Cortes and Vladimir Vapnik in the paper "Support Vector Network " (1995) which was

a new method to solve the classification problems. The authors proved the efficiency

of the Support Vector Machine by conducting experiments on a different dataset. The

small dataset includes 7300 training patterns and 2000 testing patterns taken from

the US Postal Service database. The large dataset obtained from NIST dataset, which

is handwritten character digits derived collected by the national institute of standard

and technology that contains 60,000 training samples and 10,000 samples for testing.

The idea of the SVM algorithm was to define the optimal hyperplane and generalize

to non-linearly separable problems. Compared to ANN, the SVM does not experience

duplication in the dimensional and overfitting, which begins in the training session

when the algorithm attempts to achieve a zero error on all training data.

Yichuan Tang (2013, 1) states that the support vector machines had been formulated

for binary classification, so it is learning through the given training data and its

70

corresponding label. Two loss functions used to calculate the validation loss, first one

called L1-SVM, which is a linear sum of slack variables, and second, called L2-SVM a

square sum of slack variables. The L2_SVM is considered better than L1-SVM because

" The L2-SVM is differentiable and imposes a bigger (quadratic vs linear) loss for points

which violate the margin" (ibid., 2). mathematically the L2-SVM equation it is just a

squareness of L1-SVM equation, but it will minimize the squared hinge loss:

min𝑊

1

2𝑊𝑡𝑊 + 𝐶 ∑ max

𝑡(1 − 𝑊𝑡𝑋𝑛𝑡𝑛, 0)2

𝑁

𝑛=1

Where the 𝑋𝑛denote the input data, 𝑡𝑛 is prediction range ∈ (−1, +1), and the 𝑚𝑎𝑥

function is the slack square variable. To predict the class label of a test data x:

arg max𝑡

(𝑊𝑡𝑋)𝑡

As we can see from Table 9. The accuracy rate between neural network model and

support vector machine model are approximately similar. The only difference is the

training speed. The deviation accuracy for SVM is a low standard deviation that

indicates high precision. But linear discriminant analysis (FisherFace) is the best

algorithm to classify the appearance-based to reduce the input dimensions and

achieve a robust performance on Masked face recognition.

Table 9: Accuracy and standard deviation for Five different models over masked

faces dataset.

Model Accuracy Deviation

Logistic Regression 0.868 0.0946

Linear Discriminant Analysis 0.973 0.0280

Kneighbors Classifier 0.885 0.1027

Support Vector Machine 0.948 0.0446

Neural Network 0.939 0.0416

71

8 - Discussion

“After all, the ultimate goal of all research is not objectivity, but truth.” (Helene

Deutsch 1996, 1)

This chapter explains what is needed to avoid the possible obstructions that the

researcher might encounter in future work, in addition, it addresses the answers for

the research questions that set out in chapter 2.3. As well as, provides guidance on

selecting the right variables for the study, to make the practical part of the research

more credible. As well as presents the obstacles that faced the research journey. on

the other hand. the author addresses at the end of the chapter recommendations of

further research on different methodologies.

8.1 Answers to research questions

The first research question was: What are the best approaches used to develop an

OFR system?

The goal of this research was to develop a system capable of detecting occluded faces,

the author was aiming to develop one from scratch. During the first months of

examining the literature of face recognition theories and testing various applications,

the author concluded that it would be futile to start from scratch, particularly when

there are so many previous studies, applications, and different tools that can make

the task easier, The author begins to investigate the hypotheses and explored three

main AI libraries which are used to encode the OFR system.

At this time, it seemed that the process was very complex and required a deep

understanding of the fundamentals of machine learning in order to address the

challenges that could arise during the development process. It was compulsory to

study the problem mathematically because the FR technique in fact is a mathematical

representation of a face, meaning that everyone’s face has different mathematical

72

representation, Moreover, the comparison process is carried out by applying various

mathematical equations on these representations. For that reason, the author

studied a book presented by Aaron Courville, Ian Goodfellow, and Yoshua Bengio

(Deep Learning, 2015). The book was one of the most valuable references to

understand the mathematical operation that occurs within the Conv layers, also the

author obtained a certification from deeplearning.ai presented by Dr Andrew Ng one

of the brightest researchers in the field of artificial intelligence from Stanford

University. Therefore, the author was able to carry out all the knowledge gained

through this journey and set the approach by which the OFR system can be developed.

The second research question was: How does the quantity and quality of data affect

the OFR system?

Good data and poor data are one of the biggest topics concerning the researchers in

the AI field and it affects the behaviour of Any AI system. Unluckily, the machine

learning algorithms are unable to conclude that the data being analysed is unreliable.

Furthermore, in some situations, this can lead to results that are deceptive and false

predictions. Even data of relatively high quality can lead to incorrect results,

potentially leading the face recognition systems to false identification of individuals,

and that may place wrong people in bad circumstances if the system is used by the

authorities.

In general, more data and good quality lead to a more favourable outcome always.

researchers need to take a long time to determine that it is worthwhile to collect these

data and whether it's the quantity of the data collected is enough to fulfil the purpose.

The author has used two methods of data collection as mentioned in chapter 2.4. The

collected data was not in high quality or in the quantity required to develop OFR from

scratch and to overcome this obstacle, the author used several pre-trained models,

and by using these models and weights on the collected data, the author was able to

achieve a good percentage by detecting and localizing the masked faces.

73

Before feeding the data to a machine learning system, it is necessary to take some

time examining the data and determine if it is possible to increase or boost the overall

quality of the data. A little change in the quality of the data would go a long way

towards improving the systems.

The third research question was: How does the OFR system detect and extract visual

facial features?

The implementation of OFR system has been presented in chapter 7 in detail, it is

almost like any FR system. The author trained several models with different

measurements by manipulating the variables respectively. Multiple errors have

appeared through the development process that breaks down the system sometimes

or gives a false prediction. By tuning all the model's measurements and examining the

data repeatedly, the OFR system began to predict correctly and the accuracy ratio

started improving. The author takes to consideration three main obstacles that affect

the operation of any system:

• Data

• Head pose and illumination

• Model's design

Failure to correctly prepare one of the variables can lead to severe errors. The data

variable has been already stated in the previous subsection, the head pose and

illumination variables are taking into consideration as part of the variety of data

provided to the system, finally the architecture of the model that take into account

multiple sub-variables such as ( input shape, CNN layers , stride, padding, pooling

layers, number of parameters, activation functions, flatten, fully connected layer, etc)

All of the variables are essentials to develop an accurate model.

The fourth research question was: What is the evaluation of the OFR system’s

performance?

74

In this subsection, the author presents the evaluation method and the obstacles that

the study can encounter. Usually, the models' results are evaluated by examining and

comparing them with previous works' results on the same database but, I could not

see much comparisons with other studies to evaluate the results. This is not the only

way to make sure your system works properly. In order to evaluate the OFR system,

the author implemented the evaluation of detection and recognition separately and

by using different pre-trained models and compare the outcome for each model as

presented in chapter 7.This evaluation approach is not bad, but it is also not reliable,

as it is possible that the variables are incorrectly set for all the models that the author

used, and the evaluation's measurements can be misleading.

Therefore the author used two methods to evaluate the models’ performance, the

first method was the statistical results presented in table 7 to compare and calculate

the accuracy and the loss, and second, was supervised evaluation, that giving some

samples (images) for the models and then the author record the accuracy ratio for the

sample on each model respectively.

8.2 Conclusion

This study presented a critical overview of 5 Face Recognition algorithms, deep

learning functionality, and offered a comparative analysis of six databases and related

benchmarks. On the other hand, it highlighted the weaknesses of the state-of-the-art

approaches for detecting occluded faces and designed a new approach to overcome

deficiencies.

The study aims to present a strategy to develop a face recognition system for masked

faces. The problem has been separated into phases, and each phase has been

researched, developed and evaluated independently. The first phase was to develop

deep neural network models that can detect and align faces which includes the

occlusion part so the system can crop the visual part of the face. Next, the system will

crop the extracted visual part of the face and embed it into a vector of 128-

dimensional using FaceNet pre-trained model. The last phase was to develop a

75

recognizer model for the prediction. The system was designed to run in real-time with

low computing power consuming.

The pipeline that was built to solve this problem was introduced in Appendix (A), this

diagram illustrates how the system works and how the new input data should be

embedded and trained to be compatible with the system. The system was tested and

demonstrated in a real-life environment. All the detection model has been trained

with the same dataset and the same number of iterations, as mentioned in section

(6.1). We have seen the InceptionV3 and the MobileNetV2 achieve 90% of accuracy

when the faces are clear, and lighting is good. The thesis life cycle starts from defining

the deep learning approach and how the machine learns and explains some face

recognition algorithms and their performance through the past twenty years. Also, for

the training and evaluation phase, the author went through some of the datasets as

benchmarks of accuracy and testing the model performance.

From the results obtained in the study of the experiments, it can be inferred that the

binary comparison techniques are not suitable for a large scale and are not reliable

approaches on multi-classification. The recognition rate decreases rapidly to less than

60% depending on the lighting, head pose, and if the person is far more than 2 m from

the camera. It should be noted that the OFR system is geared towards a controlled

lighting condition and fixed head position, and thus the efficiency of the recognition

system with different various circumstances of these variables are wobbling between

45 % to 64% depending on the measured distance from the camera to the target in

real video streaming.

Finally, the recognition phase was partially achieved by using the SVM and Resnet101.

The system was trained to recognize in real-time on the small dataset was not for

production. The detection model was competent to detect faces with a mask and

without a mask with accuracy 99%. The recognition phase needs more data for training

to achieve good accuracy on a large scale and needs more time to handle the data

augmentation. Moreover, it can change the design used to develop the model to

76

extract more features as measurements for the distance between the visible features,

such as (distance between the eyes, face's width, and 3d geometry of face, etc).

8.3 Recommendation for future work

As a complement of this study, there are range research lines that still open and on

which it is possible to expand further. Some of these potential lines have appeared

during the study phase, which has been left open and is expected to be discussed in

the future. Many of them are more directly correlated to this thesis work and the

findings of problems that have occurred during the research. The rest are only general

lines, as suggested by the author for future works by other researchers. The following

list of these recommendations that can be examined in the future:

• Instead of using a Support Vector Machine binary classification, it is suggested

to use a different approach to the large database. Multiclass SVMs and

SoftMax use deep learning techniques as a standard for multi-class

classification and from the author's perspective, they are ideal ways to

complete the recognition task with high accuracy.

• Enhance the OFR by using a high-quality camera with Raspberry Pi 4 (1GB

storage) as a controller where facial detection and recognition are already built

in. The Raspberry Pi 4 can be improved and make it compatible with our model

by including the OFR system and enhance the visual functionalities.

• The process of identification can be significantly enhanced by using various

datasets that contain many high-quality and scaled diversity images. Datasets

such as (Tufts Face Database, CelebA Dataset) will boost the system

performance if data is properly organised and defined.

• The author tried using the 3D model techniques derived from Feng Liu, Qijun

Zhao, Xiaoming Liu, Dan Zeng (2018) works. The idea is good and reliable, but

it requires a long time to research and examine, also it demands particular data

with a special detector that sets landmarks on visual parts for 3d masked faces.

77

8.4 Summary

This thesis has two objectives to achieve: first, understanding the methods and

algorithms that used to perform the detection, localization, alignment, and

classification steps on faces in an image, and what kind of evaluation has been used to

improve and check if these algorithms are doing well on the face in different poses.

Second.

To address these questions and carry out the task of developing a model that can

classify individuals wearing a medical mask, the author used quantitative experiment

research to collect the data and test the experiments' efficiency. Since our research

method is quantitative, the author has an advantage over the data obtained in order

to achieve the desired results. The data used in the study were collected from March

to July 2020 and were classified according to priority to accomplish the task and to

answer the questions.

After studying the methods and hypotheses used to identify or recognize an object, it

becomes easier for the author to understand the perspective of the identification in

addition to how it is implemented in real life. A crucial point has arisen in the middle

of the research process, which is "Data", to use the deep learning approach you need

an immense amount of data to achieve the excellent accuracy so that the model can

learn properly. For this reason, the various datasets provided in this study have

previously been used as a pillar to test several face-recognition theories and

algorithms. Therefore, these datasets have been widespread in computer vision

society, because they were structured to include thousands of images accompanied

by annotations that contain face coordinates. As a result of this advancement

research, the author was able to implement his models on the real-life application and

obtain an appropriate outcome for identification of the frontal face with a mask.

Finally, the approach was based on deep learning, and machine learning techniques,

a small dataset was used to evaluate the efficiency of the system.

78

References

Ali Sharifara, Mohd Shafry Mohd Rahim and Yasaman Anisi. 2014. “A General Review

of Human Face Detection Including a Study of Neural Networks and Haar Feature-

based Cascade Classifier in Face Detection”. Available from:

https://www.researchgate.net/publication/282680769_A_

Alok Sharma, Kuldip K. Paliwal. 2013. “Linear discriminant analysis for the small sample

size problem: an overview”. Available from:

https://link.springer.com/article/10.1007/s13042-013-0226-9

Anne Marie Monchamp .2008. “ALÈTHEIA TRUTH OF THE PAST”. Available from:

https://novaojs.newcastle.edu.au

Ayush Singhal, Pradeep Sinha, Rakesh Pant. 2017. “Use of Deep Learning in Modern

Recommendation System: A Summary of Recent Works”, Available from:

https://arxiv.org/ftp/arxiv/papers/1712/1712.07525.pdf

C. Li, Y. Diao, H. Ma and Y. Li., 2008. “A Statistical PCA Method for Face Recognition”,

Available from:

https://ieeexplore.ieee.org/abstract/document/4740022/citations#citations

Claudia Iancu and Peter M. Corcoran. 2011.“A Review of Hidden Markov Models in

Face Recognition”, Available from:

https://www.intechopen.com/predownload/17168

Corinna Cortes and Vladimir Vapnik.1995.” Support-Vector Networks”. Available

from: http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf

Melissa P. Johnston. 2014. “Secondary Data Analysis: A Method of which the time Has

Come” Available from:

http://www.qqml-journal.net/index.php/qqml/article/view/169/170

Dr. P.Shanmugavadivu and Ashish Kumar. 2016. “Rapid Face Detection and Annotation

with Loosely Face Geometry”. Available from:

https://www.researchgate.net/publication/316733019_Rapid_face_detection_and_

https://www.researchgate.net/publication/282680769_A_general_review_of_human_face_detection_including_a_study_of_neural_networks_and_Haar_feature-based_cascade_classifier_in_face_detection


https://novaojs.newcastle.edu.au/hass/index.php/humanity/article/download/2/2

https://arxiv.org/ftp/arxiv/papers/1712/1712.07525.pdf

https://ieeexplore.ieee.org/abstract/document/4740022/citations#citations

https://www.intechopen.com/predownload/17168

http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf

http://www.qqml-journal.net/index.php/qqml/article/view/169/170

https://www.researchgate.net/publication/316733019_Rapid_face_detection_and_annotation_with_loosely_face_geometry

79

David E. Rumelhart, Yves Chauvin. 1995. “Backpropagation: Theory, Architectures, and

Applications”, Available from:

https://books.google.fi/books?hl=en&lr=&id=oWRv7BR4BqMC&oi

Dilip Singh Sisodia, Ram Bilas Pachori, Lalit Garg. 2020. “Handbook of research on

advancements of artificial intelligence in healthcare engineering”. Available from:

https://books.google.fi/books?id=SQfYDwAAQBAJ&pg=PA123&lpg

F. H. Alhadi, W. Fakhr and A. Farag. 2005.”Hidden Markov Models for Face

Recognition”, Available from: https://www.researchgate.net/publication/220939899

Feng Liu, Qijun Zhao, Xiaoming Liu, Dan Zeng. 2008. “Joint Face Alignment and 3D Face

Reconstruction with Application to Face Recognition“. Available from:


Ferdinando Samaria, Frank Fallside. 2007. “Face Identification and Feature Extraction

Using Hidden Markov Models”. Available from:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.851

Ian Goodfellow, Yoshua Bengio, Aaron Courville. 2016. “Deep Learning”. Available

from: http://www.deeplearningbook.org/

Jeremy Howard and Sylvain Gugger. 2020. “fastai: A Layered API for Deep Learning”.

Available from: https://arxiv.org/pdf/2002.04688.pdf

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler.

2015. “Efficient Object Localization Using Convolutional Networks”, Available from:


Joseph Redmon, Santosh Divvalay, Ross Girshick, Ali Farhadi. 2016. “You Only Look

Once: Unified, Real-Time Object Detection”. Available from:

https://arxiv.org/abs/1506.02640

K. Mallat, J-L. Dugelay, (2018) « A benchmark database of visible and thermal paired

face images across multiple variations », International Conference of the Biometrics

Special Interest Group BIOSIG, pages 199-206, 2018. Available from:


https://books.google.fi/books?hl=en&lr=&id=oWRv7BR4BqMC&oi=fnd&pg=PA1&dq=backpropagation&ots=RMGqw3Tycc&sig=9qpzNImDOuy8WAjGkTRLnIl9uXU&redir_esc=y#v=onepage&q=backpropagation&f=false

https://books.google.fi/books?id=SQfYDwAAQBAJ&pg=PA123&lpg=PA123&dq=A+neural+network+is+a+series+of+algorithms+that+endeavors+to+recognize+underlying+relationships+in+a+set+of+data+through+a+process+that+mimics+the+way+the+human+brain+operates&source=bl&ots=8ZYZaF6hrh&sig=ACfU3U156i8_ofAD737rEvg13d1EDeKVhA&hl=en&sa=X&ved=2ahUKEwiWt73ikOXpAhVBw4sKHTC6CYAQ6AEwCnoECAsQAQ#v=onepage&q=A%20neural%20network%20is%20a%20series%20of%20algorithms%20that%20endeavors%20to%20recognize%20underlying%20relationships%20in%20a%20set%20of%20data%20through%20a%20process%20that%20mimics%20the%20way%20the%20human%20brain%20operates&f=false

https://www.researchgate.net/publication/220939899


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.851&rep=rep1&type=pdf

http://www.deeplearningbook.org/front_matter.pdf

http://www.deeplearningbook.org/




80

Kim Esbensen, Paul Geladi.1987. “Principal Component Analysis”, Available from:

https://www.sciencedirect.com/science/article/abs/pii/0169743987800849

L. R. Rabiner and B. H. Juang. 1986. “An Introduction to Hidden Markov Models”.

Available from:

http://ai.stanford.edu/~pabbeel/depth_qual/Rabiner_Juang_hmms.pdf

Laurenz Wiskott, Jean-Marc Fellous, Norbert Kr uger, and Christoph von der

Malsburg (1999). “Face Recognition by Elastic Bunch Graph Matching”. Available

from: https://www.researchgate.net

Lawrence R. Rabiner. 1989. “A Tutorial on Hidden Markov Models and selected

application in speech recognition”, Available from:

https://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/

Le Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran, Louis-Antoine Blais-Morin.

2017. “A comparison of CNN-based face and head detectors for real-time video

surveillance applications”. Available from: https://ieeexplore.ieee.org/

Lie Gu and Takeo Kanade. 2016. “3D Alignment of Face in a Single Image”. Available

from: https://ieeexplore.ieee.org/document/1640900

Mahmoud Afifi and Abdelrahman Abdelhamed, "AFIF4: Deep gender classification

based on an AdaBoost-based fusion of isolated facial features and foggy faces".

Journal of Visual Communication and Image Representation, 2019. Available from:


Mislav Grgic, Kresimir Delac, and Sonja Grgic. 2009. “SCface – surveillance cameras

face database”. Available from:


N. Mohanty, A. Lee-St. John, R. Manmatha, T.M. Rath. 2013. “Shape-Based Image

Classification and Retrieval”. Available from:

https://www.sciencedirect.com/science/article/pii/B9780444538598000102

P. Jonathon Phillips, Hyeonjoon Moon, Patrick Rauss, and Jeffery Huangb. 1998. “The

FERET Evaluation Methodology for Face-Recognition Algorithms”.Available from:

https://www.nist.gov/system/files/documents/2016/12/15/feret_database

https://www.sciencedirect.com/science/article/abs/pii/0169743987800849

http://ai.stanford.edu/~pabbeel/depth_qual/Rabiner_Juang_hmms.pdf

https://www.researchgate.net/publication/220182105_Face_recognition_by_elastic_bunch_graph_matching

https://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/tutorial%20on%20hmm%20and%20applications.pdf

https://ieeexplore.ieee.org/

https://ieeexplore.ieee.org/document/1640900



https://www.sciencedirect.com/science/article/pii/B9780444538598000102

https://www.nist.gov/system/files/documents/2016/12/15/feret_database_evaluation_procedure.pdf

81

Paul Viola and Michael Jones. 2001. “Rapid Object Detection using a Boosted Cascade

of Simple Features”. Available from: http://web.iitd.ac.in/~sumeet/viola-cvpr-01.pdf

Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu,

Pavel Kuksa. 2011. “Natural Language Processing (Almost) from Scratch”, Available

from: http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. 2014. “Rich feature

hierarchies for accurate object detection and semantic segmentation“.Available from:


Ross Girshick. 2016.” Fast R-CNN”. Available from:

https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_

Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf, Mazdak Zamani, Alireza

Hooman. 2013. “An Overview of Principal Component Analysis”. Available from:

https://www.researchgate.net/publication/262527828_An_Overview_of_

Shahpour Alirezaee, Hassan Aghaeinia, Karim Faez, and Farid Askari. 2006. “An

Efficient Algorithm for Face Localization”. Available from:

https://www.researchgate.net/publication/240706465_An_Efficient_Algorithm

Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian SunLocal. 2014. “Face Alignment at

3000 FPS via Regressing Local Binary Features”. Available from: https://www.cv-

foundation.org/openaccess/content_cvpr_2014/papers/Ren_Face_Alignment_at_20

14_CVPR_paper.pdf

Simone Bianco. 2016. “Large age-gap face verification by feature injection in deep

networks”. Available from: https://arxiv.org/pdf/1602.06149.pdf

Song Han, Huizi Mao, and William J. Dally. 2016. “DEEP COMPRESSION:

COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION

AND HUFFMAN CODING”. Available from: https://arxiv.org/pdf/1510.00149.pdf

Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran.

2013. “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR”, Available from:

http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf

http://web.iitd.ac.in/~sumeet/viola-cvpr-01.pdf

http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf


https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf

https://www.researchgate.net/publication/262527828_An_Overview_of_Principal_Component_Analysis

https://www.researchgate.net/publication/240706465_An_Efficient_Algorithm_for_Face_Localization

https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Ren_Face_Alignment_at_2014_CVPR_paper.pdf





http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf

82

Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. “Active

Appearance Models”, Available from:

https://people.eecs.berkeley.edu/~efros/courses/AP06/Papers/cootes-pami-01.pdf

Vineet Kushwaha, Maneet Singh, Richa Singh, Mayank Vatsa. 2018. “Disguised Faces

in the Wild”. Available from:

http://iab-rubric.org/papers/2018_CVPRW_disguised-faces-wild.pdf

Vittorio Castelli. “Nearest Neighbor Classifiers”. Available from:

http://www.ee.columbia.edu/~vittorio/lecture8.pdf

W. Zhao, R. Chellappa, P. J. Phillips, A. Rosenfeld. 2003. “Face Recognition: A Literature

Survey”, Available from: https://inc.ucsd.edu/~marni/Igert/Zhao_2003.pdf

Xinbo Gao, Ya Su, Xuelong Li, and Dacheng Tao. 2010. “A Review of Active Appearance

Models”. Available from:

https://www.researchgate.net/publication/220509234_A_Review

Xuehan Xiong and Fernando De la Torre. 2013. “Supervised Descent Method and its

Applications to Face Alignment “ Available from:

https://www.ri.cmu.edu/pub_files/2013/5/main.pdf

Yue Wu and Qiang Ji. 2015. “Robust Facial Landmark Detection under Significant

Head Poses and Occlusion”. Available from:

https://openaccess.thecvf.com/content_iccv_2015/papers/Wu_Robust_Facial

https://people.eecs.berkeley.edu/~efros/courses/AP06/Papers/cootes-pami-01.pdf

http://iab-rubric.org/papers/2018_CVPRW_disguised-faces-wild.pdf

http://www.ee.columbia.edu/~vittorio/lecture8.pdf

https://inc.ucsd.edu/~marni/Igert/Zhao_2003.pdf

https://www.researchgate.net/publication/220509234_A_Review_of_Active_Appearance_Models

https://www.ri.cmu.edu/pub_files/2013/5/main.pdf

https://openaccess.thecvf.com/content_iccv_2015/papers/Wu_Robust_Facial_Landmark_ICCV_2015_paper.pdf

83

Appendices Appendix - A: The pipeline to build a face recognition system for masked faces.

84

Appendix B: Embedding face without mask and same face with mask to 128 dimensional.

85

Appendix C: System Configuration.

Practical work

In the present system, an application for masked facial recognition was implemented by my

personal computer alongside with Python 3.7 which allowed me to connect all the models and

Initialize the system. I also used the OpenCV library which supports image processing, as well

as different Python packages such as (NumPy, Pandas, Matplotlib, Scikit-Learn, os) to facilitate

the tasks of reading the input images and encoding them to be 2-dimensional tensors.

The author trained three different models (InceptionV3, MobileNetV2, VGG16) on masked

face detection dataset and the data was limited (854 images with 4072 faces) so we

implemented a data augmentation technique to increase the data. These three models will

detect the face with a mask and without a mask and return a prediction ratio.

Then, we divided the data into training and testing dataset:

also, we initialized some hyper-parameters such as the learning rate parameter, several

epochs, and the bunch size:

Then we downloaded the pre-trained model and set up the architecture with weights and

defined the input shape to 224 * 224*3, in addition, we used Average Pooling layers with size

86

(5*5) to improve on any overfitting that may occur and increase the training speed. Moreover,

two activation function used (ReLU and SoftMax):

Furthermore, we applied Adam optimization algorithm to update network weights iterative

based on training data:

Now the training process:

87

The results for the first training process: 20 epochs

88

The results for the second training process: 20 epochs

real-time masked face recognition using machine learning

Documents