real-time masked face recognition using machine learning
TRANSCRIPT
Real-Time Masked Face Recognition Using
Machine Learning Support Vector Machine (SVM)
BASSAM AL-ASADI
Master’s thesis
November 2020
Information Technology
Full Stack Software Development
Description
Author(s) Al-Asadi, Bassam
Type of publication Master’s thesis
Date November 2020 Language of publication: English
Number of pages 88
Permission for web publication: x
Title of publication Real-Time Masked Face Recognition Using Machine Learning
Support Vector Machine (SVM)
Degree programme Full Stack Software Development
Supervisor(s) Huotari, Jouni. Kotikoski, Sampo.
Assigned by Abstract
An enormous number of robust face recognition systems has been around to help authorities and commercial companies to recognize people. While these systems work very well, a new problem came from nowhere that forced people everywhere to wear a medical face mask to prevent the spread of COVID-19 pandemic. These systems began to fail in the prediction that led to many problems. The efficiency of facial recognition systems can significantly deteriorate due to occlusions, such as medical masks, hats, facial hair, and sunglasses. Significant companies started gathering photographs of people wearing medical masks posted on their accounts to develop their facial recognition technologies, and they are struggling to keep the recognition technology up-to-date and appropriate.
Essential data has been collected from theories, observations data, reviews to develop some methods and solve the recognition problem by using quantitative experiment research. These experiments were applied in a restricted real-time streaming system. The author aims in this thesis to develop a model that can detect faces with a mask and without a mask and classify it to which identity it belongs. The detection was the first step to extract faces from an image of the frame to embed it to the recognition phase. The recognition model has been trained on different object recognition pre-trained models with the same data and evaluated on multiple environments to achieve good accuracy for limited identities.
Keywords/tags (subjects) Face Recognition, Deep learning, Machine Learning, Datasets, Support Vector Machine (SVM) Miscellaneous (Confidential information)
Description
Tekijä(t) Al-Asadi, Bassam
Julkaisun laji Opinnäytetyö, ylempi AMK
Päivämäärä Marraskuu 2020
Julkaisun kieli Englanti
Sivumäärä 88
Verkkojulkaisulupa myönnetty: x
Työn nimi
Real-Time Masked Face recognition Using Machine Learning
Support Vector Machine (SVM)
Tutkinto-ohjelma
Full Stack Software Development
Työn ohjaaja(t) Huotari, Jouni. Kotikoski, Sampo.
Toimeksiantaja(t)
Tiivistelmä
Valtava määrä vankkoja kasvojentunnistusjärjestelmiä on auttanut viranomaisia ja kaupallisia yrityksiä tunnistamaan ihmiset. Vaikka nämä järjestelmät toimivat erittäin hyvin, uusi ongelma tuli kuin tyhjästä, ja pakotti ihmiset kaikkialla käyttämään lääketieteellistä kasvomaskia estääkseen COVID-19-epidemian leviämisen. Nämä järjestelmät alkoivat epäonnistua ennusteessa, mikä johti moniin ongelmiin. Kasvojentunnistusjärjestelmien tehokkuus voi heikentyä huomattavasti okkluusioiden, kuten lääketieteellisten naamioiden, hattujen, kasvojen, hiusten ja aurinkolasienjoi vuoksi. Merkittävät yritykset alkoivat kerätä tileilleen valokuvia ihmisistä, jotka käyttävät lääketieteellisiä maskia kasvojentunnistustekniikoiden kehittämiseksi, ja ne pyrkivät pitämään tunnistusteknologian ajan tasalla ja tarkoituksenmukaisena. Olennaista tietoa on kerätty teorioista, havainnoititiedoista, arvosteluista joidenkin menetelmien kehittämiseksi ja tunnistamisongelman ratkaisemiseksi kvantitatiivisella kokeilututkimuksella. Näitä kokeita sovellettiin rajoitetussa reaaliaikaisessa suoratoistojärjestelmässä. Kirjoittaja pyrkii tässä opinnäytetyössä kehittämään mallin, joka tunnistaa kasvot maskilla ja ilman ja luokittelee ne identiteettiin, johon se kuuluu. Havaitseminen oli ensimmäinen vaihe kasvojen poimimiseksi kehyksen kuvasta niiden upottamiseksi tunnistusvaiheeseen. Tunnistamismalli on koulutettu erilaisten esineiden tunnistamisen esikoulutetuille malleille, joilla on samat tiedot, ja arvioitu useissa ympäristöissä rajoitetun identiteetin tarkkuuden saavuttamiseksi.
Avainsanat Kasvojentunnistus, Syväoppiminen, Koneoppiminen, Vertailuarvot, Support Vector Machine (SVM)
Muut tiedot
Abbreviations and Acronyms AAM Active Appearance Models AI Artificial Intelligence ANN Artificial Neural Network AFW Annotated Face in-the-Wild dataset CNN Convolutional Neural Network ConvL Convolutional layer ConvNet Convolutional Network CPU Central Processing Unit DFW Disguised Faces in the Wild dataset DL Deep Learning DNN Deep Neural Network EBGM Elastic Bunch Graph Matching dataset FERET Facial Recognition Technology dataset FR Facial Recognition GPU Graphics Processing Unit HMM Hidden Markov Models HPO Head Poses and Occlusion IBUG Tight face bounding box dataset IOU Intersection over Union JFA Joint Head Pose Estimation and Face Alignment Framework L1-SVM Linear norm Support Vector Machine L2-SVM Square norm Support Vector Machine LAG Large Age-Gap Database LBF Local Binary Features LDA Linear Discriminant Analysis LFPW Labelled Face Parts in the Wild ML Machine Learning OFR Occluded Face Recognition PCA Principal component analysis R-CNN Region Based Convolutional Neural Networks ReLU Rectified linear unit ROI Regions of Interest RPN Region proposal network SCfaceDB surveillance cameras face database SIFT Scale-Invariant feature transform SOF Specs on Faces Dataset SVM Support Vector Machine XM2VTS Extended Multi-modal face database YOLO You only look once
1
Contents
1 INTRODUCTION ........................................................................................................................ 5
2 RESEARCH ................................................................................................................................. 7
2.1 PURPOSE ....................................................................................................................................... 7
2.2 OBJECTIVES .................................................................................................................................... 7
2.3 RESEARCH QUESTIONS ...................................................................................................................... 8
2.4 RESEARCH METHODS ........................................................................................................................ 9
3 BACKGROUND ........................................................................................................................ 11
3.1 NEURAL NETWORKS ....................................................................................................................... 11
3.1.1 Deep FeedForward Networks .......................................................................................... 12
3.1.2 Activation Function ......................................................................................................... 13
3.1.4 Optimization (Gradient Descent) .................................................................................... 16
3.1.5 Backward Propagation .................................................................................................... 18
3.2 CONVOLUTIONAL NEURAL NETWORKS (CNN) .................................................................................... 19
4 FACE RECOGNITION ALGORITHMS ......................................................................................... 24
4.1 AAM - ACTIVE APPEARANCE MODELS ............................................................................................... 25
4.2 HMM - HIDDEN MARKOV MODELS.................................................................................................. 26
4.3 PCA - PRINCIPAL COMPONENT ANALYSIS ............................................................................................ 28
4.4 LDA - LINEAR DISCRIMINANT ANALYSIS ............................................................................................. 30
4.5 EBGM - ELASTIC BUNCH GRAPH MATCHING ...................................................................................... 31
5 STANDARD BENCHMARKS...................................................................................................... 33
5.1 FERET DATABASE ......................................................................................................................... 33
5.2 SCFACEDB LANDMARKS ................................................................................................................. 34
5.3 SPECS ON FACES (SOF) DATASET ...................................................................................................... 35
5.4 LARGE AGE-GAP DATABASE (LAG) ................................................................................................... 36
5.5 DISGUISED FACES IN THE WILD (DFW) .............................................................................................. 37
5.6 EURECOM VISIBLE AND THERMAL PAIRED FACE DATABASE .................................................................. 39
6 FACE RECOGNITION PIPELINE ................................................................................................. 40
6.1 DETECTION AND LOCALIZATION ........................................................................................................ 40
6.1.1 Viola-Jones ...................................................................................................................... 41
6.1.2 You Only Look Once (YOLO) ............................................................................................. 42
6.1.3 Faster R-CNN ................................................................................................................... 44
2
Summary ....................................................................................................................................... 47
6.2 ALIGNMENT .................................................................................................................................. 48
6.2.1 Supervised Descent Method and its Applications to Face Alignment ............................. 50
6.2.2 Face Alignment at 3000 FPS via Regressing Local Binary Features ................................. 51
6.2.3 Robust Facial Landmark Detection under Significant Head Poses and Occlusion ........... 52
6.2.4 Joint Head Pose Estimation and Face Alignment Framework ......................................... 53
Summary ....................................................................................................................................... 55
7 EXPERIMENT AND RESULT ..................................................................................................... 56
7.1 DETECTION AND EXTRACTION........................................................................................................... 57
7.1.1 Data ................................................................................................................................. 57
7.1.2 Implementation ............................................................................................................... 59
7.2 RECOGNITION ............................................................................................................................... 65
7.2.1 Landmarks ....................................................................................................................... 66
7.2.2 Visible features embedding ............................................................................................. 67
7.2.3 Classification ................................................................................................................... 69
8 - DISCUSSION ................................................................................................................................ 71
8.1 ANSWERS TO RESEARCH QUESTIONS ...................................................................................................... 71
8.2 CONCLUSION .................................................................................................................................... 74
8.3 RECOMMENDATION FOR FUTURE WORK ................................................................................................. 76
8.4 SUMMARY ....................................................................................................................................... 77
REFERENCES ..................................................................................................................................... 78
APPENDICES ..................................................................................................................................... 83
3
Figures
FIGURE 1. SIMPLE NEURAL NETWORK. ........................................................................................................ 12
FIGURE 2. NEURAL NETWORK STRUCTURE ................................................................................................... 13
FIGURE 3. LINEAR ACTIVATION FUNCTION. ................................................................................................... 14
FIGURE 4. GRAPH OF SIGMOID, TANH, AND RELU FUNCTIONS (NON-LINEAR ACTIVATION FUNCTION). ..................... 15
FIGURE 5. FIVE ITERATIONS OF GRADIENT DESCENT ....................................................................................... 17
FIGURE 6. BACKWARD PROPAGATION ......................................................................................................... 18
FIGURE 7. THE STRUCTURE OF A CNN, CONSISTING OF CONVOLUTIONAL, POOLING, AND FULLY-CONNECTED LAYERS. . 20
FIGURE 8. MAX POOLING AND AVERAGE POOLING. ....................................................................................... 21
FIGURE 9. FULLY-CONNECTED LAYER ........................................................................................................... 21
FIGURE 10. CONVOLUTION LAYER OPERATION. ............................................................................................. 22
FIGURE 11. SHAPE AND LABELLED IMAGE. .................................................................................................... 25
FIGURE 12. SAMPLE OF TRAINING DATA FOR ERGODIC HMM 2- LEFT-TO-RIGHT MODELS. ................................... 27
FIGURE 13. SAMPLE OF TRAINING DATA FOR TOP-TO-BOTTOM HMM. .............................................................. 28
FIGURE 14. LDA INFLUENCE ON THE DATA TO SEPARATE THE CLASSES, CONSIDERING EACH COLOUR IS A VARIABLE. .... 30
FIGURE 16. EXAMPLE OF DIFFERENT CATEGORIES OF PHOTOS FOR ONE INDIVIDUAL. ............................................. 34
FIGURE 17. EXAMPLE OF DIFFERENT POSE IMAGES. ........................................................................................ 35
FIGURE 18. SAMPLES OF THE SPECS ON FACES (SOF) DATASET. ....................................................................... 36
FIGURE 19. EXAMPLES OF FACE CROPS FOR MATCHING PAIRS. .......................................................................... 36
FIGURE 20. SAMPLE IMAGES OF THREE SUBJECTS FROM THE DFW DATASET. ...................................................... 37
FIGURE 21. VISIBLE AND THERMAL FACE IMAGES. .......................................................................................... 39
FIGURE 22. RECTANGLE FEATURES FOR OBJECT DETECTION. ............................................................................. 42
FIGURE 23. YOLO BOUNDING BOXES, CONFIDENCE, AND CLASS PROBABILITY MAP............................................... 43
FIGURE 24. INTERSECTION OVER UNION (IOU). ............................................................................................ 44
FIGURE 25. OBJECT DETECTION BY FASTER R-CNN. ...................................................................................... 45
FIGURE 26. FACE ALIGNMENT AND LANDMARK. ............................................................................................ 49
FIGURE 27. A COMPARISON OF GRADIENT DESCENT (GREEN) AND NEWTON'S METHOD (RED) FOR MINIMIZING A
FUNCTION. ............................................................................................................................................. 50
FIGURE 28. FACIAL LANDMARK DETECTION AND OCCLUSION PREDICTION IN DIFFERENT ITERATIONS. ........................ 52
FIGURE 29. THE TOP IMAGES USE (DLIB IMPLEMENTATION) AND BOTTOM IMAGES USE JFA ALGORITHM FOR LANDMARK
DETECTION. ............................................................................................................................................ 54
FIGURE 30. LFW DATASET WITH MEDICAL MASKS. ........................................................................................ 58
FIGURE 31. FACE MASK DETECTION DATASET. .............................................................................................. 58
FIGURE 32. MODELS COMPLEXITY (PARAMETERS SIZE) COMPARING TO TOTAL MEMORY UTILIZATION. ..................... 60
FIGURE 33. RED AND GREEN BOUNDING BOXES WITH DIFFERENT MODELS. ......................................................... 63
4
FIGURE 34. (A) ORIGINAL IMAGE (B) FACES' CROPPED WITH 224X224 PIXELS DIMENSION. .................................... 64
FIGURE 35. VALIDATION ACCURACY IMPLEMENTED BY THREE MODELS (INCEPTION V3, MOBILENETV2, VGG16) ON
MASKED FACES DATASET FOR 20 EPOCHS. ................................................................................................... 65
FIGURE 36. VALIDATION LOSS IMPLEMENTED BY THREE MODELS (INCEPTION V3, MOBILENETV2, VGG16) ON
MASKED FACES DATASET FOR 20 EPOCHS. ................................................................................................... 65
FIGURE 37. CROPPED THE LOCAL VISIBLE FEATURES FROM EXTRACTED FACE. ....................................................... 66
FIGURE 38. 2D FACE LANDMARKS (A) 68 LANDMARKS FOR THE ENTIRE FACE (B) 24 LANDMARKS FOR VISIBLE PARTS OF
THE FACE (C) CROPPED THE VISIBLE PARTS WITH THE LANDMARKS. .................................................................... 67
FIGURE 39. EMBEDDING A FACE IMAGE TO 128-DIMENSIONAL VECTOR. ........................................................... 69
Tables
TABLE 1: IMAGES IN THE TRAINING AND TESTING PARTITION. ........................................................................... 38
TABLE 2: AVERAGE ACCURACY OF FACE AND HEAD DETECTION ON THE FDDB DATASET AND CASABLANCA DATASET. .. 47
TABLE 3: AVERAGE TIME AND MEMORY COMPLEXITY FOR FACE DETECTION ON FDDB. ......................................... 48
TABLE 4: FACIAL LANDMARKS DETECTION ERROR. .......................................................................................... 55
TABLE 5: HEAD POSE VARIATIONS. .............................................................................................................. 56
TABLE 6: A CSV FILE REPRESENTS ALL THE ANNOTATIONS DETAILS FOR (FACE MASK DETECTION DATASET) IMAGES
NAME, DIMENSIONS, FACES' CATEGORY AND COORDINATES EXTRACTED FROM XML FILES IN THE DATASET. ............... 59
TABLE 7: FOUR EPOCHS FOR MASKED FACE DETECTION AND LOCALIZATION ON A SMALL DATASET, PRESENTING THE
VALIDATION ACCURACY AND LOSS FOR EACH MODEL. ...................................................................................... 61
TABLE 8: INCEPTIONV3, MOBILENETV2, AND VGG16 PARAMETERS WITH NUMBER OF CONVLS. .......................... 62
TABLE 9: ACCURACY AND STANDARD DEVIATION FOR FIVE DIFFERENT MODELS OVER MASKED FACES DATASET. .......... 70
5
1 Introduction
Artificial Intelligence and Neural networks have already transformed internet
technologies across the globe from something useful to something significant in our
lives. The artificial intelligence innovations have intervened in all fields ranging from
improved health care, where these machines are better suited to cancer detection
than any doctor, to self-driving cars that are considered safer than human driving.
Without ignoring the practical assistance of AI in simulations to measuring, monitoring
and resource management on climate change, conservation and environment.
Artificial neural networks are one of the most important technologies ever developed
that can operate without human intervention. In the programming conventional
techniques, we tell the machine the steps to breaking big problems down to small ones
in different scenarios. Next, the device can use its computational capabilities to help
users manage data more rapidly and effectively. On the contrary, machine learning
uses a massive amount of data to develop a pattern that can classify and predict to
solve complicated problems without human interferes. Therefore, we train the
machine and create various models that predict the outcome with high accuracy,
which will be a significant aid in achieving a complicated mission. The computer can
find a solution to any problems by supplying an appropriate amount of labelled data
and using supervised learning techniques. However, had its portion of problems not
in space but on the ground.
The last twenty years, we have been facing an increase in global terrorism.
Nonetheless, the modern world has had its share of problems not in space but on the
ground. The last twenty years, we have been faced with a spike in global terrorism.
This problem affected law-enforcement agencies in airports, border crossings and
ports across the globe on millions of people travelling daily.
According to the United States Department of Transportation around 900 million
passengers have been transported in 2019 using airports, hence managing the traffic
6
flow by a human was complicated task to accomplish. To avoid security breaches and
human mistakes. Governments began funding many agencies and companies to
develop these technologies to enhance human work and fulfil the grown needs to
improve artificial intelligence. Since the Iris and fingerprint recognition are a slow way
to achieve the task of recognition and not safe, authorities start to use a new biometric
recognition called "Face Recognition", and these techniques become so popular and
ubiquitous last five years.
At the end of 2019, the world was confronted with a widespread problem in the
history of humanity Coronavirus pandemic (COVID-19). The COVID-19 virus affects the
respiratory system and spreads through close contact. As a result, most people began
to use face masks for protection; thus, facial recognition systems struggle to identify
faces wearing medical masks.
The purpose of this thesis aims to explore and test the new detection model and
evaluation algorithms and their efficacy on masked faces. The author will create a
masked faces dataset by collecting images from the internet to use it for the
recognition models, also we will use a masked face detection dataset to trained
different models to detect masked faces and recognize the faces. It was a complicated
problem since there were no labelled datasets for masked faces to train our
recognition model; nor do we have enough earlier research on this subject for
comparison and evaluation.
The methodology of developing Artificial intelligence systems are correlated to full-
stack developments approaches through different aspects such as, provide the front
end applications to the end-users and host the AI models on the backend which are
built to be running on a server. In general, FR systems use many software development
methodologies to improve the system's performance, such as using the agile approach
to deploy facial recognition systems in easily manageable environments, with the agile
approach the FR systems result delivered in quick, constant control. The agile software
developments are appropriate to computer vision systems because the end-users
need to be involved earlier to further examination, adjustments, improvements, and
7
finally evaluate the facial recognition models. Many developers use facial recognition
technologies nowadays to grant access to various applications.
2 Research 2.1 Purpose Face recognition is one of the problems of computer vision that has gained a lot of
attention in the previous decade. Many researchers have been contributed to develop
and innovate new theories to improve computer vision and applied these theories in
real applications. Various studies have published on face recognition, that focuses on
developing methods to ascertain the presence of face and recognize it. The study
conduces to clarify the obstacles of detection and identification and introduce an
explication that able to develop a new recognition system method for both
circumstances ( full-visible features and half-visible features ).
2.2 Objectives The thesis aims to develop a new facial recognition system capable of recognizing an
individual wearing a medical mask. To understand how the facial recognition system
functions, the author divides the issue into four stages. The purpose of the division is
to make the problem understandable and experimental by taking each stage on its
own and clarifying the characteristics and the necessity for experiments to be carried
out. The thesis, therefore, presents these steps, beginning by presenting some popular
algorithms used previously to implement facial recognition systems alongside datasets
used to build and evaluate these algorithms. Next step is to research the detection
and alignment processes, to understand the mechanism behind them and how they
can apply them in the system.
8
2.3 Research Questions According to Dawson R. Hancock and Bob Algozzine (2017, 3), Various types of
questions (What?, How?, Why?) that have driven scholars to explore the causes why
things have happened and to create more specific approaches. Usually, while
specialists are analysing a topic, this implies that they are seeking to obtain feedbacks
for a better understanding of the subject, alternative scenarios for analysis, and
possible explanations for review. These feedbacks and questions have driven the study
to draw conclusions that are reliable, practical, and interpretable.
According to Dawson H. et al. (2017,4), A research effort should not be carried out by
a researcher without an organizational paradigm. This paradigm lays out for the
researcher the distinguishing characteristics of the study and the possibilities for
obtaining answers to the questions. Therefore, the author determines a paradigm to
conduct systematic research that investigates the study's topic by identifying three
critical pillars to design the research process:
• Study's methods
• Gathering information
• Results' confirmation
These three essential pillars drove the study to comprehend research methods, data,
and analysis forms to illustrate the OFR stages and to understand the data required to
build a very accurate model. By following the previous steps, the research questions
based on the triple paradigm are:
1- What are the best approaches used to develop an OFR system?
2- How does the quantity and quality of data affect the OFR system?
3. How does the OFR system detect and extract visual facial features?
4- What is the evaluation of the OFR system’s performance?
9
The questions lead the study to dive deeper into theories and algorithms to determine
the usefulness and viability of using them in the research.
2.4 Research methods By addressing a question such as, "What is this study trying to do?" the author came
up with specific research methods to figure out the answer. The answer to that
question and the purpose of this study is to arise a new approach to solve the OFR
problem. Therefore, the study began to conduct quantitative research by setting up
controlled experiments and collecting data through different resources.
According to Claes Wohlin, Martin Höst, and Kennet Henningsson (2003,2)
"quantitative data promotes comparisons and statistical analysis". And by utilizing the
literature available to the different contributors in computer vision society such as
Stanford University, google AI research labs, Massachusetts Institute of Technology,
and numerous researchers' works. The author was able to collect essential data for
use in the research to enhance the system performance by comparing and analysing
the findings. The study considered the collected materials to be noteworthy,
particularly the work of Gregory Koch (2015) in his thesis "Siamese Neural Networks
for One-Shot Image Recognition" when Koch explored a method to classify the
similarity between two figures. And the dissertation of Ali Sharif Razavian(2017) in
"Convolutional Network Representation for Visual Recognition" which described the
representation of the convolutional network in visual recognition from an empirical
perspective.
The study carries out with testable background information to utilize as a basis for
implementing the OFR system in real-time and addressing the challenges. The
research focuses on several different insights that combine to include responding to
questions, clarify the working methods and mechanisms used to develop the
algorithms, and study recommendations. The study uses experiments, which requires
10
several variables to conduct them and to review the effects. These variables, such as
data, hypotheses, reviews, devices' performance, and environmental circumstances,
are more critical in the research than the experiment itself.
Two quantitative research methods have been applied that conduct the experiments
and analyse the findings. According to Claes Wohlin, Martin Höst, and Kennet
Henningsson (2003, 9), the empirical research method can be mapped to the following
steps: Definition, Planning, Operation, Analysis, and Conclusion. The objective of
empirical research is to manipulate one or more variables and control all other
variables at a fixed level. And the effect of variable manipulation can be measured
based on statistical results. Since we compare different types of methods in each stage
and analysis the outcomes to calculate the ratios, the empirical research method was
the proper pattern for this quantitative research. The data used for this study has
collected from different resources (books, Articles, Reviews, the accumulated
experience of the author) and has been obtained by using specific keywords such as
Artificial intelligence, Neural network, Machine learning, Deep learning, Computer
vision, Object detection and localisation, Object recognition and comparison,
Algorithm reviews, and Finally datasets.
The empirical research method uses to obtain the evidence and observe the scientific
data from the experiments, then review these findings by using secondary data
analysis research method. The empiric process begins by reacting to the research
concerns that need to be investigated in order to determine the direction of the
research by highlighting the fundamental goals of the systematic investigation.
The second step is to reanalysis the previously collected data and compares them with
the findings. Therefore, the author uses Secondary data analysis that is widely used
data collection technique in science research. According to Melissa P. Johnston (2014,
8) " The major advantages associated with secondary analysis are the cost-
effectiveness and convenience it provides". Secondary analysis is the best technique
for gathering data for several purposes, such as offering validation opportunities for
replications, and this is substantial in the study as study findings are credible if they
11
occur in a variety of other studies. The author needs to consider how the data are
categorized, organized, and how this might affect the results. Also, to make necessary
adjustments to the data in some way to conduction the study in the right way. The
chosen methods bring some considerations for findings and evaluations to implement
new experiments and to determine the research process.
3 Background
This chapter discusses the theoretical framework used in the research, starting with
Deep Learning approach that mimics certain aspects of human brain function in the
recognition of objects, images, voices or patterns, whether supervised or
unsupervised. Let's start with the neural networks.
3.1 Neural networks
In order to understand the functionality of the Neural network, it is essential to start
describing the single neuron, which is a mathematical process modelled as a
representation of biological neurons. The mathematical mechanism that occurs within
the artificial neuron is a necessary process for all machine learning algorithms.
According to Dilip Singh Sisodia, Ram Bilas Pachori, and Lalit Garg, (2020, 123) “A
neural network is a series of algorithms that endeavors to recognize underlying
relationships in a set of data through a process that mimics the way the human brain
operates”. Neural networks are a simplified representation of the architecture of
machine learning that represent a small network containing two layers (the hidden
layer and the output layer). Data are flowing from the input or previous layers
presented as a one-dimensional column vector (tensor), which will be processed inside
the artificial neuron by using some mathematical equation, such as, the Logistic
Regression equation. Once the parameters ( Weights as a two-dimensional tensor,
bias as a one-dimensional tensor ) are added to the equation, the output of the
12
equation should be within a specified range, so that the process needs to place a
specific activation function on the equation to minimise the output's value. From the
(Figure 1), we can see how the single neural network operates on data received from
the one-dimensional tensor how to implement the forward propagation phase. The
architecture of the neural network has four primary phases, which will be described
in detail as the critical pillars for the development of deep learning models, starting
with the forward propagation step to backward propagation, these phases will be
explained in detail by the author:
Figure 1. Simple Neural network.
3.1.1 Deep FeedForward Networks
The term FeedForward refers to the mathematical calculation of intermediate
variables (weights, and biases) applied to the input data and including the outputs of
the previous layer. As a result of this process, the coefficients' data will be stored in
the neuron to be used again when the network processes the backward propagation.
13
(Figure 2) Describing the architecture of the feedforward networks requires defining
depth, width, and activation functions in each hidden layer. Depth is the number of
hidden layers. Width is the number of neurons on each hidden layer. The activation
function is a function within each neuron. According to Ian Goodfellow, Yoshua
Bengio, and Aaron Courville (2016, 165), naming "deep learning" emerged from the
chain of functions used to build the feedforward network.
Figure 2. Neural network Structure
3.1.2 Activation Function
The activation function used in feedforward networks is to minimise the outcome's
value and evaluate the output behaviour of each node. The activation functions are
essential components of an artificial neural network, which allows learning complex,
non-linear mappings between inputs and response coefficients. The primary objective
behind the activation function within each node is to process the output of the
equation and simplify the results between specific scales with respect to input data.
Activation functions can be divided into two parts linear and non-linear operations
(Figure 3, 4).
14
In this section the author will list some of the common activation functions usually
used in deep learning:
• Linear Activation Function
Linear activation functions used to multiply the input data by the intermediate
coefficients within each neuron and generate a single output's value (one or zero).
These functions have two significant disadvantages: it cannot be used for backward
propagation, and all layers will collapse into one layer.
𝑓(𝑥) = 𝑥
Figure 3. Linear activation function.
(Source: https://towardsdatascience.com )
• Non-linear Activation Functions
Unlike linear activation functions, the purpose of these activation functions is to
generate an output within a limited range. Since the output's value is not limited to
two values, the non-linear activation function ensures that the neural network layers
will not behave as a single layer. In the other hand, the backward propagation will
normally run. In this section, the author will describe three non-linear activation
functions:
15
1. Sigmoid or Logistic Activation Function - The sigmoid function modifies the
input value to output a new value between 0 to 1.
𝑓(𝑥) =1
1 + 𝑒−𝑥
2. Tanh or hyperbolic tangent Activation Function - The hyperbolic tangent
function is rescaling the sigmoid function's output to be between -1 to 1.
𝑓 (𝑥) =𝑒𝑥 + 𝑒−𝑥
𝑒𝑥 − 𝑒−𝑥
3. ReLU (Rectified Linear Unit) Activation Function - ReLU is not a linear function,
but it provides the same result as Sigmoid with superior performance.
𝑓(𝑥) = max (0, 𝑥)
Figure 4. Graph of Sigmoid, Tanh, and Relu functions (non-linear activation function).
(Source: https://www.researchgate.net )
16
2.1.3 Cost functions
The cost function purpose is to measure the neural network error ratio. The general
formula for the cost function represents the sum of the error, which is the difference
between the real value and predicted value.
Cost function formula:
Loss (error) function: in a single training example:
𝐿𝑜𝑠𝑠(�̂� , 𝑦) = (𝑦 𝑙𝑜𝑔 �̂� + (1 − 𝑦)𝑙𝑜𝑔(1 − �̂�)
Cost function: for the entire training set:
𝐽(𝑤 , 𝑏) = 1
𝑚 ∑ 𝐿𝑜𝑠𝑠(�̂�𝑖 , 𝑦𝑖)
𝑚
𝑖=1
3.1.4 Optimization (Gradient Descent)
The optimisation is the first step in backward propagation, which tries to find the best
value from some set of obtainable values. Gradient descent algorithm is one of the
necessary optimisation algorithms most used to update the models' parameters
(weights and biases). The algorithm starts at the initial point and takes a step in the
slope direction (Figure 5), after many iterations of gradient descent, you might end up
converging to a global optimum. Used in backpropagation in neural networks uses a
gradient descent algorithm and applied massively in linear regression and
classification algorithms. It defines how the parameters should be improved and
updated them so that the task can be carried to a minimum by modifying the losses
so can be reduced. The gradient descent cons are; requires a large memory to
calculate the gradient for the entire dataset, the weights are updated with each
gradient measurement, which can take years to complete the process if applied to a
large dataset.
17
On the other hand, the pros of this optimization function are: easy to carry out,
understand, and compute. It is a basic feature used in different models but is not
enough to execute on massive projects.
Figure 5. Five iterations of Gradient Descent
To understand the optimiser function, the author implies the gradient descent
mathematically by sitting these equations:
Update parameters (w, b)
𝑤 = 𝑤 − 𝛼𝜕𝐽(𝑤, 𝑏)
𝜕𝑤
𝑏 = 𝑏 − 𝛼𝜕𝐽(𝑤, 𝑏)
𝜕𝑏
𝛼 = learning rate 𝜕 = derivative
18
3.1.5 Backward Propagation
According to David E. Rumelhart and Yves Chauvin (1995, 1), "backpropagation
terminology derives from the Rosenblatt attempt (1962) to generalize the perception
learning algorithm to the multilayer case". The researchers' objectives were to replace
hand-engineered features with trainable multilayer networks, according to David E. R.
et al. (1995, 2), the solution was not widely understood until (1986) when Rumelhart
published a paper explaining in more detail how this algorithm works in the real-world
application. The idea behind the backward propagation is minimizing the cost function
error ratio and modifying the coefficients iteratively, by implementing the
optimization algorithm of the cost function (Figure 6). As it turns out, multilayer
architectures can be trained by straightforward stochastic gradient descent, or you
prefer hill-climbing optimizer. The backward propagation includes two steps:
1- Compute the partial derivative of ( 𝜕𝑍𝑙 , 𝜕𝑊𝑙, 𝜕𝑏𝑙 ).
2- Update the weighting matrix (W, b).
Figure 6. Backward Propagation
19
The backward Propagation calculates the derivatives to update the parameters
(weights and biases) by using these equations:
First, calculate the derivative (dZ) for the output layer (Z)
𝜕𝑍𝑙 = 𝑎𝑙 − 𝑦
Second, calculate the previous layer derivative and update the weight parameters by
multiplying the derivative value of the final layer with the current layer parameters.
Biases will take the derivative value of the final layer.
𝜕𝑊𝑙 = 𝜕𝑍𝑙 . 𝑎[𝑙−1].𝑇
𝜕𝑏𝑙 = 𝜕𝑍𝑙
3.2 Convolutional Neural Networks (CNN)
According to Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016, 326)
“Convolutional networks are simply neural networks that use convolution in place of
general matrix multiplication in at least one of their layers.” Convolutional neural
networks designed to process the incoming data in the shape of multiple arrays, this
approach has had an innovative result over the past few years in image processing,
voice recognition, and other recognition patterns. The main feature of CNN is the
reduction of parameters in the ANN (Artificial neural network).
CNN is one of the categories of deep learning networks, mostly used to analyse images
and video frames. CNN focuses on the basis "input will always be images" that will lead
the architecture to be set-up in a way that best suits the specific type of data. The
input data for the CNN layer will include four tensors (x, h, w, d). The x sample denotes
to the number of data, h and w will be image's height and width, the last dimension is
the image's depth, which represents the pixel colour for input image (red, green, blue),
for instance, let say we use an image with dimensions (x,32,32,3). The first hidden
layer will be (x,28,28,6), the depth dimension in the first hidden layer represent the
20
classes, and output layer will be (x,1,1,n) where n represents the possible number of
classes. Convolution neural networks shaped from three types of layers, (convolutions
layers, Pooling layers, and Fully connected layers), these three stacked together will
form CNN architecture. Figure 7 simplifies CNN architecture.
Figure 7. The structure of a CNN, consisting of convolutional, pooling, and fully-
connected layers.
(Source: https://www.mdpi.com )
CNN basic implementation will be:
1- The input layer will not do any change on pixel values of the image.
2- The convolutions layer's parameters consist of a group of learnable filters. Every
filter has the width and height and depth of the input volume. For Example, let say the
first filter in ConvNet have size 5x5x3 and this filter will slide over the input volume
(input layer) and the output after the computation will be 2-dimensional activation
map that present filter at every place, the network will enhance the filters that activate
when determining the type of visual attributes like an edge of orientation or some
significant colour or any patterns on layers of the network. In each ConvNet, we will
have a set of filters stacked together to produce the output value.
3- The pooling layer will simplify the ConvNet filters to progressively reduce the spatial
size of the representation and reduce the number of parameters in the network, and
21
hence to control the overfitting. A common approach used in pooling is max pooling
and average pooling.
Figure 8. Max Pooling and Average Pooling.
( Source: https://www.researchgate.net )
4- The fully connected layer is essential to check the incoming data from the previous
layer and compare it with stored labelled data inside it; it’s dividing and analysing the
image into class.
Figure 9. Fully-connected layer
Convolutional layers are the primary constituent elements in Convolutional neural
networks. Convolution is the simple implementation of a filter. iterative
22
implementation of the same filter to an input layer produces a feature map, denoting
to place and strength of detected features in an image. CNN is capable to updating
many filters in parallel, and the outcome is a set feature that is detected anywhere in
the image.
The ConvL filters shift to the right with a specific Stride value until it parses the entire
width. Then, it jumps down to the beginning left of an image with the same Stride
value and iterates this operation for the entire image.
ConvNet layers will reduce the complexity of the model by optimizing
hyperparameters, depth, stride, and setting zero-padding.
For example, let say the input image (source layer) has a size (8x8x3) and the filter
size will be 3x3 then each node in the ConvL will have weights to a ( 3x3x3 ) region in
the convolutional kernel, so total weights will be 27 and +1 for bias factor, and the
destination layer (feature map ) will have (6x6xn).
Figure 10. Convolution layer operation.
( Source: https://www.researchgate.net )
23
Last two decades, CNN has been applied with tremendous success to detect, segment,
and recognise objects in an image. These implementations were effortless to attain on
account of the labelled data were plentiful and accurate. These data have been used
to detect an individual, faces, moving objects. Jonathan Tompson, Ross Goroshin,
Arjun Jain, Yann LeCun, Christoph Bregler. (2014, 2) applied a novel ConvNets
architecture to detect human body pose detection; they have shown the precision lost
due to pooling in ConvNet architectures. Convolutional Networks have been
successfully applied to a multitude of other problems, such as recommender systems
as indicated in (Ayush Singhal, et al., 2017, 20), natural language understanding in
(Ronan Collobert, R. et al., 2011, 2503), and speech recognition as described in (Tara
N. Tara N. Sainath. et al., 2013, 4).
24
4 Face Recognition Algorithms
For the last decade, Face recognition has been one of the most researched topics in
the field of computer vision and biometrics. Traditional approaches, based on hand-
crafted features and Conventional machine learning techniques such as, feature-
based and geometry-based methods, were the first steps to develop deep neural
networks techniques which considered the backbone to all computer vision
applications. In this section, the author will explain five conventional machine learning
algorithms that have been used widely on face recognition.
W. Zhao, R. Chellappa, P. J. Phillips, A. Rosenfeld (2003, 400) states that face
recognition problems can be formulated as following " given still or video images of a
scene, identify or verify one or more persons in the scene using a stored database of
faces", from this perspective we understand that challenges for developing accurate
model will depend on (occlusions, poses, illumination, ageing, conditions, and facial
expressions). The researchers for the last decade concentrated on methods that used
image processing technology to describe the geometry of the face to correspond with
the exact faces from an image.
In this chapter, the author will briefly explain some algorithms used to detect and
recognize faces in an image. Also, we will go through some advantages and
disadvantages for each one. In general, all algorithms will be affected by factors that
debase the accuracy of recognition and these factors are low resolution, illumination,
and expressions, etc. Face recognition from still and moving faces is a tremendous
task, many machine learning experts have already accomplished 100 % accuracy in the
frontal face images, with factors mentioned previously, the accuracy will not be
reliable.
25
4.1 AAM - Active Appearance Models
According to Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor ( 2001,
681) Active Appearance Models Has an integrated statistical model combines the
shape of variations with appearance variations, also AMM contains a monochrome
representation for the spot of interest, Cootes, F., does not attempt to solve
comprehensive optimisation every time they want to assign the new image to the
model, alternatively, they avail the truth of the optimisation problem is always
identical, hence they did the optimisation step discreetly (offline).
AMM structure depends on a Gaussian image pyramid, that achieves rapidity and
robustly multiresolution approach for identification. The difficulty of this approach is
to understand and learn the differences between real interval change to the data and
changes only due to systematic and unsystematic distortion. The AMM method is one
of the oldest in matching statistical methods which achieved 88% accuracy, that
trained on 100 faces hand-labelled images for a training set and an equal number for
the testing set included face expectations.
Active Appearance model algorithm has two procedures which are modelling and
fitting. The modelling step AMM will separate an object for two parts, first will be the
shape which is a vector established by connecting the facial landmarks, the second
part is the texture is the measure of pixels represented by the density of colours. Once
the model is formed, it is necessary to fit the model into different images, which are
vital to finding the most realistic parameters of the face.
Figure 11. Shape and labelled image.
(Source: http://pages.cs.wisc.edu )
26
Xinbo Gao, Ya Su, Xuelong Li, and Dacheng Tao (2010, 147-151) states that the AMM
is commonly used in the modelling of distorted objects as it has effective
representation and reliable fitting capability. Recently there are improvements to
address the difficulties and extend the ability in three aspects:
• Efficiency - To increase the efficiency of the AMM various enhancements are
proposed to make the algorithm capable of fitting an image successfully, these
enhancements consider the reduction of the computational process by optimizing the
algorithm, texture representation, and model training.
• Discrimination - To improve the model's discrimination, many improvements
have made to determine the accuracy, such as shape prior, texture representation,
nonlinear modelling, etc.
• Robustness - many significant improvements have been made to the AMM to
improve the robustness that is influenced by the changing circumstances, such as
condition changes, pose variations, missing features, low resolution.
4.2 HMM - Hidden Markov Models
Hidden Markov model (HMM) is a mathematical approach used to model biological
sequences. This approach was developed in the 1970s by Andrey Andreyevich Markov.
This mathematician scientist developed much relevant statistical theory, and it has
been used for the first time in the 1980s to apply the speech recognition system by (L.
R. Rabiner and B. H. Juang, 1986). HMM is one of the solutions to a mathematical
model reception of some visible signals, according to Rabiner. L., (1989, 257) "the
signals can be discrete in nature (e.g., characters from a finite alphabet) or continuous
in nature (e.g., temperature measurements)". HMM has been efficiently used with
one-dimensional data and accomplished important outcomes in activity recognition
and voice recognition. It has also been used with face detection and recognition. In
Claudia Iancu's view (2011, 4), "HMM techniques remain mathematically complex
27
even in the one-dimensional form. The extension of HMM to two-dimensional model
structures is exponentially more complex", but researchers and scientists had used
this model to develop the facial recognition model despite the mathematical
complexity we can see in the F. H. Alhadi, W. Fakhr and A. Farag. (2005, 2) experiment.
Implement an HMM in face recognition has two significant disadvantages:
1- Pixel values do not consider a useful feature in face recognition methodology due
to image conditions (illumination, shift, noise, rotation).
2- The vast vector dimension engenders a high computational process for the
detection system.
Ferdinando Samaria, Frank Fallside (2007, 2-3) initialised two sets of HMMs that
trained for each identity in the dataset. Two models have been used for training and
testing:
1- Ergodic Models:
In this model, the authors used to train an HMM 10 training images for each identity
size of (256*256, 8-bit grey levels), each image was divided into 64*64 window, which
slides on the given image by 58 pixels for each window from left to the right of the
image, then moves down by 48 pixels and starts moving again but from right to left
(Figure 12).
Figure 12. Sample of training data for ergodic HMM 2- Left-to-right Models.
(Source: https://www.researchgate.net)
28
2- Top-to-bottom Models:
For each identity in the dataset five variant images size of (184*224 - 8 bits grey level)
used to train each HMM, these images were analysed into 16 lines of blocks spatially
ordered in a top-to-bottom direction (Figure 13).
Figure 13. Sample of training data for top-to-bottom HMM.
(Source: https://www.researchgate.net )
4.3 PCA - Principal component analysis
Principal component analysis or PCA is a statistical approach set to decreasing the
number of parameters in face recognition. According to Kim Esbensen and Paul Geladi
(1987, 37-38), PCA is developed to model data, which is distinguished via a significant
interrelationship between the parameters concerned. According to C. Li, Y. Diao, H.
Ma and Y. Li. (2008, 376), "PCA is a classical feature extraction and data representation
technique widely used in the areas of pattern recognition and computer vision such as
face recognition". In PCA, every image in the training dataset is represented as a linear
combination of weighted eigenvectors called eigenfaces. The author will explain the
Eigenfaces in the next chapter. This technique used a multifaceted data analysis
method to investigative data analysis, aberration detection, decrease classification,
and reduce regressions.
According to Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf, Mazdak
Zamani, Alireza Hooman (2013, 173-174), they introduced a paper "An Overview of
Principal Component Analysis " describing the mathematical processes for this
29
algorithm by presenting the set of M images (B1, B2,....BM) and image size will be (N
* N), and the training set image average() will be:
𝜇 = 1
𝑚 ∑ 𝐵𝑛
𝑀
𝑛=1
The average image by the vector (W) has a new value for each training image
describing by:
𝑊𝑖 = 𝐵𝑖 − 𝜇
The authors calculate the covariance matrix by:
𝐶 = ∑ 𝑤𝑛𝑤𝑛𝑡 = 𝐴𝐴𝑇
𝑀
𝑛=1
Where A = [𝑊1, 𝑊2, 𝑊3, … . , 𝑊𝑀]
Then calculate the eigenvectors 𝑈𝐿 and eigenvalues 𝜆𝐿 of covariance matrix. The last
equation will be measuring the vectors of weights, for image classification by:
Ω𝑇 = [𝑤1, 𝑤2, … . . , 𝑤𝑀]
Whereby,
Hk = UkT (B − µ), k = 1, 2, … , M
PCA as other algorithms has disadvantages which are the complexity of mathematical
calculations, and roundoff errors incline to accumulate for each step in the algorithm
which makes it a complex task to evaluate the scatter matrix (covariance matrix)
accurately. Moreover, the PCA could not capture even the simplest invariance unless
this information provided in the training data.
30
4.4 LDA - Linear Discriminant Analysis
LDA is a technique used to reduce the dimensions in a dataset and keeping as much
data as possible. It generates an ideal linear discrimination function that maps the
input data into the classification venue, the input data will handle via scatter matrices
analysis, that searches the matching used in this approach could be a simple Euclidean.
According to N. Mohanty, A. Lee-St. John, R. Manmatha, T.M. Rath (2013, 253) "LDA
provides class separability by drawing a decision region between the different classes.
LDA tries to maximize the ratio of the between-class variance and the within-class
variance", In this situation, LDA will create a linear amalgamation of these which
produce a higher average variation between the classes. LDA approach successfully
used in several applications, such as image identification, pattern recognition, data
classification, and bioinformatics, etc.
According to Alok Sharma • Kuldip K. Paliwal (2013, 1), the orientation Z in the LDA
technique transforms higher-dimensional feature vectors for the different classes to a
lower-dimensional feature space; in this range, lower feature vectors will separate the
classes. Hence the reduction of d-dimensional (Rd) space to h-dimensional (Rh) space
(where d > h), then the size of the Z orientation will be R h*d. For more simplicity, LDA
creates a diagonal axis and displays the information from both features to reduce the
variance and increase the distance between the two classes, as shown below.
Figure 14. LDA influence on the data to separate the classes, considering each colour
is a variable.
31
4.5 EBGM - Elastic Bunch Graph Matching
The EBGM algorithm has established on the basis that the real face in images has a
variety of nonlinear features that are not touched upon by the linear analysis method
like LDA, such as pose, illumination, the algorithm identifies the faces by localising a
set of landmark features after that calculate the similarity between these features.
The information extracted for faces in an image can be represented by a bunch-graph,
the bunch-graph provides a database for each landmark that can be used to locate
features for new faces in an image. " A person is recognised correctly if the correct
model yields the highest graph similarity, i.e., if it is of rank one. A confidence criterion
on how reliably a person is recognised can easily be derived from the statistics of the
ranking"(David S. Bolme 2003,3).
The algorithm proved effective in the detection of faces in the FERET (Facial
Recognition Project) because the algorithm perceived faces by contrasting their parts,
instead of performing extensive picture coordination. The nodes in EBGM labelled
with a variety of Gabor wavelet parameters called a jet which is used for matching and
recognition. According to Laurenz Wiskott, Jean-Marc Fellous, Norbert Kruger, and
Christoph von der Malsburg (1999, 2), " The representation of local features is based
on the Gabor wavelet transform. Gabor wavelets are biologically motivated
convolution kernels in the shape of plane waves restricted by a Gaussian envelope
function". Figure 15 clarifies that when you apply the EBGM algorithm in a face
recognition system, a set of the process will define the graph representation of a face
(Gabor wavelet transform, convolution wavelet kernels).
32
Figure 15. The two major keys to represent a face in EBGM are Gabor wavelet
transform, convolution with a set of wavelet kernels.
(Source: https://www.researchgate.net )
33
5 Standard Benchmarks
In this chapter, the author will analyze and explain six dataset benchmarks used to
train and test face-recognition algorithms and to evaluate the performance.
Systematic benchmark studies are the most accurate, which considered being the
source of the algorithms to demonstrate efficiency and robustness. A reliable
benchmark should be authorized with an open-source, data should be explicitly
labelled, and applied at least on one machine learning algorithm. The datasets used in
this chapter extracted from www.face-rec.org/databases/, the first dataset to be
broached will be FERET, which mentioned in section 3.5.
5.1 FERET Database
The FERET database was established in 1993 under a collaborative effort of Dr.
Wechsler H., and Dr. Phillips J., assuming that a database should serve both
development and testing by providing the algorithms sequestered images. According
to P. Jonathon Phillips,l Hyeonjoon Moon, Patrick Rauss, and Syed A. Rizvi (1998, 137),
the dataset included 14,126 images from 1199 individuals serves as a standard dataset
of face images for researchers to develop facial recognition algorithms and evaluate
the outcomes. A colour FERET high resolution (512*768 pixels) dataset was released
in 2003. The FERET dataset is split into two parts, and the first one is the development
set served to researchers, and the second is isolated images for the testing set. The
founders of the FERET database had used to collect the images "a 35-mm Kodak
camera and processed them into CD_ROM via Kodak's multiresolution technique for
digitising and storing digital imagery. The colour images were retrieved from the CD-
ROM and converted into 8-bit grayscale images." (ibid., 138.). In order to preserve a
level of consistency across the database, the images captured in a semi-controlled
environment. The same physical configuration used for each photographic session.
Each session, the equipment needed to be reassembled; therefore, there was a small
divergence in the images captured on various days.
34
Figure 16. Example of different categories of photos for one individual.
(Source: https://www.researchgate.net )
5.2 SCfaceDB Landmarks
Released in 2011, SCface – surveillance cameras face database (SCfaceDB) was
designed basically to evaluate the face recognition algorithms' robustness in real-
world monitoring but has been used to assess other recognition algorithms, such as
face recognition algorithms for head poses, and illumination normalisation algorithms.
According to Mislav Grgic, Kresimir Delac, and Sonja Grgic (2009, 863), "Images from
different quality cameras should mimic real-world conditions and enable robust face
recognition algorithms testing, emphasizing different law enforcement and
surveillance use case scenarios.".
The authors decided to use six surveillance cameras, professional digital video
surveillance recorder, professional high-quality photo camera, and a computer to
capture 4160 static images of different quality for 130 individuals, 114 were males and
16 females. The participants for this work were students, employees, and professors
from the University of Zagreb, Croatia. Their ages range from 20 to 75. SCfaceDB was
considered as one of the unique databases for face recognition in 2009, since the
authors distributed the textual file that contains the birthday of each participant with
the database, and the birthdays' feature is not available in many faces databases. Also
is held additional information about gender, glasses, and facial hair (beard,
moustache). (ibid., 870.).
35
Figure 17. Example of different pose images.
(Source: https://www.researchgate.net)
5.3 Specs on Faces (SoF) Dataset
The SoF dataset collected from April 2015 to October 2016, were captured in different
countries over a long period to test the detection, recognition and classification of face
algorithms. The SoF has been assembled from 112 individuals (66 males and 46
females) Who wears glasses in various lighting conditions, comprises 42,592 images
(640 * 480 pixels). The dataset is dedicated to solving gender classification problems
in case of face occlusions and high varying illumination from various ages. The author
of the dataset has used many occlusions techniques to conceal features of the faces,
but the primary occlusion was glasses.
According to Mahmoud Afifi and Abdelrahman Abdelhamed (2017, 15), " The SoF
dataset involves handcrafted metadata that contains subject ID, view (frontal/near-
frontal) label, 17 facial feature points, face and glasses rectangle, gender and age
labels, illumination quality, and facial emotion for each subject". The authors applied
three filters on the original images to generate more challenging artificial images that
may circumvent face detection systems.
36
Figure 18. Samples of the Specs on Faces (SoF) dataset.
(Source: https://arxiv.org/pdf)
5.4 Large Age-Gap Database (LAG)
The Large Age-Gap (LAG) database was presented in "Large age-gap face verification
by feature injection in deep networks" by Simone Bianco (2016, 1). Simone Bianco
introduces a face verification method across significant age gaps. Also, he assembled
a dataset containing variations of age in the wild, collecting face images extending
from children to older, and included pictures of celebrities using Google image search
engine, and YouTube by adding "adult", "childhood" keywords to search query. Bianco
checked his dataset and removed all noisy and duplicate images manually.
Subsequently, he obtained 3,828 images of 1,010 celebrities. The LAG dataset is highly
required in applications used by law enforcement. It's intractable even for a human to
recognise faces across ageing; therefore, it became a defy task for computer vision
systems, because of the age-related biological transformations in the presence of the
other variations in appearance.
Figure 19. Examples of face crops for matching pairs.
(Source: https://arxiv.org )
37
The paper presents a novel method for face verification over the age gap by exploiting
a deep conventional neural network (DCNN) which trained on a Siamese architecture
applying multiple loss functions. The method has been evaluated by comparing with
different techniques like high dimensional local binary feature (HDLBP), One Shot
Similarity Kernel, Joint Bayesian, and Cross-Age Reference Coding (CARC), etc.
5.5 Disguised Faces in the Wild (DFW)
Having similar purposes to (LAG), the DFW was assembled on a large scale in
controlled scenarios to solve the recognition of faces under the covariate of disguise.
The dataset contains a wide range of unrestricted disguised faces, the main body of
the dataset has collected from the internet, represented in 11,157 images of 1,000
individuals primarily of Indian or Caucasian origin. According to Kushwaha, Maneet
Singh, Richa Singh, Mayank Vatsa(2018, 1), "DFW is a first-of-a-kind dataset containing
images pertaining to both obfuscation and impersonation for understanding the effect
of disguise variations.". The dataset contains concealment variations concerning
hairstyles, facial hair (beard, moustache, goatee), make-up, hats, veils, glasses, etc.
These variations make face recognition. Those variations make facial recognition a
challenging task, moreover, if we consider other differences making it arduous to
recognise faces such as illumination, head pose, ethnicity, age, gender, facial
expression, and camera quality. (ibid., 2)
Figure 20. Sample images of three subjects from the DFW dataset.
(Source: https://ieeexplore )
38
The authors of DFW dataset have partitioned the collected data to four types of
images: 1000 of Normal Face Images, 903 of Validation Face Images, 4,814 of
Disguised Face Images, and 4,440 Impersonator Face Images. Each of these four types
has been distributed on the training set and validation set, as shown in table 1.
Table 1: Images in the training and testing partition.
Number of Training Set Testing Set
Subjects 400 600
Images 3,386 7,771
Normal Images 400 600
Validation Images 308 595
Disguised Images 1,756 3,058
Impersonator Images 922 3,518
Kushwaha V. et al., (2018, 3) states that the evaluation protocols in this approach for
face recognition model have divided into three types:
• Protocol-1 (Impersonation) - evaluate the ability to distinguish identity
impersonators.
• Protocol-2 (Obfuscation) - evaluate the performance of the model for concealment
faces deliberate or Inadvertent.
• Protocol-3 (Overall Performance) - evaluate the performance of face recognition
algorithm on the entire dataset.
39
5.6 EURECOM Visible and Thermal paired Face database
The EURECOM benchmark was introduced in 2015 by the paper "A benchmark
database of visible and thermal paired face images across multiple variations" and
released by Khawla Mallat and Jean-Luc Dugelay. The EURECOM dataset is composed
of 2100 images for 50 individuals of different age, ethnicity, and sex. Each participant
has two photography sessions ranging from 3 to 4 months, and each session includes
21 face images per individual with different facial variations. The variations in the
Photography environment include head pose, facial expression, illumination and
occlusion. Therefore, the Authors have used to capture these images a camera (FLIR
Duo R by FLIR Systems) that is designed to photograph simultaneous faces of thermal
and visible spectra as illustrated in K. Mallat and J-L. Dugelay (2015, 1). The purpose of
this approach is to recognise faces, throw the thermal range and compare it with data
collected for face images obtained in the visible spectrum.
Figure 21. Visible and thermal face images.
(Source: http://www.eurecom.fr )
The authors used the FisherFace approach to evaluate the database, followed by 1-
Nearest Neighbourhood classification (Vittorio Castelli, 1). Unlike feature-based
algorithms, the FisherFace algorithm does not rely on facial features detection, which
40
can be especially difficult for thermal images, instead is based on PCA and LDA
techniques (described in chapter two). These FisherFace methods performed a high
accuracy on recognising visible and thermal face images compared to holistic face
recognition algorithms.
6 Face Recognition pipeline In this chapter, the author will cover the important steps to recognize an object in an
image, which are Detection and Localization, and alignment. These three steps it's
compulsory in object recognition algorithms to extract the relevant feature's
information from an input image.
6.1 Detection and Localization
Face detection is the first stage of a face recognition system since the system should
first locate the face then recognise it. The face is just like instances of objects, the
system seeks to locate and categorize them from a specific denomination such as
(people, buildings, number, etc.) in an image, this technique is performed by
discriminating the patterns formed by the objects from the other patterns and
oriented the dimensions of each object.
Ali Sharifara, Mohd Shafry Mohd Rahim, and Yasaman Anisi (2014, 73) states that
"Face detection is one of the demanding issues in the image processing and it aims to
apply for all feasible appearance variations occurred by changing in illumination,
occlusions, facial feature".
According to Dr. P.Shanmugavadivu and Ashish Kumar (2016, 594), they illustrated
three methods to resolve the problem of partially occluded face distributed between
part-based, feature-based, and fractal-based methods which divided the face
detected to overlapping and non-overlapping parts and compute self-similarities
between the images or consider facial features (nose, left eye, right eye, mouth, left
41
ear, right ear, and chin). The skin detection factor has been significantly influential in
face detection algorithms to decrease the area of search feature detection, and
computational process, a variety of human skin colours are spanned in a pre-specified
scale, and each pixel within that domain is known as skin pixels. Moreover, locating
that domain itself is a difficult task as the range of skin varies by ethnicity and race. In
this section, three algorithms are addressed which are used for object detection:
6.1.1 Viola-Jones
Paul Viola and Michael Jones (2001, 1) in their paper " Rapid Object Detection using a
Boosted Cascade of Simple Features ", provided a new competitive object detection
framework that the primary purpose is to detect faces in real-time, this framework
has a very low false-positive rate and the high true-positive rate which makes the
algorithm rapid and robust. The author will briefly explain the three main features of
this framework: The first feature is called an integral image, and it is considered that
all human faces have the same features, such as the nose bridge region is whiter than
the eyes region, and so on. The computed process for an integral image will be using
a few operations per pixel. The second feature is to describe the methods to construct
a classifier by using the AdaBoost training algorithm that helps to find small critical
visible features from a wide range of possible features. The third feature is the process
of combining the cascading classifiers, which neglect the background regions so that
more computation can be performed on face-like regions.
The object detection process in this algorithm classifies images in accordance with the
value of simple features, and the reason to use features instead of the pixel value
return to the features can act to encode the domain knowledge that is difficult to learn
using small dataset. Moreover, the feature-based system runs rapidly and robustly
than a pixel-based system. In figure 22, four examples demonstrate the rectangle
features relative to the detection window, Figures 22 A and B have the same size and
shape, to detect a specific feature the algorithm will subtract the sum of the pixel of
the white rectangle from the sum of the pixel of the grey rectangle, both rectangles
should move horizontally or vertically. Figure 22C shows three rectangles which will
42
subtract the sum of the pixels in the centre rectangle from the sum of the pixels of the
sides’ rectangles. Figure 22D calculate the variants between diagonal pairs of the
rectangle. (ibid., 2.).
Figure 22. Rectangle features for object detection.
(Source: https://en.wikipedia.org )
6.1.2 You Only Look Once (YOLO)
According to Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi (2016, 1),
Yolo, or (You Only Look Once) takes one feedforward propagation across the network
to make predictions. YOLO detects the objects in an arbitrary image magnificently
unlike region-based and parts-based methods since YOLO used to see the full
representation during the training and testing phase, it gets every information about
the entire image and the objects. The algorithm has good detection accuracy under
different complicated conditions, like an illumination environment, and noise while
satisfying the real-time performance.
Redmon J. et al. (2016,2) “YOLO sees the entire image during training and test time,
so it implicitly encodes contextual information about classes as well as their
appearance”. It is developed based on a regression problem; it predicts classes and
bounding boxes for an image by applying the algorithm one time on the image. The
43
algorithm will split the given image to grid cells S * S, as shown in Figure 23, each grid
cell will have a prediction bounding box, the degree of confidence for each prediction,
and a class of probabilities; most of the boxes have a low predicted result, so we can
avoid unnecessary bound boxes or objects detected by setting a threshold. The
prediction bounding box can be portrayed as (width, height, centre, class); in addition,
the model will calculate the value of prediction by using this formula:
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃𝑟(𝑜𝑏𝑗𝑒𝑐𝑡) ∗ 𝐼𝑂𝑈𝑝𝑟𝑒𝑑𝑡𝑟𝑢𝑡ℎ
Figure 23. YOLO bounding boxes, confidence, and class probability map.
(Source: https://arxiv.org )
The formula computes the confidence scores for the grid cell if the confidence value
is zero that means no object detected in the bounding box. Otherwise, the confidence
scores would equal the (IOU - intersection over union), scaling among the certainty of
the ground-truth bounding box and predicted bounding box as shown below.
44
Figure 24. Intersection over Union (IOU).
6.1.3 Faster R-CNN
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik (2014, 1), they propose
" a simple and scalable detection algorithm that improves mean average precision
(mAP)," R-CNN algorithm will apply on high capacity convolutional neural network,
sliding on an image to divide to more than 2000 bottom-up Regions-of-Interest (ROI)
to localize and segment objects, then classifies each part using class-specific linear
SVM (Support Vector Machines).
The R_CNN is slow, and each extracted region needs to operate the entire forward
CNN calculation, and the network coefficients will not update during the regression.
Faster R-CNN will normalise the input image and any extracted boxes around features
from the first layers and added to the final convolution layer directly. This step will
increase the computing speed since there is no prolonged need to store any
information for an extracted feature in the first layers, which makes training faster.
Ross Girshick (2016, 1441) proposed the faster R-CNN algorithm to improve the R-CNN
speed and accuracy. Faster R-CNN has many advantages, such as GPU will not be filled
with extracted data for the feature caching; the training set is one-step using a multi-
task loss will update the network layers. The comprehensive performance has been
improved particularly in terms of detection speed since it creates a convolution
45
network to generate the proposed box and shares the convolution network with the
object detection network which reduces the number of proposed frames
approximately to more than half compared to R-CNN.
The architecture of Faster R-CNN is composed of the region proposal network (RPN)
and Fast R-CNN. RPN used to reduce the computational process by scanning rapidly
and effectively the spots in an image to evaluate the needs for more computational in
specific spots using a convolutional neural network. Fast R-CNN has a deeper
architecture than RPN; it consists of a convolutional neural network, a region of
interest pooling layer (ROI), fully connected layers, and finally has two sections for
classification and regression (Figure 25).
Figure 25. Object detection by Faster R-CNN.
(Source: https://towardsdatascience.com)
According to BIN LIU, Wencang ZHAO and Qiaoqiao SUN (2017, 6234) Faster R-CNN
used RPN (Region Proposal Networks) instead of Selective Search methods used in
Fast R-CNN. The Faster R-CNN framework can be classified into four steps:
46
• Convolution layers.
Through the first step, the algorithm through the ConvNet layers, ReLU activation
function, and pooling layers will extract the image feature maps.
• Region Proposal Networks (RPN).
RPN is a fully convolutional network that predicts the boundaries and numbers of
objects at each position. RPN shares image convolutional feature maps with a
detection network, thus enabling nearly complementary region proposals.
• ROI (Region of Interest) Pooling.
ROI gathers both of input image features and the proposals for these features and
produces fixed-size feature maps by applying max-pooling on the inputs. In the pooling
layer, the output and the input channels are identical.
• Classification
Classification layers calculate the proposal class and compare it to the proposal feature
maps to get the final exact position of the checkbox.
According to Shahpour Alirezaee, Hassan Aghaeinia, Karim Faez, and Farid Askari
(2006, 30), “The ultimate goal of the face localization is finding an object in an image
whose shape resembles the shape of a face”. They stated four methods used to
localise and detect a face, and the method efficiency depends on motion and colour
information. They classified the processes into four categories (Knowledge-based
methods, Feature invariant approaches, Template matching methods, Appearance-
based methods).
Facial landmarks localization is an uncomplicated detection problem and intends to
ascertain the image size and position of faces in still and video images. Localization is
a crucial stage in the face recognition process. In face recognition and face tracking in
47
real-time, we can use the location found by the face detector. The area used to align
the face, but if the detection steps are not robust and accurate enough, the new face
landmarks, like a nose, between the eyebrows, and mouth are required. (ibid., 30-31).
Summary
rigid templates. The author summarizes the pros and cons for each algorithm
alongside with references to different works: when published in 2003, V&J achieved
detection at 2 fps with a detection accuracy rate of 95% comparing and it was very
robust face detection classifier at that time because it had very low false-positive rate
and very high detection rate. on the other hand, the Yolo and Faster R-CNN are based
on CNN and according to the table 2.
Table 2: Average accuracy of face and head detection on the FDDB dataset and
Casablanca dataset.
CNN based neural networks are significantly more reliable than Viola-Jones in terms
of accuracy but need more computational power and time for training and to calculate
the results. The mean average accuracy error for CNN based networks is 5 times less
than for VJ for FDDB data as stated in the Le Thanh Nguyen-Meidine, Eric Granger,
Madhu Kiran, and Louis-Antoine Blais-Morin2 paper (2018, 5).
The only reason why Viola-Jones still has a presence among modern algorithms is it
allows real time recognition with 60 FPS as we can see in the table 3 with very low GPU
consuming compared to Faster R-CNN and YOLO. We conclude that the Viola-Jones is
the fastest face detector with range between 40 to 60 FPS, and the CNN algorithms
48
excel it by the ability to detect from different angles. Moreover Faster R-CNN is
acceptable in terms of accuracy and speed but the only disadvantage is that it
consumes too much memory. On the other hand Yolo is fast and easy to implement in
a real-time system with major drawbacks which is not good at detecting distant
faces. (ibid., 6)
Table 3: Average time and memory complexity for face detection on FDDB.
6.2 Alignment
Face alignment is computer vision technology used to determine the human face
geometric structure in images (Figure 22) which consider one of the important stages
of face recognition, besides face recognition, face alignment had applied for another
face implementations such as ( Deep fakes, face synthesis, and face modelling). Given
the position and size of a face, the shape of facial landmarks such as eyes, nose, and
mouth are calculated automatically. Due to the existence of factors such as different
posture, voice, lighting, and partial occlusion in face pictures, the face-alignment
function becomes a complicated problem.
49
Figure 26. Face alignment and landmark.
(Source: https://ldl.herokuapp.com )
According to Timothy F. Cootes, et al. (2001, 682), the Active Appearance Model which
discussed in chapter two used the density of the face's pixels to obtain better
accuracies, the only challenge with AAM is, however, the labelling effort (positioning
the landmarks on face features for training set). Due to the existence of factors such
as different posture, voice, lighting, and partial occlusion in face pictures, the face-
alignment function becomes a very difficult problem.
Face alignment with 3D algorithms for object detection will align not only the
appearance of the face but also the head pose. 2D alignment algorithms are not able
to reach the depth of an occluded face. Lie Gu and Takeo Kanade (2006, 1) in their 3D
patch-based approach. "A face is modelled by a set of sparse 3D points (shape) and
the view-based patches (appearance) associated with every point", and it has two
advantages: First, It is easier to compensate for local illumination, and second, The
texture variance within the patch is considerably smaller than that of the entire face.
In this section, we examine four methods that have a large impact on face recognition
and understanding the process and the principles for each one.
50
6.2.1 Supervised Descent Method and its Applications to Face Alignment
According to Xuehan Xiong and Fernando De la Torre (2013, 3), their approach
formulates the problem of the face alignment problem as a minimisation problem, and
they calculated the using the formula:
𝑓(𝑥0, Δ𝑥) = ‖ℎ(𝑑(𝑥0 + Δ𝑥)) − 𝜃∗‖2
2
where (d) represents an image, d(x) are the landmarks for the image, (h) is an
extraction function, such as SIFT (Scale-Invariant feature transform) which is a feature
detection algorithm that calculates the values of the features of images. () indicates
(SIFT) values in the manually labelled landmarks. The classical approach to solving the
minimisation problems was Newton's function which can find the minimum of the
scalar function by approximating the loss function through calculating the quadratic
surface and stepping to the optimal point, figure 27 illustrates the comparison of the
gradient descent function with the newton's function. This approach demands
computing the reverse matrix (Hessian), which has two disadvantages represented by
computational cost for the massive data and it's infeasible for non-differentiable
function.
Figure 27. A comparison of gradient descent (green) and Newton's method (red) for
minimizing a function.
( Source: https://en.wikipedia.org )
51
The supervised descent algorithm is first validated on simple analytic functions and
compared to Newton's method. Then the algorithm experimented on facial feature
detection in two datasets the LFPW dataset that achieved 96.7% accuracy and LFW-
A&C dataset that achieved 98.7% and evaluated it by compares with modern
detection methods by showing cumulative error distribution with linear regression
and Belhumeur methods. Finally, the algorithm tested facial feature tracking on a
video dataset where the algorithm tries to detect facial landmarks in each frame.
6.2.2 Face Alignment at 3000 FPS via Regressing Local Binary Features
Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian SunLocal (2014, 3) presented the
Local Binary Feature algorithm attempting to achieve significant error reduction and
improvement in speed. Shaoqinr R. et al. (2014, 1) stated that LBF runs at 300FPS on
mobile or 3,000FPS on desktop, and that led to new possibilities for online face
applications on portable devices. Binary features use the training data to learn local
features instead of the SIFT methods; this technique used to learn each feature
independently in the local region. Utilising the LBF will depreciate the face alignment
error-rate, increasing the discriminative features, and diminish the computational
process. To find the local features in an image the LBF approach takes regions nearby
the h landmarks and to solve the learning step for the local features h in the area by
using the regression target, and the formula to calculate that:
min𝑤𝑡,𝜙ℎ
𝑡∑‖𝜋ℎ ∘ ∆�̂�𝑖
𝑡 − 𝑤ℎ𝑡 𝜙ℎ
𝑡 (𝐼𝑖, 𝑆𝑖𝑡−1)‖
2
2
𝑖=1
LBF update ( 𝑤ℎ𝑡 , 𝜙ℎ
𝑡 ) both of a local feature and local mapping simultaneously, the
operator ( 𝜋ℎ) extract two-element ( 2h -1, 2h ) from the vector ∆�̂�𝑖𝑡 in each iteration.
The factor (i) denotes the of training samples. Consequently the ( 𝜋ℎ ∘ ∆�̂�𝑖𝑡 ) will be
the ground truth of (ℎ𝑡ℎ) landmarks in the (𝑖𝑡ℎ) training samples.
The LBF has performed the experiment and evaluation phases on three kinds of
datasets to assess the algorithm. First is LFPW dataset include (29 landmarks) that
were collected from the web. The Second is Helen datasets that include (194
52
landmarks) and contain 2300 high-resolution web images. The third dataset is 300-W
including (68 landmarks) that collected from an existing dataset such as (AFW, LFPW,
Helen, XM2VTS). These datasets manually split into two portions for the training set
and test set.
6.2.3 Robust Facial Landmark Detection under Significant Head Poses and Occlusion
According to Yue Wu Qiang Ji (2015, 3660-3661), The purpose of head poses, and
occlusion is to take the landmarks visibility into consideration and train models that
are able to predict clarity ratio of the face visible features. Conversely of LBF that
handle all facial landmarks symmetrically without considering the occlusions and head
pose (Figure 28), which clarify the head poses and occlusion for two subjects. The
authors display the visible landmarks points as the final output for the vast head pose.
The pre-trained model will assist the face alignment by extracting the face local
features and combining with the visibility information and form configuration around
face landmarks.
Figure 28. Facial landmark detection and occlusion prediction in different iterations.
(Source: https://openaccess.thecvf.com )
minΔ𝑝𝑡
‖Δ𝑝𝑡 − Τ𝑡Ψ(𝐼, 𝑥𝑡−1)‖22 + 𝜆𝔼𝑝𝑡 [𝐿𝑜𝑠𝑠(𝑐)]
𝑝𝑡 = 𝑝𝑡−1 + ∆𝑝𝑡 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑝𝑡 ≤ 1
53
The mathematical formula used in HPO calculate the probability function 𝑙𝑜𝑠𝑠(𝑐),
where (c) denotes to each possible occlusion and head poses patterns that can occur
in an image of (m) pixels and it is a vector of length 2𝑚. Last equation ( 𝑝𝑡 ) solve the
problem iteratively by optimising the function concerning ∆𝑝 and 𝑇, to update T (
number of parameters ) the authors used the least square problem and used to solve
∆p problem a gradient descent method to update the prediction task. They used to
evaluate their algorithm on three kinds of databases. The first dataset was collected
from the internet, the second one is Labelled Face Parts in the Wild (LFPW), and the
last dataset is Helen dataset. The authors took into consideration, the degree of
inclination and declination for near-frontal head poses and limited occlusion in each
dataset.
6.2.4 Joint Head Pose Estimation and Face Alignment Framework
Xiang Xu and Ioannis A. Kakadiaris (2017, 2) states that JFA was the first approach that
calculates the global and local CNN features to enhance both the head pose and face
alignment to determine the landmarks detection task. Since head pose and face
alignment are deeply correlated and to reduce errors in face alignment and head pose,
the authors have trained CNN to detect facial features by using different head pose
combinations from multiple datasets.
The JFA algorithm had two parts to detect and localise faces in images from global to
local features and analyse the faces in a cascade manner. By utilising; firstly, global
CNN features to produce appropriate initialisation to diminish the covariance for the
bounding boxes around faces and secondly, local CNN features that are implemented
to discriminative features for the cascade regression. It was the first time to utilise
both global and local features together using CNN techniques in a cascade way, by
strengthening the relationship between the head and landmarks the algorithm will
establish the proper shape initialisations utilising the following formula in L iteration.
54
𝑆𝐿 = 𝑆𝐿−1 + 𝑊𝑙𝜃𝐿(𝐼, 𝑆𝐿−1)
Where ( 𝐼 ) is a face image and ( 𝑆 ) denotes landmarks. ( 𝜃𝐿(𝐼, 𝑆𝐿−1) ) is the most
crucial part of the formula which indicates the function that extracts the facial features
by applying the image on the previously extracted landmark.
Figure 29. The top images use (Dlib implementation) and bottom images use JFA
algorithm for landmark detection.
( Source: https://www.researchgate.net )
JFA used for training and evaluation multiple datasets gathered from the 300-W
competition include LFPW, AFW, HELEN, and IBUG datasets. The images were
distribution into two sections: The first section was a training set that consists of 3,146
images from LFPW, AFW, AND HELEN. The second section is called the full testing set
since it contains two parts of data: a common testing set that includes 689 images
collected from LFPW and HELEN testing set. And the second is the challenge testing
set contains 135 images (significant variations of head pose and lower resolution)
assembled from IBUG datasets. These datasets annotated with 68 landmarks but
without head pose information (Figure 29).
55
Summary
To sum up, we presented in this chapter four facial landmarks algorithms, the SDM
and LBF are use cascaded regressors to predict the coordinates of landmarks directly
from shape-indexed features, the HPO is binary landmark occlusion vector that
updates the visibility likelihoods and the landmark position over iteration to achieve
convergence between the face features and the suggested landmarks, finally JFA has
different approach according to Xiang Xu, et al (2017, 1), "use the global and local CNN
features to solve head pose estimation and landmark detection tasks jointly".
Essentially, All the algorithms aim to estimate the angle of head poses and introduce
a constrained supervised regression to achieve an accurate convergence.
Table 4: Facial landmarks detection error.
Algorithms Helen 194 L Helen 68 L LFPW 68 L LFPW 29 L 300W 68 L
SDM 5.82 - - 3.47 5.57
LBF 5.41 6.58 5.58 3.35 4.95
HPO 5.49 - - 3.93 -
JFA - 5.48 5.08 - 5.32
The results of error ratio shown in the table 4 which obtained from Hongwen Zhang,
Qi Li, and Zhenan Sun (2018, 7) and from Yue Wu and Qiang Ji (2015, 3665), on Helen
dataset with 194 and 68 landmarks, LFPW dataset with 68 and 29 landmarks, and
300W dataset with 68 landmarks, showed that all the algorithms have approximately
the same error ratio on the same dataset the only difference will be the speed of
detection on real-time systems and the head pose variations. These algorithms
achieve different results according to the head pose, and the databases used in these
experiments have different head poses as shown in the table 5. Experimental results
described the algorithms effectiveness on face images with extreme appearance
variations, heavy occlusions, and large head poses.
56
Table 5: Head pose variations.
(Source: https://ibug.doc.ic.ac.uk )
7 Experiment and Result
Developing an optimal masked Face recognition algorithm is a significant challenge in
computer vision with limited sample cases and quite a few reference datasets. The
experiment for masked faces recognition system was implemented by connecting the
three steps in chapter three (Detection, Alignment, and Recognition), by utilising
multiple solutions for each stage were achieved, the author was able to evaluate the
efficiency and the rapidity to accomplish the experiment.
The author used the Fastai library, which builds on Pytorch as Jeremy Howard and
Sylvain Gugger stated in their article" fastai: A Layered API for Deep Learning "(2020,
3), to develop our model and test the effectiveness of the methods used in this
chapter. The author used two datasets for training and testing steps; the first one
collected by the author given the fact that there was no masked face dataset and the
second dataset loaded from (https://makeml.app/datasets/mask). These datasets and
methods used in the research performed by Jupyter notebook using (Python
programming language) and these processes implemented on various cloud
platforms, such as Google cloud, Paperspace, and Google Colab. The computational
process for different methods calculated via time cost and accuracy.
57
7.1 Detection and Extraction Face detection and localisation are a long-standing challenge in computer vision. In
the previous chapter, we investigated and sought into face detection through multiple
techniques beginning with machine learning algorithms to the deep learning methods.
The human face has a unique structure, and it's back to local facial parts like eyes,
nose, and mouth, and these features assist in localising and detecting faces under
unrestrained condition.
7.1.1 Data
All Face detection systems necessitate the use of face datasets for training and testing
objectives. In the deep learning approach, the accuracy of any CNNs enormously relies
upon the scale of training datasets. Despite many face detection and localisation
datasets, their usage restricted to the research purposes only, moreover, prohibited
for commercial use. Therefore, In the experiment, the author started with Labelled
Faces in the Wild (LFW) dataset which contained 13,000 images and large-scale
celebfaces attributes (CelebA) that include 200,000 images. To prepare datasets to be
compatible with our models we must add a mask for each face to craft a new dataset
for masked faces. Wherefore the author embeds a mask on datasets programmatically
using a Dlib landmark's facial detector that used to estimate the location of 68
coordinates which map the face points the idea behind using landmarks detector was
to align the faces that have a large head pose (figure 30). The results were good but
not perfect as compared to the real medical mask and that might lead the model to
make false predictions if the data provided was not accurate, moreover, these new
images were without any annotations. On the other hand, the author was looking for
specific annotations that include the face coordinates in images with information
about the face, these annotations will help to develop a face detection model
alongside with a mask classifier.
58
Figure 30. LFW dataset with medical masks.
To solve this dilemma the author uses a dataset (Face Mask Detection dataset)
obtained from Kaggle (a subsidiary of Google). This dataset includes 854 images (figure
31) referring to the three classes (Mask, Without-Mask, Mask-Worn-Incorrectly). The
dataset divided into two sections (images, annotations).
Figure 31. Face Mask Detection dataset.
The images section contains 4072 faces into different images. The separation ratio of
the dataset through the process of the training is 0.8% for learning (682 images) and
0.2% for testing (171 images). The annotations section provided details on each image
like the category name for each face in the image, the image dimension, and the face
coordinates as we can see the table 6. The second dataset was collected using google
and firefox browsers by selecting specifically the images of people wearing a medical
mask, and we obtained 494 images each image include one face and this dataset was
used for the recognition phase.
59
Table 6: A CSV file represents all the annotations details for (Face Mask Detection
dataset) Images name, dimensions, faces' category and coordinates extracted from
XML files in the dataset.
Image Dimensions Face 1 Face 2
0 maksssksksss579.png ['400', '226'] ['with_mask', '150', '26', '193', '86']
['with_mask', '204', '127', '245', '175']
1 maksssksksss136.png ['267', '400'] ['with_mask', '100', '105', '152', '158']
0
2 maksssksksss384.png ['267', '400'] ['with_mask', '37', '244', '50', '258']
['with_mask', '146', '238', '161', '255']
3 maksssksksss245.png ['400', '210'] ['with_mask', '32', '25', '57', '54']
['with_mask', '23', '162', '38', '179']
4 maksssksksss451.png ['400', '273'] ['with_mask', '11', '1', '33', '13']
['with_mask', '120', '7', '141', '28']
5 maksssksksss699.png ['400', '279'] ['without_mask', '18', '82', '64', '131']
['without_mask', '18', '198', '66', '245']
6 maksssksksss249.png ['400', '267'] ['with_mask', '168', '65', '234', '139']
['without_mask', '309', '119', '368', '191']
7 maksssksksss770.png ['400', '210'] ['with_mask', '45', '43', '51', '53']
['with_mask', '80', '39', '97', '64']
8 maksssksksss597.png ['400', '267'] ['with_mask', '295', '125', '317', '149']
['with_mask', '344', '115', '366', '142']
7.1.2 Implementation
For experiments, we used three pre-trained models (InceptionV3, MobileNetv2, and
VGG16), training them on Google Colaboratory, which is a free Jupyter notebook
environment embedded on the Tesla K80 GPU (12G). The author used those three pre-
trained modes for two reasons; The models have a good architecture design with 94
ConvLs for inceptionV3, 35 ConvLs for MobileNetv2, and finally 13 ConvLs for VGG16
which consider very simple architecture design utilize high GPU power, using these
pre-trained models would achieves good accuracy rate in the detection process.
(Figure 32). To extract the faces from video or live-streaming, OpenCV library presents
a novel approach called OpenCV-dnn, which is a deep learning pre-trained model. This
model trained to detect faces with various head poses and divergent illumination. For
training face detection models, a TensorFlow Framework and Fastai used. For both
frameworks, three pre-trained models were utilized for transfer learning in
TensorFlow and two in Fastai. Since the author uses small datasets for Masked faces,
60
pre-trained models built on these frameworks are necessarily required to detect a
masked face in the video streaming.
Figure 32. Models complexity (parameters size) comparing to total memory
utilization.
(Source: https://arxiv.org)
First, Using TensorFlow Object Detection API that contains pre-trained models will
facilitate the task by reducing the time and computational effort to train the MF
detector. The InceptionV3, MobileNetV2, and ResNet101 pre-trained models were
used in detection and localization experiments since their speed of deduction was
supposed to be quick enough. These models have been established so the shape of
the input images to the detector will set to (224*224*3) due to diminishing the shape
of the images the detector will achieve more reliable results and will decrease the time
to detect.
61
In the detection experiments and after 20 epochs on the training and validation
datasets, the author randomly chosen some epochs to demonstrate the validation
accuracy and validation loss to compare the efficiency of these models as shown in
table 7.
Table 7: Four epochs for masked face detection and localization on a small dataset,
presenting the validation accuracy and loss for each model.
Methods Validation Accuracy Validation Loss
Epoch 1
InceptionV3 0.8626 0.3353
MobileNetV2 0.8380 0.3794
VGG16 0.7411 0.5950
Epoch 7
InceptionV3 0.8908 0.2700
MobileNetV2 0.8699 0.3206
VGG16 0.8110 0.5201
Epoch 16
InceptionV3 0.8834 0.3050
MobileNetV2 0.8908 0.2846
VGG16 0.8172 0.5068
Epoch 20
InceptionV3 0.9067 0.2240
MobileNetV2 0.8969 0.2692
VGG16 0.8172 0.4976
By using the OpenCV deep neural network (dnn) which officially released in 2017 with
our models, we achieved a quite high-grade accuracy to detect faces with mask and
without a mask. The detection task is completed by using the pre-trained models,
which are object detection models used in our approach to enhance facial features
with a face region and gathering data augmentation to deal with occlusions and small
faces.
62
As can be seen from Table 7, the validation accuracy and loss for both InceptionV3 and
MobileNetV2 have achieved substantially better performance on a small dataset than
VGG16. This is because the VGG16 has been trained on an extensive image dataset (15
million labelled high-resolution images) from ImageNet dataset to classify more than
22,000 categories. Also, it has a simple CNN architecture not built to detect and classify
faces with more than 14 million parameters (Table 8). To solve the underfitting
problem with VGG16, we need more data, different augmentations, and different
architecture to achieve good accuracy. Fitting the model to be configured on our
dataset is not an essential step in the experiment since we have two models with
proper accuracy, and we do not have another source for unrestricted mask faces
dataset.
Table 8: InceptionV3, MobileNetV2, and VGG16 parameters with number of ConvLs.
Parameters Inception3 MobileNetV2 VGG16
Total params 22,328,099 2,586,691 14,846,787
Trainable Params 525,315 328,707 132,099
Convolutional layers 94 - Conv2D 35 - Conv2D 13 - Conv2D
The process of facial detection is responsible for detecting and extracting faces in an
image or any frame in a video. It is considered the first step for the face recognition
pipeline. After the detection step that we applied, we use the extracted faces
coordinates to draw bounding boxes around the faces (Figure 33), which is the outputs
for the models (a) InceptionV3, (b) MobileNetV2, and (c) VGG16. The detection
process will provide red bounding boxes for unmasked face and green bounding boxes
for masked faces accompanied by the prediction accuracy. The prediction accuracy is
different (Figure 33). Based on the used model, it looks like all the prediction outcomes
are approximately close to each other. But from the table 7, we found that
InceptionV3 and MobileNetV2 models had better accuracy, lower validation-loss, and
faster to train than VGG16 and this is because these models (MobileNetV2 and
63
InceptionV3) are utilising depth-wise separable convolutions which reduce the
number of shrinking parameters. On the contrary, and according to Song Han, Huizi
Mao, and William J. Dally (2016, 1-2), the VGG16 model has a large number of
parameters that consume a storage capacity and GPU with simple convolutional
architecture that make it a very complicated structure with three fully connected
layers.
(a) InceptionV3 (b) MobileNetV2 (c) VGG16
Figure 33. Red and Green bounding boxes with different models.
Finally, in the last step in the detection and alignment phase is to resize the extracted
faces from an image without losing face's purity and clarity. The output face's image
would be square, and the face covers all the images with size (224, 224) pixels to
increase the speed of the recognition process. In figure 34 as we provide the original
64
image containing four subjects through the detection phase, and the outcome will be
seen in figure 34 b extracting four faces with the same dimension size.
(a) (b)
Figure 34. (a) original image (b) faces' cropped with 224x224 pixels dimension.
(source: https://www.pexels.com )
We also demonstrate the percentage of validation accuracy and loss for masked face
dataset after operating 20 epochs for each model (InceptionV3, MobileNetV2, and
VGG16). We define the learning rate parameter to be (1e-4), and the bunch size for
each iteration was assigned to be 32 for all models. The training time consuming was
different from each model. The InceptionV3 consume 14.8 minutes to finish 20 epochs
in the second iteration, MobileNetV2 was faster to accomplish same epochs in the
second iteration with 13.6 minutes, since the VGG16 has a large number of
parameters, therefore will take longer than others with 16.8 minutes to finish same
epochs in the second iteration. As we can see in the figures 33, 34 the performance
for the models on the same data set and same models’ configuration, we will choose
from among the three models (InceptionV3). The InceptionV3 model outperforms the
MobileNetV2 with validation accuracy around (0.9070), and validation loss about
(0.2240), which is shown by the percentage of low thresholds on different epochs
(Figures 33, 34).
65
Figure 35. Validation Accuracy implemented by three models (Inception V3,
MobileNetV2, VGG16) on Masked faces Dataset for 20 epochs.
Figure 36. Validation Loss implemented by three models (Inception V3, MobileNetV2,
VGG16) on Masked faces Dataset for 20 epochs.
7.2 Recognition
In order to distinguish the difference between the two faces, several approaches have
been proposed to solve the discrimination task. The author described some of the
machine learning algorithms that calculate the embedded faces variance in chapter
two. Over the years, different methods have been presented in the classification field,
66
depending on the way of calculation the features. However, it can classify the face
recognition methods into three categories (features extraction, dimensionality
reduction, and hybrid approaches).
7.2.1 Landmarks
In our experiment and to develop an identification model from single-shot image
classification, we first set a neural network masked face detector to find and extract
the faces then we will embed the face feature to 128-dimensions to discriminate
between the classes via transferring the features through a deep neural network. Local
feature methods are used to identify and describe the facial features of a face with
specific geometrical properties (Figure 37).
Figure 37. Cropped the local visible features from extracted face.
We sit the local features that include the upper side of the face since half of the face
is hidden, by using the Dlib-ml open-source library to detect face features map. The
dlib library has a shape detector (68 face landmarks) used to cover the entire face
(Figure 38 a). We used the 68-shape detector to implement the alignment task to
correct the rotation face so that we can remove the masked region efficiently. Also,
we will use 24 landmarks to localise and represent salient features of the face, such as
eyes, eyebrows, the centre of the face (Figure 38 b).
67
(a) (b) (c)
Figure 38. 2D face landmarks (a) 68 landmarks for the entire face (b) 24 landmarks
for visible parts of the face (c) cropped the visible parts with the landmarks.
7.2.2 Visible features embedding
To embed the visible facial features to 128 dimensional, we need to crop the detected
faces and to do that we have chosen InceptionV3 for the lightweight architecture and
accuracy. It is implemented with Keras, which is an open-source neural-network
library running on TensorFlow library that is developed to implement computer vision
tasks. To create a DCNN to embed a face features, google researchers had introduced
the state-of-art network that can take a face image as input and embedding 128D as
output. This technique facilitates the problem of classification since the classification
on many people using a deep learning pre-trained model is taking a long time to
compare, and we need to accomplish this task in milliseconds. According to Florian
Schroff, Dmitry Kalenichenko, and James Philbin (2015, 815-816), FaceNet uses DCNN
(deep convolutional neural network) that trained to compute the range between the
embedding corresponding to face similarity.
The authors of FaceNet used a triplet loss function to calculate the similarity between
two embedding faces. Since we have three input for the network (anchor, positive,
and negative objects), the apparent idea to use this loss function that the anchor
68
object should be relatively similar to positive object as compared to negative object,
and the formal to calculate this comparison:
ℒ(𝑥, 𝑦, 𝑧) = max (‖𝑓(𝑥) − 𝑓(𝑦)‖2 − ‖𝑓(𝑥) − 𝑓(𝑧)‖2 + 𝜃)
Where 𝑥 denote to the anchor object, 𝑦 to the positive object, and 𝑧 to the negative
object. 𝐹() indicate the function that is embedding the image to 128 dimensional.
And 𝜃 represent the margin between the positive and negative part that means the
discriminative value between image pairs.
Embedding means to take a few necessary measurements from the input image and
encoding to 128 dimensional (Figure 39). There are useful pre-trained models used to
encode a face image to vectors, such as FaceNet from Google, DeepFace from
Facebook, and Rekognition from Amazon, these models are trained to encode facial
features and make the comparison. For our purpose, we need a model that encodes
the upper part of the face to 128-dimensional vector. Different configurations were
tested to fulfil the requirement of this task, but FaceNet has the best performance.
The authors of FaceNet prove that the 128 dimensional is enough to perform excellent
accuracy comparing to the modern methods such as DeepFace. The FaceNet model
was implemented with TensorFlow using 3x96x96 input size images based on the
output size from the OpenCV-dnn face detection model. Through the encoding step,
we normalized the embedded face that means scaling the values measured on the
different range to a standard scale and normalizing our vectors we used scikits-learn
library that has various classification, regression and clustering algorithms.
69
Figure 39. Embedding a face image to 128-Dimensional vector.
7.2.3 Classification
In our experiments on the classification task, we used different models to compare
the accuracy and the deviation accuracy for both datasets (Masked faces, without
mask faces). The standard model for face classification tasks in the thesis is the
Support Vector Machine. The support vector machine was introduced by Corinna
Cortes and Vladimir Vapnik in the paper "Support Vector Network " (1995) which was
a new method to solve the classification problems. The authors proved the efficiency
of the Support Vector Machine by conducting experiments on a different dataset. The
small dataset includes 7300 training patterns and 2000 testing patterns taken from
the US Postal Service database. The large dataset obtained from NIST dataset, which
is handwritten character digits derived collected by the national institute of standard
and technology that contains 60,000 training samples and 10,000 samples for testing.
The idea of the SVM algorithm was to define the optimal hyperplane and generalize
to non-linearly separable problems. Compared to ANN, the SVM does not experience
duplication in the dimensional and overfitting, which begins in the training session
when the algorithm attempts to achieve a zero error on all training data.
Yichuan Tang (2013, 1) states that the support vector machines had been formulated
for binary classification, so it is learning through the given training data and its
70
corresponding label. Two loss functions used to calculate the validation loss, first one
called L1-SVM, which is a linear sum of slack variables, and second, called L2-SVM a
square sum of slack variables. The L2_SVM is considered better than L1-SVM because
" The L2-SVM is differentiable and imposes a bigger (quadratic vs linear) loss for points
which violate the margin" (ibid., 2). mathematically the L2-SVM equation it is just a
squareness of L1-SVM equation, but it will minimize the squared hinge loss:
min𝑊
1
2𝑊𝑡𝑊 + 𝐶 ∑ max
𝑡(1 − 𝑊𝑡𝑋𝑛𝑡𝑛, 0)2
𝑁
𝑛=1
Where the 𝑋𝑛denote the input data, 𝑡𝑛 is prediction range ∈ (−1, +1), and the 𝑚𝑎𝑥
function is the slack square variable. To predict the class label of a test data x:
arg max𝑡
(𝑊𝑡𝑋)𝑡
As we can see from Table 9. The accuracy rate between neural network model and
support vector machine model are approximately similar. The only difference is the
training speed. The deviation accuracy for SVM is a low standard deviation that
indicates high precision. But linear discriminant analysis (FisherFace) is the best
algorithm to classify the appearance-based to reduce the input dimensions and
achieve a robust performance on Masked face recognition.
Table 9: Accuracy and standard deviation for Five different models over masked
faces dataset.
Model Accuracy Deviation
Logistic Regression 0.868 0.0946
Linear Discriminant Analysis 0.973 0.0280
Kneighbors Classifier 0.885 0.1027
Support Vector Machine 0.948 0.0446
Neural Network 0.939 0.0416
71
8 - Discussion
“After all, the ultimate goal of all research is not objectivity, but truth.” (Helene
Deutsch 1996, 1)
This chapter explains what is needed to avoid the possible obstructions that the
researcher might encounter in future work, in addition, it addresses the answers for
the research questions that set out in chapter 2.3. As well as, provides guidance on
selecting the right variables for the study, to make the practical part of the research
more credible. As well as presents the obstacles that faced the research journey. on
the other hand. the author addresses at the end of the chapter recommendations of
further research on different methodologies.
8.1 Answers to research questions
The first research question was: What are the best approaches used to develop an
OFR system?
The goal of this research was to develop a system capable of detecting occluded faces,
the author was aiming to develop one from scratch. During the first months of
examining the literature of face recognition theories and testing various applications,
the author concluded that it would be futile to start from scratch, particularly when
there are so many previous studies, applications, and different tools that can make
the task easier, The author begins to investigate the hypotheses and explored three
main AI libraries which are used to encode the OFR system.
At this time, it seemed that the process was very complex and required a deep
understanding of the fundamentals of machine learning in order to address the
challenges that could arise during the development process. It was compulsory to
study the problem mathematically because the FR technique in fact is a mathematical
representation of a face, meaning that everyone’s face has different mathematical
72
representation, Moreover, the comparison process is carried out by applying various
mathematical equations on these representations. For that reason, the author
studied a book presented by Aaron Courville, Ian Goodfellow, and Yoshua Bengio
(Deep Learning, 2015). The book was one of the most valuable references to
understand the mathematical operation that occurs within the Conv layers, also the
author obtained a certification from deeplearning.ai presented by Dr Andrew Ng one
of the brightest researchers in the field of artificial intelligence from Stanford
University. Therefore, the author was able to carry out all the knowledge gained
through this journey and set the approach by which the OFR system can be developed.
The second research question was: How does the quantity and quality of data affect
the OFR system?
Good data and poor data are one of the biggest topics concerning the researchers in
the AI field and it affects the behaviour of Any AI system. Unluckily, the machine
learning algorithms are unable to conclude that the data being analysed is unreliable.
Furthermore, in some situations, this can lead to results that are deceptive and false
predictions. Even data of relatively high quality can lead to incorrect results,
potentially leading the face recognition systems to false identification of individuals,
and that may place wrong people in bad circumstances if the system is used by the
authorities.
In general, more data and good quality lead to a more favourable outcome always.
researchers need to take a long time to determine that it is worthwhile to collect these
data and whether it's the quantity of the data collected is enough to fulfil the purpose.
The author has used two methods of data collection as mentioned in chapter 2.4. The
collected data was not in high quality or in the quantity required to develop OFR from
scratch and to overcome this obstacle, the author used several pre-trained models,
and by using these models and weights on the collected data, the author was able to
achieve a good percentage by detecting and localizing the masked faces.
73
Before feeding the data to a machine learning system, it is necessary to take some
time examining the data and determine if it is possible to increase or boost the overall
quality of the data. A little change in the quality of the data would go a long way
towards improving the systems.
The third research question was: How does the OFR system detect and extract visual
facial features?
The implementation of OFR system has been presented in chapter 7 in detail, it is
almost like any FR system. The author trained several models with different
measurements by manipulating the variables respectively. Multiple errors have
appeared through the development process that breaks down the system sometimes
or gives a false prediction. By tuning all the model's measurements and examining the
data repeatedly, the OFR system began to predict correctly and the accuracy ratio
started improving. The author takes to consideration three main obstacles that affect
the operation of any system:
• Data
• Head pose and illumination
• Model's design
Failure to correctly prepare one of the variables can lead to severe errors. The data
variable has been already stated in the previous subsection, the head pose and
illumination variables are taking into consideration as part of the variety of data
provided to the system, finally the architecture of the model that take into account
multiple sub-variables such as ( input shape, CNN layers , stride, padding, pooling
layers, number of parameters, activation functions, flatten, fully connected layer, etc)
All of the variables are essentials to develop an accurate model.
The fourth research question was: What is the evaluation of the OFR system’s
performance?
74
In this subsection, the author presents the evaluation method and the obstacles that
the study can encounter. Usually, the models' results are evaluated by examining and
comparing them with previous works' results on the same database but, I could not
see much comparisons with other studies to evaluate the results. This is not the only
way to make sure your system works properly. In order to evaluate the OFR system,
the author implemented the evaluation of detection and recognition separately and
by using different pre-trained models and compare the outcome for each model as
presented in chapter 7.This evaluation approach is not bad, but it is also not reliable,
as it is possible that the variables are incorrectly set for all the models that the author
used, and the evaluation's measurements can be misleading.
Therefore the author used two methods to evaluate the models’ performance, the
first method was the statistical results presented in table 7 to compare and calculate
the accuracy and the loss, and second, was supervised evaluation, that giving some
samples (images) for the models and then the author record the accuracy ratio for the
sample on each model respectively.
8.2 Conclusion
This study presented a critical overview of 5 Face Recognition algorithms, deep
learning functionality, and offered a comparative analysis of six databases and related
benchmarks. On the other hand, it highlighted the weaknesses of the state-of-the-art
approaches for detecting occluded faces and designed a new approach to overcome
deficiencies.
The study aims to present a strategy to develop a face recognition system for masked
faces. The problem has been separated into phases, and each phase has been
researched, developed and evaluated independently. The first phase was to develop
deep neural network models that can detect and align faces which includes the
occlusion part so the system can crop the visual part of the face. Next, the system will
crop the extracted visual part of the face and embed it into a vector of 128-
dimensional using FaceNet pre-trained model. The last phase was to develop a
75
recognizer model for the prediction. The system was designed to run in real-time with
low computing power consuming.
The pipeline that was built to solve this problem was introduced in Appendix (A), this
diagram illustrates how the system works and how the new input data should be
embedded and trained to be compatible with the system. The system was tested and
demonstrated in a real-life environment. All the detection model has been trained
with the same dataset and the same number of iterations, as mentioned in section
(6.1). We have seen the InceptionV3 and the MobileNetV2 achieve 90% of accuracy
when the faces are clear, and lighting is good. The thesis life cycle starts from defining
the deep learning approach and how the machine learns and explains some face
recognition algorithms and their performance through the past twenty years. Also, for
the training and evaluation phase, the author went through some of the datasets as
benchmarks of accuracy and testing the model performance.
From the results obtained in the study of the experiments, it can be inferred that the
binary comparison techniques are not suitable for a large scale and are not reliable
approaches on multi-classification. The recognition rate decreases rapidly to less than
60% depending on the lighting, head pose, and if the person is far more than 2 m from
the camera. It should be noted that the OFR system is geared towards a controlled
lighting condition and fixed head position, and thus the efficiency of the recognition
system with different various circumstances of these variables are wobbling between
45 % to 64% depending on the measured distance from the camera to the target in
real video streaming.
Finally, the recognition phase was partially achieved by using the SVM and Resnet101.
The system was trained to recognize in real-time on the small dataset was not for
production. The detection model was competent to detect faces with a mask and
without a mask with accuracy 99%. The recognition phase needs more data for training
to achieve good accuracy on a large scale and needs more time to handle the data
augmentation. Moreover, it can change the design used to develop the model to
76
extract more features as measurements for the distance between the visible features,
such as (distance between the eyes, face's width, and 3d geometry of face, etc).
8.3 Recommendation for future work
As a complement of this study, there are range research lines that still open and on
which it is possible to expand further. Some of these potential lines have appeared
during the study phase, which has been left open and is expected to be discussed in
the future. Many of them are more directly correlated to this thesis work and the
findings of problems that have occurred during the research. The rest are only general
lines, as suggested by the author for future works by other researchers. The following
list of these recommendations that can be examined in the future:
• Instead of using a Support Vector Machine binary classification, it is suggested
to use a different approach to the large database. Multiclass SVMs and
SoftMax use deep learning techniques as a standard for multi-class
classification and from the author's perspective, they are ideal ways to
complete the recognition task with high accuracy.
• Enhance the OFR by using a high-quality camera with Raspberry Pi 4 (1GB
storage) as a controller where facial detection and recognition are already built
in. The Raspberry Pi 4 can be improved and make it compatible with our model
by including the OFR system and enhance the visual functionalities.
• The process of identification can be significantly enhanced by using various
datasets that contain many high-quality and scaled diversity images. Datasets
such as (Tufts Face Database, CelebA Dataset) will boost the system
performance if data is properly organised and defined.
• The author tried using the 3D model techniques derived from Feng Liu, Qijun
Zhao, Xiaoming Liu, Dan Zeng (2018) works. The idea is good and reliable, but
it requires a long time to research and examine, also it demands particular data
with a special detector that sets landmarks on visual parts for 3d masked faces.
77
8.4 Summary
This thesis has two objectives to achieve: first, understanding the methods and
algorithms that used to perform the detection, localization, alignment, and
classification steps on faces in an image, and what kind of evaluation has been used to
improve and check if these algorithms are doing well on the face in different poses.
Second.
To address these questions and carry out the task of developing a model that can
classify individuals wearing a medical mask, the author used quantitative experiment
research to collect the data and test the experiments' efficiency. Since our research
method is quantitative, the author has an advantage over the data obtained in order
to achieve the desired results. The data used in the study were collected from March
to July 2020 and were classified according to priority to accomplish the task and to
answer the questions.
After studying the methods and hypotheses used to identify or recognize an object, it
becomes easier for the author to understand the perspective of the identification in
addition to how it is implemented in real life. A crucial point has arisen in the middle
of the research process, which is "Data", to use the deep learning approach you need
an immense amount of data to achieve the excellent accuracy so that the model can
learn properly. For this reason, the various datasets provided in this study have
previously been used as a pillar to test several face-recognition theories and
algorithms. Therefore, these datasets have been widespread in computer vision
society, because they were structured to include thousands of images accompanied
by annotations that contain face coordinates. As a result of this advancement
research, the author was able to implement his models on the real-life application and
obtain an appropriate outcome for identification of the frontal face with a mask.
Finally, the approach was based on deep learning, and machine learning techniques,
a small dataset was used to evaluate the efficiency of the system.
78
References
Ali Sharifara, Mohd Shafry Mohd Rahim and Yasaman Anisi. 2014. “A General Review
of Human Face Detection Including a Study of Neural Networks and Haar Feature-
based Cascade Classifier in Face Detection”. Available from:
https://www.researchgate.net/publication/282680769_A_
Alok Sharma, Kuldip K. Paliwal. 2013. “Linear discriminant analysis for the small sample
size problem: an overview”. Available from:
https://link.springer.com/article/10.1007/s13042-013-0226-9
Anne Marie Monchamp .2008. “ALÈTHEIA TRUTH OF THE PAST”. Available from:
https://novaojs.newcastle.edu.au
Ayush Singhal, Pradeep Sinha, Rakesh Pant. 2017. “Use of Deep Learning in Modern
Recommendation System: A Summary of Recent Works”, Available from:
https://arxiv.org/ftp/arxiv/papers/1712/1712.07525.pdf
C. Li, Y. Diao, H. Ma and Y. Li., 2008. “A Statistical PCA Method for Face Recognition”,
Available from:
https://ieeexplore.ieee.org/abstract/document/4740022/citations#citations
Claudia Iancu and Peter M. Corcoran. 2011.“A Review of Hidden Markov Models in
Face Recognition”, Available from:
https://www.intechopen.com/predownload/17168
Corinna Cortes and Vladimir Vapnik.1995.” Support-Vector Networks”. Available
from: http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf
Melissa P. Johnston. 2014. “Secondary Data Analysis: A Method of which the time Has
Come” Available from:
http://www.qqml-journal.net/index.php/qqml/article/view/169/170
Dr. P.Shanmugavadivu and Ashish Kumar. 2016. “Rapid Face Detection and Annotation
with Loosely Face Geometry”. Available from:
https://www.researchgate.net/publication/316733019_Rapid_face_detection_and_
79
David E. Rumelhart, Yves Chauvin. 1995. “Backpropagation: Theory, Architectures, and
Applications”, Available from:
https://books.google.fi/books?hl=en&lr=&id=oWRv7BR4BqMC&oi
Dilip Singh Sisodia, Ram Bilas Pachori, Lalit Garg. 2020. “Handbook of research on
advancements of artificial intelligence in healthcare engineering”. Available from:
https://books.google.fi/books?id=SQfYDwAAQBAJ&pg=PA123&lpg
F. H. Alhadi, W. Fakhr and A. Farag. 2005.”Hidden Markov Models for Face
Recognition”, Available from: https://www.researchgate.net/publication/220939899
Feng Liu, Qijun Zhao, Xiaoming Liu, Dan Zeng. 2008. “Joint Face Alignment and 3D Face
Reconstruction with Application to Face Recognition“. Available from:
https://arxiv.org/pdf/1708.02734.pdf
Ferdinando Samaria, Frank Fallside. 2007. “Face Identification and Feature Extraction
Using Hidden Markov Models”. Available from:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.851
Ian Goodfellow, Yoshua Bengio, Aaron Courville. 2016. “Deep Learning”. Available
from: http://www.deeplearningbook.org/
Jeremy Howard and Sylvain Gugger. 2020. “fastai: A Layered API for Deep Learning”.
Available from: https://arxiv.org/pdf/2002.04688.pdf
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler.
2015. “Efficient Object Localization Using Convolutional Networks”, Available from:
https://arxiv.org/pdf/1411.4280.pdf
Joseph Redmon, Santosh Divvalay, Ross Girshick, Ali Farhadi. 2016. “You Only Look
Once: Unified, Real-Time Object Detection”. Available from:
https://arxiv.org/abs/1506.02640
K. Mallat, J-L. Dugelay, (2018) « A benchmark database of visible and thermal paired
face images across multiple variations », International Conference of the Biometrics
Special Interest Group BIOSIG, pages 199-206, 2018. Available from:
http://www.eurecom.fr/fr/publication/5700/download/sec-publi-5700.pdf
80
Kim Esbensen, Paul Geladi.1987. “Principal Component Analysis”, Available from:
https://www.sciencedirect.com/science/article/abs/pii/0169743987800849
L. R. Rabiner and B. H. Juang. 1986. “An Introduction to Hidden Markov Models”.
Available from:
http://ai.stanford.edu/~pabbeel/depth_qual/Rabiner_Juang_hmms.pdf
Laurenz Wiskott, Jean-Marc Fellous, Norbert Kr uger, and Christoph von der
Malsburg (1999). “Face Recognition by Elastic Bunch Graph Matching”. Available
from: https://www.researchgate.net
Lawrence R. Rabiner. 1989. “A Tutorial on Hidden Markov Models and selected
application in speech recognition”, Available from:
https://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/
Le Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran, Louis-Antoine Blais-Morin.
2017. “A comparison of CNN-based face and head detectors for real-time video
surveillance applications”. Available from: https://ieeexplore.ieee.org/
Lie Gu and Takeo Kanade. 2016. “3D Alignment of Face in a Single Image”. Available
from: https://ieeexplore.ieee.org/document/1640900
Mahmoud Afifi and Abdelrahman Abdelhamed, "AFIF4: Deep gender classification
based on an AdaBoost-based fusion of isolated facial features and foggy faces".
Journal of Visual Communication and Image Representation, 2019. Available from:
https://arxiv.org/pdf/1706.04277.pdf
Mislav Grgic, Kresimir Delac, and Sonja Grgic. 2009. “SCface – surveillance cameras
face database”. Available from:
https://link.springer.com/article/10.1007/s11042-009-0417-2
N. Mohanty, A. Lee-St. John, R. Manmatha, T.M. Rath. 2013. “Shape-Based Image
Classification and Retrieval”. Available from:
https://www.sciencedirect.com/science/article/pii/B9780444538598000102
P. Jonathon Phillips, Hyeonjoon Moon, Patrick Rauss, and Jeffery Huangb. 1998. “The
FERET Evaluation Methodology for Face-Recognition Algorithms”.Available from:
https://www.nist.gov/system/files/documents/2016/12/15/feret_database
81
Paul Viola and Michael Jones. 2001. “Rapid Object Detection using a Boosted Cascade
of Simple Features”. Available from: http://web.iitd.ac.in/~sumeet/viola-cvpr-01.pdf
Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu,
Pavel Kuksa. 2011. “Natural Language Processing (Almost) from Scratch”, Available
from: http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. 2014. “Rich feature
hierarchies for accurate object detection and semantic segmentation“.Available from:
https://arxiv.org/abs/1311.2524
Ross Girshick. 2016.” Fast R-CNN”. Available from:
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_
Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf, Mazdak Zamani, Alireza
Hooman. 2013. “An Overview of Principal Component Analysis”. Available from:
https://www.researchgate.net/publication/262527828_An_Overview_of_
Shahpour Alirezaee, Hassan Aghaeinia, Karim Faez, and Farid Askari. 2006. “An
Efficient Algorithm for Face Localization”. Available from:
https://www.researchgate.net/publication/240706465_An_Efficient_Algorithm
Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian SunLocal. 2014. “Face Alignment at
3000 FPS via Regressing Local Binary Features”. Available from: https://www.cv-
foundation.org/openaccess/content_cvpr_2014/papers/Ren_Face_Alignment_at_20
14_CVPR_paper.pdf
Simone Bianco. 2016. “Large age-gap face verification by feature injection in deep
networks”. Available from: https://arxiv.org/pdf/1602.06149.pdf
Song Han, Huizi Mao, and William J. Dally. 2016. “DEEP COMPRESSION:
COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION
AND HUFFMAN CODING”. Available from: https://arxiv.org/pdf/1510.00149.pdf
Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran.
2013. “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR”, Available from:
http://www.cs.toronto.edu/~asamir/papers/icassp13_cnn.pdf
82
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. “Active
Appearance Models”, Available from:
https://people.eecs.berkeley.edu/~efros/courses/AP06/Papers/cootes-pami-01.pdf
Vineet Kushwaha, Maneet Singh, Richa Singh, Mayank Vatsa. 2018. “Disguised Faces
in the Wild”. Available from:
http://iab-rubric.org/papers/2018_CVPRW_disguised-faces-wild.pdf
Vittorio Castelli. “Nearest Neighbor Classifiers”. Available from:
http://www.ee.columbia.edu/~vittorio/lecture8.pdf
W. Zhao, R. Chellappa, P. J. Phillips, A. Rosenfeld. 2003. “Face Recognition: A Literature
Survey”, Available from: https://inc.ucsd.edu/~marni/Igert/Zhao_2003.pdf
Xinbo Gao, Ya Su, Xuelong Li, and Dacheng Tao. 2010. “A Review of Active Appearance
Models”. Available from:
https://www.researchgate.net/publication/220509234_A_Review
Xuehan Xiong and Fernando De la Torre. 2013. “Supervised Descent Method and its
Applications to Face Alignment “ Available from:
https://www.ri.cmu.edu/pub_files/2013/5/main.pdf
Yue Wu and Qiang Ji. 2015. “Robust Facial Landmark Detection under Significant
Head Poses and Occlusion”. Available from:
https://openaccess.thecvf.com/content_iccv_2015/papers/Wu_Robust_Facial
83
Appendices Appendix - A: The pipeline to build a face recognition system for masked faces.
84
Appendix B: Embedding face without mask and same face with mask to 128 dimensional.
85
Appendix C: System Configuration.
Practical work
In the present system, an application for masked facial recognition was implemented by my
personal computer alongside with Python 3.7 which allowed me to connect all the models and
Initialize the system. I also used the OpenCV library which supports image processing, as well
as different Python packages such as (NumPy, Pandas, Matplotlib, Scikit-Learn, os) to facilitate
the tasks of reading the input images and encoding them to be 2-dimensional tensors.
The author trained three different models (InceptionV3, MobileNetV2, VGG16) on masked
face detection dataset and the data was limited (854 images with 4072 faces) so we
implemented a data augmentation technique to increase the data. These three models will
detect the face with a mask and without a mask and return a prediction ratio.
Then, we divided the data into training and testing dataset:
also, we initialized some hyper-parameters such as the learning rate parameter, several
epochs, and the bunch size:
Then we downloaded the pre-trained model and set up the architecture with weights and
defined the input shape to 224 * 224*3, in addition, we used Average Pooling layers with size
86
(5*5) to improve on any overfitting that may occur and increase the training speed. Moreover,
two activation function used (ReLU and SoftMax):
Furthermore, we applied Adam optimization algorithm to update network weights iterative
based on training data:
Now the training process:
87
The results for the first training process: 20 epochs
88
The results for the second training process: 20 epochs