copyright and citation considerations for this thesis ... · deep learning approaches and swarm...
TRANSCRIPT
![Page 1: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/1.jpg)
COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS/ DISSERTATION
o Attribution — You must give appropriate credit, provide a link to the license, and indicate ifchanges were made. You may do so in any reasonable manner, but not in any way thatsuggests the licensor endorses you or your use.
o NonCommercial — You may not use the material for commercial purposes.
o ShareAlike — If you remix, transform, or build upon the material, you must distribute yourcontributions under the same license as the original.
How to cite this thesis
Surname, Initial(s). (2012) Title of the thesis or dissertation. PhD. (Chemistry)/ M.Sc. (Physics)/ M.A. (Philosophy)/M.Com. (Finance) etc. [Unpublished]: University of Johannesburg. Retrieved from: https://ujcontent.uj.ac.za/vital/access/manager/Index?site_name=Research%20Output (Accessed: Date).
![Page 2: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/2.jpg)
Computational Intelligence Techniques for
High-Dimensional Missing Data Estimation
by
Collins Achepsah Leke
A dissertation submitted to the Faculty of Engineering and the BuiltEnvironment in the fulfillment of the requirements for the degree of
Doctor of Engineering
in
Electrical and Electronic Engineering Science
at the
University of Johannesburg
Supervisor: Prof. Tshilidzi Marwala
Co-Supervisor: Prof. Bhekisipho Twala
2017
![Page 3: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/3.jpg)
Declaration of Authorship
• This work was done mainly while in candidature for a research degree at this Uni-
versity.
• Where I have consulted the published work of others, this is always clearly at-
tributed.
• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
Signed:
Date:
ii
![Page 4: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/4.jpg)
Abstract
Missing data is a recurrent issue which leads to a variety of problems in the analysis and
processing of data in datasets. Due to this reason, missing data and ways of handling
this problem have been an area of research in a variety of disciplines in recent times.
Most real world datasets possess the properties of big data being volume, velocity and
variety. With an increase in volume which includes sample size and dimensionality, ex-
isting imputation methods have become less effective and accurate. Much attention has
been given to narrow Artificial Intelligence frameworks courtesy to their efficiency in low
dimensional settings. However, with an increase in dimensionality, these methods yield
unrepresentative imputations with an impact on decision making processes. The goal
of this thesis is to present a new direction in the missing data estimation literature by
proposing novel methods which are aimed at finding approximations to missing values in
high-dimensional datasets with emphasis placed on image recognition datasets, with the
objective of being able to reconstruct corrupted images which can subsequently be used
in classification tasks. The features in these datasets represent the pixel values of the
images. To the best of our knowledge, high-dimensional missing data estimation using
deep learning approaches and swarm intelligence techniques has not yet been reported or
investigated.
The first contribution of this thesis is the presentation of novel ant-based optimization
deep learning missing data estimation approaches. The ant-based optimization algorithms
used are Ant-Lion Optimizer (ALO) and Ant Colony Optimization (ACO). These opti-
mization algorithms are used in combination with a deep learning regression model. The
methods are compared against three existing approaches of a similar nature being a hy-
brid multi-layer perceptron (MLP) auto-associate neural network (AANN) with genetic
iii
![Page 5: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/5.jpg)
Abstract
algorithm (GA), a hybrid AANN with simulated annealing (SA) and a hybrid AANN with
particle swarm optimization (PSO). The proposed methods show better performance over-
all, whilst the existing methods show better computational time required to obtain the
estimates. Statistical tests are done to validate these findings.
The second contribution presented in this thesis is the proposition of novel flight-based
optimization deep learning missing data estimation techniques. The flight-based optimiza-
tion algorithms used are the Firefly Algorithm (FA), Bat Algorithm (BAT) and Cuckoo
Search (CS) algorithm. The algorithms are also hybridized with a deep learning regres-
sion model. The third contribution is the proposition of a novel plant-based optimization
missing data estimation technique. The plant-based optimization algorithm used is the
Invasive Weed Optimization (IWO) algorithm. This algorithm is hybridized with a deep
learning regression model. These approaches are compared against the existing methods.
Statistical tests are also done to validate the findings observed.
This thesis further provides a comparative analysis of the methods proposed. These
methods make use of the optimization algorithms to reduce to an acceptable level an
objective function obtained by training the regression model, a process during which the
correlations between inputs and outputs are preserved in the weights assigned to the edges
that link the different network layers. The objective function represents the square of the
disparity between the real output values and estimated output values from a deep auto-
encoder network. When missing data is observed in the dataset, the objective function is
broken down to incorporate both known and unknown feature variable values. Each layer
of the deep auto-encoder network is a restricted Boltzmann Machine (RBM), with these
stacked together and trained in a back-propagation supervised learning approach using
the stochastic gradient descent (SGD) algorithm. All the experiments conducted in this
thesis are done from a high-dimensional perspective.
The results obtained from experiments conducted in this thesis reveal that the most
effective method proposed is that which comprises of the deep learning model and the ant
colony optimization algorithm yielding the best evaluation metric values. The method
which leads to the worse performance metric values is consistently the method compris-
ing of the deep learning model and the firefly algorithm. The statistical t-tests performed
further reveal that this least performing approach yields estimates which are significantly
different from those of the other five methods at a 95% confidence level, resulting in very
iv
![Page 6: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/6.jpg)
Abstract
low p-values which are significantly less than zero, when these are compared in pairs. It
is observed that only when the objective function values per sample are considered, that
the deep learning model-ant colony optimization algorithm approach does not yield the
best values. Rather, it is the deep learning model with the bat algorithm approach that
results in the lowest values in this scenario.
v
![Page 7: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/7.jpg)
Dedication
To my dad and sister,
Leke Betechuoh Casimir & Leke Sydonie Kesangha
and their everlasting memory and love.
vi
![Page 8: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/8.jpg)
Acknowledgements
First, I would like to address my gratitude to my supervisors Prof. Tshilidzi Marwala
and Prof. Bhekisipho Twala for giving me the opportunity to work with them and for
believing in my potential.
Special gratitude goes to Dr. Richard Ndjiongue and Mr. Kesi Fanoro for their valuable
assistance and support.
My heartfelt gratitude, extends to my Mum (Leke Nee Nchangeh Agatha Fonkeng), broth-
ers (Leke Betechuoh Brian and Leke Fonkeng Clarence) and sister (Gwendoline Tasong).
They have all made me who I am today. We have been through a lot, but through it all,
they kept me strong and going. There are no words to say how much I am grateful and
thankful to them and for them.
My gratitude also goes to all my friends who supported me morally and through their
prayers throughout this journey, especially Nqobile Dudu who provided me with a voice
of reason and sanity through the difficult times I went through while completing this
research. Everyone deserves to have such a person in their life.
Above all, I praise the Almighty GOD, for always strengthening and leading me.
vii
![Page 9: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/9.jpg)
Dissertation Related
Publications
Conferences
[1] Collins Leke, A. R. Ndjiongue, Bhekisipho Twala, and Tshilidzi Marwala
Deep Learning-Bat High-Dimensional Missing Data Estimator
(accepted) in 2017 IEEE International Conference on Systems, Man and Cybernetics
(SMC) - October 5-8, 2017, Banff, Canada.
[2] Collins Leke, A. R. Ndjiongue, Bhekisipho Twala, and Tshilidzi Marwala
A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-
Dimensional Datasets
(accepted) in 2017 (Springer) International Conference in Swarm Intelligence (ICSI) -
July 27 - August 1, 2017, Fukuoka, Japan.
[3] Collins Leke and Tshilidzi Marwala
Missing Data Estimation in High-Dimensional Datasets: A Swarm Intelligence-
Deep Neural Network Approach
(Springer) International Conference in Swarm Intelligence - June 25-30, 2016, Bali, In-
donesia.
[4] Collins Leke, Bhekisipho Twala and Tshilidzi Marwala
Modeling of missing data prediction: Computational intelligence and opti-
mization algorithms
IEEE International Conference on Systems, Man and Cybernetics (SMC)- October 5-8,
2014, San Diego, CA, USA.
viii
![Page 10: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/10.jpg)
Contents
Other Publications
[5] Collins Leke, Bhekisipho Twala and Tshilidzi Marwala
Missing Data Prediction and Classification: The Use of Auto-Associative Neu-
ral Networks and Optimization Algorithms
(CoRR), arXiv, http://arxiv.org/abs/1403.5488, abs/1403.5488, 2014.
[6] Collins Leke, Satyakama Paul and Tshilidzi Marwala
Proposition of a Theoretical Model for Missing Data Imputation using Deep
Learning and Evolutionary Algorithms
(CoRR), arXiv, http://arxiv.org/abs/1512.01362, abs/1512.01362, 2015.
ix
![Page 11: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/11.jpg)
Contents
Declaration of Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Dissertation Related Publications . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.4 Contribution of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
x
![Page 12: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/12.jpg)
Contents
1.5 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
1.6 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
2 Literature Review and Background on Approaches for Dealing withMissing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.2 Missing Data Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.3 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
2.3.1 Missing Completely at Random (MCAR) . . . . . . . . . . . . . . 2-2
2.3.2 Missing at Random (MAR) . . . . . . . . . . . . . . . . . . . . . 2-3
2.3.3 Non-Ignorable Case or Missing Not at Random (MNAR) . . . . . 2-4
2.3.4 Missing by Natural Design (MBND) . . . . . . . . . . . . . . . . 2-4
2.4 Missing Data Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.5 Classical Missing Data Techniques . . . . . . . . . . . . . . . . . . . . . . 2-6
2.5.1 List-Wise or Case-Wise Deletion . . . . . . . . . . . . . . . . . . . 2-6
2.5.2 Pair-Wise Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.5.3 Mean-mode Substitution . . . . . . . . . . . . . . . . . . . . . . . 2-7
2.5.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
2.5.5 Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.6 Machine Learning Approaches to Missing Data . . . . . . . . . . . . . . . 2-10
2.6.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.6.2 Artificial Neural Networks (ANNs) . . . . . . . . . . . . . . . . . 2-11
2.6.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 2-14
2.7 Machine Learning Optimization . . . . . . . . . . . . . . . . . . . . . . . 2-14
2.7.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
xi
![Page 13: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/13.jpg)
Contents
2.7.2 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . 2-15
2.7.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.8 Deep Learning (DL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.8.1 Restricted Boltzmann Machine (RBM) . . . . . . . . . . . . . . . 2-17
2.8.2 Contrastive Divergence (CD) . . . . . . . . . . . . . . . . . . . . 2-19
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
3 Novel Ant-based Missing Data Estimators . . . . . . . . . . . . . . . . 3-1
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2.1 Statement of Hypothesis and Research Question . . . . . . . . . . 3-2
3.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.3.1 Ant Colony Optimization (ACO) . . . . . . . . . . . . . . . . . . 3-7
3.3.2 Ant-Lion Optimizer (ALO) . . . . . . . . . . . . . . . . . . . . . 3-8
3.4 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.5 Deep-Learning-Ant Colony Optimization (DL-ACO) Estimator . . . . . . 3-14
3.6 Deep-Learning-Ant Lion Optimizer (DL-ALO) Estimator . . . . . . . . . 3-19
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
4 Novel Flight-based Missing Data Estimators . . . . . . . . . . . . . . . 4-1
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.2.1 Statement of Hypothesis and Research Question . . . . . . . . . . 4-2
4.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
xii
![Page 14: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/14.jpg)
Contents
4.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
4.3.1 Cuckoo Search (CS) . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
4.3.2 Bat Algorithm (BAT) . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.3.3 Firefly Algorithm (FA) . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.4 Deep Learning-Cuckoo Search (DL-CS) Estimator . . . . . . . . . . . . . 4-10
4.5 Deep Learning-Bat Algorithm (DL-BAT) Estimator . . . . . . . . . . . . 4-15
4.6 Deep Learning-Firefly Algorithm (DL-FA) Estimator . . . . . . . . . . . 4-21
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
5 Novel Plant-based Missing Data Estimator and Comparative Analysis 5-1
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.2.1 Statement of Hypothesis and Research Question . . . . . . . . . . 5-2
5.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
5.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.3.1 Invasive Weed Optimization (IWO) . . . . . . . . . . . . . . . . . 5-3
5.4 Deep Learning-Invasive Weed Optimization (DL-IWO) Estimator . . . . 5-5
5.5 Comparative Analysis of Proposed Approaches . . . . . . . . . . . . . . . 5-10
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
6 Concluding Remarks and Future Research . . . . . . . . . . . . . . . . 6-1
6.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.1.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.1.2 Results Summary and Discussions . . . . . . . . . . . . . . . . . . 6-2
6.2 Avenues for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
xiii
![Page 15: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/15.jpg)
Contents
6.2.1 Apply Alternative Machine Learning Techniques . . . . . . . . . . 6-3
6.2.2 Apply Different Optimization Techniques . . . . . . . . . . . . . . 6-4
6.2.3 Compare to Other Models using Similar Datasets . . . . . . . . . 6-4
6.3 Alternative Areas of Application . . . . . . . . . . . . . . . . . . . . . . . 6-5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rf-1
xiv
![Page 16: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/16.jpg)
List of Figures
1.1 MNIST Dataset Sample Images. Top Row - Real Data: Bottom Row -Data with Missing Pixel Values . . . . . . . . . . . . . . . . . . . . . . . 1-5
3.1 Data Imputation Configuration. . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.2 Stacked Auto-encoder Network Structure. . . . . . . . . . . . . . . . . . 3-4
3.3 Missing Data Estimator Structure. . . . . . . . . . . . . . . . . . . . . . 3-5
3.4 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 3-14
3.5 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 3-15
3.6 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 3-15
3.7 Global Deviation vs Estimation Approach. . . . . . . . . . . . . . . . . . 3-16
3.8 Top Row: Corrupted Images - Bottom Row: DL-ACO Reconstructed Images.3-18
3.9 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
3.10 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 3-19
3.11 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 3-20
3.12 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 3-20
3.13 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 3-21
3.14 Top Row: Corrupted Images - Bottom Row: DL-ALO Reconstructed Images.3-23
xv
![Page 17: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/17.jpg)
List of Figures
3.15 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
4.1 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 4-10
4.2 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 4-11
4.3 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 4-11
4.4 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 4-12
4.5 Top Row: Corrupted Images - Bottom Row: DL-CS Reconstructed Images.4-12
4.6 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
4.7 Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . . . . . 4-15
4.8 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 4-16
4.9 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 4-17
4.10 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 4-18
4.11 Top Row: Corrupted Images - Bottom Row: DL-BAT Reconstructed Images.4-20
4.12 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
4.13 Global Deviation vs Estimation Approach. . . . . . . . . . . . . . . . . . 4-21
4.14 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 4-22
4.15 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 4-22
4.16 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 4-23
4.17 Top Row: Corrupted Images - Bottom Row: DL-FA Reconstructed Images.4-25
4.18 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
5.1 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 5-6
5.2 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 5-6
xvi
![Page 18: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/18.jpg)
List of Figures
5.3 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 5-7
5.4 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 5-7
5.5 Top Row: Corrupted Images - Bottom Row: DL-IWO Reconstructed Images. 5-9
5.6 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
5.7 Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . . . . . 5-10
5.8 Mean Absolute Error vs Estimation Approach. . . . . . . . . . . . . . . . 5-11
5.9 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 5-12
5.10 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 5-13
xvii
![Page 19: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/19.jpg)
List of Tables
2.1 Univariate Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . 2-5
2.2 Arbitrary Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.3 Monotone Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . 2-5
3.1 ACO Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.2 DL-ACO Mean Squared Error Objective Value Per Sample. . . . . . . . . 3-16
3.3 DL-ACO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . 3-17
3.4 Statistical Analysis of DL-ACO Results. . . . . . . . . . . . . . . . . . . 3-17
3.5 DL-ALO Mean Squared Error Objective Value Per Sample. . . . . . . . . 3-21
3.6 DL-ALO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
3.7 Statistical Analysis of DL-ALO Results. . . . . . . . . . . . . . . . . . . 3-22
4.1 CS Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.2 BAT Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.3 FA Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
4.4 DL-CS Mean Squared Error Objective Value Per Sample. . . . . . . . . . 4-13
4.5 DL-CS Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
4.6 Statistical Analysis of DL-CS Results . . . . . . . . . . . . . . . . . . . . 4-14
xviii
![Page 20: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/20.jpg)
List of Tables
4.7 DL-BAT Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4.8 DL-BAT Mean Squared Error Objective Value Per Instance. . . . . . . . 4-19
4.9 Statistical Analysis of DL-BAT Results. . . . . . . . . . . . . . . . . . . 4-19
4.10 DL-FA Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
4.11 DL-FA Mean Squared Error Objective Value Per Sample. . . . . . . . . . 4-24
4.12 Statistical Analysis of DL-FA Results. . . . . . . . . . . . . . . . . . . . . 4-24
5.1 IWO Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
5.2 DL-IWO Mean Squared Error Objective Value Per Sample. . . . . . . . . 5-8
5.3 DL-IWO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
5.4 Statistical Analysis of DL-IWO Model Results. . . . . . . . . . . . . . . . 5-9
5.5 Model Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
5.6 Model Mean Squared Error Objective Values Per Sample. . . . . . . . . . 5-14
5.7 Statistical Analysis of Model Results. . . . . . . . . . . . . . . . . . . . . 5-15
xix
![Page 21: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/21.jpg)
List of Abbreviations
xx
![Page 22: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/22.jpg)
List of Abbreviations
AANN Auto-Associative Neural Networks
ACO Ant Colony Optimization
ALO Ant Lion Optimizer
ANN Artificial Neural Networks
BAT Bat Algorithm
CD Contrastive Divergence
COD Coefficient of Determination
CS Cuckoo Search
DAE Deep Auto-Encoder
DL Deep Learning
FA Firefly Algorithm
GA Genetic Algorithm
GD Global Deviation
IWO Invasive Weed Optimization
MAE Mean Absolute Error
MAR Missing at Random
MBND Missing by Natural Design
MCAR Missing Completely at Random
MI Multiple Imputation
MLP Multi-Layer Perceptron
MNAR Missing not at Random
MSE Mean Square Error
PCA Principal Component Analysis
PSO Particle Swarm Optimization
r Correlation Coefficient
RBM Restricted Boltzmann Machine
RMSLE Root Mean Square Logarithmic Error
SA Simulated Annealing
SAE Stacked Auto-Encoder
SCG Scaled Conjugate Gradient
SE Squared Error
SGD Stochastic Gradient Descent
SNR Signal-to-Noise Ratioxxi
![Page 23: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/23.jpg)
1 Introduction
1.1 Missing Data
The presence of missing data in datasets as per previous research in a variety of academic
domains alludes to the fact that data analysis tasks and decision-making processes are
rendered non-trivial. Due to this observation, one can assume that reliable and accurate
decisions are more likely to be made when complete records are used as opposed to in-
complete datasets. The result from this presumption has been a lot of research being
conducted in the data mining domain with the introduction of novel methods that ac-
curately perform the task of filling in missing data. Research indicates that operations
from a variety of professional sectors that use sensors in instruments to report very impor-
tant information which are subsequently used to make decisions, for example in medicine,
manufacturing and energy, may come across instances whereby these sensors fail, lead-
ing to there being missing data entries in the dataset, thereby influencing the nature of
the decisions made. In scenarios such as these, it is of great importance that there be a
system that can impute with high accuracy, the missing data from these faulty sensors.
This imputation framework will need to take into consideration the existing correlations
between the information obtained from the sensors in the system to accurately estimate
the missing data. Another scenario in which the missing data problem presents an incon-
venience in the process of making decisions is in image recognition tasks, whereby missing
pixel values render the task of predicting or classifying an image difficult and as such, it
is paramount that there be a system capable of estimating these missing pixel values with
high accuracy to make these tasks easier and more feasible.
Datasets nowadays such as those that record production, manufacturing and medical
1-1
![Page 24: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/24.jpg)
1. Introduction
data may suffer from the problem of missing data at different phases of the data collec-
tion and storage processes. Faults in measuring instruments or data transmission lines are
predominant causes of missing data. The occurrence of missing data results in difficul-
ties in decision making and analysis tasks which rely on access to complete and accurate
data, resulting in data estimation techniques which are not only accurate, but also effi-
cient. Several methods exist to alleviate the problems presented by missing data ranging
from deleting records with missing attributes (list-wise and pair-wise data deletion) to
approaches that employ statistical and artificial intelligence methods such as hybrid neu-
ral network and evolutionary algorithm approaches. The problem though is, some of the
statistical and naive approaches often produce biased approximations, or they make false
assumptions about the data and correlations within the data. These have adverse effects
on the decision-making processes which are data dependent.
Furthermore, missing data has always been a challenge in the real world as well as within
the research community. Decision making processes that rely on accurate knowledge are
quite reliant upon the availability of data, from which information can be extracted. Such
processes often require predictive models or other computational intelligence techniques
that use the observed data as inputs. However, in some cases due to alternate reasons,
data could be lost, corrupted or recorded incompletely, which affects the quality of the
data negatively. Majority of the decision-making and machine learning frameworks such
as Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Principal Com-
ponent Analysis (PCA) and others cannot be used for decision making and data analysis
if the data is incomplete. Missing values can critically influence pattern recognition and
classification tasks. Since the decision output should still be maintained despite the miss-
ing data, it is important to deal with the problem. Therefore, in case of incomplete or
missing data, the initial step in processing the data is estimating the missing values.
In addition, the way in which the missing data problem is handled is reliant upon the
reason for the presence of the missing data. According to [1], there exists three ways
in which this can happen, these being: Missing at Random (MAR), Missing Completely
at Random (MCAR), and Missing Not at Random (MNAR) or the Non-Ignorable Case.
Another missing data mechanism is Missing by Natural Design (MBND).
1-2
![Page 25: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/25.jpg)
1. Introduction
1.2 Rationale
The applications of missing data estimation techniques are vast and with that said, the
existing methods depend on the nature of the data, the pattern of missingness and, are
predominantly implemented on low-dimensional datasets. These methods include, but are
not limited to, modeling (structural equation modeling) [2], environmental (air quality
data) [3] and time series (reconstruction of times series data) [4] observations. Authors
in [5] used MLP autoencoder networks, principal component analysis and support vector
machines in combination with the genetic algorithm to impute missing data, and in [6],
they used robust regression imputation in datasets with outliers and investigated its per-
formance. Missing data imputation via a multi-objective genetic algorithm technique is
presented in [7]. The results obtained point to the fact that the approach proposed out-
performs certain popular missing data imputation techniques yielding accuracy values in
the 90 percentiles.
The authors in [8] implemented a hybrid system comprising of the genetic algorithm
and a neural network with the aim being to impute missing data values within a single
feature variable at a time, in scenarios whereby the number of missing values varied within
this variable. In [9], the authors proposed a novel system that hybridizes the k-Nearest
Neighbor with a Neural Network with the aim of imputing missing values within a single
feature variable. In [10], hybrid systems made up of an Auto-Associative Neural Network
(AANN) and the Particle Swarm Optimization (PSO), Simulated Annealing (SA) and
Genetic Algorithm (GA) optimization techniques were created and applied to estimate
missing values, yielding high accuracy values in scenarios where a single feature variable
was affected by the problem of missing data. Some researchers used neural networks with
Principal Component Analysis (PCA) and GA to solve the missing data problem such
as [11] and [12]. Information within records with missing values was suggested to be used
in the missing data estimation task in [13]. This resulted in a Non-Parametric Iterative
Imputation Algorithm (NIIA) being introduced which yielded a classification accuracy
of at most 87.3% on the imputation of discrete values, and a root mean squared error
value of at least 0.5 on the imputation of continuous values when the missing data ra-
tios were varied. In [14], a Shell-Neighbour Imputation (SNI) approach is proposed and
applied to the missing data imputation problem, and it makes use of the shell-neighbour
method. The results obtained indicated that the proposed method performed better than
the k-Nearest Neighbour Imputation when imputation and classification accuracy were
1-3
![Page 26: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/26.jpg)
1. Introduction
considered. This was because of the method considering the right and left nearest neigh-
bours of the missing data in addition to using different numbers of nearest neighbours
as opposed to the k-Nearest Neighbour method which uses a fixed number of k nearest
neighbours. New techniques aimed at solving the missing data problem and comparisons
between these and existing methods can be found in [15]- [17].
These techniques mainly cater to low-dimensional datasets with missing values, but are
less effective when high-dimensional datasets are considered with missingness occurring
in an uncontrolled manner. The main motivation behind this thesis is therefore to intro-
duce high-dimensional missing data estimation approaches, with emphasis laid on image
recognition datasets. These approaches will be used to reconstruct corrupted images by
estimating missing image pixel values. These images can then be used to test classification
models.
1.3 Problem Statement
As previously mentioned, most of the existing missing data imputation techniques cater
to low-dimensional datasets. Therefore, with the introduction and design of more sophis-
ticated computational and swarm intelligence methods, it is worth it making an analyst
aware of which method(s) is(are) best suited to a certain kind of dataset, in this case,
image recognition datasets. Therefore, in this research, six optimization algorithms (Ant
Colony Optimization (ACO), Ant Lion Optimizer (ALO), Cuckoo Search (CS), Bat Al-
gorithm (BA), Firefly Algorithm (FA) and Invasive Weed Optimization (IWO)), are used
in combination with a deep learning regression model on a high-dimensional dataset to
carry out a comparison of their missing data imputation capabilities.
In most sectors, decisions which originate from the data rely upon the availability of
complete and accurate data. Therefore, inferences which originate from complete datasets
with all the information available are inclined towards reliability and usefulness as opposed
to inferences drawn from incomplete datasets. The problem of missing data in datasets
may arise in a variety of ways. Some more compelling than others, for example, failures at
sensors which are meant to record the data in processes which use these or data entry er-
rors. In this thesis, the objective is to reconstruct images by imputing missing pixel values.
1-4
![Page 27: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/27.jpg)
1. Introduction
Let us take for example a high-dimensional dataset like the Mixed National Institute
of Standards and Technology (MNIST) dataset [18] that contains 784 feature variables.
These feature variables are the pixel values of an image. Some of the images from the
dataset are shown in Figure 1.1. Let us then assume that some images are corrupted
leading to pixel values being missing (bottom row of Figure 1.1), and statistical analy-
sis is needed to classify the records in the dataset, the questions which need answering
are: (i) Is it possible to impute with some degree of certainty and with high accuracy
the missing data in high-dimensional datasets? (ii) Is it possible to design new methods
which outperform existing approaches to the problem of missing data by approximating
the missing data considering the correlation and interrelationships between the variables?
Figure 1.1: MNIST Dataset Sample Images. Top Row - Real Data: Bottom Row - Data withMissing Pixel Values
Therefore, there are two main objectives of this research. Firstly, approximating the
missing values from the data using novel approaches which comprise of a combination of
six optimization algorithms and a deep learning regression model. Secondly, carrying out
a comparative analysis of the approaches proposed to observe which approach under the
circumstances performs best and why that is the case.
1.4 Contribution of Thesis
The research done in this thesis presents a new direction with regards to what needs to
be done when presented with the problem of missing data in high-dimensional datasets,
from analysing and preparing the data, to estimating the missing data. It makes use of an
image recognition dataset to analyse and evaluate the performances of six models. The
evaluation metric values of the proposed individual models on the dataset are compared
against existing approaches, and subsequently, all the results for the different models are
1-5
![Page 28: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/28.jpg)
1. Introduction
compared against each other to provide some form of generalization based on these re-
sults. The objective of these comparisons and analyses is to identify the best performing
method on the dataset and in this way, eradicate the trial and error approach often used
to identify the method best suited to the problem before implementing it, and instead
simply use whichever method has been identified as the best. This procedure done cor-
rectly will cut down on the time it takes to reconstruct images which can subsequently be
used for classification tasks. This research is also aimed at identifying the key evaluation
metrics. Tests were performed to confirm the outcomes obtained from the experiments
with the aim being to provide some form of symbolically statistical significance of the
results.
As mentioned before, missing data in a dataset leads to a variety of problems, there-
fore, one use of the work in this thesis is that it precisely presents novel and adequate
approaches for addressing the problem by estimating the missing data in the dataset.
Furthermore, besides the contributions already presented, another important contribu-
tion of the thesis lies in the fact that it suggests a research direction into the missing
data estimation literature in high-dimensional datasets by making use of deep learning
and swarm/meta-heuristic optimization techniques.
A more succinct outline of the contributions of the thesis are:
• Novel high-dimensional missing data estimation approaches are proposed which
comprise of ant-based optimization algorithms and a deep learning regression model;
• Novel missing data estimation approaches are proposed comprising of flight-based
optimization algorithms and a deep learning regression model;
• Novel missing data estimation approach is proposed comprising of a plant-based
optimization algorithm and a deep learning regression model, and;
• A comparative analysis of the methods proposed which includes statistical tests
to further back the findings, and essentially suggest a technique to be applied on
datasets with similar properties.
1-6
![Page 29: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/29.jpg)
1. Introduction
1.5 Overview of Approach
The methodology implemented in this thesis comprises of primarily pre-processing the
data from the dataset. This procedure constitutes the normalization of the data which
reduces the variation in the values between feature variables, in addition to ensuring that
the network generates representative outputs. Six optimization algorithms were applied
to reduce an error function derived from training a deep learning linear regression model
using the stochastic gradient descent method. A portion of the normalized data from the
training set is presented to the deep learning network architecture for training. An error
function is then derived which is defined mathematically by calculating the square of the
disparity between the real outputs and the estimated model outputs. In this research,
data entries from the test set from any of the feature variables could be missing simulta-
neously, and therefore, the error function is reformulated to incorporate the unknown and
known input values. The normal routine is to create missing values in a specific feature(s),
and then estimate these. The uncontrolled nature of the missing data within the test set
of data to the best of our knowledge is an aspect which has not been investigated and
reported as well. Restricted Boltzmann Machines (RBMs) are used to train the individual
layers of the network in an unsupervised manner, which are subsequently joined together
to form the encoding part of the network and then transposed to make up the decoding
part. The stochastic gradient descent (SGD) algorithm is applied to train the network
using the training set of data in a supervised learning manner. The optimal network
structure is constructed and consists of an input layer, seven hidden layers and an output
layer. The number of nodes in the hidden layers are obtained from an initial suggestion
made in [19] and by performing cross-validations on a held out set of the training data,
known as the validation data. With the optimal network structure obtained via training,
the swarm algorithms implemented are used to identify the optimal network and algo-
rithm parameter combinations. The missing data estimation procedure is then performed
with the parameters identified in the previous step. The expected outputs are compared
against the estimated outputs to yield understanding into how the methods perform.
To assess the performances of the methods as high-dimensional estimators of the miss-
ing values, an image recognition dataset is used. Entries from the test set of data were
removed and approximated using the models. These models were all coded in MAT-
LAB. To measure the accuracies of the methods as estimators, eight error metrics were
used, these being: Squared Error (SE), Mean Square Error (MSE), Mean Absolute Er-
1-7
![Page 30: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/30.jpg)
1. Introduction
ror (MAE), Root Mean Squared Logarithmic Error (RMSLE), Global Deviation (GD),
Relative Prediction Accuracy (RPA), Signal-to-Noise Ratio (SNR) and Coefficient of De-
termination (COD). These error metrics were selected due to having been applied in a
variety of research reports ( [5], [8], [10] and [20]) as performance measures for missing
data estimation problems in addition to them being convenient. Correlation coefficient
(r) between the estimated and expected output values is also used to provide more insight
into the relationship between the estimated values and the real values. Statistical t-tests
are also performed to back the findings from the metrics and provide information as to
the statistical significance of the results obtained.
1.6 Structure of the Report
Chapter 2: Literature Review and Background on Approaches for Dealing
with Missing Data includes a literature review which presents background informa-
tion on the deep learning model components being RBMs, contrastive divergence and
auto-encoders, as well as some areas of application of these. Missing data background is
also presented in this chapter along with some existing methods aimed at addressing the
problem.
Chapter 3: Ant-based Missing Data Estimators introduces the two ant-based op-
timization missing data estimation models being proposed and tested in this research.
The optimization algorithms used are the ACO and ALO algorithms. Details about the
algorithms are presented. Also, some of the related work which applied these techniques
are presented in the chapter. The methodology used in implementing the estimators is
presented along with the results from the experiments conducted.
Chapter 4: Flight-based Missing Data Estimators presents the three flight-based
optimization missing data estimation models being proposed and tested in this research.
The optimization algorithms used are the CS, BAT and FA methods. Details about
these algorithms and their implementation are presented. Also, some of the related work
which applied these techniques are presented in the chapter. The methodology used in
implementing the estimators is presented along with the results from the experiments
conducted.
1-8
![Page 31: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/31.jpg)
1. Introduction
Chapter 5: Plant-based Missing Data Estimator and Comparative Analysis
presents the plant-based optimization missing data estimation model proposed and tested.
The optimization algorithm used is the IWO algorithm. The methodology used in im-
plementing the estimator is presented along with the results from the experiments. Also
presented in this chapter is the fourth contribution of the thesis which entails comparing
the proposed methods against each other to identify which performs best, with statistical
tests performed to back the results obtained.
Chapter 6: Concluding Remarks and Future Research presents the concluding
remarks on this research. This chapter also presents areas for possible research in the
future.
1-9
![Page 32: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/32.jpg)
2Literature Review and
Background on Approaches
for Dealing with Missing Data
2.1 Introduction
The presence of missing data affects the quality of the dataset, which impacts on the
analysis and interpretation of the data. There are several reasons that could lead to data
being missing in a dataset with some being more predominant than others. The first well
known reason is participants’ denying revealing some personal and sensitive information,
for example monthly income. The second main reason is the failure of systems meant to
capture and store the data in databases. Another main reason is interoperability whereby
information exchanged between systems may be subjected to missing data.
This chapter gives the literature review of this research. A background on missing data
mechanisms is given in Section 2.3 followed by an introduction on missing data patterns
in Section 2.4. A discussion on classical missing data techniques is presented in Section
2.5 ensued by machine learning approaches to missing data in Section 2.6. Section 2.7
presents a discussion on machine learning optimization techniques for missing data im-
putation, while Section 2.8 discusses the machine learning framework used in this thesis
and the building blocks of this framework.
2.2 Missing Data Proportions
Missing data in datasets influences the analysis, inferences and conclusions reached based
on the information [21]. The impact on machine learning algorithm performances be-
2-1
![Page 33: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/33.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
comes more significant with an increase in the proportion of missing data in the dataset.
Researchers have shown that the impact on machine learning algorithms is not as sig-
nificant when the proportion of missing data is small in large scale datasets ( [22]- [24]).
This could be attributed to the fact that certain machine learning algorithms inherently
possess frameworks to cater to certain proportions of missing data. With an increase in
missing data proportions, for example cases where the proportion is greater than 25%, it
is observed that tolerance and performance levels of machine learning algorithms decrease
significantly [25]. It is because of these reduced levels in tolerance and performance that
more complex and reliable approaches to solve the problem of missing data are required.
2.3 Missing Data Mechanisms
Any scenario whereby certain or all feature variables within a dataset have missing data
entries, or contain data entries which are not exactly characterized within the bounds of
the problem domain is termed Missing Data [26]. The presence of missing data leads to
several issues in a variety of sectors that depend on the availability of complete and quality
data. This has resulted in different methods being introduced with their aim being to
address the missing data problem in varying disciplines ( [26] and [27]). Handling missing
data in an acceptable way is dependent upon the nature of the missingness. There are
currently four missing data mechanisms in the literature and these are, MCAR, MAR, a
non-ignorable case or MNAR and MBND.
2.3.1 Missing Completely at Random (MCAR)
The MCAR case is observed when the possibility of a feature variable having missing
data entries is independent of the feature variable itself or of any of the other feature
variables within the dataset. Essentially, this means that the missing data entry does not
depend on the feature variable being considered or any of the other feature variables in
the dataset. This relationship is expressed mathematically as [1]:
P (M |Yo, Ym) = P (M) (2.1)
2-2
![Page 34: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/34.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
where M ∈ {0, 1} represents an indication of the missing value. M = 1 if Y is known
and M = 0 if Y is unknown(missing). Yo represents the observed values in Y while Ym
represents the missing values of Y . From equation (2.1), the probability of a missing
entry in a variable is not related to Yo or Ym. For instance, let us assume that in modeling
software defects in relation to development time, if the missingness is in no way linked
to the missing values of the rate of defects itself and at the same time not linked to
the values of the development time, the data is said to be MCAR. Researchers have
successfully addressed cases where the data is MCAR. [28] successfully applied multilayer
perceptrons (MLPs) for missing data imputation in datasets with missing values. Other
research work done on this mechanism could be found in [29] and [30].
2.3.2 Missing at Random (MAR)
The MAR case is observed when the possibility of a specific feature variable having missing
data entries is related to the other feature variables in the dataset. However, this missing
data does not depend on the feature variable itself. MAR means the missing data in the
feature variable is conditional on any other feature variable in the dataset but not on
that being considered [31]. For example, consider a dataset with two related variables,
monthly expenditure and monthly income. Assume for instance that all high-income
earners deny revealing their monthly expenditures while low income earners do provide
this information. This implies that in the dataset, there is no monthly expenditure entry
for high income earners, while for low income earners, the information is available. The
missing monthly income entry is linked to the income earning level of the individual. This
relationship can be expressed mathematically as [1]:
P (M |Yo, Ym) = P (M |Yo) (2.2)
where M ∈ {0, 1} is the missing data indicator, and M = 1 if Y is known with M = 0 if
Y is unknown (missing). Yo represents the observed values in Y while Ym represents the
missing values of Y . Equation (2.2) indicates that the probability of a missing entry given
an observable entry and a missing entry is equivalent to the probability of the missing
entry given the observable entry only. Considering the example described in Section
2.3.1, the software defects might not be revealed because of a certain development time.
Such a scenario points to the data being MAR. Several studies have been conducted in the
literature where the missing data mechanism is MAR, for example, [12] performed a study
2-3
![Page 35: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/35.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
to compare the performance of expectation maximization and a GA optimized AANN and
it was revealed that the AANN is a better method than the expectation maximization.
Further research on this mechanism was performed in [32]- [34].
2.3.3 Non-Ignorable Case or Missing Not at Random (MNAR)
The third missing data mechanism is the missing not at random or non-ignorable case.
The MNAR case is observed when the possibility of a feature variable having a missing
data entry depends on the value of the feature variable itself irrespective of any alteration
or modification to the values of other feature variables in the datasets [27]. In scenarios
such as these, it is impossible to estimate the missing data by making use of the other
feature variables in the dataset since the nature of the missing data is not random. MNAR
is the most challenging missing data mechanism to model and these values are quite tough
to estimate [26]. Let us consider the same scenario described in the previous subsection.
Assume for instance that some high-income earners do reveal their monthly expenditures
while others refuse, and the same for low income earners. Unlike the MAR mechanism, in
this instance the missing entries in the monthly expenditure variable cannot be ignored
because they are not directly linked to the income variable or any other variable. Models
developed to estimate this kind of missing data are very often not biased. A probabilistic
formulation of this mechanism is not easy because the data in the mechanism is neither
MAR nor MCAR.
2.3.4 Missing by Natural Design (MBND)
This is a mechanism whereby the missing data occurs because it cannot be measured phys-
ically [35]. It is impossible to measure these data entries; however, they are quite relevant
in the data analysis procedure. Overcoming this problem requires that mathematical
equations be formulated. This missing data mechanism mainly applies to mechanical en-
gineering and natural science problems. Therefore, it will not be used in this thesis for
the problem under consideration.
2-4
![Page 36: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/36.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
2.4 Missing Data Patterns
The way in which missing data occurs can be grouped into three patterns given by Tables
2.1-2.3. Table 2.1 depicts a univariate pattern which is a scenario described by the presence
of missing data in only one feature variable as seen in column I7. Table 2.2 depicts an
arbitrary missing data pattern, which is a scenario whereby the missing data occurs in a
distributed and random manner. The last pattern is the monotone missing data pattern
which is shown in Table 2.3. This pattern is also referred to as a uniform pattern as it
occurs in cases whereby the missing data can be present in more than one feature variable
and, it is easy to understand and recognize [1].
Table 2.1: Univariate Missing Data Pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 0.65 0.63 ?
3 0.17 0.79 0.66 0.53 0.95 0.43 ?
4 0.19 0.24 0.15 0.91 0.46 0.82 ?
Table 2.2: Arbitrary Missing Data Pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 ? 0.20 0.19 0.75 0.67 0.96
2 0.69 0.11 0.08 0.41 ? 0.63 0.04
3 0.17 0.79 ? 0.53 0.95 0.43 0.54
4 ? 0.24 0.15 0.91 0.46 0.82 ?
Table 2.3: Monotone Missing Data Pattern
Sample I1 I2 I3 I4 I5 I6 I7
1 0.38 0.18 0.20 0.19 0.75 0.67 ?
2 0.69 0.11 0.08 0.41 0.65 ? ?
3 0.17 0.79 0.66 0.53 ? ? ?
4 0.19 0.24 0.15 ? ? ? ?
The missing data pattern considered in this thesis is the arbitrary pattern and the mech-
anisms are the Missing at Random and Missing Completely at Random mechanisms.
2-5
![Page 37: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/37.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
2.5 Classical Missing Data Techniques
Depending on how data goes missing in a dataset, there currently exist several data im-
putation techniques that are being used in statistical packages [36]. These techniques
include basic approaches such as case-wise data deletion and move on to approaches that
are characterized by the application of more refined artificial intelligence and statistical
methods. The subsections that follow present some of the most commonly applied missing
data imputation methods. We begin with basic and naive approaches and carry on pre-
senting more complex and competent mechanisms. There are a variety of classical missing
data imputation techniques courtesy of their simplicity and ease of implementation. The
techniques presented in this section are list-wise or case-wise deletion, pair-wise deletion,
mean substitution, stochastic imputation with expectation maximization, hot and cold
deck imputation, multiple imputation and regression methods.
2.5.1 List-Wise or Case-Wise Deletion
A lot of statistical approaches will get rid of an entire record if it is seen that any of the
columns in the record has a missing data entry. Such an approach is termed case-wise or
list-wise data deletion, and is a scenario whereby in the event of any of the columns in a
record having a missing value for a feature variable, the entire record is deleted from the
dataset. List-wise data deletion is the easiest and most basic way to handle the problem of
missing data as well as being the least recommended option for the problem as it tends to
significantly reduce the number of records in the dataset which are necessary for the data
analysis task, and by so doing, it reduces the accuracy of the findings from the analysis
of the data. Applying this technique is a possibility if the ratio of records with missing
data to records with complete data is very small. If this is not the case, making use of
this approach may result in the estimates of the missing data being biased.
2.5.2 Pair-Wise Deletion
The pair-wise data deletion approach operates by performing the analysis required by
using pair-wise data. The implication of this is records with missing data will be used in
analysis tasks if and only if the feature variable with the missing data in that record is
2-6
![Page 38: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/38.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
not needed. The benefit from doing this is the number of records used for analysis will
often be more than if one were to use the list-wise data deletion approach. However, this
approach results in biased missing data estimates, which is a bad outcome for the dataset
when the missing data mechanism is MAR or MNAR. On the contrary, this approach is
quite competent if the data is MCAR.
2.5.3 Mean-mode Substitution
This approach works by substituting the missing data entries with the value of the mean
or mode of the available data in the feature variable(s). It has a high possibility of yielding
biased estimates of the missing data just like the pair-wise data deletion approach [37],
and, it is not a highly recommended approach to solve this problem. For dataset feature
variables with continuous or numerical values, the missing data entries are substituted by
the mean of the related variables. On the other hand, for feature variables with categorical
or nominal values, the missing data are substituted by using the most common or modal
value of the respective feature variable [1]. These techniques are most effective when the
data is assumed to be MCAR. Mean-mode substitution has been used with success in
previous research (see [38] and [39]).
2.5.4 Imputation
Imputation in statistics is the process of replacing missing data with substituted values,
thus addressing the pitfalls caused by the presence of missing data. Imputation techniques
can be categorized into single and multiple imputation. Single imputation comprises of
replacing a missing value with only one estimated value while with multiple imputation,
each missing entry is replaced with a set of M estimated values.
2.5.4.1 Single-based Imputation
Expectation Maximization
Expectation maximization (EM) is a model-based imputation technique designed for pa-
rameter estimation in probabilistic methods with missing data [40]. EM is comprised of a
2-7
![Page 39: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/39.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
two-step process, with the first step, known as the E-step, being the process of estimating
a probability distribution over completions of missing data given the model. The M-step
is the second step, and identifies parameter estimates that maximize the complete data
log-likelihood obtained from the E-step. The M-step has as stopping criteria either con-
vergence being attained, or number of iterations being reached [40]. Details about this
algorithm can be found in [40]. It is applicable in both single and multiple imputation
procedures and has been shown to perform better than the techniques described above
( [12], [25] and [35]). This technique works best on the assumption that the data is MAR.
Hot Deck and Cold Deck Imputation
These methods fall under the category of donor-based imputation techniques. Donor-
based imputation entails substituting missing entries with data from other records.
Hot Deck imputation is a method in which missing data entries are filled in with val-
ues from other records. This is achieved by [1]:
• Splitting instances into clusters of similar data. This can be done using methods
such as k-Nearest Neighbour.
• Missing entries are replaced with values from instances that fall in the same class.
Cold Deck imputation on the other hand entails substituting the missing data by a con-
stant value obtained from other sources [1]. Hot or cold deck imputation are popular owing
to their simplicity as well as there being no need to make strong assumptions about the
model used to fit the data. It is worth mentioning though that the imputation strategy
does not necessarily lead to reduction in bias, in relation to the incomplete dataset.
2.5.4.2 Multiple-based Imputation
Multiple Imputation (MI) is an approach whereby missing data entry are substituted with
a set of M approximated values. In [26], MI is described in three consecutive steps. The
first step entails substituting the missing data entries in the dataset with M different
values. This results in M different datasets being obtained with complete records. Step
2-8
![Page 40: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/40.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
two of the process entails analysing the different M datasets with complete records by
applying complete-data analysis techniques. Finally, in step three, the results from the
M datasets are combined based on the analysis done in step two. This result referred
to from step two indicates which of the M datasets obtains the best state of the missing
data entries or yields the better conclusions and inferences. This approach is a better one
than the single imputation approaches. It also makes use of the advantages of the EM
and likelihood estimation approaches, with the popular traits of the hot-deck imputation
method to obtain new data matrices that will be processed ( [31] and [37]). The three
steps mentioned above can be further explained in the points below:
• Make use of a reliable model that incorporates randomness to estimate the missing
values;
• Generate M complete datasets by repeating the process M times;
• Apply complete-data analysis algorithms to perform analysis of the components of
the datasets obtained;
• From the M complete datasets obtained, calculate the overall value of the estimates
by averaging the values from the M datasets;
This method depends on the assumption that the data is MAR and originates from a
multivariate uniform distribution.
2.5.5 Regression Methods
This approach involves generating a regression equation that depends on a record with all
the data available for a given feature variable. To achieve this, the feature variable with
the missing data is considered as being the dependent variable in the equation, with all the
other feature variables considered the independent variables (predictors). Records with
missing values have these values being obtained as estimates from using the regression
equation with the feature variable of interest being the output and all the others being
the model equation inputs [1].
The process of generating regression equations is done repeatedly and in order, for the
2-9
![Page 41: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/41.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
feature variables with missing data entries until all such missing entry values are esti-
mated and substituted. This means that a feature variable vj having missing data entries
will have a model created for it using records with known values for the other variables.
Applying this method to estimate the missing data entry in sample 2 of Table 2.2, the
regression equation that is to be fitted will consider the variables I1, I2, I3, I4, I6, and
I7. This results in the equation below:
I5 = i1I1 + i2I2 + i3I3 + i4I4 + i6I6 + i7I7 + ε (2.3)
The regression equation constitutes the estimate terms ii as well as the error, ε. It can
subsequently be applied to approximate missing data entries by replacing I1, I2, I3, I4,
I6, and I7 by their known values.
2.6 Machine Learning Approaches to Missing Data
Several approaches in computational intelligence have been developed to address the prob-
lems of missing data and the drawbacks of statistical techniques covered in Section 2.5.
Some of these techniques are tree-based or based on biological concepts.
2.6.1 Decision Trees
Decision trees are supervised learning models aimed at separating data into consistent
clusters for classification or regression analysis. A decision tree is acyclic by default and
consists of a root node, leaf nodes, internal nodes and edges. The root node indicates the
onset of the tree with the leaf nodes representing the end of the tree which either presents
the final outcome or the class label. The internal node stores details about the attribute
used for splitting data at each node. The edges are links between the nodes and contain
details about splits. The outcome of a record is obtained by processing the information
across the tree from the root node to the leaf node [41].
Using decision trees to perform the missing data estimation task entails building a tree
for each feature variable with missing data entries. This feature variable is considered the
class-label with the actual class label forming part of the input feature set. The building
2-10
![Page 42: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/42.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
of the tree is done using records with known class labels and the missing data entries are
substituted with a corresponding tree [41]. Let us say for example that a dataset has
attributes I1, I2, I3 and a class-label L, which is obtained as such: L(I1, I2, I3). Assume
I1 has missing values, I1 will be considered the class-label while L will be regarded as one
of the input feature variables. The new class label will be obtained by: I1(L, I2, I3). If I2
has the missing data, the new output is obtained using: I2(I1, L, I3) and I2 is considered
the new class-label. This procedure is executed until all feature variables with missing
data are complete.
The strategy described above operates the way a single imputation method does, and
has been applied successfully ( [41]- [43]). It is unknown whether the sequence in which
the missing values are substituted influence the estimates and, the method works best
when the data is assumed to be MCAR.
2.6.2 Artificial Neural Networks (ANNs)
An ANN is referred to as a probabilistic model that is used to process data in the way
that the biological nervous system does [8]. The human brain is a very good example of
such a system. It can also be defined as a collection and combination of elements whose
performance is dependent on the elements, which in a neural network are neurons. The
neurons are connected to one another and the nature of these connections also influence
the performance of the network. The fundamental processing unit of a neural network
is what is referred to as a neuron [44]. A neural network comprises of four important
components, these being [45]: (1) At any stage in the biorhythm of the neural network,
each neuron has an activation function value; (2) Each neuron is connected to every other
neuron, and these connections are what determine how the activation level of a neuron
becomes the input for another neuron. Each of these connections is allocated a weight
value; (3) At a neuron, an activation function is applied to all the incoming inputs to
generate a new input for neurons in the output layer or subsequent hidden layers, and;
(4) A learning algorithm that is used to adjust the weights between neurons when given
an input-output pairing.
A predominant feature of a neural network is the capability it possesses to accommodate
and acclimatise to its environment with the introduction of new data and information. It
is with this in mind that learning algorithms were created, and they are very important
2-11
![Page 43: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/43.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
in determining how competent a neural network will and can be. A neural network is
applicable in several domains such as in the modeling of problems of a highly compli-
cated nature because of how relatively easy it is for them to derive meaning from complex
data, identify patterns and also trends, which are very convoluted for other computational
models [8]. Trained neural networks are applicable in prediction tasks where the aim is
to determine the outcome of a new input record after having been presented with similar
information during the training process [8]. Their inherent ability to adapt with ease to
being presented with new non-linear information makes them favorable to be used to solve
non-linear models.
Neural networks have been observed to be highly efficient and capable in obtaining so-
lutions to a variety of tasks with most of these being forecasting and modeling, expert
systems and signal processing tasks [45]. The organization of the neurons in a neural
network affects the processing capability of the said network as well as an influence on
the way in which information moves between the layers and neurons.
2.6.2.1 Auto-Associative Neural Network
Auto-encoder networks are defined as networks that try to regenerate the inputs as outputs
in the output layer [46], and this guarantees that the network will be capable of predicting
new input values as the outputs when presented with new inputs. These auto-encoders
are made up of one input layer and one output layer both having the same number of
neurons, resulting in the term auto-associative [46]. Besides these two layers, a narrow-
hidden layer exists and it is important that this layer contains fewer neurons than there
are in the input and output layers with the aim of this being to apply encoding and
decoding procedures when trying to find a solution to a given task [47]. These networks
have been used in a variety of applications ( [48]- [53]). The main concept that defines the
operation of auto-encoder networks is the notion that the mapping from the input to the
output, x(i) 7→ y(i) stores very important information and the key inherent architecture
present in the input, x(i), that is contrarily hypothetical ( [48] and [54]). An auto-encoder
takes x as the input and transcribes it into y, which is a hidden representation of the
input, by making use of a mapping function, fθ, that is deterministic. This function is
expressed as ( [54] and [55]):
fθ (x) = s (Wx+ b) . (2.4)
2-12
![Page 44: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/44.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
The parameter, θ, is made up of the weights W and biases b. s represents the sigmoid
activation function which is given by:
s =1
1 + e−x. (2.5)
y is then mapped to a vector, z, representing the reconstruction of the inputs from this
hidden representation. This reconstructed vector is obtained by using the following equa-
tions [54]:
z = gθ′ (y) = s (W ′y + b′) , (2.6)
or
z = gθ′ (y) = W ′y + b′. (2.7)
In the above equations, θ′
is made up of the transpose of the weights matrix and the
vector of biases from equation (2.4). Equation (2.6) is the autoencoder output function
with a sigmoid transfer function (equation (2.5)), while equation (2.7) is the linear output
equation. After these operations, the network can then be fine-tuned by applying a
supervised learning approach [55]. The network is said to have tied weights when the
weights matrix is transposed. The vector z is not described as an exacting transformation
of x, but rather as the criteria of a distribution p (X|Z = z) in probabilistic terms. The
hope is these criteria will result in x with high probability [55]. The resulting equation is
as follows [54]:
p (X|Y = y) = p (X|Z = gθ′ (y)) . (2.8)
This leads to an associated regeneration disparity that forms the basis for the objective
function to be used by the optimization algorithm. This disparity is usually represented
by [54]:
L (x, z) ∝ −logp (x|z) . (2.9)
∝ indicates proportionality. The equation above could be represented by [56]:
δAE (θ) =∑t
L(x(t), gθ
(fθ(x(t))))
. (2.10)
Auto-encoder networks have been used in a variety of application areas, by several re-
searchers, with the focus being on the problem of missing data ( [8], [57]- [59]).
2-13
![Page 45: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/45.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
2.6.3 Support Vector Machine
Support Vector Machine (SVM) is a classification model capable of solving both linear
and non-linear complex problems ( [60] and [61]). In linear problems, the model tries
to identify a maximal marginal hyper-plane with the greatest margin. This hyper-plane
must obey the following expression ( [60] and [61]):
f(x) =
1, w.x+ b ≥ 1
−1, w.x+ b ≤ 1, (2.11)
where w and x represent the weight and input vectors, respectively, and b indicates the
bias. Larger margins are preferable as they increase the accuracy of classifications. In
scenarios where the data dimensions are linearly inseparable, the data requires transfor-
mation into higher dimensions. The model identifies an optimal hyper-plane capable of
separating the variables of the classes in the new high dimensional space. Kernel func-
tions which are used to map the original data into higher dimensions could be expressed
mathematically as ( [62]- [64]):
K(xi, xj) = ϕ(xi).ϕ(xj), (2.12)
where ϕ(xi) and ϕ(xj) are the non-linear mapping functions. Some frequently used kernel
functions are ( [62]- [64]): the polynomial, sigmoid and Gaussian radial functions.
2.7 Machine Learning Optimization
As previously mentioned, constructing models for handling missing data can be complex
and computationally expensive. Successful models employ an optimization technique to
construct a model that best fits the training set. In this section, we highlight various
strategies that have been employed as optimization techniques in missing data problems.
2.7.1 Genetic Algorithm
Genetic algorithm (GA) is an evolutionary computational technique designed to search
for global optimum solutions to complex problems. It was inspired by Darwin’s theory
2-14
![Page 46: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/46.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
of natural evolution. Genetic algorithms use the notion of survival of the fittest, where
the strongest individuals are selected for reproduction until the best solution is found
or the number of cycles is completed. The processes involved in a genetic algorithm are
selection, crossover, mutation and recombination. The selection process involves selecting
the strongest parent individuals using a probabilistic technique for the crossover process.
During crossover, a crossover point is chosen between the parent individuals. Data is
swapped or exchanged from the starting point of an individual to the crossover point.
The outcome is two children. At this point, if the children are stronger than their par-
ents, they can be used to replace one or both parents. Mutation is performed by randomly
selecting a gene and inverting it. Mutation is given a low probability value, which means
that it occurs less than the crossover process. Recombination process evaluates the fitness
of the children or newly generated individual to determine if they can be merged with the
current population.
As previously mentioned, genetic algorithms have been applied to optimize neural net-
works ( [12], [35] and [59]) by searching for individuals that maximize the objective func-
tion, prior to imputation. This algorithm is classified within the domain of computational
intelligence as per [65] and has been further used to address the missing data problem
in [66].
2.7.2 Particle Swarm Optimization
Particle swarm optimization (PSO) is a search technique based on a collective behavior
of birds within a flock. The goal of the technique is to simulate the random and unpre-
dictable movement of a flock of birds, with the intent of finding patterns that govern the
birds’ ability to move at the same time, and change direction whilst regrouping in an
optimal manner ( [67] and [68]).
PSO particles move through a search space. A change in position of the particles within
the space is based on a socio-psychological tendency of each particle copying or emulating
the success of the neighboring particle and their own success. These changes are influ-
enced by the knowledge or experience of surrounding particles and the particle themselves.
Therefore, the search behavior of one particle is affected by the behavior of other parti-
cles or itself. The collective behavior of particles within a swarm permits the discovery of
global optimal solutions in high dimensional search spaces ( [67] and [68]). This algorithm
2-15
![Page 47: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/47.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
is also classified within the domain of computational intelligence as per [65] and has also
been used to address the missing data problem in [66].
2.7.3 Simulated Annealing
Simulated annealing (SA) is an optimization technique that uses the concept of cooling
metal substances with the goal of condensing matter into a crystalline solid. It can be
considered as a procedure used to find an optimal solution. The main characteristics of
simulated annealing are that, (i) it can find a global optimum solution; (ii) it is easy to
implement for complex problems and, (iii) it solves complex problems and cost functions
with various numbers of variables. The drawbacks of simulated annealing are [69]:
• It takes many iterations to find an optimal solution;
• The cost function is computationally expensive to estimate;
• It is inefficient if there are many local minimum points;
• It depends on the nature of the problem, and;
• It is difficult to determine the temperature cooling technique.
Research in SA has shown that it performs marginally better than GA and PSO techniques
( [35] and [69]). However, for problems involving datasets with high dimensions, GA and
PSO are recommended over SA because of these drawbacks. In addition, this algorithm is
classified within the domain of computational intelligence as per [65] and has been further
used to address the missing data problem in [66].
2.8 Deep Learning (DL)
Deep learning is made up of a variety of techniques in the field of machine learning that
make use of a deluge of non-linear nodes which are arranged into multiple layers that
extract and convert feature variable values from the input vector ( [70] and [71]). The
individual layers of such a system have as input, the outputs from preceding layers, except
2-16
![Page 48: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/48.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
for the input layer which just receives signals or the input vectors from the outside environ-
ment. Also, during training of the systems, unsupervised or supervised techniques could
be applied. This brings about possible application of these models in supervised learning
tasks like classification, and unsupervised tasks like pattern analysis. Deep learning mod-
els are also based on the extraction of higher level features from lower level features to
obtain a stratified portrayal of the input data via an unsupervised learning approach on
the different levels of the features [71]. A ranking of notions and theories is obtained by
learning different layers of portrayals of the data that represent varying levels of absorp-
tion of the data. Some of the deep learning frameworks in the literature are Deep Belief
Networks (DBNs) ( [19] and [72]), Deep/Stacked Auto-encoder Networks (DAEs/SAEs)
( [73] and [74]) and Convolutional Neural Networks (CNNs) ( [75] and [76]). The Deep
Learning technique used in this thesis is the Stacked Auto-encoder (SAE) which is built
from restricted Boltzmann machines, trained in an unsupervised manner using the con-
trastive divergence method and subsequently joined to form the encoder and decoder
parts of the network, which is then trained in a supervised learning manner using the
stochastic gradient descent algorithm. The motivation behind using an SAE is that it is
trained in such a way that the hidden layer maintains all the information about the input.
2.8.1 Restricted Boltzmann Machine (RBM)
Prior to defining an RBM, we begin by explaining what a Boltzmann machine (BM)
is. It is a bidirectionally connected network of stochastic processing units, which can
be interpreted as a neural network [77]. It can be used to learn important aspects of an
anonymous probability distribution based on cases from the distribution. This is typically
a challenging procedure. The learning procedure can be simplified by imposing constraints
on the architecture of the network which leads to restricted Boltzmann machines [78].
RBMs can be defined as probabilistic, undirected, parametrized graphical models also
referred to as Markov random fields (MRFs). RBMs have received a lot of attention in
the aftermath of being proposed as building blocks of multi-layered architectures called
deep networks ( [54] and [78]). The concept behind deep networks is that the hidden
neurons excerpt relevant features from the input data, which then serve as input to another
RBM [54]. The goal in assembling the RBMs is to obtain higher level portrayals of the
data by learning features from features [54]. RBMs which are also MRFs linked to binary
undirected graphs are made up of m visible units, V = (V1, . . . , Vm) to mean detectable
data, and n hidden units, H = (H1,...,Hn) that record the relationship between variables
2-17
![Page 49: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/49.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
in the input layer ( [19] and [79]). The variables V take on values in the range [0, 1]m+n,
whereas the variables H take on values in the range {0, 1}m+n. The joint probability
which is obtained from the Gibbs distribution requires an activation function given by
( [80] and [81]):
E (v, h) = −hTWv − bTv − cTh. (2.13)
In scalar form, (2.13) is expressed as ( [80] and [81]):
E (v, h) = −n∑i=1
m∑j=1
wijhivj −m∑j=1
bjvj −n∑i=1
cihi. (2.14)
In (2.14), wij represents a real valued weight between the input unit Vj and the hidden
unit Hi. This value is the most essential part of an RBM. The parameters bj and ci
represent real valued bias terms associated with the jth visible variable and the ith hidden
variable [54]. In a scenario where wij is less than zero and vj = hi = 1, a high energy is
obtained courtesy of a decrease in probability. However, if wij is greater than zero, and
vj = hi = 0, a lower energy value is obtained because of an increase in probability. If bj is
less than zero and vj = 1, a low probability is achieved due to an increase in energy ( [80]
and [81]). This points to an inclination for vj to be equal to zero, rather than vj being
equal to one. On the other hand, if bj is greater than zero and vj = 0, a high probability
is achieved due to a decrease in energy. This points to an inclination for vj to be equal to
one, rather than vj being equal to zero. The second term in equation (2.14) is influenced
by the value of bj with a value less than zero decreasing this term, while a value greater
than zero increases the term. The third term of equation (2.14) on the other hand is
influenced by the value of ci in the same way as bj affects the second term. The Gibbs
distribution or probability from (2.13) or (2.14) can be gotten by ( [80] and [81]):
p (v, h) =e−E(v,h)
Z=e(h
TWv+bT v+cT h)
Z=e(h
TWv)e(bT v)e(c
T h)
Z. (2.15)
In this equation, Z represents an intractable partition function while all the exponential
terms represent factors of a Markov network with vector nodes [54]. The intractable
nature of Z is due to the exponential number of values it can assume. In RBMs, the
intractable partition function is obtained by ( [80] and [81]):
Z =∑v,h
e−E(v,h). (2.16)
2-18
![Page 50: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/50.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
The variable, h, is conditionally independent of v, and vice versa, and this is yet another
important trait of an RBM. This is because of the fact that no nodes in the same layer
are connected. Mathematically, this can be expressed as ( [80] and [81]):
p(h|v) =n∏i=1
p(hi|v), (2.17)
and
p(v|h) =m∏i=1
p(vi|h). (2.18)
2.8.2 Contrastive Divergence (CD)
In training an RBM, the goal is to reduce the mean negative log-likelihood or loss by as
much as possible without any form of regularization [54]. This is done by making use of
the stochastic gradient descent algorithm because it can handle high-dimensional datasets
better than others. The loss is expressed as ( [80] and [82]):
loss =1
T
∑t
−logp(v(t)). (2.19)
This can be achieved by calculating the partial derivative of the loss function with respect
to a parameter, θ, as follows ( [80] and [82]):
∂(−logp
(v(t)))
∂θ= Eh
[∂E(v(t), h
)∂θ
|v(t)]− Ev,h
[∂E (v, h)
∂θ
]. (2.20)
The first term in (2.20) defines the expectation over the distribution of the data. This is
coined the positive phase. v and h are the same variables used in equations (2.13)-(2.17).
The second term referred to as the negative phase represents the expectation over the
distribution of the model. Due to an exponential sum being needed over the v and h vari-
ables, the calculation of these partial derivatives is both intractable and difficult [83]. In
addition, achieving estimates of the log-likelihood gradient which are unbiased normally
needs several steps of sampling. Recently though, it has been revealed that estimates
which are gotten from executing the Markov chain for a few iterations can suffice during
the model training process. From this emerged the contrastive divergence (CD) method
( [80] and [82]). CD can be defined as a technique for training undirected graphical models
2-19
![Page 51: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/51.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
of a probabilistic nature. The hope is this eradicates the double expectation procedure
in the negative phase of equation (2.20) and rather sheds the spotlight on estimation. It
essentially uses a Monte-Carlo estimate of the expectation over one input data point [54].
An extension of the CD algorithm is the k-step CD learning technique (CD-k) which
states that instead of approximating the second term in equation (2.20) by a case from
the distribution of the model, a Gibbs chain could be executed for only k steps with k
often set to 1. v(0) is a training sample from the training set of data used to initialize the
Gibbs chain, and, this produces the sample v(k) after k steps. Each time step comprises of
a sample h(t) obtained from a probability p(h|v(t)) as well as a sample v(t+1) subsequently
gotten from p(v|h(t)).
The partial derivative of the log-likelihood with respect to θ for a single training sample,
v(0), is approximated by ( [80] and [82]):
CDk(θ, v(0)) = −
∑h
p(h|v(0))∂E(v(0), h)
∂θ+∑h
p(h|v(k))∂E(v(k), h)
∂θ. (2.21)
Due to the fact that v(k) is not drawn from a stationary distribution of the model, the
estimates from equation (2.21) are biased [54]. As k −→ ∞, the bias vanishes. An
additional factor that points to the biased nature of the CD algorithm is the fact that it
maximizes the disparity between two Kullback-Liebler (KL) divergences ( [80] and [82]):
KL(q|p)−KL(pk|p). (2.22)
Here, pk defines the distribution of the visible variables after k steps of the Markov chain
while q represents the empirical distribution. If the chain is observed to have already
attained stationarity, then pk = p, therefore, KL(pk|p) = 0, and with this, the error from
the CD estimates vanishes. More information on the CD algorithm can be found in [84].
2-20
![Page 52: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/52.jpg)
2. Literature Review and Background on Approaches for Dealing with Missing Data
2.9 Conclusion
This chapter gives a background summary on the missing data mechanisms and patterns,
as well as classical techniques used for handling missing data. We also discussed modern
approaches for handling missing data and their applications. This chapter also presented
the techniques employed to optimize the missing data imputation process. The chapter
is important because it illustrates the importance of understanding missing data and how
various methods have evolved over the years to handle the problem. It also shows why
the research in this thesis was conducted.
In this thesis, the methods described in Section 2.7 are used as a frame of reference
to compare the proposed methods against existing techniques, while the techniques de-
scribed in Section 2.8 are used to construct the deep learning regression framework.
2-21
![Page 53: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/53.jpg)
3Novel Ant-based Missing
Data Estimators
3.1 Introduction
In this chapter, we present the results obtained from analysing and implementing the novel
ant-based missing data estimators. We begin in Section 3.2 by describing the experimental
design that will be implemented throughout the chapter, followed by Section 3.3 in which
we present information on the optimization algorithms that will be used in the chapter. In
Section 3.4, we present the performance evaluation metrics that will be used in Chapters 3,
4 and 5. Section 3.5 presents the results obtained from analysing the DL-ACO estimator,
while Section 3.6 reports on the findings from the analysis of the DL-ALO estimator.
Section 3.7 presents the key findings from the chapter.
3-1
![Page 54: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/54.jpg)
3. Novel Ant-based Missing Data Estimators
3.2 Experimental Design
3.2.1 Statement of Hypothesis and Research Question
It should be discernible at this point that the research done in this work focused on
how possible it will be to effectively estimate missing data entries in a high-dimensional
dataset. We tried to answer two key questions being:
• Is it possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the ant
colony optimization and ant-lion optimizer algorithms?
• Is there a relationship between the accuracy of the estimated values and the real
values in the feature variables with missing data?
The responses to these questions which we expect to get in relation to prior research are
detailed in the hypotheses of the research.
3.2.2 Hypothesis Testing
3.2.2.1 Hypothesis One
• It is possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the ant
colony optimization and ant-lion optimizer algorithms.
3.2.2.2 Hypothesis Two
• It is expected that the level of correlation between the estimated values and the real
values be high or low depending on the nature of the dataset.
3-2
![Page 55: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/55.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.1: Data Imputation Configuration.
Figure 3.1 illustrates how the regression model and optimization methods will be used.
The dataset used is the Mixed National Institute of Standards and Technology (MNIST)
handwritten digit recognition dataset [18]. This dataset comprises 60,000 training images
and 10,000 test images. With each image being a 28× 28 pixel image, this results in 784
pixel values representing the image, and serving as the input to the model. The data is
preprocessed by normalizing each pixel value to being in the range [0, 1].
Two predominant features of an auto-encoder being; (i) its auto-associative nature, and
(ii) the butterfly-like structure of the network resulting from a bottleneck trait in the
hidden layers, were the reasons behind the network being used. Auto-encoders are also
ideal courtesy of their ability to replicate the input data by learning certain linear and
non-linear correlations and covariances present in the input space, by projecting the input
data into lower dimensions. The only condition required is that the hidden layer(s) have
fewer nodes than the input layer, though it is dependent on the application. Prior to
optimizing the regression model parameters, it is necessary to identify the network struc-
ture where the structure depends on the number of layers, the number of hidden units per
hidden layer, the activation functions used, and the number of input and output units.
After this, the parameters can then be approximated using the training set of data. The
parameter approximation procedure was done for a given number of training cycles, with
the optimal number of this being obtained by analysing the validation error. The aim of
this was to avoid over-fitting the network and to use the fastest training approach without
3-3
![Page 56: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/56.jpg)
3. Novel Ant-based Missing Data Estimators
compromising on accuracy. The optimal number of training cycles was found to be 500.
The training procedure estimated weight parameters such that the network output was
as close as possible to the target output.
Figure 3.2: Stacked Auto-encoder Network Structure.
3-4
![Page 57: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/57.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.3: Missing Data Estimator Structure.
The optimization algorithms were used to estimate the missing values by optimizing an
objective function which has as part of it the trained network. They used values from
the population as part of the input to the network, and the network recalled these val-
ues which subsequently form part of the output. The complete data matrix containing
the estimated values and observed values was fed into the auto-encoder as input. Some
inputs were considered as being known, with others unknown and to be estimated using
the regression method and the optimization algorithms as described at the beginning of
the paragraph. The symbols Ik and Iu as used in Figures 3.1 and 3.3 represent the known
and unknown/missing values, respectively.
Considering that the approach made use of a deep auto-encoder, it was imperative that
the auto-encoder architecture match the output to the input. This trait is expected when
a dataset with familiar correlations recorded in the network is used. The error, δ, used is
the disparity between the target output and the network output, expressed as [54]:
δ =−→I − f(
−→W,−→I ), (3.1)
where−→I and
−→W represent the inputs and the weights, respectively.
3-5
![Page 58: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/58.jpg)
3. Novel Ant-based Missing Data Estimators
The square of equation (3.1) was used to always guarantee that the error is positive.
This results in the following equation [54]:
δ =(−→I − f(
−→W,−→I ))2. (3.2)
Courtesy of the fact that the input and output vectors contain both Ik and Iu, the error
function is rewritten as [54]:
δ =
([Ik
Iu
]− f
({Ik
Iu
}, w
))2
. (3.3)
Equation (3.3) is the objective function used and minimized by the optimization algo-
rithms to estimate Iu, with f being the regression model function. The stopping criteria
of the optimization algorithms, and therefore the estimation procedure, were either a
maximum of 40,000 function evaluations being attained, or no change observed in the ob-
jective/error value during the estimation procedure. From the above descriptions of how
the deep auto-encoder and optimization algorithms were used, the equation below sum-
marizes the function of the proposed approach, with fOA being the optimization algorithm
estimation operation and fDAE being the function of the deep auto-encoder.
y = fDAE(W, fOA(−→I )), (3.4)
where−→I =
[−→Ik−→Iu
]represents the input space of known and unknown features. This
equation represents the model design whereby the complete input vector with known
and, estimated missing data entries obtained by executing the missing data estimation
procedure (fOA) is presented to the deep regression model (fDAE) to observe whether
the network error has been minimized. If it is found that the error has been minimized,
the output, y, will contain the known input vector values and the optimal missing data
estimated values. This model design is also used in Chapter 5.
3-6
![Page 59: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/59.jpg)
3. Novel Ant-based Missing Data Estimators
3.3 Optimization Algorithms
3.3.1 Ant Colony Optimization (ACO)
ACO is an algorithm that mimics the innate behavior of ants. In the everyday lives of
ants, the ants must explore the neighborhood of their nests in the search for food for
sustenance purposes [85]. When ants move, they leave on their trail a substance termed
pheromone. There are two main objectives behind this deposit of pheromones. One
reason is to allow ants to navigate their way back to the nests, and the second, is that
it allows other ants to trace back the direction followed and used by other ants, so it
can be followed [85]. ACO has a collection of characterizing traits that can be regarded
as building blocks. These traits are essential and are required to be specified in every
implementation. These characteristics include [86]: (i) the method selected to build the
solutions, (ii) the examining knowledge, (iii) the rule to update the pheromones, (iv) the
probability function and transition rules, (v) the values of the parameters, and (vi) the
stopping criteria [85]. The algorithm considers a unique colony of ants that has m artificial
ants collaborating with one another. Prior to the start of the execution of the algorithm,
each of the links between the solutions is given a certain number of pheromones, τ0. This
value is usually very small, small enough that the probability of the path to each solution
being chosen is not zero. At each iteration, each of the m ants that has constructed a
solution in it updates the values of the pheromone. The pheromone, τij, that is related to
the link between solutions i and j is revised using the following equation ( [86] and [87]):
τij ← (1− ρ) ∗ τij + Σmk=1∆τ
kij, (3.5)
where ρ represents the evaporation rate of the pheromone, m depicts the number of ants,
and ∆τ kij is the number of pheromones deposited in the link between solutions i and
j, which is updated by ant k such that: ∆τ kij = Q/Lk if ant k used the link between
solutions i and j, while ∆τ kij = 0 otherwise [87]. Q is a constant, with Lk representing
the radius of the link created by ant k. In the construction of a new solution, ants
choose the next solution via a stochastic approach. When ant k is at solution i and has
constructed a partial solution, aP , the probability of then moving to solution j is such
that: P kij =
ταij ∗ ηβij
Σcil∈N(sp)ταil ∗ η
βil
if cij ∈ N(sp) or P kij = 0 if cij /∈ N(sp) [88]. N(sp) represents
a collection of appropriate items, that are links between solutions i and l whereby l is a
solution that has not yet been tested for its fitness towards the task at hand by ant k.
3-7
![Page 60: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/60.jpg)
3. Novel Ant-based Missing Data Estimators
The α and β parameters govern the related relevance of the heuristic information versus
pheromone, ηij, obtained by ( [86] and [89]):
ηij =1
dij, (3.6)
where dij is the distance between solutions i and j. This algorithm has been applied in
several papers such as in [90] in which it was used to solve problems in water distribu-
tion systems, while in [91], the ACO algorithm was employed to solve a mathematical
model which was constructed to represent a process planning problem. In [92], the ACO
algorithm was used in problems where no a priori information was considered in spatial
clustering problems, and compared against a novel algorithm which was proposed.
The ACO parameters used in the model implementation are given in Table 3.1 apart
from the number of decision variables which depends on the number of missing values in
a record. The data was normalized to being in the range [0, 1], meaning the lower and
upper bounds of the decision variables are 0 and 1, respectively. These parameters were
chosen because they resulted in the more optimal outcomes with different combinations
and permutations of values having been tested.
Table 3.1: ACO Parameters.
Parameter Value
Maximum Number of Iterations 1000
Population Size 10
Intensification Factor 0.5
Deviation-Distance Ratio 1
Sample Size 40
3.3.2 Ant-Lion Optimizer (ALO)
ALO is a meta-heuristic algorithm that mimics the interaction between ants, prey and
the ant-lion species [93]. ALO implements five main steps of hunting, these being: the
random motion of ants, the construction of traps by the ant-lions, the capturing of ants
in the traps, the catching of prey, and the rebuilding of traps. Also, it is a gradient-free
algorithm which has the property of providing greater exploration and exploitation of the
3-8
![Page 61: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/61.jpg)
3. Novel Ant-based Missing Data Estimators
solution space. Exploration is assured by the selection of ant-lions at random, as well as
the motion of ants around them which is also random. Exploitation on the other hand
is assured by the flexible declining of the boundaries of ant-lion traps. The algorithm is
based on three tuples being ALO(A1, A2, A3), which estimate the global optimum of an
optimization problem. These three tuples are defined respectively as ( [93] and [94]):
Φ→A1 {GAnt, GOA, GAntlions, GOAL} , (3.7)
{GAnt, GAntlion} →A2 {GAnt, GAntlion} , (3.8)
and
{GAnt, GAntlion} →A3 {true, false} , (3.9)
where GAnt represents the ants’ position matrix, GAntlion is comprised of the antlions’
position, GOA depicts the fitness of the ants, and finally, GOAL contains the fitness values
of the ant-lions. The algorithm operates in such a way that the ant-lion and ant matri-
ces are initialized in a random manner by applying equation (3.7). The roulette wheel
operator is used to select the location of each ant relative to the ant-lion. Equation (3.8)
is used to update the elite in each iteration. The update of the perimeter location is
primarily described in relation to the iteration number at that instance. The location is
subsequently refined by using two random steps near the selected elite and ant-lion. The
fitness function is used to estimate the points where every ant randomly walks. In case
any of the ants becomes more capable than any of the ant-lions, their locations will be
used in the next iteration as the new locations for the ant-lions. There will then be a com-
parison between the best ant-lion and the best ant-lion obtained during the optimization
procedure, and they are substituted in one of the key operations in the implementation of
the algorithm. These steps are executed until the function in equation (3.9) returns false.
In the implementation of the algorithm, ants walk randomly according to ( [93] and [94]):
Xa(t) = [0, cummsum(2l(t1)− 1), cummsum(2l(t2)− 1), . . . , cummsum(2l(Tn)− 1)],
(3.10)
where n is the maximum number of iterations, cummsum represents the cumulative sum,
and t indicates the random step walk. l(t) is a stochastic equation defined by the relations:
l(t) = 1 if rand > 0.5, and l(t) = 0 if rand ≤ 0.5. rand depicts a random number obtained
from a Gaussian distribution in the range [0, 1]. In order to restrict the random movement
of ants with the boundaries of the solution space, they are normalized according to ( [93]
3-9
![Page 62: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/62.jpg)
3. Novel Ant-based Missing Data Estimators
and [94]):
X ti =
(X ti − ai) ∗ (di − cti)
(dti − ai)+ ci, (3.11)
where ai represents the minimum random walk of the ith variable, bi indicates the max-
imum random walk of the ith variable, cti represents the minimum of the ith variable at
the tth iteration and finally, dti indicates the maximum of the ith variable at the tth itera-
tion [93].
Modeling of the trapping of ants by ant-lion pits can be obtained by ( [93] and [94]):
cti = Antliontj + ct, (3.12)
and
dti = Antliontj + dt, (3.13)
where ct represents the lower bound of all features at the tth step, dt depicts the maxi-
mum of all features at the tth step and Antliontj represents the location of the chosen jth
ant-lion at the tth step.
The hunting capability of an ant-lion is described by the fitness proportional roulette
wheel selection. The mathematical equation describing the manner in which the ants that
are trapped slide down the trap and towards the ant-lion is obtained by ( [93] and [94]):
ut =ut
Z, (3.14)
and
vt =vt
Z, (3.15)
whereby Z is a ratio calculated by:
Z = 10wt
T. (3.16)
In the equation above, t is th current step, T represents the upper bound on the number
of steps to be taken, and w depicts a constant which relies on the current step according
to the following relations: w = 2 if t > 0.1T , w = 3 if t > 0.5T , w = 4 if t > 0.75T , w = 5
if t > 0.9T and w = 6 if t > 0.95T [93].
The last part of the algorithm is elitism which is done such that the fittest ant-lion
3-10
![Page 63: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/63.jpg)
3. Novel Ant-based Missing Data Estimators
at each step is said to be the elite. This implies that every ant randomly walks around a
selected ant-lion with a location that respects the following equation ( [93] and [94]):
Antti =RtA +Rt
E
2. (3.17)
In equation (3.17), RtA represents the random motion around an ant-lion selected using
the roulette wheel method at the tth step while RtE depicts the random motion around
the elite ant-lion at the tth step.
In [94], the ALO algorithm was used to find the parameters of a primary governor loop
of thermal generators for successful Automatic Generation Control (AGC) of a two-area
interconnected power system. It has been used in [95] to investigate a three-area inter-
connected power system while in [96], it was used to train a multilayer perceptron neural
network. The authors in [97] used a chaotic ALO algorithm for feature selection purposes
from large datasets meanwhile in [98], the ALO algorithm was used to solve the NP-hard
combinatorial optimization problem of obtaining an optimal process plan according to all
alternative manufacturing resources.
The ALO parameters used in the model implementation are the number of search agents
which is set to 40, and the maximum number of iterations, given a value of 1000. The
number of decision variables depends on the number of missing values in a record. The
data was normalized to being in the range [0, 1], meaning the lower and upper bounds of
the decision variables are 0 and 1, respectively. These parameters were chosen because
they resulted in the more optimal outcomes with different combinations and permutations
of values having been tested.
3.4 Performance Evaluation Metrics
The effectiveness of the proposed approaches were determined using the SE, MSE, RM-
SLE, MAE, r and the RPA metrics. Also used were the SNR, GD and COD performance
measures. The mean squared, and root mean squared logarithmic errors as well as the
global deviation yield measures of the difference between the actual and predicted values,
and provide an indication of the capability of the estimation approach.
3-11
![Page 64: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/64.jpg)
3. Novel Ant-based Missing Data Estimators
MSE =Σni=1(Ii − Ii)2
n, (3.18)
RMSLE =
√Σni=1(log(Ii + 1)− log(Ii + 1))2
n, (3.19)
and
GD =
Σni=1
(Ii − Ii
)n
2
. (3.20)
The correlation coefficient provides a measure of the similarity between the predicted and
actual data. The output value of this measure lies in the range [−1, 1] where the absolute
value indicates the strength of the link, while the sign indicates the direction of said link.
Therefore, a value close to 1(100%) signifies a strong predictive capability while a value
close to -1(-100%) signifies otherwise. In the equation below, ’¯’ represents the mean of
the data.
r =Σni=1(Ii − Ii)(Ii −
¯Ii)[
Σni=1
(Ii − Ii
)2Σni=1
(Ii − ¯
Ii
)2]1/2 . (3.21)
The relative prediction accuracy on the other hand measures the number of estimates
made within a specific tolerance, with the tolerance dependent on the sensitivity required
by the application. The tolerance was set to 10% as it seemed favorable for the application
domain. This measure is given by:
A =nτn∗ 100. (3.22)
The Squared Error (SE) measures a quadratic scoring rule that records the average mag-
nitude of the error. This can be gotten by calculating the variance square root, also
referred to as the standard deviation. It reduces the variance in the error, contrary to the
MSE, hence, its application in this work. SE can be obtained by using the formula:
SE =
√1
nΣni=1(Ii − Ii)2. (3.23)
The Mean Absolute Error (MAE) measures the average magnitude of the errors in a
dataset without considering direction. Under ideal scenarios, SE values are always greater
than the MAE values, and in case of equality, the error values are said to have the same
3-12
![Page 65: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/65.jpg)
3. Novel Ant-based Missing Data Estimators
magnitude. This error can be calculated using the following equation:
MAE =1
nΣni=1|Ii − Ii|. (3.24)
The coefficient of determination is a metric regularly applied in statistical analysis tasks
aimed at assessing the performance of a model in the explanation and prediction of future
outputs. It is also referred to as the R-squared statistic, obtained by the following:
COD =
Σni=1(Ii − Ii)(Ii −
¯Ii)[
Σni=1
(Ii − Ii
)2Σni=1
(Ii − ¯
Ii
)2]1/2
2
. (3.25)
The Signal-to-Noise Ratio compares the estimated value against the real value to indicate
the level of noise in the estimate. The signal-to-noise ratio used in this paper is obtained
by:
SNR =var(I − I)
var(I). (3.26)
In equations (3.18)-(3.25), n represents the number of samples, while in equations (3.18)-
(3.21) and equations (3.23) - (3.26), I and I represent the real test set values and estimated
missing output values from the modified test set, respectively. In equation (3.22), nτ
represents the number of correctly estimated outputs within the set tolerance of 10%.
3-13
![Page 66: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/66.jpg)
3. Novel Ant-based Missing Data Estimators
3.5 Deep-Learning-Ant Colony Optimization
(DL-ACO) Estimator
Taking into consideration the evaluation metrics from Section 3.4, the performance of the
DL-ACO method is evaluated and compared against existing methods (refer to [8] (MLP-
GA), [10] (MLP-GA, MLP-SA and MLP-PSO)) by estimating the missing attributes
concurrently, wherever missing data may be ascertained. The scenarios investigated were
such that any sample/record could have at least 62, and at most 97 missing attributes
(dimensions) to be approximated. The MLP network used has a structure of 784-400-784,
with 784 input and output nodes in the input and output layers, respectively, and, 400
nodes in the one hidden layer. This number is obtained by testing the network with
different number of nodes in the hidden layer and observing the network structure that
leads to the lowest possible network error.
Figures 3.4-3.7 show the performance and comparison of DL-ACO with MLP-PSO, MLP-
SA and MLP-GA. Figures 3.4 and 3.5 are bar charts that show the MSE and RMSLE
values for DL-ACO when compared to MLP-PSO, MLP-SA and MLP-GA.
Figure 3.4: Mean Squared Error vs Estimation Approach.
3-14
![Page 67: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/67.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.5: Root Mean Squared Logarithmic Error vs Estimation Approach.
We observe 0.66%, 5.19%, 31.02% and 30.98% of MSE and 6.06%, 17.98%, 41.21% and
41.25% of RMSLE for DL-ACO, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-
ACO yielded the lowest MSE value when compared to the others. These results are
validated by the correlation coefficient whose bar chart is given in Figure 3.6.
Figure 3.6: Correlation Coefficient vs Estimation Approach.
3-15
![Page 68: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/68.jpg)
3. Novel Ant-based Missing Data Estimators
DL-ACO and MLP-PSO yielded 96.29% and 74.52% correlation values, respectively, while
MLP-SA and MLP-GA showed correlations of -1.88% and -0.56%, respectively.
Figure 3.7: Global Deviation vs Estimation Approach.
MLP-SA and MLP-GA yielded 13.25% and 13.67% of global deviation, respectively, while
DL-ACO and MLP-PSO respectively yielded 0.0078% and 0.98%, as shown in Figure 3.7.
As observed, the DL-ACO approach obtained the best figures for all four metrics presented
diagrammatically.
Table 3.2: DL-ACO Mean Squared Error Objective Value Per Sample.
Sample Dimensions DL-ACO MLP-PSO MLP-SA MLP-GA
1 79 2.52 13.62 15.59 15.59
2 88 2.38 7.34 8.78 8.78
3 75 1.27 5.59 6.76 6.76
4 81 0.26 3.69 5.91 5.91
5 83 0.46 6.57 8.02 8.02
6 82 1.27 4.81 9.76 9.76
7 90 1.91 5.66 15.05 15.05
8 79 1.18 7.58 9.54 9.54
9 76 2.59 7.96 9.48 9.48
10 76 2.86 6.52 12.60 12.60
In Table 3.2, the Dimensions column refers to the number of missing values in a sam-
3-16
![Page 69: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/69.jpg)
3. Novel Ant-based Missing Data Estimators
ple/record. Tables 3.2 and 3.3 further back the findings from Figures 3.4-3.7 showing
that the proposed DL-ACO approach yielded the lowest objective function value in the
estimation of missing values in each sample, as well as the best COD, MAE, SE and SNR
values. Considering the RPA metric, the MLP-PSO approach yielded a better value than
the proposed approach.
Table 3.3: DL-ACO Additional Metrics.
Method DL-ACO MLP-PSO MLP-SA MLP-GA
COD 92.71 55.53 0.0353 0.0032
MAE 3.37 14.82 47.5 47.7
RPA 53.25 53.58 10.75 10.08
SE 8.12 22.78 55.7 55.66
SNR 7.7 57.16 209.04 208.83
In Table 3.4, we present results obtained from statistically analysing the estimates ob-
tained by the DL-ACO approach when compared against the MLP-PSO, MLP-SA and
MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there
is no significant difference in the means of the missing data estimates obtained by the
DL-ACO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)
however indicates that there is a significant difference in the means of the missing data
estimates obtained by the four methods.
Table 3.4: Statistical Analysis of DL-ACO Results.
Pairs Compared P-Values (95% Confidence Level)
DL-ACO:MLP-PSO 5.31*10−15
DL-ACO:MLP-SA 2*10−167
DL-ACO:MLP-GA 3*10−173
Table 3.4 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-ACO when compared to MLP-PSO, MLP-SA
and MLP-GA, yielding p-values of 5.31*10−15, 2*10−167 and 3*10−173, respectively, when
all three pairs are compared. This therefore indicates that the null hypothesis (H0), which
assumes that there is no significant difference in the means between DL-ACO estimates
3-17
![Page 70: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/70.jpg)
3. Novel Ant-based Missing Data Estimators
and those of the other three methods can be rejected in favor of the alternative hypothesis
(HA) at a 95% confidence level.
Figure 3.8: Top Row: Corrupted Images - Bottom Row: DL-ACO Reconstructed Images.
In the top row of Figure 3.8, we depict 10 images with missing pixel values which are
to be estimated prior to classification tasks being performed by statistical methods. In
the bottom row of the same figure, we show the reconstructed images from using the
DL-ACO approach, while in the top and bottom rows of Figure 3.9, we observe the
reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.
The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more
so in the bottom row than in the top row, as opposed to when the DL-ACO approach is
applied. Furthermore, closer inspection reveals that the images are not fully reconstructed
as not all pixel values within the images are estimated correctly.
Figure 3.9: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA ReconstructedImages.
3-18
![Page 71: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/71.jpg)
3. Novel Ant-based Missing Data Estimators
3.6 Deep-Learning-Ant Lion Optimizer (DL-ALO)
Estimator
Figures 3.10-3.13 show the performance and comparison of DL-ALO with MLP-PSO,
MLP-SA and MLP-GA. Figures 3.10 and 3.11 are bar charts that show the MSE and
RMSLE values for DL-ALO when compared to MLP-PSO, MLP-SA and MLP-GA. The
MLP network used has a structure of 784-400-784, with 784 input and output nodes in
the input and output layers, respectively, and, 400 nodes in the one hidden layer. This
number is obtained by testing the network with different number of nodes in the hidden
layer and observing the network structure that leads to the lowest possible network error.
Figure 3.10: Mean Squared Error vs Estimation Approach.
We observed 1.85%, 4.85%, 32.46% and 29.4% of MSE and 9.03%, 17.43%, 41.98% and
40.15% of RMSLE for DL-ALO, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-
ALO yielded the lowest MSE value when compared to the others. These results are
validated by the correlation coefficient whose bar chart is given in Figure 3.12.
DL-ALO and MLP-PSO yielded 92.49% and 78.62% correlation values, respectively, while
MLP-SA and MLP-GA showed correlations of 4.03% and 8.29%, respectively.
3-19
![Page 72: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/72.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.11: Root Mean Squared Logarithmic Error vs Estimation Approach.
Figure 3.12: Correlation Coefficient vs Estimation Approach.
MLP-SA and MLP-GA yielded 10.92% and 11.42% of RPA respectively, while DL-ALO
and MLP-PSO respectively yielded 81.33% and 54.83%, as shown in Figure 3.13. As
observed, the DL-ALO approach obtained the best figures for all four metrics presented
graphically.
3-20
![Page 73: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/73.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.13: Relative Prediction Accuracy vs Estimation Approach.
Table 3.5: DL-ALO Mean Squared Error Objective Value Per Sample.
Sample Dimensions DL-ALO MLP-PSO MLP-SA MLP-GA
1 81 1.29 9.46 6.57 6.57
2 72 1.63 7.39 7.65 7.65
3 85 3.65 7.66 10.57 10.57
4 88 1.13 7.75 6.03 6.03
5 77 2.21 6.28 9.09 9.09
6 89 1.45 6.49 13.55 13.55
7 84 2.70 5.79 6.77 6.77
8 75 1.14 5.30 9.11 9.11
9 71 1.22 5.67 6.31 6.31
10 85 1.67 5.63 16.55 16.55
In Table 3.5, the Dimensions column refers to the number of missing values in a sam-
ple/record. Tables 3.5 and 3.6 further back the findings from Figures 3.10-3.13 showing
that the proposed DL-ALO approach yielded the lowest objective function value in the
estimation of missing values in each sample, as well as the best COD, GD, MAE, SE and
SNR values.
3-21
![Page 74: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/74.jpg)
3. Novel Ant-based Missing Data Estimators
Table 3.6: DL-ALO Additional Metrics.
Method DL-ALO MLP-PSO MLP-SA MLP-GA
COD 85.55 61.81 0.16 0.69
GD 0.07 0.87 14.52 12.31
MAE 6.00 14.42 48.71 45.81
SE 13.6 22.01 56.97 54.22
SNR 30.94 51.24 211.47 202.94
In Table 3.7, we present results obtained from statistically analysing the estimates ob-
tained by the DL-ALO approach when compared against the MLP-PSO, MLP-SA and
MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there
is no significant difference in the means of the missing data estimates obtained by the
DL-ALO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)
however indicates that there is a significant difference in the means of the missing data
estimates obtained by the four methods.
Table 3.7: Statistical Analysis of DL-ALO Results.
Pairs Compared P-Values (95% Confidence Level)
DL-ALO:MLP-PSO 0.46
DL-ALO:MLP-SA 0.34
DL-ALO:MLP-GA 0.16
Table 3.7 indicates that there is no significant difference in the means of the estimates
obtained by DL-ALO when compared to MLP-PSO, MLP-SA and MLP-GA, yielding
p-values of 0.46, 0.34 and 0.16 when DL-ALO is compared to MLP-PSO, MLP-SA and
MLP-GA, respectively, at a 95% confidence level. Therefore, the null hypothesis can be
accepted.
3-22
![Page 75: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/75.jpg)
3. Novel Ant-based Missing Data Estimators
Figure 3.14: Top Row: Corrupted Images - Bottom Row: DL-ALO Reconstructed Images.
In the top row of Figure 3.14, we depict 10 images with missing pixel values which are
to be estimated prior to classification tasks being performed by statistical methods. In
the bottom row of the same figure, we show the reconstructed images from using the
DL-ALO approach, while in the top and bottom rows of Figure 3.15, we observe the
reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.
The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more
so in the bottom row than in the top row, as opposed to when the DL-ALO approach is
applied. Furthermore, closer inspection reveals that the images are not fully reconstructed
as not all pixel values within the images are estimated correctly.
Figure 3.15: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Recon-structed Images.
3-23
![Page 76: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/76.jpg)
3. Novel Ant-based Missing Data Estimators
3.7 Conclusion
In this chapter, novel and effective ant-based high dimensional missing data estimator
models were presented and tested on an image recognition dataset. These were then
compared against existing approaches of a similar nature. The results obtained from the
experiments conducted in this chapter allude to the fact that the models proposed can
be used to approximate missing values in the high-dimensional dataset more accurately
than when existing approaches of the same nature are used. It can also be observed
that the images reconstructed using the proposed models are more likely to be used in
subsequent statistical analysis and classification tasks than those obtained by using the
existing approaches. This is because of the fact that the existing approaches are seen to
introduce a lot of noise in the images which could skew the findings from any subsequent
analysis.
3-24
![Page 77: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/77.jpg)
4Novel Flight-based Missing
Data Estimators
4.1 Introduction
In this chapter, we present the results obtained from analysing the novel flight-based miss-
ing data estimators. We begin in Section 4.2 by presenting the experimental design that
will be implemented throughout the chapter, followed by Section 4.3 in which we present
information on the optimization algorithms that will be used in the chapter. Section
4.4 presents the results obtained from analysing the DL-CS estimator, while Section 4.5
reports on the findings from the analysis of the DL-BAT estimator. Sections 4.6 and 4.7
show the results from analysing the DL-FA estimator and present the key findings from
the chapter, respectively.
4-1
![Page 78: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/78.jpg)
4. Novel Flight-based Missing Data Estimators
4.2 Experimental Design
4.2.1 Statement of Hypothesis and Research Question
It should be discernible at this point that the research done in this work focussed on
how possible it will be to effectively estimate missing data entries in a high-dimensional
dataset. We therefore tried to answer two key questions being:
• Is it possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the bat,
cuckoo search and firefly optimization algorithms?
• Is there a relationship between the accuracy of the estimated values and the real
values in the feature variables with missing data?
The responses to these questions which we expect to get in relation to prior research are
detailed in the hypotheses of the research.
4.2.2 Hypothesis Testing
4.2.2.1 Hypothesis One
• It is possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the bat,
cuckoo search and firefly optimization algorithms.
4.2.2.2 Hypothesis Two
• It is expected that the level of correlation between the estimated values and the real
values be high or low depending on the nature of the dataset.
The dataset used is the Mixed National Institute of Standards and Technology (MNIST)
handwritten digit recognition dataset [18]. This dataset comprises 60,000 training images
and 10,000 test images. With each image being a 28 × 28 pixel image, this results in
4-2
![Page 79: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/79.jpg)
4. Novel Flight-based Missing Data Estimators
784 pixel values representing the image, and serving as the input to the model. The
data is preprocessed by normalizing each pixel value to being in the range [0, 1]. Each
of the network layers were put through a pretraining process using restricted Boltzmann
Machines and the contrastive divergence algorithm with the aim being to set the weight
and bias values in a good search space [54]. This resulted in a network structure of size:
784−1000−500−250−30−250−500−1000−784. There are 784 nodes in the input and
output layers, and seven hidden layers with varying number of nodes [54]. This network is
subsequently trained in a supervised learning manner using the stochastic gradient descent
(SGD) algorithm with the objective being to minimize the network error. The network
training procedure is performed using the entire training set of data which is divided into
600 balanced mini-batches each containing 10 examples of each digit class. The weights
and biases are updated after each mini-batch. Missing data is created in the test set in
accordance with the arbitrary pattern as well as the MCAR and MAR mechanisms. The
optimization algorithms are then used to estimate this missing data, and they have as
objective to minimize the cost function of the trained deep network. The error tolerance
is set to 10%. A matrix of the same size as the test set of data is created with values
obtained from a binomial distribution with the required percentage of missingness (10%),
which is then superimposed on the test set to incorporate the intended missing data.
From this modified test set of data, 300 samples are selected randomly, with 100 samples
used to test the DL-CS method, another 100 samples used to test the DL-BAT approach,
and the last 100 used to test the DL-FA approach. The procedure described above can
be summarized into four consecutive steps being:
1) Use the training set of data with complete records to train the individual restricted
Boltzmann Machines by making use of the algorithm described in [99]. The training
procedure starts from the bottom layer. These individual layers are trained for 50
epochs.
2) Create the encoder and decoder parts of the network with tied weights by combining
these RBMs together.
3) Train the deep auto-encoder network obtained in a supervised learning manner by
applying a back-propagation algorithm being the stochastic gradient descent (SGD)
algorithm.
4) Use the trained network as part of the objective function for the optimization algo-
rithms during the missing data estimation procedure. Initially, the known feature
4-3
![Page 80: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/80.jpg)
4. Novel Flight-based Missing Data Estimators
variable values are presented to the objective function, and then the estimated un-
known feature variable values are parsed into the objective function.
5) The stopping criteria for the missing data estimation procedure using the optimiza-
tion algorithms are either 40,000 function evaluations having been executed, or there
being no change in the objective function value.
The MLP network used for comparison against existing approaches has a structure of 784-
400-784, with 784 input and output nodes in the input and output layers, respectively,
and, 400 nodes in the one hidden layer. This number is obtained by testing the network
with different number of nodes in the hidden layer and observing the network structure
that leads to the lowest possible network error.
4.3 Optimization Algorithms
4.3.1 Cuckoo Search (CS)
The CS algorithm is a population-based meta-heuristic technique based on the brood
parasitism trait of certain species of Cuckoo birds [100]. Cuckoos are very interesting
birds, not only courtesy of the serene sounds they make, but also due to their aggressive
reproduction strategies. The design of the CS algorithm is simplified by the assumption
of three main rules being: (i) Each cuckoo lays just one egg at a time and it dumps this
egg in a nest randomly, (ii) The best nests with a high quality of eggs will carry over to
the next generation, and, (iii) The number of nests in which cuckoos can dump their eggs
is fixed, and the eggs laid by cuckoos in these nests can be discovered by the host bird
with a probability, pa ∈ [0, 1] [101]. When the cuckoo egg is discovered, the egg is either
thrown out of the nest, or the host bird abandons the nest and builds a new one. The
last rule is further simplified by approximating pa and replacing a fraction of the existing
nests with new nests which have new random solutions [101]. In solving maximization
problems, the fitness of a solution can simply be proportional to the value of the objective
function. To further simplify the implementation of the algorithm, it is assumed that
each egg in a nest represents a solution, and a cuckoo egg represents a new solution. The
objective is to use the new and potentially better solutions/cuckoos to replace the not so
4-4
![Page 81: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/81.jpg)
4. Novel Flight-based Missing Data Estimators
good solutions in the nests. These new solutions are given by [101]:
x(t+1)i = x
(t)i + α⊕ Levy(λ), (4.1)
where α > 0 is the step size which should be related to the scales of the problem of
interest. Often, α = 1. Equation (4.1) essentially represents the stochastic equation for
a random walk process. In a general sense, a random walk is a Markov chain whose next
location is solely reliant upon the current location (the first term in equation (4.1)) and
the transition probability (the second term in equation (4.1)). The ⊕ symbol represents
entry wise multiplications. The Levy flight basically provides a random walk with the
random step drawn from a Levy distribution like so:
Levy ∼ u = t−λ, (4.2)
where t is the step-length and λ is a heavy-tailed continuous probability distribution gen-
erated from having a probability density function that follows the condition 1 < λ ≤ 3.
The Levy distribution has an infinite variance with an infinite mean.
The reason for applying the CS algorithm in this dissertation, although it has been used
in several domains, is courtesy of the fact that it has not been investigated in missing
data estimation tasks. Also, the randomization of the motion equation is more efficient
as the step-length is heavy-tailed in addition to there being a low number of parameters
which need tuning, thereby making the algorithm more generic to adapt to a wider range
of optimization problems.
The CS algorithm has been utilized to minimize the generation and emission costs of
a microgrid while satisfying system hourly demands and constraints [102], while in [103],
it was used to establish the parameters of chaotic systems via an improved cuckoo search
algorithm. The authors in [104] presented a new hybrid algorithm comprised of the cuckoo
search algorithm and the Neder-Mead method with the aim being to solve the integer and
minimax optimization problems.
The CS parameters used in the model implementation are given in Table 4.1 except
for the number of decision variables which depends on the number of missing values in
a record. The data was normalized to being in the range [0, 1], meaning the lower and
upper bounds of the decision variables are 0 and 1, respectively. These parameters were
4-5
![Page 82: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/82.jpg)
4. Novel Flight-based Missing Data Estimators
chosen because they resulted in the more optimal outcomes with different combinations
and permutations of values having been tested.
Table 4.1: CS Parameters.
Parameter Value
Number of Nests 40
Discovery Rate of Eggs (pa) 0.25
Maximum Number of Iterations 1000
4.3.2 Bat Algorithm (BAT)
The bat algorithm is a meta-heuristic swarm intelligence technique based on the echolo-
cation trait of bats. Bats possess incredible capabilities with their ability to hunt for prey
in the dark using sonar and the Doppler effect. This trait was expressed mathematically
in [105] in the form of an optimization algorithm which was tested against existing op-
timization methods on a set of benchmark functions. There are three main rules based
upon which the algorithm is designed, these being: (i) Each bat uses echolocation to
sense distance and also tell the difference between food/prey and obstacles, (ii) Bats fly
randomly with a velocity and position, and a given frequency, varying wavelength and
also varying loudness to search for their prey, and, (iii) Although there are several ways
in which the loudness could vary, the assumption made is that the loudness varies from a
large positive value, to a low constant value. Each bat moves around the solution space
with a specific velocity and at a given position [106]. There always exists a bat towards
which all other bats move, and this constitutes the current best solution. In addition
to these, the bats in using sonar emit sounds with a frequency, wavelength and loud-
ness. These can be adjusted depending on the proximity of a prey or an obstacle. These
properties are expressed mathematically in equations (4.3), (4.4) and (4.5):
xt+1i = xti + vt+1
i , (4.3)
vt+1i = vti + (xti − x∗)fi, (4.4)
4-6
![Page 83: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/83.jpg)
4. Novel Flight-based Missing Data Estimators
and
fi = fmin + (fmax − fmin)ν, (4.5)
where xt+1i and vt+1
i are the new position and velocity of the bat at time step t + 1, xti
and vti are the current position and velocity of the bat at time step t, x∗ is the current
best solution and ν ∈ [0, 1] is a random vector drawn from U(a, b) with −∞ < a < b <∞and probability density function, f(x) = 1
b−a , such that a < x < b. fmax and fmin are
the maximum and minimum frequencies, respectively. The bat algorithm implemented in
this dissertation adds an element of randomness in the motion of the bats in the form of
a Levy flight. This results in the position equation being:
xt+1i = xti + vt+1
i + Levy(λ). (4.6)
In [107], the bat algorithm was implemented to solve multi-objective optimization tasks
which comprises of most engineering problems while in [108], it was applied as an ap-
proach to solve topology optimization problems. Authors in [109] used the bat algorithm
to optimize mono/multi objective tasks linked to brushless DC wheel motors. Some of
the inherent advantages of the algorithm are that it implements a great balance between
exploitation (by using the loudness and pulse emission rate of bats) and exploration (like
in the standard Particle Swarm Optimization (PSO)), and, [105] reveals that there is a
guarantee to attain a global optimum solution in the search of an optimal point, which
are the reasons for it being selected. The main disadvantage however is that the conver-
gence rate of the optimization process depends on the fine adjustment of the algorithm
parameters.
The BA parameters used in the model implementation are given in Table 4.2 except
for the number of decision variables which depends on the number of missing values in
a record. The data was normalized to being in the range [0, 1], meaning the lower and
upper bounds of the decision variables are 0 and 1, respectively. These parameters were
chosen because they resulted in the more optimal outcomes with different combinations
and permutations of values having been tested.
4-7
![Page 84: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/84.jpg)
4. Novel Flight-based Missing Data Estimators
Table 4.2: BAT Parameters.
Parameter Value
Population Size 40
Loudness 0.25
Pulse Rate 0.5
Number of Generations 1000
4.3.3 Firefly Algorithm (FA)
FA can be defined as a meta-heuristic algorithm that mimics the flashing behavior of
fireflies [110]. It relies on three main assumptions which are: (i) All fireflies are attracted
to all other fireflies by being unisex, (ii) An increase in distance is seen to decrease both
attractiveness and brightness of the fireflies, with attractiveness being proportional to
brightness, and, (iii) The objective function landscape determines the brightness of a
firefly [110]. Considering the fact that the attractiveness of a firefly is proportional to the
brightness, the equation below defines the manner in which the attractiveness varies in
relation to the distance:
β = β0e−γr2 . (4.7)
In (4.7), β represents the attractiveness trait, β0 defines the original attractiveness value,
γ represents the absorption coefficient and r defines the distance between fireflies. The
motion equation of a firefly in the direction of a brighter one is defined by:
xt+1i = xti + β0e
−γr2ij(xtj − xti
)+ αtε
ti. (4.8)
Here, xi and xj depict positional references of two fireflies, with the second term being
because of the attraction between the fireflies. Different time steps are represented by t
and t + 1, in the third term, the randomization criteria that controls the step size is α,
with ε being a vector of arbitrary numbers drawn from a uniform distribution [54]. If β0
is equal to zero, the motion of fireflies becomes a basic random walk [110]. If γ is equal to
zero, the motion is reduced to an alternative version of the particle swarm optimization
algorithm [110].
The FA parameters used in the model implementation are given in Table 4.3 except
for the number of decision variables which depends on the number of missing values in
4-8
![Page 85: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/85.jpg)
4. Novel Flight-based Missing Data Estimators
a record. The data was normalized to being in the range [0, 1], meaning the lower and
upper bounds of the decision variables are 0 and 1, respectively. These parameters were
chosen because they resulted in the more optimal outcomes with different combinations
and permutations of values having been tested.
Table 4.3: FA Parameters.
Parameter Value
Number of Fireflies 40
Randomness (α) 0.5
Attractiveness (β) 0.2
Absorption Coefficient (γ) 1
Number of Iterations 1000
4-9
![Page 86: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/86.jpg)
4. Novel Flight-based Missing Data Estimators
4.4 Deep Learning-Cuckoo Search (DL-CS)
Estimator
Taking into consideration the evaluation metrics from Section 3.4, the performance of the
DL-CS method was evaluated and compared against existing methods (refer to [8] (MLP-
GA), [10] (MLP-GA, MLP-SA and MLP-PSO)) by estimating the missing attributes con-
currently, wherever missing data may be ascertained. The scenarios investigated where
such that any sample/record could have at least 62, and at most 97 missing attributes
(dimensions) to be approximated.
Figures 4.1-4.4 show the performance and comparison of DL-CS with MLP-PSO, MLP-
SA and MLP-GA. Figures 4.1 and 4.2 are bar charts that show the MSE and RMSLE
values for DL-CS when compared to MLP-PSO, MLP-SA and MLP-GA.
Figure 4.1: Mean Squared Error vs Estimation Approach.
We observed 0.62%, 5.58%, 30.95% and 31.85% of MSE and, 5.89%, 18.55%, 41.23% and
41.71% of RMSLE for DL-CS, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-CS
yielded the lowest MSE value when compared to the others. These results are validated
by the correlation coefficient whose bar chart is given in Figure 4.3.
4-10
![Page 87: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/87.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.2: Root Mean Squared Logarithmic Error vs Estimation Approach.
Figure 4.3: Correlation Coefficient vs Estimation Approach.
DL-CS and MLP-PSO yielded 96.19% and 71.57% correlation values, respectively, while
MLP-SA and MLP-GA showed correlations of 1.44% and 1.03%, respectively.
4-11
![Page 88: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/88.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.4: Relative Prediction Accuracy vs Estimation Approach.
MLP-SA and MLP-GA yielded 9.25% and 11.67% of RPA respectively, while DL-CS and
MLP-PSO respectively yielded 87.92% and 54.58%, as shown in Figure 4.4. As observed,
the DL-CS approach obtained the best figures for all four metrics presented graphically.
Figure 4.5: Top Row: Corrupted Images - Bottom Row: DL-CS Reconstructed Images.
4-12
![Page 89: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/89.jpg)
4. Novel Flight-based Missing Data Estimators
In the top row of Figure 4.5, we depict 10 images with missing pixel values which are to
be estimated prior to classification tasks being performed by statistical methods. In the
bottom row of the same figure, we show the reconstructed images from using the DL-CS
approach, while in the top and bottom rows of Figure 4.6, we observe the reconstructed
images when the MLP-PSO and MLP-GA approaches are used, respectively. The recon-
structed images using MLP-PSO and MLP-GA introduce a lot of noise, more so in the
bottom row than in the top row, as opposed to when the DL-CS approach is applied.
Furthermore, closer inspection reveals that the images are not fully reconstructed as not
all pixel values within the images are estimated correctly.
Figure 4.6: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA ReconstructedImages.
Table 4.4: DL-CS Mean Squared Error Objective Value Per Sample.
Sample Dimensions DL-CS MLP-PSO MLP-SA MLP-GA
1 83 2.89 5.72 9.26 9.26
2 75 2.84 8.94 14.22 14.22
3 85 1.29 5.73 6.77 6.77
4 74 3.45 7.72 16.06 16.06
5 66 1.78 6.79 10.33 10.33
6 74 1.10 5.37 9.12 9.12
7 82 3.19 9.31 11.79 11.79
8 77 2.97 10.38 14.64 14.64
9 74 3.51 8.35 8.49 8.49
10 81 1.25 5.67 15.36 15.36
4-13
![Page 90: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/90.jpg)
4. Novel Flight-based Missing Data Estimators
In Table 4.4, the Dimensions column refers to the number of missing values in a sam-
ple/record. Tables 4.4 and 4.5 further back the findings from Figures 4.1-4.4 showing
that the proposed DL-CS approach yielded the lowest objective function value in the es-
timation of missing values in each sample, as well as the best COD, GD, MAE, SE and
SNR values.
Table 4.5: DL-CS Additional Metrics.
DL-CS MLP-PSO MLP-SA MLP-GA
COD 92.52 51.22 0.02 0.01
GD 0.01 1.23 14.93 15.22
MAE 3.57 15.17 47.63 48.07
SE 7.85 23.61 55.63 56.44
SNR 8.36 60.35 194.62 189.37
In Table 4.6, we present results gotten from statistically analysing the estimates obtained
by the DL-CS approach when compared against the MLP-PSO, MLP-SA and MLP-GA
approaches using the t-test. The t-test null hypothesis (H0) assumes that there is no
significant difference in the means of the missing data estimates obtained by the DL-CS,
MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA) however
indicates that there is a significant difference in the means of the missing data estimates
obtained by the four methods.
Table 4.6: Statistical Analysis of DL-CS Results
Pairs Compared P-Values (95% Confidence Level)
DL-CS:MLP-PSO 3.7*10−19
DL-CS:MLP-SA 4.6*10−50
DL-CS:MLP-GA 4.6*10−50
4-14
![Page 91: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/91.jpg)
4. Novel Flight-based Missing Data Estimators
Table 4.6 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-CS when compared to MLP-PSO, MLP-SA and
MLP-GA, yielding p-values of 3.7*10−19, 4.6*10−50 and 4.6*10−50, respectively, when all
three pairs are compared. This therefore indicates that the null hypothesis (H0), which
assumes that there is no significant difference in the means between DL-CS and the
other three methods can be rejected in favor of the alternative hypothesis (HA) at a 95%
confidence level.
4.5 Deep Learning-Bat Algorithm (DL-BAT)
Estimator
In this analysis, the novel DL-BAT approach is compared against existing approaches in
the literature (MLP-PSO [10], MLP-SA [10] and MLP-GA ( [8] and [10])). The results
are grouped in Figures 4.7-4.10 and Tables 4.7 and 4.8.
The results reveal that the DL-BAT approach outperforms the other approaches. The
squared error is given in Figure 4.7. It shows a 9.45% of error for DL-BAT while MLP-
PSO, MLP-SA and MLP-GA obtain 22.61%, 55.61% and 56.04% error values, respectively.
Figure 4.7: Squared Error vs Estimation Approach.
4-15
![Page 92: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/92.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.8: Correlation Coefficient vs Estimation Approach.
Figures 4.8 and 4.9 show the correlation coefficient and relative prediction accuracy of
the four approaches analysed, including the novel DL-BAT approach. They both confirm
better performance of the DL-BAT approach when compared to the others. DL-BAT
exhibits a 95.86% of correlation with 85% of RPA while we obtain 76.93% of correlation
coefficient for MLP-PSO, and -2.06% and -3.06% of correlation for MLP-SA and MLP-GA,
respectively. MLP-PSO depicts 56.33% of RPA, while MLP-SA and MLP-GA produce
values of 9.92% and 10.17%, respectively.
4-16
![Page 93: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/93.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.9: Relative Prediction Accuracy vs Estimation Approach.
In Figure 4.10, we depict the root mean squared logarithmic error values obtained from
analysing the methods. It can be observed that the DL-BAT approach yields the lowest
RMSLE value of 7.11% while the second best performer is the MLP-PSO which shows a
value of 17.65%. The MLP-SA and MLP-GA approaches produce RMSLE values of 41%
and 41.32%, respectively.
4-17
![Page 94: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/94.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.10: Root Mean Squared Logarithmic Error vs Estimation Approach.
These findings are further backed by the values in Table 4.7 with the DL-BAT system
yielding the lowest COD, GD, MAE, MSE and SNR values.
Table 4.7: DL-BAT Additional Metrics.
Method DL-BAT MLP-PSO MLP-SA MLP-GA
COD 91.89 59.19 0.04 0.09
GD 0.06 0.8 12.1 12.36
MAE 4.73 14.3 47.74 48.15
MSE 0.89 5.11 30.92 31.4
SNR 8.44 53.47 228.91 230.52
Considering Table 4.8, it is observed that the proposed DL-BAT approach results in the
best objective function value per record during the estimation of all missing values within
that record.
4-18
![Page 95: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/95.jpg)
4. Novel Flight-based Missing Data Estimators
Table 4.8: DL-BAT Mean Squared Error Objective Value Per Instance.
Sample Dimensions DL-BAT MLP-PSO MLP-SA MLP-GA
1 66 5.09 9.05 9.02 9.02
2 69 1.62 5.26 8.60 8.60
3 73 0.32 3.69 4.74 4.74
4 85 3.90 9.22 18.54 18.54
5 83 2.20 7.28 15.83 15.83
6 92 2.74 8.59 8.79 8.79
7 77 3.08 7.56 13.44 13.44
8 82 0.53 6.07 4.46 4.46
9 84 2.29 6.51 17.09 17.09
10 63 0.52 3.21 1.82 1.82
In Table 4.9, we present results obtained from statistically analysing the estimates ob-
tained by the DL-BAT approach when compared against the MLP-PSO, MLP-SA and
MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there
is no significant difference in the means of the missing data estimates obtained by the
DL-BAT, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)
however indicates that there is a significant difference in the means of the missing data
estimates obtained by the four methods.
Table 4.9: Statistical Analysis of DL-BAT Results.
Pairs Compared P-Values (95% Confidence Level)
DL-BAT:MLP-PSO 1.2*10−07
DL-BAT:MLP-SA 2.0*10−134
DL-BAT:MLP-GA 9.0*10−137
Table 4.9 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-BAT when compared to MLP-PSO, MLP-SA and
MLP-GA, yielding p-values of 1.2*10−07, 2.0*10−134 and 9.0*10−137, respectively, when all
three pairs are compared. This therefore indicates that the null hypothesis (H0), which
assumes that there is no significant difference in the means between DL-BAT and the
other three methods can be rejected in favour of the alternative hypothesis (HA) at a 95%
confidence level.
4-19
![Page 96: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/96.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.11: Top Row: Corrupted Images - Bottom Row: DL-BAT Reconstructed Images.
In the top row of Figure 4.11, we depict 10 images with missing pixel values which are
to be estimated prior to classification tasks being performed by statistical methods. In
the bottom row of the same figure, we show the reconstructed images from using the
DL-BAT approach, while in the top and bottom rows of Figure 4.12, we observe the
reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.
The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more
so in the bottom row than in the top row, as opposed to when the DL-BAT approach is
applied. Furthermore, closer inspection reveals that the images are not fully reconstructed
as not all pixel values within the images are estimated correctly.
Figure 4.12: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Recon-structed Images.
4-20
![Page 97: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/97.jpg)
4. Novel Flight-based Missing Data Estimators
4.6 Deep Learning-Firefly Algorithm (DL-FA)
Estimator
In this analysis, the novel DL-FA approach is compared against existing approaches in the
literature (MLP-PSO [10], MLP-SA [10] and MLP-GA [8] [10]). The results are grouped
in Figures 4.13-4.16 and Tables 4.10 and 4.11.
The results reveal that the DL-FA approach outperforms the other approaches. The
global deviation is given in Figure 4.13. It shows a 0.27% GD value for DL-FA while
MLP-PSO, MLP-SA and MLP-GA obtain 0.97%, 12.5% and 13.36% GD values, respec-
tively.
Figure 4.13: Global Deviation vs Estimation Approach.
Figures 4.14 and 4.15 show the mean squared error and root mean squared logarithmic
error values of the four approaches analysed, including the novel DL-FA approach. They
both confirm better performance of the DL-FA approach when compared to the others.
DL-FA exhibits a 2.24% of MSE with 11.79% of RMSLE while we obtain 5.83% of MSE
for MLP-PSO, and, 30.81% and 33.27% of MSE for MLP-SA and MLP-GA, respectively.
MLP-PSO depicts 18.78% of RMSLE, while MLP-SA and MLP-GA produce values of
40.94% and 42.42%, respectively.
4-21
![Page 98: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/98.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.14: Mean Squared Error vs Estimation Approach.
Figure 4.15: Root Mean Squared Logarithmic Error vs Estimation Approach.
Considering Figure 4.16, we observe that the DL-FA approach produces the highest cor-
relation coefficient value of 90.22%, with the second best value of 74.44% obtained by the
MLP-PSO method. MLP-SA and MLP-GA yield correlation coefficient values of 2.67%
and -5.18%, respectively.
4-22
![Page 99: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/99.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.16: Correlation Coefficient vs Estimation Approach.
These findings are further backed by the values in Table 4.10 with the DL-FA system
yielding the best COD, MAE, RPA, SE and SNR values.
Table 4.10: DL-FA Additional Metrics.
Method DL-FA MLP-PSO MLP-SA MLP-GA
COD 81.41 55.41 0.07 0.27
MAE 10.05 15.83 47.57 49.8
RPA 56.75 51.75 10.75 8.25
SE 14.98 24.15 55.51 57.68
SNR 22.42 61.1 221.36 236.98
With regards to Table 4.11, it is observed that the proposed DL-FA approach results in
the best objective function value per record during the estimation of all missing values
within that record.
4-23
![Page 100: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/100.jpg)
4. Novel Flight-based Missing Data Estimators
Table 4.11: DL-FA Mean Squared Error Objective Value Per Sample.
Sample Dimensions DL-FA MLP-PSO MLP-SA MLP-GA
1 74 1.27 2.44 4.11 4.11
2 74 2.93 7.42 8.56 8.56
3 72 12.34 17.90 20.69 20.69
4 72 1.36 5.58 4.28 4.28
5 73 2.97 4.99 12.23 12.23
6 75 2.86 5.90 13.84 13.84
7 78 5.46 8.76 13.59 13.59
8 78 3.43 11.34 11.34 6.92
9 84 1.74 6.00 6.72 6.72
10 97 5.87 17.65 19.58 19.58
In Table 4.12, we present results obtained from statistically analysing the estimates ob-
tained by the DL-FA approach when compared against the MLP-PSO, MLP-SA and
MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there
is no significant difference in the means of the missing data estimates obtained by the
DL-FA, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)
however indicates that there is a significant difference in the means of the missing data
estimates obtained by the four methods.
Table 4.12: Statistical Analysis of DL-FA Results.
Pairs Compared P-Values (95% Confidence Level)
DL-FA:MLP-PSO 4.09*10−05
DL-FA:MLP-SA 2.0*10−132
DL-FA:MLP-GA 1.0*10−140
Table 4.12 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-FA when compared to MLP-PSO, MLP-SA and
MLP-GA, yielding p-values of 4.09*10−05, 2.0*10−132 and 1.0*10−140, respectively, when
all three pairs are compared. This therefore indicates that the null hypothesis (H0), which
assumes that there is no significant difference in the means between DL-FA and the other
three methods can be rejected in favour of the alternative hypothesis (HA) at a 95%
confidence level.
4-24
![Page 101: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/101.jpg)
4. Novel Flight-based Missing Data Estimators
Figure 4.17: Top Row: Corrupted Images - Bottom Row: DL-FA Reconstructed Images.
In the top row of Figure 4.17, we depict 10 images with missing pixel values which
are to be estimated prior to classification tasks being performed by statistical methods.
In the bottom row of the same figure, we show the reconstructed images from using
the DL-FA approach, while in the top and bottom rows of Figure 4.18, we observe the
reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.
The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more
so in the bottom row than in the top row, as opposed to when the DL-FA approach is
applied. Furthermore, closer inspection reveals that the images are not fully reconstructed
as not all pixel values within the images are estimated correctly.
Figure 4.18: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Recon-structed Images.
4-25
![Page 102: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/102.jpg)
4. Novel Flight-based Missing Data Estimators
4.7 Conclusion
In this chapter, novel flight-based high-dimensional missing data estimator models were
presented and tested on an image recognition dataset. These were then compared against
existing approaches of a similar nature. The results obtained from the experiments con-
ducted in this chapter allude to the fact that the models proposed can be used to approx-
imate missing values in the high-dimensional dataset more accurately than when existing
approaches of the same nature are used. It can also be observed that the images recon-
structed using the proposed models are more likely to be used in subsequent statistical
analysis and classification tasks than those obtained by using the existing approaches.
This is because of the fact that the existing approaches are seen to introduce a lot of
noise in the images which could skew the findings from any subsequent analysis.
4-26
![Page 103: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/103.jpg)
5Novel Plant-based Missing
Data Estimator and
Comparative Analysis
5.1 Introduction
In this chapter, we present the results obtained from analysing the novel plant-based
missing data estimator, as well as a comparison of all six methods proposed. We begin
in Section 5.2 by presenting the hypotheses that will be investigated in Section 5.4. In
Section 5.3 we present information on the optimization algorithm that will be used in the
chapter, followed by Section 5.4 in which we present the results obtained from analysing
the DL-IWO estimator. Section 5.5 reports on the findings from the analysis of all six
approaches proposed. Finally, Section 5.6 presents the key findings from the chapter.
5-1
![Page 104: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/104.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
5.2 Experimental Design
The missing data estimation framework used in this Chapter is like that designed in
Section 3.2 and implemented throughout Chapter 3.
5.2.1 Statement of Hypothesis and Research Question
It should be discernible at this point that the research done in this work focused on
how possible it will be to effectively estimate missing data entries in a high-dimensional
dataset. We therefore aimed to answer two key questions being:
• Is it possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the invasive
weed optimization algorithm?
• Is there a relationship between the accuracy of the estimated values and the real
values in the feature variables with missing data?
The responses to these questions which we expect to get in relation to prior research are
detailed in the hypotheses of the research.
5.2.2 Hypothesis Testing
5.2.2.1 Hypothesis One
• It is possible to estimate missing data entries in a high-dimensional dataset efficiently
using models comprising of a deep auto-encoder network framework with the invasive
weed optimization algorithm.
5.2.2.2 Hypothesis Two
• It is expected that the level of correlation between the estimated values and the real
values be high or low depending on the nature of the dataset.
5-2
![Page 105: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/105.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
5.3 Optimization Algorithm
5.3.1 Invasive Weed Optimization (IWO)
In [111], the authors proposed the Invasive Weed Optimization (IWO) algorithm which
mimics the invasion trait of weeds. In general, weeds are defined as plants that grow in an
area where they are not wanted. Specific to horticulture, the term weed refers to a plant
whose growth properties poses a menace to plants that are cultivated. Weeds portray
fascinating traits such as adaptivity and robustness. In the invasive weed optimization
algorithm, weeds are replaced by points in the solution space in which a colony of points
grow to an optimal value [112].
Let us say for instance that D represents the dimension of a problem, implying that
the dimension of the search space is RD. Let us further assume that Pinit is the initial
weed population size and Pmax, is the upper bound on the population size such that
1 ≤ Pinit ≤ Pmax [112]. Also, let W define the set of weeds with W ={W1, . . . ,W‖W‖
}[112]. Every one of the weeds, Wi ∈ RD represents a location in the solution space.
Computing the fitness of the weed requires the use of a fitness function of the form:
F : RD → R [112].
There are two main parts to the IWO algorithm, these being: Initialization and Iter-
ation. In the initialization step, the counter of the generation of solutions is set to zero
G = 0. Subsequently, the initial population, W , is generated randomly by creating Pinit
with uniformly distributed values [112]:
Wi ∼ U(Xmin, Xmax)D. (5.1)
Xmin and Xmax represent the lower and upper bounds of the solution space, respectively,
and are problem specific.
In the iteration step, each of the weeds in the current population are regenerated by
a given number of seeds. Snum, defines the number of seeds, and is evaluated such that
it is proportional to the fitness value of the weed being considered. This implies that
it is linearly mapped depending on the population’s worse and best fitness, Fworse and
5-3
![Page 106: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/106.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Fbest [112]:
Snum = Smin +F (Wi)− FworseFbest − Fworse
(Smax − Smin). (5.2)
In equation (5.2), Smin and Smax represent the lower and upper bounds of seeds allowed
for each weed [112]. All the Snum seeds, Sj, are generated near the current weed by making
use of a Gaussian distribution with varying standard deviation and zero mean [112]:
Sj = Wi +N (0, σG)D (5.3)
In equation (5.3), 1 ≤ j ≤ Snum. The standard deviation, σG, begins at σinit and is
reduced in a non-linear manner over the entire run to σfinal. The standard deviation for
the current generation is calculated by using [112]:
σG = σfinal +(Niter −G)σmod
(Niter)σmod(σinit − σfinal), (5.4)
where Niter represents the upper bound on the number of generations and σmod is the
non-linear modulation indicator. Following these procedures, the subsequent population
is generated by uniting the current population with all the generated seeds of all the
weeds. If the volume of the new population attains Pmax, this new population is cate-
gorized as per the fitness value, and the Pmax optimal weeds are preserved. The least
favorable weeds are removed.
This algorithm has been used to investigate the time-cost-quality trade-off in projects
by the authors in [113]. Also, the authors in [114] developed a novel receiver that merges
constant modulus approach (CMA) blind adaptive multiuser detection with IWO for
multi-carrier code division multiple access (MC-CDMA). In [115], the services selection
problem is modeled as a non-linear optimization problem with constraints and solved us-
ing a discrete version of the IWO algorithm. Unconstrained and constrained optimization
problems are solved using a hybrid IWO and firefly algorithm in [116]. Finally, in [117],
the problem of minimizing the total weighted tardiness and earliness criteria on a single
machine is considered.
The IWO parameters used in the model implementation are given in Table 5.1 except
for the number of dimensions which depends on the number of missing values in a record.
The data was normalized to being in the range [0, 1], meaning the lower and upper bounds
of the decision variables are 0 and 1, respectively. These parameters were chosen because
5-4
![Page 107: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/107.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
they resulted in the more optimal outcomes with different combinations and permutations
of values having been tested.
Table 5.1: IWO Parameters.
Parameter Value
Initial Population Size 10
Maximum Population Size 40
Minimum Number of Seeds 0
Maximum Number of Seeds 5
Variance Reduction Exponent 2
Maximum Number of Iterations 1000
5.4 Deep Learning-Invasive Weed Optimization
(DL-IWO) Estimator
In this analysis, the novel DL-IWO approach is compared against existing approaches in
the literature (MLP-PSO [10], MLP-SA [10] and MLP-GA ( [8] and [10])). The results are
grouped in Figures 5.1-5.4 and Tables 5.2 and 5.3. The results reveal that the DL-IWO
approach outperforms all the other approaches. The mean squared error is given in Figure
5.1. It shows a 0.45% of error for DL-IWO while MLP-PSO, MLP-SA and MLP-GA yield
5.53%, 31.86% and 32.61% error, respectively.
Figure 5.2 depicts the root mean squared logarithmic error for the approaches analysed,
including the novel DL-IWO approach. It confirms better performance of the DL-IWO
approach when compared to the other approaches. DL-IWO exhibits a 5.11% of RMSLE
while we obtain 18.58% of RMSLE for MLP-PSO. MLP-SA and MLP-GA show RMSLE
values of 41.77% and 42.16%, respectively.
5-5
![Page 108: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/108.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Figure 5.1: Mean Squared Error vs Estimation Approach.
Figure 5.2: Root Mean Squared Logarithmic Error vs Estimation Approach.
In Figure 5.3, the RPA values of the approaches are revealed while in Figure 5.4, we depict
the correlation coefficient values of these approaches. In Figure 5.3, the DL-IWO yields
the better RPA value of 88.25% compared to MLP-PSO, MLP-SA and MLP-GA yielding
values of 54.42%, 8.75% and 10.25%, respectively.
5-6
![Page 109: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/109.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Figure 5.3: Relative Prediction Accuracy vs Estimation Approach.
Furthermore, in Figure 5.4, the DL-IWO approach shows the better correlation coefficient
value of 97.67%, while MLP-SA and MLP-GA show -3.7% and 0.28% correlation between
the estimates and the real values. The MLP-PSO approach is the second best performer
resulting in a value of 73.65%.
Figure 5.4: Correlation Coefficient vs Estimation Approach.
5-7
![Page 110: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/110.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Table 5.2: DL-IWO Mean Squared Error Objective Value Per Sample.
Sample Dimensions DL-IWO MLP-PSO MLP-SA MLP-GA
1 76 2.67 11.48 9.04 9.04
2 89 0.69 6.01 5.81 5.81
3 82 4.74 11.26 14.80 14.80
4 86 1.21 8.42 6.19 6.19
5 82 2.59 7.03 7.70 7.70
6 62 1.02 4.65 8.71 8.71
7 88 1.69 7.63 12.61 12.61
8 71 4.89 12.21 12.09 12.09
9 79 1.01 4.24 8.14 8.14
10 84 1.34 7.06 13.36 13.36
In Table 5.2, the Dimensions column refers to the number of missing values in a sam-
ple/record. Tables 5.2 and 5.3 further back the findings from Figures 5.1-5.4 showing
that the proposed DL-IWO approach yielded the lowest objective function values in the
estimation of missing values in each sample, as well as the best COD, GD, MAE, SE and
SNR values.
Table 5.3: DL-IWO Additional Metrics.
Method DL-IWO MLP-PSO MLP-SA MLP-GA
COD 95.4 54.24 0.14 0.08
GD 0.02 1.1 13.89 14.83
MAE 3.31 15.2 48.72 48.91
SE 6.67 23.52 56.44 57.1
SNR 5.16 59.76 219.89 205.01
In Table 5.4, we present results obtained from statistically analysing the estimates ob-
tained by the DL-IWO approach when compared against the MLP-PSO, MLP-SA and
MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there
is no significant difference in the means of the missing data estimates obtained by the
DL-IWO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)
however indicates that there is a significant difference in the means of the missing data
estimates obtained by the four methods.
5-8
![Page 111: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/111.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Table 5.4: Statistical Analysis of DL-IWO Model Results.
Pairs Compared P-Values (95% Confidence Level)
DL-IWO:MLP-PSO 2.51*10−15
DL-IWO:MLP-SA 2.0*10−174
DL-IWO:MLP-GA 4.0*10−180
Table 5.4 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-IWO when compared to MLP-PSO, MLP-SA
and MLP-GA, yielding p-values of 2.51*10−15, 2.0*10−174 and 4.0*10−180, respectively,
when all three pairs are compared. This therefore indicates that the null hypothesis (H0),
which assumes that there is no significant difference in the means between DL-IWO and
the other three methods can be rejected in favour of the alternative hypothesis (HA) at a
95% confidence level.
Figure 5.5: Top Row: Corrupted Images - Bottom Row: DL-IWO Reconstructed Images.
In the top row of Figure 5.5, we depict 10 images with missing pixel values which are
to be estimated prior to classification tasks being performed by statistical methods. In
the bottom row of the same figure, we show the reconstructed images from using the
DL-IWO approach, while in the top and bottom rows of Figure 5.6, we observe the
reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.
The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more
so in the bottom row than in the top row, as opposed to when the DL-IWO approach is
applied. Furthermore, closer inspection reveals that the images are not fully reconstructed
as not all pixel values within the images are estimated correctly.
5-9
![Page 112: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/112.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Figure 5.6: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA ReconstructedImages.
5.5 Comparative Analysis of Proposed Approaches
In this section, we present the findings from comparing all six methods against each other
to identify which performs best on the dataset. Statistical t-test will be performed to back
the findings from this experiment. The results obtained are grouped in Figures 5.7-5.10
and Tables 5.5 and 5.6.
Figure 5.7: Squared Error vs Estimation Approach.
Figures 5.7 and 5.8 depict the squared errors and mean absolute errors for all six novel
5-10
![Page 113: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/113.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
approaches proposed and analysed. They both confirm better performance of the DL-ACO
approach when compared to the other approaches. DL-ACO exhibits a 7.94% of squared
error with 3.26% of MAE while we obtain 11.97%, 8.24%, 8.17%, 13.93% and 8.26% of
squared errors for DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO, respectively. Based
on these squared error values, we note that the order of performance is: DL-ACO→DL-
CS→DL-BAT→DL-IWO→DL-ALO→DL-FA.
Figure 5.8: Mean Absolute Error vs Estimation Approach.
The DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO approaches yield MAE values of
5.07%, 4.37%, 3.83%, 9.3% and 3.72%, respectively. Based on these MAE values, we
note that the order of performance is: DL-ACO→DL-IWO→DL-CS→DL-BAT→DL-
ALO→DL-FA.
5-11
![Page 114: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/114.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Figure 5.9: Root Mean Squared Logarithmic Error vs Estimation Approach.
In Figure 5.9, the RMSLE values of the approaches are revealed while in Figure 5.10, we
observe the RPA values of these approaches. Considering Figure 5.9, the approach which
yields the lowest RMSLE value is the DL-ACO approach with a value of 5.85%. The
second best performer is the DL-CS algorithm with a value of 6.06%. The other values
obtained are 8.2%, 6.24%, 11.21% and 6.14% for the DL-ALO, DL-BAT, DL-FA and
DL-IWO approaches, respectively. This reveals a performance order of: DL-ACO→DL-
CS→DL-IWO→DL-BAT→DL-ALO→DL-FA.
5-12
![Page 115: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/115.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Figure 5.10: Relative Prediction Accuracy vs Estimation Approach.
With regards to Figure 5.10, the order of performance of the approaches is: DL-ACO→DL-
CS→DL-IWO→DL-BAT→DL-ALO→DL-FA. This ordering is based on the approaches
yielding values of 87.21%, 86.9%, 86.73%, 85.75%, 83.03% and 59.15%, respectively.
These findings are further backed by the results in Table 5.5. The DL-ACO approach
yields the best values for COD, GD, MSE, r and SNR. Considering the COD met-
ric, the order of performance observed is: DL-ACO→DL-BAT→DL-CS→DL-IWO→DL-
ALO→DL-FA. The order of performance changes when we consider the GD metric:
DL-ACO→DL-IWO→DL-CS→DL-ALO→DL-BAT→DL-FA. In terms of the MSE met-
ric, we observe a performance ordering of: DL-ACO→DL-CS→DL-IWO/DL-BAT→DL-
ALO→DL-FA, noting that DL-IWO and DL-BAT perform on par. The correlation
coefficient and SNR values reveal an ordering of: DL-ACO→DL-BAT→DL-CS→DL-
IWO→DL-ALO→DL-FA. Based on these orderings, it can be stated that the DL-ACO
approach performs best, with the approach on the other end of the scale consistently
being the DL-FA approach.
Considering Table 5.6, the DL-BAT approach yielded the lowest objective function values
in the estimation of missing data within a single sample, across all samples shown in the
table. The Dimensions column refers to the number of missing values within that sample.
The DL-ACO approach is observed to yield the second best objective function values
5-13
![Page 116: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/116.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Table 5.5: Model Additional Metrics.
Method DL-ACO DL-ALO DL-BAT DL-CS DL-FA DL-IWO
COD 93.47 87.01 93.33 93.2 82.74 93.02
GD 0.0079 0.035 0.044 0.019 0.28 0.017
MSE 0.63 1.43 0.68 0.67 1.94 0.68
r 96.68 93.28 96.61 96.54 90.96 96.44
SNR 6.95 22.72 7.21 7.4 22.84 7.55
across all samples, while DL-FA yields the highest objective function values, which goes
on to support the findings from Table 5.5 and Figures 5.7 - 5.10.
Table 5.6: Model Mean Squared Error Objective Values Per Sample.
Sample Dimensions DL-ACO DL-ALO DL-BAT DL-CS DL-FA DL-IWO
1 65 0.83 1.37 0.80 0.94 2.48 0.88
2 94 2.25 2.29 2.15 2.27 3.17 2.26
3 77 2.68 2.85 2.49 2.69 3.45 2.69
4 64 1.80 2.26 1.71 1.87 2.98 1.83
5 77 2.62 3.13 2.47 2.64 3.87 2.64
6 77 1.02 2.05 0.91 1.09 2.73 1.11
7 79 2.18 2.48 2.06 2.20 3.31 2.20
8 75 1.59 1.91 1.37 1.64 2.85 1.73
9 74 1.24 1.29 1.17 1.24 1.93 1.25
10 62 3.17 3.56 2.86 3.24 4.49 3.22
5-14
![Page 117: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/117.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
In Table 5.7, we present results obtained from statistically analysing the estimates ob-
tained by the DL-ACO, DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO approaches
using the t-test. The t-test null hypothesis (H0) assumes that there is no significant dif-
ference in the means of the missing data estimates obtained by the DL-ALO, MLP-PSO,
MLP-SA and MLP-GA methods. The alternative hypothesis (HA) however indicates that
there is a significant difference in the means of the missing data estimates obtained by
the six methods.
Table 5.7: Statistical Analysis of Model Results.
Pairs Compared P-Values (95% Confidence Level)
DL-ACO:DL-ALO 1.37*10−10
DL-ACO:DL-BAT 0.01
DL-ACO:DL-CS 0.29
DL-ACO:DL-FA 2.0*10−22
DL-ACO:DL-IWO 0.37
DL-ALO:DL-BAT 9.88*10−20
DL-ALO:DL-CS 6.43*10−14
DL-ALO:DL-FA 2.0*10−67
DL-ALO:DL-IWO 2.64*10−13
DL-BAT:DL-CS 0.14
DL-BAT:DL-FA 1.0*10−12
DL-BAT:DL-IWO 0.10
DL-CS:DL-FA 3.44*10−18
DL-CS:DL-IWO 0.87
DL-FA:DL-IWO 9.06*10−19
Table 5.7 reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-ACO when compared to DL-ALO, DL-BAT and
DL-FA, yielding p-values of 1.37*10−10, 0.01 and 2.0*10−22, respectively, when all three
pairs are compared. This therefore indicates that the null hypothesis (H0), which assumes
that there is no significant difference in the means of the estimates between DL-ACO and
the DL-ALO, DL-BAT and DL-FA methods can be rejected in favour of the alternative
hypothesis (HA) at a 95% confidence level. It also indicates that there is no significant
difference in the means of the estimates obtained by the DL-ACO approach when com-
pared against the DL-CS and DL-IWO approaches. This is made evident by the p-values
obtained which are 0.29 and 0.37, respectively.
5-15
![Page 118: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/118.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
Table 5.7 further reveals that there is a significant difference at a 95% confidence level in
the means of the estimates obtained by DL-ALO when compared to DL-BAT, DL-CS, DL-
FA and DL-IWO, yielding p-values of 9.88*10−20, 6.43*10−14, 2.0*10−67 and 2.64*10−13,
respectively, when all four pairs are compared. This therefore indicates that the null hy-
pothesis (H0), which assumes that there is no significant difference in the means of the
estimates between DL-ALO and the DL-BAT, DL-CS, DL-FA and DL-IWO methods can
be rejected in favour of the alternative hypothesis (HA) at a 95% confidence level.
In addition, it is observed that at a 95% confidence level, there is a significant differ-
ence in the means of the estimates obtained by the DL-BAT approach when compared
to the DL-FA approach. This conclusion is reached courtesy of the p-value of 1.0*10−12
being obtained. When the DL-BAT approach is compared to the DL-CS and DL-IWO
approaches, it is observed that the p-values obtained point in favour of accepting the
null hypothesis, which states that there is no significant difference in the means of the
estimates at a 95% confidence level. These p-values are 0.14 and 0.10 when DL-BAT is
compared against DL-CS and DL-IWO, respectively.
Moreover, it reveals that there is a significant difference at a 95% confidence level in
the means of the estimates obtained by DL-CS when compared to DL-FA yielding a p-
value of 3.44*10−18, when the pair is compared. This therefore indicates that the null
hypothesis (H0), which assumes that there is no significant difference in the means of the
estimates between DL-CS and DL-FA method can be rejected in favour of the alternative
hypothesis (HA) at a 95% confidence level. It also indicates that there is no significant
difference in the means of the estimates obtained by the DL-CS approach when compared
against the DL-IWO approach. This is made evident by the p-value obtained which is 0.87.
Finally, it reveals that there is a significant difference at a 95% confidence level in the
means of the estimates obtained by DL-FA when compared to DL-IWO yielding a p-value
of 9.06*10−19, when the pair is compared. This therefore indicates that the null hypothesis
(H0), which assumes that there is no significant difference in the means of the estimates
between DL-FA and DL-IWO method can be rejected in favour of the alternative hypoth-
esis (HA) at a 95% confidence level.
5-16
![Page 119: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/119.jpg)
5. Novel Plant-based Missing Data Estimator and Comparative Analysis
5.6 Conclusion
Firstly, in this chapter, a novel plant-based high-dimensional missing data estimator model
was presented and tested on an image recognition dataset. This model was then compared
against existing approaches of a similar nature. The results obtained from the experiment
conducted, allude to the fact that the model proposed can be used to approximate missing
values in the high-dimensional dataset more accurately than when existing approaches of
the same nature are used. It can also be observed that the image reconstructed using the
proposed model is more likely to be used in subsequent statistical analysis and classifica-
tion tasks than those obtained by using the existing approaches. This is because of the
fact that the existing approaches are seen to introduce a lot of noise in the images which
could skew the findings from any subsequent analysis.
Secondly, we present in this chapter a comparative analysis of the models proposed to
identify that which performs best, but also that which is on the other end of the scale
in terms of performance. It is observed that the method which consistently performs the
best is the DL-ACO approach while the method which leads to the worse performance
metric values is consistently the DL-FA method. The statistical t-test performed further
reveals that the DL-FA approach yields estimates which are significantly different from
those of the other five methods at a 95% confidence level, resulting in p-values of 0.00
when these are compared in pairs as can be seen in Table 5.7. It is observed that only
when the objective function values per sample are considered as given in Table 5.6, that
the DL-ACO approach does not yield the best values. Rather, it is the DL-BAT approach
that results in the lowest values in this scenario.
5-17
![Page 120: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/120.jpg)
6Concluding Remarks and
Future Research
In this chapter, we begin by presenting a summary of the research. Furthermore, we
discuss the findings reached from conducting the experiments. Subsequently, we provide
ideas for avenues for further investigations. We then examine the contributions of this
dissertation, and then present final conclusions.
6.1 Concluding Remarks
6.1.1 Research Summary
The research performed in this dissertation assesses the efficiency of using a deep auto-
encoder neural network in combination with the Ant Colony Optimization, Ant-Lion
Optimizer, Bat, Firefly, Cuckoo Search and Invasive Weed Optimization methods in per-
forming missing data estimation tasks on a high-dimensional dataset. By so doing, we
aimed to respond to a few objectives which are:
• To prove the ineffectiveness of existing approaches on the estimation of missing data
entries in a high-dimensional dataset.
• To prove the effectiveness of novel methods in estimating missing data entries in
a high-dimensional dataset by proposing models consisting of a deep auto-encoder
neural network and Ant Colony Optimization, Ant-Lion Optimizer, Bat, Firefly,
Cuckoo Search and Invasive Weed Optimization algorithms.
6-1
![Page 121: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/121.jpg)
6. Concluding Remarks and Future Research
• Assess, evaluate, and compare the accuracy of results using the individually pro-
posed models.
To answer these questions, an image recognition dataset was used. Six different models
comprising of a deep auto-associative neural network with six optimization algorithms
being: Ant Colony Optimization, Ant-Lion Optimizer, Bat, Firefly, Cuckoo Search and
Invasive Weed Optimization algorithms were proposed to perform the task. An auto-
encoder is a neural network capable of reproducing its inputs as outputs. An RBM
trained in an unsupervised learning manner using the contrastive divergence approach
was used to initialize the weights of the deep auto-associative neural network in a good
solution space. The trained RBMs were concatenated together, forming the encoder part
of the neural network, and then transposed to form the decoder part of said network.
This encoder-decoder network was then trained in a supervised learning manner using
the stochastic gradient descent algorithm. During the training of the network, an er-
ror function was derived which is expressed as the square of the disparity between the
estimated missing data entries and the real values. This error function was further de-
composed to incorporate both known input vector values and the missing components of
the input vector. Subsequently, the optimization algorithms were used to estimate the
missing values in the input vector with the objective being to minimize the loss which has
as part of it the trained network. The models created in this manner were implemented
and compared against existing approaches to prove the ineffectiveness of the latter while
revealing the effectiveness of the former. Subsequently, the models created were compared
against each other to provide an order in terms of performance of these.
6.1.2 Results Summary and Discussions
An in-depth analysis of the results obtained from the experiments alludes to the fact that
the Deep Learning-Ant Colony Optimization (DL-ACO) method performed better than
the Deep Learning-Ant Lion Optimizer (DL-ALO), Deep Learning-Bat Algorithm (DL-
BAT), Deep Learning-Cuckoo Search (DL-CS), Deep Learning-Firefly Algorithm (DL-FA)
and Deep Learning-Invasive Weed Optimization (DL-IWO) methods.
To be more precise, the analysis of the results using the Squared Error, Root Mean Squared
Logarithmic Error, Mean Absolute Error, Mean Squared Error, Global Deviation, Signal-
to-Noise Ratio, Correlation Coefficient, Relative Prediction Accuracy and Coefficient of
6-2
![Page 122: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/122.jpg)
6. Concluding Remarks and Future Research
Determination values from the dataset revealed that on average, the DL-ACO method
performed better than the DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO methods.
The difference in performance between the DL-BAT, DL-CS and DL-IWO methods on
the dataset were minimal and insignificant when the performance metrics were considered.
Considering the statistical analysis results, the aforementioned findings are further vali-
dated by the p-values obtained. However, when the performances of these three methods
were compared against those of the DL-ALO and DL-FA methods, the differences in these
were significantly different as shown in Chapter 5, and backed by the statistical analysis
results. General analysis of the results obtained points to an ordering of the methods in
terms of performance as follows: (i) DL-ACO, (ii) DL-BAT, (iii) DL-CS, (iv) DL-IWO,
(v) DL-ALO, and (vi) DL-FA. The predominant advantage of the models proposed lies
in the fact that they make use of a deep learning framework that is better at extracting
the correlations and interrelationships that exist between feature variables in the high-
dimensional dataset. This ensures that the reconstruction of input vector values at the
output layer are quite accurate and in the estimation of missing data values, this property
is imperative.
6.2 Avenues for Future Research
The findings from the work done in this dissertation are quite encouraging and dependable,
however, there is room for improvement with regards to either improving the performances
of the models proposed or applying these models to different application domains or
varying types of datasets.
6.2.1 Apply Alternative Machine Learning Techniques
The research used a deep auto-associative neural network to learn the correlations and
interrelationships that exists between feature variables within the dataset. Although an
auto-encoder was chosen as the learning model due to its advantages over other methods
in missing data estimation tasks in addition to it having yielded trustworthy outcomes,
it is a good idea to look into the possibility of building new models by making use of
other deep learning algorithms. Instead of using a deep auto-encoder network, one could
implement for instance, a Convolutional Neural Network or a Deep Belief Network as the
6-3
![Page 123: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/123.jpg)
6. Concluding Remarks and Future Research
learning model in the hybrid system(s).
6.2.2 Apply Different Optimization Techniques
The optimization algorithms used in the research were the Ant Colony Optimization,
Ant-Lion Optimizer, Bat, Firefly, Cuckoo Search and Invasive Weed Optimization algo-
rithms. The parameter combinations used in these algorithms were based entirely on the
dataset used. It could be worthwhile considering obtaining an optimal set of parameter
combinations for these algorithms used in the models that could be used across a range
of application domains and datasets with accuracy still being preserved or bettered. The
optimization algorithms implemented in the models were selected because they have not
been used, or extensively applied in the domain of missing data estimation, more specif-
ically when high-dimensional datasets are considered. It could be worth it looking into
the possibility of using alternate optimization algorithms in the models. Some of these
optimization algorithms which could be used to create the hybrid models from the new
collection of optimization algorithms are the grey wolf algorithm, differential evolution,
lion optimization algorithm just to name a few. From the old category of optimization
algorithms, the particle swarm optimization, simulated annealing, genetic algorithm, hill
climbing, and pattern search algorithms could be applied to observe whether or not these
algorithms in combination with deep learning frameworks are efficient at performing the
task.
6.2.3 Compare to Other Models using Similar Datasets
Chapter 2 presents several existing missing data imputation algorithms applied in different
scenarios and application areas. These algorithms each have their own advantages and
disadvantages. Despite the encouraging outcomes produced by the models proposed in the
work done in this dissertation, one very important investigation that has to be performed
is comparing the results produced by the proposed models against those of other missing
data imputation techniques, not only on the same dataset and ones of the same nature as
this, but also on different datasets that possess different characteristics in order to provide
some form of generalization. In addition, it is worthwhile investigating the performances
of the models proposed in this dissertation on different datasets of a similar nature (image
recognition datasets).
6-4
![Page 124: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/124.jpg)
6. Concluding Remarks and Future Research
6.3 Alternative Areas of Application
The work done in this dissertation used the proposed models to estimate missing data
entries in a high-dimensional dataset with the focal point being on image recognition
datasets. These proposed models could further be applied and extended to alternate
application domains and not only on the sector mentioned. For instance, it could be used
in risk analysis and forecasting sectors. Furthermore, they can be applied to environmental
and health datasets to generalize the performances of the models and the results obtained
in the thesis.
6-5
![Page 125: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/125.jpg)
References
[1] R. Little and D. Rubin, Statistical analysis with missing data. John Wiley & Sons,2014.
[2] R. L. Carter, “Solutions for missing data in structural equation modeling,” Research& Practice in Assessment, vol. 1, no. 1, pp. 1–6, Winter 2006.
[3] N. A. Zainuri, A. A. Jemain, and N. Muda, “A comparison of various imputationmethods for missing values in air quality data,” Sains Malaysiana, vol. 44, no. 3,pp. 449–456, August 2015.
[4] T. Sidekerskiene and R. Damasevicius, “Reconstruction of missing data in synthetictime series using emd,” CEUR Workshop Proceedings, vol. 1712, pp. 7–12, 2016.
[5] M. N. Vukosi, F. V. Nelwamondo, and T. Marwala, “Autoencoder, principal com-ponent analysis and support vector regression for data imputation,” arXiv preprintarXiv:0709.2506, 2007.
[6] S. Rana, A. H. John, H. Midi, and A. Imon, “Robust regression imputation formissing data in the presence of outliers,” Far East Journal of Mathematical Sciences,vol. 97, no. 2, pp. 183–195, October 2015.
[7] F. Lobato, C. Sales, I. Araujo, V. Tadaiesky, L. Dias, L. Ramos, and A. Santana,“Multi-objective genetic algorithm for missing data imputation,” Pattern Recogni-tion Letters, vol. 68, no. 1, pp. 126–131, December 2015, (last accessed: 18-March-2016).
[8] M. Abdella and T. Marwala, “The use of genetic algorithms and neural networksto approximate missing data in database,” vol. 24, October 2005, pp. 577–589.
[9] I. Aydilek and A. Arslan, “A novel hybrid approach to estimating missing values indatabases using k-nearest neighbors and neural networks,” International Journal ofInnovative Computing, Information and Control, vol. 7, no. 8, pp. 4705–4717, 2012.
Rf-1
![Page 126: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/126.jpg)
References
[10] C. Leke, B. Twala, and T. Marwala, “Modeling of missing data prediction: Com-putational intelligence and optimization algorithms,” in IEEE International Con-ference on Systems, Man and Cybernetics (SMC). San Diego, CA, USA, October2014, pp. 1400–1404.
[11] F. J. Mistry, F. V. Nelwamondo, and T. Marwala, “Missing data estimation us-ing principle component analysis and autoassociative neural networks,” Journal ofSystemics, Cybernatics and Informatics, vol. 7, no. 3, pp. 72–79, 2009.
[12] F. V. Nelwamondo, S. Mohamed, and T. Marwala, “Missing data: A comparison ofneural network and expectation maximisation techniques,” Current Science, vol. 93,no. 12, pp. 1514–1521.
[13] S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizing informationwithin incomplete instances,” Journal of Systems and Software, vol. 84, no. 3, pp.452–459, 2011.
[14] S. Zhang, “Shell-neighbor method and its application in missing data imputation,”Applied Intelligence, vol. 35, no. 1, pp. 123–133, 2011.
[15] A. Baraldi and C. Enders, “An introduction to modern missing data analyses,”Journal of School Psychology, vol. 48, no. 1, pp. 5–37, 2010.
[16] S. Van Buuren, Flexible imputation of missing data. CRC press, 2012.
[17] J. M. Jerez, I. Molina, P. J. Garcıa-Laencina, E. Alba, N. Ribelles, M. Martın, andL. Franco, “Missing data imputation using statistical and machine learning methodsin a real breast cancer problem,” Artificial intelligence in medicine, vol. 50, no. 2,pp. 105–115, 2010.
[18] Y. LeCun. The mnist database of handwritten digits. (last accessed: 15-Jan-2016).[Online]. Available: http://yann.lecun.com/exdb/mnist/
[19] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deepbelief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[20] A. K. Mohamed, F. V. Nelwamondo, and T. Marwala, “Estimating missing datausing neural network techniques, principal component analysis and genetic algo-rithms,” Proceedings of the Eighteenth Annual Symposium of the Pattern Recogni-tion Association of South Africa, 2007.
[21] L. Francis, Dancing With Dirty Data: Methods for Exploring and CleaningData, pp. 198–254, (last accessed: November 2016). [Online]. Available:http://dx.doi.org/10.1007/978-3-319-19884-2 11
[22] M. Ramoni and P. Sebastiani, “Robust learning with missing data,” Journal ofMachine Learning, vol. 45, no. 2, pp. 147–170, 2001.
Rf-2
![Page 127: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/127.jpg)
References
[23] M. C. Tremblay, K. Dutta, and D. Vandermeer, “Using data mining techniques todiscover bias patterns in missing data,” Journal of Data and Information Quality,vol. 2, no. 1, 2010.
[24] R. Polikar, J. De Pasquale, H. S. Mohammed, G. Brown, and L. I. Kuncheva,“Learn++mf: A random subspace approach for the missing feature problem,” Pat-tern Recognition, vol. 43, no. 11, pp. 3817–3832, 2010.
[25] B. Twala, “An empirical comparison of techniques for handling incomplete datausing decision trees,” Applied Artificial Intelligence, vol. 23, no. 5, pp. 373–405,2009.
[26] D. Rubin, “Multiple imputations in sample surveys-a phenomenological bayesianapproach to nonresponse,” Proceedings of the survey research methods section ofthe American Statistical Association, vol. 1, pp. 20–34, 1978.
[27] P. D. Allison, “Multiple imputation for missing data,” Sociological Methods & Re-search, vol. 28, no. 3, pp. 301–309, 2000.
[28] E.-L. Silva-Ramirez, R. Pino-Mejias, M. Lopez-Coello, and M.-D. Cubiles-de-laVega, “Missing value imputation on missing completely at random data using mul-tilayer perceptrons,” Neural Networks, vol. 24, no. 1, pp. 121–129, January 2011.
[29] T. D. Pigott, “A review of methods for missing data,” Educational Research andEvaluation, vol. 7, no. 4, pp. 353–383, 2001.
[30] K. J. Nishanth and V. Ravi, “A computational intelligence based online data im-putation method: An application for banking,” Journal of Information ProcessingSystems, vol. 9, no. 4, pp. 633–650, 2013.
[31] J. Scheffer, “Dealing with missing data,” Research Letters in the Information andMathematical Sciences, vol. 3, pp. 153–160, 2000, (last accessed: 18-March-2016).[Online]. Available: http://www.massey.ac.nz/wwiims/research/letters
[32] P. Garca-Laencina, J. Sancho-Gmez, A. Figueiras-Vidal, and M. Verleysen, “K near-est neighbours with mutual information for simultaneous classification and missingdata imputation,” Neurocomputing, vol. 72, no. 7-9, pp. 1483–1493, 2009.
[33] F. Z. Poleto, J. M. Singer, and C. D. Paulino, “Missing data mechanisms and theirimplications on the analysis of categorical data,” Statistics and Computing, vol. 21,no. 1, pp. 31–43, 2011.
[34] Y. Liu and S. D. Brown, “Comparison of five iterative imputation methods formultivariate classification,” Chemometrics and Intelligent Laboratory Systems, vol.120, pp. 106–115, 2013.
[35] T. Marwala, Computational Intelligence for Missing Data Imputation: Estimationand Management Knowledge Optimization Techniques. Information Science Ref-erence, Hershey, New York, 2009.
Rf-3
![Page 128: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/128.jpg)
References
[36] I. S. Yansaneh, L. S. Wallace, and D. A. Marker, “Imputation methods for largecomplex datasets: An application to the nehis,” In Proceedings of the Survey Re-search Methods Section, pp. 314–319, 1998.
[37] P. D. Allison, Missing data. Thousand Oaks, CA: Sage, 2002.
[38] A. Kalousis and M. Hilario, “Supervised knowledge discovery from incompletedata,” In Proceedings of the 2nd International Conference on Data Mining, 2000,(last accessed: October 2016). [Online]. Available: http://cui.unige.ch/AI-group/research/metal/Papers/missing values.ps
[39] A. Perez, R. J. Dennis, J. F. A. Gil, M. A. Rondon, and A. Lopez, “Use of themean, hot deck and multiple imputation techniques to predict outcome in intensivecare unit patients in colombia,” Journal of Statistics in Medicine, vol. 21, no. 24,pp. 3885–3896, 2002.
[40] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incom-plete data via the em algorithm,” Journal of the Royal Statistics Society, vol. 39,no. 1, pp. 1–38, 1997.
[41] B. Twala and M. Cartwright, “Ensemble missing data techniques for software effortprediction,” Intelligent Data Analysis, vol. 14, no. 3, pp. 299–331, 2010.
[42] B. E. T. H. Twala, M. C. Jones, and D. J. Hand, “Good methods for coping withmissing data in decision trees,” Pattern Recognition Letters, vol. 29, no. 7, pp.950–956, 2008.
[43] B. Twala and M. Phorah, “Predicting incomplete gene microarray data with theuse of supervised learning algorithms,” Pattern Recognition Letters, vol. 31, pp.2061–2069, 2010.
[44] C. Ming-Hau, “Pattern recognition of business failure by autoassociative neuralnetworks in considering the missing values,” in International Computer Symposium(ICS). Taipei, Taiwan, Dec 2010, pp. 711–715.
[45] S. Haykin, Neural Networks. Prentice-Hall, New Jersey, second edition, 1999.
[46] P. J. Lu and T. C. Hsu, “Application of autoassociative neural network on gas-path sensor data validation,” Journal of Propulsion and Power, vol. 18, no. 4, pp.879–888, July 2002.
[47] J. Mistry, F. Nelwamondo, and T. Marwala, “Estimating missing data and deter-mining the confidence of the estimate data,” Seventh International Conference onMachine Learning and Applications, pp. 752–755, December 2008, san Diego, CA,USA.
[48] J. W. Hines, E. U. Robert, and D. J. Wrest, “Use of autoassociative neural networksfor signal validation,” Journal of Intelligent and Robotic Systems, vol. 21, no. 2, pp.143–154, February 1998.
Rf-4
![Page 129: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/129.jpg)
References
[49] M. J. Atalla and D. J. Inman, “On model updating using neural networks,” Me-chanical Systems and Signal Processing, vol. 12, pp. 135–161, 1998.
[50] N. Smauoi and S. Al-Yakoob, “Analyzing the dynamics of cellular flames usingkarhunenloeve decomposition and autoassociative neural networks,” Society for In-dustrial and Applied Mathematics, vol. 24, pp. 1790–1808, 2003.
[51] T. Marwala, “Probabilistic fault identification using a committee of neural networksand vibration data,” Journal of Aircraft, vol. 38, no. 1, pp. 138–146, January-February 2001.
[52] T. Marwala and S. Chakraverty, Fault classification in structures with incompletemeasured data using autoassociative neural networks and genetic algorithm, 2006,vol. 90, no. 4.
[53] T. Marwala, Economic Modelling Using Artificial Intelligence Methods. Springer-Verlag, London, UK., 2013.
[54] C. Leke and T. Marwala, “Missing data estimation in high-dimensional datasets: Aswarm intelligence-deep neural network approach,” in International Conference inSwarm Intelligence. Springer International Publishing, 2016, pp. 259–270.
[55] J. C. Isaacs, “Representational learning for sonar atr,” in SPIE Defense+ Security.International Society for Optics and Photonics, June 2014.
[56] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review andnew perspectives,” Transactions on Pattern Analysis and Machine Intelligence,vol. 35, no. 8, pp. 1798–1828, 2013.
[57] K. Baek and S. Cho, “Bankruptcy prediction for credit risk using an auto-associativeneural network in korean firms,” IEEE Conference on Computational Intelligencefor Financial Engineering, pp. 25–29, March 2003, hong Kong, China.
[58] T. Tim, M. Mutajogire, and T. Marwala, “Stock market prediction using evolu-tionary neural networks,” Fifteenth Annual Symposium of the Pattern Recognition,PRASA, pp. 123–133, Nov 2004.
[59] L. B. Brain, T. Marwala, and T. Tettet, “Autoencoder networks for hiv classifica-tion,” Current Science, vol. 91, no. 11, pp. 1467–1473, 2006.
[60] W.-H. Steeb, The Nonlinear Workbook. World Scientific, Singapore, 2008.
[61] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2008.
[62] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classi-fiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, June 1999.
Rf-5
![Page 130: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/130.jpg)
References
[63] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vectormachines,” IEEE Intelligent Systems and their Applications, vol. 13, no. 4, pp.18–28, July-Aug. 1998.
[64] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,”Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, June 1998.
[65] T. Marwala, Finite Element Model Updating Using Computational Intelligence Tech-niques: Applications to Structural Dynamics. Heidelberg: Springer, 2010.
[66] ——, Causality, Correlation, and Artificial Intelligence for Rational Decision Mak-ing. Singapore: World Scientific, 2015.
[67] J. Kennedy and R. Eberhart, “Particle swarm optimization (pso),” in Proc. IEEEInternational Conference on Neural Networks (ICNN), Perth, Australia, vol. 4, Nov1995, pp. 1942–1948.
[68] A. P. Engelbrecht, Particle swarm optimization: Where does it belong?, May 2006.
[69] T. Marwala and M. Logazio, Militarized Conflict Modeling Using ComputationalIntelligence Techniques. Springer-Verlag, London, UK, 2011.
[70] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He,and J. Williams, “Recent advances in deep learning for speech research at mi-crosoft,” IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 8604–8608, May 2013.
[71] L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations andTrends in Signal Processing, vol. 7, no. 3-4, pp. 197–387, 2014.
[72] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, 2009.
[73] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked de-noising autoencoders: Learning useful representations in a deep network with a localdenoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408,2010.
[74] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies fortraining deep neural networks,” Journal of Machine Learning Research, vol. 1, pp.1–40, 2009.
[75] K. Alex, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in Neural InformationProcessing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105,(last accessed: May 2016). [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Rf-6
![Page 131: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/131.jpg)
References
[76] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and timeseries,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp.1995:1–14, 1995.
[77] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machinesfor collaborative filtering,” in Proceedings of the 24th International Conferenceon Machine Learning, ser. ICML ’07. New York, NY, USA: ACM,2007, pp. 791–798, (last accessed: May 2016). [Online]. Available: http://doi.acm.org/10.1145/1273496.1273596
[78] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” Artificial Intelli-gence and Statistics, vol. 1, no. 2, pp. 448–455, 2009.
[79] G. Hinton, “A practical guide to training restricted boltzmann machines,” Momen-tum, vol. 9, no. 1, p. 926, 2010.
[80] T. Tieleman, “Training restricted boltzmann machines using approximations tothe likelihood gradient,” in Proceedings of the 25th International Conferenceon Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008,pp. 1064–1071, (last accessed: May 2016). [Online]. Available: http://doi.acm.org/10.1145/1390156.1390290
[81] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, pp.436–444, 2015.
[82] T. Tieleman and G. E. Hinton, “Using fast weights to improve persistent contrastivedivergence,” Proceedings of 26th International Conference on Machine Learning, pp.1033–1040, 2009.
[83] M. Carreira-Perpin and G. E. Hinton, “On contrastive divergence learning,”Artificial Intelligence and Statistics, vol. 0, pp. 1–7, 2005, (last accessed: 15-March-2015). [Online]. Available: http://learning.cs.toronto.edu/∼hinton/absps/cdmiguel.pdf
[84] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,”Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[85] M. S. R. Monteiro, D. B. M. M. Fontes, and F. A. C. C. Fontes, “Ant colonyoptimization: a literature survey,” Universidade do Porto, Faculdade de Economiado Porto,” FEP Working Papers, 2012, (last accessed: January 2016). [Online].Available: http://EconPapers.repec.org/RePEc:por:fepwps:474
[86] M. Dorigo, V. Maniezzo, and A. Colorni, “Positive feedback as a search strategy,”Tech. Rep., 1991.
[87] M. Dorigo, V. Maniezzo, and A. , “The ant system: Optimization by a colony ofcooperating agents,” IEEE Transactions on Systems, Man, and Cybernetics-PartB, vol. 26, no. 1, pp. 29–41, 1996.
Rf-7
![Page 132: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/132.jpg)
References
[88] M. Dorigo and G. Di Caro, “New ideas in optimization,” 1999, ch. The Ant ColonyOptimization Meta-heuristic, pp. 11–32, (last accessed: 20-May-2016), publisher =McGraw-Hill Ltd., UK, address = Maidenhead, UK, England. [Online]. Available:http://dl.acm.org/citation.cfm?id=329055.329062
[89] M. Dorigo, M. Birattari, and T. Sttzle, “Ant colony optimization – artificial ants as acomputational intelligence technique,” IEEE Computational Intelligence Magazine,vol. 1, pp. 28–39, 2006.
[90] A. C. Zecchina, A. R. Simpsona, H. R. Maiera, M. Leonarda, A. J. Roberts, andM. J. Berrisforda, “Application of two ant colony optimisation algorithms to waterdistribution system optimisation,” Mathematical and Computer Modelling, vol. 44,no. 5-6, pp. 451–468, 2006.
[91] X. J. Liu, H. Yi, and Z.-H. Ni, “Application of ant colony optimization algorithm inprocess planning optimization,” Journal of Intelligent Manufacturing, vol. 24, no. 1,pp. 1–13, 2013.
[92] T. nkaya, S. Kayalgil, and N. E. zdemirel, “Ant colony optimization based clusteringmethodology,” Applied Soft Computing, vol. 28, pp. 301–311, 2015.
[93] S. Mirjalili, “The ant lion optimizer,” Advances in Engineering Software, vol. 8, pp.80–98, 2015.
[94] E. Gupta and A. Saxena, “Performance evaluation of antlion optimizer based reg-ulator in automatic generation control of interconnected power system,” Journal ofEngineering, vol. 2016, pp. 1–14, 2016.
[95] R. Satheeshkumar and R. Shivakumar, “Ant lion optimization approach for load fre-quency control of multi-area interconnected power systems,” Circuits and Systems,vol. 7, pp. 2357–2383, 2016.
[96] W. Yamany, A. Tharwat, M. Fawzy, T. Gaber, and A. E. Hassanien, “A new multi-layer perceptrons trainer based on ant lion optimization algorithm,” in Fourth In-ternational Conference on Information Science and Industrial Applications (ISI),Sept 2015, pp. 40–45.
[97] H. M. Zawbaa, E. Emary, and C. Grosan, “Feature selection via chaotic antlionoptimization,” PLOS ONE, vol. 11, no. 3, pp. 1–21, March 2016, (last accessed:June 2016). [Online]. Available: https://doi.org/10.1371/journal.pone.0150652
[98] M. Petrovi, J. Petronijevi, M. Miti, N. Vukovi, A. Plemi, Z. Miljkovi, and B. Babi,“The ant lion optimization algorithm for flexible process planning,” Journal of Pro-duction Engineering, vol. 18, no. 2, pp. 65–68, 2015.
[99] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data withneural networks,” American Association for the Advancement of Science, vol. 313,no. 5786, pp. 504–507, July 2006.
Rf-8
![Page 133: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/133.jpg)
References
[100] X. S. Yang and S. Debb, “Cuckoo search: recent advances and applications,” NeuralComputing and Applications, vol. 24, no. 1, pp. 169–174, 2014.
[101] X. S. Yang and S. Deb, “Cuckoo search via levy flights,” World Congress on Natureand Biologically Inspired Computing (NaBIC), vol. 48, no. 2, pp. 210–214, Feb 2009.
[102] S. Vasanthakumar, N. Kumarappan, R. Arulraj, and T. Vigneysh, “Cuckoo searchalgorithm based environmental economic dispatch of microgrid system with dis-tributed generation,” in IEEE International Conference on Smart Technologies andManagement for Computing, Communication, Controls, Energy and Materials (IC-STM), August 2015, pp. 575–580.
[103] J. Wang, B. Zhou, and S. Zhou, “An improved cuckoo search optimization algo-rithm for the problem of chaotic systems parameter estimation,” ComputationalIntelligence and Neuroscience, vol. 2016, p. 8, 2016.
[104] F. A. Ali and A. T. Mohamed, “A hybrid cuckoo search algorithm with neldermead method for solving global optimization problems,” SpringerPlus. SpringerInternational Publishing, vol. 5, no. 1, p. 473, 2016.
[105] X. S. Yang, “A new metaheuristic bat-inspired algorithm,” In: Nature InspiredCooperative Strategies for Optimization (NISCO), Studies in Computational Intel-ligence, pp. 65–74, 2010.
[106] ——, “Bat algorithm: Literature review and applications,” International Journalof Bio-Inspired Computation, vol. 5, no. 3, pp. 141–149, 2013.
[107] ——, “Bat algorithm for multiobjective optimization,” International Journal ofBio-Inspired Computation, vol. 3, no. 5, pp. 267–274, 2011.
[108] X. S. Yang, M. Karamanoglu, and S. Fong, “Bat algorithm for topology optimizationin microelectronic applications,” First International Conference on Future Genera-tion Communication Technologies (FGST), pp. 12–14, Dec 2012.
[109] C. B. Teodoro, d.-. S. C. Leandro, and L. Luiz, “Bat-inspired optimization approachfor the brushless dc wheel motor problem,” Transactions on Magnetics, Feb 2012.
[110] X.-S. Yang, “Firefly algorithm, levy flights and global optimization,” In: Re-search and Development in Intelligent Systems XXVI (Eds M. Bramer, R. Ellis,M. Petridis), pp. 209–218.
[111] A. R. Mehrabian and C. Lucas, “A novel numerical optimization algorithm inspiredfrom weed colonization,” Ecological Informatics, vol. 1, pp. 355–366, 2006.
[112] C. Veenhuis, “Binary invasive weed optimization,” Second World Congress on Na-ture and Biologically Inspired Computing, pp. 449–454, Dec 2010.
Rf-9
![Page 134: COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS ... · deep learning approaches and swarm intelligence techniques has not yet been reported or investigated. The rst contribution](https://reader031.vdocuments.net/reader031/viewer/2022011820/5ea22931e01ff931ba6addb4/html5/thumbnails/134.jpg)
References
[113] B. Paryzad and N. S. Pour, “Time-cost-quality trade-off in project with using inva-sive weed optimization algorithm,” Journal Basic and Applied Scientific Research,vol. 3, no. 11, pp. 134–142, 2013.
[114] H. L. Hung, C. C. Chao, C. H. Cheng, and Y. F. Huang, “Invasive weed optimiza-tion method based blind multiuser detection for mc-cdma interference suppressionover multipath fading channel,” International Conference on Systems, Man andCybernatics (SMC), pp. 2145–2150, 2010.
[115] K. Su, L. Ma, X. Guo, and Y. Sun, “An efficient discrete invasive weed optimizationalgorithm for web services selection,” Journal of Software, vol. 9, no. 3, pp. 709–715,March 2014.
[116] H. A. Kasdirin, N. M. Yahya, M. S. M. Aras, and M. O. Tokhi, “Hybridizing in-vasive weed optimization with firefly algorithm for unconstrained and constrainedoptimization problems,” Journal of Theoretical and Applied Information Technol-ogy, vol. 95, no. 4, pp. 912–927, Feb 2017.
[117] M. Yazdani and R. Ghodsi, “Invasive weed optimization algorithm for minimizingtotal weighted earliness and tardiness penalties on a single machine under agingeffect,” International Robotics and Automation Journal, vol. 2, no. 1, Jan 2017.
Rf-10