copyright and citation considerations for this thesis ... · deep learning approaches and swarm...

COPYRIGHT AND CITATION CONSIDERATIONS FOR THIS THESIS/ DISSERTATION

o Attribution — You must give appropriate credit, provide a link to the license, and indicate ifchanges were made. You may do so in any reasonable manner, but not in any way thatsuggests the licensor endorses you or your use.

o NonCommercial — You may not use the material for commercial purposes.

o ShareAlike — If you remix, transform, or build upon the material, you must distribute yourcontributions under the same license as the original.

How to cite this thesis

Surname, Initial(s). (2012) Title of the thesis or dissertation. PhD. (Chemistry)/ M.Sc. (Physics)/ M.A. (Philosophy)/M.Com. (Finance) etc. [Unpublished]: University of Johannesburg. Retrieved from: https://ujcontent.uj.ac.za/vital/access/manager/Index?site_name=Research%20Output (Accessed: Date).

http://www.uj.ac.za/

https://ujdigispace.uj.ac.za/

Computational Intelligence Techniques for

High-Dimensional Missing Data Estimation

by

Collins Achepsah Leke

A dissertation submitted to the Faculty of Engineering and the BuiltEnvironment in the fulfillment of the requirements for the degree of

Doctor of Engineering

in

Electrical and Electronic Engineering Science

at the

University of Johannesburg

Supervisor: Prof. Tshilidzi Marwala

Co-Supervisor: Prof. Bhekisipho Twala

2017

Declaration of Authorship

• This work was done mainly while in candidature for a research degree at this Uni-

versity.

• Where I have consulted the published work of others, this is always clearly at-

tributed.

• Where I have quoted from the work of others, the source is always given. With the

exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

Signed:

Date:

ii

Abstract

Missing data is a recurrent issue which leads to a variety of problems in the analysis and

processing of data in datasets. Due to this reason, missing data and ways of handling

this problem have been an area of research in a variety of disciplines in recent times.

Most real world datasets possess the properties of big data being volume, velocity and

variety. With an increase in volume which includes sample size and dimensionality, ex-

isting imputation methods have become less effective and accurate. Much attention has

been given to narrow Artificial Intelligence frameworks courtesy to their efficiency in low

dimensional settings. However, with an increase in dimensionality, these methods yield

unrepresentative imputations with an impact on decision making processes. The goal

of this thesis is to present a new direction in the missing data estimation literature by

proposing novel methods which are aimed at finding approximations to missing values in

high-dimensional datasets with emphasis placed on image recognition datasets, with the

objective of being able to reconstruct corrupted images which can subsequently be used

in classification tasks. The features in these datasets represent the pixel values of the

images. To the best of our knowledge, high-dimensional missing data estimation using

deep learning approaches and swarm intelligence techniques has not yet been reported or

investigated.

The first contribution of this thesis is the presentation of novel ant-based optimization

deep learning missing data estimation approaches. The ant-based optimization algorithms

used are Ant-Lion Optimizer (ALO) and Ant Colony Optimization (ACO). These opti-

mization algorithms are used in combination with a deep learning regression model. The

methods are compared against three existing approaches of a similar nature being a hy-

brid multi-layer perceptron (MLP) auto-associate neural network (AANN) with genetic

iii

Abstract

algorithm (GA), a hybrid AANN with simulated annealing (SA) and a hybrid AANN with

particle swarm optimization (PSO). The proposed methods show better performance over-

all, whilst the existing methods show better computational time required to obtain the

estimates. Statistical tests are done to validate these findings.

The second contribution presented in this thesis is the proposition of novel flight-based

optimization deep learning missing data estimation techniques. The flight-based optimiza-

tion algorithms used are the Firefly Algorithm (FA), Bat Algorithm (BAT) and Cuckoo

Search (CS) algorithm. The algorithms are also hybridized with a deep learning regres-

sion model. The third contribution is the proposition of a novel plant-based optimization

missing data estimation technique. The plant-based optimization algorithm used is the

Invasive Weed Optimization (IWO) algorithm. This algorithm is hybridized with a deep

learning regression model. These approaches are compared against the existing methods.

Statistical tests are also done to validate the findings observed.

This thesis further provides a comparative analysis of the methods proposed. These

methods make use of the optimization algorithms to reduce to an acceptable level an

objective function obtained by training the regression model, a process during which the

correlations between inputs and outputs are preserved in the weights assigned to the edges

that link the different network layers. The objective function represents the square of the

disparity between the real output values and estimated output values from a deep auto-

encoder network. When missing data is observed in the dataset, the objective function is

broken down to incorporate both known and unknown feature variable values. Each layer

of the deep auto-encoder network is a restricted Boltzmann Machine (RBM), with these

stacked together and trained in a back-propagation supervised learning approach using

the stochastic gradient descent (SGD) algorithm. All the experiments conducted in this

thesis are done from a high-dimensional perspective.

The results obtained from experiments conducted in this thesis reveal that the most

effective method proposed is that which comprises of the deep learning model and the ant

colony optimization algorithm yielding the best evaluation metric values. The method

which leads to the worse performance metric values is consistently the method compris-

ing of the deep learning model and the firefly algorithm. The statistical t-tests performed

further reveal that this least performing approach yields estimates which are significantly

different from those of the other five methods at a 95% confidence level, resulting in very

iv

Abstract

low p-values which are significantly less than zero, when these are compared in pairs. It

is observed that only when the objective function values per sample are considered, that

the deep learning model-ant colony optimization algorithm approach does not yield the

best values. Rather, it is the deep learning model with the bat algorithm approach that

results in the lowest values in this scenario.

v

Dedication

To my dad and sister,

Leke Betechuoh Casimir & Leke Sydonie Kesangha

and their everlasting memory and love.

vi

Acknowledgements

First, I would like to address my gratitude to my supervisors Prof. Tshilidzi Marwala

and Prof. Bhekisipho Twala for giving me the opportunity to work with them and for

believing in my potential.

Special gratitude goes to Dr. Richard Ndjiongue and Mr. Kesi Fanoro for their valuable

assistance and support.

My heartfelt gratitude, extends to my Mum (Leke Nee Nchangeh Agatha Fonkeng), broth-

ers (Leke Betechuoh Brian and Leke Fonkeng Clarence) and sister (Gwendoline Tasong).

They have all made me who I am today. We have been through a lot, but through it all,

they kept me strong and going. There are no words to say how much I am grateful and

thankful to them and for them.

My gratitude also goes to all my friends who supported me morally and through their

prayers throughout this journey, especially Nqobile Dudu who provided me with a voice

of reason and sanity through the difficult times I went through while completing this

research. Everyone deserves to have such a person in their life.

Above all, I praise the Almighty GOD, for always strengthening and leading me.

vii

Dissertation Related

Publications

Conferences

[1] Collins Leke, A. R. Ndjiongue, Bhekisipho Twala, and Tshilidzi Marwala

Deep Learning-Bat High-Dimensional Missing Data Estimator

(accepted) in 2017 IEEE International Conference on Systems, Man and Cybernetics

(SMC) - October 5-8, 2017, Banff, Canada.

[2] Collins Leke, A. R. Ndjiongue, Bhekisipho Twala, and Tshilidzi Marwala

A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-

Dimensional Datasets

(accepted) in 2017 (Springer) International Conference in Swarm Intelligence (ICSI) -

July 27 - August 1, 2017, Fukuoka, Japan.

[3] Collins Leke and Tshilidzi Marwala

Missing Data Estimation in High-Dimensional Datasets: A Swarm Intelligence-

Deep Neural Network Approach

(Springer) International Conference in Swarm Intelligence - June 25-30, 2016, Bali, In-

donesia.

[4] Collins Leke, Bhekisipho Twala and Tshilidzi Marwala

Modeling of missing data prediction: Computational intelligence and opti-

mization algorithms

IEEE International Conference on Systems, Man and Cybernetics (SMC)- October 5-8,

2014, San Diego, CA, USA.

viii

Contents

Other Publications

[5] Collins Leke, Bhekisipho Twala and Tshilidzi Marwala

Missing Data Prediction and Classification: The Use of Auto-Associative Neu-

ral Networks and Optimization Algorithms

(CoRR), arXiv, http://arxiv.org/abs/1403.5488, abs/1403.5488, 2014.

[6] Collins Leke, Satyakama Paul and Tshilidzi Marwala

Proposition of a Theoretical Model for Missing Data Imputation using Deep

Learning and Evolutionary Algorithms

(CoRR), arXiv, http://arxiv.org/abs/1512.01362, abs/1512.01362, 2015.

ix

Contents

Declaration of Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Dissertation Related Publications . . . . . . . . . . . . . . . . . . . . . . . viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

1.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

1.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4

1.4 Contribution of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5

x

Contents

1.5 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7

1.6 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8

2 Literature Review and Background on Approaches for Dealing withMissing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

2.2 Missing Data Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1

2.3 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2

2.3.1 Missing Completely at Random (MCAR) . . . . . . . . . . . . . . 2-2

2.3.2 Missing at Random (MAR) . . . . . . . . . . . . . . . . . . . . . 2-3

2.3.3 Non-Ignorable Case or Missing Not at Random (MNAR) . . . . . 2-4

2.3.4 Missing by Natural Design (MBND) . . . . . . . . . . . . . . . . 2-4

2.4 Missing Data Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

2.5 Classical Missing Data Techniques . . . . . . . . . . . . . . . . . . . . . . 2-6

2.5.1 List-Wise or Case-Wise Deletion . . . . . . . . . . . . . . . . . . . 2-6

2.5.2 Pair-Wise Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6

2.5.3 Mean-mode Substitution . . . . . . . . . . . . . . . . . . . . . . . 2-7

2.5.4 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7

2.5.5 Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9

2.6 Machine Learning Approaches to Missing Data . . . . . . . . . . . . . . . 2-10

2.6.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10

2.6.2 Artificial Neural Networks (ANNs) . . . . . . . . . . . . . . . . . 2-11

2.6.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 2-14

2.7 Machine Learning Optimization . . . . . . . . . . . . . . . . . . . . . . . 2-14

2.7.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14

xi

Contents

2.7.2 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . 2-15

2.7.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 2-16

2.8 Deep Learning (DL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16

2.8.1 Restricted Boltzmann Machine (RBM) . . . . . . . . . . . . . . . 2-17

2.8.2 Contrastive Divergence (CD) . . . . . . . . . . . . . . . . . . . . 2-19

2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21

3 Novel Ant-based Missing Data Estimators . . . . . . . . . . . . . . . . 3-1

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

3.2.1 Statement of Hypothesis and Research Question . . . . . . . . . . 3-2

3.2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

3.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7

3.3.1 Ant Colony Optimization (ACO) . . . . . . . . . . . . . . . . . . 3-7

3.3.2 Ant-Lion Optimizer (ALO) . . . . . . . . . . . . . . . . . . . . . 3-8

3.4 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 3-11

3.5 Deep-Learning-Ant Colony Optimization (DL-ACO) Estimator . . . . . . 3-14

3.6 Deep-Learning-Ant Lion Optimizer (DL-ALO) Estimator . . . . . . . . . 3-19

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24

4 Novel Flight-based Missing Data Estimators . . . . . . . . . . . . . . . 4-1

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1




xii

Contents

4.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4

4.3.1 Cuckoo Search (CS) . . . . . . . . . . . . . . . . . . . . . . . . . 4-4

4.3.2 Bat Algorithm (BAT) . . . . . . . . . . . . . . . . . . . . . . . . 4-6

4.3.3 Firefly Algorithm (FA) . . . . . . . . . . . . . . . . . . . . . . . . 4-8

4.4 Deep Learning-Cuckoo Search (DL-CS) Estimator . . . . . . . . . . . . . 4-10

4.5 Deep Learning-Bat Algorithm (DL-BAT) Estimator . . . . . . . . . . . . 4-15

4.6 Deep Learning-Firefly Algorithm (DL-FA) Estimator . . . . . . . . . . . 4-21

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26

5 Novel Plant-based Missing Data Estimator and Comparative Analysis 5-1

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1




5.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3

5.3.1 Invasive Weed Optimization (IWO) . . . . . . . . . . . . . . . . . 5-3

5.4 Deep Learning-Invasive Weed Optimization (DL-IWO) Estimator . . . . 5-5

5.5 Comparative Analysis of Proposed Approaches . . . . . . . . . . . . . . . 5-10

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17

6 Concluding Remarks and Future Research . . . . . . . . . . . . . . . . 6-1

6.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1

6.1.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1

6.1.2 Results Summary and Discussions . . . . . . . . . . . . . . . . . . 6-2

6.2 Avenues for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 6-3

xiii

Contents

6.2.1 Apply Alternative Machine Learning Techniques . . . . . . . . . . 6-3

6.2.2 Apply Different Optimization Techniques . . . . . . . . . . . . . . 6-4

6.2.3 Compare to Other Models using Similar Datasets . . . . . . . . . 6-4

6.3 Alternative Areas of Application . . . . . . . . . . . . . . . . . . . . . . . 6-5

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rf-1

xiv

List of Figures

1.1 MNIST Dataset Sample Images. Top Row - Real Data: Bottom Row -Data with Missing Pixel Values . . . . . . . . . . . . . . . . . . . . . . . 1-5

3.1 Data Imputation Configuration. . . . . . . . . . . . . . . . . . . . . . . . 3-3

3.2 Stacked Auto-encoder Network Structure. . . . . . . . . . . . . . . . . . 3-4

3.3 Missing Data Estimator Structure. . . . . . . . . . . . . . . . . . . . . . 3-5

3.4 Mean Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . 3-14

3.5 Root Mean Squared Logarithmic Error vs Estimation Approach. . . . . . 3-15

3.6 Correlation Coefficient vs Estimation Approach. . . . . . . . . . . . . . . 3-15

3.7 Global Deviation vs Estimation Approach. . . . . . . . . . . . . . . . . . 3-16

3.8 Top Row: Corrupted Images - Bottom Row: DL-ACO Reconstructed Images.3-18

3.9 Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Re-constructed Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18




3.13 Relative Prediction Accuracy vs Estimation Approach. . . . . . . . . . . 3-21

3.14 Top Row: Corrupted Images - Bottom Row: DL-ALO Reconstructed Images.3-23

xv

List of Figures






4.5 Top Row: Corrupted Images - Bottom Row: DL-CS Reconstructed Images.4-12


4.7 Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . . . . . 4-15




4.11 Top Row: Corrupted Images - Bottom Row: DL-BAT Reconstructed Images.4-20


4.13 Global Deviation vs Estimation Approach. . . . . . . . . . . . . . . . . . 4-21




4.17 Top Row: Corrupted Images - Bottom Row: DL-FA Reconstructed Images.4-25




xvi

List of Figures



5.5 Top Row: Corrupted Images - Bottom Row: DL-IWO Reconstructed Images. 5-9


5.7 Squared Error vs Estimation Approach. . . . . . . . . . . . . . . . . . . . 5-10

5.8 Mean Absolute Error vs Estimation Approach. . . . . . . . . . . . . . . . 5-11



xvii

List of Tables

2.1 Univariate Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . 2-5

2.2 Arbitrary Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . 2-5

2.3 Monotone Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . 2-5

3.1 ACO Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8

3.2 DL-ACO Mean Squared Error Objective Value Per Sample. . . . . . . . . 3-16

3.3 DL-ACO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . 3-17

3.4 Statistical Analysis of DL-ACO Results. . . . . . . . . . . . . . . . . . . 3-17

3.5 DL-ALO Mean Squared Error Objective Value Per Sample. . . . . . . . . 3-21

3.6 DL-ALO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 3-22

3.7 Statistical Analysis of DL-ALO Results. . . . . . . . . . . . . . . . . . . 3-22

4.1 CS Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6

4.2 BAT Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8

4.3 FA Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9

4.4 DL-CS Mean Squared Error Objective Value Per Sample. . . . . . . . . . 4-13

4.5 DL-CS Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14

4.6 Statistical Analysis of DL-CS Results . . . . . . . . . . . . . . . . . . . . 4-14

xviii

List of Tables

4.7 DL-BAT Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 4-18

4.8 DL-BAT Mean Squared Error Objective Value Per Instance. . . . . . . . 4-19

4.9 Statistical Analysis of DL-BAT Results. . . . . . . . . . . . . . . . . . . 4-19

4.10 DL-FA Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23

4.11 DL-FA Mean Squared Error Objective Value Per Sample. . . . . . . . . . 4-24

4.12 Statistical Analysis of DL-FA Results. . . . . . . . . . . . . . . . . . . . . 4-24

5.1 IWO Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5

5.2 DL-IWO Mean Squared Error Objective Value Per Sample. . . . . . . . . 5-8

5.3 DL-IWO Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 5-8

5.4 Statistical Analysis of DL-IWO Model Results. . . . . . . . . . . . . . . . 5-9

5.5 Model Additional Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14

5.6 Model Mean Squared Error Objective Values Per Sample. . . . . . . . . . 5-14

5.7 Statistical Analysis of Model Results. . . . . . . . . . . . . . . . . . . . . 5-15

xix

List of Abbreviations

xx

List of Abbreviations

AANN Auto-Associative Neural Networks

ACO Ant Colony Optimization

ALO Ant Lion Optimizer

ANN Artificial Neural Networks

BAT Bat Algorithm

CD Contrastive Divergence

COD Coefficient of Determination

CS Cuckoo Search

DAE Deep Auto-Encoder

DL Deep Learning

FA Firefly Algorithm

GA Genetic Algorithm

GD Global Deviation

IWO Invasive Weed Optimization

MAE Mean Absolute Error

MAR Missing at Random

MBND Missing by Natural Design

MCAR Missing Completely at Random

MI Multiple Imputation

MLP Multi-Layer Perceptron

MNAR Missing not at Random

MSE Mean Square Error

PCA Principal Component Analysis

PSO Particle Swarm Optimization

r Correlation Coefficient

RBM Restricted Boltzmann Machine

RMSLE Root Mean Square Logarithmic Error

SA Simulated Annealing

SAE Stacked Auto-Encoder

SCG Scaled Conjugate Gradient

SE Squared Error

SGD Stochastic Gradient Descent

SNR Signal-to-Noise Ratioxxi

1 Introduction

1.1 Missing Data

The presence of missing data in datasets as per previous research in a variety of academic

domains alludes to the fact that data analysis tasks and decision-making processes are

rendered non-trivial. Due to this observation, one can assume that reliable and accurate

decisions are more likely to be made when complete records are used as opposed to in-

complete datasets. The result from this presumption has been a lot of research being

conducted in the data mining domain with the introduction of novel methods that ac-

curately perform the task of filling in missing data. Research indicates that operations

from a variety of professional sectors that use sensors in instruments to report very impor-

tant information which are subsequently used to make decisions, for example in medicine,

manufacturing and energy, may come across instances whereby these sensors fail, lead-

ing to there being missing data entries in the dataset, thereby influencing the nature of

the decisions made. In scenarios such as these, it is of great importance that there be a

system that can impute with high accuracy, the missing data from these faulty sensors.

This imputation framework will need to take into consideration the existing correlations

between the information obtained from the sensors in the system to accurately estimate

the missing data. Another scenario in which the missing data problem presents an incon-

venience in the process of making decisions is in image recognition tasks, whereby missing

pixel values render the task of predicting or classifying an image difficult and as such, it

is paramount that there be a system capable of estimating these missing pixel values with

high accuracy to make these tasks easier and more feasible.

Datasets nowadays such as those that record production, manufacturing and medical

1-1

1. Introduction

data may suffer from the problem of missing data at different phases of the data collec-

tion and storage processes. Faults in measuring instruments or data transmission lines are

predominant causes of missing data. The occurrence of missing data results in difficul-

ties in decision making and analysis tasks which rely on access to complete and accurate

data, resulting in data estimation techniques which are not only accurate, but also effi-

cient. Several methods exist to alleviate the problems presented by missing data ranging

from deleting records with missing attributes (list-wise and pair-wise data deletion) to

approaches that employ statistical and artificial intelligence methods such as hybrid neu-

ral network and evolutionary algorithm approaches. The problem though is, some of the

statistical and naive approaches often produce biased approximations, or they make false

assumptions about the data and correlations within the data. These have adverse effects

on the decision-making processes which are data dependent.

Furthermore, missing data has always been a challenge in the real world as well as within

the research community. Decision making processes that rely on accurate knowledge are

quite reliant upon the availability of data, from which information can be extracted. Such

processes often require predictive models or other computational intelligence techniques

that use the observed data as inputs. However, in some cases due to alternate reasons,

data could be lost, corrupted or recorded incompletely, which affects the quality of the

data negatively. Majority of the decision-making and machine learning frameworks such

as Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Principal Com-

ponent Analysis (PCA) and others cannot be used for decision making and data analysis

if the data is incomplete. Missing values can critically influence pattern recognition and

classification tasks. Since the decision output should still be maintained despite the miss-

ing data, it is important to deal with the problem. Therefore, in case of incomplete or

missing data, the initial step in processing the data is estimating the missing values.

In addition, the way in which the missing data problem is handled is reliant upon the

reason for the presence of the missing data. According to [1], there exists three ways

in which this can happen, these being: Missing at Random (MAR), Missing Completely

at Random (MCAR), and Missing Not at Random (MNAR) or the Non-Ignorable Case.

Another missing data mechanism is Missing by Natural Design (MBND).

1-2

1. Introduction

1.2 Rationale

The applications of missing data estimation techniques are vast and with that said, the

existing methods depend on the nature of the data, the pattern of missingness and, are

predominantly implemented on low-dimensional datasets. These methods include, but are

not limited to, modeling (structural equation modeling) [2], environmental (air quality

data) [3] and time series (reconstruction of times series data) [4] observations. Authors

in [5] used MLP autoencoder networks, principal component analysis and support vector

machines in combination with the genetic algorithm to impute missing data, and in [6],

they used robust regression imputation in datasets with outliers and investigated its per-

formance. Missing data imputation via a multi-objective genetic algorithm technique is

presented in [7]. The results obtained point to the fact that the approach proposed out-

performs certain popular missing data imputation techniques yielding accuracy values in

the 90 percentiles.

The authors in [8] implemented a hybrid system comprising of the genetic algorithm

and a neural network with the aim being to impute missing data values within a single

feature variable at a time, in scenarios whereby the number of missing values varied within

this variable. In [9], the authors proposed a novel system that hybridizes the k-Nearest

Neighbor with a Neural Network with the aim of imputing missing values within a single

feature variable. In [10], hybrid systems made up of an Auto-Associative Neural Network

(AANN) and the Particle Swarm Optimization (PSO), Simulated Annealing (SA) and

Genetic Algorithm (GA) optimization techniques were created and applied to estimate

missing values, yielding high accuracy values in scenarios where a single feature variable

was affected by the problem of missing data. Some researchers used neural networks with

Principal Component Analysis (PCA) and GA to solve the missing data problem such

as [11] and [12]. Information within records with missing values was suggested to be used

in the missing data estimation task in [13]. This resulted in a Non-Parametric Iterative

Imputation Algorithm (NIIA) being introduced which yielded a classification accuracy

of at most 87.3% on the imputation of discrete values, and a root mean squared error

value of at least 0.5 on the imputation of continuous values when the missing data ra-

tios were varied. In [14], a Shell-Neighbour Imputation (SNI) approach is proposed and

applied to the missing data imputation problem, and it makes use of the shell-neighbour

method. The results obtained indicated that the proposed method performed better than

the k-Nearest Neighbour Imputation when imputation and classification accuracy were

1-3

1. Introduction

considered. This was because of the method considering the right and left nearest neigh-

bours of the missing data in addition to using different numbers of nearest neighbours

as opposed to the k-Nearest Neighbour method which uses a fixed number of k nearest

neighbours. New techniques aimed at solving the missing data problem and comparisons

between these and existing methods can be found in [15]- [17].

These techniques mainly cater to low-dimensional datasets with missing values, but are

less effective when high-dimensional datasets are considered with missingness occurring

in an uncontrolled manner. The main motivation behind this thesis is therefore to intro-

duce high-dimensional missing data estimation approaches, with emphasis laid on image

recognition datasets. These approaches will be used to reconstruct corrupted images by

estimating missing image pixel values. These images can then be used to test classification

models.

1.3 Problem Statement

As previously mentioned, most of the existing missing data imputation techniques cater

to low-dimensional datasets. Therefore, with the introduction and design of more sophis-

ticated computational and swarm intelligence methods, it is worth it making an analyst

aware of which method(s) is(are) best suited to a certain kind of dataset, in this case,

image recognition datasets. Therefore, in this research, six optimization algorithms (Ant

Colony Optimization (ACO), Ant Lion Optimizer (ALO), Cuckoo Search (CS), Bat Al-

gorithm (BA), Firefly Algorithm (FA) and Invasive Weed Optimization (IWO)), are used

in combination with a deep learning regression model on a high-dimensional dataset to

carry out a comparison of their missing data imputation capabilities.

In most sectors, decisions which originate from the data rely upon the availability of

complete and accurate data. Therefore, inferences which originate from complete datasets

with all the information available are inclined towards reliability and usefulness as opposed

to inferences drawn from incomplete datasets. The problem of missing data in datasets

may arise in a variety of ways. Some more compelling than others, for example, failures at

sensors which are meant to record the data in processes which use these or data entry er-

rors. In this thesis, the objective is to reconstruct images by imputing missing pixel values.

1-4

1. Introduction

Let us take for example a high-dimensional dataset like the Mixed National Institute

of Standards and Technology (MNIST) dataset [18] that contains 784 feature variables.

These feature variables are the pixel values of an image. Some of the images from the

dataset are shown in Figure 1.1. Let us then assume that some images are corrupted

leading to pixel values being missing (bottom row of Figure 1.1), and statistical analy-

sis is needed to classify the records in the dataset, the questions which need answering

are: (i) Is it possible to impute with some degree of certainty and with high accuracy

the missing data in high-dimensional datasets? (ii) Is it possible to design new methods

which outperform existing approaches to the problem of missing data by approximating

the missing data considering the correlation and interrelationships between the variables?

Figure 1.1: MNIST Dataset Sample Images. Top Row - Real Data: Bottom Row - Data withMissing Pixel Values

Therefore, there are two main objectives of this research. Firstly, approximating the

missing values from the data using novel approaches which comprise of a combination of

six optimization algorithms and a deep learning regression model. Secondly, carrying out

a comparative analysis of the approaches proposed to observe which approach under the

circumstances performs best and why that is the case.

1.4 Contribution of Thesis

The research done in this thesis presents a new direction with regards to what needs to

be done when presented with the problem of missing data in high-dimensional datasets,

from analysing and preparing the data, to estimating the missing data. It makes use of an

image recognition dataset to analyse and evaluate the performances of six models. The

evaluation metric values of the proposed individual models on the dataset are compared

against existing approaches, and subsequently, all the results for the different models are

1-5

1. Introduction

compared against each other to provide some form of generalization based on these re-

sults. The objective of these comparisons and analyses is to identify the best performing

method on the dataset and in this way, eradicate the trial and error approach often used

to identify the method best suited to the problem before implementing it, and instead

simply use whichever method has been identified as the best. This procedure done cor-

rectly will cut down on the time it takes to reconstruct images which can subsequently be

used for classification tasks. This research is also aimed at identifying the key evaluation

metrics. Tests were performed to confirm the outcomes obtained from the experiments

with the aim being to provide some form of symbolically statistical significance of the

results.

As mentioned before, missing data in a dataset leads to a variety of problems, there-

fore, one use of the work in this thesis is that it precisely presents novel and adequate

approaches for addressing the problem by estimating the missing data in the dataset.

Furthermore, besides the contributions already presented, another important contribu-

tion of the thesis lies in the fact that it suggests a research direction into the missing

data estimation literature in high-dimensional datasets by making use of deep learning

and swarm/meta-heuristic optimization techniques.

A more succinct outline of the contributions of the thesis are:

• Novel high-dimensional missing data estimation approaches are proposed which

comprise of ant-based optimization algorithms and a deep learning regression model;

• Novel missing data estimation approaches are proposed comprising of flight-based

optimization algorithms and a deep learning regression model;

• Novel missing data estimation approach is proposed comprising of a plant-based

optimization algorithm and a deep learning regression model, and;

• A comparative analysis of the methods proposed which includes statistical tests

to further back the findings, and essentially suggest a technique to be applied on

datasets with similar properties.

1-6

1. Introduction

1.5 Overview of Approach

The methodology implemented in this thesis comprises of primarily pre-processing the

data from the dataset. This procedure constitutes the normalization of the data which

reduces the variation in the values between feature variables, in addition to ensuring that

the network generates representative outputs. Six optimization algorithms were applied

to reduce an error function derived from training a deep learning linear regression model

using the stochastic gradient descent method. A portion of the normalized data from the

training set is presented to the deep learning network architecture for training. An error

function is then derived which is defined mathematically by calculating the square of the

disparity between the real outputs and the estimated model outputs. In this research,

data entries from the test set from any of the feature variables could be missing simulta-

neously, and therefore, the error function is reformulated to incorporate the unknown and

known input values. The normal routine is to create missing values in a specific feature(s),

and then estimate these. The uncontrolled nature of the missing data within the test set

of data to the best of our knowledge is an aspect which has not been investigated and

reported as well. Restricted Boltzmann Machines (RBMs) are used to train the individual

layers of the network in an unsupervised manner, which are subsequently joined together

to form the encoding part of the network and then transposed to make up the decoding

part. The stochastic gradient descent (SGD) algorithm is applied to train the network

using the training set of data in a supervised learning manner. The optimal network

structure is constructed and consists of an input layer, seven hidden layers and an output

layer. The number of nodes in the hidden layers are obtained from an initial suggestion

made in [19] and by performing cross-validations on a held out set of the training data,

known as the validation data. With the optimal network structure obtained via training,

the swarm algorithms implemented are used to identify the optimal network and algo-

rithm parameter combinations. The missing data estimation procedure is then performed

with the parameters identified in the previous step. The expected outputs are compared

against the estimated outputs to yield understanding into how the methods perform.

To assess the performances of the methods as high-dimensional estimators of the miss-

ing values, an image recognition dataset is used. Entries from the test set of data were

removed and approximated using the models. These models were all coded in MAT-

LAB. To measure the accuracies of the methods as estimators, eight error metrics were

used, these being: Squared Error (SE), Mean Square Error (MSE), Mean Absolute Er-

1-7

1. Introduction

ror (MAE), Root Mean Squared Logarithmic Error (RMSLE), Global Deviation (GD),

Relative Prediction Accuracy (RPA), Signal-to-Noise Ratio (SNR) and Coefficient of De-

termination (COD). These error metrics were selected due to having been applied in a

variety of research reports ( [5], [8], [10] and [20]) as performance measures for missing

data estimation problems in addition to them being convenient. Correlation coefficient

(r) between the estimated and expected output values is also used to provide more insight

into the relationship between the estimated values and the real values. Statistical t-tests

are also performed to back the findings from the metrics and provide information as to

the statistical significance of the results obtained.

1.6 Structure of the Report

Chapter 2: Literature Review and Background on Approaches for Dealing

with Missing Data includes a literature review which presents background informa-

tion on the deep learning model components being RBMs, contrastive divergence and

auto-encoders, as well as some areas of application of these. Missing data background is

also presented in this chapter along with some existing methods aimed at addressing the

problem.

Chapter 3: Ant-based Missing Data Estimators introduces the two ant-based op-

timization missing data estimation models being proposed and tested in this research.

The optimization algorithms used are the ACO and ALO algorithms. Details about the

algorithms are presented. Also, some of the related work which applied these techniques

are presented in the chapter. The methodology used in implementing the estimators is

presented along with the results from the experiments conducted.

Chapter 4: Flight-based Missing Data Estimators presents the three flight-based

optimization missing data estimation models being proposed and tested in this research.

The optimization algorithms used are the CS, BAT and FA methods. Details about

these algorithms and their implementation are presented. Also, some of the related work

which applied these techniques are presented in the chapter. The methodology used in

implementing the estimators is presented along with the results from the experiments

conducted.

1-8

1. Introduction

Chapter 5: Plant-based Missing Data Estimator and Comparative Analysis

presents the plant-based optimization missing data estimation model proposed and tested.

The optimization algorithm used is the IWO algorithm. The methodology used in im-

plementing the estimator is presented along with the results from the experiments. Also

presented in this chapter is the fourth contribution of the thesis which entails comparing

the proposed methods against each other to identify which performs best, with statistical

tests performed to back the results obtained.

Chapter 6: Concluding Remarks and Future Research presents the concluding

remarks on this research. This chapter also presents areas for possible research in the

future.

1-9

2Literature Review and

Background on Approaches

for Dealing with Missing Data

2.1 Introduction

The presence of missing data affects the quality of the dataset, which impacts on the

analysis and interpretation of the data. There are several reasons that could lead to data

being missing in a dataset with some being more predominant than others. The first well

known reason is participants’ denying revealing some personal and sensitive information,

for example monthly income. The second main reason is the failure of systems meant to

capture and store the data in databases. Another main reason is interoperability whereby

information exchanged between systems may be subjected to missing data.

This chapter gives the literature review of this research. A background on missing data

mechanisms is given in Section 2.3 followed by an introduction on missing data patterns

in Section 2.4. A discussion on classical missing data techniques is presented in Section

2.5 ensued by machine learning approaches to missing data in Section 2.6. Section 2.7

presents a discussion on machine learning optimization techniques for missing data im-

putation, while Section 2.8 discusses the machine learning framework used in this thesis

and the building blocks of this framework.

2.2 Missing Data Proportions

Missing data in datasets influences the analysis, inferences and conclusions reached based

on the information [21]. The impact on machine learning algorithm performances be-

2-1

2. Literature Review and Background on Approaches for Dealing with Missing Data

comes more significant with an increase in the proportion of missing data in the dataset.

Researchers have shown that the impact on machine learning algorithms is not as sig-

nificant when the proportion of missing data is small in large scale datasets ( [22]- [24]).

This could be attributed to the fact that certain machine learning algorithms inherently

possess frameworks to cater to certain proportions of missing data. With an increase in

missing data proportions, for example cases where the proportion is greater than 25%, it

is observed that tolerance and performance levels of machine learning algorithms decrease

significantly [25]. It is because of these reduced levels in tolerance and performance that

more complex and reliable approaches to solve the problem of missing data are required.

2.3 Missing Data Mechanisms

Any scenario whereby certain or all feature variables within a dataset have missing data

entries, or contain data entries which are not exactly characterized within the bounds of

the problem domain is termed Missing Data [26]. The presence of missing data leads to

several issues in a variety of sectors that depend on the availability of complete and quality

data. This has resulted in different methods being introduced with their aim being to

address the missing data problem in varying disciplines ( [26] and [27]). Handling missing

data in an acceptable way is dependent upon the nature of the missingness. There are

currently four missing data mechanisms in the literature and these are, MCAR, MAR, a

non-ignorable case or MNAR and MBND.

2.3.1 Missing Completely at Random (MCAR)

The MCAR case is observed when the possibility of a feature variable having missing

data entries is independent of the feature variable itself or of any of the other feature

variables within the dataset. Essentially, this means that the missing data entry does not

depend on the feature variable being considered or any of the other feature variables in

the dataset. This relationship is expressed mathematically as [1]:

P (M |Yo, Ym) = P (M) (2.1)

2-2


where M ∈ {0, 1} represents an indication of the missing value. M = 1 if Y is known

and M = 0 if Y is unknown(missing). Yo represents the observed values in Y while Ym

represents the missing values of Y . From equation (2.1), the probability of a missing

entry in a variable is not related to Yo or Ym. For instance, let us assume that in modeling

software defects in relation to development time, if the missingness is in no way linked

to the missing values of the rate of defects itself and at the same time not linked to

the values of the development time, the data is said to be MCAR. Researchers have

successfully addressed cases where the data is MCAR. [28] successfully applied multilayer

perceptrons (MLPs) for missing data imputation in datasets with missing values. Other

research work done on this mechanism could be found in [29] and [30].

2.3.2 Missing at Random (MAR)

The MAR case is observed when the possibility of a specific feature variable having missing

data entries is related to the other feature variables in the dataset. However, this missing

data does not depend on the feature variable itself. MAR means the missing data in the

feature variable is conditional on any other feature variable in the dataset but not on

that being considered [31]. For example, consider a dataset with two related variables,

monthly expenditure and monthly income. Assume for instance that all high-income

earners deny revealing their monthly expenditures while low income earners do provide

this information. This implies that in the dataset, there is no monthly expenditure entry

for high income earners, while for low income earners, the information is available. The

missing monthly income entry is linked to the income earning level of the individual. This

relationship can be expressed mathematically as [1]:

P (M |Yo, Ym) = P (M |Yo) (2.2)

where M ∈ {0, 1} is the missing data indicator, and M = 1 if Y is known with M = 0 if

Y is unknown (missing). Yo represents the observed values in Y while Ym represents the

missing values of Y . Equation (2.2) indicates that the probability of a missing entry given

an observable entry and a missing entry is equivalent to the probability of the missing

entry given the observable entry only. Considering the example described in Section

2.3.1, the software defects might not be revealed because of a certain development time.

Such a scenario points to the data being MAR. Several studies have been conducted in the

literature where the missing data mechanism is MAR, for example, [12] performed a study

2-3


to compare the performance of expectation maximization and a GA optimized AANN and

it was revealed that the AANN is a better method than the expectation maximization.

Further research on this mechanism was performed in [32]- [34].

2.3.3 Non-Ignorable Case or Missing Not at Random (MNAR)

The third missing data mechanism is the missing not at random or non-ignorable case.

The MNAR case is observed when the possibility of a feature variable having a missing

data entry depends on the value of the feature variable itself irrespective of any alteration

or modification to the values of other feature variables in the datasets [27]. In scenarios

such as these, it is impossible to estimate the missing data by making use of the other

feature variables in the dataset since the nature of the missing data is not random. MNAR

is the most challenging missing data mechanism to model and these values are quite tough

to estimate [26]. Let us consider the same scenario described in the previous subsection.

Assume for instance that some high-income earners do reveal their monthly expenditures

while others refuse, and the same for low income earners. Unlike the MAR mechanism, in

this instance the missing entries in the monthly expenditure variable cannot be ignored

because they are not directly linked to the income variable or any other variable. Models

developed to estimate this kind of missing data are very often not biased. A probabilistic

formulation of this mechanism is not easy because the data in the mechanism is neither

MAR nor MCAR.

2.3.4 Missing by Natural Design (MBND)

This is a mechanism whereby the missing data occurs because it cannot be measured phys-

ically [35]. It is impossible to measure these data entries; however, they are quite relevant

in the data analysis procedure. Overcoming this problem requires that mathematical

equations be formulated. This missing data mechanism mainly applies to mechanical en-

gineering and natural science problems. Therefore, it will not be used in this thesis for

the problem under consideration.

2-4


2.4 Missing Data Patterns

The way in which missing data occurs can be grouped into three patterns given by Tables

2.1-2.3. Table 2.1 depicts a univariate pattern which is a scenario described by the presence

of missing data in only one feature variable as seen in column I7. Table 2.2 depicts an

arbitrary missing data pattern, which is a scenario whereby the missing data occurs in a

distributed and random manner. The last pattern is the monotone missing data pattern

which is shown in Table 2.3. This pattern is also referred to as a uniform pattern as it

occurs in cases whereby the missing data can be present in more than one feature variable

and, it is easy to understand and recognize [1].

Table 2.1: Univariate Missing Data Pattern

Sample I1 I2 I3 I4 I5 I6 I7

1 0.38 0.18 0.20 0.19 0.75 0.67 0.96

2 0.69 0.11 0.08 0.41 0.65 0.63 ?

3 0.17 0.79 0.66 0.53 0.95 0.43 ?

4 0.19 0.24 0.15 0.91 0.46 0.82 ?

Table 2.2: Arbitrary Missing Data Pattern


1 0.38 ? 0.20 0.19 0.75 0.67 0.96

2 0.69 0.11 0.08 0.41 ? 0.63 0.04

3 0.17 0.79 ? 0.53 0.95 0.43 0.54

4 ? 0.24 0.15 0.91 0.46 0.82 ?

Table 2.3: Monotone Missing Data Pattern


1 0.38 0.18 0.20 0.19 0.75 0.67 ?

2 0.69 0.11 0.08 0.41 0.65 ? ?

3 0.17 0.79 0.66 0.53 ? ? ?

4 0.19 0.24 0.15 ? ? ? ?

The missing data pattern considered in this thesis is the arbitrary pattern and the mech-

anisms are the Missing at Random and Missing Completely at Random mechanisms.

2-5


2.5 Classical Missing Data Techniques

Depending on how data goes missing in a dataset, there currently exist several data im-

putation techniques that are being used in statistical packages [36]. These techniques

include basic approaches such as case-wise data deletion and move on to approaches that

are characterized by the application of more refined artificial intelligence and statistical

methods. The subsections that follow present some of the most commonly applied missing

data imputation methods. We begin with basic and naive approaches and carry on pre-

senting more complex and competent mechanisms. There are a variety of classical missing

data imputation techniques courtesy of their simplicity and ease of implementation. The

techniques presented in this section are list-wise or case-wise deletion, pair-wise deletion,

mean substitution, stochastic imputation with expectation maximization, hot and cold

deck imputation, multiple imputation and regression methods.

2.5.1 List-Wise or Case-Wise Deletion

A lot of statistical approaches will get rid of an entire record if it is seen that any of the

columns in the record has a missing data entry. Such an approach is termed case-wise or

list-wise data deletion, and is a scenario whereby in the event of any of the columns in a

record having a missing value for a feature variable, the entire record is deleted from the

dataset. List-wise data deletion is the easiest and most basic way to handle the problem of

missing data as well as being the least recommended option for the problem as it tends to

significantly reduce the number of records in the dataset which are necessary for the data

analysis task, and by so doing, it reduces the accuracy of the findings from the analysis

of the data. Applying this technique is a possibility if the ratio of records with missing

data to records with complete data is very small. If this is not the case, making use of

this approach may result in the estimates of the missing data being biased.

2.5.2 Pair-Wise Deletion

The pair-wise data deletion approach operates by performing the analysis required by

using pair-wise data. The implication of this is records with missing data will be used in

analysis tasks if and only if the feature variable with the missing data in that record is

2-6


not needed. The benefit from doing this is the number of records used for analysis will

often be more than if one were to use the list-wise data deletion approach. However, this

approach results in biased missing data estimates, which is a bad outcome for the dataset

when the missing data mechanism is MAR or MNAR. On the contrary, this approach is

quite competent if the data is MCAR.

2.5.3 Mean-mode Substitution

This approach works by substituting the missing data entries with the value of the mean

or mode of the available data in the feature variable(s). It has a high possibility of yielding

biased estimates of the missing data just like the pair-wise data deletion approach [37],

and, it is not a highly recommended approach to solve this problem. For dataset feature

variables with continuous or numerical values, the missing data entries are substituted by

the mean of the related variables. On the other hand, for feature variables with categorical

or nominal values, the missing data are substituted by using the most common or modal

value of the respective feature variable [1]. These techniques are most effective when the

data is assumed to be MCAR. Mean-mode substitution has been used with success in

previous research (see [38] and [39]).

2.5.4 Imputation

Imputation in statistics is the process of replacing missing data with substituted values,

thus addressing the pitfalls caused by the presence of missing data. Imputation techniques

can be categorized into single and multiple imputation. Single imputation comprises of

replacing a missing value with only one estimated value while with multiple imputation,

each missing entry is replaced with a set of M estimated values.

2.5.4.1 Single-based Imputation

Expectation Maximization

Expectation maximization (EM) is a model-based imputation technique designed for pa-

rameter estimation in probabilistic methods with missing data [40]. EM is comprised of a

2-7


two-step process, with the first step, known as the E-step, being the process of estimating

a probability distribution over completions of missing data given the model. The M-step

is the second step, and identifies parameter estimates that maximize the complete data

log-likelihood obtained from the E-step. The M-step has as stopping criteria either con-

vergence being attained, or number of iterations being reached [40]. Details about this

algorithm can be found in [40]. It is applicable in both single and multiple imputation

procedures and has been shown to perform better than the techniques described above

( [12], [25] and [35]). This technique works best on the assumption that the data is MAR.

Hot Deck and Cold Deck Imputation

These methods fall under the category of donor-based imputation techniques. Donor-

based imputation entails substituting missing entries with data from other records.

Hot Deck imputation is a method in which missing data entries are filled in with val-

ues from other records. This is achieved by [1]:

• Splitting instances into clusters of similar data. This can be done using methods

such as k-Nearest Neighbour.

• Missing entries are replaced with values from instances that fall in the same class.

Cold Deck imputation on the other hand entails substituting the missing data by a con-

stant value obtained from other sources [1]. Hot or cold deck imputation are popular owing

to their simplicity as well as there being no need to make strong assumptions about the

model used to fit the data. It is worth mentioning though that the imputation strategy

does not necessarily lead to reduction in bias, in relation to the incomplete dataset.

2.5.4.2 Multiple-based Imputation

Multiple Imputation (MI) is an approach whereby missing data entry are substituted with

a set of M approximated values. In [26], MI is described in three consecutive steps. The

first step entails substituting the missing data entries in the dataset with M different

values. This results in M different datasets being obtained with complete records. Step

2-8


two of the process entails analysing the different M datasets with complete records by

applying complete-data analysis techniques. Finally, in step three, the results from the

M datasets are combined based on the analysis done in step two. This result referred

to from step two indicates which of the M datasets obtains the best state of the missing

data entries or yields the better conclusions and inferences. This approach is a better one

than the single imputation approaches. It also makes use of the advantages of the EM

and likelihood estimation approaches, with the popular traits of the hot-deck imputation

method to obtain new data matrices that will be processed ( [31] and [37]). The three

steps mentioned above can be further explained in the points below:

• Make use of a reliable model that incorporates randomness to estimate the missing

values;

• Generate M complete datasets by repeating the process M times;

• Apply complete-data analysis algorithms to perform analysis of the components of

the datasets obtained;

• From the M complete datasets obtained, calculate the overall value of the estimates

by averaging the values from the M datasets;

This method depends on the assumption that the data is MAR and originates from a

multivariate uniform distribution.

2.5.5 Regression Methods

This approach involves generating a regression equation that depends on a record with all

the data available for a given feature variable. To achieve this, the feature variable with

the missing data is considered as being the dependent variable in the equation, with all the

other feature variables considered the independent variables (predictors). Records with

missing values have these values being obtained as estimates from using the regression

equation with the feature variable of interest being the output and all the others being

the model equation inputs [1].

The process of generating regression equations is done repeatedly and in order, for the

2-9


feature variables with missing data entries until all such missing entry values are esti-

mated and substituted. This means that a feature variable vj having missing data entries

will have a model created for it using records with known values for the other variables.

Applying this method to estimate the missing data entry in sample 2 of Table 2.2, the

regression equation that is to be fitted will consider the variables I1, I2, I3, I4, I6, and

I7. This results in the equation below:

I5 = i1I1 + i2I2 + i3I3 + i4I4 + i6I6 + i7I7 + ε (2.3)

The regression equation constitutes the estimate terms ii as well as the error, ε. It can

subsequently be applied to approximate missing data entries by replacing I1, I2, I3, I4,

I6, and I7 by their known values.

2.6 Machine Learning Approaches to Missing Data

Several approaches in computational intelligence have been developed to address the prob-

lems of missing data and the drawbacks of statistical techniques covered in Section 2.5.

Some of these techniques are tree-based or based on biological concepts.

2.6.1 Decision Trees

Decision trees are supervised learning models aimed at separating data into consistent

clusters for classification or regression analysis. A decision tree is acyclic by default and

consists of a root node, leaf nodes, internal nodes and edges. The root node indicates the

onset of the tree with the leaf nodes representing the end of the tree which either presents

the final outcome or the class label. The internal node stores details about the attribute

used for splitting data at each node. The edges are links between the nodes and contain

details about splits. The outcome of a record is obtained by processing the information

across the tree from the root node to the leaf node [41].

Using decision trees to perform the missing data estimation task entails building a tree

for each feature variable with missing data entries. This feature variable is considered the

class-label with the actual class label forming part of the input feature set. The building

2-10


of the tree is done using records with known class labels and the missing data entries are

substituted with a corresponding tree [41]. Let us say for example that a dataset has

attributes I1, I2, I3 and a class-label L, which is obtained as such: L(I1, I2, I3). Assume

I1 has missing values, I1 will be considered the class-label while L will be regarded as one

of the input feature variables. The new class label will be obtained by: I1(L, I2, I3). If I2

has the missing data, the new output is obtained using: I2(I1, L, I3) and I2 is considered

the new class-label. This procedure is executed until all feature variables with missing

data are complete.

The strategy described above operates the way a single imputation method does, and

has been applied successfully ( [41]- [43]). It is unknown whether the sequence in which

the missing values are substituted influence the estimates and, the method works best

when the data is assumed to be MCAR.

2.6.2 Artificial Neural Networks (ANNs)

An ANN is referred to as a probabilistic model that is used to process data in the way

that the biological nervous system does [8]. The human brain is a very good example of

such a system. It can also be defined as a collection and combination of elements whose

performance is dependent on the elements, which in a neural network are neurons. The

neurons are connected to one another and the nature of these connections also influence

the performance of the network. The fundamental processing unit of a neural network

is what is referred to as a neuron [44]. A neural network comprises of four important

components, these being [45]: (1) At any stage in the biorhythm of the neural network,

each neuron has an activation function value; (2) Each neuron is connected to every other

neuron, and these connections are what determine how the activation level of a neuron

becomes the input for another neuron. Each of these connections is allocated a weight

value; (3) At a neuron, an activation function is applied to all the incoming inputs to

generate a new input for neurons in the output layer or subsequent hidden layers, and;

(4) A learning algorithm that is used to adjust the weights between neurons when given

an input-output pairing.

A predominant feature of a neural network is the capability it possesses to accommodate

and acclimatise to its environment with the introduction of new data and information. It

is with this in mind that learning algorithms were created, and they are very important

2-11


in determining how competent a neural network will and can be. A neural network is

applicable in several domains such as in the modeling of problems of a highly compli-

cated nature because of how relatively easy it is for them to derive meaning from complex

data, identify patterns and also trends, which are very convoluted for other computational

models [8]. Trained neural networks are applicable in prediction tasks where the aim is

to determine the outcome of a new input record after having been presented with similar

information during the training process [8]. Their inherent ability to adapt with ease to

being presented with new non-linear information makes them favorable to be used to solve

non-linear models.

Neural networks have been observed to be highly efficient and capable in obtaining so-

lutions to a variety of tasks with most of these being forecasting and modeling, expert

systems and signal processing tasks [45]. The organization of the neurons in a neural

network affects the processing capability of the said network as well as an influence on

the way in which information moves between the layers and neurons.

2.6.2.1 Auto-Associative Neural Network

Auto-encoder networks are defined as networks that try to regenerate the inputs as outputs

in the output layer [46], and this guarantees that the network will be capable of predicting

new input values as the outputs when presented with new inputs. These auto-encoders

are made up of one input layer and one output layer both having the same number of

neurons, resulting in the term auto-associative [46]. Besides these two layers, a narrow-

hidden layer exists and it is important that this layer contains fewer neurons than there

are in the input and output layers with the aim of this being to apply encoding and

decoding procedures when trying to find a solution to a given task [47]. These networks

have been used in a variety of applications ( [48]- [53]). The main concept that defines the

operation of auto-encoder networks is the notion that the mapping from the input to the

output, x(i) 7→ y(i) stores very important information and the key inherent architecture

present in the input, x(i), that is contrarily hypothetical ( [48] and [54]). An auto-encoder

takes x as the input and transcribes it into y, which is a hidden representation of the

input, by making use of a mapping function, fθ, that is deterministic. This function is

expressed as ( [54] and [55]):

fθ (x) = s (Wx+ b) . (2.4)

2-12


The parameter, θ, is made up of the weights W and biases b. s represents the sigmoid

activation function which is given by:

s =1

1 + e−x. (2.5)

y is then mapped to a vector, z, representing the reconstruction of the inputs from this

hidden representation. This reconstructed vector is obtained by using the following equa-

tions [54]:

z = gθ′ (y) = s (W ′y + b′) , (2.6)

or

z = gθ′ (y) = W ′y + b′. (2.7)

In the above equations, θ′

is made up of the transpose of the weights matrix and the

vector of biases from equation (2.4). Equation (2.6) is the autoencoder output function

with a sigmoid transfer function (equation (2.5)), while equation (2.7) is the linear output

equation. After these operations, the network can then be fine-tuned by applying a

supervised learning approach [55]. The network is said to have tied weights when the

weights matrix is transposed. The vector z is not described as an exacting transformation

of x, but rather as the criteria of a distribution p (X|Z = z) in probabilistic terms. The

hope is these criteria will result in x with high probability [55]. The resulting equation is

as follows [54]:

p (X|Y = y) = p (X|Z = gθ′ (y)) . (2.8)

This leads to an associated regeneration disparity that forms the basis for the objective

function to be used by the optimization algorithm. This disparity is usually represented

by [54]:

L (x, z) ∝ −logp (x|z) . (2.9)

∝ indicates proportionality. The equation above could be represented by [56]:

δAE (θ) =∑t

L(x(t), gθ

(fθ(x(t))))

. (2.10)

Auto-encoder networks have been used in a variety of application areas, by several re-

searchers, with the focus being on the problem of missing data ( [8], [57]- [59]).

2-13


2.6.3 Support Vector Machine

Support Vector Machine (SVM) is a classification model capable of solving both linear

and non-linear complex problems ( [60] and [61]). In linear problems, the model tries

to identify a maximal marginal hyper-plane with the greatest margin. This hyper-plane

must obey the following expression ( [60] and [61]):

f(x) =

1, w.x+ b ≥ 1

−1, w.x+ b ≤ 1, (2.11)

where w and x represent the weight and input vectors, respectively, and b indicates the

bias. Larger margins are preferable as they increase the accuracy of classifications. In

scenarios where the data dimensions are linearly inseparable, the data requires transfor-

mation into higher dimensions. The model identifies an optimal hyper-plane capable of

separating the variables of the classes in the new high dimensional space. Kernel func-

tions which are used to map the original data into higher dimensions could be expressed

mathematically as ( [62]- [64]):

K(xi, xj) = ϕ(xi).ϕ(xj), (2.12)

where ϕ(xi) and ϕ(xj) are the non-linear mapping functions. Some frequently used kernel

functions are ( [62]- [64]): the polynomial, sigmoid and Gaussian radial functions.

2.7 Machine Learning Optimization

As previously mentioned, constructing models for handling missing data can be complex

and computationally expensive. Successful models employ an optimization technique to

construct a model that best fits the training set. In this section, we highlight various

strategies that have been employed as optimization techniques in missing data problems.

2.7.1 Genetic Algorithm

Genetic algorithm (GA) is an evolutionary computational technique designed to search

for global optimum solutions to complex problems. It was inspired by Darwin’s theory

2-14


of natural evolution. Genetic algorithms use the notion of survival of the fittest, where

the strongest individuals are selected for reproduction until the best solution is found

or the number of cycles is completed. The processes involved in a genetic algorithm are

selection, crossover, mutation and recombination. The selection process involves selecting

the strongest parent individuals using a probabilistic technique for the crossover process.

During crossover, a crossover point is chosen between the parent individuals. Data is

swapped or exchanged from the starting point of an individual to the crossover point.

The outcome is two children. At this point, if the children are stronger than their par-

ents, they can be used to replace one or both parents. Mutation is performed by randomly

selecting a gene and inverting it. Mutation is given a low probability value, which means

that it occurs less than the crossover process. Recombination process evaluates the fitness

of the children or newly generated individual to determine if they can be merged with the

current population.

As previously mentioned, genetic algorithms have been applied to optimize neural net-

works ( [12], [35] and [59]) by searching for individuals that maximize the objective func-

tion, prior to imputation. This algorithm is classified within the domain of computational

intelligence as per [65] and has been further used to address the missing data problem

in [66].

2.7.2 Particle Swarm Optimization

Particle swarm optimization (PSO) is a search technique based on a collective behavior

of birds within a flock. The goal of the technique is to simulate the random and unpre-

dictable movement of a flock of birds, with the intent of finding patterns that govern the

birds’ ability to move at the same time, and change direction whilst regrouping in an

optimal manner ( [67] and [68]).

PSO particles move through a search space. A change in position of the particles within

the space is based on a socio-psychological tendency of each particle copying or emulating

the success of the neighboring particle and their own success. These changes are influ-

enced by the knowledge or experience of surrounding particles and the particle themselves.

Therefore, the search behavior of one particle is affected by the behavior of other parti-

cles or itself. The collective behavior of particles within a swarm permits the discovery of

global optimal solutions in high dimensional search spaces ( [67] and [68]). This algorithm

2-15


is also classified within the domain of computational intelligence as per [65] and has also

been used to address the missing data problem in [66].

2.7.3 Simulated Annealing

Simulated annealing (SA) is an optimization technique that uses the concept of cooling

metal substances with the goal of condensing matter into a crystalline solid. It can be

considered as a procedure used to find an optimal solution. The main characteristics of

simulated annealing are that, (i) it can find a global optimum solution; (ii) it is easy to

implement for complex problems and, (iii) it solves complex problems and cost functions

with various numbers of variables. The drawbacks of simulated annealing are [69]:

• It takes many iterations to find an optimal solution;

• The cost function is computationally expensive to estimate;

• It is inefficient if there are many local minimum points;

• It depends on the nature of the problem, and;

• It is difficult to determine the temperature cooling technique.

Research in SA has shown that it performs marginally better than GA and PSO techniques

( [35] and [69]). However, for problems involving datasets with high dimensions, GA and

PSO are recommended over SA because of these drawbacks. In addition, this algorithm is

classified within the domain of computational intelligence as per [65] and has been further

used to address the missing data problem in [66].

2.8 Deep Learning (DL)

Deep learning is made up of a variety of techniques in the field of machine learning that

make use of a deluge of non-linear nodes which are arranged into multiple layers that

extract and convert feature variable values from the input vector ( [70] and [71]). The

individual layers of such a system have as input, the outputs from preceding layers, except

2-16


for the input layer which just receives signals or the input vectors from the outside environ-

ment. Also, during training of the systems, unsupervised or supervised techniques could

be applied. This brings about possible application of these models in supervised learning

tasks like classification, and unsupervised tasks like pattern analysis. Deep learning mod-

els are also based on the extraction of higher level features from lower level features to

obtain a stratified portrayal of the input data via an unsupervised learning approach on

the different levels of the features [71]. A ranking of notions and theories is obtained by

learning different layers of portrayals of the data that represent varying levels of absorp-

tion of the data. Some of the deep learning frameworks in the literature are Deep Belief

Networks (DBNs) ( [19] and [72]), Deep/Stacked Auto-encoder Networks (DAEs/SAEs)

( [73] and [74]) and Convolutional Neural Networks (CNNs) ( [75] and [76]). The Deep

Learning technique used in this thesis is the Stacked Auto-encoder (SAE) which is built

from restricted Boltzmann machines, trained in an unsupervised manner using the con-

trastive divergence method and subsequently joined to form the encoder and decoder

parts of the network, which is then trained in a supervised learning manner using the

stochastic gradient descent algorithm. The motivation behind using an SAE is that it is

trained in such a way that the hidden layer maintains all the information about the input.

2.8.1 Restricted Boltzmann Machine (RBM)

Prior to defining an RBM, we begin by explaining what a Boltzmann machine (BM)

is. It is a bidirectionally connected network of stochastic processing units, which can

be interpreted as a neural network [77]. It can be used to learn important aspects of an

anonymous probability distribution based on cases from the distribution. This is typically

a challenging procedure. The learning procedure can be simplified by imposing constraints

on the architecture of the network which leads to restricted Boltzmann machines [78].

RBMs can be defined as probabilistic, undirected, parametrized graphical models also

referred to as Markov random fields (MRFs). RBMs have received a lot of attention in

the aftermath of being proposed as building blocks of multi-layered architectures called

deep networks ( [54] and [78]). The concept behind deep networks is that the hidden

neurons excerpt relevant features from the input data, which then serve as input to another

RBM [54]. The goal in assembling the RBMs is to obtain higher level portrayals of the

data by learning features from features [54]. RBMs which are also MRFs linked to binary

undirected graphs are made up of m visible units, V = (V1, . . . , Vm) to mean detectable

data, and n hidden units, H = (H1,...,Hn) that record the relationship between variables

2-17


in the input layer ( [19] and [79]). The variables V take on values in the range [0, 1]m+n,

whereas the variables H take on values in the range {0, 1}m+n. The joint probability

which is obtained from the Gibbs distribution requires an activation function given by

( [80] and [81]):

E (v, h) = −hTWv − bTv − cTh. (2.13)

In scalar form, (2.13) is expressed as ( [80] and [81]):

E (v, h) = −n∑i=1

m∑j=1

wijhivj −m∑j=1

bjvj −n∑i=1

cihi. (2.14)

In (2.14), wij represents a real valued weight between the input unit Vj and the hidden

unit Hi. This value is the most essential part of an RBM. The parameters bj and ci

represent real valued bias terms associated with the jth visible variable and the ith hidden

variable [54]. In a scenario where wij is less than zero and vj = hi = 1, a high energy is

obtained courtesy of a decrease in probability. However, if wij is greater than zero, and

vj = hi = 0, a lower energy value is obtained because of an increase in probability. If bj is

less than zero and vj = 1, a low probability is achieved due to an increase in energy ( [80]

and [81]). This points to an inclination for vj to be equal to zero, rather than vj being

equal to one. On the other hand, if bj is greater than zero and vj = 0, a high probability

is achieved due to a decrease in energy. This points to an inclination for vj to be equal to

one, rather than vj being equal to zero. The second term in equation (2.14) is influenced

by the value of bj with a value less than zero decreasing this term, while a value greater

than zero increases the term. The third term of equation (2.14) on the other hand is

influenced by the value of ci in the same way as bj affects the second term. The Gibbs

distribution or probability from (2.13) or (2.14) can be gotten by ( [80] and [81]):

p (v, h) =e−E(v,h)

Z=e(h

TWv+bT v+cT h)

Z=e(h

TWv)e(bT v)e(c

T h)

Z. (2.15)

In this equation, Z represents an intractable partition function while all the exponential

terms represent factors of a Markov network with vector nodes [54]. The intractable

nature of Z is due to the exponential number of values it can assume. In RBMs, the

intractable partition function is obtained by ( [80] and [81]):

Z =∑v,h

e−E(v,h). (2.16)

2-18


The variable, h, is conditionally independent of v, and vice versa, and this is yet another

important trait of an RBM. This is because of the fact that no nodes in the same layer

are connected. Mathematically, this can be expressed as ( [80] and [81]):

p(h|v) =n∏i=1

p(hi|v), (2.17)

and

p(v|h) =m∏i=1

p(vi|h). (2.18)

2.8.2 Contrastive Divergence (CD)

In training an RBM, the goal is to reduce the mean negative log-likelihood or loss by as

much as possible without any form of regularization [54]. This is done by making use of

the stochastic gradient descent algorithm because it can handle high-dimensional datasets

better than others. The loss is expressed as ( [80] and [82]):

loss =1

T

∑t

−logp(v(t)). (2.19)

This can be achieved by calculating the partial derivative of the loss function with respect

to a parameter, θ, as follows ( [80] and [82]):

∂(−logp

(v(t)))

∂θ= Eh

[∂E(v(t), h

)∂θ

|v(t)]− Ev,h

[∂E (v, h)

∂θ

]. (2.20)

The first term in (2.20) defines the expectation over the distribution of the data. This is

coined the positive phase. v and h are the same variables used in equations (2.13)-(2.17).

The second term referred to as the negative phase represents the expectation over the

distribution of the model. Due to an exponential sum being needed over the v and h vari-

ables, the calculation of these partial derivatives is both intractable and difficult [83]. In

addition, achieving estimates of the log-likelihood gradient which are unbiased normally

needs several steps of sampling. Recently though, it has been revealed that estimates

which are gotten from executing the Markov chain for a few iterations can suffice during

the model training process. From this emerged the contrastive divergence (CD) method

( [80] and [82]). CD can be defined as a technique for training undirected graphical models

2-19


of a probabilistic nature. The hope is this eradicates the double expectation procedure

in the negative phase of equation (2.20) and rather sheds the spotlight on estimation. It

essentially uses a Monte-Carlo estimate of the expectation over one input data point [54].

An extension of the CD algorithm is the k-step CD learning technique (CD-k) which

states that instead of approximating the second term in equation (2.20) by a case from

the distribution of the model, a Gibbs chain could be executed for only k steps with k

often set to 1. v(0) is a training sample from the training set of data used to initialize the

Gibbs chain, and, this produces the sample v(k) after k steps. Each time step comprises of

a sample h(t) obtained from a probability p(h|v(t)) as well as a sample v(t+1) subsequently

gotten from p(v|h(t)).

The partial derivative of the log-likelihood with respect to θ for a single training sample,

v(0), is approximated by ( [80] and [82]):

CDk(θ, v(0)) = −

∑h

p(h|v(0))∂E(v(0), h)

∂θ+∑h

p(h|v(k))∂E(v(k), h)

∂θ. (2.21)

Due to the fact that v(k) is not drawn from a stationary distribution of the model, the

estimates from equation (2.21) are biased [54]. As k −→ ∞, the bias vanishes. An

additional factor that points to the biased nature of the CD algorithm is the fact that it

maximizes the disparity between two Kullback-Liebler (KL) divergences ( [80] and [82]):

KL(q|p)−KL(pk|p). (2.22)

Here, pk defines the distribution of the visible variables after k steps of the Markov chain

while q represents the empirical distribution. If the chain is observed to have already

attained stationarity, then pk = p, therefore, KL(pk|p) = 0, and with this, the error from

the CD estimates vanishes. More information on the CD algorithm can be found in [84].

2-20


2.9 Conclusion

This chapter gives a background summary on the missing data mechanisms and patterns,

as well as classical techniques used for handling missing data. We also discussed modern

approaches for handling missing data and their applications. This chapter also presented

the techniques employed to optimize the missing data imputation process. The chapter

is important because it illustrates the importance of understanding missing data and how

various methods have evolved over the years to handle the problem. It also shows why

the research in this thesis was conducted.

In this thesis, the methods described in Section 2.7 are used as a frame of reference

to compare the proposed methods against existing techniques, while the techniques de-

scribed in Section 2.8 are used to construct the deep learning regression framework.

2-21

3Novel Ant-based Missing

Data Estimators

3.1 Introduction

In this chapter, we present the results obtained from analysing and implementing the novel

ant-based missing data estimators. We begin in Section 3.2 by describing the experimental

design that will be implemented throughout the chapter, followed by Section 3.3 in which

we present information on the optimization algorithms that will be used in the chapter. In

Section 3.4, we present the performance evaluation metrics that will be used in Chapters 3,

4 and 5. Section 3.5 presents the results obtained from analysing the DL-ACO estimator,

while Section 3.6 reports on the findings from the analysis of the DL-ALO estimator.

Section 3.7 presents the key findings from the chapter.

3-1

3. Novel Ant-based Missing Data Estimators

3.2 Experimental Design

3.2.1 Statement of Hypothesis and Research Question

It should be discernible at this point that the research done in this work focused on

how possible it will be to effectively estimate missing data entries in a high-dimensional

dataset. We tried to answer two key questions being:

• Is it possible to estimate missing data entries in a high-dimensional dataset efficiently

using models comprising of a deep auto-encoder network framework with the ant

colony optimization and ant-lion optimizer algorithms?

• Is there a relationship between the accuracy of the estimated values and the real

values in the feature variables with missing data?

The responses to these questions which we expect to get in relation to prior research are

detailed in the hypotheses of the research.

3.2.2 Hypothesis Testing

3.2.2.1 Hypothesis One

• It is possible to estimate missing data entries in a high-dimensional dataset efficiently

using models comprising of a deep auto-encoder network framework with the ant

colony optimization and ant-lion optimizer algorithms.

3.2.2.2 Hypothesis Two

• It is expected that the level of correlation between the estimated values and the real

values be high or low depending on the nature of the dataset.

3-2


Figure 3.1: Data Imputation Configuration.

Figure 3.1 illustrates how the regression model and optimization methods will be used.

The dataset used is the Mixed National Institute of Standards and Technology (MNIST)

handwritten digit recognition dataset [18]. This dataset comprises 60,000 training images

and 10,000 test images. With each image being a 28× 28 pixel image, this results in 784

pixel values representing the image, and serving as the input to the model. The data is

preprocessed by normalizing each pixel value to being in the range [0, 1].

Two predominant features of an auto-encoder being; (i) its auto-associative nature, and

(ii) the butterfly-like structure of the network resulting from a bottleneck trait in the

hidden layers, were the reasons behind the network being used. Auto-encoders are also

ideal courtesy of their ability to replicate the input data by learning certain linear and

non-linear correlations and covariances present in the input space, by projecting the input

data into lower dimensions. The only condition required is that the hidden layer(s) have

fewer nodes than the input layer, though it is dependent on the application. Prior to

optimizing the regression model parameters, it is necessary to identify the network struc-

ture where the structure depends on the number of layers, the number of hidden units per

hidden layer, the activation functions used, and the number of input and output units.

After this, the parameters can then be approximated using the training set of data. The

parameter approximation procedure was done for a given number of training cycles, with

the optimal number of this being obtained by analysing the validation error. The aim of

this was to avoid over-fitting the network and to use the fastest training approach without

3-3


compromising on accuracy. The optimal number of training cycles was found to be 500.

The training procedure estimated weight parameters such that the network output was

as close as possible to the target output.

Figure 3.2: Stacked Auto-encoder Network Structure.

3-4


Figure 3.3: Missing Data Estimator Structure.

The optimization algorithms were used to estimate the missing values by optimizing an

objective function which has as part of it the trained network. They used values from

the population as part of the input to the network, and the network recalled these val-

ues which subsequently form part of the output. The complete data matrix containing

the estimated values and observed values was fed into the auto-encoder as input. Some

inputs were considered as being known, with others unknown and to be estimated using

the regression method and the optimization algorithms as described at the beginning of

the paragraph. The symbols Ik and Iu as used in Figures 3.1 and 3.3 represent the known

and unknown/missing values, respectively.

Considering that the approach made use of a deep auto-encoder, it was imperative that

the auto-encoder architecture match the output to the input. This trait is expected when

a dataset with familiar correlations recorded in the network is used. The error, δ, used is

the disparity between the target output and the network output, expressed as [54]:

δ =−→I − f(

−→W,−→I ), (3.1)

where−→I and

−→W represent the inputs and the weights, respectively.

3-5


The square of equation (3.1) was used to always guarantee that the error is positive.

This results in the following equation [54]:

δ =(−→I − f(

−→W,−→I ))2. (3.2)

Courtesy of the fact that the input and output vectors contain both Ik and Iu, the error

function is rewritten as [54]:

δ =

([Ik

Iu

]− f

({Ik

Iu

}, w

))2

. (3.3)

Equation (3.3) is the objective function used and minimized by the optimization algo-

rithms to estimate Iu, with f being the regression model function. The stopping criteria

of the optimization algorithms, and therefore the estimation procedure, were either a

maximum of 40,000 function evaluations being attained, or no change observed in the ob-

jective/error value during the estimation procedure. From the above descriptions of how

the deep auto-encoder and optimization algorithms were used, the equation below sum-

marizes the function of the proposed approach, with fOA being the optimization algorithm

estimation operation and fDAE being the function of the deep auto-encoder.

y = fDAE(W, fOA(−→I )), (3.4)

where−→I =

[−→Ik−→Iu

]represents the input space of known and unknown features. This

equation represents the model design whereby the complete input vector with known

and, estimated missing data entries obtained by executing the missing data estimation

procedure (fOA) is presented to the deep regression model (fDAE) to observe whether

the network error has been minimized. If it is found that the error has been minimized,

the output, y, will contain the known input vector values and the optimal missing data

estimated values. This model design is also used in Chapter 5.

3-6


3.3 Optimization Algorithms

3.3.1 Ant Colony Optimization (ACO)

ACO is an algorithm that mimics the innate behavior of ants. In the everyday lives of

ants, the ants must explore the neighborhood of their nests in the search for food for

sustenance purposes [85]. When ants move, they leave on their trail a substance termed

pheromone. There are two main objectives behind this deposit of pheromones. One

reason is to allow ants to navigate their way back to the nests, and the second, is that

it allows other ants to trace back the direction followed and used by other ants, so it

can be followed [85]. ACO has a collection of characterizing traits that can be regarded

as building blocks. These traits are essential and are required to be specified in every

implementation. These characteristics include [86]: (i) the method selected to build the

solutions, (ii) the examining knowledge, (iii) the rule to update the pheromones, (iv) the

probability function and transition rules, (v) the values of the parameters, and (vi) the

stopping criteria [85]. The algorithm considers a unique colony of ants that has m artificial

ants collaborating with one another. Prior to the start of the execution of the algorithm,

each of the links between the solutions is given a certain number of pheromones, τ0. This

value is usually very small, small enough that the probability of the path to each solution

being chosen is not zero. At each iteration, each of the m ants that has constructed a

solution in it updates the values of the pheromone. The pheromone, τij, that is related to

the link between solutions i and j is revised using the following equation ( [86] and [87]):

τij ← (1− ρ) ∗ τij + Σmk=1∆τ

kij, (3.5)

where ρ represents the evaporation rate of the pheromone, m depicts the number of ants,

and ∆τ kij is the number of pheromones deposited in the link between solutions i and

j, which is updated by ant k such that: ∆τ kij = Q/Lk if ant k used the link between

solutions i and j, while ∆τ kij = 0 otherwise [87]. Q is a constant, with Lk representing

the radius of the link created by ant k. In the construction of a new solution, ants

choose the next solution via a stochastic approach. When ant k is at solution i and has

constructed a partial solution, aP , the probability of then moving to solution j is such

that: P kij =

ταij ∗ ηβij

Σcil∈N(sp)ταil ∗ η

βil

if cij ∈ N(sp) or P kij = 0 if cij /∈ N(sp) [88]. N(sp) represents

a collection of appropriate items, that are links between solutions i and l whereby l is a

solution that has not yet been tested for its fitness towards the task at hand by ant k.

3-7


The α and β parameters govern the related relevance of the heuristic information versus

pheromone, ηij, obtained by ( [86] and [89]):

ηij =1

dij, (3.6)

where dij is the distance between solutions i and j. This algorithm has been applied in

several papers such as in [90] in which it was used to solve problems in water distribu-

tion systems, while in [91], the ACO algorithm was employed to solve a mathematical

model which was constructed to represent a process planning problem. In [92], the ACO

algorithm was used in problems where no a priori information was considered in spatial

clustering problems, and compared against a novel algorithm which was proposed.

The ACO parameters used in the model implementation are given in Table 3.1 apart

from the number of decision variables which depends on the number of missing values in

a record. The data was normalized to being in the range [0, 1], meaning the lower and

upper bounds of the decision variables are 0 and 1, respectively. These parameters were

chosen because they resulted in the more optimal outcomes with different combinations

and permutations of values having been tested.

Table 3.1: ACO Parameters.

Parameter Value

Maximum Number of Iterations 1000

Population Size 10

Intensification Factor 0.5

Deviation-Distance Ratio 1

Sample Size 40

3.3.2 Ant-Lion Optimizer (ALO)

ALO is a meta-heuristic algorithm that mimics the interaction between ants, prey and

the ant-lion species [93]. ALO implements five main steps of hunting, these being: the

random motion of ants, the construction of traps by the ant-lions, the capturing of ants

in the traps, the catching of prey, and the rebuilding of traps. Also, it is a gradient-free

algorithm which has the property of providing greater exploration and exploitation of the

3-8


solution space. Exploration is assured by the selection of ant-lions at random, as well as

the motion of ants around them which is also random. Exploitation on the other hand

is assured by the flexible declining of the boundaries of ant-lion traps. The algorithm is

based on three tuples being ALO(A1, A2, A3), which estimate the global optimum of an

optimization problem. These three tuples are defined respectively as ( [93] and [94]):

Φ→A1 {GAnt, GOA, GAntlions, GOAL} , (3.7)

{GAnt, GAntlion} →A2 {GAnt, GAntlion} , (3.8)

and

{GAnt, GAntlion} →A3 {true, false} , (3.9)

where GAnt represents the ants’ position matrix, GAntlion is comprised of the antlions’

position, GOA depicts the fitness of the ants, and finally, GOAL contains the fitness values

of the ant-lions. The algorithm operates in such a way that the ant-lion and ant matri-

ces are initialized in a random manner by applying equation (3.7). The roulette wheel

operator is used to select the location of each ant relative to the ant-lion. Equation (3.8)

is used to update the elite in each iteration. The update of the perimeter location is

primarily described in relation to the iteration number at that instance. The location is

subsequently refined by using two random steps near the selected elite and ant-lion. The

fitness function is used to estimate the points where every ant randomly walks. In case

any of the ants becomes more capable than any of the ant-lions, their locations will be

used in the next iteration as the new locations for the ant-lions. There will then be a com-

parison between the best ant-lion and the best ant-lion obtained during the optimization

procedure, and they are substituted in one of the key operations in the implementation of

the algorithm. These steps are executed until the function in equation (3.9) returns false.

In the implementation of the algorithm, ants walk randomly according to ( [93] and [94]):

Xa(t) = [0, cummsum(2l(t1)− 1), cummsum(2l(t2)− 1), . . . , cummsum(2l(Tn)− 1)],

(3.10)

where n is the maximum number of iterations, cummsum represents the cumulative sum,

and t indicates the random step walk. l(t) is a stochastic equation defined by the relations:

l(t) = 1 if rand > 0.5, and l(t) = 0 if rand ≤ 0.5. rand depicts a random number obtained

from a Gaussian distribution in the range [0, 1]. In order to restrict the random movement

of ants with the boundaries of the solution space, they are normalized according to ( [93]

3-9


and [94]):

X ti =

(X ti − ai) ∗ (di − cti)

(dti − ai)+ ci, (3.11)

where ai represents the minimum random walk of the ith variable, bi indicates the max-

imum random walk of the ith variable, cti represents the minimum of the ith variable at

the tth iteration and finally, dti indicates the maximum of the ith variable at the tth itera-

tion [93].

Modeling of the trapping of ants by ant-lion pits can be obtained by ( [93] and [94]):

cti = Antliontj + ct, (3.12)

and

dti = Antliontj + dt, (3.13)

where ct represents the lower bound of all features at the tth step, dt depicts the maxi-

mum of all features at the tth step and Antliontj represents the location of the chosen jth

ant-lion at the tth step.

The hunting capability of an ant-lion is described by the fitness proportional roulette

wheel selection. The mathematical equation describing the manner in which the ants that

are trapped slide down the trap and towards the ant-lion is obtained by ( [93] and [94]):

ut =ut

Z, (3.14)

and

vt =vt

Z, (3.15)

whereby Z is a ratio calculated by:

Z = 10wt

T. (3.16)

In the equation above, t is th current step, T represents the upper bound on the number

of steps to be taken, and w depicts a constant which relies on the current step according

to the following relations: w = 2 if t > 0.1T , w = 3 if t > 0.5T , w = 4 if t > 0.75T , w = 5

if t > 0.9T and w = 6 if t > 0.95T [93].

The last part of the algorithm is elitism which is done such that the fittest ant-lion

3-10


at each step is said to be the elite. This implies that every ant randomly walks around a

selected ant-lion with a location that respects the following equation ( [93] and [94]):

Antti =RtA +Rt

E

2. (3.17)

In equation (3.17), RtA represents the random motion around an ant-lion selected using

the roulette wheel method at the tth step while RtE depicts the random motion around

the elite ant-lion at the tth step.

In [94], the ALO algorithm was used to find the parameters of a primary governor loop

of thermal generators for successful Automatic Generation Control (AGC) of a two-area

interconnected power system. It has been used in [95] to investigate a three-area inter-

connected power system while in [96], it was used to train a multilayer perceptron neural

network. The authors in [97] used a chaotic ALO algorithm for feature selection purposes

from large datasets meanwhile in [98], the ALO algorithm was used to solve the NP-hard

combinatorial optimization problem of obtaining an optimal process plan according to all

alternative manufacturing resources.

The ALO parameters used in the model implementation are the number of search agents

which is set to 40, and the maximum number of iterations, given a value of 1000. The

number of decision variables depends on the number of missing values in a record. The

data was normalized to being in the range [0, 1], meaning the lower and upper bounds of

the decision variables are 0 and 1, respectively. These parameters were chosen because

they resulted in the more optimal outcomes with different combinations and permutations

of values having been tested.

3.4 Performance Evaluation Metrics

The effectiveness of the proposed approaches were determined using the SE, MSE, RM-

SLE, MAE, r and the RPA metrics. Also used were the SNR, GD and COD performance

measures. The mean squared, and root mean squared logarithmic errors as well as the

global deviation yield measures of the difference between the actual and predicted values,

and provide an indication of the capability of the estimation approach.

3-11


MSE =Σni=1(Ii − Ii)2

n, (3.18)

RMSLE =

√Σni=1(log(Ii + 1)− log(Ii + 1))2

n, (3.19)

and

GD =

Σni=1

(Ii − Ii

)n

2

. (3.20)

The correlation coefficient provides a measure of the similarity between the predicted and

actual data. The output value of this measure lies in the range [−1, 1] where the absolute

value indicates the strength of the link, while the sign indicates the direction of said link.

Therefore, a value close to 1(100%) signifies a strong predictive capability while a value

close to -1(-100%) signifies otherwise. In the equation below, ’¯’ represents the mean of

the data.

r =Σni=1(Ii − Ii)(Ii −

¯Ii)[

Σni=1

(Ii − Ii

)2Σni=1

(Ii − ¯

Ii

)2]1/2 . (3.21)

The relative prediction accuracy on the other hand measures the number of estimates

made within a specific tolerance, with the tolerance dependent on the sensitivity required

by the application. The tolerance was set to 10% as it seemed favorable for the application

domain. This measure is given by:

A =nτn∗ 100. (3.22)

The Squared Error (SE) measures a quadratic scoring rule that records the average mag-

nitude of the error. This can be gotten by calculating the variance square root, also

referred to as the standard deviation. It reduces the variance in the error, contrary to the

MSE, hence, its application in this work. SE can be obtained by using the formula:

SE =

√1

nΣni=1(Ii − Ii)2. (3.23)

The Mean Absolute Error (MAE) measures the average magnitude of the errors in a

dataset without considering direction. Under ideal scenarios, SE values are always greater

than the MAE values, and in case of equality, the error values are said to have the same

3-12


magnitude. This error can be calculated using the following equation:

MAE =1

nΣni=1|Ii − Ii|. (3.24)

The coefficient of determination is a metric regularly applied in statistical analysis tasks

aimed at assessing the performance of a model in the explanation and prediction of future

outputs. It is also referred to as the R-squared statistic, obtained by the following:

COD =

Σni=1(Ii − Ii)(Ii −

¯Ii)[

Σni=1

(Ii − Ii

)2Σni=1

(Ii − ¯

Ii

)2]1/2

2

. (3.25)

The Signal-to-Noise Ratio compares the estimated value against the real value to indicate

the level of noise in the estimate. The signal-to-noise ratio used in this paper is obtained

by:

SNR =var(I − I)

var(I). (3.26)

In equations (3.18)-(3.25), n represents the number of samples, while in equations (3.18)-

(3.21) and equations (3.23) - (3.26), I and I represent the real test set values and estimated

missing output values from the modified test set, respectively. In equation (3.22), nτ

represents the number of correctly estimated outputs within the set tolerance of 10%.

3-13


3.5 Deep-Learning-Ant Colony Optimization

(DL-ACO) Estimator

Taking into consideration the evaluation metrics from Section 3.4, the performance of the

DL-ACO method is evaluated and compared against existing methods (refer to [8] (MLP-

GA), [10] (MLP-GA, MLP-SA and MLP-PSO)) by estimating the missing attributes

concurrently, wherever missing data may be ascertained. The scenarios investigated were

such that any sample/record could have at least 62, and at most 97 missing attributes

(dimensions) to be approximated. The MLP network used has a structure of 784-400-784,

with 784 input and output nodes in the input and output layers, respectively, and, 400

nodes in the one hidden layer. This number is obtained by testing the network with

different number of nodes in the hidden layer and observing the network structure that

leads to the lowest possible network error.

Figures 3.4-3.7 show the performance and comparison of DL-ACO with MLP-PSO, MLP-

SA and MLP-GA. Figures 3.4 and 3.5 are bar charts that show the MSE and RMSLE

values for DL-ACO when compared to MLP-PSO, MLP-SA and MLP-GA.

Figure 3.4: Mean Squared Error vs Estimation Approach.

3-14


Figure 3.5: Root Mean Squared Logarithmic Error vs Estimation Approach.

We observe 0.66%, 5.19%, 31.02% and 30.98% of MSE and 6.06%, 17.98%, 41.21% and

41.25% of RMSLE for DL-ACO, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-

ACO yielded the lowest MSE value when compared to the others. These results are

validated by the correlation coefficient whose bar chart is given in Figure 3.6.

Figure 3.6: Correlation Coefficient vs Estimation Approach.

3-15


DL-ACO and MLP-PSO yielded 96.29% and 74.52% correlation values, respectively, while

MLP-SA and MLP-GA showed correlations of -1.88% and -0.56%, respectively.

Figure 3.7: Global Deviation vs Estimation Approach.

MLP-SA and MLP-GA yielded 13.25% and 13.67% of global deviation, respectively, while

DL-ACO and MLP-PSO respectively yielded 0.0078% and 0.98%, as shown in Figure 3.7.

As observed, the DL-ACO approach obtained the best figures for all four metrics presented

diagrammatically.

Table 3.2: DL-ACO Mean Squared Error Objective Value Per Sample.

Sample Dimensions DL-ACO MLP-PSO MLP-SA MLP-GA

1 79 2.52 13.62 15.59 15.59

2 88 2.38 7.34 8.78 8.78

3 75 1.27 5.59 6.76 6.76

4 81 0.26 3.69 5.91 5.91

5 83 0.46 6.57 8.02 8.02

6 82 1.27 4.81 9.76 9.76

7 90 1.91 5.66 15.05 15.05

8 79 1.18 7.58 9.54 9.54

9 76 2.59 7.96 9.48 9.48

10 76 2.86 6.52 12.60 12.60

In Table 3.2, the Dimensions column refers to the number of missing values in a sam-

3-16


ple/record. Tables 3.2 and 3.3 further back the findings from Figures 3.4-3.7 showing

that the proposed DL-ACO approach yielded the lowest objective function value in the

estimation of missing values in each sample, as well as the best COD, MAE, SE and SNR

values. Considering the RPA metric, the MLP-PSO approach yielded a better value than

the proposed approach.

Table 3.3: DL-ACO Additional Metrics.

Method DL-ACO MLP-PSO MLP-SA MLP-GA

COD 92.71 55.53 0.0353 0.0032

MAE 3.37 14.82 47.5 47.7

RPA 53.25 53.58 10.75 10.08

SE 8.12 22.78 55.7 55.66

SNR 7.7 57.16 209.04 208.83

In Table 3.4, we present results obtained from statistically analysing the estimates ob-

tained by the DL-ACO approach when compared against the MLP-PSO, MLP-SA and

MLP-GA approaches using the t-test. The t-test null hypothesis (H0) assumes that there

is no significant difference in the means of the missing data estimates obtained by the

DL-ACO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)

however indicates that there is a significant difference in the means of the missing data

estimates obtained by the four methods.

Table 3.4: Statistical Analysis of DL-ACO Results.

Pairs Compared P-Values (95% Confidence Level)

DL-ACO:MLP-PSO 5.31*10−15

DL-ACO:MLP-SA 2*10−167

DL-ACO:MLP-GA 3*10−173

Table 3.4 reveals that there is a significant difference at a 95% confidence level in the

means of the estimates obtained by DL-ACO when compared to MLP-PSO, MLP-SA

and MLP-GA, yielding p-values of 5.31*10−15, 2*10−167 and 3*10−173, respectively, when

all three pairs are compared. This therefore indicates that the null hypothesis (H0), which

assumes that there is no significant difference in the means between DL-ACO estimates

3-17


and those of the other three methods can be rejected in favor of the alternative hypothesis

(HA) at a 95% confidence level.

Figure 3.8: Top Row: Corrupted Images - Bottom Row: DL-ACO Reconstructed Images.

In the top row of Figure 3.8, we depict 10 images with missing pixel values which are

to be estimated prior to classification tasks being performed by statistical methods. In

the bottom row of the same figure, we show the reconstructed images from using the

DL-ACO approach, while in the top and bottom rows of Figure 3.9, we observe the

reconstructed images when the MLP-PSO and MLP-GA approaches are used, respectively.

The reconstructed images using MLP-PSO and MLP-GA introduce a lot of noise, more

so in the bottom row than in the top row, as opposed to when the DL-ACO approach is

applied. Furthermore, closer inspection reveals that the images are not fully reconstructed

as not all pixel values within the images are estimated correctly.

Figure 3.9: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA ReconstructedImages.

3-18


3.6 Deep-Learning-Ant Lion Optimizer (DL-ALO)

Estimator

Figures 3.10-3.13 show the performance and comparison of DL-ALO with MLP-PSO,

MLP-SA and MLP-GA. Figures 3.10 and 3.11 are bar charts that show the MSE and

RMSLE values for DL-ALO when compared to MLP-PSO, MLP-SA and MLP-GA. The

MLP network used has a structure of 784-400-784, with 784 input and output nodes in

the input and output layers, respectively, and, 400 nodes in the one hidden layer. This

number is obtained by testing the network with different number of nodes in the hidden

layer and observing the network structure that leads to the lowest possible network error.


We observed 1.85%, 4.85%, 32.46% and 29.4% of MSE and 9.03%, 17.43%, 41.98% and

40.15% of RMSLE for DL-ALO, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-

ALO yielded the lowest MSE value when compared to the others. These results are

validated by the correlation coefficient whose bar chart is given in Figure 3.12.

DL-ALO and MLP-PSO yielded 92.49% and 78.62% correlation values, respectively, while

MLP-SA and MLP-GA showed correlations of 4.03% and 8.29%, respectively.

3-19




MLP-SA and MLP-GA yielded 10.92% and 11.42% of RPA respectively, while DL-ALO

and MLP-PSO respectively yielded 81.33% and 54.83%, as shown in Figure 3.13. As

observed, the DL-ALO approach obtained the best figures for all four metrics presented

graphically.

3-20


Figure 3.13: Relative Prediction Accuracy vs Estimation Approach.

Table 3.5: DL-ALO Mean Squared Error Objective Value Per Sample.

Sample Dimensions DL-ALO MLP-PSO MLP-SA MLP-GA

1 81 1.29 9.46 6.57 6.57

2 72 1.63 7.39 7.65 7.65

3 85 3.65 7.66 10.57 10.57

4 88 1.13 7.75 6.03 6.03

5 77 2.21 6.28 9.09 9.09

6 89 1.45 6.49 13.55 13.55

7 84 2.70 5.79 6.77 6.77

8 75 1.14 5.30 9.11 9.11

9 71 1.22 5.67 6.31 6.31

10 85 1.67 5.63 16.55 16.55



that the proposed DL-ALO approach yielded the lowest objective function value in the

estimation of missing values in each sample, as well as the best COD, GD, MAE, SE and

SNR values.

3-21


Table 3.6: DL-ALO Additional Metrics.

Method DL-ALO MLP-PSO MLP-SA MLP-GA

COD 85.55 61.81 0.16 0.69

GD 0.07 0.87 14.52 12.31

MAE 6.00 14.42 48.71 45.81

SE 13.6 22.01 56.97 54.22

SNR 30.94 51.24 211.47 202.94


tained by the DL-ALO approach when compared against the MLP-PSO, MLP-SA and



DL-ALO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)



Table 3.7: Statistical Analysis of DL-ALO Results.


DL-ALO:MLP-PSO 0.46

DL-ALO:MLP-SA 0.34

DL-ALO:MLP-GA 0.16

Table 3.7 indicates that there is no significant difference in the means of the estimates

obtained by DL-ALO when compared to MLP-PSO, MLP-SA and MLP-GA, yielding

p-values of 0.46, 0.34 and 0.16 when DL-ALO is compared to MLP-PSO, MLP-SA and

MLP-GA, respectively, at a 95% confidence level. Therefore, the null hypothesis can be

accepted.

3-22


Figure 3.14: Top Row: Corrupted Images - Bottom Row: DL-ALO Reconstructed Images.




DL-ALO approach, while in the top and bottom rows of Figure 3.15, we observe the



so in the bottom row than in the top row, as opposed to when the DL-ALO approach is



Figure 3.15: Top Row: MLP-PSO Reconstructed Images - Bottom Row: MLP-GA Recon-structed Images.

3-23


3.7 Conclusion

In this chapter, novel and effective ant-based high dimensional missing data estimator

models were presented and tested on an image recognition dataset. These were then

compared against existing approaches of a similar nature. The results obtained from the

experiments conducted in this chapter allude to the fact that the models proposed can

be used to approximate missing values in the high-dimensional dataset more accurately

than when existing approaches of the same nature are used. It can also be observed

that the images reconstructed using the proposed models are more likely to be used in

subsequent statistical analysis and classification tasks than those obtained by using the

existing approaches. This is because of the fact that the existing approaches are seen to

introduce a lot of noise in the images which could skew the findings from any subsequent

analysis.

3-24

4Novel Flight-based Missing

Data Estimators

4.1 Introduction

In this chapter, we present the results obtained from analysing the novel flight-based miss-

ing data estimators. We begin in Section 4.2 by presenting the experimental design that

will be implemented throughout the chapter, followed by Section 4.3 in which we present

information on the optimization algorithms that will be used in the chapter. Section

4.4 presents the results obtained from analysing the DL-CS estimator, while Section 4.5

reports on the findings from the analysis of the DL-BAT estimator. Sections 4.6 and 4.7

show the results from analysing the DL-FA estimator and present the key findings from

the chapter, respectively.

4-1

4. Novel Flight-based Missing Data Estimators



It should be discernible at this point that the research done in this work focussed on


dataset. We therefore tried to answer two key questions being:


using models comprising of a deep auto-encoder network framework with the bat,

cuckoo search and firefly optimization algorithms?








using models comprising of a deep auto-encoder network framework with the bat,

cuckoo search and firefly optimization algorithms.




The dataset used is the Mixed National Institute of Standards and Technology (MNIST)

handwritten digit recognition dataset [18]. This dataset comprises 60,000 training images

and 10,000 test images. With each image being a 28 × 28 pixel image, this results in

4-2


784 pixel values representing the image, and serving as the input to the model. The

data is preprocessed by normalizing each pixel value to being in the range [0, 1]. Each

of the network layers were put through a pretraining process using restricted Boltzmann

Machines and the contrastive divergence algorithm with the aim being to set the weight

and bias values in a good search space [54]. This resulted in a network structure of size:

784−1000−500−250−30−250−500−1000−784. There are 784 nodes in the input and

output layers, and seven hidden layers with varying number of nodes [54]. This network is

subsequently trained in a supervised learning manner using the stochastic gradient descent

(SGD) algorithm with the objective being to minimize the network error. The network

training procedure is performed using the entire training set of data which is divided into

600 balanced mini-batches each containing 10 examples of each digit class. The weights

and biases are updated after each mini-batch. Missing data is created in the test set in

accordance with the arbitrary pattern as well as the MCAR and MAR mechanisms. The

optimization algorithms are then used to estimate this missing data, and they have as

objective to minimize the cost function of the trained deep network. The error tolerance

is set to 10%. A matrix of the same size as the test set of data is created with values

obtained from a binomial distribution with the required percentage of missingness (10%),

which is then superimposed on the test set to incorporate the intended missing data.

From this modified test set of data, 300 samples are selected randomly, with 100 samples

used to test the DL-CS method, another 100 samples used to test the DL-BAT approach,

and the last 100 used to test the DL-FA approach. The procedure described above can

be summarized into four consecutive steps being:

1) Use the training set of data with complete records to train the individual restricted

Boltzmann Machines by making use of the algorithm described in [99]. The training

procedure starts from the bottom layer. These individual layers are trained for 50

epochs.

2) Create the encoder and decoder parts of the network with tied weights by combining

these RBMs together.

3) Train the deep auto-encoder network obtained in a supervised learning manner by

applying a back-propagation algorithm being the stochastic gradient descent (SGD)

algorithm.

4) Use the trained network as part of the objective function for the optimization algo-

rithms during the missing data estimation procedure. Initially, the known feature

4-3


variable values are presented to the objective function, and then the estimated un-

known feature variable values are parsed into the objective function.

5) The stopping criteria for the missing data estimation procedure using the optimiza-

tion algorithms are either 40,000 function evaluations having been executed, or there

being no change in the objective function value.

The MLP network used for comparison against existing approaches has a structure of 784-

400-784, with 784 input and output nodes in the input and output layers, respectively,

and, 400 nodes in the one hidden layer. This number is obtained by testing the network

with different number of nodes in the hidden layer and observing the network structure

that leads to the lowest possible network error.

4.3 Optimization Algorithms

4.3.1 Cuckoo Search (CS)

The CS algorithm is a population-based meta-heuristic technique based on the brood

parasitism trait of certain species of Cuckoo birds [100]. Cuckoos are very interesting

birds, not only courtesy of the serene sounds they make, but also due to their aggressive

reproduction strategies. The design of the CS algorithm is simplified by the assumption

of three main rules being: (i) Each cuckoo lays just one egg at a time and it dumps this

egg in a nest randomly, (ii) The best nests with a high quality of eggs will carry over to

the next generation, and, (iii) The number of nests in which cuckoos can dump their eggs

is fixed, and the eggs laid by cuckoos in these nests can be discovered by the host bird

with a probability, pa ∈ [0, 1] [101]. When the cuckoo egg is discovered, the egg is either

thrown out of the nest, or the host bird abandons the nest and builds a new one. The

last rule is further simplified by approximating pa and replacing a fraction of the existing

nests with new nests which have new random solutions [101]. In solving maximization

problems, the fitness of a solution can simply be proportional to the value of the objective

function. To further simplify the implementation of the algorithm, it is assumed that

each egg in a nest represents a solution, and a cuckoo egg represents a new solution. The

objective is to use the new and potentially better solutions/cuckoos to replace the not so

4-4


good solutions in the nests. These new solutions are given by [101]:

x(t+1)i = x

(t)i + α⊕ Levy(λ), (4.1)

where α > 0 is the step size which should be related to the scales of the problem of

interest. Often, α = 1. Equation (4.1) essentially represents the stochastic equation for

a random walk process. In a general sense, a random walk is a Markov chain whose next

location is solely reliant upon the current location (the first term in equation (4.1)) and

the transition probability (the second term in equation (4.1)). The ⊕ symbol represents

entry wise multiplications. The Levy flight basically provides a random walk with the

random step drawn from a Levy distribution like so:

Levy ∼ u = t−λ, (4.2)

where t is the step-length and λ is a heavy-tailed continuous probability distribution gen-

erated from having a probability density function that follows the condition 1 < λ ≤ 3.

The Levy distribution has an infinite variance with an infinite mean.

The reason for applying the CS algorithm in this dissertation, although it has been used

in several domains, is courtesy of the fact that it has not been investigated in missing

data estimation tasks. Also, the randomization of the motion equation is more efficient

as the step-length is heavy-tailed in addition to there being a low number of parameters

which need tuning, thereby making the algorithm more generic to adapt to a wider range

of optimization problems.

The CS algorithm has been utilized to minimize the generation and emission costs of

a microgrid while satisfying system hourly demands and constraints [102], while in [103],

it was used to establish the parameters of chaotic systems via an improved cuckoo search

algorithm. The authors in [104] presented a new hybrid algorithm comprised of the cuckoo

search algorithm and the Neder-Mead method with the aim being to solve the integer and

minimax optimization problems.

The CS parameters used in the model implementation are given in Table 4.1 except

for the number of decision variables which depends on the number of missing values in



4-5




Table 4.1: CS Parameters.

Parameter Value

Number of Nests 40

Discovery Rate of Eggs (pa) 0.25


4.3.2 Bat Algorithm (BAT)

The bat algorithm is a meta-heuristic swarm intelligence technique based on the echolo-

cation trait of bats. Bats possess incredible capabilities with their ability to hunt for prey

in the dark using sonar and the Doppler effect. This trait was expressed mathematically

in [105] in the form of an optimization algorithm which was tested against existing op-

timization methods on a set of benchmark functions. There are three main rules based

upon which the algorithm is designed, these being: (i) Each bat uses echolocation to

sense distance and also tell the difference between food/prey and obstacles, (ii) Bats fly

randomly with a velocity and position, and a given frequency, varying wavelength and

also varying loudness to search for their prey, and, (iii) Although there are several ways

in which the loudness could vary, the assumption made is that the loudness varies from a

large positive value, to a low constant value. Each bat moves around the solution space

with a specific velocity and at a given position [106]. There always exists a bat towards

which all other bats move, and this constitutes the current best solution. In addition

to these, the bats in using sonar emit sounds with a frequency, wavelength and loud-

ness. These can be adjusted depending on the proximity of a prey or an obstacle. These

properties are expressed mathematically in equations (4.3), (4.4) and (4.5):

xt+1i = xti + vt+1

i , (4.3)

vt+1i = vti + (xti − x∗)fi, (4.4)

4-6


and

fi = fmin + (fmax − fmin)ν, (4.5)

where xt+1i and vt+1

i are the new position and velocity of the bat at time step t + 1, xti

and vti are the current position and velocity of the bat at time step t, x∗ is the current

best solution and ν ∈ [0, 1] is a random vector drawn from U(a, b) with −∞ < a < b <∞and probability density function, f(x) = 1

b−a , such that a < x < b. fmax and fmin are

the maximum and minimum frequencies, respectively. The bat algorithm implemented in

this dissertation adds an element of randomness in the motion of the bats in the form of

a Levy flight. This results in the position equation being:

xt+1i = xti + vt+1

i + Levy(λ). (4.6)

In [107], the bat algorithm was implemented to solve multi-objective optimization tasks

which comprises of most engineering problems while in [108], it was applied as an ap-

proach to solve topology optimization problems. Authors in [109] used the bat algorithm

to optimize mono/multi objective tasks linked to brushless DC wheel motors. Some of

the inherent advantages of the algorithm are that it implements a great balance between

exploitation (by using the loudness and pulse emission rate of bats) and exploration (like

in the standard Particle Swarm Optimization (PSO)), and, [105] reveals that there is a

guarantee to attain a global optimum solution in the search of an optimal point, which

are the reasons for it being selected. The main disadvantage however is that the conver-

gence rate of the optimization process depends on the fine adjustment of the algorithm

parameters.

The BA parameters used in the model implementation are given in Table 4.2 except






4-7


Table 4.2: BAT Parameters.

Parameter Value

Population Size 40

Loudness 0.25

Pulse Rate 0.5

Number of Generations 1000

4.3.3 Firefly Algorithm (FA)

FA can be defined as a meta-heuristic algorithm that mimics the flashing behavior of

fireflies [110]. It relies on three main assumptions which are: (i) All fireflies are attracted

to all other fireflies by being unisex, (ii) An increase in distance is seen to decrease both

attractiveness and brightness of the fireflies, with attractiveness being proportional to

brightness, and, (iii) The objective function landscape determines the brightness of a

firefly [110]. Considering the fact that the attractiveness of a firefly is proportional to the

brightness, the equation below defines the manner in which the attractiveness varies in

relation to the distance:

β = β0e−γr2 . (4.7)

In (4.7), β represents the attractiveness trait, β0 defines the original attractiveness value,

γ represents the absorption coefficient and r defines the distance between fireflies. The

motion equation of a firefly in the direction of a brighter one is defined by:

xt+1i = xti + β0e

−γr2ij(xtj − xti

)+ αtε

ti. (4.8)

Here, xi and xj depict positional references of two fireflies, with the second term being

because of the attraction between the fireflies. Different time steps are represented by t

and t + 1, in the third term, the randomization criteria that controls the step size is α,

with ε being a vector of arbitrary numbers drawn from a uniform distribution [54]. If β0

is equal to zero, the motion of fireflies becomes a basic random walk [110]. If γ is equal to

zero, the motion is reduced to an alternative version of the particle swarm optimization

algorithm [110].

The FA parameters used in the model implementation are given in Table 4.3 except


4-8






Table 4.3: FA Parameters.

Parameter Value

Number of Fireflies 40

Randomness (α) 0.5

Attractiveness (β) 0.2

Absorption Coefficient (γ) 1

Number of Iterations 1000

4-9


4.4 Deep Learning-Cuckoo Search (DL-CS)

Estimator

Taking into consideration the evaluation metrics from Section 3.4, the performance of the

DL-CS method was evaluated and compared against existing methods (refer to [8] (MLP-

GA), [10] (MLP-GA, MLP-SA and MLP-PSO)) by estimating the missing attributes con-

currently, wherever missing data may be ascertained. The scenarios investigated where

such that any sample/record could have at least 62, and at most 97 missing attributes

(dimensions) to be approximated.

Figures 4.1-4.4 show the performance and comparison of DL-CS with MLP-PSO, MLP-

SA and MLP-GA. Figures 4.1 and 4.2 are bar charts that show the MSE and RMSLE

values for DL-CS when compared to MLP-PSO, MLP-SA and MLP-GA.


We observed 0.62%, 5.58%, 30.95% and 31.85% of MSE and, 5.89%, 18.55%, 41.23% and

41.71% of RMSLE for DL-CS, MLP-PSO, MLP-SA and MLP-GA, respectively. DL-CS

yielded the lowest MSE value when compared to the others. These results are validated

by the correlation coefficient whose bar chart is given in Figure 4.3.

4-10




DL-CS and MLP-PSO yielded 96.19% and 71.57% correlation values, respectively, while

MLP-SA and MLP-GA showed correlations of 1.44% and 1.03%, respectively.

4-11



MLP-SA and MLP-GA yielded 9.25% and 11.67% of RPA respectively, while DL-CS and

MLP-PSO respectively yielded 87.92% and 54.58%, as shown in Figure 4.4. As observed,

the DL-CS approach obtained the best figures for all four metrics presented graphically.

Figure 4.5: Top Row: Corrupted Images - Bottom Row: DL-CS Reconstructed Images.

4-12


In the top row of Figure 4.5, we depict 10 images with missing pixel values which are to

be estimated prior to classification tasks being performed by statistical methods. In the

bottom row of the same figure, we show the reconstructed images from using the DL-CS

approach, while in the top and bottom rows of Figure 4.6, we observe the reconstructed

images when the MLP-PSO and MLP-GA approaches are used, respectively. The recon-

structed images using MLP-PSO and MLP-GA introduce a lot of noise, more so in the

bottom row than in the top row, as opposed to when the DL-CS approach is applied.

Furthermore, closer inspection reveals that the images are not fully reconstructed as not

all pixel values within the images are estimated correctly.


Table 4.4: DL-CS Mean Squared Error Objective Value Per Sample.

Sample Dimensions DL-CS MLP-PSO MLP-SA MLP-GA

1 83 2.89 5.72 9.26 9.26

2 75 2.84 8.94 14.22 14.22

3 85 1.29 5.73 6.77 6.77

4 74 3.45 7.72 16.06 16.06

5 66 1.78 6.79 10.33 10.33

6 74 1.10 5.37 9.12 9.12

7 82 3.19 9.31 11.79 11.79

8 77 2.97 10.38 14.64 14.64

9 74 3.51 8.35 8.49 8.49

10 81 1.25 5.67 15.36 15.36

4-13




that the proposed DL-CS approach yielded the lowest objective function value in the es-

timation of missing values in each sample, as well as the best COD, GD, MAE, SE and

SNR values.

Table 4.5: DL-CS Additional Metrics.

DL-CS MLP-PSO MLP-SA MLP-GA

COD 92.52 51.22 0.02 0.01

GD 0.01 1.23 14.93 15.22

MAE 3.57 15.17 47.63 48.07

SE 7.85 23.61 55.63 56.44

SNR 8.36 60.35 194.62 189.37

In Table 4.6, we present results gotten from statistically analysing the estimates obtained

by the DL-CS approach when compared against the MLP-PSO, MLP-SA and MLP-GA

approaches using the t-test. The t-test null hypothesis (H0) assumes that there is no

significant difference in the means of the missing data estimates obtained by the DL-CS,

MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA) however

indicates that there is a significant difference in the means of the missing data estimates

obtained by the four methods.

Table 4.6: Statistical Analysis of DL-CS Results


DL-CS:MLP-PSO 3.7*10−19

DL-CS:MLP-SA 4.6*10−50

DL-CS:MLP-GA 4.6*10−50

4-14



means of the estimates obtained by DL-CS when compared to MLP-PSO, MLP-SA and

MLP-GA, yielding p-values of 3.7*10−19, 4.6*10−50 and 4.6*10−50, respectively, when all

three pairs are compared. This therefore indicates that the null hypothesis (H0), which

assumes that there is no significant difference in the means between DL-CS and the

other three methods can be rejected in favor of the alternative hypothesis (HA) at a 95%

confidence level.

4.5 Deep Learning-Bat Algorithm (DL-BAT)

Estimator

In this analysis, the novel DL-BAT approach is compared against existing approaches in

the literature (MLP-PSO [10], MLP-SA [10] and MLP-GA ( [8] and [10])). The results

are grouped in Figures 4.7-4.10 and Tables 4.7 and 4.8.

The results reveal that the DL-BAT approach outperforms the other approaches. The

squared error is given in Figure 4.7. It shows a 9.45% of error for DL-BAT while MLP-

PSO, MLP-SA and MLP-GA obtain 22.61%, 55.61% and 56.04% error values, respectively.

Figure 4.7: Squared Error vs Estimation Approach.

4-15



Figures 4.8 and 4.9 show the correlation coefficient and relative prediction accuracy of

the four approaches analysed, including the novel DL-BAT approach. They both confirm

better performance of the DL-BAT approach when compared to the others. DL-BAT

exhibits a 95.86% of correlation with 85% of RPA while we obtain 76.93% of correlation

coefficient for MLP-PSO, and -2.06% and -3.06% of correlation for MLP-SA and MLP-GA,

respectively. MLP-PSO depicts 56.33% of RPA, while MLP-SA and MLP-GA produce

values of 9.92% and 10.17%, respectively.

4-16



In Figure 4.10, we depict the root mean squared logarithmic error values obtained from

analysing the methods. It can be observed that the DL-BAT approach yields the lowest

RMSLE value of 7.11% while the second best performer is the MLP-PSO which shows a

value of 17.65%. The MLP-SA and MLP-GA approaches produce RMSLE values of 41%

and 41.32%, respectively.

4-17



These findings are further backed by the values in Table 4.7 with the DL-BAT system

yielding the lowest COD, GD, MAE, MSE and SNR values.

Table 4.7: DL-BAT Additional Metrics.

Method DL-BAT MLP-PSO MLP-SA MLP-GA

COD 91.89 59.19 0.04 0.09

GD 0.06 0.8 12.1 12.36

MAE 4.73 14.3 47.74 48.15

MSE 0.89 5.11 30.92 31.4

SNR 8.44 53.47 228.91 230.52

Considering Table 4.8, it is observed that the proposed DL-BAT approach results in the

best objective function value per record during the estimation of all missing values within

that record.

4-18


Table 4.8: DL-BAT Mean Squared Error Objective Value Per Instance.

Sample Dimensions DL-BAT MLP-PSO MLP-SA MLP-GA

1 66 5.09 9.05 9.02 9.02

2 69 1.62 5.26 8.60 8.60

3 73 0.32 3.69 4.74 4.74

4 85 3.90 9.22 18.54 18.54

5 83 2.20 7.28 15.83 15.83

6 92 2.74 8.59 8.79 8.79

7 77 3.08 7.56 13.44 13.44

8 82 0.53 6.07 4.46 4.46

9 84 2.29 6.51 17.09 17.09

10 63 0.52 3.21 1.82 1.82


tained by the DL-BAT approach when compared against the MLP-PSO, MLP-SA and



DL-BAT, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)



Table 4.9: Statistical Analysis of DL-BAT Results.


DL-BAT:MLP-PSO 1.2*10−07

DL-BAT:MLP-SA 2.0*10−134

DL-BAT:MLP-GA 9.0*10−137


means of the estimates obtained by DL-BAT when compared to MLP-PSO, MLP-SA and

MLP-GA, yielding p-values of 1.2*10−07, 2.0*10−134 and 9.0*10−137, respectively, when all

three pairs are compared. This therefore indicates that the null hypothesis (H0), which

assumes that there is no significant difference in the means between DL-BAT and the

other three methods can be rejected in favour of the alternative hypothesis (HA) at a 95%

confidence level.

4-19


Figure 4.11: Top Row: Corrupted Images - Bottom Row: DL-BAT Reconstructed Images.




DL-BAT approach, while in the top and bottom rows of Figure 4.12, we observe the



so in the bottom row than in the top row, as opposed to when the DL-BAT approach is




4-20


4.6 Deep Learning-Firefly Algorithm (DL-FA)

Estimator

In this analysis, the novel DL-FA approach is compared against existing approaches in the

literature (MLP-PSO [10], MLP-SA [10] and MLP-GA [8] [10]). The results are grouped

in Figures 4.13-4.16 and Tables 4.10 and 4.11.

The results reveal that the DL-FA approach outperforms the other approaches. The

global deviation is given in Figure 4.13. It shows a 0.27% GD value for DL-FA while

MLP-PSO, MLP-SA and MLP-GA obtain 0.97%, 12.5% and 13.36% GD values, respec-

tively.

Figure 4.13: Global Deviation vs Estimation Approach.

Figures 4.14 and 4.15 show the mean squared error and root mean squared logarithmic

error values of the four approaches analysed, including the novel DL-FA approach. They

both confirm better performance of the DL-FA approach when compared to the others.

DL-FA exhibits a 2.24% of MSE with 11.79% of RMSLE while we obtain 5.83% of MSE

for MLP-PSO, and, 30.81% and 33.27% of MSE for MLP-SA and MLP-GA, respectively.

MLP-PSO depicts 18.78% of RMSLE, while MLP-SA and MLP-GA produce values of

40.94% and 42.42%, respectively.

4-21




Considering Figure 4.16, we observe that the DL-FA approach produces the highest cor-

relation coefficient value of 90.22%, with the second best value of 74.44% obtained by the

MLP-PSO method. MLP-SA and MLP-GA yield correlation coefficient values of 2.67%

and -5.18%, respectively.

4-22



These findings are further backed by the values in Table 4.10 with the DL-FA system

yielding the best COD, MAE, RPA, SE and SNR values.

Table 4.10: DL-FA Additional Metrics.

Method DL-FA MLP-PSO MLP-SA MLP-GA

COD 81.41 55.41 0.07 0.27

MAE 10.05 15.83 47.57 49.8

RPA 56.75 51.75 10.75 8.25

SE 14.98 24.15 55.51 57.68

SNR 22.42 61.1 221.36 236.98

With regards to Table 4.11, it is observed that the proposed DL-FA approach results in

the best objective function value per record during the estimation of all missing values

within that record.

4-23


Table 4.11: DL-FA Mean Squared Error Objective Value Per Sample.

Sample Dimensions DL-FA MLP-PSO MLP-SA MLP-GA

1 74 1.27 2.44 4.11 4.11

2 74 2.93 7.42 8.56 8.56

3 72 12.34 17.90 20.69 20.69

4 72 1.36 5.58 4.28 4.28

5 73 2.97 4.99 12.23 12.23

6 75 2.86 5.90 13.84 13.84

7 78 5.46 8.76 13.59 13.59

8 78 3.43 11.34 11.34 6.92

9 84 1.74 6.00 6.72 6.72

10 97 5.87 17.65 19.58 19.58


tained by the DL-FA approach when compared against the MLP-PSO, MLP-SA and



DL-FA, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)



Table 4.12: Statistical Analysis of DL-FA Results.


DL-FA:MLP-PSO 4.09*10−05

DL-FA:MLP-SA 2.0*10−132

DL-FA:MLP-GA 1.0*10−140


means of the estimates obtained by DL-FA when compared to MLP-PSO, MLP-SA and

MLP-GA, yielding p-values of 4.09*10−05, 2.0*10−132 and 1.0*10−140, respectively, when

all three pairs are compared. This therefore indicates that the null hypothesis (H0), which

assumes that there is no significant difference in the means between DL-FA and the other

three methods can be rejected in favour of the alternative hypothesis (HA) at a 95%

confidence level.

4-24


Figure 4.17: Top Row: Corrupted Images - Bottom Row: DL-FA Reconstructed Images.

In the top row of Figure 4.17, we depict 10 images with missing pixel values which

are to be estimated prior to classification tasks being performed by statistical methods.

In the bottom row of the same figure, we show the reconstructed images from using

the DL-FA approach, while in the top and bottom rows of Figure 4.18, we observe the



so in the bottom row than in the top row, as opposed to when the DL-FA approach is




4-25


4.7 Conclusion

In this chapter, novel flight-based high-dimensional missing data estimator models were

presented and tested on an image recognition dataset. These were then compared against

existing approaches of a similar nature. The results obtained from the experiments con-

ducted in this chapter allude to the fact that the models proposed can be used to approx-

imate missing values in the high-dimensional dataset more accurately than when existing

approaches of the same nature are used. It can also be observed that the images recon-

structed using the proposed models are more likely to be used in subsequent statistical

analysis and classification tasks than those obtained by using the existing approaches.

This is because of the fact that the existing approaches are seen to introduce a lot of

noise in the images which could skew the findings from any subsequent analysis.

4-26

5Novel Plant-based Missing

Data Estimator and

Comparative Analysis

5.1 Introduction

In this chapter, we present the results obtained from analysing the novel plant-based

missing data estimator, as well as a comparison of all six methods proposed. We begin

in Section 5.2 by presenting the hypotheses that will be investigated in Section 5.4. In

Section 5.3 we present information on the optimization algorithm that will be used in the

chapter, followed by Section 5.4 in which we present the results obtained from analysing

the DL-IWO estimator. Section 5.5 reports on the findings from the analysis of all six

approaches proposed. Finally, Section 5.6 presents the key findings from the chapter.

5-1

5. Novel Plant-based Missing Data Estimator and Comparative Analysis


The missing data estimation framework used in this Chapter is like that designed in

Section 3.2 and implemented throughout Chapter 3.


It should be discernible at this point that the research done in this work focused on


dataset. We therefore aimed to answer two key questions being:


using models comprising of a deep auto-encoder network framework with the invasive

weed optimization algorithm?








using models comprising of a deep auto-encoder network framework with the invasive

weed optimization algorithm.




5-2


5.3 Optimization Algorithm

5.3.1 Invasive Weed Optimization (IWO)

In [111], the authors proposed the Invasive Weed Optimization (IWO) algorithm which

mimics the invasion trait of weeds. In general, weeds are defined as plants that grow in an

area where they are not wanted. Specific to horticulture, the term weed refers to a plant

whose growth properties poses a menace to plants that are cultivated. Weeds portray

fascinating traits such as adaptivity and robustness. In the invasive weed optimization

algorithm, weeds are replaced by points in the solution space in which a colony of points

grow to an optimal value [112].

Let us say for instance that D represents the dimension of a problem, implying that

the dimension of the search space is RD. Let us further assume that Pinit is the initial

weed population size and Pmax, is the upper bound on the population size such that

1 ≤ Pinit ≤ Pmax [112]. Also, let W define the set of weeds with W ={W1, . . . ,W‖W‖

}[112]. Every one of the weeds, Wi ∈ RD represents a location in the solution space.

Computing the fitness of the weed requires the use of a fitness function of the form:

F : RD → R [112].

There are two main parts to the IWO algorithm, these being: Initialization and Iter-

ation. In the initialization step, the counter of the generation of solutions is set to zero

G = 0. Subsequently, the initial population, W , is generated randomly by creating Pinit

with uniformly distributed values [112]:

Wi ∼ U(Xmin, Xmax)D. (5.1)

Xmin and Xmax represent the lower and upper bounds of the solution space, respectively,

and are problem specific.

In the iteration step, each of the weeds in the current population are regenerated by

a given number of seeds. Snum, defines the number of seeds, and is evaluated such that

it is proportional to the fitness value of the weed being considered. This implies that

it is linearly mapped depending on the population’s worse and best fitness, Fworse and

5-3


Fbest [112]:

Snum = Smin +F (Wi)− FworseFbest − Fworse

(Smax − Smin). (5.2)

In equation (5.2), Smin and Smax represent the lower and upper bounds of seeds allowed

for each weed [112]. All the Snum seeds, Sj, are generated near the current weed by making

use of a Gaussian distribution with varying standard deviation and zero mean [112]:

Sj = Wi +N (0, σG)D (5.3)

In equation (5.3), 1 ≤ j ≤ Snum. The standard deviation, σG, begins at σinit and is

reduced in a non-linear manner over the entire run to σfinal. The standard deviation for

the current generation is calculated by using [112]:

σG = σfinal +(Niter −G)σmod

(Niter)σmod(σinit − σfinal), (5.4)

where Niter represents the upper bound on the number of generations and σmod is the

non-linear modulation indicator. Following these procedures, the subsequent population

is generated by uniting the current population with all the generated seeds of all the

weeds. If the volume of the new population attains Pmax, this new population is cate-

gorized as per the fitness value, and the Pmax optimal weeds are preserved. The least

favorable weeds are removed.

This algorithm has been used to investigate the time-cost-quality trade-off in projects

by the authors in [113]. Also, the authors in [114] developed a novel receiver that merges

constant modulus approach (CMA) blind adaptive multiuser detection with IWO for

multi-carrier code division multiple access (MC-CDMA). In [115], the services selection

problem is modeled as a non-linear optimization problem with constraints and solved us-

ing a discrete version of the IWO algorithm. Unconstrained and constrained optimization

problems are solved using a hybrid IWO and firefly algorithm in [116]. Finally, in [117],

the problem of minimizing the total weighted tardiness and earliness criteria on a single

machine is considered.

The IWO parameters used in the model implementation are given in Table 5.1 except

for the number of dimensions which depends on the number of missing values in a record.

The data was normalized to being in the range [0, 1], meaning the lower and upper bounds

of the decision variables are 0 and 1, respectively. These parameters were chosen because

5-4


they resulted in the more optimal outcomes with different combinations and permutations

of values having been tested.

Table 5.1: IWO Parameters.

Parameter Value

Initial Population Size 10

Maximum Population Size 40

Minimum Number of Seeds 0

Maximum Number of Seeds 5

Variance Reduction Exponent 2


5.4 Deep Learning-Invasive Weed Optimization

(DL-IWO) Estimator

In this analysis, the novel DL-IWO approach is compared against existing approaches in

the literature (MLP-PSO [10], MLP-SA [10] and MLP-GA ( [8] and [10])). The results are

grouped in Figures 5.1-5.4 and Tables 5.2 and 5.3. The results reveal that the DL-IWO

approach outperforms all the other approaches. The mean squared error is given in Figure

5.1. It shows a 0.45% of error for DL-IWO while MLP-PSO, MLP-SA and MLP-GA yield

5.53%, 31.86% and 32.61% error, respectively.

Figure 5.2 depicts the root mean squared logarithmic error for the approaches analysed,

including the novel DL-IWO approach. It confirms better performance of the DL-IWO

approach when compared to the other approaches. DL-IWO exhibits a 5.11% of RMSLE

while we obtain 18.58% of RMSLE for MLP-PSO. MLP-SA and MLP-GA show RMSLE

values of 41.77% and 42.16%, respectively.

5-5




In Figure 5.3, the RPA values of the approaches are revealed while in Figure 5.4, we depict

the correlation coefficient values of these approaches. In Figure 5.3, the DL-IWO yields

the better RPA value of 88.25% compared to MLP-PSO, MLP-SA and MLP-GA yielding

values of 54.42%, 8.75% and 10.25%, respectively.

5-6



Furthermore, in Figure 5.4, the DL-IWO approach shows the better correlation coefficient

value of 97.67%, while MLP-SA and MLP-GA show -3.7% and 0.28% correlation between

the estimates and the real values. The MLP-PSO approach is the second best performer

resulting in a value of 73.65%.


5-7


Table 5.2: DL-IWO Mean Squared Error Objective Value Per Sample.

Sample Dimensions DL-IWO MLP-PSO MLP-SA MLP-GA

1 76 2.67 11.48 9.04 9.04

2 89 0.69 6.01 5.81 5.81

3 82 4.74 11.26 14.80 14.80

4 86 1.21 8.42 6.19 6.19

5 82 2.59 7.03 7.70 7.70

6 62 1.02 4.65 8.71 8.71

7 88 1.69 7.63 12.61 12.61

8 71 4.89 12.21 12.09 12.09

9 79 1.01 4.24 8.14 8.14

10 84 1.34 7.06 13.36 13.36



that the proposed DL-IWO approach yielded the lowest objective function values in the

estimation of missing values in each sample, as well as the best COD, GD, MAE, SE and

SNR values.

Table 5.3: DL-IWO Additional Metrics.

Method DL-IWO MLP-PSO MLP-SA MLP-GA

COD 95.4 54.24 0.14 0.08

GD 0.02 1.1 13.89 14.83

MAE 3.31 15.2 48.72 48.91

SE 6.67 23.52 56.44 57.1

SNR 5.16 59.76 219.89 205.01


tained by the DL-IWO approach when compared against the MLP-PSO, MLP-SA and



DL-IWO, MLP-PSO, MLP-SA and MLP-GA methods. The alternative hypothesis (HA)



5-8


Table 5.4: Statistical Analysis of DL-IWO Model Results.


DL-IWO:MLP-PSO 2.51*10−15

DL-IWO:MLP-SA 2.0*10−174

DL-IWO:MLP-GA 4.0*10−180


means of the estimates obtained by DL-IWO when compared to MLP-PSO, MLP-SA

and MLP-GA, yielding p-values of 2.51*10−15, 2.0*10−174 and 4.0*10−180, respectively,

when all three pairs are compared. This therefore indicates that the null hypothesis (H0),

which assumes that there is no significant difference in the means between DL-IWO and

the other three methods can be rejected in favour of the alternative hypothesis (HA) at a

95% confidence level.

Figure 5.5: Top Row: Corrupted Images - Bottom Row: DL-IWO Reconstructed Images.




DL-IWO approach, while in the top and bottom rows of Figure 5.6, we observe the



so in the bottom row than in the top row, as opposed to when the DL-IWO approach is



5-9



5.5 Comparative Analysis of Proposed Approaches

In this section, we present the findings from comparing all six methods against each other

to identify which performs best on the dataset. Statistical t-test will be performed to back

the findings from this experiment. The results obtained are grouped in Figures 5.7-5.10

and Tables 5.5 and 5.6.

Figure 5.7: Squared Error vs Estimation Approach.

Figures 5.7 and 5.8 depict the squared errors and mean absolute errors for all six novel

5-10


approaches proposed and analysed. They both confirm better performance of the DL-ACO

approach when compared to the other approaches. DL-ACO exhibits a 7.94% of squared

error with 3.26% of MAE while we obtain 11.97%, 8.24%, 8.17%, 13.93% and 8.26% of

squared errors for DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO, respectively. Based

on these squared error values, we note that the order of performance is: DL-ACO→DL-

CS→DL-BAT→DL-IWO→DL-ALO→DL-FA.

Figure 5.8: Mean Absolute Error vs Estimation Approach.

The DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO approaches yield MAE values of

5.07%, 4.37%, 3.83%, 9.3% and 3.72%, respectively. Based on these MAE values, we

note that the order of performance is: DL-ACO→DL-IWO→DL-CS→DL-BAT→DL-

ALO→DL-FA.

5-11



In Figure 5.9, the RMSLE values of the approaches are revealed while in Figure 5.10, we

observe the RPA values of these approaches. Considering Figure 5.9, the approach which

yields the lowest RMSLE value is the DL-ACO approach with a value of 5.85%. The

second best performer is the DL-CS algorithm with a value of 6.06%. The other values

obtained are 8.2%, 6.24%, 11.21% and 6.14% for the DL-ALO, DL-BAT, DL-FA and

DL-IWO approaches, respectively. This reveals a performance order of: DL-ACO→DL-

CS→DL-IWO→DL-BAT→DL-ALO→DL-FA.

5-12



With regards to Figure 5.10, the order of performance of the approaches is: DL-ACO→DL-

CS→DL-IWO→DL-BAT→DL-ALO→DL-FA. This ordering is based on the approaches

yielding values of 87.21%, 86.9%, 86.73%, 85.75%, 83.03% and 59.15%, respectively.

These findings are further backed by the results in Table 5.5. The DL-ACO approach

yields the best values for COD, GD, MSE, r and SNR. Considering the COD met-

ric, the order of performance observed is: DL-ACO→DL-BAT→DL-CS→DL-IWO→DL-

ALO→DL-FA. The order of performance changes when we consider the GD metric:

DL-ACO→DL-IWO→DL-CS→DL-ALO→DL-BAT→DL-FA. In terms of the MSE met-

ric, we observe a performance ordering of: DL-ACO→DL-CS→DL-IWO/DL-BAT→DL-

ALO→DL-FA, noting that DL-IWO and DL-BAT perform on par. The correlation

coefficient and SNR values reveal an ordering of: DL-ACO→DL-BAT→DL-CS→DL-

IWO→DL-ALO→DL-FA. Based on these orderings, it can be stated that the DL-ACO

approach performs best, with the approach on the other end of the scale consistently

being the DL-FA approach.

Considering Table 5.6, the DL-BAT approach yielded the lowest objective function values

in the estimation of missing data within a single sample, across all samples shown in the

table. The Dimensions column refers to the number of missing values within that sample.

The DL-ACO approach is observed to yield the second best objective function values

5-13


Table 5.5: Model Additional Metrics.

Method DL-ACO DL-ALO DL-BAT DL-CS DL-FA DL-IWO

COD 93.47 87.01 93.33 93.2 82.74 93.02

GD 0.0079 0.035 0.044 0.019 0.28 0.017

MSE 0.63 1.43 0.68 0.67 1.94 0.68

r 96.68 93.28 96.61 96.54 90.96 96.44

SNR 6.95 22.72 7.21 7.4 22.84 7.55

across all samples, while DL-FA yields the highest objective function values, which goes

on to support the findings from Table 5.5 and Figures 5.7 - 5.10.

Table 5.6: Model Mean Squared Error Objective Values Per Sample.

Sample Dimensions DL-ACO DL-ALO DL-BAT DL-CS DL-FA DL-IWO

1 65 0.83 1.37 0.80 0.94 2.48 0.88

2 94 2.25 2.29 2.15 2.27 3.17 2.26

3 77 2.68 2.85 2.49 2.69 3.45 2.69

4 64 1.80 2.26 1.71 1.87 2.98 1.83

5 77 2.62 3.13 2.47 2.64 3.87 2.64

6 77 1.02 2.05 0.91 1.09 2.73 1.11

7 79 2.18 2.48 2.06 2.20 3.31 2.20

8 75 1.59 1.91 1.37 1.64 2.85 1.73

9 74 1.24 1.29 1.17 1.24 1.93 1.25

10 62 3.17 3.56 2.86 3.24 4.49 3.22

5-14



tained by the DL-ACO, DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO approaches

using the t-test. The t-test null hypothesis (H0) assumes that there is no significant dif-

ference in the means of the missing data estimates obtained by the DL-ALO, MLP-PSO,

MLP-SA and MLP-GA methods. The alternative hypothesis (HA) however indicates that

there is a significant difference in the means of the missing data estimates obtained by

the six methods.

Table 5.7: Statistical Analysis of Model Results.


DL-ACO:DL-ALO 1.37*10−10

DL-ACO:DL-BAT 0.01

DL-ACO:DL-CS 0.29

DL-ACO:DL-FA 2.0*10−22

DL-ACO:DL-IWO 0.37

DL-ALO:DL-BAT 9.88*10−20

DL-ALO:DL-CS 6.43*10−14

DL-ALO:DL-FA 2.0*10−67

DL-ALO:DL-IWO 2.64*10−13

DL-BAT:DL-CS 0.14

DL-BAT:DL-FA 1.0*10−12

DL-BAT:DL-IWO 0.10

DL-CS:DL-FA 3.44*10−18

DL-CS:DL-IWO 0.87

DL-FA:DL-IWO 9.06*10−19


means of the estimates obtained by DL-ACO when compared to DL-ALO, DL-BAT and

DL-FA, yielding p-values of 1.37*10−10, 0.01 and 2.0*10−22, respectively, when all three

pairs are compared. This therefore indicates that the null hypothesis (H0), which assumes

that there is no significant difference in the means of the estimates between DL-ACO and

the DL-ALO, DL-BAT and DL-FA methods can be rejected in favour of the alternative

hypothesis (HA) at a 95% confidence level. It also indicates that there is no significant

difference in the means of the estimates obtained by the DL-ACO approach when com-

pared against the DL-CS and DL-IWO approaches. This is made evident by the p-values

obtained which are 0.29 and 0.37, respectively.

5-15


Table 5.7 further reveals that there is a significant difference at a 95% confidence level in

the means of the estimates obtained by DL-ALO when compared to DL-BAT, DL-CS, DL-

FA and DL-IWO, yielding p-values of 9.88*10−20, 6.43*10−14, 2.0*10−67 and 2.64*10−13,

respectively, when all four pairs are compared. This therefore indicates that the null hy-

pothesis (H0), which assumes that there is no significant difference in the means of the

estimates between DL-ALO and the DL-BAT, DL-CS, DL-FA and DL-IWO methods can

be rejected in favour of the alternative hypothesis (HA) at a 95% confidence level.

In addition, it is observed that at a 95% confidence level, there is a significant differ-

ence in the means of the estimates obtained by the DL-BAT approach when compared

to the DL-FA approach. This conclusion is reached courtesy of the p-value of 1.0*10−12

being obtained. When the DL-BAT approach is compared to the DL-CS and DL-IWO

approaches, it is observed that the p-values obtained point in favour of accepting the

null hypothesis, which states that there is no significant difference in the means of the

estimates at a 95% confidence level. These p-values are 0.14 and 0.10 when DL-BAT is

compared against DL-CS and DL-IWO, respectively.

Moreover, it reveals that there is a significant difference at a 95% confidence level in

the means of the estimates obtained by DL-CS when compared to DL-FA yielding a p-

value of 3.44*10−18, when the pair is compared. This therefore indicates that the null

hypothesis (H0), which assumes that there is no significant difference in the means of the

estimates between DL-CS and DL-FA method can be rejected in favour of the alternative

hypothesis (HA) at a 95% confidence level. It also indicates that there is no significant

difference in the means of the estimates obtained by the DL-CS approach when compared

against the DL-IWO approach. This is made evident by the p-value obtained which is 0.87.

Finally, it reveals that there is a significant difference at a 95% confidence level in the

means of the estimates obtained by DL-FA when compared to DL-IWO yielding a p-value

of 9.06*10−19, when the pair is compared. This therefore indicates that the null hypothesis

(H0), which assumes that there is no significant difference in the means of the estimates

between DL-FA and DL-IWO method can be rejected in favour of the alternative hypoth-

esis (HA) at a 95% confidence level.

5-16


5.6 Conclusion

Firstly, in this chapter, a novel plant-based high-dimensional missing data estimator model

was presented and tested on an image recognition dataset. This model was then compared

against existing approaches of a similar nature. The results obtained from the experiment

conducted, allude to the fact that the model proposed can be used to approximate missing

values in the high-dimensional dataset more accurately than when existing approaches of

the same nature are used. It can also be observed that the image reconstructed using the

proposed model is more likely to be used in subsequent statistical analysis and classifica-

tion tasks than those obtained by using the existing approaches. This is because of the

fact that the existing approaches are seen to introduce a lot of noise in the images which

could skew the findings from any subsequent analysis.

Secondly, we present in this chapter a comparative analysis of the models proposed to

identify that which performs best, but also that which is on the other end of the scale

in terms of performance. It is observed that the method which consistently performs the

best is the DL-ACO approach while the method which leads to the worse performance

metric values is consistently the DL-FA method. The statistical t-test performed further

reveals that the DL-FA approach yields estimates which are significantly different from

those of the other five methods at a 95% confidence level, resulting in p-values of 0.00

when these are compared in pairs as can be seen in Table 5.7. It is observed that only

when the objective function values per sample are considered as given in Table 5.6, that

the DL-ACO approach does not yield the best values. Rather, it is the DL-BAT approach

that results in the lowest values in this scenario.

5-17

6Concluding Remarks and

Future Research

In this chapter, we begin by presenting a summary of the research. Furthermore, we

discuss the findings reached from conducting the experiments. Subsequently, we provide

ideas for avenues for further investigations. We then examine the contributions of this

dissertation, and then present final conclusions.

6.1 Concluding Remarks

6.1.1 Research Summary

The research performed in this dissertation assesses the efficiency of using a deep auto-

encoder neural network in combination with the Ant Colony Optimization, Ant-Lion

Optimizer, Bat, Firefly, Cuckoo Search and Invasive Weed Optimization methods in per-

forming missing data estimation tasks on a high-dimensional dataset. By so doing, we

aimed to respond to a few objectives which are:

• To prove the ineffectiveness of existing approaches on the estimation of missing data

entries in a high-dimensional dataset.

• To prove the effectiveness of novel methods in estimating missing data entries in

a high-dimensional dataset by proposing models consisting of a deep auto-encoder

neural network and Ant Colony Optimization, Ant-Lion Optimizer, Bat, Firefly,

Cuckoo Search and Invasive Weed Optimization algorithms.

6-1

6. Concluding Remarks and Future Research

• Assess, evaluate, and compare the accuracy of results using the individually pro-

posed models.

To answer these questions, an image recognition dataset was used. Six different models

comprising of a deep auto-associative neural network with six optimization algorithms

being: Ant Colony Optimization, Ant-Lion Optimizer, Bat, Firefly, Cuckoo Search and

Invasive Weed Optimization algorithms were proposed to perform the task. An auto-

encoder is a neural network capable of reproducing its inputs as outputs. An RBM

trained in an unsupervised learning manner using the contrastive divergence approach

was used to initialize the weights of the deep auto-associative neural network in a good

solution space. The trained RBMs were concatenated together, forming the encoder part

of the neural network, and then transposed to form the decoder part of said network.

This encoder-decoder network was then trained in a supervised learning manner using

the stochastic gradient descent algorithm. During the training of the network, an er-

ror function was derived which is expressed as the square of the disparity between the

estimated missing data entries and the real values. This error function was further de-

composed to incorporate both known input vector values and the missing components of

the input vector. Subsequently, the optimization algorithms were used to estimate the

missing values in the input vector with the objective being to minimize the loss which has

as part of it the trained network. The models created in this manner were implemented

and compared against existing approaches to prove the ineffectiveness of the latter while

revealing the effectiveness of the former. Subsequently, the models created were compared

against each other to provide an order in terms of performance of these.

6.1.2 Results Summary and Discussions

An in-depth analysis of the results obtained from the experiments alludes to the fact that

the Deep Learning-Ant Colony Optimization (DL-ACO) method performed better than

the Deep Learning-Ant Lion Optimizer (DL-ALO), Deep Learning-Bat Algorithm (DL-

BAT), Deep Learning-Cuckoo Search (DL-CS), Deep Learning-Firefly Algorithm (DL-FA)

and Deep Learning-Invasive Weed Optimization (DL-IWO) methods.

To be more precise, the analysis of the results using the Squared Error, Root Mean Squared

Logarithmic Error, Mean Absolute Error, Mean Squared Error, Global Deviation, Signal-

to-Noise Ratio, Correlation Coefficient, Relative Prediction Accuracy and Coefficient of

6-2


Determination values from the dataset revealed that on average, the DL-ACO method

performed better than the DL-ALO, DL-BAT, DL-CS, DL-FA and DL-IWO methods.

The difference in performance between the DL-BAT, DL-CS and DL-IWO methods on

the dataset were minimal and insignificant when the performance metrics were considered.

Considering the statistical analysis results, the aforementioned findings are further vali-

dated by the p-values obtained. However, when the performances of these three methods

were compared against those of the DL-ALO and DL-FA methods, the differences in these

were significantly different as shown in Chapter 5, and backed by the statistical analysis

results. General analysis of the results obtained points to an ordering of the methods in

terms of performance as follows: (i) DL-ACO, (ii) DL-BAT, (iii) DL-CS, (iv) DL-IWO,

(v) DL-ALO, and (vi) DL-FA. The predominant advantage of the models proposed lies

in the fact that they make use of a deep learning framework that is better at extracting

the correlations and interrelationships that exist between feature variables in the high-

dimensional dataset. This ensures that the reconstruction of input vector values at the

output layer are quite accurate and in the estimation of missing data values, this property

is imperative.

6.2 Avenues for Future Research

The findings from the work done in this dissertation are quite encouraging and dependable,

however, there is room for improvement with regards to either improving the performances

of the models proposed or applying these models to different application domains or

varying types of datasets.

6.2.1 Apply Alternative Machine Learning Techniques

The research used a deep auto-associative neural network to learn the correlations and

interrelationships that exists between feature variables within the dataset. Although an

auto-encoder was chosen as the learning model due to its advantages over other methods

in missing data estimation tasks in addition to it having yielded trustworthy outcomes,

it is a good idea to look into the possibility of building new models by making use of

other deep learning algorithms. Instead of using a deep auto-encoder network, one could

implement for instance, a Convolutional Neural Network or a Deep Belief Network as the

6-3


learning model in the hybrid system(s).

6.2.2 Apply Different Optimization Techniques

The optimization algorithms used in the research were the Ant Colony Optimization,

Ant-Lion Optimizer, Bat, Firefly, Cuckoo Search and Invasive Weed Optimization algo-

rithms. The parameter combinations used in these algorithms were based entirely on the

dataset used. It could be worthwhile considering obtaining an optimal set of parameter

combinations for these algorithms used in the models that could be used across a range

of application domains and datasets with accuracy still being preserved or bettered. The

optimization algorithms implemented in the models were selected because they have not

been used, or extensively applied in the domain of missing data estimation, more specif-

ically when high-dimensional datasets are considered. It could be worth it looking into

the possibility of using alternate optimization algorithms in the models. Some of these

optimization algorithms which could be used to create the hybrid models from the new

collection of optimization algorithms are the grey wolf algorithm, differential evolution,

lion optimization algorithm just to name a few. From the old category of optimization

algorithms, the particle swarm optimization, simulated annealing, genetic algorithm, hill

climbing, and pattern search algorithms could be applied to observe whether or not these

algorithms in combination with deep learning frameworks are efficient at performing the

task.

6.2.3 Compare to Other Models using Similar Datasets

Chapter 2 presents several existing missing data imputation algorithms applied in different

scenarios and application areas. These algorithms each have their own advantages and

disadvantages. Despite the encouraging outcomes produced by the models proposed in the

work done in this dissertation, one very important investigation that has to be performed

is comparing the results produced by the proposed models against those of other missing

data imputation techniques, not only on the same dataset and ones of the same nature as

this, but also on different datasets that possess different characteristics in order to provide

some form of generalization. In addition, it is worthwhile investigating the performances

of the models proposed in this dissertation on different datasets of a similar nature (image

recognition datasets).

6-4


6.3 Alternative Areas of Application

The work done in this dissertation used the proposed models to estimate missing data

entries in a high-dimensional dataset with the focal point being on image recognition

datasets. These proposed models could further be applied and extended to alternate

application domains and not only on the sector mentioned. For instance, it could be used

in risk analysis and forecasting sectors. Furthermore, they can be applied to environmental

and health datasets to generalize the performances of the models and the results obtained

in the thesis.

6-5

References

[1] R. Little and D. Rubin, Statistical analysis with missing data. John Wiley & Sons,2014.

[2] R. L. Carter, “Solutions for missing data in structural equation modeling,” Research& Practice in Assessment, vol. 1, no. 1, pp. 1–6, Winter 2006.

[3] N. A. Zainuri, A. A. Jemain, and N. Muda, “A comparison of various imputationmethods for missing values in air quality data,” Sains Malaysiana, vol. 44, no. 3,pp. 449–456, August 2015.

[4] T. Sidekerskiene and R. Damasevicius, “Reconstruction of missing data in synthetictime series using emd,” CEUR Workshop Proceedings, vol. 1712, pp. 7–12, 2016.

[5] M. N. Vukosi, F. V. Nelwamondo, and T. Marwala, “Autoencoder, principal com-ponent analysis and support vector regression for data imputation,” arXiv preprintarXiv:0709.2506, 2007.

[6] S. Rana, A. H. John, H. Midi, and A. Imon, “Robust regression imputation formissing data in the presence of outliers,” Far East Journal of Mathematical Sciences,vol. 97, no. 2, pp. 183–195, October 2015.

[7] F. Lobato, C. Sales, I. Araujo, V. Tadaiesky, L. Dias, L. Ramos, and A. Santana,“Multi-objective genetic algorithm for missing data imputation,” Pattern Recogni-tion Letters, vol. 68, no. 1, pp. 126–131, December 2015, (last accessed: 18-March-2016).

[8] M. Abdella and T. Marwala, “The use of genetic algorithms and neural networksto approximate missing data in database,” vol. 24, October 2005, pp. 577–589.

[9] I. Aydilek and A. Arslan, “A novel hybrid approach to estimating missing values indatabases using k-nearest neighbors and neural networks,” International Journal ofInnovative Computing, Information and Control, vol. 7, no. 8, pp. 4705–4717, 2012.

Rf-1

References

[10] C. Leke, B. Twala, and T. Marwala, “Modeling of missing data prediction: Com-putational intelligence and optimization algorithms,” in IEEE International Con-ference on Systems, Man and Cybernetics (SMC). San Diego, CA, USA, October2014, pp. 1400–1404.

[11] F. J. Mistry, F. V. Nelwamondo, and T. Marwala, “Missing data estimation us-ing principle component analysis and autoassociative neural networks,” Journal ofSystemics, Cybernatics and Informatics, vol. 7, no. 3, pp. 72–79, 2009.

[12] F. V. Nelwamondo, S. Mohamed, and T. Marwala, “Missing data: A comparison ofneural network and expectation maximisation techniques,” Current Science, vol. 93,no. 12, pp. 1514–1521.

[13] S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizing informationwithin incomplete instances,” Journal of Systems and Software, vol. 84, no. 3, pp.452–459, 2011.

[14] S. Zhang, “Shell-neighbor method and its application in missing data imputation,”Applied Intelligence, vol. 35, no. 1, pp. 123–133, 2011.

[15] A. Baraldi and C. Enders, “An introduction to modern missing data analyses,”Journal of School Psychology, vol. 48, no. 1, pp. 5–37, 2010.

[16] S. Van Buuren, Flexible imputation of missing data. CRC press, 2012.

[17] J. M. Jerez, I. Molina, P. J. Garcıa-Laencina, E. Alba, N. Ribelles, M. Martın, andL. Franco, “Missing data imputation using statistical and machine learning methodsin a real breast cancer problem,” Artificial intelligence in medicine, vol. 50, no. 2,pp. 105–115, 2010.

[18] Y. LeCun. The mnist database of handwritten digits. (last accessed: 15-Jan-2016).[Online]. Available: http://yann.lecun.com/exdb/mnist/

[19] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deepbelief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[20] A. K. Mohamed, F. V. Nelwamondo, and T. Marwala, “Estimating missing datausing neural network techniques, principal component analysis and genetic algo-rithms,” Proceedings of the Eighteenth Annual Symposium of the Pattern Recogni-tion Association of South Africa, 2007.

[21] L. Francis, Dancing With Dirty Data: Methods for Exploring and CleaningData, pp. 198–254, (last accessed: November 2016). [Online]. Available:http://dx.doi.org/10.1007/978-3-319-19884-2 11

[22] M. Ramoni and P. Sebastiani, “Robust learning with missing data,” Journal ofMachine Learning, vol. 45, no. 2, pp. 147–170, 2001.

Rf-2

http://yann.lecun.com/exdb/mnist/

http://dx.doi.org/10.1007/978-3-319-19884-2_11

References

[23] M. C. Tremblay, K. Dutta, and D. Vandermeer, “Using data mining techniques todiscover bias patterns in missing data,” Journal of Data and Information Quality,vol. 2, no. 1, 2010.

[24] R. Polikar, J. De Pasquale, H. S. Mohammed, G. Brown, and L. I. Kuncheva,“Learn++mf: A random subspace approach for the missing feature problem,” Pat-tern Recognition, vol. 43, no. 11, pp. 3817–3832, 2010.

[25] B. Twala, “An empirical comparison of techniques for handling incomplete datausing decision trees,” Applied Artificial Intelligence, vol. 23, no. 5, pp. 373–405,2009.

[26] D. Rubin, “Multiple imputations in sample surveys-a phenomenological bayesianapproach to nonresponse,” Proceedings of the survey research methods section ofthe American Statistical Association, vol. 1, pp. 20–34, 1978.

[27] P. D. Allison, “Multiple imputation for missing data,” Sociological Methods & Re-search, vol. 28, no. 3, pp. 301–309, 2000.

[28] E.-L. Silva-Ramirez, R. Pino-Mejias, M. Lopez-Coello, and M.-D. Cubiles-de-laVega, “Missing value imputation on missing completely at random data using mul-tilayer perceptrons,” Neural Networks, vol. 24, no. 1, pp. 121–129, January 2011.

[29] T. D. Pigott, “A review of methods for missing data,” Educational Research andEvaluation, vol. 7, no. 4, pp. 353–383, 2001.

[30] K. J. Nishanth and V. Ravi, “A computational intelligence based online data im-putation method: An application for banking,” Journal of Information ProcessingSystems, vol. 9, no. 4, pp. 633–650, 2013.

[31] J. Scheffer, “Dealing with missing data,” Research Letters in the Information andMathematical Sciences, vol. 3, pp. 153–160, 2000, (last accessed: 18-March-2016).[Online]. Available: http://www.massey.ac.nz/wwiims/research/letters

[32] P. Garca-Laencina, J. Sancho-Gmez, A. Figueiras-Vidal, and M. Verleysen, “K near-est neighbours with mutual information for simultaneous classification and missingdata imputation,” Neurocomputing, vol. 72, no. 7-9, pp. 1483–1493, 2009.

[33] F. Z. Poleto, J. M. Singer, and C. D. Paulino, “Missing data mechanisms and theirimplications on the analysis of categorical data,” Statistics and Computing, vol. 21,no. 1, pp. 31–43, 2011.

[34] Y. Liu and S. D. Brown, “Comparison of five iterative imputation methods formultivariate classification,” Chemometrics and Intelligent Laboratory Systems, vol.120, pp. 106–115, 2013.

[35] T. Marwala, Computational Intelligence for Missing Data Imputation: Estimationand Management Knowledge Optimization Techniques. Information Science Ref-erence, Hershey, New York, 2009.

Rf-3

http://www.massey.ac.nz/wwiims/research/letters

References

[36] I. S. Yansaneh, L. S. Wallace, and D. A. Marker, “Imputation methods for largecomplex datasets: An application to the nehis,” In Proceedings of the Survey Re-search Methods Section, pp. 314–319, 1998.

[37] P. D. Allison, Missing data. Thousand Oaks, CA: Sage, 2002.

[38] A. Kalousis and M. Hilario, “Supervised knowledge discovery from incompletedata,” In Proceedings of the 2nd International Conference on Data Mining, 2000,(last accessed: October 2016). [Online]. Available: http://cui.unige.ch/AI-group/research/metal/Papers/missing values.ps

[39] A. Perez, R. J. Dennis, J. F. A. Gil, M. A. Rondon, and A. Lopez, “Use of themean, hot deck and multiple imputation techniques to predict outcome in intensivecare unit patients in colombia,” Journal of Statistics in Medicine, vol. 21, no. 24,pp. 3885–3896, 2002.

[40] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incom-plete data via the em algorithm,” Journal of the Royal Statistics Society, vol. 39,no. 1, pp. 1–38, 1997.

[41] B. Twala and M. Cartwright, “Ensemble missing data techniques for software effortprediction,” Intelligent Data Analysis, vol. 14, no. 3, pp. 299–331, 2010.

[42] B. E. T. H. Twala, M. C. Jones, and D. J. Hand, “Good methods for coping withmissing data in decision trees,” Pattern Recognition Letters, vol. 29, no. 7, pp.950–956, 2008.

[43] B. Twala and M. Phorah, “Predicting incomplete gene microarray data with theuse of supervised learning algorithms,” Pattern Recognition Letters, vol. 31, pp.2061–2069, 2010.

[44] C. Ming-Hau, “Pattern recognition of business failure by autoassociative neuralnetworks in considering the missing values,” in International Computer Symposium(ICS). Taipei, Taiwan, Dec 2010, pp. 711–715.

[45] S. Haykin, Neural Networks. Prentice-Hall, New Jersey, second edition, 1999.

[46] P. J. Lu and T. C. Hsu, “Application of autoassociative neural network on gas-path sensor data validation,” Journal of Propulsion and Power, vol. 18, no. 4, pp.879–888, July 2002.

[47] J. Mistry, F. Nelwamondo, and T. Marwala, “Estimating missing data and deter-mining the confidence of the estimate data,” Seventh International Conference onMachine Learning and Applications, pp. 752–755, December 2008, san Diego, CA,USA.

[48] J. W. Hines, E. U. Robert, and D. J. Wrest, “Use of autoassociative neural networksfor signal validation,” Journal of Intelligent and Robotic Systems, vol. 21, no. 2, pp.143–154, February 1998.

Rf-4

http://cui.unige.ch/AI-group/research/metal/Papers/missing_values.ps

http://cui.unige.ch/AI-group/research/metal/Papers/missing_values.ps

References

[49] M. J. Atalla and D. J. Inman, “On model updating using neural networks,” Me-chanical Systems and Signal Processing, vol. 12, pp. 135–161, 1998.

[50] N. Smauoi and S. Al-Yakoob, “Analyzing the dynamics of cellular flames usingkarhunenloeve decomposition and autoassociative neural networks,” Society for In-dustrial and Applied Mathematics, vol. 24, pp. 1790–1808, 2003.

[51] T. Marwala, “Probabilistic fault identification using a committee of neural networksand vibration data,” Journal of Aircraft, vol. 38, no. 1, pp. 138–146, January-February 2001.

[52] T. Marwala and S. Chakraverty, Fault classification in structures with incompletemeasured data using autoassociative neural networks and genetic algorithm, 2006,vol. 90, no. 4.

[53] T. Marwala, Economic Modelling Using Artificial Intelligence Methods. Springer-Verlag, London, UK., 2013.

[54] C. Leke and T. Marwala, “Missing data estimation in high-dimensional datasets: Aswarm intelligence-deep neural network approach,” in International Conference inSwarm Intelligence. Springer International Publishing, 2016, pp. 259–270.

[55] J. C. Isaacs, “Representational learning for sonar atr,” in SPIE Defense+ Security.International Society for Optics and Photonics, June 2014.

[56] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review andnew perspectives,” Transactions on Pattern Analysis and Machine Intelligence,vol. 35, no. 8, pp. 1798–1828, 2013.

[57] K. Baek and S. Cho, “Bankruptcy prediction for credit risk using an auto-associativeneural network in korean firms,” IEEE Conference on Computational Intelligencefor Financial Engineering, pp. 25–29, March 2003, hong Kong, China.

[58] T. Tim, M. Mutajogire, and T. Marwala, “Stock market prediction using evolu-tionary neural networks,” Fifteenth Annual Symposium of the Pattern Recognition,PRASA, pp. 123–133, Nov 2004.

[59] L. B. Brain, T. Marwala, and T. Tettet, “Autoencoder networks for hiv classifica-tion,” Current Science, vol. 91, no. 11, pp. 1467–1473, 2006.

[60] W.-H. Steeb, The Nonlinear Workbook. World Scientific, Singapore, 2008.

[61] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2008.

[62] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classi-fiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, June 1999.

Rf-5

References

[63] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vectormachines,” IEEE Intelligent Systems and their Applications, vol. 13, no. 4, pp.18–28, July-Aug. 1998.

[64] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,”Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, June 1998.

[65] T. Marwala, Finite Element Model Updating Using Computational Intelligence Tech-niques: Applications to Structural Dynamics. Heidelberg: Springer, 2010.

[66] ——, Causality, Correlation, and Artificial Intelligence for Rational Decision Mak-ing. Singapore: World Scientific, 2015.

[67] J. Kennedy and R. Eberhart, “Particle swarm optimization (pso),” in Proc. IEEEInternational Conference on Neural Networks (ICNN), Perth, Australia, vol. 4, Nov1995, pp. 1942–1948.

[68] A. P. Engelbrecht, Particle swarm optimization: Where does it belong?, May 2006.

[69] T. Marwala and M. Logazio, Militarized Conflict Modeling Using ComputationalIntelligence Techniques. Springer-Verlag, London, UK, 2011.

[70] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He,and J. Williams, “Recent advances in deep learning for speech research at mi-crosoft,” IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 8604–8608, May 2013.

[71] L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations andTrends in Signal Processing, vol. 7, no. 3-4, pp. 197–387, 2014.

[72] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, 2009.

[73] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked de-noising autoencoders: Learning useful representations in a deep network with a localdenoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408,2010.

[74] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies fortraining deep neural networks,” Journal of Machine Learning Research, vol. 1, pp.1–40, 2009.

[75] K. Alex, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in Neural InformationProcessing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105,(last accessed: May 2016). [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Rf-6

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

References

[76] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and timeseries,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp.1995:1–14, 1995.

[77] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machinesfor collaborative filtering,” in Proceedings of the 24th International Conferenceon Machine Learning, ser. ICML ’07. New York, NY, USA: ACM,2007, pp. 791–798, (last accessed: May 2016). [Online]. Available: http://doi.acm.org/10.1145/1273496.1273596

[78] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” Artificial Intelli-gence and Statistics, vol. 1, no. 2, pp. 448–455, 2009.

[79] G. Hinton, “A practical guide to training restricted boltzmann machines,” Momen-tum, vol. 9, no. 1, p. 926, 2010.

[80] T. Tieleman, “Training restricted boltzmann machines using approximations tothe likelihood gradient,” in Proceedings of the 25th International Conferenceon Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008,pp. 1064–1071, (last accessed: May 2016). [Online]. Available: http://doi.acm.org/10.1145/1390156.1390290

[81] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, pp.436–444, 2015.

[82] T. Tieleman and G. E. Hinton, “Using fast weights to improve persistent contrastivedivergence,” Proceedings of 26th International Conference on Machine Learning, pp.1033–1040, 2009.

[83] M. Carreira-Perpin and G. E. Hinton, “On contrastive divergence learning,”Artificial Intelligence and Statistics, vol. 0, pp. 1–7, 2005, (last accessed: 15-March-2015). [Online]. Available: http://learning.cs.toronto.edu/∼hinton/absps/cdmiguel.pdf

[84] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,”Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[85] M. S. R. Monteiro, D. B. M. M. Fontes, and F. A. C. C. Fontes, “Ant colonyoptimization: a literature survey,” Universidade do Porto, Faculdade de Economiado Porto,” FEP Working Papers, 2012, (last accessed: January 2016). [Online].Available: http://EconPapers.repec.org/RePEc:por:fepwps:474

[86] M. Dorigo, V. Maniezzo, and A. Colorni, “Positive feedback as a search strategy,”Tech. Rep., 1991.

[87] M. Dorigo, V. Maniezzo, and A. , “The ant system: Optimization by a colony ofcooperating agents,” IEEE Transactions on Systems, Man, and Cybernetics-PartB, vol. 26, no. 1, pp. 29–41, 1996.

Rf-7

http://doi.acm.org/10.1145/1273496.1273596

http://doi.acm.org/10.1145/1273496.1273596

http://doi.acm.org/10.1145/1390156.1390290

http://doi.acm.org/10.1145/1390156.1390290

http://learning.cs.toronto.edu/~hinton/absps/cdmiguel.pdf

http://learning.cs.toronto.edu/~hinton/absps/cdmiguel.pdf

http://EconPapers.repec.org/RePEc:por:fepwps:474

References

[88] M. Dorigo and G. Di Caro, “New ideas in optimization,” 1999, ch. The Ant ColonyOptimization Meta-heuristic, pp. 11–32, (last accessed: 20-May-2016), publisher =McGraw-Hill Ltd., UK, address = Maidenhead, UK, England. [Online]. Available:http://dl.acm.org/citation.cfm?id=329055.329062

[89] M. Dorigo, M. Birattari, and T. Sttzle, “Ant colony optimization – artificial ants as acomputational intelligence technique,” IEEE Computational Intelligence Magazine,vol. 1, pp. 28–39, 2006.

[90] A. C. Zecchina, A. R. Simpsona, H. R. Maiera, M. Leonarda, A. J. Roberts, andM. J. Berrisforda, “Application of two ant colony optimisation algorithms to waterdistribution system optimisation,” Mathematical and Computer Modelling, vol. 44,no. 5-6, pp. 451–468, 2006.

[91] X. J. Liu, H. Yi, and Z.-H. Ni, “Application of ant colony optimization algorithm inprocess planning optimization,” Journal of Intelligent Manufacturing, vol. 24, no. 1,pp. 1–13, 2013.

[92] T. nkaya, S. Kayalgil, and N. E. zdemirel, “Ant colony optimization based clusteringmethodology,” Applied Soft Computing, vol. 28, pp. 301–311, 2015.

[93] S. Mirjalili, “The ant lion optimizer,” Advances in Engineering Software, vol. 8, pp.80–98, 2015.

[94] E. Gupta and A. Saxena, “Performance evaluation of antlion optimizer based reg-ulator in automatic generation control of interconnected power system,” Journal ofEngineering, vol. 2016, pp. 1–14, 2016.

[95] R. Satheeshkumar and R. Shivakumar, “Ant lion optimization approach for load fre-quency control of multi-area interconnected power systems,” Circuits and Systems,vol. 7, pp. 2357–2383, 2016.

[96] W. Yamany, A. Tharwat, M. Fawzy, T. Gaber, and A. E. Hassanien, “A new multi-layer perceptrons trainer based on ant lion optimization algorithm,” in Fourth In-ternational Conference on Information Science and Industrial Applications (ISI),Sept 2015, pp. 40–45.

[97] H. M. Zawbaa, E. Emary, and C. Grosan, “Feature selection via chaotic antlionoptimization,” PLOS ONE, vol. 11, no. 3, pp. 1–21, March 2016, (last accessed:June 2016). [Online]. Available: https://doi.org/10.1371/journal.pone.0150652

[98] M. Petrovi, J. Petronijevi, M. Miti, N. Vukovi, A. Plemi, Z. Miljkovi, and B. Babi,“The ant lion optimization algorithm for flexible process planning,” Journal of Pro-duction Engineering, vol. 18, no. 2, pp. 65–68, 2015.

[99] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data withneural networks,” American Association for the Advancement of Science, vol. 313,no. 5786, pp. 504–507, July 2006.

Rf-8

http://dl.acm.org/citation.cfm?id=329055.329062

https://doi.org/10.1371/journal.pone.0150652

References

[100] X. S. Yang and S. Debb, “Cuckoo search: recent advances and applications,” NeuralComputing and Applications, vol. 24, no. 1, pp. 169–174, 2014.

[101] X. S. Yang and S. Deb, “Cuckoo search via levy flights,” World Congress on Natureand Biologically Inspired Computing (NaBIC), vol. 48, no. 2, pp. 210–214, Feb 2009.

[102] S. Vasanthakumar, N. Kumarappan, R. Arulraj, and T. Vigneysh, “Cuckoo searchalgorithm based environmental economic dispatch of microgrid system with dis-tributed generation,” in IEEE International Conference on Smart Technologies andManagement for Computing, Communication, Controls, Energy and Materials (IC-STM), August 2015, pp. 575–580.

[103] J. Wang, B. Zhou, and S. Zhou, “An improved cuckoo search optimization algo-rithm for the problem of chaotic systems parameter estimation,” ComputationalIntelligence and Neuroscience, vol. 2016, p. 8, 2016.

[104] F. A. Ali and A. T. Mohamed, “A hybrid cuckoo search algorithm with neldermead method for solving global optimization problems,” SpringerPlus. SpringerInternational Publishing, vol. 5, no. 1, p. 473, 2016.

[105] X. S. Yang, “A new metaheuristic bat-inspired algorithm,” In: Nature InspiredCooperative Strategies for Optimization (NISCO), Studies in Computational Intel-ligence, pp. 65–74, 2010.

[106] ——, “Bat algorithm: Literature review and applications,” International Journalof Bio-Inspired Computation, vol. 5, no. 3, pp. 141–149, 2013.

[107] ——, “Bat algorithm for multiobjective optimization,” International Journal ofBio-Inspired Computation, vol. 3, no. 5, pp. 267–274, 2011.

[108] X. S. Yang, M. Karamanoglu, and S. Fong, “Bat algorithm for topology optimizationin microelectronic applications,” First International Conference on Future Genera-tion Communication Technologies (FGST), pp. 12–14, Dec 2012.

[109] C. B. Teodoro, d.-. S. C. Leandro, and L. Luiz, “Bat-inspired optimization approachfor the brushless dc wheel motor problem,” Transactions on Magnetics, Feb 2012.

[110] X.-S. Yang, “Firefly algorithm, levy flights and global optimization,” In: Re-search and Development in Intelligent Systems XXVI (Eds M. Bramer, R. Ellis,M. Petridis), pp. 209–218.

[111] A. R. Mehrabian and C. Lucas, “A novel numerical optimization algorithm inspiredfrom weed colonization,” Ecological Informatics, vol. 1, pp. 355–366, 2006.

[112] C. Veenhuis, “Binary invasive weed optimization,” Second World Congress on Na-ture and Biologically Inspired Computing, pp. 449–454, Dec 2010.

Rf-9

References

[113] B. Paryzad and N. S. Pour, “Time-cost-quality trade-off in project with using inva-sive weed optimization algorithm,” Journal Basic and Applied Scientific Research,vol. 3, no. 11, pp. 134–142, 2013.

[114] H. L. Hung, C. C. Chao, C. H. Cheng, and Y. F. Huang, “Invasive weed optimiza-tion method based blind multiuser detection for mc-cdma interference suppressionover multipath fading channel,” International Conference on Systems, Man andCybernatics (SMC), pp. 2145–2150, 2010.

[115] K. Su, L. Ma, X. Guo, and Y. Sun, “An efficient discrete invasive weed optimizationalgorithm for web services selection,” Journal of Software, vol. 9, no. 3, pp. 709–715,March 2014.

[116] H. A. Kasdirin, N. M. Yahya, M. S. M. Aras, and M. O. Tokhi, “Hybridizing in-vasive weed optimization with firefly algorithm for unconstrained and constrainedoptimization problems,” Journal of Theoretical and Applied Information Technol-ogy, vol. 95, no. 4, pp. 912–927, Feb 2017.

[117] M. Yazdani and R. Ghodsi, “Invasive weed optimization algorithm for minimizingtotal weighted earliness and tardiness penalties on a single machine under agingeffect,” International Robotics and Automation Journal, vol. 2, no. 1, Jan 2017.

Rf-10

copyright and citation considerations for this thesis ... · deep learning approaches and swarm...

Documents