imputing missing values using expected-maximization …...complete the research project...
TRANSCRIPT
Imputing Missing Values Using Expected-Maximization Algorithm
Thulare Evans : 1799336
Supervisor : Dr. Ritesh Ajoodha
A research project submitted to the
DEPARTMENT OF COMPUTER SCIENCE AND APPLIED MATHEMATICS
UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG
SOUTH AFRICA
In partial fulfilment of the requirements for the degree of Master of Science (MSc)
e-Science
25 SEPTEMBER 2018
Declaration
I Thulare Evans Molahlegi declare that this research is my own original work and has not
been submitted before for any other degree, part of degree or examination at University of
the Witwatersrand or any other University.
I
Dedication
I dedicate this research project to Almighty God my creator, my source of inspiration. He
has been the source of my strength throughout this research project.
I also dedicate this work to my family and friends. A special feeling of gratitude to my
loving and carrying single parent Mrs Nelly Thulare who has encouraged me all the way
and taught me that even the largest task can be accomplished if it is done one step at a time.
To my brother Mr Mahlatse Thulare, I will always appreciate all you have done.
II
Acknowledgements
I would like to thank the Almighty for His showers of blessings throughout my work to
complete the research project successfully.
I would also like to express my deepest and sincere gratitude to my research project su-
pervisor, Dr. Ritesh Ajoodha (Ritso) for giving me the opportunity and providing me with
priceless guidance throughout. His energetic personality, vision, sincerity and motivation
have deeply inspired me. It was a great privilege and honour to work and study under his
supervision.
I am extremely grateful to my parents, Mrs Thulare SN and Mrs Thulare MM for their
love, understanding, prayers and sacrifices for educating and preparing me for my future.
The support of the DST-CSIR National e-Science Postgraduate Teaching and Training Plat-
form (NEPTTP) towards this research is hereby acknowledged. Opinions expressed and
conclusions arrived at, are those of the author and are not necessarily to be attributed to the
NEPTTP.
III
Abstract
The study aims to evaluate the performance of Expectation-Maximization (EM) algorithm
when estimating missing values and to observe how estimated distributions diverge with
the true distributions using Kullback-Leibler (KL) Divergence on a generated data set of
40 000 observations, simulated by Bayesian Networks (BN). BN was used to generate the
data because it can precisely manage the correlation of variables in the data set. Missing at
Random (MAR) was assumed in this case. Different percentage of data set was hidden then
estimated, to see how well the EM algorithm perform when we increase the percentage of
missing values.
It is discovered that EM algorithm does not perform well on the massive percentage of
missing values in a data set. The KL divergence has proved that the more we have estimated
missing values in the data set say more than 50 per cent the more we lose the structure of the
data and EM algorithm will produce estimated values that are less reliable. KL divergence
plot showed that the EM algorithm can perform well when estimating missing values of
less than 50% of the data set. If one estimate massive missing values in the data set say 80
or 90 per cent of the data will get misleading results which lead to inaccurate results.
Keywords: Expected-Maximization algorithm, Kullback-Leibler Divergence, Bayesian
Networks, Missing at Random,
IV
Contents
Declaration I
Dedication II
Acknowledgements III
Abstract IV
List of Figures VII
List of Tables VIII
List of Abbreviations IX
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Expectation-Maximization (EM) Algorithm . . . . . . . . . . . . 3
1.1.3 KullbackLeibler (KL) divergence . . . . . . . . . . . . . . . . . 4
1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . 9
2.3 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Listwise and Pairwise Deletion . . . . . . . . . . . . . . . . . . . 11
2.3.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Single and Multiple Imputation . . . . . . . . . . . . . . . . . . 13
2.3.4 Sampling Importance Resampling . . . . . . . . . . . . . . . . . 15
V
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Methodology 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 ground truth Bayesian network . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Sampling Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Separating your complete dataset into copies with missing components . . 19
3.5 Learning Bayesian networks from the missing datasets . . . . . . . . . . 19
3.5.1 Parameter leaning . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.2 Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Evaluating the learned Bayesian network using KL divergence . . . . . . 20
3.7 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Results and Discussion 21
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Conclusion and Recommendation 25
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Recommendations for Future Work . . . . . . . . . . . . . . . . . . . . 25
VI
List of Figures
3.1 The digram below shows BNs structures . . . . . . . . . . . . . . . . . . 18
4.1 The digram below shows BN structure for generating data set . . . . . . . 21
4.2 The digram constitute 10% of the authentic data and 90% of estimated data 22
4.3 The figure shows the KL divergence with missing data percentage . . . . 23
4.4 The figure shows the KL divergence with missing data percentage and 50
observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
VII
List of Tables
4.1 Sample of a data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Sample of a 10% Missing values . . . . . . . . . . . . . . . . . . . . . . 22
VIII
List of Abbreviations
EM : Expectation Maximization
ML : Maximum Likelihood
KNN : K-Nearest Neighbour
KNNimpute : K-Nearest Neighbour impute
KL-Divergence : KullbackLeibler Divergence
MAR : Missing at Random
MCAR : Missing Completely at Random
MNAR : Missing Not at Random
BNs : Bayesian Networks
MI : Multiple Imputation
SI : Single Imputation
MVNI : Multivariate Normal Imputation
MICEs : Multiple Imputation by Chained Equations
RF : Random Forest
SVD : Singular Value Decomposition
SVDimpute : Singular Value Decomposition impute
MSE : Mean Squared Error
cGDI : Column-wise Guided Data
SIR : Sampling Importance Resampling
SRMI : Sequential RegressionMultivariate Imputation
GA : Genetic Algorithm
PGMs : Probabilistic Graphical Models
DAG : Directed Acyclic Graph
MAP : Maximum a Posteriori
IID : Independent and Identically Distributed
IX
Chapter 1
IntroductionMissing data are a part of almost all research, and we all have to decide how to deal with it
from time to time, whether by imputing them or ignoring them. The most popular and sim-
ple method of handling missing data is to ignore the attributes with missing observations
[Azadeh et al. 2013]. The problem of missing data is a common issue and determining the
right approach to mitigate this often becomes a major challenge encountered by machine
learning practitioners when working with real-world data since many statistical models and
machine learning algorithms rely on a complete dataset.
Data problems with missing values and latent variables are common in practice. Miss-
ing data are variables without observations. Most statistical procedures usually eliminate
entire cases whenever they encounter missing data in any variable included in the analy-
sis [Ramezani et al. 2017]. In the surveys, missing data can be caused by a lot of things.
The person taking the survey may not understand the question asked and end up leaving
questions unanswered. The person answering the questions may refuse to answer some of
them due to privacy issues. The respondent may give answers that are not relevant to the
questions or lose interest along the way while completing the questionnaire. Every ques-
tion that has no answer is regarded as a missing data point. In research missing data may
occur due to human error (maybe forget to record a certain measurement).
It is well known that the naive method, ignoring the missing data methods like complete
case analysis can lead to seriously biased parameter estimations and can also affect the data
quality and thereby the final knowledge discovered there-from. A good/advisable way to
deal with missing data is by data imputation before processing it, that is to estimate and
fill the missing data according to the information approaches such as Expectation Max-
imization (EM), Regression imputation, K-Nearest Neighbour (KNN) imputation, mean
imputation, Hot-deck imputation etc. To apply any of the methods listed above, one must
first understand the nature of the missing data.
1
1.1 Background
Recently Balakrishnan et al. [2017] highlighted that Expected-Maximization (EM) algo-
rithm is a tool in maximum-likelihood that is widely used to estimate the missing values in
a dataset and there is now a very rich literature on its behaviour (eg., [Wu 1983] [Xu and
Jordan 1996], [Neal and Hinton 1998], [Hastie et al. 2009], etc. )
1.1.1 Bayesian Networks
Griffiths and Yuille [2008] has shown that Bayesian Networks (BNs) also known as ’belief
networks’ or ’casual networks’ or just Bayes nets, belongs to Probabilistic Graphical Mod-
els (PGMs) family that can be used to build models from data or representing multivariate
probability distributions, it is also used for different application in various areas such as
machine learning, text mining, natural language processing, speech recognition, signal pro-
cessing, bioinformatics, error-control codes, medical diagnosis, weather forecasting, and
cellular networks. The BNs combines the principles of graph theory, probability theory,
computer science and statistics. BNs can be referred as the graph that is made up of nodes
and directed links between them, where nodes represents variables and links are added to
between that nodes to indicate one node is directly influences by the other. Every BNs is
a Directed Acyclic Graph (DAG). Network is defined as B = 〈G,Θ〉 where G is the DAG
whose nodes x1, x2, ..., xn represents random variables and Θ represents the parameters of
the network. Θ contains the parameter θxi|πi = P B(xi|πi) the set of parents of Xi in G.
PB(X1, X2, X3, ..., Xn) =n∏i=1
PB(Xi|πi)
=n∏i=1
θXi|πi
(1.1)
The goal/focus of this paper is to generate data using BNs and use this BNs to find
the relationship between the variables. So, that when we estimate the the missing values
using EM algorithm we going to use the Conditional probability or just dependent prob-
ability. for example, using the above figure, if X1 is missing or hidden we can find it by
2
P (X1|X2, X3, X4, X5).
1.1.2 Expectation-Maximization (EM) Algorithm
The EM Algorithm is part of the maximum likelihood based on estimating missing data.
Unlike other methods like complete case analysis, Hot-Deck Imputation, etc. EM algo-
rithm does not ’fill in’ the missing values, instead it estimates the parameters directly by
maximizing the complete data log likelihood function, then it utilize that estimated param-
eters to estimate the missing values in data set. It estimates those parameters by iterating
between the E-step and the M-step [Dempster et al. 1977].
In the E-step, the log-likelihood function of the parameters given data is calculated. It
is assumed that the data is partitioned into observed data and missing data. Let X be our
complete data, then we have X = (Xobs, Xmis). The distribution of X depend on unknown
parameter θ, i.e.
P (X|θ) = P (Xobs,Xmis|θ) = P (Xobs|θ)P (Xmis|Xobs, θ) (1.2)
Equation 3.1 can be written as likelihood function below.
L(θ|X) = L(θ|Xobs,Xmis)
= cL(θ|Xobs)P (Xobs|Xmis, θ)(1.3)
where c is the constant in missing data mechanism and it can be ignored when working
under MAR assumption. In our case because we assumed MAR, c is ignored. when taking
the log both sides in equation 3.2, we get the following equation.
l(θ|X) = l(θ|Xobs) + logP (Xmis|Xobs, θ) + logc (1.4)
Where l(θ|X) is the log likelihood of complete-data, l(θ|Xobs) is the log likelihood of
the observed data, logc is the constant and P (Xmis|Xobs, θ) is the predictive distribution of
the missing data, given θ [Schafer 1997].
Since Xmis cannot be calculated directly because is unknown, we have to guess θ de-
noted as θ(t) to compute the expectation of l(θ|X) with respect to P (Xmis|Xobs, θ(t)) i.e.
3
Q(θ|θ(t)) = E[l(θ|X)|Xobs, θ(t)]
=
∫l(θ|X)P (Xmis|Xobs, θ
(t))dXmis
= l(θ|Xobs) +
∫logP (Xmis|Xobs, θ)P (Xmis|Xobs, θ
(t))dXmis
(1.5)
A the M-step of the EM algorithm, θ is obtained by maximizing the expectation of the
log likelihood of complete-data from the E-step. Mathematically,
θ(t+1) = arg maxθ
Q(θ|θ(t)) (1.6)
EM algorithm is initially start with the guess of θ0 and iterate between the E-step and
M-step until it converges. It converges when the estimates of θ are nearly identical [Dong
and Peng 2013].
It is found that, EM algorithm is an iterative method to find the maximum likelihood
(MLE) or maximum a posteriori (MAP) estimate for models with missing values. The
following four steps describes how EM works.
1. Initialization step: get an initial estimate of θ0. This can be a random initialization.
2. Expectation step: In this step, is assumed the parameters θ(t−1) are fixed from the
previous step of initialization, then compute the expected values of the missing val-
ues. In most cases, it compute the function of the expected values of the missing
values.
3. Maximization step: Given the values we got in the E-step, we estimate new values
of parameter θt that maximize a variation of the likelihood function
4. Exit condition: in these step, if the likelihood of the observations are almost identical,
it exit or otherwise it will go to step 1 iteratively.
1.1.3 KullbackLeibler (KL) divergence
In this section we are going to take a look at way of comparing two probability distributions
called Kullback-Leibler (KL) Divergence. In statistics, we use KL Divergence to measure
4
how much information we lose when we approximate. KL Divergence is a natural distance
measure from a probability distribution p(x) to an estimated probability distribution q(x).
KL Divergence is commonly used in pattern recognition and in the fields of speech and
image recognition [Hershey and Olsen 2007].
KL Divergence is also called the relative entropy in machine learning between two
probability distribution functions p(x) and q(x).
D(p||q) =
∫p(x)log
p(x)
q(x)dx (1.7)
Below is the properties of the KL divergence.
1. If D(p||p) = 0 then it is called self similar, which means it is roughly look the same
on any scale.
2. D(p||q) = 0 if and only if p = q and is called self identification or a limiting case
3. If D(p||q) ≥ 0 for all p, q, this property is called positivity.
It must be noted that The larger the divergence between p(x) and q(x), the higher the
value of D(p||q); if there is no much difference between p(x) and q(x), the small value
of D(p||q) will be; and finally KL Divergence is not a metric as D(p||q) 6= D(q||p). The
importance of KL-divergence lies in its ability to quantify how far off your estimation of a
distribution may be from the true distribution. The objective of this project is to check how
q(x) converge to p(x) as we increase the missing values in the data.
1.2 Aims and Objectives
This research aims to present empirical results which demonstrates the ability of EM algo-
rithm to recover different percentage of missing values.
The objectives of this paper are to:
∗ To evaluate the performance of EM algorithm during the estimation of missing val-
ues.
5
∗ To determine how the algorithm performs during the estimations of missing data on
different percentage of hidden values.
∗ To use KL divergence to see how close or far do the original distribution and esti-
mated distribution differ by quantity as we increase percentage of missing data.
∗ Recognise the types and patterns of missing data, and when EM algorithm can be
used and when can be unbiased
For the above objective to be met, discrete data is generated randomly and Missing
at Random (MAR) is assumed since values were randomly removed. A total of 40 000
observations. We will first hide each and every point in the data set then estimate that
hidden values, then compare the true values with the estimated ones.
KL divergence will be applied to compare the distribution which generated the com-
plete dataset and the distribution which was obtained from learning 10 percent of the data,
the value we get will be the value that tell how far are we to the true values in the dataset.
We continue learning from different percentages of the data and calculate the KL diver-
gence of each data after hiding certain percentage.
1.3 Problem Definition
The concept of missing values it is important in researches in order to analyse the data
successfully. If the researcher fail to handle the missing values, then they might end up
drawing inaccurate conclusion. For example, The problem of missing data is relatively
common in almost all research and can have a significant effect on the conclusions that
can be drawn from the data if not taken into consideration. Papers like Papageorgiou et al.
[2018], Agarwal and Tangirala [2017], Zhang [2006] and Zarate et al. [2006] have found
the importance of recovering the missing data. To solve this problem, one should under-
stand the nature of missing data before trying to impute or delete the missing ones. This
research project might help researchers when imputing missing values using EM algorithm.
It will give them an understanding on how EM algorithm perform on different percentage
of missing data.
6
1.4 Research Question
The research questions of this research project are:
∗ How does EM algorithm perform when estimating the missing values?
∗ How does EM algorithm perform on a compact data set?
∗ What is going to happen if we use EM algorithm to estimate a massive missing values
of about 90% of the data?
1.5 Structure of the Report
In this chapter (Chapter 1), introduction, background of the study, aims and objective prob-
lem definition and research questions are outlined in order. Literature review of missing
data and methods or tests are discussed in detail in Chapter 2. Chapter 3 represent the
methodology followed by this research. The analysis and discussion of the study is struc-
tured in Chapter 4 and finally the conclusions and recommendations are in Chapter 5.
7
Chapter 2
Literature Review
2.1 Introduction
Missing data occur when there is no data value that is stored for the variable in an observa-
tion or dataset. Missing data occur commonly and it affects the conclusions that are drawn
from a dataset [Ghahramani and Jordan 1995]. The occurrence of missing data can caused
by the non-response of a respondent or respondent does not understand the question, in-
correct measurement or human error, etc. Every survey question that has no answer is a
missing data. There is no a perfect way to deal with missing data.
2.2 Missing Data
Abdella and Marwala [2005] shows that missing values in a dataset refer to a case when
some of the components of the dataset are not available for all data items in the database
or may not be even defined. Missing values in the dataset create problems in many appli-
cations depending on the fields to accurate data.
Several research studies have concentrated on the impact of missing values in the
dataset and its management. Treating the missing values are considered as an important
task or step to take in the analysis since it improves the effectiveness of knowledge dis-
covery process [Nancy et al. 2017]. In the fields that are highly dependent on the data for
decision making, missing data is still a problem that need to be solved [Zha et al. 2013].
According to Matta et al. [2017], Missing values on datasets is a common feature to
any of longitudinal studies, is that feature that if not taken into consideration can reduce the
statistical power and also lead to biased parameter estimates. In the context of longitudinal
studies, statistical literature uses the term incomplete data and missing data interchange-
ably.
8
2.2.1 Missing Data Mechanisms
Abdella and Marwala [2005] illustrates the methods that has been used to handle the miss-
ing values in areas such as research in statistics, mathematics and other different disci-
plines. The good way to handle missing data depends on how the data points have gone
missing. The three types of missing data mechanisms are: Missing Completely at Random
(MCAR), Missing at Random (MAR), and non-ignorable. MCAR occur if the probability
of missing value for variable X is not related to the value X or any other variable in the
dataset. This only happens when the missing of data does not depend on the variable of
interest. MAR arises if the probability of missing data on a variable X depends on the
other variables but not on X itself. Finally, the non-ignorable happens if the probability of
missing data X is related to a value of X itself. This is the most difficult mechanisms to
approximate and model than the other two missing mechanisms.
Recently Vazifehdan et al. [2018] shows that in a real world, real datasets often include
missing values for various reasons. This is a major challenge when using the machine
learning approach. Most of the learning algorithms can not work with missing data. Im-
putation of missing values is very useful for unbiased predictions using machine learning
tools. Vazifehdan et al. [2018] and Schafer and Graham [2002] indicates the three types of
missing values that should be considered when using imputation method:
1. Missing at Random (MAR)
2. Missing Completely at Random (MCAR)
3. Missing Not at Random (MNAR)
Checking the missing mechanism is equivalent to testing the randomness of the data.
When investigating the missing mechanisms, one may test for correlation between missing-
ness and other variables in the dataset. If the correlation coefficients are low, this indicates
MCAR and high correlation coefficients reflect MAR [Musil et al. 2002]. When MCAR
mechanism is rejected, MAR or MNAR are assumed. Karangwa et al. [2016] displays
methods that have been developed to model MAR and MCAR data, which are: single-
based imputation such as mean imputation, regression imputation, interpolation, multiple
9
imputation based methods such as Multivariate Normal Imputation (MVNI) and the Mul-
tiple Imputations by Chained Equations (MICEs).
2.3 Test
According to Petrozziello and Jordanov [2017], Random Forest (RF) is known as a bet-
ter and an efficient algorithm in classification, however, it depends on the strength of the
dataset. In many fields, Common methods in dealing with missing values usually make use
of estimation and imputation approaches whose efficiency is connected to the assumptions
of data features. The strategy of data imputation before classification is preferred, where
you have to estimate and fill the missing values according to the information from the ex-
isting dataset. The types of imputations mentioned in the paper are: Mean imputation,
hot-deck imputation, K-Nearest Neighbours imputation (KNNimpute), regression imputa-
tion, Bayesian estimation and Expectation Maximization (EM). RF algorithm is designed
on a complete dataset, however, it is also common to have an incomplete dataset in clas-
sification cases. Eventually, the experimental results on different datasets showed that the
RF algorithm is an outstanding method to solve the classification problem on an incom-
plete dataset. Royston and others [2004] shows that the Hot-deck imputation may perform
poorly when many rows of data have at least one missing values. Troyanskaya et al. [2001]
found that KNNimpute appears to yield a more robust and sensitive method for missing
value estimation in the dataset than Singular Value Decomposition impute (SVDimpute).
Both methods exceed the commonly used row average method.
Medical data are likely to contain missing values due to reasons such as human er-
rors, different interpretations and administrator's faults. Under clinical diagnosis, machine
learning and data mining are common technologies that are used for analysis. However,
machine learning methods in a high volume of missing data will lead to high error rate, be-
cause they cannot estimate a high volume of missing data properly due to univariate nature.
The issue that should be taken into consideration is that the dataset often includes missing
values which reduces the accuracy of diagnostic [Nekouie and Moattar 2018]. The fear of
deleting the data is that the critical information might be lost and this will affect modelling
and analytical results significantly [Scheffer 2002].
10
Hlalele [2009] highlighted that imputing missing values have been an area of interest
especially in the field of statistics community due to the biasedness results. Missing values
has led to the development of models and methods that impute missing data. After the
missing values are imputed, are substituted by estimated values such that the dataset can
be analysed using a technique that requires complete dataset. It should be expected that
the missing data will have an impact on data analysis and when making a decision. The
most common way of dealing with the problem of missing data is the imputation of the
missing information. Many techniques in machine learning have also been employed to
handle the missing data points. A hybrid missing data imputation model is developed to
impute missing values in the datasets and is improved to increase its accuracy.
2.3.1 Listwise and Pairwise Deletion
Matta et al. [2017] shows that many of the studies when facing this challenge of missing
values choose to omit subjects with missing data completely. The method is usually called
lit-wise deletion or complete case analysis. This may result in an unacceptable level of
biased. Sometimes the thoughts of removing the missing values will come, thinking there
is no way of knowing what the missing values could have been. Missing data is theoreti-
cally challenging particularly on data analysis. Matta et al. [2017] represent three general
methods of analysing incomplete datasets.
1. Likelihood-based (including Bayesian) - This is a semi-parametric method that al-
lows the person who applies the method to be specific about the parameter based
model through estimating equations
2. Multiple Imputation (MI) - Multiple imputation for missing data is a method that
handles the missing data in the multivariate analysis. Rubin (1977) was the first
person to propose the method of multiple imputation for missing data.
3. Weighting - This is the process of adjusting the contribution of each observation in
a survey sample based on independent knowledge about appropriate distributions,
after weighing no observation should have a weight of zero.
Vazifehdan et al. [2018] mentioned that removing missing data points or using list-
wise deletion method is an acceptable but only with little proportion of missing values.
11
Karangwa et al. [2016] and Bailey et al. [1994] shows that a traditional way to handle
missing values in a dataset is to eliminate them from the analysis by the process of listwise
deletion. This strategy was approved by most of the statistical packages such as STATA,
SPSS, and SAS. The problems appear in the analysis stage when there is a lot of missing
data. When the proportion of missing data is high, the listwise deletion will reduce the
sample size as a result, a sample that is not representative of the population is obtained
leading to the reduction of power of the statistical test, biased parameter estimates, and
large standard errors. The degree of missing data has a negative impact on the data analy-
sis when the missing values are excluded from the analysis. Generally, no matter the degree
of missing values is, problems associated with missing data will always arise.
As highlighted by Enders [2001], The analysis of missing data was revolving around
listwise and pairwise deletion Methods. In software packages, they have recently devel-
oped other methods than deletion methods for treating missing values.
2.3.2 Expectation Maximization
Missing values of the discrete datasets are imputed using Bayesian networks. The Expec-
tation Maximization (EM) algorithm is one of the best methods. The advantage of EM is
that, is trained with missing data meaning it deals with the missing data quite easily. In
recent years, machine learning and deep learning has improved the imputations of missing
values in many fields. Dempster et al. [1977] shows that if the iteration of the algorithm
consists of an expectation step followed by maximization step, then the algorithm is said
to be Expectation Maximization (EM). The main purpose of the EM algorithm is to pro-
vide the iterative computation of the maximum likelihood estimate for data in which some
variables are unobserved [Wu 1983] and [McLachlan and Krishnan 2007].
Enders [2001] indicated that the EM algorithm is an iterative method of two-step ,
where the missing values are assigned and estimate the unknown parameters. On the first
step usually called the Estep, the missing values are substituted by the conditional expec-
tation of the missing data points given the observed data. This particularly means that the
missing data will be replaced by predicted data calculated from the regression equations.
On the second and last step called the Mstep, the estimates of maximum likelihood of
12
the mean and covariance matrix are calculated using the values calculated on the Estep.
The Mstep is taken when the data is complete. The covariance and regression coefficients
calculated from Estep are used to calculate the missing data estimates on the next Estep,
and the process iterates again until the difference between covariance matrices in follow-
ing Msteps falls below some specified convergence condition [Enders 2001] and [Borman
2004]. It should be noted that the EM algorithm cannot be used to obtain directs estimates
of the linear model parameters like regression model. Otherwise, the EM algorithm can
only be used to calculate Maximum Likelihood estimates of a mean vector and covari-
ance matrix. Missing values in the original dataset are estimated and imputed using the
regression equations that are generated from the new covariance matrix [Dellaert 2002].
The understanding of the difference between MCAR and MAR is very critical. The list-
wise and pairwise deletion can only yield unbiased estimates when missing data is MCAR
and in practice, this is difficult to meet. On the other hand Maximum Likelihood yield un-
biased estimates under both MCAR and MAR. It has been seen that even when the data is
assumed to be MCAR, Maximum Likelihood methods gives efficient parameter estimates
than the listwise and pairwise deletion [Allison 2001]. Maximum Likelihood is regarded
as the method that gives unbiased estimates under MAR than other methods. There is
three maximum likelihood estimation algorithm that can handle missing data currently:
The multiple-group approach, full information maximum likelihood estimation, and the
EM algorithm [Allison 2001] and [Enders 2001].
EM alternates between an Estep calculating an expectation of the likelihood by in-
cluding the missing variables as if they were observed and a maximization (Mstep). The
challenges of the EM algorithm is the computational complexity of the E-step and the M-
step and the slow convergence problem. The computational complexity in Estep and Mstep
is another challenge of EM algorithm and last one, the EM algorithm can take more than
necessary number of iterations to converge [Sundararajan 2016].
2.3.3 Single and Multiple Imputation
The study of Tai et al. [2016] aimed to compare the two data imputation methods and to
provide a framework to evaluate the performance of imputed data. The two Imputation
13
methods are Single Imputation technique and Multiple Imputation (MI) technique. Com-
plete dataset was identified and randomly selected for training datasets and testing datasets.
Using the training set, regression-based Single Imputation was used to estimate the length
of stay in intensive care management. Again Multiple Imputation was used. The length
of stay distribution and cross-validation metric-root Mean-Squared Error (MSE) was com-
pared to come up with Imputation that performs better. MI performed better than the Single
Imputation.
MI is one of the most applicable methods for dealing with missing values in the multi-
variate analysis [Abdella and Marwala 2005]. Little and Rubin [2014] has listed the idea
of MI as follows:
1. Impute missing values using a proper model that includes random variation.
2. Iterate the step n times, normally from 3 to 5 times, create n complete datasets.
3. Execute the desired analysis on each dataset using standard complete data methods.
4. Calculate the mean of the parameter estimates of all the n samples to come up with
a single point estimate.
5. Calculate the standard errors by finding the mean of the squared errors on then esti-
mates, Calculate the variance of the samples and finally, combine the two quantities.
The Multiple Imputation (MI) and Maximum Likelihood (ML) execute better than
other methods even when the missing values in the dataset is non-ignorable. The two
method happens to appear the best choice for handling the missing data in most cases. The
researcher has found that the proposed model was more than 95% accurate [Abdella and
Marwala 2005].
Ramezani et al. [2017] shows that most of the previous papers or journals found a way
around the problem of missing values by ignoring them, vanishing all the records related to
missing values from the training set. The researcher introduced a novel intelligent system
for handling diabetic data with missing values. This method used the technique of multiple
imputations to increase the accuracy of the results. The problem of missing values on high
14
dimensional data causes inaccurate results in classification techniques. This concept of
missing values is a very critical issue in mathematical modelling of data. Walczak and
Massart [2001] mentioned that the multiple imputations is another statistical technique
for analysing incomplete datasets to obtain missing values. Orthogonal transformation
techniques were used to reduce the dimensionality of input data.
Petrozziello and Jordanov [2017] investigated data imputation techniques for pre-processing
of dataset with missing values. Real-world datasets contain missing values due to ei-
ther sensors failures or human errors. Petrozziello and Jordanov [2017] and Nelwamondo
[2008] mentioned that dealing with the missing data is a very important step to take in data
cleaning since in methods like machine learning, statistical analysis, and any other process
require complete datasets. The overall accuracy is estimated by evaluating estimations
of missing values on a dataset. To tackle this problem of missing values, a Column-wise
Guided Data (cGDI) was proposed. cGDI is used for selecting the best model from a multi-
tude of imputation techniques through a learning process of known data. After 13 publicly
datasets was conducted, cGDI became better.
2.3.4 Sampling Importance Resampling
In science and engineering, missing data are frequently encountered. Wang et al. [2017]
focuses on the estimation of parameters in estimating an equation with non-ignorable miss-
ing data. The Sampling Importance Resampling (SIR) was proposed to calculate the con-
ditional expectation for non-respondents. It is well known that the ignoring the missing
values can lead to seriously biased parameter estimations if the distribution of respon-
dents differs from the non-respondents individuals. When dealing with MAR data the
methods such as parametric likelihood method, imputation methods, and inverse proba-
bility weighted method are suggested. The SIR was found excessive especially for high-
dimensional cases.
2.4 Conclusion
Missing data is something that you cannot avoid easily, the best thing that one can do
is to reduce their occurrence in trial design and conduct. Sensitivity analyses should be
15
part of the primary step to take in the analysis of the data. Considering the sensitivity of
assumptions about the missing data mechanism should be a compulsory component of the
analysis. The listwise and pairwise deletion must not be applied to impute missing data
unless the dimension is small. Single Imputation method can produce bias results if the
proportion of missing data is larger than 5% [Stuart et al. 2009]. Hot-deck imputation is
suggested as a better solution practice to missing data problems [Myers 2011]. MI imputes
each missing value multiple times, is a powerful and flexible technique to deal with missing
data. The advantage of MI is easy to implement and is appropriate for a large dataset.
The EM algorithm is a favoured method when estimating for parameters for Bayesian
networks in the presence of incomplete data. EM becomes a useful tool for building a
method of statistical and mathematical analysis. It is a very useful algorithm for proba-
bilistic models and is simpler to implement. Many articles highlighted that EM algorithm
produced unbiased results.
16
Chapter 3
Methodology
3.1 Introduction
This chapter consist of the methodology followed by this research. The sections below
explains how the research were conducted.
The figure illustrated below, shows how the steps were followed to come up with the
results. At first the BN structure was created and named ’bnet’ as shown in the figure to
generate a random dataset, the dataset that was created have 4 rows and 10 000 columns,
the total data points generated was 40 000. With the 40 000 data points, we start by hiding
everything and show 0% of the complete data set, then we name it ’bnet0’. This implies
that in the structure (distribution) bnet0, there is 100% estimated data. Secondly we show
only 10% of the data then hide 90% of the data, we store it and create another BN with that
remaining 10% of the data and rename it bnet10. We continue with this order until we hide
nothing which mean we show 100% of the data and name the structure bnet100. For each
bnets represented in the digram, we estimate their missing values using EM-algorithm. We
want to see if we continue loosing data how good/bad the model will perform. will see this
by plotting KL divergence.
17
Figure 3.1: The digram below shows BNs structures
The last step will be calculating the KL divergence for each Bayesian networks from
bnet0 to bnet100, we will calculate their KL-divergence with their estimated imputed val-
ues and see how different the distributions are, compare with the original distribution that
created the complete data set.
3.2 ground truth Bayesian network
Ground truth is a basic term used in different fields that refer to the information provided
by direct observation. This means that the ground truth is the process of collecting valid or
provable data.
3.3 Sampling Data set
Sampling data set is the process of generating a data set from a Bayesian Network. In this
study we will use we create a structure of BN with four variables that have correlation.
18
3.4 Separating your complete dataset into copies with miss-
ing components
After generating the data set we then split the data. We stored some of the data and hidden
some. The first point we hidden 100% of the data and remain with 0% of the data (this
means that we remain with empty data set). We estimated 100% of the values in the data
set and store them in a Bayesian structure and create a distribution for that data set. The
next step was to was to hide 90% of the data and remain with only 10% of the original data
set. We estimated 90% of the values in the data set and store them in a Bayesian structure
and again create its own distribution. We continue with the same procedure from hiding
100% of the data till we hide 0% of the data. When we hide 0% of the data, we remain
with 100% of the original data set. meaning we hide nothing we take the sample as it is
and we create its own Bayesian structure and its distribution.
3.5 Learning Bayesian networks from the missing datasets
This section also called the structural learning, is the process that utilizes the observed
data to learn the links or paths of the BNs. The structure of this network is determined
by marginal and conditional independence tests. In this section we will discuss methods
that used when when learning BNs. Learning parameters of the BNs directly mean to learn
the conditional probability distributions and the graphical model of dependencies. The
objective of this research project, is to find a posterior distribution of models, given the
observed data set.
3.5.1 Parameter leaning
Once the structure of the network has determined, we determine the parameters. Parameter
learning is regarded as the process of using the data to the the distributions of the BN.
In this research project we used Bayesian network that uses the EM algorithm to perform
Maximum Likelihood (ML) to learn the parameters of BN. We used EM algorithm to learn
19
the parameters that will estimate the missing values in a data set. Learning the parameters
with incomplete data set, we had to turn to numerical optimization techniques. We use the
Bayesian approach to learn parameters of a posterior distribution on the parameter space
by applying Bayes's rule. We then used the expectation of the parameters with respect to
posterior distribution.
3.5.2 Model learning
Applying Bayesian on the entire posterior distribution is found by the integration and ap-
plication of Bayes's rule. For model selection we used the Maximum a posteriori Model
(MAP) where we avoided normalizing constant.
3.6 Evaluating the learned Bayesian network using KL
divergence
After learning the Bayesian Network which consist of, parameter learning and the model
learning. We evaluate the learned structure using the KL divergence. We stored the learned
BNs in each per cent of missing data from 100% to 0% missing data. The learned BNs is
utilized to create its own distribution. After creating different distributions, the final step
will be comparing the estimated distributions with the original distribution that created the
data set.
3.7 Motivation
Bayesian Networks are probabilistic models that are flexible and it is shown that they are
increasing exponential in many fields including genetics and genomics [Bae et al. 2016].
The reason why BNs where chosen in this research project to generate the data is because
it can precisely manage the correlation of variables in the data set. The EM algorithm
was chosen because it's best performance when estimating missing values in the data set
[Stephens and Scheet 2005] and lastly Solanki et al. [2006] shows that KL divergence is
the best method to use when comparing two distributions.
20
Chapter 4
Results and Discussion
4.1 Introduction
This chapter describes the analysis of imputation of missing values and the comparison of
distributions using a randomly generated data, with 40 000 observations. Complete data set
was utilized to intentionally hide some values using different percentage from 0% to 100%,
then after estimate them (missing values) using EM algorithm for each step of percentage
missing. The different distributions was drawn and compared with the distribution that
generated complete data set (original data set).
4.2 Data analysis
Bayesian Network (BN) structure was first created to randomly generate the data to be
analysed. The BN structure generated 40 000 observations. We apply the methods of BN
structure to generate the data because we want it to precisely control the relationship or
correlations in the simulated data. The data set generated is discrete data that consist of 10
000 variables and 4 rows of observations. This data follows Bernoulli distribution since it
composed of numbers 1 and 2. Figure 4.1 below shows how the data was created.
Figure 4.1: The digram below shows BN structure for generating data set
Table 4.1 illustrated below shows the sample or example of our data set generated by
21
the Bayesian Network. It is only consist of 2 and 1 which in our case was representing
True and False, 2 is True and False otherwise.
Table 4.1: Sample of a data set
Col1 Col2 Col3 ... Col100001 1 1 2 ... 12 1 1 2 ... 23 2 2 2 ... 14 1 2 2 ... 2
From generated data, the next step was to hide some of the data points with respect to
% missing in a data. The next table (table 4.2) will be the table consist 10% of the data set
of the complete data set, where NaN simply means the data is not observable.
Table 4.2: Sample of a 10% Missing values
Col1 Col2 Col3 ... Col100001 NaN 1 NaN ... 12 1 NaN 2 ... NaN3 2 NaN NaN ... 14 NaN 2 2 ... NaN
Data was shown from 0% to 100%, where we start with an empty data set meaning we
show 0% of the data set and will try to estimate the whole data set till we show 100% of the
data set. From table 4.2, we estimate 90% missing values using EM algorithm. We then
fill in the missing values with estimated ones, then after we generate a new BN structure.
Figure 4.2 represents the new BN structure with its 90% estimated values.
Figure 4.2: The digram constitute 10% of the authentic data and 90% of estimated data
The last step was to compare the distributions of each percent missing with the origi-
22
nal distribution using KL divergence. The distributions was drawn for each BN structure
created, even the first BN structure that generated data inclusively. The second BN struc-
ture's distribution that constitute 90% estimated data was compared with the distribution
that generated complete data set and it is found that the KL divergence was extremely high
meaning the distributions do not diverge (they differ a lot). We continued comparing the
distributions until we compare the distribution that generated complete data set with the
one that have 0% estimated missing values and we got a very little (close to zero) KL
divergence, implying the distribution almost the same.
Figure 4.3: The figure shows the KL divergence with missing data percentage
The plot shows the decreasing in KL divergence as we learn from more observations.
KL divergence shows that if one have about 40% of latent values in a data set, EM algo-
rithm can estimate them well, since we observe from the plot that if we only have 60% of
the data set, the estimated distribution is slowly converging to original distribution. But it
is shown that the more the values we estimate in the data the more we lose the structure
of the data. Again from the plot we observe that, if we have about 50% of the data or
more, EM algorithm may fail to truly estimate the missing values. Especially if you have
only 10% of the data (meaning 90% of the data is missing) the structure of the data will be
completely different.
23
4.3 Discussion
It is observed that even if we use a sophisticated method of estimating missing data like
EM algorithm, if we have a lot of missing or latent values such as 50% or more in the data
set it will not produce reliable results. We should have at most 40 or less than 50 percent
of missing values because if we have more than 50% of the missing values, the estimated
values will be completely different from true values. Below is the figure that shows that
if we have a very little observations like maybe 250, EM algorithm performed very bad to
that point of you can not even interpret it. It produced good results when we have about 40
000 observations.
Figure 4.4: The figure shows the KL divergence with missing data percentage and 50observations
24
Chapter 5
Conclusion and Recommendation
5.1 Conclusion
The results of this study have shown that EM algorithm can precisely produce reliable
estimated missing values in the data set when we have a large data set and the percentage
of missing data is not huge say less than 50 per cent. If we have about 30 000 or more
observations, it performs well and if we have maybe less than 50% of missing values in the
data set. The KL divergence has proved that the more we have estimated missing values
in the data set say more than 50 per cent the more we lose the structure of the data. We
observed on the KL divergence plot that, EM algorithm can perform well when estimating
missing values of less than 50% of the data set. If one estimate massive missing values
in the data set, maybe 80 or 90 per cent of the data will get misleading results which lead
to inaccurate results. Therefore it is advisable not to use the EM algorithm if one has got
massive missing values in a data set.
5.2 Recommendations for Future Work
Firstly, for future work instead of just imputing or estimate missing values in the data set
using the EM algorithm, it is needed to check how much data is missing. This is necessary
for methods like EM algorithm because it is not performing well in all kind of missing data,
it has it's limitations. The other thing is to check or take into consideration the missing
mechanisms like MAR, MNAR and MCAR. Further investigation is also needed on why
EM algorithm performs badly when per cent of missing data increases.
The effect of variables in the data set should be investigated and the use of another
distribution like Multinomial distribution instead of the Bernoulli distribution. The other
thing to be considered in future work may be using more complex independence assump-
tions and a lot more samples.
25
Bibliography[Abdella and Marwala 2005] Mussa Abdella and Tshilidzi Marwala. The use of genetic
algorithms and neural networks to approximate missing data in database. In Com-
putational Cybernetics, 2005. ICCC 2005. IEEE 3rd International Conference on,
pages 207–212. IEEE, 2005.
[Agarwal and Tangirala 2017] Piyush Agarwal and Arun K Tangirala. Reconstruction of
missing data in multivariate processes with applications to causality analysis. In-
ternational Journal of Advances in Engineering Sciences and Applied Mathematics,
9(4):196–213, 2017.
[Allison 2001] Paul D Allison. Missing data, volume 136. Sage publications, 2001.
[Azadeh et al. 2013] Ali Azadeh, SM Asadzadeh, R Jafari-Marandi, S Nazari-Shirkouhi,
G Baharian Khoshkhou, Sahar Talebi, and Arash Naghavi. Optimum estima-
tion of missing values in randomized complete block design by genetic algorithm.
Knowledge-Based Systems, 37:37–47, 2013.
[Bae et al. 2016] Harold Bae, Stefano Monti, Monty Montano, Martin H Steinberg,
Thomas T Perls, and Paola Sebastiani. Learning bayesian networks from correlated
data. Scientific reports, 6:25156, 2016.
[Bailey et al. 1994] Timothy L Bailey, Charles Elkan, et al. Fitting a mixture model by
expectation maximization to discover motifs in bipolymers. 1994.
[Balakrishnan et al. 2017] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al.
Statistical guarantees for the em algorithm: From population to sample-based analy-
sis. The Annals of Statistics, 45(1):77–120, 2017.
[Borman 2004] Sean Borman. The expectation maximization algorithm-a short tutorial.
Submitted for publication, pages 1–9, 2004.
[Burge and Lane ] John Burge and Terran Lane. Selecting bayesian network parameteri-
zations for generating simulated data.
26
[Dellaert 2002] Frank Dellaert. The expectation maximization algorithm. Technical report,
2002.
[Dempster et al. 1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum
likelihood from incomplete data via the em algorithm. Journal of the royal statistical
society. Series B (methodological), pages 1–38, 1977.
[Dong and Peng 2013] Yiran Dong and Chao-Ying Joanne Peng. Principled missing data
methods for researchers. SpringerPlus, 2(1):222, 2013.
[Enders 2001] Craig K Enders. A primer on maximum likelihood algorithms available for
use with missing data. Structural Equation Modeling, 8(1):128–141, 2001.
[Ghahramani and Jordan 1995] Zoubin Ghahramani and Michael I Jordan. Learning from
incomplete data. 1995.
[Griffiths and Yuille 2008] T Griffiths and Alan Yuille. A primer on probabilistic infer-
ence. The probabilistic mind: Prospects for Bayesian cognitive science, pages 33–
57, 2008.
[Hastie et al. 2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised
learning. In The elements of statistical learning, pages 485–585. Springer, 2009.
[Hershey and Olsen 2007] John R Hershey and Peder A Olsen. Approximating the kull-
back leibler divergence between gaussian mixture models. In Acoustics, Speech
and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol-
ume 4, pages IV–317. IEEE, 2007.
[Hlalele 2009] Nthabiseng Unathi Hlalele. The impact of missing data imputation on HIV
classification. PhD thesis, 2009.
[Karangwa et al. 2016] Innocent Karangwa, Danelle Kotze, Renette Blignaut, et al. Mul-
tiple imputation of unordered categorical missing data: A comparison of the multi-
variate normal imputation and multiple imputation by chained equations. Brazilian
Journal of Probability and Statistics, 30(4):521–539, 2016.
27
[Little and Rubin 2014] Roderick JA Little and Donald B Rubin. Statistical analysis with
missing data, volume 333. John Wiley & Sons, 2014.
[Matta et al. 2017] Tyler H Matta, John C Flournoy, and Michelle L Byrne. Making an
unknown unknown a known unknown: Missing data in longitudinal neuroimaging
studies. Developmental cognitive neuroscience, 2017.
[McLachlan and Krishnan 2007] Geoffrey McLachlan and Thriyambakam Krishnan. The
EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
[Musil et al. 2002] Carol M Musil, Camille B Warner, Piyanee Klainin Yobas, and Su-
san L Jones. A comparison of imputation techniques for handling missing data.
Western Journal of Nursing Research, 24(7):815–829, 2002.
[Myers 2011] Teresa A Myers. Goodbye, listwise deletion: Presenting hot deck imputa-
tion as an easy and effective tool for handling missing data. Communication Methods
and Measures, 5(4):297–310, 2011.
[Nancy et al. 2017] Jane Y Nancy, Nehemiah H Khanna, and Kannan Arputharaj. Im-
puting missing values in unevenly spaced clinical time series data to build an effec-
tive temporal classification framework. Computational Statistics & Data Analysis,
112:63–79, 2017.
[Neal and Hinton 1998] Radford M Neal and Geoffrey E Hinton. A view of the em algo-
rithm that justifies incremental, sparse, and other variants. In Learning in graphical
models, pages 355–368. Springer, 1998.
[Nekouie and Moattar 2018] Atefeh Nekouie and Mohammad Hossein Moattar. Missing
value imputation for breast cancer diagnosis data using tensor factorization improved
by enhanced reduced adaptive particle swarm optimization. Journal of King Saud
University-Computer and Information Sciences, 2018.
[Nelwamondo et al. 2007] Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi
Marwala. Missing data: A comparison of neural network and expectation maxi-
mization techniques. Current Science, pages 1514–1521, 2007.
28
[Nelwamondo 2008] Fulufhelo Vincent Nelwamondo. Computational intelligence tech-
niques for missing data imputation. PhD thesis, 2008.
[Papageorgiou et al. 2018] Grigorios Papageorgiou, Stuart W Grant, Johanna JM Takken-
berg, and Mostafa M Mokhles. Statistical primer: how to deal with missing data in
scientific research? Interactive cardiovascular and thoracic surgery, 2018.
[Pearl 1998] Judea Pearl. Graphical models for probabilistic and causal reasoning. In
Quantified representation of uncertainty and imprecision, pages 367–389. Springer,
1998.
[Petrozziello and Jordanov 2017] Alessio Petrozziello and Ivan Jordanov. Column-wise
guided data imputation. Procedia Computer Science, 108:2282–2286, 2017.
[Raghunathan et al. 2001] Trivellore E Raghunathan, James M Lepkowski, John
Van Hoewyk, and Peter Solenberger. A multivariate technique for multiply imput-
ing missing values using a sequence of regression models. Survey methodology,
27(1):85–96, 2001.
[Ramezani et al. 2017] Rohollah Ramezani, Mansoureh Maadi, and Seyedeh Malihe
Khatami. A novel hybrid intelligent system with missing value imputation for di-
abetes diagnosis. Alexandria Engineering Journal, 2017.
[Royston and others 2004] Patrick Royston et al. Multiple imputation of missing values.
Stata journal, 4(3):227–41, 2004.
[Schafer and Graham 2002] Joseph L Schafer and John W Graham. Missing data: our
view of the state of the art. Psychological methods, 7(2):147, 2002.
[Schafer 1997] Joseph L Schafer. Analysis of incomplete multivariate data. Chapman and
Hall/CRC, 1997.
[Scheffer 2002] Judi Scheffer. Dealing with missing data. 2002.
[Schneider 2001] Tapio Schneider. Analysis of incomplete climate data: Estimation of
mean values and covariance matrices and imputation of missing values. Journal of
climate, 14(5):853–871, 2001.
29
[Solanki et al. 2006] Kaushal Solanki, Kenneth Sullivan, Upamanyu Madhow, BS Man-
junath, and Shivkumar Chandrasekaran. Provably secure steganography: Achieving
zero kl divergence using statistical restoration. In Image Processing, 2006 IEEE
International Conference on, pages 125–128. IEEE, 2006.
[Stephens and Scheet 2005] Matthew Stephens and Paul Scheet. Accounting for decay
of linkage disequilibrium in haplotype inference and missing-data imputation. The
American Journal of Human Genetics, 76(3):449–462, 2005.
[Stuart et al. 2009] Elizabeth A Stuart, Melissa Azur, Constantine Frangakis, and Philip
Leaf. Multiple imputation with large data sets: a case study of the children’s mental
health initiative. American journal of epidemiology, 169(9):1133–1139, 2009.
[Sundararajan 2016] Priya Krishnan Sundararajan. Improving the performance and under-
standing of the expectation maximization algorithm: Evolutionary and visualization
methods. 2016.
[Tai et al. 2016] M Tai, E Onukwugha, et al. Data imputation for missing values in a
claim-based administrative database: Comparison of imputatation approaches. Value
in Health, 19(7):A855, 2016.
[Troyanskaya et al. 2001] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat
Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Miss-
ing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525,
2001.
[Vazifehdan et al. 2018] Mahin Vazifehdan, Mohammad Hossein Moattar, and Mehrdad
Jalali. A hybrid bayesian network and tensor factorization approach for missing
value imputation to improve breast cancer recurrence prediction. Journal of King
Saud University-Computer and Information Sciences, 2018.
[Walczak and Massart 2001] Beata Walczak and Desire L Massart. Dealing with missing
data: Part ii. Chemometrics and Intelligent Laboratory Systems, 58(1):29–42, 2001.
[Wang et al. 2017] Xiuli Wang, Yunquan Song, and Lu Lin. Handling estimating equation
30
with nonignorably missing data based on sir algorithm. Journal of Computational
and Applied Mathematics, 326:62–70, 2017.
[Wu 1983] CF Jeff Wu. On the convergence properties of the em algorithm. The Annals
of statistics, pages 95–103, 1983.
[Xu and Jordan 1996] Lei Xu and Michael I Jordan. On convergence properties of the em
algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.
[Zarate et al. 2006] Luis E Zarate, Bruno M Nogueira, Tadeu RA Santos, and Mark AJ
Song. Techniques for missing value recovering in imbalanced databases: Appli-
cation in a marketing database with massive missing data. In Systems, Man and
Cybernetics, 2006. SMC’06. IEEE International Conference on, volume 3, pages
2658–2664. IEEE, 2006.
[Zha et al. 2013] Yong Zha, Ali Song, Chuanyong Xu, and Honglin Yang. Dealing with
missing data based on data envelopment analysis and halo effect. Applied Mathe-
matical Modelling, 37(9):6135–6145, 2013.
[Zhang 2006] Yin Zhang. When is missing data recoverable? Technical report, 2006.
31