optimization based robust methods in data...
TRANSCRIPT
OPTIMIZATION BASEDROBUST METHODS IN DATA ANALYSIS
WITH APPLICATIONS TO BIOMEDICINE AND ENGINEERING
By
NAQEEBUDDIN MUJAHID SYED
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2013
Dedicated tomy beloved mother,
memories of my father,and all of my dearest siblings,
who taught me to believe in myself...
3
ACKNOWLEDGMENTS
All praise is due to Allah (S.W.T) for His kindest blessings on me and all the
members of my family. I feel privileged to glorify His name in sincerest way through
this small accomplishment. I ask for His blessings, mercy and forgiveness all the time. I
sincerely ask Him to accept this meager effort as an act of worship. May the peace and
blessings of Allah (S.W.T) be upon His dearest prophet, Muhammad (S.A.W).
I would like to express my profound gratitude and appreciation to my advisor Prof.
Panos M. Pardalos, for his consistent help, guidance and attention that he devoted
throughout the course of this work. He is always kind, understanding and sympathetic
to me. His valuable suggestions and useful discussions made this work interesting to
me. I am also very grateful to Prof. Jose C. Principe for his immense help and insightful
discussions on the topics presented in the thesis. Sincere thanks go to my thesis
committee members Dr. Joseph Geunes, Dr. Jean-Philippe P. Richard for their interest,
cooperation and constructive advice. I would also like to thank Dr. Pando Georgiev for
hours of friendly discussion and constructive advice. Special thanks to Dr. Ilias Kotsireas
and Dr. James C. Sackellares for their valuable discussions.
I would like to thank the University of Florida and the Industrial and Systems
Engineering Dept. for providing me an opportunity to pursue PhD under the esteemed
program. I would like to thank all the staff members at the ISE Dept., my Weil 401
friends, and the staff at the international center for there friendly guidance, and warm
support throughout my study at UFL. Special thanks to Br. Ammar for making my stay in
Gainesville memorable.
Last but not least, I humbly offer my sincere thanks to my mother for her incessant
inspiration, blessings and prayers, and to my father for his indelible memories filled
with love and care. I owe a lot to my brothers S.N. Jaweed and S.N. Majeed, and my
sisters Nasreen, Shaheen, Tahseen, Yasmeen and Afreen for their unrequited support,
encouragement, blessings and prayers.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 ROBUST METHODS IN DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . 12
1.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Motivation and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Scope and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 ROBUST MEASURES AND ALGORITHMS . . . . . . . . . . . . . . . . . . . . 22
2.1 Traditional Robust Measures . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Proposed Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 242.3 Minimization of Correntropy Cost . . . . . . . . . . . . . . . . . . . . . . . 252.4 Minimization of Error Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Minimization of Error Entropy with Fiducial Points . . . . . . . . . . . . . . 412.6 Traditional Robust Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 442.7 Proposed Robust Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 462.8 Discussion on the Robust Methods . . . . . . . . . . . . . . . . . . . . . . 47
3 ROBUST DATA CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Traditional Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 663.3 Proposed Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 69
4 ROBUST SIGNAL SEPARATION . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1 Signal Separation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2 Traditional Sparsity Based Methods . . . . . . . . . . . . . . . . . . . . . 804.3 Proposed Sparsity Based Methods . . . . . . . . . . . . . . . . . . . . . . 86
5 SIMULATIONS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Cauchy and Skew Normal Data . . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Real World Binary Classification Data . . . . . . . . . . . . . . . . . . . . 1075.3 Comparison Among ANN Based Methods . . . . . . . . . . . . . . . . . . 1085.4 ANN and SVM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5
5.5 Linear Mixing EEG-ECoG Data . . . . . . . . . . . . . . . . . . . . . . . . 1105.6 fMRI Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.7 MRI Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.8 Finger Prints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.9 Zip Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.10 Ghost Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.11 Hyperplane Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.12 Robust Source Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.1 Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
APPENDIX: GENERALIZED CONVEXITY . . . . . . . . . . . . . . . . . . . . . . . 155
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6
LIST OF TABLES
Table page
3-1 Binary classification proposed methods . . . . . . . . . . . . . . . . . . . . . . 72
5-1 Binary classification case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5-2 Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5-3 Skew data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5-4 Binary classification case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5-5 Sample based performance of ANN on PID data . . . . . . . . . . . . . . . . . 123
5-6 Block based performance of ANN on PID data . . . . . . . . . . . . . . . . . . 124
5-7 Sample based performance of ANN on BLD data . . . . . . . . . . . . . . . . . 125
5-8 Block based performance of ANN on BLD data . . . . . . . . . . . . . . . . . . 126
5-9 Sample based performance of ANN on WBC data . . . . . . . . . . . . . . . . 127
5-10 Block based performance of ANN on WBC data . . . . . . . . . . . . . . . . . . 128
5-11 Performance of ACS for different values of σ and number of PEs in hiddenlayer on PID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5-12 Performance of ACS for different values of σ and number of PEs in hiddenlayer on BLD data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5-13 Performance of ACS for different values of σ and number of PEs in hiddenlayer on WBC data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5-14 Linear mixing assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5-15 Average unmixing error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5-16 Standard deviation unmixing error . . . . . . . . . . . . . . . . . . . . . . . . . 130
5-17 Simulation-1 results for case study 2 . . . . . . . . . . . . . . . . . . . . . . . . 130
5-18 Simulation-2 results for case study 2 . . . . . . . . . . . . . . . . . . . . . . . . 130
5-19 Performance of correntropy minimization algorithm . . . . . . . . . . . . . . . . 130
A-1 Generalized convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7
LIST OF FIGURES
Figure page
3-1 Correntropic, quadratic and 0-1 loss functions . . . . . . . . . . . . . . . . . . . 73
3-2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-1 Cocktail party problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4-2 BSS setup for human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4-3 Overview of different approaches to solve the BSS problem . . . . . . . . . . . 99
4-4 Original example source S ∈ R3×80 . . . . . . . . . . . . . . . . . . . . . . . . 103
4-5 Mixed data X ∈ R2×80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4-6 Processed data X ∈ R2×80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4-7 Algorithm 4.2 description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5-1 Global view of Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5-2 Local view Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5-3 Skew normal data with noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5-4 Performance of SVM on PID data . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5-5 Performance of SVM on BLD data . . . . . . . . . . . . . . . . . . . . . . . . . 135
5-6 Performance of SVM on WBC data . . . . . . . . . . . . . . . . . . . . . . . . . 136
5-7 EEG recordings from monkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5-8 ECoG recordings from monkey . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5-9 fMRI data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5-10 Convex hull PPC1 assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5-11 Mixing and unmxing of MRI scans . . . . . . . . . . . . . . . . . . . . . . . . . 140
5-12 Mixing and unmxing of finger prints . . . . . . . . . . . . . . . . . . . . . . . . . 141
5-13 Mixing and unmxing of zip codes . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5-14 Mixing and unmxing of ghost effect . . . . . . . . . . . . . . . . . . . . . . . . . 143
5-15 Original sparse source (normalized) for case study 1 . . . . . . . . . . . . . . . 144
5-16 Given mixtures of sources for case study 1 . . . . . . . . . . . . . . . . . . . . 145
8
5-17 Original mixing matrix for case study 1 . . . . . . . . . . . . . . . . . . . . . . . 146
5-18 Mixing matrices for case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5-19 Recovered mixing matrix for case study 1 . . . . . . . . . . . . . . . . . . . . . 146
5-20 Recovered source (normalized) for case study 1 . . . . . . . . . . . . . . . . . 147
5-21 Data for source extraction method . . . . . . . . . . . . . . . . . . . . . . . . . 148
5-22 Recovery of sources by quadratic and correntropy loss . . . . . . . . . . . . . . 149
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
OPTIMIZATION BASEDROBUST METHODS IN DATA ANALYSIS
WITH APPLICATIONS TO BIOMEDICINE AND ENGINEERING
By
Naqeebuddin Mujahid Syed
August 2013
Chair: Panos M. PardalosMajor: Industrial and Systems Engineering
Analysis of a complex system as a whole, and the limitations of traditional statistical
analysis led towards the search of robust methods in data analysis. In the current
information age, data driven modeling and analysis forms a core research element of
many scientific research disciplines. One of the primary concerns in the data analysis is
the treatment of data points which do not show the true behavior of the system (outliers).
The aim of this dissertation is to develop optimization based methods for data analysis
that are insensitive and/or resistant towards the outliers. Generally, such methods
are termed as robust methods. In this dissertation, our approach will be different
from the conventional uncertainty based robust optimization approaches. The goal
is to develop robust methods that include robust algorithms and/or robust measures.
Specifically, applicability of an information theoretic learning measure based on entropy
called correntropy is highlighted. Some crucial theoretical results on the optimization
properties of correntropy and related measures are proved. Optimization algorithms
for correntropy are developed for both parametric and non-parametric frameworks. A
second order triggered algorithm is developed, which minimizes the correntropic cost
on a parametric framework. For the case of non-parametric framework, the usage
of convolution smoothing and simulated annealing based algorithms is proposed.
Furthermore, a modified Randomized Sampling Consensus (RANSAC) based robust
10
algorithm is also proposed. The performance of the proposed approaches is illustrated
by case studies on the data related to biomedical and engineering areas, with the
objective of binary classification and signal separation.
11
CHAPTER 1ROBUST METHODS IN DATA ANALYSIS
Understanding the underlying mechanism of a real world system is the basic goal
of many scientific research disciplines. Typical questions related to the system like “How
does it work?” or “What will happen if this or that is changed in the system?” are to be
answered for successful progress of the scientific research. This particular research
element has been revolutionized by the methods of experimentation and statistical
analysis. In fact, prior to statistical analysis and experimentation, deductive logic was
typically used in understanding of the system, which had tremendous limitations.
The concept of hypothesis testing can be solely attributed to statistical analysis and
experimentation. Nowadays, obtaining information from data is one of the prevalent
research areas of science and engineering. However, as the curiosity to study the real
world complex systems as a whole increased over the time, traditional statistical analysis
methods had proven to be inefficient.
Statistical methods dictated the theory of analyzing the data, until Tukey [98]
revolutionized the ideology of analyzing the experimental data. He differentiated
the term “data analysis” from “statistical analysis” by stating that the former can be
considered as science, but the later is subjective upon the statistician’s approach (i.e.,
either mathematics or science, but not both). Supporting Tukey’s ideology, Huber [49]
encouraged the usage of term data analysis, as the other term is often interpreted in an
overly narrow sense (restricted to mathematics and probability). Thus, the seminal work
of Tukey [97, 98] enlarged the scope of data analysis from mere statistical inference to
something more.
In simple terms, the key idea of data analysis approach is to propose some
analytical or mathematical model that represents the underlying mechanism of the
system under consideration. The proposed model can be specific (parametric) to the
system or can be a general (nonparametric) model. Both parametric and nonparametric
12
models have some parameters to tune. The parameters are tuned based on the
observed data collected from the system (experimentation). The process of identifying
the model parameter is called as parameter estimation.
The basic idea involved in model parameter estimation is to estimate the model
parameters by minimizing the error between the estimated output from the model
and the desired response. The error by definition is merely a difference between the
output and the response. The error measure (or the worth of an error value) plays a
very crucial role in the estimation of the model parameters. Typically, when the error is
assumed to be Gaussian, the Mean Square Error (MSE) criterion is equivalent to the
minimization of error energy (or variance). It is well know that, under Gaussianity
assumption, MSE leads to the best possible parameter estimation (a maximum
likelihood solution). However, parameter estimation issues related to nonlinear and
non-gaussian error call for costs other than MSE [15].
Higher order statistics can be used to deal with non-gaussian errors. Statistically,
MSE minimizes the second order moment of the error. In order to transfer as much
information as possible from the system’s responses to the model parameters, second
and higher order moments (kurtosis, or cumulants) of error should be minimized.
However, the most important drawback of using higher order statistics is that they are
very sensitive to outliers.
The objective of Chapter 1 is to introduce the topic of the dissertation. Section 1.1
presents an simplistic introduction to the notion of data analysis. Section 1.2 highlights
the significance of the robust data analysis by presenting a motivating anecdote from
the literature, and highlights its relevance to the practical engineering and biomedical
scenarios. In Section 1.3, the overview of the robust data analysis ideology is presented,
and the specific approaches that will be implemented and developed in this work will be
clearly stated. The scope and objective of this work is presented in Section 1.4.
13
1.1. Data Analysis
Data analysis is an interdisciplinary field, including statistics, database design,
machine learning and optimization. It can be defined in simple terms as “the process of
extracting knowledge from a raw data set by any means”. Approaches in data analysis
vary depending upon the type of data, the objective of the analysis, the availability of
computational time and resources, and the familiarity (or inclination) of the researcher
towards a specific approach. Thus, there are plethora of data analysis methods,
including parametric and non-parametric framework with exact and heuristic algorithms.
However, a data analysis approach can be schematically specified based on some
prominent elements of data analysis. In general, the elements of data analysis can be
structured into the following six sequential steps:
Objective. The first and most important step in data analysis is the objective of the
analysis. It should be well defined and clear in nature. Based on the objective, the later
steps are customized. Typically, the objectives may involve one or more than one (or a
combination) of the following major criteria:
• Regression: Literally the term ‘Regression’ means a return to formal or primitivestate. Statistical regression involves the idea of finding an underlying primitiverelationship between the causal variables and the effect variables. Moreover, theunstated basic assumption in statistical regression is that all the data belongs to asingle class.
• Classification: Literally ‘Classification’ means a process of classifying somethingbased on shared characteristics. Statistical classification is a supervised learningmethod that involves classifying uncategorized data based on the knowledge ofcategorized data. The class label for the categorized data is known. Whereas, theclass label for the uncategorized data is unknown. The unstated basic assumptionin statistical classification is that an uncategorized data point should be assigned toexactly one of the class labels.
• Clustering: Literally ‘Clustering’ means congregating things together based ontheir particular characteristics. Statistical clustering is an unsupervised learningmethod which aims to cluster data based on defined nearness measure. It involvesmultiple classes, and for each class an underlying relationship is to be found.Ideally, there is no prior knowledge available about the data classes. However,
14
some of the clustering methods assume that the information of the total number ofdata classes are known a priori.
Data Representation. Data is nothing but stored and/or known facts. Data comes
in different forms and representations. It can represent a qualitative or quantitative
fact (in the form of numbers, text, patterns or categories). Based on the objective of
data analysis, a suitable data representation should be selected. A generalized way
to represent data is in the form of an n × p matrix, also known as ‘flat representation’.
Typically, the rows (records, individual, entities, cases or objects) represent data points,
and for each data point a column (attribute, feature, variable or field) represents a
measurement value. However, depending upon the context, the interpretation of rows
and columns may interchange.
Knowledge Representation. The extracted knowledge can be represented in
the form of relationships (between inputs and outputs) and/or summaries (novel ways
to represent same data). The way of representing the relationships (or summaries)
depends upon the field of research, and the final audience (i.e., it should be novel, but
more importantly understandable to the reader). The relationships or summaries (often
referred to as models or patterns) can be represented but not limited to following forms:
Linear equations, Graphs, Trees, Rules, Clusters, Recurrent patterns, Neural networks,
etcetera. Typically, the type of representation for relationships/summaries should be
selected before analyzing the data.
Loss Function. The loss function is a measure function that accounts for the error
between the predicted output and actual output. It is also known as penalty function or
cost function. The selection or design of loss function depends upon two main criteria.
Firstly, it should appropriately reflect the error between the predicted output and actual
output. Secondly, the loss function should be easily incorporable inside an optimization
algorithm. In addition to that, given an instance of predicted output and actual output,
the loss function should give the error value in polynomial time. The longer it takes to
15
calculate the error, the lesser is the efficiency of the optimization algorithm. There are
two main classical loss functions, namely: absolute error, mean square error. Typically,
the mean square error (commonly known as quadratic loss function) is used often as a
loss function.
Optimization Algorithm. The knowledge representation, selected a priori, is
trained (using an optimization algorithm) on the data set to minimize the loss function.
Thus, this assures that the represented knowledge aptly imitates the real system
(the source or generator of the data set). Such training algorithms, also known as
learning algorithms, are based on some optimization methods. Classically, a parametric
representation is encouraged, and is accompanied by an exact optimization method.
Although, a parametric representation requires in depth knowledge of the given data
set, parametric methods were given superiority over non-parametric methods due to
the existence of efficient exact optimization methods. Moreover, exact solution methods
are suitable for a limited class of parametric representations, thus they limit the scope of
knowledge representation. Recent developments in the use of non-parametric methods
like artificial neural networks have widened the scope of knowledge representation.
However, due to the use of exact methods, they have not been utilized to their full
potential. Lately, due to the development in heuristic optimization methods, the use of
non-parametric methods have become desirable and enlarged the scope of knowledge
representation.
Validation. This is typically the last step in the data analysis. The key purpose
of this step is to justify the output (estimated parameters) obtained from the earlier
steps. Experts on the problem specific domain are consulted to verify and validate the
results. However, expert opinion may not always be available. Hence, cross validations
methods are developed. There are several cross validation methods that are based
on the concept of training and testing. The idea is to divide the given data set into two
subgroups called training and testing sets. Data analysis is conducted on the training
16
data set, and the model’s performance is calibrated using the testing data set. Generally,
the size of training set is greater than testing set. Next, three most common methods of
cross validation are described:
• k-fold Cross Validation (kCV): In this method, the dataset is partitioned in k equallysized groups of samples (folds). In every cross validation iteration, k-1 folds areused for the training and 1 fold is used for the testing. In the literature, usually ktakes a value from 1, ... , 10.
• Leave One Out Cross Validation (LOOCV): In this method, each sample representsone fold. Particularly, this method is used when the number of samples are small,or when the goal of classification is to detect outliers (samples with particularproperties that do not resemble the other samples of their class).
• Repeated Random Sub-sampling Cross Validation (RRSCV): In this method, thedataset is partitioned into two random sets, namely training set and validation (ortesting) set. In every cross validation, the training set is used to train the model,and the testing (or validation) set to test the accuracy of the model. This method ispreferred if there are large number of samples in the data. The advantage of thismethod (over k-fold cross validation) is that the proportion of the training set andnumber of iterations are independent. However, the main drawback of this methodis if few cross validations are performed, then some observations may never beselected in the training phase (or the testing phase). Whereas others may beselected more than once in the training phase (or the testing phase respectively).To overcome this difficulty, the model is cross validated sufficiently large number oftimes, so that each sample is selected at least once for training as well as testingthe model. These multiple cross validations also exhibit Monte-Carlo variation(since the training and testing sets are chosen randomly).
Among the above stated steps, knowledge representation, loss function and
the optimization algorithm form the crux of the data analysis. Traditional approaches
of data analysis were based on statistical principles, and were termed as statistical
analysis. A typical assumption in the traditional approaches includes the availability
of the knowledge of the data distribution or ability to perfectly learn the distribution
from the infinite length data. Thus, either the data is assumed to be perfect, or the
filter methods are developed to remove the noise from the data before conducting the
statistical analysis. However, filter methods are based on assumptions, and require data
17
specific knowledge. Therefore, the statistical analysis performs well theoretically but has
limitations for most of the practical scenarios.
1.2. Motivation and Significance
From traditional statistical analysis to the contemporary data analysis, one of the
key analysis elements that has remained unchanged is the optimization based approach
in extracting knowledge from the data. The efficiency of optimization methods are in turn
dependent upon the type of the objective function and the feasible space. Furthermore,
the solution quality (local or global best) of data analysis methods also depends upon
the objective function and the feasible space. Existence of outliers (or noise) often taint
the solution space. Hence, practical data analysis calls for methods in data analysis that
are insensitive or resistant to the outliers.
Determining similarity between data samples using an appropriate measure has
been the key issue in the analysis of experimental data. The importance of robust
methods in data analysis can be traced back to the old famous dispute between Fisher
and Eddington. Based on practical observations, Eddington [25] proposed the suitability
of the absolute error as an appropriate measure. Fisher [30] countered the idea of
Eddington by theoretically showing that under “ideal circumstances” (errors are normally
distributed, and outliers free data) the mean square error is better than the absolute
error. The dispute between Eddington and Fisher actually played a prominent role in
shaping the theory of statistical analysis. After Fisher’s illustration, many researchers
incorporated mean square error as a default similarity measure in their analysis. Tukey
[97] reasoned that occurrence of the ideal circumstances for practical scenarios is very
rare. Huber [48] further showed that noise as less as 0.2%, which is ideal for many
practical data, will favor the usage of absolute error instead of mean square error.
Although Tukey’s paper highlighted the importance of robust measures like the absolute
error, the prevalence of mean square error in data analysis can be solely attributed
to its convex, continuous and differentiable nature. There have been explicit studies
18
[40, 47, 48] on the research and development of robust measures, under the preamble
of robust statistics.
The traditional statistical analysis methods were strictly dependent upon theoretical
assumptions like,
• ideal circumstances: Errors are normally distributed.• distributional assumptions: Distribution of data can be learned (or available).• sensitivity assumptions: Small deviations in distribution result in minor changes.• smoothing assumptions: Effect of few outliers gets faded out w.r.t bulk data.
Tukey [97] suggested that in the practical scenarios, the assumptions are hardly true
and barely verifiable. In fact, the assumptions are more or less assumed to be true for
mathematical convenience. The assumptions were justified by vague stability principles
that minor changes should result in small error in the ultimate conclusion. On the
contrary, Huber [47] states that the assumptions do not always hold, and traditional
methods based on the distributional assumptions are very sensitive to minor changes. In
fact, Geary [31] (cited by Tukey [98] and Hampel [39]) stated that “Normality is a myth;
there never was, and never will be, a normal distribution”. Thus, robust procedures are a
crucial requirement of the contemporary data analysis methods. These ideologies led to
the development of “robust methods” in data analysis.
1.3. Robust Methods
A robust method in data analysis can be defined as “the method of extracting
knowledge from the bulk of the given data, simultaneously neglecting the knowledge
from the outliers present in the given data”. The major approaches of robust methods in
data analysis can be divided into following categories:
Relaxing Distributional Assumptions. The approach here is to develop data
analysis methods based on geometric (or structural) assumptions rather than the
distributional assumptions. This approach is followed in the hope of reducing the
sensitivity of methods with respect to the practical scenarios. Furthermore, the
19
geometrical assumptions on data can be easily verified, unlike the distributional
assumptions.
Incorporating Distributional Assumptions. Obviously relaxing all the distributional
assumptions in a data analysis method is the most appropriate case for practical
data. However, the distributional assumptions cannot be discarded in most of the
scenarios, mainly due to the loss of mathematical convenience in the analysis approach.
Thus, most of the research in robust methods is based on incorporating ideas into
the traditional methods that will result in insensitivity to the conventional theoretical
assumptions. The approaches can be categorized as usage of:
• Robust Measure: A measure which is insensitive to outliers is used as a lossfunction.
• Robust Algorithm: Subsamples from the given data sample is analyzed separately,and the information from the subsample analysis is utilized to construct the model.
• Robust Optimization: An uncertainty based domain is considered around eachdata sample, and stochastic optimization based algorithms are used to conduct theanalysis.
It is to be noted that incorporating robustness is a practical approach, and it is a
current critical requirement of data analysis methods. However, robustness often results
in loss of convexity and/or smoothness from the optimization problem related to the
data analysis. Furthermore, the computational efficiency of the robust methods are
generally lower than that of the non-robust methods. It is out of scope of this dissertation
to discuss all the aspects of robust method. Therefore, before proceeding further to
develop the theme of robust methods, the scope and objective of this dissertation is
presented in Section 1.4.
1.4. Scope and Objective
The objective of this dissertation is to develop novel optimization based robust
methods in data analysis problems. As described in Section 1.3, the term “robust
methods” has been used in different connotations based on the intention and area of
20
the application. In this work, robust methods mean incorporation of robust algorithms,
and/or usage of robust measures in data analysis problems. In the case of robust
measures, the focus is on the applicability of entropy based robust measures, like
correntropy, in data analysis. In this work, generalized convexity based results are
presented for the entropy based measures. In addition to that, the performance of the
robust measure in binary classification using a non-parametric framework is illustrated.
On the other hand, a robust algorithm for signal separation problem is also proposed.
Specifically, a linear mixing model for the signal separation problem is considered.
Robust algorithms are developed to extract the dictionary information from the given
mixture data. Furthermore, an entropy based method is proposed to extract the sources
from the mixture data.
Robust methods are applicable to practical data analysis scenarios, which typically
involve noisy data. From the literature [39], it can now be assumed as a rule of thumb
(not an exception) that the data from biomedical and engineering systems contain 5%
to 10% of outliers. Moreover, if it is assumed that there are no outliers present in the
data, then the solution quality obtained from robust methods is typically competitive to
the non-robust methods. However, the main drawback of robust methods is that they are
computationally expensive. Nevertheless, our aim is to analyze optimization properties
of robust measure and propose selection strategies for robust algorithms that maybe
used to improve the computational and optimization efficiency. In Chapter 2, those
issues related to robust methods that are relevant to this dissertation are addressed.
Interested readers are directed towards references [40, 47], which present a general
discussion on the robust methods.
21
CHAPTER 2ROBUST MEASURES AND ALGORITHMS
In Chapter 2, the theory of robust methods is presented. The proposed approaches
include the concept of robust measures and robust algorithms. The ideas related to
robust optimization are relatively new when compared to traditional robust measures
and algorithms. However, robust optimization based methods, which are nothing but
uncertainty based optimization methods, have been rigorously applied in the area of
data analysis due to the efficient methods developed by the stochastic optimization
community. On the other hand, the notion of robust measures and algorithms can
be traced back to the times of Eddington and Fisher. However, elegant methods to
incorporate the concepts robust measures and algorithms in a practical framework have
always been an open research area. The crux of this work is to show the applicability
of a new robust measure, which is developed from theory of Renyi’s entropy, in the
problems related to data analysis.
Chapter 2 is structured as follows. Section 2.1 presents a brief summary of
the traditional robust measures. The concept of entropy based robust measures is
presented in Section 2.2. Sections 2.3, 2.4 and 2.5 prove the generalized convexity
based optimization properties of the three entropy based robust measures. Furthermore,
Section 2.6 presents a brief introduction to the traditional robust algorithms. Section 2.7
presents the proposed robust algorithm. Finally, Section 2.8 concludes Chapter 2 by
presenting a brief discussion on the proposed methods.
2.1. Traditional Robust Measures
Consider a univariate data containing N samples. One of the traditional ways
to collect information from the samples is to calculate its mean and variance. Now,
assume that one outlier has been appended to the existing data set. Obviously the
Some sections of Chapter 2 have been published in Optimization Letters.
22
mean and variance will change significantly. However, the median of the data will not
change much. In fact, the median will give the true information of the data until there
are about 50% outliers in the data. Thus, median is considered as a robust measure
than compared to mean, and in some sense, median is the most robust measure of
location. An improvement of traditional mean calculation is the α-trimmed mean, where
0 < α < 12. The key idea in α-trimmed mean is to remove up to αN points from the
sample before calculating the mean.
Using the above ideology, many robust estimates of data have been proposed.
Generally, the scale estimates can be classified into three main categories: L-estimators,
M-estimators, and S-estimators. Among the three estimates, this work will consider
M-estimators. In simple terms, M-estimators are the minima of a measure that is
constructed by the summation of functions of the data points. Huber was pioneer in
proposing a class of robust M-estimators. The Huber class of function can be defined
by a family of functions ψθ(x), where θ is a parameter, and x is an error value. For the
estimates of location ψθ(x) = ψ(x − θ), and the base model of Huber measure can be
represented as:
ψ(x) =
x if |x | ≤ k1
k2 sign(x) otherwise,(2–1)
where 0 < k1, k2 < ∞. When k1 = c > 0 and k2 = 0, the function ψ(x) corresponds
to metric trimming. When k1 = k2 = c > 0, the function ψ(x) corresponds to metric
winsorizing. Tukey proposed another class of robust measure, called as biweight
measure:
ψ(x) = x
[1−
(x
k1
)2]2+
, (2–2)
23
where [a]+ represents the positive part of a, and k1 is a parameter. Furthermore, Hampel
proposed a piecewise linear functions based robust measure, which is defined as:
ψ(x) =
|x |sign(x) 0 < |x | ≤ k1
k1sign(x) k1 < |x | ≤ k2
k1k3−|x |k3−k2
sign(x) k2 < |x | ≤ k3
0 k3 < |x | <∞.
(2–3)
Based on the ψ function, the M-estimator can be defined as ρ(x) =∫ψ(x)(dFx). Huber,
Tukey, and Hampel’s measure are the traditional robust measures in data analysis.
Although, there are several advantages of being a robust measure, there are few critical
drawbacks of the above measures
• The measures are scale variant.• There are no standard rules of parameter selection, i.e., how to select values for
k1, k2, ....• The measures are nonsmooth, i.e., they are discontinuous.• The measures are nonconvex.
Typically, it is desirable to use a scale invariant measure. Therefore, suitable
preprocessing methods like normalization can be used to manage the scale variant
property of the robust measures. However, the parameter selection is a critical issue,
and there are no specific rules in selecting the parameter. Moreover, the nonsmooth
nature of the functions creates a difficulty in developing solution algorithms. Finally, the
presence of nonconvex functions increase the complexity of optimization algorithms.
Section 2.2 proposes another class of robust estimators that are smooth and invex in
general, and they may overcome some of the above listed drawbacks.
2.2. Proposed Entropy Based Measures
Entropy is another criterion that can be used in the parameter estimation, and it
bypasses the higher order moments’ expansions. Shannon [85] defined entropy as the
average unpredictability (equivalent to information content) of a probability distribution.
24
Shannon’s entropy, a measure of uncertainty of the probability distribution, quantifies
the expected value of information contained in a system. Later, Renyi [76] generalized
the notion of entropy; that includes Shannon’s definition of entropy. When combined
with a non-parametric estimator like Parzen’s estimator [71], Renyi’s entropy provides
a mechanism to estimate entropy directly from the responses. Using the concept of
non-parametric Renyi’s entropy, the notion of Minimization of Error Entropy (MEE) [26] is
founded, which is a central concept in the field of information theoretic learning [72, 73].
Another important property of entropy based robust measures is that they
encompass higher order moments. Therefore, minimizing error measure based on
entropy indirectly take into account higher order statistics. Typically, the traditional
higher order statistics are very sensitive measures. On the other hand, as an additional
advantage, entropy based robust measures are robust. Thus, entropy based measures
are useful for nonlinear nongaussian systems. Sections 2.3, 2.4 and 2.5 present novel
properties of three entropy based robust measures.
2.3. Minimization of Correntropy Cost
Correntropy (strictly speaking, should be called as cross correntropy) is a generalized
similarity measure between any two arbitrary random variables (y , a), defined as [83]:
ν(y , a) = Ey ,a [k(y − a,σ)] , (2–4)
where k is any form of transformation kernel function with parameter σ (in this work it
is taken as Gaussian kernel). For the sake of simplicity, consider a binary classification
scenario. Let x = a − y represent the error, where a, x , and y ∈ R are actual label,
error and predicted label respectively. The correntropic loss function is defined as:
FC(x ,σ) = β(1− ν(x)) or FC(x ,σ) = β(1− Ex [k(x ,σ)]), (2–5)
where β =[1− e
( −1
2σ2 )]−1
. Typically, the probability distribution function of x is unknown,
and only {xi}ni=1 observations are available. Using the information from n observations,
25
the empirical correntropic loss function can be defined as:
FC(x,σ) = β(1− 1
n
n∑i=1
k(xi ,σ)), (2–6)
where x = [x1, ... , xn]T is an array of sample errors, and k(x ,σ) = e
(−x2
2σ2
). A practical
approach to minimize the function given in Equation 2–6 is to assume σ as a parameter.
Multiple iterations for different values of the parameter are executed to obtain the optimal
solution.
Parameter Based Correntropic Function
The parameterized correntropic loss function, is defined as:
FCσ(x) = βσ(1−
1
n
n∑i=1
kσ(xi)), (2–7)
where βσ =[1− e
( −1
2σ2 )]−1
, and kσ(x) = e
(−x2
2σ2
). Let Hσ
C(x) denotes the Hessian of the
function defined in Equation 2–7, given as:
HσC(x) =
σ(x1)σ2−x2
1
σ2 0 · · · 0
0 σ(x2)σ2−x2
2
σ2 · · · 0
...... . . . ...
0 0 · · · σ(xn)σ2−x2
n
σ2
.
(2–8)
where σ(x) = βσ
σ2 e
(−x2
2σ2
). From Equation 2–8, it can be seen that if |σ| > |xi |, for
i = 1, ... , n, then the correntropic function is convex. Under the ideal circumstances as
assumed by Fisher, choosing |σ| > |xi |, for i = 1, ... , n is appropriate. However, for the
practical case, σ should be selected such that |σ| < |xi | when the i th sample is an outlier,
and |σ| > |xi | when the i th sample is a non-outlier. This winnowing of outliers by kernel
width σ is the robustness of the correntropic loss function. However, the robustness is
achieved in correntropic loss function at the cost of losing convexity.
26
The above analysis highlights the subtle yet crucial issue, i.e., the trade-off among
the three desired properties: convexity, robustness and smoothness. Conventionally,
the best strategy is to select any two of the three properties in a similarity measure.
For instance, most of the traditional practitioners select convexity and robustness (like
the absolute loss function), or select convexity and smoothness (like the quadratic loss
function). Correntropy opens a door in the direction, where robustness and smoothness
are guaranteed. But without convexity, optimization of a general nonlinear function will
be a challenging task. Fortunately, for the correntropic loss function, we show that the
function is pseudoconvex for one dimension, and invex for multi-dimensions. When data
cannot be normalized (i.e., when data should have a different kernel width for different
features) the generalized correntropic function loss function, is defined as:
FCσ(x) =
n∑i=1
βσin(1− kσi (xi)), (2–9)
where βσi =[1− e
( −1
2σ2i
)]−1
and kσi (x) = e
(−x2
2σ2i
). In the following part of Chapter 2, the
total correntropic loss, instead of the average loss is considered, i.e., the correntropic
loss function is defined as:
FCσ(x) =
n∑i=1
βσi (1− kσi (xi)). (2–10)
Generalized Convexity of Correntropic Function
Although for the higher value of parameter σ (depending upon the error magnitude),
the correntropy based measure is convex. However, it is of practical importance to study
the properties of correntropy function for any value of σ > 0. Specifically in this work,
it is claimed that the correntropy function is pseudoconvex and invex, depending upon
the sample dimension. Let us consider the simplest case, where the error from a single
sample is considered one at a time. This case is called as single sample case.
27
Single Sample Case: Let x be the sample error. The correntropy loss function, with
respect to one sample, can be defined as:
FσC (x) =
[1− e
( −1
2σ2 )]−1
[1− e
(−x2
2σ2
)]∀ x ∈ R. (2–11)
The pseudoconvexity of the loss function is claimed under the following conditions.
Theorem 2.1. Let βσ =[1− e(
−1
2σ2 )]−1
and S = {x ∈ R : x2 < M, 0 < M << ∞}. If
x ∈ S and FσC : S 7→ R, then the function Fσ
C , defined as :
FσC (x) = βσ
[1− e
(−x2
2σ2
)]∀ x ∈ R, (2–12)
is pseudoconvex for any finite σ > 0.
Proof: Let x1, x2 ∈ R. Consider the following:
∇FσC (x1)(x2 − x1) =
βσσ2e
(−x212σ2
)(x1)(x2 − x1)
= σ(x1)x1(x2 − x1),
where σ(x) = βσ
σ2 e
(−x2
2σ2
)and σ(x) > 0 ∀σ, x ; since βσ > 0, σ = 0 and finite, and
x2 < M.
Now, if ∇FσC (x1)(x2 − x1) ≥ 0, then
σ(x1)x1(x2 − x1) ≥ 0
⇒ x1(x2 − x1) ≥ 0. (2–14a)
Next, consider the following cases:
• Case 1: if x1 ≥ 0, then Equation 2–14a reduces to the following:
x2 ≥ x1
or x2 ≥ x1 ≥ 0
⇒ FσC (x2) ≥ Fσ
C (x1). (2–15a)
28
• Case 2: if x1 < 0, then Equation 2–14a reduces to the following:
x2 ≤ x1
or x2 ≤ x1 < 0
⇒ FσC (x2) ≥ Fσ
C (x1). (2–16a)
From Equations 2–15a & 2–16a, the following statement holds:
If ∇FσC (x1)(x2 − x1) ≥ 0, then Fσ
C (x2) ≥ FσC (x1) for a given σ (2–17)
From Equation 2–17, it follows that FσC is pseudoconvex for a given parameter σ. �
Remark 2.1. If there exists x⋆ ∈ R such that ∇FσC (x
⋆) = 0, then x⋆ is the global optimal
solution of FσC .
n-sample Case: Let xi be the i th sample error. The correntropic loss function, in
n-sample (cumulative error of n-samples) is given as:
FσC (x) =
n∑i=1
βσi
[1− e
(−x2
i
2σ2i
)]∀ x ∈ S, (2–18)
where βσi =
[1− e
(−1
2σ2i
)]−1
and S = {x ∈ Rn : x2i < Mi , 0 < Mi << ∞ ∀i = 1, ... , n}. If
σ = σi∀i , then FσC is represented as Fσ
C and is defined as :
FσC (x) = βσ
[n −
n∑i=1
e
(−x2
i
2σ2
)]∀ x ∈ Rn, (2–19)
which can be rewritten as:
FσC (x) =
n∑i=1
[βσ − βσe
(−x2
i
2σ2
)]∀ x ∈ S. (2–20)
Let f σ(x) =
[βσ − βσe
(−x2
2σ2
)]. From Theorem 2.1, it can be easily shown
that f σ(x) is pseudoconvex. Furthermore, it can be seen that FσC (x) is sum of n
pseudoconvex functions. But unlike the property of convex functions, in general,
the sum of pseudoconvex functions may not be a pseudoconvex function. Thus, the
29
pseudoconvexity of correntropic function for n samples does not follow directly from
Theorem 2.1.
Theorem 2.2. Let βσi =
[1− e
(−1
2σ2i
)]−1
and S = {x ∈ Rn : x2i < Mi , 0 < Mi <<
∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ
C , defined as :
FσC (x) =
n∑i=1
βσi
[1− e
(−x2
i
2σ2i
)]∀ x ∈ S, (2–21)
is locally pseudoconvex for any finite σi > 0.
Proof: Let Nϵ(x) = {y| ∥y − x∥ < δ, 0 < δ < ϵ ∧ ϵ −→ 0} represent the epsilon
neighborhood of x. Let x ∈ S and x ∈ Nϵ(x) ∩ S be any two points, such that:
∇FσC (x)
T (x− x) ≥ 0 (2–22a)n∑
i=1
σi (xi)xi(xi − xi) ≥ 0
n∑i=1
σi (xi)xidi ≥ 0, (2–22b)
where d ∈ Rn is the direction, such that x = x+ λd.
The following relation is claimed to be true:
FσC (x) ≤ Fσ
C (x). (2–23)
By contradiction, say if FσC (x) > Fσ
C (x), then:
n∑∀ i = 1
[f σi (xi)− f σi (xi)] < 0. (2–24)
Now
n∑∀ i = 1
[f σi (xi)− f σi (xi)] =
n∑∀ i = 1
[−βσie
(−(xi
2+2λxi di+λ2d2i)
2σ2i
)+ βσie
(−xi
2
2σ2i
)]
=
n∑∀ i = 1
βσie−xi
2
2σ2i
[1− e
−(2λxi di+λ2d2i)
2σ2i
]. (2–25)
30
Equations 2–24 & 2–25 imply the following:
n∑∀ i = 1
βσe−xi
2
2σ2i
[1− e
−(2λxi di+λ2d2i)
2σ2i
]< 0. (2–26)
Dividing both sides of Equation 2–26 by λ > 0, and taking the limits λ −→ 0, results in:
0 > limλ−→0
1
λ
n∑∀ i = 1
βσie−xi
2
2σ2i
[1− e
−(2λxi di+λ2d2i)
2σ2i
]
=
n∑∀ i = 1
βσie−xi
2
2σ2i lim
λ−→0
1− e
−(2λxi di+λ2d2i)
2σ2i
λ
=
n∑∀ i = 1
βσie−xi
2
2σ2i
xidi
σ2i
=
n∑∀ i = 1
σi (xi)xidi . (2–27)
Equation 2–27 is a contradiction to the assumption made in Equation 2–22. This
proves that the claim stated in Equation 2–23 is true. Therefore, from Equations 2–22, 2–23
& 2–27 it is concluded that:
if ∇FσC (x)
T (x− x) ≥ 0, then FσC (x) ≥ Fσ
C (x) ∀ x ∈ Nϵ(x) ∩ S. (2–28)
That is, from Equation 2–28, it can be stated that FσC is locally pseudoconvex for a given
parameter σ. �
Unfortunately, the local pseudoconvexity will not guarantee the existence of global
pseudoconvexity. However, a gradient descent algorithm with sufficiently small step
size can be designed such that it can guarantee global convergence. Nevertheless, the
following theorem proves the existence of invexity for the correntropic loss function.
Theorem 2.3. Let βσi =
[1− e
(−1
2σ2i
)]−1
and S = {x ∈ Rn : x2i < Mi , 0 < Mi <<
∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ
C , defined as:
FσC (x) =
n∑i=1
βσi
[1− e
(−x2
i
2σ2i
)]∀ x ∈ S, (2–29)
31
is invex for any finite σi > 0.
Proof: Let x, x ∈ S be any two points. Since x2i < Mi and σi = 0 ∀ i = 1, ... , n, there
exists Mi > 0, ∈ R such that x2i
σ2i
≤ Mi ∀ i = 1, ... , n. The gradient,∇FσC (x) ∈ Rn is
defined as:
∇FσC (x) =
[βσ1
σ21e
(−x21
2σ21
)x1, ... ,
βσnσ2n
e
(−x2n
2σ2n
)xn
]T, (2–30)
which implies
∇FσC (x)
T∇FσC (x) =
n∑i=1
β2σiσ4i
e
(−x2
i
σ2i
)x2i . (2–31)
Since x ∈ S, it follows that ∇FσC (x)
T∇FσC (x) = 0 only when ∇Fσ
C (x) = 0. Let us define
η(x, x) ∈ Rn as:
η(x, x) =
0 ∇Fσ
C (x) = 0
[FσC (x)−Fσ
C (x)]∇FσC (x)
∇FσC (x)T∇Fσ
C (x)otherwise.
(2–32)
From Equation 2–32, it follows that:
FσC (x)−Fσ
C (x) ≥ η(x, x)T∇FσC (x). (2–33)
From Equation 2–33 it follows that FσC (x) is invex, when x2i < Mi ∀ i = 1, ... , n. �
Remark 2.2. If there exists x⋆ ∈ Rn such that ∇FσC (x
⋆) = 0, then x⋆ is the global optimal
solution of FσC .
Kernel width plays a critical role in setting the level of convexity in the correntropic
loss function. In Theorem 2.4, the condition under which the correntropy loss function
will be pseudoconvex is presented.
Theorem 2.4. Let βσi =
[1− e
(−1
2σ2i
)]−1
, S = {x ∈ Rn : x2i < Mi , 0 < Mi << ∞ ∀ i =
1, ... , n} and ck(σ) =∏k
i=1
σi(xi )(σ
2i−x2
i)
σ2i
(∑k
i=1
σi(xi )x
2iσ2i
σ2i−x2
i
), where σi (xi) =
βσi
σ2i
e
(−x2
i
2σ2i
). If
32
x ∈ S and FσC : S 7→ R, then the function Fσ
C , defined as :
FσC (x) =
n∑i=1
βσi
[1− e
(−x2
i
2σ2i
)]∀ x ∈ S, (2–34)
is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... , n and σi is nonzero and finite.
Proof: In Theorem 2.2, local pseudoconvexity of FσC is proved. In Theorem 2.4, it is
shown that under certain conditions, FσC is globally pseudoconvex. Let Hσ
C (x) denote the
Hessian of the function, given as:
HσC (x) =
σ1(x1)
σ21−x2
1
σ21
0 · · · 0
0 σ2(x2)
σ22−x2
2
σ22
· · · 0
...... . . . ...
0 0 · · · σn(xn)σ2n−x2
n
σ2n
.
(2–35)
Consider the bordered Hessian matrix B(x) of FσC defined as:
B(x) =
HσC (x) ∇Fσ
C (x)
∇FσC (x)
T 0
.
(2–36)
The determinant of B(x) is defined as:
det B(x) =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
σ1(x1)
σ21−x2
1
σ21
0 · · · 0 σ1(x1)x1
0 σ2(x2)
σ22−x2
2
σ22
· · · 0 σ2(x2)x2
...... . . . ...
...
0 0 · · · σn(xn)σ2n−x2
n
σ2n
σn(xn)xn
σ1(x1)x1 σ2
(x2)x2 · · · σn(xn)xn 0
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣ .(2–37)
Using typical row operations, the determinant can be rewritten as:
detB(x) = −n∏
i=1
σi (xi)(σ2i − x2i )
σ2i
(n∑
i=1
σi (xi)x2i σ
2i
σ2i − x2i
).
(2–38)
33
Let detBk(x) = −∏k
i=1
σi(xi )(σ
2i−x2
i)
σ2i
(∑k
i=1
σi(xi )x
2iσ2i
σ2i−x2
i
). Since ck(σ) > 0 ∀ k =
1, ... , n,∀x ∈ S, detBk(x) < 0 ∀ k = 1, ... , n, which implies the function is quasiconvex.
Furthermore, from Theorem 2.3 the function is invex. Thus, it can be concluded that the
function is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... , n. �
Additional Properties
In addition to the above proved properties, the following properties hold for the
correntropic function:
• Let σi = σ ∀ i = 1, ... , n, βσ =[1− e(
−1
2σ2 )]−1
and S = {x ∈ Rn : x2i < Mi , 0 <
Mi <<∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ
C definedas :
FσC (x) =
n∑i=1
βσ
[1− e
(−x2
i
2σ2
)]∀ x ∈ S, (2–39)
is invex for any given nonzero finite value of the parameter σ.
• If there exists x⋆ ∈ Rn such that ∇FσC (x
⋆) = 0, then x⋆ is the global optimal solutionof Fσ
C .
• Every local minimum of FσC is the global minimum.
• FσC (x) is symmetric, i.e., Fσ
C (−x) = FσC (x).
• Let ϕ : Rr 7→ Rn (r ≥ n) be a differentiable function. If ∇ϕ is of rank n, then FσC ◦ ϕ
is invex.Proof: Let us define η(x, x)T = η(x, x)T∇ϕ(x)−1. Since Fσ
C is invex, we have
FσC (ϕ(x))−Fσ
C (ϕ(y)) ≥ η(y, x)T∇FσC (ϕ(y)) (2–40)
FσC (ϕ(x))−Fσ
C (ϕ(y)) ≥ η(y, x)T∇ϕ(y)−1∇ϕ(y)∇FσC (ϕ(y)) (2–41)
FσC (ϕ(x))−Fσ
C (ϕ(y)) ≥ η(y, x)T∇(FσC ◦ ϕ)(y). (2–42)
�
• If ψ : R 7→ R is a monotone increasing differentiable convex function, then ψ ◦ FσC is
invex.Proof: Since, ψ is convex, we have:
ψ(FσC (x)) ≥ ψ(Fσ
C (y)) + [FσC (x)−Fσ
C (y)]ψ′(Fσ
C (y)). (2–43)
34
Furthermore, due to the invexity of FσC , we have:
FσC (x)−Fσ
C (y) ≥ η(y, x)T∇FσC (y). (2–44)
Since, ψ is monotone increasing:
ψ′(x) > 0 ∀x ∈ R. (2–45)
Multiplying Equation 2–44 on both sides by ψ′(FσC (y)) and substituting in
Equation 2–43, the results follows. �
• If σ2i > Mi ∀ i , then FσC (x) is convex.
Some data analysis problems, like multi-class classification, are based on error
vector, i.e., the error for a single sample is a vector in m-dimensions. In order to avoid
confusion on the usage of sample and error dimension, the error dimensions are called
as dimensions.
m-dimensions, Single-sample Case: The correntropy loss function for m-dimensions,
with respect to one sample can be defined as:
GσC (x) =
[1− e
( −1
2σ2 )]−1
[1− e
(−||x||2
2σ2
)]∀ x ∈ Rm, (2–46)
where ||x|| is the Euclidean norm. We claim that the loss function is pseudoconvex.
Theorem 2.5. If x ∈ Rm then the function GσC : Rm 7→ R, defined as :
GσC (x) =
[1− e
( −1
2σ2 )]−1
[1− e
(−||x||2
2σ2
)]∀ x ∈ Rm, (2–47)
is pseudoconvex for finite σ > 0.
Proof: Let βσ =[1− e
( −1
2σ2 )]−1
, βσ > 0 ∀ σ. The function can be rewritten as:
GσC (x) = βσ − βσ e
(−||x||2
2σ2
)∀ x ∈ Rm. (2–48)
Let x1 and x2 be two vectors such that:
GσC (x2) < Gσ
C (x1). (2–49)
35
Then, βσ − βσ e
(−||x2||
2
2σ2
)< βσ − βσ e
(−||x1||
2
2σ2
)(2–50a)
e
(−||x2||
2
2σ2
)> e
(−||x1||
2
2σ2
)(2–50b)
−||x2||2
2σ2>
−||x1||2
2σ2(2–50c)
||x1|| > ||x2||. (2–50d)
Now, ∇GσC (x) = σ(x) x, where σ(x) = βσ
σ2 e
(−||x||2
2σ2
)and σ(x) > 0 ∀ σ, x.
Consider:
∇GσC (x1)
T (x2 − x1) = σ(x1) xT1 (x2 − x1) (2–51a)
= σ(x1) (xT1 x2 − xT1 x1). (2–51b)
Using the Cauchy-Bunyakovsky-Schwarz inequality, we have:
if ||x1|| > ||x2|| then xT1 x1 > xT1 x2.
Therefore, using the above inequality and from Equations 2–50d & 2–51b, we have:
If GσC (x2) < Gσ
C (x1), then ∇GσC (x1)
T (x2 − x1) < 0 for a given σ. (2–52)
From Equation 2–52, it follows that GσC is pseudoconvex for a given parameter σ. �
m-dimension, n-sample Case: The correntropy loss function for m dimensions,
with respect to n samples can be defined as:
GσC (X ) =
[1− e
( −1
2σ2 )]−1
[1−
n∑i=1
e
(−||xi ||
2
2σ2
)]∀ xi ∈ Rm, ∀i = 1, ... , n. (2–53)
Let βσ =[1− e
( −1
2σ2 )]−1
, σ(x) = βσ
σ2 e
(−||x||2
2σ2
)and σ(x), βσ > 0 ∀ σ, x . Let
gσi (X ) =
βσ
n− βσe
(−
∑j x
2i ,j
2σ2
). The loss function can be rewritten as:
GσC (X ) =
n∑i = 1
gσi (X ). (2–54)
36
The gradient of GσC (X ) can be written as:
∇GσC (X ) = [σ(x1)x1,1, ... , σ(x1)x1,m, ... , σ(xn)xn,1, ... , σ(xn)xn,m]
T (2–55)
and
∇GσC (X )T (X − X ) =
n∑i = 1
m∑j = 1
σ(xi) · xi ,j · (xi ,j − xi ,j). (2–56)
Theorem 2.6. Let βσi =
[1− e
(−1
2σ2i
)]−1
and S = {X ∈ Rn·m : ||xi ||2 < Mi , 0 < Mi <<
∞ ∀ i = 1, ... , n}. If xi ∈ S and GσC : S 7→ R, then the function Gσ
C , defined as :
GσC (X ) =
n∑i=1
βσi
[1− e
(−||xi ||
2
2σ2
)]∀ xi ∈ Rm, (2–57)
is invex for finite σ > 0.
Proof: Let X , X ∈ Rn·m be any two points. Since ||xi ||2 < Mi and σi = 0 ∀ i =
1, ... , n, there exists Mi > 0, ∈ R such that ||xi ||2
σ2i
≤ Mi ∀ i . The gradient,∇GσC (X ) ∈
Rn·m is defined as:
∇GσC (X ) = [σ1
(x1)x1,1, ... , σ1(x1)x1,m, ... , σn(xn)xn,1, ... , σn(xn)xn,m]
T , (2–58)
which implies
∇GσC (x)
T∇GσC (x) =
n∑i=1
σi (xi)||xi ||2. (2–59)
Since x ∈ S, it follows that ∇GσC (X )T∇Gσ
C (X ) = 0 only when ∇GσC (X ) = 0. Let us define
η(X ,X ) ∈ Rn·m as:
η(X ,X ) =
0 ∇Gσ
C (X ) = 0
[GσC (X )−Gσ
C (X )]∇GσC (X )
∇GσC (X )T∇Gσ
C (X )otherwise.
(2–60)
From Equation 2–60, it follows that:
GσC (X )− Gσ
C (X ) ≥ η(X , X )T∇GσC (X ). (2–61)
37
From Equation 2–61 it follows that GσC (X ) is invex, when xi ∈ S ∀ i = 1, ... , n. �
2.4. Minimization of Error Entropy
Let z be the error between i th measurement and i th desired value, defined as
zi = xi − yi ∀ i = 1, ... ,N. The Minimization of Error Entropy (MEE) problem can be
stated as maximization of Information Potential (IP) and can be defined as:
minimize :
−IP(z) = − 1
N2
N∑i=1
N∑j=1
kσ(zi − zj), (2–62)
where kσ(ν) = e−ν2
2σ2 is the well known Gaussian kernel and σ is the kernel parameter (for
the sake of simplicity, the constant term in the Gaussian kernel is ignored).
Let e ∈ R(N−1) be a vector containing all ones. Let ei ∈ R(N−1) be a vector containing
all zeros, except a 1 at i th position. Construct a matrix Bk ∈ RN×(N−1) as:
Bk = [−e1, ... − ek−1, +e, −ek , ... − eN−1] ∀k = 1, ... ,N. (2–63)
Let A ∈ RN(N−1)×N be defined as:
A = [BT1 , ... ,B
TN ]
T . (2–64)
Now, the MEE problem can be re-stated as:
minimize :
− 1
N2
N(N−1)∑k=1
kσ(aTk•z), (2–65)
where ak• ∈ RN represents the k th row of matrix A. Let S1 = {u ∈ RN(N−1) : u =
Az, z ∈ S}, and define an affine function L : S ⊆ RN 7→ S1 ⊆ RN(N−1) as:
L(z) = Az. (2–66)
38
Let GσC (z) = − 1
N2
∑N(N−1)k=1 kσ(a
Tk•z) ∀ z ∈ S. Gσ
C can be represented as a
composite function of FσL and L, i.e.,
GσC = Fσ
L ◦ L, (2–67)
where FσL : S1 ⊆ RN(N−1) 7→ R, defined as:
FσL = − 1
N2
N(N−1)∑i=1
e
(−u2
i
2σ2
). (2–68)
Now, Equation 2–68 represents the correntropy loss function defined over a
projected space S1. Furthermore, Equation 2–67 implies that MEE is a composite
function of the correntropy loss function and an affine function. This representation
paves the way to establish the generalized convexity results for MEE since, in general,
composition with an affine function preserves generalized convexity of the composite
function. Next, the properties of MEE function is presented.
Theorem 2.7. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S
and GσC : S 7→ R, then the function Gσ
C , defined as :
GσC (z) = − 1
N2
N(N−1)∑k=1
kσ(aTk•z) ∀ z ∈ S, (2–69)
is convex when σ ≥ 2√M.
Proof: Since u = Az from the definition of S and S1, it can be established that u2i <
4M ∀ i = 1, ... ,N. Thus, from Equation 2–8, it follows that FσL is convex. Therefore,
from Equation 2–67 it follows that GσC is convex, since composition of a convex function
over an affine function results in a convex function. �
Theorem 2.8. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S
and GσC : S 7→ R, then the function Gσ
C , defined as :
GσC (z) = − 1
N2
N(N−1)∑k=1
kσ(aTk•z) ∀ z ∈ S, (2–70)
39
is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero and finite, where
ck(σ) =∏k
i=1
σ(ui )(σ2−u2
i)
σ2
(∑k
i=1
σ(ui )u2iσ2
σ2−u2i
), σi (ui) =
1σ2 e
(−u2
i
2σ2
)and u = Az.
Proof: From [60] and [7], it is concluded that pseudoconvexity is invariant under the
composition with an affine function. Thus, using the results from [92] in Equation 2–67,
it follows that GσC is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero
and finite, where ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero and finite, where
ck(σ) =∏k
i=1
σ(ui )(σ2−u2
i)
σ2
(∑k
i=1
σ(ui )u2iσ2
σ2−u2i
), σi (ui) =
1σ2 e
(−u2
i
2σ2
), and u = Az. �
Theorem 2.9. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S
and GσC : S 7→ R, then the function Gσ
C , defined as :
GσC (z) = − 1
N2
N(N−1)∑k=1
kσ(aTk•z) ∀ z ∈ S, (2–71)
is invex for finite σ > 0.
Proof: To the best of our knowledge, there is no general proof in the literature that
affirms preservation of invexity over affine compositions. Therefore, an elementary proof
that serves not only as a proof for the above theorem, but for invex functions in general,
is presented. To prove: if FσL is an invex function, then Gσ
C = FσL ◦ L is invex, where L is
any affine transformation.
By contradiction, assume GσC is not invex, i.e., the following is true for any arbitrary
η(z,w) : S × S 7→ S
GσC (z)− Gσ
C (w) < η(z,w)T∇GσC (w). (2–72)
Rewriting Equation 2–72:
FσL(Az)−Fσ
L(Aw) < [Aη(z,w)]T∇FσL(Aw). (2–73)
Let η(Az,Aw) = Aη(z,w), u = Az, and v = Aw. Equation 2–72 can be written as:
FσL(u)−Fσ
L(v) < η(u, v)T∇FσL(v). (2–74)
40
Since, z, w, and η(z,w) are chosen arbitrarily, Equation 2–74 implies that FσL is not
invex. This contradiction is a result of the assumption made in Equation 2–72. Thus, the
assumption that GσC is not invex is false. �
2.5. Minimization of Error Entropy with Fiducial Points
Another important M-estimator, using the concept of fiducial point (reference point)
is proposed in [55]. The goal of such measure is to provide an anchor to zero error, i.e.,
make most of the errors zero. This M-estimator is obtained by the Minimization of Error
Entropy with Fiducial points (MEEF). The MEEF problem can be defined as:
minimize : − 1
(N + 1)2
N∑i=0
N∑j=0
kσ(zi − zj). (2–75)
The only modification in MEEF, when compared to MEE is the addition of a reference
point, z0 = 0. Simplifying the above function, by using the symmetric property of the
Gaussian kernel, the MEEF problem can be written as:
minimize : − 1
(N + 1)2
N∑i=1
N∑j=1
kσ(zi − zj)−2
(N + 1)2
N∑j=0
kσ(z0 − zj) (2–76)
or
minimize : − 1
(N + 1)2
N∑i=1
N∑j=1
kσ(zi − zj)−2
(N + 1)2
N∑j=1
kσ(zj)−2
(N + 1)2. (2–77)
In general, by adding m fiducial points, the following MEEF function is obtained:
minimize : − 1
(N + 1)2
N∑i=1
N∑j=1
kσ(zi − zj)−2m
(N + 1)2
N∑j=1
kσ(zj)−2m
(N + 1)2. (2–78)
Removing the constant term, and normalizing the coefficients, we get the following:
minimize : −λN∑i=1
N∑j=1
kσ(zi − zj)− (1− λ)N∑j=1
kσ(zj), (2–79)
where λ ∈ (0, 1]. It can be seen that, as λ −→ 0, the MEEF formulation converges
to Minimizing Correntropic Cost (MCC) function. On the other hand, when λ = 1, the
41
MEEF objective function reduces to MEE objective function. Intuitively, the second
term,∑N
j=1 kσ(zj) can be seen as a regularization function. In fact, correntropy is a
similarity norm, and can be used for sparsification of the solution. This sparsification is
the underlying reasoning for the usage of fiducial points.
Consider the normalized loss function of the MEEF function, HσC(x), defined as:
HσC(z) = −λ
N∑i=1
N∑j=1
kσ(zi − zj)− (1− λ)N∑j=1
kσ(zj) (2–80)
HσC(z) = λ Gσ
C(z) + (1− λ) FσC(z), (2–81)
where GσC(z) = −
∑N
i=1
∑N
j=1 kσ(zi−zj), and FσC(z) = −
∑N
j=1 kσ(zj). Equation 2–81 states
that function HσC(z) is a convex combination of two real functions. Unlike convexity, as
a reminder, pseudoconvexity may not be preserved with positive weighted summation.
However, invexity will be preserved over the positive weighted summation, when all the
functions are invex with respect to same η function. Next, the conditions under which,
HσC(x) is convex in particular and invex in general are developed.
Theorem 2.10. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S
and HσC : S 7→ R, then the function Hσ
C , defined as :
HσC(z) = λ Gσ
C(z) + (1− λ) FσC(z) ∀ z ∈ S, (2–82)
is convex when σ ≥ 2√M.
Proof: Both GσC(z) and Fσ
C(z) is convex when σ ≥ 2√M. Therefore, it follows
immediately that HσC is convex when σ ≥ 2
√M.
Theorem 2.11. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S
and HσC : S 7→ R, then the function Hσ
C , defined as :
HσC(z) = λ Gσ
C(z) + (1− λ) FσC(z) ∀ z ∈ S, (2–83)
is invex for finite σ > 0.
42
Proof: In order to prove the invexity of HσC(z), it is sufficient to show that both Gσ
C(z) and
FσC(z) are invex with respect to a common η function.
By contradiction, assume that following system, say (system-1) is infeasible for any
z and w ∈ S:
∇FσC(w)
Tη(z,w) ≤ FσC(z)−Fσ
C(w) (2–84a)
∇GσC(w)
Tη(z,w) ≤ GσC(z)− Gσ
C(w). (2–84b)
Since Equation 2–84 is linear with respect to η(z,w), from the Gale’s theorem
[61], it can be stated that: if the above linear system (system-1) is infeasible, then the
following system (system-2) should be feasible:
[∇FσC(w) ∇Gσ
C(w)]p = 0 (2–85a)
[FσC(z)−Fσ
C(w) GσC(z)− Gσ
C(w)]p = −1 (2–85b)
p ≥ 0. (2–85c)
Case 1: either p1 = 0 or p2 = 0 . Clearly, if p1 = 0, then p2 = 0 from
Equation 2–85a. Whereas, when p1 = 0 and p2 = 0 then Equation 2–85b is infeasible.
Thus, p1 = 0. Similar argument can be followed to show that p2 = 0. To sum up, neither
p1 = 0 nor p2 = 0 give a feasible solution for (system-2).
Case 2: p1 > 0 and p2 > 0. Let us rearrange elements of w such that the following
relation holds: w1 ≤ w2 ≤ ... ≤ wN . Now Equation 2–85a can be written as:
∇GσC(w) = −λ∇Fσ
C(w), (2–86)
where λ = p1p2> 0. Now consider the following two sub-cases:
43
Sub-case1: wN ≥ 0 Consider the last element on both sides of Equation 2–86, i.e.,
consider2
σ2
N∑i=1
e−(wN−wi )
2
2σ2 (wN − wi) = −λ 1
σ2e
−w2N
2σ2 wN . (2–87)
Clearly, Equation 2–87 has no feasible value of λ.
Sub-case2: wN < 0 Consider the first element on both sides of Equation 2–86, i.e., we
have
[∇GσC(w)]1 = −λ[∇Fσ
C(w)]1 (2–88)
2
σ2
N∑i=1
e−(w1−wi )
2
2σ2 (w1 − wi) = −λ 1
σ2e
−w21
2σ2 w1. (2–89)
Clearly, Equation 2–87 has no feasible value of λ. Thus, the (system-2) is infeasible,
implying that the assumption is false, and the (system-1) is feasible. In other words,
there exist a common η such that both GσC(z) and Fσ
C(z) are invex. Therefore, HσC(z) is
invex for nonzero finite value of σ. �
To sum up, it can be stated that MCC, MEE and MEEF functions are invex in
nature. Furthermore, invexity along with robustness and smoothness are the three main
desirable properties in a robust measure. Presence of these three properties, along with
suitable optimization algorithms will improve the current computational complexities of
the robust methods. Next, the traditional and proposed robust algorithms are presented.
2.6. Traditional Robust Algorithm
Consider the classical data analysis methods, like least square method, in order
to understand the concept of robust algorithms. The idea in the classical methods
is to estimate the model parameters with respect to all of the presented data. The
methods give equal weight to all the data points, and the methods have no internal
mechanism to detect and/or filter the outliers. The classical methods are based on
smoothing assumption, which states that the effects of outliers are smoothed out by
presence of large amount of good data points. However, in many practical problems, the
smoothing assumption is not justifiable. Thus, earlier robust algorithms were based on
44
removal of outliers. The simple idea in a robust algorithm is to estimate the parameters
with respect to all of the data points. Then, identify those points which are farthest
(non-conforming) from the model. The identified points are assumed as outliers, and
removed from the data. The remaining points are used to construct a new model. This
iterative process continues until a better model is constructed, or until there is no longer
sufficient remaining points to proceed. However, the heuristic iterative methods easily
fail even when there is only one outlier[28].
Fischler [29] pioneered the constructive approach of robust algorithm using the
notion of random sampling, called Random Sampling Consensus (RANSAC). The basic
idea in RANSAC is to simultaneously estimate the model and eliminate the outliers.
The novelty that RANSAC proposed when compared to the earlier heuristics can be
summarized as:
• Initially, a small number of data points (initial set) are selected to estimate themodel parameters.
• While estimating the instantaneous model parameters, the initial set is enlarged insize by adding the consensus points.
The philosophy of selecting small number of points for estimating the instantaneous
model parameters is the robustness of the RANSAC algorithm. Typically, the number of
outliers in a practical data set is assumed to be much less in number when compared
to good data points. Thus, selecting a small sample for the given data increases
the probability of selecting good data points. Formally, the RANSAC is described in
Algorithm 2.1. RANSAC is the basis of many robust algorithms due to its ability to
tolerate a large fraction of outliers. RANSAC can often perform well with high amount
of outliers, however, the number of samples required to do so increases exponentially
with respect to the percentage of outliers in the data. Thus, similar to robust measures,
robust algorithms are computationally expensive.
45
If the percentage of outliers in a sample is know a priori (say po), then the number of
samples required (say k) with η level of confidence can be calculated as:
k ≥ log(1− η)
log(1− (1− po)m), (2–90)
where m is the minimum number of samples required to compute a solution. RANSAC
is a simple, successful robust algorithm available in the literature of data analysis.
Nevertheless, many efforts have been developed toward improving the performance of
RANSAC. For example, the optimization of the model verification process of RANSAC is
addressed in [12, 16, 18]. Improvements directing towards sampling process in order to
generate usable hypotheses are addressed in [17, 19, 96]. Furthermore, the real-time
execution issues are address in [68, 74].
2.7. Proposed Robust Algorithm
In this work, a RANSAC based robust algorithm is developed, which proposes
improvement in sampling strategy and usage of mathematical modeling for hypothesis
testing. Specifically, the algorithm is proposed for the hyperplane clustering problem.
The key idea in sample selection of robust algorithm is two fold. First, an initial data
sample S1 is selected based on closeness criterion. By restricting the closeness
criterion, another subgroup of the initial data sample S2 is selected. Let the rest of the
data points be denoted as S3. The model parameter is estimated as follows: The data
points belonging to S2 is considered as definitely good points. The data points belonging
to S1 are considered as tentative good points. By using information of points from S1
and S2, a hyperplane containing S2 and other points is searched. If some of the points
belonging to S1 contain points that do not belong to hyperplane, then the algorithm
has a mechanism to discard those points. After the execution of one instance of the
algorithm, we will end up with two possibilities. The desired possibility is the number
of consensus points will be above threshold for that hyperplane, and we search for the
next hyperplane. Another possibility is that the number of consensus points is below the
46
threshold. In this case, we re-sample for the two sets, but archive information of previous
unsuccessful sample S2. This archive acts as a cut for selecting the next sample, and
avoids the repetition.
2.8. Discussion on the Robust Methods
The practical consideration, while applying the results presented in Section 2.3, is
the asymptotic behavior of the negative exponential function. Theoretically, σ(x) −→ 0
as x −→ ∞. However in practice, finite large values of x result in σ(x) = 0. This
behavior may result in local minima, which can be avoided by using the following
methods: constraining the absolute value of error, replacing the Gaussian kernel with a
suitable kernel function, or using the solution of a quadratic loss function as a starting
point of correntropy minimization.
In Sections 2.3, 2.4 and 2.5, convexity, pseudoconvexity and invexity of the entropy
based loss functions (MCC, MEE and MEEF) are established. Invexity is the sole
property can be exploited in designing optimization algorithms, which can be used for
efficiently minimizing the loss function. The generalized convexity results for the case
of single and multiple dimension are separately presented. The purpose of discussing
one-dimensional cases separately was to address the traditional sample-by-sample
artificial neural network approach in data analysis. Generally, the cumulative error
approach is useful in both the parametric and non-parametric approaches of data
analysis. Future directions of utilizing the correntropic loss function will involve designing
fast algorithms that can speed up the grid search of the kernel parameter. Furthermore,
designing a kernel that improves the asymptotic behavior of the current kernel function
will enhance the efficiency of the algorithms.
Entropic learning in the form of MCC, MEE and MEEF has been successfully
applied in robust data analysis, including robust regression [56], robust classification
[93], robust pattern recognition [43], robust image analysis [42], etc. In Chapter 2,
it is shown that unconstrained MEE and MEEF problems are invex. In general, the
47
invexity property remains intact over a convex feasible space for the constrained
optimization problems. Therefore, a linear learning mapper (or in general, a convex
mapper) designed to minimize MEE will be an invex problem. By suitably exploiting the
invexity, efficient optimization algorithms can be proposed for MCC, MEE and MEEF
problems. Furthermore, stochastic gradient methods like convolution smoothing can be
intelligently applied to solve the problems. In fact, by varying the kernel parameter we
move from convex to invex domain, which is inherently the notion of not only convolution
smoothing, but also for many global optimization algorithms.
Sections 2.6 and 2.7 present the robust algorithmic approach in data analysis.
Typically, the RANSAC philosophy is applied in computer vision and related areas of
data analysis. However, the method is useful in those data analysis scenarios where
a sample can be used to estimate model parameters and validate the other points. In
Chapter 4, the blind signal separation problem is presented and a solution methodology
that involves the RANSAC philosophy is proposed.
Algorithm 2.1: RANSAC Algorithminput : P data points.output: Estimated model M⋆.
1 begin2 Set ε the tolerance limit;3 Set θ a predefined threshold ;4 Set termination = false;5 while termination == false do6 Select Si , a set containing n points, from the given P points;7 Estimate model Mi using the knowledge of set Si ;8 Identify Sc
i , the set of points (consensus set) from the original P datapoints that fall within ε tolerance limits of Mi ;
9 if |Sci | ≥ θ then
10 Estimate model M⋆ using Sci ;
11 termination = true;
12 return (M⋆);
48
CHAPTER 3ROBUST DATA CLASSIFICATION
The goal of Chapter 3 is to present the binary classification problem, and illustrate
the robust non-parametric methods for data classification. In Section 3.1, all the
major preliminary topics required to understand the proposed robust methods in
data classification are discussed. The purpose of reviewing these topics is to provide
sufficient background information to a novice reader. However, they by no means
serve as a comprehensive discussion, and interested readers will be directed to the
appropriate references for the detailed discussions. Furthermore, Section 3.2 reviews
some of the traditional approaches in binary classification, whereas, Section 3.3
presents the proposed robust approaches.
3.1. Preliminaries
The following topics are reviewed in Section 3.1:
• Classification• Correntropic Function• Convolution Smoothing (CS)• Simulated Annealing (SA)• Artificial Neural Network (ANN)• Support Vector Machine (SVM)
Classification
Classification (strictly speaking, statistical classification) is a supervised learning
methodology of identifying (or assigning) class labels to an unlabeled data set (a
sub-population of data, whose class is unknown) from the knowledge of a pre-labeled
data set (another sub-population of the same data, whose class is known). The
knowledge of pre-labeled data set can be used to generate an optimal rule, based
on the theory of learning [99, 100]. More specifically, the optimal rule (discriminant
Some sections of Chapter 3 have been published in Dynamics of Information Sys-tems: Mathematical Foundations.
49
function f ) is generated in such a way that it will minimize the risk of assigning incorrect
class labels [3, 65]. The classification problem is defined in the following paragraph.
Let Dn represents the data set containing the observations, defined as, Dn =
{(xi , yi), i = 1, .., n : xi ∈ Rm ∧ yi ∈ {−1, 1}}, where xi is an input vector, and yi is
the class label for the input vector. Under the assumption that (xi , yi) is an independent
and identical realization of random pair (X,Y ), the classification problem can be defined
as finding a function f from a class of functions �, such that f minimizes the risk, R(f ).
Thus, classification problem can be written as:
minimize :
R(f ) (3–1a)
subject to :
(xi , yi) ∈ Dn ∀ i = 1, ... , n, (3–1b)
xi ∈ Rm ∀ i = 1, ... , n, (3–1c)
yi ∈ {−1, 1} ∀ i = 1, ... , n, (3–1d)
f ∈ �, (3–1e)
where R(f ) is defined as:
R(f ) = P(Y = sign(f (X))
= E [l0−1(f (X),Y )], (3–2)
where sign is the signum function and l0−1 is the 0-1 loss function, they are defined as:
sign(f (X)) =
+1 if f (X) > 0
−1 if f (X) < 0
0 otherwise
(3–3)
l0−1(f (x), y) = ||(−yf (x))+||0, (3–4)
50
where (.)+ denotes the positive part and ||.||0 denotes the L0 norm. When f (x) = 0,
the above definition does not reflect error, however, this is a rare case and can be easily
avoided or adjusted (i.e., by considering ||(f (x) − y)+||0). Moreover, it is clear from
the definition of R(f ) that it requires the knowledge of P(X,Y ), the joint probability
distribution of the random pair (X,Y ). Usually, the joint distribution is unknown. This
leads to the calculation of empirical risk function R(f ), which is given as:
R(f ) = 1n
∑n
i=1 l0−1(yi f (xi)). (3–5)
At this juncture, only Empirical Risk Minimization (ERM) is considered, and any
discussion pertaining to Structural Risk Minimization (SRM) is avoided. However,
SRM will be discussed when the notion of support vector machine is presented.
Generally, it is not easy to find the optimal solution f ⋆ of problem stated in
Formulation 3–1 since the space of functions class � is huge, and there is no efficient
way to search over such space. In order to find the solution, a usual approach is to
select the class of functions a priori, and then try to find the best function from the
selected function class �. Generally, the selected class of functions can be categorized
as parametric or non-parametric class. Based on the category of the function class,
different learning algorithms can be used to minimize the loss function. Thus, with the
above stated restrictions, the classification problem can be represented as:
minimize :
R(f ) (3–6a)
subject to :
(xi , yi) ∈ Dn ∀ i = 1, ... , n, (3–6b)
xi ∈ Rm ∀ i = 1, ... , n, (3–6c)
yi ∈ {−1, 1} ∀ i = 1, ... , n, (3–6d)
f ∈ �. (3–6e)
51
In summary, usually, both R(f ) and � will be selected before finding f ⋆. Moreover,
the type of risk function and the function class selected will significantly determine the
accuracy of classification method. Next, the usage of correntropy loss function as a risk
function in data classification is presented.
Correntropic Function
Although the classification problem stated in Formulation 3–6 looks simple, it has
an inherent difficulty, due to the non-convex and non-continuous loss function defined in
Equation 3–4. Furthermore, the search over the � function space is another difficulty in
solving Formulation 3–6. The key idea is to propose a loss function that can efficiently
replace the loss function given in Equation 3–4. Conventionally, the 0-1 loss function is
replaced by a quadratic loss function, i.e., the quadratic risk is given as:
R(f ) = E [(Y − f (X)2]
= E [(ε)2]. (3–7)
In general, the knowledge of Probability Distribution Function (PDF) of ε is required to
calculate the above risk function. However, the quadratic risk can be approximated by
the following empirical quadratic risk function:
R(f ) =1
n
n∑i=1
(yi − f (xi))2 , (3–8)
where n is the number of samples. The replacement of 0-1 loss function with quadratic
loss function makes Formulation 3–6 computationally easy to solve (due to its convex
nature). Moreover, if the function class � is smooth, then the problem can be solved by
gradient descent methods. However, the quadratic loss function performs poorly in noisy
data, i.e., the computational simplicity has its price in the classification performance.
Hence, usual gradient descent based optimization methods with quadratic loss function
may not provide the global optimal solution for the class of functions selected (�). In
order to overcome this difficulty, the use of the correntropic loss function is proposed.
52
In order to define the correntropic risk function, consider a function ϕβ,σ(f (x), y)
defined as:
ϕβ,σ(f (x), y) = β[1− kσ(1− yf (x))]
= β[1− kσ(1− α))], (3–9a)
where α = yf (x) is called the margin, β = [1 − e(−1
2σ2 )]−1 is a positive scaling factor,
and kσ is the Gaussian kernel with width parameter σ. This function has its roots from
correntropy function (see [72] for more details). Using this information, the correntropic
risk function can be rewritten as:
R(f ) = E [ϕβ,σ(f (X),Y )]
= E [β(1− kσ(1− Yf (X)))]
= β(1− E [kσ(1− Yf (X))])
= β(1− ν(1− Yf (X))])
= β(1− ν(Y − f (X))])
= β(1− ν(ε)). (3–10a)
Due to the unavailability of PDF function, similar to quadratic loss function, the
empirical correntropic risk function can be defined as:
R(f ) = β(1− ν(ε)), (3–11)
where ν(ε) = 1n
∑n
i=1 kσ(yi − f (xi)) and n is the number of samples.
The characteristics of this function for different values of the width parameter
are shown in Figure 3-1. Clearly, from Figure 3-1, it can be seen that the function
ϕβ,σ(f (x), y) is convex for higher values of kernel width parameter (σ > 1), and as
the parameter value decreases, it becomes non-convex. For σ = 1 it approximates
the hinge loss function (Hinge loss function is a typical function often used in SVMs).
53
However, for smaller values of kernel width the function almost approximates the 0-1
loss function, which is mostly an unexplored territory for typical classification problems.
In fact, any value of kernel width apart from σ = 2 or 1 has not been studied for other
loss functions. This peculiar property of correntropic function can be harmoniously
used with the concept of convolution smoothing for finding global optimal solutions.
Moreover, with a fixed lower value of kernel width, suitable global optimization algorithms
(heuristics like simulated annealing) can be used to find the global optimal solution.
Next, the elementary ideas about different optimization algorithms that can be used with
the correntropic loss function are discussed.
Convolution Smoothing (CS)
A Convolution Smoothing (CS) approach [87] forms the basis for one of the
proposed methods of correntropic risk minimization. The main idea of CS approach
is sequential learning, where the algorithm starts from a high kernel width correntropic
loss function and smoothly moves towards a low kernel width correntropic loss function,
approximating the original loss function. The suitability of this approach can be seen in
[86], where the authors used a two step approach for finding the global optimal solution.
The current proposed method is a generalization of the the two step approach. Before
discussing the proposed method, consider the following basic framework of CS. A
general unconstrained optimization problem is defined as:
minimize : g(u) (3–12a)
subject to : u ∈ Rn, (3–12b)
where g : Rn 7→ R. The complexity of solving such problems depends upon the nature
of function g. If g is convex in nature, then a simple gradient descent method will lead
to the global optimal solution. Whereas, if g is non-convex, then the gradient descent
algorithm will behave poorly and will converge to a local optimal solution (or in the worst
case converges to a stationary point).
54
CS is a heuristic based global optimization method to solve problems illustrated in
Formulation 3–12 when g is non-convex. This is a specialized stochastic approximation
method introduced in 1951 [77]. Usage of convolution in solving convex optimization
problems was first proposed in 1972 [4]. Later, as an extension, a generalized method
for solving non-convex unconstrained problems is proposed in 1983 [82]. The main
motivation behind CS is that the global optimal solution of a multi-extremal function g
can be obtained by the information of a local optimal solution of its smoothed function. It
is assumed that the function g is a convoluted function of a convex function g0 and other
non-convex functions gi ∀ = 1, ... , n. The other non-convex functions can be seen as
noise added to the convex function g0. In practice, g0 is intangible, i.e., it is impractical
to obtain a deconvolution of g into gi ’s, such that argminu{g(u)} = argminu{g0(u)}.
In order to overcome this difficulty, a smoothed approximation function g is used. This
smoothed function has the following main property:
g(u,λ) −→ g(u) as λ −→ 0, (3–13)
where λ is the smoothing parameter. For higher values of λ, the function is highly
smooth (nearly convex), and as the value of λ tends towards zero, the function takes the
shape of original non-convex function g. Such smoothed functions can be defined as:
g(u,λ) =
∫ ∞
−∞h((u − v),λ) g(v) dv , (3–14)
where h(v ,λ) is a kernel function, with the following properties:
• h(v ,λ) −→ δ(v) , as λ −→ 0; where δ(v) is Dirac’s delta function.• h(v ,λ) is a probability distribution function.• h(v ,λ) is a piecewise differentiable function with respect to u.
Moreover, the smoothed gradient of g(u,λ) can be expressed as:
∇g(u,λ) =
∫ ∞
−∞∇h(v ,λ) g(u − v) dv . (3–15)
55
Equation 3–15 highlights a very important aspect of CS, it states that information
of ∇g(v) is not required for obtaining the smoothed gradient. This is one of the crucial
aspects of smoothed gradient that can be easily extended for non-smooth optimization
problems where ∇g(v) does not usually exist.
Furthermore, the objective of CS is to find the global optimal solution of function g.
However, based on the level of smoothness, a local optimal solution of the smoothed
function may not coincide with the global optimal solution of the original function.
Therefore, a series of sequential optimizations are required with different level of
smoothness. Usually, at first, a high value of λ is set, and an optimal solution u⋆λ is
obtained. Taking u⋆λ as the starting point, the value of λ is reduced, and a new optimal
value in the neighborhood of u⋆λ is obtained. This procedure is repeated until the value
of λ is reduced to zero. The idea behind these sequential optimizations is to end up in a
neighborhood of u⋆ as λ −→ 0, i.e.,
u⋆λ −→ u⋆ as λ −→ 0, (3–16)
where u⋆ = argmin{g(u)}. The crucial part in the CS approach is the decrement of the
smoothing parameter. Different algorithms can be devised to decrement the smoothing
parameter. In [87] a heuristic method (similar to simulated annealing) is proposed to
decrease the smoothing parameter.
Apparently, the main difficulty of using the CS method to any optimization problem
is defining a smoothed function with the property given by Equation 3–13. However,
the CS can be used efficiently with the proposed correntropic loss function, as the
correntropic loss function can be seen as a generalized smoothed function for the true
loss function (see Figure 3-1). The kernel width of correntropic loss function can be
visualized as the smoothing parameter.
Therefore, the CS method is applicable in solving the classification problem
when suitable kernel width is unknown a priori (a practical situation). On the other
56
hand, if appropriate value of kernel is width known a priori (maybe an impractical
assumption, but quite possible), then other efficient methods may be developed, like
simulated annealing based methods. The crux of Chapter 3 is to present a correntropy
minimization method over a non-parametric framework. Generally, the correntropy loss
function is invex (and convex under certain cases). However, due to the presence of
non-convex framework, global optimization methods like CS or simulated annealing
based methods are proposed.
Simulated Annealing (SA)
Simulated Annealing (SA) is a meta-heuristic method which is employed to find a
good solution to an optimization problem. This method stems from thermal annealing
which aims to obtain a perfect crystalline structure (lowest energy state possible)
by a slow temperature reduction. Metropolis et al. in 1953 simulated this processes
of material cooling [13], Kirkpatrick et al. applied the simulation method for solving
optimization problems [53, 70].
SA can be viewed as an upgraded version of greedy neighborhood search. In
neighborhood search method, a neighborhood structure is defined in the solution
space, and the neighborhood of a current solution is searched for a better solution. The
main disadvantage of this type of search is its tendency to converge to a local optimal
solution. SA tackles this drawback by using concepts from Hill-climbing methods [64].
In SA, any neighborhood solution of the current solution is evaluated and accepted with
a probability. If the new solution is better than the current solution, then it will replace
the current solution with probability 1. Whereas, if the new solution is worse than the
current solution, then the acceptance probability depends upon the control parameters
(temperature and change in energy). During the early iterations of the algorithm,
temperature is kept high, and this results in a high probability of accepting worse new
solutions. After a predetermined number of iterations, the temperature is reduced
strategically, and thus the probability of accepting a new worse solution is reduced.
57
These iterations will continue until any of the termination criteria is met. The use of
high temperature at the earlier iterations (low temperature at the later iterations) can
be viewed as exploration (exploitation, respectively) of the feasible solution space. As
each new solution is accepted with a probability it is also known as stochastic method.
A complete treatment of SA and its applications is carried out in [75]. Neighborhood
selection strategies are discussed in [2]. Convergence criteria of SA are presented in
[57].
In this work, SA will be used to train the correntropic loss function when the
information of kernel width is known a priori. Although the assumption of known
kernel width seems implausible, any known information of an unknown variable will
increase the efficiency of solving an optimization problem. Moreover, a comprehensive
knowledge of data may provide the appropriate kernel width that can be used in the loss
function. Nevertheless, when kernel width in unknown, a grid search can be performed
on the kernel width space to obtain the appropriate kernel width that maximizes the
classification accuracy. This is a typical approach while using kernel based soft margin
SVMs, which generally involves grid search over a two dimensional parameter space.
So far, no discussion about the function class (�) is addressed. In the current work,
a non-parametric function class, namely artificial neural networks, and a parametric
function class, namely support vector machines, is considered. Next, an introductory
review of artificial neural networks is presented.
Artificial Neural Networks(ANN)
Curiosity of studying the human brain led to the development of Artificial Neural
Networks (ANNs). Henceforth, ANNs are the mathematical models that share some
of the properties of brain functions, such as nonlinearity, adaptability and distributed
computations. The first mathematical model that depicted a working ANN used the
perceptron, proposed by McCulloch and Pitts [62]. The actual adaptable perceptron
model is credited to Rosenblatt [80]. The perceptron is a simple single layer neuron
58
model, which uses a learning rule similar to gradient descent. However, the simplicity
of this model (single layer) limits its applicability to model complex practical problems.
Thereby, it was an object of censure in [66]. However, a question which instigated the
use of multilayer neural networks was kindled in [66]. After a couple of decades of
research, neural network research exploded with impressive success. Furthermore,
multilayered feedforward neural networks are rigorously established as a function class
of universal approximators [46]. In addition to that, different models of ANNs were
proposed to solve combinatorial optimization problems. Furthermore, the convergence
conditions for the ANNs optimization models have been extensively analyzed [91].
Processing Elements(PEs) are the primary elements of any ANN. The state of
PE can take any real value between the interval [0, 1] (Some authors prefer to use the
values between [-1,1]; however, both definitions are interchangeable and have the same
convergence behavior). The main characteristic of a PE is to do function embedding. In
order to understand this phenomenon, consider a single PE ANN model (the perceptron
model) with n inputs and one output, shown in Figure 3-2.
The total information incident on the PE is∑n
i=1 wixi . PE embeds this information
into a transfer function , and sends the output to the following layer. Since there is a
single layer in the example, the output from the PE is considered as the final output.
Moreover, if we define as :
(n∑
i=1
wixi + b
)=
1 if
∑n
i=1 wixi + b ≥ 0
0 otherwise,(3–17)
where b is the threshold level of the PE, then the single PE perceptron can be used for
binary classification, given the data is linearly separable. The difference between this
simple perceptron method of classification and support vector based classification is
that the perceptron finds a plane that linearly separates the data. However, the support
vector finds the plane with maximum margin. This does not indicate superiority of one
59
method over the other method since a single PE is considered. In fact, this shows the
capability of a single PE, but, a single PE is incapable to process complex information
that is required for most practical problems. Therefore, multiple PEs in multiple layers
are used as universal classifiers. The PEs interact with each other via links to share
the available information. The intensity and sense of interactions between any two
connecting PEs is represented by weight or synaptic weight on the links. The term
synaptic is related to the nervous system, and is used in ANNs to indicate the weight
between any two PEs.
Usually, PEs in the (r − 1)th layer sends information to the r th layer using the
following feedforward rule:
yi = i
∑j∈(r−1)
wjiyj − Ui
, (3–18)
where PE i belongs to the r th layer, and any PE j belongs to the (r − 1)th layer. yi
represents the state of the i th PE, wji represents weight between the j th PE and i th PE,
and Ui represents threshold level of the i th PE. Function i() is the transfer function for
the i th PE. Once the PEs in the final layer are updated, the error from the actual output
is calculated using a loss function (this is the part where correntropic loss function will
be injected). The error or loss calculation marks the end of feed forward phase of ANNs.
Based on the error information, backpropagation phase of ANNs starts. In this phase,
the error information is utilized to update the weights, using the following rules:
wjk = wjk + µ δk yj , (3–19)
where
δk =∂F(ε)
∂εn′(netk), (3–20)
60
where µ is the learning step size, netk =∑
j∈(r−1) wjiyj − Uk , and F(ε) is the error
function (or loss function). For the output layer, the weights are computed as:
δk = δ0 =∂F(ε)
∂ε′(netk)
= (y − y0) ′(netk), (3–21)
and the deltas of the previous layers are updated as:
δk = δh = ′(netk)
N0∑o=1
whoδo . (3–22)
In the proposed approaches, ANN is trained in order to minimize the correntropic
loss function. In total, two different approaches to train ANN are proposed. In one
approach, ANN will be trained using the CS algorithm. Whereas in the other proposed
approach, ANN will be trained using the SA algorithm. In order to validate the results,
we will not only compare the proposed approaches with conventional ANN training
methods, but also compare them with the support vector machines based classification
method. Next, a review of support vector machines is presented.
Support Vector Machines(SVMs)
Support Vector Machine (SVM) is a popular supervised learning method [9, 22].
It has been developed for binary classification problems, but can be extended to
multiclass classification problems [38, 101, 102] and it has been applied in many
areas of engineering and biomedicine [44, 52, 69, 95, 104]. In general, supervised
classification algorithms provide a classification rule able to decide the class of an
unknown sample. In particular, the goal of SVMs training phase is to find a hyperplane
that ‘optimally’ separates the data samples that belong to a class. More precisely, SVM
is a particular case of hyperplane separation. The basic idea of SVM is to separate two
classes (say A and B) by a hyperplane defined as:
f (x) = wtx+ b, (3–23)
61
such that f (a) < 0 when a ∈ A, and f (b) > 0 when b ∈ B. However, there
could be infinitely many possible ways to select w. The goal of SVM is to choose a best
w according to a criterion (usually the one that maximizes the margin), so that the risk of
misclassifying a new unlabeled data point is minimum. A best separating hyperplane for
unknown data will be the one that is sufficiently far from both the classes (it is the basic
notion of SRM), i.e., a hyperplane which is in the middle of the following two parallel
hyperplanes (support hyperplanes) can be used as a separating hyperplane:
wtx+ b = c (3–24)
wtx+ b = −c . (3–25)
Since, w, b and c are all parameters, a suitable normalization will lead to:
wtx+ b = 1 (3–26)
wtx+ b = −1. (3–27)
Moreover, the distance between the supporting hyperplanes defined in Equations 3–26
& 3–27 is given by:
� =2
||w||.(3–28)
In order to obtain the best separating hyperplane, the following optimization problem is
solved:
maximize :
2
||w||(3–29a)
subject to :
yi(wtxi + b)− 1 ≥ 0 ∀i . (3–29b)
62
The objective given in Equation 3–29a is replaced by minimizing ||w||2/2. Usually,
the solution to Formulation 3–29 is obtained by solving its dual. In order to obtain the
dual, consider the Lagrangian of Equation 3–29, given as:
L(w, b,u) = 1
2||w||2 −
N∑i=1
ui(yi(w
txi + b)− 1), (3–30)
where ui ≥ 0 ∀ i . Now, observe that Formulation 3–29 is convex. Therefore, the
strong duality holds, and Equation 3–31 is valid:
min(w,b)
maxu
L(w, b,u) = maxu
min(w,b)
L(w, b,u). (3–31)
Moreover, from the saddle point theory [5], the following equations hold:
w =
N∑i=1
uiyixi (3–32)
N∑i=1
uiyi = 0. (3–33)
Therefore, using Equations 3–32 & 3–33, the dual of Formulation 3–29 is given as:
maximize :N∑i=1
ui −1
2
N∑i ,j=1
uiujyiyjxitxj (3–34a)
subject to :N∑i=1
uiyi = 0, (3–34b)
ui ≥ 0 ∀i . (3–34c)
Thus, solving Formulation 3–34 results in obtaining support vectors, which in turn
leads to the optimal hyperplane. This phase of SVM is called as training phase. The
testing phase is simple and can be stated as:
ytest =
−1, test ∈ A if f ∗(xtest) < 0
+1, test ∈ B if f ∗(xtest) > 0.(3–35)
63
The above method works very well when the data is linearly separable. However,
most of the practical problems are not linearly separable. In order to extend the usability
of SVMs, soft margins and kernel transformations are incorporated in the basic linear
formulation. When considering soft margin, Equation 3–29a is modified as:
yi(wtxi + b)− 1 + si ≥ 0 ∀i , (3–36)
where si ≥ 0 are slack variables. The primal formulation is then updated as:
minimize :
1
2||w||2 + c
N∑i=1
si (3–37a)
subject to :
yi(wtxi + b)− 1 + si ≥ 0 ∀i , (3–37b)
si ≥ 0 ∀i . (3–37c)
Similar to the linear SVM, the Lagrangian of Formulation 3–37 is given by:
L(w, b,u, v) = 1
2||w||2 + c
N∑i=1
si −N∑i=1
ui(yi(w
txi + b)− 1)− vts, (3–38)
where ui , vi ≥ 0 ∀ i . Correspondingly, using the theories of saddle point and strong
duality, the soft margin SVM dual is defined as:
maximize :N∑i=1
ui −1
2
N∑i ,j=1
uiujyiyjxitxj (3–39a)
subject to :N∑i=1
uiyi = 0, (3–39b)
ui ≤ c ∀i , (3–39c)
ui ≥ 0 ∀i . (3–39d)
64
Furthermore, the dot product ”xitxj” in Equation 3–39a is exploited to overcome the
nonlinearity, i.e., by using kernel transformations into a higher dimensional space. Thus,
the soft margin kernel SVM has the following dual formulation:
maximize :N∑i=1
ui −1
2
N∑i ,j=1
uiujyiyjK(xi, xj) (3–40a)
subject to :N∑i=1
uiyi = 0, (3–40b)
ui ≤ c ∀i , (3–40c)
ui ≥ 0 ∀i , (3–40d)
where K(x, y) is any symmetric kernel. in this dissertation a Gaussian kernel is used,
which is defined as:
K(xi, xj) = e−γ||xi−xj||2, (3–41)
where γ > 0. Therefore, in order to classify the data, two parameters (c , γ) should be
given a priori. The information about the parameters is obtained from the knowledge and
structure of the input data. However, this information is intangible for practical problems.
Thus, an exhaustive logarithmic grid search is conducted over the parameter space to
find their suitable values. It is worthwhile to mention that assuming c & γ as variables
for the kernel SVM and letting the kernel SVM try to obtain the optimal values of c & γ
makes the classification problem Formulation 3–40 intractable.
Once the parameter values are obtained from the grid search, the kernel SVM is
trained to obtain the support vectors. Usually the training phase of the kernel SVM is
performed in combination with a re-sampling method called cross validation. During
cross validation the existing dataset is partitioned in two parts (training and testing). The
model is built based on the training data, and its performance is evaluated using the
testing data. In [84], a general method to select data for training SVM is discussed.
65
Different combinations of training and testing sets are used to calculate average
accuracy. This process is mainly followed in order to avoid manipulation of classification
accuracy results due to a particular choice of the training and testing datasets. Finally,
the classification accuracy reported is the average classification accuracy for all the
cross validation iterations. There are several cross validation methods available to
built the training and testing sets. In this work, the RRSCV method is used to train the
kernel SVM. The performance accuracy of the SVM is compared with the proposed
approaches.
3.2. Traditional Classification Methods
The goal of any learning algorithm is to obtain the optimal rule f ⋆ by solving
the classification problem illustrated in Formulation 3–6. Based on the type of loss
function used in risk estimation, the type of information representation, and the type of
optimization algorithm, different classification algorithms can be designed. A summary
of the classification methods that are used in this work is listed in Table 3-1. Next, the
conventional non-parametric and parametric approaches are presented.
Conventional Non-parametric Approaches
A classical method of classification using ANN involves training a Multi-Layer
Perceptron (MLP) using a back-propagation algorithm. Usually, a signmodal function
is used as an activation function, and a quadratic loss function is used for error
measurement. The ANN is trained using a back-propagation algorithm involving gradient
descent method [63]. Before proceeding further to present the training algorithms, let us
define the notations:
w njk : The weight between the k th and j th PEs at the nth iteration.
y nj : Output of the j th PE at the nth iteration.
netnk =∑
j wnjky
nj : Weighted sum of all outputs y nj of the previous layer at nth iteration.
66
(): Sigmoidal squashing function in each PE, defined as:
(z) =1− e−z
1 + e−z
y nk = (netnk ): Output of k th PE of the current layer, at the nth iteration.
y n ∈ {±1}: the true label (actual label) for the nth sample.
Next, the training algorithms are described. These algorithms mainly differ in the
type of loss function used to train ANNs.
Training ANN with Quadratic loss function using Gradient descent(AQG).
This is the simplest and most widely known method of training ANN. A three layered
ANN (input, hidden, and output layers) is trained using a back-propagation algorithm.
Specifically, the generalized delta rule is used to update the weights of ANN, and the
training equations are:
w n+1jk = w n
jk + µ δnk ynj , (3–42)
where
δnk =∂MSE(ε)
∂εn′(netnk ), (3–43)
where µ is the learning step size, ε = (y n − y n0 ) is the error (or loss), and MSE(ε) is the
mean square error. For the output layer, the weights are computed as:
δnk = δn0 =∂MSE(ε)
∂εn′(netnk )
= (y n − y n0 ) ′(netnk ). (3–44)
The deltas of the previous layers are updated as:
δnk = δnh = ′(netnk )
N0∑o=1
w nhoδ
no . (3–45)
Training ANN with Correntropic loss function using Gradient descent(ACG).
This method is similar to AQG method, the only difference is the use of correntropic loss
67
function instead of quadratic loss function. Furthermore, the kernel width of correntropic
loss is fixed to a smaller value (in [86], a value of 0.5 is illustrated to perform well).
Moreover, since the correntropic function is non-convex at that kernel width, the ANN is
trained with a quadratic loss function for some initial epochs. After sufficient number of
epochs (ACG1), the loss function is changed to correntropic loss function. Thus (ACG1)
is a parameter of the algorithm. The reason for using quadratic loss function at the initial
epochs is to prevent converging at a local minimum at early learning stages. Similar to
AQG, the delta rule is used to update the weights of ANN, and the training equations
are:
w n+1jk = w n
jk + µ δnk ynj , (3–46)
where
δnk =∂F(ε)
∂εn′netnk ), (3–47)
where µ is the step length, and F(ε) is a general loss function, which can be either
quadratic or correntropic function based on the current number of training epochs. For
the output layer, the weights are computed as:
δnk = δn0 =∂F(ε)
∂εn′(netnk ).
=
βσ2 e
(−(yn−yn0 )2
2σ2
)(y n − y n0 ) g
′(netnk ) if F ≡ C-loss Function
(y n − y n0 ) g′(netnk ) if F ≡ MSE Function,
(3–48)
where C − loss is the correntropic loss. The deltas of the previous layers are updated as:
δnk = δnh = ′(netnk )
N0∑o=1
w nhoδ
no . (3–49)
Based on the results of [86], the value of ACG1 is taken as 5 epochs. The purpose of
comparing the proposed approaches with the ACG method is to see the improvement in
the classification accuracy, when the kernel width is changing smoothly.
68
Conventional Parametric SVM Approach
Training soft margin SVM with Gaussian kernel (SGK). SVM is one of the most
widely known parametric methods in classification. In the present work, a Gaussian
kernel based soft margin SVM is used. The SVM is implemented in two steps. In the
first step, optimal parameters (kernel width and cost penalty) are obtained via exhaustive
searching over the parameter space. Once the optimal parameters are obtained in the
second step, the kernel SVM is trained with the optimal parameters.
From the grid search, appropriate values of the parameters are selected. Based
on the selected values of the parameters, the SVM is trained with 100 Monte-Carlo
simulations. In each simulation, a data is divided into two random subsets for training
and testing (RRSCV method). The use of the kernel SVM in Chapter 3 is to compare the
results of the proposed algorithms. Next, the proposed algorithms are presented.
3.3. Proposed Classification Methods
In Section 3.3, two optimization methods that utilize the correntropic loss function
are proposed. In one of the methods, the kernel width act as variable. Whereas, in the
other method, the kernel width is set as a parameter.
Training ANN with Correntropic Loss Function Using Convolution smoothing(ACC)
Similar to the previous ANN based methods, a back-propagation algorithm is used
to train the ANN, i.e., in this method, the weights are updated using the delta rule.
However, the cost function F is always the correntropic function, and the kernel width
σ is changed over the training period. The kernel width act as a smoothing parameter
of the CS algorithm, and initially kernel width is set to a value of 2. As the algorithm
proceeds, the kernel width is smoothly reduced till it reaches 0.5. Furthermore, as the
algorithm progress, if the delta rule leads to a high error value, then the kernel width is
increased to a value of 2 with probability Paccept , to escape from the local minima. This
probability is reduced exponentially depending on the number of epochs. ACC method
69
can be seen as a stochastic CS method which minimizes the correntropic loss function.
The training equations for the underlying ANN framework are as follows:
w n+1jk = w n
jk + µ δnkynj , (3–50)
where for the output layer, the deltas and weights are computed as:
δnk =∂Fσ
C (ε)
∂εn′(netnk ) (3–51)
δnk = δn0 =∂Fσ
C (ε)
∂εn′(netnk ) (3–52)
=β
σ2e
(−(yn−yn0 )2
2σ2
)(y n − y n0 )
′(netnk ), (3–53)
where FσC ≡ correntropic loss Function with kernel width σ, and Fσ
C (ε) is the error at the
output layer. The deltas of the previous layers are updated as:
δnk = δnh = ′(netnk )
N0∑o=1
w nhoδ
no . (3–54)
The ACC method is illustrated in Algorithm 3.1 for a given n × p data matrix with r
elements in the middle layer. Algorithm 3.1 represents ACC learning method for the
block update scenario. For the sample by sample update scenario, Algorithm 3.1 is
adjusted appropriately to incorporate the CS mechanism.
In Algorithm 3.1, σ0, α1 are the parameters that control the flow of ACC method,
and their values are taken as 2, 0.5e respectively (where e is vector of ones). f1, f2 are
the functions to update σ, Paccept is the probability of accepting noisy solutions. For the
sake of simplicity, f1 and Paccept are taken as exponentially decreasing functions, and f2
updates σ to a value of 2.
Training ANN with Correntropic Loss Function Using Simulated Annealing (ACS)
Unlike the previous gradient descent based learning methods, in this method a SA
algorithm is used to train ANN, i.e., no gradient search is involved in ANN. This method
70
assumes that the correntropic loss function has a fixed kernel width. Since the kernel
width determines the convexity of the loss function, a gradient descent method cannot
be used as a learning method in a generalized framework. Hence, the SA algorithm is
used as a learning method to avoid convergence to a local minimum. The ACS method
is illustrated in Algorithm 3.2 for a given n × p data matrix with r elements in the middle
layer. Furthermore, σ = σ is a given parameter of the algorithm. Moreover, the ACS
algorithm is used in block update mode only, unlike the ACC algorithm ( i.e., ACC
algorithm can be used in a sample or block based update mode).
In Algorithm 3.2, T0 is the initial temperature, and its value is taken as 1. f1(T )
and Paccept(T ) are two different functions of temperature. f1(T ) is a simple exponential
cooling function, whereas, function Paccept(T ) is exponential probability, which depends
upon the values of T , �a and �a−1. There are two termination criteria for ACC and ACS
method. Either the total error should fall below minErr (taken as 0.001) or the number
of epochs should exceed MaxEpochs (MaxEpochs is a parameter for experimental runs,
and is varied from 1, ... , 10).
The implementation of the proposed algorithms on simulated and real data are
presented in Chapter 5. In Chapter 4, another well known problem of data analysis is
introduced, and robust methods to solve the problem is proposed.
71
Table 3-1. Notation and description of proposed (z) and existing (X) methodsNotation Information Representation Loss Function Optimization Algorithm
Exact Method -AQGX Non-parametric (ANN) Quadratic Gradient Descent
Initially Quadratic , Exact Method -ACG X Non-parametric (ANN) shifts to Correntropy Gradient Descent
with fixed kernel width
Correntropy with Heuristic Method -ACCz Non-parametric (ANN) varying kernel width Convolution Smoothing
Correntropy with Heuristic Method -ACSz Non-parametric (ANN) fixed kernel width Simulated Annealing
Quadratic with Exact Method -SGQX Parametric (SVM) gaussian kernel Quadratic Optimization
Algorithm 3.1: ACC Methodinput : Classification data, Structure and transfer functions of ANNoutput: Optimal weights
1 begin2 Randomly initialize W (0) ;3 Set σ = σ0, µ = µ0 ;4 Set termination = false;5 while termination == false do6 Execute BLOCK FEEDFORWARD PHASE - ANN;7 if random() < Paccept then8 σ = f1(σ) ;9 else
10 σ = f2(σ) ;
11 if FσC (ε) < minErr then
12 termination = true;
13 Execute BLOCK BACKPROPAGATION PHASE - ANN;
14 return (W );
72
A
B
Figure 3-1. Correntropic, quadratic and 0-1 loss functions. A) Margin on x-axis. B) Erroron x-axis.
73
Algorithm 3.2: ACS Methodinput : Classification data, Structure and transfer functions of ANNoutput: Optimal weights
1 begin2 Randomly initialize W (0) ;3 Set σ = σ0, µ = µ0 ;4 initialize a = 0 and T = T0 ;5 �0 = Fσ
C (ε0) Set termination = false;6 while termination == false do7 T = f1(T );8 a = a+ 1;9 Wa = neighbor(Wa−1);
10 Execute BLOCK FEEDFORWARD PHASE - ANN;11 �a = Fσ
C (εa);12 if �a < minErr then13 termination criteria = true;
14 if �a < �a−1 then15 Wa = Wa−1;16 �a = �a−1;17 else18 if random() < Paccept(T ) then19 Wa = Wa−1;20 �a = �a−1;
21 return (W );
Figure 3-2. Perceptron
74
CHAPTER 4ROBUST SIGNAL SEPARATION
Signal separation is a specific case of signal processing, which aims at identifying
unknown source signals si(t) (i = 1, ... , n) from their observable mixtures xj(t) (j =
1, ... ,m). In this problem, a mixture is assumed to be a linear transformation of sources,
i.e., x(t) = A s(t), where A ∈ Rm×n is mixing matrix (or sometimes called as
dictionary). Typically, t is any acquisition variable, over which a sample of mixture
(a column for discrete acquisition variable) is collected. The most common types of
acquisition variables are time and frequency. However, position, wave number, and
other indices can be used depending on the nature of the physical process under
investigation. Apart from identification of the sources, knowledge about mixing is
assumed to be unknown. The generative model of the problem in its standard form can
be written as:
X = A S + N, (4–1)
where X ∈ Rm×N denotes the mixture matrix, A ∈ Rm×n is the mixing matrix, S ∈ Rn×N
denotes the source matrix, and N ∈ Rm×N denotes uncorrelated noise. Since, both A
and S are unknown, the signal separation problem is called “Blind” Signal Separation
(BSS) problem. The BSS problem first appeared in [45], where the authors have
proposed the seminal idea of BSS via an example of two source signals (n = 2) and
two mixture signals (m = 2). Their objective was to recover source signals from the
mixture signals, without any further information.
A classical illustrative example for the BSS model is the cocktail party problem
where a mixture of sound signals from simultaneously speaking individuals is available
(see Figure 4-1 for a simple illustration). In a nutshell, the goal in BSS is to identify
Some sections of Chapter 4 have been published in Computers & Operations Re-search and Neuromethods.
75
and extract the sources (Figure 4-1B) from the available mixture signals (Figure 4-1A).
This problem caught the attention of many researchers, due to its wide applicability in
different scientific research areas. A general setup of the BSS problem in computational
neuroscience is depicted in Figure 4-2. Any surface (or scalp) noninvasive cognitive
activity recording can be used as a specific example. Depending upon the scenario, the
mixture can be EEG, MEG or fMRI data. Typically, physical substances like skull, brain
matter, muscles, and electrode-skull interface act as mixers. The goal is to identify the
internal source signals, which hopefully reduce the mixing effect during further analysis.
Currently, most of the approaches of BSS in computational neuroscience are based
on the statistical independence assumptions. There are very few approaches that
exploit the sparsity in the signals. Sparsity assumptions can be considered as flexible
approaches for BSS compared to the independence assumption, since independence
requires the sources to be at least uncorrelated. In addition to that, if the number
of sources is larger than the number of mixtures (underdetermined case), then the
statistical independence assumption cannot reveal the sources, but it can reveal the
mixing matrix. For sparsity based approaches, there are very few papers in the literature
(compared to independence based approaches) that have been devoted to develop
identifiability conditions, and to develop the methods of uniquely identifying (or learning)
the mixing matrix [1, 34, 37, 54].
In Section 4.1, an overview of the BSS problem is presented. Sufficient identifiability
conditions are revised, and their implication on the solution methodology are discussed
in Section 4.2. Different well known approaches that are used to find the solution of BSS
problem are also briefly presented. Finally, the proposed algorithms are presented in
Section 4.3.
Other Look-alike Problems. BSS is a special type of Linear Matrix Factorization
(LMF) problem. There are many other methods that can be described in the form of
LMF. For instance, Nonnegative Matrix Factorization (NMF), Morphological Component
76
Analysis (MCA), Sparse Dictionary Identification (SDI), etc. The three properties that
differentiate BSS from other LMF problems are:
• The model is assumed to be generativeIn BSS, the data matrix X is assumed to be a linear mixture of S.
• Completely unknown source and mixing matricesSome of the LMF methods (like MCA) assume partial knowledge about mixing.
• Identifiable source and mixing matricesSome of the LMF methods (like NMF, SDI) focus on estimating A and S withoutany condition for identifiability. NMF can be considered as a dimensionalityreduction method like Principal Component Analysis (PCA). Similarly, SDIestimates A such that X = A S, and S is as sparse as possible. Althoughthe NMF and SDI problem looks similar to BSS, they have no precise notion aboutthe source signals or their identifiability.
4.1. Signal Separation Problem
From this point a flat representation of mixture data is assumed, i.e., mixture signals
can be represented by a matrix containing finite number of columns. Before presenting
the formal definition of the BSS problem, consider the following notations that will be
used throughout Chapter 4: A scalar is denoted by a lowercase letter, such as y . A
column vector is denoted by a bold lowercase letter, such as y, and a matrix is denoted
by a bold uppercase letter, such as Y. For example, in Chapter 4, the mixtures are
represented by matrix X. An i th column of matrix X is represented as xi . An i th row of
matrix X is represented as xi•. An i th row j th column element of matrix X is represented
as xi ,j .
Now, the BSS problem can be mathematically stated as: Let X ∈ Rm×N be
generated by a linear mixing of sources S ∈ Rn×N . Given X, the objective of BSS
problem is to find two matrices A ∈ Rm×n and S, such that the three matrices are related
as X = A S. In the theoretical development of the problem and the solution methods,
the noise factor is ignored. Although, without noise, the problem may appear easy.
However, from the very definition of the problem, it can be seen that the solution of
the BSS problem suffers from uniqueness and identifiability. Thus the notion of “good”
77
solution to the BSS problem must be precisely defined. Next, the uniqueness and
identifiability issues are explained.
Uniqueness: Let � and � ∈ Rn×n be a diagonal matrix and permutation matrix
respectively. Let A and S be such that, X = A S. Consider the following:
X = A S
= (A � �)(�−1�−1S
)= Aa Sa.
Thus, even if A and S are known, there can be infinite equivalent solutions of the
form Aa and Sa. The goal of good BSS solution algorithm should be to find at least one
of the equivalent solutions. Due to the inability of finding the unique solution, not only
the information regarding the order of sources is lot, but also the information of energy
contained in the sources is lost. Generally, normalization of rows of S may be used to
tackle scalability. Also, relative or normalized form of energy can be used in the further
analysis. Theoretically, any information pertaining to order of source is impossible to
recover. However, problem specific knowledge will be helpful in identifying correct order
for the further analysis.
Identifiability: Let � ∈ Rn×n be any nonsingular matrix. Let A and S be such that,
X = A S. Consider the following:
X = A S
= (A �)(�−1S
)= Að Sð.
Thus, even if A and S are known, there can be infinite non-identifiable solutions of
the form Að and Sð. The goal of BSS solution algorithm is to avoid the non-identifiable
solutions. Typically, the issue of identifiability arises from the dimension and structure of
A and S. The key idea to correctly identify both the matrices (of course with unavoidable
78
scaling and permutation ambiguity) is to impose structural properties on S while solving
the BSS problem (see Figure 4-3). Some widely known BSS solution approaches [90]
from the literature are summarized below.
Statistical Independence Assumptions: One of the earliest approaches to
solve the BSS problem is to assume statistical independence among the source
signals. These approaches are termed as the Independent Component Analysis (ICA)
approaches. The fundamental assumption in ICA is that the rows of matrix S are
statistically independent and non-gaussian [50, 94].
Sparse Assumptions: Apart from ICA, the other type of approaches, which provide
sufficient identifiability conditions are based on the notion of sparsity in the S matrix.
These approaches can be named as Sparse Component Analysis (SCA) approaches.
There are two distinct categories in the sparse assumptions:
• Partially Sparse Nonnegative Sources (PSNS): In this category, along with acertain level of sparsity, the elements of S are assumed to be nonnegative.Ideas of this type of approach can be traced back to the Nonnegative MatrixFactorization (NMF) method. The basic assumption in NMF is that the elementsof S and A are assumed to be nonnegative [21]. However, in the case of BSSproblem the nonnegativity assumptions on the elements of matrix A can be relaxed[67] without damaging the identifiability of A and S.
• Completely Sparse Components (CSC): In this category, no sign restrictions areplaced on the elements of S, i.e., si ,j ∈ R. The only assumption used to define theidentifiability conditions is the existence of certain level of sparsity in every columnof S. [32].
At present, these are the only known BSS approaches that can provide sufficient
identifiability conditions (uniqueness up to permutation and scalability). In fact, the
sparsity based approaches (see [34, 67]) are relatively new in the area of BSS when
compared to the traditional statistical independence approaches (see [50]). One of the
novelties that sparsity based methods brought to the BSS problem is the verifiability of
the sparse assumptions on a finite length data. Furthermore, not only overdetermined
but also underdetermined scenarios of BSS problem can be handled by the sparsity
79
based methods. However, underdetermined scenario requires a high level of sparsity
than the m = n simple scenario. In Section 4.2, a brief discussion on the important
issues of the sparsity based methods is presented [90].
4.2. Traditional Sparsity Based Methods
The earliest methods that proposed the notion of sparsity and the identifiability
conditions for BSS problems can be found in [33, 34, 67]. From the literature, different
approaches to solve Sparse Component Analysis (SCA) problem can be grouped into
two distinct classes. The main difference between the two classes is based on the
nonnegativity assumption of the elements in the S matrix. The reason for such division
is due to the structure of the resulting SCA problem. Typically, when the sources are
non-negative the SCA problem can be boiled down to a convex programming problem.
Thus, the algorithms for the class with nonnegativity assumptions are computationally
inexpensive. Whereas, for the other class, the SCA problem generally results in
nonconvex optimization problem. Therefore, finding a global optimal solution, when
the source elements are real, is a computationally expensive task.
SCA can be considered as a flexible method for BSS than ICA. ICA requires the
source to be statistically independent, whereas, SCA requires sparsity of sources (a
weaker assumption). In addition to that, ICA is not suitable if the number of sources is
larger than the number of mixtures (underdetermined case). Typical ideas of SCA can
be found in [34, 37, 54]. Furthermore, the identifiability conditions on X that improve the
separability of sources are studied by few researchers [1, 34].
Partially Sparse Nonnegative Sources (PSNS)
In many physiological data scenarios, the notion that the source signal is nonnegative
seems to be valid; for example, medical imaging, NMR, ICP, HR etc. Using this ideology,
and the fact that ICA at least requires complete uncorrelated source signals, a partial
correlated BSS method can be developed. A source matrix S is defined to be partially
correlated, when rows of a certain set of columns of S are uncorrelated. However,
80
the rows of full S matrix are correlated. For the sources on which the nonnegativity
assumption holds, a partially correlated assumption is less restrictive than ICA. The
primary idea on which this class of SCA method works can be summarized as: any
vector xi ∀i = 1, ... , N is nothing but a nonnegative linear combination of vectors
aj ∀j = 1, ... , n. Thus, sparse assumptions on S that may lead to proper identification
of A, which can be exploited in order to identify A and S. One of earliest approaches
towards this method is presented by Naanaa and Nuzillard [67], is called Positive and
Partially Correlated (PPC) method. Next, the sufficient identifiability condition for PPC
will be discussed.
Sufficient Identifiability Conditions on A and S for PPC [67]
Following are the two sufficient conditions, which are required for unique identification
of A and S (up to scaling and permutation ambiguity):
• PPC1: There exists a diagonal submatrix in S.For each row Si there exists a j ∈ {1, ... , N} such that si ,j = 0 and sk,j > 0 fork = 1, ... , i − 1, i + 1, ... , N.
• PPC2: Columns of A are linearly independent.
Implication of the Identifiability Conditions for PPC
Due to the restriction given in PPC1, the PPC BSS problem boils down to the
following: All the columns of matrix X span a cone in Rm, where the edges of the cone
are nothing but the columns of matrix A. Using this simplification, suitable linear or
convex programming problems can be solved to identify the edges of the cone spanned
by the columns of X. Finding these edges results in idetification of A. Matrix S can be
obtained by using Moore-Penrose pseudoinverse of A.
81
PPC Approaches
In [67], a least square minimization problem is proposed to solve the PPC problem.
The formulation is given as:
minimize : ∥∥∥∥∥∥∥N∑i=1i =j
αixi − xj
∥∥∥∥∥∥∥2
(4–2a)
subject to :
αi ≥ 0 ∀ i . (4–2b)
In addition to the above formulation, based on the same edge extraction idea, many
recent works are directed towards efficient edge extraction from X [14, 103].
Another recent modification of PPC approach is called Positive everywhere Partially
orthogonal Dominant intervals (PePoDi) [88]. In PePoDi, the PPC1 condition is modified
by stating that the last row of S is positive dominant and does not satisfy PPC1.
However this modification comes with the price of restricting A to be nonnegative.
Thus, PePoDi method can be seen as a special case of the NMF problem.
Complete Sparse Component Sources [36]
When the sources are not non-negative, then the BSS problem transforms into a
nonconvex optimization problem. In fact, the only identifiable condition known for real
sources, is to have sparsity in each column of X. Before defining the complete sparse
component (CSC) criteria, consider the following definitions:
CSC-conditioned: A matrix M is said to be CSC-conditioned if every square
submatrix of M is nonsingular. CSC-sparse: A matrix M is said to be CSC-sparse if
every column of M has at most m − 1 nonzero elements. CSC-representable: A matrix
M is said to be CSC-representable if for any n −m + 1 selected rows of M, there exists
m columns such that:
• All the m columns contain zeros in the selected rows, and
82
• Any m − 1 subset of the m columns is linearly independent.
Sufficient Identifiability Conditions on A and S for CSC
Following are the three sufficient conditions, which are required for unique
identification of A and S (up to scaling and permutation ambiguity):
• CSC1: A is CSC-conditioned,
• CSC2: S is CSC-sparse,
• CSC3: S is CSC-representable
Implication of the Identifiability Conditions for CSC
Due to the restriction given in CSC2 and CSC3, the CSC BSS problem boils down
to the following: All the columns of matrix X lie on m-hyperplanes passing through
origin, where the normal vectors of the hyperplanes are nothing but the orthonormal
compliment of the matrix A. Using this transformation, suitable hyperplane clustering
methods can be used to identify the hyperplanes defined by X. Since, the hyperplane
clustering is non-convex, the CSC BSS problem is relatively difficult to solve when
compared to PPC BSS problem.
CSC Approaches
Given data matrix X ∈ Rm×N , the goal of CSC is to find two matrices, namely,
mixing (A ∈ Rm×n) and source (S ∈ Rn×N), such that X = A · S. Under the
CSC1, CSC2 and CSC3 assumptions, uniqueness up to permutation and scalability
can be achieved. Next the basic formulation of CSC BSS problem is described, and the
different improvements that can be done with the basic formulation is proposed. Before
proceeding further, let us describe all the notations that will be used in the following
formulations:
83
Given Data:
p : Index for a point, p ∈ {1, ... ,N}
X : (x1, ... , xN) = data matrix of N points, xp ∈ Rm
n : The column size of dictionary matrix
Variables:
h : Index for a hyperplane, h ∈ {1, ... , n}
wh : Normal vector of hth hyperplane, ∈ Rm
uhp : Distance between pth point and hth hyperplane, ∈ R+
thp : 1 if pth point belongs to hth hyperplane, 0 otherwise
vhp : Ancillary variable, which reflects the product of thpuhp in a linearised form
Mathematically, the set of hyperplanes containing the data points is a solution to
mathematical Formulation 4–3:
minimize :N∑
p=1
min1≤h≤n
(wth xp − bh)
2 (4–3a)
subject to :
||wh||2 = 1, (4–3b)
wh ∈ Rm, (4–3c)
bh ∈ R. (4–3d)
Therefore, any solution of Formulation 4–3 will represent a w(2) − skeleton of X [10]. It
consists of n hyperplanes defined as:
Hh = {xp ∈ Rm : wth xp = bh} ∀ h = 1, ... , n. (4–4)
84
Another approach for hyperplane clustering is presented in [81], which can be described
via Formulation 4–5:
minimize :N∑
p=1
min1≤h≤n
∣∣wth xp − bh
∣∣ (4–5a)
subject to :
(4−−3b)− (4−−3d). (4–5b)
The solution to Formulation 4–5 defines w(1) − skeleton of X . Formulation 4–5 is
analogous to Formulation 4–3 in defining the hyperplanes. However, the main difference
is that Formulation 4–5 minimizes the absolute distances, whereas Formulation 4–3
minimizes the squared distances. This does not seems to be a huge difference,
however, absolute distance minimization is considered to be a robust approach.
The equivalence of both the formulations and uniqueness of their solution under
sparsity assumptions are discussed in [20, 32]. Moreover, Georgiev et al. [32] have
reduced the hyperplane clustering problem to a bilinear formulation in the case
when every data point belongs to only one skeleton hyperplane (and therefore, the
minimum value in Formulation 4–5 is zero). Then Formulation 4–5 is equivalent to
Formulation 4–6.
In order to obtain the bilinear formulation, the non-linear constraints given in
Equation 4–6e is replaced with wthe = 1 (where e is vector of all ones). This replacement
does not change the hyperplanes defined by their solutions and those defined by
solutions of Formulation 4–6. Different optimization methods can be applied to solve
the bilinear problem. In [32], an n-plane clustering algorithm via linear programming
is proposed to solve the bilinear problem. Algorithm 4.1 briefly describes this n-plane
clustering algorithm. The initial approaches to CSC BSS problem is based on bilinear
hyperplane clustering approach [32]. However, the main drawback of the algorithm is its
85
convergence to local minima. In fact, most of the hyperplane clustering methods in the
literature are confined to 7 to 8 dimensions.
minimize :N∑
p=1
n∑h=1
thp uhp (4–6a)
subject to :
wth xp ≤ uhp ∀ h, p, (4–6b)
wth xp ≥ −uhp ∀ h, p, (4–6c)∑h
thp = 1 ∀ p, (4–6d)
||wh||2 = 1 ∀ h, (4–6e)
thp ≤ 1 ∀ h, p, (4–6f)
thp ≥ 0 ∀ h, p, (4–6g)
wh ∈ Rm ∀ h, (4–6h)
uhp ≥ 0 ∀ h, p. (4–6i)
4.3. Proposed Sparsity Based Methods
The goal of Section 4.3 is to present the proposed approaches for the SCA
problem. Specifically, the standard preprocessing method for BSS problem is illustrated.
In addition to that, novel methods for both PPC and CSC case of SCA problem are
developed. Furthermore, a robust correntropy minimization method for source extraction
is also proposed.
Data Preprocessing and Recovery
Before using the proposed methods, the given data X is preprocessed using the
prewidening method. This is done in order to reduce the ill-conditioning effect on X
arising from the dictionary matrix. For example, consider the source matrix S ∈ R3×80,
shown in Figure 4-4. The dictionary matrix is A (see matrix in Equation 4–7). In this
86
example m < n; (m = 2, n = 3). The data X ∈ R2×80 is shown in Figure 4-5. The
processed data is shown in Figure 4-6. From Figures 4-4, 4-5 & 4-6 the ill-conditioning
effect and prewidening enhancement can be easily observed.
A =
1.0000 0.9000 1.1000
1.0000 0.8500 1.1500
.
(4–7)
Consider the following eigenvalue decomposition:
� = XXT = Q�QT , (4–8)
where � is a square diagonal matrix whose elements are eigenvalues of � and Q is a
square orthonormal matrix of eigenvectors of �. Since � is non-negative semi-definite,
all the elements of � are non-negative. Thus, a transformation matrix � can be defined
as:
� = �− 12QT . (4–9)
Now X can be transformed as:
X = �X. (4–10)
We redefine the dictionary matrix as A = �A and have the following model:
X = �AS = A S. (4–11)
The reason for such transformation is that the ill-conditioning effect due to mixing of
the original sources can be reduced. If the original sources were uncorrelated, then
87
SST = I. Therefore, A AT = I as shown below:
A AT = �AAT�T
= �ASSTAT�T
= �XXT�T
= �− 12QTQ�QTQ�− 1
2 (4–12)
= I. (4–13)
However, we do not assume that SST = I, but the above transformation still helps
in finding hyperplanes. For the case when m = n, once the optimal solution for all n
optimization problem is obtained, the source matrix from non-noisy mixtures is obtained
as:
Sπ = WT X =WT X, (4–14)
where Sπ = Pπ S, Pπ is a monomial matrix (i.e., each row and each column contains
only one non-zero element). The source extraction method for noisy mixtures is
considered an the end of Chapter 4. Unless any other information about A is known,
correspondence between rows of S and Sπ is hard to determine. Similarly, matrix A is
obtained by solving Equation 4–15:
WT A = Pπ. (4–15)
However Pπ, is unknown. Therefore, Equation 4–15 can be solved by a simple
assumption on Pπ matrix (i.e., Pπ = I). Moreover, the resulting dictionary matrix
will be unique up to permutation and scalability of columns. For example, solving system
of equations given by Equation 4–16 will be enough:
WT Aπ = I. (4–16)
88
Finally, A can be obtained as:
Aπ = Q�12 Aπ. (4–17)
To sum, although the actual A and S matrices cannot be identified, they can be obtained
in permuted and scaled forms when m = n.
PPC Robust Method for Dictionary Identification
Given data matrix X ∈ Rm×N , the goal of PPC is to find two matrices, namely
mixing (A ∈ Rm×n) and source (S ∈ Rn×N+ ), such that X = A · S. While developing
the algorithm, it is assumed that the source signals are non-negative. The proposed
algorithm is as follows:
• Step1: Normalize all the columns of X
• Step2: Solve the following LP to to get the projection direction:
minimize β (4–18)s.t. β ≥ −dTxi ∀ i , (4–19)
−dTxi ≤ 0 ∀ i , (4–20)−2 ≤ dj ≤ 2. (4–21)
The above formulation will generate projection vector d which is inside the coneformed by the columns of X.
• Step3: Normalize vector d.
• Step4: Projecting the points on a n-dimensional simplex plane orthogonal to d, i.e.,update each point xi as xi = xi
dT xi
• Step5: Translate the points, such that the plane containing the n-dimensionalsimplex plane passes through the origin. This can be done by centering of data,i.e., for each data point use the following transformation:
xi =xi − x
std(4–22)
where x and std are respectively the mean and standard deviation of all thecolumns of X.
• Step6: Affine transformation, like Principal Component Analysis (PCA) can beused to transform the n simplex, from n + 1 dimensions to n dimensions. The PCA
89
method: Identify the eigenvalues and eigenvectors of XXT :
UDnU = XXT
Rearrange the eigenvalues in the diagonal of Dn in decreasing order of their value.Let Dn−1 be the submatrix of Dn, constructed by eliminating last row and lastcolumn. Let Y be created such that yi = UTxi ∀ i . Let Z represent the submatrixof Y obtained by eliminating the last row of matrix Y. The matrix Z is an affinetransformation and dimensionality reduction of matrix X.
• Step7: If the PPC conditions are satisfied, then find the n vertices. If not, thenapproximately find the best n extreme points.
Projection based idea is an extension to the method proposed in [14]. However,
the approach did not address the scenario of negative elements in mixing matrix.
The proposed method can incorporate the negative elements in the mixing matrix. A
recent approach also addresses issues of negative elements in the mixing matrix [103].
Furthermore, the major advantage of the proposed approach over earlier methods
[14, 103] is to avoid solving large number of LPs. The only LP that we solve is in the
Step 2. For Step 7, in the proposed approach, instead of solving many LPs the following
projection approach is proposed:
Projection Approach: initially, the data points are projected on the normal vector
to the edges of the standard n-dimensional simplex projected on the n-dimensional
space. The maximum and minimum projections for the initial n normal vector projections
are archived. Now, the standard simplex is randomly rotated, and a new set of normal
vectors are used for the projection. Again, the maximum and minimum projections
are archived. If the total number of minimum and maximum projection points is equal
to n + 1 points, then the PPC assumptions are satisfied. Furthermore, the n vertices
can now be obtained from the archive. However, if there are more than n + 1 points,
then this indicates that PPC assumptions are not satisfied. In this case, from the set
of archived points (potential candidates for vertices), the one with maximum norm is
picked. The maximum norm point is taken as a best extreme point. Now, the rest of
the archived points are projected on a hyperplane passing through origin with a normal
90
vector passing through the identified extreme point. The projected archived points can
now be used to reduce the problem size by one dimension. This process of projection
and dimension reduction is continued n times to identify all the best extreme points. It
is to be noted that, in the projection and dimension reduction phase of the proposed
approach utilizes archived points only.
CSC Robust Method for Dictionary Identification
Given data matrix X ∈ Rm×N , the goal of CSC is to find two matrices, namely,
mixing (A ∈ Rm×n) and source (S ∈ Rn×N+ ), such that X = A · S. An alternative
approach, which is developed in this dissertation is to solve the bilinear problem given
in Formulation 4–6 via 0-1 linear reformulation [89]. Next, the 0-1 formulation for CSC is
presented:
minimize :N∑
p=1
n∑h=1
vhp (4–23a)
subject to :
(4−−6b)− (4−−6d), (4−−6h), (4−−6i), (4–23b)
wthe = 1 ∀ h, (4–23c)
vhp ≤ M1thp ∀ h, p, (4–23d)
vhp ≤ uhp ∀ h, p, (4–23e)
vhp ≥ uhp −M2(1− thp) ∀ h, p, (4–23f)
thp ∈ {0, 1} ∀ h, p, (4–23g)
vhp ≥ 0 ∀ h, p. (4–23h)
where M1 and M2 are very large positive scalars. Formulations 4–6 and 4–23 are
equivalent. Clearly, the MIP can be solved sequentially for each hyperplane. Before
defining the hierarchy based MIP formulation, let us introduce the following:
91
Notations:
w⋆r : Optimal solution at the r th optimization problem given by
Formulation 4–24.
H⋆r : A Hyperplane passing through origin, whose normal vector is w⋆
r .
Pr : Index set of points, defined as Pr = Pr−1 \ Rϵr−1, r > 1, ... , n where,
P1 = 1, ... ,N
Rϵr : Index set of points which are within ϵ distance from the hyperplane H⋆
r ,
defined as Rϵr = {p : |w⋆
rt xp| ≤ ϵ}, where ϵ > 0 is a given threshold
such that Rϵr has at least m + 1 elements.
minimize : ∑p∈Pr
αp vp −∑p∈Pr
βp tp (4–24a)
subject to :
−up ≤ wtrxp ≤ up p ∈ Pr , (4–24b)
up −M1(1− tp) ≤ vp p ∈ Pr , (4–24c)
vp ≤ up p ∈ Pr , (4–24d)
vp ≤ M2tp p ∈ Pr , (4–24e)
wtre = 1, (4–24f)∑
p∈Pr
tp ≥ m + 1, (4–24g)
tp ∈ {0, 1} p ∈ Pr , (4–24h)
up ≥ 0 p ∈ Pr , (4–24i)
vp ≥ 0 p ∈ Pr , (4–24j)
wr ∈ Rm. (4–24k)
92
Since the formulation considers one hyperplane at time, the second index for double
indexed variables can be dropped. For example vp is nothing but vpr , similar argument
follows for up and tp. αp and βp are scaling factors, and are arbitrarily selected.
Clearly, the non-hierarchical Formulation 4–23 has N · n binary variables, whereas
the r th iteration in hierarchical Formulation 4–24 has |Pr | binary variables (where
|Pr | < N,∀ r > 1 ). Moreover, for any two iterations say r1, r2, where r2 > r1, we
have |Pr1| > |Pr2|. Probabilistically, the complexity at each iteration is reduced. This is
due to the fact that in the r th iteration, the probability that xp, p ∈ Pr will lie in the
remaining n − r + 1 planes will be 1n−r+1
, (since X is BSS-skeletable). Ideally, if there is
no noise in the data and if all the earlier iterations converged to global optimal solution,
then the nth iteration is redundant. The proposed hierarchical approach for solving
Formulation 4–24 is presented in Algorithm 4.2. The steps of the proposed hierarchical
approach is illustrated by a flowchart shown in Figure 4-7.
Robust Method for Source Extraction
When the knowledge of dictionary is obtained, then the source extraction problem
can be simplified as:
S = pinv(A)X, (4–25)
where pinv(.) is pseudo inverse function. This method works only when X is free from
outliers. However, when the mixture matrix contain outliers, the above solution approach
will not work. For such scenarios, the following algorithm is proposed. Consider the
following optimization problem:
minimize :
∥AS− X∥ (4–26a)
subject to :
S ∈ Rn×N , (4–26b)
93
where A ∈ Rm×n and X ∈ Rm×N . Typically, the above problem is solved as a quadratic
error minimization problem. Such methods are not robust, when the elements of data
(A and/or X) are contaminated with outliers. The goal is to present a robust method for
source extraction, which is insensitive to outliers. Specifically, the following problem is
considered:
minimize :
Fσc (Y) + α Fσ
c (S) (4–27a)
subject to :
Y = AS− X, (4–27b)
S ∈ Rn×N , (4–27c)
Y ∈ Rm×N . (4–27d)
where FσC is the correntropic loss function, and α is known weight (or a parameter) for
regularization, which controls the sparsity in S. Let vector z ∈ RN(m+n) be defined as:
zi =
y⌈ i
N⌉,(i−(⌈ i
N⌉−1)N) i ≤ mN
s⌈ i−mN
N⌉,(i−mN−(⌈ i
N⌉−1)N) otherwise.
(4–28)
Let C ∈ RmN×(m+n)N be defined as:
C = [−ImN , A⊗ IN ] (4–29)
94
The above problem can be transformed as:
minimize :
−(m+n)N∑
i=1
αie
(−z2
i
2σ2
)(4–30a)
subject to :
Cz = d, (4–30b)
z ∈ RN(m+n), (4–30c)
where d ∈ RmN , defined as di = x⌈ iN⌉,(i−(⌈ i
N⌉−1)N), and
αi =
1 if i ≤ mN
α otherwise.(4–31)
Based on the value of σ, Formulation 4–30 can move from convex domain to invex
domain. Specifically, the problem will be a convex programming problem, when σ2 ≥
z2i ∀i = 1, ... , (m + n)N.
Consider the Lagrangian of Formulation 4–30:
L(z, v) = −(m+n)N∑
i=1
αie
(−z2
i
2σ2
)+ vT (Cz− d) , (4–32)
where v ∈ RmN are the dual variables. The KKT system of Formulation 4–30 will be:
∇FσC(z) + CTv = 0 (4–33)
Cz = d, (4–34)
where [∇FσC (z)]i =
αi
σ2 e
(−z2
i
2σ2
)zi ∀i = 1, ... , (m+n)N. Solving Equations 4–33 and 4–34
gives the solution for minimum correntropy error with α regularity.
Let z(r) be the current feasible solution, and let d(r+1) be an improving and
feasible direction. Consider the linear approximation of any gradient function of a
95
twice differentiable function:
∇f (w + u) ≈ ∇f (w) +∇2f (w)u. (4–35)
Using the above information, Equation 4–33 and 4–34 can be rewritten as:
∇FσC(z
(r)) +∇2FσC(z
(r))d(r+1) + CTv(r+1) = 0 (4–36)
Cd(r+1) = 0, (4–37)
where ∇2FσC(z
(r)) is the Hessian of the correntropic function, defined as:
[∇2Fσ
C(z(r))]i ,j=
αi
σ2 e
(−z
(r)i
2
2σ2
) (σ2−z
(r)i
2
σ2
)if i = j
0 otherwise.
(4–38)
Equation 4–36 can be rewritten as:
d(r+1) = −[∇2Fσ
C(z(r))]−1 [∇Fσ
C(z(r)) + CTv(r+1)
], (4–39)
where
[∇2Fσ
C(z(r))]−1
i ,j=
σ2
αie
(z(r)i
2
2σ2
) (σ2
σ2−z(r)i
2
)if i = j
0 otherwise.
(4–40)
Let � =[∇2Fσ
C(z(r))]−1. Using Equation 4–39 in Equation 4–37, we get:
C �[∇Fσ
C(z(r)) + CTv(r+1)
]= 0 (4–41)
C � CTv(r+1) = −C �∇FσC(z
(r)). (4–42)
Equation 4–42 can be written as:
v(r+1) = −(C � CT
)−1C �∇Fσ
C(z(r)). (4–43)
96
Substituting Equation 4–43 in Equation 4–39, we get:
d(r+1) = −�[∇Fσ
C(z(r))− CT
(C�CT
)−1C �∇Fσ
C(z(r))]
(4–44)
d(r+1) = −�[I(m+n)N − CT
(C�CT
)−1C �
]∇Fσ
C(z(r)) (4–45)
d(r+1) =[�CT
(C�CT
)−1C �−�
]∇Fσ
C(z(r)) (4–46)
d(r+1) =
−�Y
�S (A⊗ IN)T
(�Y + (A⊗ IN) �S (A⊗ IN)T)−1
[−�Y (A⊗ IN) �S
]∇Fσ
C(z(r)),
(4–47)
where � =
�Y 0
0 �S
.
The second order method is suitable when the objective function of Formulation 4–30
is convex. When there are outliers, the goal in Formulation 4–30 will be to minimize the
total correntopic loss, while ignoring the effect of outliers. In such a scenario, the
kernel width will be so selected such that it separates the true sample and outliers.
Typically, this separation mechanism leads to transformation of the correntropy to
the invex domain. Thus, the second order Newton’s method will not be able to find
the optimal solution. Thus, in the following paragraphs an iterative method to solve
Formulation 4–30 is developed when correntropy is invex.
Let z(r) be the current feasible solution. Let f1(S) = Fσc (AS− X) and f2(S) = Fσ
c (S).
The aim of finding the optimal kernel width is to identify a border that separates good
data points and outliers. Generally, such mechanism of separating data points requires
problem specific knowledge. However, in this work, correntropy based method that
identifies optimal kernel width is proposed, which in turn provides a margin between
good data points and outliers.
The philosophy of the proposed method is based on the simple notion that if σi
is the optimal kernel width and if pth given point bp contains noise, then setting the
corresponding solution sp to zero vector should give maximum improvement in the
97
objective function f (S) = f1(S) + f2(S). It is easy to see why f2 should decrease.
However, the decrement in f1 is only possible when the given point xp is indeed an
outlier w.r.t σi . Now, among all possible values of σi , the one that provides maximum
decrease w.r.t the original objective function value is the optimal value of the kernel
width. Let f (S \p) be the correntropy cost, when sp is set equal to the zero vector.
Algorithm 4.3 presents the proposed algorithm.
One of the drawbacks of this approach is the computational expensiveness of the
second order method, which increases with the problem dimensions n,m and N. On the
other hand, the step involving second order can be avoided when the proposed method
is used for initial filtering, i.e., solving the following problem:
X = IXf . (4–48)
When solving for Xf in Equation 4–48, Xf can be initialized as Xf = X, and the
second order method can be skipped. After executing Algorithm 4.3, the optimal kernel
width and filtered mixture matrix is obtained. This filtered mixture matrix then can be
used for dictionary identification and source extraction.
A B
Figure 4-1. Cocktail party problem: A) Setup. B) Problem.
98
S1
S2
S3
X1
X2
X3
↑A
↑
Source
Mixing
Mixture
Figure 4-2. BSS setup for human brain
..BLINDSIGNAL
SEPARATION
.
PartialBlind
.
.
FullBlind
.
.
IndependentComponent
Analysis
.
.
SparseComponent
Analysis
.
.
Partially SparseNonnegative Source
.
CompleteSparse
Components
Figure 4-3. Overview of different approaches to solve the BSS problem
99
Algorithm 4.1: Bilinear Algorithminput : X ∈ Rm×N
output: W ∈ Rm×n and T ∈ Rn×N
1 begin2 Randomly initialize T;3 Set termination = false ;4 Set ϵ = epsilon;5 while termination do6 for p = 1 to N do7 Calculate distances Dhp’s between xp and all the hyperplanes wh’s ;8 Assign xp to cluster Ch⋆ iff Dh⋆p = minh{Dhp} ;
9 error =∑
p |Dh⋆p|;10 if error < ϵ then11 termination = true;
12 Thp =
{1 if xp ∈ Ch
0 otherwise;
13 for k = 1 to n do14 Replace Equation 4–6e by wt
he = 1;
15 Solve Formulation(4--6), given thp =
{Thp if h = k
0 otherwise;
16 Arrange W = [w1, ... ,wn];17 return (W, T);
100
Algorithm 4.2: CSC Hierarchical Optimization Algorithminput : X ∈ Rm×N
output: Aπ ∈ Rm×n and Sπ ∈ Rn×N
1 begin2 X = Preprocessing (X, �) ;3 Set ϵ = epsilon;4 Set P1 = {1, ... ,N} ;5 for counter = 1 to n do6 Set r = counter;
// r represents the current index of the hyperplane
7 Set termination = false;8 repeat9 Choose initial points;
10 Solve Formulation(4--24) for the r th hyperplane, givenxp, ∀ p ∈ Pr ;
11 if optimal solution is obtained then12 termination = true;13 Archive vector w⋆
r ;// w⋆
r is the optimal solution vector of
Formulation(4--24)
14 Obtain indexes Rϵr = {p : |w⋆
rT xp| ≤ ϵ};
15 Set Pr+1 = Pr \ Rϵr ;
16 until termination;
17 Arrange W = [w⋆1, ... ,w
⋆n];
18 Get Sπ as, Sπ =WTX ;19 Get Aπ by solving the model, WT Aπ = In×n;20 return (Sπ, Aπ);
101
Algorithm 4.3: Correntropy Minimization for X = AS Type Scenariosinput : X ∈ Rm×N and A ∈ Rm×n
output: S ∈ Rn×N and σ⋆
1 begin2 Let z(r) be solution obtained from second order minimization, where r can
be chosen arbitrarily based on the required accuracy;3 Let S(r) be the solution constructed from z(r);4 Let σ(r) be the minimum value of kernel width obtained from z(r), such that
the correntropy function is convex;5 Select any value for ν, such that 0 < ν < 1;6 �r = −∞;7 termination = false;8 while termination == false do9 Calculate f (S)(r);
10 for i = 1 to N do11 if f (S \i)
(r) < f (S)(r) then12 I = I ∪ {i};
13 Let fnew(S)(r) be the correntropy cost when si = 0 ∀i ∈ I ;14 �r+1 = | fnew (S)
(r)−f (S)(r)
f (S)(r)|;
15 if �r+1 > �r then16 σ(r) = σ(r) ∗ ν;17 r = r + 1;18 else19 σ⋆ = σ(r);20 termination = true;
21 Return {σ⋆, X};
102
Begin
Choose
Initial
points
Solve the
hierarchical
formulation
Optimal
solution?
Remove the points
corresponding to the
plane
All the
planes
obtained
?
End
Yes
No
Yes
No
Figure 4-7. Algorithm 4.2 description
105
CHAPTER 5SIMULATIONS AND RESULTS
In Chapter 5, the applicability of the proposed robust methods is illustrated by
experimentations on simulated and real world data. Generally, it is impractical to
conclude about the presence of outliers in real data. Therefore, the significance of the
proposed methods are highlighted using simulated data. We show that the proposed
methods work very well in non-noisy simulated data, as well as in noisy simulated data.
After the two simulated data scenarios, the performance of the proposed methods on
the real data is also tested.
In Sections 5.1 - 5.4, case studies related to binary classification are illustrated.
Section 5.5 presents the case study related to linear mixing assumption. Application
of non-negative source separation is shown in Section 5.5. The suitability of the
proposed PPC method for the image unmixing problems is shown in Sections 5.6 - 5.10.
Section 5.11 presents the case studies related to complete sparse source separation via
the hyperplane clustering method. Finally, Section 5.12 illustrates the proposed robust
source extraction procedure.
5.1. Cauchy and Skew Normal Data
The objective of Section 5.1 is to evaluate the performance behavior of correntropy
loss function in simulated noisy data classification. A two-dimensional noisy data for the
binary classification is simulated for this study. Altogether, two different types of data
sets were generated. The first data set is generated using Cauchy distribution. The
reason for selecting this distribution is to evaluate the performance of proposed and
existing methods in a non-Gaussian environment. In this data set, the fat tail behavior of
the Cauchy distribution mimics the noise. The second data set is generated by a skew
Some sections of Chapter 5 have been published in Dynamics of Information Sys-tems: Mathematical Foundations, Computers & Operations Research and Neurometh-ods.
106
normal distribution. In this data set, 10% of the data points from one class are randomly
assigned to another class and vice versa. Brief information regarding the two data sets
is given in Table 5-1. The details of the data sets are shown in Figures 5-1, 5-2 & 5-3.
For the data sets, a fixed number of records were selected for training the classifier.
The remaining records were used for testing the trained classifier. In order to have
accurate results, a data set is randomly divided into testing data and training data. For
each data set, the training data is preprocessed by normalizing the data to zero mean
and unit variance along the features (to avoid scaling effects). Based on the mean and
variance of the training data, the testing data is scaled. In addition to that, for the results
to be consistent, 100 Monte-Carlo simulations were conducted (both for ANN and SVM),
and the average testing accuracy of the classifier over the 100 simulations is reported in
the results.
The results are shown in Tables 5-2 & 5-3. From the results, it can be seen that the
correntropy loss function does perform better for the case of Cauchy data. However,
when the data is normally distributed, like Skew data, its performance is similar to the
quadratic loss function.
5.2. Real World Binary Classification Data
In Section 5.2, simulations are carried out for three real world data sets (Wisconsin
Breast Cancer Data, Pima Indians Diabetes Data and BUPA Liver Disorder Data) related
to biomedical field. These data sets are taken from the UCI machine learning repository
(http://archive.ics.uci.edu/ml/). A brief information regarding each of the data sets is
given in Table 5-4. The objective of Section 5.2 is to evaluate the performance behavior
of correntropy loss function in the real world data classification.
Originally, some of the selected data sets have missing values. All the records
containing any missing data values are discarded before using the data for classification.
In addition to that, for each data set, a fixed number of records were selected for training
the classifier. The remaining records were used for testing the trained classifier. In order
107
to have accurate results, a data set is randomly divided into testing data and training
data (keeping the total number of training records fixed, as given in Table 5-4). For
each data set, the training data is preprocessed by normalizing the data to zero mean
and unit variance along the features (to avoid scaling effect). Based on the mean and
variance of the training data, the testing data is scaled. The purpose of normalizing the
training data alone and scaling the testing data later is to mimic the real life scenario.
Usually, the testing data is not available beforehand, and its information is unknown
while normalizing the training data. In addition to that, for the results to be as consistent
as possible, 100 Monte-Carlo simulations were conducted (both for ANN and SVM), and
the average testing accuracy of the classifier over the 100 simulations is reported in the
results.
5.3. Comparison Among ANN Based Methods
The aim of Section 5.3 is to compare the proposed ANN based methods with the
existing ANN based binary classification methods. Since the number of PEs in the
hidden layer have an effect on the performance of ANN based classifiers, simulations
have been conducted for 5, 10 and 20 PEs in the hidden layer for each of the data sets.
Although, the exact number of PEs that will give maximum classification accuracy is
unknown, it can be estimated by an experimental search over the number of PEs in
the hidden layer. However, such a search is out of the scope of the current work due to
its high computational requirements. Therefore, the computations have been confined
for 5, 10 and 20 PEs in order to efficiently compare all the ANN based classifiers.
Moreover, the performance of ANN based classifier with sample and block based
learning framework were also considered in the comparison.
The result of sample and block based learning methods of ANN simulations
are given in Tables 5-5, 5-6, 5-7, 5-8, 5-9, and 5-10. In the six Tables, each column
represents a number of learning epochs for sample based learning. Whereas, each
column represents a number of epochs × training sample size for block based
108
learning. For a given algorithm, a row represents the average result of 100 Monte-Carlo
simulations. First row presents the results with 5 PEs in the hidden layer. Second row
presents the results with 10 PEs, and third row presents the results with 20 PEs in the
hidden layer.
For the AQG and ACG methods, the results from [86] are used as a reference for
further comparisons (see Tables 5-5, 5-7 and 5-9). Since ACS requires knowledge
of change in loss function value over any two consecutive iterations, it cannot be
implemented in sample based learning. However, all the algorithms have been
implemented in block based learning, and the performance results of ACS at σ = 0.5
have been presented. The results shows that ACC almost always (both for sample
and block based learning methods) performs better when compared to any of the ANN
based classification algorithms. Therefore, this method can be used as a generalized
robust ANN based classifier for practical data classification problems. Moreover,
the poor performance of ACS method is attributed to the σ = 0.5 criterion. It is not
necessary that the assumed criterion may show ACS’s best performance. Therefore,
this instigated the study of performance behavior of ACS method over different levels of
parameter σ (see Tables 5-11, 5-12 and 5-13).
5.4. ANN and SVM Comparison
The aim of Section 5.4 is to compare the proposed ANN based binary classification
methods with the SVM based binary classification methods. Since SVM has no
concept of PEs, the best of the average accuracy of SVM (average of 100 Monte-Carlo
simulations for a given pair of c and γ) over the exponential grid space of c and γ is
used to compare with the accuracy of the proposed algorithms. Figure 5-4A shows the
topology of performance accuracy over the grid, and Figure 5-4B shows the topology
of number of support vectors for PID data. Correspondingly, Figures 5-5 and 5-6 show
the same for BLD and WBC data respectively. The maximum testing accuracy that is
109
obtained for PID data from the grid search is 77.2%. Similarly, for BLD and WBC it is
71.4% and 97.07% respectively.
It would be unfair to directly compare the best accuracy of SVM with the accuracy
of the proposed ANN based algorithms, due to following reason: While calculating the
best accuracy of SVM based method, a fine grid search (exhaustive in nature) over the
parameters c and γ is conducted. The possibility to conduct such exhaustive searches
over the parameter space is credited to the existence of fast quadratic optimization
algorithms like sequential minimal optimization [27]. However, such fine exhaustive grid
search for the proposed methods is yet computationally expensive in the case of ANNs
(for example: an exhaustive gird search for ACS require to search over three parameters
namely: number of epochs, σ and number of PEs in the hidden layer).
However, in order to see the behavior of the ACS algorithm with various levels of
σ, a coarse grid search with few grid points have been conducted. The result of this
grid search is shown in Tables 5-11, 5-12 and 5-13. Although, the grid is confined to
very few grid points, it can be seen that the performance accuracy of ACS algorithm
vary with the change of parameters (σ and number of PEs in hidden layer). The results
from the grid search shows that the performance accuracy of ACS (even with limited
PEs and confined levels of σ) is very closer to the best performance accuracy of soft
margin kernel based SVM. Furthermore, even with the limitations (number of PEs in
the hidden layer, and number of epochs) ACC beats the best performance accuracy of
SVM for WBC data. In addition to that, its performance is very close to that of best SVM
performance for the other two data sets.
5.5. Linear Mixing EEG-ECoG Data
The aim of Section 5.5 is to understand the nature of mixing across the skull. In
particular, the objective is to assert the validity of the linear mixing assumption in BSS
problem. Since, linear mixing is assumed in almost all the BSS methods, it will be of
primary interest to show the validity of the assumption with respect to neural data. The
110
idea of this experiment is to consider a neural data set which contains the information
regarding the source as well as the mixture signals from brain. Based on the available
information from the data set, the goal is to extract the mixing matrix. However, the
mixing matrix itself may not provide a significant information, when compared to the
total error from the linear mixing assumption. Therefore, in the following experiment,
a suitable publicly available data set (which contains both source and mixture data) is
considered. The aim is to examine the linear mixing assumption across the skull by
minimizing the total error.
Data containing simultaneous electrical activity over the scalp (EEG) and over the
exposed surface of the cortex (ECoG) from a monkey is considered in Section 5.5. The
information regarding experimental setup and position of electrodes is available on the
following web address, (http://wiki.neurotycho.org/EEG-ECoG recording). Since the
data from this experiment is simultaneously collected from above the scalp and under
the scalp, it opens the door to understand the mixing mechanism across the brain.
Typically, the mixing over the skull is assumed to be linear. Mathematical advantages in
formulating the problem, developing the algorithms, and identifying the unknown source
and mixing matrices are obtained through the linear mixing assumption. In fact, the only
known successful results in BSS problem is obtained from linear mixing assumption. By
analyzing the data of this experiment, the goal is to experimentally verify the validity of
the linear mixing assumption.
The data consists of ECoG and EEG signals, which were simultaneously recorded
from the same monkey. 128 channels ECoG array that covered entire lateral cortical
surface of left hemisphere with every 5 millimeter spacing was implanted in the monkey.
The EEG signal was recorded from 19 channels. The location of the EEG electrodes
was determined by 10-20 systems without the Cz electrode (because the location of the
Cz electrode interfered with a connector of ECoG). In the present simulation, results on
a particular data set is presented, where the monkey is blindfolded, seated in a primate
111
chair, and hands are tied to the chair. Figure 5-7 shows the 8 EEG channels of the left
hemisphere, and Figure 5-8 shows the 128 ECoG channels from the left hemisphere.
During the recording, the monkey is in resting condition. In such scenario, it is
assumed that the theta and alpha bands should be dominant in a normal healthy
primate. Thus, the goal is to see how particular frequency bands mix over the skull.
Basically, the formulation is of the following form:
minimize : |Xeeg − A× Xecog|, (5–1)
where Xeeg ∈ R18×N represents the EEG data from 18 channels (each column
represents a channel), Xecog ∈ R128×N represents the ECoG data from 128 channels
(each column represents a channel), and A ∈ R18×128 is the unknown mixing matrix.
Before solving Formulation 5–1, the data has been filtered to remove high(≥ 45Hz)
and low(≤ 0.5Hz) frequencies. In addition to that, 50Hz and 60Hz notch filters have
been used to remove the noise induced by the electric current. Furthermore, all the
channels have been referenced to the average signal before conducting the analysis,
i.e., for example, EEG data value from a particular channel at a given time instance is
referenced with average EEG from all the EEG channels at the same time instance.
Similarly, the ECoG data is referenced to the average signal.
Instead of solving Formulation 5–1 with respect to the whole data, the formulation
has been solved multiple times, with reduced data sets. Typically, the reduced data sets
are nothing but smaller chunks of data with the window size of N = 2000 points for a
particular frequency band, taken from the original data. The objective of Formulation 5–1
is to calculate the total absolute error due to linear mixing assumption in different
frequency bands. Thus, this experiment provides a mechanism to understand mixing
around the skull. A low error shows that linear mixing assumption is valid. Whereas, a
high error indicates that the linear mixing assumption is invalid. Moreover, the ultimate
goal is to show if the mixing is constant over the time. However, to develop such results,
112
complete understanding regarding the total number of sources should be available. At
this point, a simple experiment is presented, where, it is assumed that all the ECoG
electrodes are sources, and all the EEG electrodes are mixtures. Thus, the model
is highly under-determined, but due to the availability of both source and mixture
information, Formulation 5–1 transforms to a convex programming problem.
The results of the analysis are shown in Table 5-14. While calculating the error,
only those channels are considered that are placed on the left hemisphere, i.e., 8
EEG channels, and 128 ECoG channels. Since, ECoG data is available for the left
hemisphere, the right hemispherical channels in EEG have been neglected during the
calculation of the error. In Table 5-14, the third row presents the mean value of the total
absolute error over all the multiple runs on the reduced data set. The fourth row provides
the corresponding variance of the total absolute errors for the multiple runs. The low
average error and negligible variance in alpha and theta bands suggest the existence
of linear mixing across the skull. At this stage of the experiment, the linear mixing
assumption is validated in the neural data. However, it is far from theoretical validation
and generalization to other neural data sets. Furthermore, the other critical question,
which directs towards the constancy in mixing is open for further investigation.
5.6. fMRI Data Analysis
In Sections 5.6, 5.7, 5.8, 5.9 and 5.10, the focus is on non-negative sources.
Generally, images fall under the non-negative sources category. The aim of Section 5.6
is to examine the validity of PPC sparsity assumption in fMRI data. Generally, sparsity in
fMRI images is a more plausible assumption than independence [24]. However, the PPC
sparsity may not be applicable to fMRI data. Through this experiment, the applicability of
PPC method on fMRI data is explored.
An fMRI data set examined previously in the literature is considered in Section 5.6.
The description of experimental setup and data collection of the fMRI data is available
in [35], where the authors compare ICA and SCA methods. Here, the same data are
113
used to analyze the convex hull of the fMRI data. The basic idea is, if PPC assumptions
are valid, then the convex hull should be a simplex. Furthermore, if the convex hull is
simplex in n dimensions, then an affine transformation to lower dimensions, like PCA,
should result in a simplex in lower dimensions. Furthermore, the extreme points (or
vertices) of simplex (or convex hull) is nothing but the columns of the mixing matrix.
Thus, finding convex hull leads to the identification of mixing matrix.
The fMRI data from a single subject consist of 98 images taken every 50 millisecond.
The images are vectorized by scanning the image vertically from top left to bottom right.
Next, the dimensionality of data is reduced to 3 principal components using PCA
for ease of identifying the convex hull. Since the images are vectorized, the relation
between fMRI data and PCA components is intangible. However, the usage of PCA has
a crucial advantage in visualization of the data, which in turn leads to easy identification
of the convex hull in the lower dimensions. Figure 5-9A shows the scatter plot of three
principal components. Now, taking the three principal components, the data is projected
on a two dimensional plane. This projection of the three principal components into
two dimensions is shown in Figure 5-10. First thing to notice is the projection in two
dimension is different from Figure 5-9B, which shows the scatter plot of two principal
components. Next, a simplex that fits all the points in Figure 5-10 gives the information
pertaining to the columns of the mixing matrix. For the unique identification of mixing
matrix, existence of a unique simplex is necessary.
For the fMRI data, the PPC1 conditions are not completely satisfied since the
vertices of the triangle (simplex) are not available. However, approximate methods can
be developed to identify the extreme points of the triangle. For example, Figure 5-10A
and Figure 5-10B show different ways of extrapolating the data to obtain the vertices.
Obviously, this idea can be extended in high dimensions by defining objectives like
finding a simplex of minimum volume containing all the data, or finding a simplex of
minimum volume containing high percentage of the data. From this analysis, it can be
114
concluded that, in general the PPC method may not be directly applicable to the analysis
of fMRI data. Thus, alternate methods which can overcome the restrictions of the PPC
method are needed to analyze the fMRI data.
5.7. MRI Scans
In Section 5.7, three MRI scan images are considered. From the original MRI
images, the minimum pixel value is subtracted, and the validity of the PPC1 assumption
is tested. These processed images do satisfy the PPC1 assumption. Let us call these
images as initial images. Now the initial images are linearly mixed to obtain three
mixture images. The goal is to extract the pure source images from the mixture images.
Figure 5-11A displays the initial sources, and Figure 5-11B presents the mixture images.
The three mixture images are vectorized into matrix X ∈ R3×N , where N
depends upon the size of the images. Now, the columns of X are projected on the
two dimensional space using the PCA transformation. From the projected data, the three
unique vertices of the simplex (triangle) are identified using the proposed projection
approach. Since the initial images satisfy the PPC1 assumption, the unique vertices
are identified, i.e., no approximation is needed. From the vertices of the simplex, the
mixing matrix is constructed. Using the information of the mixing matrix, the source
images are recovered. Figure 5-11C shows the recovered source images. Since the
PPC1 assumptions were satisfied initially, except the ordering and intensity (ambiguity of
permutation and scalability) of the images, all the other information is recovered from the
mixture images.
Furthermore, in order to see the performance of proposed approach, the experiment
is repeated 50 times with a random mixing matrix in every repetition. Moreover, the 50
repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order
to present the accuracy of the proposed approach, the error between recovered and
115
original sources is calculated as:
e(S, S) = minπ ∈ �n
n∑i=1
∥si• − sπi•∥2, (5–2)
where si• is the i th row of the original source matrix S, and si• is the i th row of the
recovered source matrix S. All the rows of original and recovered source matrices are
normalized. The normalization removes the scaling effect. The effect of permutation is
handled by π vector. Let π = [π1, ... ,πn]T and �n = {π ∈ Rn|πi ∈ {1, 2, ... , n}, πi =
πj , ∀i = j} be the set of all permutations of {1, 2, ... , n}. The optimization
problem given in Equation 5–2 is to match the rows of recovered source matrix to
the original source matrix. Typically, the minimization problem is nothing but standard
the assignment problem, and can be easily solved using the Hungarian method.
The average error and standard deviation for MRI scan images of the simulation are
presented in the first column of Tables 5-15 and 5-16 respectively.
5.8. Finger Prints
In Section 5.8, three finger print images are considered. Similar to the MRI scans
in Section 5.7, in the case of the finger print images, the minimum pixel in each image
is first subtracted from the images, and then checked for the PPC1 assumption. These
processed images do not satisfy the PPC1 assumption. Let us call these images as
initial images. Now the linear mixing operation is repeated similar to the MRI scan
images experiment (see Section 5.7), to obtain three mixture images. Since the PPC1
assumption is not satisfied, the goal is to approximately extract the pure sources from
the mixture images. Figure 5-12A displays the initial sources, and Figure 5-12B presents
the mixture images.
The three mixture images are vectorized into matrix X ∈ R3×N , where N
depends upon the size of the images. Now, the columns of X are projected on the
two dimensional space using the PCA transformation. From the projected data, the three
best extreme points are identified using the proposed projection approach. Taking the
116
three points as the vertices of the simplex, the mixing matrix is constructed. Sources
are recovered using the information of mixing matrix. Figure 5-12C shows the extracted
source images. It can be seen from the recovered sources that, apart from intensity and
ordering, the recovery is not perfect.
Furthermore, in order to see the performance of proposed approach, the experiment
is repeated 50 times with a random mixing matrix in every repetition. Moreover, the 50
repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order
to present the accuracy of the proposed approach, the error between recovered and
original sources is calculated by the formula given in Equation 5–2. The average error
and standard deviation for finger print images of the simulation are presented in the
second column of Tables 5-15 and 5-16 respectively.
5.9. Zip Codes
In Section 5.9, four zip code images are considered. Let us call these images
as initial images. Now the linear mixing operation similar to the MRI scan images
experiment (see Section 5.7) is performed, in order to obtain four mixture images. The
PPC1 assumption is not satisfied for the four images, and the goal is to approximately
extract the pure sources from the mixture images. Figure 5-13A displays the initial
source images, and Figure 5-13B presents the mixture images.
The four mixture images are vectorized into matrix X ∈ R4×N , where N
depends upon the size of the images. Now, the columns of X are projected on the
three dimensional space using the PCA transformation. From the projected data, the
four best extreme points are identified using the proposed projection approach. Taking
the four points as the vertices of the simplex, the mixing matrix is constructed. Sources
are recovered using the information of mixing matrix. Figure 5-13C shows the extracted
sources. It can be seen from the recovered sources that, apart from intensity and
ordering, the recovery is not perfect.
117
Furthermore, in order to see the performance of proposed approach, the experiment
is repeated 50 times with a random mixing matrix in every repetition. Moreover, the 50
repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order
to present the accuracy of the proposed approach, the error between recovered and
original sources is calculated by the formula given in Equation 5–2. The average error
and standard deviation for zip code images of the simulation are presented in the third
column of Tables 5-15 and 5-16 respectively.
5.10. Ghost Effect
Five translated images of a same individual are considered in Section 5.10. Let us
call these images as initial images. Now the linear mixing operation similar to the MRI
scan images experiment (see Section 5.7), is performed in order to obtain five mixture
images. The PPC1 assumption is not satisfied for the five images, and the goal is to
approximately extract the pure sources from the mixture images. Figure 5-14A displays
the initial sources, and Figure 5-14B presents the mixture images.
The five mixture images are vectorized into matrix X ∈ R5×N , where N
depends upon the size of the images. Now, the columns of X are projected on the
four dimensional space using the PCA transformation. From the projected data, the five
best extreme points are identified using the proposed projection approach. Taking the
five points as the vertices of the simplex, the mixing matrix is constructed. Sources are
recovered using the information of mixing matrix. Figure 5-14C shows the extracted
sources. It can be seen from the recovered sources that, apart from intensity and
ordering, the recovery is not perfect.
Furthermore, in order to see the performance of proposed approach, the experiment
is repeated 50 times with a random mixing matrix in every repetition. Furthermore, the
50 repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order
to present the accuracy of the proposed approach, the error between recovered and
original sources is calculated by the formula given in Equation 5–2. The average error
118
and standard deviation for ghost effect images of the simulation are presented in the
fourth column of Tables 5-15 and 5-16 respectively.
5.11. Hyperplane Clustering
In order to show the performance of Algorithm 4.2, random test instances have
been generated in Section 5.11. For simplicity, the case when m = n is considered.
To show the performance of the proposed approach, noise free correlated sources are
used. All the data in the case study is artificially generated. Data points X ∈ R16×1600
without any noise from randomly generated dictionary A ∈ R16×16 and source
S ∈ R16×1600 matrices have been generated (source is sparse, i.e., in each column
there is at least one zero). Figure 5-15 represents the original source matrix, and
Figure 5-16 represents the given data. The original dictionary matrix A is randomly
generated for the case study. Matrix shown in Figure 5-17 represents the A matrix
normalized separately with respect to each column. The correlation of sources is given
by the matrix shown in Figure 5-18.
The correlation matrix (see Figure 5-18) is far from being diagonal, therefore,
the sources are highly correlated. Nevertheless, the proposed method separate
their mixtures successfully, unlike ICA, since ICA at least requires the sources to be
uncorrelated.
After preprocessing the data, the hierarchical sequence of MIPs is solved, as
described in Algorithm 4.2. For fast execution, the algorithm is jump started by
generating initial points. Specifically, the following two ϵ neighborhoods of a point xr
are defined:
xp ∈ Nϵ1(xr) i�xtpxr
||xtp||2||xr ||2≤ ϵ1 (5–3)
and
xp ∈ Nϵ2(xr) i�xtpxr
||xtp||2||xr ||2≤ ϵ2. (5–4)
For iteration r , a random point xp is selected as a candidate point, which has maximum
elements in Nϵ1 neighborhood. Next, all the points that belongs to the Nϵ1(xp) are
119
considered as the points that belongs to r th hyperplane. Moreover, all the points that
belong to Nϵ2(xp) are taken as a starting solution to the r th hierarchical problem.
Where ϵ1, ϵ2 are arbitrary selected, such that ϵ1 < ϵ2. In the case study, we have set
ϵ1 = cos(θ), where θ ∈ [15, 20] and ϵ2 = cos(25). The sets Nϵ1 and Nϵ2 are the two
samples of the proposed RANSAC based algorithm.
After running the proposed algorithm, from Equation 4–17 Aπ is recovered. In
Figure 5-19, the column normalized dictionary matrix Aπ is shown. In addition to that,
the recovered source Sπ is shown in Figure 5-20. The A and Aπ are normalized and
truncated to two decimal places for the sake of easy comparison. It can be seen that Aπ
and A differ only by permutation of the columns, which shows the excellent performance
of the proposed algorithm.
In addition to that, data points X ∈ Rm×N from randomly generated dictionary
A ∈ Rn×n and source S ∈ Rn×N matrices have been generated, without any
noise, for different values of m, n and N. The objective is to study the performance
of the proposed algorithm w.r.t the solution time. For consistency in the study, all the
simulations are carried on the same machine (used 8 processors on a 64 processor
Linux server). To accommodate the infeasibility issue for high ill-conditioned A matrix,
the best time out of 5 runs is reported. Table 5-17 presents the solution times for the
cases when m = n = 6 and N = 600, ... , 3800. Table 5-18 presents the solution times
for the cases when m = n = 6, 8, ... , 16 and N = 100 × n. Based on the simulation
results presented in Tables 5-17 and 5-18, it can be seen that the complexity of the
problem is inclined towards n, than compared to N.
5.12. Robust Source Extraction
In Section 5.12, the application and performance of Algorithm 4.3 is presented.
In all the simulations, only one iteration of the second order method is executed. Data
is randomly generated for the simulations. Figure 5-21A shows original 7 signals.
The signals are linearly mixed using a random A matrix to obtain the X matrix. Now,
120
2% noise is added to the X matrix. Figure 5-21B shows the non-contaminated and
Figure 5-21C shows the contaminated X matrices. Figure 5-22A shows the results
obtained by simple quadratic minimization. Figure 5-22B shows the solution obtained
from the proposed approach.
Furthermore, in order to see the performance of proposed algorithm, the experiment
is repeated 50 times with random mixing and source matrices in every repetition.
Moreover, the 50 repetitions of the experiment is conducted for every value of n =
3, ... , 6. In order to present the accuracy of the proposed approach, the error between
recovered and original sources is calculated by the formula given in Equation 5–2.
The average error, standard deviation and average time to solve one instance for the
simulated signals of the experiment are presented in Table 5-19.
Table 5-1. Binary classification case study 1Name Inherent
DistributionNoise Criteria
Cauchy CauchyDistribution
random globalflights of thedistribution areconsidered asnoise
Skew Skew NormalDistribution
random 10%noise is addedto the data
Table 5-2. Cauchy data5 PE’s 10 PE’s 20 PE’s
AQG 0.7157 0.8065 0.8208
ACG 0.6995 0.7702 0.814
ACC 0.6645 0.729 0.801
ACS 0.834 0.8405 0.8403
121
Table 5-3. Skew data5 PE’s 10 PE’s 20 PE’s
AQG 0.902 0.901 0.909
ACG 0.9008 0.905 0.9005
ACC 0.8998 0.8993 0.9
ACS 0.9005 0.8998 0.9025
Table 5-4. Binary classification case study 2Data set Attributes (or) Total Classes Training
Features records sizePima Indians Diabetes (PID) 8 768 2 400
Wisconsin Breast Cancer (WBC) 9 683 2 300
BUPA Liver Disorders (BLD) 6 345 2 150
122
Table 5-5. Sample based performance of ANN on PID data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.74 0.755 0.757 0.757 0.757 0.756 0.756 0.755 0.754 0.754 0.7580.75 0.757 0.758 0.757 0.757 0.756 0.756 0.755 0.755 0.7540.738 0.744 0.744 0.748 0.748 0.747 0.745 0.743 0.744 0.744
ACG 0.762 0.763 0.763 0.763 0.763 0.7660.765 0.766 0.766 0.765 0.7650.759 0.761 0.76 0.76 0.76
ACC 0.687 0.746 0.754 0.763 0.761 0.761 0.763 0.762 0.762 0.761 0.7680.731 0.758 0.76 0.765 0.764 0.765 0.768 0.766 0.765 0.7640.747 0.759 0.764 0.762 0.764 0.763 0.76 0.765 0.766 0.763
123
Table 5-6. Block based performance of ANN on PID data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.71 0.744 0.758 0.763 0.764 0.762 0.767 0.764 0.763 0.766 0.7690.736 0.756 0.762 0.763 0.766 0.767 0.769 0.767 0.766 0.7650.746 0.761 0.766 0.768 0.765 0.768 0.767 0.767 0.762 0.764
ACG 0.766 0.767 0.765 0.765 0.766 0.7690.767 0.765 0.769 0.765 0.7650.767 0.765 0.765 0.765 0.766
ACC 0.67 0.701 0.724 0.741 0.752 0.759 0.759 0.765 0.765 0.762 0.770.698 0.725 0.746 0.754 0.762 0.763 0.765 0.767 0.77 0.7680.72 0.75 0.762 0.763 0.764 0.767 0.764 0.765 0.766 0.769
ACS 0.755 0.752 0.752 0.756 0.754 0.751 0.754 0.754 0.753 0.75 0.7560.752 0.752 0.755 0.749 0.753 0.752 0.75 0.751 0.75 0.7550.747 0.754 0.751 0.752 0.748 0.75 0.749 0.748 0.748 0.75
124
Table 5-7. Sample based performance of ANN on BLD data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.569 0.578 0.578 0.577 0.588 0.584 0.591 0.59 0.591 0.592 0.620.568 0.57 0.584 0.584 0.596 0.599 0.599 0.606 0.611 0.610.572 0.574 0.591 0.597 0.603 0.611 0.603 0.614 0.62 0.622
ACG 0.578 0.579 0.579 0.58 0.583 0.5960.579 0.581 0.585 0.585 0.5870.584 0.596 0.594 0.595 0.592
ACC 0.575 0.577 0.581 0.579 0.583 0.585 0.592 0.59 0.592 0.597 0.6270.57 0.576 0.582 0.584 0.591 0.591 0.597 0.603 0.61 0.6130.571 0.581 0.582 0.592 0.601 0.6 0.608 0.612 0.622 0.627
125
Table 5-8. Block based performance of ANN on BLD data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.561 0.57 0.59 0.597 0.595 0.602 0.613 0.615 0.631 0.638 0.6850.57 0.595 0.596 0.61 0.625 0.638 0.644 0.653 0.657 0.6580.58 0.61 0.637 0.644 0.652 0.663 0.672 0.668 0.675 0.685
ACG 0.612 0.614 0.615 0.626 0.633 0.6850.631 0.639 0.643 0.655 0.6590.66 0.667 0.671 0.675 0.685
ACC 0.57 0.578 0.581 0.591 0.604 0.604 0.628 0.63 0.632 0.641 0.6860.565 0.585 0.598 0.617 0.622 0.631 0.65 0.647 0.659 0.6680.581 0.608 0.634 0.639 0.662 0.663 0.667 0.675 0.677 0.686
ACS 0.612 0.634 0.643 0.637 0.636 0.641 0.633 0.646 0.643 0.642 0.6750.637 0.655 0.655 0.657 0.656 0.66 0.658 0.657 0.656 0.6540.653 0.675 0.668 0.669 0.668 0.669 0.664 0.674 0.67 0.67
126
Table 5-9. Sample based performance of ANN on WBC data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.966 0.968 0.969 0.969 0.969 0.969 0.969 0.968 0.968 0.968 0.970.965 0.969 0.97 0.97 0.969 0.97 0.969 0.969 0.969 0.9690.965 0.97 0.97 0.97 0.97 0.969 0.969 0.969 0.968 0.968
ACG 0.97 0.97 0.97 0.97 0.97 0.9710.97 0.97 0.97 0.97 0.9710.97 0.971 0.971 0.971 0.971
ACC 0.969 0.97 0.969 0.969 0.969 0.97 0.971 0.97 0.97 0.97 0.9720.97 0.97 0.971 0.97 0.972 0.97 0.97 0.971 0.971 0.9710.971 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.969 0.97
127
Table 5-10. Block based performance of ANN on WBC data1 2 3 4 5 6 7 8 9 10 Best
AQG 0.961 0.968 0.968 0.968 0.97 0.97 0.97 0.97 0.969 0.97 0.970.964 0.968 0.968 0.968 0.969 0.97 0.969 0.97 0.969 0.9680.966 0.97 0.969 0.969 0.969 0.969 0.97 0.97 0.97 0.97
ACG 0.97 0.967 0.969 0.967 0.97 0.9730.971 0.97 0.971 0.973 0.9650.971 0.965 0.97 0.969 0.971
ACC 0.961 0.966 0.968 0.968 0.97 0.971 0.97 0.97 0.972 0.969 0.9720.965 0.968 0.969 0.97 0.968 0.969 0.968 0.97 0.97 0.9690.966 0.968 0.969 0.971 0.969 0.97 0.97 0.97 0.97 0.97
ACS 0.965 0.965 0.966 0.965 0.964 0.965 0.964 0.965 0.967 0.966 0.9670.964 0.966 0.967 0.965 0.965 0.965 0.964 0.965 0.965 0.9660.965 0.964 0.965 0.964 0.964 0.962 0.963 0.963 0.965 0.964
128
Table 5-11. Performance of ACS for different values of σ and number of PEs in hiddenlayer on PID data0.5 0.8 1 1.2 1.4 1.6
5 PE 0.7556 0.749 0.7593 0.7568 0.7633 0.761610 PE 0.7549 0.7461 0.7585 0.7604 0.7608 0.760320 PE 0.7543 0.7423 0.7614 0.758 0.7585 0.7593
Table 5-12. Performance of ACS for different values of σ and number of PEs in hiddenlayer on BLD data0.5 0.8 1 1.2 1.4 1.6
5 PE 0.646 0.6806 0.681 0.6861 0.6853 0.68410 PE 0.6596 0.6884 0.6928 0.6931 0.6941 0.692820 PE 0.6753 0.6992 0.6996 0.6997 0.7013 0.7007
Table 5-13. Performance of ACS for different values of σ and number of PEs in hiddenlayer on WBC data0.5 0.8 1 1.2 1.4 1.6
5 PE 0.9672 0.9646 0.9633 0.9648 0.9648 0.967210 PE 0.9665 0.9639 0.9631 0.9647 0.9634 0.963520 PE 0.9654 0.9621 0.9613 0.9625 0.963 0.9634
Table 5-14. Linear mixing assumptionTHETA ALPHA BETA
Frequency 3.5 - 7.5 Hz 8 - 13 Hz 14 - 30 Hz
Activity falling asleep closed eyes concentration
Error (mean) 7.32E-04 0.001 3.3539
Error (variance) 1.04E-06 2.06E-06 97.8081
Table 5-15. Average unmixing errorn MRI Scans Finger Prints Zip Codes Ghost Effect3 1.41× 10−16 0.0031 0.006 1.77× 10−4
4 5.42× 10−4 0.0046 0.0111 7.19× 10−4
5 0.0022 0.0064 0.0152 0.00166 0.0069 0.0084 0.0186 0.00337 0.0158 0.0104 0.0263 0.0055
129
Table 5-16. Standard deviation unmixing errorn MRI Scans Finger Prints Zip Codes Ghost Effect3 2× 10−16 1× 10−4 4× 10−4 1× 10−15
4 2× 10−4 4× 10−4 8× 10−4 4× 10−5
5 2× 10−5 1× 10−3 0.002 5× 10−5
6 7× 10−4 0.001 0.003 9× 10−5
7 0.001 0.002 0.004 7× 10−4
Table 5-17. Simulation-1 results for case study 2m, n N time (sec)6 600 9.407 5616 1000 15.343 756 1400 30.179 276 1800 46.281 626 2200 62.921 346 2600 97.714 096 3000 112.98966 3400 149.6679
Table 5-18. Simulation-2 results for case study 2m, n N time (sec)6 600 9.407 5618 800 46.695 689 900 29.976 3310 1000 58.387 1111 1100 73.565 8212 1200 132.904614 1400 634.741816 1600 4320.502
Table 5-19. Performance of correntropy minimization algorithmn × N 3× 300 4× 400 5× 500 6× 600 7× 700
mean 0.035 0.036 0.039 0.027 0.025std (×10−2) 4.24 5.83 9.27 1.29 0.96time (s) 0.39 59.13 139.88 284.36 527.79
130
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 104
Fp1
F7
F3
T3
C3
T5
P3
O1
EEG Signals 10−20 System: Left hemisphere
Figure 5-7. EEG recordings from monkey.
137
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 104
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
ECoG Signals: Left hemisphere
Figure 5-8. ECoG recordings from monkey.
138
−20
24
6
x 10−4
0
5
10
15
x 10−4
0
0.2
0.4
0.6
0.8
A
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−10
−5
0
5
10
15
20
25
30
35
40
B
Figure 5-9. fMRI data visualization. A) PCA reduction to 3 dimensions. B) PCAreduction to 2 dimensions.
A B
Figure 5-10. Convex hull PPC1 assumption. A) Convex hull representation 1. B) Convexhull representation 2.
139
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
Original Source
Figure 5-15. Original sparse source (normalized) for case study 1
144
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
Given Data
Figure 5-16. Given mixtures of sources for case study 1
145
0.1 0.28 -0.33 0.13 0.03 -0.17 -0.07 0.04 0.53 0.41 -0.2 0.45 0.1 -0.02 0.1 -0.03
-0.07 0.3 0.43 0.32 0.15 -0.25 0.18 0.4 0 0.16 0.23 -0.18 0.05 -0.2 0.19 -0.3
-0.27 -0.12 -0.5 0.45 0.13 -0.27 0.43 -0.29 -0.1 -0.08 0.19 -0.08 -0.2 -0.08 0.01 0.06
0.29 0.2 -0.19 -0.02 -0.37 -0.24 0.01 0.4 -0.4 -0.05 -0.01 0.16 -0.37 0.02 0.12 0.35
0.24 0.05 -0.33 0.09 -0.06 0.5 0.02 0.16 0.05 0.31 -0.07 -0.5 -0.31 0.03 0.02 -0.25
0.28 0.12 0.14 0.49 -0.13 0.08 0.27 -0.03 0.06 -0.27 -0.22 0.01 0.21 0.59 -0.13 0
0.28 -0.27 -0.24 -0.24 0.06 -0.27 0.22 0.31 -0.04 0.01 0.16 0.04 0.23 0 -0.53 -0.33
0.5 -0.28 0.15 -0.05 0.07 -0.35 -0.08 -0.23 0.39 0 0.24 -0.26 -0.28 0.11 0.21 0.16
0 -0.37 0.24 0.23 -0.52 0.01 0.07 -0.19 -0.14 0.6 0.04 0.12 0.06 -0.09 -0.14 0.04
-0.37 -0.04 -0.04 -0.16 -0.47 0.06 0.05 0.15 0.3 -0.16 0.42 0.12 -0.1 0.32 0.2 -0.28
-0.28 -0.32 -0.08 0.26 0.09 -0.28 -0.58 0.28 -0.07 0.07 -0.19 -0.16 -0.1 0.31 -0.08 -0.07
0.2 0.08 -0.14 0.27 -0.39 -0.08 -0.37 -0.17 0.03 -0.39 -0.02 -0.03 0.11 -0.47 0 -0.32
0.24 -0.2 0 0.27 0.32 0.4 -0.19 0.12 -0.16 -0.05 0.45 0.49 -0.1 0 0.1 -0.08
0.09 0.4 -0.13 -0.11 0.01 -0.17 -0.26 -0.39 -0.38 0.24 0.3 -0.07 0.19 0.36 0.03 -0.22
-0.08 0.21 -0.06 0.17 -0.07 0.12 -0.21 0.13 0.24 0 0.44 -0.21 0.2 -0.07 -0.4 0.53
-0.07 0.26 0.26 0.01 0.04 -0.04 -0.05 -0.18 0.11 0 -0.03 0.19 -0.61 0.01 -0.58 -0.18
.41
.16
-0.08
-0.05
.31
-0.27
.01
0
.6
-0.16
.07
-0.39
-0.05
.24
0
0
A =
Figure 5-17. Original mixing matrix for case study 1
0.051 0.04 0.037 0.05 0.037 0.055 0.049 0.037 0.053 0.047 0.034 0.038 0.049 0.037 0.037 0.052
0.04 0.051 0.041 0.051 0.039 0.06 0.054 0.04 0.057 0.052 0.034 0.041 0.05 0.044 0.044 0.057
0.037 0.041 0.053 0.049 0.038 0.056 0.05 0.047 0.055 0.05 0.032 0.039 0.047 0.039 0.039 0.05
0.05 0.051 0.049 0.067 0.047 0.069 0.063 0.048 0.068 0.06 0.039 0.048 0.058 0.047 0.046 0.066
0.037 0.039 0.038 0.047 0.05 0.055 0.05 0.037 0.053 0.047 0.031 0.039 0.045 0.037 0.036 0.052
0.055 0.06 0.056 0.069 0.055 0.094 0.074 0.055 0.079 0.071 0.047 0.057 0.068 0.055 0.056 0.079
0.049 0.054 0.05 0.063 0.05 0.074 0.078 0.049 0.071 0.064 0.042 0.053 0.062 0.05 0.05 0.071
0.037 0.04 0.047 0.048 0.037 0.055 0.049 0.051 0.054 0.049 0.032 0.038 0.046 0.039 0.038 0.05
0.053 0.057 0.055 0.068 0.053 0.079 0.071 0.054 0.084 0.068 0.043 0.055 0.063 0.053 0.052 0.074
0.047 0.052 0.05 0.06 0.047 0.071 0.064 0.049 0.068 0.07 0.04 0.049 0.059 0.048 0.047 0.067
0.034 0.034 0.032 0.039 0.031 0.047 0.042 0.032 0.043 0.04 0.035 0.032 0.043 0.031 0.031 0.044
0.038 0.041 0.039 0.048 0.039 0.057 0.053 0.038 0.055 0.049 0.032 0.054 0.047 0.038 0.038 0.055
0.049 0.05 0.047 0.058 0.045 0.068 0.062 0.046 0.063 0.059 0.043 0.047 0.068 0.046 0.046 0.064
0.037 0.044 0.039 0.047 0.037 0.055 0.05 0.039 0.053 0.048 0.031 0.038 0.046 0.051 0.041 0.052
0.037 0.044 0.039 0.046 0.036 0.056 0.05 0.038 0.052 0.047 0.031 0.038 0.046 0.041 0.05 0.052
0.052 0.057 0.05 0.066 0.052 0.079 0.071 0.05 0.074 0.067 0.044 0.055 0.064 0.052 0.052 0.085
SST
= 1600
Figure 5-18. Mixing matrices for case study 1
0.1 0.45 -0.2 0.28 -0.02 0.13 -0.1 0.33 -0.1 0.17 0.04 0.53 -0.03 0.07 0.03 0.41
-0.07 -0.18 0.23 0.3 -0.2 0.32 -0.05 -0.43 -0.19 0.25 0.4 0 -0.15 -0.18 0.3 0.16
-0.27 -0.08 0.19 -0.12 -0.08 0.45 0.2 0.5 -0.01 0.27 -0.29 -0.1 -0.13 -0.43 -0.06 -0.08
0.29 0.16 -0.01 0.2 0.02 -0.02 0.37 0.19 -0.12 0.24 0.4 -0.4 0.37 -0.01 -0.35 -0.05
0.24 -0.5 -0.07 0.05 0.03 0.09 0.31 0.33 -0.02 -0.5 0.16 0.05 0.06 -0.02 0.25 0.31
0.28 0.01 -0.22 0.12 0.59 0.49 -0.21 -0.14 0.13 -0.08 -0.03 0.06 0.13 -0.27 0 -0.27
0.28 0.04 0.16 -0.27 0 -0.24 -0.23 0.24 0.53 0.27 0.31 -0.04 -0.06 -0.22 0.33 0.01
0.5 -0.26 0.24 -0.28 0.11 -0.05 0.28 -0.15 -0.21 0.35 -0.23 0.39 -0.07 0.08 -0.16 0
0 0.12 0.04 -0.37 -0.09 0.23 -0.06 -0.24 0.14 -0.01 -0.19 -0.14 0.52 -0.07 -0.04 0.6
-0.37 0.12 0.42 -0.04 0.32 -0.16 0.1 0.04 -0.2 -0.06 0.15 0.3 0.47 -0.05 0.28 -0.16
-0.28 -0.16 -0.19 -0.32 0.31 0.26 0.1 0.08 0.08 0.28 0.28 -0.07 -0.09 0.58 0.07 0.07
0.2 -0.03 -0.02 0.08 -0.47 0.27 -0.11 0.14 0 0.08 -0.17 0.03 0.39 0.37 0.32 -0.39
0.24 0.49 0.45 -0.2 0 0.27 0.1 0 -0.1 -0.4 0.12 -0.16 -0.32 0.19 0.08 -0.05
0.09 -0.07 0.3 0.4 0.36 -0.11 -0.19 0.13 -0.03 0.17 -0.39 -0.38 -0.01 0.26 0.22 0.24
-0.08 -0.21 0.44 0.21 -0.07 0.17 -0.2 0.06 0.4 -0.12 0.13 0.24 0.07 0.21 -0.53 0
-0.07 0.19 -0.03 0.26 0.01 0.01 0.61 -0.26 0.58 0.04 -0.18 0.11 -0.04 0.05 0.18 0
A =
Figure 5-19. Recovered mixing matrix for case study 1
146
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
0 200 400 600 800 1000 1200 1400 1600−1
01
Recovered Source
Figure 5-20. Recovered source (normalized) for case study 1
147
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
A
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
148.7 148.75 148.8 148.85 148.9 148.95 149 149.05 149.1 149.15−1
01
30 30.5 31 31.5 32−1
01
B
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
C
Figure 5-21. Data for source extraction method. A) Original source signal. B) Mixture before adding noise. C) Mixture afteradding noise.
148
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
A
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
0 100 200 300 400 500 600 700−1
01
B
Figure 5-22. Recovery of sources by quadratic and correntropy loss. A) Recovered source by quadratic error minimization.B) Recovered source by proposed method.
149
CHAPTER 6SUMMARY
In Chapter 3, two novel approaches integrating the concepts of correntropy in data
classification are proposed. The rationale behind proposing correntropic loss function
in data classification, is its ability to deemphasize outliers during the learning phase.
Thus, the outliers will not have influence while obtaining the classification rule. This is
an important property of the correntropy function that can be used in real world data
classification problems. In addition to that, the use of the correntropic loss function in
two different forms has been illustrated. In first form, the kernel width is allowed to vary
in the learning phase. In order to incorporate varying kernel width, a CS based ANN
learning is proposed (ACC method). The ACC method uses the simple well known
delta rule to update the weights. However, the purpose of using this back-propagation
mechanism is to illustrate the use of CS based ANN learning. Different sophisticated
methods to replace the back-propagation can be used to enhance the basic ACC
algorithm.
Furthermore, the second form of correntropic loss function has a fixed kernel width.
Depending upon the kernel width, the loss function may be convex or invex. However,
the ANN mapper inherently contains nonconvexity. Therefore, any classical gradient
descent algorithm in ANN framework may converge to a local minimum. To avoid such
local convergence, the gradient descent method has been replaced by SA algorithm.
Although a simple SA is used within ANN framework, nevertheless, this method can
suitably incorporate other specialized forms of SA.
Chapter 4 proposes solution methods for two major sparsity based classes of the
BSS problem. The proposed solution methods are broken down into two major steps.
The first step involves identification of the mixing matrix. Two different approaches to
identify the mixing matrix based on the non-negativity and sparsity level of the sources
are proposed. The second step involves extraction of source matrix. For this step, the
150
correntropy based method is proposed. The proposed method can be used not only
to identify the source matrix, but also to identify the outliers in the mixture matrix. By
applying the two steps, the BSS problem in the presence of outliers can be solved
efficiently.
Experiments on binary classification show that the proposed correntropic loss
function improves the classification accuracy of ANN based classifiers. Furthermore,
experiments show that the proposed approaches provide a tough competition to the
state of the art SVM based classifier. It can be proposed that the correntropic loss
function is a substantial contender for a robust measure in the risk minimization.
Moreover, the development of efficient algorithms for the parameter searches in ANNs
will further enhance the importance of correntropic loss function. Experiments on
signal separation show that the proposed method for hyperplane clustering can solve
problems up to size 16 which is unattainable with the earlier methods. Furthermore, the
correntropy based source extraction method shows that suitable kernel width can be
obtained from the contaminated data, which can separate the outliers from the good
data points.
6.1. Criticism
Robust methods have always been criticized on the loss of efficiency and increase
in computational complexity. Theoretical results like those shown by Fisher [30] always
support the usage of quadratic loss function. Moreover, the quadratic loss function is
easy to optimize and is efficient in model parameter estimation. Thus, using the notion
of smoothing effect (i.e., the effect of few outliers can be subdued by the presence of
large number of good data points) has always been used to counter the idea of robust
methods.
There are two basic types of criticism on the usage of BSS approaches in data
analysis. The primary comment is on the loss of order in the sources. As discussed
in Section 4.1, the scalability issue of BSS method can be overcome by using suitable
151
normalization approaches. However, identifying appropriate sources in general is not
possible. Makegi et al. [58] discussed the above issue, and stated the importance
of knowing “what the sources are” instead of knowing “where the sources are” in
understanding cortical activity. Furthermore, the underdetermined case is usually
resolved by experimental design, where artifacts are introduced into the data while
recording to reduce the underdeterminacy.
The other type of criticism that is received on the usage of BSS approaches is on
the validity of the assumptions imposed on mixing and source matrices. The smearing of
signal by volume conduction is instantaneous, thus a no delay assumption is not much
of a concern. The linear mixing assumption is the critical one and is hard to validate
experimentally. However, superposition of signals (a typical natural phenomenon) can be
used to support the notion of linear mixing. In addition to that, the assumptions imposed
on source signals are often objected. Statistical independence among the neuronal
signals is hard to justify. Therefore, researchers working with ICA directed the research
justifying statistical independence among artifacts and neuronal signals. On the other
hand, the assumptions of the novel SCA approaches are yet to be experimentally
validated on neurological data. Furthermore, sparsification methods transforming the
given problem into a sparse source problem are yet to be explored.
6.2. Conclusion
Conventionally, a quadratic loss function is used as a measure the similarity.
Rockafellar et al. [79] proposed four axioms for an error measure: error measure is
strictly positive for a non-zero error, positive homogeneity, subadditivity, and lower
semicontinuity. Homogeneity and robustness are contradicting, and cannot exists in a
single function. Thus in this work, the following properties favorable for a robust error
measure are proposed: (1) error measure is strictly positive for a non-zero error, (2)
generalized convexity, (3) differentiability (4) symmetry, and (5) lower semicontinuity.
One of the goals of this work is to propose a specific robust measure, called correntropic
152
loss function, that calculates the similarity between two random variables y and a, and
satisfies the above five properties. Furthermore, similar to the generalization of SVMs
from the basic formulation to the kernel based soft margin formulation, correntropy
based ANNs can be viewed as a generalized form of ANNs (both in regression [72] and
classification). From rigorous experimental results, the usability of correntropy based
ANNs in real world data classification problems is shown in Chapter 5.
BSS approaches based on ICA are well known in the signal separation literature.
However, sparsity based BSS methods are relatively new, and their potential is yet to be
explored in the area of signal processing. Through the systematic overview presented in
this dissertation, the awareness of the novel sparsity based BSS methods is increased,
and the differences between ICA and SCA methods are highlighted. The primary
difference between ICA based methods, compared to SCA based methods, is that
the ICA based methods are mostly suitable for artifact filtering. However, the striking
difference is that, the SCA based methods may be suitable for separating pure sources,
which are not necessarily statistically independent. Similar to EEG/MEG analysis with
ICA, where artifacts are induced into the signal via strategic experiments, efficient
experiments for SCA can be designed, where sparsity can be induced in to the source
signals. Furthermore, sparsification methods (like wavelet transforms) that can efficiently
sparsify source signals can also be used to analyze non-sparse source signals. To sum
up, SCA based methods may open a new door for understanding the mysteries of the
brain.
To conclude, the computational complexity of robust methods will always be an
issue when compared to the traditional methods. However, properties like invexity for
robust measures, and sample selection strategies for robust algorithms will overcome
the issues related to the computational complexity to a certain extent. Nevertheless, for
practical scenarios, robust methods are always preferable in terms of solution quality to
the traditional methods in data analysis. Furthermore, even for the theoretical scenarios,
153
the performance of robust methods in terms of solution quality is competitive with the
traditional methods.
154
APPENDIXGENERALIZED CONVEXITY
In the following discussion, the functions are assumed to be twice differentiable.
Obviously, convex analysis is not confined to the differentiable functions, and the
interested readers may refer to [5, 6, 11, 61, 78] for comprehensive details. An important
building block of convex analysis is the notion of a convex set. A set is said to be convex,
if the line segment joining any two points of the set completely lie within the set.
Definition 1. Let f : S 7→ R be a twice differentiable function, where S is a nonempty
convex subset of Rn. The function f is said to be convex, if and only if, the Hessian
matrix of f is positive semidefinite at each point in S.
Duality and the optimality conditions are the two important theories in the field of
optimization that are nurtured by convexity [8]. Convexity added the crucial brick of no
duality gap, in the duality theory. Furthermore, it is convexity that provided a ladder for a
local optimal solution to reach the status of a global optimal solution. These two theories
are the backbone of almost all the optimization algorithms. However, there has always
been curiosity among researchers to break the strict requirements of convexity. This is
due to the fact that most practical problems tend to be non-convex. As a first successful
attempt, Mangasarian [59] generalized the notion of convexity by proposing another
class of functions called pseudoconvex functions.
Definition 2. Let f : S 7→ R be a differentiable function, where S is a nonempty subset
of Rn. The function f is said to be pseudoconvex:
if ∇f (x1)T (x2 − x1) ≥ 0 then f (x2) ≥ f (x1) ∀ x1, x2 ∈ S
Pseudoconvex functions do not require the positive semidefinite criterion, like that
of convex functions. Furthermore, pseudoconvex functions preserve the tractability,
i.e., a local minimum of a pseudoconvex function on a convex domain is a global
minimum. Thus, pesudoconvexity augmented the optimality conditions to a larger class
155
of functions. Pseudoconvexity in the objective function, along with the quasiconvexity
in the constraints were assumed to be the weakest conditions that can be imposed so
that the Karush-Kuhn-Tucker (KKT) conditions are sufficient (under certain constraint
qualifications) [5, 61]. However in general, the pseudoconvex function failed with respect
to extendability. In other words, the non-negative weighted sum of pseudoconvex
functions may not result in a pseudoconvex function. Therefore, the pseudoconvex
theory had its own limitations. There has been continuous effort to relax the convexity
criterion, yet preserve the tractability and the extendability characteristics. Many other
ideas to extend the concept of tractability can be seen in the literature [11, 51]. One of
the practical successful extensions of convexity is invexity [6]. Hanson [41] proposed
the characteristics of such functions whose local minimum is a global minimum.
Subsequently, Craven [23] named such functions as invex functions.
Definition 3. Let f : S 7→ R be a differentiable function, where S is a nonempty subset
of Rn. The function f is said to be invex, if and only if:
f (x2) ≥ f (x1) + η(x1, x2)T∇f (x1) ∀ x1, x2 ∈ S (A–1)
where η : S × S 7→ Rn is some arbitrary vector function.
Invex functions not only provide a criterion of tractability, but also provide a criterion
of extendability. That is, a local minimum of an invex function over a convex domain will
be a global minimum, and there exists a criterion under which the non-negative sum of
invex functions will be an invex function. Although, it may be argued that invexity comes
with a price; unlike the pseudoconvex functions, a sub-level set of an invex function
may not be convex. However, they preserve both tractability and extendability, and it
is due to invexity that a huge class of functions can now be analyzed with respect to
the optimality conditions. Therefore, invexity is one of the weakest properties in convex
analysis that extends the theory of optimization in concluding the global optimality of a
feasible solution.
156
The reason to use differentiability based definitions is due to the differentiable
nature of the correntropic loss function. There are other definitions and properties of the
above stated functions, and readers are directed to [5, 11] for a comprehensive list of
definitions and properties.
Table A-1. Generalized convexity( ⋆under constraint qualification)Function Type Tractability Optimality
ConditionsStrongDuality
Extendability
Convex True Sufficient⋆ Exists AlwaysPseudoconvex True Sufficient⋆ Exists No Known Criteria
Invex True Sufficient⋆ Exists Criterion Exists
157
REFERENCES
[1] Aharon, M., Elad, M., & Bruckstein, A. (2006). On the uniqueness of overcompletedictionaries, and a practical way to retrieve them. Linear algebra and its applica-tions, 416(1), 48–67.
[2] Alizamir, S., Rebennack, S., & Pardalos, P. (2008). Improving the neighborhoodselection strategy in simulated annealing using the optimal stopping problem.Simulated Annealing, C. M. Tan (Ed.), (pp. 363–382).
[3] Anthony, M., & Bartlett, P. (2009). Neural network learning: Theoretical founda-tions. Cambridge Univ Pr.
[4] Antonov, G., & Katkovnik, V. (1972). Generalization of the concept of statisticalgradient. Avtomat. i Vycisl. Tehn.(Riga), 4, 25–30.
[5] Bazaraa, M., Sherali, H., & Shetty, C. (2006). Nonlinear programming: theory andalgorithms. Wiley-interscience.
[6] Ben-Israel, A., & Mond, B. (1986). What is invexity. J. Austral. Math. Soc. Ser. B,28(1), 1–9.
[7] Bereanu, B. (1972). Quasi-convexity, strictly quasi-convexity and pseudo-convexityof composite objective functions. ESAIM: Mathematical Modelling and NumericalAnalysis-Modelisation Mathematique et Analyse Numerique, 6(R1), 15–26.
[8] Bertsekas, D. (2003). Convex analysis and optimization. Athena ScientificBelmont.
[9] Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal marginclassifiers. In Proceedings of the fifth annual workshop on Computational learningtheory , (pp. 144–152). ACM.
[10] Bradley, P., & Mangasarian, O. (2000). k-plane clustering. Journal of GlobalOptimization, 16(1), 23–32.
[11] Cambini, A., & Martein, L. (2008). Generalized Convexity and Optimization:Theory and Applications, vol. 616. Springer.
[12] Capel, D. (2005). An effective bail-out test for ransac consensus scoring. In Proc.BMVC, (pp. 629–638).
[13] Catoni, O. (1996). Metropolis, simulated annealing, and iterated energytransformation algorithms: theory and experiments. Journal of Complexity ,12(4), 595–623.
[14] Chan, T.-H., Ma, W.-K., Chi, C.-Y., & Wang, Y. (2008). A convex analysisframework for blind separation of non-negative sources. Signal Processing,IEEE Transactions on, 56(10), 5120–5134.
158
[15] Chen, B., & Principe, J. (2012). Maximum correntropy estimation is a smoothedmap estimation. Signal Processing Letters, IEEE , 19(8), 491–494.
[16] Chum, O., & Matas, J. (2002). Randomized ransac with td, d test. In Proc. BritishMachine Vision Conference, vol. 2, (pp. 448–457).
[17] Chum, O., & Matas, J. (2005). Matching with prosac-progressive sampleconsensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, vol. 1, (pp. 220–226). IEEE.
[18] Chum, O., & Matas, J. (2008). Optimal randomized ransac. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 30(8), 1472–1482.
[19] Chum, O., Matas, J., & Kittler, J. (2003). Locally optimized ransac. In PatternRecognition, (pp. 236–243). Springer.
[20] Cichocki, A., & Amari, S. (2002). Blind Signal and Image Processing. Wiley OnlineLibrary.
[21] Cichocki, A., Zdunek, R., & Amari, S. (2006). New algorithms for non-negativematrix factorization in applications to blind source separation. In Acoustics,Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEEInternational Conference on, vol. 5, (pp. V–V). Ieee.
[22] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,20(3), 273–297.
[23] Craven, B. (1981). Duality for generalized convex fractional programs. GeneralizedConcavity in Optimization and Economic, (pp. 437–489).
[24] Daubechies, I., Roussos, E., Takerkart, S., Benharrosh, M., Golden, C.,D’Ardenne, K., Richter, W., Cohen, J., & Haxby, J. (2009). Independentcomponent analysis for brain fmri does not select for independence. Proceedingsof the National Academy of Sciences, 106(26), 10415–10422.
[25] Eddington, S. (1914). Stellar Movements and the Structure of the Universe.Macmillan and Company, limited.
[26] Erdogmus, D., Principe, J., & Hild I., K. E. (2002). Beyond second-order statisticsfor learning: A pairwise interaction model for entropy estimation. Natural comput-ing, 1(1), 85–108.
[27] Fan, R., Chen, P., & Lin, C. (2005). Working set selection using second orderinformation for training support vector machines. The Journal of Machine LearningResearch, 6, 1889–1918.
[28] Fischler, M., & Bolles, R. (1980). Random sample consensus: a paradigm formodel fitting with applications to image analysis and automated cartography. Tech.rep., DTIC Document.
159
[29] Fischler, M., & Bolles, R. (1981). Random sample consensus: a paradigm formodel fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6), 381–395.
[30] Fisher, R., et al. (1920). A mathematical examination of the methods ofdetermining the accuracy of an observation by the mean error, and by the meansquare error. Monthly Notices of the Royal Astronomical Society , 80, 758–770.
[31] Geary, R. (1947). Testing for normality. Biometrika, 34(3/4), 209–242.
[32] Georgiev, P., Pardalos, P., & Theis, F. (2007). A bilinear algorithm for sparserepresentations. Computational Optimization and Applications, 38(2), 249–259.
[33] Georgiev, P., & Theis, F. (2004). Blind source separation of linear mixtures withsingular matrices. Independent Component Analysis and Blind Signal Separation,(pp. 121–128).
[34] Georgiev, P., Theis, F., & Cichocki, A. (2005). Sparse component analysis andblind source separation of underdetermined mixtures. Neural Networks, IEEETransactions on, 16(4), 992–996.
[35] Georgiev, P., Theis, F., Cichocki, A., & Bakardjian, H. (2007). Sparse componentanalysis: a new tool for data mining. Data Mining in Biomedicine, (pp. 91–116).
[36] Georgiev, P., Theis, F., & Ralescu, A. (2007). Identifiability conditions andsubspace clustering in sparse bss. Independent Component Analysis and SignalSeparation, (pp. 357–364).
[37] Gribonval, R., & Schnass, K. (2010). Dictionary identification - sparsematrix-factorization via l1 -minimization. Information Theory, IEEE Transactions on,56(7), 3523–3539.
[38] Gunn, S. (1998). Support vector machines for classification and regression. ISIStechnical report , 14.
[39] Hampel, F. (1973). Robust estimation: A condensed partial survey. ProbabilityTheory and Related Fields, 27 (2), 87–104.
[40] Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (2011). Robust Statistics:The Approach Based on Influence Functions, vol. 114. Wiley.
[41] Hanson, M. (1981). On sufficiency of the kuhn-tucker conditions. Journal ofMathematical Analysis and Applications, 80(2), 545–550.
[42] He, R., Zheng, W.-S., & Hu, B.-G. (2011). Maximum correntropy criterion forrobust face recognition. Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, 33(8), 1561–1576.
160
[43] He, R., Zheng, W.-S., Hu, B.-G., & Kong, X.-W. (2011). A regularized correntropyframework for robust pattern recognition. Neural Computation, 23(8), 2074–2100.
[44] Heisele, B., Ho, P., & Poggio, T. (2001). Face recognition with support vectormachines: Global versus component-based approach. In Computer Vision, 2001.ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, (pp.688–694). IEEE.
[45] Herault, J., Jutten, C., & Ans, B. (1985). Detection de grandeurs primitivesdans un message composite par une architecture de calcul neuromimetique enapprentissage non supervise. In 10 Colloque sur le traitement du signal et desimages, FRA, 1985. GRETSI, Groupe dEtudes du Traitement du Signal et desImages.
[46] Hornik Maxwell, K., & White, H. (1989). Multilayer feedforward networks areuniversal approximators. Neural networks, 2(5), 359–366.
[47] Huber, P. (1981). Robust statistics.
[48] Huber, P. (1997). Robust Statistical Procedures. 27. SIAM.
[49] Huber, P. (2012). Data analysis: what can be learned from the past 50 years, vol.874. Wiley.
[50] Hyvarinen, A., & Oja, E. (2000). Independent component analysis: algorithms andapplications. Neural networks, 13(4), 411–430.
[51] Khanh, P. (1995). Invex-convexlike functions and duality. Journal of optimizationtheory and applications, 87 (1), 141–165.
[52] Kim, K., Jung, K., Park, S., & Kim, H. (2002). Support vector machines for textureclassification. Pattern Analysis and Machine Intelligence, IEEE Transactions on,24(11), 1542–1550.
[53] Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimization by simulatedannealing. science, 220(4598), 671.
[54] Kreutz-Delgado, K., Murray, J., Rao, B., Engan, K., Lee, T., & Sejnowski, T. (2003).Dictionary learning algorithms for sparse representation. Neural computation,15(2), 349–396.
[55] Liu, W., Pokharel, P., & Principe, J. (2006). Error entropy, correntropy andm-estimation. In Machine Learning for Signal Processing, 2006. Proceedingsof the 2006 16th IEEE Signal Processing Society Workshop on, (pp. 179–184).IEEE.
[56] Liu, W., Pokharel, P., & Principe, J. (2007). Correntropy: properties andapplications in non-gaussian signal processing. Signal Processing, IEEE Transac-tions on, 55(11), 5286–5298.
161
[57] Lundy, M., & Mees, A. (1986). Convergence of an annealing algorithm. Mathemat-ical programming, 34(1), 111–124.
[58] Makeig, S., Jung, T.-P., Ghahremani, D., Bell, A., & Sejnowski, T. (1996). What(not where) are the sources of the eeg? In The 18th Annual Meeting of TheCognitive Science Society .
[59] Mangasarian, O. (1965). Pseudo-convex functions. Journal of the Society forIndustrial & Applied Mathematics, Series A: Control , 3(2), 281–290.
[60] Mangasarian, O. (1968). Convexity, pseudo-convexity and quasi-convexity ofcomposite functions.
[61] Mangasarian, O. (1994). Nonlinear programming. Society for Industrial andApplied Mathematics Philadelphia, PA.
[62] McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent innervous activity. Bulletin of Mathematical Biology , 5(4), 115–133.
[63] Mehrotra, K., Mohan, C., & Ranka, S. (1997). Elements of artificial neuralnetworks. the MIT Press.
[64] Michalewicz, Z., & Fogel, D. (2004). How to solve it: modern heuristics.Springer-Verlag New York Inc.
[65] Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural andstatistical classification. Ellis Horwood Series in Artificial Intelligence, New York,NY: Ellis Horwood,— c1994, edited by Michie, Donald; Spiegelhalter, David J.;Taylor, Charles C., 1.
[66] Minsky, M., & Seymour, P. (1988). Perceptrons. In Neurocomputing: foundationsof research, (pp. 157–169). MIT Press.
[67] Naanaa, W., & Nuzillard, J. (2005). Blind source separation of positive andpartially correlated data. Signal Processing, 85(9), 1711–1722.
[68] Nister, D. (2005). Preemptive ransac for live structure and motion estimation.Machine Vision and Applications, 16(5), 321–329.
[69] Pardalos, P., Boginski, V., & Vazacopoulos, A. (2007). Data mining in biomedicine.Springer Verlag.
[70] Pardalos, P., Pitsoulis, L., Mavridou, T., & Resende, M. (1995). Parallel search forcombinatorial optimization: genetic algorithms, simulated annealing, tabu searchand grasp. Parallel Algorithms for Irregularly Structured Problems, (pp. 317–331).
[71] Parzen, E. (1962). On estimation of a probability density function and mode. Theannals of mathematical statistics, 33(3), 1065–1076.
162
[72] Principe, J. (2010). Information theoretic learning: Renyi’s entropy and Kernelperspectives. Springer Verlag.
[73] Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. Unsuper-vised adaptive filtering, 1, 265–319.
[74] Raguram, R., Frahm, J.-M., & Pollefeys, M. (2008). A comparative analysis ofransac techniques leading to adaptive real-time random sample consensus. InComputer Vision–ECCV 2008, (pp. 500–513). Springer.
[75] Reeves, C. (1993). Modern heuristic techniques for combinatorial problems. JohnWiley & Sons, Inc.
[76] Renyi, A. (1965). On the foundations of information theory. Revue de l’InstitutInternational de Statistique, (pp. 1–14).
[77] Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annalsof Mathematical Statistics, (pp. 400–407).
[78] Rockafellar, R. (1997). Convex analysis, vol. 28. Princeton university press.
[79] Rockafellar, R., Uryasev, S., & Zabarankin, M. (2008). Risk tuning with generalizedlinear regression. Mathematics of Operations Research, 33(3), 712–729.
[80] Rosenblatt, F. (1958). The perceptron: A probabilistic model for informationstorage and organization in the brain. Psychological review , 65(6), 386.
[81] Rubinov, A., & Ugon, J. (2003). Skeletons of finite sets of points. submitted paper .
[82] Rubinstein, R. (1983). Smoothed functionals in stochastic optimization. Mathe-matics of Operations Research, (pp. 26–33).
[83] Santamaria, I., Pokharel, P., & Principe, J. (2006). Generalized correlationfunction: Definition, properties, and application to blind equalization. SignalProcessing, IEEE Transactions on, 54(6), 2187–2197.
[84] Scholkopf, B., Burges, C., & Vapnik, V. (1995). Extracting support data for a giventask. In Proceedings, First International Conference on Knowledge Discovery &Data Mining. AAAI Press, Menlo Park, CA, (pp. 252–257).
[85] Shannon, C. (1948). A mathematical theory of communication.
[86] Singh, A., & Principe, J. (2010). A loss function for classification based on arobust similarity metric. In Neural Networks (IJCNN), The 2010 International JointConference on, (pp. 1–6). IEEE.
[87] Styblinski, M., & Tang, T. (1990). Experiments in nonconvex optimization:stochastic approximation with function smoothing and simulated annealing. NeuralNetworks, 3(4), 467–483.
163
[88] Sun, Y., & Xin, J. (2012). Nonnegative sparse blind source separation for nmrspectroscopy by data clustering, model reduction, and l1 minimization. SIAMJournal on Imaging Sciences, 5(3), 886–911.
[89] Syed, M., Georgiev, P., & Pardalos, P. (2012). A hierarchical approach for sparsesource blind signal separation problem. Computers & Operations Research,available online.
[90] Syed, M., Georgiev, P., & Pardalos, P. (2013). Blind signal separation methods incomputational neuroscience. In Neuromethods. Springer, to appear.
[91] Syed, M., & Pardalos, P. (2013). Neural network models in combinatorialoptimization. In Handbook of Combinatorial Optimization. Springer, to appear.
[92] Syed, M., Pardalos, P., & Principe, J. (2013). On the optimization of thecorrentropic loss function in data analysis. Optimization Letters, available on-line.
[93] Syed, M., Principe, J., & Pardalos, P. (2012). Correntropy in data classification.In Dynamics of Information Systems: Mathematical Foundations, (pp. 81–117).Springer.
[94] Te-Won, L. (1998). Independent component analysis, theory and applications.Boston: Kluwer Academic Publishers.
[95] Tong, S., & Koller, D. (2002). Support vector machine active learning withapplications to text classification. The Journal of Machine Learning Research, 2,45–66.
[96] Tordoff, B., & Murray, D. (2002). Guided sampling and consensus for motionestimation. In Computer VisionECCV 2002, (pp. 82–96). Springer.
[97] Tukey, J. (1960). A survey of sampling from contaminated distributions. Contribu-tions to Probability and Statistics, 2, 448–485.
[98] Tukey, J. (1962). The future of data analysis. The Annals of MathematicalStatistics, 33(1), 1–67.
[99] Vapnik, V. (1999). An overview of statistical learning theory. Neural Networks,IEEE Transactions on, 10(5), 988–999.
[100] Vapnik, V. (2000). The nature of statistical learning theory . Springer Verlag.
[101] Vapnik, V., Golowich, S., & Smola, A. (1996). Support vector method for functionapproximation, regression estimation, and signal processing. In Advances inNeural Information Processing Systems 9.
164
[102] Weston, J., & Watkins, C. (1998). Multi-class support vector machines. Tech.rep., Technical Report CSD-TR-98-04, Department of Computer Science, RoyalHolloway, University of London.
[103] Yang, Z., Xiang, Y., Rong, Y., & Xie, S. (2013). Projection-pursuit-based methodfor blind separation of nonnegative sources. Neural Networks and Learning, IEEETransactions on, 24(1), 47–57.
[104] Zhang, J., Xanthopoulos, P., Chien, J., Tomaino, V., & Pardalos, P. (2011).Minimum prediction error models and causal relations between multiple timeseries. Wiley Encyclopedia of Operations Research and Management Science, J.J. Cochran (ed.), 3, 1843–1850.
165
BIOGRAPHICAL SKETCH
Naqeebuddin Mujahid Syed has received Bachelor of Engineering (BE) in
Mechanical Engineering from Muffakham Jah College of Engineering and Technology
(MJCET), Osmania University (OU), Hyderabad, India in 2005. He was awarded with
two Gold Medals in BE (Mechanical Engineering) from MJCET as well as from OU.
He received Master of Science (MS) in Systems Engineering (SE) from King Fahd
University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia in 2007. He
was awarded with the Outstanding Academic Performance award for the academic year
2006-07 from the College of Computer Science & Engineering (CCSE), at KFUPM.
From 2007 to 2009, he served as a lecture-B in the SE Dept. at KFUPM. He received
Doctor of Philosophy (PhD) in Operations Research from the Industrial and Systems
Engineering (ISE) Department at the University of Florida (UFL). During his PhD he has
been awarded with the Outstanding International Student award at UFL for the years
2009, 2011 and 2012. In addition to that, he was awarded with the Graduate Student
Teaching award at ISE Dept. in UFL.
166