optimization based robust methods in data...

OPTIMIZATION BASEDROBUST METHODS IN DATA ANALYSIS

WITH APPLICATIONS TO BIOMEDICINE AND ENGINEERING

By

NAQEEBUDDIN MUJAHID SYED

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2013

c⃝ 2013 Naqeebuddin Mujahid Syed

2

Dedicated tomy beloved mother,

memories of my father,and all of my dearest siblings,

who taught me to believe in myself...

3

ACKNOWLEDGMENTS

All praise is due to Allah (S.W.T) for His kindest blessings on me and all the

members of my family. I feel privileged to glorify His name in sincerest way through

this small accomplishment. I ask for His blessings, mercy and forgiveness all the time. I

sincerely ask Him to accept this meager effort as an act of worship. May the peace and

blessings of Allah (S.W.T) be upon His dearest prophet, Muhammad (S.A.W).

I would like to express my profound gratitude and appreciation to my advisor Prof.

Panos M. Pardalos, for his consistent help, guidance and attention that he devoted

throughout the course of this work. He is always kind, understanding and sympathetic

to me. His valuable suggestions and useful discussions made this work interesting to

me. I am also very grateful to Prof. Jose C. Principe for his immense help and insightful

discussions on the topics presented in the thesis. Sincere thanks go to my thesis

committee members Dr. Joseph Geunes, Dr. Jean-Philippe P. Richard for their interest,

cooperation and constructive advice. I would also like to thank Dr. Pando Georgiev for

hours of friendly discussion and constructive advice. Special thanks to Dr. Ilias Kotsireas

and Dr. James C. Sackellares for their valuable discussions.

I would like to thank the University of Florida and the Industrial and Systems

Engineering Dept. for providing me an opportunity to pursue PhD under the esteemed

program. I would like to thank all the staff members at the ISE Dept., my Weil 401

friends, and the staff at the international center for there friendly guidance, and warm

support throughout my study at UFL. Special thanks to Br. Ammar for making my stay in

Gainesville memorable.

Last but not least, I humbly offer my sincere thanks to my mother for her incessant

inspiration, blessings and prayers, and to my father for his indelible memories filled

with love and care. I owe a lot to my brothers S.N. Jaweed and S.N. Majeed, and my

sisters Nasreen, Shaheen, Tahseen, Yasmeen and Afreen for their unrequited support,

encouragement, blessings and prayers.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 ROBUST METHODS IN DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . 12

1.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Motivation and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Scope and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 ROBUST MEASURES AND ALGORITHMS . . . . . . . . . . . . . . . . . . . . 22

2.1 Traditional Robust Measures . . . . . . . . . . . . . . . . . . . . . . . . . 222.2 Proposed Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 242.3 Minimization of Correntropy Cost . . . . . . . . . . . . . . . . . . . . . . . 252.4 Minimization of Error Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Minimization of Error Entropy with Fiducial Points . . . . . . . . . . . . . . 412.6 Traditional Robust Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 442.7 Proposed Robust Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 462.8 Discussion on the Robust Methods . . . . . . . . . . . . . . . . . . . . . . 47

3 ROBUST DATA CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Traditional Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 663.3 Proposed Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 69

4 ROBUST SIGNAL SEPARATION . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.1 Signal Separation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2 Traditional Sparsity Based Methods . . . . . . . . . . . . . . . . . . . . . 804.3 Proposed Sparsity Based Methods . . . . . . . . . . . . . . . . . . . . . . 86

5 SIMULATIONS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Cauchy and Skew Normal Data . . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Real World Binary Classification Data . . . . . . . . . . . . . . . . . . . . 1075.3 Comparison Among ANN Based Methods . . . . . . . . . . . . . . . . . . 1085.4 ANN and SVM Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5

5.5 Linear Mixing EEG-ECoG Data . . . . . . . . . . . . . . . . . . . . . . . . 1105.6 fMRI Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.7 MRI Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.8 Finger Prints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.9 Zip Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.10 Ghost Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.11 Hyperplane Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.12 Robust Source Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.1 Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

APPENDIX: GENERALIZED CONVEXITY . . . . . . . . . . . . . . . . . . . . . . . 155

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6

LIST OF TABLES

Table page

3-1 Binary classification proposed methods . . . . . . . . . . . . . . . . . . . . . . 72

5-1 Binary classification case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5-2 Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5-3 Skew data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5-4 Binary classification case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5-5 Sample based performance of ANN on PID data . . . . . . . . . . . . . . . . . 123

5-6 Block based performance of ANN on PID data . . . . . . . . . . . . . . . . . . 124

5-7 Sample based performance of ANN on BLD data . . . . . . . . . . . . . . . . . 125

5-8 Block based performance of ANN on BLD data . . . . . . . . . . . . . . . . . . 126

5-9 Sample based performance of ANN on WBC data . . . . . . . . . . . . . . . . 127

5-10 Block based performance of ANN on WBC data . . . . . . . . . . . . . . . . . . 128

5-11 Performance of ACS for different values of σ and number of PEs in hiddenlayer on PID data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5-12 Performance of ACS for different values of σ and number of PEs in hiddenlayer on BLD data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5-13 Performance of ACS for different values of σ and number of PEs in hiddenlayer on WBC data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5-14 Linear mixing assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5-15 Average unmixing error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5-16 Standard deviation unmixing error . . . . . . . . . . . . . . . . . . . . . . . . . 130

5-17 Simulation-1 results for case study 2 . . . . . . . . . . . . . . . . . . . . . . . . 130

5-18 Simulation-2 results for case study 2 . . . . . . . . . . . . . . . . . . . . . . . . 130

5-19 Performance of correntropy minimization algorithm . . . . . . . . . . . . . . . . 130

A-1 Generalized convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7

LIST OF FIGURES

Figure page

3-1 Correntropic, quadratic and 0-1 loss functions . . . . . . . . . . . . . . . . . . . 73

3-2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-1 Cocktail party problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4-2 BSS setup for human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4-3 Overview of different approaches to solve the BSS problem . . . . . . . . . . . 99

4-4 Original example source S ∈ R3×80 . . . . . . . . . . . . . . . . . . . . . . . . 103

4-5 Mixed data X ∈ R2×80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4-6 Processed data X ∈ R2×80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4-7 Algorithm 4.2 description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5-1 Global view of Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5-2 Local view Cauchy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5-3 Skew normal data with noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5-4 Performance of SVM on PID data . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5-5 Performance of SVM on BLD data . . . . . . . . . . . . . . . . . . . . . . . . . 135

5-6 Performance of SVM on WBC data . . . . . . . . . . . . . . . . . . . . . . . . . 136

5-7 EEG recordings from monkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5-8 ECoG recordings from monkey . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5-9 fMRI data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5-10 Convex hull PPC1 assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5-11 Mixing and unmxing of MRI scans . . . . . . . . . . . . . . . . . . . . . . . . . 140

5-12 Mixing and unmxing of finger prints . . . . . . . . . . . . . . . . . . . . . . . . . 141

5-13 Mixing and unmxing of zip codes . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5-14 Mixing and unmxing of ghost effect . . . . . . . . . . . . . . . . . . . . . . . . . 143

5-15 Original sparse source (normalized) for case study 1 . . . . . . . . . . . . . . . 144

5-16 Given mixtures of sources for case study 1 . . . . . . . . . . . . . . . . . . . . 145

8

5-17 Original mixing matrix for case study 1 . . . . . . . . . . . . . . . . . . . . . . . 146

5-18 Mixing matrices for case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5-19 Recovered mixing matrix for case study 1 . . . . . . . . . . . . . . . . . . . . . 146

5-20 Recovered source (normalized) for case study 1 . . . . . . . . . . . . . . . . . 147

5-21 Data for source extraction method . . . . . . . . . . . . . . . . . . . . . . . . . 148

5-22 Recovery of sources by quadratic and correntropy loss . . . . . . . . . . . . . . 149

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

OPTIMIZATION BASEDROBUST METHODS IN DATA ANALYSIS

WITH APPLICATIONS TO BIOMEDICINE AND ENGINEERING

By

Naqeebuddin Mujahid Syed

August 2013

Chair: Panos M. PardalosMajor: Industrial and Systems Engineering

Analysis of a complex system as a whole, and the limitations of traditional statistical

analysis led towards the search of robust methods in data analysis. In the current

information age, data driven modeling and analysis forms a core research element of

many scientific research disciplines. One of the primary concerns in the data analysis is

the treatment of data points which do not show the true behavior of the system (outliers).

The aim of this dissertation is to develop optimization based methods for data analysis

that are insensitive and/or resistant towards the outliers. Generally, such methods

are termed as robust methods. In this dissertation, our approach will be different

from the conventional uncertainty based robust optimization approaches. The goal

is to develop robust methods that include robust algorithms and/or robust measures.

Specifically, applicability of an information theoretic learning measure based on entropy

called correntropy is highlighted. Some crucial theoretical results on the optimization

properties of correntropy and related measures are proved. Optimization algorithms

for correntropy are developed for both parametric and non-parametric frameworks. A

second order triggered algorithm is developed, which minimizes the correntropic cost

on a parametric framework. For the case of non-parametric framework, the usage

of convolution smoothing and simulated annealing based algorithms is proposed.

Furthermore, a modified Randomized Sampling Consensus (RANSAC) based robust

10

algorithm is also proposed. The performance of the proposed approaches is illustrated

by case studies on the data related to biomedical and engineering areas, with the

objective of binary classification and signal separation.

11

CHAPTER 1ROBUST METHODS IN DATA ANALYSIS

Understanding the underlying mechanism of a real world system is the basic goal

of many scientific research disciplines. Typical questions related to the system like “How

does it work?” or “What will happen if this or that is changed in the system?” are to be

answered for successful progress of the scientific research. This particular research

element has been revolutionized by the methods of experimentation and statistical

analysis. In fact, prior to statistical analysis and experimentation, deductive logic was

typically used in understanding of the system, which had tremendous limitations.

The concept of hypothesis testing can be solely attributed to statistical analysis and

experimentation. Nowadays, obtaining information from data is one of the prevalent

research areas of science and engineering. However, as the curiosity to study the real

world complex systems as a whole increased over the time, traditional statistical analysis

methods had proven to be inefficient.

Statistical methods dictated the theory of analyzing the data, until Tukey [98]

revolutionized the ideology of analyzing the experimental data. He differentiated

the term “data analysis” from “statistical analysis” by stating that the former can be

considered as science, but the later is subjective upon the statistician’s approach (i.e.,

either mathematics or science, but not both). Supporting Tukey’s ideology, Huber [49]

encouraged the usage of term data analysis, as the other term is often interpreted in an

overly narrow sense (restricted to mathematics and probability). Thus, the seminal work

of Tukey [97, 98] enlarged the scope of data analysis from mere statistical inference to

something more.

In simple terms, the key idea of data analysis approach is to propose some

analytical or mathematical model that represents the underlying mechanism of the

system under consideration. The proposed model can be specific (parametric) to the

system or can be a general (nonparametric) model. Both parametric and nonparametric

12

models have some parameters to tune. The parameters are tuned based on the

observed data collected from the system (experimentation). The process of identifying

the model parameter is called as parameter estimation.

The basic idea involved in model parameter estimation is to estimate the model

parameters by minimizing the error between the estimated output from the model

and the desired response. The error by definition is merely a difference between the

output and the response. The error measure (or the worth of an error value) plays a

very crucial role in the estimation of the model parameters. Typically, when the error is

assumed to be Gaussian, the Mean Square Error (MSE) criterion is equivalent to the

minimization of error energy (or variance). It is well know that, under Gaussianity

assumption, MSE leads to the best possible parameter estimation (a maximum

likelihood solution). However, parameter estimation issues related to nonlinear and

non-gaussian error call for costs other than MSE [15].

Higher order statistics can be used to deal with non-gaussian errors. Statistically,

MSE minimizes the second order moment of the error. In order to transfer as much

information as possible from the system’s responses to the model parameters, second

and higher order moments (kurtosis, or cumulants) of error should be minimized.

However, the most important drawback of using higher order statistics is that they are

very sensitive to outliers.

The objective of Chapter 1 is to introduce the topic of the dissertation. Section 1.1

presents an simplistic introduction to the notion of data analysis. Section 1.2 highlights

the significance of the robust data analysis by presenting a motivating anecdote from

the literature, and highlights its relevance to the practical engineering and biomedical

scenarios. In Section 1.3, the overview of the robust data analysis ideology is presented,

and the specific approaches that will be implemented and developed in this work will be

clearly stated. The scope and objective of this work is presented in Section 1.4.

13

1.1. Data Analysis

Data analysis is an interdisciplinary field, including statistics, database design,

machine learning and optimization. It can be defined in simple terms as “the process of

extracting knowledge from a raw data set by any means”. Approaches in data analysis

vary depending upon the type of data, the objective of the analysis, the availability of

computational time and resources, and the familiarity (or inclination) of the researcher

towards a specific approach. Thus, there are plethora of data analysis methods,

including parametric and non-parametric framework with exact and heuristic algorithms.

However, a data analysis approach can be schematically specified based on some

prominent elements of data analysis. In general, the elements of data analysis can be

structured into the following six sequential steps:

Objective. The first and most important step in data analysis is the objective of the

analysis. It should be well defined and clear in nature. Based on the objective, the later

steps are customized. Typically, the objectives may involve one or more than one (or a

combination) of the following major criteria:

• Regression: Literally the term ‘Regression’ means a return to formal or primitivestate. Statistical regression involves the idea of finding an underlying primitiverelationship between the causal variables and the effect variables. Moreover, theunstated basic assumption in statistical regression is that all the data belongs to asingle class.

• Classification: Literally ‘Classification’ means a process of classifying somethingbased on shared characteristics. Statistical classification is a supervised learningmethod that involves classifying uncategorized data based on the knowledge ofcategorized data. The class label for the categorized data is known. Whereas, theclass label for the uncategorized data is unknown. The unstated basic assumptionin statistical classification is that an uncategorized data point should be assigned toexactly one of the class labels.

• Clustering: Literally ‘Clustering’ means congregating things together based ontheir particular characteristics. Statistical clustering is an unsupervised learningmethod which aims to cluster data based on defined nearness measure. It involvesmultiple classes, and for each class an underlying relationship is to be found.Ideally, there is no prior knowledge available about the data classes. However,

14

some of the clustering methods assume that the information of the total number ofdata classes are known a priori.

Data Representation. Data is nothing but stored and/or known facts. Data comes

in different forms and representations. It can represent a qualitative or quantitative

fact (in the form of numbers, text, patterns or categories). Based on the objective of

data analysis, a suitable data representation should be selected. A generalized way

to represent data is in the form of an n × p matrix, also known as ‘flat representation’.

Typically, the rows (records, individual, entities, cases or objects) represent data points,

and for each data point a column (attribute, feature, variable or field) represents a

measurement value. However, depending upon the context, the interpretation of rows

and columns may interchange.

Knowledge Representation. The extracted knowledge can be represented in

the form of relationships (between inputs and outputs) and/or summaries (novel ways

to represent same data). The way of representing the relationships (or summaries)

depends upon the field of research, and the final audience (i.e., it should be novel, but

more importantly understandable to the reader). The relationships or summaries (often

referred to as models or patterns) can be represented but not limited to following forms:

Linear equations, Graphs, Trees, Rules, Clusters, Recurrent patterns, Neural networks,

etcetera. Typically, the type of representation for relationships/summaries should be

selected before analyzing the data.

Loss Function. The loss function is a measure function that accounts for the error

between the predicted output and actual output. It is also known as penalty function or

cost function. The selection or design of loss function depends upon two main criteria.

Firstly, it should appropriately reflect the error between the predicted output and actual

output. Secondly, the loss function should be easily incorporable inside an optimization

algorithm. In addition to that, given an instance of predicted output and actual output,

the loss function should give the error value in polynomial time. The longer it takes to

15

calculate the error, the lesser is the efficiency of the optimization algorithm. There are

two main classical loss functions, namely: absolute error, mean square error. Typically,

the mean square error (commonly known as quadratic loss function) is used often as a

loss function.

Optimization Algorithm. The knowledge representation, selected a priori, is

trained (using an optimization algorithm) on the data set to minimize the loss function.

Thus, this assures that the represented knowledge aptly imitates the real system

(the source or generator of the data set). Such training algorithms, also known as

learning algorithms, are based on some optimization methods. Classically, a parametric

representation is encouraged, and is accompanied by an exact optimization method.

Although, a parametric representation requires in depth knowledge of the given data

set, parametric methods were given superiority over non-parametric methods due to

the existence of efficient exact optimization methods. Moreover, exact solution methods

are suitable for a limited class of parametric representations, thus they limit the scope of

knowledge representation. Recent developments in the use of non-parametric methods

like artificial neural networks have widened the scope of knowledge representation.

However, due to the use of exact methods, they have not been utilized to their full

potential. Lately, due to the development in heuristic optimization methods, the use of

non-parametric methods have become desirable and enlarged the scope of knowledge

representation.

Validation. This is typically the last step in the data analysis. The key purpose

of this step is to justify the output (estimated parameters) obtained from the earlier

steps. Experts on the problem specific domain are consulted to verify and validate the

results. However, expert opinion may not always be available. Hence, cross validations

methods are developed. There are several cross validation methods that are based

on the concept of training and testing. The idea is to divide the given data set into two

subgroups called training and testing sets. Data analysis is conducted on the training

16

data set, and the model’s performance is calibrated using the testing data set. Generally,

the size of training set is greater than testing set. Next, three most common methods of

cross validation are described:

• k-fold Cross Validation (kCV): In this method, the dataset is partitioned in k equallysized groups of samples (folds). In every cross validation iteration, k-1 folds areused for the training and 1 fold is used for the testing. In the literature, usually ktakes a value from 1, ... , 10.

• Leave One Out Cross Validation (LOOCV): In this method, each sample representsone fold. Particularly, this method is used when the number of samples are small,or when the goal of classification is to detect outliers (samples with particularproperties that do not resemble the other samples of their class).

• Repeated Random Sub-sampling Cross Validation (RRSCV): In this method, thedataset is partitioned into two random sets, namely training set and validation (ortesting) set. In every cross validation, the training set is used to train the model,and the testing (or validation) set to test the accuracy of the model. This method ispreferred if there are large number of samples in the data. The advantage of thismethod (over k-fold cross validation) is that the proportion of the training set andnumber of iterations are independent. However, the main drawback of this methodis if few cross validations are performed, then some observations may never beselected in the training phase (or the testing phase). Whereas others may beselected more than once in the training phase (or the testing phase respectively).To overcome this difficulty, the model is cross validated sufficiently large number oftimes, so that each sample is selected at least once for training as well as testingthe model. These multiple cross validations also exhibit Monte-Carlo variation(since the training and testing sets are chosen randomly).

Among the above stated steps, knowledge representation, loss function and

the optimization algorithm form the crux of the data analysis. Traditional approaches

of data analysis were based on statistical principles, and were termed as statistical

analysis. A typical assumption in the traditional approaches includes the availability

of the knowledge of the data distribution or ability to perfectly learn the distribution

from the infinite length data. Thus, either the data is assumed to be perfect, or the

filter methods are developed to remove the noise from the data before conducting the

statistical analysis. However, filter methods are based on assumptions, and require data

17

specific knowledge. Therefore, the statistical analysis performs well theoretically but has

limitations for most of the practical scenarios.

1.2. Motivation and Significance

From traditional statistical analysis to the contemporary data analysis, one of the

key analysis elements that has remained unchanged is the optimization based approach

in extracting knowledge from the data. The efficiency of optimization methods are in turn

dependent upon the type of the objective function and the feasible space. Furthermore,

the solution quality (local or global best) of data analysis methods also depends upon

the objective function and the feasible space. Existence of outliers (or noise) often taint

the solution space. Hence, practical data analysis calls for methods in data analysis that

are insensitive or resistant to the outliers.

Determining similarity between data samples using an appropriate measure has

been the key issue in the analysis of experimental data. The importance of robust

methods in data analysis can be traced back to the old famous dispute between Fisher

and Eddington. Based on practical observations, Eddington [25] proposed the suitability

of the absolute error as an appropriate measure. Fisher [30] countered the idea of

Eddington by theoretically showing that under “ideal circumstances” (errors are normally

distributed, and outliers free data) the mean square error is better than the absolute

error. The dispute between Eddington and Fisher actually played a prominent role in

shaping the theory of statistical analysis. After Fisher’s illustration, many researchers

incorporated mean square error as a default similarity measure in their analysis. Tukey

[97] reasoned that occurrence of the ideal circumstances for practical scenarios is very

rare. Huber [48] further showed that noise as less as 0.2%, which is ideal for many

practical data, will favor the usage of absolute error instead of mean square error.

Although Tukey’s paper highlighted the importance of robust measures like the absolute

error, the prevalence of mean square error in data analysis can be solely attributed

to its convex, continuous and differentiable nature. There have been explicit studies

18

[40, 47, 48] on the research and development of robust measures, under the preamble

of robust statistics.

The traditional statistical analysis methods were strictly dependent upon theoretical

assumptions like,

• ideal circumstances: Errors are normally distributed.• distributional assumptions: Distribution of data can be learned (or available).• sensitivity assumptions: Small deviations in distribution result in minor changes.• smoothing assumptions: Effect of few outliers gets faded out w.r.t bulk data.

Tukey [97] suggested that in the practical scenarios, the assumptions are hardly true

and barely verifiable. In fact, the assumptions are more or less assumed to be true for

mathematical convenience. The assumptions were justified by vague stability principles

that minor changes should result in small error in the ultimate conclusion. On the

contrary, Huber [47] states that the assumptions do not always hold, and traditional

methods based on the distributional assumptions are very sensitive to minor changes. In

fact, Geary [31] (cited by Tukey [98] and Hampel [39]) stated that “Normality is a myth;

there never was, and never will be, a normal distribution”. Thus, robust procedures are a

crucial requirement of the contemporary data analysis methods. These ideologies led to

the development of “robust methods” in data analysis.

1.3. Robust Methods

A robust method in data analysis can be defined as “the method of extracting

knowledge from the bulk of the given data, simultaneously neglecting the knowledge

from the outliers present in the given data”. The major approaches of robust methods in

data analysis can be divided into following categories:

Relaxing Distributional Assumptions. The approach here is to develop data

analysis methods based on geometric (or structural) assumptions rather than the

distributional assumptions. This approach is followed in the hope of reducing the

sensitivity of methods with respect to the practical scenarios. Furthermore, the

19

geometrical assumptions on data can be easily verified, unlike the distributional

assumptions.

Incorporating Distributional Assumptions. Obviously relaxing all the distributional

assumptions in a data analysis method is the most appropriate case for practical

data. However, the distributional assumptions cannot be discarded in most of the

scenarios, mainly due to the loss of mathematical convenience in the analysis approach.

Thus, most of the research in robust methods is based on incorporating ideas into

the traditional methods that will result in insensitivity to the conventional theoretical

assumptions. The approaches can be categorized as usage of:

• Robust Measure: A measure which is insensitive to outliers is used as a lossfunction.

• Robust Algorithm: Subsamples from the given data sample is analyzed separately,and the information from the subsample analysis is utilized to construct the model.

• Robust Optimization: An uncertainty based domain is considered around eachdata sample, and stochastic optimization based algorithms are used to conduct theanalysis.

It is to be noted that incorporating robustness is a practical approach, and it is a

current critical requirement of data analysis methods. However, robustness often results

in loss of convexity and/or smoothness from the optimization problem related to the

data analysis. Furthermore, the computational efficiency of the robust methods are

generally lower than that of the non-robust methods. It is out of scope of this dissertation

to discuss all the aspects of robust method. Therefore, before proceeding further to

develop the theme of robust methods, the scope and objective of this dissertation is

presented in Section 1.4.

1.4. Scope and Objective

The objective of this dissertation is to develop novel optimization based robust

methods in data analysis problems. As described in Section 1.3, the term “robust

methods” has been used in different connotations based on the intention and area of

20

the application. In this work, robust methods mean incorporation of robust algorithms,

and/or usage of robust measures in data analysis problems. In the case of robust

measures, the focus is on the applicability of entropy based robust measures, like

correntropy, in data analysis. In this work, generalized convexity based results are

presented for the entropy based measures. In addition to that, the performance of the

robust measure in binary classification using a non-parametric framework is illustrated.

On the other hand, a robust algorithm for signal separation problem is also proposed.

Specifically, a linear mixing model for the signal separation problem is considered.

Robust algorithms are developed to extract the dictionary information from the given

mixture data. Furthermore, an entropy based method is proposed to extract the sources

from the mixture data.

Robust methods are applicable to practical data analysis scenarios, which typically

involve noisy data. From the literature [39], it can now be assumed as a rule of thumb

(not an exception) that the data from biomedical and engineering systems contain 5%

to 10% of outliers. Moreover, if it is assumed that there are no outliers present in the

data, then the solution quality obtained from robust methods is typically competitive to

the non-robust methods. However, the main drawback of robust methods is that they are

computationally expensive. Nevertheless, our aim is to analyze optimization properties

of robust measure and propose selection strategies for robust algorithms that maybe

used to improve the computational and optimization efficiency. In Chapter 2, those

issues related to robust methods that are relevant to this dissertation are addressed.

Interested readers are directed towards references [40, 47], which present a general

discussion on the robust methods.

21

CHAPTER 2ROBUST MEASURES AND ALGORITHMS

In Chapter 2, the theory of robust methods is presented. The proposed approaches

include the concept of robust measures and robust algorithms. The ideas related to

robust optimization are relatively new when compared to traditional robust measures

and algorithms. However, robust optimization based methods, which are nothing but

uncertainty based optimization methods, have been rigorously applied in the area of

data analysis due to the efficient methods developed by the stochastic optimization

community. On the other hand, the notion of robust measures and algorithms can

be traced back to the times of Eddington and Fisher. However, elegant methods to

incorporate the concepts robust measures and algorithms in a practical framework have

always been an open research area. The crux of this work is to show the applicability

of a new robust measure, which is developed from theory of Renyi’s entropy, in the

problems related to data analysis.

Chapter 2 is structured as follows. Section 2.1 presents a brief summary of

the traditional robust measures. The concept of entropy based robust measures is

presented in Section 2.2. Sections 2.3, 2.4 and 2.5 prove the generalized convexity

based optimization properties of the three entropy based robust measures. Furthermore,

Section 2.6 presents a brief introduction to the traditional robust algorithms. Section 2.7

presents the proposed robust algorithm. Finally, Section 2.8 concludes Chapter 2 by

presenting a brief discussion on the proposed methods.

2.1. Traditional Robust Measures

Consider a univariate data containing N samples. One of the traditional ways

to collect information from the samples is to calculate its mean and variance. Now,

assume that one outlier has been appended to the existing data set. Obviously the

Some sections of Chapter 2 have been published in Optimization Letters.

22

mean and variance will change significantly. However, the median of the data will not

change much. In fact, the median will give the true information of the data until there

are about 50% outliers in the data. Thus, median is considered as a robust measure

than compared to mean, and in some sense, median is the most robust measure of

location. An improvement of traditional mean calculation is the α-trimmed mean, where

0 < α < 12. The key idea in α-trimmed mean is to remove up to αN points from the

sample before calculating the mean.

Using the above ideology, many robust estimates of data have been proposed.

Generally, the scale estimates can be classified into three main categories: L-estimators,

M-estimators, and S-estimators. Among the three estimates, this work will consider

M-estimators. In simple terms, M-estimators are the minima of a measure that is

constructed by the summation of functions of the data points. Huber was pioneer in

proposing a class of robust M-estimators. The Huber class of function can be defined

by a family of functions ψθ(x), where θ is a parameter, and x is an error value. For the

estimates of location ψθ(x) = ψ(x − θ), and the base model of Huber measure can be

represented as:

ψ(x) =

x if |x | ≤ k1

k2 sign(x) otherwise,(2–1)

where 0 < k1, k2 < ∞. When k1 = c > 0 and k2 = 0, the function ψ(x) corresponds

to metric trimming. When k1 = k2 = c > 0, the function ψ(x) corresponds to metric

winsorizing. Tukey proposed another class of robust measure, called as biweight

measure:

ψ(x) = x

[1−

(x

k1

)2]2+

, (2–2)

23

where [a]+ represents the positive part of a, and k1 is a parameter. Furthermore, Hampel

proposed a piecewise linear functions based robust measure, which is defined as:

ψ(x) =

|x |sign(x) 0 < |x | ≤ k1

k1sign(x) k1 < |x | ≤ k2

k1k3−|x |k3−k2

sign(x) k2 < |x | ≤ k3

0 k3 < |x | <∞.

(2–3)

Based on the ψ function, the M-estimator can be defined as ρ(x) =∫ψ(x)(dFx). Huber,

Tukey, and Hampel’s measure are the traditional robust measures in data analysis.

Although, there are several advantages of being a robust measure, there are few critical

drawbacks of the above measures

• The measures are scale variant.• There are no standard rules of parameter selection, i.e., how to select values for

k1, k2, ....• The measures are nonsmooth, i.e., they are discontinuous.• The measures are nonconvex.

Typically, it is desirable to use a scale invariant measure. Therefore, suitable

preprocessing methods like normalization can be used to manage the scale variant

property of the robust measures. However, the parameter selection is a critical issue,

and there are no specific rules in selecting the parameter. Moreover, the nonsmooth

nature of the functions creates a difficulty in developing solution algorithms. Finally, the

presence of nonconvex functions increase the complexity of optimization algorithms.

Section 2.2 proposes another class of robust estimators that are smooth and invex in

general, and they may overcome some of the above listed drawbacks.

2.2. Proposed Entropy Based Measures

Entropy is another criterion that can be used in the parameter estimation, and it

bypasses the higher order moments’ expansions. Shannon [85] defined entropy as the

average unpredictability (equivalent to information content) of a probability distribution.

24

Shannon’s entropy, a measure of uncertainty of the probability distribution, quantifies

the expected value of information contained in a system. Later, Renyi [76] generalized

the notion of entropy; that includes Shannon’s definition of entropy. When combined

with a non-parametric estimator like Parzen’s estimator [71], Renyi’s entropy provides

a mechanism to estimate entropy directly from the responses. Using the concept of

non-parametric Renyi’s entropy, the notion of Minimization of Error Entropy (MEE) [26] is

founded, which is a central concept in the field of information theoretic learning [72, 73].

Another important property of entropy based robust measures is that they

encompass higher order moments. Therefore, minimizing error measure based on

entropy indirectly take into account higher order statistics. Typically, the traditional

higher order statistics are very sensitive measures. On the other hand, as an additional

advantage, entropy based robust measures are robust. Thus, entropy based measures

are useful for nonlinear nongaussian systems. Sections 2.3, 2.4 and 2.5 present novel

properties of three entropy based robust measures.

2.3. Minimization of Correntropy Cost

Correntropy (strictly speaking, should be called as cross correntropy) is a generalized

similarity measure between any two arbitrary random variables (y , a), defined as [83]:

ν(y , a) = Ey ,a [k(y − a,σ)] , (2–4)

where k is any form of transformation kernel function with parameter σ (in this work it

is taken as Gaussian kernel). For the sake of simplicity, consider a binary classification

scenario. Let x = a − y represent the error, where a, x , and y ∈ R are actual label,

error and predicted label respectively. The correntropic loss function is defined as:

FC(x ,σ) = β(1− ν(x)) or FC(x ,σ) = β(1− Ex [k(x ,σ)]), (2–5)

where β =[1− e

( −1

2σ2 )]−1

. Typically, the probability distribution function of x is unknown,

and only {xi}ni=1 observations are available. Using the information from n observations,

25

the empirical correntropic loss function can be defined as:

FC(x,σ) = β(1− 1

n

n∑i=1

k(xi ,σ)), (2–6)

where x = [x1, ... , xn]T is an array of sample errors, and k(x ,σ) = e

(−x2

2σ2

). A practical

approach to minimize the function given in Equation 2–6 is to assume σ as a parameter.

Multiple iterations for different values of the parameter are executed to obtain the optimal

solution.

Parameter Based Correntropic Function

The parameterized correntropic loss function, is defined as:

FCσ(x) = βσ(1−

1

n

n∑i=1

kσ(xi)), (2–7)

where βσ =[1− e

( −1

2σ2 )]−1

, and kσ(x) = e

(−x2

2σ2

). Let Hσ

C(x) denotes the Hessian of the

function defined in Equation 2–7, given as:

HσC(x) =

σ(x1)σ2−x2

1

σ2 0 · · · 0

0 σ(x2)σ2−x2

2

σ2 · · · 0

...... . . . ...

0 0 · · · σ(xn)σ2−x2

n

σ2

.

(2–8)

where σ(x) = βσ

σ2 e

(−x2

2σ2

). From Equation 2–8, it can be seen that if |σ| > |xi |, for

i = 1, ... , n, then the correntropic function is convex. Under the ideal circumstances as

assumed by Fisher, choosing |σ| > |xi |, for i = 1, ... , n is appropriate. However, for the

practical case, σ should be selected such that |σ| < |xi | when the i th sample is an outlier,

and |σ| > |xi | when the i th sample is a non-outlier. This winnowing of outliers by kernel

width σ is the robustness of the correntropic loss function. However, the robustness is

achieved in correntropic loss function at the cost of losing convexity.

26

The above analysis highlights the subtle yet crucial issue, i.e., the trade-off among

the three desired properties: convexity, robustness and smoothness. Conventionally,

the best strategy is to select any two of the three properties in a similarity measure.

For instance, most of the traditional practitioners select convexity and robustness (like

the absolute loss function), or select convexity and smoothness (like the quadratic loss

function). Correntropy opens a door in the direction, where robustness and smoothness

are guaranteed. But without convexity, optimization of a general nonlinear function will

be a challenging task. Fortunately, for the correntropic loss function, we show that the

function is pseudoconvex for one dimension, and invex for multi-dimensions. When data

cannot be normalized (i.e., when data should have a different kernel width for different

features) the generalized correntropic function loss function, is defined as:

FCσ(x) =

n∑i=1

βσin(1− kσi (xi)), (2–9)

where βσi =[1− e

( −1

2σ2i

)]−1

and kσi (x) = e

(−x2

2σ2i

). In the following part of Chapter 2, the

total correntropic loss, instead of the average loss is considered, i.e., the correntropic

loss function is defined as:

FCσ(x) =

n∑i=1

βσi (1− kσi (xi)). (2–10)

Generalized Convexity of Correntropic Function

Although for the higher value of parameter σ (depending upon the error magnitude),

the correntropy based measure is convex. However, it is of practical importance to study

the properties of correntropy function for any value of σ > 0. Specifically in this work,

it is claimed that the correntropy function is pseudoconvex and invex, depending upon

the sample dimension. Let us consider the simplest case, where the error from a single

sample is considered one at a time. This case is called as single sample case.

27

Single Sample Case: Let x be the sample error. The correntropy loss function, with

respect to one sample, can be defined as:

FσC (x) =

[1− e

( −1

2σ2 )]−1

[1− e

(−x2

2σ2

)]∀ x ∈ R. (2–11)

The pseudoconvexity of the loss function is claimed under the following conditions.

Theorem 2.1. Let βσ =[1− e(

−1

2σ2 )]−1

and S = {x ∈ R : x2 < M, 0 < M << ∞}. If

x ∈ S and FσC : S 7→ R, then the function Fσ

C , defined as :

FσC (x) = βσ

[1− e

(−x2

2σ2

)]∀ x ∈ R, (2–12)

is pseudoconvex for any finite σ > 0.

Proof: Let x1, x2 ∈ R. Consider the following:

∇FσC (x1)(x2 − x1) =

βσσ2e

(−x212σ2

)(x1)(x2 − x1)

= σ(x1)x1(x2 − x1),

where σ(x) = βσ

σ2 e

(−x2

2σ2

)and σ(x) > 0 ∀σ, x ; since βσ > 0, σ = 0 and finite, and

x2 < M.

Now, if ∇FσC (x1)(x2 − x1) ≥ 0, then

σ(x1)x1(x2 − x1) ≥ 0

⇒ x1(x2 − x1) ≥ 0. (2–14a)

Next, consider the following cases:

• Case 1: if x1 ≥ 0, then Equation 2–14a reduces to the following:

x2 ≥ x1

or x2 ≥ x1 ≥ 0

⇒ FσC (x2) ≥ Fσ

C (x1). (2–15a)

28

• Case 2: if x1 < 0, then Equation 2–14a reduces to the following:

x2 ≤ x1

or x2 ≤ x1 < 0

⇒ FσC (x2) ≥ Fσ

C (x1). (2–16a)

From Equations 2–15a & 2–16a, the following statement holds:

If ∇FσC (x1)(x2 − x1) ≥ 0, then Fσ

C (x2) ≥ FσC (x1) for a given σ (2–17)

From Equation 2–17, it follows that FσC is pseudoconvex for a given parameter σ. �

Remark 2.1. If there exists x⋆ ∈ R such that ∇FσC (x

⋆) = 0, then x⋆ is the global optimal

solution of FσC .

n-sample Case: Let xi be the i th sample error. The correntropic loss function, in

n-sample (cumulative error of n-samples) is given as:

FσC (x) =

n∑i=1

βσi

[1− e

(−x2

i

2σ2i

)]∀ x ∈ S, (2–18)

where βσi =

[1− e

(−1

2σ2i

)]−1

and S = {x ∈ Rn : x2i < Mi , 0 < Mi << ∞ ∀i = 1, ... , n}. If

σ = σi∀i , then FσC is represented as Fσ

C and is defined as :

FσC (x) = βσ

[n −

n∑i=1

e

(−x2

i

2σ2

)]∀ x ∈ Rn, (2–19)

which can be rewritten as:

FσC (x) =

n∑i=1

[βσ − βσe

(−x2

i

2σ2

)]∀ x ∈ S. (2–20)

Let f σ(x) =

[βσ − βσe

(−x2

2σ2

)]. From Theorem 2.1, it can be easily shown

that f σ(x) is pseudoconvex. Furthermore, it can be seen that FσC (x) is sum of n

pseudoconvex functions. But unlike the property of convex functions, in general,

the sum of pseudoconvex functions may not be a pseudoconvex function. Thus, the

29

pseudoconvexity of correntropic function for n samples does not follow directly from

Theorem 2.1.

Theorem 2.2. Let βσi =

[1− e

(−1

2σ2i

)]−1

and S = {x ∈ Rn : x2i < Mi , 0 < Mi <<

∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ

C , defined as :

FσC (x) =

n∑i=1

βσi

[1− e

(−x2

i

2σ2i

)]∀ x ∈ S, (2–21)

is locally pseudoconvex for any finite σi > 0.

Proof: Let Nϵ(x) = {y| ∥y − x∥ < δ, 0 < δ < ϵ ∧ ϵ −→ 0} represent the epsilon

neighborhood of x. Let x ∈ S and x ∈ Nϵ(x) ∩ S be any two points, such that:

∇FσC (x)

T (x− x) ≥ 0 (2–22a)n∑

i=1

σi (xi)xi(xi − xi) ≥ 0

n∑i=1

σi (xi)xidi ≥ 0, (2–22b)

where d ∈ Rn is the direction, such that x = x+ λd.

The following relation is claimed to be true:

FσC (x) ≤ Fσ

C (x). (2–23)

By contradiction, say if FσC (x) > Fσ

C (x), then:

n∑∀ i = 1

[f σi (xi)− f σi (xi)] < 0. (2–24)

Now

n∑∀ i = 1

[f σi (xi)− f σi (xi)] =

n∑∀ i = 1

[−βσie

(−(xi

2+2λxi di+λ2d2i)

2σ2i

)+ βσie

(−xi

2

2σ2i

)]

=

n∑∀ i = 1

βσie−xi

2

2σ2i

[1− e

−(2λxi di+λ2d2i)

2σ2i

]. (2–25)

30

Equations 2–24 & 2–25 imply the following:

n∑∀ i = 1

βσe−xi

2

2σ2i

[1− e


2σ2i

]< 0. (2–26)

Dividing both sides of Equation 2–26 by λ > 0, and taking the limits λ −→ 0, results in:

0 > limλ−→0

1

λ

n∑∀ i = 1

βσie−xi

2

2σ2i

[1− e


2σ2i

]

=

n∑∀ i = 1

βσie−xi

2

2σ2i lim

λ−→0

1− e


2σ2i

λ

=

n∑∀ i = 1

βσie−xi

2

2σ2i

xidi

σ2i

=

n∑∀ i = 1

σi (xi)xidi . (2–27)

Equation 2–27 is a contradiction to the assumption made in Equation 2–22. This

proves that the claim stated in Equation 2–23 is true. Therefore, from Equations 2–22, 2–23

& 2–27 it is concluded that:

if ∇FσC (x)

T (x− x) ≥ 0, then FσC (x) ≥ Fσ

C (x) ∀ x ∈ Nϵ(x) ∩ S. (2–28)

That is, from Equation 2–28, it can be stated that FσC is locally pseudoconvex for a given

parameter σ. �

Unfortunately, the local pseudoconvexity will not guarantee the existence of global

pseudoconvexity. However, a gradient descent algorithm with sufficiently small step

size can be designed such that it can guarantee global convergence. Nevertheless, the

following theorem proves the existence of invexity for the correntropic loss function.


[1− e

(−1

2σ2i

)]−1

and S = {x ∈ Rn : x2i < Mi , 0 < Mi <<

∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ

C , defined as:

FσC (x) =

n∑i=1

βσi

[1− e

(−x2

i

2σ2i

)]∀ x ∈ S, (2–29)

31

is invex for any finite σi > 0.

Proof: Let x, x ∈ S be any two points. Since x2i < Mi and σi = 0 ∀ i = 1, ... , n, there

exists Mi > 0, ∈ R such that x2i

σ2i

≤ Mi ∀ i = 1, ... , n. The gradient,∇FσC (x) ∈ Rn is

defined as:

∇FσC (x) =

[βσ1

σ21e

(−x21

2σ21

)x1, ... ,

βσnσ2n

e

(−x2n

2σ2n

)xn

]T, (2–30)

which implies

∇FσC (x)

T∇FσC (x) =

n∑i=1

β2σiσ4i

e

(−x2

i

σ2i

)x2i . (2–31)

Since x ∈ S, it follows that ∇FσC (x)

T∇FσC (x) = 0 only when ∇Fσ

C (x) = 0. Let us define

η(x, x) ∈ Rn as:

η(x, x) =

0 ∇Fσ

C (x) = 0

[FσC (x)−Fσ

C (x)]∇FσC (x)

∇FσC (x)T∇Fσ

C (x)otherwise.

(2–32)

From Equation 2–32, it follows that:

FσC (x)−Fσ

C (x) ≥ η(x, x)T∇FσC (x). (2–33)

From Equation 2–33 it follows that FσC (x) is invex, when x2i < Mi ∀ i = 1, ... , n. �

Remark 2.2. If there exists x⋆ ∈ Rn such that ∇FσC (x

⋆) = 0, then x⋆ is the global optimal

solution of FσC .

Kernel width plays a critical role in setting the level of convexity in the correntropic

loss function. In Theorem 2.4, the condition under which the correntropy loss function

will be pseudoconvex is presented.


[1− e

(−1

2σ2i

)]−1

, S = {x ∈ Rn : x2i < Mi , 0 < Mi << ∞ ∀ i =

1, ... , n} and ck(σ) =∏k

i=1

σi(xi )(σ

2i−x2

i)

σ2i

(∑k

i=1

σi(xi )x

2iσ2i

σ2i−x2

i

), where σi (xi) =

βσi

σ2i

e

(−x2

i

2σ2i

). If

32

x ∈ S and FσC : S 7→ R, then the function Fσ

C , defined as :

FσC (x) =

n∑i=1

βσi

[1− e

(−x2

i

2σ2i

)]∀ x ∈ S, (2–34)

is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... , n and σi is nonzero and finite.

Proof: In Theorem 2.2, local pseudoconvexity of FσC is proved. In Theorem 2.4, it is

shown that under certain conditions, FσC is globally pseudoconvex. Let Hσ

C (x) denote the

Hessian of the function, given as:

HσC (x) =

σ1(x1)

σ21−x2

1

σ21

0 · · · 0

0 σ2(x2)

σ22−x2

2

σ22

· · · 0

...... . . . ...

0 0 · · · σn(xn)σ2n−x2

n

σ2n

.

(2–35)

Consider the bordered Hessian matrix B(x) of FσC defined as:

B(x) =

HσC (x) ∇Fσ

C (x)

∇FσC (x)

T 0

.

(2–36)

The determinant of B(x) is defined as:

det B(x) =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

σ1(x1)

σ21−x2

1

σ21

0 · · · 0 σ1(x1)x1

0 σ2(x2)

σ22−x2

2

σ22

· · · 0 σ2(x2)x2

...... . . . ...

...

0 0 · · · σn(xn)σ2n−x2

n

σ2n

σn(xn)xn

σ1(x1)x1 σ2

(x2)x2 · · · σn(xn)xn 0

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣ .(2–37)

Using typical row operations, the determinant can be rewritten as:

detB(x) = −n∏

i=1

σi (xi)(σ2i − x2i )

σ2i

(n∑

i=1

σi (xi)x2i σ

2i

σ2i − x2i

).

(2–38)

33

Let detBk(x) = −∏k

i=1

σi(xi )(σ

2i−x2

i)

σ2i

(∑k

i=1

σi(xi )x

2iσ2i

σ2i−x2

i

). Since ck(σ) > 0 ∀ k =

1, ... , n,∀x ∈ S, detBk(x) < 0 ∀ k = 1, ... , n, which implies the function is quasiconvex.

Furthermore, from Theorem 2.3 the function is invex. Thus, it can be concluded that the

function is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... , n. �

Additional Properties

In addition to the above proved properties, the following properties hold for the

correntropic function:

• Let σi = σ ∀ i = 1, ... , n, βσ =[1− e(

−1

2σ2 )]−1

and S = {x ∈ Rn : x2i < Mi , 0 <

Mi <<∞ ∀ i = 1, ... , n}. If x ∈ S and FσC : S 7→ R, then the function Fσ

C definedas :

FσC (x) =

n∑i=1

βσ

[1− e

(−x2

i

2σ2

)]∀ x ∈ S, (2–39)

is invex for any given nonzero finite value of the parameter σ.

• If there exists x⋆ ∈ Rn such that ∇FσC (x

⋆) = 0, then x⋆ is the global optimal solutionof Fσ

C .

• Every local minimum of FσC is the global minimum.

• FσC (x) is symmetric, i.e., Fσ

C (−x) = FσC (x).

• Let ϕ : Rr 7→ Rn (r ≥ n) be a differentiable function. If ∇ϕ is of rank n, then FσC ◦ ϕ

is invex.Proof: Let us define η(x, x)T = η(x, x)T∇ϕ(x)−1. Since Fσ

C is invex, we have

FσC (ϕ(x))−Fσ

C (ϕ(y)) ≥ η(y, x)T∇FσC (ϕ(y)) (2–40)

FσC (ϕ(x))−Fσ

C (ϕ(y)) ≥ η(y, x)T∇ϕ(y)−1∇ϕ(y)∇FσC (ϕ(y)) (2–41)

FσC (ϕ(x))−Fσ

C (ϕ(y)) ≥ η(y, x)T∇(FσC ◦ ϕ)(y). (2–42)

�

• If ψ : R 7→ R is a monotone increasing differentiable convex function, then ψ ◦ FσC is

invex.Proof: Since, ψ is convex, we have:

ψ(FσC (x)) ≥ ψ(Fσ

C (y)) + [FσC (x)−Fσ

C (y)]ψ′(Fσ

C (y)). (2–43)

34

Furthermore, due to the invexity of FσC , we have:

FσC (x)−Fσ

C (y) ≥ η(y, x)T∇FσC (y). (2–44)

Since, ψ is monotone increasing:

ψ′(x) > 0 ∀x ∈ R. (2–45)

Multiplying Equation 2–44 on both sides by ψ′(FσC (y)) and substituting in

Equation 2–43, the results follows. �

• If σ2i > Mi ∀ i , then FσC (x) is convex.

Some data analysis problems, like multi-class classification, are based on error

vector, i.e., the error for a single sample is a vector in m-dimensions. In order to avoid

confusion on the usage of sample and error dimension, the error dimensions are called

as dimensions.

m-dimensions, Single-sample Case: The correntropy loss function for m-dimensions,

with respect to one sample can be defined as:

GσC (x) =

[1− e

( −1

2σ2 )]−1

[1− e

(−||x||2

2σ2

)]∀ x ∈ Rm, (2–46)

where ||x|| is the Euclidean norm. We claim that the loss function is pseudoconvex.

Theorem 2.5. If x ∈ Rm then the function GσC : Rm 7→ R, defined as :

GσC (x) =

[1− e

( −1

2σ2 )]−1

[1− e

(−||x||2

2σ2

)]∀ x ∈ Rm, (2–47)

is pseudoconvex for finite σ > 0.

Proof: Let βσ =[1− e

( −1

2σ2 )]−1

, βσ > 0 ∀ σ. The function can be rewritten as:

GσC (x) = βσ − βσ e

(−||x||2

2σ2

)∀ x ∈ Rm. (2–48)

Let x1 and x2 be two vectors such that:

GσC (x2) < Gσ

C (x1). (2–49)

35

Then, βσ − βσ e

(−||x2||

2

2σ2

)< βσ − βσ e

(−||x1||

2

2σ2

)(2–50a)

e

(−||x2||

2

2σ2

)> e

(−||x1||

2

2σ2

)(2–50b)

−||x2||2

2σ2>

−||x1||2

2σ2(2–50c)

||x1|| > ||x2||. (2–50d)

Now, ∇GσC (x) = σ(x) x, where σ(x) = βσ

σ2 e

(−||x||2

2σ2

)and σ(x) > 0 ∀ σ, x.

Consider:

∇GσC (x1)

T (x2 − x1) = σ(x1) xT1 (x2 − x1) (2–51a)

= σ(x1) (xT1 x2 − xT1 x1). (2–51b)

Using the Cauchy-Bunyakovsky-Schwarz inequality, we have:

if ||x1|| > ||x2|| then xT1 x1 > xT1 x2.

Therefore, using the above inequality and from Equations 2–50d & 2–51b, we have:

If GσC (x2) < Gσ

C (x1), then ∇GσC (x1)

T (x2 − x1) < 0 for a given σ. (2–52)

From Equation 2–52, it follows that GσC is pseudoconvex for a given parameter σ. �

m-dimension, n-sample Case: The correntropy loss function for m dimensions,

with respect to n samples can be defined as:

GσC (X ) =

[1− e

( −1

2σ2 )]−1

[1−

n∑i=1

e

(−||xi ||

2

2σ2

)]∀ xi ∈ Rm, ∀i = 1, ... , n. (2–53)

Let βσ =[1− e

( −1

2σ2 )]−1

, σ(x) = βσ

σ2 e

(−||x||2

2σ2

)and σ(x), βσ > 0 ∀ σ, x . Let

gσi (X ) =

βσ

n− βσe

(−

∑j x

2i ,j

2σ2

). The loss function can be rewritten as:

GσC (X ) =

n∑i = 1

gσi (X ). (2–54)

36

The gradient of GσC (X ) can be written as:

∇GσC (X ) = [σ(x1)x1,1, ... , σ(x1)x1,m, ... , σ(xn)xn,1, ... , σ(xn)xn,m]

T (2–55)

and

∇GσC (X )T (X − X ) =

n∑i = 1

m∑j = 1

σ(xi) · xi ,j · (xi ,j − xi ,j). (2–56)


[1− e

(−1

2σ2i

)]−1

and S = {X ∈ Rn·m : ||xi ||2 < Mi , 0 < Mi <<

∞ ∀ i = 1, ... , n}. If xi ∈ S and GσC : S 7→ R, then the function Gσ

C , defined as :

GσC (X ) =

n∑i=1

βσi

[1− e

(−||xi ||

2

2σ2

)]∀ xi ∈ Rm, (2–57)

is invex for finite σ > 0.

Proof: Let X , X ∈ Rn·m be any two points. Since ||xi ||2 < Mi and σi = 0 ∀ i =

1, ... , n, there exists Mi > 0, ∈ R such that ||xi ||2

σ2i

≤ Mi ∀ i . The gradient,∇GσC (X ) ∈

Rn·m is defined as:

∇GσC (X ) = [σ1

(x1)x1,1, ... , σ1(x1)x1,m, ... , σn(xn)xn,1, ... , σn(xn)xn,m]

T , (2–58)

which implies

∇GσC (x)

T∇GσC (x) =

n∑i=1

σi (xi)||xi ||2. (2–59)

Since x ∈ S, it follows that ∇GσC (X )T∇Gσ

C (X ) = 0 only when ∇GσC (X ) = 0. Let us define

η(X ,X ) ∈ Rn·m as:

η(X ,X ) =

0 ∇Gσ

C (X ) = 0

[GσC (X )−Gσ

C (X )]∇GσC (X )

∇GσC (X )T∇Gσ

C (X )otherwise.

(2–60)

From Equation 2–60, it follows that:

GσC (X )− Gσ

C (X ) ≥ η(X , X )T∇GσC (X ). (2–61)

37

From Equation 2–61 it follows that GσC (X ) is invex, when xi ∈ S ∀ i = 1, ... , n. �

2.4. Minimization of Error Entropy

Let z be the error between i th measurement and i th desired value, defined as

zi = xi − yi ∀ i = 1, ... ,N. The Minimization of Error Entropy (MEE) problem can be

stated as maximization of Information Potential (IP) and can be defined as:

minimize :

−IP(z) = − 1

N2

N∑i=1

N∑j=1

kσ(zi − zj), (2–62)

where kσ(ν) = e−ν2

2σ2 is the well known Gaussian kernel and σ is the kernel parameter (for

the sake of simplicity, the constant term in the Gaussian kernel is ignored).

Let e ∈ R(N−1) be a vector containing all ones. Let ei ∈ R(N−1) be a vector containing

all zeros, except a 1 at i th position. Construct a matrix Bk ∈ RN×(N−1) as:

Bk = [−e1, ... − ek−1, +e, −ek , ... − eN−1] ∀k = 1, ... ,N. (2–63)

Let A ∈ RN(N−1)×N be defined as:

A = [BT1 , ... ,B

TN ]

T . (2–64)

Now, the MEE problem can be re-stated as:

minimize :

− 1

N2

N(N−1)∑k=1

kσ(aTk•z), (2–65)

where ak• ∈ RN represents the k th row of matrix A. Let S1 = {u ∈ RN(N−1) : u =

Az, z ∈ S}, and define an affine function L : S ⊆ RN 7→ S1 ⊆ RN(N−1) as:

L(z) = Az. (2–66)

38

Let GσC (z) = − 1

N2

∑N(N−1)k=1 kσ(a

Tk•z) ∀ z ∈ S. Gσ

C can be represented as a

composite function of FσL and L, i.e.,

GσC = Fσ

L ◦ L, (2–67)

where FσL : S1 ⊆ RN(N−1) 7→ R, defined as:

FσL = − 1

N2

N(N−1)∑i=1

e

(−u2

i

2σ2

). (2–68)

Now, Equation 2–68 represents the correntropy loss function defined over a

projected space S1. Furthermore, Equation 2–67 implies that MEE is a composite

function of the correntropy loss function and an affine function. This representation

paves the way to establish the generalized convexity results for MEE since, in general,

composition with an affine function preserves generalized convexity of the composite

function. Next, the properties of MEE function is presented.

Theorem 2.7. Let S = {z ∈ RN : z2i < M, 0 < M << ∞ ∀ i = 1, ... ,N}. If z ∈ S

and GσC : S 7→ R, then the function Gσ

C , defined as :

GσC (z) = − 1

N2

N(N−1)∑k=1

kσ(aTk•z) ∀ z ∈ S, (2–69)

is convex when σ ≥ 2√M.

Proof: Since u = Az from the definition of S and S1, it can be established that u2i <

4M ∀ i = 1, ... ,N. Thus, from Equation 2–8, it follows that FσL is convex. Therefore,

from Equation 2–67 it follows that GσC is convex, since composition of a convex function

over an affine function results in a convex function. �



C , defined as :

GσC (z) = − 1

N2

N(N−1)∑k=1

kσ(aTk•z) ∀ z ∈ S, (2–70)

39

is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero and finite, where

ck(σ) =∏k

i=1

σ(ui )(σ2−u2

i)

σ2

(∑k

i=1

σ(ui )u2iσ2

σ2−u2i

), σi (ui) =

1σ2 e

(−u2

i

2σ2

)and u = Az.

Proof: From [60] and [7], it is concluded that pseudoconvexity is invariant under the

composition with an affine function. Thus, using the results from [92] in Equation 2–67,

it follows that GσC is pseudoconvex when ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero

and finite, where ck(σ) > 0 ∀ k = 1, ... ,N and σ is nonzero and finite, where

ck(σ) =∏k

i=1

σ(ui )(σ2−u2

i)

σ2

(∑k

i=1

σ(ui )u2iσ2

σ2−u2i

), σi (ui) =

1σ2 e

(−u2

i

2σ2

), and u = Az. �



C , defined as :

GσC (z) = − 1

N2

N(N−1)∑k=1

kσ(aTk•z) ∀ z ∈ S, (2–71)


Proof: To the best of our knowledge, there is no general proof in the literature that

affirms preservation of invexity over affine compositions. Therefore, an elementary proof

that serves not only as a proof for the above theorem, but for invex functions in general,

is presented. To prove: if FσL is an invex function, then Gσ

C = FσL ◦ L is invex, where L is

any affine transformation.

By contradiction, assume GσC is not invex, i.e., the following is true for any arbitrary

η(z,w) : S × S 7→ S

GσC (z)− Gσ

C (w) < η(z,w)T∇GσC (w). (2–72)

Rewriting Equation 2–72:

FσL(Az)−Fσ

L(Aw) < [Aη(z,w)]T∇FσL(Aw). (2–73)

Let η(Az,Aw) = Aη(z,w), u = Az, and v = Aw. Equation 2–72 can be written as:

FσL(u)−Fσ

L(v) < η(u, v)T∇FσL(v). (2–74)

40

Since, z, w, and η(z,w) are chosen arbitrarily, Equation 2–74 implies that FσL is not

invex. This contradiction is a result of the assumption made in Equation 2–72. Thus, the

assumption that GσC is not invex is false. �

2.5. Minimization of Error Entropy with Fiducial Points

Another important M-estimator, using the concept of fiducial point (reference point)

is proposed in [55]. The goal of such measure is to provide an anchor to zero error, i.e.,

make most of the errors zero. This M-estimator is obtained by the Minimization of Error

Entropy with Fiducial points (MEEF). The MEEF problem can be defined as:

minimize : − 1

(N + 1)2

N∑i=0

N∑j=0

kσ(zi − zj). (2–75)

The only modification in MEEF, when compared to MEE is the addition of a reference

point, z0 = 0. Simplifying the above function, by using the symmetric property of the

Gaussian kernel, the MEEF problem can be written as:

minimize : − 1

(N + 1)2

N∑i=1

N∑j=1

kσ(zi − zj)−2

(N + 1)2

N∑j=0

kσ(z0 − zj) (2–76)

or

minimize : − 1

(N + 1)2

N∑i=1

N∑j=1

kσ(zi − zj)−2

(N + 1)2

N∑j=1

kσ(zj)−2

(N + 1)2. (2–77)

In general, by adding m fiducial points, the following MEEF function is obtained:

minimize : − 1

(N + 1)2

N∑i=1

N∑j=1

kσ(zi − zj)−2m

(N + 1)2

N∑j=1

kσ(zj)−2m

(N + 1)2. (2–78)

Removing the constant term, and normalizing the coefficients, we get the following:

minimize : −λN∑i=1

N∑j=1

kσ(zi − zj)− (1− λ)N∑j=1

kσ(zj), (2–79)

where λ ∈ (0, 1]. It can be seen that, as λ −→ 0, the MEEF formulation converges

to Minimizing Correntropic Cost (MCC) function. On the other hand, when λ = 1, the

41

MEEF objective function reduces to MEE objective function. Intuitively, the second

term,∑N

j=1 kσ(zj) can be seen as a regularization function. In fact, correntropy is a

similarity norm, and can be used for sparsification of the solution. This sparsification is

the underlying reasoning for the usage of fiducial points.

Consider the normalized loss function of the MEEF function, HσC(x), defined as:

HσC(z) = −λ

N∑i=1

N∑j=1

kσ(zi − zj)− (1− λ)N∑j=1

kσ(zj) (2–80)

HσC(z) = λ Gσ

C(z) + (1− λ) FσC(z), (2–81)

where GσC(z) = −

∑N

i=1

∑N

j=1 kσ(zi−zj), and FσC(z) = −

∑N

j=1 kσ(zj). Equation 2–81 states

that function HσC(z) is a convex combination of two real functions. Unlike convexity, as

a reminder, pseudoconvexity may not be preserved with positive weighted summation.

However, invexity will be preserved over the positive weighted summation, when all the

functions are invex with respect to same η function. Next, the conditions under which,

HσC(x) is convex in particular and invex in general are developed.


and HσC : S 7→ R, then the function Hσ

C , defined as :

HσC(z) = λ Gσ

C(z) + (1− λ) FσC(z) ∀ z ∈ S, (2–82)

is convex when σ ≥ 2√M.

Proof: Both GσC(z) and Fσ

C(z) is convex when σ ≥ 2√M. Therefore, it follows

immediately that HσC is convex when σ ≥ 2

√M.


and HσC : S 7→ R, then the function Hσ

C , defined as :

HσC(z) = λ Gσ

C(z) + (1− λ) FσC(z) ∀ z ∈ S, (2–83)


42

Proof: In order to prove the invexity of HσC(z), it is sufficient to show that both Gσ

C(z) and

FσC(z) are invex with respect to a common η function.

By contradiction, assume that following system, say (system-1) is infeasible for any

z and w ∈ S:

∇FσC(w)

Tη(z,w) ≤ FσC(z)−Fσ

C(w) (2–84a)

∇GσC(w)

Tη(z,w) ≤ GσC(z)− Gσ

C(w). (2–84b)

Since Equation 2–84 is linear with respect to η(z,w), from the Gale’s theorem

[61], it can be stated that: if the above linear system (system-1) is infeasible, then the

following system (system-2) should be feasible:

[∇FσC(w) ∇Gσ

C(w)]p = 0 (2–85a)

[FσC(z)−Fσ

C(w) GσC(z)− Gσ

C(w)]p = −1 (2–85b)

p ≥ 0. (2–85c)

Case 1: either p1 = 0 or p2 = 0 . Clearly, if p1 = 0, then p2 = 0 from

Equation 2–85a. Whereas, when p1 = 0 and p2 = 0 then Equation 2–85b is infeasible.

Thus, p1 = 0. Similar argument can be followed to show that p2 = 0. To sum up, neither

p1 = 0 nor p2 = 0 give a feasible solution for (system-2).

Case 2: p1 > 0 and p2 > 0. Let us rearrange elements of w such that the following

relation holds: w1 ≤ w2 ≤ ... ≤ wN . Now Equation 2–85a can be written as:

∇GσC(w) = −λ∇Fσ

C(w), (2–86)

where λ = p1p2> 0. Now consider the following two sub-cases:

43

Sub-case1: wN ≥ 0 Consider the last element on both sides of Equation 2–86, i.e.,

consider2

σ2

N∑i=1

e−(wN−wi )

2

2σ2 (wN − wi) = −λ 1

σ2e

−w2N

2σ2 wN . (2–87)

Clearly, Equation 2–87 has no feasible value of λ.

Sub-case2: wN < 0 Consider the first element on both sides of Equation 2–86, i.e., we

have

[∇GσC(w)]1 = −λ[∇Fσ

C(w)]1 (2–88)

2

σ2

N∑i=1

e−(w1−wi )

2

2σ2 (w1 − wi) = −λ 1

σ2e

−w21

2σ2 w1. (2–89)

Clearly, Equation 2–87 has no feasible value of λ. Thus, the (system-2) is infeasible,

implying that the assumption is false, and the (system-1) is feasible. In other words,

there exist a common η such that both GσC(z) and Fσ

C(z) are invex. Therefore, HσC(z) is

invex for nonzero finite value of σ. �

To sum up, it can be stated that MCC, MEE and MEEF functions are invex in

nature. Furthermore, invexity along with robustness and smoothness are the three main

desirable properties in a robust measure. Presence of these three properties, along with

suitable optimization algorithms will improve the current computational complexities of

the robust methods. Next, the traditional and proposed robust algorithms are presented.

2.6. Traditional Robust Algorithm

Consider the classical data analysis methods, like least square method, in order

to understand the concept of robust algorithms. The idea in the classical methods

is to estimate the model parameters with respect to all of the presented data. The

methods give equal weight to all the data points, and the methods have no internal

mechanism to detect and/or filter the outliers. The classical methods are based on

smoothing assumption, which states that the effects of outliers are smoothed out by

presence of large amount of good data points. However, in many practical problems, the

smoothing assumption is not justifiable. Thus, earlier robust algorithms were based on

44

removal of outliers. The simple idea in a robust algorithm is to estimate the parameters

with respect to all of the data points. Then, identify those points which are farthest

(non-conforming) from the model. The identified points are assumed as outliers, and

removed from the data. The remaining points are used to construct a new model. This

iterative process continues until a better model is constructed, or until there is no longer

sufficient remaining points to proceed. However, the heuristic iterative methods easily

fail even when there is only one outlier[28].

Fischler [29] pioneered the constructive approach of robust algorithm using the

notion of random sampling, called Random Sampling Consensus (RANSAC). The basic

idea in RANSAC is to simultaneously estimate the model and eliminate the outliers.

The novelty that RANSAC proposed when compared to the earlier heuristics can be

summarized as:

• Initially, a small number of data points (initial set) are selected to estimate themodel parameters.

• While estimating the instantaneous model parameters, the initial set is enlarged insize by adding the consensus points.

The philosophy of selecting small number of points for estimating the instantaneous

model parameters is the robustness of the RANSAC algorithm. Typically, the number of

outliers in a practical data set is assumed to be much less in number when compared

to good data points. Thus, selecting a small sample for the given data increases

the probability of selecting good data points. Formally, the RANSAC is described in

Algorithm 2.1. RANSAC is the basis of many robust algorithms due to its ability to

tolerate a large fraction of outliers. RANSAC can often perform well with high amount

of outliers, however, the number of samples required to do so increases exponentially

with respect to the percentage of outliers in the data. Thus, similar to robust measures,

robust algorithms are computationally expensive.

45

If the percentage of outliers in a sample is know a priori (say po), then the number of

samples required (say k) with η level of confidence can be calculated as:

k ≥ log(1− η)

log(1− (1− po)m), (2–90)

where m is the minimum number of samples required to compute a solution. RANSAC

is a simple, successful robust algorithm available in the literature of data analysis.

Nevertheless, many efforts have been developed toward improving the performance of

RANSAC. For example, the optimization of the model verification process of RANSAC is

addressed in [12, 16, 18]. Improvements directing towards sampling process in order to

generate usable hypotheses are addressed in [17, 19, 96]. Furthermore, the real-time

execution issues are address in [68, 74].

2.7. Proposed Robust Algorithm

In this work, a RANSAC based robust algorithm is developed, which proposes

improvement in sampling strategy and usage of mathematical modeling for hypothesis

testing. Specifically, the algorithm is proposed for the hyperplane clustering problem.

The key idea in sample selection of robust algorithm is two fold. First, an initial data

sample S1 is selected based on closeness criterion. By restricting the closeness

criterion, another subgroup of the initial data sample S2 is selected. Let the rest of the

data points be denoted as S3. The model parameter is estimated as follows: The data

points belonging to S2 is considered as definitely good points. The data points belonging

to S1 are considered as tentative good points. By using information of points from S1

and S2, a hyperplane containing S2 and other points is searched. If some of the points

belonging to S1 contain points that do not belong to hyperplane, then the algorithm

has a mechanism to discard those points. After the execution of one instance of the

algorithm, we will end up with two possibilities. The desired possibility is the number

of consensus points will be above threshold for that hyperplane, and we search for the

next hyperplane. Another possibility is that the number of consensus points is below the

46

threshold. In this case, we re-sample for the two sets, but archive information of previous

unsuccessful sample S2. This archive acts as a cut for selecting the next sample, and

avoids the repetition.

2.8. Discussion on the Robust Methods

The practical consideration, while applying the results presented in Section 2.3, is

the asymptotic behavior of the negative exponential function. Theoretically, σ(x) −→ 0

as x −→ ∞. However in practice, finite large values of x result in σ(x) = 0. This

behavior may result in local minima, which can be avoided by using the following

methods: constraining the absolute value of error, replacing the Gaussian kernel with a

suitable kernel function, or using the solution of a quadratic loss function as a starting

point of correntropy minimization.

In Sections 2.3, 2.4 and 2.5, convexity, pseudoconvexity and invexity of the entropy

based loss functions (MCC, MEE and MEEF) are established. Invexity is the sole

property can be exploited in designing optimization algorithms, which can be used for

efficiently minimizing the loss function. The generalized convexity results for the case

of single and multiple dimension are separately presented. The purpose of discussing

one-dimensional cases separately was to address the traditional sample-by-sample

artificial neural network approach in data analysis. Generally, the cumulative error

approach is useful in both the parametric and non-parametric approaches of data

analysis. Future directions of utilizing the correntropic loss function will involve designing

fast algorithms that can speed up the grid search of the kernel parameter. Furthermore,

designing a kernel that improves the asymptotic behavior of the current kernel function

will enhance the efficiency of the algorithms.

Entropic learning in the form of MCC, MEE and MEEF has been successfully

applied in robust data analysis, including robust regression [56], robust classification

[93], robust pattern recognition [43], robust image analysis [42], etc. In Chapter 2,

it is shown that unconstrained MEE and MEEF problems are invex. In general, the

47

invexity property remains intact over a convex feasible space for the constrained

optimization problems. Therefore, a linear learning mapper (or in general, a convex

mapper) designed to minimize MEE will be an invex problem. By suitably exploiting the

invexity, efficient optimization algorithms can be proposed for MCC, MEE and MEEF

problems. Furthermore, stochastic gradient methods like convolution smoothing can be

intelligently applied to solve the problems. In fact, by varying the kernel parameter we

move from convex to invex domain, which is inherently the notion of not only convolution

smoothing, but also for many global optimization algorithms.

Sections 2.6 and 2.7 present the robust algorithmic approach in data analysis.

Typically, the RANSAC philosophy is applied in computer vision and related areas of

data analysis. However, the method is useful in those data analysis scenarios where

a sample can be used to estimate model parameters and validate the other points. In

Chapter 4, the blind signal separation problem is presented and a solution methodology

that involves the RANSAC philosophy is proposed.

Algorithm 2.1: RANSAC Algorithminput : P data points.output: Estimated model M⋆.

1 begin2 Set ε the tolerance limit;3 Set θ a predefined threshold ;4 Set termination = false;5 while termination == false do6 Select Si , a set containing n points, from the given P points;7 Estimate model Mi using the knowledge of set Si ;8 Identify Sc

i , the set of points (consensus set) from the original P datapoints that fall within ε tolerance limits of Mi ;

9 if |Sci | ≥ θ then

10 Estimate model M⋆ using Sci ;

11 termination = true;

12 return (M⋆);

48

CHAPTER 3ROBUST DATA CLASSIFICATION

The goal of Chapter 3 is to present the binary classification problem, and illustrate

the robust non-parametric methods for data classification. In Section 3.1, all the

major preliminary topics required to understand the proposed robust methods in

data classification are discussed. The purpose of reviewing these topics is to provide

sufficient background information to a novice reader. However, they by no means

serve as a comprehensive discussion, and interested readers will be directed to the

appropriate references for the detailed discussions. Furthermore, Section 3.2 reviews

some of the traditional approaches in binary classification, whereas, Section 3.3

presents the proposed robust approaches.

3.1. Preliminaries

The following topics are reviewed in Section 3.1:

• Classification• Correntropic Function• Convolution Smoothing (CS)• Simulated Annealing (SA)• Artificial Neural Network (ANN)• Support Vector Machine (SVM)

Classification

Classification (strictly speaking, statistical classification) is a supervised learning

methodology of identifying (or assigning) class labels to an unlabeled data set (a

sub-population of data, whose class is unknown) from the knowledge of a pre-labeled

data set (another sub-population of the same data, whose class is known). The

knowledge of pre-labeled data set can be used to generate an optimal rule, based

on the theory of learning [99, 100]. More specifically, the optimal rule (discriminant

Some sections of Chapter 3 have been published in Dynamics of Information Sys-tems: Mathematical Foundations.

49

function f ) is generated in such a way that it will minimize the risk of assigning incorrect

class labels [3, 65]. The classification problem is defined in the following paragraph.

Let Dn represents the data set containing the observations, defined as, Dn =

{(xi , yi), i = 1, .., n : xi ∈ Rm ∧ yi ∈ {−1, 1}}, where xi is an input vector, and yi is

the class label for the input vector. Under the assumption that (xi , yi) is an independent

and identical realization of random pair (X,Y ), the classification problem can be defined

as finding a function f from a class of functions �, such that f minimizes the risk, R(f ).

Thus, classification problem can be written as:

minimize :

R(f ) (3–1a)

subject to :

(xi , yi) ∈ Dn ∀ i = 1, ... , n, (3–1b)

xi ∈ Rm ∀ i = 1, ... , n, (3–1c)

yi ∈ {−1, 1} ∀ i = 1, ... , n, (3–1d)

f ∈ �, (3–1e)

where R(f ) is defined as:

R(f ) = P(Y = sign(f (X))

= E [l0−1(f (X),Y )], (3–2)

where sign is the signum function and l0−1 is the 0-1 loss function, they are defined as:

sign(f (X)) =

+1 if f (X) > 0

−1 if f (X) < 0

0 otherwise

(3–3)

l0−1(f (x), y) = ||(−yf (x))+||0, (3–4)

50

where (.)+ denotes the positive part and ||.||0 denotes the L0 norm. When f (x) = 0,

the above definition does not reflect error, however, this is a rare case and can be easily

avoided or adjusted (i.e., by considering ||(f (x) − y)+||0). Moreover, it is clear from

the definition of R(f ) that it requires the knowledge of P(X,Y ), the joint probability

distribution of the random pair (X,Y ). Usually, the joint distribution is unknown. This

leads to the calculation of empirical risk function R(f ), which is given as:

R(f ) = 1n

∑n

i=1 l0−1(yi f (xi)). (3–5)

At this juncture, only Empirical Risk Minimization (ERM) is considered, and any

discussion pertaining to Structural Risk Minimization (SRM) is avoided. However,

SRM will be discussed when the notion of support vector machine is presented.

Generally, it is not easy to find the optimal solution f ⋆ of problem stated in

Formulation 3–1 since the space of functions class � is huge, and there is no efficient

way to search over such space. In order to find the solution, a usual approach is to

select the class of functions a priori, and then try to find the best function from the

selected function class �. Generally, the selected class of functions can be categorized

as parametric or non-parametric class. Based on the category of the function class,

different learning algorithms can be used to minimize the loss function. Thus, with the

above stated restrictions, the classification problem can be represented as:

minimize :

R(f ) (3–6a)

subject to :

(xi , yi) ∈ Dn ∀ i = 1, ... , n, (3–6b)

xi ∈ Rm ∀ i = 1, ... , n, (3–6c)

yi ∈ {−1, 1} ∀ i = 1, ... , n, (3–6d)

f ∈ �. (3–6e)

51

In summary, usually, both R(f ) and � will be selected before finding f ⋆. Moreover,

the type of risk function and the function class selected will significantly determine the

accuracy of classification method. Next, the usage of correntropy loss function as a risk

function in data classification is presented.

Correntropic Function

Although the classification problem stated in Formulation 3–6 looks simple, it has

an inherent difficulty, due to the non-convex and non-continuous loss function defined in

Equation 3–4. Furthermore, the search over the � function space is another difficulty in

solving Formulation 3–6. The key idea is to propose a loss function that can efficiently

replace the loss function given in Equation 3–4. Conventionally, the 0-1 loss function is

replaced by a quadratic loss function, i.e., the quadratic risk is given as:

R(f ) = E [(Y − f (X)2]

= E [(ε)2]. (3–7)

In general, the knowledge of Probability Distribution Function (PDF) of ε is required to

calculate the above risk function. However, the quadratic risk can be approximated by

the following empirical quadratic risk function:

R(f ) =1

n

n∑i=1

(yi − f (xi))2 , (3–8)

where n is the number of samples. The replacement of 0-1 loss function with quadratic

loss function makes Formulation 3–6 computationally easy to solve (due to its convex

nature). Moreover, if the function class � is smooth, then the problem can be solved by

gradient descent methods. However, the quadratic loss function performs poorly in noisy

data, i.e., the computational simplicity has its price in the classification performance.

Hence, usual gradient descent based optimization methods with quadratic loss function

may not provide the global optimal solution for the class of functions selected (�). In

order to overcome this difficulty, the use of the correntropic loss function is proposed.

52

In order to define the correntropic risk function, consider a function ϕβ,σ(f (x), y)

defined as:

ϕβ,σ(f (x), y) = β[1− kσ(1− yf (x))]

= β[1− kσ(1− α))], (3–9a)

where α = yf (x) is called the margin, β = [1 − e(−1

2σ2 )]−1 is a positive scaling factor,

and kσ is the Gaussian kernel with width parameter σ. This function has its roots from

correntropy function (see [72] for more details). Using this information, the correntropic

risk function can be rewritten as:

R(f ) = E [ϕβ,σ(f (X),Y )]

= E [β(1− kσ(1− Yf (X)))]

= β(1− E [kσ(1− Yf (X))])

= β(1− ν(1− Yf (X))])

= β(1− ν(Y − f (X))])

= β(1− ν(ε)). (3–10a)

Due to the unavailability of PDF function, similar to quadratic loss function, the

empirical correntropic risk function can be defined as:

R(f ) = β(1− ν(ε)), (3–11)

where ν(ε) = 1n

∑n

i=1 kσ(yi − f (xi)) and n is the number of samples.

The characteristics of this function for different values of the width parameter

are shown in Figure 3-1. Clearly, from Figure 3-1, it can be seen that the function

ϕβ,σ(f (x), y) is convex for higher values of kernel width parameter (σ > 1), and as

the parameter value decreases, it becomes non-convex. For σ = 1 it approximates

the hinge loss function (Hinge loss function is a typical function often used in SVMs).

53

However, for smaller values of kernel width the function almost approximates the 0-1

loss function, which is mostly an unexplored territory for typical classification problems.

In fact, any value of kernel width apart from σ = 2 or 1 has not been studied for other

loss functions. This peculiar property of correntropic function can be harmoniously

used with the concept of convolution smoothing for finding global optimal solutions.

Moreover, with a fixed lower value of kernel width, suitable global optimization algorithms

(heuristics like simulated annealing) can be used to find the global optimal solution.

Next, the elementary ideas about different optimization algorithms that can be used with

the correntropic loss function are discussed.

Convolution Smoothing (CS)

A Convolution Smoothing (CS) approach [87] forms the basis for one of the

proposed methods of correntropic risk minimization. The main idea of CS approach

is sequential learning, where the algorithm starts from a high kernel width correntropic

loss function and smoothly moves towards a low kernel width correntropic loss function,

approximating the original loss function. The suitability of this approach can be seen in

[86], where the authors used a two step approach for finding the global optimal solution.

The current proposed method is a generalization of the the two step approach. Before

discussing the proposed method, consider the following basic framework of CS. A

general unconstrained optimization problem is defined as:

minimize : g(u) (3–12a)

subject to : u ∈ Rn, (3–12b)

where g : Rn 7→ R. The complexity of solving such problems depends upon the nature

of function g. If g is convex in nature, then a simple gradient descent method will lead

to the global optimal solution. Whereas, if g is non-convex, then the gradient descent

algorithm will behave poorly and will converge to a local optimal solution (or in the worst

case converges to a stationary point).

54

CS is a heuristic based global optimization method to solve problems illustrated in

Formulation 3–12 when g is non-convex. This is a specialized stochastic approximation

method introduced in 1951 [77]. Usage of convolution in solving convex optimization

problems was first proposed in 1972 [4]. Later, as an extension, a generalized method

for solving non-convex unconstrained problems is proposed in 1983 [82]. The main

motivation behind CS is that the global optimal solution of a multi-extremal function g

can be obtained by the information of a local optimal solution of its smoothed function. It

is assumed that the function g is a convoluted function of a convex function g0 and other

non-convex functions gi ∀ = 1, ... , n. The other non-convex functions can be seen as

noise added to the convex function g0. In practice, g0 is intangible, i.e., it is impractical

to obtain a deconvolution of g into gi ’s, such that argminu{g(u)} = argminu{g0(u)}.

In order to overcome this difficulty, a smoothed approximation function g is used. This

smoothed function has the following main property:

g(u,λ) −→ g(u) as λ −→ 0, (3–13)

where λ is the smoothing parameter. For higher values of λ, the function is highly

smooth (nearly convex), and as the value of λ tends towards zero, the function takes the

shape of original non-convex function g. Such smoothed functions can be defined as:

g(u,λ) =

∫ ∞

−∞h((u − v),λ) g(v) dv , (3–14)

where h(v ,λ) is a kernel function, with the following properties:

• h(v ,λ) −→ δ(v) , as λ −→ 0; where δ(v) is Dirac’s delta function.• h(v ,λ) is a probability distribution function.• h(v ,λ) is a piecewise differentiable function with respect to u.

Moreover, the smoothed gradient of g(u,λ) can be expressed as:

∇g(u,λ) =

∫ ∞

−∞∇h(v ,λ) g(u − v) dv . (3–15)

55

Equation 3–15 highlights a very important aspect of CS, it states that information

of ∇g(v) is not required for obtaining the smoothed gradient. This is one of the crucial

aspects of smoothed gradient that can be easily extended for non-smooth optimization

problems where ∇g(v) does not usually exist.

Furthermore, the objective of CS is to find the global optimal solution of function g.

However, based on the level of smoothness, a local optimal solution of the smoothed

function may not coincide with the global optimal solution of the original function.

Therefore, a series of sequential optimizations are required with different level of

smoothness. Usually, at first, a high value of λ is set, and an optimal solution u⋆λ is

obtained. Taking u⋆λ as the starting point, the value of λ is reduced, and a new optimal

value in the neighborhood of u⋆λ is obtained. This procedure is repeated until the value

of λ is reduced to zero. The idea behind these sequential optimizations is to end up in a

neighborhood of u⋆ as λ −→ 0, i.e.,

u⋆λ −→ u⋆ as λ −→ 0, (3–16)

where u⋆ = argmin{g(u)}. The crucial part in the CS approach is the decrement of the

smoothing parameter. Different algorithms can be devised to decrement the smoothing

parameter. In [87] a heuristic method (similar to simulated annealing) is proposed to

decrease the smoothing parameter.

Apparently, the main difficulty of using the CS method to any optimization problem

is defining a smoothed function with the property given by Equation 3–13. However,

the CS can be used efficiently with the proposed correntropic loss function, as the

correntropic loss function can be seen as a generalized smoothed function for the true

loss function (see Figure 3-1). The kernel width of correntropic loss function can be

visualized as the smoothing parameter.

Therefore, the CS method is applicable in solving the classification problem

when suitable kernel width is unknown a priori (a practical situation). On the other

56

hand, if appropriate value of kernel is width known a priori (maybe an impractical

assumption, but quite possible), then other efficient methods may be developed, like

simulated annealing based methods. The crux of Chapter 3 is to present a correntropy

minimization method over a non-parametric framework. Generally, the correntropy loss

function is invex (and convex under certain cases). However, due to the presence of

non-convex framework, global optimization methods like CS or simulated annealing

based methods are proposed.

Simulated Annealing (SA)

Simulated Annealing (SA) is a meta-heuristic method which is employed to find a

good solution to an optimization problem. This method stems from thermal annealing

which aims to obtain a perfect crystalline structure (lowest energy state possible)

by a slow temperature reduction. Metropolis et al. in 1953 simulated this processes

of material cooling [13], Kirkpatrick et al. applied the simulation method for solving

optimization problems [53, 70].

SA can be viewed as an upgraded version of greedy neighborhood search. In

neighborhood search method, a neighborhood structure is defined in the solution

space, and the neighborhood of a current solution is searched for a better solution. The

main disadvantage of this type of search is its tendency to converge to a local optimal

solution. SA tackles this drawback by using concepts from Hill-climbing methods [64].

In SA, any neighborhood solution of the current solution is evaluated and accepted with

a probability. If the new solution is better than the current solution, then it will replace

the current solution with probability 1. Whereas, if the new solution is worse than the

current solution, then the acceptance probability depends upon the control parameters

(temperature and change in energy). During the early iterations of the algorithm,

temperature is kept high, and this results in a high probability of accepting worse new

solutions. After a predetermined number of iterations, the temperature is reduced

strategically, and thus the probability of accepting a new worse solution is reduced.

57

These iterations will continue until any of the termination criteria is met. The use of

high temperature at the earlier iterations (low temperature at the later iterations) can

be viewed as exploration (exploitation, respectively) of the feasible solution space. As

each new solution is accepted with a probability it is also known as stochastic method.

A complete treatment of SA and its applications is carried out in [75]. Neighborhood

selection strategies are discussed in [2]. Convergence criteria of SA are presented in

[57].

In this work, SA will be used to train the correntropic loss function when the

information of kernel width is known a priori. Although the assumption of known

kernel width seems implausible, any known information of an unknown variable will

increase the efficiency of solving an optimization problem. Moreover, a comprehensive

knowledge of data may provide the appropriate kernel width that can be used in the loss

function. Nevertheless, when kernel width in unknown, a grid search can be performed

on the kernel width space to obtain the appropriate kernel width that maximizes the

classification accuracy. This is a typical approach while using kernel based soft margin

SVMs, which generally involves grid search over a two dimensional parameter space.

So far, no discussion about the function class (�) is addressed. In the current work,

a non-parametric function class, namely artificial neural networks, and a parametric

function class, namely support vector machines, is considered. Next, an introductory

review of artificial neural networks is presented.

Artificial Neural Networks(ANN)

Curiosity of studying the human brain led to the development of Artificial Neural

Networks (ANNs). Henceforth, ANNs are the mathematical models that share some

of the properties of brain functions, such as nonlinearity, adaptability and distributed

computations. The first mathematical model that depicted a working ANN used the

perceptron, proposed by McCulloch and Pitts [62]. The actual adaptable perceptron

model is credited to Rosenblatt [80]. The perceptron is a simple single layer neuron

58

model, which uses a learning rule similar to gradient descent. However, the simplicity

of this model (single layer) limits its applicability to model complex practical problems.

Thereby, it was an object of censure in [66]. However, a question which instigated the

use of multilayer neural networks was kindled in [66]. After a couple of decades of

research, neural network research exploded with impressive success. Furthermore,

multilayered feedforward neural networks are rigorously established as a function class

of universal approximators [46]. In addition to that, different models of ANNs were

proposed to solve combinatorial optimization problems. Furthermore, the convergence

conditions for the ANNs optimization models have been extensively analyzed [91].

Processing Elements(PEs) are the primary elements of any ANN. The state of

PE can take any real value between the interval [0, 1] (Some authors prefer to use the

values between [-1,1]; however, both definitions are interchangeable and have the same

convergence behavior). The main characteristic of a PE is to do function embedding. In

order to understand this phenomenon, consider a single PE ANN model (the perceptron

model) with n inputs and one output, shown in Figure 3-2.

The total information incident on the PE is∑n

i=1 wixi . PE embeds this information

into a transfer function , and sends the output to the following layer. Since there is a

single layer in the example, the output from the PE is considered as the final output.

Moreover, if we define as :

(n∑

i=1

wixi + b

)=

1 if

∑n

i=1 wixi + b ≥ 0

0 otherwise,(3–17)

where b is the threshold level of the PE, then the single PE perceptron can be used for

binary classification, given the data is linearly separable. The difference between this

simple perceptron method of classification and support vector based classification is

that the perceptron finds a plane that linearly separates the data. However, the support

vector finds the plane with maximum margin. This does not indicate superiority of one

59

method over the other method since a single PE is considered. In fact, this shows the

capability of a single PE, but, a single PE is incapable to process complex information

that is required for most practical problems. Therefore, multiple PEs in multiple layers

are used as universal classifiers. The PEs interact with each other via links to share

the available information. The intensity and sense of interactions between any two

connecting PEs is represented by weight or synaptic weight on the links. The term

synaptic is related to the nervous system, and is used in ANNs to indicate the weight

between any two PEs.

Usually, PEs in the (r − 1)th layer sends information to the r th layer using the

following feedforward rule:

yi = i

∑j∈(r−1)

wjiyj − Ui

, (3–18)

where PE i belongs to the r th layer, and any PE j belongs to the (r − 1)th layer. yi

represents the state of the i th PE, wji represents weight between the j th PE and i th PE,

and Ui represents threshold level of the i th PE. Function i() is the transfer function for

the i th PE. Once the PEs in the final layer are updated, the error from the actual output

is calculated using a loss function (this is the part where correntropic loss function will

be injected). The error or loss calculation marks the end of feed forward phase of ANNs.

Based on the error information, backpropagation phase of ANNs starts. In this phase,

the error information is utilized to update the weights, using the following rules:

wjk = wjk + µ δk yj , (3–19)

where

δk =∂F(ε)

∂εn′(netk), (3–20)

60

where µ is the learning step size, netk =∑

j∈(r−1) wjiyj − Uk , and F(ε) is the error

function (or loss function). For the output layer, the weights are computed as:

δk = δ0 =∂F(ε)

∂ε′(netk)

= (y − y0) ′(netk), (3–21)

and the deltas of the previous layers are updated as:

δk = δh = ′(netk)

N0∑o=1

whoδo . (3–22)

In the proposed approaches, ANN is trained in order to minimize the correntropic

loss function. In total, two different approaches to train ANN are proposed. In one

approach, ANN will be trained using the CS algorithm. Whereas in the other proposed

approach, ANN will be trained using the SA algorithm. In order to validate the results,

we will not only compare the proposed approaches with conventional ANN training

methods, but also compare them with the support vector machines based classification

method. Next, a review of support vector machines is presented.

Support Vector Machines(SVMs)

Support Vector Machine (SVM) is a popular supervised learning method [9, 22].

It has been developed for binary classification problems, but can be extended to

multiclass classification problems [38, 101, 102] and it has been applied in many

areas of engineering and biomedicine [44, 52, 69, 95, 104]. In general, supervised

classification algorithms provide a classification rule able to decide the class of an

unknown sample. In particular, the goal of SVMs training phase is to find a hyperplane

that ‘optimally’ separates the data samples that belong to a class. More precisely, SVM

is a particular case of hyperplane separation. The basic idea of SVM is to separate two

classes (say A and B) by a hyperplane defined as:

f (x) = wtx+ b, (3–23)

61

such that f (a) < 0 when a ∈ A, and f (b) > 0 when b ∈ B. However, there

could be infinitely many possible ways to select w. The goal of SVM is to choose a best

w according to a criterion (usually the one that maximizes the margin), so that the risk of

misclassifying a new unlabeled data point is minimum. A best separating hyperplane for

unknown data will be the one that is sufficiently far from both the classes (it is the basic

notion of SRM), i.e., a hyperplane which is in the middle of the following two parallel

hyperplanes (support hyperplanes) can be used as a separating hyperplane:

wtx+ b = c (3–24)

wtx+ b = −c . (3–25)

Since, w, b and c are all parameters, a suitable normalization will lead to:

wtx+ b = 1 (3–26)

wtx+ b = −1. (3–27)

Moreover, the distance between the supporting hyperplanes defined in Equations 3–26

& 3–27 is given by:

� =2

||w||.(3–28)

In order to obtain the best separating hyperplane, the following optimization problem is

solved:

maximize :

2

||w||(3–29a)

subject to :

yi(wtxi + b)− 1 ≥ 0 ∀i . (3–29b)

62

The objective given in Equation 3–29a is replaced by minimizing ||w||2/2. Usually,

the solution to Formulation 3–29 is obtained by solving its dual. In order to obtain the

dual, consider the Lagrangian of Equation 3–29, given as:

L(w, b,u) = 1

2||w||2 −

N∑i=1

ui(yi(w

txi + b)− 1), (3–30)

where ui ≥ 0 ∀ i . Now, observe that Formulation 3–29 is convex. Therefore, the

strong duality holds, and Equation 3–31 is valid:

min(w,b)

maxu

L(w, b,u) = maxu

min(w,b)

L(w, b,u). (3–31)

Moreover, from the saddle point theory [5], the following equations hold:

w =

N∑i=1

uiyixi (3–32)

N∑i=1

uiyi = 0. (3–33)

Therefore, using Equations 3–32 & 3–33, the dual of Formulation 3–29 is given as:

maximize :N∑i=1

ui −1

2

N∑i ,j=1

uiujyiyjxitxj (3–34a)

subject to :N∑i=1

uiyi = 0, (3–34b)

ui ≥ 0 ∀i . (3–34c)

Thus, solving Formulation 3–34 results in obtaining support vectors, which in turn

leads to the optimal hyperplane. This phase of SVM is called as training phase. The

testing phase is simple and can be stated as:

ytest =

−1, test ∈ A if f ∗(xtest) < 0

+1, test ∈ B if f ∗(xtest) > 0.(3–35)

63

The above method works very well when the data is linearly separable. However,

most of the practical problems are not linearly separable. In order to extend the usability

of SVMs, soft margins and kernel transformations are incorporated in the basic linear

formulation. When considering soft margin, Equation 3–29a is modified as:

yi(wtxi + b)− 1 + si ≥ 0 ∀i , (3–36)

where si ≥ 0 are slack variables. The primal formulation is then updated as:

minimize :

1

2||w||2 + c

N∑i=1

si (3–37a)

subject to :

yi(wtxi + b)− 1 + si ≥ 0 ∀i , (3–37b)

si ≥ 0 ∀i . (3–37c)

Similar to the linear SVM, the Lagrangian of Formulation 3–37 is given by:

L(w, b,u, v) = 1

2||w||2 + c

N∑i=1

si −N∑i=1

ui(yi(w

txi + b)− 1)− vts, (3–38)

where ui , vi ≥ 0 ∀ i . Correspondingly, using the theories of saddle point and strong

duality, the soft margin SVM dual is defined as:

maximize :N∑i=1

ui −1

2

N∑i ,j=1

uiujyiyjxitxj (3–39a)

subject to :N∑i=1

uiyi = 0, (3–39b)

ui ≤ c ∀i , (3–39c)

ui ≥ 0 ∀i . (3–39d)

64

Furthermore, the dot product ”xitxj” in Equation 3–39a is exploited to overcome the

nonlinearity, i.e., by using kernel transformations into a higher dimensional space. Thus,

the soft margin kernel SVM has the following dual formulation:

maximize :N∑i=1

ui −1

2

N∑i ,j=1

uiujyiyjK(xi, xj) (3–40a)

subject to :N∑i=1

uiyi = 0, (3–40b)

ui ≤ c ∀i , (3–40c)

ui ≥ 0 ∀i , (3–40d)

where K(x, y) is any symmetric kernel. in this dissertation a Gaussian kernel is used,

which is defined as:

K(xi, xj) = e−γ||xi−xj||2, (3–41)

where γ > 0. Therefore, in order to classify the data, two parameters (c , γ) should be

given a priori. The information about the parameters is obtained from the knowledge and

structure of the input data. However, this information is intangible for practical problems.

Thus, an exhaustive logarithmic grid search is conducted over the parameter space to

find their suitable values. It is worthwhile to mention that assuming c & γ as variables

for the kernel SVM and letting the kernel SVM try to obtain the optimal values of c & γ

makes the classification problem Formulation 3–40 intractable.

Once the parameter values are obtained from the grid search, the kernel SVM is

trained to obtain the support vectors. Usually the training phase of the kernel SVM is

performed in combination with a re-sampling method called cross validation. During

cross validation the existing dataset is partitioned in two parts (training and testing). The

model is built based on the training data, and its performance is evaluated using the

testing data. In [84], a general method to select data for training SVM is discussed.

65

Different combinations of training and testing sets are used to calculate average

accuracy. This process is mainly followed in order to avoid manipulation of classification

accuracy results due to a particular choice of the training and testing datasets. Finally,

the classification accuracy reported is the average classification accuracy for all the

cross validation iterations. There are several cross validation methods available to

built the training and testing sets. In this work, the RRSCV method is used to train the

kernel SVM. The performance accuracy of the SVM is compared with the proposed

approaches.

3.2. Traditional Classification Methods

The goal of any learning algorithm is to obtain the optimal rule f ⋆ by solving

the classification problem illustrated in Formulation 3–6. Based on the type of loss

function used in risk estimation, the type of information representation, and the type of

optimization algorithm, different classification algorithms can be designed. A summary

of the classification methods that are used in this work is listed in Table 3-1. Next, the

conventional non-parametric and parametric approaches are presented.

Conventional Non-parametric Approaches

A classical method of classification using ANN involves training a Multi-Layer

Perceptron (MLP) using a back-propagation algorithm. Usually, a signmodal function

is used as an activation function, and a quadratic loss function is used for error

measurement. The ANN is trained using a back-propagation algorithm involving gradient

descent method [63]. Before proceeding further to present the training algorithms, let us

define the notations:

w njk : The weight between the k th and j th PEs at the nth iteration.

y nj : Output of the j th PE at the nth iteration.

netnk =∑

j wnjky

nj : Weighted sum of all outputs y nj of the previous layer at nth iteration.

66

(): Sigmoidal squashing function in each PE, defined as:

(z) =1− e−z

1 + e−z

y nk = (netnk ): Output of k th PE of the current layer, at the nth iteration.

y n ∈ {±1}: the true label (actual label) for the nth sample.

Next, the training algorithms are described. These algorithms mainly differ in the

type of loss function used to train ANNs.

Training ANN with Quadratic loss function using Gradient descent(AQG).

This is the simplest and most widely known method of training ANN. A three layered

ANN (input, hidden, and output layers) is trained using a back-propagation algorithm.

Specifically, the generalized delta rule is used to update the weights of ANN, and the

training equations are:

w n+1jk = w n

jk + µ δnk ynj , (3–42)

where

δnk =∂MSE(ε)

∂εn′(netnk ), (3–43)

where µ is the learning step size, ε = (y n − y n0 ) is the error (or loss), and MSE(ε) is the

mean square error. For the output layer, the weights are computed as:

δnk = δn0 =∂MSE(ε)

∂εn′(netnk )

= (y n − y n0 ) ′(netnk ). (3–44)

The deltas of the previous layers are updated as:

δnk = δnh = ′(netnk )

N0∑o=1

w nhoδ

no . (3–45)

Training ANN with Correntropic loss function using Gradient descent(ACG).

This method is similar to AQG method, the only difference is the use of correntropic loss

67

function instead of quadratic loss function. Furthermore, the kernel width of correntropic

loss is fixed to a smaller value (in [86], a value of 0.5 is illustrated to perform well).

Moreover, since the correntropic function is non-convex at that kernel width, the ANN is

trained with a quadratic loss function for some initial epochs. After sufficient number of

epochs (ACG1), the loss function is changed to correntropic loss function. Thus (ACG1)

is a parameter of the algorithm. The reason for using quadratic loss function at the initial

epochs is to prevent converging at a local minimum at early learning stages. Similar to

AQG, the delta rule is used to update the weights of ANN, and the training equations

are:

w n+1jk = w n

jk + µ δnk ynj , (3–46)

where

δnk =∂F(ε)

∂εn′netnk ), (3–47)

where µ is the step length, and F(ε) is a general loss function, which can be either

quadratic or correntropic function based on the current number of training epochs. For

the output layer, the weights are computed as:

δnk = δn0 =∂F(ε)

∂εn′(netnk ).

=

βσ2 e

(−(yn−yn0 )2

2σ2

)(y n − y n0 ) g

′(netnk ) if F ≡ C-loss Function

(y n − y n0 ) g′(netnk ) if F ≡ MSE Function,

(3–48)

where C − loss is the correntropic loss. The deltas of the previous layers are updated as:


N0∑o=1

w nhoδ

no . (3–49)

Based on the results of [86], the value of ACG1 is taken as 5 epochs. The purpose of

comparing the proposed approaches with the ACG method is to see the improvement in

the classification accuracy, when the kernel width is changing smoothly.

68

Conventional Parametric SVM Approach

Training soft margin SVM with Gaussian kernel (SGK). SVM is one of the most

widely known parametric methods in classification. In the present work, a Gaussian

kernel based soft margin SVM is used. The SVM is implemented in two steps. In the

first step, optimal parameters (kernel width and cost penalty) are obtained via exhaustive

searching over the parameter space. Once the optimal parameters are obtained in the

second step, the kernel SVM is trained with the optimal parameters.

From the grid search, appropriate values of the parameters are selected. Based

on the selected values of the parameters, the SVM is trained with 100 Monte-Carlo

simulations. In each simulation, a data is divided into two random subsets for training

and testing (RRSCV method). The use of the kernel SVM in Chapter 3 is to compare the

results of the proposed algorithms. Next, the proposed algorithms are presented.

3.3. Proposed Classification Methods

In Section 3.3, two optimization methods that utilize the correntropic loss function

are proposed. In one of the methods, the kernel width act as variable. Whereas, in the

other method, the kernel width is set as a parameter.

Training ANN with Correntropic Loss Function Using Convolution smoothing(ACC)

Similar to the previous ANN based methods, a back-propagation algorithm is used

to train the ANN, i.e., in this method, the weights are updated using the delta rule.

However, the cost function F is always the correntropic function, and the kernel width

σ is changed over the training period. The kernel width act as a smoothing parameter

of the CS algorithm, and initially kernel width is set to a value of 2. As the algorithm

proceeds, the kernel width is smoothly reduced till it reaches 0.5. Furthermore, as the

algorithm progress, if the delta rule leads to a high error value, then the kernel width is

increased to a value of 2 with probability Paccept , to escape from the local minima. This

probability is reduced exponentially depending on the number of epochs. ACC method

69

can be seen as a stochastic CS method which minimizes the correntropic loss function.

The training equations for the underlying ANN framework are as follows:

w n+1jk = w n

jk + µ δnkynj , (3–50)

where for the output layer, the deltas and weights are computed as:

δnk =∂Fσ

C (ε)

∂εn′(netnk ) (3–51)

δnk = δn0 =∂Fσ

C (ε)

∂εn′(netnk ) (3–52)

=β

σ2e

(−(yn−yn0 )2

2σ2

)(y n − y n0 )

′(netnk ), (3–53)

where FσC ≡ correntropic loss Function with kernel width σ, and Fσ

C (ε) is the error at the

output layer. The deltas of the previous layers are updated as:


N0∑o=1

w nhoδ

no . (3–54)

The ACC method is illustrated in Algorithm 3.1 for a given n × p data matrix with r

elements in the middle layer. Algorithm 3.1 represents ACC learning method for the

block update scenario. For the sample by sample update scenario, Algorithm 3.1 is

adjusted appropriately to incorporate the CS mechanism.

In Algorithm 3.1, σ0, α1 are the parameters that control the flow of ACC method,

and their values are taken as 2, 0.5e respectively (where e is vector of ones). f1, f2 are

the functions to update σ, Paccept is the probability of accepting noisy solutions. For the

sake of simplicity, f1 and Paccept are taken as exponentially decreasing functions, and f2

updates σ to a value of 2.

Training ANN with Correntropic Loss Function Using Simulated Annealing (ACS)

Unlike the previous gradient descent based learning methods, in this method a SA

algorithm is used to train ANN, i.e., no gradient search is involved in ANN. This method

70

assumes that the correntropic loss function has a fixed kernel width. Since the kernel

width determines the convexity of the loss function, a gradient descent method cannot

be used as a learning method in a generalized framework. Hence, the SA algorithm is

used as a learning method to avoid convergence to a local minimum. The ACS method

is illustrated in Algorithm 3.2 for a given n × p data matrix with r elements in the middle

layer. Furthermore, σ = σ is a given parameter of the algorithm. Moreover, the ACS

algorithm is used in block update mode only, unlike the ACC algorithm ( i.e., ACC

algorithm can be used in a sample or block based update mode).

In Algorithm 3.2, T0 is the initial temperature, and its value is taken as 1. f1(T )

and Paccept(T ) are two different functions of temperature. f1(T ) is a simple exponential

cooling function, whereas, function Paccept(T ) is exponential probability, which depends

upon the values of T , �a and �a−1. There are two termination criteria for ACC and ACS

method. Either the total error should fall below minErr (taken as 0.001) or the number

of epochs should exceed MaxEpochs (MaxEpochs is a parameter for experimental runs,

and is varied from 1, ... , 10).

The implementation of the proposed algorithms on simulated and real data are

presented in Chapter 5. In Chapter 4, another well known problem of data analysis is

introduced, and robust methods to solve the problem is proposed.

71

Table 3-1. Notation and description of proposed (z) and existing (X) methodsNotation Information Representation Loss Function Optimization Algorithm

Exact Method -AQGX Non-parametric (ANN) Quadratic Gradient Descent

Initially Quadratic , Exact Method -ACG X Non-parametric (ANN) shifts to Correntropy Gradient Descent

with fixed kernel width

Correntropy with Heuristic Method -ACCz Non-parametric (ANN) varying kernel width Convolution Smoothing

Correntropy with Heuristic Method -ACSz Non-parametric (ANN) fixed kernel width Simulated Annealing

Quadratic with Exact Method -SGQX Parametric (SVM) gaussian kernel Quadratic Optimization

Algorithm 3.1: ACC Methodinput : Classification data, Structure and transfer functions of ANNoutput: Optimal weights

1 begin2 Randomly initialize W (0) ;3 Set σ = σ0, µ = µ0 ;4 Set termination = false;5 while termination == false do6 Execute BLOCK FEEDFORWARD PHASE - ANN;7 if random() < Paccept then8 σ = f1(σ) ;9 else

10 σ = f2(σ) ;

11 if FσC (ε) < minErr then

12 termination = true;

13 Execute BLOCK BACKPROPAGATION PHASE - ANN;

14 return (W );

72

A

B

Figure 3-1. Correntropic, quadratic and 0-1 loss functions. A) Margin on x-axis. B) Erroron x-axis.

73

Algorithm 3.2: ACS Methodinput : Classification data, Structure and transfer functions of ANNoutput: Optimal weights

1 begin2 Randomly initialize W (0) ;3 Set σ = σ0, µ = µ0 ;4 initialize a = 0 and T = T0 ;5 �0 = Fσ

C (ε0) Set termination = false;6 while termination == false do7 T = f1(T );8 a = a+ 1;9 Wa = neighbor(Wa−1);

10 Execute BLOCK FEEDFORWARD PHASE - ANN;11 �a = Fσ

C (εa);12 if �a < minErr then13 termination criteria = true;

14 if �a < �a−1 then15 Wa = Wa−1;16 �a = �a−1;17 else18 if random() < Paccept(T ) then19 Wa = Wa−1;20 �a = �a−1;

21 return (W );

Figure 3-2. Perceptron

74

CHAPTER 4ROBUST SIGNAL SEPARATION

Signal separation is a specific case of signal processing, which aims at identifying

unknown source signals si(t) (i = 1, ... , n) from their observable mixtures xj(t) (j =

1, ... ,m). In this problem, a mixture is assumed to be a linear transformation of sources,

i.e., x(t) = A s(t), where A ∈ Rm×n is mixing matrix (or sometimes called as

dictionary). Typically, t is any acquisition variable, over which a sample of mixture

(a column for discrete acquisition variable) is collected. The most common types of

acquisition variables are time and frequency. However, position, wave number, and

other indices can be used depending on the nature of the physical process under

investigation. Apart from identification of the sources, knowledge about mixing is

assumed to be unknown. The generative model of the problem in its standard form can

be written as:

X = A S + N, (4–1)

where X ∈ Rm×N denotes the mixture matrix, A ∈ Rm×n is the mixing matrix, S ∈ Rn×N

denotes the source matrix, and N ∈ Rm×N denotes uncorrelated noise. Since, both A

and S are unknown, the signal separation problem is called “Blind” Signal Separation

(BSS) problem. The BSS problem first appeared in [45], where the authors have

proposed the seminal idea of BSS via an example of two source signals (n = 2) and

two mixture signals (m = 2). Their objective was to recover source signals from the

mixture signals, without any further information.

A classical illustrative example for the BSS model is the cocktail party problem

where a mixture of sound signals from simultaneously speaking individuals is available

(see Figure 4-1 for a simple illustration). In a nutshell, the goal in BSS is to identify

Some sections of Chapter 4 have been published in Computers & Operations Re-search and Neuromethods.

75

and extract the sources (Figure 4-1B) from the available mixture signals (Figure 4-1A).

This problem caught the attention of many researchers, due to its wide applicability in

different scientific research areas. A general setup of the BSS problem in computational

neuroscience is depicted in Figure 4-2. Any surface (or scalp) noninvasive cognitive

activity recording can be used as a specific example. Depending upon the scenario, the

mixture can be EEG, MEG or fMRI data. Typically, physical substances like skull, brain

matter, muscles, and electrode-skull interface act as mixers. The goal is to identify the

internal source signals, which hopefully reduce the mixing effect during further analysis.

Currently, most of the approaches of BSS in computational neuroscience are based

on the statistical independence assumptions. There are very few approaches that

exploit the sparsity in the signals. Sparsity assumptions can be considered as flexible

approaches for BSS compared to the independence assumption, since independence

requires the sources to be at least uncorrelated. In addition to that, if the number

of sources is larger than the number of mixtures (underdetermined case), then the

statistical independence assumption cannot reveal the sources, but it can reveal the

mixing matrix. For sparsity based approaches, there are very few papers in the literature

(compared to independence based approaches) that have been devoted to develop

identifiability conditions, and to develop the methods of uniquely identifying (or learning)

the mixing matrix [1, 34, 37, 54].

In Section 4.1, an overview of the BSS problem is presented. Sufficient identifiability

conditions are revised, and their implication on the solution methodology are discussed

in Section 4.2. Different well known approaches that are used to find the solution of BSS

problem are also briefly presented. Finally, the proposed algorithms are presented in

Section 4.3.

Other Look-alike Problems. BSS is a special type of Linear Matrix Factorization

(LMF) problem. There are many other methods that can be described in the form of

LMF. For instance, Nonnegative Matrix Factorization (NMF), Morphological Component

76

Analysis (MCA), Sparse Dictionary Identification (SDI), etc. The three properties that

differentiate BSS from other LMF problems are:

• The model is assumed to be generativeIn BSS, the data matrix X is assumed to be a linear mixture of S.

• Completely unknown source and mixing matricesSome of the LMF methods (like MCA) assume partial knowledge about mixing.

• Identifiable source and mixing matricesSome of the LMF methods (like NMF, SDI) focus on estimating A and S withoutany condition for identifiability. NMF can be considered as a dimensionalityreduction method like Principal Component Analysis (PCA). Similarly, SDIestimates A such that X = A S, and S is as sparse as possible. Althoughthe NMF and SDI problem looks similar to BSS, they have no precise notion aboutthe source signals or their identifiability.

4.1. Signal Separation Problem

From this point a flat representation of mixture data is assumed, i.e., mixture signals

can be represented by a matrix containing finite number of columns. Before presenting

the formal definition of the BSS problem, consider the following notations that will be

used throughout Chapter 4: A scalar is denoted by a lowercase letter, such as y . A

column vector is denoted by a bold lowercase letter, such as y, and a matrix is denoted

by a bold uppercase letter, such as Y. For example, in Chapter 4, the mixtures are

represented by matrix X. An i th column of matrix X is represented as xi . An i th row of

matrix X is represented as xi•. An i th row j th column element of matrix X is represented

as xi ,j .

Now, the BSS problem can be mathematically stated as: Let X ∈ Rm×N be

generated by a linear mixing of sources S ∈ Rn×N . Given X, the objective of BSS

problem is to find two matrices A ∈ Rm×n and S, such that the three matrices are related

as X = A S. In the theoretical development of the problem and the solution methods,

the noise factor is ignored. Although, without noise, the problem may appear easy.

However, from the very definition of the problem, it can be seen that the solution of

the BSS problem suffers from uniqueness and identifiability. Thus the notion of “good”

77

solution to the BSS problem must be precisely defined. Next, the uniqueness and

identifiability issues are explained.

Uniqueness: Let � and � ∈ Rn×n be a diagonal matrix and permutation matrix

respectively. Let A and S be such that, X = A S. Consider the following:

X = A S

= (A � �)(�−1�−1S

)= Aa Sa.

Thus, even if A and S are known, there can be infinite equivalent solutions of the

form Aa and Sa. The goal of good BSS solution algorithm should be to find at least one

of the equivalent solutions. Due to the inability of finding the unique solution, not only

the information regarding the order of sources is lot, but also the information of energy

contained in the sources is lost. Generally, normalization of rows of S may be used to

tackle scalability. Also, relative or normalized form of energy can be used in the further

analysis. Theoretically, any information pertaining to order of source is impossible to

recover. However, problem specific knowledge will be helpful in identifying correct order

for the further analysis.

Identifiability: Let � ∈ Rn×n be any nonsingular matrix. Let A and S be such that,

X = A S. Consider the following:

X = A S

= (A �)(�−1S

)= Að Sð.

Thus, even if A and S are known, there can be infinite non-identifiable solutions of

the form Að and Sð. The goal of BSS solution algorithm is to avoid the non-identifiable

solutions. Typically, the issue of identifiability arises from the dimension and structure of

A and S. The key idea to correctly identify both the matrices (of course with unavoidable

78

scaling and permutation ambiguity) is to impose structural properties on S while solving

the BSS problem (see Figure 4-3). Some widely known BSS solution approaches [90]

from the literature are summarized below.

Statistical Independence Assumptions: One of the earliest approaches to

solve the BSS problem is to assume statistical independence among the source

signals. These approaches are termed as the Independent Component Analysis (ICA)

approaches. The fundamental assumption in ICA is that the rows of matrix S are

statistically independent and non-gaussian [50, 94].

Sparse Assumptions: Apart from ICA, the other type of approaches, which provide

sufficient identifiability conditions are based on the notion of sparsity in the S matrix.

These approaches can be named as Sparse Component Analysis (SCA) approaches.

There are two distinct categories in the sparse assumptions:

• Partially Sparse Nonnegative Sources (PSNS): In this category, along with acertain level of sparsity, the elements of S are assumed to be nonnegative.Ideas of this type of approach can be traced back to the Nonnegative MatrixFactorization (NMF) method. The basic assumption in NMF is that the elementsof S and A are assumed to be nonnegative [21]. However, in the case of BSSproblem the nonnegativity assumptions on the elements of matrix A can be relaxed[67] without damaging the identifiability of A and S.

• Completely Sparse Components (CSC): In this category, no sign restrictions areplaced on the elements of S, i.e., si ,j ∈ R. The only assumption used to define theidentifiability conditions is the existence of certain level of sparsity in every columnof S. [32].

At present, these are the only known BSS approaches that can provide sufficient

identifiability conditions (uniqueness up to permutation and scalability). In fact, the

sparsity based approaches (see [34, 67]) are relatively new in the area of BSS when

compared to the traditional statistical independence approaches (see [50]). One of the

novelties that sparsity based methods brought to the BSS problem is the verifiability of

the sparse assumptions on a finite length data. Furthermore, not only overdetermined

but also underdetermined scenarios of BSS problem can be handled by the sparsity

79

based methods. However, underdetermined scenario requires a high level of sparsity

than the m = n simple scenario. In Section 4.2, a brief discussion on the important

issues of the sparsity based methods is presented [90].

4.2. Traditional Sparsity Based Methods

The earliest methods that proposed the notion of sparsity and the identifiability

conditions for BSS problems can be found in [33, 34, 67]. From the literature, different

approaches to solve Sparse Component Analysis (SCA) problem can be grouped into

two distinct classes. The main difference between the two classes is based on the

nonnegativity assumption of the elements in the S matrix. The reason for such division

is due to the structure of the resulting SCA problem. Typically, when the sources are

non-negative the SCA problem can be boiled down to a convex programming problem.

Thus, the algorithms for the class with nonnegativity assumptions are computationally

inexpensive. Whereas, for the other class, the SCA problem generally results in

nonconvex optimization problem. Therefore, finding a global optimal solution, when

the source elements are real, is a computationally expensive task.

SCA can be considered as a flexible method for BSS than ICA. ICA requires the

source to be statistically independent, whereas, SCA requires sparsity of sources (a

weaker assumption). In addition to that, ICA is not suitable if the number of sources is

larger than the number of mixtures (underdetermined case). Typical ideas of SCA can

be found in [34, 37, 54]. Furthermore, the identifiability conditions on X that improve the

separability of sources are studied by few researchers [1, 34].

Partially Sparse Nonnegative Sources (PSNS)

In many physiological data scenarios, the notion that the source signal is nonnegative

seems to be valid; for example, medical imaging, NMR, ICP, HR etc. Using this ideology,

and the fact that ICA at least requires complete uncorrelated source signals, a partial

correlated BSS method can be developed. A source matrix S is defined to be partially

correlated, when rows of a certain set of columns of S are uncorrelated. However,

80

the rows of full S matrix are correlated. For the sources on which the nonnegativity

assumption holds, a partially correlated assumption is less restrictive than ICA. The

primary idea on which this class of SCA method works can be summarized as: any

vector xi ∀i = 1, ... , N is nothing but a nonnegative linear combination of vectors

aj ∀j = 1, ... , n. Thus, sparse assumptions on S that may lead to proper identification

of A, which can be exploited in order to identify A and S. One of earliest approaches

towards this method is presented by Naanaa and Nuzillard [67], is called Positive and

Partially Correlated (PPC) method. Next, the sufficient identifiability condition for PPC

will be discussed.

Sufficient Identifiability Conditions on A and S for PPC [67]

Following are the two sufficient conditions, which are required for unique identification

of A and S (up to scaling and permutation ambiguity):

• PPC1: There exists a diagonal submatrix in S.For each row Si there exists a j ∈ {1, ... , N} such that si ,j = 0 and sk,j > 0 fork = 1, ... , i − 1, i + 1, ... , N.

• PPC2: Columns of A are linearly independent.

Implication of the Identifiability Conditions for PPC

Due to the restriction given in PPC1, the PPC BSS problem boils down to the

following: All the columns of matrix X span a cone in Rm, where the edges of the cone

are nothing but the columns of matrix A. Using this simplification, suitable linear or

convex programming problems can be solved to identify the edges of the cone spanned

by the columns of X. Finding these edges results in idetification of A. Matrix S can be

obtained by using Moore-Penrose pseudoinverse of A.

81

PPC Approaches

In [67], a least square minimization problem is proposed to solve the PPC problem.

The formulation is given as:

minimize : ∥∥∥∥∥∥∥N∑i=1i =j

αixi − xj

∥∥∥∥∥∥∥2

(4–2a)

subject to :

αi ≥ 0 ∀ i . (4–2b)

In addition to the above formulation, based on the same edge extraction idea, many

recent works are directed towards efficient edge extraction from X [14, 103].

Another recent modification of PPC approach is called Positive everywhere Partially

orthogonal Dominant intervals (PePoDi) [88]. In PePoDi, the PPC1 condition is modified

by stating that the last row of S is positive dominant and does not satisfy PPC1.

However this modification comes with the price of restricting A to be nonnegative.

Thus, PePoDi method can be seen as a special case of the NMF problem.

Complete Sparse Component Sources [36]

When the sources are not non-negative, then the BSS problem transforms into a

nonconvex optimization problem. In fact, the only identifiable condition known for real

sources, is to have sparsity in each column of X. Before defining the complete sparse

component (CSC) criteria, consider the following definitions:

CSC-conditioned: A matrix M is said to be CSC-conditioned if every square

submatrix of M is nonsingular. CSC-sparse: A matrix M is said to be CSC-sparse if

every column of M has at most m − 1 nonzero elements. CSC-representable: A matrix

M is said to be CSC-representable if for any n −m + 1 selected rows of M, there exists

m columns such that:

• All the m columns contain zeros in the selected rows, and

82

• Any m − 1 subset of the m columns is linearly independent.

Sufficient Identifiability Conditions on A and S for CSC

Following are the three sufficient conditions, which are required for unique

identification of A and S (up to scaling and permutation ambiguity):

• CSC1: A is CSC-conditioned,

• CSC2: S is CSC-sparse,

• CSC3: S is CSC-representable

Implication of the Identifiability Conditions for CSC

Due to the restriction given in CSC2 and CSC3, the CSC BSS problem boils down

to the following: All the columns of matrix X lie on m-hyperplanes passing through

origin, where the normal vectors of the hyperplanes are nothing but the orthonormal

compliment of the matrix A. Using this transformation, suitable hyperplane clustering

methods can be used to identify the hyperplanes defined by X. Since, the hyperplane

clustering is non-convex, the CSC BSS problem is relatively difficult to solve when

compared to PPC BSS problem.

CSC Approaches

Given data matrix X ∈ Rm×N , the goal of CSC is to find two matrices, namely,

mixing (A ∈ Rm×n) and source (S ∈ Rn×N), such that X = A · S. Under the

CSC1, CSC2 and CSC3 assumptions, uniqueness up to permutation and scalability

can be achieved. Next the basic formulation of CSC BSS problem is described, and the

different improvements that can be done with the basic formulation is proposed. Before

proceeding further, let us describe all the notations that will be used in the following

formulations:

83

Given Data:

p : Index for a point, p ∈ {1, ... ,N}

X : (x1, ... , xN) = data matrix of N points, xp ∈ Rm

n : The column size of dictionary matrix

Variables:

h : Index for a hyperplane, h ∈ {1, ... , n}

wh : Normal vector of hth hyperplane, ∈ Rm

uhp : Distance between pth point and hth hyperplane, ∈ R+

thp : 1 if pth point belongs to hth hyperplane, 0 otherwise

vhp : Ancillary variable, which reflects the product of thpuhp in a linearised form

Mathematically, the set of hyperplanes containing the data points is a solution to

mathematical Formulation 4–3:

minimize :N∑

p=1

min1≤h≤n

(wth xp − bh)

2 (4–3a)

subject to :

||wh||2 = 1, (4–3b)

wh ∈ Rm, (4–3c)

bh ∈ R. (4–3d)

Therefore, any solution of Formulation 4–3 will represent a w(2) − skeleton of X [10]. It

consists of n hyperplanes defined as:

Hh = {xp ∈ Rm : wth xp = bh} ∀ h = 1, ... , n. (4–4)

84

Another approach for hyperplane clustering is presented in [81], which can be described

via Formulation 4–5:

minimize :N∑

p=1

min1≤h≤n

∣∣wth xp − bh

∣∣ (4–5a)

subject to :

(4−−3b)− (4−−3d). (4–5b)

The solution to Formulation 4–5 defines w(1) − skeleton of X . Formulation 4–5 is

analogous to Formulation 4–3 in defining the hyperplanes. However, the main difference

is that Formulation 4–5 minimizes the absolute distances, whereas Formulation 4–3

minimizes the squared distances. This does not seems to be a huge difference,

however, absolute distance minimization is considered to be a robust approach.

The equivalence of both the formulations and uniqueness of their solution under

sparsity assumptions are discussed in [20, 32]. Moreover, Georgiev et al. [32] have

reduced the hyperplane clustering problem to a bilinear formulation in the case

when every data point belongs to only one skeleton hyperplane (and therefore, the

minimum value in Formulation 4–5 is zero). Then Formulation 4–5 is equivalent to

Formulation 4–6.

In order to obtain the bilinear formulation, the non-linear constraints given in

Equation 4–6e is replaced with wthe = 1 (where e is vector of all ones). This replacement

does not change the hyperplanes defined by their solutions and those defined by

solutions of Formulation 4–6. Different optimization methods can be applied to solve

the bilinear problem. In [32], an n-plane clustering algorithm via linear programming

is proposed to solve the bilinear problem. Algorithm 4.1 briefly describes this n-plane

clustering algorithm. The initial approaches to CSC BSS problem is based on bilinear

hyperplane clustering approach [32]. However, the main drawback of the algorithm is its

85

convergence to local minima. In fact, most of the hyperplane clustering methods in the

literature are confined to 7 to 8 dimensions.

minimize :N∑

p=1

n∑h=1

thp uhp (4–6a)

subject to :

wth xp ≤ uhp ∀ h, p, (4–6b)

wth xp ≥ −uhp ∀ h, p, (4–6c)∑h

thp = 1 ∀ p, (4–6d)

||wh||2 = 1 ∀ h, (4–6e)

thp ≤ 1 ∀ h, p, (4–6f)

thp ≥ 0 ∀ h, p, (4–6g)

wh ∈ Rm ∀ h, (4–6h)

uhp ≥ 0 ∀ h, p. (4–6i)

4.3. Proposed Sparsity Based Methods

The goal of Section 4.3 is to present the proposed approaches for the SCA

problem. Specifically, the standard preprocessing method for BSS problem is illustrated.

In addition to that, novel methods for both PPC and CSC case of SCA problem are

developed. Furthermore, a robust correntropy minimization method for source extraction

is also proposed.

Data Preprocessing and Recovery

Before using the proposed methods, the given data X is preprocessed using the

prewidening method. This is done in order to reduce the ill-conditioning effect on X

arising from the dictionary matrix. For example, consider the source matrix S ∈ R3×80,

shown in Figure 4-4. The dictionary matrix is A (see matrix in Equation 4–7). In this

86

example m < n; (m = 2, n = 3). The data X ∈ R2×80 is shown in Figure 4-5. The

processed data is shown in Figure 4-6. From Figures 4-4, 4-5 & 4-6 the ill-conditioning

effect and prewidening enhancement can be easily observed.

A =

1.0000 0.9000 1.1000

1.0000 0.8500 1.1500

.

(4–7)

Consider the following eigenvalue decomposition:

� = XXT = Q�QT , (4–8)

where � is a square diagonal matrix whose elements are eigenvalues of � and Q is a

square orthonormal matrix of eigenvectors of �. Since � is non-negative semi-definite,

all the elements of � are non-negative. Thus, a transformation matrix � can be defined

as:

� = �− 12QT . (4–9)

Now X can be transformed as:

X = �X. (4–10)

We redefine the dictionary matrix as A = �A and have the following model:

X = �AS = A S. (4–11)

The reason for such transformation is that the ill-conditioning effect due to mixing of

the original sources can be reduced. If the original sources were uncorrelated, then

87

SST = I. Therefore, A AT = I as shown below:

A AT = �AAT�T

= �ASSTAT�T

= �XXT�T

= �− 12QTQ�QTQ�− 1

2 (4–12)

= I. (4–13)

However, we do not assume that SST = I, but the above transformation still helps

in finding hyperplanes. For the case when m = n, once the optimal solution for all n

optimization problem is obtained, the source matrix from non-noisy mixtures is obtained

as:

Sπ = WT X =WT X, (4–14)

where Sπ = Pπ S, Pπ is a monomial matrix (i.e., each row and each column contains

only one non-zero element). The source extraction method for noisy mixtures is

considered an the end of Chapter 4. Unless any other information about A is known,

correspondence between rows of S and Sπ is hard to determine. Similarly, matrix A is

obtained by solving Equation 4–15:

WT A = Pπ. (4–15)

However Pπ, is unknown. Therefore, Equation 4–15 can be solved by a simple

assumption on Pπ matrix (i.e., Pπ = I). Moreover, the resulting dictionary matrix

will be unique up to permutation and scalability of columns. For example, solving system

of equations given by Equation 4–16 will be enough:

WT Aπ = I. (4–16)

88

Finally, A can be obtained as:

Aπ = Q�12 Aπ. (4–17)

To sum, although the actual A and S matrices cannot be identified, they can be obtained

in permuted and scaled forms when m = n.

PPC Robust Method for Dictionary Identification

Given data matrix X ∈ Rm×N , the goal of PPC is to find two matrices, namely

mixing (A ∈ Rm×n) and source (S ∈ Rn×N+ ), such that X = A · S. While developing

the algorithm, it is assumed that the source signals are non-negative. The proposed

algorithm is as follows:

• Step1: Normalize all the columns of X

• Step2: Solve the following LP to to get the projection direction:

minimize β (4–18)s.t. β ≥ −dTxi ∀ i , (4–19)

−dTxi ≤ 0 ∀ i , (4–20)−2 ≤ dj ≤ 2. (4–21)

The above formulation will generate projection vector d which is inside the coneformed by the columns of X.

• Step3: Normalize vector d.

• Step4: Projecting the points on a n-dimensional simplex plane orthogonal to d, i.e.,update each point xi as xi = xi

dT xi

• Step5: Translate the points, such that the plane containing the n-dimensionalsimplex plane passes through the origin. This can be done by centering of data,i.e., for each data point use the following transformation:

xi =xi − x

std(4–22)

where x and std are respectively the mean and standard deviation of all thecolumns of X.

• Step6: Affine transformation, like Principal Component Analysis (PCA) can beused to transform the n simplex, from n + 1 dimensions to n dimensions. The PCA

89

method: Identify the eigenvalues and eigenvectors of XXT :

UDnU = XXT

Rearrange the eigenvalues in the diagonal of Dn in decreasing order of their value.Let Dn−1 be the submatrix of Dn, constructed by eliminating last row and lastcolumn. Let Y be created such that yi = UTxi ∀ i . Let Z represent the submatrixof Y obtained by eliminating the last row of matrix Y. The matrix Z is an affinetransformation and dimensionality reduction of matrix X.

• Step7: If the PPC conditions are satisfied, then find the n vertices. If not, thenapproximately find the best n extreme points.

Projection based idea is an extension to the method proposed in [14]. However,

the approach did not address the scenario of negative elements in mixing matrix.

The proposed method can incorporate the negative elements in the mixing matrix. A

recent approach also addresses issues of negative elements in the mixing matrix [103].

Furthermore, the major advantage of the proposed approach over earlier methods

[14, 103] is to avoid solving large number of LPs. The only LP that we solve is in the

Step 2. For Step 7, in the proposed approach, instead of solving many LPs the following

projection approach is proposed:

Projection Approach: initially, the data points are projected on the normal vector

to the edges of the standard n-dimensional simplex projected on the n-dimensional

space. The maximum and minimum projections for the initial n normal vector projections

are archived. Now, the standard simplex is randomly rotated, and a new set of normal

vectors are used for the projection. Again, the maximum and minimum projections

are archived. If the total number of minimum and maximum projection points is equal

to n + 1 points, then the PPC assumptions are satisfied. Furthermore, the n vertices

can now be obtained from the archive. However, if there are more than n + 1 points,

then this indicates that PPC assumptions are not satisfied. In this case, from the set

of archived points (potential candidates for vertices), the one with maximum norm is

picked. The maximum norm point is taken as a best extreme point. Now, the rest of

the archived points are projected on a hyperplane passing through origin with a normal

90

vector passing through the identified extreme point. The projected archived points can

now be used to reduce the problem size by one dimension. This process of projection

and dimension reduction is continued n times to identify all the best extreme points. It

is to be noted that, in the projection and dimension reduction phase of the proposed

approach utilizes archived points only.

CSC Robust Method for Dictionary Identification

Given data matrix X ∈ Rm×N , the goal of CSC is to find two matrices, namely,

mixing (A ∈ Rm×n) and source (S ∈ Rn×N+ ), such that X = A · S. An alternative

approach, which is developed in this dissertation is to solve the bilinear problem given

in Formulation 4–6 via 0-1 linear reformulation [89]. Next, the 0-1 formulation for CSC is

presented:

minimize :N∑

p=1

n∑h=1

vhp (4–23a)

subject to :

(4−−6b)− (4−−6d), (4−−6h), (4−−6i), (4–23b)

wthe = 1 ∀ h, (4–23c)

vhp ≤ M1thp ∀ h, p, (4–23d)

vhp ≤ uhp ∀ h, p, (4–23e)

vhp ≥ uhp −M2(1− thp) ∀ h, p, (4–23f)

thp ∈ {0, 1} ∀ h, p, (4–23g)

vhp ≥ 0 ∀ h, p. (4–23h)

where M1 and M2 are very large positive scalars. Formulations 4–6 and 4–23 are

equivalent. Clearly, the MIP can be solved sequentially for each hyperplane. Before

defining the hierarchy based MIP formulation, let us introduce the following:

91

Notations:

w⋆r : Optimal solution at the r th optimization problem given by

Formulation 4–24.

H⋆r : A Hyperplane passing through origin, whose normal vector is w⋆

r .

Pr : Index set of points, defined as Pr = Pr−1 \ Rϵr−1, r > 1, ... , n where,

P1 = 1, ... ,N

Rϵr : Index set of points which are within ϵ distance from the hyperplane H⋆

r ,

defined as Rϵr = {p : |w⋆

rt xp| ≤ ϵ}, where ϵ > 0 is a given threshold

such that Rϵr has at least m + 1 elements.

minimize : ∑p∈Pr

αp vp −∑p∈Pr

βp tp (4–24a)

subject to :

−up ≤ wtrxp ≤ up p ∈ Pr , (4–24b)

up −M1(1− tp) ≤ vp p ∈ Pr , (4–24c)

vp ≤ up p ∈ Pr , (4–24d)

vp ≤ M2tp p ∈ Pr , (4–24e)

wtre = 1, (4–24f)∑

p∈Pr

tp ≥ m + 1, (4–24g)

tp ∈ {0, 1} p ∈ Pr , (4–24h)

up ≥ 0 p ∈ Pr , (4–24i)

vp ≥ 0 p ∈ Pr , (4–24j)

wr ∈ Rm. (4–24k)

92

Since the formulation considers one hyperplane at time, the second index for double

indexed variables can be dropped. For example vp is nothing but vpr , similar argument

follows for up and tp. αp and βp are scaling factors, and are arbitrarily selected.

Clearly, the non-hierarchical Formulation 4–23 has N · n binary variables, whereas

the r th iteration in hierarchical Formulation 4–24 has |Pr | binary variables (where

|Pr | < N,∀ r > 1 ). Moreover, for any two iterations say r1, r2, where r2 > r1, we

have |Pr1| > |Pr2|. Probabilistically, the complexity at each iteration is reduced. This is

due to the fact that in the r th iteration, the probability that xp, p ∈ Pr will lie in the

remaining n − r + 1 planes will be 1n−r+1

, (since X is BSS-skeletable). Ideally, if there is

no noise in the data and if all the earlier iterations converged to global optimal solution,

then the nth iteration is redundant. The proposed hierarchical approach for solving

Formulation 4–24 is presented in Algorithm 4.2. The steps of the proposed hierarchical

approach is illustrated by a flowchart shown in Figure 4-7.

Robust Method for Source Extraction

When the knowledge of dictionary is obtained, then the source extraction problem

can be simplified as:

S = pinv(A)X, (4–25)

where pinv(.) is pseudo inverse function. This method works only when X is free from

outliers. However, when the mixture matrix contain outliers, the above solution approach

will not work. For such scenarios, the following algorithm is proposed. Consider the

following optimization problem:

minimize :

∥AS− X∥ (4–26a)

subject to :

S ∈ Rn×N , (4–26b)

93

where A ∈ Rm×n and X ∈ Rm×N . Typically, the above problem is solved as a quadratic

error minimization problem. Such methods are not robust, when the elements of data

(A and/or X) are contaminated with outliers. The goal is to present a robust method for

source extraction, which is insensitive to outliers. Specifically, the following problem is

considered:

minimize :

Fσc (Y) + α Fσ

c (S) (4–27a)

subject to :

Y = AS− X, (4–27b)

S ∈ Rn×N , (4–27c)

Y ∈ Rm×N . (4–27d)

where FσC is the correntropic loss function, and α is known weight (or a parameter) for

regularization, which controls the sparsity in S. Let vector z ∈ RN(m+n) be defined as:

zi =

y⌈ i

N⌉,(i−(⌈ i

N⌉−1)N) i ≤ mN

s⌈ i−mN

N⌉,(i−mN−(⌈ i

N⌉−1)N) otherwise.

(4–28)

Let C ∈ RmN×(m+n)N be defined as:

C = [−ImN , A⊗ IN ] (4–29)

94

The above problem can be transformed as:

minimize :

−(m+n)N∑

i=1

αie

(−z2

i

2σ2

)(4–30a)

subject to :

Cz = d, (4–30b)

z ∈ RN(m+n), (4–30c)

where d ∈ RmN , defined as di = x⌈ iN⌉,(i−(⌈ i

N⌉−1)N), and

αi =

1 if i ≤ mN

α otherwise.(4–31)

Based on the value of σ, Formulation 4–30 can move from convex domain to invex

domain. Specifically, the problem will be a convex programming problem, when σ2 ≥

z2i ∀i = 1, ... , (m + n)N.

Consider the Lagrangian of Formulation 4–30:

L(z, v) = −(m+n)N∑

i=1

αie

(−z2

i

2σ2

)+ vT (Cz− d) , (4–32)

where v ∈ RmN are the dual variables. The KKT system of Formulation 4–30 will be:

∇FσC(z) + CTv = 0 (4–33)

Cz = d, (4–34)

where [∇FσC (z)]i =

αi

σ2 e

(−z2

i

2σ2

)zi ∀i = 1, ... , (m+n)N. Solving Equations 4–33 and 4–34

gives the solution for minimum correntropy error with α regularity.

Let z(r) be the current feasible solution, and let d(r+1) be an improving and

feasible direction. Consider the linear approximation of any gradient function of a

95

twice differentiable function:

∇f (w + u) ≈ ∇f (w) +∇2f (w)u. (4–35)

Using the above information, Equation 4–33 and 4–34 can be rewritten as:

∇FσC(z

(r)) +∇2FσC(z

(r))d(r+1) + CTv(r+1) = 0 (4–36)

Cd(r+1) = 0, (4–37)

where ∇2FσC(z

(r)) is the Hessian of the correntropic function, defined as:

[∇2Fσ

C(z(r))]i ,j=

αi

σ2 e

(−z

(r)i

2

2σ2

) (σ2−z

(r)i

2

σ2

)if i = j

0 otherwise.

(4–38)

Equation 4–36 can be rewritten as:

d(r+1) = −[∇2Fσ

C(z(r))]−1 [∇Fσ

C(z(r)) + CTv(r+1)

], (4–39)

where

[∇2Fσ

C(z(r))]−1

i ,j=

σ2

αie

(z(r)i

2

2σ2

) (σ2

σ2−z(r)i

2

)if i = j

0 otherwise.

(4–40)

Let � =[∇2Fσ

C(z(r))]−1. Using Equation 4–39 in Equation 4–37, we get:

C �[∇Fσ

C(z(r)) + CTv(r+1)

]= 0 (4–41)

C � CTv(r+1) = −C �∇FσC(z

(r)). (4–42)

Equation 4–42 can be written as:

v(r+1) = −(C � CT

)−1C �∇Fσ

C(z(r)). (4–43)

96

Substituting Equation 4–43 in Equation 4–39, we get:

d(r+1) = −�[∇Fσ

C(z(r))− CT

(C�CT

)−1C �∇Fσ

C(z(r))]

(4–44)

d(r+1) = −�[I(m+n)N − CT

(C�CT

)−1C �

]∇Fσ

C(z(r)) (4–45)

d(r+1) =[�CT

(C�CT

)−1C �−�

]∇Fσ

C(z(r)) (4–46)

d(r+1) =

−�Y

�S (A⊗ IN)T

(�Y + (A⊗ IN) �S (A⊗ IN)T)−1

[−�Y (A⊗ IN) �S

]∇Fσ

C(z(r)),

(4–47)

where � =

�Y 0

0 �S

.

The second order method is suitable when the objective function of Formulation 4–30

is convex. When there are outliers, the goal in Formulation 4–30 will be to minimize the

total correntopic loss, while ignoring the effect of outliers. In such a scenario, the

kernel width will be so selected such that it separates the true sample and outliers.

Typically, this separation mechanism leads to transformation of the correntropy to

the invex domain. Thus, the second order Newton’s method will not be able to find

the optimal solution. Thus, in the following paragraphs an iterative method to solve

Formulation 4–30 is developed when correntropy is invex.

Let z(r) be the current feasible solution. Let f1(S) = Fσc (AS− X) and f2(S) = Fσ

c (S).

The aim of finding the optimal kernel width is to identify a border that separates good

data points and outliers. Generally, such mechanism of separating data points requires

problem specific knowledge. However, in this work, correntropy based method that

identifies optimal kernel width is proposed, which in turn provides a margin between

good data points and outliers.

The philosophy of the proposed method is based on the simple notion that if σi

is the optimal kernel width and if pth given point bp contains noise, then setting the

corresponding solution sp to zero vector should give maximum improvement in the

97

objective function f (S) = f1(S) + f2(S). It is easy to see why f2 should decrease.

However, the decrement in f1 is only possible when the given point xp is indeed an

outlier w.r.t σi . Now, among all possible values of σi , the one that provides maximum

decrease w.r.t the original objective function value is the optimal value of the kernel

width. Let f (S \p) be the correntropy cost, when sp is set equal to the zero vector.

Algorithm 4.3 presents the proposed algorithm.

One of the drawbacks of this approach is the computational expensiveness of the

second order method, which increases with the problem dimensions n,m and N. On the

other hand, the step involving second order can be avoided when the proposed method

is used for initial filtering, i.e., solving the following problem:

X = IXf . (4–48)

When solving for Xf in Equation 4–48, Xf can be initialized as Xf = X, and the

second order method can be skipped. After executing Algorithm 4.3, the optimal kernel

width and filtered mixture matrix is obtained. This filtered mixture matrix then can be

used for dictionary identification and source extraction.

A B

Figure 4-1. Cocktail party problem: A) Setup. B) Problem.

98

S1

S2

S3

X1

X2

X3

↑A

↑

Source

Mixing

Mixture

Figure 4-2. BSS setup for human brain

..BLINDSIGNAL

SEPARATION

.

PartialBlind

.

.

FullBlind

.

.

IndependentComponent

Analysis

.

.

SparseComponent

Analysis

.

.

Partially SparseNonnegative Source

.

CompleteSparse

Components

Figure 4-3. Overview of different approaches to solve the BSS problem

99

Algorithm 4.1: Bilinear Algorithminput : X ∈ Rm×N

output: W ∈ Rm×n and T ∈ Rn×N

1 begin2 Randomly initialize T;3 Set termination = false ;4 Set ϵ = epsilon;5 while termination do6 for p = 1 to N do7 Calculate distances Dhp’s between xp and all the hyperplanes wh’s ;8 Assign xp to cluster Ch⋆ iff Dh⋆p = minh{Dhp} ;

9 error =∑

p |Dh⋆p|;10 if error < ϵ then11 termination = true;

12 Thp =

{1 if xp ∈ Ch

0 otherwise;

13 for k = 1 to n do14 Replace Equation 4–6e by wt

he = 1;

15 Solve Formulation(4--6), given thp =

{Thp if h = k

0 otherwise;

16 Arrange W = [w1, ... ,wn];17 return (W, T);

100

Algorithm 4.2: CSC Hierarchical Optimization Algorithminput : X ∈ Rm×N

output: Aπ ∈ Rm×n and Sπ ∈ Rn×N

1 begin2 X = Preprocessing (X, �) ;3 Set ϵ = epsilon;4 Set P1 = {1, ... ,N} ;5 for counter = 1 to n do6 Set r = counter;

// r represents the current index of the hyperplane

7 Set termination = false;8 repeat9 Choose initial points;

10 Solve Formulation(4--24) for the r th hyperplane, givenxp, ∀ p ∈ Pr ;

11 if optimal solution is obtained then12 termination = true;13 Archive vector w⋆

r ;// w⋆

r is the optimal solution vector of

Formulation(4--24)

14 Obtain indexes Rϵr = {p : |w⋆

rT xp| ≤ ϵ};

15 Set Pr+1 = Pr \ Rϵr ;

16 until termination;

17 Arrange W = [w⋆1, ... ,w

⋆n];

18 Get Sπ as, Sπ =WTX ;19 Get Aπ by solving the model, WT Aπ = In×n;20 return (Sπ, Aπ);

101

Algorithm 4.3: Correntropy Minimization for X = AS Type Scenariosinput : X ∈ Rm×N and A ∈ Rm×n

output: S ∈ Rn×N and σ⋆

1 begin2 Let z(r) be solution obtained from second order minimization, where r can

be chosen arbitrarily based on the required accuracy;3 Let S(r) be the solution constructed from z(r);4 Let σ(r) be the minimum value of kernel width obtained from z(r), such that

the correntropy function is convex;5 Select any value for ν, such that 0 < ν < 1;6 �r = −∞;7 termination = false;8 while termination == false do9 Calculate f (S)(r);

10 for i = 1 to N do11 if f (S \i)

(r) < f (S)(r) then12 I = I ∪ {i};

13 Let fnew(S)(r) be the correntropy cost when si = 0 ∀i ∈ I ;14 �r+1 = | fnew (S)

(r)−f (S)(r)

f (S)(r)|;

15 if �r+1 > �r then16 σ(r) = σ(r) ∗ ν;17 r = r + 1;18 else19 σ⋆ = σ(r);20 termination = true;

21 Return {σ⋆, X};

102

Figure 4-4. Original example source S ∈ R3×80

Figure 4-5. Mixed data X ∈ R2×80

103

Figure 4-6. Processed data X ∈ R2×80

104

Begin

Choose

Initial

points

Solve the

hierarchical

formulation

Optimal

solution?

Remove the points

corresponding to the

plane

All the

planes

obtained

?

End

Yes

No

Yes

No

Figure 4-7. Algorithm 4.2 description

105

CHAPTER 5SIMULATIONS AND RESULTS

In Chapter 5, the applicability of the proposed robust methods is illustrated by

experimentations on simulated and real world data. Generally, it is impractical to

conclude about the presence of outliers in real data. Therefore, the significance of the

proposed methods are highlighted using simulated data. We show that the proposed

methods work very well in non-noisy simulated data, as well as in noisy simulated data.

After the two simulated data scenarios, the performance of the proposed methods on

the real data is also tested.

In Sections 5.1 - 5.4, case studies related to binary classification are illustrated.

Section 5.5 presents the case study related to linear mixing assumption. Application

of non-negative source separation is shown in Section 5.5. The suitability of the

proposed PPC method for the image unmixing problems is shown in Sections 5.6 - 5.10.

Section 5.11 presents the case studies related to complete sparse source separation via

the hyperplane clustering method. Finally, Section 5.12 illustrates the proposed robust

source extraction procedure.

5.1. Cauchy and Skew Normal Data

The objective of Section 5.1 is to evaluate the performance behavior of correntropy

loss function in simulated noisy data classification. A two-dimensional noisy data for the

binary classification is simulated for this study. Altogether, two different types of data

sets were generated. The first data set is generated using Cauchy distribution. The

reason for selecting this distribution is to evaluate the performance of proposed and

existing methods in a non-Gaussian environment. In this data set, the fat tail behavior of

the Cauchy distribution mimics the noise. The second data set is generated by a skew

Some sections of Chapter 5 have been published in Dynamics of Information Sys-tems: Mathematical Foundations, Computers & Operations Research and Neurometh-ods.

106

normal distribution. In this data set, 10% of the data points from one class are randomly

assigned to another class and vice versa. Brief information regarding the two data sets

is given in Table 5-1. The details of the data sets are shown in Figures 5-1, 5-2 & 5-3.

For the data sets, a fixed number of records were selected for training the classifier.

The remaining records were used for testing the trained classifier. In order to have

accurate results, a data set is randomly divided into testing data and training data. For

each data set, the training data is preprocessed by normalizing the data to zero mean

and unit variance along the features (to avoid scaling effects). Based on the mean and

variance of the training data, the testing data is scaled. In addition to that, for the results

to be consistent, 100 Monte-Carlo simulations were conducted (both for ANN and SVM),

and the average testing accuracy of the classifier over the 100 simulations is reported in

the results.

The results are shown in Tables 5-2 & 5-3. From the results, it can be seen that the

correntropy loss function does perform better for the case of Cauchy data. However,

when the data is normally distributed, like Skew data, its performance is similar to the

quadratic loss function.

5.2. Real World Binary Classification Data

In Section 5.2, simulations are carried out for three real world data sets (Wisconsin

Breast Cancer Data, Pima Indians Diabetes Data and BUPA Liver Disorder Data) related

to biomedical field. These data sets are taken from the UCI machine learning repository

(http://archive.ics.uci.edu/ml/). A brief information regarding each of the data sets is

given in Table 5-4. The objective of Section 5.2 is to evaluate the performance behavior

of correntropy loss function in the real world data classification.

Originally, some of the selected data sets have missing values. All the records

containing any missing data values are discarded before using the data for classification.

In addition to that, for each data set, a fixed number of records were selected for training

the classifier. The remaining records were used for testing the trained classifier. In order

107

http://archive.ics.uci.edu/ml/

to have accurate results, a data set is randomly divided into testing data and training

data (keeping the total number of training records fixed, as given in Table 5-4). For

each data set, the training data is preprocessed by normalizing the data to zero mean

and unit variance along the features (to avoid scaling effect). Based on the mean and

variance of the training data, the testing data is scaled. The purpose of normalizing the

training data alone and scaling the testing data later is to mimic the real life scenario.

Usually, the testing data is not available beforehand, and its information is unknown

while normalizing the training data. In addition to that, for the results to be as consistent

as possible, 100 Monte-Carlo simulations were conducted (both for ANN and SVM), and

the average testing accuracy of the classifier over the 100 simulations is reported in the

results.

5.3. Comparison Among ANN Based Methods

The aim of Section 5.3 is to compare the proposed ANN based methods with the

existing ANN based binary classification methods. Since the number of PEs in the

hidden layer have an effect on the performance of ANN based classifiers, simulations

have been conducted for 5, 10 and 20 PEs in the hidden layer for each of the data sets.

Although, the exact number of PEs that will give maximum classification accuracy is

unknown, it can be estimated by an experimental search over the number of PEs in

the hidden layer. However, such a search is out of the scope of the current work due to

its high computational requirements. Therefore, the computations have been confined

for 5, 10 and 20 PEs in order to efficiently compare all the ANN based classifiers.

Moreover, the performance of ANN based classifier with sample and block based

learning framework were also considered in the comparison.

The result of sample and block based learning methods of ANN simulations

are given in Tables 5-5, 5-6, 5-7, 5-8, 5-9, and 5-10. In the six Tables, each column

represents a number of learning epochs for sample based learning. Whereas, each

column represents a number of epochs × training sample size for block based

108

learning. For a given algorithm, a row represents the average result of 100 Monte-Carlo

simulations. First row presents the results with 5 PEs in the hidden layer. Second row

presents the results with 10 PEs, and third row presents the results with 20 PEs in the

hidden layer.

For the AQG and ACG methods, the results from [86] are used as a reference for

further comparisons (see Tables 5-5, 5-7 and 5-9). Since ACS requires knowledge

of change in loss function value over any two consecutive iterations, it cannot be

implemented in sample based learning. However, all the algorithms have been

implemented in block based learning, and the performance results of ACS at σ = 0.5

have been presented. The results shows that ACC almost always (both for sample

and block based learning methods) performs better when compared to any of the ANN

based classification algorithms. Therefore, this method can be used as a generalized

robust ANN based classifier for practical data classification problems. Moreover,

the poor performance of ACS method is attributed to the σ = 0.5 criterion. It is not

necessary that the assumed criterion may show ACS’s best performance. Therefore,

this instigated the study of performance behavior of ACS method over different levels of

parameter σ (see Tables 5-11, 5-12 and 5-13).

5.4. ANN and SVM Comparison

The aim of Section 5.4 is to compare the proposed ANN based binary classification

methods with the SVM based binary classification methods. Since SVM has no

concept of PEs, the best of the average accuracy of SVM (average of 100 Monte-Carlo

simulations for a given pair of c and γ) over the exponential grid space of c and γ is

used to compare with the accuracy of the proposed algorithms. Figure 5-4A shows the

topology of performance accuracy over the grid, and Figure 5-4B shows the topology

of number of support vectors for PID data. Correspondingly, Figures 5-5 and 5-6 show

the same for BLD and WBC data respectively. The maximum testing accuracy that is

109

obtained for PID data from the grid search is 77.2%. Similarly, for BLD and WBC it is

71.4% and 97.07% respectively.

It would be unfair to directly compare the best accuracy of SVM with the accuracy

of the proposed ANN based algorithms, due to following reason: While calculating the

best accuracy of SVM based method, a fine grid search (exhaustive in nature) over the

parameters c and γ is conducted. The possibility to conduct such exhaustive searches

over the parameter space is credited to the existence of fast quadratic optimization

algorithms like sequential minimal optimization [27]. However, such fine exhaustive grid

search for the proposed methods is yet computationally expensive in the case of ANNs

(for example: an exhaustive gird search for ACS require to search over three parameters

namely: number of epochs, σ and number of PEs in the hidden layer).

However, in order to see the behavior of the ACS algorithm with various levels of

σ, a coarse grid search with few grid points have been conducted. The result of this

grid search is shown in Tables 5-11, 5-12 and 5-13. Although, the grid is confined to

very few grid points, it can be seen that the performance accuracy of ACS algorithm

vary with the change of parameters (σ and number of PEs in hidden layer). The results

from the grid search shows that the performance accuracy of ACS (even with limited

PEs and confined levels of σ) is very closer to the best performance accuracy of soft

margin kernel based SVM. Furthermore, even with the limitations (number of PEs in

the hidden layer, and number of epochs) ACC beats the best performance accuracy of

SVM for WBC data. In addition to that, its performance is very close to that of best SVM

performance for the other two data sets.

5.5. Linear Mixing EEG-ECoG Data

The aim of Section 5.5 is to understand the nature of mixing across the skull. In

particular, the objective is to assert the validity of the linear mixing assumption in BSS

problem. Since, linear mixing is assumed in almost all the BSS methods, it will be of

primary interest to show the validity of the assumption with respect to neural data. The

110

idea of this experiment is to consider a neural data set which contains the information

regarding the source as well as the mixture signals from brain. Based on the available

information from the data set, the goal is to extract the mixing matrix. However, the

mixing matrix itself may not provide a significant information, when compared to the

total error from the linear mixing assumption. Therefore, in the following experiment,

a suitable publicly available data set (which contains both source and mixture data) is

considered. The aim is to examine the linear mixing assumption across the skull by

minimizing the total error.

Data containing simultaneous electrical activity over the scalp (EEG) and over the

exposed surface of the cortex (ECoG) from a monkey is considered in Section 5.5. The

information regarding experimental setup and position of electrodes is available on the

following web address, (http://wiki.neurotycho.org/EEG-ECoG recording). Since the

data from this experiment is simultaneously collected from above the scalp and under

the scalp, it opens the door to understand the mixing mechanism across the brain.

Typically, the mixing over the skull is assumed to be linear. Mathematical advantages in

formulating the problem, developing the algorithms, and identifying the unknown source

and mixing matrices are obtained through the linear mixing assumption. In fact, the only

known successful results in BSS problem is obtained from linear mixing assumption. By

analyzing the data of this experiment, the goal is to experimentally verify the validity of

the linear mixing assumption.

The data consists of ECoG and EEG signals, which were simultaneously recorded

from the same monkey. 128 channels ECoG array that covered entire lateral cortical

surface of left hemisphere with every 5 millimeter spacing was implanted in the monkey.

The EEG signal was recorded from 19 channels. The location of the EEG electrodes

was determined by 10-20 systems without the Cz electrode (because the location of the

Cz electrode interfered with a connector of ECoG). In the present simulation, results on

a particular data set is presented, where the monkey is blindfolded, seated in a primate

111

http://wiki.neurotycho.org/EEG-ECoG_recording

chair, and hands are tied to the chair. Figure 5-7 shows the 8 EEG channels of the left

hemisphere, and Figure 5-8 shows the 128 ECoG channels from the left hemisphere.

During the recording, the monkey is in resting condition. In such scenario, it is

assumed that the theta and alpha bands should be dominant in a normal healthy

primate. Thus, the goal is to see how particular frequency bands mix over the skull.

Basically, the formulation is of the following form:

minimize : |Xeeg − A× Xecog|, (5–1)

where Xeeg ∈ R18×N represents the EEG data from 18 channels (each column

represents a channel), Xecog ∈ R128×N represents the ECoG data from 128 channels

(each column represents a channel), and A ∈ R18×128 is the unknown mixing matrix.

Before solving Formulation 5–1, the data has been filtered to remove high(≥ 45Hz)

and low(≤ 0.5Hz) frequencies. In addition to that, 50Hz and 60Hz notch filters have

been used to remove the noise induced by the electric current. Furthermore, all the

channels have been referenced to the average signal before conducting the analysis,

i.e., for example, EEG data value from a particular channel at a given time instance is

referenced with average EEG from all the EEG channels at the same time instance.

Similarly, the ECoG data is referenced to the average signal.

Instead of solving Formulation 5–1 with respect to the whole data, the formulation

has been solved multiple times, with reduced data sets. Typically, the reduced data sets

are nothing but smaller chunks of data with the window size of N = 2000 points for a

particular frequency band, taken from the original data. The objective of Formulation 5–1

is to calculate the total absolute error due to linear mixing assumption in different

frequency bands. Thus, this experiment provides a mechanism to understand mixing

around the skull. A low error shows that linear mixing assumption is valid. Whereas, a

high error indicates that the linear mixing assumption is invalid. Moreover, the ultimate

goal is to show if the mixing is constant over the time. However, to develop such results,

112

complete understanding regarding the total number of sources should be available. At

this point, a simple experiment is presented, where, it is assumed that all the ECoG

electrodes are sources, and all the EEG electrodes are mixtures. Thus, the model

is highly under-determined, but due to the availability of both source and mixture

information, Formulation 5–1 transforms to a convex programming problem.

The results of the analysis are shown in Table 5-14. While calculating the error,

only those channels are considered that are placed on the left hemisphere, i.e., 8

EEG channels, and 128 ECoG channels. Since, ECoG data is available for the left

hemisphere, the right hemispherical channels in EEG have been neglected during the

calculation of the error. In Table 5-14, the third row presents the mean value of the total

absolute error over all the multiple runs on the reduced data set. The fourth row provides

the corresponding variance of the total absolute errors for the multiple runs. The low

average error and negligible variance in alpha and theta bands suggest the existence

of linear mixing across the skull. At this stage of the experiment, the linear mixing

assumption is validated in the neural data. However, it is far from theoretical validation

and generalization to other neural data sets. Furthermore, the other critical question,

which directs towards the constancy in mixing is open for further investigation.

5.6. fMRI Data Analysis

In Sections 5.6, 5.7, 5.8, 5.9 and 5.10, the focus is on non-negative sources.

Generally, images fall under the non-negative sources category. The aim of Section 5.6

is to examine the validity of PPC sparsity assumption in fMRI data. Generally, sparsity in

fMRI images is a more plausible assumption than independence [24]. However, the PPC

sparsity may not be applicable to fMRI data. Through this experiment, the applicability of

PPC method on fMRI data is explored.

An fMRI data set examined previously in the literature is considered in Section 5.6.

The description of experimental setup and data collection of the fMRI data is available

in [35], where the authors compare ICA and SCA methods. Here, the same data are

113

used to analyze the convex hull of the fMRI data. The basic idea is, if PPC assumptions

are valid, then the convex hull should be a simplex. Furthermore, if the convex hull is

simplex in n dimensions, then an affine transformation to lower dimensions, like PCA,

should result in a simplex in lower dimensions. Furthermore, the extreme points (or

vertices) of simplex (or convex hull) is nothing but the columns of the mixing matrix.

Thus, finding convex hull leads to the identification of mixing matrix.

The fMRI data from a single subject consist of 98 images taken every 50 millisecond.

The images are vectorized by scanning the image vertically from top left to bottom right.

Next, the dimensionality of data is reduced to 3 principal components using PCA

for ease of identifying the convex hull. Since the images are vectorized, the relation

between fMRI data and PCA components is intangible. However, the usage of PCA has

a crucial advantage in visualization of the data, which in turn leads to easy identification

of the convex hull in the lower dimensions. Figure 5-9A shows the scatter plot of three

principal components. Now, taking the three principal components, the data is projected

on a two dimensional plane. This projection of the three principal components into

two dimensions is shown in Figure 5-10. First thing to notice is the projection in two

dimension is different from Figure 5-9B, which shows the scatter plot of two principal

components. Next, a simplex that fits all the points in Figure 5-10 gives the information

pertaining to the columns of the mixing matrix. For the unique identification of mixing

matrix, existence of a unique simplex is necessary.

For the fMRI data, the PPC1 conditions are not completely satisfied since the

vertices of the triangle (simplex) are not available. However, approximate methods can

be developed to identify the extreme points of the triangle. For example, Figure 5-10A

and Figure 5-10B show different ways of extrapolating the data to obtain the vertices.

Obviously, this idea can be extended in high dimensions by defining objectives like

finding a simplex of minimum volume containing all the data, or finding a simplex of

minimum volume containing high percentage of the data. From this analysis, it can be

114

concluded that, in general the PPC method may not be directly applicable to the analysis

of fMRI data. Thus, alternate methods which can overcome the restrictions of the PPC

method are needed to analyze the fMRI data.

5.7. MRI Scans

In Section 5.7, three MRI scan images are considered. From the original MRI

images, the minimum pixel value is subtracted, and the validity of the PPC1 assumption

is tested. These processed images do satisfy the PPC1 assumption. Let us call these

images as initial images. Now the initial images are linearly mixed to obtain three

mixture images. The goal is to extract the pure source images from the mixture images.

Figure 5-11A displays the initial sources, and Figure 5-11B presents the mixture images.

The three mixture images are vectorized into matrix X ∈ R3×N , where N

depends upon the size of the images. Now, the columns of X are projected on the

two dimensional space using the PCA transformation. From the projected data, the three

unique vertices of the simplex (triangle) are identified using the proposed projection

approach. Since the initial images satisfy the PPC1 assumption, the unique vertices

are identified, i.e., no approximation is needed. From the vertices of the simplex, the

mixing matrix is constructed. Using the information of the mixing matrix, the source

images are recovered. Figure 5-11C shows the recovered source images. Since the

PPC1 assumptions were satisfied initially, except the ordering and intensity (ambiguity of

permutation and scalability) of the images, all the other information is recovered from the

mixture images.

Furthermore, in order to see the performance of proposed approach, the experiment

is repeated 50 times with a random mixing matrix in every repetition. Moreover, the 50

repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order

to present the accuracy of the proposed approach, the error between recovered and

115

original sources is calculated as:

e(S, S) = minπ ∈ �n

n∑i=1

∥si• − sπi•∥2, (5–2)

where si• is the i th row of the original source matrix S, and si• is the i th row of the

recovered source matrix S. All the rows of original and recovered source matrices are

normalized. The normalization removes the scaling effect. The effect of permutation is

handled by π vector. Let π = [π1, ... ,πn]T and �n = {π ∈ Rn|πi ∈ {1, 2, ... , n}, πi =

πj , ∀i = j} be the set of all permutations of {1, 2, ... , n}. The optimization

problem given in Equation 5–2 is to match the rows of recovered source matrix to

the original source matrix. Typically, the minimization problem is nothing but standard

the assignment problem, and can be easily solved using the Hungarian method.

The average error and standard deviation for MRI scan images of the simulation are

presented in the first column of Tables 5-15 and 5-16 respectively.

5.8. Finger Prints

In Section 5.8, three finger print images are considered. Similar to the MRI scans

in Section 5.7, in the case of the finger print images, the minimum pixel in each image

is first subtracted from the images, and then checked for the PPC1 assumption. These

processed images do not satisfy the PPC1 assumption. Let us call these images as

initial images. Now the linear mixing operation is repeated similar to the MRI scan

images experiment (see Section 5.7), to obtain three mixture images. Since the PPC1

assumption is not satisfied, the goal is to approximately extract the pure sources from

the mixture images. Figure 5-12A displays the initial sources, and Figure 5-12B presents

the mixture images.

The three mixture images are vectorized into matrix X ∈ R3×N , where N


two dimensional space using the PCA transformation. From the projected data, the three

best extreme points are identified using the proposed projection approach. Taking the

116

three points as the vertices of the simplex, the mixing matrix is constructed. Sources

are recovered using the information of mixing matrix. Figure 5-12C shows the extracted

source images. It can be seen from the recovered sources that, apart from intensity and

ordering, the recovery is not perfect.





original sources is calculated by the formula given in Equation 5–2. The average error

and standard deviation for finger print images of the simulation are presented in the

second column of Tables 5-15 and 5-16 respectively.

5.9. Zip Codes

In Section 5.9, four zip code images are considered. Let us call these images

as initial images. Now the linear mixing operation similar to the MRI scan images

experiment (see Section 5.7) is performed, in order to obtain four mixture images. The

PPC1 assumption is not satisfied for the four images, and the goal is to approximately

extract the pure sources from the mixture images. Figure 5-13A displays the initial

source images, and Figure 5-13B presents the mixture images.

The four mixture images are vectorized into matrix X ∈ R4×N , where N


three dimensional space using the PCA transformation. From the projected data, the

four best extreme points are identified using the proposed projection approach. Taking

the four points as the vertices of the simplex, the mixing matrix is constructed. Sources

are recovered using the information of mixing matrix. Figure 5-13C shows the extracted

sources. It can be seen from the recovered sources that, apart from intensity and


117






and standard deviation for zip code images of the simulation are presented in the third

column of Tables 5-15 and 5-16 respectively.

5.10. Ghost Effect

Five translated images of a same individual are considered in Section 5.10. Let us

call these images as initial images. Now the linear mixing operation similar to the MRI

scan images experiment (see Section 5.7), is performed in order to obtain five mixture

images. The PPC1 assumption is not satisfied for the five images, and the goal is to

approximately extract the pure sources from the mixture images. Figure 5-14A displays

the initial sources, and Figure 5-14B presents the mixture images.

The five mixture images are vectorized into matrix X ∈ R5×N , where N


four dimensional space using the PCA transformation. From the projected data, the five

best extreme points are identified using the proposed projection approach. Taking the

five points as the vertices of the simplex, the mixing matrix is constructed. Sources are

recovered using the information of mixing matrix. Figure 5-14C shows the extracted

sources. It can be seen from the recovered sources that, apart from intensity and



is repeated 50 times with a random mixing matrix in every repetition. Furthermore, the

50 repetitions of the experiment is conducted for every value of n = 3, ... , 7. In order



118

and standard deviation for ghost effect images of the simulation are presented in the

fourth column of Tables 5-15 and 5-16 respectively.

5.11. Hyperplane Clustering

In order to show the performance of Algorithm 4.2, random test instances have

been generated in Section 5.11. For simplicity, the case when m = n is considered.

To show the performance of the proposed approach, noise free correlated sources are

used. All the data in the case study is artificially generated. Data points X ∈ R16×1600

without any noise from randomly generated dictionary A ∈ R16×16 and source

S ∈ R16×1600 matrices have been generated (source is sparse, i.e., in each column

there is at least one zero). Figure 5-15 represents the original source matrix, and

Figure 5-16 represents the given data. The original dictionary matrix A is randomly

generated for the case study. Matrix shown in Figure 5-17 represents the A matrix

normalized separately with respect to each column. The correlation of sources is given

by the matrix shown in Figure 5-18.

The correlation matrix (see Figure 5-18) is far from being diagonal, therefore,

the sources are highly correlated. Nevertheless, the proposed method separate

their mixtures successfully, unlike ICA, since ICA at least requires the sources to be

uncorrelated.

After preprocessing the data, the hierarchical sequence of MIPs is solved, as

described in Algorithm 4.2. For fast execution, the algorithm is jump started by

generating initial points. Specifically, the following two ϵ neighborhoods of a point xr

are defined:

xp ∈ Nϵ1(xr) i�xtpxr

||xtp||2||xr ||2≤ ϵ1 (5–3)

and

xp ∈ Nϵ2(xr) i�xtpxr

||xtp||2||xr ||2≤ ϵ2. (5–4)

For iteration r , a random point xp is selected as a candidate point, which has maximum

elements in Nϵ1 neighborhood. Next, all the points that belongs to the Nϵ1(xp) are

119

considered as the points that belongs to r th hyperplane. Moreover, all the points that

belong to Nϵ2(xp) are taken as a starting solution to the r th hierarchical problem.

Where ϵ1, ϵ2 are arbitrary selected, such that ϵ1 < ϵ2. In the case study, we have set

ϵ1 = cos(θ), where θ ∈ [15, 20] and ϵ2 = cos(25). The sets Nϵ1 and Nϵ2 are the two

samples of the proposed RANSAC based algorithm.

After running the proposed algorithm, from Equation 4–17 Aπ is recovered. In

Figure 5-19, the column normalized dictionary matrix Aπ is shown. In addition to that,

the recovered source Sπ is shown in Figure 5-20. The A and Aπ are normalized and

truncated to two decimal places for the sake of easy comparison. It can be seen that Aπ

and A differ only by permutation of the columns, which shows the excellent performance

of the proposed algorithm.

In addition to that, data points X ∈ Rm×N from randomly generated dictionary

A ∈ Rn×n and source S ∈ Rn×N matrices have been generated, without any

noise, for different values of m, n and N. The objective is to study the performance

of the proposed algorithm w.r.t the solution time. For consistency in the study, all the

simulations are carried on the same machine (used 8 processors on a 64 processor

Linux server). To accommodate the infeasibility issue for high ill-conditioned A matrix,

the best time out of 5 runs is reported. Table 5-17 presents the solution times for the

cases when m = n = 6 and N = 600, ... , 3800. Table 5-18 presents the solution times

for the cases when m = n = 6, 8, ... , 16 and N = 100 × n. Based on the simulation

results presented in Tables 5-17 and 5-18, it can be seen that the complexity of the

problem is inclined towards n, than compared to N.

5.12. Robust Source Extraction

In Section 5.12, the application and performance of Algorithm 4.3 is presented.

In all the simulations, only one iteration of the second order method is executed. Data

is randomly generated for the simulations. Figure 5-21A shows original 7 signals.

The signals are linearly mixed using a random A matrix to obtain the X matrix. Now,

120

2% noise is added to the X matrix. Figure 5-21B shows the non-contaminated and

Figure 5-21C shows the contaminated X matrices. Figure 5-22A shows the results

obtained by simple quadratic minimization. Figure 5-22B shows the solution obtained

from the proposed approach.

Furthermore, in order to see the performance of proposed algorithm, the experiment

is repeated 50 times with random mixing and source matrices in every repetition.

Moreover, the 50 repetitions of the experiment is conducted for every value of n =

3, ... , 6. In order to present the accuracy of the proposed approach, the error between

recovered and original sources is calculated by the formula given in Equation 5–2.

The average error, standard deviation and average time to solve one instance for the

simulated signals of the experiment are presented in Table 5-19.

Table 5-1. Binary classification case study 1Name Inherent

DistributionNoise Criteria

Cauchy CauchyDistribution

random globalflights of thedistribution areconsidered asnoise

Skew Skew NormalDistribution

random 10%noise is addedto the data

Table 5-2. Cauchy data5 PE’s 10 PE’s 20 PE’s

AQG 0.7157 0.8065 0.8208

ACG 0.6995 0.7702 0.814

ACC 0.6645 0.729 0.801

ACS 0.834 0.8405 0.8403

121

Table 5-3. Skew data5 PE’s 10 PE’s 20 PE’s

AQG 0.902 0.901 0.909

ACG 0.9008 0.905 0.9005

ACC 0.8998 0.8993 0.9

ACS 0.9005 0.8998 0.9025

Table 5-4. Binary classification case study 2Data set Attributes (or) Total Classes Training

Features records sizePima Indians Diabetes (PID) 8 768 2 400

Wisconsin Breast Cancer (WBC) 9 683 2 300

BUPA Liver Disorders (BLD) 6 345 2 150

122

Table 5-5. Sample based performance of ANN on PID data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.74 0.755 0.757 0.757 0.757 0.756 0.756 0.755 0.754 0.754 0.7580.75 0.757 0.758 0.757 0.757 0.756 0.756 0.755 0.755 0.7540.738 0.744 0.744 0.748 0.748 0.747 0.745 0.743 0.744 0.744

ACG 0.762 0.763 0.763 0.763 0.763 0.7660.765 0.766 0.766 0.765 0.7650.759 0.761 0.76 0.76 0.76

ACC 0.687 0.746 0.754 0.763 0.761 0.761 0.763 0.762 0.762 0.761 0.7680.731 0.758 0.76 0.765 0.764 0.765 0.768 0.766 0.765 0.7640.747 0.759 0.764 0.762 0.764 0.763 0.76 0.765 0.766 0.763

123

Table 5-6. Block based performance of ANN on PID data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.71 0.744 0.758 0.763 0.764 0.762 0.767 0.764 0.763 0.766 0.7690.736 0.756 0.762 0.763 0.766 0.767 0.769 0.767 0.766 0.7650.746 0.761 0.766 0.768 0.765 0.768 0.767 0.767 0.762 0.764

ACG 0.766 0.767 0.765 0.765 0.766 0.7690.767 0.765 0.769 0.765 0.7650.767 0.765 0.765 0.765 0.766

ACC 0.67 0.701 0.724 0.741 0.752 0.759 0.759 0.765 0.765 0.762 0.770.698 0.725 0.746 0.754 0.762 0.763 0.765 0.767 0.77 0.7680.72 0.75 0.762 0.763 0.764 0.767 0.764 0.765 0.766 0.769

ACS 0.755 0.752 0.752 0.756 0.754 0.751 0.754 0.754 0.753 0.75 0.7560.752 0.752 0.755 0.749 0.753 0.752 0.75 0.751 0.75 0.7550.747 0.754 0.751 0.752 0.748 0.75 0.749 0.748 0.748 0.75

124

Table 5-7. Sample based performance of ANN on BLD data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.569 0.578 0.578 0.577 0.588 0.584 0.591 0.59 0.591 0.592 0.620.568 0.57 0.584 0.584 0.596 0.599 0.599 0.606 0.611 0.610.572 0.574 0.591 0.597 0.603 0.611 0.603 0.614 0.62 0.622

ACG 0.578 0.579 0.579 0.58 0.583 0.5960.579 0.581 0.585 0.585 0.5870.584 0.596 0.594 0.595 0.592

ACC 0.575 0.577 0.581 0.579 0.583 0.585 0.592 0.59 0.592 0.597 0.6270.57 0.576 0.582 0.584 0.591 0.591 0.597 0.603 0.61 0.6130.571 0.581 0.582 0.592 0.601 0.6 0.608 0.612 0.622 0.627

125

Table 5-8. Block based performance of ANN on BLD data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.561 0.57 0.59 0.597 0.595 0.602 0.613 0.615 0.631 0.638 0.6850.57 0.595 0.596 0.61 0.625 0.638 0.644 0.653 0.657 0.6580.58 0.61 0.637 0.644 0.652 0.663 0.672 0.668 0.675 0.685

ACG 0.612 0.614 0.615 0.626 0.633 0.6850.631 0.639 0.643 0.655 0.6590.66 0.667 0.671 0.675 0.685

ACC 0.57 0.578 0.581 0.591 0.604 0.604 0.628 0.63 0.632 0.641 0.6860.565 0.585 0.598 0.617 0.622 0.631 0.65 0.647 0.659 0.6680.581 0.608 0.634 0.639 0.662 0.663 0.667 0.675 0.677 0.686

ACS 0.612 0.634 0.643 0.637 0.636 0.641 0.633 0.646 0.643 0.642 0.6750.637 0.655 0.655 0.657 0.656 0.66 0.658 0.657 0.656 0.6540.653 0.675 0.668 0.669 0.668 0.669 0.664 0.674 0.67 0.67

126

Table 5-9. Sample based performance of ANN on WBC data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.966 0.968 0.969 0.969 0.969 0.969 0.969 0.968 0.968 0.968 0.970.965 0.969 0.97 0.97 0.969 0.97 0.969 0.969 0.969 0.9690.965 0.97 0.97 0.97 0.97 0.969 0.969 0.969 0.968 0.968

ACG 0.97 0.97 0.97 0.97 0.97 0.9710.97 0.97 0.97 0.97 0.9710.97 0.971 0.971 0.971 0.971

ACC 0.969 0.97 0.969 0.969 0.969 0.97 0.971 0.97 0.97 0.97 0.9720.97 0.97 0.971 0.97 0.972 0.97 0.97 0.971 0.971 0.9710.971 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.969 0.97

127

Table 5-10. Block based performance of ANN on WBC data1 2 3 4 5 6 7 8 9 10 Best

AQG 0.961 0.968 0.968 0.968 0.97 0.97 0.97 0.97 0.969 0.97 0.970.964 0.968 0.968 0.968 0.969 0.97 0.969 0.97 0.969 0.9680.966 0.97 0.969 0.969 0.969 0.969 0.97 0.97 0.97 0.97

ACG 0.97 0.967 0.969 0.967 0.97 0.9730.971 0.97 0.971 0.973 0.9650.971 0.965 0.97 0.969 0.971

ACC 0.961 0.966 0.968 0.968 0.97 0.971 0.97 0.97 0.972 0.969 0.9720.965 0.968 0.969 0.97 0.968 0.969 0.968 0.97 0.97 0.9690.966 0.968 0.969 0.971 0.969 0.97 0.97 0.97 0.97 0.97

ACS 0.965 0.965 0.966 0.965 0.964 0.965 0.964 0.965 0.967 0.966 0.9670.964 0.966 0.967 0.965 0.965 0.965 0.964 0.965 0.965 0.9660.965 0.964 0.965 0.964 0.964 0.962 0.963 0.963 0.965 0.964

128

Table 5-11. Performance of ACS for different values of σ and number of PEs in hiddenlayer on PID data0.5 0.8 1 1.2 1.4 1.6

5 PE 0.7556 0.749 0.7593 0.7568 0.7633 0.761610 PE 0.7549 0.7461 0.7585 0.7604 0.7608 0.760320 PE 0.7543 0.7423 0.7614 0.758 0.7585 0.7593

Table 5-12. Performance of ACS for different values of σ and number of PEs in hiddenlayer on BLD data0.5 0.8 1 1.2 1.4 1.6

5 PE 0.646 0.6806 0.681 0.6861 0.6853 0.68410 PE 0.6596 0.6884 0.6928 0.6931 0.6941 0.692820 PE 0.6753 0.6992 0.6996 0.6997 0.7013 0.7007

Table 5-13. Performance of ACS for different values of σ and number of PEs in hiddenlayer on WBC data0.5 0.8 1 1.2 1.4 1.6

5 PE 0.9672 0.9646 0.9633 0.9648 0.9648 0.967210 PE 0.9665 0.9639 0.9631 0.9647 0.9634 0.963520 PE 0.9654 0.9621 0.9613 0.9625 0.963 0.9634

Table 5-14. Linear mixing assumptionTHETA ALPHA BETA

Frequency 3.5 - 7.5 Hz 8 - 13 Hz 14 - 30 Hz

Activity falling asleep closed eyes concentration

Error (mean) 7.32E-04 0.001 3.3539

Error (variance) 1.04E-06 2.06E-06 97.8081

Table 5-15. Average unmixing errorn MRI Scans Finger Prints Zip Codes Ghost Effect3 1.41× 10−16 0.0031 0.006 1.77× 10−4

4 5.42× 10−4 0.0046 0.0111 7.19× 10−4

5 0.0022 0.0064 0.0152 0.00166 0.0069 0.0084 0.0186 0.00337 0.0158 0.0104 0.0263 0.0055

129

Table 5-16. Standard deviation unmixing errorn MRI Scans Finger Prints Zip Codes Ghost Effect3 2× 10−16 1× 10−4 4× 10−4 1× 10−15

4 2× 10−4 4× 10−4 8× 10−4 4× 10−5

5 2× 10−5 1× 10−3 0.002 5× 10−5

6 7× 10−4 0.001 0.003 9× 10−5

7 0.001 0.002 0.004 7× 10−4

Table 5-17. Simulation-1 results for case study 2m, n N time (sec)6 600 9.407 5616 1000 15.343 756 1400 30.179 276 1800 46.281 626 2200 62.921 346 2600 97.714 096 3000 112.98966 3400 149.6679

Table 5-18. Simulation-2 results for case study 2m, n N time (sec)6 600 9.407 5618 800 46.695 689 900 29.976 3310 1000 58.387 1111 1100 73.565 8212 1200 132.904614 1400 634.741816 1600 4320.502

Table 5-19. Performance of correntropy minimization algorithmn × N 3× 300 4× 400 5× 500 6× 600 7× 700

mean 0.035 0.036 0.039 0.027 0.025std (×10−2) 4.24 5.83 9.27 1.29 0.96time (s) 0.39 59.13 139.88 284.36 527.79

130

Figure 5-1. Global view of Cauchy data with local and global flights.

131

Figure 5-2. Local view Cauchy data with local and global flights.

132

Figure 5-3. Skew normal data with noise

133

A

B

Figure 5-4. Performance of SVM on PID data. A) Accuracy. B) Number of supportvectors.

134

A

B

Figure 5-5. Performance of SVM on BLD data. A) Accuracy. B) Number of supportvectors.

135

A

B

Figure 5-6. Performance of SVM on WBC data. A) Accuracy. B) Number of supportvectors.

136

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 104

Fp1

F7

F3

T3

C3

T5

P3

O1

EEG Signals 10−20 System: Left hemisphere

Figure 5-7. EEG recordings from monkey.

137

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 104

1

5

9

13

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

101

105

109

113

117

121

125

ECoG Signals: Left hemisphere

Figure 5-8. ECoG recordings from monkey.

138

−20

24

6

x 10−4

0

5

10

15

x 10−4

0

0.2

0.4

0.6

0.8

A

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−10

−5

0

5

10

15

20

25

30

35

40

B

Figure 5-9. fMRI data visualization. A) PCA reduction to 3 dimensions. B) PCAreduction to 2 dimensions.

A B

Figure 5-10. Convex hull PPC1 assumption. A) Convex hull representation 1. B) Convexhull representation 2.

139

A B C

Figure 5-11. Mixing and unmxing of MRI scans. A) Original. B) Mixture. C) Recovered.

140

A B C

Figure 5-12. Mixing and unmxing of finger prints. A) Original. B) Mixture. C) Recovered.

141

A B C

Figure 5-13. Mixing and unmxing of zip codes. A) Original. B) Mixture. C) Recovered.

142

A B C

Figure 5-14. Mixing and unmxing of ghost effect. A) Original. B) Mixture. C) Recovered.

143

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

Original Source

Figure 5-15. Original sparse source (normalized) for case study 1

144

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

Given Data

Figure 5-16. Given mixtures of sources for case study 1

145

0.1 0.28 -0.33 0.13 0.03 -0.17 -0.07 0.04 0.53 0.41 -0.2 0.45 0.1 -0.02 0.1 -0.03

-0.07 0.3 0.43 0.32 0.15 -0.25 0.18 0.4 0 0.16 0.23 -0.18 0.05 -0.2 0.19 -0.3

-0.27 -0.12 -0.5 0.45 0.13 -0.27 0.43 -0.29 -0.1 -0.08 0.19 -0.08 -0.2 -0.08 0.01 0.06

0.29 0.2 -0.19 -0.02 -0.37 -0.24 0.01 0.4 -0.4 -0.05 -0.01 0.16 -0.37 0.02 0.12 0.35

0.24 0.05 -0.33 0.09 -0.06 0.5 0.02 0.16 0.05 0.31 -0.07 -0.5 -0.31 0.03 0.02 -0.25

0.28 0.12 0.14 0.49 -0.13 0.08 0.27 -0.03 0.06 -0.27 -0.22 0.01 0.21 0.59 -0.13 0

0.28 -0.27 -0.24 -0.24 0.06 -0.27 0.22 0.31 -0.04 0.01 0.16 0.04 0.23 0 -0.53 -0.33

0.5 -0.28 0.15 -0.05 0.07 -0.35 -0.08 -0.23 0.39 0 0.24 -0.26 -0.28 0.11 0.21 0.16

0 -0.37 0.24 0.23 -0.52 0.01 0.07 -0.19 -0.14 0.6 0.04 0.12 0.06 -0.09 -0.14 0.04

-0.37 -0.04 -0.04 -0.16 -0.47 0.06 0.05 0.15 0.3 -0.16 0.42 0.12 -0.1 0.32 0.2 -0.28

-0.28 -0.32 -0.08 0.26 0.09 -0.28 -0.58 0.28 -0.07 0.07 -0.19 -0.16 -0.1 0.31 -0.08 -0.07

0.2 0.08 -0.14 0.27 -0.39 -0.08 -0.37 -0.17 0.03 -0.39 -0.02 -0.03 0.11 -0.47 0 -0.32

0.24 -0.2 0 0.27 0.32 0.4 -0.19 0.12 -0.16 -0.05 0.45 0.49 -0.1 0 0.1 -0.08

0.09 0.4 -0.13 -0.11 0.01 -0.17 -0.26 -0.39 -0.38 0.24 0.3 -0.07 0.19 0.36 0.03 -0.22

-0.08 0.21 -0.06 0.17 -0.07 0.12 -0.21 0.13 0.24 0 0.44 -0.21 0.2 -0.07 -0.4 0.53

-0.07 0.26 0.26 0.01 0.04 -0.04 -0.05 -0.18 0.11 0 -0.03 0.19 -0.61 0.01 -0.58 -0.18

.41

.16

-0.08

-0.05

.31

-0.27

.01

0

.6

-0.16

.07

-0.39

-0.05

.24

0

0

A =

Figure 5-17. Original mixing matrix for case study 1

0.051 0.04 0.037 0.05 0.037 0.055 0.049 0.037 0.053 0.047 0.034 0.038 0.049 0.037 0.037 0.052

0.04 0.051 0.041 0.051 0.039 0.06 0.054 0.04 0.057 0.052 0.034 0.041 0.05 0.044 0.044 0.057

0.037 0.041 0.053 0.049 0.038 0.056 0.05 0.047 0.055 0.05 0.032 0.039 0.047 0.039 0.039 0.05

0.05 0.051 0.049 0.067 0.047 0.069 0.063 0.048 0.068 0.06 0.039 0.048 0.058 0.047 0.046 0.066

0.037 0.039 0.038 0.047 0.05 0.055 0.05 0.037 0.053 0.047 0.031 0.039 0.045 0.037 0.036 0.052

0.055 0.06 0.056 0.069 0.055 0.094 0.074 0.055 0.079 0.071 0.047 0.057 0.068 0.055 0.056 0.079

0.049 0.054 0.05 0.063 0.05 0.074 0.078 0.049 0.071 0.064 0.042 0.053 0.062 0.05 0.05 0.071

0.037 0.04 0.047 0.048 0.037 0.055 0.049 0.051 0.054 0.049 0.032 0.038 0.046 0.039 0.038 0.05

0.053 0.057 0.055 0.068 0.053 0.079 0.071 0.054 0.084 0.068 0.043 0.055 0.063 0.053 0.052 0.074

0.047 0.052 0.05 0.06 0.047 0.071 0.064 0.049 0.068 0.07 0.04 0.049 0.059 0.048 0.047 0.067

0.034 0.034 0.032 0.039 0.031 0.047 0.042 0.032 0.043 0.04 0.035 0.032 0.043 0.031 0.031 0.044

0.038 0.041 0.039 0.048 0.039 0.057 0.053 0.038 0.055 0.049 0.032 0.054 0.047 0.038 0.038 0.055

0.049 0.05 0.047 0.058 0.045 0.068 0.062 0.046 0.063 0.059 0.043 0.047 0.068 0.046 0.046 0.064

0.037 0.044 0.039 0.047 0.037 0.055 0.05 0.039 0.053 0.048 0.031 0.038 0.046 0.051 0.041 0.052

0.037 0.044 0.039 0.046 0.036 0.056 0.05 0.038 0.052 0.047 0.031 0.038 0.046 0.041 0.05 0.052

0.052 0.057 0.05 0.066 0.052 0.079 0.071 0.05 0.074 0.067 0.044 0.055 0.064 0.052 0.052 0.085

SST

= 1600

Figure 5-18. Mixing matrices for case study 1

0.1 0.45 -0.2 0.28 -0.02 0.13 -0.1 0.33 -0.1 0.17 0.04 0.53 -0.03 0.07 0.03 0.41

-0.07 -0.18 0.23 0.3 -0.2 0.32 -0.05 -0.43 -0.19 0.25 0.4 0 -0.15 -0.18 0.3 0.16

-0.27 -0.08 0.19 -0.12 -0.08 0.45 0.2 0.5 -0.01 0.27 -0.29 -0.1 -0.13 -0.43 -0.06 -0.08

0.29 0.16 -0.01 0.2 0.02 -0.02 0.37 0.19 -0.12 0.24 0.4 -0.4 0.37 -0.01 -0.35 -0.05

0.24 -0.5 -0.07 0.05 0.03 0.09 0.31 0.33 -0.02 -0.5 0.16 0.05 0.06 -0.02 0.25 0.31

0.28 0.01 -0.22 0.12 0.59 0.49 -0.21 -0.14 0.13 -0.08 -0.03 0.06 0.13 -0.27 0 -0.27

0.28 0.04 0.16 -0.27 0 -0.24 -0.23 0.24 0.53 0.27 0.31 -0.04 -0.06 -0.22 0.33 0.01

0.5 -0.26 0.24 -0.28 0.11 -0.05 0.28 -0.15 -0.21 0.35 -0.23 0.39 -0.07 0.08 -0.16 0

0 0.12 0.04 -0.37 -0.09 0.23 -0.06 -0.24 0.14 -0.01 -0.19 -0.14 0.52 -0.07 -0.04 0.6

-0.37 0.12 0.42 -0.04 0.32 -0.16 0.1 0.04 -0.2 -0.06 0.15 0.3 0.47 -0.05 0.28 -0.16

-0.28 -0.16 -0.19 -0.32 0.31 0.26 0.1 0.08 0.08 0.28 0.28 -0.07 -0.09 0.58 0.07 0.07

0.2 -0.03 -0.02 0.08 -0.47 0.27 -0.11 0.14 0 0.08 -0.17 0.03 0.39 0.37 0.32 -0.39

0.24 0.49 0.45 -0.2 0 0.27 0.1 0 -0.1 -0.4 0.12 -0.16 -0.32 0.19 0.08 -0.05

0.09 -0.07 0.3 0.4 0.36 -0.11 -0.19 0.13 -0.03 0.17 -0.39 -0.38 -0.01 0.26 0.22 0.24

-0.08 -0.21 0.44 0.21 -0.07 0.17 -0.2 0.06 0.4 -0.12 0.13 0.24 0.07 0.21 -0.53 0

-0.07 0.19 -0.03 0.26 0.01 0.01 0.61 -0.26 0.58 0.04 -0.18 0.11 -0.04 0.05 0.18 0

A =

Figure 5-19. Recovered mixing matrix for case study 1

146

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

0 200 400 600 800 1000 1200 1400 1600−1

01

Recovered Source

Figure 5-20. Recovered source (normalized) for case study 1

147

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

A

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

148.7 148.75 148.8 148.85 148.9 148.95 149 149.05 149.1 149.15−1

01

30 30.5 31 31.5 32−1

01

B

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

C

Figure 5-21. Data for source extraction method. A) Original source signal. B) Mixture before adding noise. C) Mixture afteradding noise.

148

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

A

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

0 100 200 300 400 500 600 700−1

01

B

Figure 5-22. Recovery of sources by quadratic and correntropy loss. A) Recovered source by quadratic error minimization.B) Recovered source by proposed method.

149

CHAPTER 6SUMMARY

In Chapter 3, two novel approaches integrating the concepts of correntropy in data

classification are proposed. The rationale behind proposing correntropic loss function

in data classification, is its ability to deemphasize outliers during the learning phase.

Thus, the outliers will not have influence while obtaining the classification rule. This is

an important property of the correntropy function that can be used in real world data

classification problems. In addition to that, the use of the correntropic loss function in

two different forms has been illustrated. In first form, the kernel width is allowed to vary

in the learning phase. In order to incorporate varying kernel width, a CS based ANN

learning is proposed (ACC method). The ACC method uses the simple well known

delta rule to update the weights. However, the purpose of using this back-propagation

mechanism is to illustrate the use of CS based ANN learning. Different sophisticated

methods to replace the back-propagation can be used to enhance the basic ACC

algorithm.

Furthermore, the second form of correntropic loss function has a fixed kernel width.

Depending upon the kernel width, the loss function may be convex or invex. However,

the ANN mapper inherently contains nonconvexity. Therefore, any classical gradient

descent algorithm in ANN framework may converge to a local minimum. To avoid such

local convergence, the gradient descent method has been replaced by SA algorithm.

Although a simple SA is used within ANN framework, nevertheless, this method can

suitably incorporate other specialized forms of SA.

Chapter 4 proposes solution methods for two major sparsity based classes of the

BSS problem. The proposed solution methods are broken down into two major steps.

The first step involves identification of the mixing matrix. Two different approaches to

identify the mixing matrix based on the non-negativity and sparsity level of the sources

are proposed. The second step involves extraction of source matrix. For this step, the

150

correntropy based method is proposed. The proposed method can be used not only

to identify the source matrix, but also to identify the outliers in the mixture matrix. By

applying the two steps, the BSS problem in the presence of outliers can be solved

efficiently.

Experiments on binary classification show that the proposed correntropic loss

function improves the classification accuracy of ANN based classifiers. Furthermore,

experiments show that the proposed approaches provide a tough competition to the

state of the art SVM based classifier. It can be proposed that the correntropic loss

function is a substantial contender for a robust measure in the risk minimization.

Moreover, the development of efficient algorithms for the parameter searches in ANNs

will further enhance the importance of correntropic loss function. Experiments on

signal separation show that the proposed method for hyperplane clustering can solve

problems up to size 16 which is unattainable with the earlier methods. Furthermore, the

correntropy based source extraction method shows that suitable kernel width can be

obtained from the contaminated data, which can separate the outliers from the good

data points.

6.1. Criticism

Robust methods have always been criticized on the loss of efficiency and increase

in computational complexity. Theoretical results like those shown by Fisher [30] always

support the usage of quadratic loss function. Moreover, the quadratic loss function is

easy to optimize and is efficient in model parameter estimation. Thus, using the notion

of smoothing effect (i.e., the effect of few outliers can be subdued by the presence of

large number of good data points) has always been used to counter the idea of robust

methods.

There are two basic types of criticism on the usage of BSS approaches in data

analysis. The primary comment is on the loss of order in the sources. As discussed

in Section 4.1, the scalability issue of BSS method can be overcome by using suitable

151

normalization approaches. However, identifying appropriate sources in general is not

possible. Makegi et al. [58] discussed the above issue, and stated the importance

of knowing “what the sources are” instead of knowing “where the sources are” in

understanding cortical activity. Furthermore, the underdetermined case is usually

resolved by experimental design, where artifacts are introduced into the data while

recording to reduce the underdeterminacy.

The other type of criticism that is received on the usage of BSS approaches is on

the validity of the assumptions imposed on mixing and source matrices. The smearing of

signal by volume conduction is instantaneous, thus a no delay assumption is not much

of a concern. The linear mixing assumption is the critical one and is hard to validate

experimentally. However, superposition of signals (a typical natural phenomenon) can be

used to support the notion of linear mixing. In addition to that, the assumptions imposed

on source signals are often objected. Statistical independence among the neuronal

signals is hard to justify. Therefore, researchers working with ICA directed the research

justifying statistical independence among artifacts and neuronal signals. On the other

hand, the assumptions of the novel SCA approaches are yet to be experimentally

validated on neurological data. Furthermore, sparsification methods transforming the

given problem into a sparse source problem are yet to be explored.

6.2. Conclusion

Conventionally, a quadratic loss function is used as a measure the similarity.

Rockafellar et al. [79] proposed four axioms for an error measure: error measure is

strictly positive for a non-zero error, positive homogeneity, subadditivity, and lower

semicontinuity. Homogeneity and robustness are contradicting, and cannot exists in a

single function. Thus in this work, the following properties favorable for a robust error

measure are proposed: (1) error measure is strictly positive for a non-zero error, (2)

generalized convexity, (3) differentiability (4) symmetry, and (5) lower semicontinuity.

One of the goals of this work is to propose a specific robust measure, called correntropic

152

loss function, that calculates the similarity between two random variables y and a, and

satisfies the above five properties. Furthermore, similar to the generalization of SVMs

from the basic formulation to the kernel based soft margin formulation, correntropy

based ANNs can be viewed as a generalized form of ANNs (both in regression [72] and

classification). From rigorous experimental results, the usability of correntropy based

ANNs in real world data classification problems is shown in Chapter 5.

BSS approaches based on ICA are well known in the signal separation literature.

However, sparsity based BSS methods are relatively new, and their potential is yet to be

explored in the area of signal processing. Through the systematic overview presented in

this dissertation, the awareness of the novel sparsity based BSS methods is increased,

and the differences between ICA and SCA methods are highlighted. The primary

difference between ICA based methods, compared to SCA based methods, is that

the ICA based methods are mostly suitable for artifact filtering. However, the striking

difference is that, the SCA based methods may be suitable for separating pure sources,

which are not necessarily statistically independent. Similar to EEG/MEG analysis with

ICA, where artifacts are induced into the signal via strategic experiments, efficient

experiments for SCA can be designed, where sparsity can be induced in to the source

signals. Furthermore, sparsification methods (like wavelet transforms) that can efficiently

sparsify source signals can also be used to analyze non-sparse source signals. To sum

up, SCA based methods may open a new door for understanding the mysteries of the

brain.

To conclude, the computational complexity of robust methods will always be an

issue when compared to the traditional methods. However, properties like invexity for

robust measures, and sample selection strategies for robust algorithms will overcome

the issues related to the computational complexity to a certain extent. Nevertheless, for

practical scenarios, robust methods are always preferable in terms of solution quality to

the traditional methods in data analysis. Furthermore, even for the theoretical scenarios,

153

the performance of robust methods in terms of solution quality is competitive with the

traditional methods.

154

APPENDIXGENERALIZED CONVEXITY

In the following discussion, the functions are assumed to be twice differentiable.

Obviously, convex analysis is not confined to the differentiable functions, and the

interested readers may refer to [5, 6, 11, 61, 78] for comprehensive details. An important

building block of convex analysis is the notion of a convex set. A set is said to be convex,

if the line segment joining any two points of the set completely lie within the set.

Definition 1. Let f : S 7→ R be a twice differentiable function, where S is a nonempty

convex subset of Rn. The function f is said to be convex, if and only if, the Hessian

matrix of f is positive semidefinite at each point in S.

Duality and the optimality conditions are the two important theories in the field of

optimization that are nurtured by convexity [8]. Convexity added the crucial brick of no

duality gap, in the duality theory. Furthermore, it is convexity that provided a ladder for a

local optimal solution to reach the status of a global optimal solution. These two theories

are the backbone of almost all the optimization algorithms. However, there has always

been curiosity among researchers to break the strict requirements of convexity. This is

due to the fact that most practical problems tend to be non-convex. As a first successful

attempt, Mangasarian [59] generalized the notion of convexity by proposing another

class of functions called pseudoconvex functions.

Definition 2. Let f : S 7→ R be a differentiable function, where S is a nonempty subset

of Rn. The function f is said to be pseudoconvex:

if ∇f (x1)T (x2 − x1) ≥ 0 then f (x2) ≥ f (x1) ∀ x1, x2 ∈ S

Pseudoconvex functions do not require the positive semidefinite criterion, like that

of convex functions. Furthermore, pseudoconvex functions preserve the tractability,

i.e., a local minimum of a pseudoconvex function on a convex domain is a global

minimum. Thus, pesudoconvexity augmented the optimality conditions to a larger class

155

of functions. Pseudoconvexity in the objective function, along with the quasiconvexity

in the constraints were assumed to be the weakest conditions that can be imposed so

that the Karush-Kuhn-Tucker (KKT) conditions are sufficient (under certain constraint

qualifications) [5, 61]. However in general, the pseudoconvex function failed with respect

to extendability. In other words, the non-negative weighted sum of pseudoconvex

functions may not result in a pseudoconvex function. Therefore, the pseudoconvex

theory had its own limitations. There has been continuous effort to relax the convexity

criterion, yet preserve the tractability and the extendability characteristics. Many other

ideas to extend the concept of tractability can be seen in the literature [11, 51]. One of

the practical successful extensions of convexity is invexity [6]. Hanson [41] proposed

the characteristics of such functions whose local minimum is a global minimum.

Subsequently, Craven [23] named such functions as invex functions.

Definition 3. Let f : S 7→ R be a differentiable function, where S is a nonempty subset

of Rn. The function f is said to be invex, if and only if:

f (x2) ≥ f (x1) + η(x1, x2)T∇f (x1) ∀ x1, x2 ∈ S (A–1)

where η : S × S 7→ Rn is some arbitrary vector function.

Invex functions not only provide a criterion of tractability, but also provide a criterion

of extendability. That is, a local minimum of an invex function over a convex domain will

be a global minimum, and there exists a criterion under which the non-negative sum of

invex functions will be an invex function. Although, it may be argued that invexity comes

with a price; unlike the pseudoconvex functions, a sub-level set of an invex function

may not be convex. However, they preserve both tractability and extendability, and it

is due to invexity that a huge class of functions can now be analyzed with respect to

the optimality conditions. Therefore, invexity is one of the weakest properties in convex

analysis that extends the theory of optimization in concluding the global optimality of a

feasible solution.

156

The reason to use differentiability based definitions is due to the differentiable

nature of the correntropic loss function. There are other definitions and properties of the

above stated functions, and readers are directed to [5, 11] for a comprehensive list of

definitions and properties.

Table A-1. Generalized convexity( ⋆under constraint qualification)Function Type Tractability Optimality

ConditionsStrongDuality

Extendability

Convex True Sufficient⋆ Exists AlwaysPseudoconvex True Sufficient⋆ Exists No Known Criteria

Invex True Sufficient⋆ Exists Criterion Exists

157

REFERENCES

[1] Aharon, M., Elad, M., & Bruckstein, A. (2006). On the uniqueness of overcompletedictionaries, and a practical way to retrieve them. Linear algebra and its applica-tions, 416(1), 48–67.

[2] Alizamir, S., Rebennack, S., & Pardalos, P. (2008). Improving the neighborhoodselection strategy in simulated annealing using the optimal stopping problem.Simulated Annealing, C. M. Tan (Ed.), (pp. 363–382).

[3] Anthony, M., & Bartlett, P. (2009). Neural network learning: Theoretical founda-tions. Cambridge Univ Pr.

[4] Antonov, G., & Katkovnik, V. (1972). Generalization of the concept of statisticalgradient. Avtomat. i Vycisl. Tehn.(Riga), 4, 25–30.

[5] Bazaraa, M., Sherali, H., & Shetty, C. (2006). Nonlinear programming: theory andalgorithms. Wiley-interscience.

[6] Ben-Israel, A., & Mond, B. (1986). What is invexity. J. Austral. Math. Soc. Ser. B,28(1), 1–9.

[7] Bereanu, B. (1972). Quasi-convexity, strictly quasi-convexity and pseudo-convexityof composite objective functions. ESAIM: Mathematical Modelling and NumericalAnalysis-Modelisation Mathematique et Analyse Numerique, 6(R1), 15–26.

[8] Bertsekas, D. (2003). Convex analysis and optimization. Athena ScientificBelmont.

[9] Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal marginclassifiers. In Proceedings of the fifth annual workshop on Computational learningtheory , (pp. 144–152). ACM.

[10] Bradley, P., & Mangasarian, O. (2000). k-plane clustering. Journal of GlobalOptimization, 16(1), 23–32.

[11] Cambini, A., & Martein, L. (2008). Generalized Convexity and Optimization:Theory and Applications, vol. 616. Springer.

[12] Capel, D. (2005). An effective bail-out test for ransac consensus scoring. In Proc.BMVC, (pp. 629–638).

[13] Catoni, O. (1996). Metropolis, simulated annealing, and iterated energytransformation algorithms: theory and experiments. Journal of Complexity ,12(4), 595–623.

[14] Chan, T.-H., Ma, W.-K., Chi, C.-Y., & Wang, Y. (2008). A convex analysisframework for blind separation of non-negative sources. Signal Processing,IEEE Transactions on, 56(10), 5120–5134.

158

[15] Chen, B., & Principe, J. (2012). Maximum correntropy estimation is a smoothedmap estimation. Signal Processing Letters, IEEE , 19(8), 491–494.

[16] Chum, O., & Matas, J. (2002). Randomized ransac with td, d test. In Proc. BritishMachine Vision Conference, vol. 2, (pp. 448–457).

[17] Chum, O., & Matas, J. (2005). Matching with prosac-progressive sampleconsensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, vol. 1, (pp. 220–226). IEEE.

[18] Chum, O., & Matas, J. (2008). Optimal randomized ransac. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 30(8), 1472–1482.

[19] Chum, O., Matas, J., & Kittler, J. (2003). Locally optimized ransac. In PatternRecognition, (pp. 236–243). Springer.

[20] Cichocki, A., & Amari, S. (2002). Blind Signal and Image Processing. Wiley OnlineLibrary.

[21] Cichocki, A., Zdunek, R., & Amari, S. (2006). New algorithms for non-negativematrix factorization in applications to blind source separation. In Acoustics,Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEEInternational Conference on, vol. 5, (pp. V–V). Ieee.

[22] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,20(3), 273–297.

[23] Craven, B. (1981). Duality for generalized convex fractional programs. GeneralizedConcavity in Optimization and Economic, (pp. 437–489).

[24] Daubechies, I., Roussos, E., Takerkart, S., Benharrosh, M., Golden, C.,D’Ardenne, K., Richter, W., Cohen, J., & Haxby, J. (2009). Independentcomponent analysis for brain fmri does not select for independence. Proceedingsof the National Academy of Sciences, 106(26), 10415–10422.

[25] Eddington, S. (1914). Stellar Movements and the Structure of the Universe.Macmillan and Company, limited.

[26] Erdogmus, D., Principe, J., & Hild I., K. E. (2002). Beyond second-order statisticsfor learning: A pairwise interaction model for entropy estimation. Natural comput-ing, 1(1), 85–108.

[27] Fan, R., Chen, P., & Lin, C. (2005). Working set selection using second orderinformation for training support vector machines. The Journal of Machine LearningResearch, 6, 1889–1918.

[28] Fischler, M., & Bolles, R. (1980). Random sample consensus: a paradigm formodel fitting with applications to image analysis and automated cartography. Tech.rep., DTIC Document.

159

[29] Fischler, M., & Bolles, R. (1981). Random sample consensus: a paradigm formodel fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6), 381–395.

[30] Fisher, R., et al. (1920). A mathematical examination of the methods ofdetermining the accuracy of an observation by the mean error, and by the meansquare error. Monthly Notices of the Royal Astronomical Society , 80, 758–770.

[31] Geary, R. (1947). Testing for normality. Biometrika, 34(3/4), 209–242.

[32] Georgiev, P., Pardalos, P., & Theis, F. (2007). A bilinear algorithm for sparserepresentations. Computational Optimization and Applications, 38(2), 249–259.

[33] Georgiev, P., & Theis, F. (2004). Blind source separation of linear mixtures withsingular matrices. Independent Component Analysis and Blind Signal Separation,(pp. 121–128).

[34] Georgiev, P., Theis, F., & Cichocki, A. (2005). Sparse component analysis andblind source separation of underdetermined mixtures. Neural Networks, IEEETransactions on, 16(4), 992–996.

[35] Georgiev, P., Theis, F., Cichocki, A., & Bakardjian, H. (2007). Sparse componentanalysis: a new tool for data mining. Data Mining in Biomedicine, (pp. 91–116).

[36] Georgiev, P., Theis, F., & Ralescu, A. (2007). Identifiability conditions andsubspace clustering in sparse bss. Independent Component Analysis and SignalSeparation, (pp. 357–364).

[37] Gribonval, R., & Schnass, K. (2010). Dictionary identification - sparsematrix-factorization via l1 -minimization. Information Theory, IEEE Transactions on,56(7), 3523–3539.

[38] Gunn, S. (1998). Support vector machines for classification and regression. ISIStechnical report , 14.

[39] Hampel, F. (1973). Robust estimation: A condensed partial survey. ProbabilityTheory and Related Fields, 27 (2), 87–104.

[40] Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (2011). Robust Statistics:The Approach Based on Influence Functions, vol. 114. Wiley.

[41] Hanson, M. (1981). On sufficiency of the kuhn-tucker conditions. Journal ofMathematical Analysis and Applications, 80(2), 545–550.

[42] He, R., Zheng, W.-S., & Hu, B.-G. (2011). Maximum correntropy criterion forrobust face recognition. Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, 33(8), 1561–1576.

160

[43] He, R., Zheng, W.-S., Hu, B.-G., & Kong, X.-W. (2011). A regularized correntropyframework for robust pattern recognition. Neural Computation, 23(8), 2074–2100.

[44] Heisele, B., Ho, P., & Poggio, T. (2001). Face recognition with support vectormachines: Global versus component-based approach. In Computer Vision, 2001.ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, (pp.688–694). IEEE.

[45] Herault, J., Jutten, C., & Ans, B. (1985). Detection de grandeurs primitivesdans un message composite par une architecture de calcul neuromimetique enapprentissage non supervise. In 10 Colloque sur le traitement du signal et desimages, FRA, 1985. GRETSI, Groupe dEtudes du Traitement du Signal et desImages.

[46] Hornik Maxwell, K., & White, H. (1989). Multilayer feedforward networks areuniversal approximators. Neural networks, 2(5), 359–366.

[47] Huber, P. (1981). Robust statistics.

[48] Huber, P. (1997). Robust Statistical Procedures. 27. SIAM.

[49] Huber, P. (2012). Data analysis: what can be learned from the past 50 years, vol.874. Wiley.

[50] Hyvarinen, A., & Oja, E. (2000). Independent component analysis: algorithms andapplications. Neural networks, 13(4), 411–430.

[51] Khanh, P. (1995). Invex-convexlike functions and duality. Journal of optimizationtheory and applications, 87 (1), 141–165.

[52] Kim, K., Jung, K., Park, S., & Kim, H. (2002). Support vector machines for textureclassification. Pattern Analysis and Machine Intelligence, IEEE Transactions on,24(11), 1542–1550.

[53] Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimization by simulatedannealing. science, 220(4598), 671.

[54] Kreutz-Delgado, K., Murray, J., Rao, B., Engan, K., Lee, T., & Sejnowski, T. (2003).Dictionary learning algorithms for sparse representation. Neural computation,15(2), 349–396.

[55] Liu, W., Pokharel, P., & Principe, J. (2006). Error entropy, correntropy andm-estimation. In Machine Learning for Signal Processing, 2006. Proceedingsof the 2006 16th IEEE Signal Processing Society Workshop on, (pp. 179–184).IEEE.

[56] Liu, W., Pokharel, P., & Principe, J. (2007). Correntropy: properties andapplications in non-gaussian signal processing. Signal Processing, IEEE Transac-tions on, 55(11), 5286–5298.

161

[57] Lundy, M., & Mees, A. (1986). Convergence of an annealing algorithm. Mathemat-ical programming, 34(1), 111–124.

[58] Makeig, S., Jung, T.-P., Ghahremani, D., Bell, A., & Sejnowski, T. (1996). What(not where) are the sources of the eeg? In The 18th Annual Meeting of TheCognitive Science Society .

[59] Mangasarian, O. (1965). Pseudo-convex functions. Journal of the Society forIndustrial & Applied Mathematics, Series A: Control , 3(2), 281–290.

[60] Mangasarian, O. (1968). Convexity, pseudo-convexity and quasi-convexity ofcomposite functions.

[61] Mangasarian, O. (1994). Nonlinear programming. Society for Industrial andApplied Mathematics Philadelphia, PA.

[62] McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent innervous activity. Bulletin of Mathematical Biology , 5(4), 115–133.

[63] Mehrotra, K., Mohan, C., & Ranka, S. (1997). Elements of artificial neuralnetworks. the MIT Press.

[64] Michalewicz, Z., & Fogel, D. (2004). How to solve it: modern heuristics.Springer-Verlag New York Inc.

[65] Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural andstatistical classification. Ellis Horwood Series in Artificial Intelligence, New York,NY: Ellis Horwood,— c1994, edited by Michie, Donald; Spiegelhalter, David J.;Taylor, Charles C., 1.

[66] Minsky, M., & Seymour, P. (1988). Perceptrons. In Neurocomputing: foundationsof research, (pp. 157–169). MIT Press.

[67] Naanaa, W., & Nuzillard, J. (2005). Blind source separation of positive andpartially correlated data. Signal Processing, 85(9), 1711–1722.

[68] Nister, D. (2005). Preemptive ransac for live structure and motion estimation.Machine Vision and Applications, 16(5), 321–329.

[69] Pardalos, P., Boginski, V., & Vazacopoulos, A. (2007). Data mining in biomedicine.Springer Verlag.

[70] Pardalos, P., Pitsoulis, L., Mavridou, T., & Resende, M. (1995). Parallel search forcombinatorial optimization: genetic algorithms, simulated annealing, tabu searchand grasp. Parallel Algorithms for Irregularly Structured Problems, (pp. 317–331).

[71] Parzen, E. (1962). On estimation of a probability density function and mode. Theannals of mathematical statistics, 33(3), 1065–1076.

162

[72] Principe, J. (2010). Information theoretic learning: Renyi’s entropy and Kernelperspectives. Springer Verlag.

[73] Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. Unsuper-vised adaptive filtering, 1, 265–319.

[74] Raguram, R., Frahm, J.-M., & Pollefeys, M. (2008). A comparative analysis ofransac techniques leading to adaptive real-time random sample consensus. InComputer Vision–ECCV 2008, (pp. 500–513). Springer.

[75] Reeves, C. (1993). Modern heuristic techniques for combinatorial problems. JohnWiley & Sons, Inc.

[76] Renyi, A. (1965). On the foundations of information theory. Revue de l’InstitutInternational de Statistique, (pp. 1–14).

[77] Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annalsof Mathematical Statistics, (pp. 400–407).

[78] Rockafellar, R. (1997). Convex analysis, vol. 28. Princeton university press.

[79] Rockafellar, R., Uryasev, S., & Zabarankin, M. (2008). Risk tuning with generalizedlinear regression. Mathematics of Operations Research, 33(3), 712–729.

[80] Rosenblatt, F. (1958). The perceptron: A probabilistic model for informationstorage and organization in the brain. Psychological review , 65(6), 386.

[81] Rubinov, A., & Ugon, J. (2003). Skeletons of finite sets of points. submitted paper .

[82] Rubinstein, R. (1983). Smoothed functionals in stochastic optimization. Mathe-matics of Operations Research, (pp. 26–33).

[83] Santamaria, I., Pokharel, P., & Principe, J. (2006). Generalized correlationfunction: Definition, properties, and application to blind equalization. SignalProcessing, IEEE Transactions on, 54(6), 2187–2197.

[84] Scholkopf, B., Burges, C., & Vapnik, V. (1995). Extracting support data for a giventask. In Proceedings, First International Conference on Knowledge Discovery &Data Mining. AAAI Press, Menlo Park, CA, (pp. 252–257).

[85] Shannon, C. (1948). A mathematical theory of communication.

[86] Singh, A., & Principe, J. (2010). A loss function for classification based on arobust similarity metric. In Neural Networks (IJCNN), The 2010 International JointConference on, (pp. 1–6). IEEE.

[87] Styblinski, M., & Tang, T. (1990). Experiments in nonconvex optimization:stochastic approximation with function smoothing and simulated annealing. NeuralNetworks, 3(4), 467–483.

163

[88] Sun, Y., & Xin, J. (2012). Nonnegative sparse blind source separation for nmrspectroscopy by data clustering, model reduction, and l1 minimization. SIAMJournal on Imaging Sciences, 5(3), 886–911.

[89] Syed, M., Georgiev, P., & Pardalos, P. (2012). A hierarchical approach for sparsesource blind signal separation problem. Computers & Operations Research,available online.

[90] Syed, M., Georgiev, P., & Pardalos, P. (2013). Blind signal separation methods incomputational neuroscience. In Neuromethods. Springer, to appear.

[91] Syed, M., & Pardalos, P. (2013). Neural network models in combinatorialoptimization. In Handbook of Combinatorial Optimization. Springer, to appear.

[92] Syed, M., Pardalos, P., & Principe, J. (2013). On the optimization of thecorrentropic loss function in data analysis. Optimization Letters, available on-line.

[93] Syed, M., Principe, J., & Pardalos, P. (2012). Correntropy in data classification.In Dynamics of Information Systems: Mathematical Foundations, (pp. 81–117).Springer.

[94] Te-Won, L. (1998). Independent component analysis, theory and applications.Boston: Kluwer Academic Publishers.

[95] Tong, S., & Koller, D. (2002). Support vector machine active learning withapplications to text classification. The Journal of Machine Learning Research, 2,45–66.

[96] Tordoff, B., & Murray, D. (2002). Guided sampling and consensus for motionestimation. In Computer VisionECCV 2002, (pp. 82–96). Springer.

[97] Tukey, J. (1960). A survey of sampling from contaminated distributions. Contribu-tions to Probability and Statistics, 2, 448–485.

[98] Tukey, J. (1962). The future of data analysis. The Annals of MathematicalStatistics, 33(1), 1–67.

[99] Vapnik, V. (1999). An overview of statistical learning theory. Neural Networks,IEEE Transactions on, 10(5), 988–999.

[100] Vapnik, V. (2000). The nature of statistical learning theory . Springer Verlag.

[101] Vapnik, V., Golowich, S., & Smola, A. (1996). Support vector method for functionapproximation, regression estimation, and signal processing. In Advances inNeural Information Processing Systems 9.

164

[102] Weston, J., & Watkins, C. (1998). Multi-class support vector machines. Tech.rep., Technical Report CSD-TR-98-04, Department of Computer Science, RoyalHolloway, University of London.

[103] Yang, Z., Xiang, Y., Rong, Y., & Xie, S. (2013). Projection-pursuit-based methodfor blind separation of nonnegative sources. Neural Networks and Learning, IEEETransactions on, 24(1), 47–57.

[104] Zhang, J., Xanthopoulos, P., Chien, J., Tomaino, V., & Pardalos, P. (2011).Minimum prediction error models and causal relations between multiple timeseries. Wiley Encyclopedia of Operations Research and Management Science, J.J. Cochran (ed.), 3, 1843–1850.

165

BIOGRAPHICAL SKETCH

Naqeebuddin Mujahid Syed has received Bachelor of Engineering (BE) in

Mechanical Engineering from Muffakham Jah College of Engineering and Technology

(MJCET), Osmania University (OU), Hyderabad, India in 2005. He was awarded with

two Gold Medals in BE (Mechanical Engineering) from MJCET as well as from OU.

He received Master of Science (MS) in Systems Engineering (SE) from King Fahd

University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia in 2007. He

was awarded with the Outstanding Academic Performance award for the academic year

2006-07 from the College of Computer Science & Engineering (CCSE), at KFUPM.

From 2007 to 2009, he served as a lecture-B in the SE Dept. at KFUPM. He received

Doctor of Philosophy (PhD) in Operations Research from the Industrial and Systems

Engineering (ISE) Department at the University of Florida (UFL). During his PhD he has

been awarded with the Outstanding International Student award at UFL for the years

2009, 2011 and 2012. In addition to that, he was awarded with the Graduate Student

Teaching award at ISE Dept. in UFL.

166

optimization based robust methods in data...

Documents