c 2013 paul francis thottakkara - university of...

ON FEATURE SELECTION IN DATA MINING

By

PAUL FRANCIS THOTTAKKARA

A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFENGINEER

UNIVERSITY OF FLORIDA

2013

c⃝ 2013 Paul Francis Thottakkara

2

To my parents Alice Francis and Francis T. Paul

3

ACKNOWLEDGMENTS

This thesis would not have been possible without the guidance and the help of

several individuals who in one way or another contributed and extended their valuable

assistance in the preparation and completion of this study. Foremost, I would like to

express my gratitude to my adviser Distinguished Prof. Panos M. Pardalos for his

contribution and guidance throughout my research. Besides my advisor, I would like to

thank the rest of my thesis committee: Prof. William Hager and Dr.Petar Momcilovic, for

their help and encouragement.

My sincere thanks to my lab members Vijay Pappu, Dr.Pando G. Georgiev, Mohsen

Rahmani, Michael Fenn, Syed Mujahid for supporting in my research work. I would

like to extend a big thank you to my friends Jorge Sefair, Zehra Melis Teksan, Rachna

Manek, Radhika Medury, Amrutha Pattamatta, Mini Manchanda, Vishnu Narayanan,

Rahul Subramany, Gokul Bhat, Vijaykumar Ramaswamy for the stimulating thoughts and

encouragement.

Finally, I thank my parents Alice Francis and Francis T. Paul and my sister Neetha

Francis for all the motivation and supporting me throughout my life.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 INTRODUCTION TO DATA MINING . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1 What is Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 Role of Feature Selection in Data Mining . . . . . . . . . . . . . . . . . . . 11

2 LEAST SQUARES FORMULATION FOR PROXIMAL SUPPORT VECTORMACHINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Data Classification and Separating Hyperplanes . . . . . . . . . . . . . . 122.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Proximal Support Vector Machines . . . . . . . . . . . . . . . . . . 132.1.3 Twin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 14

2.2 Importance of Least Square Formulations . . . . . . . . . . . . . . . . . . 152.3 Least Square Formulation for Generating Proximal Planes . . . . . . . . . 16

2.3.1 Using Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . 162.3.2 Special Case Eigenvalue Problem . . . . . . . . . . . . . . . . . . 20

2.4 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 JOINT SPARSE FEATURE SELECTION . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 L21 Norm and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 273.3 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 FEATURE SELECTION IN UNLABELLED DATASETS . . . . . . . . . . . . . . 33

4.1 Introduction to Raman Spectra Signals . . . . . . . . . . . . . . . . . . . . 334.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Remove Unnecessary Features . . . . . . . . . . . . . . . . . . . . 344.3.2 Noise and Background Subtraction . . . . . . . . . . . . . . . . . . 344.3.3 Peak Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.7 Sparse Clustering for Feature Selection . . . . . . . . . . . . . . . . . . . 39

5

4.8 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 44

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6

LIST OF TABLES

Table page

2-1 GEV and PSVM-LS Formulation Classification Accuracy . . . . . . . . . . . . . 25

3-1 PCA and JS Method Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3-2 JS Method:Top features, � threshold selection . . . . . . . . . . . . . . . . . . 32

3-3 JS Method:Top T features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4-1 Weights ωj and corresponding feature (wavenumber) . . . . . . . . . . . . . . . 42

7

LIST OF FIGURES

Figure page

4-1 K-Means Clustering, TRETScan6 scan of dimension 120 X 21 . . . . . . . . . 36

4-2 K-Means Clustering,C3AScan2 scan of dimension 125 X 50 . . . . . . . . . . . 37

4-3 K-means C3AScan2 scan with nucleus marked . . . . . . . . . . . . . . . . . . 38

4-4 Spectral Clustering, TRETScan6 scan of dimension 120 X 21 . . . . . . . . . . 39

4-5 Spectral Clustering, C3AScan2 scan of dimension 125 X 50 . . . . . . . . . . . 40

4-6 Cluster using top 15 features, C3AScan2 scan . . . . . . . . . . . . . . . . . . 42

8

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Engineer

ON FEATURE SELECTION IN DATA MINING

By

Paul Francis Thottakkara

August 2013

Chair: Panos M. PardalosMajor: Industrial and Systems Engineering

Analysing and extracting useful information from high dimensional dataset

challenges the frontiers of statistical tools and methods. Traditional methods tend to

fail while dealing with high dimensional datasets. Lower sample size has always been

a problem in statistical tests, this gets aggravated in high dimensional data due to

the comparable or higher feature size than the number of samples. The power of any

statistical test is directly proportional to its ability to reject a false null hypothesis, and

sample size is a major deciding factor in generating probabilities of type II error for

making valid conclusions. Hence one of the efficient ways of handling high dimensional

datasets is by reducing its dimension through feature selection, so that valid statistical

conclusions can be easily performed.

This work focuses on different aspects associated with feature selection in data

mining. Feature selection is one of the active research areas in data mining. The main

idea behind feature selection methods is to identify a subset of original input features

that are pivotal in data classification or understanding. Feature selection helps in

eliminating features with little or no predictive information. The various discussions

in this thesis could be categorized to three major sections, section 1 introduces least

squares formulation for proximal support vector machines, section 2 introduces l2,1 norm

as a method to induce sparsity and the last section discusses on the applicability of

using sparse clustering in Raman Spectroscopy data.

9

CHAPTER 1INTRODUCTION TO DATA MINING

1.1 What is Data Mining

Generally, data mining can be explained as the science of analysing data to extract

useful information using statistical tools and methods. The overwhelming prospect

of understanding a data-driven system using the underlying data has been a major

motivating factor in promoting research in the field of data mining. Immense growth in

technology has made data collection cheaper and efficient. Meanwhile, an exponential

rise in computational power has made data processing extremely faster and economical.

These advances have accelerated the motivations behind the research in data mining.

Data classification, feature selection and outlier detection are the major research

areas in the field of data mining. Data classification problems focus on learning a set

of training data, and then use information from training data to predict the nature of

any new datum point. Data classification can be supervised or unsupervised learning

method based on the availability of label information on training data. Supervised data

classification algorithm studies labelled training set (class labels of training data are

available) and generates a classification model to classify any set of new data points

observed. Unsupervised algorithms, on the other hand, try to find patterns in the

unlabelled training data.

In any statistical analysis, the three major factors of concern are statistical accuracy,

model interpretability and computational complexity. For any classification model, it is

necessary to ensure that the efficiency of any of these three factors is not compromised.

A set of data points is normally expressed as a matrix, where each column represents a

datum point and the rows represent features. Standard statistical methods require larger

number of data points compared to feature space dimension to make valid statistical

inferences. Hence standard datasets have a large column dimension compared to

row dimension. Research in the past two decades have paved way for efficient data

10

mining algorithms that perform well on standard datasets. But most of the traditional

classification models behave poorly when handling high dimensional datasets, i.e.

datasets where number of features is comparable or larger than the number of data

points. One of the prime reasons for poor performance on high dimensional datasets

is the compromise made in statistical accuracy and computational complexity due to

the higher dimensional feature space. Another concern with data classification in high

dimensional space is the large amount of collinearity between features resulting in

wrong model selection [7]. Over fitting and higher noise levels are also associated with

high dimensional datasets. These drawbacks associated with higher dimensional data in

data mining is referred to as the curse of dimensionality.

1.2 Role of Feature Selection in Data Mining

Feature selection is one of the prime focuses in the field of data mining. Given a

dataset, feature selection can be generalized as the process of selecting a subset of

features for use in further data analysis. This selected subset of features is expected to

capture maximum information present in the dataset, i.e. the selected feature subset

should contain the most prominent features for model construction. Feature selection

is particularly important in high dimensional datasets since it reduces dimensionality

and thereby nullifies the effects of the curse of dimensionality. Further, in many real

life systems, feature selection is very important in identifying the behaviour and

performance of the system. Especially in biomedical applications, feature selection

can play a pivotal role in identifying biomarkers. In a disease classification problem

in genomic study, for example, features selection techniques can identify the genes

that differentiate the diseased and healthy cells. This not only helps the data analyst in

reducing data dimension, but is also a huge breakthrough for biologists to understand

the biological system and identify the disease triggering genes.

11

CHAPTER 2LEAST SQUARES FORMULATION FOR PROXIMAL SUPPORT VECTOR MACHINES

2.1 Data Classification and Separating Hyperplanes

A hyperplane in an n-dimensional vector space can be defined as a flat subset

with n-1 dimensions and it separates the vector space in two disjoint half spaces.

Many data classification algorithms focus on finding hyperplanes which separate data

into different classes or which assist in approximating data. The very first algorithm in

machine learning was the Perceptron algorithm that generates a separating hyperplane

by minimizing the distance of misclassified points to the decision boundary. Perceptron

methods gained huge momentum and continued to be the major method for more than

a decade. However, the algorithm had a number of issues such as the existence of

multiple separating hyperplanes, slow convergence, failure in handling inseparable data,

etc. This limited the applicability of perceptron algorithms in complex and large datasets.

This motivated the development of advanced and more robust algorithms that could

efficiently handle complex and large data. One of these was Support Vector Machine,

which produced efficient and robust classification models, was effective in handling

complex datasets and produced lower generalization error.

2.1.1 Support Vector Machines

Support vector machine (SVM) is a supervised data classification method

introduced by Vladimir Vapnik and coworkers [4, 32]. The basic idea of SVM is to

generate a hyperplane that could separate the data points into its two classes. If the two

classes are linearly separable then the standard SVM tries to generate a hyperplane

that divides the input space into two disjoint half spaces where each class belongs to

one of the half spaces. As there could exist more than one hyperplane separating the

classes, the SVM algorithm selects the hyperplane which is farthest from its closest data

points. Any new data point is classified to a class based on its location in the half space.

12

2.1.2 Proximal Support Vector Machines

Proximal Support Vector Machine (PSVM) introduced by Fung and Mangasarian

can be considered closely related to the SVM Classifier [10]. Standard SVM classifies

points based on their location in the disjoint subspaces generated by the hyperplane

while PSVM classifies points based on their proximity to two parallel hyperplanes. The

objective of PSVM is to generate two parallel hyperplanes with each plane closest

to one class while being farthest from the other class. Later Mangasarian and Wild

introduced an extension to PSVM called the Multi-surface Proximal Support Vector

Machine (MPSVM) by relaxing the requirement of proximal planes being parallel [23].

MPSVM generates two hyperplanes such that each plane is closest to one class and

farthest from the other class. As MPSVM closely resembles PSVM, they are used

interchangeably in literature. In this study, we use PSVM to imply Multi-Surface Proximal

Support Vector Machine.

Consider a binary classification problem with two classes represented as A ∈ ℜn1×m

and B ∈ ℜn2×m where n1 + n2 = n are number of samples or data points and m is the

dimension of the input space. The proximal hyperplane closest to class A and farthest

from class B is given by

PA = {x ∈ ℜm | ⟨ω, x⟩ − γ = 0} (2–1)

Then the optimization model to obtain PA in PSVM can be formulated as:

min(ω,γ)=0

∥ Aω − eγ ∥2

∥ Bω − eγ ∥2(2–2)

where the numerator is the sum of squared distances from the hyperplane PA to

the points in class A and the denominator is the sum of squared distances from the

hyperplane PA to the points in class B. The basic idea behind the optimization model

is to generate PA such that it is closest to class A and farthest from class B. Tikhonov

regularization term is introduced to the optimization model to avoid degenerate solutions

13

[30]. The regularized optimization model [6, 24, 25] is given by:

min(ω,γ) =0

∥ Aω − eγ ∥2 +δ ∥ [ωγ ] ∥2

∥ Bω − eγ ∥2(2–3)

where δ > 0 is a regularization constant.

Define,

GA = [A − e]T [A − e] + δI , HB = [B − e]T [B − e], zT = [ωT γ]

Substituting the above variables to the optimization model 2–3, it can be reformulated

as

minz =0

f (z) :=zTGAz

zTHBz⇔ max

z =0f (z) :=

zTHBz

zTGAz(2–4)

The stationary points of 2–4 are given by eigenvectors corresponding to generalized

eigenvalue problem GEV(HB , GA):

HBz = λGAz (2–5)

and the hyperplane PA is given by the eigenvector corresponding to the largest

eigenvalue.

Similarly, the proximal hyperplane PB (farthest from class A and closest to class B)

given by:

PB = {x ∈ ℜm | ⟨�ω, x⟩�γ = 0}

can be found by solving for the eigenvector corresponding to maximum eigenvalue of the

generalized eigenvalue problem GEV(HA, GB) where,

GB = B − e]T [B − e] + νI , HA = [A − e]T [A − e], zT = [ �ωT �γ]

2.1.3 Twin Support Vector Machines

Twin Support Vector Machine (TWSVM) introduced by Jayadeva et al. [16] is very

similar to generalized eigenvalue PSVM, in the sense that it also produces two non

14

parallel planes such that each plane is close to one class and away from the other class.

However, TWSVM is not an exact reformulation of the PSVM model 2–3, but is very

close to the standard SVM formulation.

Twin Support Vector Machine model solves a pair of quadratic programming (QP)

problems, where each QP finds a hyperplane closest to points of one class and at

least at a unit distance from the points of other class. TWSVM classifier is obtained

solving the below pair of QP problems where TWSVM1 generates the hyperplane PA

i.e. hyperplane closest to class A and farthest from class B [1, 16]. Similarly TWSVM2

generates the hyperplane PB .

TWSVM1 ⇒ minω1,γ1,q

1

2(Aω1 + e1γ

1)T (Aω1 + e1γ1) + c1e

T2 q

s.t. − (Bω1 + e2γ1) + q ≥ e2, q ≥ 0

(2–6)

TWSVM2 ⇒ minω2,γ2,q

1

2(Bω2 + e2γ

2)T (Bω2 + e2γ2) + c2e

T1 q

s.t. − (Aω2 + e1γ2) + q ≥ e1, q ≥ 0

(2–7)

2.2 Importance of Least Square Formulations

Technological advances in the last decade have introduced new and efficient tools

for data collection especially in the field of biomedicine. This has paved way for a large

number of cases with high dimensional dataset i.e. with large number of input features

compared to the number of samples. As discussed in the introductory section, traditional

data mining techniques have produced appreciable results for standard datasets. But

when data is represented as high dimensional feature vectors with limited sample size, it

poses great challenge for standard algorithms. Hence reducing the number of features

is very important to obtain efficient and effective data analysis.

Feature selection methods can be classified into three different categories filter,

wrapper and embedded methods. The most simple idea is to select a subset of features

15

from the original set of features based on a feature ranking procedure, this is the filter

method. Wrapper methods compares a set of feature subsets and compare their

performance in predicting the data, and selects the best subset. Embedded methods

perform feature selection along with classification model construction process. One

of the very common methods is to introduce l1 to a least squares classification model

which induces sparsity to the model. This assists in feature selection by removing

irrelevant features.

Feature extraction for high dimensional datasets is very important as most features

in high dimensional vectors are usually non-informative or noisy and could affect

the generalization performance. Hence there is great interest in many data mining

applications for inducing sparsity in high dimensional datasets with respect to input

features to remove insignificant features. Such sparse representations can provide

information on relevant features and thereby assist in feature selection. Further,

classification models with sparse data matrix can simplify decision rule for faster

prediction in large-scale problems. Finally, in many data analysis applications, a small

set of features is desirable to interpret the results. As sparsity can be very easily

introduced to a least squares mathematical formulation using a regularization term,

the focus of this study is to generate a least squares formulation for proximal support

vector machines. The Least Absolute Shrinkage and Selection Operator (LASSO)

method introduced by Tibshirani [2, 29] can be effectively applied to the least squares

formulation for inducing sparsity. Further, robust and efficient classification model of

PSVM makes it an attractive model for studying. These factors motivated to investigate

on a least squares formulation for generating proximal planes.

2.3 Least Square Formulation for Generating Proximal Planes

2.3.1 Using Spectral Decomposition

Zou et al. [15] proved the following theorem that establishes the relation between

eigenvalue problems and least-squares problems.

16

Theorem 2.1 ((Zou et al.)). Consider a real matrix X ∈ ℜn×p with rank r ≤ min(n, p). Let

matrices V ∈ ℜp×p and D ∈ ℜp×p satisfy the following relation:

V T (XTX )V = D (2–8)

where, D = diag(σ21,σ

22, ...σ

2r , 0, 0, ... , 0)p×p. Assume σ2

1 ≥ σ22 ≥ · · · ≥ σ2

r . For the

following optimization problem,

minα,β

n∑i=1

||Xi − αβTXi ||2 + λβTβ

subject to αTα = 1

(2–9)

βopt ∝ V1, where Xi is the ith − row of matrix X and V1 is the eigenvector corresponding

to the largest eigenvalue σ21.

Using Theorem 2.1, an equivalent least squares formulation for the proximal

hyperplanes can be developed from the eigenvalue formulation 2–5. Let the Cholesky

decomposition of matrices HB and GA be given by:

HB = LBLTB = UT

B UB

GA = LALTA = UT

A UA

(2–10)

where LA,LB are lower triangular matrices, and UA,UB are upper triangular matrices.

Substituting these in GEV (HB ,GA),

HBz = λGAz

LBLTB z = λUT

A UAz

U−TA LBL

TB z = λUAz

U−TA LBL

TBU

−1A UAz = λUAz

(LTBU−1A )T (LTBU

−1A )UAz = λUAz

(LTBU−1A )T (LTBU

−1A )y = λy

(2–11)

17

where UAz = y . The optimal eigenvector related to proximal hyperplane PA in PSVMs

can be found by the following relation:

zopt = U−1A y (2–12)

where y is the eigenvector corresponding to the maximum eigenvalue in the following

symmetric eigenvalue problem in 2–11.

By substituting X = LTBU−1A , β = UAβ, (LTBU

−1A )i = U−T

A UB,i in the least squares

problem, Equation 2–9 in Theorem 2.1, and re-arranging the terms, the following

least-squares optimization problem is obtained:

minα,β

||UBU−1A − UB βα

T ||2 + λβTGAβ

s.t. αTα = 1

(2–13)

where βopt is proportional to z1, the optimal eigenvector corresponding to the largest

eigenvalue of the GEV(HB ,GA).

The optimization problem 2–13 can solved by alternating over α and β.

Fixed β : For a fixed β, the following optimization problem is solved to obtain α.

minα,β


T ||2

s.t. αTα = 1

(2–14)

Expanding the objective function ||UBU−1A − UB βα

T ||2,

(UBU−1A − UB βα

T )T (UBU−1A − UB βα

T ) ≈ −2αTU−TA HB β + αTαβHB β

Substituting αTα = 1, the optimization problem in 2–14 can be re-written as:

18

maxα

αTU−TA HB β

s.t. αTα = 1

(2–15)

An analytical solution for this problem exists and the αopt is given by,

αopt =U−TA HB β

∥U−TA HB β∥

(2–16)

Fixed α : For a given α, the optimization problem 2–13 can be reduced to ridge

regression-type problem. To see this, let A be an orthogonal matrix such that [α; A]

is p × p orthogonal. Then,


T ||2

= ||UBU−1A [α; A]− UB βα

T [α; A]||2

= ||UBU−1A α− UB β||2 + ||UBU

−1A A||2

(2–17)

So, for a fixed α, β optimizes the following regression problem:

minβ

||UBU−1A α− UB β||2 + λβTGAβ (2–18)

In this case as well, an analytical solution can be found given by:

βopt = (HB + λGA)−1HBU

−1A α (2–19)

The following algorithm summarizes the steps needed to solve for each optimal

hyperplane in PSVM using the least-squares (LS) approach:

19

Algorithm 1 PSVMs-via-LS (HB ,GA)1. Initialize β.

2. Find the upper triangular matrix UA from the cholesky decomposition of GA.

3. Find α from the following relation:

α =U−TA HB β

∥U−TA HB β∥

(2–20)

4. Find β as follows:

β = (HB + λGA)−1HBU

−1A α (2–21)

5. Alternate between 3 and 4 until convergence.

2.3.2 Special Case Eigenvalue Problem

Liang Sun et al. [22] introduced a theorem that establishes relation with a specially

structured eigenvalue problem and a least squares problem.

Theorem 2.2. Consider a generalized eigenvalue problems of the form

XSXTω = λXXTω or

(XXT )†XSXTω = λω

(2–22)

where X ∈ ℜm×n is the data matrix (n be the number of training sample and m dimen-

sion of the points) and S ∈ ℜn×n is a symmetric semi-positive definite matrix and (XXT )†

is the pseudo-inverse of XXT .

Further assume that the following conditions and satisfied:

• columns of X are centered, i.e. Xe = 0

• rank(X)= n-1

• column vectors of X are linearly independent before centering

20

As S is symmetric and positive semi-definite it can be decomposed as

S = HHT (2–23)

using the Cholesky decomposition, where H ∈ ℜn×s and s ≤ n.

Matrix H & X undergoes further decomposition to obtain U1, �1,VT1 ,Q,UR .

Consider the following sequence of decomposition

QR decomposition ⇒ HP = QR

Singular value decomposition ⇒ R = UR�RVTR

Compact Singular value decomposition ⇒ X = U�V T = U1�1VT1

(2–24)

With the above conditions satisfied, the nonzero eigenvalues of the generalized

eigenvalue problem 2–22 are diag(�2R) and the corresponding eigenvectors are

Weig = U1�−11 V T

1 QUR (2–25)

Now, consider a regression problem with n training set {(xi , ti), i = 1, ...n} where

xi ∈ ℜd is the observation and ti ∈ ℜk the corresponding target. Least squares is a

classical approach for solving regression problems. Least squares formulation for the

regression problem is given by

minW

n∑i=1

||W Txi − ti ||22 = ||W TX − T ||2F (2–26)

where W ∈ ℜm×k weight matrix. The optimal solution Wls minimizes the sum of squares

error function and the closed form solution to the least squares problem is given as

Wopt = (XXT )†XTT . (2–27)

Define the target matrix for the least squares formulation, Equation 2–26 as:

T = UTR Q

T (2–28)

21

T ∈ ℜr×n, then the solution to the least squares problem is given as:

Wls = (XXT )†XTT = U1�−11 V T

1 QUR (2–29)

Based on the results of 2–25 and 2–29 Liang Sun et al. [22] proves the equivalence

of eigenvalue problem and least square problem under conditions.

Using Theorem 2.2, we can generate proximal hyperplanes PA and PB via the

least-squares formulation. Consider the Generalized Eigenvalue problem HBz = λGAz

for Proximal plane PA and the following optimization model can be derived from the

original model 2–4.

maxz =0

f (z) :=zTCHBCz

zTCGACz⇔ max

y =0f (y) :=

yTHB y

yTGA y(2–30)

where y = Cz and C = I − 1neeT is a centering matrix. Centering matrix is used to

center the data matrix to the mean (Xe = 0 if X is centered.). The optimal solution to the

above problem is related by yopt = Czopt , i.e the optimal solution to the derived model is

multiplied by the centering matrix C to obtain the optimal solution the original problem.

Properties of Centered Matrix C

• C is symmetric, C = CT

• C is Idempotent, C 2 = C

Cholesky Decomposition of HB & GA gives

HB = UTB UB

GA = LALTA

(2–31)

Singular Value decomposition of LA

LA = U�V T (2–32)

22

Applying Cholesky decomposition and Singular Value Decomposition to the centered

matrix we have

CHBCz = λCGACz

CUTB UBCz = λCLAL

TACz

(2–33)

Define

H = V�†UTUTB , (2–34)

where �† is the pseudo-inverse of �

Introducing U�V TV�†UT = I to Equation 2–33 and substituting for H & LA

CU�V TV�†UT UTB UB U�†V TV�UTCz = λCLAL

TACz

CLAHHTLTACz = λCLAL

TACz

(2–35)

The GEV problem HBz = λGAz is reformulated to

CLAHHTLTACz = λCLAL

TACz

(CLA)HHT (CLA)

Tz = λ(CLA)(CLA)Tz

(2–36)

where CLA = LA is centred and S = HHT is Symmetric and Positive semi-definite matrix.

Applying Theorem 2.2 , solving LAHHT LA

Tz = λLALA

Tz is equivalent to

minW

||W T LA − T ||2F (2–37)

where T is generated from H and LA using Equation 2–28 in Theorem 2.2. Further,

at optimality the column vectors of W represent the eigenvectors of the Generalized

Eigenvalue problem 2–36. Closed form solution for Wopt is given by Wopt = (LALAT)†LAT

T ,

referring to Equation 2–27.

The following algorithm summarizes the steps needed to solve for each optimal

hyperplane in PSVM using the least-squares (LS) approach derived from Theorem 2.2:

23

Algorithm 2 PSVMs-via-LS (HB ,GA)1. Using Cholesky Decomposition find the upper triangular matrix UB from HB and

lower triangular matrix LA from GA

2. Find U, �,V using Singular Value Decomposition of LA

3. Centre the lower triangular matrix LA to create LA

4. Generate H using the Equation 2–34

5. Apply Theorem 2.2 to generate T from H and LA using 2–28

6. Closed form solution is obtained using the equation

Wopt = (LALAT)†LAT

T (2–38)

2.4 Results and Observations

In this chapter we introduced two least squares formulations for generating

the proximal planes of PSVM. The correctness of the least square formulations is

confirmed mathematically using the Theorems 2.1, 2.2. It could be further validated by

comparing the classification accuracies of least squares formulation with the accuracies

obtained from standard PSVM formulation (generalized eigenvalue formulation). In

this study, 10 fold cross validation accuracies are reported. As the least squares are

just reformulations of the standard generalized eigenvalue model, their classification

accuracies are expected to be same or very close to the accuracies from standard

formulation.

Numerical tests were done on publicly available binary class datasets. In the results

Table 2.4 the column Dimensions represent the number of data points to the number

of features in the dataset. Colon, DBWorld and DLBCL are high dimensional datasets

while others are standard datasets. PSVM − Eig column shows the accuracies obtained

using the standard generalized eigenvalue formulation, PSVM − LS − F1 column shows

accuracies associated with the least squares formulation using Zou et al. Theorem 2.1

and PSVM − LS − F2 column shows accuracies associated with the least squares

24

Table 2-1. Results Table: PSVM-Eig represent the standard Generalized eigenvalueformulation, PSVM-LS-F1 is the Least squares formulation using Theorem2.1, Zou et al. [15] and PSVM-LS-F2 is the Lease squares formulation usingTheorem 2.2 Liang Sun et al. [22]

Dataset Dimensions Class Ratio PSVM - Eig PSVM - LS-F1 PSVM - LS F2WDBC 569*30 212 : 357 93.3% 93.30% 92.80%

Spambase 4601*57 1813 : 2788 68.0% 67.96% 76.40%Ionosphere 351*34 126 : 225 76.9% 76.91% 75.48%

WPBC 198*33 47 : 151 74.9% 74.70% 73.79%Mushroom 8124*126 3916 : 4208 99.8% 99.80% 99.80%

Colon 62*2000 40 : 22 87.1% 87.14% 87.14%DBWorld 64*4702 35 : 29 90.7% 90.71% 90.71%DLBCL 77*5469 58 : 19 81.8% 81.79% 75.36%

formulation using Theorem 2.2 from Liang Sun et al.. Results infer that, both the least

square approaches are valid representations of proximal planes, as it generates similar

classification accuracies compared to the standard PSVM formulation. This new

formulation paves way for an easy introduction of embedded feature selection technique

to Proximal Support vector machines (PSVMs). l1 norm can be introduced in the new

least squares formulations developed in this study to obtain sparse classification and in

turn attain feature selection for PSVMs.

The Quadratic model for PSVM using Twin Support Vector Machine can also use

l1 norm to induce sparsity. This is a direct method of inducing sparsity, however when

l1 norm is induced to the TWSVM model it becomes a non-differentiable constrained

optimization model which is computationally very challenging. While the least square

models developed in this chapter can be solved more efficiently. After inducing l1 norm

to the Least squares formulation 2–13 developed from Theorem 2.1, the optimization

problem can be solved iteratively by alternating between α and β. There exists efficient

algorithms [13] to solve the least squares formulation 2–37 obtained using the Theorem

2.2. These credits further signify the applicability of the new least square approaches

introduced in this study.

25

CHAPTER 3JOINT SPARSE FEATURE SELECTION

3.1 Dimensionality Reduction

Dimensionality reduction is the technique of projecting a set of input data points

on to a smaller dimensional space. That is, the data is represented on a lower

dimensional subspace. It is normally achieved through feature selection or feature

extraction methods. Feature selection method identifies a set of prominent features

and the reduced subspace is determined by this selected set of features. While

feature extraction method creates derived features which are combination of existing

features and these new derived features are used for generating the reduced subspace.

Dimensionality reduction has many advantages in data mining, especially while handling

high dimensional datasets.

Dimensionality reduction techniques play a vital role in high dimensional datasets as

it helps in reducing the dimensionality of input space with minimum loss of information.

Principal component analysis (PCA) is very common method of dimensionality reduction

technique using feature extraction method. It creates a set of derived features or linear

subspaces that are linear combination of the existing features such that maximum

data variance is also accounted in the new subspace. This can also looked from the

perspective of creating a new subspace where each basis of the subspace is a linear

combination of the existing features.

Consider a data matrix X ∈ ℜdxn where d is the number of features and n the

number of data points, and let U ∈ ℜdxr be the transformation matrix used to generate

the reduced subspace S ∈ ℜnxr where r ≪ d . When a set of data points is projected

onto a new subspace, the optimal subspace preserves maximal relationship between

the data points, i.e the loss of information is minimized by accommodating the maximum

variance present in the dataset. The optimization model for generating the optimal

26

subspace can be formulated as variance maximization problem. The subspace is

represented by r orthogonal ui ∈ ℜd vectors and u′i s form basis for the subspace.

MaximizeU

(UTXXTU)

subject to UTU = I

(3–1)

As feature extraction method creates a set of derived features which cannot be

directly correlated to the actual features, these methods are not suitable for identifying

the prominent features. A very common observation with high dimensional data is that

most of the features are either irrelevant or collinear with the prominent features. High

levels of correlation and noise in high dimensional dataset has reduced the applicability

of traditional methods. Higher dimensional dataset also affects the computational

efficiency. These setbacks accentuate the application of dimensionality reduction

techniques and feature selection on high dimensional datasets. This chapter focuses on

a method that could perform feature selection along with dimensionality reduction using

l2,1 norm.

3.2 L21 Norm and Feature Selection

For any matrix A ∈ ℜdxr , its l2,1 norm is defined as

∥A∥2,1 =d∑i=1

√√√√ r∑j=1

A2i ,j (3–2)

L2,1 norm calculates norm along the rows of any matrix. Consider a projection

matrix A that is used to project an input space X ∈ ℜdxn onto a reduced dimensional

subspace S . Since each row of the projection matrix corresponds to a feature in the

original input space, it is desirable to have some rows of the projection matrix going to

zero. This can be looked from the perspective that the significance of irrelevant features

is being nullified in the reduced subspace. This is the major motivation behind studying

27

l2,1 norm. It is introduced to a dimensionality reduction problem expecting to induce row

sparsity in the transformation matrix and thereby assist in feature selection.

Joint sparsity is induced in the orthonormal vectors spanning S by introducing l2,1

norm to the optimization model 3–1 and is modified as below.

MaximizeU∈ℜdxr

(UTXXTU)− C∥U∥2,1

subject to UTU = I

(3–3)

where C > 0 controls the intensity of sparsity induced.

Solution to the optimization model 3–1 can be obtained by solving the following

symmetric eigenvalue problem and optimal u∗i are the eigenvectors corresponding to the

r largest eigenvalues:

XXTU = DU (3–4)

where D = diag(λ1,λ2, ....,λr) are the set of eigenvalues and U ∈ ℜdxr represent the set

of eigenvectors of XXT .

In high dimensional datasets computing eigenvectors of XXT ∈ ℜdxd is challenging

due to the large dimensional vectors. The eigenvectors of XXT can be calculated

from the eigenvectors of XTX . Let the set of eigenvectors of XTX be represented by

W ∈ ℜnxr , then U can be estimated as:

XTU = WD1/2 (3–5)

So, the optimization problem 3–3 can be reformulated as

MinimizeU∈ℜdxr

∥U∥2,1

subject to XTU = WD1/2

(3–6)

28

The model 3–6 is further relaxed to the following optimization problem:

MinimizeU∈ℜdxr

∥U∥2,1

subject to ∥XTU − Y ∥F ≤ δ

(3–7)

where Y = WD1/2 and δ can be used as tuning parameter for constrain relaxation.

To solve the model 3–7 the iterative method introduced by Gu et al. [28] is used.

The algorithm can be summarized as below

Algorithm 3 Solving Optimization Model 3–71. Initialize G0 = I , t = 0 and µ. (Analytical relation between µ and δ is not relevant in

this algorithm, µ is used to fine tune the convergence criterion and thereby has effect

in the constrain relaxation)

2. Compute Y = WD1/2 where W ∈ ℜnxr are the eigenvectors of XTX .

3. Ut+1 = G−1t X (XTG−1

t X + 12µI )−1Y

4. Update Gt+1 based on At+1, G is a diagonal matrix with gi ,i =

0, if u i = 0

1|ai∥2 otherwise

5. t=t+1, repeat steps 3 to 5 until convergence

3.3 Results and Observations

Joint sparse(JS) feature selection method was tested on 4 high dimensional

datasets. The transformation matrix U obtained using the above method was not

only used to transform the data to a lower dimensional subspace but also to extract

significant features. Each row in U can be directly correlated to a feature in the original

data space. In this experiment 3 different reduced subspaces where considered each

having 5,10 and 15 dimensions, i.e 3 different U ∈ ℜdxr were r varied from 5,10 and

15. The amount of variance captured by the new subspace is a good measure for

understanding the applicability of this method. A very common dimensionality reduction

method the Principal component analysis (PCA) was used to compare the results. The

percentage of variance captured by the subspace is a good measure to analyse the

29

Table 3-1. Summarizes and compares the variance captured by PCA and JS method(Joint Sparse Feature selection method),r represent the dimension of thereduced space, Accuracy and Standard deviations are also compared.

Variance Accuracy % Std DeviationDataset Class Ratio r PCA JS Method PCA JS Method PCA JS Method

Colon 40 : 225 0.71 0.70 64.2 71.3 13.55 13.38

10 0.84 0.84 72.9 73.3 13.21 14.2015 0.90 0.89 68.3 68.3 13.41 13.94

DBWorld 35 : 295 0.23 0.16 89.6 88.5 7.60 8.46

10 0.37 0.27 87.7 86.9 7.23 7.9315 0.48 0.36 88.8 86.9 6.35 7.53

Leukimia 27 : 115 0.45 0.40 100.0 90.7 0.00 6.99

10 0.62 0.58 98.6 98.6 4.40 4.4015 0.73 0.69 99.3 99.3 3.19 3.19

Breast 44 : 335 0.37 0.31 66.6 68.4 10.39 11.38

10 0.50 0.44 63.8 62.2 11.03 13.9715 0.59 0.52 60.3 60.0 13.34 13.36

performance of any dimensionality reduction method. The data variance in the reduced

subspace is compared with the variance in the original input space to calculate the

percentage of variance captured by dimensionality reduction technique. Percentage

of variance is the ratio of variance in subspace to the variance in original input space.

The percentage of variance captured by the principal components in PCA is compared

with the variance captured using the joint sparsity method in Table 3.3. Classification

accuracy using SVM on the reduced subspace (generated from both PCA and JS

method) is also compared for different values of subspace dimension.

The results show that Joint Sparsity method captures variance similar to that of

PCA and also performs well with classification accuracy. An iterative algorithm is used

to obtain the transformation matrix U were in each iteration the l2,1 norm decreases

forcing rows corresponding to irrelevant features to smaller magnitude but never

reducing it to zero. Hence during the iteration when the norm of any row goes below

a particular value it is forced to zero and in this study we have used a threshold of e−8.

The algorithm terminating condition are, if the ϵu <= 5e−4 and ϵf <= e−4 or if the

30

number of iterations exceeds 100. ϵu for k th iteration is defined as ∥Uk−Uk−1|F√rxd

were r is

the dimension of reduced subspace and d is the number of features in the dataset. ϵf is

defined as ∥Objk−Objk−1|d

were Objk is the objective value (∥U∥2,1) at the k th iteration. The

maximum number of iterations was fixed at 100 as for most of the datasets algorithm

was converging to acceptable levels in lesser than 100 iterations.

As the algorithm couldn’t induce efficient parameters to control the sparsity levels,

the performance in feature selection process was tested using the top prominent

features. In this study two ideas were used to select the prominent feature subset. In the

first case, features that have magnitude larger than �% of the largest magnitude feature

are the active feature set. Table 3.3 shows the classification accuracies associated

with the prominent features subset generated by this method and also the number of

features in the subset. In the second approach top T features (features with the largest

magnitude) formed the prominent subset and the classification accuracy is in Table 3.3

. Both the results table compares the accuracy of JS Method with the widely accepted

PCA method.

Result shows that the classification accuracies from JS Methods are not higher than

the PCA method, but the results are very much comparable with that of PCA method.

In dimensionality reduction methods, PCA is considered to be the best and efficient

algorithm and PCA-SVM classification model is known to provide higher classification

accuracies. JS method provides comparable classification accuracies with PCA, further

it also provides a list of prominent features. Hence, along with dimensionality reduction,

feature selection is also performed using JS method.

31

Table 3-2. Summarizes accuracies of PCA and JS method (Joint Sparse Featureselection method),� is the threshold for selecting features (� = 30% selectsfeatures that have weights greater than 30% largest weight). The number ofrelevant features is also given for corresponding � values.

Acc % Accuracy (JS Mehtod) % NonZero FeaturesDataset r PCA � 30% � 20% � 10% � 30% � 20% � 10%

colon5 64.16 76.25 69.58 75.41 70 83 100

10 72.91 65 66.25 75.41 80 105 12215 68.33 64.16 66.25 59.16 85 105 133

DBWorld5 89.61 88.07 90.76 88.84 29 52 112

10 87.69 81.53 85.76 86.53 28 72 13315 88.84 86.92 85.76 85.38 95 150 211

Leukimia5 100 90 91.42 91.42 42 62 86

10 98.57 93.57 92.85 97.85 13 36 7415 99.28 87.85 99.28 99.28 19 50 88

Breast5 66.56 69.68 67.18 65.31 26 51 93

10 63.75 63.43 64.68 63.12 59 96 16015 60.31 59.37 60.93 59.68 68 122 196

Table 3-3. Summarizes accuracies of PCA and JS method (Joint Sparse Featureselection method),T is the number of top selected features (T = 10 selectsthe top 10 features that have largest weights)

Accuracy % Accuracy % (JS Method, Top Features)Dataset r PCA T 10 T 15 T 20 T 25 T 30

Colon5 64.17 78.75 77.50 81.25 77.50 80.0010 72.92 65.00 69.58 71.25 72.08 76.2515 68.33 62.50 69.58 67.50 63.33 69.17

DBWorld5 89.62 82.69 87.31 83.08 86.92 86.9210 87.69 77.31 88.08 84.23 82.69 85.3815 88.85 82.69 79.62 81.92 74.62 76.54

Leukimia5 100.00 86.43 90.00 89.29 88.57 85.7110 98.57 91.43 86.43 87.86 95.71 93.5715 99.29 82.86 80.00 87.86 88.57 90.00

Breast5 66.56 62.81 65.31 65.00 69.69 69.6910 63.75 60.63 68.44 69.06 67.50 72.5015 60.31 72.19 73.13 68.75 66.88 62.19

32

CHAPTER 4FEATURE SELECTION IN UNLABELLED DATASETS

4.1 Introduction to Raman Spectra Signals

This chapter focuses on feature selection in unlabelled Raman spectroscopy data.

A Raman Spectra consists of Raman intensities measured at various wavenumbers.

The peaks in Raman spectra can be associated to various biological elements. This

non invasive method is very vital in the study of cells and cellular process. The amount

of morphologic and chemical feature information in Raman spectra and its ease of

measurement makes Raman spectroscopy an attractive method to study cells. However,

efficiently extracting information from Raman spectra is a challenge. The dataset used

here is a cross sectional Raman spectroscopy scan of a cell embedded in a layer of

trehalose. One of the motivations behind collecting Raman spectra scan of the cell is to

create an image of the cell using the Raman spectroscopy scan. Target study includes,

the task of creating cell image based on Raman intensity spectra and also identifying

the important peaks that helps in distinguishing various regions in the image generated.

Clustering methods are used to generate the image while sparse clustering is used to

identify the relevant peaks in Raman spectra.

4.2 Dataset

Dataset represent a Raman spectroscopy scan of a cell embedded in a trehalose

layer. The scan is performed on the X-Z plane, i.e it provides a cross sectional view

of the cell. The expected cross sectional image is a layer of cell at the centre of the

scan and on both top and bottom of the cell, is the trehalose layer. The two datasets

considered in this study are 1.)TRETScan6 (scan area : pixel dimension 120 ×

21) and C3AScan2(125 × 50). Each dataset represent cross sectional scan of cell

embedded in a trehlose medium but with different scan area and cell sample. Each pixel

of the scan area represent a Raman spectra consisting of 1024 features. The number

of data points in TRETScan6 is 120 × 21 = 2520 and C3AScan2 has 125 × 50 = 6250.

33

Both datasets have same number of 1024 features , i.e Raman intensities are measured

at 1024 different wave numbers varying between 0 to 3800. Hence our TRETScan6

dataset consists of 2520 data points with 1024 features and C3AScan2 dataset consists

of 6250 data points with 1024 features.

4.3 Data Preprocessing

4.3.1 Remove Unnecessary Features

Research on Raman spectra suggests that the intensity measures at very low wave

numbers do not give any information due to the presence of high noise levels. Hence

first 10 features of the data points are removed and the dataset is reduced to 1014

features.

4.3.2 Noise and Background Subtraction

In order to extract maximum information on the Raman scattering, both noise and

background fluorescence must be removed. Subtraction of noise is performed using

SavitzkyGolay smoothing filter, which is found to be very effective for Raman spectra.

The most promising type of background subtraction algorithms use polynomial fits

because they can approximate the fluorescence profile while excluding the Raman

peaks. However, there is no consensus on the best polynomial fit order for fluorescence

background subtraction [3]. In this study we applied the subback function of matlab, this

function will subtract the background of a spectrum by fitting a polynomial through the

data points in an iterative way.

4.3.3 Peak Selection

Biologically relevant wave numbers in Raman spectra are associated with peaks

of Raman intensity, hence only those wave numbers corresponding to peaks are only

relevant for any analysis. Hence peaks associated with each Raman spectra is selected

and its associated wave numbers are shortlisted as potential features. Due to the

resolution and noise in the process of plotting Raman spectra peaks of different Raman

spectra could be shifted by few wavenumbers. This is taken care off by cohesion of

34

peaks in different spectra to one prominent wave number. After peak selection and peak

cohesion the number of relevant features are further reduced.

Once the raw dataset undergoes preprocessing it is taken for further analysis. The

preprocessed dataset is expected to be containing only relevant features and the noise

and background subtracted.

4.4 Clustering

Clustering consists of partitioning the data points based on the differences between

them. The most common dissimilarity measure considered is the euclidean distance

between the data points. Hence clustering tends to group points lying close to each

other or the process of grouping similar elements together.Clustering is an unsupervised

classification model where the class of training dataset is unknown, in many cases even

the number of classes present is unknown. In this study two clustering methods are

analysed and applied on the dataset.

4.5 K-means Clustering

K-means clustering algorithm is one of the most popular clustering algorithms. To

perform K-means clustering the user needs to specify the number of clusters present

in the training dataset. The K-means algorithm can be explained as repetition of two

steps. The main idea is to define k centroids, one for each cluster. The next step is to

take each point belonging to a given dataset and associate it to the nearest centroid.

When all data points are assigned to a centroid or mean the first step is completed and

an initial cluster is formed. At this point we need to re-calculate k new centroids based

on the clusters resulting from the previous step. After we have these k new centroids,

now reassign the data points to each of these new nearest centroids creating a new

set of clusters. Repeat the process till no more data points are reassigned from its

current cluster or in other words centroids do not move any more [20]. The performance

of clustering heavily depends on the number of clusters present. So assigning the

35

correct number of clusters is critical in obtaining a good cluster and thereby efficient

classification.

The first task in this study was identifying the extra cellular (region of Trehlose) and

the cellular region. So initially clustering was performed with two means i.e K = 2.

K-means clustering was performed on both the datasets and the Figures 4.5,4.5 shows

the clustering. The cell is suspended in a medium of Trehlose and hence in the cross

sectional view it is expected to observe a cellular region at the centre while the upper

and bottom layers have Trehlose. So the clustering algorithm should generate a layer at

the centre while the upper and bottom layers consists of the same cluster. The K-means

cluster output shown in Figures 4.5 and 4.5 provides the same analysis and hence

validates the applicability of using clustering methods in distinguishing the cellular and

extra cellular regions in Raman spectra scan.

Figure 4-1. K-Means Clustering, TRETScan6 scan of dimension 120 X 21

In the TRETScan6 Figure 4.5 the centre red coloured strip represent the cluster

associated with cellular region and blue coloured region the extra cellular region.

36

Similar pattern is also expected from C3AScan2 scan and the cluster obtained 4.5 also

validates our proposition.

Figure 4-2. K-Means Clustering,C3AScan2 scan of dimension 125 X 50

Once the image of extra cellular and cellular region is generated, the region

associated with nucleus is located from the cellular region. This is a first step towards

distinguishing various elements in the cellular region. In this study only the cellular

region of C3AScan6 scan is further clustered to locate the nucleus and the light green

coloured region shown in the Figure 4.5 is expected to represent nucleus.

4.6 Spectral Clustering

Spectral clustering is also one of the popular clustering algorithms due to its

ease of implementation and availability of efficient methods to solve. In many cases it

outperforms the traditional methods like the K-means algorithm. For performing spectral

clustering the dataset is represented as a graph network consisting of n nodes where

n is the number of data points and each arc weight represent some sort of dissimilarity

measure between the nodes connected by the arc. In this study we take euclidean

distance between two data points as the dissimilarity measure for the arc connecting

37

Figure 4-3. K-means C3AScan2 scan with nucleus marked

them. Let W ∈ ℜnxn represent the weight matrix representing the dissimilarity measure

matrix for the arcs connecting all the data points where wij = distance between point i

and j. D ∈ ℜnxn is a diagonal matrix with diagonal entries di =n∑

j=1

wi ,j . Laplacian matrix L

is given by L = D −W .

The network generated by n data points is a connected graph, as there exists arcs

between all the data points and hence the smallest eigenvalue is of the Laplacian matrix

is 0. The multiplicity of 0 eigenvalues in a Laplacian represent the connectivity of the

graph. The eigenvector corresponding to the first non zero eigenvalue represent the best

partition of graph into two sub graphs with limited interaction between them, i.e it assists

in clustering the nodes into two clusters [31]. As the Laplacian created is connected

their would be only one zero eigenvalue and hence the eigenvector corresponding to the

second smallest eigenvalue represent the partitioning of graph into two subgraphs. The

eigenvector is an n dimensional vector and each entry in the vector can be associated

with a data point. The cluster is generated by arranging the data points based on the

entries of the eigenvector.

38

Observations from Spectral Clustering

Spectral clustering was performed on both the datasets and the Figures 4.6

4.6 shows the clustered image. Similar to the K-means clustered image 4.5 4.5, the

Spectral clustering algorithm also generates the scan image. This further validates the

applicability of using clustering methods in distinguishing the cellular and extra cellular

regions in Raman spectra scan.

Figure 4-4. Spectral Clustering, TRETScan6 scan of dimension 120 X 21

In the TRETScan6 Figure 4.6, the red coloured strip represent the cluster

associated with cellular region and blue coloured region the extra cellular region. Similar

pattern is also expected from C3AScan2 Figure 4.6 data and the cluster obtained also

validates our proposition on expected scan image. Both the clustering methods provide

similar clusters and hence this classification output can be used as a reference for

further analysis.

4.7 Sparse Clustering for Feature Selection

Technological advances in the last decade have introduced new and efficient

tools for data collection especially in the field of biomedicine. This has paved way for

a new class of large datasets with very high dimensions i.e with large number of input

39

Figure 4-5. Spectral Clustering, C3AScan2 scan of dimension 125 X 50

features compared to the size of observations. Traditional data mining techniques

have produced appreciable results for standard datasets but when data is represented

as very high dimensional vectors it poses great challenges for standard algorithms.

Feature extraction for high dimensional datasets is very important as most features

in high dimensional vectors are usually non-informative or noisy and could affect

the generalization performance. There is a great interest in many machine learning

application for inducing sparsity to high dimensional datasets with respect to input

features. Sparse dataset can provide significant information on relevant features and

thereby assist in feature selection. Further, classification models with sparse data matrix

can simplify decision rule for faster prediction in large-scale problems. Finally, in many

data analysis applications, a small set of features is desirable to interpret the results.

In this study sparse clustering is performed to extract the relevant features

from dataset. Feature selection will help in identifying the biologically significant

wave-numbers that are critical in distinguishing between extra cellular and cellular

region. Sparse K-means method suggested by Witten et al . [5] is used for sparse

40

clustering. The sparse K-means clustering optimization problem can be formulated as

follows

Maximizec1,c2..cK ,ω

p∑j=1

ωj(1

n

n∑i=1

n∑i ′=1

di ,�i ,j −K∑k=1

1

nk

∑i ,i ′

∈ Ckdi ,i ′,j)

Subject to ∥ω∥2 ≤ 1, ∥ω∥ ≤ s, ωj ≥ 0 ∀j

(4–1)

where, c1, c2, ..cK represent the K classes or number of clusters in the data space

ω represent the weight associated with each features

p is the number of features

s is the tuning parameter

di ,i ′,j dissimilarity measure between nodes i and i ′ along feature j .

K is the number of classes

The above optimization problem is solved using an iterative process proposed by

Witten et al . [5].

The weight ωj associated with each feature represent the significance of that

feature in clustering, i.e differentiating between the extracellular and cellular region.

Hence features with greater weights are crucial in distinguishing between cellular and

extra cellular region and they also represent the biologically significant wavenumbers

associated with cellular material. The Table 4.7 shows a list of relevant features

(wavenumbers) and their corresponding weights. Wavenumbers are arranged in the

decreasing order of their weights and weights less than 0.01 are discarded.

4.8 Observations

Study have shown that clustering methods could be effectively used to generate

an image of the Raman spectroscopy scan. For both the datasets, a well separated

image could be generated distinguishing the cellular and extra cellular region. Both

K-means and Spectral clustering methods provided similar well separated images. This

well separated image could be used to evaluate the clustering performed by the sparse

41

Table 4-1. Weights ωj and corresponding feature (wavenumber)Weights Wavenumber Weights Wavenumber0.437 1128 0.073 8680.416 1359 0.058 13180.328 1343 0.057 12730.326 1145 0.052 14440.312 1380 0.049 10350.268 1111 0.048 9330.248 538 0.041 12530.208 1086 0.040 14280.197 555 0.035 4020.152 1400 0.033 5100.126 1465 0.030 13020.114 1061 0.028 9070.096 429 0.023 12320.090 1161 0.019 7240.079 842 0.015 12150.079 456 0.013 587

Figure 4-6. Image generated from top 15 features from sparse Clustering, C3AScan2scan of dimension 125 X 50

clustering method. The relevant wavenumbers short-listed by sparse clustering method

(in Table 4.7) is used for further learning.

These shot-listed wavenumbers can be associated with biologically significant

Raman peaks and thereby validating the feature selection process. Further, the cell

42

scan image generated by the top 15 short-listed features provides a very similar image

comparable with the reference images generated by k-means and spectral methods.

This could further validate the features selected and feature selection process. Figure

4.7 shows image generated by the top 15 features selected from the sparse clustering

method and this could be compared with images 4.5, 4.6 generated by K-means

clustering and Spectral clustering method. In this study only a preliminary analysis of

distinguishing between cellular and extra cellular region is preformed and it shows that

sparse clustering is effective in both clustering and feature extraction. There is a huge

potential of further extending this study, to even differentiating the various regions inside

a cell.

43

CHAPTER 5DISCUSSION AND CONCLUSION

Traditional statistical methods fail while handling high dimensional datasets,

introductory section discusses the probable reasons behind this behavior. Hence, high

dimensional datasets are preferred to be studied in a lower dimensional space and

this can be achieved with the help of feature selection process. The dimensionality

of datasets can be reduced by picking only a few of the best features. By feature

selection a new dataset is created with only a subset of original features but it captures

maximum information from the original dataset. This study focused on various aspects

of extracting this subset of features.

The first section focussed on introducing a least squares formulation for proximal

support vector machines. The motivation behind this formulation is rooted from the

standard method of inducing sparsity into a classification model using l1 norm in a

least squares formulation. It is very common to introduce l1 norm into a least squares

classification optimization model, this induces sparsity in the deciding variables and

thence helps in understanding relevant variables that normally correspond to the

features in a dataset. Proximal support vector machines are very efficient classification

algorithms and can handle complex datasets very well. This was a major drive in

studying proximal planes and investigating ways for reformulating it to a least squares

problem. The classification accuracy of least squares proximal support vector machines

is similar to that of the original eigenvalue formulation. So by this study we could develop

a new least squares formulation for generating the proximal planes. This model could

be further used to introduce l1 norm and thereby induce sparsity for feature selection.

Adding to it, the algorithms developed to solve the least squares formulation has closed

form solutions for proximal planes, this further improves the computational efficiency.

The second section discussed a new norm that could induce joint sparsity in a

dimensionality reduction problem. An l2,1 norm is introduced into an dimensionality

44

reduction problem and the optimization model is iteratively solved. The direct relation

between transformation matrix and input features is used to discard irrelevant features.

This helps in creating a subset of prominent features. The approach not only reduces

the dimensionality of the dataset but also helps in extracting features. The classification

accuracies associated with the reduced feature set produced comparable accuracies

to that of the well know PCA-SVM classification method. This not only validates the

feature selection process but also revamps the benefits of eliminating irrelevant features.

Reduced dimensional space provided better classification accuracies, better feature

interpretability and reduced computational complexity.

In the last section, sparse K-means method is applied to Raman spectroscopy data.

This study concentrated on understanding the applicability of sparse clustering methods

to generate the image of a Raman spectroscopy scan of cell. Initially clustering was

performed using the standard methods,viz K-means clustering and Spectral clustering

algorithms. Both algorithms generated similar clusters and also produced expected

images based on the scan set up. This image was used as reference to compare the

cluster generated by sparse K-means methods. Testing showed that sparse K-means

also produced similar cluster and also short-listed a set of relevant features. This helped

in removing irrelevant features and creating a subset of prominent features. Further,

the wave numbers corresponding to prominent features could be related to biological

significant wave numbers and this justified the feature selection process and applicability

of sparse K-means method in Raman spectroscopy data.

To summarize, the study targeted understanding the importance of feature selection

process and various ways of performing feature selection. Majority of the research

focused on labeled dataset, where sparsity was induced in supervised classification

models to assist feature selection. Lastly, feature selection in unlabeled dataset was also

studied using sparse clustering methods.

45

REFERENCES

[1] Balasundaram S, Kapil N Application of lagrangian twin support vector machinesfor classification. Second international conference on machine learning andcomputing (2010), pp 193397.

[2] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, Least angleregression, The Annals of Statistics (2004), Volume 32, Number 2, 407-499.

[3] Cao, Alex; Pandya, Abhilash K.; Serhatkulu, Gulay K.; Weber, Rachel E.; Dai,Houbei; Thakur, Jagdish S.; Naik, Vaman M.; Naik, Ratna; Auner, GregoryW.; Rabah, Raja; Freeman, D. Carl (2007). A robust method for automatedbackground subtraction of tissue fluorescence. In Journal of Raman Spectroscopy38(9): 1199-1205.

[4] Cortes C, Vapnik V Support-vector network. Machine Learning (1995) 20:273297.

[5] Daniela M. Witten and Robert Tibshirani (2010). A framework for feature selectionin clustering. In J Am Stat Assoc. 105(490): 713726.

[6] T. Evgeniou, M. Pontil, and T. Poggio, Regularization Networks and SupportVector Machines, Advances in Computational Math. (2000),vol. 13, pp. 1-50.

[7] Fan J, Fan Y.(2008) High-dimensional classification using features annealedindependence rules Ann.Statist. 2008; 36:26052637.

[8] J Fan, Y Feng, X Tong (2012) A road to classification in high dimensional space:the regularized optimal affine discriminant Journal of the Royal Statistical Society.

[9] J Fan, J Lv (2010) A Selective Overview of Variable Selection in High DimensionalFeature Space, Statistica Sinica

[10] G. Fung and O.L. Mangasarian,Proximal Support Vector Machine Classifiers Proc.Knowledge Discovery and Data Mining, F. Provost and R. Srikant,(2001) eds., pp.77-86.

[11] M Gallagher, T Downs (1997)Visualization of learning in neural networks usingprincipal component analysis, International Conference on Computational

[12] Ghorai S, Mukherjee A, Dutta PK Nonparallel plane proximal classifier. SignalProcess (2009) 89:510522.

[13] Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin, A Comparisonof Optimization Methods and Software for Large-scale L1-regularized LinearClassification ,Journal of Machine Learning Research (2010) 11:31833234.

[14] J Hamm, DD Lee (2008) Grassmann Discriminant Analysis: a Unifying View onSubspace-Based Learning, 25th international conference on Machine learning.

46

[15] Hui Zou, Trevor Hastie,and Robert Tibshirani Sparse Principal ComponentAnalysis Journal of Computational and Graphical Statistics (2006), Volume 15,Number 2, Pages 265286.

[16] Jayadeva R, Khemchandani R, Chandra S Twin support vector machine forpattern classification.(2007) IEEE Transactions on Pattern Analysis and MachineIntelligence 29(5):905910.

[17] B Jiang, YH Dai (2013) A Framework of Constraint Preserving Update Schemesfor Optimization on Stiefel Manifold, arXiv:1301.0172

[18] Jun Liu, Shuiwang Ji, Jieping Ye (2012) Multi-Task Feature Learning Via Effi-cient l2,1-Norm Minimization , Proceedings of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence.

[19] M Kolar, H Liu - (2013) Feature Selection in High-Dimensional ClassificationProceedings of the 30 th International Conference on Machine Learning.

[20] Larraaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA,Armaanzas R, Santaf G, Prez A, Robles V.(2006). Machine learning in bioin-formatics. In Brief Bioinform Vol 7 No.1:86-112.

[21] K Lee, Y Bresler, M Junge (2012 Subspace Methods for Joint Sparse Recovery,Information Theory, IEEE.

[22] Liang Sun , Shuiwang Ji , Jieping Ye, A Least Squares Formulation for a Class ofGeneralized Eigenvalue Problems in Machine Learning, Proceedings of the 26thInternational Conference on Machine Learning, Montreal, Canada, 2009.

[23] Mangasarian OL, Wild EW Multisurface proximal support vector classification viageneralized eigenvalues. IEEE Transactions on Pattern Analysis and MachineIntelligence (2006) 28(1):6974.

[24] O.L. Mangasarian, Least Norm Solution of Non-Monotone ComplementarityProblems, Functional Analysis, Optimization and Mathematical Economics (1990),pp. 217-221, New York: Oxford Univ. Press.

[25] O.L. Mangasarian and R.R. Meyer, Nonlinear Perturbation of Linear Programs,SIAM J. Control and Optimization(1979), vol. 17, no. 6, pp. 745-752, .

[26] D Niu, JG Dy, MI Jordan (2011) - Dimensionality Reduction for Spectral Cluster-ing, 14th International Conference on Artificial.

[27] Osborne, M. R., Presnell, B., and Turlach, B. A.,A New Approach to VariableSelection in Least Squares Problems, IMA Journal of Numerical Analysis (2000),20, 389403.

47

[28] Quanquan Gu, Zhenhui Li and Jiawei Han (2011) Joint Feature Selection andSubspace Learning, Proceedings of the Twenty-Second International JointConference on Artificial Intelligence

[29] Tibshirani, R.,Regression Shrinkage and Selection via the Lasso, Journal of theRoyal Statistical Society (1996),Series B, 58, 267288.

[30] A.N. Tikhonov and V.Y. Arsen, Solutions of Ill-Posed Problems. New York: JohnWiley & Sons (1977).

[31] Ulrike Von Luxburg (2007). A Tutorial on Spectral Clustering. In Statistics andComputing 17(4).

[32] Vapnik V The nature of statistical learning, (1998) 2nd edn.Springer, New York.

[33] Yunhai Xiao, Soon Yi Wu, Bing Sheng He (2012)A proximal alternating directionmethod for L2,1 norm least squares problem in multi-task feature learning, Journalof Industrial and Management Optimization.

48

BIOGRAPHICAL SKETCH

Paul Francis Thottakkara was born 1985 in Kerala, India. He graduated with

a bachelor’s degree in Mechanical Engineering from Mahatama Gandhi University

in Kerala, India. After bachelor’s he worked at Sanmar Engineering Corporation in

Chennai, India for two years and then went to University of Florida to purse master’s

degree in Industrial Engineering. During his master’s program at University of Florida,

Paul was an active member of UF INFORMS Chapter. It was during the master’s

program that he developed interest in data mining techniques and optimization

methods. He continued his studies at University of Florida to get Engineer Degree

with specialization in data mining and optimization. He will graduate with Engineer

Degree from University of Florida in August 2013. After graduation he plans to join the

industry as a data analyst to utilize his skill and research experience.

http://plaza.ufl.edu/paulthottakkara/Paul.html

49

http://plaza.ufl.edu/paulthottakkara/Paul.html

c 2013 paul francis thottakkara - university of...

Documents