c 2013 paul francis thottakkara - university of...
TRANSCRIPT
![Page 1: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/1.jpg)
ON FEATURE SELECTION IN DATA MINING
By
PAUL FRANCIS THOTTAKKARA
A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFENGINEER
UNIVERSITY OF FLORIDA
2013
![Page 2: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/2.jpg)
c⃝ 2013 Paul Francis Thottakkara
2
![Page 3: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/3.jpg)
To my parents Alice Francis and Francis T. Paul
3
![Page 4: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/4.jpg)
ACKNOWLEDGMENTS
This thesis would not have been possible without the guidance and the help of
several individuals who in one way or another contributed and extended their valuable
assistance in the preparation and completion of this study. Foremost, I would like to
express my gratitude to my adviser Distinguished Prof. Panos M. Pardalos for his
contribution and guidance throughout my research. Besides my advisor, I would like to
thank the rest of my thesis committee: Prof. William Hager and Dr.Petar Momcilovic, for
their help and encouragement.
My sincere thanks to my lab members Vijay Pappu, Dr.Pando G. Georgiev, Mohsen
Rahmani, Michael Fenn, Syed Mujahid for supporting in my research work. I would
like to extend a big thank you to my friends Jorge Sefair, Zehra Melis Teksan, Rachna
Manek, Radhika Medury, Amrutha Pattamatta, Mini Manchanda, Vishnu Narayanan,
Rahul Subramany, Gokul Bhat, Vijaykumar Ramaswamy for the stimulating thoughts and
encouragement.
Finally, I thank my parents Alice Francis and Francis T. Paul and my sister Neetha
Francis for all the motivation and supporting me throughout my life.
4
![Page 5: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/5.jpg)
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 INTRODUCTION TO DATA MINING . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1 What is Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 Role of Feature Selection in Data Mining . . . . . . . . . . . . . . . . . . . 11
2 LEAST SQUARES FORMULATION FOR PROXIMAL SUPPORT VECTORMACHINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Data Classification and Separating Hyperplanes . . . . . . . . . . . . . . 122.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Proximal Support Vector Machines . . . . . . . . . . . . . . . . . . 132.1.3 Twin Support Vector Machines . . . . . . . . . . . . . . . . . . . . 14
2.2 Importance of Least Square Formulations . . . . . . . . . . . . . . . . . . 152.3 Least Square Formulation for Generating Proximal Planes . . . . . . . . . 16
2.3.1 Using Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . 162.3.2 Special Case Eigenvalue Problem . . . . . . . . . . . . . . . . . . 20
2.4 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 JOINT SPARSE FEATURE SELECTION . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 L21 Norm and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 273.3 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 FEATURE SELECTION IN UNLABELLED DATASETS . . . . . . . . . . . . . . 33
4.1 Introduction to Raman Spectra Signals . . . . . . . . . . . . . . . . . . . . 334.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Remove Unnecessary Features . . . . . . . . . . . . . . . . . . . . 344.3.2 Noise and Background Subtraction . . . . . . . . . . . . . . . . . . 344.3.3 Peak Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.7 Sparse Clustering for Feature Selection . . . . . . . . . . . . . . . . . . . 39
5
![Page 6: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/6.jpg)
4.8 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 44
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6
![Page 7: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/7.jpg)
LIST OF TABLES
Table page
2-1 GEV and PSVM-LS Formulation Classification Accuracy . . . . . . . . . . . . . 25
3-1 PCA and JS Method Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3-2 JS Method:Top features, � threshold selection . . . . . . . . . . . . . . . . . . 32
3-3 JS Method:Top T features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4-1 Weights ωj and corresponding feature (wavenumber) . . . . . . . . . . . . . . . 42
7
![Page 8: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/8.jpg)
LIST OF FIGURES
Figure page
4-1 K-Means Clustering, TRETScan6 scan of dimension 120 X 21 . . . . . . . . . 36
4-2 K-Means Clustering,C3AScan2 scan of dimension 125 X 50 . . . . . . . . . . . 37
4-3 K-means C3AScan2 scan with nucleus marked . . . . . . . . . . . . . . . . . . 38
4-4 Spectral Clustering, TRETScan6 scan of dimension 120 X 21 . . . . . . . . . . 39
4-5 Spectral Clustering, C3AScan2 scan of dimension 125 X 50 . . . . . . . . . . . 40
4-6 Cluster using top 15 features, C3AScan2 scan . . . . . . . . . . . . . . . . . . 42
8
![Page 9: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/9.jpg)
Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Engineer
ON FEATURE SELECTION IN DATA MINING
By
Paul Francis Thottakkara
August 2013
Chair: Panos M. PardalosMajor: Industrial and Systems Engineering
Analysing and extracting useful information from high dimensional dataset
challenges the frontiers of statistical tools and methods. Traditional methods tend to
fail while dealing with high dimensional datasets. Lower sample size has always been
a problem in statistical tests, this gets aggravated in high dimensional data due to
the comparable or higher feature size than the number of samples. The power of any
statistical test is directly proportional to its ability to reject a false null hypothesis, and
sample size is a major deciding factor in generating probabilities of type II error for
making valid conclusions. Hence one of the efficient ways of handling high dimensional
datasets is by reducing its dimension through feature selection, so that valid statistical
conclusions can be easily performed.
This work focuses on different aspects associated with feature selection in data
mining. Feature selection is one of the active research areas in data mining. The main
idea behind feature selection methods is to identify a subset of original input features
that are pivotal in data classification or understanding. Feature selection helps in
eliminating features with little or no predictive information. The various discussions
in this thesis could be categorized to three major sections, section 1 introduces least
squares formulation for proximal support vector machines, section 2 introduces l2,1 norm
as a method to induce sparsity and the last section discusses on the applicability of
using sparse clustering in Raman Spectroscopy data.
9
![Page 10: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/10.jpg)
CHAPTER 1INTRODUCTION TO DATA MINING
1.1 What is Data Mining
Generally, data mining can be explained as the science of analysing data to extract
useful information using statistical tools and methods. The overwhelming prospect
of understanding a data-driven system using the underlying data has been a major
motivating factor in promoting research in the field of data mining. Immense growth in
technology has made data collection cheaper and efficient. Meanwhile, an exponential
rise in computational power has made data processing extremely faster and economical.
These advances have accelerated the motivations behind the research in data mining.
Data classification, feature selection and outlier detection are the major research
areas in the field of data mining. Data classification problems focus on learning a set
of training data, and then use information from training data to predict the nature of
any new datum point. Data classification can be supervised or unsupervised learning
method based on the availability of label information on training data. Supervised data
classification algorithm studies labelled training set (class labels of training data are
available) and generates a classification model to classify any set of new data points
observed. Unsupervised algorithms, on the other hand, try to find patterns in the
unlabelled training data.
In any statistical analysis, the three major factors of concern are statistical accuracy,
model interpretability and computational complexity. For any classification model, it is
necessary to ensure that the efficiency of any of these three factors is not compromised.
A set of data points is normally expressed as a matrix, where each column represents a
datum point and the rows represent features. Standard statistical methods require larger
number of data points compared to feature space dimension to make valid statistical
inferences. Hence standard datasets have a large column dimension compared to
row dimension. Research in the past two decades have paved way for efficient data
10
![Page 11: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/11.jpg)
mining algorithms that perform well on standard datasets. But most of the traditional
classification models behave poorly when handling high dimensional datasets, i.e.
datasets where number of features is comparable or larger than the number of data
points. One of the prime reasons for poor performance on high dimensional datasets
is the compromise made in statistical accuracy and computational complexity due to
the higher dimensional feature space. Another concern with data classification in high
dimensional space is the large amount of collinearity between features resulting in
wrong model selection [7]. Over fitting and higher noise levels are also associated with
high dimensional datasets. These drawbacks associated with higher dimensional data in
data mining is referred to as the curse of dimensionality.
1.2 Role of Feature Selection in Data Mining
Feature selection is one of the prime focuses in the field of data mining. Given a
dataset, feature selection can be generalized as the process of selecting a subset of
features for use in further data analysis. This selected subset of features is expected to
capture maximum information present in the dataset, i.e. the selected feature subset
should contain the most prominent features for model construction. Feature selection
is particularly important in high dimensional datasets since it reduces dimensionality
and thereby nullifies the effects of the curse of dimensionality. Further, in many real
life systems, feature selection is very important in identifying the behaviour and
performance of the system. Especially in biomedical applications, feature selection
can play a pivotal role in identifying biomarkers. In a disease classification problem
in genomic study, for example, features selection techniques can identify the genes
that differentiate the diseased and healthy cells. This not only helps the data analyst in
reducing data dimension, but is also a huge breakthrough for biologists to understand
the biological system and identify the disease triggering genes.
11
![Page 12: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/12.jpg)
CHAPTER 2LEAST SQUARES FORMULATION FOR PROXIMAL SUPPORT VECTOR MACHINES
2.1 Data Classification and Separating Hyperplanes
A hyperplane in an n-dimensional vector space can be defined as a flat subset
with n-1 dimensions and it separates the vector space in two disjoint half spaces.
Many data classification algorithms focus on finding hyperplanes which separate data
into different classes or which assist in approximating data. The very first algorithm in
machine learning was the Perceptron algorithm that generates a separating hyperplane
by minimizing the distance of misclassified points to the decision boundary. Perceptron
methods gained huge momentum and continued to be the major method for more than
a decade. However, the algorithm had a number of issues such as the existence of
multiple separating hyperplanes, slow convergence, failure in handling inseparable data,
etc. This limited the applicability of perceptron algorithms in complex and large datasets.
This motivated the development of advanced and more robust algorithms that could
efficiently handle complex and large data. One of these was Support Vector Machine,
which produced efficient and robust classification models, was effective in handling
complex datasets and produced lower generalization error.
2.1.1 Support Vector Machines
Support vector machine (SVM) is a supervised data classification method
introduced by Vladimir Vapnik and coworkers [4, 32]. The basic idea of SVM is to
generate a hyperplane that could separate the data points into its two classes. If the two
classes are linearly separable then the standard SVM tries to generate a hyperplane
that divides the input space into two disjoint half spaces where each class belongs to
one of the half spaces. As there could exist more than one hyperplane separating the
classes, the SVM algorithm selects the hyperplane which is farthest from its closest data
points. Any new data point is classified to a class based on its location in the half space.
12
![Page 13: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/13.jpg)
2.1.2 Proximal Support Vector Machines
Proximal Support Vector Machine (PSVM) introduced by Fung and Mangasarian
can be considered closely related to the SVM Classifier [10]. Standard SVM classifies
points based on their location in the disjoint subspaces generated by the hyperplane
while PSVM classifies points based on their proximity to two parallel hyperplanes. The
objective of PSVM is to generate two parallel hyperplanes with each plane closest
to one class while being farthest from the other class. Later Mangasarian and Wild
introduced an extension to PSVM called the Multi-surface Proximal Support Vector
Machine (MPSVM) by relaxing the requirement of proximal planes being parallel [23].
MPSVM generates two hyperplanes such that each plane is closest to one class and
farthest from the other class. As MPSVM closely resembles PSVM, they are used
interchangeably in literature. In this study, we use PSVM to imply Multi-Surface Proximal
Support Vector Machine.
Consider a binary classification problem with two classes represented as A ∈ ℜn1×m
and B ∈ ℜn2×m where n1 + n2 = n are number of samples or data points and m is the
dimension of the input space. The proximal hyperplane closest to class A and farthest
from class B is given by
PA = {x ∈ ℜm | ⟨ω, x⟩ − γ = 0} (2–1)
Then the optimization model to obtain PA in PSVM can be formulated as:
min(ω,γ)=0
∥ Aω − eγ ∥2
∥ Bω − eγ ∥2(2–2)
where the numerator is the sum of squared distances from the hyperplane PA to
the points in class A and the denominator is the sum of squared distances from the
hyperplane PA to the points in class B. The basic idea behind the optimization model
is to generate PA such that it is closest to class A and farthest from class B. Tikhonov
regularization term is introduced to the optimization model to avoid degenerate solutions
13
![Page 14: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/14.jpg)
[30]. The regularized optimization model [6, 24, 25] is given by:
min(ω,γ) =0
∥ Aω − eγ ∥2 +δ ∥ [ωγ ] ∥2
∥ Bω − eγ ∥2(2–3)
where δ > 0 is a regularization constant.
Define,
GA = [A − e]T [A − e] + δI , HB = [B − e]T [B − e], zT = [ωT γ]
Substituting the above variables to the optimization model 2–3, it can be reformulated
as
minz =0
f (z) :=zTGAz
zTHBz⇔ max
z =0f (z) :=
zTHBz
zTGAz(2–4)
The stationary points of 2–4 are given by eigenvectors corresponding to generalized
eigenvalue problem GEV(HB , GA):
HBz = λGAz (2–5)
and the hyperplane PA is given by the eigenvector corresponding to the largest
eigenvalue.
Similarly, the proximal hyperplane PB (farthest from class A and closest to class B)
given by:
PB = {x ∈ ℜm | ⟨�ω, x⟩�γ = 0}
can be found by solving for the eigenvector corresponding to maximum eigenvalue of the
generalized eigenvalue problem GEV(HA, GB) where,
GB = B − e]T [B − e] + νI , HA = [A − e]T [A − e], zT = [ �ωT �γ]
2.1.3 Twin Support Vector Machines
Twin Support Vector Machine (TWSVM) introduced by Jayadeva et al. [16] is very
similar to generalized eigenvalue PSVM, in the sense that it also produces two non
14
![Page 15: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/15.jpg)
parallel planes such that each plane is close to one class and away from the other class.
However, TWSVM is not an exact reformulation of the PSVM model 2–3, but is very
close to the standard SVM formulation.
Twin Support Vector Machine model solves a pair of quadratic programming (QP)
problems, where each QP finds a hyperplane closest to points of one class and at
least at a unit distance from the points of other class. TWSVM classifier is obtained
solving the below pair of QP problems where TWSVM1 generates the hyperplane PA
i.e. hyperplane closest to class A and farthest from class B [1, 16]. Similarly TWSVM2
generates the hyperplane PB .
TWSVM1 ⇒ minω1,γ1,q
1
2(Aω1 + e1γ
1)T (Aω1 + e1γ1) + c1e
T2 q
s.t. − (Bω1 + e2γ1) + q ≥ e2, q ≥ 0
(2–6)
TWSVM2 ⇒ minω2,γ2,q
1
2(Bω2 + e2γ
2)T (Bω2 + e2γ2) + c2e
T1 q
s.t. − (Aω2 + e1γ2) + q ≥ e1, q ≥ 0
(2–7)
2.2 Importance of Least Square Formulations
Technological advances in the last decade have introduced new and efficient tools
for data collection especially in the field of biomedicine. This has paved way for a large
number of cases with high dimensional dataset i.e. with large number of input features
compared to the number of samples. As discussed in the introductory section, traditional
data mining techniques have produced appreciable results for standard datasets. But
when data is represented as high dimensional feature vectors with limited sample size, it
poses great challenge for standard algorithms. Hence reducing the number of features
is very important to obtain efficient and effective data analysis.
Feature selection methods can be classified into three different categories filter,
wrapper and embedded methods. The most simple idea is to select a subset of features
15
![Page 16: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/16.jpg)
from the original set of features based on a feature ranking procedure, this is the filter
method. Wrapper methods compares a set of feature subsets and compare their
performance in predicting the data, and selects the best subset. Embedded methods
perform feature selection along with classification model construction process. One
of the very common methods is to introduce l1 to a least squares classification model
which induces sparsity to the model. This assists in feature selection by removing
irrelevant features.
Feature extraction for high dimensional datasets is very important as most features
in high dimensional vectors are usually non-informative or noisy and could affect
the generalization performance. Hence there is great interest in many data mining
applications for inducing sparsity in high dimensional datasets with respect to input
features to remove insignificant features. Such sparse representations can provide
information on relevant features and thereby assist in feature selection. Further,
classification models with sparse data matrix can simplify decision rule for faster
prediction in large-scale problems. Finally, in many data analysis applications, a small
set of features is desirable to interpret the results. As sparsity can be very easily
introduced to a least squares mathematical formulation using a regularization term,
the focus of this study is to generate a least squares formulation for proximal support
vector machines. The Least Absolute Shrinkage and Selection Operator (LASSO)
method introduced by Tibshirani [2, 29] can be effectively applied to the least squares
formulation for inducing sparsity. Further, robust and efficient classification model of
PSVM makes it an attractive model for studying. These factors motivated to investigate
on a least squares formulation for generating proximal planes.
2.3 Least Square Formulation for Generating Proximal Planes
2.3.1 Using Spectral Decomposition
Zou et al. [15] proved the following theorem that establishes the relation between
eigenvalue problems and least-squares problems.
16
![Page 17: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/17.jpg)
Theorem 2.1 ((Zou et al.)). Consider a real matrix X ∈ ℜn×p with rank r ≤ min(n, p). Let
matrices V ∈ ℜp×p and D ∈ ℜp×p satisfy the following relation:
V T (XTX )V = D (2–8)
where, D = diag(σ21,σ
22, ...σ
2r , 0, 0, ... , 0)p×p. Assume σ2
1 ≥ σ22 ≥ · · · ≥ σ2
r . For the
following optimization problem,
minα,β
n∑i=1
||Xi − αβTXi ||2 + λβTβ
subject to αTα = 1
(2–9)
βopt ∝ V1, where Xi is the ith − row of matrix X and V1 is the eigenvector corresponding
to the largest eigenvalue σ21.
Using Theorem 2.1, an equivalent least squares formulation for the proximal
hyperplanes can be developed from the eigenvalue formulation 2–5. Let the Cholesky
decomposition of matrices HB and GA be given by:
HB = LBLTB = UT
B UB
GA = LALTA = UT
A UA
(2–10)
where LA,LB are lower triangular matrices, and UA,UB are upper triangular matrices.
Substituting these in GEV (HB ,GA),
HBz = λGAz
LBLTB z = λUT
A UAz
U−TA LBL
TB z = λUAz
U−TA LBL
TBU
−1A UAz = λUAz
(LTBU−1A )T (LTBU
−1A )UAz = λUAz
(LTBU−1A )T (LTBU
−1A )y = λy
(2–11)
17
![Page 18: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/18.jpg)
where UAz = y . The optimal eigenvector related to proximal hyperplane PA in PSVMs
can be found by the following relation:
zopt = U−1A y (2–12)
where y is the eigenvector corresponding to the maximum eigenvalue in the following
symmetric eigenvalue problem in 2–11.
By substituting X = LTBU−1A , β = UAβ, (LTBU
−1A )i = U−T
A UB,i in the least squares
problem, Equation 2–9 in Theorem 2.1, and re-arranging the terms, the following
least-squares optimization problem is obtained:
minα,β
||UBU−1A − UB βα
T ||2 + λβTGAβ
s.t. αTα = 1
(2–13)
where βopt is proportional to z1, the optimal eigenvector corresponding to the largest
eigenvalue of the GEV(HB ,GA).
The optimization problem 2–13 can solved by alternating over α and β.
Fixed β : For a fixed β, the following optimization problem is solved to obtain α.
minα,β
||UBU−1A − UB βα
T ||2
s.t. αTα = 1
(2–14)
Expanding the objective function ||UBU−1A − UB βα
T ||2,
(UBU−1A − UB βα
T )T (UBU−1A − UB βα
T ) ≈ −2αTU−TA HB β + αTαβHB β
Substituting αTα = 1, the optimization problem in 2–14 can be re-written as:
18
![Page 19: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/19.jpg)
maxα
αTU−TA HB β
s.t. αTα = 1
(2–15)
An analytical solution for this problem exists and the αopt is given by,
αopt =U−TA HB β
∥U−TA HB β∥
(2–16)
Fixed α : For a given α, the optimization problem 2–13 can be reduced to ridge
regression-type problem. To see this, let A be an orthogonal matrix such that [α; A]
is p × p orthogonal. Then,
||UBU−1A − UB βα
T ||2
= ||UBU−1A [α; A]− UB βα
T [α; A]||2
= ||UBU−1A α− UB β||2 + ||UBU
−1A A||2
(2–17)
So, for a fixed α, β optimizes the following regression problem:
minβ
||UBU−1A α− UB β||2 + λβTGAβ (2–18)
In this case as well, an analytical solution can be found given by:
βopt = (HB + λGA)−1HBU
−1A α (2–19)
The following algorithm summarizes the steps needed to solve for each optimal
hyperplane in PSVM using the least-squares (LS) approach:
19
![Page 20: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/20.jpg)
Algorithm 1 PSVMs-via-LS (HB ,GA)1. Initialize β.
2. Find the upper triangular matrix UA from the cholesky decomposition of GA.
3. Find α from the following relation:
α =U−TA HB β
∥U−TA HB β∥
(2–20)
4. Find β as follows:
β = (HB + λGA)−1HBU
−1A α (2–21)
5. Alternate between 3 and 4 until convergence.
2.3.2 Special Case Eigenvalue Problem
Liang Sun et al. [22] introduced a theorem that establishes relation with a specially
structured eigenvalue problem and a least squares problem.
Theorem 2.2. Consider a generalized eigenvalue problems of the form
XSXTω = λXXTω or
(XXT )†XSXTω = λω
(2–22)
where X ∈ ℜm×n is the data matrix (n be the number of training sample and m dimen-
sion of the points) and S ∈ ℜn×n is a symmetric semi-positive definite matrix and (XXT )†
is the pseudo-inverse of XXT .
Further assume that the following conditions and satisfied:
• columns of X are centered, i.e. Xe = 0
• rank(X)= n-1
• column vectors of X are linearly independent before centering
20
![Page 21: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/21.jpg)
As S is symmetric and positive semi-definite it can be decomposed as
S = HHT (2–23)
using the Cholesky decomposition, where H ∈ ℜn×s and s ≤ n.
Matrix H & X undergoes further decomposition to obtain U1, �1,VT1 ,Q,UR .
Consider the following sequence of decomposition
QR decomposition ⇒ HP = QR
Singular value decomposition ⇒ R = UR�RVTR
Compact Singular value decomposition ⇒ X = U�V T = U1�1VT1
(2–24)
With the above conditions satisfied, the nonzero eigenvalues of the generalized
eigenvalue problem 2–22 are diag(�2R) and the corresponding eigenvectors are
Weig = U1�−11 V T
1 QUR (2–25)
Now, consider a regression problem with n training set {(xi , ti), i = 1, ...n} where
xi ∈ ℜd is the observation and ti ∈ ℜk the corresponding target. Least squares is a
classical approach for solving regression problems. Least squares formulation for the
regression problem is given by
minW
n∑i=1
||W Txi − ti ||22 = ||W TX − T ||2F (2–26)
where W ∈ ℜm×k weight matrix. The optimal solution Wls minimizes the sum of squares
error function and the closed form solution to the least squares problem is given as
Wopt = (XXT )†XTT . (2–27)
Define the target matrix for the least squares formulation, Equation 2–26 as:
T = UTR Q
T (2–28)
21
![Page 22: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/22.jpg)
T ∈ ℜr×n, then the solution to the least squares problem is given as:
Wls = (XXT )†XTT = U1�−11 V T
1 QUR (2–29)
Based on the results of 2–25 and 2–29 Liang Sun et al. [22] proves the equivalence
of eigenvalue problem and least square problem under conditions.
Using Theorem 2.2, we can generate proximal hyperplanes PA and PB via the
least-squares formulation. Consider the Generalized Eigenvalue problem HBz = λGAz
for Proximal plane PA and the following optimization model can be derived from the
original model 2–4.
maxz =0
f (z) :=zTCHBCz
zTCGACz⇔ max
y =0f (y) :=
yTHB y
yTGA y(2–30)
where y = Cz and C = I − 1neeT is a centering matrix. Centering matrix is used to
center the data matrix to the mean (Xe = 0 if X is centered.). The optimal solution to the
above problem is related by yopt = Czopt , i.e the optimal solution to the derived model is
multiplied by the centering matrix C to obtain the optimal solution the original problem.
Properties of Centered Matrix C
• C is symmetric, C = CT
• C is Idempotent, C 2 = C
Cholesky Decomposition of HB & GA gives
HB = UTB UB
GA = LALTA
(2–31)
Singular Value decomposition of LA
LA = U�V T (2–32)
22
![Page 23: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/23.jpg)
Applying Cholesky decomposition and Singular Value Decomposition to the centered
matrix we have
CHBCz = λCGACz
CUTB UBCz = λCLAL
TACz
(2–33)
Define
H = V�†UTUTB , (2–34)
where �† is the pseudo-inverse of �
Introducing U�V TV�†UT = I to Equation 2–33 and substituting for H & LA
CU�V TV�†UT UTB UB U�†V TV�UTCz = λCLAL
TACz
CLAHHTLTACz = λCLAL
TACz
(2–35)
The GEV problem HBz = λGAz is reformulated to
CLAHHTLTACz = λCLAL
TACz
(CLA)HHT (CLA)
Tz = λ(CLA)(CLA)Tz
(2–36)
where CLA = LA is centred and S = HHT is Symmetric and Positive semi-definite matrix.
Applying Theorem 2.2 , solving LAHHT LA
Tz = λLALA
Tz is equivalent to
minW
||W T LA − T ||2F (2–37)
where T is generated from H and LA using Equation 2–28 in Theorem 2.2. Further,
at optimality the column vectors of W represent the eigenvectors of the Generalized
Eigenvalue problem 2–36. Closed form solution for Wopt is given by Wopt = (LALAT)†LAT
T ,
referring to Equation 2–27.
The following algorithm summarizes the steps needed to solve for each optimal
hyperplane in PSVM using the least-squares (LS) approach derived from Theorem 2.2:
23
![Page 24: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/24.jpg)
Algorithm 2 PSVMs-via-LS (HB ,GA)1. Using Cholesky Decomposition find the upper triangular matrix UB from HB and
lower triangular matrix LA from GA
2. Find U, �,V using Singular Value Decomposition of LA
3. Centre the lower triangular matrix LA to create LA
4. Generate H using the Equation 2–34
5. Apply Theorem 2.2 to generate T from H and LA using 2–28
6. Closed form solution is obtained using the equation
Wopt = (LALAT)†LAT
T (2–38)
2.4 Results and Observations
In this chapter we introduced two least squares formulations for generating
the proximal planes of PSVM. The correctness of the least square formulations is
confirmed mathematically using the Theorems 2.1, 2.2. It could be further validated by
comparing the classification accuracies of least squares formulation with the accuracies
obtained from standard PSVM formulation (generalized eigenvalue formulation). In
this study, 10 fold cross validation accuracies are reported. As the least squares are
just reformulations of the standard generalized eigenvalue model, their classification
accuracies are expected to be same or very close to the accuracies from standard
formulation.
Numerical tests were done on publicly available binary class datasets. In the results
Table 2.4 the column Dimensions represent the number of data points to the number
of features in the dataset. Colon, DBWorld and DLBCL are high dimensional datasets
while others are standard datasets. PSVM − Eig column shows the accuracies obtained
using the standard generalized eigenvalue formulation, PSVM − LS − F1 column shows
accuracies associated with the least squares formulation using Zou et al. Theorem 2.1
and PSVM − LS − F2 column shows accuracies associated with the least squares
24
![Page 25: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/25.jpg)
Table 2-1. Results Table: PSVM-Eig represent the standard Generalized eigenvalueformulation, PSVM-LS-F1 is the Least squares formulation using Theorem2.1, Zou et al. [15] and PSVM-LS-F2 is the Lease squares formulation usingTheorem 2.2 Liang Sun et al. [22]
Dataset Dimensions Class Ratio PSVM - Eig PSVM - LS-F1 PSVM - LS F2WDBC 569*30 212 : 357 93.3% 93.30% 92.80%
Spambase 4601*57 1813 : 2788 68.0% 67.96% 76.40%Ionosphere 351*34 126 : 225 76.9% 76.91% 75.48%
WPBC 198*33 47 : 151 74.9% 74.70% 73.79%Mushroom 8124*126 3916 : 4208 99.8% 99.80% 99.80%
Colon 62*2000 40 : 22 87.1% 87.14% 87.14%DBWorld 64*4702 35 : 29 90.7% 90.71% 90.71%DLBCL 77*5469 58 : 19 81.8% 81.79% 75.36%
formulation using Theorem 2.2 from Liang Sun et al.. Results infer that, both the least
square approaches are valid representations of proximal planes, as it generates similar
classification accuracies compared to the standard PSVM formulation. This new
formulation paves way for an easy introduction of embedded feature selection technique
to Proximal Support vector machines (PSVMs). l1 norm can be introduced in the new
least squares formulations developed in this study to obtain sparse classification and in
turn attain feature selection for PSVMs.
The Quadratic model for PSVM using Twin Support Vector Machine can also use
l1 norm to induce sparsity. This is a direct method of inducing sparsity, however when
l1 norm is induced to the TWSVM model it becomes a non-differentiable constrained
optimization model which is computationally very challenging. While the least square
models developed in this chapter can be solved more efficiently. After inducing l1 norm
to the Least squares formulation 2–13 developed from Theorem 2.1, the optimization
problem can be solved iteratively by alternating between α and β. There exists efficient
algorithms [13] to solve the least squares formulation 2–37 obtained using the Theorem
2.2. These credits further signify the applicability of the new least square approaches
introduced in this study.
25
![Page 26: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/26.jpg)
CHAPTER 3JOINT SPARSE FEATURE SELECTION
3.1 Dimensionality Reduction
Dimensionality reduction is the technique of projecting a set of input data points
on to a smaller dimensional space. That is, the data is represented on a lower
dimensional subspace. It is normally achieved through feature selection or feature
extraction methods. Feature selection method identifies a set of prominent features
and the reduced subspace is determined by this selected set of features. While
feature extraction method creates derived features which are combination of existing
features and these new derived features are used for generating the reduced subspace.
Dimensionality reduction has many advantages in data mining, especially while handling
high dimensional datasets.
Dimensionality reduction techniques play a vital role in high dimensional datasets as
it helps in reducing the dimensionality of input space with minimum loss of information.
Principal component analysis (PCA) is very common method of dimensionality reduction
technique using feature extraction method. It creates a set of derived features or linear
subspaces that are linear combination of the existing features such that maximum
data variance is also accounted in the new subspace. This can also looked from the
perspective of creating a new subspace where each basis of the subspace is a linear
combination of the existing features.
Consider a data matrix X ∈ ℜdxn where d is the number of features and n the
number of data points, and let U ∈ ℜdxr be the transformation matrix used to generate
the reduced subspace S ∈ ℜnxr where r ≪ d . When a set of data points is projected
onto a new subspace, the optimal subspace preserves maximal relationship between
the data points, i.e the loss of information is minimized by accommodating the maximum
variance present in the dataset. The optimization model for generating the optimal
26
![Page 27: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/27.jpg)
subspace can be formulated as variance maximization problem. The subspace is
represented by r orthogonal ui ∈ ℜd vectors and u′i s form basis for the subspace.
MaximizeU
(UTXXTU)
subject to UTU = I
(3–1)
As feature extraction method creates a set of derived features which cannot be
directly correlated to the actual features, these methods are not suitable for identifying
the prominent features. A very common observation with high dimensional data is that
most of the features are either irrelevant or collinear with the prominent features. High
levels of correlation and noise in high dimensional dataset has reduced the applicability
of traditional methods. Higher dimensional dataset also affects the computational
efficiency. These setbacks accentuate the application of dimensionality reduction
techniques and feature selection on high dimensional datasets. This chapter focuses on
a method that could perform feature selection along with dimensionality reduction using
l2,1 norm.
3.2 L21 Norm and Feature Selection
For any matrix A ∈ ℜdxr , its l2,1 norm is defined as
∥A∥2,1 =d∑i=1
√√√√ r∑j=1
A2i ,j (3–2)
L2,1 norm calculates norm along the rows of any matrix. Consider a projection
matrix A that is used to project an input space X ∈ ℜdxn onto a reduced dimensional
subspace S . Since each row of the projection matrix corresponds to a feature in the
original input space, it is desirable to have some rows of the projection matrix going to
zero. This can be looked from the perspective that the significance of irrelevant features
is being nullified in the reduced subspace. This is the major motivation behind studying
27
![Page 28: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/28.jpg)
l2,1 norm. It is introduced to a dimensionality reduction problem expecting to induce row
sparsity in the transformation matrix and thereby assist in feature selection.
Joint sparsity is induced in the orthonormal vectors spanning S by introducing l2,1
norm to the optimization model 3–1 and is modified as below.
MaximizeU∈ℜdxr
(UTXXTU)− C∥U∥2,1
subject to UTU = I
(3–3)
where C > 0 controls the intensity of sparsity induced.
Solution to the optimization model 3–1 can be obtained by solving the following
symmetric eigenvalue problem and optimal u∗i are the eigenvectors corresponding to the
r largest eigenvalues:
XXTU = DU (3–4)
where D = diag(λ1,λ2, ....,λr) are the set of eigenvalues and U ∈ ℜdxr represent the set
of eigenvectors of XXT .
In high dimensional datasets computing eigenvectors of XXT ∈ ℜdxd is challenging
due to the large dimensional vectors. The eigenvectors of XXT can be calculated
from the eigenvectors of XTX . Let the set of eigenvectors of XTX be represented by
W ∈ ℜnxr , then U can be estimated as:
XTU = WD1/2 (3–5)
So, the optimization problem 3–3 can be reformulated as
MinimizeU∈ℜdxr
∥U∥2,1
subject to XTU = WD1/2
(3–6)
28
![Page 29: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/29.jpg)
The model 3–6 is further relaxed to the following optimization problem:
MinimizeU∈ℜdxr
∥U∥2,1
subject to ∥XTU − Y ∥F ≤ δ
(3–7)
where Y = WD1/2 and δ can be used as tuning parameter for constrain relaxation.
To solve the model 3–7 the iterative method introduced by Gu et al. [28] is used.
The algorithm can be summarized as below
Algorithm 3 Solving Optimization Model 3–71. Initialize G0 = I , t = 0 and µ. (Analytical relation between µ and δ is not relevant in
this algorithm, µ is used to fine tune the convergence criterion and thereby has effect
in the constrain relaxation)
2. Compute Y = WD1/2 where W ∈ ℜnxr are the eigenvectors of XTX .
3. Ut+1 = G−1t X (XTG−1
t X + 12µI )−1Y
4. Update Gt+1 based on At+1, G is a diagonal matrix with gi ,i =
0, if u i = 0
1|ai∥2 otherwise
5. t=t+1, repeat steps 3 to 5 until convergence
3.3 Results and Observations
Joint sparse(JS) feature selection method was tested on 4 high dimensional
datasets. The transformation matrix U obtained using the above method was not
only used to transform the data to a lower dimensional subspace but also to extract
significant features. Each row in U can be directly correlated to a feature in the original
data space. In this experiment 3 different reduced subspaces where considered each
having 5,10 and 15 dimensions, i.e 3 different U ∈ ℜdxr were r varied from 5,10 and
15. The amount of variance captured by the new subspace is a good measure for
understanding the applicability of this method. A very common dimensionality reduction
method the Principal component analysis (PCA) was used to compare the results. The
percentage of variance captured by the subspace is a good measure to analyse the
29
![Page 30: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/30.jpg)
Table 3-1. Summarizes and compares the variance captured by PCA and JS method(Joint Sparse Feature selection method),r represent the dimension of thereduced space, Accuracy and Standard deviations are also compared.
Variance Accuracy % Std DeviationDataset Class Ratio r PCA JS Method PCA JS Method PCA JS Method
Colon 40 : 225 0.71 0.70 64.2 71.3 13.55 13.38
10 0.84 0.84 72.9 73.3 13.21 14.2015 0.90 0.89 68.3 68.3 13.41 13.94
DBWorld 35 : 295 0.23 0.16 89.6 88.5 7.60 8.46
10 0.37 0.27 87.7 86.9 7.23 7.9315 0.48 0.36 88.8 86.9 6.35 7.53
Leukimia 27 : 115 0.45 0.40 100.0 90.7 0.00 6.99
10 0.62 0.58 98.6 98.6 4.40 4.4015 0.73 0.69 99.3 99.3 3.19 3.19
Breast 44 : 335 0.37 0.31 66.6 68.4 10.39 11.38
10 0.50 0.44 63.8 62.2 11.03 13.9715 0.59 0.52 60.3 60.0 13.34 13.36
performance of any dimensionality reduction method. The data variance in the reduced
subspace is compared with the variance in the original input space to calculate the
percentage of variance captured by dimensionality reduction technique. Percentage
of variance is the ratio of variance in subspace to the variance in original input space.
The percentage of variance captured by the principal components in PCA is compared
with the variance captured using the joint sparsity method in Table 3.3. Classification
accuracy using SVM on the reduced subspace (generated from both PCA and JS
method) is also compared for different values of subspace dimension.
The results show that Joint Sparsity method captures variance similar to that of
PCA and also performs well with classification accuracy. An iterative algorithm is used
to obtain the transformation matrix U were in each iteration the l2,1 norm decreases
forcing rows corresponding to irrelevant features to smaller magnitude but never
reducing it to zero. Hence during the iteration when the norm of any row goes below
a particular value it is forced to zero and in this study we have used a threshold of e−8.
The algorithm terminating condition are, if the ϵu <= 5e−4 and ϵf <= e−4 or if the
30
![Page 31: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/31.jpg)
number of iterations exceeds 100. ϵu for k th iteration is defined as ∥Uk−Uk−1|F√rxd
were r is
the dimension of reduced subspace and d is the number of features in the dataset. ϵf is
defined as ∥Objk−Objk−1|d
were Objk is the objective value (∥U∥2,1) at the k th iteration. The
maximum number of iterations was fixed at 100 as for most of the datasets algorithm
was converging to acceptable levels in lesser than 100 iterations.
As the algorithm couldn’t induce efficient parameters to control the sparsity levels,
the performance in feature selection process was tested using the top prominent
features. In this study two ideas were used to select the prominent feature subset. In the
first case, features that have magnitude larger than �% of the largest magnitude feature
are the active feature set. Table 3.3 shows the classification accuracies associated
with the prominent features subset generated by this method and also the number of
features in the subset. In the second approach top T features (features with the largest
magnitude) formed the prominent subset and the classification accuracy is in Table 3.3
. Both the results table compares the accuracy of JS Method with the widely accepted
PCA method.
Result shows that the classification accuracies from JS Methods are not higher than
the PCA method, but the results are very much comparable with that of PCA method.
In dimensionality reduction methods, PCA is considered to be the best and efficient
algorithm and PCA-SVM classification model is known to provide higher classification
accuracies. JS method provides comparable classification accuracies with PCA, further
it also provides a list of prominent features. Hence, along with dimensionality reduction,
feature selection is also performed using JS method.
31
![Page 32: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/32.jpg)
Table 3-2. Summarizes accuracies of PCA and JS method (Joint Sparse Featureselection method),� is the threshold for selecting features (� = 30% selectsfeatures that have weights greater than 30% largest weight). The number ofrelevant features is also given for corresponding � values.
Acc % Accuracy (JS Mehtod) % NonZero FeaturesDataset r PCA � 30% � 20% � 10% � 30% � 20% � 10%
colon5 64.16 76.25 69.58 75.41 70 83 100
10 72.91 65 66.25 75.41 80 105 12215 68.33 64.16 66.25 59.16 85 105 133
DBWorld5 89.61 88.07 90.76 88.84 29 52 112
10 87.69 81.53 85.76 86.53 28 72 13315 88.84 86.92 85.76 85.38 95 150 211
Leukimia5 100 90 91.42 91.42 42 62 86
10 98.57 93.57 92.85 97.85 13 36 7415 99.28 87.85 99.28 99.28 19 50 88
Breast5 66.56 69.68 67.18 65.31 26 51 93
10 63.75 63.43 64.68 63.12 59 96 16015 60.31 59.37 60.93 59.68 68 122 196
Table 3-3. Summarizes accuracies of PCA and JS method (Joint Sparse Featureselection method),T is the number of top selected features (T = 10 selectsthe top 10 features that have largest weights)
Accuracy % Accuracy % (JS Method, Top Features)Dataset r PCA T 10 T 15 T 20 T 25 T 30
Colon5 64.17 78.75 77.50 81.25 77.50 80.0010 72.92 65.00 69.58 71.25 72.08 76.2515 68.33 62.50 69.58 67.50 63.33 69.17
DBWorld5 89.62 82.69 87.31 83.08 86.92 86.9210 87.69 77.31 88.08 84.23 82.69 85.3815 88.85 82.69 79.62 81.92 74.62 76.54
Leukimia5 100.00 86.43 90.00 89.29 88.57 85.7110 98.57 91.43 86.43 87.86 95.71 93.5715 99.29 82.86 80.00 87.86 88.57 90.00
Breast5 66.56 62.81 65.31 65.00 69.69 69.6910 63.75 60.63 68.44 69.06 67.50 72.5015 60.31 72.19 73.13 68.75 66.88 62.19
32
![Page 33: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/33.jpg)
CHAPTER 4FEATURE SELECTION IN UNLABELLED DATASETS
4.1 Introduction to Raman Spectra Signals
This chapter focuses on feature selection in unlabelled Raman spectroscopy data.
A Raman Spectra consists of Raman intensities measured at various wavenumbers.
The peaks in Raman spectra can be associated to various biological elements. This
non invasive method is very vital in the study of cells and cellular process. The amount
of morphologic and chemical feature information in Raman spectra and its ease of
measurement makes Raman spectroscopy an attractive method to study cells. However,
efficiently extracting information from Raman spectra is a challenge. The dataset used
here is a cross sectional Raman spectroscopy scan of a cell embedded in a layer of
trehalose. One of the motivations behind collecting Raman spectra scan of the cell is to
create an image of the cell using the Raman spectroscopy scan. Target study includes,
the task of creating cell image based on Raman intensity spectra and also identifying
the important peaks that helps in distinguishing various regions in the image generated.
Clustering methods are used to generate the image while sparse clustering is used to
identify the relevant peaks in Raman spectra.
4.2 Dataset
Dataset represent a Raman spectroscopy scan of a cell embedded in a trehalose
layer. The scan is performed on the X-Z plane, i.e it provides a cross sectional view
of the cell. The expected cross sectional image is a layer of cell at the centre of the
scan and on both top and bottom of the cell, is the trehalose layer. The two datasets
considered in this study are 1.)TRETScan6 (scan area : pixel dimension 120 ×
21) and C3AScan2(125 × 50). Each dataset represent cross sectional scan of cell
embedded in a trehlose medium but with different scan area and cell sample. Each pixel
of the scan area represent a Raman spectra consisting of 1024 features. The number
of data points in TRETScan6 is 120 × 21 = 2520 and C3AScan2 has 125 × 50 = 6250.
33
![Page 34: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/34.jpg)
Both datasets have same number of 1024 features , i.e Raman intensities are measured
at 1024 different wave numbers varying between 0 to 3800. Hence our TRETScan6
dataset consists of 2520 data points with 1024 features and C3AScan2 dataset consists
of 6250 data points with 1024 features.
4.3 Data Preprocessing
4.3.1 Remove Unnecessary Features
Research on Raman spectra suggests that the intensity measures at very low wave
numbers do not give any information due to the presence of high noise levels. Hence
first 10 features of the data points are removed and the dataset is reduced to 1014
features.
4.3.2 Noise and Background Subtraction
In order to extract maximum information on the Raman scattering, both noise and
background fluorescence must be removed. Subtraction of noise is performed using
SavitzkyGolay smoothing filter, which is found to be very effective for Raman spectra.
The most promising type of background subtraction algorithms use polynomial fits
because they can approximate the fluorescence profile while excluding the Raman
peaks. However, there is no consensus on the best polynomial fit order for fluorescence
background subtraction [3]. In this study we applied the subback function of matlab, this
function will subtract the background of a spectrum by fitting a polynomial through the
data points in an iterative way.
4.3.3 Peak Selection
Biologically relevant wave numbers in Raman spectra are associated with peaks
of Raman intensity, hence only those wave numbers corresponding to peaks are only
relevant for any analysis. Hence peaks associated with each Raman spectra is selected
and its associated wave numbers are shortlisted as potential features. Due to the
resolution and noise in the process of plotting Raman spectra peaks of different Raman
spectra could be shifted by few wavenumbers. This is taken care off by cohesion of
34
![Page 35: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/35.jpg)
peaks in different spectra to one prominent wave number. After peak selection and peak
cohesion the number of relevant features are further reduced.
Once the raw dataset undergoes preprocessing it is taken for further analysis. The
preprocessed dataset is expected to be containing only relevant features and the noise
and background subtracted.
4.4 Clustering
Clustering consists of partitioning the data points based on the differences between
them. The most common dissimilarity measure considered is the euclidean distance
between the data points. Hence clustering tends to group points lying close to each
other or the process of grouping similar elements together.Clustering is an unsupervised
classification model where the class of training dataset is unknown, in many cases even
the number of classes present is unknown. In this study two clustering methods are
analysed and applied on the dataset.
4.5 K-means Clustering
K-means clustering algorithm is one of the most popular clustering algorithms. To
perform K-means clustering the user needs to specify the number of clusters present
in the training dataset. The K-means algorithm can be explained as repetition of two
steps. The main idea is to define k centroids, one for each cluster. The next step is to
take each point belonging to a given dataset and associate it to the nearest centroid.
When all data points are assigned to a centroid or mean the first step is completed and
an initial cluster is formed. At this point we need to re-calculate k new centroids based
on the clusters resulting from the previous step. After we have these k new centroids,
now reassign the data points to each of these new nearest centroids creating a new
set of clusters. Repeat the process till no more data points are reassigned from its
current cluster or in other words centroids do not move any more [20]. The performance
of clustering heavily depends on the number of clusters present. So assigning the
35
![Page 36: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/36.jpg)
correct number of clusters is critical in obtaining a good cluster and thereby efficient
classification.
The first task in this study was identifying the extra cellular (region of Trehlose) and
the cellular region. So initially clustering was performed with two means i.e K = 2.
K-means clustering was performed on both the datasets and the Figures 4.5,4.5 shows
the clustering. The cell is suspended in a medium of Trehlose and hence in the cross
sectional view it is expected to observe a cellular region at the centre while the upper
and bottom layers have Trehlose. So the clustering algorithm should generate a layer at
the centre while the upper and bottom layers consists of the same cluster. The K-means
cluster output shown in Figures 4.5 and 4.5 provides the same analysis and hence
validates the applicability of using clustering methods in distinguishing the cellular and
extra cellular regions in Raman spectra scan.
Figure 4-1. K-Means Clustering, TRETScan6 scan of dimension 120 X 21
In the TRETScan6 Figure 4.5 the centre red coloured strip represent the cluster
associated with cellular region and blue coloured region the extra cellular region.
36
![Page 37: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/37.jpg)
Similar pattern is also expected from C3AScan2 scan and the cluster obtained 4.5 also
validates our proposition.
Figure 4-2. K-Means Clustering,C3AScan2 scan of dimension 125 X 50
Once the image of extra cellular and cellular region is generated, the region
associated with nucleus is located from the cellular region. This is a first step towards
distinguishing various elements in the cellular region. In this study only the cellular
region of C3AScan6 scan is further clustered to locate the nucleus and the light green
coloured region shown in the Figure 4.5 is expected to represent nucleus.
4.6 Spectral Clustering
Spectral clustering is also one of the popular clustering algorithms due to its
ease of implementation and availability of efficient methods to solve. In many cases it
outperforms the traditional methods like the K-means algorithm. For performing spectral
clustering the dataset is represented as a graph network consisting of n nodes where
n is the number of data points and each arc weight represent some sort of dissimilarity
measure between the nodes connected by the arc. In this study we take euclidean
distance between two data points as the dissimilarity measure for the arc connecting
37
![Page 38: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/38.jpg)
Figure 4-3. K-means C3AScan2 scan with nucleus marked
them. Let W ∈ ℜnxn represent the weight matrix representing the dissimilarity measure
matrix for the arcs connecting all the data points where wij = distance between point i
and j. D ∈ ℜnxn is a diagonal matrix with diagonal entries di =n∑
j=1
wi ,j . Laplacian matrix L
is given by L = D −W .
The network generated by n data points is a connected graph, as there exists arcs
between all the data points and hence the smallest eigenvalue is of the Laplacian matrix
is 0. The multiplicity of 0 eigenvalues in a Laplacian represent the connectivity of the
graph. The eigenvector corresponding to the first non zero eigenvalue represent the best
partition of graph into two sub graphs with limited interaction between them, i.e it assists
in clustering the nodes into two clusters [31]. As the Laplacian created is connected
their would be only one zero eigenvalue and hence the eigenvector corresponding to the
second smallest eigenvalue represent the partitioning of graph into two subgraphs. The
eigenvector is an n dimensional vector and each entry in the vector can be associated
with a data point. The cluster is generated by arranging the data points based on the
entries of the eigenvector.
38
![Page 39: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/39.jpg)
Observations from Spectral Clustering
Spectral clustering was performed on both the datasets and the Figures 4.6
4.6 shows the clustered image. Similar to the K-means clustered image 4.5 4.5, the
Spectral clustering algorithm also generates the scan image. This further validates the
applicability of using clustering methods in distinguishing the cellular and extra cellular
regions in Raman spectra scan.
Figure 4-4. Spectral Clustering, TRETScan6 scan of dimension 120 X 21
In the TRETScan6 Figure 4.6, the red coloured strip represent the cluster
associated with cellular region and blue coloured region the extra cellular region. Similar
pattern is also expected from C3AScan2 Figure 4.6 data and the cluster obtained also
validates our proposition on expected scan image. Both the clustering methods provide
similar clusters and hence this classification output can be used as a reference for
further analysis.
4.7 Sparse Clustering for Feature Selection
Technological advances in the last decade have introduced new and efficient
tools for data collection especially in the field of biomedicine. This has paved way for
a new class of large datasets with very high dimensions i.e with large number of input
39
![Page 40: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/40.jpg)
Figure 4-5. Spectral Clustering, C3AScan2 scan of dimension 125 X 50
features compared to the size of observations. Traditional data mining techniques
have produced appreciable results for standard datasets but when data is represented
as very high dimensional vectors it poses great challenges for standard algorithms.
Feature extraction for high dimensional datasets is very important as most features
in high dimensional vectors are usually non-informative or noisy and could affect
the generalization performance. There is a great interest in many machine learning
application for inducing sparsity to high dimensional datasets with respect to input
features. Sparse dataset can provide significant information on relevant features and
thereby assist in feature selection. Further, classification models with sparse data matrix
can simplify decision rule for faster prediction in large-scale problems. Finally, in many
data analysis applications, a small set of features is desirable to interpret the results.
In this study sparse clustering is performed to extract the relevant features
from dataset. Feature selection will help in identifying the biologically significant
wave-numbers that are critical in distinguishing between extra cellular and cellular
region. Sparse K-means method suggested by Witten et al . [5] is used for sparse
40
![Page 41: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/41.jpg)
clustering. The sparse K-means clustering optimization problem can be formulated as
follows
Maximizec1,c2..cK ,ω
p∑j=1
ωj(1
n
n∑i=1
n∑i ′=1
di ,�i ,j −K∑k=1
1
nk
∑i ,i ′
∈ Ckdi ,i ′,j)
Subject to ∥ω∥2 ≤ 1, ∥ω∥ ≤ s, ωj ≥ 0 ∀j
(4–1)
where, c1, c2, ..cK represent the K classes or number of clusters in the data space
ω represent the weight associated with each features
p is the number of features
s is the tuning parameter
di ,i ′,j dissimilarity measure between nodes i and i ′ along feature j .
K is the number of classes
The above optimization problem is solved using an iterative process proposed by
Witten et al . [5].
The weight ωj associated with each feature represent the significance of that
feature in clustering, i.e differentiating between the extracellular and cellular region.
Hence features with greater weights are crucial in distinguishing between cellular and
extra cellular region and they also represent the biologically significant wavenumbers
associated with cellular material. The Table 4.7 shows a list of relevant features
(wavenumbers) and their corresponding weights. Wavenumbers are arranged in the
decreasing order of their weights and weights less than 0.01 are discarded.
4.8 Observations
Study have shown that clustering methods could be effectively used to generate
an image of the Raman spectroscopy scan. For both the datasets, a well separated
image could be generated distinguishing the cellular and extra cellular region. Both
K-means and Spectral clustering methods provided similar well separated images. This
well separated image could be used to evaluate the clustering performed by the sparse
41
![Page 42: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/42.jpg)
Table 4-1. Weights ωj and corresponding feature (wavenumber)Weights Wavenumber Weights Wavenumber0.437 1128 0.073 8680.416 1359 0.058 13180.328 1343 0.057 12730.326 1145 0.052 14440.312 1380 0.049 10350.268 1111 0.048 9330.248 538 0.041 12530.208 1086 0.040 14280.197 555 0.035 4020.152 1400 0.033 5100.126 1465 0.030 13020.114 1061 0.028 9070.096 429 0.023 12320.090 1161 0.019 7240.079 842 0.015 12150.079 456 0.013 587
Figure 4-6. Image generated from top 15 features from sparse Clustering, C3AScan2scan of dimension 125 X 50
clustering method. The relevant wavenumbers short-listed by sparse clustering method
(in Table 4.7) is used for further learning.
These shot-listed wavenumbers can be associated with biologically significant
Raman peaks and thereby validating the feature selection process. Further, the cell
42
![Page 43: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/43.jpg)
scan image generated by the top 15 short-listed features provides a very similar image
comparable with the reference images generated by k-means and spectral methods.
This could further validate the features selected and feature selection process. Figure
4.7 shows image generated by the top 15 features selected from the sparse clustering
method and this could be compared with images 4.5, 4.6 generated by K-means
clustering and Spectral clustering method. In this study only a preliminary analysis of
distinguishing between cellular and extra cellular region is preformed and it shows that
sparse clustering is effective in both clustering and feature extraction. There is a huge
potential of further extending this study, to even differentiating the various regions inside
a cell.
43
![Page 44: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/44.jpg)
CHAPTER 5DISCUSSION AND CONCLUSION
Traditional statistical methods fail while handling high dimensional datasets,
introductory section discusses the probable reasons behind this behavior. Hence, high
dimensional datasets are preferred to be studied in a lower dimensional space and
this can be achieved with the help of feature selection process. The dimensionality
of datasets can be reduced by picking only a few of the best features. By feature
selection a new dataset is created with only a subset of original features but it captures
maximum information from the original dataset. This study focused on various aspects
of extracting this subset of features.
The first section focussed on introducing a least squares formulation for proximal
support vector machines. The motivation behind this formulation is rooted from the
standard method of inducing sparsity into a classification model using l1 norm in a
least squares formulation. It is very common to introduce l1 norm into a least squares
classification optimization model, this induces sparsity in the deciding variables and
thence helps in understanding relevant variables that normally correspond to the
features in a dataset. Proximal support vector machines are very efficient classification
algorithms and can handle complex datasets very well. This was a major drive in
studying proximal planes and investigating ways for reformulating it to a least squares
problem. The classification accuracy of least squares proximal support vector machines
is similar to that of the original eigenvalue formulation. So by this study we could develop
a new least squares formulation for generating the proximal planes. This model could
be further used to introduce l1 norm and thereby induce sparsity for feature selection.
Adding to it, the algorithms developed to solve the least squares formulation has closed
form solutions for proximal planes, this further improves the computational efficiency.
The second section discussed a new norm that could induce joint sparsity in a
dimensionality reduction problem. An l2,1 norm is introduced into an dimensionality
44
![Page 45: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/45.jpg)
reduction problem and the optimization model is iteratively solved. The direct relation
between transformation matrix and input features is used to discard irrelevant features.
This helps in creating a subset of prominent features. The approach not only reduces
the dimensionality of the dataset but also helps in extracting features. The classification
accuracies associated with the reduced feature set produced comparable accuracies
to that of the well know PCA-SVM classification method. This not only validates the
feature selection process but also revamps the benefits of eliminating irrelevant features.
Reduced dimensional space provided better classification accuracies, better feature
interpretability and reduced computational complexity.
In the last section, sparse K-means method is applied to Raman spectroscopy data.
This study concentrated on understanding the applicability of sparse clustering methods
to generate the image of a Raman spectroscopy scan of cell. Initially clustering was
performed using the standard methods,viz K-means clustering and Spectral clustering
algorithms. Both algorithms generated similar clusters and also produced expected
images based on the scan set up. This image was used as reference to compare the
cluster generated by sparse K-means methods. Testing showed that sparse K-means
also produced similar cluster and also short-listed a set of relevant features. This helped
in removing irrelevant features and creating a subset of prominent features. Further,
the wave numbers corresponding to prominent features could be related to biological
significant wave numbers and this justified the feature selection process and applicability
of sparse K-means method in Raman spectroscopy data.
To summarize, the study targeted understanding the importance of feature selection
process and various ways of performing feature selection. Majority of the research
focused on labeled dataset, where sparsity was induced in supervised classification
models to assist feature selection. Lastly, feature selection in unlabeled dataset was also
studied using sparse clustering methods.
45
![Page 46: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/46.jpg)
REFERENCES
[1] Balasundaram S, Kapil N Application of lagrangian twin support vector machinesfor classification. Second international conference on machine learning andcomputing (2010), pp 193397.
[2] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, Least angleregression, The Annals of Statistics (2004), Volume 32, Number 2, 407-499.
[3] Cao, Alex; Pandya, Abhilash K.; Serhatkulu, Gulay K.; Weber, Rachel E.; Dai,Houbei; Thakur, Jagdish S.; Naik, Vaman M.; Naik, Ratna; Auner, GregoryW.; Rabah, Raja; Freeman, D. Carl (2007). A robust method for automatedbackground subtraction of tissue fluorescence. In Journal of Raman Spectroscopy38(9): 1199-1205.
[4] Cortes C, Vapnik V Support-vector network. Machine Learning (1995) 20:273297.
[5] Daniela M. Witten and Robert Tibshirani (2010). A framework for feature selectionin clustering. In J Am Stat Assoc. 105(490): 713726.
[6] T. Evgeniou, M. Pontil, and T. Poggio, Regularization Networks and SupportVector Machines, Advances in Computational Math. (2000),vol. 13, pp. 1-50.
[7] Fan J, Fan Y.(2008) High-dimensional classification using features annealedindependence rules Ann.Statist. 2008; 36:26052637.
[8] J Fan, Y Feng, X Tong (2012) A road to classification in high dimensional space:the regularized optimal affine discriminant Journal of the Royal Statistical Society.
[9] J Fan, J Lv (2010) A Selective Overview of Variable Selection in High DimensionalFeature Space, Statistica Sinica
[10] G. Fung and O.L. Mangasarian,Proximal Support Vector Machine Classifiers Proc.Knowledge Discovery and Data Mining, F. Provost and R. Srikant,(2001) eds., pp.77-86.
[11] M Gallagher, T Downs (1997)Visualization of learning in neural networks usingprincipal component analysis, International Conference on Computational
[12] Ghorai S, Mukherjee A, Dutta PK Nonparallel plane proximal classifier. SignalProcess (2009) 89:510522.
[13] Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin, A Comparisonof Optimization Methods and Software for Large-scale L1-regularized LinearClassification ,Journal of Machine Learning Research (2010) 11:31833234.
[14] J Hamm, DD Lee (2008) Grassmann Discriminant Analysis: a Unifying View onSubspace-Based Learning, 25th international conference on Machine learning.
46
![Page 47: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/47.jpg)
[15] Hui Zou, Trevor Hastie,and Robert Tibshirani Sparse Principal ComponentAnalysis Journal of Computational and Graphical Statistics (2006), Volume 15,Number 2, Pages 265286.
[16] Jayadeva R, Khemchandani R, Chandra S Twin support vector machine forpattern classification.(2007) IEEE Transactions on Pattern Analysis and MachineIntelligence 29(5):905910.
[17] B Jiang, YH Dai (2013) A Framework of Constraint Preserving Update Schemesfor Optimization on Stiefel Manifold, arXiv:1301.0172
[18] Jun Liu, Shuiwang Ji, Jieping Ye (2012) Multi-Task Feature Learning Via Effi-cient l2,1-Norm Minimization , Proceedings of the Twenty-Fifth Conference onUncertainty in Artificial Intelligence.
[19] M Kolar, H Liu - (2013) Feature Selection in High-Dimensional ClassificationProceedings of the 30 th International Conference on Machine Learning.
[20] Larraaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA,Armaanzas R, Santaf G, Prez A, Robles V.(2006). Machine learning in bioin-formatics. In Brief Bioinform Vol 7 No.1:86-112.
[21] K Lee, Y Bresler, M Junge (2012 Subspace Methods for Joint Sparse Recovery,Information Theory, IEEE.
[22] Liang Sun , Shuiwang Ji , Jieping Ye, A Least Squares Formulation for a Class ofGeneralized Eigenvalue Problems in Machine Learning, Proceedings of the 26thInternational Conference on Machine Learning, Montreal, Canada, 2009.
[23] Mangasarian OL, Wild EW Multisurface proximal support vector classification viageneralized eigenvalues. IEEE Transactions on Pattern Analysis and MachineIntelligence (2006) 28(1):6974.
[24] O.L. Mangasarian, Least Norm Solution of Non-Monotone ComplementarityProblems, Functional Analysis, Optimization and Mathematical Economics (1990),pp. 217-221, New York: Oxford Univ. Press.
[25] O.L. Mangasarian and R.R. Meyer, Nonlinear Perturbation of Linear Programs,SIAM J. Control and Optimization(1979), vol. 17, no. 6, pp. 745-752, .
[26] D Niu, JG Dy, MI Jordan (2011) - Dimensionality Reduction for Spectral Cluster-ing, 14th International Conference on Artificial.
[27] Osborne, M. R., Presnell, B., and Turlach, B. A.,A New Approach to VariableSelection in Least Squares Problems, IMA Journal of Numerical Analysis (2000),20, 389403.
47
![Page 48: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/48.jpg)
[28] Quanquan Gu, Zhenhui Li and Jiawei Han (2011) Joint Feature Selection andSubspace Learning, Proceedings of the Twenty-Second International JointConference on Artificial Intelligence
[29] Tibshirani, R.,Regression Shrinkage and Selection via the Lasso, Journal of theRoyal Statistical Society (1996),Series B, 58, 267288.
[30] A.N. Tikhonov and V.Y. Arsen, Solutions of Ill-Posed Problems. New York: JohnWiley & Sons (1977).
[31] Ulrike Von Luxburg (2007). A Tutorial on Spectral Clustering. In Statistics andComputing 17(4).
[32] Vapnik V The nature of statistical learning, (1998) 2nd edn.Springer, New York.
[33] Yunhai Xiao, Soon Yi Wu, Bing Sheng He (2012)A proximal alternating directionmethod for L2,1 norm least squares problem in multi-task feature learning, Journalof Industrial and Management Optimization.
48
![Page 49: c 2013 Paul Francis Thottakkara - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/59/86/00001/... · 2013-10-18 · PAUL FRANCIS THOTTAKKARA ... Finally, I thank my parents](https://reader033.vdocuments.net/reader033/viewer/2022041920/5e6b6df6d66aa715383c720e/html5/thumbnails/49.jpg)
BIOGRAPHICAL SKETCH
Paul Francis Thottakkara was born 1985 in Kerala, India. He graduated with
a bachelor’s degree in Mechanical Engineering from Mahatama Gandhi University
in Kerala, India. After bachelor’s he worked at Sanmar Engineering Corporation in
Chennai, India for two years and then went to University of Florida to purse master’s
degree in Industrial Engineering. During his master’s program at University of Florida,
Paul was an active member of UF INFORMS Chapter. It was during the master’s
program that he developed interest in data mining techniques and optimization
methods. He continued his studies at University of Florida to get Engineer Degree
with specialization in data mining and optimization. He will graduate with Engineer
Degree from University of Florida in August 2013. After graduation he plans to join the
industry as a data analyst to utilize his skill and research experience.
http://plaza.ufl.edu/paulthottakkara/Paul.html
49