sparsity optimization method for multivariate feature screening for gene expression analysis
TRANSCRIPT
Research Article
Sparsity Optimization Method for Multivariate Feature
Screening for Gene Expression Analysis
QIANG CHENG1 and JIE CHENG2
ABSTRACT
Constructing features from high-dimensional gene expression data is a critically importanttask for monitoring and predicting patients’ diseases, or for knowledge discovery in com-putational molecular biology. The features need to capture the essential characteristics of thedata to be maximally distinguishable. Moreover, the essential features usually lie in small orextremely low-dimensional subspaces, and it is crucial to find them for knowledge discoveryand pattern classification. We present a computational method for extracting small or evenextremely low-dimensional subspaces for multivariate feature screening and gene expressionanalysis using sparse optimization techniques. After we transform the feature screeningproblem into a convex optimization problem, we develop an efficient primal-dual interior-point method expressively for solving large-scale problems. The effectiveness of our method isconfirmed by our experimental results. The computer programs will be publicly available.
Key words: feature screening, gene expression, high-dimensional classification, large-scale op-
timization, low-dimensional subspaces, sparsity optimization.
1. INTRODUCTION
D isease and normal cells have differential expressions across diverse genes. Microarray analysis
aims to effectively identify gene expression patterns across various types of tissues, for example, at
different disease development stages, with different patient outcomes, or under different environmental
conditions. It is particularly useful for cancer diagnosis and prognosis in cancer genomics. Typically, gene
expression data have high-dimensional features but a small sample size, where the number of genes is
frequently thousands of more while the number of samples is typically less than one hundred. These high-
dimensional data pose many intrinsic challenges for pattern recognition problems such as the curse of
dimensionality. Moreover, the data exhibit the following prominent phenomena: Many variables are noisy or
irrelevant, and only a few of the variables really contribute to the class distinctiveness. The phenomena
significantly challenged traditional methods for data mining, pattern recognition, and knowledge discovery,
which have limited successes when directly applied to gene expression data. The irrelevant variables do not
contribute to the classification accuracy, but instead have adverse effects on the performance. Extracting
(small) subspaces of essential features becomes critically important for reducing the dimension, preventing
noise from accumulating, and exploiting existing classification techniques.
1Computer Science Department and 2Electrical and Computer Engineering Department, Southern Illinois University,Carbondale, Illinois.
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 16, Number 9, 2009
# Mary Ann Liebert, Inc.
Pp. 1241–1252
DOI: 10.1089/cmb.2008.0034
1241
In the literature, there have been many proposed methods which emphasize the importance of dimension
reduction and feature selection (Speed, 2003; Zhang, 2006). These include projection methods such as
principal component analysis (PCA), partial least square (PLS), and sliced inverse regression (SIL). PCA
has been applied to gene expression data classification by Ghosh (2002) and Bair et al. (Bair et al. 2007);
PLS by Nguyen and Rocke (2002), Huang and Pan (2003); and SIL by Chiaromonte and Martinelli (2002),
Antoniadis et al. (2003), and Bura and Pfeiffer (2003). PCA decomposes a sample variance matrix to find
the largest eigenvalues and their corresponding major eigenvectors. It projects the data to the directions of
major eigenvectors ( Jolliffe, 1986). The advantages of PCA are that its reconstruction capability is good,
and the noise, missing components or outliers have less adverse effect on classification. The disadvantages
of PCA, however, are that it does not necessarily capture the most discriminative power of the gene data
(Martinez and Kak, 2001) and the projected directions often lack meaningful interpretability. PLS finds
fundamental relations between two matrices and uses a linear model to describe predicted variables by
using other observable variables. It has been used for discrimination in chemometrics (Barker and Rayens,
2003) and dimensionality reduction in bioinformatics (Nguyen and Rocke, 2002). SIL finds a smooth
regression function that operates on a set of projections. The advantages of PLS and SIL are that they
attempt to find projection directions resulting in small classification errors. Usually, they put more weight
on those features that have large classification power. However, the disadvantage is that their projection
directions may produce large sets of features for prediction. The most informative genes are sparse; for
example, for cancer versus non-cancer diagnosis, usually 50 genes are sufficient (Golub et al., 1999). Due
to the existence of noise components, using redundant variables increases the misclassification rate when
there is only a fraction of variables accounting for most of the information of the data.
In addition to the above methods, variable selection techniques have also been proposed. Tibshirani et al.
(2002) propose a nearest shrunken centroid (NSC) method which basically uses a simple component-wise
two-sample t-test to identify genes for tumor classification. Bickel and Levina (2004) use an independence
rule (IR) to overcome singularity problem of high-dimensional data existing in Fisher linear discriminant
analysis (LDA). LDA maximizes between-class variance while minimizing within-class variance, leading
to the highest discriminative power when the data are well conditioned. IR only makes use of the diagonal
elements of sample covariance matrix and thus it is also referred to as diagonal LDA (DLAD). Fan and Fan
(2007) propose a feature annealed independence rule (FAIR). Based on Bickel and Levina’s IR, they
propose to further single out the most important genes to reduce noise effect. The advantage of these
methods is that they directly consider discriminative power of the features, and thus they offer potentially
better classification performance than the projection methods. The disadvantage is that they take no
consideration of interactions or correlations between features during feature selection. Expressive power of
the data is limited as the essential inter-variable relationship or structure of the data is ignored. When the
data contain many noise components or a number of outliers, the performance of these methods suffers.
With only discrimination and no expressive power, the features drawn are sensitive to noise or outliers,
which leads to a performance degradation. This phenomenon has been observed by Fidler et al. (2006).
Random forest (Breiman, 2001) has also been applied to tumor classification that uses an ensemble of
learning trees to get a boosted performance (Diaz-Uriarte and Alvarez de Andres, 2006). It is a committee
machine which is a combination of classifiers, while we aim at building feature selection method for a
single classifier which can be combined for higher performance when needed.
We propose in this article a different approach to feature selection for gene expression data than the
abovementioned methods. We construct a small subset of the most salient features based on which the
classification is as accurate as possible. The selected features must be discriminative so that different
classes can be maximally distinguished. The features must also be expressive. That is, they must represent
the intrinsic inter-feature structure of the data so that the data can be essentially reconstructed with even
noise components or outliers. Moreover, the sparseness of features offers us intrinsic structure of the high-
dimensional data. Noise components can be screened out and thus the method enables robustness to noise
and efficient deployment of algorithms. To these ends, we identify proper measures for desired properties
and incorporate them into objective functions or constraints. Especially, the maximum number of zero
elements or sparsity leads to our main objective function. We formulate a sparsity optimization approach to
integrating the identified measures for maximized classification performance. We solve the optimization
problem efficiently using modern convex optimization techniques.
In this paper, scalars and vectors are denoted by lowercase letters, and they should be clearly distin-
guished in the context. Matrices are denoted by capital letters. We use diag (x) to denote a k�k diagonal
1242 CHENG AND CHENG
matrix whose diagonal vector is x 2 Rk. The script letters denote the spaces. The lq-norm of a vector x is
denoted as jjxjjq, 0� q�?. We use the notation of (�)i to denote the ith element of a vector.
The rest of the paper is organized as follows. Section 2 discusses our way to formulate the feature
selection problem. The formulated problem is solved using a convex optimization technique in Section 3.
The algorithm is presented in Section 4. The effectiveness of the proposed technique is experimentally
demonstrated in Section 5, and the conclusion is drawn in Section 6.
2. FORMULATION OF GENE SELECTION AND SPARSEOPTIMIZATION APPROACH
For gene expression data based classification, we have input observations xk 2 Rp and response variables
yk 2 R, k¼ 1, … , n, where n is the number of observations, p is the dimension of features, and typically
p� n for gene expression data. Here xk is a vector of genes for the kth observation. Denote the data matrix
of size n�p by
X¼ [x1jx2j � � � jxn]T , (1)
and the responses by y¼ [y1, … , yn]T. Usually, p is in the range of 5000 to 30,000 while n is in the order of
tens. The number of genes must be reduced to apply existing classification algorithms like LDA or SVM
(Vapnik, 1998). As noted in the previous section, the projection methods for dimensionality reduction of
gene expression data are susceptible to noise accumulation. A small subset of features must be selected to
overcome noise accumulations. The existing methods for gene selection such as IR, NSC, and FAIR, are
mainly based on component-wise comparisons. In other words, they are essentially univariable selection
methods. We must take into account interactions of the features. That is, we aim at multivariable selection
with which a combination of genes is able to offer the maximized classification performance. To this end, a
brute-force method needs to examine the discriminative capability for each subset of the genes; however,
this requires considering a combinatorial number of subsets and has an impractical complexity.
For multivariable feature selection from gene expression data, we propose to use a linear combination of
observation variables to approximate the responses. A parameter b 2 Rp with a large number of zero
components is used to represent selecting and combining the features. That is, we use a linear model
y¼Xbþ z; (2)
where z 2 Rn represents approximation errors. The feature selector b needs to have a small number of
nonzeros which pick out the most important genes. A large number of zero components of b screen out the
noise or redundant components so that our method can provide robustness to noise accumulations. The
interaction between genes is represented by the linear combination of genes where combining coefficients are
nonzero elements of b. Now the feature selection process is boiled down to estimating b from the linear model.
The feature selector b must have high expressive power, meaning the essential structure of the data can be
captured using the linear model. With higher expressive power the predictions for new samples are more
accurate as well as more robust to noise components in observations. We measure the expressive power using
approximation errors to the responses during training stage. The smaller the approximation errors, the better the
expressive power. However, because the gene expression data usually contain significant noise components,
missing components, or outliers, merely minimizing the approximation errors tends to overfit the training data.
It is well known that overfitting the data during training stage can lead only to a degraded discriminative
capability for new samples. To preserve sufficient expressive power while preventing overfitting, we must take
into account the complexity of b; at the same time, we impose an upper bound for the approximation errors. The
complexity of b is measured with lq-norm of b, and we shall choose a proper q which is not only computa-
tionally tractable but also allows for meaning interpretability. The upper bound for the approximation errors is
max{(z)i}ni¼ 1¼max{(y�Xb)i}
ni¼ 1¼ jjy�Xbjj1 £ d, (3)
where d is some constant representing a tolerance level to noise when approximating y with training
observations. We need to choose d sufficiently small for good class separability and also big enough for
high noise tolerance. Usually, we choose d to be around 0.1 when y uses integers as class labels.
Under the approximation upper bound in Eq. (3), we minimize the complexity of estimating b to prevent
model from overfitting. The complexity is represented by jjbjjp for some 0� p�?. The feature selector b
SPARSITY OPTIMIZATION FOR MULTIVARIATE GENE SCREENING 1243
needs to have a large number of zeros. The number of nonzero elements of b is mathematically measured by
the l0-norm of b. Foster and George (1994) have used the l0-norm in the well-known canonical procedure
arg minjjy�Xbjj22þKjjbjj0, (4)
where L is some constant balancing the goodness of fit and the model complexity. The minimization with
jjbjj0 is combinatorial in nature which has an impractical computational complexity. To alleviate the
difficulty, under uniform uncertain principles l1-minimization has been used to replace jjbjj0 (Donoho and
Huo, 2001, Elad and Bruckstein, 2002). Especially, Donoho noticed that for most underdetermined systems
of linear equations, the minimal l1-solution is also the sparsest solution (Donoho, 2004). Because n� p for
gene expression data, we have an underdetermined system. To obtain a sparse b for high classification
power and robustness to noise, we make use of l1 minimization techniques.
In summary, by taking into account approximation accuracy and model overfitting while avoiding noise
accumulations can we come up with the following constrained optimization for gene feature selection
minjjbjj1 subject to jjy�Xbjj1 £ d: (5)
This is convex optimization problem (Boyd and Vandenberghe, 2004), which can be easily transformed
into a linear program (LP). Introducing auxiliary optimization variable r 2 Rp to Eq. (5) yields
min rT 1p
subject to b� r £ 0
� b� r £ 0
�Xb £ � yþ d1n
Xb £ yþ d1n,
(6)
where 1k 2 Rk is a vector of k ones, and ‘‘�’’ holds for each component of the vector in the vectorial
inequality. The optimization variables in Eq. (6) are b and r.
Now the selection of the most salient features from gene expression data has been boiled down to solving
an LP problem. Typically the number of genes, p, is in a range of 5000 to 30,000. For LP (6) the
optimization variables b and r have a total dimension of 2p; and the number of inequality constraints is
m¼ 2pþ 2n: (7)
The dimension of the optimization variables and the number of inequality constraints are both in an
approximate range of 10,000 to 60,000. The large-scale LP in Eq. (6) needs to be solved efficiently. To this
end, we design a primal-dual interior-point method (Boyd and Vandenberghe, 2004) that is adapted to the
LP with a large p. The method is specified in Section 3.After applying the above sparsity optimization procedure, the estimated bb is almost the sparsest solution
to the problem of approximating y using Xb. Those significant components in bb represent the most
important features, which are retained whenever bbk ‡ c0 with g0 being a small positive threshold. The
estimated bb has demonstrated the so-called ‘‘shrinkage effect’’ though (Fan and Li 2006). That is, im-
portant features are underestimated in magnitude, leading to some important features shrunken to zero. To
counteract this undesirable effect, we propose a trilogy method. Excluding the significant genes from the
first-round optimization, this method does a second-round sparsity optimization to the remaining genes.
The purpose of the re-estimation is to pick up the significant genes that have been improperly shrunken
previously. Subsequently, the chosen significant genes from both rounds are re-estimated using the least
square (LS) method, or support vector regression (SVR) method (Vapnik 1998), to remove the shrinkage.
LS and SVR are chosen to do re-estimation because they are not biased. In summary, our trilogy method is
as follows:
� First, apply the above sparsity optimization procedure and retain the resultant significant features. Denote this set
of features by G0.� Second, remove the columns of X that correspond to the genes in G0. The new data matrix is denoted by �XX. Apply
the sparsity optimization procedure to �XX and choose the significant genes with another threshold 0< g1< g0. The
resulting significant feature set is denoted by G1.� Third, restrict the dataset X to only significant genes in G¼G0[G1. Apply the LS (or SVR) estimation to X
restricted to G.
1244 CHENG AND CHENG
In this trilogy method, the core part is the sparsity optimization for large p. We specify an efficient way
for doing this in the next section.
3. PRIMAL-DUAL INTERIOR POINT FOR SPARSITY OPTIMIZATION
To extract the most salient features, we have formulated essentially an optimization-based approach to
estimating the feature selector b in Eq. (2). The optimization turns out to be a large-scale LP problem which
is the most critical part for our method. To efficiently solve the LP in Eq. (6), we exploit a primal-dual
interior-point technique. By defining matrix A 2 Rm · 2p, vectors b 2 Rm, c 2 R2p, and ~bb 2 R2p as follows,
we put the LP parameters in block structure
A¼
I � I
� I � I
�X 0
X 0
0BB@
1CCA, b¼
0
0
� yþ d1n
yþ d1n
0BB@
1CCA, c¼ 0
1p
� �, ~bb¼ b
r
� �:
Following the notations of Boyd and Vandenberghe and using logarithmic barrier function (Boyd-
Vandenberghe, 2004), we define
f0(~bb)¼ cT ~bb, (8)
f (~bb)¼A~bb� b, (9)
rdual¼ cþATk, (10)
rcent¼ � diag(k)f (~bb)� (1 / �)1m, (11)
where k 2 Rm is the dual variable, the parameter n typically increases geometrically at each iteration, rcent
is the centrality residual, and rdual is the dual residual (Boyd and Vandenberghe, 2004).
The LP in Eq. (6) becomes
min~bb2R2p f0(~bb) subject to f (~bb) £ 0: (12)
The dual residual rdual in Eq. (10) is obtained from
rdual¼rf0(~bb)þDf (~bb)Tk: (13)
Here ! is the gradient operator, and Df is the derivative matrix of size m�2p given by
Df (~bb)¼D
f1(~bb)
..
.
fm(~bb)
0B@
1CA¼
rf1(~bb)T
..
.
rfm(~bb)T
0B@
1CA, (14)
where fk(~bb) is the kth scalar inequality in fm(~bb) of Eq. (9).
The primal-dual interior-point method iteratively updates the primal-dual pair (~bb, k). By using the
Newton step (Boyd and Vandenberghe, 2004), one solves the following linear system of equations to obtain
an update of the primal-dual pair
0 �AT
diag(k)A diag(f (~bb))
� �D~bbDk
� �¼ rdual
rcent
� �: (15)
The primal and dual search directions are coupled in that the primal search direction D~bb depends on the
current values of both dual and primal variables. After obtaining the primal-dual search direction, then the
current values of primal-dual pair can be updated by assigning to them the new values
(~bbþ , kþ )¼ (~bb, k)þ s(D~bb, Dk), (16)
where (~bbþ , kþ ) is the next iterate pair, and s is the step size determined by a standard backtracking line
search (Boyd and Vandenberghe, 2004). The step size is based on the norm of the residual and one needs to
SPARSITY OPTIMIZATION FOR MULTIVARIATE GENE SCREENING 1245
ensure that l is positive and f (~bb) is negative, both elementwise. The primal-dual interior-point algorithm
typically terminates after the size of the residual vectors and the surrogate gap are smaller than specified
tolerances (Boyd and Vandenberghe, 2004).
From the above Eq. (15) we obtain the Newton search direction of f ~bb
AT diag(k / f (~bb))AD~bb¼ rdualþAT diag(1 / f (~bb))rcent: (17)
We decompose diag(k / f (~bb)) and f (~bb) into the following block structure
diag(k / f (~bb))¼ diag(G1, G2, G3, G4),
f (~bb)¼
g1
g2
g3
g4
0BB@
1CCA¼
b� r
� b� r
�Xb� d1nþ y
Xb� d1n� y
0BB@
1CCA, (18)
where k¼ (kT1 , kT
2 , kT3 , kT
4 )T , kk, gk 2 Rp, k¼ 1, 2, G3, G4, k3, k4 2 Rn, and Gi¼ diag(ki / gi), i¼ 1, � � � , 4,
Therefore, we have
(AT diag(k / f (~bb))A)D~bb
¼ (G1þG2)þXT (G3þG4)X �G1þG2
�G1þG2 G1þG2
� �DbDr
� �: (19)
Decomposing rdualþAT diag(1 / f (~bb))rcent intor1
r2
� �, where ri 2 Rp, i¼ 1, 2, we have
(4G1G2þ (G1þG2)XT (G3þG4)X)Db¼ (G1þG2)r1þ (G1�G2)r2, (20)
and
Dr¼ (G1þG2)� 1r2� (G1þG2)� 1(�G1þG2)Db: (21)
From the above equation, we can obtain the update for D~bb¼ DbDr
� �. This mainly involves solving a p by
p linear system of equations by block elimination. Afterward, we can get the update for the dual variable
from Eq. (15) using the following equation
Dk¼ � diag(1 / f (~bb))(diag(k)AD~bb� rcent): (22)
It can be seen that essentially each Newton step needs to solve a p by p system of linear equations. The
surrogate duality gap is
gg(~bb, k)¼ � f (~bb)Tk: (23)
When both the primal-dual gap and the size of the residual vector fall below a specified tolerance, the
Newton iteration is terminated. Alternatively, we may terminate the algorithm after a certain maximal
iteration, or when the algorithm converges where the consecutive values of optimization variables are
sufficiently close.
4. ALGORITHM FOR FEATURE EXTRACTIONUSING SPARSITY OPTIMIZATION
We now summarize the algorithm for feature selection from gene expression data using sparsity opti-
mization in Algorithm 1.
A standard backtracking line search (Boyd and Vandenberghe, 2004) is used in Step 3 of Algorithm 1.
First we compute
smax¼ sup{s 2 [0, 1] j kþ sDk ‡ 0}: (24)
1246 CHENG AND CHENG
Algorithm 1 Feature selection using sparsity optimization with primal dual interior-point method
Given the notations in Eqs. (8) – (11).
Given ~bb that satisfies fk(~bb)\0, k¼ 1, � � � , n, k[0, l[1, �f [0, �[0, Max_Iter�1, g0� g1> 0.
Set Num_Iter( 0.
Repeat
1) Determine n. Set � ( lm / gg.
2) Compute primal-dual search directions Db, Dr, and Dl using Eqs. (20)–(22).
3) Line search and update. Determine step length s> 0 and set new primal-dual pair by (~bb, k)( (~bb, k)þ s(D~bb, Dk).
Set Num_Iter(Num_Iterþ 1.
Until {rfeas\�f and gg(~bb, k)\�, or Num_Iter[ Max_Iter}
Choose N nonzero bk such that jbkj>g0 to get the set of significant genes G0, the first step in the trilogy method.
Apply the second step in the trilogy method to get G1 with g1� g0. Finally, apply LS (or SVR) to re-estimate and
rank the significant genes in G¼G1[G0.
Then we start backtracking with s¼ 0.99smax and multiply s by a factor in (0, 1) untill we have f (~bbþ )\0.
We define the norm of the primal and dual residuals in the termination condition of Algorithm 1 as follows:
rfeas¼ (jjrjj2primþ jjrjj2dual)
1 / 2: (25)
After the sparsity optimization converges, only a tiny fraction of bk are typically nonzero while the rest
vanish. Among these nonzero bk, only a small fraction has relatively large magnitudes while the others
have very small magnitudes which have insignificant effect on the final classification performance.
Therefore, we choose N nonzero bk to represent a high percentage of nonzero magnitudes of bk. Usually the
percentage can be 95% or 85% in our algorithm, which are also used in our experimental evaluations.
5. EXPERIMENTAL RESULTS
5.1. Leukemia dataset
We refer to our sparsity optimization approach to constructing the most salient features for gene ex-
pression data as SOGE. We implement SOGE and obtain promising results. We classify Leukemia gene
microarray data, originally obtained by Golub et al. (1999). There are 7129 genes, 38 training samples and
34 testing samples coming from two classes: ALL (acute lymphocytic leukemia) and AML (acute mylo-
genous leukemia). We follow exactly the same preprocessing steps as in Dudoit et al. (2002). After
preprocessing, the number of genes is reduced to 3571 out of 7129. The data after preprocessing is the same
as Tibshirani et al. (2002) and Fan and Fan (2007). Then we use our SOGE for feature selection. From
Figure 1, we can see after about 30 iterations, the first-round optimization converges. The resultant feature
selector b is plotted in Figure 2. We then sort the magnitudes of bk and plot the largest 35 in Figure 2. It can
0 5 10 15 20 25 30 35 4010
10
10
100
105
Surrogate duality gap for Leukemia data
Iteration number
Su
rro
ga
te d
ua
lity
ga
p
0 5 10 15 20 25 30 35 4010
10
10
10
10
10
100
102
The norm of primal and dual residualsfor Leukemia data
Iteration number
No
rm o
f re
sid
ua
ls
(a) Convergence behavior of surrogate dualitygap for Leukemia data.
(b) Convergence behavior (in norm) of primaland dual residuals for Leukemia data.
FIG. 1. Convergence behaviors of surrogate duality gap, and primal and dual residuals for leukemia data.
SPARSITY OPTIMIZATION FOR MULTIVARIATE GENE SCREENING 1247
be seen that only a very small fraction of coefficients of b is nonzero, from which we choose those having
the largest magnitudes to form G. And then the rest of the procedures in Algorithm 1 are followed to select
the most important genes.
Afterwards, we use simply the IR classification rule (Bickel and Levina, 2004) for classification for fair
comparisons, as FAIR uses IR for classification after the feature selection. Because SOGE is a general
feature selection method, the selected features can be used for other classifiers such as support vector
machines or decision tree, with even better classification results. We compare our results to the state of the
art methods: the nearest shrunken centroid (NSC) method proposed in 2002 by Tibshirani et al. (2002), and
the feature annealed independence rule (FAIR) by Fan and Fan (2007). From Table 1, it can be seen that
our results for both training and testing are better than NSC and FAIR. With 11 selected genes, SOGE has
no training error out of 38 samples and only 1 testing error out of 34 samples. Especially, SOGE discovers
two most important genes with which we have no training error out of 38 samples and 2 testing errors out of
34 samples.
5.2. Prostate cancer dataset
We have also applied our feature selection method to Prostate Cancer dataset used to classify normal
from tumor prostate samples. The dataset is available at Singh et al. (2002). There are 12600 genes, 102
patient samples for training, 52 of which are tumor samples and 50 of which are prostate samples. We use
an independent dataset for testing (Welsh et al., 2001), which has 25 tumor and 9 normal samples from a
different experiment. The testing dataset has a nearly 10-fold difference in overall microarray intensity
from the training data. We follow exactly the same preprocessing steps as in Dudoit et al. (2002). After
preprocessing, the number of genes is reduced to 2803 out of 12600. The data after preprocessing is the
same as in Tibshirani et al. (2002) and Fan and Fan (2007). Then we use SOGE algorithm for feature
selection. From Figure 3, we can see after about 30 iterations, the first-round optimization converges. The
resultant feature selector b is plotted in Figure 4. Similar conclusion to that for leukemia dataset can be
drawn.
Our testing results are better than both NSC and FAIR and the training errors are a bit larger (Table 2).
Specifically, with six genes, SOGE has nine training errors out of 102 samples, and only four testing error out
of 34 samples. This can be compared to NSC: With six genes, NSC makes eight training errors and nine
testing errors using six genes. We also compare to FAIR: With two selected genes, FAIR makes nine testing
Table 1. Classification Errors of Leukemia Data
Method Training error Test error No. of selected genes
NSC 1/38 3/34 21
FAIR 1/38 1/34 11
SOGE 0/38 1/34 6
SOGE 0/38 2/34 2
0 500 1000 1500 2000 2500 3000 3500 4000
0
0.05
0.1
0.15
0.2
0.25
0.3
Final result of for Leukemia data
Genes
Ele
me
nts
of
0 5 10 15 20 25 30 350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
The largest elements of in magnitudefor Leukemia data
Number of Genes
Ma
gn
itud
es
of
ele
me
nts
of
(a) Final result of β for Leukemia data.(b) Sorted magnitudes of βκ for Leukemia data.Only the largest 35 are shown.
FIG. 2. Resultant feature selector b for leukemia data.
1248 CHENG AND CHENG
Table 2. Classification Errors of Prostate Cancer Data
Method Training error Test error No. of selected genes
NSC 8/102 9/34 6
FAIR 10/102 9/34 2
SOGE 9/102 4/34 6
SOGE 14/102 4/34 3
0 500 1000 1500 2000 2500 3000
0
0.005
0.01
0.015
0.02
Final result of for Prostate Cancer data
Genes
Ele
ments
of
0 10 20 30 40 500
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
The largest elements of in magnitude for Prostate Cancer data
Number of Genes
Magnitu
des
of ele
ments
of
(a) Final result of β for Prostate Cancer data.(b) Sorted magnitudes of κ for Prostate Cancerdata. Only the largest 50 are shown.
β
FIG. 4. Resultant feature selector b for prostate cancer data.
0 5 10 15 20 25 3010
10
10
100
105
Surrogate duality gap for Prostate Cancer data
Iteration numberS
urro
gate
dua
lity
gap
0 5 10 15 20 25 3010
10
10
100
105
The norm of primal and dual residualsfor Prostate Cancer data
Iteration number
Nor
m o
f res
idua
ls
(a) Convergence behavior of surrogate dualitygap for Prostate Cancer data.
(b) Convergence behavior (in norm) of primaland dual residuals for Prostate Cancer data.
FIG. 3. Convergence behaviors of surrogate duality gap, and primal and dual residuals for prostate cancer data.
0 5 10 15 20 25 3010
10
10
10
10
100
102
104
Surrogate duality gap for Lung cancer data
Iteration number
Surr
ogate
dualit
y gap
0 5 10 15 20 25 3010
10
10
10
10
10
100
102
The norm of primal and dual residualsfor Lung data
Iteration number
Norm
of re
siduals
(a) Convergence behavior of surrogate dualitygap for Lung Cancer data.
(b) Convergence behavior (in norm) of primaland dual residuals for Lung Cancer data.
FIG. 5. Convergence behaviors of surrogate duality gap, and primal and dual residuals for lung cancer data.
1249
errors out of 34 samples, and 10 training errors out of 149 samples. As can be seen from Table 2, our method
can make less number of testing errors with approximatedly the same number of genes as NSC or FAIR.
5.3. Lung cancer dataset
We have also applied our feature selection method to lung cancer data used to classify between ma-
lignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. The dataset is available at
Gordon et al. (2002). There are 12533 genes, 32 training samples, and 149 testing samples. We follow
exactly the same preprocessing steps as in Dudoit et al. (2002). After preprocessing, the number of genes is
reduced to 2532 out of 12533. The data after preprocessing is the same as in Tibshirani et al. (2002) and
Fan and Fan (2007). Then we use our SOGE for feature selection. From Figure 5, we can see after about 30
iterations, the first-round optimization converges. The resultant feature selector b is plotted in Figure 6.
Similar conclusion to that for Leukemia dataset can be drawn.
Our results are much better than both NSC and FAIR (Table 3). Specifically, with 19 genes, SOGE has
no training errors out of 32 samples and only 1 testing error out of 149 samples. As can be seen from Table 3,
our method uses much less number of genes and makes much less errors compared to NSC and FAIR.
6. CONCLUSION
We have proposed a multivariate feature selection method for gene expression data. Sparse optimization
techniques have been naturally applied to capture intrinsic structures of the high-dimensional data. We
exploit a primal-dual interior-point technique to efficiently solve the large-scale optimization program.
Experimental results show the effectiveness of the proposed scheme. A future line of research will include
further reducing the computational complexity.
ACKNOWLEDGMENTS
We would like to thank Drs. M. Zargham and M. Sayeh for their helpful discussions. This work is
partially supported by funding from Southern Illinois University Carbondale.
Table 3. Classification Errors of Lung Cancer Data
Method Training error Test error No. of selected genes
NSC 0/32 11/149 26
FAIR 0/32 7/149 31
SOGE 0/32 3/149 11
SOGE 0/32 1/149 19
0 500 1000 1500 2000 2500 3000
0
0.05
0.1
0.15
Final result of β for Lung cancer data
Genes
Ele
ments
of
β
0 5 10 15 20 25 30 350
0.02
0.04
0.06
0.08
0.1
0.12
0.14
The largest elements of β in magnitudefor Lung cancer data
Number of Genes
Ma
gn
itud
es
of
ele
me
nts
of
β
(a) Final result of β for Lung Cancer data.(b) Sorted magnitudes of κ for Lung Cancerdata. Only the largest 35 are shown.
β
FIG. 6. Resultant feature selector b for lung cancer data.
1250 CHENG AND CHENG
DISCLOSURE STATEMENT
No Competing financial interests exist.
REFERENCES
Antoniadis, A., Lambert-Lacroix, S., and Leblanc, F. 2003. Effective dimension reduction methods for tumor classi-
fication using gene expression data. Bioinformatics 19, 563–570.
Bair, E., Hastie, T., Paul, D., and Tibshirani, R. 2006. Prediction by supervised principal components. J. Amer. Statist.
101, 119–137.
Barker, M., and Rayens, W. 2003. Partial least squares for discrimination. J Chemometrics. 17, 166–173.
Bickel, P., and Levina, E. 2004. Some theory of Fisher’s linear discriminant function, ‘‘naive Bayes,’’ and some
alternatives where there are many more variables than observations. Bernoulli 10, 989–1010.
Boyd S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge, UK.
Breiman, L. 2001. Random forests. Machine Learn. 45, 5–32.
Bura, E., and Pfeiffer, R.M. 2003. Graphical methods for class prediction using dimension reduction techniques on
DNA microarray data. Bioinformatics 19, 1252–1258.
Chiaromonte, F., and Martinelli, J. 2002. Dimension reduction strategies for analyzing global gene expression data with
a response. Math. Biosci. 176, 123–144.
Diaz-Uriarte, R., and Alvarez de Andres, S. 2006. Gene selection and classification of microarray data using random
forest. BMC Bioinform. 7, 3.
Domingos, P., and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine
Learn. 29, 103–130.
Donoho, D.L. 2004. For most large underdetermined systems of linear equations the minimal l1-norm solution is also
the sparsest solution. Manuscript. Department of Statistics Stanford University, Palo Alto, CA.
Donoho, D.L., and Huo, X. 2001. Uncertainty principles and ideal atomic decomposition. IEEE Inform. Theory 47,
2845–2862.
Dudoit, S., Fridlyand, J., and Speed, T. 2002. Comparison of discrimination methods for the classification of tumors
using gene expression data. J. Am. Statist. Assoc. 97, 77–87.
Elad, M., and Bruckstein, A.M. 2002. A generalized uncertainty principle and sparse representation in pairs of bases.
IEEE Trans. Inform. Theory 48, 2558–2567.
Fan, J., and Fan, Y. 2008. High-dimensional classification using features annealed independence rules. The Annals of
Statistics, 36, 2232–2260.
Fan, J., and Li, R. 2006. Statistical challenges with high dimensionality: Feature selection in knowledge discovery.
Proc. of the Int. Congress of Mathematicians (M. Sanz-sole, J. Soria, J.L. Varona, J. Verdera, eds.).
Fidler, S., Slocaj, D., and Leonardis, A. 2006. Combining reconstructive and discriminative subspace methods for
robust classification and regression by subsampling. IEEE Trans. Pattern Analy. Mach. Intellig. 28, 337–350.
Foster, D.P., and George, E.I. 1994. The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975.
Ghosh, D. 2002. Singular value decomposition regression modeling for classification of tumors from microarray
experiments. Proc. Pacif. Symp. Biocomput. 98, 11462–11467.
Golub, T.R., et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression
monitoring. Science 286, 531–537. Available at: www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. Accessed May 1,
2009.
Gordon, G.J., et al. 2002. Translation of microarray data into clinically relevant cancer diagnostic tests using gene
expression ratios in lung cancer and mesothelioma. Cancer Res. 62, 4963–4967. Available at: www.chestsurg.org.
Accessed May 1, 2009.
Huang, X., and Pan, W. 2003. Linear regression and two-class classification with gene expression data. Bioinformatics
19, 2072–2078.
Jolliffe, I.T. 1986. Principal Component Analysis. Springer-Verlag, New York.
Martinez, A., and Kak, A. 2001. PCA versus LDA. IEEE Trans. Pattern Analy. Mach. Intellige. 23, 228–233.
Nguyen, D., and Rocke, D.M. 2002. Tumor classification by partial least squares using microarray gene expression
data. Bioinformatics 18, 39–50.
Singh, D., et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209.
Available at: www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. Accessed May 1, 2009.
Speed, T. 2003. Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, Boca Raton, FL.
Tibshirani, R., Hastie, T., Narasimhan, B., et al. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene
expression. Proc. Natl. Acad. Sci. USA 99, 6567–6572.
Vapnik, V.N. 1998. Statistical Learning Theory. Wiley, New York.
SPARSITY OPTIMIZATION FOR MULTIVARIATE GENE SCREENING 1251
Welsh, J.B., Sapinoso, L.M., Su, A.I., et al. 2001. Analysis of gene expression identifies candidate markers and
pharmacological targets in prostate cancer. Cancer Res. 61, 5974–5978.
Zhang, A. 2006. Advanced Analysis of Gene Expression Microarray Data (Science, Engineering, and Biology Infor-
matics). World Scientific Publishing Company, New York.
Address correspondence to:
Dr. Qiang Cheng
Faner Hall, Room 2140
Mailcode 4511, 1000 Faner Drive
Carbondale, IL 62901
E-mail: [email protected]
1252 CHENG AND CHENG