sparsity optimization method for multivariate feature screening for gene expression analysis

Research Article

Sparsity Optimization Method for Multivariate Feature

Screening for Gene Expression Analysis

QIANG CHENG1 and JIE CHENG2

ABSTRACT

Constructing features from high-dimensional gene expression data is a critically importanttask for monitoring and predicting patients’ diseases, or for knowledge discovery in com-putational molecular biology. The features need to capture the essential characteristics of thedata to be maximally distinguishable. Moreover, the essential features usually lie in small orextremely low-dimensional subspaces, and it is crucial to find them for knowledge discoveryand pattern classification. We present a computational method for extracting small or evenextremely low-dimensional subspaces for multivariate feature screening and gene expressionanalysis using sparse optimization techniques. After we transform the feature screeningproblem into a convex optimization problem, we develop an efficient primal-dual interior-point method expressively for solving large-scale problems. The effectiveness of our method isconfirmed by our experimental results. The computer programs will be publicly available.

Key words: feature screening, gene expression, high-dimensional classification, large-scale op-

timization, low-dimensional subspaces, sparsity optimization.

1. INTRODUCTION

D isease and normal cells have differential expressions across diverse genes. Microarray analysis

aims to effectively identify gene expression patterns across various types of tissues, for example, at

different disease development stages, with different patient outcomes, or under different environmental

conditions. It is particularly useful for cancer diagnosis and prognosis in cancer genomics. Typically, gene

expression data have high-dimensional features but a small sample size, where the number of genes is

frequently thousands of more while the number of samples is typically less than one hundred. These high-

dimensional data pose many intrinsic challenges for pattern recognition problems such as the curse of

dimensionality. Moreover, the data exhibit the following prominent phenomena: Many variables are noisy or

irrelevant, and only a few of the variables really contribute to the class distinctiveness. The phenomena

significantly challenged traditional methods for data mining, pattern recognition, and knowledge discovery,

which have limited successes when directly applied to gene expression data. The irrelevant variables do not

contribute to the classification accuracy, but instead have adverse effects on the performance. Extracting

(small) subspaces of essential features becomes critically important for reducing the dimension, preventing

noise from accumulating, and exploiting existing classification techniques.

1Computer Science Department and 2Electrical and Computer Engineering Department, Southern Illinois University,Carbondale, Illinois.

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 16, Number 9, 2009

# Mary Ann Liebert, Inc.

Pp. 1241–1252

DOI: 10.1089/cmb.2008.0034

1241

In the literature, there have been many proposed methods which emphasize the importance of dimension

reduction and feature selection (Speed, 2003; Zhang, 2006). These include projection methods such as

principal component analysis (PCA), partial least square (PLS), and sliced inverse regression (SIL). PCA

has been applied to gene expression data classification by Ghosh (2002) and Bair et al. (Bair et al. 2007);

PLS by Nguyen and Rocke (2002), Huang and Pan (2003); and SIL by Chiaromonte and Martinelli (2002),

Antoniadis et al. (2003), and Bura and Pfeiffer (2003). PCA decomposes a sample variance matrix to find

the largest eigenvalues and their corresponding major eigenvectors. It projects the data to the directions of

major eigenvectors ( Jolliffe, 1986). The advantages of PCA are that its reconstruction capability is good,

and the noise, missing components or outliers have less adverse effect on classification. The disadvantages

of PCA, however, are that it does not necessarily capture the most discriminative power of the gene data

(Martinez and Kak, 2001) and the projected directions often lack meaningful interpretability. PLS finds

fundamental relations between two matrices and uses a linear model to describe predicted variables by

using other observable variables. It has been used for discrimination in chemometrics (Barker and Rayens,

2003) and dimensionality reduction in bioinformatics (Nguyen and Rocke, 2002). SIL finds a smooth

regression function that operates on a set of projections. The advantages of PLS and SIL are that they

attempt to find projection directions resulting in small classification errors. Usually, they put more weight

on those features that have large classification power. However, the disadvantage is that their projection

directions may produce large sets of features for prediction. The most informative genes are sparse; for

example, for cancer versus non-cancer diagnosis, usually 50 genes are sufficient (Golub et al., 1999). Due

to the existence of noise components, using redundant variables increases the misclassification rate when

there is only a fraction of variables accounting for most of the information of the data.

In addition to the above methods, variable selection techniques have also been proposed. Tibshirani et al.

(2002) propose a nearest shrunken centroid (NSC) method which basically uses a simple component-wise

two-sample t-test to identify genes for tumor classification. Bickel and Levina (2004) use an independence

rule (IR) to overcome singularity problem of high-dimensional data existing in Fisher linear discriminant

analysis (LDA). LDA maximizes between-class variance while minimizing within-class variance, leading

to the highest discriminative power when the data are well conditioned. IR only makes use of the diagonal

elements of sample covariance matrix and thus it is also referred to as diagonal LDA (DLAD). Fan and Fan

(2007) propose a feature annealed independence rule (FAIR). Based on Bickel and Levina’s IR, they

propose to further single out the most important genes to reduce noise effect. The advantage of these

methods is that they directly consider discriminative power of the features, and thus they offer potentially

better classification performance than the projection methods. The disadvantage is that they take no

consideration of interactions or correlations between features during feature selection. Expressive power of

the data is limited as the essential inter-variable relationship or structure of the data is ignored. When the

data contain many noise components or a number of outliers, the performance of these methods suffers.

With only discrimination and no expressive power, the features drawn are sensitive to noise or outliers,

which leads to a performance degradation. This phenomenon has been observed by Fidler et al. (2006).

Random forest (Breiman, 2001) has also been applied to tumor classification that uses an ensemble of

learning trees to get a boosted performance (Diaz-Uriarte and Alvarez de Andres, 2006). It is a committee

machine which is a combination of classifiers, while we aim at building feature selection method for a

single classifier which can be combined for higher performance when needed.

We propose in this article a different approach to feature selection for gene expression data than the

abovementioned methods. We construct a small subset of the most salient features based on which the

classification is as accurate as possible. The selected features must be discriminative so that different

classes can be maximally distinguished. The features must also be expressive. That is, they must represent

the intrinsic inter-feature structure of the data so that the data can be essentially reconstructed with even

noise components or outliers. Moreover, the sparseness of features offers us intrinsic structure of the high-

dimensional data. Noise components can be screened out and thus the method enables robustness to noise

and efficient deployment of algorithms. To these ends, we identify proper measures for desired properties

and incorporate them into objective functions or constraints. Especially, the maximum number of zero

elements or sparsity leads to our main objective function. We formulate a sparsity optimization approach to

integrating the identified measures for maximized classification performance. We solve the optimization

problem efficiently using modern convex optimization techniques.

In this paper, scalars and vectors are denoted by lowercase letters, and they should be clearly distin-

guished in the context. Matrices are denoted by capital letters. We use diag (x) to denote a k�k diagonal

1242 CHENG AND CHENG

matrix whose diagonal vector is x 2 Rk. The script letters denote the spaces. The lq-norm of a vector x is

denoted as jjxjjq, 0� q�?. We use the notation of (�)i to denote the ith element of a vector.

The rest of the paper is organized as follows. Section 2 discusses our way to formulate the feature

selection problem. The formulated problem is solved using a convex optimization technique in Section 3.

The algorithm is presented in Section 4. The effectiveness of the proposed technique is experimentally

demonstrated in Section 5, and the conclusion is drawn in Section 6.

2. FORMULATION OF GENE SELECTION AND SPARSEOPTIMIZATION APPROACH

For gene expression data based classification, we have input observations xk 2 Rp and response variables

yk 2 R, k¼ 1, … , n, where n is the number of observations, p is the dimension of features, and typically

p� n for gene expression data. Here xk is a vector of genes for the kth observation. Denote the data matrix

of size n�p by

X¼ [x1jx2j � � � jxn]T , (1)

and the responses by y¼ [y1, … , yn]T. Usually, p is in the range of 5000 to 30,000 while n is in the order of

tens. The number of genes must be reduced to apply existing classification algorithms like LDA or SVM

(Vapnik, 1998). As noted in the previous section, the projection methods for dimensionality reduction of

gene expression data are susceptible to noise accumulation. A small subset of features must be selected to

overcome noise accumulations. The existing methods for gene selection such as IR, NSC, and FAIR, are

mainly based on component-wise comparisons. In other words, they are essentially univariable selection

methods. We must take into account interactions of the features. That is, we aim at multivariable selection

with which a combination of genes is able to offer the maximized classification performance. To this end, a

brute-force method needs to examine the discriminative capability for each subset of the genes; however,

this requires considering a combinatorial number of subsets and has an impractical complexity.

For multivariable feature selection from gene expression data, we propose to use a linear combination of

observation variables to approximate the responses. A parameter b 2 Rp with a large number of zero

components is used to represent selecting and combining the features. That is, we use a linear model

y¼Xbþ z; (2)

where z 2 Rn represents approximation errors. The feature selector b needs to have a small number of

nonzeros which pick out the most important genes. A large number of zero components of b screen out the

noise or redundant components so that our method can provide robustness to noise accumulations. The

interaction between genes is represented by the linear combination of genes where combining coefficients are

nonzero elements of b. Now the feature selection process is boiled down to estimating b from the linear model.

The feature selector b must have high expressive power, meaning the essential structure of the data can be

captured using the linear model. With higher expressive power the predictions for new samples are more

accurate as well as more robust to noise components in observations. We measure the expressive power using

approximation errors to the responses during training stage. The smaller the approximation errors, the better the

expressive power. However, because the gene expression data usually contain significant noise components,

missing components, or outliers, merely minimizing the approximation errors tends to overfit the training data.

It is well known that overfitting the data during training stage can lead only to a degraded discriminative

capability for new samples. To preserve sufficient expressive power while preventing overfitting, we must take

into account the complexity of b; at the same time, we impose an upper bound for the approximation errors. The

complexity of b is measured with lq-norm of b, and we shall choose a proper q which is not only computa-

tionally tractable but also allows for meaning interpretability. The upper bound for the approximation errors is

max{(z)i}ni¼ 1¼max{(y�Xb)i}

ni¼ 1¼ jjy�Xbjj1 £ d, (3)

where d is some constant representing a tolerance level to noise when approximating y with training

observations. We need to choose d sufficiently small for good class separability and also big enough for

high noise tolerance. Usually, we choose d to be around 0.1 when y uses integers as class labels.

Under the approximation upper bound in Eq. (3), we minimize the complexity of estimating b to prevent

model from overfitting. The complexity is represented by jjbjjp for some 0� p�?. The feature selector b

SPARSITY OPTIMIZATION FOR MULTIVARIATE GENE SCREENING 1243

needs to have a large number of zeros. The number of nonzero elements of b is mathematically measured by

the l0-norm of b. Foster and George (1994) have used the l0-norm in the well-known canonical procedure

arg minjjy�Xbjj22þKjjbjj0, (4)

where L is some constant balancing the goodness of fit and the model complexity. The minimization with

jjbjj0 is combinatorial in nature which has an impractical computational complexity. To alleviate the

difficulty, under uniform uncertain principles l1-minimization has been used to replace jjbjj0 (Donoho and

Huo, 2001, Elad and Bruckstein, 2002). Especially, Donoho noticed that for most underdetermined systems

of linear equations, the minimal l1-solution is also the sparsest solution (Donoho, 2004). Because n� p for

gene expression data, we have an underdetermined system. To obtain a sparse b for high classification

power and robustness to noise, we make use of l1 minimization techniques.

In summary, by taking into account approximation accuracy and model overfitting while avoiding noise

accumulations can we come up with the following constrained optimization for gene feature selection

minjjbjj1 subject to jjy�Xbjj1 £ d: (5)

This is convex optimization problem (Boyd and Vandenberghe, 2004), which can be easily transformed

into a linear program (LP). Introducing auxiliary optimization variable r 2 Rp to Eq. (5) yields

min rT 1p

subject to b� r £ 0

� b� r £ 0

�Xb £ � yþ d1n

Xb £ yþ d1n,

(6)

where 1k 2 Rk is a vector of k ones, and ‘‘�’’ holds for each component of the vector in the vectorial

inequality. The optimization variables in Eq. (6) are b and r.

Now the selection of the most salient features from gene expression data has been boiled down to solving

an LP problem. Typically the number of genes, p, is in a range of 5000 to 30,000. For LP (6) the

optimization variables b and r have a total dimension of 2p; and the number of inequality constraints is

m¼ 2pþ 2n: (7)

The dimension of the optimization variables and the number of inequality constraints are both in an

approximate range of 10,000 to 60,000. The large-scale LP in Eq. (6) needs to be solved efficiently. To this

end, we design a primal-dual interior-point method (Boyd and Vandenberghe, 2004) that is adapted to the

LP with a large p. The method is specified in Section 3.After applying the above sparsity optimization procedure, the estimated bb is almost the sparsest solution

to the problem of approximating y using Xb. Those significant components in bb represent the most

important features, which are retained whenever bbk ‡ c0 with g0 being a small positive threshold. The

estimated bb has demonstrated the so-called ‘‘shrinkage effect’’ though (Fan and Li 2006). That is, im-

portant features are underestimated in magnitude, leading to some important features shrunken to zero. To

counteract this undesirable effect, we propose a trilogy method. Excluding the significant genes from the

first-round optimization, this method does a second-round sparsity optimization to the remaining genes.

The purpose of the re-estimation is to pick up the significant genes that have been improperly shrunken

previously. Subsequently, the chosen significant genes from both rounds are re-estimated using the least

square (LS) method, or support vector regression (SVR) method (Vapnik 1998), to remove the shrinkage.

LS and SVR are chosen to do re-estimation because they are not biased. In summary, our trilogy method is

as follows:

� First, apply the above sparsity optimization procedure and retain the resultant significant features. Denote this set

of features by G0.� Second, remove the columns of X that correspond to the genes in G0. The new data matrix is denoted by �XX. Apply

the sparsity optimization procedure to �XX and choose the significant genes with another threshold 0< g1< g0. The

resulting significant feature set is denoted by G1.� Third, restrict the dataset X to only significant genes in G¼G0[G1. Apply the LS (or SVR) estimation to X

restricted to G.


In this trilogy method, the core part is the sparsity optimization for large p. We specify an efficient way

for doing this in the next section.

3. PRIMAL-DUAL INTERIOR POINT FOR SPARSITY OPTIMIZATION

To extract the most salient features, we have formulated essentially an optimization-based approach to

estimating the feature selector b in Eq. (2). The optimization turns out to be a large-scale LP problem which

is the most critical part for our method. To efficiently solve the LP in Eq. (6), we exploit a primal-dual

interior-point technique. By defining matrix A 2 Rm · 2p, vectors b 2 Rm, c 2 R2p, and ~bb 2 R2p as follows,

we put the LP parameters in block structure

A¼

I � I

� I � I

�X 0

X 0

0BB@

1CCA, b¼

0

0

� yþ d1n

yþ d1n

0BB@

1CCA, c¼ 0

1p

� �, ~bb¼ b

r

� �:

Following the notations of Boyd and Vandenberghe and using logarithmic barrier function (Boyd-

Vandenberghe, 2004), we define

f0(~bb)¼ cT ~bb, (8)

f (~bb)¼A~bb� b, (9)

rdual¼ cþATk, (10)

rcent¼ � diag(k)f (~bb)� (1 / �)1m, (11)

where k 2 Rm is the dual variable, the parameter n typically increases geometrically at each iteration, rcent

is the centrality residual, and rdual is the dual residual (Boyd and Vandenberghe, 2004).

The LP in Eq. (6) becomes

min~bb2R2p f0(~bb) subject to f (~bb) £ 0: (12)

The dual residual rdual in Eq. (10) is obtained from

rdual¼rf0(~bb)þDf (~bb)Tk: (13)

Here ! is the gradient operator, and Df is the derivative matrix of size m�2p given by

Df (~bb)¼D

f1(~bb)

..

.

fm(~bb)

0B@

1CA¼

rf1(~bb)T

..

.

rfm(~bb)T

0B@

1CA, (14)

where fk(~bb) is the kth scalar inequality in fm(~bb) of Eq. (9).

The primal-dual interior-point method iteratively updates the primal-dual pair (~bb, k). By using the

Newton step (Boyd and Vandenberghe, 2004), one solves the following linear system of equations to obtain

an update of the primal-dual pair

0 �AT

diag(k)A diag(f (~bb))

� �D~bbDk

� �¼ rdual

rcent

� �: (15)

The primal and dual search directions are coupled in that the primal search direction D~bb depends on the

current values of both dual and primal variables. After obtaining the primal-dual search direction, then the

current values of primal-dual pair can be updated by assigning to them the new values

(~bbþ , kþ )¼ (~bb, k)þ s(D~bb, Dk), (16)

where (~bbþ , kþ ) is the next iterate pair, and s is the step size determined by a standard backtracking line

search (Boyd and Vandenberghe, 2004). The step size is based on the norm of the residual and one needs to


ensure that l is positive and f (~bb) is negative, both elementwise. The primal-dual interior-point algorithm

typically terminates after the size of the residual vectors and the surrogate gap are smaller than specified

tolerances (Boyd and Vandenberghe, 2004).

From the above Eq. (15) we obtain the Newton search direction of f ~bb

AT diag(k / f (~bb))AD~bb¼ rdualþAT diag(1 / f (~bb))rcent: (17)

We decompose diag(k / f (~bb)) and f (~bb) into the following block structure

diag(k / f (~bb))¼ diag(G1, G2, G3, G4),

f (~bb)¼

g1

g2

g3

g4

0BB@

1CCA¼

b� r

� b� r

�Xb� d1nþ y

Xb� d1n� y

0BB@

1CCA, (18)

where k¼ (kT1 , kT

2 , kT3 , kT

4 )T , kk, gk 2 Rp, k¼ 1, 2, G3, G4, k3, k4 2 Rn, and Gi¼ diag(ki / gi), i¼ 1, � � � , 4,

Therefore, we have

(AT diag(k / f (~bb))A)D~bb

¼ (G1þG2)þXT (G3þG4)X �G1þG2

�G1þG2 G1þG2

� �DbDr

� �: (19)

Decomposing rdualþAT diag(1 / f (~bb))rcent intor1

r2

� �, where ri 2 Rp, i¼ 1, 2, we have

(4G1G2þ (G1þG2)XT (G3þG4)X)Db¼ (G1þG2)r1þ (G1�G2)r2, (20)

and

Dr¼ (G1þG2)� 1r2� (G1þG2)� 1(�G1þG2)Db: (21)

From the above equation, we can obtain the update for D~bb¼ DbDr

� �. This mainly involves solving a p by

p linear system of equations by block elimination. Afterward, we can get the update for the dual variable

from Eq. (15) using the following equation

Dk¼ � diag(1 / f (~bb))(diag(k)AD~bb� rcent): (22)

It can be seen that essentially each Newton step needs to solve a p by p system of linear equations. The

surrogate duality gap is

gg(~bb, k)¼ � f (~bb)Tk: (23)

When both the primal-dual gap and the size of the residual vector fall below a specified tolerance, the

Newton iteration is terminated. Alternatively, we may terminate the algorithm after a certain maximal

iteration, or when the algorithm converges where the consecutive values of optimization variables are

sufficiently close.

4. ALGORITHM FOR FEATURE EXTRACTIONUSING SPARSITY OPTIMIZATION

We now summarize the algorithm for feature selection from gene expression data using sparsity opti-

mization in Algorithm 1.

A standard backtracking line search (Boyd and Vandenberghe, 2004) is used in Step 3 of Algorithm 1.

First we compute

smax¼ sup{s 2 [0, 1] j kþ sDk ‡ 0}: (24)


Algorithm 1 Feature selection using sparsity optimization with primal dual interior-point method

Given the notations in Eqs. (8) – (11).

Given ~bb that satisfies fk(~bb)\0, k¼ 1, � � � , n, k[0, l[1, �f [0, �[0, Max_Iter�1, g0� g1> 0.

Set Num_Iter( 0.

Repeat

1) Determine n. Set � ( lm / gg.

2) Compute primal-dual search directions Db, Dr, and Dl using Eqs. (20)–(22).

3) Line search and update. Determine step length s> 0 and set new primal-dual pair by (~bb, k)( (~bb, k)þ s(D~bb, Dk).

Set Num_Iter(Num_Iterþ 1.

Until {rfeas\�f and gg(~bb, k)\�, or Num_Iter[ Max_Iter}

Choose N nonzero bk such that jbkj>g0 to get the set of significant genes G0, the first step in the trilogy method.

Apply the second step in the trilogy method to get G1 with g1� g0. Finally, apply LS (or SVR) to re-estimate and

rank the significant genes in G¼G1[G0.

Then we start backtracking with s¼ 0.99smax and multiply s by a factor in (0, 1) untill we have f (~bbþ )\0.

We define the norm of the primal and dual residuals in the termination condition of Algorithm 1 as follows:

rfeas¼ (jjrjj2primþ jjrjj2dual)

1 / 2: (25)

After the sparsity optimization converges, only a tiny fraction of bk are typically nonzero while the rest

vanish. Among these nonzero bk, only a small fraction has relatively large magnitudes while the others

have very small magnitudes which have insignificant effect on the final classification performance.

Therefore, we choose N nonzero bk to represent a high percentage of nonzero magnitudes of bk. Usually the

percentage can be 95% or 85% in our algorithm, which are also used in our experimental evaluations.

5. EXPERIMENTAL RESULTS

5.1. Leukemia dataset

We refer to our sparsity optimization approach to constructing the most salient features for gene ex-

pression data as SOGE. We implement SOGE and obtain promising results. We classify Leukemia gene

microarray data, originally obtained by Golub et al. (1999). There are 7129 genes, 38 training samples and

34 testing samples coming from two classes: ALL (acute lymphocytic leukemia) and AML (acute mylo-

genous leukemia). We follow exactly the same preprocessing steps as in Dudoit et al. (2002). After

preprocessing, the number of genes is reduced to 3571 out of 7129. The data after preprocessing is the same

as Tibshirani et al. (2002) and Fan and Fan (2007). Then we use our SOGE for feature selection. From

Figure 1, we can see after about 30 iterations, the first-round optimization converges. The resultant feature

selector b is plotted in Figure 2. We then sort the magnitudes of bk and plot the largest 35 in Figure 2. It can

0 5 10 15 20 25 30 35 4010

10

10

100

105

Surrogate duality gap for Leukemia data

Iteration number

Su

rro

ga

te d

ua

lity

ga

p

0 5 10 15 20 25 30 35 4010

10

10

10

10

10

100

102

The norm of primal and dual residualsfor Leukemia data

Iteration number

No

rm o

f re

sid

ua

ls

(a) Convergence behavior of surrogate dualitygap for Leukemia data.

(b) Convergence behavior (in norm) of primaland dual residuals for Leukemia data.

FIG. 1. Convergence behaviors of surrogate duality gap, and primal and dual residuals for leukemia data.


be seen that only a very small fraction of coefficients of b is nonzero, from which we choose those having

the largest magnitudes to form G. And then the rest of the procedures in Algorithm 1 are followed to select

the most important genes.

Afterwards, we use simply the IR classification rule (Bickel and Levina, 2004) for classification for fair

comparisons, as FAIR uses IR for classification after the feature selection. Because SOGE is a general

feature selection method, the selected features can be used for other classifiers such as support vector

machines or decision tree, with even better classification results. We compare our results to the state of the

art methods: the nearest shrunken centroid (NSC) method proposed in 2002 by Tibshirani et al. (2002), and

the feature annealed independence rule (FAIR) by Fan and Fan (2007). From Table 1, it can be seen that

our results for both training and testing are better than NSC and FAIR. With 11 selected genes, SOGE has

no training error out of 38 samples and only 1 testing error out of 34 samples. Especially, SOGE discovers

two most important genes with which we have no training error out of 38 samples and 2 testing errors out of

34 samples.

5.2. Prostate cancer dataset

We have also applied our feature selection method to Prostate Cancer dataset used to classify normal

from tumor prostate samples. The dataset is available at Singh et al. (2002). There are 12600 genes, 102

patient samples for training, 52 of which are tumor samples and 50 of which are prostate samples. We use

an independent dataset for testing (Welsh et al., 2001), which has 25 tumor and 9 normal samples from a

different experiment. The testing dataset has a nearly 10-fold difference in overall microarray intensity

from the training data. We follow exactly the same preprocessing steps as in Dudoit et al. (2002). After

preprocessing, the number of genes is reduced to 2803 out of 12600. The data after preprocessing is the

same as in Tibshirani et al. (2002) and Fan and Fan (2007). Then we use SOGE algorithm for feature

selection. From Figure 3, we can see after about 30 iterations, the first-round optimization converges. The

resultant feature selector b is plotted in Figure 4. Similar conclusion to that for leukemia dataset can be

drawn.

Our testing results are better than both NSC and FAIR and the training errors are a bit larger (Table 2).

Specifically, with six genes, SOGE has nine training errors out of 102 samples, and only four testing error out

of 34 samples. This can be compared to NSC: With six genes, NSC makes eight training errors and nine

testing errors using six genes. We also compare to FAIR: With two selected genes, FAIR makes nine testing

Table 1. Classification Errors of Leukemia Data

Method Training error Test error No. of selected genes

NSC 1/38 3/34 21

FAIR 1/38 1/34 11

SOGE 0/38 1/34 6

SOGE 0/38 2/34 2

0 500 1000 1500 2000 2500 3000 3500 4000

0

0.05

0.1

0.15

0.2

0.25

0.3

Final result of for Leukemia data

Genes

Ele

me

nts

of

0 5 10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

The largest elements of in magnitudefor Leukemia data

Number of Genes

Ma

gn

itud

es

of

ele

me

nts

of

(a) Final result of β for Leukemia data.(b) Sorted magnitudes of βκ for Leukemia data.Only the largest 35 are shown.

FIG. 2. Resultant feature selector b for leukemia data.


Table 2. Classification Errors of Prostate Cancer Data


NSC 8/102 9/34 6

FAIR 10/102 9/34 2

SOGE 9/102 4/34 6

SOGE 14/102 4/34 3

0 500 1000 1500 2000 2500 3000

0

0.005

0.01

0.015

0.02

Final result of for Prostate Cancer data

Genes

Ele

ments

of

0 10 20 30 40 500

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

The largest elements of in magnitude for Prostate Cancer data

Number of Genes

Magnitu

des

of ele

ments

of

(a) Final result of β for Prostate Cancer data.(b) Sorted magnitudes of κ for Prostate Cancerdata. Only the largest 50 are shown.

β

FIG. 4. Resultant feature selector b for prostate cancer data.

0 5 10 15 20 25 3010

10

10

100

105

Surrogate duality gap for Prostate Cancer data

Iteration numberS

urro

gate

dua

lity

gap

0 5 10 15 20 25 3010

10

10

100

105

The norm of primal and dual residualsfor Prostate Cancer data

Iteration number

Nor

m o

f res

idua

ls

(a) Convergence behavior of surrogate dualitygap for Prostate Cancer data.

(b) Convergence behavior (in norm) of primaland dual residuals for Prostate Cancer data.

FIG. 3. Convergence behaviors of surrogate duality gap, and primal and dual residuals for prostate cancer data.

0 5 10 15 20 25 3010

10

10

10

10

100

102

104

Surrogate duality gap for Lung cancer data

Iteration number

Surr

ogate

dualit

y gap

0 5 10 15 20 25 3010

10

10

10

10

10

100

102

The norm of primal and dual residualsfor Lung data

Iteration number

Norm

of re

siduals

(a) Convergence behavior of surrogate dualitygap for Lung Cancer data.

(b) Convergence behavior (in norm) of primaland dual residuals for Lung Cancer data.

FIG. 5. Convergence behaviors of surrogate duality gap, and primal and dual residuals for lung cancer data.

1249

errors out of 34 samples, and 10 training errors out of 149 samples. As can be seen from Table 2, our method

can make less number of testing errors with approximatedly the same number of genes as NSC or FAIR.

5.3. Lung cancer dataset

We have also applied our feature selection method to lung cancer data used to classify between ma-

lignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. The dataset is available at

Gordon et al. (2002). There are 12533 genes, 32 training samples, and 149 testing samples. We follow

exactly the same preprocessing steps as in Dudoit et al. (2002). After preprocessing, the number of genes is

reduced to 2532 out of 12533. The data after preprocessing is the same as in Tibshirani et al. (2002) and

Fan and Fan (2007). Then we use our SOGE for feature selection. From Figure 5, we can see after about 30

iterations, the first-round optimization converges. The resultant feature selector b is plotted in Figure 6.

Similar conclusion to that for Leukemia dataset can be drawn.

Our results are much better than both NSC and FAIR (Table 3). Specifically, with 19 genes, SOGE has

no training errors out of 32 samples and only 1 testing error out of 149 samples. As can be seen from Table 3,

our method uses much less number of genes and makes much less errors compared to NSC and FAIR.

6. CONCLUSION

We have proposed a multivariate feature selection method for gene expression data. Sparse optimization

techniques have been naturally applied to capture intrinsic structures of the high-dimensional data. We

exploit a primal-dual interior-point technique to efficiently solve the large-scale optimization program.

Experimental results show the effectiveness of the proposed scheme. A future line of research will include

further reducing the computational complexity.

ACKNOWLEDGMENTS

We would like to thank Drs. M. Zargham and M. Sayeh for their helpful discussions. This work is

partially supported by funding from Southern Illinois University Carbondale.

Table 3. Classification Errors of Lung Cancer Data


NSC 0/32 11/149 26

FAIR 0/32 7/149 31

SOGE 0/32 3/149 11

SOGE 0/32 1/149 19

0 500 1000 1500 2000 2500 3000

0

0.05

0.1

0.15

Final result of β for Lung cancer data

Genes

Ele

ments

of

β

0 5 10 15 20 25 30 350

0.02

0.04

0.06

0.08

0.1

0.12

0.14

The largest elements of β in magnitudefor Lung cancer data

Number of Genes

Ma

gn

itud

es

of

ele

me

nts

of

β

(a) Final result of β for Lung Cancer data.(b) Sorted magnitudes of κ for Lung Cancerdata. Only the largest 35 are shown.

β

FIG. 6. Resultant feature selector b for lung cancer data.


DISCLOSURE STATEMENT

No Competing financial interests exist.

REFERENCES

Antoniadis, A., Lambert-Lacroix, S., and Leblanc, F. 2003. Effective dimension reduction methods for tumor classi-

fication using gene expression data. Bioinformatics 19, 563–570.

Bair, E., Hastie, T., Paul, D., and Tibshirani, R. 2006. Prediction by supervised principal components. J. Amer. Statist.

101, 119–137.

Barker, M., and Rayens, W. 2003. Partial least squares for discrimination. J Chemometrics. 17, 166–173.

Bickel, P., and Levina, E. 2004. Some theory of Fisher’s linear discriminant function, ‘‘naive Bayes,’’ and some

alternatives where there are many more variables than observations. Bernoulli 10, 989–1010.

Boyd S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge, UK.

Breiman, L. 2001. Random forests. Machine Learn. 45, 5–32.

Bura, E., and Pfeiffer, R.M. 2003. Graphical methods for class prediction using dimension reduction techniques on

DNA microarray data. Bioinformatics 19, 1252–1258.

Chiaromonte, F., and Martinelli, J. 2002. Dimension reduction strategies for analyzing global gene expression data with

a response. Math. Biosci. 176, 123–144.

Diaz-Uriarte, R., and Alvarez de Andres, S. 2006. Gene selection and classification of microarray data using random

forest. BMC Bioinform. 7, 3.

Domingos, P., and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine

Learn. 29, 103–130.

Donoho, D.L. 2004. For most large underdetermined systems of linear equations the minimal l1-norm solution is also

the sparsest solution. Manuscript. Department of Statistics Stanford University, Palo Alto, CA.

Donoho, D.L., and Huo, X. 2001. Uncertainty principles and ideal atomic decomposition. IEEE Inform. Theory 47,

2845–2862.

Dudoit, S., Fridlyand, J., and Speed, T. 2002. Comparison of discrimination methods for the classification of tumors

using gene expression data. J. Am. Statist. Assoc. 97, 77–87.

Elad, M., and Bruckstein, A.M. 2002. A generalized uncertainty principle and sparse representation in pairs of bases.

IEEE Trans. Inform. Theory 48, 2558–2567.

Fan, J., and Fan, Y. 2008. High-dimensional classification using features annealed independence rules. The Annals of

Statistics, 36, 2232–2260.

Fan, J., and Li, R. 2006. Statistical challenges with high dimensionality: Feature selection in knowledge discovery.

Proc. of the Int. Congress of Mathematicians (M. Sanz-sole, J. Soria, J.L. Varona, J. Verdera, eds.).

Fidler, S., Slocaj, D., and Leonardis, A. 2006. Combining reconstructive and discriminative subspace methods for

robust classification and regression by subsampling. IEEE Trans. Pattern Analy. Mach. Intellig. 28, 337–350.

Foster, D.P., and George, E.I. 1994. The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975.

Ghosh, D. 2002. Singular value decomposition regression modeling for classification of tumors from microarray

experiments. Proc. Pacif. Symp. Biocomput. 98, 11462–11467.

Golub, T.R., et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression

monitoring. Science 286, 531–537. Available at: www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. Accessed May 1,

2009.

Gordon, G.J., et al. 2002. Translation of microarray data into clinically relevant cancer diagnostic tests using gene

expression ratios in lung cancer and mesothelioma. Cancer Res. 62, 4963–4967. Available at: www.chestsurg.org.

Accessed May 1, 2009.

Huang, X., and Pan, W. 2003. Linear regression and two-class classification with gene expression data. Bioinformatics

19, 2072–2078.

Jolliffe, I.T. 1986. Principal Component Analysis. Springer-Verlag, New York.

Martinez, A., and Kak, A. 2001. PCA versus LDA. IEEE Trans. Pattern Analy. Mach. Intellige. 23, 228–233.

Nguyen, D., and Rocke, D.M. 2002. Tumor classification by partial least squares using microarray gene expression

data. Bioinformatics 18, 39–50.

Singh, D., et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209.

Available at: www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. Accessed May 1, 2009.

Speed, T. 2003. Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC, Boca Raton, FL.

Tibshirani, R., Hastie, T., Narasimhan, B., et al. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene

expression. Proc. Natl. Acad. Sci. USA 99, 6567–6572.

Vapnik, V.N. 1998. Statistical Learning Theory. Wiley, New York.


Welsh, J.B., Sapinoso, L.M., Su, A.I., et al. 2001. Analysis of gene expression identifies candidate markers and

pharmacological targets in prostate cancer. Cancer Res. 61, 5974–5978.

Zhang, A. 2006. Advanced Analysis of Gene Expression Microarray Data (Science, Engineering, and Biology Infor-

matics). World Scientific Publishing Company, New York.

Address correspondence to:

Dr. Qiang Cheng

Faner Hall, Room 2140

Mailcode 4511, 1000 Faner Drive

Carbondale, IL 62901

E-mail: [email protected]


sparsity optimization method for multivariate feature screening for gene expression analysis

Documents