sparse principal component analysis - stanford …web.stanford.edu/~hastie/papers/spc_jcgs.pdf ·...

22
Sparse Principal Component Analysis Hui ZOU , Trevor HASTIE , and Robert TIBSHIRANI Principal component analysis (PCA) is widely used in data processing and dimension- ality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results. Key Words: Arrays; Gene expression; Lasso/elastic net; Multivariate analysis; Singular value decomposition; Thresholding. 1. INTRODUCTION Principal component analysis (PCA) (Jolliffe 1986) is a popular data-processing and dimension-reduction technique, with numerous applications in engineering, biology, and so- cial science. Some interesting examples include handwritten zip code classification (Hastie, Tibshirani, and Friedman 2001) and human face recognition (Hancock, Burton, and Bruce 1996). Recently PCA has been used in gene expression data analysis (Alter, Brown, and Botstein 2000). Hastie et al. (2000) proposed the so-called gene shaving techniques using PCA to cluster highly variable and coherent genes in microarray datasets. PCA seeks the linear combinations of the original variables such that the derived vari- ables capture maximal variance. PCA can be computed via the singular value decomposition (SVD) of the data matrix. In detail, let the data X be a n × p matrix, where n and p are the Hui Zou is Assistant Professor, School of Statistics, University of Minnesota, Minneapolis, MN 55455 (E-mail: [email protected]). Trevor Hastie is Professor, Department of Statistics, Stanford University, Stanford, CA 94305 (E-mail: [email protected]). Robert Tibshirani is Professor, Department of Health Research Policy, Stanford University, Stanford, CA 94305 (E-mail: [email protected]). c 2006 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 265–286 DOI: 10.1198/106186006X113430 265

Upload: truongkiet

Post on 04-Sep-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

Sparse Principal Component Analysis

Hui ZOU Trevor HASTIE and Robert TIBSHIRANI

Principal component analysis (PCA) is widely used in data processing and dimension-ality reduction However PCA suffers from the fact that each principal component is a linearcombination of all the original variables thus it is often difficult to interpret the results Weintroduce a new method called sparse principal component analysis (SPCA) using the lasso(elastic net) to produce modified principal components with sparse loadings We first showthat PCA can be formulated as a regression-type optimization problem sparse loadings arethen obtained by imposing the lasso (elastic net) constraint on the regression coefficientsEfficient algorithms are proposed to fit our SPCA models for both regular multivariate dataand gene expression arrays We also give a new formula to compute the total variance ofmodified principal components As illustrations SPCA is applied to real and simulated datawith encouraging results

Key Words Arrays Gene expression Lassoelastic net Multivariate analysis Singularvalue decomposition Thresholding

1 INTRODUCTION

Principal component analysis (PCA) (Jolliffe 1986) is a popular data-processing anddimension-reduction technique with numerous applications in engineering biology and so-cial science Some interesting examples include handwritten zip code classification (HastieTibshirani and Friedman 2001) and human face recognition (Hancock Burton and Bruce1996) Recently PCA has been used in gene expression data analysis (Alter Brown andBotstein 2000) Hastie et al (2000) proposed the so-called gene shaving techniques usingPCA to cluster highly variable and coherent genes in microarray datasets

PCA seeks the linear combinations of the original variables such that the derived vari-ables capture maximal variance PCA can be computed via the singular value decomposition(SVD) of the data matrix In detail let the data X be a ntimes p matrix where n and p are the

Hui Zou is Assistant Professor School of Statistics University of Minnesota Minneapolis MN 55455 (E-mailhzoustatumnedu) Trevor Hastie is Professor Department of Statistics Stanford University Stanford CA94305 (E-mail hastiestatstanfordedu) Robert Tibshirani is Professor Department of Health Research PolicyStanford University Stanford CA 94305 (E-mail tibsstatstanfordedu)

ccopy2006 American Statistical Association Institute of Mathematical Statisticsand Interface Foundation of North America

Journal of Computational and Graphical Statistics Volume 15 Number 2 Pages 265ndash286DOI 101198106186006X113430

265

266 H ZOU T HASTIE AND R TIBSHIRANI

number of observations and the number of variables respectively Without loss of generalityassume the column means of X are all 0 Let the SVD of X be

X = UDVT (11)

Z = UD are the principal components (PCs) and the columns of V are the correspondingloadings of the principal components The sample variance of the ith PC is D2

iin In geneexpression data the standardized PCs U are called the eigen-arrays and V are the eigen-genes (Alter Brown and Botstein 2000) Usually the first q (q min(n p)) PCs are chosento represent the data thus a great dimensionality reduction is achieved

The success of PCA is due to the following two important optimal properties1 principal components sequentially capture the maximum variability among the

columns of X thus guaranteeing minimal information loss2 principal components are uncorrelated so we can talk about one principal compo-

nent without referring to othersHowever PCA also has an obvious drawback that is each PC is a linear combination of allp variables and the loadings are typically nonzero This makes it often difficult to interpretthe derived PCs Rotation techniques are commonly used to help practitioners to interpretprincipal components (Jolliffe 1995) Vines (2000) considered simple principal componentsby restricting the loadings to take values from a small set of allowable integers such as 01 and minus1

We feel it is desirable not only to achieve the dimensionality reduction but also to reducethe number of explicitly used variables An ad hoc way to achieve this is to artificially set theloadings with absolute values smaller than a threshold to zero This informal thresholdingapproach is frequently used in practice but can be potentially misleading in various respects(Cadima and Jolliffe 1995) McCabe (1984) presented an alternative to PCA which found asubset of principal variables Jolliffe Trendafilov and Uddin (2003) introduced SCoTLASSto get modified principal components with possible zero loadings

The same interpretation issues arise in multiple linear regression where the responseis predicted by a linear combination of the predictors Interpretable models are obtained viavariable selection The lasso (Tibshirani 1996) is a promising variable selection techniquesimultaneously producing accurate and sparse models Zou and Hastie (2005) proposedthe elastic net a generalization of the lasso which has some advantages In this article weintroduce a new approach for estimating PCs with sparse loadings which we call sparseprincipal component analysis (SPCA) SPCA is built on the fact that PCA can be written asa regression-type optimization problem with a quadratic penalty the lasso penalty (via theelastic net) can then be directly integrated into the regression criterion leading to a modifiedPCA with sparse loadings

In the next section we briefly review the lasso and the elastic net The methodologicaldetails of SPCA are presented in Section 3 We present an efficient algorithm for fittingthe SPCA model We also derive an appropriate expression for representing the varianceexplained by modified principal components In Section 4 we consider a special case of theSPCA algorithm for handling gene expression arrays efficiently The proposed methodology

SPARSE PRINCIPAL COMPONENT ANALYSIS 267

is illustrated by using real data and simulation examples in Section 5 Discussions are inSection 6 The article ends with an Appendix summarizing technical details

2 THE LASSO AND THE ELASTIC NET

Consider the linear regression model with n observations and p predictors Let Y =(y1 yn)T be the response vector and X = [X1 Xp] j = 1 p the predictorswhere Xj = (x1j xnj)T After a location transformation we can assume all the Xj

and Y are centeredThe lasso is a penalized least squares method imposing a constraint on the L1 norm of

the regression coefficients Thus the lasso estimates βlasso are obtained by minimizing thelasso criterion

βlasso = arg minβ

Y minuspsum

j=1

Xjβj2 + λ

psumj=1

|βj | (21)

whereλ is non-negative The lasso was originally solved by quadratic programming (Tibshi-rani 1996) Efron Hastie Johnstone and Tibshirani (2004) showed that the lasso estimatesβ are piecewise linear as a function of λ and proposed an algorithm called LARS to effi-ciently solve the entire lasso solution path in the same order of computations as a single leastsquares fit The piecewise linearity of the lasso solution path was first proved by OsbornePresnell Turlach (2000) where a different algorithm was proposed to solve the entire lassosolution path

The lasso continuously shrinks the coefficients toward zero and achieves its predictionaccuracy via the bias variance trade-off Due to the nature of theL1 penalty some coefficientswill be shrunk to exact zero if λ1 is large enough Therefore the lasso simultaneouslyproduces both an accurate and sparse model which makes it a favorable variable selectionmethod However the lasso has several limitations as pointed out by Zou and Hastie (2005)The most relevant one to this work is that the number of variables selected by the lasso islimited by the number of observations For example if applied to microarray data wherethere are thousands of predictors (genes) (p gt 1000) with less than 100 samples (n lt 100)the lasso can only select at most n genes which is clearly unsatisfactory

The elastic net (Zou and Hastie 2005) generalizes the lasso to overcome these draw-backs while enjoying its other favorable properties For any non-negative λ1 and λ2 theelastic net estimates βen are given as follows

βen = (1 + λ2)

arg minβ

Y minuspsum

j=1

Xjβj2 + λ2

psumj=1

|βj |2 + λ1

psumj=1

|βj | (22)

The elastic net penalty is a convex combination of the ridge and lasso penalties Obviouslythe lasso is a special case of the elastic net when λ2 = 0 Given a fixed λ2 the LARS-EN algorithm (Zou and Hastie 2005) efficiently solves the elastic net problem for all λ1

with the computational cost of a single least squares fit When p gt n we choose some

268 H ZOU T HASTIE AND R TIBSHIRANI

λ2 gt 0 Then the elastic net can potentially include all variables in the fitted model so thisparticular limitation of the lasso is removed Zou and Hastie (2005) compared the elasticnet with the lasso and discussed the application of the elastic net as a gene selection methodin microarray analysis

3 MOTIVATION AND DETAILS OF SPCA

In both lasso and elastic net the sparse coefficients are a direct consequence of theL1 penalty and do not depend on the squared error loss function Jolliffe Trendafilov andUddin (2003) proposed SCoTLASS an interesting procedure that obtains sparse loadingsby directly imposing an L1 constraint on PCA SCoTLASS successively maximizes thevariance

aTk (XT X)ak (31)

subject to

aTk ak = 1 and (for k ge 2) aT

h ak = 0 h lt k (32)

and the extra constraintspsum

j=1

|akj | le t (33)

for some tuning parameter t Although sufficiently small t yields some exact zero loadingsthere is not much guidance with SCoTLASS in choosing an appropriate value for t Onecould try several t values but the high computational cost of SCoTLASS makes this an im-practical solution This high computational cost is probably due to the fact that SCoTLASSis not a convex optimization problem Moreover the examples in Jolliffe Trendafilov andUddin (2003) showed that the loadings obtained by SCoTLASS are not sparse enough whenone requires a high percentage of explained variance

We consider a different approach to modifying PCA We first show how PCA can berecast exactly in terms of a (ridge) regression problem We then introduce the lasso penaltyby changing this ridge regression to an elastic-net regression

31 DIRECT SPARSE APPROXIMATIONS

We first discuss a simple regression approach to PCA Observe that each PC is a linearcombination of the p variables thus its loadings can be recovered by regressing the PC onthe p variables

Theorem 1 For each i denote byZi = UiDii the ith principal component Considera positive λ and the ridge estimates βridge given by

βridge = arg minβ

Zi minus Xβ2 + λβ2 (34)

SPARSE PRINCIPAL COMPONENT ANALYSIS 269

Let v = βridge

βridge then v = Vi

The theme of this simple theorem is to show the connection between PCA and aregression method Regressing PCs on variables was discussed in Cadima and Jolliffe(1995) where they focused on approximating PCs by a subset of k variables We extend itto a more general case of ridge regression in order to handle all kinds of data especiallygene expression data Obviously when n gt p and X is a full rank matrix the theorem doesnot require a positive λ Note that if p gt n and λ = 0 ordinary multiple regression hasno unique solution that is exactly Vi The same happens when n gt p and X is not a fullrank matrix However PCA always gives a unique solution in all situations As shown inTheorem 1 this indeterminancy is eliminated by the positive ridge penalty (λβ2) Notethat after normalization the coefficients are independent of λ therefore the ridge penalty isnot used to penalize the regression coefficients but to ensure the reconstruction of principalcomponents

Now let us add theL1 penalty to (34) and consider the following optimization problem

β = arg minβ

Zi minus Xβ2 + λβ2 + λ1β1 (35)

where β1 =sump

j=1 |βj | is the 1-norm of β We call Vi = β

β an approximation to Vi and

XVi the ith approximated principal component Zou and Hastie (2005) called (35) naiveelastic net which differs from the elastic net by a scaling factor (1 + λ) Since we are usingthe normalized fitted coefficients the scaling factor does not affect Vi Clearly large enoughλ1 gives a sparse β hence a sparse Vi Given a fixed λ (35) is efficiently solved for all λ1

by using the LARS-EN algorithm (Zou and Hastie 2005) Thus we can flexibly choose asparse approximation to the ith principal component

32 SPARSE PRINCIPAL COMPONENTS BASED ON THE SPCA CRITERION

Theorem 1 depends on the results of PCA so it is not a genuine alternative Howeverit can be used in a two-stage exploratory analysis first perform PCA then use (35) to findsuitable sparse approximations

We now present a ldquoself-containedrdquo regression-type criterion to derive PCs Let xi denotethe ith row vector of the matrix X We first consider the leading principal component

Theorem 2 For any λ gt 0 let

(α β) = arg minαβ

nsumi=1

xi minus αβT xi2 + λβ2 (36)

subject to α2 = 1

Then β prop V1

The next theorem extends Theorem 2 to derive the whole sequence of PCs

Theorem 3 Suppose we are considering the firstk principal components Let Aptimesk =

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 2: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

266 H ZOU T HASTIE AND R TIBSHIRANI

number of observations and the number of variables respectively Without loss of generalityassume the column means of X are all 0 Let the SVD of X be

X = UDVT (11)

Z = UD are the principal components (PCs) and the columns of V are the correspondingloadings of the principal components The sample variance of the ith PC is D2

iin In geneexpression data the standardized PCs U are called the eigen-arrays and V are the eigen-genes (Alter Brown and Botstein 2000) Usually the first q (q min(n p)) PCs are chosento represent the data thus a great dimensionality reduction is achieved

The success of PCA is due to the following two important optimal properties1 principal components sequentially capture the maximum variability among the

columns of X thus guaranteeing minimal information loss2 principal components are uncorrelated so we can talk about one principal compo-

nent without referring to othersHowever PCA also has an obvious drawback that is each PC is a linear combination of allp variables and the loadings are typically nonzero This makes it often difficult to interpretthe derived PCs Rotation techniques are commonly used to help practitioners to interpretprincipal components (Jolliffe 1995) Vines (2000) considered simple principal componentsby restricting the loadings to take values from a small set of allowable integers such as 01 and minus1

We feel it is desirable not only to achieve the dimensionality reduction but also to reducethe number of explicitly used variables An ad hoc way to achieve this is to artificially set theloadings with absolute values smaller than a threshold to zero This informal thresholdingapproach is frequently used in practice but can be potentially misleading in various respects(Cadima and Jolliffe 1995) McCabe (1984) presented an alternative to PCA which found asubset of principal variables Jolliffe Trendafilov and Uddin (2003) introduced SCoTLASSto get modified principal components with possible zero loadings

The same interpretation issues arise in multiple linear regression where the responseis predicted by a linear combination of the predictors Interpretable models are obtained viavariable selection The lasso (Tibshirani 1996) is a promising variable selection techniquesimultaneously producing accurate and sparse models Zou and Hastie (2005) proposedthe elastic net a generalization of the lasso which has some advantages In this article weintroduce a new approach for estimating PCs with sparse loadings which we call sparseprincipal component analysis (SPCA) SPCA is built on the fact that PCA can be written asa regression-type optimization problem with a quadratic penalty the lasso penalty (via theelastic net) can then be directly integrated into the regression criterion leading to a modifiedPCA with sparse loadings

In the next section we briefly review the lasso and the elastic net The methodologicaldetails of SPCA are presented in Section 3 We present an efficient algorithm for fittingthe SPCA model We also derive an appropriate expression for representing the varianceexplained by modified principal components In Section 4 we consider a special case of theSPCA algorithm for handling gene expression arrays efficiently The proposed methodology

SPARSE PRINCIPAL COMPONENT ANALYSIS 267

is illustrated by using real data and simulation examples in Section 5 Discussions are inSection 6 The article ends with an Appendix summarizing technical details

2 THE LASSO AND THE ELASTIC NET

Consider the linear regression model with n observations and p predictors Let Y =(y1 yn)T be the response vector and X = [X1 Xp] j = 1 p the predictorswhere Xj = (x1j xnj)T After a location transformation we can assume all the Xj

and Y are centeredThe lasso is a penalized least squares method imposing a constraint on the L1 norm of

the regression coefficients Thus the lasso estimates βlasso are obtained by minimizing thelasso criterion

βlasso = arg minβ

Y minuspsum

j=1

Xjβj2 + λ

psumj=1

|βj | (21)

whereλ is non-negative The lasso was originally solved by quadratic programming (Tibshi-rani 1996) Efron Hastie Johnstone and Tibshirani (2004) showed that the lasso estimatesβ are piecewise linear as a function of λ and proposed an algorithm called LARS to effi-ciently solve the entire lasso solution path in the same order of computations as a single leastsquares fit The piecewise linearity of the lasso solution path was first proved by OsbornePresnell Turlach (2000) where a different algorithm was proposed to solve the entire lassosolution path

The lasso continuously shrinks the coefficients toward zero and achieves its predictionaccuracy via the bias variance trade-off Due to the nature of theL1 penalty some coefficientswill be shrunk to exact zero if λ1 is large enough Therefore the lasso simultaneouslyproduces both an accurate and sparse model which makes it a favorable variable selectionmethod However the lasso has several limitations as pointed out by Zou and Hastie (2005)The most relevant one to this work is that the number of variables selected by the lasso islimited by the number of observations For example if applied to microarray data wherethere are thousands of predictors (genes) (p gt 1000) with less than 100 samples (n lt 100)the lasso can only select at most n genes which is clearly unsatisfactory

The elastic net (Zou and Hastie 2005) generalizes the lasso to overcome these draw-backs while enjoying its other favorable properties For any non-negative λ1 and λ2 theelastic net estimates βen are given as follows

βen = (1 + λ2)

arg minβ

Y minuspsum

j=1

Xjβj2 + λ2

psumj=1

|βj |2 + λ1

psumj=1

|βj | (22)

The elastic net penalty is a convex combination of the ridge and lasso penalties Obviouslythe lasso is a special case of the elastic net when λ2 = 0 Given a fixed λ2 the LARS-EN algorithm (Zou and Hastie 2005) efficiently solves the elastic net problem for all λ1

with the computational cost of a single least squares fit When p gt n we choose some

268 H ZOU T HASTIE AND R TIBSHIRANI

λ2 gt 0 Then the elastic net can potentially include all variables in the fitted model so thisparticular limitation of the lasso is removed Zou and Hastie (2005) compared the elasticnet with the lasso and discussed the application of the elastic net as a gene selection methodin microarray analysis

3 MOTIVATION AND DETAILS OF SPCA

In both lasso and elastic net the sparse coefficients are a direct consequence of theL1 penalty and do not depend on the squared error loss function Jolliffe Trendafilov andUddin (2003) proposed SCoTLASS an interesting procedure that obtains sparse loadingsby directly imposing an L1 constraint on PCA SCoTLASS successively maximizes thevariance

aTk (XT X)ak (31)

subject to

aTk ak = 1 and (for k ge 2) aT

h ak = 0 h lt k (32)

and the extra constraintspsum

j=1

|akj | le t (33)

for some tuning parameter t Although sufficiently small t yields some exact zero loadingsthere is not much guidance with SCoTLASS in choosing an appropriate value for t Onecould try several t values but the high computational cost of SCoTLASS makes this an im-practical solution This high computational cost is probably due to the fact that SCoTLASSis not a convex optimization problem Moreover the examples in Jolliffe Trendafilov andUddin (2003) showed that the loadings obtained by SCoTLASS are not sparse enough whenone requires a high percentage of explained variance

We consider a different approach to modifying PCA We first show how PCA can berecast exactly in terms of a (ridge) regression problem We then introduce the lasso penaltyby changing this ridge regression to an elastic-net regression

31 DIRECT SPARSE APPROXIMATIONS

We first discuss a simple regression approach to PCA Observe that each PC is a linearcombination of the p variables thus its loadings can be recovered by regressing the PC onthe p variables

Theorem 1 For each i denote byZi = UiDii the ith principal component Considera positive λ and the ridge estimates βridge given by

βridge = arg minβ

Zi minus Xβ2 + λβ2 (34)

SPARSE PRINCIPAL COMPONENT ANALYSIS 269

Let v = βridge

βridge then v = Vi

The theme of this simple theorem is to show the connection between PCA and aregression method Regressing PCs on variables was discussed in Cadima and Jolliffe(1995) where they focused on approximating PCs by a subset of k variables We extend itto a more general case of ridge regression in order to handle all kinds of data especiallygene expression data Obviously when n gt p and X is a full rank matrix the theorem doesnot require a positive λ Note that if p gt n and λ = 0 ordinary multiple regression hasno unique solution that is exactly Vi The same happens when n gt p and X is not a fullrank matrix However PCA always gives a unique solution in all situations As shown inTheorem 1 this indeterminancy is eliminated by the positive ridge penalty (λβ2) Notethat after normalization the coefficients are independent of λ therefore the ridge penalty isnot used to penalize the regression coefficients but to ensure the reconstruction of principalcomponents

Now let us add theL1 penalty to (34) and consider the following optimization problem

β = arg minβ

Zi minus Xβ2 + λβ2 + λ1β1 (35)

where β1 =sump

j=1 |βj | is the 1-norm of β We call Vi = β

β an approximation to Vi and

XVi the ith approximated principal component Zou and Hastie (2005) called (35) naiveelastic net which differs from the elastic net by a scaling factor (1 + λ) Since we are usingthe normalized fitted coefficients the scaling factor does not affect Vi Clearly large enoughλ1 gives a sparse β hence a sparse Vi Given a fixed λ (35) is efficiently solved for all λ1

by using the LARS-EN algorithm (Zou and Hastie 2005) Thus we can flexibly choose asparse approximation to the ith principal component

32 SPARSE PRINCIPAL COMPONENTS BASED ON THE SPCA CRITERION

Theorem 1 depends on the results of PCA so it is not a genuine alternative Howeverit can be used in a two-stage exploratory analysis first perform PCA then use (35) to findsuitable sparse approximations

We now present a ldquoself-containedrdquo regression-type criterion to derive PCs Let xi denotethe ith row vector of the matrix X We first consider the leading principal component

Theorem 2 For any λ gt 0 let

(α β) = arg minαβ

nsumi=1

xi minus αβT xi2 + λβ2 (36)

subject to α2 = 1

Then β prop V1

The next theorem extends Theorem 2 to derive the whole sequence of PCs

Theorem 3 Suppose we are considering the firstk principal components Let Aptimesk =

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 3: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 267

is illustrated by using real data and simulation examples in Section 5 Discussions are inSection 6 The article ends with an Appendix summarizing technical details

2 THE LASSO AND THE ELASTIC NET

Consider the linear regression model with n observations and p predictors Let Y =(y1 yn)T be the response vector and X = [X1 Xp] j = 1 p the predictorswhere Xj = (x1j xnj)T After a location transformation we can assume all the Xj

and Y are centeredThe lasso is a penalized least squares method imposing a constraint on the L1 norm of

the regression coefficients Thus the lasso estimates βlasso are obtained by minimizing thelasso criterion

βlasso = arg minβ

Y minuspsum

j=1

Xjβj2 + λ

psumj=1

|βj | (21)

whereλ is non-negative The lasso was originally solved by quadratic programming (Tibshi-rani 1996) Efron Hastie Johnstone and Tibshirani (2004) showed that the lasso estimatesβ are piecewise linear as a function of λ and proposed an algorithm called LARS to effi-ciently solve the entire lasso solution path in the same order of computations as a single leastsquares fit The piecewise linearity of the lasso solution path was first proved by OsbornePresnell Turlach (2000) where a different algorithm was proposed to solve the entire lassosolution path

The lasso continuously shrinks the coefficients toward zero and achieves its predictionaccuracy via the bias variance trade-off Due to the nature of theL1 penalty some coefficientswill be shrunk to exact zero if λ1 is large enough Therefore the lasso simultaneouslyproduces both an accurate and sparse model which makes it a favorable variable selectionmethod However the lasso has several limitations as pointed out by Zou and Hastie (2005)The most relevant one to this work is that the number of variables selected by the lasso islimited by the number of observations For example if applied to microarray data wherethere are thousands of predictors (genes) (p gt 1000) with less than 100 samples (n lt 100)the lasso can only select at most n genes which is clearly unsatisfactory

The elastic net (Zou and Hastie 2005) generalizes the lasso to overcome these draw-backs while enjoying its other favorable properties For any non-negative λ1 and λ2 theelastic net estimates βen are given as follows

βen = (1 + λ2)

arg minβ

Y minuspsum

j=1

Xjβj2 + λ2

psumj=1

|βj |2 + λ1

psumj=1

|βj | (22)

The elastic net penalty is a convex combination of the ridge and lasso penalties Obviouslythe lasso is a special case of the elastic net when λ2 = 0 Given a fixed λ2 the LARS-EN algorithm (Zou and Hastie 2005) efficiently solves the elastic net problem for all λ1

with the computational cost of a single least squares fit When p gt n we choose some

268 H ZOU T HASTIE AND R TIBSHIRANI

λ2 gt 0 Then the elastic net can potentially include all variables in the fitted model so thisparticular limitation of the lasso is removed Zou and Hastie (2005) compared the elasticnet with the lasso and discussed the application of the elastic net as a gene selection methodin microarray analysis

3 MOTIVATION AND DETAILS OF SPCA

In both lasso and elastic net the sparse coefficients are a direct consequence of theL1 penalty and do not depend on the squared error loss function Jolliffe Trendafilov andUddin (2003) proposed SCoTLASS an interesting procedure that obtains sparse loadingsby directly imposing an L1 constraint on PCA SCoTLASS successively maximizes thevariance

aTk (XT X)ak (31)

subject to

aTk ak = 1 and (for k ge 2) aT

h ak = 0 h lt k (32)

and the extra constraintspsum

j=1

|akj | le t (33)

for some tuning parameter t Although sufficiently small t yields some exact zero loadingsthere is not much guidance with SCoTLASS in choosing an appropriate value for t Onecould try several t values but the high computational cost of SCoTLASS makes this an im-practical solution This high computational cost is probably due to the fact that SCoTLASSis not a convex optimization problem Moreover the examples in Jolliffe Trendafilov andUddin (2003) showed that the loadings obtained by SCoTLASS are not sparse enough whenone requires a high percentage of explained variance

We consider a different approach to modifying PCA We first show how PCA can berecast exactly in terms of a (ridge) regression problem We then introduce the lasso penaltyby changing this ridge regression to an elastic-net regression

31 DIRECT SPARSE APPROXIMATIONS

We first discuss a simple regression approach to PCA Observe that each PC is a linearcombination of the p variables thus its loadings can be recovered by regressing the PC onthe p variables

Theorem 1 For each i denote byZi = UiDii the ith principal component Considera positive λ and the ridge estimates βridge given by

βridge = arg minβ

Zi minus Xβ2 + λβ2 (34)

SPARSE PRINCIPAL COMPONENT ANALYSIS 269

Let v = βridge

βridge then v = Vi

The theme of this simple theorem is to show the connection between PCA and aregression method Regressing PCs on variables was discussed in Cadima and Jolliffe(1995) where they focused on approximating PCs by a subset of k variables We extend itto a more general case of ridge regression in order to handle all kinds of data especiallygene expression data Obviously when n gt p and X is a full rank matrix the theorem doesnot require a positive λ Note that if p gt n and λ = 0 ordinary multiple regression hasno unique solution that is exactly Vi The same happens when n gt p and X is not a fullrank matrix However PCA always gives a unique solution in all situations As shown inTheorem 1 this indeterminancy is eliminated by the positive ridge penalty (λβ2) Notethat after normalization the coefficients are independent of λ therefore the ridge penalty isnot used to penalize the regression coefficients but to ensure the reconstruction of principalcomponents

Now let us add theL1 penalty to (34) and consider the following optimization problem

β = arg minβ

Zi minus Xβ2 + λβ2 + λ1β1 (35)

where β1 =sump

j=1 |βj | is the 1-norm of β We call Vi = β

β an approximation to Vi and

XVi the ith approximated principal component Zou and Hastie (2005) called (35) naiveelastic net which differs from the elastic net by a scaling factor (1 + λ) Since we are usingthe normalized fitted coefficients the scaling factor does not affect Vi Clearly large enoughλ1 gives a sparse β hence a sparse Vi Given a fixed λ (35) is efficiently solved for all λ1

by using the LARS-EN algorithm (Zou and Hastie 2005) Thus we can flexibly choose asparse approximation to the ith principal component

32 SPARSE PRINCIPAL COMPONENTS BASED ON THE SPCA CRITERION

Theorem 1 depends on the results of PCA so it is not a genuine alternative Howeverit can be used in a two-stage exploratory analysis first perform PCA then use (35) to findsuitable sparse approximations

We now present a ldquoself-containedrdquo regression-type criterion to derive PCs Let xi denotethe ith row vector of the matrix X We first consider the leading principal component

Theorem 2 For any λ gt 0 let

(α β) = arg minαβ

nsumi=1

xi minus αβT xi2 + λβ2 (36)

subject to α2 = 1

Then β prop V1

The next theorem extends Theorem 2 to derive the whole sequence of PCs

Theorem 3 Suppose we are considering the firstk principal components Let Aptimesk =

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 4: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

268 H ZOU T HASTIE AND R TIBSHIRANI

λ2 gt 0 Then the elastic net can potentially include all variables in the fitted model so thisparticular limitation of the lasso is removed Zou and Hastie (2005) compared the elasticnet with the lasso and discussed the application of the elastic net as a gene selection methodin microarray analysis

3 MOTIVATION AND DETAILS OF SPCA

In both lasso and elastic net the sparse coefficients are a direct consequence of theL1 penalty and do not depend on the squared error loss function Jolliffe Trendafilov andUddin (2003) proposed SCoTLASS an interesting procedure that obtains sparse loadingsby directly imposing an L1 constraint on PCA SCoTLASS successively maximizes thevariance

aTk (XT X)ak (31)

subject to

aTk ak = 1 and (for k ge 2) aT

h ak = 0 h lt k (32)

and the extra constraintspsum

j=1

|akj | le t (33)

for some tuning parameter t Although sufficiently small t yields some exact zero loadingsthere is not much guidance with SCoTLASS in choosing an appropriate value for t Onecould try several t values but the high computational cost of SCoTLASS makes this an im-practical solution This high computational cost is probably due to the fact that SCoTLASSis not a convex optimization problem Moreover the examples in Jolliffe Trendafilov andUddin (2003) showed that the loadings obtained by SCoTLASS are not sparse enough whenone requires a high percentage of explained variance

We consider a different approach to modifying PCA We first show how PCA can berecast exactly in terms of a (ridge) regression problem We then introduce the lasso penaltyby changing this ridge regression to an elastic-net regression

31 DIRECT SPARSE APPROXIMATIONS

We first discuss a simple regression approach to PCA Observe that each PC is a linearcombination of the p variables thus its loadings can be recovered by regressing the PC onthe p variables

Theorem 1 For each i denote byZi = UiDii the ith principal component Considera positive λ and the ridge estimates βridge given by

βridge = arg minβ

Zi minus Xβ2 + λβ2 (34)

SPARSE PRINCIPAL COMPONENT ANALYSIS 269

Let v = βridge

βridge then v = Vi

The theme of this simple theorem is to show the connection between PCA and aregression method Regressing PCs on variables was discussed in Cadima and Jolliffe(1995) where they focused on approximating PCs by a subset of k variables We extend itto a more general case of ridge regression in order to handle all kinds of data especiallygene expression data Obviously when n gt p and X is a full rank matrix the theorem doesnot require a positive λ Note that if p gt n and λ = 0 ordinary multiple regression hasno unique solution that is exactly Vi The same happens when n gt p and X is not a fullrank matrix However PCA always gives a unique solution in all situations As shown inTheorem 1 this indeterminancy is eliminated by the positive ridge penalty (λβ2) Notethat after normalization the coefficients are independent of λ therefore the ridge penalty isnot used to penalize the regression coefficients but to ensure the reconstruction of principalcomponents

Now let us add theL1 penalty to (34) and consider the following optimization problem

β = arg minβ

Zi minus Xβ2 + λβ2 + λ1β1 (35)

where β1 =sump

j=1 |βj | is the 1-norm of β We call Vi = β

β an approximation to Vi and

XVi the ith approximated principal component Zou and Hastie (2005) called (35) naiveelastic net which differs from the elastic net by a scaling factor (1 + λ) Since we are usingthe normalized fitted coefficients the scaling factor does not affect Vi Clearly large enoughλ1 gives a sparse β hence a sparse Vi Given a fixed λ (35) is efficiently solved for all λ1

by using the LARS-EN algorithm (Zou and Hastie 2005) Thus we can flexibly choose asparse approximation to the ith principal component

32 SPARSE PRINCIPAL COMPONENTS BASED ON THE SPCA CRITERION

Theorem 1 depends on the results of PCA so it is not a genuine alternative Howeverit can be used in a two-stage exploratory analysis first perform PCA then use (35) to findsuitable sparse approximations

We now present a ldquoself-containedrdquo regression-type criterion to derive PCs Let xi denotethe ith row vector of the matrix X We first consider the leading principal component

Theorem 2 For any λ gt 0 let

(α β) = arg minαβ

nsumi=1

xi minus αβT xi2 + λβ2 (36)

subject to α2 = 1

Then β prop V1

The next theorem extends Theorem 2 to derive the whole sequence of PCs

Theorem 3 Suppose we are considering the firstk principal components Let Aptimesk =

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 5: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 269

Let v = βridge

βridge then v = Vi

The theme of this simple theorem is to show the connection between PCA and aregression method Regressing PCs on variables was discussed in Cadima and Jolliffe(1995) where they focused on approximating PCs by a subset of k variables We extend itto a more general case of ridge regression in order to handle all kinds of data especiallygene expression data Obviously when n gt p and X is a full rank matrix the theorem doesnot require a positive λ Note that if p gt n and λ = 0 ordinary multiple regression hasno unique solution that is exactly Vi The same happens when n gt p and X is not a fullrank matrix However PCA always gives a unique solution in all situations As shown inTheorem 1 this indeterminancy is eliminated by the positive ridge penalty (λβ2) Notethat after normalization the coefficients are independent of λ therefore the ridge penalty isnot used to penalize the regression coefficients but to ensure the reconstruction of principalcomponents

Now let us add theL1 penalty to (34) and consider the following optimization problem

β = arg minβ

Zi minus Xβ2 + λβ2 + λ1β1 (35)

where β1 =sump

j=1 |βj | is the 1-norm of β We call Vi = β

β an approximation to Vi and

XVi the ith approximated principal component Zou and Hastie (2005) called (35) naiveelastic net which differs from the elastic net by a scaling factor (1 + λ) Since we are usingthe normalized fitted coefficients the scaling factor does not affect Vi Clearly large enoughλ1 gives a sparse β hence a sparse Vi Given a fixed λ (35) is efficiently solved for all λ1

by using the LARS-EN algorithm (Zou and Hastie 2005) Thus we can flexibly choose asparse approximation to the ith principal component

32 SPARSE PRINCIPAL COMPONENTS BASED ON THE SPCA CRITERION

Theorem 1 depends on the results of PCA so it is not a genuine alternative Howeverit can be used in a two-stage exploratory analysis first perform PCA then use (35) to findsuitable sparse approximations

We now present a ldquoself-containedrdquo regression-type criterion to derive PCs Let xi denotethe ith row vector of the matrix X We first consider the leading principal component

Theorem 2 For any λ gt 0 let

(α β) = arg minαβ

nsumi=1

xi minus αβT xi2 + λβ2 (36)

subject to α2 = 1

Then β prop V1

The next theorem extends Theorem 2 to derive the whole sequence of PCs

Theorem 3 Suppose we are considering the firstk principal components Let Aptimesk =

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 6: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

270 H ZOU T HASTIE AND R TIBSHIRANI

[α1 αk] and Bptimesk = [β1 βk] For any λ gt 0 let

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 (37)

subject to AT A = Iktimesk

Then βj prop Vj for j = 1 2 kTheorems 2 and 3 effectively transform the PCA problem to a regression-type problem

The critical element is the objective functionsumn

i=1 xi minus ABT xi2 If we restrict B = Athen

nsumi=1

xi minus ABT xi2 =nsum

i=1

xi minus AAT xi2

whose minimizer under the orthonormal constraint on A is exactly the first k loading vectorsof ordinary PCA This formulation arises in the ldquoclosest approximating linear manifoldrdquoderivation of PCA (eg Hastie Tibshirani and Friedman 2001) Theorem 3 shows that wecan still have exact PCA while relaxing the restriction B = A and adding the ridge penaltyterm As can be seen later these generalizations enable us to flexibly modify PCA

The proofs of Theorems 2 and 3 are given in the Appendix here we give an intuitiveexplanation Note that

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (38)

Since A is orthonomal let Aperp be any orthonormal matrix such that [A Aperp] is p times p

orthonormal Then we have

X minus XBAT 2 = XAperp2 + XA minus XB2 (39)

= XAperp2 +ksum

j=1

Xαj minus Xβj2 (310)

Suppose A is given then the optimal B minimizing (37) should minimize

arg minB

ksumj=1

Xαj minus Xβj2 + λβj2 (311)

which is equivalent to k independent ridge regression problems In particular if A corre-sponds to the ordinary PCs that is A = V then by Theorem 1 we know that B should beproportional to V Actually the above view points out an effective algorithm for solving(37) which is revisited in the next section

We carry on the connection between PCA and regression and use the lasso approachto produce sparse loadings (ldquoregression coefficientsrdquo) For that purpose we add the lassopenalty into the criterion (37) and consider the following optimization problem

(A B) = arg minA B

nsumi=1

xi minus ABT xi2 + λ

ksumj=1

βj2 +ksum

j=1

λ1jβj1

(312)

subject to AT A = Iktimesk

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 7: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 271

Whereas the same λ is used for all k components different λ1jrsquos are allowed for penalizingthe loadings of different principal components Again if p gt n a positive λ is required inorder to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1j = 0)We call (312) the SPCA criterion hereafter

33 NUMERICAL SOLUTION

We propose an alternating algorithm to minimize the SPCA criterion (312)

B given A For each j let Y lowastj = Xαj By the same analysis used in (39)ndash(311) we know

that B = [β1 βk] where each βj is an elastic net estimate

βj = arg minβj

Y lowastj minus Xβj2 + λβj2 + λ1jβj1 (313)

A given B On the other hand if B is fixed then we can ignore the penalty part in (312)

and only try to minimizesumn

i=1 ximinusABT xi2 = XminusXBAT 2 subject to AT A =Iktimesk The solution is obtained by a reduced rank form of the Procrustes rotation

given in Theorem 4 below We compute the SVD

(XT X)B = UDVT (314)

and set A = UVT

Theorem 4 Reduced Rank Procrustes Rotation Let Mntimesp and Nntimesk be two ma-

trices Consider the constrained minimization problem

A = arg minA

M minus NAT 2 subject to AT A = Iktimesk (315)

Suppose the SVD of MT N is UDVT then A = UVT

The usual Procrustes rotation (eg Mardia Kent and Bibby 1979) has N the same size

as M

It is worth pointing out that to solve (313) we only need to know the Gram matrix

XT X because

Y lowastj minus Xβj2 + λβj2 + λ1jβj1

= (αj minus βj)T XT X(αj minus βj) + λβj2 + λ1jβj1 (316)

The same is true of (314)

Now 1nXT X is the sample covariance matrix of X Therefore if Σ the covariance matrix

of X is known we can replace XT X with Σ in (316) and have a population version of

SPCA If X is standardized beforehand then we use the (sample) correlation matrix which

is preferred when the scales of the variables are different

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 8: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

272 H ZOU T HASTIE AND R TIBSHIRANI

Although (316) (with Σ instead of XT X) is not quite an elastic net problem we caneasily turn it into one Create the artificial response Y lowastlowast and Xlowastlowast as follows

Y lowastlowast = Σ12αj Xlowastlowast = Σ

12 (317)

then it is easy to check that

βj = arg minβ

Y lowastlowast minus Xlowastlowastβ2 + λβ2 + λ1jβ1 (318)

Algorithm 1 summarizes the steps of our SPCA procedure outlined above

Algorithm 1 General SPCA Algorithm1 Let A start at V[ 1 k] the loadings of the first k ordinary principal components2 Given a fixed A = [α1 αk] solve the following elastic net problem for j =

1 2 k

βj = arg minβ

(αj minus β)T XT X(αj minus β) + λβ2 + λ1jβ1

3 For a fixed B = [β1 middot middot middot βk] compute the SVD of XT XB = UDVT then updateA = UVT

4 Repeat Steps 2ndash3 until convergence5 Normalization Vj = βj

βj j = 1 kSome remarks1 Empirical evidence suggests that the output of the above algorithm does not change

much as λ is varied For n gt p data the default choice of λ can be zero Practically λis chosen to be a small positive number to overcome potential collinearity problemsin X Section 4 discusses the default choice of λ for data with thousands of variablessuch as gene expression arrays

2 In principle we can try several combinations of λ1j to figure out a good choiceof the tuning parameters since the above algorithm converges quite fast There is ashortcut provided by the direct sparse approximation (35) The LARS-EN algorithmefficiently delivers a whole sequence of sparse approximations for each PC and thecorresponding values ofλ1j Hence we can pick aλ1j that gives a good compromisebetween variance and sparsity When facing the variance-sparsity trade-off we letvariance have a higher priority

34 ADJUSTED TOTAL VARIANCE

The ordinary principal components are uncorrelated and their loadings are orthogonalLet ΣΣΣ = XT X then VT V = Ik and VT ΣΣΣV is diagonal It is easy to check that it isonly for ordinary principal components the the loadings can satisfy both conditions InJolliffe Trendafilov and Uddin (2003) the loadings were forced to be orthogonal so theuncorrelated property was sacrificed SPCA does not explicitly impose the uncorrelatedcomponents condition either

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 9: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 273

Let Z be the modified PCs Usually the total variance explained by Z is calculated

by tr(ZT

Z) This is reasonable when Z are uncorrelated However if they are correlated

tr(ZT

Z) is too optimistic for representing the total variance Suppose (Zi i = 1 2 k)are the first kmodified PCs by any method and the (k+1)th modified PC Zk+1 is obtainedWe want to compute the total variance explained by the first k + 1 modified PCs whichshould be the sum of the explained variance by the first k modified PCs and the additionalvariance from Zk+1 If Zk+1 is correlated with (Zi i = 1 2 k) then its variancecontains contributions from (Zi i = 1 2 k) which should not be included into thetotal variance given the presence of (Zi i = 1 2 k)

Here we propose a new formula to compute the total variance explained by Z whichtakes into account the correlations among Z We use regression projection to remove thelinear dependence between correlated components Denote Zjmiddot1jminus1 the residual afteradjusting Zj for Z1 Zjminus1 that is

Zjmiddot1jminus1 = Zj minus H1jminus1Zj (319)

where H1jminus1 is the projection matrix on Zijminus11 Then the adjusted variance of Zj is

Zjmiddot1jminus12 and the total explained variance is defined assumk

j=1 Zjmiddot1jminus12 When

the modified PCs Z are uncorrelated the new formula agrees with tr(ZT

Z)Note that the above computations depend on the order of Zi However since we have

a natural order in PCA ordering is not an issue here Using the QR decomposition we caneasily compute the adjusted variance Suppose Z = QR where Q is orthonormal and R isupper triangular Then it is straightforward to see that

Zjmiddot1jminus12 = R2jj (320)

Hence the explained total variance is equal tosumk

j=1 R2jj

35 COMPUTATION COMPLEXITY

PCA is computationally efficient for both n gt p or p n data We separately discussthe computational cost of the general SPCA algorithm for n gt p and p n

1 n gt p Traditional multivariate data fit in this category Note that although the SPCAcriterion is defined using X it only depends on X via XT X A trick is to first computethe p times p matrix ΣΣΣ = XT X once for all which requires np2 operations Then thesame ΣΣΣ is used at each step within the loop Computing XT Xβ costs p2k and theSVD of XT Xβ is of order O(pk2) Each elastic net solution requires at most O(p3)operations Since k le p the total computation cost is at most np2 +mO(p3) wherem is the number of iterations before convergence Therefore the SPCA algorithm isable to efficiently handle data with huge n as long as p is small (say p lt 100)

2 p n Gene expression arrays are typical examples in this p n category Thetrick of using ΣΣΣ is no longer applicable because ΣΣΣ is a huge matrix (ptimes p) in thiscase The most consuming step is solving each elastic net whose cost is of order

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 10: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

274 H ZOU T HASTIE AND R TIBSHIRANI

O(pnJ + J3) for a positive finite λ where J is the number of nonzero coefficientsGenerally speaking the total cost is of ordermkO(pJn+J3) which can be expensivefor large J and p Fortunately as shown in the next section there exists a specialSPCA algorithm for efficiently dealing with p n data

4 SPCA FOR p n AND GENE EXPRESSION ARRAYS

For gene expression arrays the number of variables (genes) is typically much biggerthan the number of samples (eg n = 10 000 p = 100) Our general SPCA algorithmstill fits this situation using a positive λ However the computational cost is expensive whenrequiring a large number of nonzero loadings It is desirable to simplify the general SPCAalgorithm to boost the computation

Observe that Theorem 3 is valid for all λ gt 0 so in principle we can use any positiveλ It turns out that a thrifty solution emerges if λ rarr infin Precisely we have the followingtheorem

Theorem 5 Let Vj(λ) = βj

βj (j = 1 k) be the loadings derived from criterion

(312) Let (A B) be the solution of the optimization problem

(A B) = arg minA B

minus2tr(

AT XT XB)

+ksum

j=1

βj2 +ksum

j=1

λ1jβj1 (41)

subject to AT A = Iktimesk

When λ rarr infin Vj(λ) rarr βj

βj We can use the same alternating algorithm in Section 33 to solve (41) where we only

need to replace the general elastic net problem with its special case (λ = infin) Note thatgiven A

βj = arg minβj

minus2αTj (XT X)βj + βj2 + λ1jβj1 (42)

which has an explicit form solution given in (43)

Gene Expression Arrays SPCA Algorithm Replacing Step 2 in the general SPCAalgorithm withStep 2lowast for j = 1 2 k

βj =(∣∣αT

j XT X∣∣minus λ1j

2

)+

Sign(αTj XT X) (43)

The operation in (43) is called soft-thresholding Figure 1 gives an illustration of how thesoft-thresholding rule operates Recently soft-thresholding has become increasingly popularin the literature For example nearest shrunken centroids (Tibshirani Hastie Narasimhanand Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples andselect important genes in microarrays

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 11: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 275

Figure 1 An illustration of soft-thresholding rule y = (|x| minus ∆)+Sign(x) with ∆ = 1

5 EXAMPLES

51 PITPROPS DATA

The pitprops data first introduced by Jeffers (1967) has 180 observations and 13 mea-sured variables It is a classic example showing the difficulty of interpreting principal com-ponents Jeffers (1967) tried to interpret the first six PCs Jolliffe Trendafilov and Uddin(2003) used their SCoTLASS to find the modified PCs Table 1 presents the results ofPCA while Table 2 presents the modified PC loadings as computed by SCoTLASS and theadjusted variance computed using (320)

As a demonstration we also considered the first six principal components Since thisis a usual n p dataset we set λ = 0 λ1 = (006 016 01 05 05 05) were chosenaccording to Figure 2 such that each sparse approximation explained almost the sameamount of variance as the ordinary PC did Table 3 shows the obtained sparse loadings andthe corresponding adjusted variance Compared with the modified PCs of SCoTLASS PCsby SPCA account for a larger amount of variance (758 vs 693) with a much sparserloading structure The important variables associated with the six PCs do not overlap whichfurther makes the interpretations easier and clearer It is interesting to note that in Table 3even though the variance does not strictly monotonously decrease the adjusted variancefollows the right order However Table 2 shows that this is not true in SCoTLASS Itis also worthy of mention that the entire SPCA computation was done in seconds in Rwhile the implementation of SCoTLASS for each t was expensive (Jolliffe Trendafilovand Uddin 2003) Optimizing SCoTLASS over several values of t is an even more difficultcomputational challenge

Although the informal thresholding method which we henceforth refer to as simple

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 12: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

276 H ZOU T HASTIE AND R TIBSHIRANI

Table 1 Pitprops Data Loadings of the First Six Principal Components

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0404 0218 minus0207 0091 minus0083 0120length minus0406 0186 minus0235 0103 minus0113 0163moist minus0124 0541 0141 minus0078 0350 minus0276testsg minus0173 0456 0352 minus0055 0356 minus0054ovensg minus0057 minus0170 0481 minus0049 0176 0626ringtop minus0284 minus0014 0475 0063 minus0316 0052ringbut minus0400 minus0190 0253 0065 minus0215 0003bowmax minus0294 minus0189 minus0243 minus0286 0185 minus0055bowdist minus0357 0017 minus0208 minus0097 minus0106 0034whorls minus0379 minus0248 minus0119 0205 0156 minus0173clear 0011 0205 minus0070 minus0804 minus0343 0175knots 0115 0343 0092 0301 minus0600 minus0170diaknot 0113 0309 minus0326 0303 0080 0626

Variance () 324 183 144 85 70 63Cumulative variance () 324 507 651 736 806 869

Table 2 Pitprops Data Loadings of the First Six Modified PCs by SCoTLASS Empty cells have zeroloadings

t = 175Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam 0664 minus0025 0002 minus0035length 0683 minus0001 minus0040 0001 minus0018moist 0641 0195 0180 minus0030testsg 0701 0001 minus0001ovensg minus0887 minus0056ringtop 0293 minus0186 minus0373 0044ringbut 0001 0107 minus0658 minus0051 0064bowmax 0001 0735 0021 minus0168bowdist 0283 minus0001whorls 0113 minus0001 0388 minus0017 0320clear minus0923knots 0001 minus0554 0016 0004diaknot 0703 0001 minus0197 0080

Number of nonzero loadings 6 6 6 6 10 13Variance () 196 160 131 131 92 90Adjusted variance () 196 138 124 80 71 84Cumulative adjusted variance () 196 334 458 538 609 693

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 13: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 277

Figure 2 Pitprops data The sequences of sparse approximations to the first six principal components The curvesshow the percentage of explained variance (PEV) as a function of λ1 The vertical broken lines indicate the choiceof λ1 used in our SPCA analysis

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 14: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

278 H ZOU T HASTIE AND R TIBSHIRANI

Table 3 Pitprops Data Loadings of the First Six Sparse PCs by SPCA Empty cells have zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

topdiam minus0477length minus0476moist 0785testsg 0620ovensg 0177 0640ringtop 0589ringbut minus0250 0492bowmax minus0344 minus0021bowdist minus0416whorls minus0400clear minus1knots 0013 minus1diaknot minus0015 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 280 144 150 77 77 77Adjusted variance () 280 140 133 74 68 62Cumulative adjusted variance () 280 420 553 627 695 758

thresholding has various drawbacks it may serve as the benchmark for testing sparse PCsmethods A variant of simple thresholding is soft-thresholding We found that when used inPCA soft-thresholding performs very similarly to simple thresholding Thus we omitted theresults of soft-thresholding in this article Both SCoTLASS and SPCA were compared withsimple thresholding Table 4 presents the loadings and the corresponding variance explainedby simple thresholding To make the comparisons fair we let the numbers of nonzeroloadings obtained by simple thresholding match the results of SCoTLASS and SPCA asshown in the top and bottom parts of Table 4 respectively In terms of variance it seemsthat simple thresholding is better than SCoTLASS and worse than SPCA Moreover thevariables with nonzero loadings by SPCA are different to that chosen by simple thresholdingfor the first three PCs while SCoTLASS seems to create a similar sparseness pattern assimple thresholding does especially in the leading PC

52 A SYNTHETIC EXAMPLE

Our synthetic example has three hidden factors

V1 sim N(0 290) V2 sim N(0 300)

V3 = minus03V1 + 0925V2 + ε ε sim N(0 1)

and

V1 V2 and ε are independent

Then 10 observable variables are constructed as follows

Xi = V1 + ε1i ε1

i sim N(0 1) i = 1 2 3 4

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 15: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 279

Xi = V2 + ε2i ε2

i sim N(0 1) i = 5 6 7 8

Xi = V3 + ε3i ε3

i sim N(0 1) i = 9 10

εji are independent j = 1 2 3 i = 1 10

We used the exact covariance matrix of (X1 X10) to perform PCA SPCA and simplethresholding (in the population setting)

Table 4 Pitprops Data Loadings of the First Six Modified PCs by Simple Thresholding Empty cellshave zero loadings

Variable PC1 PC2 PC3 PC4 PC5 PC6

Simple thresholding vs SCoTLASS

topdiam minus0439 0240 0120length minus0441 0105 minus0114 0163moist 0596 0354 minus0276testsg 0503 0391 0360 minus0054ovensg 0534 0178 0626ringtop 0528 minus0320 0052ringbut minus0435 0281 minus0218 0003bowmax minus0319 minus0270 minus0291 0188 minus0055bowdist minus0388 0034whorls minus0412 minus0274 0209 0158 minus0173clear minus0819 minus0347 0175knots 0378 0307 minus0608 minus0170diaknot 0340 minus0362 0309 0626

Number of nonzero loadings 6 6 6 6 10 13Variance () 289 161 154 84 71 63Adjusted Variance () 289 161 139 82 69 62Cumulative adjusted variance () 289 450 589 671 740 802

Simple thresholding vs SPCA

topdiam minus0420length minus0422moist 0640testsg 0540 0425ovensg 0580ringtop minus0296 0573ringbut minus0416bowmax minus0305bowdist minus0370whorls minus0394clear minus1knots 0406 minus1diaknot 0365 minus0393 1

Number of nonzero loadings 7 4 4 1 1 1Variance () 307 148 136 77 77 77Adjusted variance () 307 147 111 76 52 36Cumulative adjusted variance () 307 454 565 641 683 719

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 16: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

280 H ZOU T HASTIE AND R TIBSHIRANI

Table 5 Results of the Simulation Example Loadings and Variance

PCA SPCA (λ = 0) Simple ThresholdingPC1 PC2 PC3 PC1 PC2 PC1 PC2

X1 0116 minus0478 minus0087 0 05 0 minus05X2 0116 minus0478 minus0087 0 05 0 minus05X3 0116 minus0478 minus0087 0 05 0 minus05X4 0116 minus0478 minus0087 0 05 0 minus05X5 minus0395 minus0145 0270 05 0 0 0X6 minus0395 minus0145 0270 05 0 0 0X7 minus0395 minus0145 0270 05 0 minus0497 0X8 minus0395 minus0145 0270 05 0 minus0497 0X9 minus0401 0010 minus0582 0 0 minus0503 0X10 minus0401 0010 minus0582 0 0 minus0503 0

AdjustedVariance () 600 396 008 409 395 388 386

The variance of the three underlying factors is 290 300 and 2838 respectively Thenumbers of variables associated with the three factors are 4 4 and 2 Therefore V2 andV1 are almost equally important and they are much more important than V3 The first twoPCs together explain 996 of the total variance These facts suggest that we only needto consider two derived variables with ldquocorrectrdquo sparse representations Ideally the firstderived variable should recover the factor V2 only using (X5 X6 X7 X8) and the secondderived variable should recover the factor V1 only using (X1 X2 X3 X4) In fact if wesequentially maximize the variance of the first two derived variables under the orthonormalconstraint while restricting the numbers of nonzero loadings to four then the first derivedvariable uniformly assigns nonzero loadings on (X5 X6 X7 X8) and the second derivedvariable uniformly assigns nonzero loadings on (X1 X2 X3 X4)

Both SPCA (λ = 0) and simple thresholding were carried out by using the oracleinformation that the ideal sparse representations use only four variables Table 5 summarizesthe comparison results Clearly SPCA correctly identifies the sets of important variables Infact SPCA delivers the ideal sparse representations of the first two principal componentsMathematically it is easy to show that if t = 2 is used SCoTLASS is also able to find thesame sparse solution In this example both SPCA and SCoTLASS produce the ideal sparsePCs which may be explained by the fact that both methods explicitly use the lasso penalty

In contrast simple thresholding incorrectly includes X9 X10 in the most importantvariables The variance explained by simple thresholding is also lower than that by SPCAalthough the relative difference is small (less than 5) Due to the high correlation betweenV2 and V3 variablesX9 X10 achieve loadings which are even higher than those of the trueimportant variables (X5 X6 X7 X8) Thus the truth is disguised by the high correlationOn the other hand simple thresholding correctly discovers the second factor because V1

has a low correlation with V3

53 RAMASWAMY DATA

An important task in microarray analysis is to find a set of genes which are biologically

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 17: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 281

Figure 3 The sparse leading principal component percentage of explained variance versus sparsity Simplethresholding and SPCA have similar performances However there still exists consistent difference in the selectedgenes (the ones with nonzero loadings)

relevant to the outcome (eg tumor type or survival time) PCA (or SVD) has been a populartool for this purpose Many gene-clustering methods in the literature use PCA (or SVD) asa building block For example gene shaving (Hastie et al 2000) uses an iterative principalcomponent shaving algorithm to identify subsets of coherent genes Here we consideranother approach to gene selection through SPCA The idea is intuitive if the (sparse)principal component can explain a large part of the total variance of gene expression levelsthen the subset of genes representing the principal component are considered important

We illustrate the sparse PC selection method on Ramaswamyrsquos data (Ramaswamy et al2001) which has 16063 (p = 16063) genes and 144 (n = 144) samples Its first principalcomponent explains 46 of the total variance For microarray data like this it appearsthat SCoTLASS cannot be practically useful for finding sparse PCs We applied SPCA(λ = infin) to find the leading sparse PC A sequence of values for λ1 were used such thatthe number of nonzero loadings varied over a wide range As displayed in Figure 3 thepercentage of explained variance decreases at a slow rate as the sparsity increases As fewas 25 of these 16063 genes can sufficiently construct the leading principal componentwith an affordable loss of explained variance (from 46 to 40) Simple thresholdingwas also applied to this data It seems that when using the same number of genes simplethresholding always explains slightly higher variance than SPCA does Among the samenumber of selected genes by SPCA and simple thresholding there are about 2 differentgenes and this difference rate is quite consistent

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 18: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

282 H ZOU T HASTIE AND R TIBSHIRANI

6 DISCUSSION

It has been an interesting research topic for years to derive principal components withsparse loadings From a practical point of view a good method to achieve the sparsenessgoal should (at least) possess the following properties

bull Without any sparsity constraint the method should reduce to PCAbull It should be computationally efficient for both small p and big p databull It should avoid misidentifying the important variables

The often-used simple thresholding approach is not criterion based However thisinformal method seems to possess the first two of the desirable properties listed above Ifthe explained variance and sparsity are the only concerns simple thresholding is a reasonableapproach and it is extremely convenient We have shown that simple thresholding can workwell with gene expression arrays The serious problem with simple thresholding is that itcan misidentify the real important variables Nevertheless simple thresholding is regardedas a benchmark for any potentially better method

Using the lasso constraint in PCA SCoTLASS successfully derives sparse loadingsHowever SCoTLASS is not computationally efficient and it lacks a good rule to pick itstuning parameter In addition it is not feasible to apply SCoTLASS to gene expressionarrays where PCA is a quite popular tool

In this work we have developed SPCA using our SPCA criterion (312) This newcriterion gives exact PCA results when its sparsity (lasso) penalty term vanishes SPCAallows flexible control on the sparse structure of the resulting loadings Unified efficientalgorithms have been proposed to compute SPCA solutions for both regular multivariatedata and gene expression arrays As a principled procedure SPCA enjoys advantages inseveral aspects including computational efficiency high explained variance and an abilityin identifying important variables

Software in R for fitting the SPCA model (and elastic net models) is available in theCRAN contributed package elasticnet

APPENDIX PROOFS

Proof of Theorem 1 Using XT X = VD2VT and VT V = I we have

βridge =(XT X + λI

)minus1XT (XVi) = Vi

D2ii

D2ii + λ

(A1)

Hence v = Vi

Note that since Theorem 2 is a special case of Theorem 3 we will not prove it separatelyWe first provide a lemma

Lemma 1 Consider the ridge regression criterion

Cλ(β) = y minus Xβ2 + λβ2

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 19: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 283

Then if β = arg minβ Cλ(β)

Cλ(β) = yT (I minus Sλ)y

where Sλ is the ridge operator

Sλ = X(XT X + λI)minus1XT

Proof of Lemma 1 Differentiating Cλ with respect to β we get that

minusXT (y minus Xβ) + λβ = 0

Premultiplication by βT and re-arrangement gives λβ2 = (y minus Xβ)T Xβ Since

y minus Xβ2 = (y minus Xβ)T y minus (y minus Xβ)T Xβ

Cλ(β) = (y minus Xβ)T y The result follows since the ldquofitted valuesrdquo Xβ = Sλy

Proof of Theorem 3 We use the notation introduced in Section 3 A = [α1 αk]and B = [β1 βk] Let

Cλ(AB) =nsum

i=1

xi minus ABT xi2 + λ

ksumj=1

βj2

As in (39) we have

nsumi=1

xi minus ABT xi2 = X minus XBAT 2 (A2)

= XAperp2 + XA minus XB2 (A3)

Hence with A fixed solving

arg minB

Cλ(AB)

is equivalent to solving the series of ridge regressions

arg minβjk

1

ksumj=1

Xαj minus Xβj2 + λβj2

It is easy to show that

B = (XT X + λI)minus1XT XA (A4)

and using Lemma 1 and (A2) we have that the partially optimized penalized criterion isgiven by

Cλ(A B) = XAperp2 + tr((XA)T (I minus Sλ)(XA)

) (A5)

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 20: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

284 H ZOU T HASTIE AND R TIBSHIRANI

Rearranging the terms we get

Cλ(A B) = tr(XT X) minus tr(AT XT SλXA) (A6)

which must be minimized with respect to A with AT A = I Hence A should be taken tobe the largest k eigenvectors of XT SλX If the SVD of X = UDVT it is easy to show thatXT SλX = VD2(D2 + λI)minus1D2VT hence A = V[ 1 k] Likewise plugging the SVD ofX into (A4) we see that each of the βj are scaled elements of the corresponding Vj

Proof of Theorem 4 We expand the matrix norm

M minus NAT 2 = tr(MT M) minus 2tr(MT NAT ) + tr(ANT NAT ) (A7)

Since AT A = I the last term is equal to tr(NT N) and hence we need to maximize (minushalf) the middle term With the SVD MT N = UDVT this middle term becomes

tr(MT NAT ) = tr(UDVT AT ) (A8)

= tr(UDAlowastT ) (A9)

= tr(AlowastT UD) (A10)

where Alowast = AV and since V is ktimes k orthonormal AlowastT Alowast = I Now since D is diagonal(A10) is maximized when the diagonal of AlowastT U is positive and maximum By Cauchy-Schwartz inequality this is achieved when Alowast = U in which case the diagonal elementsare all 1 Hence A = UVT

Proof of Theorem 5 Let Blowast

= [βlowast1 β

lowastk ] with βlowast = (1+λ)β then Vi(λ) = βlowast

i

||βlowasti||

On the other hand β = βlowast

1+λ means that

(A Blowast) = arg min

A BCλλ1(AB) subject to AT A = Iktimesk (A11)

where

Cλλ1(AB) =nsum

i=1

xi minus ABT

1 + λxi2 + λ

ksumj=1

βj

1 + λ2 +

ksumj=1

λ1j βj

1 + λ1 (A12)

Since

ksumj=1

βj

1 + λ2 =

1(1 + λ)2

tr(BT B)

and

nsumi=1

xi minus ABT

1 + λxi2 = tr

((X minus X

B1 + λ

AT )T (X minus XB

1 + λAT )

)= tr

(XT X

)+

1(1 + λ)2

tr(BT XT XB

)minus 21 + λ

tr(

AT XT XB)

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 21: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

SPARSE PRINCIPAL COMPONENT ANALYSIS 285

Thus we have

Cλλ1(AB) = tr(XT X

)+

11 + λ

tr

(BT XT X + λI

1 + λB

)minus 2Tr

(AT XT XB

)+

ksumj=1

λ1jβj1

which implies that

(A Blowast) = arg min

A Btr

(BT XT X + λI

1 + λB

)minus 2tr

(AT XT XB

)+

ksumj=1

λ1jβj1 (A13)

subject to AT A = Iktimesk As λ rarr infin tr(

BT XT X+λI1+λ B

)rarr tr(BT B) =

sumkj=1 βj2

Thus (A13) approaches (41) and the conclusion of Theorem 5 follows

ACKNOWLEDGMENTSWe thank the editor an associate editor and referees for helpful comments and suggestions which greatly

improved the manuscript

[Received April 2004 Revised June 2005]

REFERENCES

Alter O Brown P and Botstein D (2000) ldquoSingular Value Decomposition for Genome-Wide Expression DataProcessing and Modelingrdquo in Proceedings of the National Academy of Sciences 97 pp 10101ndash10106

Cadima J and Jolliffe I (1995) ldquoLoadings and Correlations in the Interpretation of Principal ComponentsrdquoJournal of Applied Statistics 22 203ndash214

Efron B Hastie T Johnstone I and Tibshirani R (2004) ldquoLeast Angle Regressionrdquo The Annals of Statistics32 407ndash499

Hancock P Burton A and Bruce V (1996) ldquoFace Processing Human Perception and Principal ComponentsAnalysisrdquo Memory and Cognition 24 26ndash40

Hastie T Tibshirani R and Friedman J (2001) The Elements of Statistical Learning Data mining Inferenceand Prediction New York Springer Verlag

Hastie T Tibshirani R Eisen M Brown P Ross D Scherf U Weinstein J Alizadeh A Staudt Land Botstein D (2000) ldquo lsquogene Shavingrsquo as a Method for Identifying Distinct Sets of Genes With SimilarExpression Patternsrdquo Genome Biology 1 1ndash21

Jeffers J (1967) ldquoTwo Case Studies in the Application of Principal Componentrdquo Applied Statistics 16 225ndash236

Jolliffe I (1986) Principal Component Analysis New York Springer Verlag

(1995) ldquoRotation of Principal Components Choice of Normalization Constraintsrdquo Journal of AppliedStatistics 22 29ndash35

Jolliffe I T Trendafilov N T and Uddin M (2003) ldquoA Modified Principal Component Technique Based onthe Lassordquo Journal of Computational and Graphical Statistics 12 531ndash547

Mardia K Kent J and Bibby J (1979) Multivariate Analysis New York Academic Press

McCabe G (1984) ldquoPrincipal Variablesrdquo Technometrics 26 137ndash144

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320

Page 22: Sparse Principal Component Analysis - Stanford …web.stanford.edu/~hastie/Papers/spc_jcgs.pdf · Sparse Principal Component Analysis HuiZ OU,Trevor H ASTIE and RobertT IBSHIRANI

286 H ZOU T HASTIE AND R TIBSHIRANI

Osborne M R Presnell B and Turlach B A (2000) ldquoA New Approach to Variable Selection in Least SquaresProblemsrdquo IMA Journal of Numerical Analysis 20 389ndash403

Ramaswamy S Tamayo P Rifkin R Mukheriee S Yeang C Angelo M Ladd C Reich M LatulippeE Mesirov J Poggio T Gerald W Loda M Lander E and Golub T (2001) ldquoMulticlass CancerDiagnosis using Tumor Gene Expression Signaturerdquo in Proceedings of the National Academy of Sciences98 pp 15149ndash15154

Tibshirani R (1996) ldquoRegression Shrinkage and Selection via the Lassordquo Journal of the Royal Statistical SocietySeries B 58 267ndash288

Tibshirani R Hastie T Narasimhan B and Chu G (2002) ldquoDiagnosis of Multiple Cancer Types by ShrunkenCentroids of Generdquo in Proceedings of the National Academy of Sciences 99 6567ndash6572

Vines S (2000) ldquoSimple Principal Componentsrdquo Applied Statistics 49 441ndash451

Zou H and Hastie T (2005) ldquoRegularization and Variable Selection via the Elastic Netrdquo Journal of the RoyalStatistical Society Series B 67 301ndash320