sparse partial least squares regression with an...

41
Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis Hyonho Chun Department of Statistics University of Wisconsin Madison, WI 53706 email: [email protected] und¨ uz Kele¸ s * Department of Statistics Department of Biostatistics and Medical Informatics University of Wisconsin Madison, WI 53706 email: [email protected] November 8, 2007 * Corresponding Author: [email protected] 1

Upload: others

Post on 02-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Sparse Partial Least Squares Regression with an Application to

Genome Scale Transcription Factor Analysis

Hyonho Chun

Department of Statistics

University of Wisconsin

Madison, WI 53706

email: [email protected]

Sunduz Keles∗

Department of Statistics

Department of Biostatistics and Medical Informatics

University of Wisconsin

Madison, WI 53706

email: [email protected]

November 8, 2007

∗Corresponding Author: [email protected]

1

Page 2: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Author’s Footnote:

Hyonho Chun is a Ph.D. student in the Department of Statistics at the University of Wisconsin, Madison.

Sunduz Keles is an assistant professor in the Departments of Statistics and of Biostatistics and Medical

Informatics at the University of Wisconsin, Madison. This research has been supported in part NIH grants

1-R01-HG03747-01 ( S.K.) and 4-R37-GM038660-20 (S.K., H.C.).

2

Page 3: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Abstract

Analysis of modern biological data often involves ill-posed problems due to high dimensionality and mul-

ticollinearity. Partial Least Squares (pls) regression has been an alternative to ordinary least squares for

handling multicollinearity in several areas of scientific research since 1960s (Wold 1966). At the core of

the pls methodology lies a dimension reduction technique coupled with a regression model. Although pls

regression has been shown to achieve good predictive performance, it is not particularly tailored for vari-

able/feature selection and therefore often produces linear combinations of the original predictors that are

hard to interpret due to high dimensionality. In this paper, we investigate the known asymptotic properties

of the pls estimator under special case of normality and show that its consistency breaks down with the very

large p and small n paradigm. We, then, propose a sparse partial least squares (spls) formulation which

aims to simultaneously achieve good predictive performance and variable selection by producing sparse linear

combinations of the original predictors. We provide an efficient implementation of spls based on the lars

algorithm (Efron et al. 2004). An additional advantage of the spls algorithm is that it naturally handles

multivariate responses. We illustrate the methodology in a joint analysis of gene expression and genome-wide

binding data.

Keywords: Partial Least Squares; Dimension Reduction; Variable/Feature Selection; Sparsity; LASSO;

Microarrays; Gene expression; ChIP-chip data.

3

Page 4: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

1. INTRODUCTION

With the recent advancements in biotechnology such as the use of genome-wide microarrays and high through-

put sequencing, regression-based modeling of high dimensional data in biology has never been more impor-

tant. Two important statistical problems commonly arise within the regression problems that concern

modern biological data. The first of these is the selection of a set of important or relevant variables among a

large number of predictors. Utilizing the sparsity principle, e.g., operating under the assumption that a small

subset of the variables are deriving the underlying process, with L1 penalty has been recently promoted as

an effective solution to this problem (Tibshirani 1996; Efron et al. 2004). The second problem is related to

the fact that such a variable selection exercise often arises as an ill-posed problem where (1) the sample size

(n) is much smaller than the total number of variables (p); (2) the covariates are highly correlated. Dimen-

sion reduction techniques such as principal components analysis (pca) or partial least squares (pls) have

recently gained an attention for handling these two scenarios within the context of genomic data (reviewed

in Boulesteix and Strimmer (2006)).

Although dimension reduction via principal component analysis (pca) or partial least squares (pls) is a

principled way of dealing with ill-posed problems, it does not automatically lead to the selection of relevant

variables. Typically, all or a large portion of the variables contribute to the final direction vectors which

represent linear combinations of the original predictors. If one can impose sparsity in the midst of the

dimension reduction step, it might be possible to achieve both dimension reduction and variable selection

simultaneously. Recently, Huang et al. (2004) proposed a penalized partial least squares method to impose

sparsity on the partial least squares estimates by using a simple soft thresholding rule on the final coefficient

estimates. Although this serves as a way of imposing sparsity on the solution itself, the sparsity principle

is not incorporated during the dimension reduction step. Therefore, this does not necessarily lead to sparse

linear combinations of the original predictors. In this paper, our goal is to impose sparsity on the dimension

reduction step of pls so that sparsity can play a direct principled role in the pls regression.

The rest of the paper is organized as follows. We review the general principles of the pls methodology

in Section 2. In this section, we show that pls regression provides consistent estimators of the regression

coefficients only under restricted conditions, and the consistency breaks down with the very large p and small

n (p >> n) paradigm. We then formulate sparse partial least squares (spls) in Section 3. We relate this

formulation to the sparse principle components analysis formulations (Jolliffe et al. 2003; Zou et al. 2006)

and provide an efficient algorithm for solving the spls regression problem in Section 4. Methods for tuning

the sparsity parameter, the number of components and dealing with multivariate responses are also discussed

within the course of this section. Several simulation studies investigating the operating characteristics of the

spls regression and an application to transcription factor activity analysis by integrating microarray gene

4

Page 5: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

expression data and chip-chip data are provided in Sections 5 and 6.

2. PARTIAL LEAST SQUARES REGRESSION

2.1 Description of partial least squares regression

Partial least squares (pls) regression, introduced by Wold (1966), has been used as an alternative approach

to the ordinary least squares regression (ols) in ill-conditioned linear regression models that arise in several

disciplines such as chemistry, economics, medicine, psychology, and pharmaceutical science (de Jong 1993).

At the core of pls regression is a dimension reduction technique that operates under the assumption of the

following basic latent decomposition of the response variable (Y ) and predictors (X), where Y and X are

n× q and n× p matrices, respectively.

Y = TQT + F,

X = TPT + E.

T ∈ Rn×K is a matrix that produces K linear combinations (scores), P ∈ Rp×K and Q ∈ Rq×K are matrices

of coefficients (loadings), and E ∈ Rn×p and F ∈ Rn×q are matrices of random errors (reviewed in Boulesteix

and Strimmer (2006)).

In order to specify the latent component matrix T such that T = XW , pls requires finding the columns

of W = (w1, w2, . . . , wK) from successive optimization problems. The criterion to find the k-th direction

vector wk for univariate Y is explicitly formulated as

wk = argmaxw∈Rp corr2(Y, Xw)var(Xw) (1)

s.t. wT ΣXXwj = 0, j = 1, · · · , k − 1,

wT w = 1,

where ΣXX is covariance matrix of X (Stone and Brooks 1974). As evident from the previous formulation,

pls seeks direction vectors that not only relate X to Y but also capture the most variable directions in the

X space (Frank and Friedman 1993).

Although there are several variants of the previous criterion for finding W for a multivariate response

matrix Y , we focus here on the statistically inspired modification of pls (simpls), which is a natural extension

of the univariate pls. The criterion for the k-th direction vector of pls, where Y is a multivariate response

5

Page 6: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

matrix, is formulated as

wk = argmaxwwT σXY σXYT w

s.t. wT w = 1 and wT ΣXXwj = 0 for j = 1, · · · , k − 1,

where σXY is covariance matrix of X and Y . We note that the objective function of simpls is the sum of

the individual objective functions of the univariate pls regression as

wT σXY σXYT w =

q∑j=1

cov2(Xw, Yj). (2)

When estimating the matrix W , sample versions of the covariance matrices (SXX , SXY ) are used instead

of their unknown population versions (ΣXX , σXY ) in the preceding formulations (1) and (2). Specifically,

the k-th direction vector wk is estimated by wk:

wk = argmaxw∈Rp wT XT Y Y T Xw (3)

s.t. wT w = 1 and wT SXXwj = 0 for j = 1, · · · , k − 1.

After the estimation of the latent component matrix (T ), loadings (Q) are estimated via ordinary least

squares for the model Y = TQT + E. Note that we can rewrite this model as

Y = XWQT + E

= XβPLS + E,

where βPLS represents a reparametrization and is estimated by βPLS = W QT and W and Q are estimates

of W and Q.

pls can also be described as an optimization algorithm without the notion of the basic latent decompo-

sition. pls for univariate Y is known as the conjugate gradient (cg) algorithm (Gill et al. 1981) for solving

the following least squares problem

minβ

1n

(Y −Xβ)T (Y −Xβ)

(Friedman and Popescu 2004). cg algorithm solves this optimization problem only by utilizing the negative

gradient of the objective function g(β) = 1nXT (Y −Xβ). The algorithm starts from β0 = 0 and updates βk

6

Page 7: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

as βk−1 + ρksk at each step, where

sk = gk +gT

k gk

gTk−1gk−1

sk−1,

ρk = argminρ

1n

n∑i=1

(Yi −Xi(βk + ρsk))2, for k = 1, . . . ,K and K ≤ p,

gk = g(βk), and s0 = 0. cg algorithm with an early stop solves the previous least squares problem by

avoiding potential singularity of the matrix (XT X)−1. It can be easily verified that s1, . . . , sK match to the

pls direction vectors w1, · · · , wK up to a constant, and ||sk||2ρk, for 1 ≤ k ≤ K, match to the entries of

loadings QT . Therefore, Wold’s pls algorithm described in following box is essentially another description

for cg algorithm.

Wold’s pls algorithm

1. Initialize: Y0 ← Y , X0 ← X, and Y0 ← 0.

2. For k = 1 to p:

2.1 φk = XTk−1Yk−1/||XT

k−1Yk−1||2

2.2 gk = Xk−1φk

2.3 rk = [(gTk Yk−1)/(gT

k gk)]gk

2.4 Yk = Yk−1 + rk

2.5 Yk = Yk−1 − rk

2.6 Xk = Xk−1 − (gTk Xk−1)/(gT

k gk)⊗ gk

2.7 if (1T XTk Xk1 = 0), then exit.

3. Denote K as the final k of the algorithm. Since {gk}K1are linear in X, so is YK . βPLS , which is the solution of

YK = XβPLS , can be recovered from the sequence of pls

transformations.

As a summary, pls regression is based on a basic latent decomposition and can be viewed as a cg

algorithm for solving the least squares problem with singular X. pls utilizes principles of dimension reduction

by obtaining a small number of latent components that are linear combinations of the original variables to

avoid multi-collinearity.

7

Page 8: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

2.2 An asymptotic property of PLS

Stoica and Soderstorom (1998) derived the asymptotic formulae for the bias and variance of the pls estimator.

These asymptotic formulae are valid if the “signal-to-noise ratio” in the data Y is high or if the sample

size (n) is large and the predictors are uncorrelated with the residuals. Naik and Tsai (2000) proved the

consistency of the pls estimator under the normality assumptions on Y and X with additional conditions

including consistency of SXY and SXX and Helland and Almoy (1994) condition. Helland and Almoy (1994)

condition is stated as follows:

Condition 1. There exist eigenvectors vj(j = 1, . . . ,K) of ΣXX corresponding to different eigenvalues λj,

such that σXY =∑K

j=1 αjvj and α1, . . . , αK are non-zero.

Helland and Almoy (1994) condition implies that an integer K exists such that exactly K of the eigen-

vectors vj have nonzero components along σXY .

We note that the consistency proof of Naik and Tsai (2000) requires the number of variables (p) to be fixed

and much smaller than the number of observations (n). In many fields of modern genomic research, datasets

contain a large number of variables with a much smaller number of observations (e.g., gene expression datasets

where the variables are in the order of thousands and the sample size is in the order of tens). Therefore, we

investigate the consistency of the pls regression estimator under the very large p and small n paradigm and

extend the result of Naik and Tsai (2000) for the case where p is allowed to grow with n at an appropriate

rate. For this, we add assumptions to those of Naik and Tsai (2000) since the consistency of SXX and SXY ,

which is the conventional assumption for fixed p in their consistency proof, requires other assumptions to

be satisfied for large p and small n problems. Recently, Johnstone and Lu (2004) proved that the leading

principal component of SXX is consistent if and only if p/n → 0. Hence, we adopt the assumptions of

Johnstone and Lu (2004) for X to ensure consistency of SXX and SXY . Given the existing connection

between pls and pca regressions (Helland 1990; Stoica and Soderstorom 1998), posing the assumptions of

pca to pls regression is expected to yield similar asymptotic results.

The assumptions for X from Johnstone and Lu (2004) are as follows:

Assumption 1. Assume that each row of X = (xT1 , . . . , xT

n )T follows the model xi =∑m

j=1 vji ρ

j + σ1zi, for

some constant σ1, where

1. ρj, j = 1, · · · ,m ≤ p are mutually orthogonal principal components with norms ||ρ1|| ≥ ||ρ2|| ≥ · · · ≥

||ρm||.

2. The multipliers vji ∼ N (0, 1) are all independent over j = 1, · · · ,m and i = 1 · · ·n.

3. The noise vector zi ∼ N (0, Ip) are independent among themselves and also of the random effects {vji }.

8

Page 9: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

4. p(n),m(n) and {ρj(n), j = 1, · · · ,m} are functions of n, and the norms of the principal components

converge as sequences:

%(n) = (||ρ1(n)||, . . . , ||ρj(n)||, . . .)→ % = (%1, . . . , %j , . . .).

We also write %+ for the limiting l1 norm:

%+ =∑

j

%j .

We remark that this assumed factor model for X is similar to the factor model of Helland (1990). It

only differs from the model of Helland (1990) in that it has an additional random error term zi. However,

all properties of pls in Helland (1990) will hold, as the eigenvectors of ΣXX and ΣXX − σ21Ip are the same.

The assumptions for Y are also contained in Helland (1990), and are required in the consistency proof

of Naik and Tsai (2000). As an additional assumption, we impose a finite norm condition for β.

Assumption 2. Assume that Y and X have following relationship,

Y = Xβ + σ2e,

where e ∼ N (0, In), with ||β||2 <∞ and σ2 is a constant.

In summary, we assume the statistical model of Helland (1990):

xTi

yi

∼ N 0

0

,

ΣXX σXY

σTXY σY Y

,

where xi is the i-th row of X, with additional assumptions on X and β. Trivially, β = Σ−1XXσXY if Σ−1

XX

exists. We next show that, under the assumptions and Condition 1, pls estimator is consistent if p grows

much slower than n, and otherwise, pls estimator is not consistent.

Theorem 1. Under Assumptions 1 and 2, and Condition 1,

1. if pn → 0, then ||βPLS − β||2 → 0 in probability,

2. if pn → c for c > 0, then ||βPLS − β||2 > 0 in probability.

The main implication of this theorem is that pls estimator is consistent under restricted conditions and

is not suitable for very large p small and n problems in complete generality. Although pls uses a dimension

reduction technique by utilizing small number of latent factors, it cannot avoid the sample size issue since

9

Page 10: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

reasonable number of samples are required to consistently estimate sample covariance matrices as shown in

the proof of Theorem 1 in the Appendix.

Although datasets in many fields of modern genomic research contain a large number of variables, it is

often hypothesized that a few variables are important and the remaining majority of them are unimportant

(West 2003). We next explicitly illustrate how the large number of noise, i.e., irrelevant variables, affect the

pls estimator through a simple example. We will make use of the closed form solution for pls regression

βPLS = R(RT SXXR)−1RT SXY ,

where R = (SXY , SXXSXY , S2XXSXY , . . . , SK−1

XX SXY ) (Helland 1990).

Assume that X is partitioned into (X1, X2), where X1 and X2 denote p1 relevant variables and p − p1

irrelevant variables, respectively and each column of X2 follows N(0, In). We assume the existence of one

latent variable (K = 1) as well as fixed number of relevant variables (p1) and let p grow with the rate O(cn)

where c is large enough to have

max(σT

X1Y σX1Y , σTX1Y ΣX1X1σX1Y

)<< cσ2

1σ22 . (4)

It is not too difficult to obtain a large enough c to satisfy (4) considering that p1 is fixed. Then, the pls

estimator can be approximated as follows for sufficiently large c from Helland (1990)’s closed form solution.

βPLS =ST

XY SXY

STXY SXXSXY

SXY

=ST

X1Y SX1Y + STX2Y SX2Y

STX1Y SX1X1SX1Y + 2ST

X1Y SX1X2SX2Y + STX2Y SX2X2SX2Y

SXY

≈ST

X2Y SX2Y

STX2Y SX2X2SX2Y

SXY (5)

≈ O

(1c

)SXY . (6)

Approximation (5) follows from the fact that STX2Y SX2Y → cσ2

1σ22 by Lemma 1 in Appendix and the

assumption given by (4). Approximation (6) is due to the fact that largest and smallest eigenvalues of the

Wishart matrix is O(c) (Geman 1980). In this example, large number of noise variables force the coefficient

estimate in the direction of SXY to be attenuated (O(c−1)), therefore cause inconsistency.

From a practical point of view, when there are many noise variables, the latent factors almost always have

contributions from all the variables. Thus, noise variables get nonzero estimates in the pls regression and

the interpretation of the latent factors becomes difficult. Motivated by the observation that, noise variables

enter the pls regression via the direction vectors and attenuate the estimates of the regression parameters,

10

Page 11: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

we consider imposing sparsity on the direction vectors to have, perhaps, easily interpretable direction vectors

as well as consistent estimators.

3. SPARSE PARTIAL LEAST SQUARES (SPLS) REGRESSION

3.1 Finding SPLS direction vectors

We first focus on the formulations for the first spls direction vector, and discuss derivation of the remaining

direction vectors and coefficients (loadings) in Section 4.

Imposing L1 penalty is a popular and well studied choice for getting a sparse solution (Tibshirani 1996;

Efron et al. 2004; Donoho 2006), and we formulate the objective function for spls direction vector by

imposing additional L1 constraint to the maximization problem (3) of Section 2.1;

maxw

wT Mw s.t. wT w = 1, |w| ≤ λ, (7)

where M = XT Y Y T X and λ determines the amount of sparsity. The same approach, that is to impose

additional L1 constraint, has been used in the context of deriving sparse principal components in simplified

component technique lasso (scotlass) by Jolliffe et al. (2003). As evident from the formulation (7), the

problems of deriving spls and spca correspond to the same maximum eigenvalue problem for a symmetric

matrix with an additional sparsity constraint. By specifying the symmetric matrix M to be XT X in (7),

this objective function coincides with the objective function of scotlass.

Although imposing L1 penalty onto the direction vector ought to yield a sparse vector, Jolliffe et al.

(2003) pointed out that the solution tends not to be sparse enough (Figure 1). Furthermore, the objective

function is not convex. Later, this convexity issue is revisited by Alexandre d’Aspremont and Lanckriet

(2007) in direct sparse principal component analysis (dspca). They reformulated the objective function in

terms of W = wwT rather than w itself, thus produced a semidefinite programming problem that is known

to be convex. However, the issue regarding the sparsity remained.

[Figure 1 about here.]

To get a sparse enough solution, we reformulate the spls objective function in (7) by generalizing the

regression formulation of spca (Zou et al. 2006). This new regression formulation of spls promotes exact

zero property for the coefficients by imposing L1 penalty onto a surrogate of direction vector (c) instead of

the original direction vector (α), while keeping α and c close to each other:

minα,c−καT Mα + (1− κ)(c− α)T M(c− α) + λ1|c|1 + λ2|c|2 s.t. αT α = 1. (8)

11

Page 12: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

The first L1 penalty encourages sparsity on the surrogate direction vector c, and the second L2 penalty

takes care of potential singularity in M when solving for c. We will rescale c to have norm 1 and use this

scaled version as the estimated direction vector w. We note that when α = c and M = XT X, this objective

function becomes that of scotlass; when κ = 1/2 and M = XT X, the formulation becomes that of spca

(Zou et al. 2006); and when κ = 1, the formulation becomes original maximum eigenvalue problem. By

using small κ, we aim to reduce the effect of the concave objective function and, therefore, reduce the local

solution issue. We, thus, utilize this generalized regression formulation as our method of choice. Next, we

describe the solution of (8) in detail.

3.2 Solution for generalized regression formulation of spls

We will solve the generalized regression formulation of spls by alternatively iterating between solving for α

for fixed c and solving for c after fixing α.

We begin with the problem of solving α for a given c. Now, the objective function of (8) becomes

argminα − καT Mα + (1− κ)(c− α)T M(c− α) s.t. αT α = 1. (9)

The order of the problem varies depending on κ; it is quadratic in terms of α for 0 < κ < 1/2, whereas

it becomes linear when κ = 1/2. We, hence, separately present the solution of (9) depending on κ. For

0 < κ < 1/2, the objective function is

argminα

(α− 1− κ

1− 2κc

)T

M

(α− 1− κ

1− 2κc

)s.t. αT α = 1,

= argminα

(ZT α− 1− κ

1− 2κZT c

)T (ZT α− 1− κ

1− 2κZT c

)s.t. αT α = 1,

where Z = XT Y . This has the same form as the constrained least squares problem which can be solved

via the method of Lagrange multipliers. The solution is obtained as α = 1−κ1−2κ (M + λ?I)−1Mc, where the

multiplier λ? is the solution of the equation cT M(M + λI)−2Mc =(

1−2k1−k

)2

. For κ = 1/2, (9) becomes

argminα − αT Mc s.t. αT α = 1,

and the solution is given by α = UV T , where U and V are obtained from the singular value decomposition

of Mc (Zou et al. 2006).

When 0 < κ < 1/2, we suggest to set a large value as a starting guess for c because the constraint

αT α = 1 becomes equivalent to αT α ≤ 1 for c located outside of the unit circle, i.e., 1−κ1−2κ ||c||2 ≥ 1. It is

more convenient to have an inequality constraint in the optimization problem. By setting initial c to be

12

Page 13: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

large, the solution can be obtained through an easier optimization problem.

Next, we examine the problem for solving c for a given α. In this case, the objective function of (8)

becomes

argminc(c− α)T M(c− α) + λ1|c|1 + λ2|c|2 (10)

= argminc(ZT c− ZT α)T (ZT c− ZT α) + λ1|c|1 + λ2|c|2.

This has the same structure as the elastic net problem of Zou and Hastie (2005), by replacing Y in elastic

net problem with ZT α. As discussed by Zou and Hastie (2005), this problem can be solved by using the

lars algorithm (Efron et al. 2004).

We note that spls often requires a large λ2 value to solve for c due to the fact that Z is a q × p matrix

where q is usually small. As a remedy, we use a large λ2 when solving (10), and this yields the solution to

have the form of a soft thresholded estimator (Zou and Hastie 2005).

As a special case, for univariate Y , i.e., q = 1, the first direction vector of spls can be computed in one

step without any iterations. As we show in Theorem 2, the first direction vector is easily obtained by simple

thresholding of the original pls direction vector. This is different than the empirical thresholding described

in Huang et al. (2004) because in our context thresholding arises as a solution of a well defined optimization

problem and operates on the direction vector but not the final estimate of β.

Theorem 2. For univariate Y , the solution of the regression formulation (8) is c =(

Z||Z|| −

λ12

)+

sign(Z),

where Z = XT Y and is proportional to the first direction vector of pls.

Proof. We first consider the case when κ = 0.5. For a given c, α is proportional to UV T where U and

V are singular value decomposition of ZZT c by Theorem 4 of Zou et al. (2006). Because singular value

decomposition of ZZT c yields U = Z||Z|| , V = 1, D = ||Z||ZT c, the solution α becomes Z

||Z|| , which does not

depend on c. Therefore, one can get the solution of c as(

Z||Z|| −

λ12

)+

sign(Z) for sufficiently large λ2.

When κ < 0.5, for a given c, the formulation becomes a ridge regression problem, and the solution is

α = (ZZT + λ?I)−1ZZT c. Since Z is proportional to the eigenvector of the matrix ZZT , we can calculate

the inverse of the matrix by using the Woodbury formula (Golub and Loan 1987).

α = (ZZT + λ?I)−1ZZT c

=(

1λ?

I − 1λ?(||Z||2 + λ?)

ZZT

)ZZT c

=(

1λ?− ||Z||2

λ?(||Z||2 + λ?)

)ZZT c

=((

1||Z||2 + λ?

)ZT c

)Z.

13

Page 14: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Noting that(

1||Z||2+λ?

)ZT c is a scalar and α should satisfy the norm constraint, it follows that α = Z

||Z|| .

We again observe that α is proportional to Z, and does not depend on c.

4. IMPLEMENTATION AND ALGORITHMIC DETAILS

4.1 spls algorithm

In this section, we present the complete spls algorithm which encompasses the previous formulation for the

first spls direction vector as well as an efficient algorithm for all the other direction vectors and coefficients.

In principle, the objective function for the first spls direction vector could be utilized at each step of

the Wold’s pls algorithm to obtain the rest of the spls direction vectors. However, this naive application

loses the conjugacy property of the direction vectors. This is certainly not desirable because conjugacy plays

a crucial role for many appealing properties of the pls regression including computational efficiency and

theoretical convergence of the algorithm (der Sluis and der Vorst 1986). A similar issue also appears in

spca, where none of the proposed methods (Jolliffe et al. 2003; Zou et al. 2006; Alexandre d’Aspremont

and Lanckriet 2007) produce orthogonal sparse principal components. Although conjugacy among sparse

direction vectors can be obtained by the Gram-Schmidt conjugation of the derived sparse direction vectors,

these post conjugated direction vectors do not inherit the property of Krylov subsequences which is known to

be crucial for the convergence of the algorithm (Kramer 2007). In other words, such a post orthogonalization

does not guarantee the existence of the solution among the iterations.

To address this concern, we propose an spls algorithm which leads sparse solutions by keeping the Krylov

subsequence structure of the direction vectors in a restricted X space which is composed of selected variables.

The algorithm searches for relevant variables, so called active variables, by optimizing (8) from Section 3, and

updates all direction vectors to form a Krylov subsequence on the subspace of the active variables. Define A

to be an index set for active variables, K as the number of components, and XA as the matrix of covariates

contained in A.

14

Page 15: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

spls algorithm

1. Set βPLS = 0, A = { }, k = 1, and Y1 = Y .

2. While (k ≤ K),

2.1 Find w by solving the objective (8) in Section 3 with M = XT Y1YT1 X.

2.2 Update A as {i : wi 6= 0} ∪ {i : βPLSi 6= 0}.

2.3 Fit pls with XA by using k number of latent components.

2.4 Update βPLS by using the new pls estimates of the direction vectors, and

update Y1 and k through Y1 ← Y −XβPLS and k ← k + 1.

We notice that the eigenvector of M is proportional to the current correlation in lars algorithm for

univariate Y . Hence, lars and spls algorithms use the same criterion to select active variables in the

univariate case. However, spls algorithm differs from lars in that spls selects more than one variable at a

time and utilizes the cg method to compute coefficients at each step. The computational cost for computing

coefficients at each step of the spls algorithm is less than or equal to the computational cost of computing

step size in lars since cg method avoids matrix inversion. Also note that spls algorithm can naturally

handle multivariate Y , which is not an automatic property of other variable selection algorithms.

4.2 Choosing the thresholding parameter and the number of hidden components

spls has two tuning parameters, namely, thresholding parameter for imposing sparsity parameter and number

of hidden components. We discuss the criteria for setting these tuning parameters in this section. We first

consider criteria for univariate Y , and discuss these further for multivariate Y as it is a simple application

of the univariate case.

Since spls direction vectors for univariate Y are obtained by simple thresholding, we first discuss meth-

ods for choosing the thresholding parameter with both soft and hard thresholding. It is known that soft

thresholding exhibits continuous shrinkage, but suffers from substantial bias for large coefficients, whereas

hard thresholding exhibits noncontinuous shrinkage, but does not suffer from such a bias since the original

coefficients surviving the thresholding are not altered (Jansen 2001). Taking into consideration these proper-

ties of the thresholding methods, we propose to use cross validation (cv) only for soft thresholding, because

resulting cv function of the prediction error would be smooth and, hence, appropriate to find the optimal

thresholding parameter (Jansen 2001). For hard thresholding, we propose to use a false discovery rate (fdr)

controlling procedure which turns out to be computationally efficient and minimax optimal (Abramovich

et al. 2006).

We start with describing a form of soft thresholding which promotes to use single tuning parameter for

15

Page 16: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

multiple direction vectors. Define the thresholded estimator w by

w = (|w| − η max1≤i≤p

|wi|)I(|w| ≥ η max1≤i≤p

|wi|)sign(w),

where 0 ≤ η ≤ 1. Here, we retain components which are greater than some fraction of the maximum

component of the direction vector. A similar thresholding method is also utilized by (Friedman and Popescu

2004) within the hard thresholding context as opposed to the soft thresholding scheme considered here. We

promote the use of a single parameter (η), determined by cv, to control the sparsity of all the direction

vectors. Although one could tune separate sparsity parameters for individual direction vectors, tuning

multiple parameters would be computationally prohibitive. Moreover, such an approach may not produce

a unique minimum for the cv criterion as different combinations of sparsity of the direction vectors may

yield the same prediction for Y . Therefore, we propose to compromise between the need for multiple tuning

parameters and the practicality of a single tuning parameter using the proposed soft thresholding method.

Next, we describe the hard thresholding by fdr control approach. spls selects variables which exhibit

high correlations with Y in the first step and adds additional variables with high partial correlations in the

subsequent steps. Selecting related variables based on correlations has been utilized in supervised pca (Bair

et al. 2006), and in a way, we further extend this approach by utilizing partial correlations in the later steps.

Due to uniform consistency of correlations (or partial correlations), fdr control is expected to work well

even in the large p and small n scenario (Kosorok and Ma 2007).

To control fdr, we compute p-values for individual correlation coefficients (or partial correlation coeffi-

cients in the second and subsequent steps) by using Fisher’s z-transformation (Anderson 1984). Let ρyxi·Z

denote the sample partial correlation of the i-th variable with Y given Z, where Z is the set of direction

vectors which are included in the model. Under the normality assumption on X and Y , and the null hy-

pothesis H0i : ρyxi·Z = 0, the z-transformed partial correlation (correlation) coefficients have the following

distribution (Bendel and Afifi 1976):

√n− |Z| − 3

2ln

(1 + ρyxi·Z

1− ρyxi·Z

)∼ N (0, 1).

Thresholding parameter for the hard thresholding approach is, then, determined by controlling the fdr

based on the computed p-values. We utilize Benjamini and Hochberg (1995) method for this purpose. Let

the individual p-values for correlations be arranged in ascending order: p[1] ≤ · · · ≤ p[p]. Compare the

ordered p-values to a linear boundary (i/p)α, and note the last crossing time, k = max{k : p[k] ≤ (k/p)α}.

16

Page 17: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

We estimate the thresholding parameter as |w|[p−k+1], and the sparse direction vector as

w = wI(|w| > |w|[p−k+1]

).

We remark that the solution will be minimax optimal if α ∈ [0, 1/2] and α > γ/ log p (γ > 0) under the

independence assumption among tests (Abramovich et al. 2006). As long as α decreases with appropriate

rate as p increases, hard thresholding by fdr control is optimal without knowing the level of sparsity. This

adaptive property reduces computation considerably since we do not need to estimate appropriate sparsity

of the problem. Although we do not have independence among tests, this adaptivity property may work

since the argument for minimax optimality mainly depends on marginal properties (Abramovich et al. 2006).

As discussed in Section 3.2, for multivariate Y , the solution for spls regression is obtained through

iterations and the resulting solution corresponds to a form of soft thresholding. Although hard thresholding

with fdr control is no longer applicable, we can still employ soft thresholding based on cv.

The number of hidden components, K, is tuned by cv as in the original pls regression. We note that

cv will be a function of two arguments for soft thresholding and that of one argument for hard thresholding

and thereby making hard thresholding computationally much cheaper than soft thresholding.

5. SIMULATION STUDIES

5.1 Setting the weight factor in the general regression formulation of (8)

We first ran a small simulation study to examine how the generalization of the regression formulation given

in (8) helps to avoid the local solution issue. The data generating mechanism for X is set as follows:

Xi =

H1 + ei i = 1, · · · , 4,

H2 + ei i = 5, · · · , 8,

−0.3H1 + 0.925H2 + ei i = 9 · · · 10,

ei i = 11, · · · , 100.

Here, H1 is a random vector from N (0, 290I), H2 is a random vector from N (0, 300I), and H3 = −0.3H1 +

0.925H2. eis are i.i.d. random vectors from N (0, I). I is the identity matrix of dimensions 1000 × 1000.

For illustration purposes, we use M = XT X. When κ = 0.5, the algorithm gets stuck at a local solution

in 27 out of 100 simulation runs. When κ = 0.1, 0.3, 0.4, correct solution is obtained in all the runs. This

indicates that a slight imbalance giving less weight to the concave objective function of the formulation (8)

leads to a numerically easier optimization problem.

17

Page 18: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

5.2 Comparisons with recent variable selection methods in terms of prediction power and variable selection

One major advantage of the spls regression is its ability to handle correlated covariates or ill-posed problems.

In this section, we compare spls regression to other popular methods in terms of prediction and variable

selection performance in such a correlated covariates setting. In comparisons, we include popular methods

such as ols, Forward Variable Selection (fvs), lasso which are not particularly tailored for correlated

variables. We also include dimension reduction methods such as pls, pcr, super pc which ought to be

appropriate for highly correlated variables.

We first consider the case where there is a reasonable number of observations (i.e., n > p), and set

n = 400, p = 40. We vary the number of spurious variables as q = 10 and 30, and the noise to signal ratios

as 0.1 and 0.2. Hidden variables H1, · · · ,H3 are from N(0, 25In), and columns of the covariate matrix X

are generated by

Xi = H1 + ei for i = 1, · · · , (p− q)/2,

= H2 + ei for i = (p− q)/2 + 1, · · · , p− q,

= H3 + ei for i = p− q + 1, · · · , p,

where e1, · · · , ep are independently generated from N(0, In). Y is generated by 3H1 − 4H2 + ε, where

ε is normally distributed. This data generation mechanism creates covariates subsets of which are highly

correlated.

We, then, consider the case where the sample size is smaller than the number of the variables. We set

p = 80, n = 40. The number of spurious variables are set to q = 20 and 40, and noise to signal ratios to 0.1

and 0.2, respectively. X and Y are generated similar to the above n > p case.

We select the optimal tuning parameters for each method using 10-fold cv. Then we use the same proce-

dure to generate an independent test data set and predict Y on the test data set based on the fitted model.

For each parameter setting, we perform 30 runs of simulations and compute the mean and standard deviation

of the mean squared prediction error. The averages of the sensitivities and specificities are computed across

the simulations to compare the accuracy of variable selection. The results are presented in Tables 1 and 2

for n > p and p >> n scenarios, respectively.

[Table 1 about here.]

[Table 2 about here.]

Although not so surprising, the methods which have intrinsic variable selection show much smaller pre-

diction errors compared to the methods lacking this property. For n > p, fvs, lasso, spls and super pc

18

Page 19: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

show similar prediction performances in all four scenarios. However, for p >> n, spls exhibits the best

performance for prediction and is substantially superior to other methods. For the model selection accu-

racy, spls and super pc show excellent performance, whereas fvs and lasso exhibit poor performance by

missing relevant variables. spls performs better than other methods for p >> n and high noise to signal

ratio scenarios. We notice that super pc shows worse prediction performance than lasso in p >> n case,

although it has better model selection performance. This is because, sometimes, super pc misses all the

variables related to latent components, whereas lasso includes at least some of them.

In general, spls-cv and spls-fdr perform similarly, and they perform at least as good as other methods

(Table 2). Especially, when p >> n, lasso fails to identify important variables, whereas spls can select

them. This is because, although the number of spls latent components is limited by n, the actual number

of variables that make up the latent components can exceed n. This simulation study illustrates that spls

regression has not only good predictive power but also the ability to perform variable selection.

5.3 Comparisons of predictive power among methods tailored to handle multicollinearity

In this section, we compare spls regression to some of the popular methods that aim to handle multi-

collinearity such as pls, pcr, ridge regression, mixed variance-covariance approach (Bair et al. 2006), and

gene-shaving (Hastie et al. 2000), and super pc. For this simulation, we only compare prediction performances

since all methods, except for gene shaving and super pc, are not equipped with variable selection. For the

dimension reduction methods such as pls, pcr, mixed variance-covariance approach, gene-shaving, super

pc, and spls, we allow the use of only one latent component for a fair comparison.

Throughout these simulations, we set p = 5000 and n = 100. All the scenarios follow the general model

of Y = Xβ + e, but the underlying data generation for X is varying. We devise simulation scenarios where

the multi-collinearity is due to: the presence of one main latent variable (simulations 1 and 2); the presence

of multiple latent variables (simulation 3); the presence of a correlation structure that is not induced by

latent variables but some other mechanism (simulation 4).

We select the optimal tuning parameters for each method using 10-fold cv. Then, as in Section 5.1,

we use the same procedure to generate an independent test data set and utilize the fitted model to predict

Y on the test data set. For each scenario, 25 runs of simulations are performed and means and standard

deviations of the mean squared prediction errors are computed.

[Table 3 about here.]

We next describe the data generation process in each simulation scenario in detail and discuss the

performance of methods. The first simulation scenario is the same as the “simple simulation” utilized

19

Page 20: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

by Bair et al. (2006), where hidden components U1, U2 are

U1j =

3 if j ≤ 50,

4 if j > 50,

U2j = 3.5, 1 ≤ j ≤ n.

Columns of X are generated by

Xi =

U1 + εi for 1 ≤ i ≤ 50,

U2 + εi for 51 ≤ i ≤ p,

where εi are random n vector from i.i.d. N (0, In). β is p× 1 vector, where j-th element is

βj =

1/25 for 1 ≤ j ≤ 50,

0 for 1 ≤ j ≤ p.

e is a random n×1 vector from N (0, 1.52In). Although this scenario is ideal for super pc in that Y is related

to only one main hidden component, spls regression shows comparable performance with super pc and gene

shaving.

The second simulation is referred to “hard simulation” by Bair et al. (2006), where more complicated

hidden components are generated, and the rest of the data generation remains the same as the “simple

simulation”. U1, · · · , U5 are generated by

U1j =

3 if j ≤ 50,

4 if j > 50,

U2j = 3.5 + 1.5I(u1j ≤ 0.4), 1 ≤ j ≤ n,

U3j = 3.5 + 0.5I(u1j ≤ 0.7), 1 ≤ j ≤ n,

U4j = 3.5− 1.5I(u1j ≤ 0.3), 1 ≤ j ≤ n,

U5j = 3.5, 1 ≤ j ≤ n,

20

Page 21: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

where u1j , u2j , u3j for 1 ≤ j ≤ n are i.i.d. random variables from Unif(0, 1). Columns of X are

Xi =

U1 + εi for 1 ≤ i ≤ 50,

U2 + εi for 51 ≤ i ≤ 100,

U3 + εi for 101 ≤ i ≤ 200,

U4 + εi for 201 ≤ i ≤ 300,

U5 + εi for 301 ≤ i ≤ p.

As seen in Table 3, when there are complicated latent components, spls and super pc show the best

performance. These two simulation studies combined together illustrate that both spls and super pc have

good prediction performance under the latent component model with few relevant variables.

Third simulation is designed to compare the prediction performance of the methods when all methods

are allowed to use only one latent component, even though there are more than one hidden components

related to Y . This scenario aims to illustrate the differences of the derived latent components depending on

whether or not they are guided by Y .

Six hidden components U1, . . . , U6 are generated as follows:

U1j =

2.5 if j ≤ 50,

4 if j > 50,

U2j =

2.5 if 1 ≤ j ≤ 25, or 51 ≤ j ≤ 75,

4 if 26 ≤ j ≤ 50 or 76 ≤ j ≤ 100,

U3j = 3.5 + 1.5I(u1j ≤ 0.4), 1 ≤ j ≤ n,

U4j = 3.5 + 0.5I(u1j ≤ 0.7), 1 ≤ j ≤ n,

U5j = 3.5− 1.5I(u1j ≤ 0.3), 1 ≤ j ≤ n,

U6j = 3.5, 1 ≤ j ≤ n,

where u1j , u2j , u3j for 1 ≤ j ≤ n are i.i.d. random variables from Unif(0, 1). Columns of X are

Xi =

U1 + εi for 1 ≤ i ≤ 25,

U2 + εi for 26 ≤ i ≤ 50,

U3 + εi for 51 ≤ i ≤ 100,

U4 + εi for 101 ≤ i ≤ 200,

U5 + εi for 201 ≤ i ≤ 300,

U6 + εi for 301 ≤ i ≤ p.

21

Page 22: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

e is a random n× 1 vector from N (0, In).

Gene-shaving and spls both exhibit good predictive performance in this scenario. This may be explained

by the fact that both of these methods derive their latent components by directly utilizing the response

Y . In a way, the methods which utilize Y when they derive latent components can achieve much better

predictive performance compared to methods deriving latent components only build on X where the number

of components allowed in the model is fixed. This agrees with the prior observation that pls typically

requires smaller number of latent components than that of pca (Frank and Friedman 1993).

The forth simulation is designed to compare the prediction performance of the methods when the rel-

evant variables are not governed by a latent variable model. We generate the first 50 columns of X from

multivariate normal with autoregressive covariance, and remaining 4950 columns of X are generated from

hidden components as before. Five hidden components are generated by

U1j =

1 if j ≤ 50,

6 if j > 50,

U2j = 3.5 + 1.5I(u1j ≤ 0.4), 1 ≤ j ≤ n,

U3j = 3.5 + 0.5I(u1j ≤ 0.7), 1 ≤ j ≤ n,

U4j = 3.5− 1.5I(u1j ≤ 0.3), 1 ≤ j ≤ n,

U5j = 3.5, 1 ≤ j ≤ n,

where u1j , u2j , u3j for 1 ≤ j ≤ n are i.i.d. random variables from Unif(0, 1). Denoting X by (X(1), X(2))

by using partitioned matrix form, we generate rows of X(1) from N (0,Σ50×50), where Σ50×50 is from AR(1)

with autocorrelation ρ = 0.9. The columns of X(2) are generated from

X(2)i =

U1 + εi for 1 ≤ i ≤ 50,

U2 + εi for 51 ≤ i ≤ 100,

U3 + εi for 101 ≤ i ≤ 200,

U4 + εi for 201 ≤ i ≤ 300,

U5 + εi for 301 ≤ i ≤ p− 50.

22

Page 23: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

β is p× 1 vector and its j-th element is given by

βj =

8/25 for 1 ≤ j ≤ 10,

6/25 for 11 ≤ j ≤ 20,

4/25 for 21 ≤ j ≤ 30,

2/25 for 31 ≤ j ≤ 40,

1/25 for 41 ≤ j ≤ 50,

0 for 51 ≤ j ≤ p.

This fourth simulation aims to illustrate how different methods perform when the correlations among

the covariates have a different structure than that of a hidden component model. spls regression and gene-

shaving perform well indicating that they have the ability to handle such a correlation structure. As in the

third simulation, these two methods may gain an advantage to handle more general correlation structures

by utilizing response Y when they derive direction vectors.

In all four simulations, the methods incorporating variable selection procedures such as super pc, spls,

gene-shaving show remarkably smaller prediction errors than those without any built-in variable selection

schemes. In simulations 1 and 2, spls regression shows comparably good performance to super pc, al-

though both of these are scenarios designed to be favorable for super pc. In simulation 3, by using only

one component, spls still exhibits good predictive performance and compares well to gene-shaving, which

demonstrates that spls and gene-shaving require fewer number of components since they use Y when de-

riving latent components. In simulation 4, spls and gene-shaving exhibit excellent performance when the

correlation structure is different than that of a hidden component model.

6. CASE STUDY: APPLICATION TO YEAST CELL CYCLE DATA SET

Transcription factors (tfs) play an important role for interpreting a genome’s regulatory code by binding

to specific sequences to induce or repress gene expression. It is of general interest to identify tfs which are

related to cell cycle regulation—one of the fundamental processes in an eukaryotic cell. This scientific question

has been tackled by an integrative analysis of gene expression data and chromatin immunoprecipitation

(chip-chip) data measuring the amount of transcription (mrna) and physical binding of tfs, respectively.

Boulesteix and Strimmer (2005) analyzed these data via pls, where the data contained 3638 genes and 113

tfs. Their interest was on estimating transcription factor activity, but not on variable selection. Wang et al.

(2007) analyzed the same data also by using group scad regression, where the data was processed to contain

297 genes and 96 tfs. In this section, we take a similar approach and identify cell cycle regulating tfs via

variable selection methods including multivariate spls, univariate spls, super pc, and lasso.

23

Page 24: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

The cell cycle gene expression data of approximately 800 genes (Spellman et al. 1998) comprise data sets

from three different experiments depending on synchronization methods, and we use the data from α factor

based synchronization experiment which measures mrna levels at every 7 minutes for 119 minutes with a

total of 18 measurements covering two cell cycle periods. chip-chip data of Lee et al. (2002) contains the

binding information of 106 tfs which elucidate how these yeast transcriptional regulators bind to promoter

sequences of the genes across the genome. After excluding genes which have missing value at any time point

of expression data or any tf of the chip-chip data, 542 cell cycle related genes are retained. In short, our

analysis consists of modeling expression levels of 542 cell cycle related genes at 18 time points by using

chip-chip data of 106 tfs, which aimed to identify cell cycle related tfs as well as to infer transcription

factor activities (tfa).

We analyze this data set with our proposed multivariate spls and univariate spls regression methods,

and also with super pc and lasso for a comparison, and summarize the results in Table 4. Multivariate

spls selects the least number of tfs (48 tfs), and univariate spls and super pc selects exactly the same

tfs (65 tfs). lasso selects largest number of tfs, 102 out of 106. We also report the number of known tfs

among the selected ones to use a guideline for performance of the methods. A total 21 tfs are experimentally

confirmed to be cell cycle related (Wang et al. 2007).

Obviously, there is a high chance of including large number of known tfs if a large number of tfs are

selected at first hand. A score which evaluates methods not only by simple consideration of the number of

selected known tfs but also by consideration of actual sizes of the selection would provide a better idea on

the performance of the methods. As such, we compute the probability of random selection of size s (the

size of selection) that contains more than k number of known tfs out of 21 known and 85 unknown tfs.

Specifically, we compute P (K ≥ k), where K is the random variable representing the number of selected

known tfs and assumed to follow Hyper-geometric(21, 85, s) and k and s represent observed number of

selected known tfs and the size of selection, respectively. By comparing these probabilities, we observe that

multivariate spls, univariate spls, and super pc have strong evidences that contained known tfs are not

by random chance, whereas lasso does not.

[Table 4 about here.]

[Figure 2 about here.]

Because univariate spls and super pc select exactly the same tfs and lasso does not really provide any

variable selection by choosing 102 out of 106 tfs, we restrict our attention to comparing multivariate and

univariate spls regressions. There are a total of 36 tfs those are selected by both methods and 14 of these

are experimentally verified tfs. The estimators, called as tfa, of selected tfs show periodicity in general.

24

Page 25: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Interestingly, multivariate spls regression obtains smoother estimates of transcription factor activity than

univariate spls does (Figure 2). A total of 12 tfs are selected only by multivariate spls regression, and

one of them is a known tf. These tfs have small estimated coefficients but show periodicity (Figure 3)

which can be attributed to the time course covering two cell cycles. A total of 29 tfs are selected only by

univariate spls regression, and four of these are among the known tfs. These tfs do not show periodicity,

and they have non zero coefficients only at one or two time points (Figure 4). In general, multivariate spls

regression captures the overall change of coefficients, changes consistent across the two time points, which

cannot be summarized by univariate spls, but univariate spls regression captures detailed change (such as

single time change) which can be easily missed by multivariate spls regression.

[Figure 3 about here.]

[Figure 4 about here.]

7. DISCUSSION

pls regression has been promoted in ill-conditioned linear regression problems that arise in several disciplines

such as chemistry, economics, medicine, psychology, and pharmaceutical science (de Jong 1993). It has been

shown that pls yields shrinkage estimators (Goutis 1996) and it may provide peculiar shrinkage in the

sense that some of the components of the regression coefficient vector may expand. However, as argued

by (Rosipal and Kramer 2006), this does not necessarily lead to worse shrinkage as partial least squares

estimators are highly non-linear. We showed that pls is consistent under the latent model assumption with

strong restrictions on the number of variables and the sample size. This makes the suitability of pls for

the very large p and small n paradigm questionable. We argued and illustrated that imposing sparsity on

direction vector will help to avoid sample size problems in the presence of large number of noise variables

and developed a sparse partial least squares regression technique called spls.

spls regression is also likely to yield shrinkage estimators since the methodology can be considered

as a form of pls regression on a restricted set of predictors. Analysis of its shrinkage properties due to

the embedded automatic variable selection step is among our current investigations. spls regression is

computationally efficient since it solves a linear equation by using a cg algorithm rather than employing

matrix inversion at each step. Furthermore, spls regression works well even when the number of relevant

predictors are greater than the sample size and is capable of selecting more relevant variables than the

actual sample size. This becomes important in the following scenario. If there are p1 variables related to

the response variable and p − p1 number of noise variables, where p ≥ p1 > n, then it would be impossible

to identify all the relevant variables with, for example, lasso since it can only select at most n predictors.

25

Page 26: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

However, spls regression would allow n sparse linear combinations of the original predictors which could in

principle include more than p variables.

We showed that the direction vector which satisfies the spls criterion can be found by iteratively solving

a regression formulation, and proposed the accompanied spls regression algorithm. Our spls regression

algorithm has connections to other variable selection algorithms including elastic net method (Zou and

Hastie 2005) and threshold gradient(tg) method (Friedman and Popescu 2004). Elastic net algorithm deals

with the collinearity issue in variable selection problem by incorporating the ridge regression method into the

lars algorithm. In a way, spls handles the same issue by fusing the pls technique into the lars algorithm.

spls can also be related to tg method in that both algorithms use only thresholded gradient and not the

Hessian. However, spls achieves fast convergence by using conjugated gradient.

We presented proof-of-principle simulation studies with small number of predictors and with both small

and large number of samples. We also provided several simulation studies involving thousands of predictors

and illustrated that spls regression performs much better than other comparable methods that might be

suitable for these problems in terms of both the predictive power and identifying the relevant variables. As

we had anticipated, spls regression seems to achieve both high predictive power and accuracy for finding

the relevant variables (Table 2). spls regression also has high predictive power even when n is smaller than

p.

Our application with spls regression involved two recent genomic data types, namely, gene expression

data and genome-wide binding data of transcription factors. Here, the response variable was continuous

and a linear modeling framework followed naturally. Extension of spls to other modeling frameworks such

as generalized linear models and survival models is an exciting future work. Our application highlighted

the use of spls within the context of multivariate response. We anticipate that several genomic problems

with multivariate responses, e.g., linking expression of a cluster of genes to genetic marker data, might lend

themselves into the multivariate spls framework.

APPENDIX: PROOFS OF THE THEOREMS

We first introduce Lemmas 1 and 2 and then utilize them in the proof of Theorem 1. In the following

lemmas and theorems, ||A||2 for matrix A ∈ Rn×k is defined as the largest singular value of A. ||A||∞

is defined as the largest L1 norm of row vectors of A. These two norms are equivalent for fixed n, k:

1√k||A||∞ ≤ ||A||2 ≤

√n||A||∞.

Lemma 1. Under the Assumptions 1 and 2,

||SXX − ΣXX ||2 ≤ Op(√

p/n),

26

Page 27: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

||SXY − σXY ||2 ≤ Op(√

p/n),

as pn → 0.

Proof. Define En as SXX −ΣXX and Fn as SXY − σXY , then En and Fn have the following decomposition

based on Johnstone and Lu (2004):

En = An + Bn + Cn,

Fn = (An + Bn + Cn)β + Dn,

where

An =m∑j,k

(n−1n∑

i=1

vji v

ki − δjk)ρjρkT ,

Bn =m∑

j=1

σ1n−1(ρjvjT ZT + ZvjρjT ),

Cn = σ21(n−1ZZT − Ip),

Dn = σ1σ2n−1(

m∑j=1

ρjvjT e) + σ1σ2n−1ZT e.

It is proved in Johnstone and Lu (2004) that if pn → c ∈ [0,∞),

||An||2 → 0 a.s.,

||Bn||2 ≤ σ1

√c∑

%j a.s.,

||Cn||2 → σ21(c + 2

√c) a.s..

Now, we will show that if pn → c ∈ [0,∞),

||Dn||2 →√

cσ1σ2 a.s..

First, vjT e =d χnχ1Uj , 1 ≤ j ≤ m, and ZT e =d χnχpUm+1 where χ2n, χ2

1 and χ2p are chi-square random

variables, and Ujs are random vectors uniform on the surface of the unit sphere Sp−1 in Rp.

Denote uj = vjT e, 1 ≤ j ≤ m and um+1 = ZT e, then

σ21n−2||uj ||22 → 0 a.s. for 1 ≤ j ≤ m,

σ22σ2

1n−2||u||2m+1 → cσ21σ2

2 .

27

Page 28: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

By using a version of the dominated convergence theorem (Pratt 1960), we have

σ1σ2n−1(

m∑j=1

ρjvjT e)→ 0 a.s..

Then, ||Dn||2 →√

cσ1σ2 a.s.

Therefore,

||En|| ≤ σ1

√c∑

%j + σ21(c + 2

√c) a.s.,

||Fn|| ≤ (σ1

√c∑

%j + σ21(c + 2

√c))||β||2 +

√cσ1σ2 a.s.,

which proves the lemma.

Lemma 2. Under the Assumptions 1 and 2,

||SkXXSXY − Σk

XXσXY ||2 ≤ Op(√

p/n), (A.1)

||STXY Sk

XXSXY − σTXY Σk

XXσXY ||2 ≤ Op(√

p/n), (A.2)

as pn → 0.

Proof. Both of these bounds (A.1 and 2) are direct consequences of Lemma 1. By using the triangular

inequality, Holder’s inequality (Holder 1889) and Lemma 1, we have

||SkXXSXY − Σk

XXσXY ||2

≤ (||SkXX − Σk

XX ||2||σXY ||2 + ||ΣkXX ||2||SXY − σXY ||2)

≤(√

p

n

)C + c1Op

(√p

n

), for some constants C and c1,

and

||STXY Sk

XXSXY − σTXY Σk

XXσXY ||2

≤ (||STXY − σT

XY ||2||SkXXSXY ||2 + ||σT

XY ||2||SkXXSXY − Σk

XXσXY ||2)

≤ Op

(√p

n

).

28

Page 29: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Proof of Theorem 1

We start with proving part 1 of the theorem. Helland (1990) provides a closed form solution for pls estimator

as

βPLS = R(RT SXXR)−1RT SXY ,

where R = (SXY , SXXSXY , S2XXSXY , . . . , SK−1

XX SXY ). First, we will establish that

βPLS → R(RT ΣXXR)−1RT σXY in probability as n→∞,

where R = (σXY ,ΣXXσXY ,Σ2XXσXY , . . . ,ΣK−1

XX σXY ). By using the triangular inequality and Holder’s

inequality,

||R(RT SXXR)−1RT SXY −R(RT ΣXXR)−1RT σXY ||2

≤ ||R−R||2||(RSXXR)−1RT SXY )||2 + ||R||2||(RSXXR)−1 − (RΣXXR)−1||2||RT SXY )||2

+||R||2||(RΣXXR)−1||2||RT SXY −RT σXY )||2.

It is sufficient to show that

||R−R||2 → 0 in probability,

||(RSXXR)−1 − (RΣXXR)−1||2 → 0 in probability,

||RT SXY −RT σXY )||2 → 0 in probability.

By using the definition of matrix norm, Lemmas 1 and 2, we have

||R−R||2 ≤√

K max1≤k<K

||Sk−1XX SXY − Σk−1

XX σXY ||2

≤ Op(√

p/n)

→ 0 in probability.

By using ||(A + E)−1 −A−1||2 ≤ ||E||2||A−1||2||(A + E)−1||2 from Golub and Loan (1987),

||(RSXXR)−1 − (RΣXXR)−1||2 ≤ ||RSXXR−RΣXXR||2||(RΣXXR)−1||2||(RSXXR)−1||2.

Since K is given, (RΣXXR)−1 and (RSXXR)−1 are nonsingular. Hence, ||(RΣXXR)−1||2 and ||(RSXXR)−1||2

are finite. By using the triangular inequality and Holder’s inequality repeatedly, we get ||RSXXR −

29

Page 30: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

RΣXXR||2 → 0 in probability.

Using the fact ||R−R||2 → 0 in probability, and utilizing Lemma 1, the triangular and Holder’s inequal-

ities, it is clear that ||RT SXY −RT σXY )||2 → 0.

Next, we can establish that β = Σ−1XXσXY = R(RT ΣXXR)−1RT σXY by using the same argument of

Proposition 1 of Naik and Tsai (Naik and Tsai 2000).

Next, we prove part 2 of the theorem. By using SXX βPLS − SXY = 0 and ΣXXβ − ΣXY = 0,

0 = SXX βPLS − ΣXXβ − SXY + ΣXY

= SXX(βPLS − β) + (SXX − ΣXX)β − SXY + ΣXY

= SXX(βPLS − β) + Enβ − Fn.

If we assume that ||β−β||2 → 0 in probability, then we have || 1nZT e||2 → 0 in probability, since Fn = Enβ +

1nZT e. This contradicts that || 1nZT e||2 →

√cσ2 a.s. in the proof of Lemma 1. Therefore, ||βPLS − β||2 > 0

in probability.

REFERENCES

Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006), “Adapting to unknown sparsity

by controlling the false discovery rate,” The Annals of Statistics, 34, 584 – 653.

Alexandre d’Aspremont, Laurent El Ghaoui, M. I. J., and Lanckriet, G. R. G. (2007), “A Direct Formulation

for Sparse PCA Using Semidefinite Programming,” SIAM Review, 49, 434–448.

Anderson, T. (1984), An introduction to multivariate statistical analysis, New York: Wiley.

Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006), “Prediction by supervised principal components,”

Journal of the American Statistical Association, 101, 119–137.

Bendel, R. B., and Afifi, A. A. (1976), “A criterion for stepwise regression,” The American Statistician,

pp. 85–87.

Benjamini, Y., and Hochberg, Y. (1995), “Controlling the false discovery rate: a practical and powerful ap-

proach to multiple testing,” Journal of the Royal Statistical Society: Series B (Statistical Methodology),

57, 280–300.

Boulesteix, A.-L., and Strimmer, K. (2005), “Predicting transcription factor activities from combined anal-

ysis of microarray and ChIP data: a partial least squares approach,” Theoretical Biology and Medical

Modelling, 2.

30

Page 31: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Boulesteix, A.-L., and Strimmer, K. (2006), “Partial Least Squares : A Versatile Tool for the Analysis of

High-Dimensional Genomic data,” Briefings in Bioinformatics, .

de Jong, S. (1993), “SIMPLS: an alternative approach to partial least squares regression,” Chemometrics

Intell. Lab. Syst, 18, 251–263.

der Sluis, A. V., and der Vorst, H. A. V. (1986), “The rate of convergence of conjugate gradients,” Numerische

Mathematik, 48(5), 543–560.

Donoho, D. L. (2006), “For most large underdetermined systems of equations the minimal L1-norm near-

solution approximate the sparsest near-solution,” Communication on Pure and Applied Mathematics,

59, 907–934.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least Angle Regression,” The Annals of

Statistics, 32, 407–499.

Frank, I. E., and Friedman, J. H. (1993), “A statistical View of Some Chemometrics Regression Tools,”

Technometrics, 35, 109–135.

Friedman, J. H., and Popescu, B. E. (2004), Gradient Directed Regularization for Linear Regression and

Classification,. Stanford University, Department of Statistics, Technical Report, http://www-stat.

stanford.edu/~jhf/ftp/path.pdf.

Geman, S. (1980), “A limit theorem for the norm of random matrices,” Annals of Probability, 8, 252–261.

Gill, P., Murray, W., and Wright, M. (1981), Practical Optimization, New York: Academic Press.

Golub, G. H., and Loan, C. F. V. (1987), Matrix Computations, Baltimore: The Johns Hopkins University

Press.

Goutis, C. (1996), “Partial Least Squares Algorithm Yields Shrinkage Estimators,” The Annals of Statistics,

24, 816–824.

Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staudt, L., Botstein, D., and Brown, P. (2000),

“Identifying distinct sets of genes with similar expression patterns via “gene shaving”,” Genome Biology,

1, 1–21.

Helland, I. S. (1990), “Partial least squares regression and statistical models,” Scandinavian Journal of

Statistics, 17, 97–114.

Helland, I. S., and Almoy, T. (1994), “Comparison of prediction methods when only a few components are

relevant,” Journal of the American Statistical Association, 89(426), 583–591.

31

Page 32: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Holder, O. (1889), “Ueber einen Mittelwerthsatz,” Nachr. Ges. Wiss. Gottingen, pp. 38–47.

Huang, X., Pan, W., Park, S., Han, X., Miller, L. W., and Hall, J. (2004), “Modeling the relationship

between LVAD support time and gene expression changes in the human heart by penalized partial least

squares,” Bioinformatics, 20(6), 888 – 894.

Jansen, M. (2001), Noise Reduction by Wavelet Thresholding, New York: Springer-Verlag.

Johnstone, I. M., and Lu, A. Y. (2004), Sparse Principal Component Analysis,. Stanford University, De-

partment of Statistics, http://www-stat.stanford.edu/~imj/WEBLIST/AsYetUnpub/sparse.pdf.

Jolliffe, I. T., Trendafilov, N. T., and uddin, M. (2003), “A modified principal component technique based

on the lasso,” Journal of Computational and Graphical Statistics, 12, 531–547.

Kosorok, M. R., and Ma, S. (2007), “Marginal Asymptotics for the ”large p, small n” paradigm: with

Applications to Microarray data,” The Annals of Statistics, 35, 1456–1486.

Kramer, N. (2007), “An overview on the shirinkage properties of partial least squares regression,” Compu-

tational Statistics, 22, 249–273.

Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison,

C. T., Thomson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren,

B., Wyrick, J. J., Tagne, J.-B., Volkert, T. L., Fraenkel, E., Gifford, D. K., and Young, R. A. (2002),

“Transcriptional Regulatory Networks in Saccharmomyces cerevisiae,” Science, 298, 799 – 804.

Naik, P., and Tsai, C.-L. (2000), “Parital Least Squares Estimator for Single-Index Models,” Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 62, 763–771.

Pratt, J. W. (1960), “On interchanging limits and integrals,” Annals of Mathematical Statistics, 31, 74–77.

Rosipal, R., and Kramer, N. (2006), “Overview and Recent Advances in Partial Least Squares,”.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein,

D., and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast

Saccharomyces cerevisiae by Microarray Hybridization,” Molecular Biology of the Cell Vol. 9, 3273-

3297, pp. 3273–3279.

Stoica, P., and Soderstorom, T. (1998), “Partial Least Squares: A First-order Analysis,” Scandinavian

Journal of Statistics, 25, 17–24.

32

Page 33: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Stone, M., and Brooks, R. (1974), “Continuum Regression: Cross-validated Sequentially Constructed Predic-

tion Embracing Ordinary Least Squares, Partial Least Squares and Principal Component Regression,”

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 52, 237–269.

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 58, 267–288.

Wang, L., Chen, G., and Li, H. (2007), “Group SCAD Regression Analysis for Microarray Time Course

Gene Expression Data,” Bioinformatics, 23, 1486–1494.

West, M. (2003), “Bayesian Factor Regression Models in the “Large p, Small n” Paradigm,” Bayesian

Statistics, 7, 723–732.

Wold, H. (1966), Estimation of Principal Components and Related Models by Iterative Least Squares, New

York: Academic Press.

Zou, H., and Hastie, T. (2005), “Regularization and variable selection via the elastic net,” Journal of Royal

Statistical Society: Series B (Statistical Methodology), 67, 301 – 320.

Zou, H., Hastie, T., and Tibshirani, R. (2006), “Sparse Principal Component Analysis,” Journal of Compu-

tational and Graphical Statistics, 15, 265–286.

33

Page 34: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

−1.2 0 1.2

−1.2

1.2

−1.2 −1 1 1.2

−1.2

−1

1

1.2

Figure 1: Feasible area for d = 2. Blue square represents the feasible area for L1 penalty, and green circlerepresents the feasible area for the norm constraint. The intersection of the two is colored by red, and cornersof the square are excluded from the feasible area. This picture depicts the feasible area of the sparsity andthe norm constraints. The sparsity constraint yields sparse solution by encouraging the solution to be at thecorner of the square. By adding the norm constraint, the corners are excluded from the feasible area, andthe feasible area is mainly determined by the norm constraint. As a result, the sparsity constraint plays therole of making the coefficients small but not exactly zero.

34

Page 35: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

−0.

20.

00.

20.

4

5 10 15

ACE2−

0.2

0.0

0.2

0.4

5 10 15

SWI4

−0.

20.

00.

20.

4

5 10 15

SWI5

−0.

20.

00.

20.

4

5 10 15

SWI6

−0.

20.

00.

20.

4

5 10 15

MBP1

−0.

20.

00.

20.

4

5 10 15

STB1

−0.

20.

00.

20.

4

5 10 15

FKH1−

0.2

0.0

0.2

0.4

5 10 15

FKH2

−0.

20.

00.

20.

4

5 10 15

NDD1

−0.

20.

00.

20.

4

5 10 15

MCM1

−0.

20.

00.

20.

4

5 10 15

ABF1

−0.

20.

00.

20.

4

5 10 15

BAS1

−0.

20.

00.

20.

4

5 10 15

CBF1−

0.2

0.0

0.2

0.4

5 10 15

GCN4

−0.

20.

00.

20.

4

5 10 15

GCR1

−0.

20.

00.

20.

4

5 10 15

GCR2

−0.

20.

00.

20.

4

5 10 15

LEU3

−0.

20.

00.

20.

4

5 10 15

MET31

−0.

20.

00.

20.

4

5 10 15

REB1−

0.2

0.0

0.2

0.4

5 10 15

SKN7

−0.

20.

00.

20.

4

5 10 15

STE12

Figure 2: Estimated transcription factor activities (TFAs) for the known 21 TFs. y-axis is estimatedcoefficients, and x-axis is time. Red line represents the estimated TFAs by the multivariate SPLS regression,and blue line represents estimated TFAs by the univariate SPLS regression. Multivariate SPLS regressionyields smoother estimates and exhibits periodicity.

35

Page 36: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

−0.

20.

00.

20.

4

5 10 15

ARG8

−0.

20.

00.

20.

4

5 10 15

AZF1−

0.2

0.0

0.2

0.4

5 10 15

BAS1

−0.

20.

00.

20.

4

5 10 15

HAL9

−0.

20.

00.

20.

4

5 10 15

HAP2

−0.

20.

00.

20.

4

5 10 15

MSN2

−0.

20.

00.

20.

4

5 10 15

PHD1

−0.

20.

00.

20.

4

5 10 15

PHO4−

0.2

0.0

0.2

0.4

5 10 15

RGT1

−0.

20.

00.

20.

4

5 10 15

SIP4

−0.

20.

00.

20.

4

5 10 15

YAP6

−0.

20.

00.

20.

4

5 10 15

YAP7

Figure 3: Estimated transcription factor activities (TFAs) selected only by the multivariate SPLS regres-sion. The magnitude of the estimated TFAs is small but estimated TFAs show periodicity. BAS1 is anexperimentally confirmed TF.

36

Page 37: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

−0.

20.

00.

20.

4

5 10 15

ADR1−

0.2

0.0

0.2

0.4

5 10 15

ARO8

−0.

20.

00.

20.

45 10 15

CAD1

−0.

20.

00.

20.

4

5 10 15

CBF1

−0.

20.

00.

20.

4

5 10 15

DIG1

−0.

20.

00.

20.

4

5 10 15

FKH1

−0.

20.

00.

20.

4

5 10 15

FZF1

−0.

20.

00.

20.

4

5 10 15

GAL4

−0.

20.

00.

20.

4

5 10 15

GCR1

−0.

20.

00.

20.

4

5 10 15

GCR2

−0.

20.

00.

20.

4

5 10 15

GLN3

−0.

20.

00.

20.

4

5 10 15

HAP4

−0.

20.

00.

20.

4

5 10 15

HAP5

−0.

20.

00.

20.

4

5 10 15

HSF1

−0.

20.

00.

20.

4

5 10 15

INO2

−0.

20.

00.

20.

4

5 10 15

IXR1

−0.

20.

00.

20.

4

5 10 15

MOT3

−0.

20.

00.

20.

4

5 10 15

MSN1

−0.

20.

00.

20.

4

5 10 15

MSN4

−0.

20.

00.

20.

4

5 10 15

MSS1

−0.

20.

00.

20.

4

5 10 15

PDR1

−0.

20.

00.

20.

4

5 10 15

RGM1

−0.

20.

00.

20.

4

5 10 15

RLM1−

0.2

0.0

0.2

0.4

5 10 15

RPH1

−0.

20.

00.

20.

4

5 10 15

RTG1

−0.

20.

00.

20.

4

5 10 15

SKO1

−0.

20.

00.

20.

4

5 10 15

STP1

−0.

20.

00.

20.

4

5 10 15

UGA3

−0.

20.

00.

20.

4

5 10 15

ZAP1

Figure 4: Estimated transcription factor activities (TFAs) selected only by univariate SPLS regression.Estimated TFAs are not zero for only a few time points and do not show periodicity. CBF1, FKH1, GCR1,GCR2 are experimentally confirmed TFs.

37

Page 38: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Table 1: Mean Squared Prediction Error for Simulations I and II.Parameter settings Mean squared prediction errorp n q ns PLS PCR OLS FVS LASSO SPLS1 SPLS2 Super PC

(SE) (SE) (SE) (SE) (SE) (SE) (SE) (SE)40 400 10 0.1 31417.9 15717.1 31444.4 207.1 203.1 199.8 201.4 198.6

(552.5) (224.2) (554.0) (9.6) (9.3) (9.0) (11.2) (9.5)0.2 31872.0 16186.5 31956.9 678.6 667.1 661.4 658.7 668.2

(544.4) (231.4) (548.9) (15.4) (13.7) (13.9) (15.7) (17.5)30 0.1 31409.1 20914.2 31431.7 208.6 206.2 203.3 205.5 202.7

(552.5) (1324.4) (554.2) (9.2) (9.1) (10.1) (11.1) (9.4)0.2 31863.7 21336.0 31939.3 677.5 670.7 661.2 663.5 673.0

(544.1) (1307.6) (549.1) (13.9) (12.9) (14.4) (15.6) (17.3)80 40 20 0.1 29121.4 15678.0 1635.7 697.9 538.4 493.6 1079.6

(1583.2) (652.9) (406.4) (65.6) (70.5) (71.5) (294.9)0.2 30766.9 16386.5 6380.6 1838.2 1019.5 959.7 2100.8

(1386.0) (636.8) (2986.0) (135.9) (74.6) (74.0) (399.8)40 0.1 29116.2 17416.1 2829.0 677.6 506.9 437.8 1193.8

(1591.7) (924.2) (1357.0) (58.0) (66.9) (43.0) (492.3)0.2 29732.4 17940.8 6045.8 1904.3 1013.3 932.5 3172.4

(1605.8) (932.2) (1344.2) (137.0) (78.7) (53.85) (631.79)

NOTE: p: number of covariates; n: sample size; q: number of spurious variables; ns: noise to signal ratio; SPLS1: SPLS

tuned by FDR (FDR = 0.1) control; SPLS2: SPLS tuned by CV; SE: standard error.

38

Page 39: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Table 2: Model Accuracy for Simulations I and II.p n q ns FVS LASSO SPLS1 SPLS2 Super PC

Sens Spec Sens Spec Sens Spec Sens Spec Sens Spec40 400 10 0.1 0.26 0.96 0.49 0.98 1.00 0.83 1.00 1.00 1.00 1.00

0.2 0.18 0.98 0.33 0.96 1.00 0.80 1.00 1.00 1.00 1.0030 0.1 0.58 0.98 0.88 0.95 1.00 0.83 1.00 1.00 1.00 1.00

0.2 0.37 0.98 0.69 0.93 1.00 0.80 1.00 1.00 1.00 1.0080 40 20 0.1 0.29 1.00 0.49 0.52 1.00 0.80 1.00 0.93 0.95 0.88

0.2 0.32 1.00 0.48 0.50 1.00 0.67 1.00 0.90 0.90 0.8740 0.1 0.46 1.00 0.51 0.53 1.00 0.80 1.00 1.00 0.97 0.90

0.2 0.53 1.00 0.50 0.52 1.00 0.80 1.00 1.00 0.84 0.97

NOTE: p: number of covariates; n: sample size; q: number of spurious variables; ns: noise to signal ratio;

FVS: forward variable selection; SPLS1: SPLS tuned by FDR (FDR = 0.1) control; SPLS2: SPLS tuned

by CV; Sens: sensitivity; Spec: specificity.

39

Page 40: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Table 3: Mean Squared Prediction Error for Methods Those Handle Multicollinearity.Simulation 1 Simulation 2 Simulation 3 Simulation 4

PCR-1 320.67 (8.07) 308.93 (7.13) 241.75 (5.62) 2730.53 (75.82)PLS-1 301.25 (7.32) 292.70 (7.69) 209.19 (4.58) 1748.53 (47.47)Ridge 304.80 (7.47) 296.36 (7.81) 211.59 (4.70) 1723.58 (46.41)

Super PC 252.01 (9.71) 248.26 (7.68) 134.90 (3.34) 263.46 (14.98)SPLS-1(FDR) 256.22 (13.82) 246.28 (7.87) 139.01 (3.74) 290.78 (13.29)SPLS-1(CV) 257.40 (9.66) 261.14 (8.11) 120.27 (3.42) 195.63 (7.59)

Mixed var-cov 301.05 (7.31) 292.46 (7.67) 209.45 (4.58) 1748.65 (47.58)Gene shav 255.60 (9.28) 292.46 (7.67) 119.39 (3.31) 203.46 (7.95)

True 224.13 (5.12) 218.04 (6.80) 96.90 (3.02) 99.12 (2.50)

NOTE : PCR-1: PCR with one component; PLS-1: PLS with one component; Super PC: Supervised PC;

SPLS-1(FDR): SPLS with one component tuned by FDR (FDR = 0.4) control; SPLS-1 (CV): SPLS with

one component with tuned by CV; Mixed var-cov: mixed variance-covariance model; Gene shav: gene

shaving; True: true model.

40

Page 41: Sparse Partial Least Squares Regression with an ...pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf · Analysis of modern biological data often involves ill-posed problems due

Table 4: Comparison of the Number of Selected TFs.Method # of TFs selected(s) # of Known TFs(k) Prob(K ≥ k)

Multi SPLS 48 15 0.007Uni SPLS 65 18 0.008Super PC 65 18 0.008LASSO 102 21 0.407Total 106 21 1

Note: Prob(K ≥ k) represents the probability of a randomly chosen set of size s including more than k

known TFs.

41