here - university of birmingham

23
Chapter 1 A GRADIENT-BASED FORWARD GREEDY ALGORITHM FOR SPARSE GAUSSIAN PROCESS REGRESSION Ping Sun, Xin Yao CERCIA, School of Computer Science University of Birmingham, Edgbaston Park Road Birmingham, B15 2TT, UK [email protected],[email protected] Abstract In this chaper, we present a gradient-based forward greedy method for sparse approximation of Bayesian Gaussian Process Regression (GPR) model. Differ- ent from previous work, which is mostly based on various basis vector selection strategies, we propose to construct instead of select a new basis vector at each iterative step. This idea was motivated from the well-known gradient boosting approach. The resulting algorithm built on gradient-based optimisation packages incurs similar computational cost and memory requirements to other leading sparse GPR algorithms. Moreover, the proposed work is a general framework which can be extended to deal with other popular kernel machines, including Kernel Logistic Regression (KLR) and Support Vector Machines (SVMs). Nu- merical experiments on a wide range of datasets are presented to demonstrate the superiority of our algorithm in terms of generalisation performance. Keywords: Gaussian process regression, sparse approximation, sequential forward greedy algorithm, basis vector selection, basis vector construction, gradient-based opti- misation, gradient boosting 1. Introduction Recently, Gaussian Processes (GP) [16] have become one of the most popu- lar kernel machines in the machine learning community. Besides its simplicity in training and model selection, GP models also yield the probabilistic predic- tions for testing examples with excellent generalisation capability. However, original GP models are prevented from applying to large datasets due to their high computational demands. Firstly, GP models require the computation and

Upload: others

Post on 09-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Here - University of Birmingham

Chapter 1

A GRADIENT-BASED FORWARD GREEDY

ALGORITHM FOR SPARSE GAUSSIAN

PROCESS REGRESSION

Ping Sun, Xin YaoCERCIA, School of Computer Science

University of Birmingham, Edgbaston Park Road

Birmingham, B15 2TT, UK

[email protected],[email protected]

Abstract In this chaper, we present a gradient-based forward greedy methodfor sparseapproximation of Bayesian Gaussian Process Regression (GPR) model. Differ-ent from previous work, which is mostly based on various basis vector selectionstrategies, we propose toconstructinstead of select a new basis vector at eachiterative step. This idea was motivated from the well-knowngradient boostingapproach. The resulting algorithm built on gradient-based optimisation packagesincurs similar computational cost and memory requirements to other leadingsparse GPR algorithms. Moreover, the proposed work is a general frameworkwhich can be extended to deal with other popular kernel machines, includingKernel Logistic Regression (KLR) and Support Vector Machines (SVMs). Nu-merical experiments on a wide range of datasets are presented to demonstratethe superiority of our algorithm in terms of generalisation performance.

Keywords: Gaussian process regression, sparse approximation, sequential forward greedyalgorithm, basis vector selection, basis vector construction, gradient-based opti-misation, gradient boosting

1. Introduction

Recently, Gaussian Processes (GP) [16] have become one of the most popu-lar kernel machines in the machine learning community. Besides its simplicityin training and model selection, GP models also yield the probabilistic predic-tions for testing examples with excellent generalisation capability. However,original GP models are prevented from applying to large datasets due to theirhigh computational demands. Firstly, GP models require the computation and

Page 2: Here - University of Birmingham

2

storage of the full-order kernel matrixK (also known as covariance matrix) ofsizen × n, wheren is the number of training examples. Secondly, the com-putational cost of training GP models is aboutO(n3). Thirdly, predicting atest case requiresO(n) for evaluating the mean andO(n2) for computing thevariance. In order to overcome these limitations a number of approximationschemes have been proposed recently (see [21], chapter 8) to accelerate thecomputation of GP. Most of these approaches in the literature can be broadlyclassified into two main types: (i) Greedy forward selection methods that canalso be viewed as iteratively approximating the full kernel matrix by a low-rank representation [1, 29, 28, 34, 9, 19, 36, 26, 15, 30]; (ii) The methodsof approximating the matrix-vector multiplication (MVM) operations byFastGauss Transform(FGT) [35] and more generallyN-body approach[14]. Allof these algorithms could achieve a linear scalability in the number of trainingexamples for both computational cost and memory requirement. In contrastto the MVM approximation, the method of approximating the kernel matrixis simpler to implement since it does not require determining some additionalcritical parameters [35]. In this chapter we follow the path of approximatingthe full kernel matrix and propose a different forward greedy algorithm fromprevious work for achieving a low-rank kernel representation. The main ideais to construct instead of select basis vectors, which was inspired by the well-known gradient boosting[10] framework. Here we focus only on regressionproblems and the work can be extended to classification tasks [37].

We now outline the contents of this chapter. In Section 2, we introduce GPregression (GPR) and briefly show how to achieve approximate GPR modelsinthe current literature. In Section 3, we review some forward greedy algorithmsfor approximating the full GPR model and present our motivation. In Section4, we detail our approach. Some experimental results are reported in Section5. Finally, Section 6 concludes this chapter by presenting possible directionsof future research.

2. Gaussian Process Regression

In regression problems, we are given training data composed ofn examples,D = {(x1, y1), ..., (xn, yn)} wherexi ∈ R

m is them-dimensional input andyi ∈ R is the corresponding target. It is common to assume that the outputsyi

are generated by

yi = f(xi) + ǫi (1.1)

whereǫi is a normal random variable with the densityP(ǫi) = N (ǫi|0, σ2)andf(x) is an unobservable latent function. The goal of the regression task isto estimate the functionf(x) which is then used to predict the targety∗ on anunseen test casex∗.

Page 3: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 3

Nomenclaturen total number of training examplesm dimension of the inputxi, X input examplei andX = [x1 ... xn]⊤ ∈ R

n×m

xi(l) thel-th entry of the inputxi

yi, y target ofxi andy = [y1, ..., yn]⊤ ∈ Rn

Idq, 1q the identity matrix of sizeq × q and all one vector inRq

K(xi, xj) kernel function, also known as covariance functionθ0, θl, θb hyperparameters of kernelK(xi, xj)K training kernel matrix and(K)ij = K(xi, xj), i, j = 1, ..., nσ2 variance of the noisef(xi) an unobservable latent functionf vector of latent function values, i.e.,f = [f(x1), ..., f(xn)]⊤

N (·|µ,Σ) density of a Gaussian with meanµ and covarianceΣP(·) the probability density functionx∗, y∗ test input and targetf∗ latent function value onx∗

k∗, k∗∗ (k∗)i = K(xi, x∗), i = 1, ..., n andk∗

∗ = K(x∗, x∗)µ∗, σ∗ the predictive mean and varianceα weight parameter andα ∈ R

n

E(·) the objective (error) functionp iteration index or number of selected (or constructed) basisip index ofp-th basis vector to be addedIp index set andIp = {i1, ..., ip}xj , Xp selected or constructed basis vectorj andXp = [x1 ... xp]

xj(l) thel-th entry of the basis vectorxj

Kp kernel columns,(Kp)ij = K(xi, xj), i = 1, ..., n; j = 1, ..., pkp thep-th column ofKp

Qp matrix induced by{xj}pj=1

, (Q)ij = K(xi, xj)

q∗p, qp q∗p is p-th diagonal andqp is p-th column ofQp exceptq∗pK approximate kernel matrix ofK andK = KpQ

−1p K⊤

p

Qp(·) probability density function conditioned onK = KpQ−1p K⊤

p

µ∗, σ2∗ approximate predictive mean and variance

k∗ (k∗)j = K(xj , x∗), j = 1, ..., pαp a sparse estimate ofα andαp = (K⊤

p Kp + σ2Qp)−1K⊤

p yµp, rp training meanµp = Kpαp and residual errorrp = y − µp

Hp matrix Idn − KpΣpK⊤p

Lp factor of Cholesky decomposition:Qp = LpL⊤p

Gp the productKpL−⊤p

Mp Cholesky decomposition:(G⊤p Gp + σ2Idp) = MpM

⊤p

Page 4: Here - University of Birmingham

4

In GPR framework, the underlyingf(x) is assumed to be a zero meanGaussian process, which isa collection of random variables, any finite numberof which have a joint Gaussian distribution[21]. Let f = [f(x1), ..., f(xn)]⊤

be a vector of latent function values, GPR assumes a GP prior over the func-tions, i.e. P(f) = N (f |0, K), whereK is the covariance matrix generatedby evaluating paired inputs{(xi, xj)|i, j = 1, ..., n} on acovariance functionK(xi, xj).

A common example ofK(xi, xj) is thesquared-exponentialfunction

K(xi, xj ; θ) = θ0 exp

(

−1

2

m∑

l=1

θl(xi(l) − xj(l))2

)

+ θb, (1.2)

whereθ0, θl, θb > 0 are hyperparameters,θ = [θ0, θ1, ..., θm, θb]⊤ ∈ R

m+2

andxi(l) denotes thel-th entry ofxi.In order to make a prediction for a new inputx∗ we need to compute the

predictive distributionP(f∗|x∗, y). First, the probabilityP(y|f), known aslikelihood, can be evaluated by

P(y|f) =n∏

i=1

N (yi|f(xi), σ2) = N (y|f, σ2Idn), (1.3)

whereIdn is an identity matrix of sizen×n. Second, the posterior probabilityof f can be written as

P(f |y) ∝ P(f)P(y|f)

∝ N (f |K(K + σ2Idn)−1y, σ2K(K + σ2Idn)−1).(1.4)

Third, the joint GP priorP(f, f∗) is multivariate Gaussian as well, denoted as

P

([

ff∗

])

= N

([

ff∗

]

0,

[

K k∗k⊤∗ k∗

])

, (1.5)

wherek∗ = (K(xi, x∗))

ni=1, k∗

∗ = K(x∗, x∗). (1.6)

Furthermore the conditional distribution off∗ givenf is a Gaussian

P(f∗|f, x∗) = N (k⊤∗ K−1f, k∗

∗ − k⊤∗ K−1k∗) (1.7)

and finally the predictive distributionP(f∗|x∗, y) can be found by

P(f∗|x∗, y) =

P(f∗|f, x∗)P(f |y)df

= N (f∗|µ∗, σ2∗),

(1.8)

Page 5: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 5

whereµ∗ = k⊤

∗ α, σ2∗ = k∗

∗ − k⊤∗ (K + σ2Idn)−1k∗, (1.9)

and the weight parameter

α = (K + σ2Idn)−1y. (1.10)

Clearly, the main task of learning a GPR model is to estimateα. From(1.9) and (1.10), we can note that training a full GPR model requiresO(n3)time, O(n2) memory and computing the predictive mean and variance for anew test case, leading toO(n) andO(n2), respectively. So, it is impracticalto apply GPR to large-scale training or testing datasets. This has led people toinvestigate approximate GPR models.

In order to understand the main ideas of approximate GPR models appearedin the literature, we view estimatingα in (1.10) as the solution of the followingoptimisation problem [28, 30]:

minα

E(α) =1

2α⊤(σ2K + K⊤K)α − (K⊤y)⊤α +

1

2y⊤y (1.11)

=1

2‖y − Kα‖2 +

σ2

2α⊤Kα. (1.12)

Based on formulation (1.12), it can be noted that many other popular kernelmachines invented later such as Kernel Ridge Regression (KRR) [24], LeastSquares Support Vector Machines (LS-SVM) [31], Kernel Fisher Discrimi-nant [18], Regularised Least Squares Classification (RLSC) [23] and ProximalSupport Vector Machine (PSVM) [11], are equivalent to the GPR modelinessence.

Since the matrix(σ2K + K⊤K) in (1.11) is symmetric and the objective isa quadratic function, it is straightforward to exploit the well-knownConjugateGradient (CG) method [12]. The CG method solves the problem (1.11) byiteratively performing the matrix-vector multiplication (MVM) operationsKcwherec ∈ R

n is a vector. This directly motivated some researchers to applyimproved fast Gauss transform (IFGT) [35], KD-trees [27] and general N-bodyapproach [13] to accelerating the computation of the full GPR model througha series of efficient approximations of the productKc.

Another class of approximate GPR models is based on thesparse estimateof α and can be further explained as approximating the full kernel matrixKby a low-rank kernel representation. A sparse estimate ofα is defined as onein which redundant or uninformative entries are set to exactly zero. If we useαp to denote all the non-zero entries ofα indexed byIp = [i1, ..., ip], then theobjective function (1.12) can be equivalently written as

minαp

E(αp) =1

2‖y − Kpαp‖

2 +σ2

2α⊤

p Qpαp, (1.13)

Page 6: Here - University of Birmingham

6

whereKp denotes the submatrix of the columns ofK centred on{xij , j =

1, ..., p}. Let xj = xij and we refer to{xj}pj=1

as the set ofbasis vectors1.Qp denotes the kernel matrix generated by these basis vectors, i.e.,(Q)ij =K(xi, xj), i, j = 1, ..., p. The sparse estimateαp can be obtained from (1.13)as

αp = ΣpK⊤p y (1.14)

withΣp = (K⊤

p Kp + σ2Qp)−1. (1.15)

In contrast to (1.10), computingαp in (1.14) only needsO(np2) operationsinstead of the originalO(n3), which greatly alleviates the computational bur-den involved in the training and testing procedures of the full GPR model ifp ≪ n in practice.

It was observed that selecting a good index setIp has a crucial effect on thegeneralisation performance of the obtained sparse GPR model. Most currentalgorithms generally formulate the selection procedure as an iterative forwardselection process. At each iteration, a new basis vector is identified basedon greedy optimisation of some criterion and the correspondingαp is thenupdated. So we refer to this class of methods asgreedy forward selectionalgorithms.

In fact, the above sparsifying procedure can also be understood as approx-imating the kernel matrixK by a low-rank representation of the formK =KpQ

−1p K⊤

p . This can be seen from the optimal objective values of the prob-lem (1.11) and the sparse version (1.13):

E(α) =σ2

2y⊤(K + σ2Idn)−1y (1.16)

and

E(αp) =σ2

2y⊤(KpQ

−1p K⊤

p + σ2Idn)−1y. (1.17)

Further, it means that the sparse GPR model is obtained by replacing originalGP priorP(f) = N (0, K) with an approximate priorQp(f) = N (f |0, KpQ

−1p K⊤

p )[5]. Following the same derivation procedures as the full GPR model, the ap-proximate predictive distributionQp(f∗|x∗, y) of the sparse GPR model be-comes

Qp(f∗|x∗, y) =

Qp(f∗|f)P(f |y)df = N (f∗|µ∗, σ2∗), (1.18)

whereµ∗ = k⊤

∗ αp, k∗ = (K(xj , x∗))pj=1

, (1.19)

σ2∗ = k∗

∗ − k⊤∗ Q−1

p k∗ + σ2k⊤∗ Σpk∗. (1.20)

Page 7: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 7

It can be noted that computing the predictive mean and variance only needsO(p) andO(p2), respectively, in sparse approximation of GPR models.

Compared to the approaches of approximating MVM by IFGT [35] andKD-trees [27], greedy forward selection algorithms only involve some linearalgebra algorithms and are not required to specify any critical parametersasin the case of IFGT [35]. Moreover, the approximation quality of MVM isdegenerated when we are confronted with much high-dimensional problemseven though some more complex improved algorithms have been proposed[22, 3].

As we mentioned above, the crucial step of greedy forward algorithms is toselect a good index setIp based on some criteria. In other words, the problemis how to find representative basis vectors from the original training examples.A number of basis vector selection schemes were proposed before [1, 29, 28,34, 9, 19, 36, 26, 15, 30]. In the next section, we briefly summarise thesealgorithms and tease out our motivation of a new gradient-based algorithm.

3. Basis Vector Selection Algorithms

Clearly, choosingp basis vectors out ofn possible choices involves a com-binatorial search over theCp

n space and it is a NP-hard problem [20]. So wehave to resort to near-optimal search schemes, like greedy forward selectionalgorithms mentioned above, to ensure computational efficiency. This sectionshall review some principled basis vector selection schemes and analyse theircorresponding computational complexity. For any greedy forward selectionapproach, the associated time complexity is composed of two parts:TbasicandTselection as defined in [15].Tbasic denotes the cost associated withupdating of the sparse GPR model if given the index setIp. This cost is thesame for all forward selection algorithms. Another partTselection refers tothe cost incured by the procedure of selecting basis vectors. In the following,for simplicity we will always neglect theTbasic cost and all the involved timecomplexity issues refer to theTselection cost. For convenience, we categorisethe algorithms appeared in the literature into unsupervised (i.e., independentofthe target information) and supervised types here. Although some algorithms,such as [1, 2, 9, 19], are not proposed to directly deal with sparse GPR models,their ideas can be extended easily to select the set of basis vectors for GPRmodels.

Unsupervised methods

The simplest unsupervised method is random selection [29, 34]. But sev-eral experimental studies [26, 15] have shown that this would produce poorresults. All of other unsupervised methods [2, 9, 7, 8] attempt to directly min-imise the trace of the residual matrixtr(∆Kp) = tr(K − K) = tr(K −

Page 8: Here - University of Birmingham

8

KpQ−1p K⊤

p ). Let Kp−1 = Lp−1L⊤p−1 be decomposed in Cholesky factors and

Gp−1 = Kp−1L−⊤p−1

. Let ip be the index of next added basis vector,kp =(

K(xi, xip))n

i=1, qp =

(

K(xj , xip))p−1

j=1, q∗p = K(xip , xip) andlp = L−1

p−1qp.

We haveJp = tr(∆Kp) = Jp−1 − ‖gp‖

2, (1.21)

where

gp =kp − Gp−1lp√

q∗p − l⊤p lp. (1.22)

So, to compute the exact reduction‖gp‖2 after including theip-th column is an

O(np) operation [2]. If this were to be done for all the remaining columns ateach iteration, it would lead to a prohibitive total complexity ofO(n2p2). Fineand Scheinberg [9] proposed a cheap implementation. Since‖gp‖

2 is lowerbounded by(gp(ip))

2 = k∗p − l⊤p lp, which can be recursively maintained, they

just evaluate this bound with negligible cost to choose thep-th basis vector.Another cheaper implementation for this idea is to consider an on-line scheme[7, 8].

Supervised methods

It is quite straightforward for approximatingK to consider the target infor-mation since we are confronted with a supervised learning task. Continuingon the results of unsupervised methods, Bach and Jordan [1] recently proposedan algorithm which select a new basis vector based on the trade-off betweenthe unsupervised termtr(K−KpQ

−1p K⊤

p ) and another training squared errorterm ‖y − Kpαp‖

2. Combining with an efficient ‘look-ahead’ strategy, theirselection scheme only incursO(δnp) complexity ofTselection if p basis vec-tors are selected, whereδ is set to a small value. Removing the unsupervisedterm, Nair et al. [19] developed a very cheap strategy to decrease the super-vised term‖y − Kpαp‖

2, which is achieved by examining the current residual(rp = y − Kpαp) and searching for the entry with the largest absolute value.

Following the formulation (1.13) of sparse GPR model, it would be prefer-able to choose the basis vector which leads to the largest reduction in the ob-jective (1.17), which was firstly proposed in [28]. LetHp = Idn − KpΣpK

⊤p ,

E(αp) can be recursively computed as [30]:

Ep = Ep−1 − ∆E1(ip), (1.23)

where

∆E1(ip) =1

2

(g⊤p Hp−1y)2

σ2 + g⊤p Hp−1gp

. (1.24)

Page 9: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 9

Similar to the criterion (1.21), computing the reduction∆E1(j), j /∈ Ip−1 forall n + 1 − p previously unselected vectors tillp basis vectors are accumu-lated is a prohibitiveO(n2p2) operation. Therefore, Smola and Bartlett [28]resorted to a sub-greedy scheme by considering onlyκ candidates randomlychosen from outsideIp−1 during thep-th basis vector selection. They useda value ofκ = 59. For this sub-greedy method, the complexity is reduced toO(κnp2). Alternatively, Sun and Yao [30] recently improved the original com-plexity O(n2p2) to O(n2p) by recursively maintaining some quantities for allremaining vectors. Furthermore, they [30] suggest only using the numeratorpart of∆E1(ip), i.e.,

∆E2(ip) =1

2(g⊤p Hp−1y)2 =

1

2(g⊤p rp−1)

2, (1.25)

whererp−1 = Hp−1y = y − Kp−1αp−1, as the criterion of scoring all re-maining vectors, which could produce almost the same prediction accuracyas the criterion (1.24). The advantage of this simplified version (1.25) is thatthe computational cost can be decreased toO(κnp) when combining with thesub-greedy scheme compared to theO(κnp2) cost incurred by the sub-greedymethod of [28].

Another scoring criterion, also based on optimising objective (1.13), is amatching pursuit approach [15] which was motivated by [33]. Instead ofmin-imising (1.13) through all of the entries ofαp as in the case of (1.24), they justadjust the last entry ofαp to optimise (1.13). The resulting selection criterionis [15]

∆E3(ip) =1

2

[k⊤p rp−1 − σ2q⊤p αp−1]

2

σ2q∗p + k⊤p kp

. (1.26)

The computational cost of using (1.26) to score one basis vector isO(n) time,which is similar to the criterion (1.25). The empirical study conducted in [30]showed that (1.26) is always inferior to (1.25) in generalisation performance,especially on large-scale datasets.

The last supervised method we introduce here is the so-called ‘Info-gain’approach [26]. LetQp(f |y) denote the posterior probability off given theapproximate GP priorQp(f) like (1.4),Info-gainscores the “informativeness”of one basis vector by theKullback-Leibler distancebetweenQp(f |y) andQp−1(f |y), i.e.KL[Qp‖Qp−1]. Under some assumptions, this criterion can besimplified to a very cheap approach of onlyO(1) cost for evaluating one basisvector. But sometimesInfo-gainleads to very poor results reported in [15] andalso shown in our experiments.

Across the algorithms discussed above, we can note that, at thep-th itera-tion, all of them try to select a new basis vector from the remaining(n−p+1)columns ofK . If the dataset is very large, the computational cost of scoring(n − p + 1) candidates would be prohibitive for some of previous selection

Page 10: Here - University of Birmingham

10

criteria. The interesting question here is: why we have toselectfrom a hugepool of vectors and why notconstructit! This is the starting point of our work.

4. A Gradient-based Forward Greedy Algorithm

The key idea is to construct not select a basis vector at each iteration. Thisis motivated by the well-knowngradient boostingframework [10]. Beforeproceeding to our new algorithm, we briefly describe what the boosting was.The basic idea behind boosting is rather than using just a single learner forprediction, a linear combination ofT base learners

F (x) =

T∑

t=1

βtht(x) (1.27)

is used [17]. Here eachht(x) is a base learner (e.g. decision trees) andβt is itscoefficient in the linear combination. Following the pioneering work by Fried-man [10], the boosting procedure can be generally viewed as a gradient-basedincremental search for a good additive model [10]. This is done by searching,at each iteration, for the base learner which gives the “steepest descent” in theloss denoted byL(y, f). The essential steps of a boosting procedure can besummarised as follows:

1 F0(x) = 0;

2 Fort = 1 : T do:

(a) (βt, ht(x)) = arg minβ∗,h(x)Pn

i=1 L(yi, ft−1(xi) + β∗h(xi))

(b) Ft(x) = Ft−1(x) + βtht(x)

3 EndFor

4 F (x) = FT (x) =PT

t=1 βtht(x).

If replacing the lossL(y, f) by different kinds of loss functions, a family ofboosting algorithms can be produced. The most prominent example isAd-aBoost[25], which employs the exponential loss function

L(yi, f(xi)) = exp {yif(xi)}, with yi ∈ {−1, +1}. (1.28)

Let us go back to the sparse GPR approach which aims to find a sparserepresentation of the regression model and has the form

fp(x) =

p∑

j=1

αp(j)k(xj , x), (1.29)

whereαp(j) is the j-th entry of αp. If we conceptually regard each termk(xj , x), j = 1, ..., p, involved in (1.29) as a base learner, all of greedy for-ward selection algorithms summarised in Section 3 are equivalent to the above

Page 11: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 11

boosting procedure. The only difference is that greedy forward selection algo-rithmsselecta new base learner at each iteration but boostingconstructa baselearner by gradient-based search. This ultimately motivates us to propose thefollowing new approach for sparse GPR.

We formulate the problem of building sparse GPR model as a boosting pro-cedure. First, the lossL(y, f) is replaced by the objective (1.13). Then, at eachiteration, we construct the ‘base learner’k(xp, x) through optimising (1.13)w.r.t. the parametersxp and its coefficientα∗

p is also changed accordingly. Indetail, it can be described by the following optimisation problem:

minα∗

p∈R,xp∈RmE(α∗

p, xp) =1

2‖y − Kp−1αp−1 − α∗

pkp(xp)‖2

+σ2

2

[

αp−1

α∗p

]⊤ [Qp−1 qp(xp)

qp(xp)⊤ q∗p(xp)

] [

αp−1

α∗p

]

.

(1.30)In order to emphasize thatkp, qp and q∗p are dependent onxp, we have ex-pressed them in the function form in (1.30). For simplicity sometimes we stillneglect the explicit dependence onxp. It is easy to show that

E(α∗p, xp) = Ep−1 +

1

2(α∗

p)2(σ2q∗p + k⊤

p kp)

+ α∗p(σ

2q⊤p αp−1 − k⊤p rp−1).

(1.31)

Since the condition for optimality ofα∗p is

∂E(α∗p, xp)

∂α∗p

= α∗p(σ

2q∗p + k⊤p kp) + [σ2q⊤p αp−1 − k⊤

p rp−1] = 0 (1.32)

we can get

α∗p =

k⊤p rp−1 − σ2q⊤p αp−1

σ2q∗p + k⊤p kp

. (1.33)

Substitutingα∗p in (1.31) with (1.33), the problem (1.30) can be equivalently

written as

minxp∈Rm

E(xp) = Ep−1 −

{

1

2

[kp(xp)⊤rp−1 − σ2qp(xp)

⊤αp−1]2

σ2q∗p(xp) + kp(xp)⊤kp(xp)

}

. (1.34)

In fact, the objective function (1.34) we derived is the same as the criterion(1.26). The only difference is that we would not pick up training example asthe candidate of the next basis vector. The derivative of (1.34) w.r.t.xp(l), l =

Page 12: Here - University of Birmingham

12

1, ..., m, can be easily obtained, that is,

p = 1,∂E(xp)

∂xp(l)=

1

2α∗

p[2k⊤p rp−1 − α∗

p(σ2q∗p + 2k⊤

p kp)],

p > 1,∂E(xp)

∂xp(l)=

1

2α∗

p[2(k⊤p rp−1 − σ2q⊤p αp−1) − α∗

p(σ2q∗p + 2k⊤

p kp)],

where

kp =∂kp(xp)

∂xp(l), qp =

∂qp(xp)

∂xp(l), q∗p =

∂q∗p(xp)

∂xp(l). (1.35)

So, any gradient-based optimisation algorithms can be used to construct thebase learnerk(xp, x) and thus the new basis vectorxp. Note that it just costsO(n) time to computeE(xp) and corresponding gradient information if the di-mensionm ≪ n and the number of selected basisp ≪ n. Therefore our algo-rithm is applicable to large-scale datasets as well as (1.26). From a complexityviewpoint, the proposed method is the same as the criteria (1.25) and (1.26),but our approach still requires to compute corresponding gradient information(1.35) which makes it slightly slower than other approaches. The updating ofrelated quantities afterxp was constructed is detailed in the Appendix.

In our implementation, we employ the routine BFGS [4] as the gradient-based optimisation package. In the course of numerical experiments, it wasfound that even with a small number of BFGS steps at each iteration we canget better results than those obtained by other leading algorithms. In orderto improve even further the performance of the gradient-based algorithm pro-posed, we use the following multiple initialisation strategy. At the beginningof each iteration, we randomly take 20 training examples as initial basis vec-tors and rank them by (1.34). The best one is used to initialise the routineBFGS. Moreover, we set the maximal allowed BFGS steps at each iterationto 39. Thus, the total number of evaluating the objective function (1.34) is59. The aim of this setting is to compare the performance of our work withother sub-greedy algorithms [28, 15, 30], which just evaluate the correspond-ing selection criteriaκ = 59 times at each iteration. The steps of the proposedgradient-based forward greedy algorithm can be summarised as follows:

For p = 1, ..., pmax (which is the maximal number of basis vectors.)

1 Randomly taking 20 training examples from{xi}ni=1 and score them by

(1.34); then pick up the highest one, denoted asx0p;

2 Usingx0p as the initial value and run the routine BFGS; the outputxp is

thep-th constructed basis vector ;

3 UpdatingIp−1, Kp−1, Qp−1, Gp−1, Lp−1, αp−1, µp−1, rp−1 and otherrelated quantities (see Appendix for details);

End ForOutputs: {xj}

pj=1, αp, Qp andΣp.

Page 13: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 13

Finally, it is worth emphasizing that the proposed gradient-based approachto sparse GPR with the objective (1.13) can be straightforwardly extendedtodeal with other types of objective functions, which are responsible for differentkinds of kernel machines. For example, the following two objectivesEKLR andESVM are corresponding to kernel logistic regression (KLR) [37] and supportvector machines (SVM) [6], respectively:

EKLR =1

n

n∑

i=1

ln(1 + exp {−yifp(xi)}) +σ2

2α⊤

p Qpαp (1.36)

and

ESVM =1

n

n∑

i=1

max (0, 1 − yifp(xi))2 +

σ2

2α⊤

p Qpαp, (1.37)

wherefp(x) is defined in (1.29). Similar to sparse GPR, the expected trainingalgorithms for both KLR and SVM scale linearly as the number of trainingcases and would be much faster and more accurate than existing selection-based approaches.

5. Numerical experiments

In this section, we compare our gradient-based forward greedy algorithmagainst other leading sparse GPR algorithms induced by different basis selec-tion criteria on four datasets. For simplicity we refer to the algorithms to becompared using the name of its first author and they areWilliams [34], Fine[9], Nair [19], Seeger[26], Baudat [2], Bach [1], Smola[28], Keerthi [15]andSun[30]. The first four of them employ very cheap basis selection crite-ria and have the negligibleTselection cost. TheBaudatmethod is a specialcase ofBach2 when the trade-off parameter is set to zero, i.e, only consider-ing the unsupervised term. To reduce the complexity of the criterionBaudat,we also apply ‘look-ahead’ strategy [1] to speed up its computation. Thus,both of them have the same complexity ofTselection, which isO(δnp). Wewould not run theSmolamethod in our experiments due to two reasons. (1)It has been empirically proved to be generating almost the same results asSun[30]; (2) It leads toO(κnp2) complexity ofTselection which is much higherthan other approaches. TheKeerthi andSunmethods induced by (1.25) and(1.26), respectively, employ the same sub-greedy strategy and incurO(κnp)complexity ofTselection. In our implementation, we setδ = 59 andκ = 59to ensure the same selection complexity and similarly for the setting of ourgradient-based algorithm mentioned above.

The algorithms presented in this section were coded in Matlab 7.0 and allthe numerical experiments were conducted on the machine with PIV 2G and512M memory. For all experiments, thesquared-exponentialkernel (1.2) was

Page 14: Here - University of Birmingham

14

used. The involved hyperparameters are estimated via a full GPR model ona subset of 1000 examples3 randomly selected from the original dataset andthese tasks were accomplished by GP routines of the well-known NETLABsoftware4. To evaluate generalisation performance, we utilise mean squarederror (MSE) and negative logarithm of predictive distribution (NLPD). Theirdefinitions are

MSE =1

t

t∑

i=1

(yi − µi)2, (1.38)

NLPD =1

t

t∑

i=1

− logP(yi|µi, σ2i ), (1.39)

wheret is the number of test examples,yi is the test target,µi andσ2i are

the predictive mean and variance, respectively. Sometimes normalised MSE(NMSE) given byNMSE = MSE/var(y) is used for convenience, wherevar(y) is the variance of training targets. Note thatNLPDmeasures the qualityof predictive distributions as it penalizes over-confident predictions aswell asunder-confident ones. The four employed datasets areBoston Housing, Kin-32nm, LogP andKIN40K 5. Finally, we select some leading approaches interms of better generalisation performance on all four datasets considered andcompare their scaling performance on a set of datasets generated fromKIN40K.

A. Boston Housing Dataset

This popular regression dataset comprises 506 examples with 14 variablesand the task is to predict median house value of owner-occupied homes basedon other 13 variables. The results were averaged over 100 repetitions,wherethe data set was partitioned into 481/25 training/testing splits randomly, whichis a common setting in the literature [19]. Table 1.1 summarises the test per-formances of the nine methods, along with the standard deviation, forp = 100andp = 200, respectively.

From Table 1.1, it can be noted that, for bothp = 100 andp = 200, ourcontructing basis vectors method almost always achieves the better results onboth MSE and NLPD although it is not significant especially when we pickup more basis vectors. The inferior one still ranks the second among all ninemethods. In addition, the performance of three unsupervised basis selectionmethods marked by the superscript† seems systematically worse than othersix supervised methods if selecting fewer basis vectors. But when nearlyhalfof training examples are chosen, all of these methods generate very similarMSE results.

B. Kin-32nm dataset

Page 15: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 15

Table 1.1. Test results of nine sparse GPR algorithms on theBoston Housingdataset forp = 100 andp = 200, respectively. The superscript† denotes unsupervised basis selectionmethod. All reported results are the averages over 100 repetitions, along with the standarddeviation. The best method is highlighed in bold and the second best in italic.

Methodp = 100 p = 200

MSE NLPD MSE NLPDWilliams† [34] 9.97±6.58 2.73±0.44 6.98±4.01 2.66±0.57Fine† [9] 8.22±3.97 2.53±0.29 6.83±2.83 2.48±0.38Nair [19] 6.83±2.72 2.50±0.28 6.28±2.70 2.56±0.47Seeger [26] 7.32±3.21 2.54±0.20 6.35±2.63 2.45±0.37Baudat† [2] 8.15±4.27 2.48±0.29 6.56±2.68 2.52±0.43Bach [1] 7.52±3.19 2.54±0.24 6.56±2.66 2.54±0.45Keerthi [15] 7.08±2.92 2.44±0.24 6.38±2.54 2.48±0.40Sun [30] 6.64±2.82 2.46±0.30 6.28±2.55 2.55±0.45Ours 6.43±2.67 2.46±0.09 6.26±2.58 2.36±0.13

TheKin-32nmdataset is one of the eight kin-family datasets which are syn-thetically generated from a realistic simulation of the forward kinematics of an8 link all-revolute robot arm. The data is composed of 8192 examples with 32input dimensions and aim to predict the distance of the end-effector from atar-get given the angular positions of the joints, the link twist angles, link lengths,and link offset distances. We randomly split the mother data into 4000 trainingand 4192 testing examples and produce 20 repetitions, respectively. Again, weapply the nine methods to this high-dimensional problem. The results on theKin-32nmdataset are reported in Table 1.2.

According to Table 1.2, our proposed algorithm always ranks the first placesignificantly and we believe that, in a high dimensional case, our flexiblegradient-based approach could discover much representative basis vectors com-pared to selection-based algorithms. Moreover, another two algorithmsKeerthiandSunbased on directly optimising the objective (1.13) also have obviouslybetter performance than other methods. Again, we observe that supervisedbasis selection methods are always superior to unsupervised methods.

C. LogP Dataset

LogPdata is a popular benchmark problem in Quantitative Structure-ActivityRelationships (QSAR). Our used data split is the same as that in [32]. Of the6912 examples, 691 (10%) were used for testing and the remaining 6221 fortraining 6. Since the Matlab source code ofBachmethod (includingBaudat)provided by the authors involves the computation and storage of the full kernelmatrix, it cannot be used to deal with such a large dataset by our PC. There-fore, we remove these two methods from the list in the following comparative

Page 16: Here - University of Birmingham

16

Table 1.2. Test results of nine sparse GPR algorithms on theKin-32nmdataset forp = 100andp = 200, respectively. The superscript† denotes unsupervised basis selection method. Allreported results are the averages over 20 repetitions, along with the standard deviation. The bestmethod is highlighed in bold and the second best in italic.

Methodp = 100 p = 200

NMSE NLPD MSE NLPDWilliams† 0.634±0.015 0.501±0.017 0.594±0.011 0.541±0.012Fine† 0.645±0.017 0.480±0.016 0.602±0.013 0.502±0.013Nair 0.609±0.015 0.470±0.015 0.583±0.013 0.523±0.015Seeger 0.610±0.017 0.470±0.017 0.584±0.013 0.524±0.015Baudat† 0.643±0.022 0.490±0.020 0.599±0.014 0.511±0.013Bach 0.606±0.013 0.450±0.011 0.588±0.011 0.512±0.009Keerthi 0.588±0.012 0.441±0.008 0.575±0.012 0.506±0.012Sun 0.587±0.012 0.441±0.010 0.575±0.011 0.513±0.011Ours 0.569±0.011 0.384±0.007 0.553±0.015 0.396±0.015

Table 1.3. Test results of seven sparse GPR algorithms on theLogPdataset as the number ofselected basis vectors increases. The superscript† denotes unsupervised basis selection method.The best method is highlighed in bold and the second best in italic.

Methodp = 100 p = 200 p = 300

MSE NLPD MSE NLPD MSE NLPDWilliams† 0.615 5.50 0.571 9.04 0.571 9.04Fine† 0.745 1.26 0.643 1.30 0.557 1.58Nair 0.650 2.20 0.527 7.99 0.497 11.63Seeger 0.673 1.75 0.547 2.57 0.516 3.83Keerthi 0.577 1.79 0.550 2.89 0.526 4.463Sun 0.544 3.91 0.523 7.75 0.518 11.43Ours 0.528 1.13 0.521 1.08 0.509 1.06

study. Table 1.3 reports the performance of seven methods onLogPdata as thenumber of selected/constructed basis vectors is increased from 100 to 300. Itcan be seen from the results that our method achieves great performance espe-cially on NLPD over other six methods. Although theNair method get slightlybetter result on NMSE whenp = 300, it produces a very poor result on NLPDat the same time. It should be emphasized that our prediction accuracy is muchbetter than the results reported in [32] where the best achieveable MSE wasjust0.601.

D. KIN40K Dataset

Page 17: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 17

Table 1.4. Test results of seven sparse GPR algorithms on theKIN40K dataset as the num-ber of selected basis vectors increases. The superscript† denotes unsupervised basis selectionmethod. All reported results are the averages over 10 repetitions, alongwith the standard devi-ation. The best method is highlighed in bold and the second best in italic.

Methodp = 100 p = 300 p = 500

NMSE NLPD NMSE NLPD NMSE NLPDWilliams† 0.235±0.014 -0.606±0.018 0.093±0.005 -1.060±0.016 0.060±0.001 -1.304±0.008Fine† 0.227±0.012 -0.508±0.008 0.100±0.006 -0.910±0.010 0.064±0.003 -1.150±0.011Nair 0.208±0.015 -0.424±0.027 0.080±0.003 -0.805±0.022 0.050±0.001 -1.042±0.016Seeger 0.302±0.029 -0.282±0.056 0.130±0.020 -0.575±0.103 0.068±0.006 -0.820±0.099Keerthi 0.139±0.005 -0.731±0.007 0.060±0.002 -1.143±0.005 0.041±0.001 -1.366±0.006Sun 0.127±0.004 -0.751±0.005 0.057±0.001 -1.173±0.006 0.039±0.001 -1.400±0.007Ours 0.088±0.003 -0.767±0.004 0.042±0.001 -1.060±0.004 0.029±0.001 -1.223±0.006

TheKIN40K dataset is the largest one in the experiments we conducted. Itis a variant of the kin family of datasets from the DELVE archive and com-posed of 40,000 examples with 8 inputs. As the author of this data stated7,KIN40K was generated with maximum nonlinearity and little noise, giving avery difficult regression task. We randomly selected 10,000 examples for train-ing and kept the remaining 30,000 examples as test cases. The results on 10random partitions reported in Table 1.4 have shown that the last three meth-ods have a general advantage under either NMSE or NLPD over other fourapproaches. Our method always achieves the best result on NMSE butslightlyworse that the best on the NLPD. Note that theSeegermethod is even worsethan the random-based (Williams) method, which is already observed in otherwork [15].

According to the results generated above, we can see thatNair, Keerthi, SunandOursfour methods often produce better generalisation performance on testMSE (or NMSE). Now, we further compare these representative approaches forthe scaling performance on a set of datasets generated fromKIN40K data. Fig-ure 1.1 shows the computational time of the four methods for varying trainingdataset sizes. Note that the maximal number of selected basis sectors is fixedonp = 500. As expected, all of them linearly scale in the number of the train-ing examples. TheNair is the fastest one among four methods since it onlyrequiresO(1) time for scoring one basis at each selection step, and similarlyfor Williams, Fine andSeegerthree approaches although we did not plot themin the figure. In contrast toNair’s O(1) cost, other three leading algorithmsincludingKeerthi, SunandOurs, will needO(n) time to evaluate their corre-sponding criteria for one instance. Furthermore, compared withKeerthi andSunour gradient-based search approach needs extra time to evaluate gradientinformation and this is finally responsible for the time gap betweenOursandKeethishown in the Figure 1.1.

E. Discussion

Page 18: Here - University of Birmingham

18

1000 2000 3000 4000 5000 6000 7000 8000 90001000010

1

102

103

Figure 1.1. Comparison of the training time required by four leading approaches as afunctionof the size of the training dataset. The maximal number of selected basis vectors is fixed to bep = 500. From bottom to top, they areNair (square),Sun(circle), Keerthi (pentagram) andOurs(diamond).

To our knowledge, this is the first time to formally compare all kinds ofbasis vector selection algorithms appeared in the literature. Based on our ex-perimental studies, we can draw the following general summary empirically.The supervised basis selection methods are clearly better than unsupervisedmethods almost on all four datasets. In betweenNair andSeegertwo super-vised basis selection methods which both lead to very minor selection cost, itappears thatNair is superior thanSeegeron test MSE (or NMSE). The lastthree approachesKeerthi, SunandOurs, which are all based on optimising theoriginal GPR objective (1.13), produce more stable results than other sparseGPR methods on all datasets considered. On the large dataset, it seems thatthe Keerthi method is inferior to theSunmethod. Finally, the construction-based forward algorithm proposed in this chapter is more attactive than all ofselection-based forward algorithms for both test NMSE and NLPD measuresif the generalisation performance is a major concern.

6. Conclusions

Basis vector selection is very important in building a sparse GPR model. Anumber of selection schemes based on various criteria have been proposed. Inthis paper, we did not follow the previous idea of selecting basis vectors fromthe training examples. Instead, we borrowed an idea fromgradient boostingand proposed to construct basis vectors one by one through gradient-basedoptimisation. The proposed work is quite simple to implement. Excellentresults on a range of datasets have been obtained. In the near future, we willanalyse why the presented algorithm was not the best for some cases given

Page 19: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 19

in this paper and evaluate it on more and large problems. Another importantextension is to apply this idea to classification problems [37, 6].

Appendix

A. Gradients ofkp, qp andq∗p

If using thesquared-exponential(1.2) as the kernel function, we can have the gradients ofkp, qp andq∗p as

kp =∂kp(xp)

∂xp(l)= θlkp. ∗ [X(:, l) − xp(l)1n],

qp =∂qp(xp)

∂xp(l)= θlqp. ∗ [Xp−1(:, l) − xp(l)1n],

q∗p =

∂q∗p(xp)

∂xp(l)= 0,

whereX = [x1 ... xn]⊤ ∈ Rn×m is the input matrix,Xp−1 = [x1 ... xp−1]

⊤ ∈ R(p−1)×m

is the basis vector matrix, the notation ‘.*’ denotes entry-by-entry multiplication,X(:, l) denotesthel-th column ofX and similarly forXp−1(:, l). Finally,1n denotes an all one vector inRn.

B. Inclusion of the constructed basis vectorxp

In order to make a prediction for a new test case, we need to work outαp, Q−1p andΣp, which

can be seen from (1.19) and (1.20). Moreover, according to eq. (1.34), our forward procedure ofconstructing basis vectors also requires the information ofµp andrp. Since directly computingQ−1

p andΣp may encounter the problem of numerical instability [12], we resort to the Choleskydecomposition. LetLp be the factor of Cholesky factorisation:LpL⊤

p = Qp, letGp = KpL−⊤p

andMp be the factor of another Cholesky decomposition:MpM⊤p = (G⊤

p Gp + σ2Idp), wehave

Q−1p = (LpL

⊤p )−1

,

Σp = (K⊤p Kp + σ

2Qp)−1 = (LpMpM

⊤p L

⊤p )−1

,

and furtherαp = ΣpK

⊤p y = L

−⊤p (MpM

⊤p )−1

G⊤p y,

µp = Kpαp, rp = y − µp.

Thus, the following quantitiesLp, Mp, Gp, αp andµp are required to update recursively. Theinvolved steps can be summarised as follows:

kp = [K(x1, xp), ...,K(xn, xp)]⊤,

qp = [K(x1, xp), ...,K(xp−1, xp)]⊤, q∗p = K(xp, xp),

lp = L−1p−1qp, l

∗p =

qq∗p − l⊤p lp,

gp =kp − Gp−1lp

l∗p,

mp = M−1p−1(G

⊤p−1gp), η = M

−⊤p−1mp,

dp = gp − Gp−1η,

b = d⊤p y, c = d

⊤p gp,

m∗p =

pσ2 + c, a =

b

l∗p(σ2 + c),

Page 20: Here - University of Birmingham

20

αp =

�αp−1 − a[L−⊤

p−1(lp + l∗pη)]a

�,

µp = µp−1 +bdp

σ2 + c,

and finally

Lp =

�Lp−1 0

l⊤p l∗p

�, Mp =

�Mp−1 0

m⊤p m∗

p

�, Gp = [Gp−1 gp].

Since the matricesLp andMp are low-triangular, the multiplication of their inverse and a vectorcan be computed very efficiently.

Notes

1. Since each training case is responsible for each column in the full kernel matrixK, sometimes wealso refer to those corresponding columns inK as basis vectors.

2. The Matlab source code can be accessed viahttp://cmm.ensmp.fr/∼bach/csi/index.html.

3. Since the first employed dataset only incudes 506 examples, we randomly pick up 400 points to domodel selection.

4. It is available athttp://www.ncrg.aston.ac.uk/netlab/index.php.5. Boston housing data can be found in StatLib, available at URL http://lib.stat.cmu.edu/

datasets/boston.; Kin-32nm and its full description can be accessed athttp://www.cs.toronto.edu/∼delve/data/datasets.html; The LogP data can be requested from Dr Peter Tino ([email protected]); The KIN40K dataset is available athttp://ida.first.fraunhofer.de/∼anton/data.html.

6. The validation data is not necessary in our case since we employ the evidence framework to selecthyperparameters in NETLAB.

7. Seehttp://ida.first.fraunhofer.de/∼anton/data.html.

References

[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernelmethods. InProceedings of 22th International Conference on MachineLearning (ICML 2005), pages 33–40, 2005.

[2] G. Baudat and F. Anouar. Kernel-based methods and function approxi-mation. InProceedings of 2001 International Joint Conference on NeuralNetworks (IJCNN 2001), pages 1244–1249, 2001.

[3] A. Beygelzimer, S. M. Kakade, and J. Langford. Cover trees fornearestneighbor. submitted, 2005.

[4] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for boundconstrained optimization.SIAM Journal on Scientific and Statistical Com-puting, 16(5):1190–1208, 1995.

[5] J. Quinonero Candela and C. E. Rasmussen. A unifying view of sparseapproximate gaussian process regression.Journal of Machine LearningResearch, 6:1935–1959, 2005.

[6] O. Chapelle. Training a support vector machine in the primal.Journal ofMachine Learning Research, 2006. submitted.

Page 21: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 21

[7] L. Csato and M. Opper. Sparse On-line Gaussian Processes.Neural Com-putation, 14(3):641–668, 2002.

[8] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squaresalgorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285,2004.

[9] S. Fine and K. Scheinberg. Efficient SVM Training Using Low-rankKer-nel Representations.Journal of Machine Learning Research, 2:243–264,2002.

[10] J. H. Friedman. Greedy function approximation: A gradient boostingmachine.Annals of Statistics, 29(5):1189–1232, 2001.

[11] G. Fung and O. L. Mangasarian. Proximal support vector machineclas-sifiers. InKDD-2001: Knowledge Discovery and Data Mining, pages 77–86, San Francisco, CA, 2001.

[12] G. H. Golub and C. V. Loan.Matrix Computations. Johns Hopkins Univ.Press, 1996.

[13] A. G. Gray. Fast kernel matrix-vector multiplication with application togaussian process learning. Technical report, School of Computer Science,Carnegie Mellon University, 2004.

[14] A. G. Gray and A. W. Moore. ‘N-body’ problems in statistical learning.In Advances in Neural Information Processing Systems 13, pages 521–527. MIT Press, 2000.

[15] S. S. Keerthi and W. Chu. A matching pursuit approach to sparseGaussian process regression. InAdvances in Neural Information Process-ing Systems 18. MIT Press, 2006.

[16] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop,editor,Neural Networks and Machine Learning, pages 133–165. Springer,Berlin, 1998.

[17] R. Meir and G. Ratsch. An introduction to boosting and leveraging. InAdvanced Lectures on Machine Learning (LNAI2600), pages 118–183,2003.

[18] S. Mika, A.J. Smola, and B. Schokopf. An improved training algorithmfor kernel fisher discriminants. InEighth International Workshop on Arti-ficial Intelligence and Statistics, pages 98–104, Key West, Florida, 2001.

[19] P. B. Nair, A. Choudhury, and A. J. Keane. Some greedy learning algo-rithms for sparse regression and classification with mercer kernels.Jour-nal of Machine Learning Research, 3:781–801, 2002.

[20] B.K. Natarajan. Sparse approximate solutions to linear systems.SIAMJournal of Computing, 25(2):227–234, 1995.

[21] C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for MachineLearning. The MIT Press, 2006.

Page 22: Here - University of Birmingham

22

[22] V. C. Raykar, C. Yang, R. Duraiswami, and N. Gumerov. Fast Compu-tation of Sums of Gaussians in High Dimensions. Technical report, UMComputer Science Department, 2005.

[23] R. Rifkin. Everything Old Is New Again: A Fresh Look at HistoricalApproaches in Machine Learning. PhD thesis, MIT, Cambridge, MA,2002.

[24] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression LearningAlgorithm in Dual Variables. InProceedings of 15th International Con-ference on Machine Learning (ICML 1998), pages 515–521, 1998.

[25] R.E. Schapire. A brief introduction to boosting. In T. Dean, editor,Proceedings of the Sixteenth International Joint Conference on ArtificialIntelligence, pages 1401–1406, San Francisco, CA, 1999. Morgan Kauf-mann Publishers.

[26] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward se-lection to speed up sparse gaussian process regression. InNinth Inter-national Workshop on Artificial Intelligence and Statistics, Key West,Florida, 2003.

[27] Y. Shen, A. Ng, and M. Seeger. Fast Gaussian Process Regression UsingKD-Trees. InAdvances in Neural Information Processing Systems 18.MIT Press, 2006.

[28] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. InAdvances in Neural Information Processing Systems 14, pages 619–625.MIT Press, 2001.

[29] A. J. Smola and B. Schokopf. Sparse greedy matrix approximation formachine learning. InProceedings of 14th International Conference onMachine Learning (ICML 2000), pages 911–918, 2000.

[30] P. Sun and X. Yao. Greedy forward selection algorithms to sparseGaussian Process Regression. InProceedings of 2006 International JointConference on Neural Networks (IJCNN 2006), 2006. to appear.

[31] J. A. K. Suykens and J. Vandewalle. Least squares support vector ma-chine classifiers.Neural Processing Letters, 9(3):293–300, 1999.

[32] P. Tino, I. Nabney, B.S. Williams, J. Losel, and Y. Sun. Non-linear Predic-tion of Quantitative Structure-Activity Relationships.Journal of ChemicalInformation and Computer Sciences, 44(5):1647–1653, 2004.

[33] P. Vincent and Y. Bengio. Kernel Matching Pursuit.Machine Learning,48(1-3):165–187, 2002.

[34] C. Williams and M. Seeger. Using the Nystrom method to speed up kernelmachines. InAdvances in Neural Information Processing Systems 14,pages 682–688. MIT Press, 2001.

[35] C. Yang, R. Duraiswami, and L. Davis. Efficient Kernel MachinesUsingthe Improved Fast Gauss Transform. InAdvances in Neural InformationProcessing Systems 17, pages 1561–1568. MIT Press, 2005.

Page 23: Here - University of Birmingham

A Gradient-based Greedy Algorithm for Sparse GPR 23

[36] T. Zhang. Approximation bounds for some sparse kernel regression al-gorithms.Neural Computation, 14:3013–3042, 2002.

[37] J. Zhu and T. Hastie. Kernel logistic regression and the import vectormachine. Journal of Computational & Graphical Statistics, 14(1):185–205, 2005.