arxiv:1511.01870v1 [cs.lg] 5 nov 2015 · recently,wilson and nickisch(2015) introduced a fast...

Thoughts on Massively Scalable Gaussian Processes

Andrew Gordon WilsonCarnegie Mellon [email protected]

Christoph DannCarnegie Mellon University

[email protected]

Hannes NickischPhilips Research Hamburg

[email protected]

Abstract

We introduce a framework and early results for massively scalable Gaussian pro-cesses (MSGP), significantly extending the KISS-GP approach of Wilson and Nickisch(2015). The MSGP framework enables the use of Gaussian processes (GPs) on billionsof datapoints, without requiring distributed inference, or severe assumptions. In par-ticular, MSGP reduces the standard O(n3) complexity of GP learning and inference toO(n), and the standard O(n2) complexity per test point prediction to O(1). MSGP in-volves 1) decomposing covariance matrices as Kronecker products of Toeplitz matricesapproximated by circulant matrices. This multi-level circulant approximation allowsone to unify the orthogonal computational benefits of fast Kronecker and Toeplitz ap-proaches, and is significantly faster than either approach in isolation; 2) local kernelinterpolation and inducing points to allow for arbitrarily located data inputs, and O(1)test time predictions; 3) exploiting block-Toeplitz Toeplitz-block structure (BTTB),which enables fast inference and learning when multidimensional Kronecker structureis not present; and 4) projections of the input space to flexibly model correlated inputsand high dimensional data. The ability to handle many (m ≈ n) inducing points allowsfor near-exact accuracy and large scale kernel learning.

1 Introduction

Every minute of the day, users share hundreds of thousands of pictures, videos, tweets,reviews, and blog posts. More than ever before, we have access to massive datasets in almostevery area of science and engineering, including genomics, robotics, and climate science.This wealth of information provides an unprecedented opportunity to automatically learnrich representations of data, which allows us to greatly improve performance in predictivetasks, but also provides a mechanism for scientific discovery.

Expressive non-parametric methods, such as Gaussian processes (GPs) (Rasmussen andWilliams, 2006), have great potential for large-scale structure discovery; indeed, these meth-ods can be highly flexible, and have an information capacity that grows with the amountof available data. However, large data problems are mostly uncharted territory for GPs,

1

arX

iv:1

511.

0187

0v1

[cs

.LG

] 5

Nov

201

5

which can only be applied to at most a few thousand training points n, due to the O(n3)computations and O(n2) storage required for inference and learning.

Even more scalable approximate GP approaches, such as inducing point methods (Quinonero-Candela and Rasmussen, 2005a), typically require O(m2n+m3) computations and O(m2 +mn) storage, for m inducing points, and are hard to apply massive datasets, containingn > 105 examples. Moreover, for computational tractability, these approaches requirem � n, which can severely affect predictive performance, limit representational power,and the ability for kernel learning, which is most needed on large datasets (Wilson, 2014).New directions for scalable Gaussian processes have involved mini-batches of data throughstochastic variational inference (Hensman et al., 2013) and distributed learning (Deisenrothand Ng, 2015). While these approaches are promising, inference can undergo severe ap-proximations, and a small number of inducing points are still required. Indeed, stochasticvariational approaches scale as O(m3).

In this paper, we introduce a new framework for massively scalable Gaussian processes(MSGP), which provides near-exact O(n) inference and learning and O(1) test time predic-tions, and does not require distributed learning or severe assumptions. Our approach buildson the recently introduced KISS-GP framework (Wilson and Nickisch, 2015), with severalsignificant advances which enable its use on massive datasets. In particular, we provide:

• Near-exactO(1) mean and variance predictions. By contrast, standard GPs and KISS-GP cost O(n) for the predictive mean and O(n2) for the predictive variance per testpoint. Moreover, inducing point and finite basis expansions (e.g., Quinonero-Candelaand Rasmussen, 2005a; Lazaro-Gredilla et al., 2010; Yang et al., 2015) cost O(m) andO(m2) per test point.

• Circulant approximations which (i) integrate Kronecker and Toeplitz structure, (ii)enable extremely and accurate fast log determinant evaluations for kernel learning,and (iii) increase the speed of Toeplitz methods on problems with 1D predictors.

• The ability to exploit more general block-Toeplitz-Toeplitz-block (BTTB) structure,which enables fast and exact inference and learning in cases where multidimensionalKronecker structure is not present.

• Projections which help alleviate the limitation of Kronecker methods to low-dimensionalinput spaces.

• Code will be available as part of the GPML package (Rasmussen and Nickisch, 2010),with demonstrations at http://www.cs.cmu.edu/~andrewgw/pattern.

We begin by briefly reviewing Gaussian processes, structure exploiting inference, and KISS-GP, in sections 2 - 4. We then introduce our MSGP approach in section 5. We demonstratethe scalability and accuracy of MSGP in the experiments of section 6. We conclude insection 7.

2

http://www.cs.cmu.edu/~andrewgw/pattern

2 Gaussian Processes

We briefly review Gaussian processes (GPs), and the computational requirements for pre-dictions and kernel learning. Rasmussen and Williams (2006) contains a full treatment ofGPs.

We assume a dataset D of n input (predictor) vectors X = [x1, . . . ,xn], each of dimensionD, corresponding to a n× 1 vector of targets y = [y(x1), . . . , y(xn)]>. If f(x) ∼ GP(µ, kθ),then any collection of function values f has a joint Gaussian distribution,

f = f(X) = [f(x1), . . . , f(xn)]> ∼ N (µX ,KX,X) , (1)

with mean vector and covariance matrix defined by the mean vector and covariance functionof the Gaussian process: (µX)i = µ(xi), and (KX,X)ij = kθ(xi,xj). The covariance functionkθ is parametrized by θ. Assuming additive Gaussian noise, y(x)|f(x) ∼ N (y(x); f(x), σ2),then the predictive distribution of the GP evaluated at the n∗ test points indexed by X∗,is given by

f∗|X∗,X,y,θ, σ2 ∼ N (E[f∗], cov(f∗)) , (2)

E[f∗] = µX∗ +KX∗,X [KX,X + σ2I]−1y ,

cov(f∗) = KX∗,X∗ −KX∗,X [KX,X + σ2I]−1KX,X∗ .

KX∗,X represents the n∗×n matrix of covariances between the GP evaluated at X∗ and X,and all other covariance matrices follow the same notational conventions. µX∗ is the n∗× 1mean vector, and KX,X is the n× n covariance matrix evaluated at training inputs X. Allcovariance matrices implicitly depend on the kernel hyperparameters θ.

The marginal likelihood of the targets y is given by

log p(y|θ, X) ∝ −1

2[y>(Kθ + σ2I)−1y + log |Kθ + σ2I|] , (3)

where we have used Kθ as shorthand for KX,X given θ. Kernel learning is performed byoptimizing Eq. (3) with respect to θ.

The computational bottleneck for inference is solving the linear system (KX,X + σ2I)−1y,and for kernel learning is computing the log determinant log |KX,X +σ2I|. Standard proce-dure is to compute the Cholesky decomposition of the n× n matrix KX,X , which requiresO(n3) operations and O(n2) storage. Afterwards, the predictive mean and variance of theGP cost respectively O(n) and O(n2) per test point x∗.

3 Structure Exploiting Inference

Structure exploiting approaches make use of existing structure in KX,X to accelerate in-ference and learning. These approaches benefit from often exact predictive accuracy, andimpressive scalability, but are inapplicable to most problems due to severe grid restrictionson the data inputs X. We briefly review Kronecker and Toeplitz structure.

3

3.1 Kronecker Structure

Kronecker (tensor product) structure arises when we have multidimensional inputs i.e.P > 1 on a rectilinear grid, x ∈ X1 × · · · × XP , and a product kernel across dimensions

k(xi,xj) =∏Pp=1 k(x

(p)i ,x

(p)j ). In this case, K = K1 ⊗ · · · ⊗ KP . One can then compute

the eigendecomposition of K = QV Q> by separately taking the eigendecompositions ofthe much smaller K1, . . . ,KP . Inference and learning then proceed via (K + σ2I)−1y =(QV Q> + σ2I)−1y = Q(V + σ2I)−1Q>y, and log |K + σ2I| =

∑i log(Vii + σ2), where

Q is an orthogonal matrix of eigenvectors, which also decomposes a Kronecker product(allowing for fast MVMs), and V is a diagonal matrix of eigenvalues and thus simple toinvert. Overall, for m grid data points, and P grid dimensions, inference and learning costO(Pm1+ 1

P ) operations (for P > 1) and O(Pm2P ) storage (Saatchi, 2011; Wilson et al.,

2014). Unfortunately, there is no efficiency gain for 1D inputs (e.g., time series).

3.2 Toeplitz Structure

A covariance matrix constructed from a stationary kernel k(x, z) = k(x − z) on a 1Dregularly spaced grid has Toeplitz structure. Toeplitz matrices T have constant diagonals,Ti,j = Ti+1,j+1. Toeplitz structure has been exploited for GP inference (e.g., Zhang et al.,2005; Cunningham et al., 2008) in O(n log n) computations. Computing log |T |, and thepredictive variance for a single test point, requires O(n2) operations (although finite supportcan be exploited (Storkey, 1999)) thus limiting Toeplitz methods to about n < 10, 000points when kernel learning is required. Since Toeplitz methods are limited to problemswith 1D inputs (e.g., time series), they complement Kronecker methods, which exploitmultidimensional grid structure.

4 KISS-GP

Recently, Wilson and Nickisch (2015) introduced a fast Gaussian process method calledKISS-GP, which performs local kernel interpolation, in combination with inducing pointapproximations (Quinonero-Candela and Rasmussen, 2005b) and structure exploiting alge-bra (e.g., Saatchi, 2011; Wilson, 2014).

Given a set of m inducing points U = [ui]i=1...m, Wilson and Nickisch (2015) propose toapproximate the n × m matrix KX,U of cross-covariances between the training inputs Xand inducing inputs U as KX,U = WXKU,U , where WX is an n×m matrix of interpolationweights. One can then approximate KX,Z for any points Z as KX,Z ≈ KX,UW

>Z . Given

a user-specified kernel k, this structured kernel interpolation (SKI) procedure (Wilson andNickisch, 2015) gives rise to the fast approximate kernel

kSKI(x, z) = wxKU,Uw>z , (4)

4

for any single inputs x and z. The n × n training covariance matrix KX,X thus has theapproximation

KX,X ≈WXKU,UW>X = KSKI =: KX,X . (5)

Wilson and Nickisch (2015) propose to perform local kernel interpolation, in a method calledKISS-GP, for extremely sparse interpolation matrices. For example, if we are performinglocal cubic (Keys, 1981) interpolation for d-dimensional input data, WX and WZ containonly 4d non-zero entries per row.

Furthermore, Wilson and Nickisch (2015) show that classical inducing point methods canbe re-derived within their SKI framework as global interpolation with a noise free GP andnon-sparse interpolation weights. For example, the subset of regression (SoR) inducingpoint method effectively uses the kernel kSoR(x, z) = Kx,UK

−1U,UKU,z (Quinonero-Candela

and Rasmussen, 2005b), and thus has interpolation weights wSoR(x) = Kx,UK−1U,U within

the SKI framework.

GP inference and learning can be performed in O(n) using KISS-GP, a significant advanceover the more standard O(m2n) scaling of fast GP methods (Quinonero-Candela and Ras-mussen, 2005a; Lazaro-Gredilla et al., 2010). Moreover, Wilson and Nickisch (2015) showhow – when performing local kernel interpolation – one can achieve close to linear scalingwith the number of inducing points m by placing these points U on a rectilinear grid, andthen exploiting Toeplitz or Kronecker structure in KU,U (see, e.g., Wilson, 2014), withoutrequiring that the data inputs X are on a grid. Such scaling with m compares favourablyto the O(m3) operations for stochastic variational approaches (Hensman et al., 2013). Al-lowing for large m enables near-exact performance, and large scale kernel learning.

In particular, for inference we can solve (KSKI + σ2I)−1y, by performing linear conju-gate gradients, an iterative procedure which depends only on matrix vector multiplications(MVMs) with (KSKI + σ2I). Only j � n iterations are required for convergence up to ma-chine precision, and the value of j in practice depends on the conditioning of KSKI ratherthan n. MVMs with sparse W (corresponding to local interpolation) cost O(n), and MVMsexploiting structure in KU,U are roughly linear in m. Moreover, we can efficiently approx-imate the eigenvalues of KSKI to evaluate log |KSKI + σ2I|, for kernel learning, by usingfast structure exploiting eigendecompositions of KU,U . Further details are in Wilson andNickisch (2015).

5 Massively Scalable Gaussian Processes

We introduce massively scalable Gaussian processes (MSGP), which significantly extendKISS-GP, inducing point, and structure exploiting approaches, for: (1) O(1) test predictions(section 5.1); (2) circulant log determinant approximations which (i) unify Toeplitz andKronecker structure; (ii) enable extremely fast marginal likelihood evaluations (section 5.2);and (iii) extend KISS-GP and Toeplitz methods for scalable kernel learning in D=1 inputdimensions, where one cannot exploit multidimensional Kronecker structure for scalability.(3) more general BTTB structure, which enables fast exact multidimensional inference

5

without requiring Kronecker (tensor) decompositions (section 5.3); and, (4) projectionswhich enable KISS-GP to be used with structure exploiting approaches for D � 5 inputdimensions, and increase the expressive power of covariance functions (section 5.4).

5.1 Fast Test Predictions

While Wilson and Nickisch (2015) propose fast O(n) inference and learning, test timepredictions are the same as for a standard GP – namely, O(n) for the predictive mean andO(n2) for the predictive variance per single test point x∗. Here we show how to obtain O(1)test time predictions by efficiently approximating latent mean and variance of f∗. For aGaussian likelihood, the predictive distribution for y∗ is given by the relations E[y∗] = E[f∗]and cov(y∗) = cov(f∗) + σ2I.

We note that the methodology here for fast test predictions does not rely on having per-formed inference and learning in any particular way: it can be applied to any trainedGaussian process model, including a full GP, inducing points methods such as FITC (Snel-son and Ghahramani, 2006), or the Big Data GP (Hensman et al., 2013), or finite basismodels (e.g., Yang et al., 2015; Lazaro-Gredilla et al., 2010; Rahimi and Recht, 2007; Leet al., 2013; Williams and Seeger, 2001).

5.1.1 Predictive Mean

Using structured kernel interpolation on KX,X , we approximate the predictive mean E[f∗] ofEq. (2) for a set of n∗ test inputs X∗ as E[f∗] ≈ µX∗ +KX∗,Xα, where α = [KX,X +σ2I]−1yis computed as part of training using linear conjugate gradients (LCG). We propose tosuccessively apply structured kernel interpolation on KX∗,X for

E[f∗] ≈ E[f∗] = µX∗ + KX∗,Xα , (6)

= µX∗ +W∗KU,UW>α , (7)

where W∗ and W are respectively n∗×m and n×m sparse interpolation matrices, containingonly 4 non-zero entries per row if performing local cubic interpolation (which we henceforthassume). The term KU,UW

>α is pre-computed during training, taking only two MVMs inaddition to the LCG computation required to obtain α.1 Thus the only computation attest time is multiplication with sparse W∗, which costs O(n∗) operations, leading to O(1)operations per test point x∗.

5.1.2 Predictive Variance

Practical applications typically do not require the full predictive covariance matrix cov(f∗)of Eq. (2) for a set of n∗ test inputs X∗, but rather focus on the predictive variance,

v∗ = diag[cov(f∗)] = diag(KX∗,X∗)− ν∗, (8)

1We exploit the structure of KU,U for extremely efficient MVMs, which we will discuss in detail in section5.2.

6

where ν∗ = diag(KX∗,X [KX,X + σ2I]−1KX,X∗), the explained variance, is approximated bylocal interpolation from the explained variance on the grid U

ν∗ ≈W∗νU , νU = diag(KU,XA−1KX,U ). (9)

using the interpolated covariance A = KX,X + σ2I. Similar to the predictive mean – onceνU is precomputed – we only require a multiplication with the sparse interpolation weightmatrix W∗, leading to O(1) operations per test point x∗.

Every [νU ]i requires the solution to a linear system of size n, which is computationallytaxing. To efficiently precompute νU , we instead employ a stochastic estimator νU (Papan-dreou and Yuille, 2011), based on the observation that νU is the variance of the projectionKU,Xr of the Gaussian random variable r ∼ N (0, A−1). We draw ns Gaussian samplesgmi ∼ N (0, I), gni ∼ N (0, I) and solve Ari = WV

√EV >gmi + σgni with LCG, where

KU,U = V EV > is the eigendecomposition of the covariance evaluated on the grid, whichcan be computed efficiently by exploiting Kronecker and Toeplitz structure (sections 3 and5.2).

The overall (unbiased) estimate (Papandreou and Yuille, 2011) is obtained by clipping

v∗ ≈ v∗ = max[0,k∗ −W∗ns∑i=1

(KU,Xri)2], (10)

where the square is taken element-wise. Papandreou and Yuille (2011) suggest to use a valueof ns = 20 Gaussian samples ri which corresponds to a relative error ||νU − νU ||/||νU || of0.36.

5.2 Circulant Approximation

Kronecker and Toeplitz methods (section 3) are greatly restricted by requiring that thedata inputs X are located on a grid. We lift this restriction by creating structure in KU,U ,with the unobserved inducing variables U , as part of the structured kernel interpolationframework described in section 4.

Toeplitz methods apply only to 1D problems, and Kronecker methods require multidimen-sional structure for efficiency gains. Here we present a circulant approximation to unify thecomplementary benefits of Kronecker and Toeplitz methods, and to greatly scale marginallikelihood evaluations of Toeplitz based methods, while not requiring any grid structure inthe data inputs X.

If U is a regularly spaced multidimensional grid, and we use a stationary product kernel(e.g., the RBF kernel), then KU,U decomposes as a Kronecker product of Toeplitz (section3.2) matrices:

KU,U = T1 ⊗ · · · ⊗ TP . (11)

Because the algorithms which leverage Kronecker structure in Gaussian processes requireeigendecompositions of the constituent matrices, computations involving the structure in

7

Eq. (11) are no faster than if the Kronecker product were over arbitrary positive definitematrices: this nested Toeplitz structure, which is often present in Kronecker decompositions,is wasted. Indeed, while it is possible to efficiently solve linear systems with Toeplitzmatrices, there is no particularly efficient way to obtain a full eigendecomposition.

Fast operations with an m×m Toeplitz matrix T can be obtained through its relationshipwith an a×a circulant matrix. Symmetric circulant matrices C are Toeplitz matrices wherethe first column c is given by a circulant vector: c = [c1, c2, c3, .., c3, c2]

>. Each subsequentcolumn is shifted one position from the next. In other words, Ci,j = c|j−i| mod a. Circulantmatrices are computationally attractive because their eigendecomposition is given by

C = F−1diag(Fc)F , (12)

where F is the discrete Fourier transform (DFT): Fjk = exp(−2jkπi/a). The eigenvalues ofC are thus given by the DFT of its first column, and the eigenvectors are proportional to theDFT itself (the a roots of unity). The log determinant of C – the sum of log eigenvalues –can therefore be computed from a single fast Fourier transform (FFT) which costs O(a log a)operations and O(a) memory. Fast matrix vector products can be computed at the sameasymptotic cost through Eq. (12), requiring two FFTs (one FFT if we pre-compute Fc),one inverse FFT, and one inner product.

In a Gaussian process context, fast MVMs with Toeplitz matrices are typically achievedthrough embedding a m × m Toeplitz matrix K into a (2m − 1) × (2m − 1) circulantmatrix C (e.g., Zhang et al., 2005; Cunningham et al., 2008), with first column c =[k1, k2, .., km−1, km, km−1, .., k2]. Therefore K = Ci=1...m,j=1...m, and using zero paddingand truncation, Ky = [Cy,0]>i=1...m, where the circulant MVM Cy can be computed effi-ciently through FFTs. GP inference can then be achieved through LCG, solving K−1y in aniterative procedure which only involves MVMs, and has an asymptotic cost of O(m logm)computations. The log determinant and a single predictive variance, however, requireO(m2)computations.

To speed up LCG for solving linear Toeplitz systems, one can use circulant pre-conditionerswhich act as approximate inverses. One wishes to minimise the distance between the precon-ditioner C and the Toeplitz matrix K, arg minCd(C,K). Three classical pre-conditioners in-clude CStrang = arg minC ‖C −K‖1 (Strang, 1986), CT. Chan = arg minC ‖C −K‖F (Chan,1988), and CTyrtyshnikov = arg minC

∥∥I − C−1K∥∥F

(Tyrtyshnikov, 1992).

A distinct line of research was explored more than 60 years ago in the context of statisti-cal inference over spatial processes (Whittle, 1954). The circulant Whittle approximationcircWhittle(k) is given by truncating the sum

[circWhittle(k)]i =∑j∈Z

ki+jm or c(t) =∑j∈Z

k(t+ jm∆u)

i.e. we retain∑w

j=−w ki+jm only.

Positive definiteness of C = toep(c) for the complete sum is guaranteed by construction(Guinness and Fuentes, 2014, Section 2). For large lattices, the approach is often used dueto its accuracy and favourable asymptotic properties such as consistency, efficiency and

8

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.20, σ = 10−3

number of inducing inputs m

rela

tive

logd

et e

rror

covSEisocovMaterniso, d=1covRQiso, alpha=0.5T. ChanTyrtyshnikovStrangHelgasonWhittle, w=3

1%

100%

Figure 1: Benchmark of different circulant approximations illustrating consistent good qual-ity of the Whittle embedding. covSE, covMatern, and covRQ are as defined in Rasmussenand Williams (2006).

asymptotic normality (Lieberman et al., 2009). In fact, the circulant approximation ci isasymptotically equivalent to the initial covariance ki, see Gray (2005, Lemma 4.5), hencethe logdet approximation inevitably converges to the exact value.

We use

log |toep(k) + σ2I| ≈ 1> log(Fc + σ21),

where we threshold c = FH max(F circ(k),0). 1 denotes a vector 1 = [1, 1, . . . , 1]>, FH

is the conjugate transpose of the Fourier transform matrix. circ(k) denotes one of the T.Chan, Tyrtyshnikov, Strang, Helgason or Whittle approximations.

The particular circulant approximation pursued can have a dramatic practical effect onperformance. In an empirical comparison (partly shown in Figure 1), we verified thatthe Whittle approximations yields consistently accurate approximation results over severalcovariance functions k(x− z), lengthscales ` and noise variances σ2 decaying with grid sizem and below 1% relative error for m > 1000.

9

5.3 BCCB Approximation for BTTB

There is a natural extension of the circulant approximation of section 5.2 to multivariateD > 1 data. A translation invariant covariance function k(x, z) = k(x − z) evaluated atdata points ui organised on a regular grid of size n1 × n2 × .. × nD result in a symmetricblock-Toeplitz matrix with Toeplitz blocks (BTTB), which generalises Toeplitz matrices.Unlike with Kronecker methods, the factorisation of a covariance function is not requiredfor this structure. Using a dimension-wise circulant embedding of size (2n1 − 1) × (2n2 −1) × .. × (2nD − 1), fast MVMs can be accomplished using the multi-dimensional Fouriertransformation F = F1 ⊗ F2 ⊗ .. ⊗ FD by applying Fourier transformations Fd along eachdimension rendering fast inference using LCG feasible. Similarly, the Whittle approximationto the log-determinant can be generalised, where the truncated sum for circWhittle(k) isover (2w + 1)D terms instead of 2w + 1. As a result, the Whittle approximation CU,U tothe covariance matrix KU,U is block-circulant with circulant blocks (BCCB). Fortunately,BCCB matrices have an eigendecomposition CU,U = FH(Fc)F , where c ∈ Rn is the Whittleapproximation to k ∈ Rn, n = n1 · n2 · .. · nD and F = F1 ⊗ F2 ⊗ ..⊗ FD as defined before.Hence, all computational and approximation benefits from the Toeplitz case carry over tothe BTTB case. As a result, exploiting the BTTB structure allows to efficiently deal withmultivariate data without requiring a factorizing covariance function.

We note that blocks in the BTTB matrix need not be symmetric. Moreover, symmetricBCCB matrices – in contrast to symmetric BTTB matrices – are fully characterised bytheir first column.

5.4 Projections

Even if we do not exploit structure in KU,U , our framework in section 5 provides efficiencygains over conventional inducing point approaches, particularly for test time predictions.However, if we are to place U onto a multidimensional (Cartesian product) grid so that KU,U

has Kronecker structure, then the total number of inducing points m (the cardinality of U)grows exponentially with the number of grid dimensions, limiting one to about five or fewergrid dimensions for practical applications. However, we need not limit the applicability ofour approach to data inputs X with D ≤ 5 input dimensions, even if we wish to exploitKronecker structure in KU,U . Indeed, many inducing point approaches suffer from the curseof dimensionality, and input projections have provided an effective countermeasure (e.g.,Snelson, 2007).

We assume the D dimensional data inputs x ∈ RD, and inducing points which live in alower d < D dimensional space, u ∈ Rd, and are related through the mapping u = h(x,ω),where we wish to learn the parameters of the mapping ω in a supervised manner, throughthe Gaussian process marginal likelihood. Such a representation is highly general: any deeplearning architecture h(x,ω), for example, will ordinarily project into a hidden layer whichlives in a d < D dimensional space.

We focus on the supervised learning of linear projections, Px = u, where P ∈ Rd×D. Ourcovariance functions effectively become k(xi,xj) → k(Pxi, Pxj), k(xi,uj) → k(Pxi,uj),

10

k(ui,uj)→ k(ui,uj). Starting from the RBF kernel, for example,

kRBF(Pxi, Pxj) = exp[−0.5(Pxi − Pxj)>(Pxi − Pxj)

]= exp

[−0.5(xi − xj)

>PP>(xi − xj)].

The resulting kernel generalises the RBF and ARD kernels, which respectively have spheri-cal and diagonal covariance matrices, with a full covariance matrix Σ = PP>, which allowsfor richer models (Vivarelli and Williams, 1998). But in our context, this added flexibilityhas special meaning. Kronecker methods typically require a kernel which separates as aproduct across input dimensions (section 3.1). Here, we can capture sophisticated correla-tions between the different dimensions of the data inputs x through the projection matrixP , while preserving Kronecker structure in KU,U . Moreover, learning P in a supervisedmanner, e.g., through the Gaussian porcess marginal likelihood, has immediate advantagesover unsupervised dimensionality reduction of x; for example, if only a subset of the datainputs were used in producing the target values, this structure would not be detected byan unsupervised method such as PCA, but can be learned through P . Thus in addition tothe critical benefit of allowing for applications with D > 5 dimensional data inputs, P canalso enrich the expressive power of our MSGP model.

The entries of P become hyperparameters of the Gaussian process marginal likelihood ofEq. (3), and can be treated in exactly the same way as standard kernel hyperparameters suchas length-scale. One can learn these parameters through marginal likelihood optimisation.

Computing the derivatives of the log marginal likelihood with respect to the projectionmatrix requires some care under the structured kernel interpolation approximation to thecovariance matrix KX,X . We provide the mathematical details in the appendix A.

For practical reasons, one may wish to restrict P to be orthonormal or have unit scaling.We discuss this further in section 6.2.

6 Experiments

In these preliminary experiments, we stress test MSGP in terms of training and predictionruntime, as well as accuracy, empirically verifying its scalability and predictive performance.We also demonstrate the consistency of the model in being able to learn supervised projec-tions, for higher dimensional input spaces.

We compare to exact Gaussian processes, FITC (Snelson and Ghahramani, 2006), SparseSpectrum Gaussian Processes (SSGP) (Lazaro-Gredilla et al., 2010), the Big Data GP(BDGP) (Hensman et al., 2013), MSGP with Toeplitz (rather than circulant) methods, andMSGP not using the new scalable approach for test predictions (scalable test predictionsare described in section 5).

The experiments were executed on a workstation with Intel i7-4770 CPU and 32 GB RAM.We used a step length of 0.01 for BDGP based on the values reported by Hensman et al.(2013) and a batchsize of 300. We also evaluated BDGP with a larger batchsize of 5000 but

11

found that the results are qualititively similar. We stopped the stochastic optimization inBDGP when the log-likelihood did not improve at least by 0.1 within the last 50 steps orafter 5000 iterations.

6.1 Stress tests

One cannot exploit Kronecker structure in one dimensional inputs for scalability, andToeplitz methods, which apply to 1D inputs, are traditionally limited to about 10, 000points if one requires many marginal likelihood evaluations for kernel learning. Thus tostress test the value of the circulant approximation most transparently, and to give thegreatest advantage to alternative approaches in terms of scalability with number of in-ducing points m, we initially stress test in a 1D input space, before moving onto higherdimensional problems.

In particular, we sample 1D inputs x uniform randomly in [−10, 10], so that the datainputs have no grid structure. We then generate out of class ground truth data f(x) =

sin(x) exp(−x22×52

)with additive Gaussian noise to form y. We distribute inducing points

on a regularly spaced grid in [−12, 13].

In Figure 2 we show the runtime for a marginal likelihood evaluation as a function oftraining points n, and number inducing points m. MSGP quickly overtakes the alternativesas n increases past this point. Moreover, the runtimes for MSGP with different numbers ofinducing points converge quickly with increases in m. By n = 107, MSGP requires the sametraining time for m = 103 inducing points as it does for m = 106 inducing points! Indeed,MSGP is able to accommodate an unprecedented number of inducing points, overtaking thealternatives, which are using m = 103 inducing points, when using m = 106 inducing points.Such an exceptionally large number of inducing points allows for near-exact accuracy, andthe ability to retain model structure necessary for large scale kernel learning. Note thatm = 103 inducing points (resp. basis functions) is a practical upper limit in alternativeapproaches (Hensman et al. (2013), for example, gives m ∈ [50, 100] as a practical upperbound for conventional inducing approaches for large n).

We emphasize that although the stochastic optimization of the BDGP can take some time toconverge, the ability to use mini-batches for jointly optimizing the variational parametersand the GP hyper parameters, as part of a principled framework, makes BDGP highlyscalable. Indeed, the methodology in MSGP and BDGP are complementary and could becombined for a particularly scalable Gaussian process framework.

In Figure 3, we show that the prediction runtime for MSGP is practically independent ofboth m and n, and for any fixed m,n is much faster than the alternatives, which dependat least quadratically on m and n. We also see that the local interpolation strategy fortest predictions, introduced in section 5.1, greatly improves upon MSGP using the exactpredictive distributions while exploiting Kronecker and Toeplitz algebra.

In Figure 4, we evaluate the accuracy of the fast mean and variance test predictions insection 5.1, using the same data as in the runtime stress tests. We compare the relative

12

Number of training points n

10 2 10 3 10 4 10 5 10 6 10 7

Tra

inin

g tim

e (

s)

10 -3

10 -2

10 -1

10 0

10 1

10 2

10 3

10 4

exact GP

Toeplitz GP

SSGP

FITC

BDGP

MSGP

m = 103

m = 104

m = 105

m = 106

Figure 2: Training Runtime Comparison. We evaluate the runtime, in seconds, to evaluatethe marginal likelihood and all relevant derivatives for each given method.

mean absolute error, SMAE(y∗) = MAE(y∗, y∗)/MAE(y∗,y∗), where y∗ are the true testtargets, and y∗ is the mean of the true test targets, for the test predictions y∗ made by eachmethod. We compare the predictive mean and variance of the fast predictions with MSGPusing the standard (‘slow’) predictive equations and Kronecker algebra, and the predictionsmade using an exact Gaussian process.

As we increase the number of inducing points m the quality of predictions are improved,as we expect. The fast predictive variances, based on local kernel interpolation, are notas sensitive to the number of inducing points as the alternatives, but nonetheless havereasonable performance. We also see that the the fast predictive variances are improvedby having an increasing number of samples ns in the stochastic estimator of section 5.1.2,which is most noticeable for larger values numbers of inducing points m. Notably, thefast predictive mean, based on local interpolation, is essentially indistinguishable from thepredictive mean (‘slow’) using the standard GP predictive equations without interpolation.Overall, the error of the fast predictions is much less than the average variability in thedata.

13

Number of training points n10

210

310

410

5

Prediction

timefor10

5test

points

(s)

10-3

10-2

10-1

100

101

102

103

104

exact GPSSGPFITCBDGPSlow meanSlow varFast meanFast varm = 10m = 103

m = 104

m = 105

m = 106

Number of inducing points m

10 1 10 2 10 3 10 4 10 5

Prediction

timefor10

5test

points

(s)

10 -2

10 -1

10 0

10 1

10 2

10 3

SSGPFITCBDGPMSGPn = 103

n = 104

n = 105

n = 106

n = 107

Figure 3: Prediction Runtime Comparison. ‘slow’ mean and var refer to BDGP when usingstandard Kronecker and Toeplitz algebra for the predictive mean and variance, without fastlocal interpolation proposed in section 5.1.

14

Number of inducing points m101 102 103

SM

AE

10 -5

10 -4

10 -3

10 -2

10 -1

100

101

MeanVarianceSlow pred.Fast pred.ns=20ns=100

ns=1000

Figure 4: Accuracy Comparison. We compare the relative absolute difference of the predic-tive mean and variance of MSGP both using ‘fast’ local kernel interpolation, and the ‘slow’standard predictions, to exact inference.

15

Number of Dimensions0 20 40 60 80 100 120

SM

AE

0

0.2

0.4

0.6

0.8

1

1.2

1.4GP Proj. KernelGP FullGP True Sub

(a)

Number of Dimensions0 20 40 60 80 100

Sub

spac

e D

ista

nce

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b)

Figure 5: Synthetic Projection Experiments

16

6.2 Projections

Here we test the consistency of our approach in section 5.4 for recovering ground truthprojections, and providing accurate predictions, on D � 5 dimensional input spaces.

To generate data, we begin by sampling the entries of a d × D projection matrix P froma standard Gaussian distribution, and then project n = 3000 inputs x (of dimensionalityD × 1), with locations randomly sampled from a Gaussian distribution so that there is noinput grid structure, into a d = 2 dimensional space: x′ = Px. We then sample data yfrom a Gaussian process with an RBF kernel operating on the low dimensional inputs x′.We repeat this process 30 times for each of D = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, . . . , 100.

We now index the data y by the high dimensional inputs x and attempt to reconstruct thetrue low dimensional subspace described by x′ = Px. We learn the entries of P jointlywith covariance parameters through marginal likelihood optimisation (Eq. (3)). Using a(d = 2) 50× 50 Cartesian product grid for 2500 total inducing points U , we reconstruct theprojection matrix P , with the subspace error,

dist(P1, P2) = ||G1 −G2||2 , (13)

shown in Figure 5(a). Here Gi is the orthogonal projection onto the d-dimensional subspacespanned by the rows of Pi. This metric is motivated by the one-to-one correspondencebetween subspaces and orthogonal projections. It is bounded in [0, 1], where the maximumdistance 1 indicates that the subspaces are orthogonal to each other and the minimum 0 isachieved if the subspaces are identical. More information on this metric including how tocompute dist of Eq. (13) is available in chapter 2.5.3 of Golub and Van Loan (2012).

We also make predictions on n∗ = 1000 withheld test points, and compare i) our methodusing P (GP Proj. Kernel), ii) an exact GP on the high dimensional space (GP Full),and iii) an exact GP on the true subspace (GP True), with results shown in Figure 5(b).We average our results 30 times for each value of D, and show 2 standard errors. Theextremely low subspace and SMAE errors up to D = 40 validate the consistency of ourapproach for reconstructing a ground truth projection. Moreover, we see that our approachis competitive in SMAE (as defined in the previous section) with the best possible method,GP true, up to about D = 40. Moreover, although the subspace error is moderate forD = 100, we are still able to substantially outperform the standard baseline, an exact GPapplied to the observed inputs x.

We implement variants of our approach where P is constrained to be 1) orthonormal, 2) to

have unit scaling, e.g., Punit = diag(√

(PP>)−1

)P . Such constraints prevent degeneraciesbetween P and kernel hyperparameters from causing practical issues, such as length-scalesgrowing to large values to shrink the marginal likelihood log determinant complexity penalty,and then re-scaling P to leave the marginal likelihood model fit term unaffected. In prac-tice we found that unit scaling was sufficient to avoid such issues, and thus preferable toorthonormal P , which are more constrained.

17

7 Discussion

We introduce massively scalable Gaussian processes (MSGP), which significantly extendKISS-GP, inducing point, and structure exploiting approaches, for: (1) O(1) test predictions(section 5.1); (2) circulant log determinant approximations which (i) unify Toeplitz andKronecker structure; (ii) enable extremely fast marginal likelihood evaluations (section 5.2);and (iii) extend KISS-GP and Toeplitz methods for scalable kernel learning in D=1 inputdimensions, where one cannot exploit multidimensional Kronecker structure for scalability.(3) more general BTTB structure, which enables fast exact multidimensional inferencewithout requiring Kronecker (tensor) decompositions (section 5.3); and, (4) projectionswhich enable KISS-GP to be used with structure exploiting approaches for D � 5 inputdimensions, and increase the expressive power of covariance functions (section 5.4).

We demonstrate these advantages, comparing to state of the art alternatives. In particular,we show MSGP is exceptionally scalable in terms of training points n, inducing pointsm, and number of testing points. The ability to handle large m will prove important forretaining accuracy in scalable Gaussian process methods, and in enabling large scale kernellearning.

This document serves to report substantial developments regarding the SKI and KISS-GPframeworks introduced in Wilson and Nickisch (2015).

References

Chan, T. F. (1988). An optimal circulant preconditioner for Toeplitz systems. SIAMJournal on Scientific and Statistical Computation, 9(4):766–771.

Cunningham, J. P., Shenoy, K. V., and Sahani, M. (2008). Fast Gaussian process methodsfor point process intensity estimation. In International Conference on Machine Learning.

Deisenroth, M. P. and Ng, J. W. (2015). Distributed gaussian processes. In InternationalConference on Machine Learning.

Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, volume 3. JHU Press.

Gray, R. M. (2005). Toeplitz and circulant matrices: A review. Communications andInformation Theory, 2.(3):155–239.

Guinness, J. and Fuentes, M. (2014). Circulant embedding of approximate covariances forinference from Gaussian data on large lattices. under review.

Hensman, J., Fusi, N., and Lawrence, N. (2013). Gaussian processes for big data. InUncertainty in Artificial Intelligence (UAI). AUAI Press.

Keys, R. G. (1981). Cubic convolution interpolation for digital image processing. IEEETransactions on Acoustics, Speech and Signal Processing, 29(6):1153–1160.

18

Lazaro-Gredilla, M., Quinonero-Candela, J., Rasmussen, C., and Figueiras-Vidal, A. (2010).Sparse spectrum Gaussian process regression. The Journal of Machine Learning Research,11:1865–1881.

Le, Q., Sarlos, T., and Smola, A. (2013). Fastfood-computing Hilbert space expansions inloglinear time. In Proceedings of the 30th International Conference on Machine Learning,pages 244–252.

Lieberman, O., Rosemarin, R., and Rousseau, J. (2009). Asymptotic theory for maximumlikelihood estimation in stationary fractional Gaussian processes, under short-, long- andintermediate memory. In Econometric Society European Meeting.

Papandreou, G. and Yuille, A. L. (2011). Efficient variational inference in large-scaleBayesian compressed sensing. In Proc. IEEE Workshop on Information Theory in Com-puter Vision and Pattern Recognition (in conjunction with ICCV-11), pages 1332–1339.

Quinonero-Candela, J. and Rasmussen, C. (2005a). A unifying view of sparse approximategaussian process regression. The Journal of Machine Learning Research, 6:1939–1959.

Quinonero-Candela, J. and Rasmussen, C. E. (2005b). A unifying view of sparse ap-proximate Gaussian process regression. Journal of Machine Learning Research (JMLR),6:1939–1959.

Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. InNeural Information Processing Systems.

Rasmussen, C. E. and Nickisch, H. (2010). Gaussian processes for machine learning (GPML)toolbox. Journal of Machine Learning Research (JMLR), 11:3011–3015.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for Machine Learning.The MIT Press.

Saatchi, Y. (2011). Scalable Inference for Structured Gaussian Process Models. PhD thesis,University of Cambridge.

Snelson, E. (2007). Flexible and efficient Gaussian process models for machine learning.PhD thesis, University College London.

Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-inputs. InAdvances in neural information processing systems (NIPS), volume 18, page 1257. MITPress.

Storkey, A. (1999). Truncated covariance matrices and Toeplitz methods in Gaussian pro-cesses. In ICANN.

Strang, G. (1986). A proposal for Toeplitz matrix calculations. Studies in Applied Mathe-matics, 74(2):171–176.

Tyrtyshnikov, E. (1992). Optimal and superoptimal circulant preconditioners. SIAM Jour-nal on Matrix Analysis and Applications, 13(2):459–473.

19

Vivarelli, F. and Williams, C. K. (1998). Discovering hidden features with gaussian processesregression. Advances in Neural Information Processing Systems, pages 613–619.

Whittle, P. (1954). On stationary processes in the plane. Biometrika, 41(3/4):434–449.

Williams, C. and Seeger, M. (2001). Using the Nystrom method to speed up kernel machines.In Advances in Neural Information Processing Systems, pages 682–688. MIT Press.

Wilson, A. G. (2014). Covariance kernels for fast automatic pattern discovery and extrap-olation with Gaussian processes. PhD thesis, University of Cambridge.

Wilson, A. G., Gilboa, E., Nehorai, A., and Cunningham, J. P. (2014). Fast kernel learningfor multidimensional pattern extrapolation. In Advances in Neural Information Process-ing Systems.

Wilson, A. G. and Nickisch, H. (2015). Kernel interpolation for scalable structured Gaussianprocesses (KISS-GP). International Conference on Machine Learning (ICML).

Yang, Z., Smola, A. J., Song, L., and Wilson, A. G. (2015). A la carte - learning fast kernels.Artificial Intelligence and Statistics.

Zhang, Y., Leithead, W. E., and Leith, D. J. (2005). Time-series Gaussian process regres-sion based on Toeplitz computation of O(N2) operations and O(N)-level storage. InProceedings of the 44th IEEE Conference on Decision and Control.

20

A APPENDIX

A.1 Derivatives For Normal Projections

• apply chain rule to obtain from ∂ψ∂Q from ∂ψ

∂P

• the orthonormal projection matrix Q is defined by Q = diag(p)P ∈ Rd×D, p =√diag(PP>)

−1so that diag(QQ>) = 1

∂ψ

∂Q= diag(p)

∂ψ

∂P− diag(diag(

∂ψ

∂PP ′)� p3)P

A.2 Derivatives For Orthonormal Projections

• the orthonormal projection matrix Q is defined by Q = (PP>)−12P ∈ Rd×D so that

QQ> = I

• apply chain rule to obtain from ∂ψ∂Q from ∂ψ

∂P

• use eigenvalue decomposition PP> = V FV >, define S = V diag(s)V >, s =√

diag(F )

∂ψ

∂Q= S−1

∂ψ

∂P− V AV >P

A =V >S−1( ∂ψ∂P P

> + P ∂ψ∂P

>)S−1V

s1> + 1s>

• division in A is component-wise

A.3 Circulant Determinant Approximation Benchmark

21

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.10, σ = 10−1


rela

tive

logd

et e

rror


1%

100%

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.20, σ = 10−1


rela

tive

logd

et e

rror


1%

100%

Figure 6: Additional benchmarks of different circulant approximations

22

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.05, σ = 10−1


rela

tive

logd

et e

rror


1%

100%

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.10, σ = 10−2


rela

tive

logd

et e

rror


1%

100%


23

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.20, σ = 10−2


rela

tive

logd

et e

rror


1%

100%

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.05, σ = 10−2


rela

tive

logd

et e

rror


1%

100%


24

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.10, σ = 10−3


rela

tive

logd

et e

rror


1%

100%

102

103

104

10−4

10−3

10−2

10−1

100

log |K + σ2I |, x ∈ [0, 1], ` = 0.05, σ = 10−3


rela

tive

logd

et e

rror


1%

100%


25

arxiv:1511.01870v1 [cs.lg] 5 nov 2015 · recently,wilson and nickisch(2015) introduced a fast...

Documents