arxiv:1704.01445v1 [stat.ml] 5 apr 2017mosb/public/pdf/5403...approximations for large-scale...

Bayesian Inference of Log Determinants

Jack Fitzsimons1 Kurt Cutajar2 Michael Osborne1

1 Information Engineering, University of Oxford, UK2 Department of Data Science, EURECOM, France

Stephen Roberts1 Maurizio Filippone2

Abstract

The log-determinant of a kernel matrix ap-pears in a variety of machine learning prob-lems, ranging from determinantal point pro-cesses and generalized Markov random fields,through to the training of Gaussian processes.Exact calculation of this term is often in-tractable when the size of the kernel matrix ex-ceeds a few thousands. In the spirit of proba-bilistic numerics, we reinterpret the problem ofcomputing the log-determinant as a Bayesianinference problem. In particular, we com-bine prior knowledge in the form of boundsfrom matrix theory and evidence derived fromstochastic trace estimation to obtain proba-bilistic estimates for the log-determinant andits associated uncertainty within a given com-putational budget. Beyond its novelty and the-oretic appeal, the performance of our proposalis competitive with state-of-the-art approachesto approximating the log-determinant, whilealso quantifying the uncertainty due to budget-constrained evidence.

1 INTRODUCTION

Developing scalable learning models without compro-mising performance is at the forefront of machine learn-ing research. The scalability of several learning mod-els is predominantly hindered by linear algebraic op-erations having large computational complexity, amongwhich is the computation of the log-determinant of a ma-trix (Golub & Van Loan, 1996). The latter term featuresheavily in the machine learning literature, with applica-tions including spatial models (Aune et al., 2014; Rue& Held, 2005), kernel-based models (Davis et al., 2007;Rasmussen & Williams, 2006), and Bayesian learn-ing (Mackay, 2003).

The standard approach for evaluating the log-determinant of a positive definite matrix involvesthe use of Cholesky decomposition (Golub & Van Loan,1996), which is employed in various applications ofstatistical models such as kernel machines. However,the use of Cholesky decomposition for general densematrices requires O(n3) operations, whilst also entail-ing memory requirements of O(n2). In view of thiscomputational bottleneck, various models requiringthe log-determinant for inference bypass the need tocompute it altogether (Anitescu et al., 2012; Stein et al.,2013; Cutajar et al., 2016; Filippone & Engler, 2015).

Alternatively, several methods exploit sparsity and struc-ture within the matrix itself to accelerate computations.For example, sparsity in Gaussian Markov Randomfields (GMRFs) arises from encoding conditional inde-pendence assumptions that are readily available whenconsidering low-dimensional problems. For such matri-ces, the Cholesky decompositions can be computed infewer than O(n3) operations (Rue & Held, 2005; Rueet al., 2009). Similarly, Kronecker-based linear algebratechniques may be employed for kernel matrices com-puted on regularly spaced inputs (Saatci, 2011). Whilethese ideas have proven successful for a variety of spe-cific applications, they cannot be extended to the case ofgeneral dense matrices without assuming special formsor structures for the available data.

To this end, general approximations to the log-determinant frequently build upon stochastic trace es-timation techniques using iterative methods (Avron &Toledo, 2011). Two of the most widely-used polynomialapproximations for large-scale matrices are the Taylorand Chebyshev expansions (Aune et al., 2014; Han et al.,2015). A more recent approach draws from the possibil-ity of estimating the trace of functions using stochasticLanczos quadrature (Ubaru et al., 2016), which has beenshown to outperform polynomial approximations fromboth a theoretic and empirical perspective.

arX

iv:1

704.

0144

5v1

[st

at.M

L]

5 A

pr 2

017

Inspired by recent developments in the field of proba-bilistic numerics (Hennig et al., 2015), in this work wepropose an alternative approach for calculating the log-determinant of a matrix by expressing this computationas a Bayesian quadrature problem. In doing so, we refor-mulate the problem of computing an intractable quantityinto an estimation problem, where the goal is to infer thecorrect result using tractable computations that can becarried out within a given time budget. In particular, wemodel the eigenvalues of a matrix A from noisy obser-vations of Tr(Ak) obtained through stochastic trace esti-mation using the Taylor approximation method (Zhang& Leithead, 2007). Such a model can then be usedto make predictions on the infinite series of the Tay-lor expansion, yielding the estimated value of the log-determinant. Aside from permitting a probabilistic ap-proach for predicting the log-determinant, this approachinherently yields uncertainty estimates for the predictedvalue, which in turn serves as an indicator of the qualityof our approximation.

Our contributions are as follows.

1. We propose a probabilistic approach for computingthe log-determinant of a matrix which blends differ-ent elements from the literature on estimating log-determinants under a Bayesian framework.

2. We demonstrate how bounds on the expected valueof the log-determinant improve our estimates byconstraining the probability distribution to lie be-tween designated lower and upper bounds.

3. Through rigorous numerical experiments on syn-thetic and real data, we demonstrate how ourmethod can yield superior approximations to com-peting approaches, while also having the additionalbenefit of uncertainty quantification.

4. Finally, in order to demonstrate how this techniquemay be useful within a practical scenario, we em-ploy our method to carry out parameter selection fora large-scale determinantal point process.

To the best of our knowledge, this is the first time thatthe approximation of log-determinants is viewed as aBayesian inference problem, with the resulting quantifi-cation of uncertainty being hitherto unexplored thus far.

1.1 RELATED WORK

The most widely-used approaches for estimating log-determinants involve extensions of iterative algorithms,such as the Conjugate-Gradient and Lanczos methods,to obtain estimates of functions of matrices (Chen et al.,2011; Han et al., 2015) or their trace (Ubaru et al., 2016).The idea is to rewrite log-determinants as the trace of

the logarithm of the matrix, and employ trace estima-tion techniques (Hutchinson, 1990) to obtain unbiasedestimates of these. Chen et al. (2011) propose an itera-tive algorithm to efficiently compute the product of thelogarithm of a matrix with a vector, which is achievedby computing a spline approximation to the logarithmfunction. A similar idea using Chebyshev polynomi-als has been developed by Han et al. (2015). Most re-cently, the Lanczos method has been extended to handlestochastic estimates of the trace and obtain probabilisticerror bounds for the approximation (Ubaru et al., 2016).Blocking techniques, such as in Ipsen & Lee (2011) andAmbikasaran et al. (2016), have also been proposed.

In our work, we similarly strive to use a small num-ber of matrix-vector products for approximating log-determinants. However, we show that by taking aBayesian approach we can combine priors with the ev-idence gathered from the intermediate results of matrix-vector products involved in the afore-mentioned methodsto obtain more accurate results. Most importantly, ourproposal has the considerable advantage that it providesa full distribution on the approximated value.

Our proposal allows for the inclusion of explicit boundson log-determinants to constrain the posterior distribu-tion over the estimated log-determinant (Bai & Golub,1997). Nystrom approximations can also be used tobound the log-determinant, as shown by Bardenet & Tit-sias (2015). Similarly, Gaussian processes (Rasmussen& Williams, 2006) have been formulated directly usingthe eigendecomposition of its spectrum, where eigenvec-tors are approximated using the Nystrom method (Peng& Qi, 2015). There has also been work on estimatingthe distribution of kernel eigenvalues by analyzing thespectrum of linear operators (Braun, 2006; Wathen &Zhu, 2015), as well as bounds on the spectra of ma-trices with particular emphasis on deriving the largesteigenvalue (Wolkowicz & Styan, 1980; Braun, 2006).In this work, we directly consider bounds on the log-determinants of matrices (Bai & Golub, 1997).

2 BACKGROUND

As highlighted in the introduction, several approachesfor approximating the log-determinant of a matrix relyon stochastic trace estimation for accelerating computa-tions. This comes about as a result of the relationshipbetween the log-determinant of a matrix, and the corre-sponding trace of the log-matrix, whereby

log(Det (A)

)= Tr

(log (A)

). (1)

Provided the matrix log(A) can be efficiently sampled,this simple identity enables the use of stochastic trace es-

100 101 102 103 104 105 106

Order of Truncation

10-6

10-5

10-4

10-3

10-2

10-1

100A

bsol

ute

Erro

r

ν = 1ν = 10ν = 20

ν = 30ν = 40ν = 50

Figure 1: Expected absolute error of truncated Taylorseries for stationary ν-continuous kernel matrices. Thedashed grey lines indicate O(n−1).

timation techniques (Avron & Toledo, 2011; Fitzsimonset al., 2016). We elaborate further on this concept below.

2.1 STOCHASTIC TRACE ESTIMATION

The standard approach for computing the trace term of amatrix A ∈ Rn×n involves summing the eigenvalues ofthe matrix. Obtaining the eigenvalues typically involvescomputational complexity of O(n3), which is infeasiblefor large matrices. However, it is possible to obtain astochastic estimate of the trace term such that the expec-tation of the estimate matches the term being approxi-mated (Avron & Toledo, 2011). In this work, we shallconsider the Gaussian estimator, whereby we introduceNr vectors r(i) sampled from an independently and iden-tically distributed zero-mean and unit variance Gaussiandistribution. This yields the unbiased estimate

Tr(A) =1

Nr

Nr∑i=1

r(i)>A r(i). (2)

Note that more sophisticated trace estimators (see Fitzsi-mons et al., 2016) may also be considered; without lossof generality, we opt for a more straightforward approachin order to preserve clarity.

2.2 TAYLOR APPROXIMATION

Against the backdrop of machine learning applica-tions, in this work we predominantly consider covari-ance matrices taking the form of a Gram matrix K ={κ(xi,xj)}i,j=1,...,n, where the kernel function κ im-plicitly induces a feature space representation of datapoints xi. Assume K has been normalized such that themaximum eigenvalue is less than or equal to one, λ0 ≤ 1,where the largest eigenvalue can be efficiently found us-ing Gershgorin intervals (Gershgorin, 1931). Given thatcovariance matrices are positive semidefinite, we also

know that the smallest eigenvalue is bounded by zero,λn ≥ 0. Motivated by the identity presented in (1), theTaylor series expansion (Barry & Pace, 1999; Zhang &Leithead, 2007) may be employed for evaluating the log-determinant of matrices having eigenvalues bounded be-tween zero and one. In particular, this approach relies onthe following logarithm identity,

log (I −A) = −∞∑k=1

Ak

k. (3)

While the infinite summation is not explicitly com-putable in finite time, this may be approximated by com-puting a truncated series instead. Furthermore, given thatthe trace of matrices is additive, we find

Tr(log (I −A)

)≈ −

m∑k=1

Tr(Ak)

k. (4)

The Tr(Ak) term can be computed efficiently and recur-sively by propagating O(n2) vector-matrix multiplica-tions in a stochastic trace estimation scheme. To com-pute Tr(log(K)) we simply set A = I −K.

There are two sources of error associated with this ap-proach; the first due to stochastic trace estimation, andthe second due to truncation of the Taylor series. Inthe case of covariance matrices, the smallest eigen-value tends to be very small, which can be verified byWeyl (1912) and Silverstein (1986)’s observations on theeigenspectra of covariance matrices. This leads to Ak

decaying slowly as k →∞.

In light of the above, standard Taylor approximations tothe log-determinant of covariance matrices are typicallyunreliable, even when the exact traces of matrix powersare available. This can be verified analytically based onresults from kernel theory, which state that the approxi-mate rate of decay for the eigenvalues of positive definitekernels which are ν-continuous is O(n−ν−0.5) (Weyl,1912; Wathen & Zhu, 2015). Combining this result withthe absolute error, E(λ), of the truncated Taylor approx-imation we find

E [E (λ)] = O

∫ 1

0

λν+0.5

(log (λ)−

m∑j=1

λj

j

)dλ

= O

∫ 1

0

λν+0.5∞∑j=m

λj

jdλ

= O

(Ψ(0) (m+ ν + 1.5)−Ψ(0) (m)

ν + 1.5

),

where Ψ(0)(·) is the Digamma function. In Figure 1, weplot the relationship between the order of the Taylor ap-

proximation and the expected absolute error. It can beobserved that irrespective of the continuity of the kernel,the error converges at a rate of O(n−1).

3 THE PROBABILISTIC NUMERICSAPPROACH

We now propose a probabilistic numerics (Hennig et al.,2015) approach: we’ll re-frame a numerical computa-tion (in this case, trace estimation) as probabilistic in-ference. Probabilistic numerics usually requires distin-guishing: an appropriate latent function; data and; theultimate object of interest. Given the data, a posteriordistribution is calculated for the object of interest. Forinstance, in numerical integration, the latent function isthe integrand, f , the data are evaluations of the integrand,f(x), and the object of interest is the value of the in-tegral,

∫f(x)p(x)dx (see § 3.1.1 for more details). In

this work, our latent function is the distribution of eigen-values of A, the data are noisy observations of Tr(Ak),and the object of interest is log(Det(K)). For this objectof interest, we are able to provide both expected valueand variance. That is, although the Taylor approximationto the log-determinant may be considered unsatisfactory,the intermediate trace terms obtained when raising thematrix to higher powers may prove to be informative ifconsidered as observations within a probabilistic model.

3.1 RAW MOMENT OBSERVATIONS

We wish to model the eigenvalues of A from noisy ob-servations of Tr

(Ak)

obtained through stochastic traceestimation, with the ultimate goal of making predictionson the infinite series of the Taylor expansion. Let usassume that the eigenvalues are i.i.d. random variablesdrawn from P (λi = x), a probability distribution overx ∈ [0, 1]. In this setting Tr(A) = nEx[P (λi = x)], andmore generally Tr

(Ak)

= nR(k)x [P (λi = x)], where

R(k)x is the kth raw moment over the x domain. The raw

moments can thus be computed as,

R(k)x [P (λi = x)] =

∫ 1

0

xkP (λi = x) dx. (5)

Such a formulation is appealing because if P (λi = x)is modeled as a Gaussian process, the required integralsmay be solved analytically using Bayesian Quadrature.

3.1.1 Bayesian Quadrature

Gaussian processes (GPs; Rasmussen & Williams, 2006)are a powerful Bayesian inference method defined overfunctions X → R, such that the distribution of func-tions over any finite subset of the input points X =

{x1, . . . ,xn} is a multivariate Gaussian distribution.Under this framework, the moments of the conditionalGaussian distribution for a set of predictive points, givena set of labels y = (y1, . . . , yn)>, may be computed as

µ = µ0 +K>∗ K−1(y − µ0), (6)

Σ = K∗,∗ −K>∗ K−1K∗, (7)

with µ and Σ denoting the posterior mean and variance,and K being the n × n covariance matrix for the ob-served variables {xi, yi; i ∈ (1, 2, . . . n)}. The latter iscomputed as κ(x,x′) for any pair of points x,x′ ∈ X .Meanwhile, K∗ and K∗,∗ respectively denote the covari-ance between the observable and the predictive points,and the prior over the predicted points. Note that µ0, theprior mean, may be set to zero without loss of generality.

Bayesian Quadrature (BQ; O’Hagan, 1991) is primarilyconcerned with performing integration of potentially in-tractable functions. In this work, we limit our discussionto the setting where the integrand is modeled as a GP,∫

p(x) f(x) dx, f ∼ GP(µ,Σ),

where p(x) is some measure with respect to whichwe are integrating. A full discussion of BQ may befound in O’Hagan (1991) and Rasmussen & Ghahramani(2002); for the sake of conciseness, we only state the re-sult that the integrals may be computed by integrating thecovariance function with respect to p(x) for both K∗,

κ

(∫xdx, x′

)=

∫p (x)κ (x, x′) dx,

and K∗,∗,

κ

(∫xdx,

∫x′dx′

)=

∫∫p (x)κ (x, x′) p (x′) dxdx′.

3.2 KERNELS FOR RAW MOMENTS ANDINFERENCE ON THE LOG-DETERMINANT

Recalling (5), if P (λi = x) is modeled using a GP, inorder to include observations of R(k)

x [P (λi = x)], de-noted as R

(k)x , we must be able to integrate the kernel

with respect to the polynomial in x,

κ(R(k)x , x′

)=

∫ 1

0

xkκ (x, x′) dx, (8)

κ(R(k)x ,R

(k′)x′

)=

∫ 1

0

∫ 1

0

xkκ (x, x′)x′k′dxdx′. (9)

Although the integrals described above are typically an-alytically intractable, certain kernels have an elegant an-alytic form which allows for efficient computation. Inthis section, we derive the raw moment observations fora histogram kernel, and demonstrate how estimates ofthe log-determinant can be obtained. An alternate poly-nomial kernel is described in Appendix A.

3.2.1 Histogram Kernel

The entries of the histogram kernel, also known asthe piecewise constant kernel, are given by κ(x, x′) =∑1−mj=0 H( jm ,

j+1m , x, x′), where

H(j

m,j + 1

m,x, x′

)=

{1 x, x′ ∈

[jm ,

j+1m

]0 otherwise

.

Covariances between raw moments may be computed asfollows:

κ(R(k)

x , x′)

=

∫ 1

0

xkκ(x, x′

)dx

=1

k + 1

((j + 1

m

)k+1

−(j

m

)k+1)

,

(10)

where in the above x lies in the interval[jm ,

j+1m

]. Ex-

tending this to the covariance function between raw mo-ments we have,

κ(R(k)

x ,R(k′)x′

)=

∫ 1

0

∫ 1

0

xk, x′k′κ(x, x′

)dxdx′

=

m−1∑j=0

∏k∈(k,k′)

1(k + 1

) (( j + 1

m

)k+1

−(j

m

)k+1)

.

(11)

This simple kernel formulation between observations ofthe raw moments compactly allows us to perform infer-ence over P (λi = x). However, the ultimate goal is

to predict log(Det(K)), and hence∑∞i=1

Tr(Ak)k . This

requires a seemingly more complex set of kernel expres-sions; nevertheless, by propagating the implied infinitesummations into the kernel function, we can also obtainthe closed form solutions for these terms,

κ

(∞∑

k=1

R(k)x

k,R

(k′)x′

)=

m−1∑j=0

1

k′ + 1

((j + 1

m

)k+1

−

(j

m

)k+1)(

S

(j + 1

m

)− S

(j

m

))(12)

κ

(∞∑

k=1

R(k)x

k,

∞∑k′=1

R(k′)x′

k′

)=

m−1∑j=0

(S

(j + 1

m

)− S

(j

m

))2

(13)

where S(α) =∑∞k=1

αk+1

k(k+1) , which has the convenientidentity for 0 < α < 1,

S(α) = α+ (1− α) log(1− α).

Following the derivations presented above, we can fi-nally go about computing the prediction for the log-determinant, and its corresponding variance, using theGP posterior equations given in (6) and (7). This canbe achieved by replacing the terms K∗ and K∗,∗ withthe constructions presented in (12) and (13), respectively.The entries of K are filled in using (11), whereas y de-notes the noisy observations of Tr

(Ak).

3.2.2 Prior Mean Function

While GPs, and in this case BQ, can be applied with azero mean prior without loss of generality, it is often ben-eficial to have a mean function as an initial starting point.If P (λi = x) is composed of a constant mean functiong(λi = x), and a GP is used to model the residual, wehave that

P (λi = x) = g (λi = x) + f (λi = x) .

The previously derived moment observations may thenbe decomposed into,∫

xkP (λi = x) dx =

∫xkg (λi = x) dx

+

∫xkf (λi = x) dx.

(14)

Due to the domain of P (λi = x) lying between zero andone, we set a Beta distribution as the prior mean, whichhas some convenient properties. First, it is fully specifiedby the mean and variance of the distribution, which canbe computed using the trace and Frobenius norm of thematrix. Secondly, the r-th raw moment of a Beta distri-bution parameterized by α and β is

R(k)x [g (λi = x)] =

α+ r

α+ β + r,

which is straightforward to compute.

In consequence, the expectation of the logarithm of ran-dom variables and, hence, the ‘prior’ log determinantyielded by g (λi = x) can be computed as

E[log(X);X ∼ g(λi = x)] = φ(α)− φ(α+ β). (15)

This can then simply be added to the previously derivedGP expectation of the log-determinant.

3.2.3 Using Bounds on the Log-Determinant

As with most GP specifications, there are hyperparam-eters associated with the prior and the kernel. The op-timal settings for these parameters may be obtained viaoptimization of the standard GP log marginal likelihood,defined as

LMLGP = −1

2y>K−1y − 1

2log(Det(K)) + const.

Borrowing from the literature on bounds for the log-determinant of a matrix, as described in Appendix B, wecan also exploit such upper and lower bounds to trun-cate the resulting GP distribution to the relevant domain,which is expected to greatly improve the predicted log-determinant. These additional constraints can then bepropagated to the hyperparameter optimization proce-dure by incorporating them into the likelihood functionvia the product rule, as follows:

LML = LMLGP + log

(Φ

(a− µσ

)− Φ

(b− µσ

)),

with a and b representing the upper and lower log-determinant bounds respectively, µ and σ representingthe posterior mean and standard deviation, and Φ(·) rep-resenting the Gaussian cumulative density function. Pri-ors on the hyperparameters may be accounted for in asimilar way.

3.2.4 Algorithm Complexity and Recap

Due to its cubic complexity, GP inference is typicallyconsidered detrimental to the scalability of a model.However, in our formulation, the GP is only being ap-plied to the noisy observations of Tr

(Ak), which rarely

exceed the order of tens of points. As a result, given thatwe assume this to be orders of magnitude smaller thanthe dimensionality n of the matrix K, the computationalcomplexity is dominated by the matrix-vector operationsinvolved in stochastic trace estimation, i.e. O(n2) fordense matrices and O(ns) for s-sparse matrices.

The steps involved in the procedure described within thissection are summarized as pseudo-code in Algorithm 1.The input matrix A is first normalized by using Gersh-gorin intervals to find the largest eigenvalue (line 1), andthe expected bounds on the log-determinant (line 2) arecalculated using matrix theory (Appendix B). The noisyTaylor observations up to an expansion order M (lines 3-4), denoted here as y, are then obtained through stochas-tic trace estimation, as described in § 2.2. These can be

modeled using a GP, where the entries of the kernel ma-trix K (lines 5-7) are computed using (11). The kernelparameters are then tuned as per § 3.2.3 (line 8). Recallthat we seek to make a prediction for the infinite Tay-lor expansion, and hence the exact log-determinant. Tothis end, we must compute K∗ (lines 9-10) and k∗,∗ (line11) using (12) and (13), respectively. The posterior meanand variance (line 12) may then be evaluated by filling in(6) and (7). As outlined in the previous section, the re-sulting posterior distribution can be truncated using thederived bounds to obtain the final estimates for the log-determinant and its uncertainty (line 13).

Algorithm 1 Computing log-determinant and uncer-tainty using probabilistic numerics

Input: PSD matrix A ∈ Rn×n, raw moments kernel κ,expansion order M, and random vectors Z

Output: Posterior mean MTRN, and uncertainty VTRN

1: A← NORMALIZE(A)2: BOUNDS ← GETBOUNDS(A)3: for i← 1 to M do4: yi ← STOCHASTICTAYLOROBS(A, i, Z)

5: for i← 1 to M do6: for j ← 1 to M do7: Kij ← κ(i, j)

8: κ,K ← TUNEKERNEL(K,y, BOUNDS)9: for i← 1 to M do

10: K∗,i ← κ(∗, i)11: k∗,∗ ← κ(∗, ∗)12: MEXP, VEXP ← GPPRED(y,K,K∗, k∗,∗)13: MTRN, VTRN ← TRUNC(MEXP, VEXP, BOUNDS)

4 EXPERIMENTS

In this section, we show how the appeal of this formu-lation extends beyond its intrinsic novelty, whereby wealso consistently obtain performance improvements overcompeting techniques. We set up a variety of exper-iments for assessing the model performance, includingboth synthetically constructed and real matrices. Giventhe model’s probabilistic formulation, we also assess thequality of the uncertainty estimates yielded by the model.We conclude by demonstrating how this approach maybe fitted within a practical learning scenario.

We compare our approach against several other esti-mations to the log-determinant, namely approximationsbased on Taylor expansions, Chebyshev expansions andStochastic Lanczos quadrature. The Taylor approxima-tion has already been introduced in § 2.2, and we brieflydescribe the others below.

Spect-1 Spect-2 Spect-3 Spect-4 Spect-5 Spect-6Matrix Log Eigenspectra

10-4

10-3

10-2

10-1

100A

bsol

ute

Rel

ativ

e Er

ror

TaylorChebyshev

SLQPN Mean

PN Trunc. Mean

0 200 400 600 800 1000Eigenvalue Index

141210

86420

log(λi)

Spect-1Spect-2

Spect-3Spect-4

Spect-5Spect-6

Figure 2: Empirical performance of 6 covariances described in § 4.1. The right figure displays the log eigenspectrumof the matrices and their respective indices. The left figure displays the relative performance of the algorithms for thestochastic trace estimation order set to 5, 25 and 50 (from left to right respectively).

Chebyshev Expansions: This approach utilizes them-degree Chebyshev polynomial approximation to thefunction log (I −A) (Han et al., 2015; Boutsidis et al.,2015; Peng & Wang, 2015),

Tr (log (I −A)) ≈m∑k=0

ckTr (Tk (A)) , (16)

where Tk(x) = ATk−1 (A) − Tk−2 (A) starting withT0(A) = 1 and T0 (A) = 2 ∗A− 1, and ck is defined as

ck =2

n+ 1

n∑i=0

log (I − xi)Tk (xi) ,

xi = cos

((i+ 1

2

)π

n+ 1

).

(17)

The Chebyshev approximation is appealing as it gives thebest m-degree polynomial approximation of log (I − x)under the L∞-norm. The error induced by generalChebyshev polynomial approximations has also beenthoroughly investigated (Han et al., 2015).

Stochastic Lanczos Quadrature: This approach (Ubaruet al., 2016) relies on stochastic trace estimation to ap-proximate the trace using the identity presented in (1).If we consider the eigendecomposition of matrix A intoQΛQ>, the quadratic form in the equation becomes

r(i)>

log(A)r(i) = r(i)>Q log (Λ)Q>r(i)

=

n∑k=1

log (λk)µ2k

,

where µk denotes the individual components of Q>r(i).By transforming this term into a Riemann-Stieltjes inte-gral

∫ ba

log(t)dµ(t), where µ(t) is a piecewise constant

function (Ubaru et al., 2016), we can approximate it as∫ b

a

log(t)dµ(t) ≈m∑j=0

ωj log (θj) ,

where m is the degree of the approximation, while thesets of ω and θ are the parameters to be inferred usingGauss quadrature. It turns out that these parameters maybe computed analytically using the eigendecompositionof the low-rank tridiagonal transformation of A obtainedusing the Lanczos algorithm (Paige, 1972). Denoting theresulting eigenvalues and eigenvectors by θ and y respec-tively, the quadratic form may finally be evaluated as,

r(i)>

log (A) r(i) ≈m∑j=0

τ2j log (θj) , (18)

with τj =[eT1 yj

].

4.1 SYNTHETICALLY CONSTRUCTEDMATRICES

Previous work on estimating log-determinants have im-plied that the performance of any given method is closelytied to the shape of the eigenspectrum for the matrix un-der review. As such, we set up an experiment for as-sessing the performance of each technique when appliedto synthetically constructed matrices whose eigenvaluesdecay at different rates. Given that the computationalcomplexity of each method is dominated by the numberof matrix-vector products (MVPs) incurred, we also illus-trate the progression of each technique for an increasingallowance of MVPs. All matrices are constructed using aGaussian kernel evaluated over 1000 input points.

As illustrated in Figure 2, the estimates returned by ourapproach are consistently on par with (and frequently

thermomech_TC(d=102,158)

bonesS01(d=127,224)

ecology2(d=999,999)

thermal2(d=1,228,045)

UFL Dataset

10-4

10-3

10-2

10-1

100

Abs

olut

e R

elat

ive

Erro

r

TaylorChebyshev

SLQPN Mean

PN Trunc. Mean

Figure 3: Methods compared on a variety on UFL SparseDatasets. Each dataset was ran the matrix approximatelyraised to the power of 5, 10, 15, 20, 25 and 30 (left toright) using stochastic trace estimation.

superior to) those obtained using other methods. Formatrices having slowly-decaying eigenvalues, standardChebyshev and Taylor approximations fare quite poorly,whereas SLQ and our approach both yield compara-ble results. The results become more homogeneousacross methods for faster-decaying eigenspectra, but ourmethod is frequently among the top two performers. Forour approach, it is also worth noting that truncating theGP using known bounds on the log-determinant indeedresults in superior posterior estimates. This is particu-larly evident when the eigenvalues decay very rapidly.Somewhat surprisingly, the performance does not seemto be greatly affected by the number of budgeted MVPs.

4.2 UFL SPARSE DATASETS

Although we have so far limited our discussion to co-variance matrices, our proposed method is amenable toany positive semi-definite matrix. To this end, we extendthe previous experimental set-up to a selection of real,sparse matrices obtained from the SuiteSparse MatrixCollection (Davis & Hu, 2011). Following Ubaru et al.(2016), we list the true values of the log-determinant re-ported in Boutsidis et al. (2015), and compare all otherapproaches to this baseline.

The results for this experiment are shown in Figure 3.Once again, the estimates obtained using our probabilis-tic approach achieve comparable accuracy to the compet-ing techniques, and several improvements are noted forlarger allowances of MVPs. As expected, the SLQ ap-proach generally performs better than Taylor and Cheby-shev approximations, especially for smaller computa-tional budgets. Even so, our proposed technique con-sistently appears to have an edge across all datasets.

4.3 UNCERTAINTY QUANTIFICATION

thermomech_TC(d=102,158)

bonesS01(d=127,224)

ecology2(d=999,999)

thermal2(d=1,228,045)

UFL Dataset

0.0

0.5

1.0

1.5

2.0

2.5

Abs

olut

e Er

ror D

ivid

ed b

y P

redi

cted

Sta

ndar

d D

evia

tion

Figure 4: Quality of uncertainty estimates on UFLdatasets, measured as the ratio of the absolute error to thepredicted variance. As before, results are shown for in-creasing computational budgets (MVPs). The true valuelay outside 2 standard deviations in only one of 24 trials.

One of the notable features of our proposal is the abil-ity to quantify the uncertainty of the predicted log-determinant, which can be interpreted as an indicator ofthe quality of the approximation. Given that none of theother techniques offer such insights to compare against,we assess the quality of the model’s uncertainty estimatesby measuring the ratio of the absolute error to the pre-dicted standard deviation (uncertainty). For the latter tobe meaningful, the error should ideally lie within only afew multiples of the standard deviation.

In Figure 4, we report this metric for our approach whenusing the histogram kernel. We carry out this evaluationover the matrices introduced in the previous experiment,once again showing how the performance varies for dif-ferent MVP allowances. In all cases, the absolute error ofthe predicted log-determinant is consistently bounded byat most twice the predicted standard deviation, which isvery sensible for such a probabilistic model.

4.4 MOTIVATING EXAMPLE

Determinantal point processes (DPPs; Macchi, 1975) arestochastic point processes defined over subsets of datasuch that an established degree of repulsion is main-tained. A DPP, P , over a discrete space y ∈ {1, . . . , n}is a probability measure over all subsets of y such that

P(A ∈ y) = Det(KA),

whereK is a positive definite matrix having all eigenval-ues less than or equal to 1. A popular method for mod-eling data via K is the L-ensemble approach (Borodin,2009), which transforms kernel matrices, L, into an ap-propriate K,

K = (L+ I)−1L.

10-3 10-2 10-1 100

Lengthscale

0.0

0.2

0.4

0.6

0.8

1.0

NLL

/P(l

=l max)

Figure 5: The rescaled Negative log likelihood (NLL) ofDPP with varying length scale (blue) and probability ofmaximum likelihood (red). Cubic interpolation was usedbetween inferred likelihood observations. Ten samples,z, were taken to polynomial order 30.

The goal of inference is to correctly parameterizeL givenobserved subsets of y, such that the probability of unseensubsets can be accurately inferred in the future.

Given that the log-likelihood term of a DPP requires thelog-determinant of L, naıve computations of this termare intractable for large sample sizes. In this experi-ment, we demonstrate how our proposed approach can beemployed to the purpose of parameter optimization forlarge-scale DPPs. In particular, we sample points froma DPP defined on a lattice over [−1, 1]5, with one mil-lion points at uniform intervals. A Gaussian kernel withlengthscale parameter l is placed over these points, creat-ing the true L. Subsets of the lattice points can be drawnby taking advantage of the tensor structure of L, and wedraw five sets of 12,500 samples each. For a given selec-tion of lengthscale options, the goal of this experiment isto confirm that the DPP likelihood of the obtained sam-ples is indeed maximized whenL is parameterized by thetrue lengthscale, l. As shown in Figure 5, the computeduncertainty allows us to derive a distribution over the truelengthscale which, despite using few matrix-vector mul-tiplications, is very close to the optimal.

5 CONCLUSION

In a departure from conventional approaches for estimat-ing the log-determinant of a matrix, we propose a novelprobabilistic framework which provides a Bayesian per-spective on the literature of matrix theory and stochastictrace estimation. In particular, our approach enables thelog-determinant to be inferred from noisy observationsof Tr

(Ak)

obtained from stochastic trace estimation. Bymodeling these observations using a GP, a posterior esti-mate for the log-determinant may then be computed us-ing Bayesian Quadrature. Our experiments confirm thatthe results obtained using this model are highly compa-rable to competing methods, with the additional benefitof measuring uncertainty.

We forecast that the foundations laid out in this work

can be extended in various directions, such as explor-ing more kernels on the raw moments which permittractable Bayesian Quadrature. The uncertainty quanti-fied in this work is also a step closer towards fully char-acterizing the uncertainty associated with approximatinglarge-scale kernel-based models.

Acknowledgements

Part of this work was supported by the Royal Academyof Engineering and the Oxford-Man Institute. MF grate-fully acknowledges support from the AXA ResearchFund. The authors would like to thank Jonathan Down-ing for his supportive and insightful conversation on thiswork.

References

Ambikasaran, S., Foreman-Mackey, D., Greengard, L.,Hogg, D. W., and O’Neil, M. Fast Direct Methods forGaussian Processes. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 38(2):252–265, 2016.

Anitescu, M., Chen, J., and Wang, L. A Matrix-freeApproach for Solving the Parametric Gaussian ProcessMaximum Likelihood Problem. SIAM J. Scientific Com-puting, 34(1), 2012.

Aune, E., Simpson, D. P., and Eidsvik, J. ParameterEstimation in High Dimensional Gaussian Distributions.Statistics and Computing, 24(2):247–263, 2014.

Avron, H. and Toledo, S. Randomized Algorithms forEstimating the Trace of an Implicit Symmetric PositiveSemi-definite Matrix. J. ACM, 58(2):8:1–8:34, 2011.

Bai, Z. and Golub, G. H. Bounds for the Trace of theInverse and the Determinant of Symmetric Positive Def-inite Matrices. Annals of Numerical Mathematics, 4:29–38, 1997.

Bardenet, R. and Titsias, M. K. Inference for Deter-minantal Point Processes Without Spectral Knowledge.In Proceedings of the 28th International Conference onNeural Information Processing Systems, pp. 3393–3401,2015.

Barry, R. P. and Pace, R. K. Monte Carlo Estimates ofthe Log-Determinant of Large Sparse Matrices. LinearAlgebra and its applications, 289(1):41–54, 1999.

Borodin, A. Determinantal point processes. arXivpreprint arXiv:0911.1153, 2009.

Boutsidis, C., Drineas, P., Kambadur, P., and Zouzias,A. A Randomized Algorithm for Approximating the LogDeterminant of a Symmetric Positive Definite Matrix.CoRR, abs/1503.00374, 2015.

Braun, M. L. Accurate Error Bounds for the Eigenval-ues of the Kernel Matrix. Journal of Machine LearningResearch, 7:2303–2328, December 2006.

Chen, J., Anitescu, M., and Saad, Y. Computing f(A)bvia Least Squares Polynomial Approximations. SIAMJournal on Scientific Computing, 33(1):195–222, 2011.

Cutajar, K., Osborne, M., Cunningham, J., and Filip-pone, M. Preconditioning Kernel Matrices. In Proceed-ings of the 33nd International Conference on MachineLearning, ICML 2016, New York City, NY, USA, June19-24, 2016.

Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S.Information-theoretic Metric Learning. In Proceedingsof the Twenty-Fourth International Conference (ICML2007), Corvallis, Oregon, USA, June 20-24, 2007, pp.209–216, 2007.

Davis, T. A. and Hu, Y. The University of Florida SparseMatrix Collection. ACM Transactions on MathematicalSoftware (TOMS), 38(1):1, 2011.

Filippone, M. and Engler, R. Enabling ScalableStochastic Gradient-based inference for Gaussian pro-cesses by employing the Unbiased LInear System SolvEr(ULISSE). In Proceedings of the 32nd InternationalConference on Machine Learning, ICML 2015, Lille,France, July 6-11, 2015.

Fitzsimons, J. K., Osborne, M. A., Roberts, S. J., andFitzsimons, J. F. Improved Stochastic Trace Estimationusing Mutually Unbiased Bases. CoRR, abs/1608.00117,2016.

Gershgorin, S. Uber die Abgrenzung der Eigenwerteeiner Matrix. Izvestija Akademii Nauk SSSR, SerijaMatematika, 7(3):749–754, 1931.

Golub, G. H. and Van Loan, C. F. Matrix computations.The Johns Hopkins University Press, 3rd edition, Octo-ber 1996. ISBN 080185413.

Han, I., Malioutov, D., and Shin, J. Large-scale Log-Determinant computation through Stochastic ChebyshevExpansions. In Bach, F. R. and Blei, D. M. (eds.), Pro-ceedings of the 32nd International Conference on Ma-chine Learning, ICML 2015, Lille, France, 6-11 July2015, 2015.

Hennig, P., Osborne, M. A., and Girolami, M. Proba-bilistic Numerics and Uncertainty in Computations. Pro-ceedings of the Royal Society of London A: Mathe-matical, Physical and Engineering Sciences, 471(2179),2015.

Hutchinson, M. A Stochastic Estimator of the Trace ofthe Influence Matrix for Laplacian Smoothing Splines.Communications in Statistics - Simulation and Compu-tation, 19(2):433–450, 1990.

Ipsen, I. C. F. and Lee, D. J. Determinant Approxima-tions, May 2011.

Macchi, O. The Coincidence Approach to Stochasticpoint processes. Advances in Applied Probability, 7:83–122, 1975.

Mackay, D. J. C. Information Theory, Inference andLearning Algorithms. Cambridge University Press, firstedition edition, June 2003. ISBN 0521642981.

O’Hagan, A. Bayes-Hermite Quadrature. Journal of Sta-tistical Planning and Inference, 29:245–260, 1991.

Paige, C. C. Computational Variants of the Lanczosmethod for the Eigenproblem. IMA Journal of AppliedMathematics, 10(3):373–381, 1972.

Peng, H. and Qi, Y. EigenGP: Gaussian Process Mod-els with Adaptive Eigenfunctions. In Proceedings of the24th International Conference on Artificial Intelligence,IJCAI’15, pp. 3763–3769. AAAI Press, 2015.

Peng, W. and Wang, H. Large-scale Log-DeterminantComputation via Weighted L 2 Polynomial Approxima-tion with Prior Distribution of Eigenvalues. In Interna-tional Conference on High Performance Computing andApplications, pp. 120–125. Springer, 2015.

Rasmussen, C. E. and Williams, C. Gaussian Processesfor Machine Learning. MIT Press, 2006.

Rasmussen, C. E. and Ghahramani, Z. Bayesian MonteCarlo. In Advances in Neural Information ProcessingSystems 15, NIPS 2002, December 9-14, 2002, Vancou-ver, British Columbia, Canada, pp. 489–496, 2002.

Rue, H. and Held, L. Gaussian Markov Random Fields:Theory and Applications, volume 104 of Monographson Statistics and Applied Probability. Chapman & Hall,London, 2005.

Rue, H., Martino, S., and Chopin, N. ApproximateBayesian inference for latent Gaussian models by usingintegrated nested Laplace approximations. Journal of theRoyal Statistical Society: Series B (Statistical Methodol-ogy), 71(2):319–392, 2009.

Saatci, Y. Scalable Inference for Structured GaussianProcess Models. PhD thesis, University of Cambridge,2011.

Silverstein, J. W. Eigenvalues and Eigenvectors of LargeDimensional Sample Covariance Matrices. Contempo-rary Mathematics, 50:153–159, 1986.

Stein, M. L., Chen, J., and Anitescu, M. Stochastic Ap-proximation of Score functions for Gaussian processes.The Annals of Applied Statistics, 7(2):1162–1191, 2013.doi: 10.1214/13-AOAS627.

Ubaru, S., Chen, J., and Saad, Y. Fast Estimation of tr (f(a)) via Stochastic Lanczos Quadrature. 2016.

Wathen, A. J. and Zhu, S. On Spectral Distribution ofKernel Matrices related to Radial Basis functions. Nu-merical Algorithms, 70(4):709–726, 2015.

Weyl, H. Das asymptotische Verteilungsgesetz derEigenwerte linearer partieller Differentialgleichungen(mit einer Anwendung auf die Theorie der Hohlraum-strahlung). Mathematische Annalen, 71(4):441–479,1912.

Wolkowicz, H. and Styan, G. P. Bounds for Eigenvaluesusing Traces. Linear algebra and its applications, 29:471–506, 1980.

Zhang, Y. and Leithead, W. E. Approximate Imple-mentation of the logarithm of the Matrix Determinantin Gaussian process Regression. Journal of StatisticalComputation and Simulation, 77(4):329–348, 2007.

A POLYNOMIAL KERNEL

Similar to the derivation of the histogram kernel, we can also derive the polynomial kernel for moment observations.The entries of the polynomial kernel, given by k(x, x′) = (xx′ + c)d, can be integrated over as,

κ(R(k)x , x′

)=

∫ 1

0

d∑i=1

(d

i

)xk+ix′icd−idx,

=

d∑i=1

(d

i

)x′icd−i

k + i+ 1.

(19)

κ(R(k)x ,R

(k′)x′

)=

∫ 1

0

∫ 1

0

d∑i=1

(d

i

)xk+ix′k

′+icd−idxdx′

=

d∑i=1

(d

i

)cd−i

(k + i+ 1) (k′ + i+ 1).

(20)

As with the histogram kernel, the infinite sum of the Taylor expansion can also be combined into the Gaussian process,

κ

( ∞∑k=1

R(k)x

k,R

(k′)x′

)=

1

k

∞∑k=1

d∑i=1

(d

i

)cd−i

(k + i+ 1) (k′ + i+ 1)

=

d∑i=1

(d

i

)cd−i

(Ψ(0) (i+ 2) + γ

)(i+ 1) (k′ + i+ 1)

,

(21)

κ

( ∞∑k=1

R(k)x

k,

∞∑k′=1

R(k′)x′

k′

)=

1

kk′

∞∑k=1

∞∑k′=1

d∑i=1

(d

i

)cd−i

(k + i+ 1) (k′ + i+ 1)

=

d∑i=1

(d

i

)cd−i

(Ψ(0) (i+ 2) + γ

)2(i+ 1)

2 .

(22)

In the above, Ψ(0)(·) is the Digamma function and γ is the Euler-Mascheroni constant. We strongly believe thatthe polynomial and histogram kernels are not the only kernels which can analytically derived to include momentobservations but act as a reasonable initial choice for practitioners.

B BOUNDS ON LOG DETERMINANTS

For the sake of completeness, we restate the bounds on the log determinants used throughout this paper (Bai & Golub,1997).

Theorem 1 Let A be an n-by-n symmetric positive definite matrix, µ1 = Tr(A), µ2 = ‖A‖2F and λi(A) ∈ [α;β] withα > 0, then [

logαlog t

]T [α tα2 t2

] [µ1

µ2

]≤ Tr(log(A)) ≤

[log βlog t

]T [β tβ2 t2

] [µ1

µ2

]where,

t =αµ1 − µ2

αn− µ2, t =

βµ1 − µ2

βn− µ2

This bound can be easily computed during the loading of the matrix as both the trace and Frobenius norm can bereadily calculated using summary statistics. However, bounds on the maximum and minimum must also be derived.We chose to use Gershgorin intervals to bound the eigenvalues (Gershgorin, 1931).

arxiv:1704.01445v1 [stat.ml] 5 apr 2017mosb/public/pdf/5403...approximations for large-scale...

Documents