gaussian process optimization with mutual...

Gaussian Process Optimization with MutualInformation

Emile Contal1 Vianney Perchet2 Nicolas Vayatis1

1CMLA – Ecole Normale Suprieure de Cachan & CNRS, France2LPMA – Universite Paris Diderot & CNRS, France

June 22, 2014

Background Algorithm Theoretical Analysis Experiments

Problem Statement

Sequential Optimization

Let f : X → R where X ⊂ Rd is compact and convex.We consider the problem of finding the maximum of f denoted by:

f (x?) = maxx∈X

f (x) ,

via sequential queries f (x1), f (x2), . . .

Noisy Observations

At iteration T we choose xT+1 using the previous noisyobservations YT = y1, . . . , yT, where ∀t ≤ T :

yt = f (xt) + εt and εtiid∼ N (0, η2) .

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 2


Gaussian Processes

Definition

f ∼ GP(m, k) with mean function m : X → R and kernel functionk : X × X → R+, when for all x1, . . . , xn ∈ X we have:(

f (x1), . . . , f (xn))∼ N

([m(xi )]i , [k(xi , xj)]i ,j

).

Bayesian Inference

Given YT , the posterior distribution Pr[f | YT ] is a GP with meanµT+1 (prediction) and variance σ2

T+1 (uncertainty) computed byBayesian inference.



Gaussian Processes

−1 0 1

0

−1

−2



Objective

Cumulative Regret

The efficiency of a policy is measured via the cumulative regret:

RT =∑t<T

f (x?)− f (xt) .

Goal

The cumulative regret is unknown in practice. Our aim is to obtainupper bounds on RT with high probability.



Related Work

Dani al. 2008

In the general linear optimization problem, we have a lower boundon the cumulative regret of Ω(d

√T ).

Srinivas et al. 2012

For the GP-UCB algorithm with linear GP, we have an upper boundon the cumulative regret of O(d

√T ) with high probability.



Mutual Information – An Important Ingredient

Information Gain

The information gain on f at XT is the mutual information betweenf and YT . For a GP distribution with KT the kernel matrix of XT :

IT (XT ) =1

2log det(I + η−2KT ) .

We define γT = max|X |=T IT (X ) the maximum information gain bya sequence of T queries points.

Empirical Lower Bound

For GPs with bounded variance, we have: [Srinivas et al. 2012]

γT =T∑t=1

σ2t (xt) ≤ CγT where C =

2

log(1 + η−2)



GP-MI – A Novel Algorithm for Sequential Optimization

γ0 ← 0

for t = 1, 2, . . . do

Compute µt and σ2t using Bayesian inference

φt(x)← √α(√

σ2t (x) + γt−1 −

√γt−1

)xt ← argmaxx∈X µt(x) + φt(x)

γt ← γt−1 + σ2t (xt)

Query at xt and observe ytend



Main Result – Regret bounds for GP-MI

For all δ > 0 and T > 1, set α = log 2δ . The cumulative regret RT

incurred by the GP-MI algorithm on f distributed as a GP perturbedby independent Gaussian noise with variance η2 satisfies thefollowing bounds:

Pr[RT ≤ 5

√αCγT + 4

√α]≥ 1− δ ,

where C = 2log(1+η−2)

.



Application to Specific Kernels

I For linear kernel: RT = O(√d logT )

I For RBF kernel: RT = O(√

(logT )d+1)

I For Matern kernel: RT = O(√

T a logT),

where a = d(d+1)2ν+d(d+1) < 1 and ν is the Matern parameter.



High Probabilistic Bounds for RT

Gaussian Martingale

The sequence MT = RT−∑T

t=1

(µt(x

?)− µt(xt))

is a Gaussianmartingale with respect to YT−1.

Concentration Inequalities [Becu et al. 2008]

For all δ > 0 and T > 1, with y = 8(CγT + 1) we have:

Pr

[MT ≤

√2αy+

√2α

y

T∑t=1

σ2t (x?)

]≥ 1− δ

Pr

[RT ≤

√2αy+

√2α

y

T∑t=1

σ2t (x?)+

T∑t=1

(µt(x

?)− µt(xt))]≥ 1−δ .



Regret Bounds for the GP-MI Algorithm

Inequality for the GP-MI Algorithm

Using the function φt as defined in GP-MI, we have:T∑t=1

(µt(x

?)− µt(xt))≤

T∑t=1

(φt(xt)− φt(x?)

)≤√αCγT−

√2α

y

T∑t=1

σ2t (x?) .

Regret bounds

Plugging this inequality in the previous concentration bound for RT :

Pr[RT ≤ 5

√αCγT + 4

√α]≥ 1− δ .



Empirical Average Regret

50 100 150 200 2500

0.5

1

1.5

2

2.5

EI UCB

MI

RT/T

(a) Generated GP (d = 2)

100 200 300 400 5000

1

2

3

EI

UCB

MI

(b) Generated GP (d = 4)

0 200 400 600 800 1,0000

0.5

1

1.5

EI

UCB

MIRT/T

(c) Gaussian mixture

0 50 100 150 200 2500

0.5

1

1.5

2

2.5

EI

UCB

MI

(d) Himmelblau



Empirical Average Regret

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

EIUCB

MI

RT/T

(e) Branin

0 50 100 150 200 2500

0.2

0.4

0.6

EI

UCB

MI

(f) Goldstein

0 100 200 300 400 5000

0.1

0.2

0.3

0.4

EI

UCB

MIRT/T

(g) Tsunamis

0 200 400 600 800 1,0000

0.1

0.2

0.3

0.4

EI

UCB

MI

(h) Mackey-Glass



Implementation

Exact Inference for Gaussian likelihood

Numerical complexity in O(T 2) using the Cholesky sequentialupdates of the covariance matrix.

Algorithms for non-Gaussian likelihood

For other likelihood functions (e.g. Laplacian or Student’s t), onecan use the EP algorithm or Monte Carlo sampling.



Open Questions and Discussion

I Theoretical guarantees for simple regret

I Empirical performance for simple regret

I Kernel learning procedure

I Calibration of δ


gaussian process optimization with mutual...

Documents