gaussian process optimization with mutual...

16
Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA – Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA – Universit´ e Paris Diderot & CNRS, France June 22, 2014

Upload: others

Post on 12-Mar-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Gaussian Process Optimization with MutualInformation

Emile Contal1 Vianney Perchet2 Nicolas Vayatis1

1CMLA – Ecole Normale Suprieure de Cachan & CNRS, France2LPMA – Universite Paris Diderot & CNRS, France

June 22, 2014

Page 2: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Problem Statement

Sequential Optimization

Let f : X → R where X ⊂ Rd is compact and convex.We consider the problem of finding the maximum of f denoted by:

f (x?) = maxx∈X

f (x) ,

via sequential queries f (x1), f (x2), . . .

Noisy Observations

At iteration T we choose xT+1 using the previous noisyobservations YT = y1, . . . , yT, where ∀t ≤ T :

yt = f (xt) + εt and εtiid∼ N (0, η2) .

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 2

Page 3: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Gaussian Processes

Definition

f ∼ GP(m, k) with mean function m : X → R and kernel functionk : X × X → R+, when for all x1, . . . , xn ∈ X we have:(

f (x1), . . . , f (xn))∼ N

([m(xi )]i , [k(xi , xj)]i ,j

).

Bayesian Inference

Given YT , the posterior distribution Pr[f | YT ] is a GP with meanµT+1 (prediction) and variance σ2

T+1 (uncertainty) computed byBayesian inference.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 3

Page 4: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Gaussian Processes

−1 0 1

0

−1

−2

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 4

Page 5: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Objective

Cumulative Regret

The efficiency of a policy is measured via the cumulative regret:

RT =∑t<T

f (x?)− f (xt) .

Goal

The cumulative regret is unknown in practice. Our aim is to obtainupper bounds on RT with high probability.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 5

Page 6: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Related Work

Dani al. 2008

In the general linear optimization problem, we have a lower boundon the cumulative regret of Ω(d

√T ).

Srinivas et al. 2012

For the GP-UCB algorithm with linear GP, we have an upper boundon the cumulative regret of O(d

√T ) with high probability.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 6

Page 7: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Mutual Information – An Important Ingredient

Information Gain

The information gain on f at XT is the mutual information betweenf and YT . For a GP distribution with KT the kernel matrix of XT :

IT (XT ) =1

2log det(I + η−2KT ) .

We define γT = max|X |=T IT (X ) the maximum information gain bya sequence of T queries points.

Empirical Lower Bound

For GPs with bounded variance, we have: [Srinivas et al. 2012]

γT =T∑t=1

σ2t (xt) ≤ CγT where C =

2

log(1 + η−2)

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 7

Page 8: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

GP-MI – A Novel Algorithm for Sequential Optimization

γ0 ← 0

for t = 1, 2, . . . do

Compute µt and σ2t using Bayesian inference

φt(x)← √α(√

σ2t (x) + γt−1 −

√γt−1

)xt ← argmaxx∈X µt(x) + φt(x)

γt ← γt−1 + σ2t (xt)

Query at xt and observe ytend

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 8

Page 9: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Main Result – Regret bounds for GP-MI

For all δ > 0 and T > 1, set α = log 2δ . The cumulative regret RT

incurred by the GP-MI algorithm on f distributed as a GP perturbedby independent Gaussian noise with variance η2 satisfies thefollowing bounds:

Pr[RT ≤ 5

√αCγT + 4

√α]≥ 1− δ ,

where C = 2log(1+η−2)

.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 9

Page 10: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Application to Specific Kernels

I For linear kernel: RT = O(√d logT )

I For RBF kernel: RT = O(√

(logT )d+1)

I For Matern kernel: RT = O(√

T a logT),

where a = d(d+1)2ν+d(d+1) < 1 and ν is the Matern parameter.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 10

Page 11: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

High Probabilistic Bounds for RT

Gaussian Martingale

The sequence MT = RT−∑T

t=1

(µt(x

?)− µt(xt))

is a Gaussianmartingale with respect to YT−1.

Concentration Inequalities [Becu et al. 2008]

For all δ > 0 and T > 1, with y = 8(CγT + 1) we have:

Pr

[MT ≤

√2αy+

√2α

y

T∑t=1

σ2t (x?)

]≥ 1− δ

Pr

[RT ≤

√2αy+

√2α

y

T∑t=1

σ2t (x?)+

T∑t=1

(µt(x

?)− µt(xt))]≥ 1−δ .

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 11

Page 12: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Regret Bounds for the GP-MI Algorithm

Inequality for the GP-MI Algorithm

Using the function φt as defined in GP-MI, we have:T∑t=1

(µt(x

?)− µt(xt))≤

T∑t=1

(φt(xt)− φt(x?)

)≤√αCγT−

√2α

y

T∑t=1

σ2t (x?) .

Regret bounds

Plugging this inequality in the previous concentration bound for RT :

Pr[RT ≤ 5

√αCγT + 4

√α]≥ 1− δ .

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 12

Page 13: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Empirical Average Regret

50 100 150 200 2500

0.5

1

1.5

2

2.5

EI UCB

MI

RT/T

(a) Generated GP (d = 2)

100 200 300 400 5000

1

2

3

EI

UCB

MI

(b) Generated GP (d = 4)

0 200 400 600 800 1,0000

0.5

1

1.5

EI

UCB

MIRT/T

(c) Gaussian mixture

0 50 100 150 200 2500

0.5

1

1.5

2

2.5

EI

UCB

MI

(d) Himmelblau

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 13

Page 14: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Empirical Average Regret

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

EIUCB

MI

RT/T

(e) Branin

0 50 100 150 200 2500

0.2

0.4

0.6

EI

UCB

MI

(f) Goldstein

0 100 200 300 400 5000

0.1

0.2

0.3

0.4

EI

UCB

MIRT/T

(g) Tsunamis

0 200 400 600 800 1,0000

0.1

0.2

0.3

0.4

EI

UCB

MI

(h) Mackey-Glass

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 14

Page 15: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Implementation

Exact Inference for Gaussian likelihood

Numerical complexity in O(T 2) using the Cholesky sequentialupdates of the covariance matrix.

Algorithms for non-Gaussian likelihood

For other likelihood functions (e.g. Laplacian or Student’s t), onecan use the EP algorithm or Monte Carlo sampling.

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 15

Page 16: Gaussian Process Optimization with Mutual …econtal.perso.math.cnrs.fr/presentations/slides_icml14.pdfGaussian Process Optimization with Mutual Information Emile Contal1 Vianney Perchet2

Background Algorithm Theoretical Analysis Experiments

Open Questions and Discussion

I Theoretical guarantees for simple regret

I Empirical performance for simple regret

I Kernel learning procedure

I Calibration of δ

Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 16