gaussian processes: surrogate models for continuous black-box...
TRANSCRIPT
![Page 1: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/1.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian processes: surrogate modelsfor continuous black-box optimization
Lukáš Bajer
MFF UK 04/2018
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 1
![Page 2: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/2.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Contents
1 OptimizationContinuous optimizationMetaheuristics, black-box functions
2 Gaussian processesGaussian process predictionGaussian process covariance functions
3 Doubly trained Surrogate CMA-ESCMA-ESDoubly trained Surrogate CMA-ESExperimental results
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 2
![Page 3: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/3.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization
optimization (minimization) is finding such x? ∈ Rn that
f (x?) = min∀x∈Rn
f (x)
“near-optimal” solution is usually sufficient
f(x)
xf(x)
x
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 3
![Page 4: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/4.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Continuous white-box optimization
also known as numerical optimization methods
requirements:gradients: ∇f (x)
. . . can be approximated by finite difference approximations
and sometimes also Hessians: ∇2f (x)
1 gradient descend (1st order)2 Newthon method (2nd order)3 quasi-Newthon methods (2nd order approximated)4 trust-region, conjugate gradients
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 4
![Page 5: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/5.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
1st order: gradient descend
iterative steps in the direction of negative gradient
x(k+1) = x(k) − σ∇f (x(k))
σ – step size, usually changes every iteration, adapted, forexample, using a line search along the gradient direction
x 0
x 1
x 2
x 3 x 4
*
*
source: (CC) Wikipedia
pros:guaranteed to converge theoreticallysuitable even for large problems (deep NN. . . )
limitations:can be very slow, especially without a momentumoften ends-up much sooner due to round-off errors
source: (GNU) Wikipedia, author: P.A. Simionescu
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 5
![Page 6: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/6.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
2nd order: Newton’s method
take into account the second-order term of a Taylorexpansion of f (x) around x(k):
f (x(k) + h) ≈ q(k)(h) = f (x(k)) + h>∇f (k) +12
h>[∇2f (k)
]h
the next iterate is then
x(k+1) = (x(k) + h(k))
where h(k) minimizes q(k)(h)
x(k+1) = x(k) − γ[∇2f (k)
]−1∇f (k)
x
x0
pros:very fast convergence on quadratic-like function
limitations:needs a Hessian matrix to be computed and invertedthus rarely usable in practice
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 6
![Page 7: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/7.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Quasi-Newton’s methods
Hessian matrix ∇2f (k) is not computed,only iteratively approximated B(k),B(k+1), . . .
Hessians’ inverses are often calculated without inversions
BFGSthe most successful for the last three decadesindependently discovered by 4 (!) people in 1970C. G. Broyden, R. Fletcher, D. Goldfarb and D. Shanno
Hessian approximation updated via rank-two updatesworks even without derivatives (with finite differences)
shown to behave well on a variety of (even multimodal) functionsL-BFGS – a popular memory-limited version (Nocedal, 1980)in every optimization package (Matlab, Python,. . . )
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 7
![Page 8: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/8.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Other numerical optimization techniques
quadratic approximations:by far the most popular optimization technique
trust-region methodsquadratic approximations around current point x(k)
minimizes the model within region of trust
NEWOUA, BOBYQA (J. D. Powell, 2004, 2009)
construct the quadratic model using much fewer pointsthan (n + 1)(n + 2)/2 using additional minimizing a norm
that saves time and enhances performance
conjugate gradientsdo not approximate Hessiansconjugate vectors – a momentum guiding the searchcheaper variant to quasi-Newton’s methods
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 8
![Page 9: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/9.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization of black-box functions
black-box functions
x f(x)
f
only evaluation of the function value, no derivatives orgradients→ no gradient methods available
we consider continuous domain: x ∈ Rn
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 9
![Page 10: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/10.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization of empirical black-box functions
empirical function:assessing the function-value via an experiment(measuring, intensive calculation, evaluating a prototype)evaluating such functions are expensive (time and/ormoney)search cost ∼ the number of function evaluations
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 10
![Page 11: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/11.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Metaheuristics
Metaheuristicoptimization techniques finding sufficiently good solutiontreat the objective function as black-boxsample a set of candidate solutions(search space often too large to be completely sampled)often nature-inspired
particle/swarn optimizationsimulated annealing. . .evolutionary computation (EA, GA, ES, . . . )
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 11
![Page 12: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/12.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
EA’s for empirical black-box optimization
what can help with decreasingthe number of function evaluations:
utilize already measured values(at least prevent measuring the same thing twice)learn the shape of the function landscapeor learn the (global) gradient or step direction & size
f(X)
i
Re f(X)
CrSe
Mu?
X*
source: (GNU) Wikipedia, author: Johann "nojhan" Dréo
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 12
![Page 13: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/13.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Model-based methods accelerating the convergence
several methods are used in order to decreasethe number of objective function evaluations needed by EA’s
1 Bayesian optimization (EGO)
2 Surrogate modelling
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 13
![Page 14: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/14.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Bayesian optimization
Bayesian optimizer
Input : objective function f , the size of the initial sample dx1, . . . , xd ← generate an initial sampleA ← {(xi, yi)} /* initialize the archive */
for generation g = 1, 2, . . . until stopping conditions met doM← generate the probabilistic model based on Ax1, . . .← choose next points x ∈ X accord. to CM(x)y1, . . .← f (x1), . . . /* evaluate the new point(s) */A ← A∪ {(x1, y1), . . .} /* update the archive */
suitable for very low budgets of f -evaluations (∼ 10 · D)Gaussian processes used in the criterion CM most oftenexisting algorithms: EGO (D. R. Jones, 1998),SPOT (T. Bartz-Beielstein, 2005), SMAC (F. Hutter, 2011) etc.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 14
![Page 15: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/15.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Surrogate modelling
Surrogate modellingtechnique which builds an approximating modelof the fitness function landscapethe model provides a cheap and fast,but also inaccurate replacement of the fitness functionfor part of the populationinaccurate approximating model can deceive the optimizer
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 15
![Page 16: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/16.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
GP is a stochastic approximation method based on Gaussiandistributions
GP can express uncertainty of the prediction in a new point x:it gives a probability distribution of the output value
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 16
![Page 17: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/17.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
Gaussian ProcessA Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.
A Gaussian process is completely specified by itsmean function m(x) = E [fGP(x)]
covariance function cov(xi, xj) = cov(fGP(x1), fGP(x2))
and we write the Gaussian process as
f (x) ∼ GP(m(x), cov(x, x)).
(Rasmussen, Williams, 2006)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 17
![Page 18: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/18.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
given a set of N training points XN = (x1 . . . xN)>, xi ∈ Rd,and measured values yN = (y1, . . . , yN)>
of a function f being approximated
yi = f (xi), i = 1, . . . ,N
GP considers vector of these function values as a samplefrom N-variate Gaussian distribution
yN ∼ N(0,CN)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 18
![Page 19: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/19.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prior distribution
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
Draws from Gaussian processes prior for three different covariance functionsKSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern (in that order), all of them with the parameters ` = 1 and
σ2f = 1 without noise
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 19
![Page 20: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/20.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prediction (posterior)
Making predictionsLet CN+1 be extended covariance matrix – extended by entriesbelonging to an unseen point (x, y∗). Because yN is known and
the inverse C−1N+1 can be expressed using inverse of the training
covariance CN−1,
the density in a new point marginalize to 1D Gaussian density
p(y∗ |XN+1, yN) ∝ exp
(−1
2(y∗ − yN+1)2
s2yN+1
)wherethe mean yN+1 and thevariance s2
yN+1
is easily expressible fromCN−1 and yN .
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 20
![Page 21: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/21.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prediction (posterior)
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
Graphs of Gaussian processes prediction N = 2, 3, 4 training data. (+) –training set, thick line – mean prediction, thin lines – three draws from the GPposterior (without noise). Predictions y∗ and ±2s∗ are generated for 101points, computationally stable as the matrix inversion only for the trainingcovariace CN .
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 21
![Page 22: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/22.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
The covariance matrix CN is determined by the covariancefunction cov(xi, xj) which is defined on pairs from the inputspace
(C)ij = cov(xi, xj), xi,j ∈ Rd
expressing the degree of correlations between two points’values; typically decreasing functions on two points distance
d(xi,xj)
cov(xi,xj)1
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 22
![Page 23: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/23.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
The most frequent covariance function is squared-exponential
(K)ij = covSE(xi, xj) = θ exp
(−12`2 (xi − xj)
>(xi − xj)
)with the parameters (usually fitted by MLE)
θ – signal variance (scales the correlation)` – characteristic length scale
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 23
![Page 24: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/24.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
Another usual option in data-minig applications isMatérn covariance, which is for r = (xi − xj)
(K)ij = covMaternν=5/2(r) = θ
(1 +
√5r`
+5r2
3`2
)exp
(−√
5r`
).
with the parameters (same as for squared exponential)θ – signal variance` – characteristic length scale
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 24
![Page 25: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/25.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
source: (Rasmussen and Williams, 2006)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 25
![Page 26: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/26.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Stochastic search of Evolutionary algorithms
Stochastic black box searchinitilize distribution parameters θset population size λ ∈ Nwhile not terminate
1 sample distribution P(x|θ)→ x1, . . . , xλ ∈ Rn
2 evaluate x1, . . . , xλ on f3 update parameters θ
(A. Auger, Tutorial CMA-ES, GECCO 2013)
schema of most of the evolutionary strategies (and EDAalgorithms)as well as CMA-ES (Covariance Matrix Adaptation ES)– current state of the art in continuous optimization
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 26
![Page 27: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/27.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
The CMA-ESInput: m ∈ Rn, σ ∈ R+, λ ∈ NInitialize: C = I (and several other parameters)Set the weights w1, . . . wλ appropriately
while not terminate
1 xi = m + σyi, yi ∼ N(0,C), for i = 1, . . . , λ sampling
2 m←∑µ
i=1 wi xi:λ = m + σyw where yw =∑µ
i=1 wi yi:λ updatemean
3 update C4 update step-size σ
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 27
m1 σ1,C1
m1 σ1,C1
λ = 12
xi
m1 σ1,C1
µ = 3
m2
σy1:λ
σy2:λ
σy3:λ
µ = 3
m2
σ2,C2
m2
σ2,C2
m2
σ2,C2
m3
m3
σ3,C3
![Page 28: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/28.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Covariance matrix adaptation
eigenvectors of the covariance matrix C are the principlecomponents – the principle axes of the mutation ellipsoidCMA-ES learns and updates a new Mahalanobis metricsuccessively approximates the inverse Hessian onquadratic functions– transforms ellipsoid function into sphere function– it somehow holds for other functions, too (up to somedegree)
b1
b2
source: (S. Finck, N. Hansen, R. Ros, and A. Auger, 2009)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 28
![Page 29: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/29.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Is the CMA-ES the best for everything?
CMA-ES is state-of-the-art optimization algorithm,especially for rugged and ill-conditioned objective functionshowever, not the fastest if we can affordonly very few objective function evaluations
what we have already seen:use a surrogate model!however, original evaluated solutions are availableonly along the search pathsolution: construct local surrogate models
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 29
![Page 30: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/30.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Doubly trained Surrogate CMA-ES
m,σ
1
1st modeltraining
m,σ
4
fitnessevaluationof a fewchosenpoints
m,σ
2
sampling fromN(m,σ)
CMA-ES m,σ
5
2nd modeltraining
m,σ3rd3rd
3 criterion rankingaccording to 1st model
1st1st
2nd2nd
s2 m,σ
6
population
mean-prediction2nd model
for the rest of
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 30
![Page 31: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/31.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Doubly trained Surrogate CMA-ES
1 sample a new population of size λ (standard CMA-ESoffspring),
2 train the first surrogate model on the original-evaluatedpoints from the archive A,
3 select dαλe point(s) wrt. a criterion C, which is based onthe first model’s prediction,
4 evaluate these point(s) with the original fitness,5 re-train the surrogate model also using these new point(s),
and6 predict the fitness for the non-original evaluated points with
this second model.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 31
![Page 32: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/32.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated points
GP predictive mean
CM(x) = − y(x)
GP predictive standard deviation
CSTD(x) = s(x)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 32
![Page 33: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/33.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated points
Expected improvement (EI). ymin – the minimum so-farfitness
CEI(x) = E((ymin − f (x))I(f (x) < ymin) | y1, . . . , yN) , where
I(f (x) < ymin) =
{1 for f (x) < ymin
0 for f (x) ≥ ymin
Probability of improvement (PoI). the probability of findinglower fitness than some threshold T
CPoI(x,T) = P(f (x) ≤ T | y1, . . . , yN) = Φ
(T − y(x)
s(x)
)where Φ is the CDF of N (0, 1), T = ymin or a slightly highervalue
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 33
![Page 34: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/34.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated pointsselected unimodal COCO functions f1,2, f8...14
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
5-D
GP predictive mean (M)GP predictive stand. deviation (STD)Expected improvement (EI)
0 50 100 150 200 250-8
-6
-4
-2
0 20-D
Probability of improvement (PoI)Expected RDE (ERDE)
multimodal COCO functions f3,4, f15...24
0 50 100 150 200 250Number of evaluations / D
-8
-6
-4
-2
0
"flo
g
5-D
0 50 100 150 200 250Number of evaluations / D
-8
-6
-4
-2
0 20-D
The log10 of the median best f -value distances to the benchmarks’ optimawere scaled linearly to [−8, 0] for each COCO function.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 34
![Page 35: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/35.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
GP model training
trainModel(A,Nmax,TSS, rAmax,K, σ(g),C(g),m(g))
(XN , yN)← select at most Nmax points from the archive A using TSSand rAmax
XN ← transform the selected points into the (σ(g))2C(g) basis with theorigin at m(g)
yN ← standardize the f -values in yN to zero mean and unit variance(mµ, σ2
f , `, σn)← fit the hyperparameters of µ(x) and K using MLestimation
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 35
![Page 36: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/36.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Training set selection
1 TSS1 taking up to Nmax most recently evaluated points
2 TSS2 selecting the union of the k nearest neighbors ofevery point for which the fitness should be predicted,where k is maximal such that the total number of selectedpoints does not exceed Nmax,
3 TSS3 clustering the points in the input space into Nmaxclusters and taking the points nearest to clusters’ centroids
4 TSS4 selecting Nmax points which are closest to any pointin the current population.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 36
![Page 37: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/37.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
GP model parameters
parameter considered values
training set selection method TSS TSS1, TSS2 TSS3, TSS4maximum distance rAmax 2
√Qχ2(0.99,D), 4
√Qχ2(0.99,D)
Nmax 10 · D, 15 · D, 20 · Dcovariance function K KSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern
Parameters of the GP surrogate models. The maximum distance rAmax isderived using the Mahalanobis distance given by the covariance matrix σ2C.Qχ2(0.99,D) is the 0.99-quantile of the χ2
D distribution, and therefore√Qχ2(0.99,D) is the 0.99-quantile of the norm of a D-dimensional normal
distributed random vector.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 37
![Page 38: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/38.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Gaussian process parameter settings – heatmap2-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.4
0.5
0.6
0.7
0.8
0.9
5-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.3
0.4
0.5
0.6
0.7
0.8
10-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
20-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Ranking prediction error. Solid horizontal lines separate different TSSmethods TSS1–TSS4 (in that order), dashed lines separate sectors withsmaller and larger maximum distance rAmax. Three triples of settings withineach sector represent raising values of Nmax and three covariance functionsKSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern within each triple.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 38
![Page 39: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/39.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Testing frameworkBlack-Box Optimization Benchmarking (BBOB)COmparing Continuous Optimisers (COCO)
24 artificial functionsdifferent degree of separability, conditioning, modality or with orwithout a global structuretesting sets defined for dimensions 2, 3, 5, 10, 20 (and 40:)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 39
f1 f3 f4
f8 f12 f13
f14 f16 f17
f20 f22 f23 f24 .
![Page 40: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/40.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Aggregated experimental results on BBOB
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
2-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250-8
-6
-4
-2
0 5-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
Number of evaluations / D
-8
-6
-4
-2
0
"flo
g
10-D
0 50 100 150 200 250
Number of evaluations / D
-8
-6
-4
-2
0 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 40
![Page 41: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/41.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
f1 Sphere 5-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250
-5
0
5
f2 Ellipsoidal 5-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
0.5
1
1.5
2
2.5
"flo
g
f3 Rastrigin 5-D
0 50 100 150 200 250
1
1.5
2
2.5
f4 Bueche-Rastrigin 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 41
![Page 42: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/42.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f5 Linear Slope 5-D
0 50 100 150 200 250
-5
0
5f6 Attractive Sector 5-D
0 50 100 150 200 250
-5
0
5f8 Rosenbrock, original 5-D
0 50 100 150 200 250-8
-6
-4
-2
0
2
4
"flo
g
f9 Rosenbrock, rotated 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 42
![Page 43: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/43.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f13 Sharp Ridge 5-D
0 50 100 150 200 2500
0.5
1
1.5
2
2.5
"flo
g
f15 Rastrigin, multi-modal 5-D
0 50 100 150 200 250
-4
-2
0
"flo
g
f17 Schaffers F7 5-D
0 50 100 150 200 250
-4
-2
0
2f18 Schaffers F7, ill-conditioned 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 43
![Page 44: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/44.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250
-1
-0.5
0
0.5
1
"flo
g
f19 Composite Griewank-Rosenbrock F8F2 5-D
0 50 100 150 200 250
0
0.5
1
1.5
f22 Gallagher's Gaussian 21-hi Peaks 5-D
0 50 100 150 200 250
Number of evaluations / D
-0.5
0
0.5
1
"flo
g
f23 Katsuura 5-D
0 50 100 150 200 250
Number of evaluations / D
1
1.5
2
f24 Lunacek bi-Rastrigin 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 44
![Page 45: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/45.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f1 Sphere 20-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250
-5
0
5
f2 Ellipsoidal 20-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
1.5
2
2.5
3
"flo
g
f3 Rastrigin 20-D
0 50 100 150 200 250
1.5
2
2.5
3
f4 Bueche-Rastrigin 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 45
![Page 46: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/46.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f5 Linear Slope 20-D
0 50 100 150 200 250
-2
0
2
4
f6 Attractive Sector 20-D
0 50 100 150 200 250
-5
0
5
f8 Rosenbrock, original 20-D
0 50 100 150 200 250
-5
0
5
"flo
g
f9 Rosenbrock, rotated 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 46
![Page 47: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/47.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-6
-4
-2
0
2
"flo
g
f13 Sharp Ridge 20-D
0 50 100 150 200 250
1.5
2
2.5
3
"flo
g
f15 Rastrigin, multi-modal 20-D
0 50 100 150 200 250
-3
-2
-1
0
1
"flo
g
f17 Schaffers F7 20-D
0 50 100 150 200 250-2
-1
0
1
2
f18 Schaffers F7, ill-conditioned 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 47
![Page 48: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES](https://reader033.vdocuments.net/reader033/viewer/2022052013/602a81b3483d6921d043d0d3/html5/thumbnails/48.jpg)
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250
-0.5
0
0.5
1
1.5
"flo
g
f19 Composite Griewank-Rosenbrock F8F2 20-D
0 50 100 150 200 250
0
0.5
1
1.5
f22 Gallagher's Gaussian 21-hi Peaks 20-D
0 50 100 150 200 250
Number of evaluations / D
-0.5
0
0.5
1
"flo
g
f23 Katsuura 20-D
0 50 100 150 200 250
Number of evaluations / D
1.5
2
2.5
f24 Lunacek bi-Rastrigin 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 48