information-geometricoptimizationalgorithms...
TRANSCRIPT
Information-Geometric Optimization Algorithms:A Unifying Picture via Invariance Principles
Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier
Abstract
We present a canonical way to turn any smooth parametric familyof probability distributions on an arbitrary search space π into acontinuous-time black-box optimization method on π, the information-geometric optimization (IGO) method. Invariance as a major designprinciple keeps the number of arbitrary choices to a minimum. Theresulting method conducts a natural gradient ascent using an adaptive,time-dependent transformation of the objective function, and makes noparticular assumptions on the objective function to be optimized.
The IGO method produces explicit IGO algorithms through timediscretization. The cross-entropy method is recovered in a particularcase with a large time step, and can be extended into a smoothed,parametrization-independent maximum likelihood update.
When applied to specific families of distributions on discrete or con-tinuous spaces, the IGO framework allows to naturally recover versionsof known algorithms. From the family of Gaussian distributions on Rπ,we arrive at a version of the well-known CMA-ES algorithm. Fromthe family of Bernoulli distributions on {0, 1}π, we recover the seminalPBIL algorithm. From the distributions of restricted Boltzmann ma-chines, we naturally obtain a novel algorithm for discrete optimizationon {0, 1}π.
The IGO method achieves, thanks to its intrinsic formulation, max-imal invariance properties: invariance under reparametrization of thesearch space π, under a change of parameters of the probability dis-tribution, and under increasing transformation of the function to beoptimized. The latter is achieved thanks to an adaptive formulation ofthe objective.
Theoretical considerations strongly suggest that IGO algorithmsare characterized by a minimal change of the distribution. Thereforethey have minimal loss in diversity through the course of optimization,provided the initial diversity is high. First experiments using restrictedBoltzmann machines confirm this insight. As a simple consequence,IGO seems to provide, from information theory, an elegant way tospontaneously explore several valleys of a fitness landscape in a singlerun.
1
ContentsIntroduction 2
1 Algorithm description 41.1 The natural gradient on parameter space . . . . . . . . . . . 51.2 IGO: Information-geometric optimization . . . . . . . . . . . 7
2 First properties of IGO 112.1 Consistency of sampling . . . . . . . . . . . . . . . . . . . . . 112.2 Monotonicity: quantile improvement . . . . . . . . . . . . . . 122.3 The IGO flow for exponential families . . . . . . . . . . . . . 122.4 Invariance properties . . . . . . . . . . . . . . . . . . . . . . . 142.5 Speed of the IGO flow . . . . . . . . . . . . . . . . . . . . . . 152.6 Noisy objective function . . . . . . . . . . . . . . . . . . . . . 162.7 Implementation remarks . . . . . . . . . . . . . . . . . . . . . 17
3 IGO, maximum likelihood and the cross-entropy method 19
4 CMA-ES, NES, EDAs and PBIL from the IGO framework 224.1 IGO algorithm for Bernoulli measures and PBIL . . . . . . . 224.2 Multivariate normal distributions (Gaussians) . . . . . . . . . 244.3 Computing the IGO flow for some simple examples . . . . . . 26
5 Multimodal optimization using restricted Boltzmann ma-chines 295.1 IGO for restricted Boltzmann machines . . . . . . . . . . . . 305.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Convergence to the continuous-time limit . . . . . . . . . . . 36
6 Further discussion 38
Summary and conclusion 41
IntroductionIn this article, we consider an objective function π : π β R to be minimizedover a search space π. No particular assumption on π is needed: it maybe discrete or continuous, finite or infinite. We adopt a standard scenariowhere we consider π as a black box that delivers values π(π₯) for any desiredinput π₯ β π. The objective of black-box optimization is to find solutionsπ₯ β π with small value π(π₯), using the least number of calls to the blackbox. In this context, we design a stochastic optimization method from soundtheoretical principles.
We assume that we are given a family of probability distributions ππ
on π depending on a continuous multicomponent parameter π β Ξ. Abasic example is to take π = Rπ and to consider the family of all Gaussiandistributions ππ on Rπ, with π = (π, πΆ) the mean and covariance matrix.Another simple example is π = {0, 1}π equipped with the family of Bernoullimeasures, i.e. π = (ππ)16π6π and ππ(π₯) =
βππ₯π
π (1β ππ)1βπ₯π for π₯ = (π₯π) β π.The parameters π of the family ππ belong to a space, Ξ, assumed to be asmooth manifold.
2
From this setting, we build in a natural way an optimization method,the information-geometric optimization (IGO). At each time π‘, we maintaina value ππ‘ such that πππ‘ represents, loosely speaking, the current beliefabout where the smallest values of the function π may lie. Over time, πππ‘
evolves and is expected to concentrate around the minima of π . This generalapproach resembles the wide family of estimation of distribution algorithms(EDA) [LL02, BC95, PGL02]. However, we deviate somewhat from thecommon EDA reasoning, as explained in the following.
The IGO method takes the form of a gradient ascent on ππ‘ in the parameterspace Ξ. We follow the gradient of a suitable transformation of π , basedon the πππ‘-quantiles of π . The gradient used for π is the natural gradientdefined from the Fisher information metric [Rao45, Jef46, AN00], as is thecase in various other optimization strategies, for instance so-called naturalevolution strategies [WSPS08, SWSS09, GSS+10]. Thus, we extend the scopeof optimization strategies based on this gradient to arbitrary search spaces.
The IGO method also has an equivalent description as an infinitesimalmaximum likelihood update; this reveals a new property of the naturalgradient. This also relates IGO to the cross-entropy method [dBKMR05] insome situations.
When we instantiate IGO using the family of Gaussian distributionson Rπ, we naturally obtain variants of the well-known covariance matrixadaptation evolution strategy (CMA-ES) [HO01, HK04, JA06] and of naturalevolution strategies. With Bernoulli measures on the discrete cube {0, 1}π,we recover the well-known population based incremental learning (PBIL)[BC95, Bal94]; this derivation of PBIL as a natural gradient ascent appearsto be new, and sheds some light on the common ground between continuousand discrete optimization.
From the IGO framework, it is immediate to build new optimizationalgorithms using more complex families of distributions than Gaussian orBernoulli. As an illustration, distributions associated with restricted Boltz-mann machines (RBMs) provide a new but natural algorithm for discreteoptimization on {0, 1}π, able to handle dependencies between the bits (seealso [Ber02]). The probability distributions associated with RBMs are mul-timodal; combined with specific information-theoretic properties of IGOthat guarantee minimal loss of diversity over time, this allows IGO to reachmultiple optima at once very naturally, at least in a simple experimentalsetup.
Our method is built to achieve maximal invariance properties. First, itwill be invariant under reparametrization of the family of distributions ππ,that is, at least for infinitesimally small steps, it only depends on ππ and noton the way we write the parameter π. (For instance, for Gaussian measuresit should not matter whether we use the covariance matrix or its inverse ora Cholesky factor as the parameter.) This limits the influence of encodingchoices on the behavior of the algorithm. Second, it will be invariant undera change of coordinates in the search space π, provided that this change ofcoordinates globally preserves our family of distributions ππ. (For Gaussiandistributions on Rπ, this includes all affine changes of coordinates.) Thismeans that the algorithm, apart from initialization, does not depend onthe precise way the data is presented. Last, the algorithm will be invariantunder applying an increasing function to π , so that it is indifferent whetherwe minimize e.g. π , π3 or π Γ |π |β2/3. This way some non-convex or non-smooth functions can be as βeasilyβ optimised as convex ones. Contrary toprevious formulations using natural gradients [WSPS08, GSS+10, ANOK10],
3
this invariance under increasing transformation of the objective function isachieved from the start.
Invariance under π-reparametrization has beenβwe believeβone ofthe keys to the success of the CMA-ES algorithm, which derives from aparticular case of ours. Invariance under π-reparametrization is the main ideabehind information geometry [AN00]. Invariance under π -transformationis not uncommon, e.g., for evolution strategies [Sch95] or pattern searchmethods [HJ61, Tor97, NM65]; however it is not always recognized as anattractive feature. Such invariance properties mean that we deal with intrinsicproperties of the objects themselves, and not with the way we encode themas collections of numbers in Rπ. It also means, most importantly, that wemake a minimal number of arbitrary choices.
In Section 1, we define the IGO flow and the IGO algorithm. We beginwith standard facts about the definition and basic properties of the naturalgradient, and its connection with KullbackβLeibler divergence and diversity.We then proceed to the detailed description of our algorithm.
In Section 2, we state some first mathematical properties of IGO. Theseinclude monotone improvement of the objective function, invariance proper-ties, the form of IGO for exponential families of probability distributions,and the case of noisy objective functions.
In Section 3 we explain the theoretical relationships between IGO, maxi-mum likelihood estimates and the cross-entropy method. In particular, IGOis uniquely characterized by a weighted log-likelihood maximization property.
In Section 4, we derive several well-known optimization algorithms fromthe IGO framework. These include PBIL, versions of CMA-ES and otherGaussian evolutionary algorithms such as EMNA. This also illustrates howa large step size results in more and more differing algorithms w.r.t. thecontinuous-time IGO flow.
In Section 5, we illustrate how IGO can be used to design new optimizationalgorithms. As a proof of concept, we derive the IGO algorithm associatedwith restricted Boltzmann machines for discrete optimization, allowing formultimodal optimization. We perform a preliminary experimental study ofthe specific influence of the Fisher information matrix on the performance ofthe algorithm and on diversity of the optima obtained.
In Section 6, we discuss related work, and in particular, IGOβs relationshipwith and differences from various other optimization algorithms such asnatural evolution strategies or the cross-entropy method. We also sum upthe main contributions of the paper and the design philosophy of IGO.
1 Algorithm descriptionWe now present the outline of our algorithms. Each step is described in moredetail in the sections below.
Our method can be seen as an estimation of distribution algorithm: ateach time π‘, we maintain a probability distribution πππ‘ on the search spaceπ, where ππ‘ β Ξ. The value of ππ‘ will evolve so that, over time, πππ‘ givesmore weight to points π₯ with better values of the function π(π₯) to optimize.
A straightforward way to proceed is to transfer π from π₯-space to π-space:define a function πΉ (π) as the ππ-average of π and then to do a gradientdescent for πΉ (π) in space Ξ [Ber00]. This way, π will converge to a pointsuch that ππ yields a good average value of π . We depart from this approachin two ways:
4
β At each time, we replace π with an adaptive transformation of π rep-resenting how good or bad observed values of π are relative to otherobservations. This provides invariance under all monotone transforma-tions of π .
β Instead of the vanilla gradient for π, we use the so-called natural gradientgiven by the Fisher information matrix. This reflects the intrinsicgeometry of the space of probability distributions, as introduced byRao and Jeffreys [Rao45, Jef46] and later elaborated upon by Amariand others [AN00]. This provides invariance under reparametrizationof π and, importantly, minimizes the change of diversity of ππ.
The algorithm is constructed in two steps: we first give an βidealβ version,namely, a version in which time π‘ is continuous so that the evolution of ππ‘ isgiven by an ordinary differential equation in Ξ. Second, the actual algorithmis a time discretization using a finite time step and Monte Carlo samplinginstead of exact ππ-averages.
1.1 The natural gradient on parameter space
About gradients and the shortest path uphill. Let π be a smoothfunction from Rπ to R, to be maximized. We first present the interpretationof gradient ascent as βthe shortest path uphillβ.
Let π¦ β Rπ. Define the vector π§ by
π§ = limπβ0
arg maxπ§, βπ§β61
π(π¦ + ππ§). (1)
Then one can check that π§ is the normalized gradient of π at π¦: π§π = ππ/ππ¦π
βππ/ππ¦πβ .(This holds only at points π¦ where the gradient of π does not vanish.)
This shows that, for small πΏπ‘, the well-known gradient ascent of π givenby
π¦π‘+πΏπ‘π = π¦π‘
π + πΏπ‘ ππππ¦π
realizes the largest increase in the value of π, for a given step size βπ¦π‘+πΏπ‘βπ¦π‘β.The relation (1) depends on the choice of a norm βΒ·β (the gradient of
π is given by ππ/ππ¦π only in an orthonormal basis). If we use, instead ofthe standard metric βπ¦ β π¦β²β =
ββ(π¦π β π¦β²
π)2 on Rπ, a metric βπ¦ β π¦β²βπ΄ =ββπ΄ππ(π¦π β π¦β²
π)(π¦π β π¦β²π) defined by a positive definite matrix π΄ππ , then the
gradient of π with respect to this metric is given byβ
π π΄β1ππ
ππππ¦π
. (This followsfrom the textbook definition of gradients by π(π¦ + ππ§) = π(π¦) + πβ¨βπ, π§β©π΄ +π(π2) with β¨Β·, Β·β©π΄ the scalar product associated with the matrix π΄ππ [Sch92].)
We can write the analogue of (1) using the π΄-norm. We get that thegradient ascent associated with metric π΄, given by
π¦π‘+πΏπ‘ = π¦π‘ + πΏπ‘ π΄β1 ππππ¦π
,
for small πΏπ‘, maximizes the increment of π for a given π΄-distance βπ¦π‘+πΏπ‘ βπ¦π‘βπ΄βit realizes the steepest π΄-ascent. Maybe this viewpoint clarifies therelationship between gradient and metric: this steepest ascent property canactually be used as a definition of gradients.
In our setting we want to use a gradient ascent in the parameter space Ξof our distributions ππ. The metric βπβ πβ²β =
ββ(ππ β πβ²
π)2 clearly dependson the choice of parametrization π, and thus is not intrinsic. Therefore, weuse a metric depending on π only through the distributions ππ, as follows.
5
Fisher information and the natural gradient on parameter space.Let π, πβ² β Ξ be two values of the distribution parameter. The KullbackβLeibler divergence between ππ and ππβ² is defined [Kul97] as
KL(ππβ² ||ππ) =β«
π₯ln ππβ²(π₯)
ππ(π₯) ππβ²(dπ₯).
When πβ² = π + πΏπ is close to π, under mild smoothness assumptions wecan expand the KullbackβLeibler divergence at second order in πΏπ. Thisexpansion defines the Fisher information matrix πΌ at π [Kul97]:
KL(ππ+πΏπ ||ππ) = 12β
πΌππ(π) πΏπππΏππ + π(πΏπ3).
An equivalent definition of the Fisher information matrix is by the usualformulas [CT06]
πΌππ(π) =β«
π₯
π ln ππ(π₯)πππ
π ln ππ(π₯)πππ
dππ(π₯) = ββ«
π₯
π2 ln ππ(π₯)πππ πππ
dππ(π₯).
The Fisher information matrix defines a (Riemannian) metric on Ξ: thedistance, in this metric, between two very close values of π is given by thesquare root of twice the KullbackβLeibler divergence. Since the KullbackβLeibler divergence depends only on ππ and not on the parametrization of π,this metric is intrinsic.
If π : Ξ β R is a smooth function on the parameter space, its naturalgradient at π is defined in accordance with the Fisher metric as
( βπ π)π =β
π
πΌβ1ππ (π) ππ(π)
πππ
or more synthetically βπ π = πΌβ1 ππ
ππ.
From now on, we will use βπ to denote the natural gradient and πππ to denote
the vanilla gradient.By construction, the natural gradient descent is intrinsic: it does not
depend on the chosen parametrization π of ππ, so that it makes sense tospeak of the natural gradient ascent of a function π(ππ).
Given that the Fisher metric comes from the KullbackβLeibler divergence,the βshortest path uphillβ property of gradients mentioned above translatesas follows (see also [Ama98, Theorem 1]):
Proposition 1. The natural gradient ascent points in the direction πΏπ achiev-ing the largest change of the objective function, for a given distance betweenππ and ππ+πΏπ in KullbackβLeibler divergence. More precisely, let π be asmooth function on the parameter space Ξ. Let π β Ξ be a point where βπ(π)does not vanish. Then
βπ(π)β βπ(π)β
= limπβ0
1π
arg maxπΏπ such that
KL(ππ+πΏπ || ππ)6π2/2
π(π + πΏπ).
Here we have implicitly assumed that the parameter space Ξ is non-degenerate and proper (that is, no two points π β Ξ define the same proba-bility distribution, and the mapping ππ β¦β π is continuous).
6
Why use the Fisher metric gradient for optimization? Relation-ship to diversity. The first reason for using the natural gradient is itsreparametrization invariance, which makes it the only gradient available ina general abstract setting [AN00]. Practically, this invariance also limitsthe influence of encoding choices on the behavior of the algorithm. Moreprosaically, the Fisher matrix can be also seen as an adaptive learning ratefor different components of the parameter vector ππ: components π with ahigh impact on ππ will be updated more cautiously.
Another advantage comes from the relationship with KullbackβLeiblerdistance in view of the βshortest path uphillβ (see also [Ama98]). To minimizethe value of some function π(π) defined on the parameter space Ξ, the naiveapproach follows a gradient descent for π using the βvanillaβ gradient
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘ πππππ
and, as explained above, this maximizes the increment of π for a givenincrement βππ‘+πΏπ‘ β ππ‘β. On the other hand, the Fisher gradient
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘πΌβ1 πππππ
maximizes the increment of π for a given KullbackβLeibler distance KL(πππ‘+πΏπ‘ ||πππ‘).In particular, if we choose an initial value π0 such that ππ0 covers a wide
portion of the space π uniformly, the KullbackβLeibler divergence betweenπππ‘ and ππ0 measures the loss of diversity of πππ‘ . So the natural gradientdescent is a way to optimize the function π with minimal loss of diversity,provided the initial diversity is large. On the other hand the vanilla gradientdescent optimizes π with minimal change in the numerical values of theparameter π, which is of little interest.
So arguably this method realizes the best trade-off between optimizationand loss of diversity. (Though, as can be seen from the detailed algorithmdescription below, maximization of diversity occurs only greedily at eachstep, and so there is no guarantee that after a given time, IGO will providethe highest possible diversity for a given objective function value.)
An experimental confirmation of the positive influence of the Fishermatrix on diversity is given in Section 5 below. This may also provide atheoretical explanation to the good performance of CMA-ES.
1.2 IGO: Information-geometric optimization
Quantile rewriting of π . Our original problem is to minimize a functionπ : π β R. A simple way to turn π into a function on Ξ is to use theexpected value βEππ
π [Ber00, WSPS08], but expected values can be undulyinfluenced by extreme values and using them is rather unstable [Whi89];moreover βEππ
π is not invariant under increasing transformation of π (thisinvariance implies we can only compare π -values, not add them).
Instead, we take an adaptive, quantile-based approach by first replacingthe function π with a monotone rewriting π π
π and then following the gradientof Eππ
π ππ . A due choice of π π
π allows to control the range of the resultingvalues and achieves the desired invariance. Because the rewriting π π
π dependson π, it might be viewed as an adaptive π -transformation.
The goal is that if π(π₯) is βsmallβ then π ππ (π₯) β R is βlargeβ and vice
versa, and that π ππ remains invariant under monotone transformations of
π . The meaning of βsmallβ or βlargeβ depends on π β Ξ and is taken withrespect to typical values of π under the current distribution ππ. This is
7
measured by the ππ-quantile in which the value of π(π₯) lies. We write thelower and upper ππ-π -quantiles of π₯ β π as
πβπ (π₯) = Prπ₯β²βΌππ
(π(π₯β²) < π(π₯))π+
π (π₯) = Prπ₯β²βΌππ(π(π₯β²) 6 π(π₯)) .
(2)
These quantile functions reflect the probability to sample a better value thanπ(π₯). They are monotone in π (if π(π₯1) 6 π(π₯2) then πΒ±
π (π₯1) 6 πΒ±π (π₯2)) and
invariant under increasing transformations of π .Given π β [0; 1], we now choose a non-increasing function π€ : [0; 1]β R
(fixed once and for all). A typical choice for π€ is π€(π) = 1π6π0 for somefixed value π0, the selection quantile. The transform π π
π (π₯) is defined as afunction of the ππ-π -quantile of π₯ as
π ππ (π₯) =
β§β¨β©π€(π+π (π₯)) if π+
π (π₯) = πβπ (π₯),
1π+
π(π₯)βπβ
π(π₯)β« π=π+
π(π₯)
π=πβπ
(π₯) π€(π) dπ otherwise.(3)
As desired, the definition of π ππ is invariant under an increasing transforma-
tion of π . For instance, the ππ-median of π gets remapped to π€(12).
Note that Eπππ π
π =β« 1
0 π€ is independent of π and π: indeed, by definition,the quantile of a random point under ππ is uniformly distributed in [0; 1]. Inthe following, our objective will be to maximize the expected value of π π
ππ‘ inπ, that is, to maximize
Eπππ π
ππ‘ =β«
π πππ‘(π₯) ππ(dπ₯) (4)
over π, where ππ‘ is fixed at a given step but will adapt over time.Importantly, π π
π (π₯) can be estimated in practice: indeed, the quantilesPrπ₯β²βΌππ
(π(π₯β²) < π(π₯)) can be estimated by taking samples of ππ and orderingthe samples according to the value of π (see below). The estimate remainsinvariant under increasing π -transformations.
The IGO gradient flow. At the most abstract level, IGO is a continuous-time gradient flow in the parameter space Ξ, which we define now. Inpractice, discrete time steps (a.k.a. iterations) are used, and ππ-integrals areapproximated through sampling, as described in the next section.
Let ππ‘ be the current value of the parameter at time π‘, and let πΏπ‘ βͺ 1.We define ππ‘+πΏπ‘ in such a way as to increase the ππ-weight of points where πis small, while not going too far from πππ‘ in KullbackβLeibler divergence. Weuse the adaptive weights π π
ππ‘ as a way to measure which points have large orsmall values. In accordance with (4), this suggests taking the gradient ascent
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ βπ
β«π π
ππ‘(π₯) ππ(dπ₯) (5)
where the natural gradient is suggested by Proposition 1.Note again that we use π π
ππ‘ and not π ππ in the integral. So the gradientβπ does not act on the adaptive objective π π
ππ‘ . If we used π ππ instead, we
would face a paradox: right after a move, previously good points do notseem so good any more since the distribution has improved. More precisely,β«
π ππ (π₯) ππ(dπ₯) is constant and always equal to the average weight
β« 10 π€,
and so the gradient would always vanish.Using the log-likelihood trick βππ = ππ
βln ππ (assuming ππ is smooth),we get an equivalent expression of the update above as an integral underthe current distribution πππ‘ ; this is important for practical implementation.This leads to the following definition.
8
Definition 2 (IGO flow). The IGO flow is the set of continuous-timetrajectories in space Ξ, defined by the differential equation
dππ‘
dπ‘= βπ
β«π π
ππ‘(π₯) ππ(dπ₯) (6)
=β«
π πππ‘(π₯) βπ ln ππ(π₯) πππ‘(dπ₯) (7)
= πΌβ1(ππ‘)β«
π πππ‘(π₯) π ln ππ(π₯)
πππππ‘(dπ₯). (8)
where the gradients are taken at point π = ππ‘, and πΌ is the Fisher informationmatrix.
Natural evolution strategies (NES, [WSPS08, GSS+10, SWSS09]) featurea related gradient descent with π(π₯) instead of π π
ππ‘(π₯). The associated flowwould read
dππ‘
dπ‘= β βπ
β«π(π₯) ππ(dπ₯) , (9)
where the gradient is taken at ππ‘ (in the sequel when not explicitly stated,gradients in π are taken at π = ππ‘). However, in the end NESs alwaysimplement algorithms using sample quantiles, as if derived from the gradientascent of π π
ππ‘(π₯).The update (7) is a weighted average of βintrinsic movesβ increasing the
log-likelihood of some points. We can slightly rearrange the update as
dππ‘
dπ‘=β«preference weightβ β
π πππ‘(π₯) βπ ln ππ(π₯)β β intrinsic move to reinforce π₯
current sample distributionβ β πππ‘(dπ₯) (10)
= βπ
β«π π
ππ‘(π₯) ln ππ(π₯)β β weighted log-likelihood
πππ‘(dπ₯). (11)
which provides an interpretation for the IGO gradient flow as a gradientascent optimization of the weighted log-likelihood of the βgood pointsβ ofthe current distribution. In a precise sense, IGO is in fact the βbestβ way toincrease this log-likelihood (Theorem 13).
For exponential families of probability distributions, we will see later thatthe IGO flow rewrites as a nice derivative-free expression (18).
The IGO algorithm. Time discretization and sampling. The aboveis a mathematically well-defined continuous-time flow in the parameter space.Its practical implementation involves three approximations depending ontwo parameters π and πΏπ‘:
β the integral under πππ‘ is approximated using π samples taken fromπππ‘ ;
β the value π πππ‘ is approximated for each sample taken from πππ‘ ;
β the time derivative dππ‘
dπ‘ is approximated by a πΏπ‘ time increment.
We also assume that the Fisher information matrix πΌ(π) and π ln ππ(π₯)ππ can
be computed (see discussion below if πΌ(π) is unkown).At each step, we pick π samples π₯1, . . . , π₯π under πππ‘ . To approximate
the quantiles, we rank the samples according to the value of π . Definerk(π₯π) = #{π, π(π₯π) < π(π₯π)} and let the estimated weight of sample π₯π be
π€π = 1π
π€
(rk(π₯π) + 1/2π
), (12)
9
using the weighting scheme function π€ introduced above. (This is assumingthere are no ties in our sample; in case several sample points have the samevalue of π , we define π€π by averaging the above over all possible rankings ofthe ties1.)
Then we can approximate the IGO flow as follows.
Definition 3 (IGO algorithm). The IGO algorithm with sample size π andstep size πΏπ‘ is the following update rule for the parameter ππ‘. At each step,π sample points π₯1, . . . , π₯π are picked according to the distribution πππ‘ . Theparameter is updated according to
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘πβ
π=1π€πβπ ln ππ(π₯π)
π=ππ‘
(13)
= ππ‘ + πΏπ‘ πΌβ1(ππ‘)πβ
π=1π€π
π ln ππ(π₯π)ππ
π=ππ‘
(14)
where π€π is the weight (12) obtained from the ranked values of the objectivefunction π .
Equivalently one can fix the weights π€π = 1π π€
(πβ1/2
π
)once and for all
and rewrite the update as
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ πΌβ1(ππ‘)πβ
π=1π€π
π ln ππ(π₯π:π )ππ
π=ππ‘
(15)
where π₯π:π denotes the πth sampled point ranked according to π , i.e. π(π₯1:π ) <. . . < π(π₯π :π ) (assuming again there are no ties). Note that {π₯π:π} = {π₯π}and {π€π} = { π€π}.
As will be discussed in Section 4, this update applied to multivariatenormal distributions or Bernoulli measures allows to neatly recover versions ofsome well-established algorithms, in particular CMA-ES and PBIL. Actually,in the Gaussian context updates of the form (14) have already been introduced[GSS+10, ANOK10], though not formally derived from a continuous-timeflow with quantiles.
When π β β, the IGO algorithm using samples approximates thecontinuous-time IGO gradient flow, see Theorem 4 below. Indeed, the IGOalgorithm, with π =β, is simply the Euler approximation scheme for theordinary differential equation defining the IGO flow (6). The latter resultthus provides a sound mathematical basis for currently used rank-basedupdates.
IGO flow versus IGO algorithms. The IGO flow (6) is a well-definedcontinuous-time set of trajectories in the space of probability distributionsππ, depending only on the objective function π and the chosen family ofdistributions. It does not depend on the chosen parametrization for π(Proposition 8).
On the other hand, there are several IGO algorithms associated with thisflow. Each IGO algorithm approximates the IGO flow in a slightly different
1A mathematically neater but less intuitive version would be
π€π = 1rk+(π₯π)β rkβ(π₯π)
β« π’=rk+(π₯π)/π
π’=rkβ(π₯π)/π
π€(π’)dπ’
with rkβ(π₯π) = #{π, π(π₯π) < π(π₯π)} and rk+(π₯π) = #{π, π(π₯π) 6 π(π₯π)}.
10
way. An IGO algorithm depends on three further choices: a sample size π ,a time discretization step size πΏπ‘, and a choice of parametrization for π inwhich to implement (14).
If πΏπ‘ is small enough, and π large enough, the influence of the parametriza-tion π disappears and all IGO algorithms are approximations of the βidealβIGO flow trajectory. However, the larger πΏπ‘, the poorer the approximationgets.
So for large πΏπ‘, different IGO algorithms for the same IGO flow mayexhibit different behaviors. We will see an instance of this phenomenon forGaussian distributions: both CMA-ES and the maximum likelihood update(EMNA) can be seen as IGO algorithms, but the latter with πΏπ‘ = 1 is knownto exhibit premature loss of diversity (Section 4.2).
Still, two IGO algorithms for the same IGO flow will differ less from eachother than from a non-IGO algorithm: at each step the difference is onlyπ(πΏπ‘2) (Section 2.4). On the other hand, for instance, the difference betweenan IGO algorithm and the vanilla gradient ascent is, generally, not smallerthan π(πΏπ‘) at each step, i.e. roughly as big as the steps themselves.
Unknown Fisher matrix. The algorithm presented so far assumes thatthe Fisher matrix πΌ(π) is known as a function of π. This is the case forGaussian distributions in CMA-ES and for Bernoulli distributions. How-ever, for restricted Boltzmann machines as considered below, no analyticalform is known. Yet, provided the quantity π
ππ ln ππ(π₯) can be computed orapproximated, it is possible to approximate the integral
πΌππ(π) =β«
π₯
π ln ππ(π₯)πππ
π ln ππ(π₯)πππ
ππ(dπ₯)
using ππ-Monte Carlo samples for π₯. These samples may or may not be thesame as those used in the IGO update (14): in particular, it is possible touse as many Monte Carlo samples as necessary to approximate πΌππ , at noadditional cost in terms of the number of calls to the black-box function πto optimize.
Note that each Monte Carlo sample π₯ will contribute π ln ππ(π₯)πππ
π ln ππ(π₯)πππ
to the Fisher matrix approximation. This is a rank-1 matrix. So, for theapproximated Fisher matrix to be invertible, the number of (distinct) samplesπ₯ needs to be at least equal to the number of components of the parameterπ i.e. πFisher > dim Ξ.
For exponential families of distributions, the IGO update has a particularform (18) which simplifies this matter somewhat. More details are givenbelow (see Section 5) for the concrete situation of restricted Boltzmannmachines.
2 First properties of IGO
2.1 Consistency of sampling
The first property to check is that when π ββ, the update rule using πsamples converges to the IGO update rule. This is not a straightforwardapplication of the law of large numbers, because the estimated weights π€π
depend (non-continuously) on the whole sample π₯1, . . . , π₯π , and not only onπ₯π.
11
Theorem 4 (Consistency). When π ββ, the π -sample IGO update rule(14):
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ πΌβ1(ππ‘)πβ
π=1π€π
π ln ππ(π₯π)ππ
π=ππ‘
converges with probability 1 to the update rule (5):
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ βπ
β«π π
ππ‘(π₯) ππ(dπ₯).
The proof is given in the Appendix, under mild regularity assumptions.In particular we do not require that π€ be continuous.
This theorem may clarify previous claims [WSPS08, ANOK10] whererank-based updates similar to (5), such as in NES or CMA-ES, were derivedfrom optimizing the expected value βEππ
π . The rank-based weights π€π werethen introduced somewhat arbitrarily. Theorem 4 shows that, for large π ,CMA-ES and NES actually follow the gradient flow of the quantity Eππ
π πππ‘ :
the update can be rigorously derived from optimizing the expected value ofthe quantile-rewriting π π
ππ‘ .
2.2 Monotonicity: quantile improvement
Gradient descents come with a guarantee that the fitness value decreases overtime. Here, since we work with probability distributions on π, we need todefine the fitness of the distribution πππ‘ . An obvious choice is the expectationEπππ‘ π , but it is not invariant under π -transformation and moreover may besensitive to extreme values.
It turns out that the monotonicity properties of the IGO gradient flowdepend on the choice of the weighting scheme π€. For instance, if π€(π’) =1π’61/2, then the median of π improves over time.
Proposition 5 (Quantile improvement). Consider the IGO flow given by (6),with the weight π€(π’) = 1π’6π where 0 < π < 1 is fixed. Then the value ofthe π-quantile of π improves over time: if π‘1 6 π‘2 then either ππ‘1 = ππ‘2 orππ
πππ‘2
(π) < πππ
ππ‘1(π).
Here the π-quantile πππ (π) of π under a probability distribution π is
defined as any number π such that Prπ₯βΌπ (π(π₯) 6 π) > π and Prπ₯βΌπ (π(π₯) >π) > 1β π.
The proof is given in the Appendix, together with the necessary regularityassumptions.
Of course this holds only for the IGO gradient flow (6) with π =β andπΏπ‘ β 0. For an IGO algorithm with finite π , the dynamics is random andone cannot expect monotonicity. Still, Theorem 4 ensures that, with highprobability, trajectories of a large enough finite population dynamics stayclose to the infinite-population limit trajectory.
2.3 The IGO flow for exponential families
The expressions for the IGO update simplify somewhat if the family ππ
happens to be an exponential family of probability distributions (see also[MMS08]). Suppose that ππ can be written as
ππ(π₯) = 1π(π) exp
(βππππ(π₯)
)π»(dπ₯)
12
where π1, . . . , ππ is a finite family of functions on π, π»(dπ₯) is an arbitraryreference measure on π, and π(π) is the normalization constant. It iswell-known [AN00, (2.33)] that
π ln ππ(π₯)πππ
= ππ(π₯)β Eππππ (16)
so that [AN00, (3.59)]
πΌππ(π) = Covππ(ππ, ππ). (17)
In the end we find:
Proposition 6. Let ππ be an exponential family parametrized by the naturalparameters π as above. Then the IGO flow is given by
dπ
dπ‘= Covππ
(π, π )β1 Covππ(π, π π
π ) (18)
where Covππ(π, π π
π ) denotes the vector (Covππ(ππ, π π
π ))π, and Covππ(π, π )
the matrix (Covππ(ππ, ππ))ππ.
Note that the right-hand side does not involve derivatives w.r.t. π anymore. This result makes it easy to simulate the IGO flow using e.g. a Gibbssampler for ππ: both covariances in (18) may be approximated by sampling,so that neither the Fisher matrix nor the gradient term need to be known inadvance, and no derivatives are involved.
The CMA-ES uses the family of all Gaussian distributions on Rπ. Then,the family ππ is the family of all linear and quadratic functions of thecoordinates on Rπ. The expression above is then a particularly conciserewriting of a CMA-ES update, see also Section 4.2.
Moreover, the expected values ππ = Eππ of ππ satisfy the simple evolutionequation under the IGO flow
dππ
dπ‘= Cov(ππ, π π
π ) = E(ππ π ππ )β ππ Eπ π
π . (19)
The proof is given in the Appendix, in the proof of Theorem 15.The variables ππ can sometimes be used as an alternative parametrization
for an exponential family (e.g. for a one-dimensional Gaussian, these are themean π and the second moment π2 + π2). Then the IGO flow (7) may berewritten using the relation βππ
= π
πππ
for the natural gradient of exponentialfamilies (Appendix, Proposition 22), which sometimes results in simplerexpressions. We shall further exploit this fact in Section 3.
Exponential families with latent variables. Similar formulas holdwhen the distribution ππ(π₯) is the marginal of an exponential distributionππ(π₯, β) over a βhiddenβ or βlatentβ variable β, such as the restricted Boltz-mann machines of Section 5.
Namely, with ππ(π₯) = 1π(π)
ββ exp(
βπ ππππ(π₯, β)) π»(dπ₯, dβ) the Fisher
matrix isπΌππ(π) = Covππ
(ππ, ππ) (20)
where ππ(π₯) = Eππ(ππ(π₯, β)|π₯). Consequently, the IGO flow takes the form
dπ
dπ‘= Covππ
(π, π)β1 Covππ(π, π π
π ). (21)
13
2.4 Invariance properties
Here we formally state the invariance properties of the IGO flow under variousreparametrizations. Since these results follow from the very construction ofthe algorithm, the proofs are omitted.
Proposition 7 (π -invariance). Let π : R β R be an increasing function.Then the trajectories of the IGO flow when optimizing the functions π andπ(π) are the same.
The same is true for the discretized algorithm with population size π andstep size πΏπ‘ > 0.
Proposition 8 (π-invariance). Let πβ² = π(π) be a one-to-one function ofπ and let π β²
πβ² = ππβ1(π). Let ππ‘ be the trajectory of the IGO flow whenoptimizing a function π using the distributions ππ, initialized at π0. Then theIGO flow trajectory (πβ²)π‘ obtained from the optimization of the function πusing the distributions π β²
πβ², initialized at (πβ²)0 = π(π0), is the same, namely(πβ²)π‘ = π(ππ‘).
For the algorithm with finite π and πΏπ‘ > 0, invariance under π-reparame-trization is only true approximately, in the limit when πΏπ‘ β 0. As mentionedabove, the IGO update (14), with π =β, is simply the Euler approximationscheme for the ordinary differential equation (6) defining the IGO flow. Ateach step, the Euler scheme is known to make an error π(πΏπ‘2) with respectto the true flow. This error actually depends on the parametrization of π.
So the IGO updates for different parametrizations coincide at first orderin πΏπ‘, and may, in general, differ by π(πΏπ‘2). For instance the differencebetween the CMA-ES and xNES updates is indeed π(πΏπ‘2), see Section 4.2.
For comparison, using the vanilla gradient results in a divergence of π(πΏπ‘)at each step between different parametrizations. So the divergence could beof the same magnitude as the steps themselves.
In that sense, one can say that IGO algorithms are βmore parametrization-invariantβ than other algorithms. This stems from their origin as a discretiza-tion of the IGO flow.
The next proposition states that, for example, if one uses a family ofdistributions on Rπ which is invariant under affine transformations, thenour algorithm optimizes equally well a function and its image under anyaffine transformation (up to an obvious change in the initialization). Thisproposition generalizes the well-known corresponding property of CMA-ES[HO01].
Here, as usual, the image of a probability distribution π by a transfor-mation π is defined as the probability distribution π β² such that π β²(π ) =π (πβ1(π )) for any subset π β π. In the continuous domain, the density ofthe new distribution π β² is obtained by the usual change of variable formulainvolving the Jacobian of π.
Proposition 9 (π-invariance). Let π : π β π be a one-to-one transfor-mation of the search space, and assume that π globally preserves the familyof measures ππ. Let ππ‘ be the IGO flow trajectory for the optimization offunction π , initialized at ππ0. Let (πβ²)π‘ be the IGO flow trajectory for opti-mization of π β πβ1, initialized at the image of ππ0 by π. Then π(πβ²)π‘ is theimage of πππ‘ by π.
For the discretized algorithm with population size π and step size πΏπ‘ > 0,the same is true up to an error of π(πΏπ‘2) per iteration. This error disappearsif the map π acts on Ξ in an affine way.
14
The latter case of affine transforms is well exemplified by CMA-ES:here, using the variance and mean as the parametrization of Gaussians, thenew mean and variance after an affine transform of the search space arean affine function of the old mean and variance; specifically, for the affinetransformation π΄ : π₯ β¦β π΄π₯ + π we have (π, πΆ) β¦β (π΄π + π, π΄πΆπ΄T).
2.5 Speed of the IGO flow
Proposition 10. The speed of the IGO flow, i.e. the norm of dππ‘
dπ‘ in theFisher metric, is at most
ββ« 10 π€2 β (
β« 10 π€)2 where π€ is the weighting scheme.
This speed can be tested in practice in at least two ways. The first isjust to compute the Fisher norm of the increment ππ‘+πΏπ‘ β ππ‘ using the Fishermatrix; for small πΏπ‘ this is close to πΏπ‘βdπ
dπ‘ β with β Β· β the Fisher metric. Thesecond is as follows: since the Fisher metric coincides with the KullbackβLeibler divergence up to a factor 1/2, we have KL(πππ‘+πΏπ‘ ||πππ‘) β 1
2 πΏπ‘2βdπdπ‘ β
2
at least for small πΏπ‘. Since it is relatively easy to estimate KL(πππ‘+πΏπ‘ ||πππ‘)by comparing the new and old log-likelihoods of points in a Monte Carlosample, one can obtain an estimate of βdπ
dπ‘ β.
Corollary 11. Consider an IGO algorithm with weighting scheme π€, stepsize πΏπ‘ and sample size π . Then, for small πΏπ‘ and large π we have
KL(πππ‘+πΏπ‘ ||πππ‘) 6 12 πΏπ‘2 Var[0,1] π€ + π(πΏπ‘3) + π(1/
βπ).
For instance, with π€(π) = 1π6π0 and neglecting the error terms, an IGOalgorithm introduces at most 1
2 πΏπ‘2 π0(1β π0) bits of information (in base π)per iteration into the probability distribution ππ.
Thus, the time discretization parameter πΏπ‘ is not just an arbitrary variable:it has an intrinsic interpretation related to a number of bits introduced ateach step of the algorithm. This kind of relationship suggests, more generally,to use the KullbackβLeibler divergence as an external and objective way tomeasure learning rates in those optimization algorithms which use probabilitydistributions.
The result above is only an upper bound. Maximal speed can be achievedonly if all βgoodβ points point in the same direction. If the various goodpoints in the sample suggest moves in inconsistent directions, then the IGOupdate will be much smaller. The latter may be a sign that the signal isnoisy, or that the family of distributions ππ is not well suited to the problemat hand and should be enriched.
As an example, using a family of Gaussian distributions with unkownmean and fixed identity variance on Rπ, one checks that for the optimizationof a linear function on Rπ, with the weight π€(π’) = β1π’>1/2 + 1π’<1/2, theIGO flow moves at constant speed 1/
β2π β 0.4, whatever the dimension
π. On a rapidly varying sinusoidal function, the moving speed will be muchslower because there are βgoodβ and βbadβ points in all directions.
This may suggest ways to design the weighting scheme π€ to achievemaximal speed in some instances. Indeed, looking at the proof of theproposition, which involves a CauchyβSchwarz inequality, one can see thatthe maximal speed is achieved only if there is a linear relationship betweenthe weights π π
π (π₯) and the gradient βπ ln ππ(π₯). For instance, for theoptimization of a linear function on Rπ using Gaussian measures of knownvariance, the maximal speed will be achieved when the weighting schemeπ€(π’) is the inverse of the Gaussian cumulative distribution function. (In
15
particular, π€(π’) tends to +β when π’β 0 and to ββ when π’β 1.) Thisis in accordance with previously known results: the expected value of theπ-th order statistic of π standard Gaussian variates is the optimal π€π valuein evolution strategies [Bey01, Arn06]. For π β β, this order statisticconverges to the inverse Gaussian cumulative distribution function.
2.6 Noisy objective function
Suppose that the objective function π is non-deterministic: each time weask for the value of π at a point π₯ β π, we get a random result. In thissetting we may write the random value π(π₯) as π(π₯) = π(π₯, π) where π isan unseen random parameter, and π is a deterministic function of π₯ and π.Without loss of generality, up to a change of variables we can assume that πis uniformly distributed in [0, 1].
We can still use the IGO algorithm without modification in this context.One might wonder which properties (consistency of sampling, etc.) still applywhen π is not deterministic. Actually, IGO algorithms for noisy functionsfit very nicely into the IGO framework: the following proposition allows totransfer any property of IGO to the case of noisy functions.
Proposition 12 (Noisy IGO). Let π be a random function of π₯ β π, namely,π(π₯) = π(π₯, π) where π is a random variable uniformly distributed in [0, 1],and π is a deterministic function of π₯ and π. Then the two followingalgorithms coincide:
β The IGO algorithm (13), using a family of distributions ππ on spaceπ, applied to the noisy function π , and where the samples are rankedaccording to the random observed value of π (here we assume that, foreach sample, the noise π is independent from everything else);
β The IGO algorithm on space πΓ [0, 1], using the family of distributionsππ = ππ β π[0,1], applied to the deterministic function π . Here π[0,1]denotes the uniform law on [0, 1].
The (easy) proof is given in the Appendix.This proposition states that noisy optimization is the same as ordinary
optimization using a family of distributions which cannot operate any selec-tion or convergence over the parameter π. More generally, any component ofthe search space in which a distribution-based evolutionary strategy cannotperform selection or specialization will effectively act as a random noise onthe objective function.
As a consequence of this result, all properties of IGO can be transferred tothe noisy case. Consider, for instance, consistency of sampling (Theorem 4).The π -sample IGO update rule for the noisy case is identical to the non-noisycase (14):
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ πΌβ1(ππ‘)πβ
π=1π€π
π ln ππ(π₯π)ππ
π=ππ‘
where each weight π€π computed from (12) now incorporates noise from theobjective function because the rank of π₯π is computed on the random function,or equivalently on the deterministic function π : rk(π₯π) = #{π, π(π₯π , ππ) <π(π₯π, ππ)}.
Consistency of sampling (Theorem 4) thus takes the following form:When π β β, the π -sample IGO update rule on the noisy function π
16
converges with probability 1 to the update rule
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ βπ
β« 1
0
β«π π
ππ‘(π₯, π) ππ(dπ₯) dπ.
= ππ‘ + πΏπ‘ βπ
β«οΏ½οΏ½ π
ππ‘(π₯) ππ(dπ₯) (22)
where οΏ½οΏ½ ππ (π₯) = Eππ π
π (π₯, π). This entails, in particular, that when π ββ,the noise disappears asymptotically, as could be expected.
Consequently, the IGO flow in the noisy case should be defined by theπΏπ‘ β 0 limit of the update (22) using οΏ½οΏ½ . Note that the quantiles πΒ±(π₯)defined by (2) still make sense in the noisy case, and are deterministicfunctions of π₯; thus π π
π (π₯) can also be defined by (3) and is deterministic.However, unless the weighting scheme π€(π) is affine, οΏ½οΏ½ π
π (π₯) is different fromπ π
π (π₯) in general. Thus, unless π€ is affine the flows defined by π and οΏ½οΏ½do not coincide in the noisy case. The flow using π would be the π ββlimit of a slightly more complex algorithm using several evaluations of π foreach sample π₯π in order to compute noise-free ranks.
2.7 Implementation remarks
Influence of the weighting scheme π€. The weighting scheme π€ directlyaffects the update rule (15).
A natural choice is π€(π’) = 1π’6π. This, as we have proved, results inan improvement of the π-quantile over the course of optimization. Takingπ = 1/2 springs to mind; however, this is not selective enough, and boththeory and experiments confirm that for the Gaussian case (CMA-ES), mostefficient optimization requires π < 1/2 (see Section 4.2). The optimal π isabout 0.27 if π is not larger than the search space dimension π [Bey01] andeven smaller otherwise.
Second, replacing π€ with π€+π for some constant π clearly has no influenceon the IGO continuous-time flow (5), since the gradient will cancel out theconstant. However, this is not the case for the update rule (15) with a finitesample of size π .
Indeed, adding a constant π to π€ adds a quantity π 1π
β βπ ln ππ(π₯π) to theupdate. Since we know that the ππ-expected value of βπ ln ππ is 0 (becauseβ«
( βπ ln ππ) ππ =β« βππ = β1 = 0), we have E 1
πβπ ln ππ(π₯π) = 0. So adding
a constant to π€ does not change the expected value of the update, but itmay change e.g. its variance. The empirical average of βπ ln ππ(π₯π) in thesample will be π(1/
βπ). So translating the weights results in a π(1/
βπ)
change in the update. See also Section 4 in [SWSS09].Determining an optimal value for π to reduce the variance of the update is
difficult, though: the optimal value actually depends on possible correlationsbetween βπ ln ππ and the function π . The only general result is that oneshould shift π€ so that 0 lies within its range. Assuming independence, ordependence with enough symmetry, the optimal shift is when the weightsaverage to 0.
Adaptive learning rate. Comparing consecutive updates to evaluate alearning rate or step size is an effective measure. For example, in back-propagation, the update sign has been used to adapt the learning rate of eachsingle weight in an artificial neural network [SA90]. In CMA-ES, a step sizeis adapted depending on whether recent steps tended to move in a consistentdirection or to backtrack. This is measured by considering the changes of
17
the mean π of the Gaussian distribution. For a probability distribution ππ
on an arbitrary search space, in general no notion of mean may be defined.However, it is still possible to define βbacktrackingβ in the evolution of π asfollows.
Consider two successive updates πΏππ‘ = ππ‘ β ππ‘βπΏπ‘ and πΏππ‘+πΏπ‘ = ππ‘+πΏπ‘ β ππ‘.Their scalar product in the Fisher metric πΌ(ππ‘) is
β¨πΏππ‘, πΏππ‘+πΏπ‘β© =βππ
πΌππ(ππ‘) πΏππ‘π πΏππ‘+πΏπ‘
π .
Dividing by the associated norms will yield the cosine cos πΌ of the anglebetween πΏππ‘ and πΏππ‘+πΏπ‘ .
If this cosine is positive, the learning rate πΏπ‘ may be increased. If thecosine is negative, the learning rate probably needs to be decreased. Variousschemes for the change of πΏπ‘ can be devised; for instance, inspired by CMA-ES, one can multiply πΏπ‘ by exp(π½(cos πΌ)/2) or exp(π½(1cos πΌ>0 β 1cos πΌ<0)/2),where π½ β min(π/ dim Ξ, 1/2).
As before, this scheme is constructed to be robust w.r.t. reparametrizationof π, thanks to the use of the Fisher metric. However, for large learning ratesπΏπ‘, in practice the parametrization might well become relevant.
A consistent direction of the updates does not necessarily mean thatthe algorithm is performing well: for instance, when CEM/EMNA exhibitspremature convergence (see below), the parameters consistently move towardsa zero covariance matrix and the cosines above are positive.
Complexity. The complexity of the IGO algorithm depends much on thecomputational cost model. In optimization, it is fairly common to assumethat the objective function π is very costly compared to any other calculationsperformed by the algorithm. Then the cost of IGO in terms of number ofπ -calls is π per iteration, and the cost of using quantiles and computing thenatural gradient is negligible.
Setting the cost of π aside, the complexity of the IGO algorithm dependsmainly on the computation of the (inverse) Fisher matrix. Assume ananalytical expression for this matrix is known. Then, with π = dim Ξ thenumber of parameters, the cost of storage of the Fisher matrix is π(π2)per iteration, and its inversion typically costs π(π3) per iteration. However,depending on the situation and on possible algebraic simplifications, strategiesexist to reduce this cost (e.g. [LRMB07] in a learning context). For instance,for CMA-ES the cost is π(ππ) [SHI09]. More generally, parametrization byexpectation parameters (see above), when available, may reduce the cost toπ(π) as well.
If no analytical form of the Fisher matrix is known and Monte Carloestimation is required, then complexity depends on the particular situation athand and is related to the best sampling strategies available for a particularfamily of distributions. For Boltzmann machines, for instance, a host of suchstrategies are available. Still, in such a situation, IGO can be competitive ifthe objective function π is costly.
Recycling samples. We might use samples not only from the last iter-ation to compute the ranks in (12), see e.g. [SWSS09]. For π = 1 this isindispensable. In order to preserve sampling consistency (Theorem 4) the oldsamples need to be reweighted (using the ratio of their new vs old likelihood,as in importance sampling).
18
Initialization. As with other optimization algorithms, it is probably agood idea to initialize in such a way as to cover a wide portion of the searchspace, i.e. π0 should be chosen so that ππ0 has maximal diversity. For IGOalgorithms this is particularly relevant, since, as explained above, the naturalgradient provides minimal change of diversity (greedily at each step) for agiven change in the objective function.
3 IGO, maximum likelihood and the cross-entropymethod
IGO as a smooth-time maximum likelihood estimate. The IGOflow turns out to be the only way to maximize a weighted log-likelihood,where points of the current distribution are slightly reweighted according toπ -preferences.
This relies on the following interpretation of the natural gradient as aweighted maximum likelihood update with infinitesimal learning rate. Thisresult singles out, in yet another way, the natural gradient among all possiblegradients. The proof is given in the Appendix.
Theorem 13 (Natural gradient as ML with infinitesimal weights). Let π > 0and π0 β Ξ. Let π (π₯) be a function of π₯ and let π be the solution of
π = arg maxπ
{(1β π)
β«log ππ(π₯) ππ0(dπ₯)β β maximal for π = π0
+ π
β«log ππ(π₯) π (π₯) ππ0(dπ₯)
}.
Then, when πβ 0, up to π(π2) we have
π = π0 + π
β« βπ ln ππ(π₯) π (π₯) ππ0(dπ₯).
Likewise for discrete samples: with π₯1, . . . , π₯π β π, let π be the solution of
π = arg max{
(1β π)β«
log ππ(π₯) ππ0(dπ₯) + πβ
π
π (π₯π) log ππ(π₯π)}
.
Then when πβ 0, up to π(π2) we have
π = π0 + πβ
π
π (π₯π) βπ ln ππ(π₯π).
So if π (π₯) = π ππ0
(π₯) is the weight of the points according to quantilizedπ -preferences, the weighted maximum log-likelihood necessarily is the IGOflow (7) using the natural gradientβor the IGO update (14) when usingsamples.
Thus the IGO flow is the unique flow that, continuously in time, slightlychanges the distribution to maximize the log-likelihood of points with goodvalues of π . Moreover IGO continuously updates the weight π π
π0(π₯) depending
on π and on the current distribution, so that we keep optimizing.This theorem suggests a way to approximate the IGO flow by enforcing
this interpretation for a given non-infinitesimal step size πΏπ‘, as follows.
Definition 14 (IGO-ML algorithm). The IGO-ML algorithm with step sizeπΏπ‘ updates the value of the parameter ππ‘ according to
ππ‘+πΏπ‘ = arg maxπ
{(1β πΏπ‘
βππ€π)β«
log ππ(π₯) πππ‘(dπ₯) + πΏπ‘β
π
π€π log ππ(π₯π)}
(23)
19
where π₯1, . . . , π₯π are sample points picked according to the distribution πππ‘,and π€π is the weight (12) obtained from the ranked values of the objectivefunction π .
As for the cross-entropy method below, this only makes algorithmic senseif the argmax is tractable.
It turns out that IGO-ML is just the IGO algorithm in a particularparametrization (see Theorem 15).
The cross-entropy method. Taking πΏπ‘ = 1 in (23) above correspondsto a full maximum likelihood update, which is also related to the cross-entropy method (CEM). The cross-entropy method can be defined as follows[dBKMR05] in an optimization setting. Like IGO, it depends on a family ofprobability distributions ππ parametrized by π β Ξ, and a number of samplesπ at each iteration. Let also ππ = βππβ (0 < π < 1) be a number of elitesamples.
At each step, the cross-entropy method for optimization samples π pointsπ₯1, . . . , π₯π from the current distribution πππ‘ . Let π€π be 1/ππ if π₯π belongs tothe ππ samples with the best value of the objective function π , and π€π = 0otherwise. Then the cross-entropy method or maximum likelihoood update(CEM/ML) for optimization is
ππ‘+1 = arg maxπ
β π€π log ππ(π₯π) (24)
(assuming the argmax is tractable).A version with a smoother update depends on a step size parameter
0 < πΌ 6 1 and is given [dBKMR05] by
ππ‘+1 = (1β πΌ)ππ‘ + πΌ arg maxπ
β π€π log ππ(π₯π). (25)
The standard CEM/ML update corresponds to πΌ = 1.For πΌ = 1 the standard cross-entropy method is independent of the
parametrization π, whereas for πΌ < 1 this is not the case.Note the difference between the IGO-ML algorithm (23) and the smoothed
CEM update (25) with step size πΌ = πΏπ‘: the smoothed CEM update per-forms a weighted average of the parameter value after taking the maximumlikelihood estimate, whereas IGO-ML uses a weighted average of current andprevious likelihoods, then takes a maximum likelihood estimate. In general,these two rules can greatly differ, as they do for Gaussian distributions(Section 4.2).
This interversion of averaging makes IGO-ML parametrization-independentwhereas the smoothed CEM update is not.
Yet, for exponential families of probability distributions, there exists oneparticular parametrization π in which the IGO algorithm and the smoothedCEM update coincide. We now proceed to this construction.
IGO for expectation parameters and maximum likelihood. Theparticular form of IGO for exponential families has an interesting conse-quence if the parametrization chosen for the exponential family is the set ofexpectation parameters. Let ππ(π₯) = 1
π(π) exp (β
ππππ(π₯)) π»(dπ₯) be an expo-nential family as above. The expectation parameters are ππ = ππ(π) = Eππ
ππ ,(denoted ππ in [AN00, (3.56)]). The notation π will denote the collection(ππ).
20
It is well-known that, in this parametrization, the maximum likelihoodestimate for a sample of points π₯1, . . . , π₯π is just the empirical average of theexpectation parameters over that sample:
arg maxπ
1π
πβπ=1
log ππ (π₯π) = 1π
πβπ=1
π (π₯π). (26)
In the discussion above, one main difference between IGO and smoothedCEM was whether we took averages before or after taking the maximumlog-likelihood estimate. For the expectation parameters ππ, we see thatthese operations commute. (One can say that these expectation parametersβlinearize maximum likelihood estimatesβ.) A little work brings us to the
Theorem 15 (IGO, CEM and maximum likelihood). Let
ππ(π₯) = 1π(π) exp
(βππππ(π₯)
)π»(dπ₯)
be an exponential family of probability distributions, where the ππ are functionsof π₯ and π» is some reference measure. Let us parametrize this family by theexpected values ππ = Eππ.
Let us assume the chosen weights π€π sum to 1. For a sample π₯1, . . . , π₯π ,let
π *π =
βπ
π€π ππ(π₯π).
Then the IGO update (14) in this parametrization reads
π π‘+πΏπ‘π = (1β πΏπ‘) π π‘
π + πΏπ‘ π *π . (27)
Moreover these three algorithms coincide:
β The IGO-ML algorithm (23).
β The IGO algorithm written in the parametrization ππ.
β The smoothed CEM algorithm (25) written in the parametrization ππ,with πΌ = πΏπ‘.
Corollary 16. The standard CEM/ML update (24) is the IGO algorithmin parametrization ππ with πΏπ‘ = 1.
Beware that the expectation parameters ππ are not always the mostobvious parameters [AN00, Section 3.5]. For example, for 1-dimensionalGaussian distributions, these expectation parameters are the mean π andthe second moment π2 + π2. When expressed back in terms of mean andvariance, with the update (27) the new mean is (1 β πΏπ‘)π + πΏπ‘π*, but thenew variance is (1β πΏπ‘)π2 + πΏπ‘(π*)2 + πΏπ‘(1β πΏπ‘)(π* β π)2.
On the other hand, when using smoothed CEM with mean and variance asparameters, the new variance is (1βπΏπ‘)π2+πΏπ‘(π*)2, which can be significantlysmaller for πΏπ‘ β (0, 1). This proves, in passing, that the smoothed CEMupdate in other parametrizations is generally not an IGO algorithm (becauseit can differ at first order in πΏπ‘).
The case of Gaussian distributions is further exemplified in Section 4.2below: in particular, smoothed CEM in the (π, π) parametrization exhibitspremature reduction of variance, preventing good convergence.
For these reasons we think that the IGO-ML algorithm is the sensibleway to perform an interpolated ML estimate for πΏπ‘ < 1, in a parametrization-independent way. In Section 6 we further discuss IGO and CEM and sumup the differences and relative advantages.
21
Taking πΏπ‘ = 1 is a bold approximation choice: the βidealβ continuous-timeIGO flow itself, after time 1, does not coincide with the maximum likelihoodupdate of the best points in the sample. Since the maximum likelihoodalgorithm is known to converge prematurely in some instances (Section 4.2),using the parametrization by expectation parameters with large πΏπ‘ may notbe desirable.
The considerable simplification of the IGO update in these coordinatesreflects the duality of coordinates ππ and ππ. More precisely, the naturalgradient ascent w.r.t. the parameters ππ is given by the vanilla gradient w.r.t.the parameters ππ: βππ
= π
πππ
(Proposition 22 in the Appendix).
4 CMA-ES, NES, EDAs and PBIL from the IGOframework
In this section we investigate the IGO algorithm for Bernoulli measures and formultivariate normal distributions and show the correspondence to well-knownalgorithms. In addition, we discuss the influence of the parametrization ofthe distributions.
4.1 IGO algorithm for Bernoulli measures and PBIL
We consider on π = {0, 1}π a family of Bernoulli measures ππ(π₯) = ππ1(π₯1)Γ. . . Γ πππ
(π₯π) with πππ(π₯π) = ππ₯π
π (1 β ππ)1βπ₯π . As this family is a product ofprobability measures πππ
(π₯π), the different components of a random vectorπ¦ following ππ are independent and all off-diagonal terms of the Fisherinformation matrix (FIM) are zero. Diagonal terms are given by 1
ππ(1βππ) .Therefore the inverse of the FIM is a diagonal matrix with diagonal entriesequal to ππ(1 β ππ). In addition, the partial derivative of ln ππ(π₯) w.r.t. ππ
can be computed in a straightforward manner resulting in
π ln ππ(π₯)πππ
= π₯π
ππβ 1β π₯π
1β ππ.
Let π₯1, . . . , π₯π be π samples at step π‘ with distribution πππ‘ and letπ₯1:π , . . . , π₯π :π be the ranked samples according to π . The natural gradientupdate (15) with Bernoulli measures is then given by
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘ ππ‘π(1β ππ‘
π)πβ
π=1π€π
([π₯π:π ]π
ππ‘π
β 1β [π₯π:π ]π1β ππ‘
π
)(28)
where π€π = π€((π β 1/2)/π)/π and [π¦]π denotes the πth coordinate of π¦ β π.The previous equation simplifies to
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘πβ
π=1π€π
([π₯π:π ]π β ππ‘
π
),
or, denoting οΏ½οΏ½ the sum of the weightsβπ
π=1 π€π ,
ππ‘+πΏπ‘π = (1β οΏ½οΏ½πΏπ‘) ππ‘
π + πΏπ‘πβ
π=1π€π [π₯π:π ]π .
22
If we set the IGO weights as π€1 = 1, π€π = 0 for π = 2, . . . , π , we recoverthe PBIL/EGA algorithm with update rule towards the best solution only(disregarding the mutation of the probability vector), with πΏπ‘ = LR whereLR is the so-called learning rate of the algorithm [Bal94, Figure 4]. The PBILupdate rule towards the π best solutions, proposed in [BC95, Figure 4]2, canbe recovered as well using
πΏπ‘ = LRπ€π = (1β LR)πβ1, for π = 1, . . . , π
π€π = 0, for π = π + 1, . . . , π .
Interestingly, the parameters ππ are the expectation parameters describedin Section 3: indeed, the expectation of π₯π is ππ. So the formulas above areparticular cases of (27). Thus, by Theorem 15, PBIL is both a smoothedCEM in these parameters and an IGO-ML algorithm.
Let us now consider another, so-called βlogitβ representation, given bythe logistic function π (π₯π = 1) = 1
1+exp(βππ). This π is the exponential
parametrization of Section 2.3. We find that
π ln ππ(π₯)πππ
= (π₯π β 1) + exp(βππ)1 + exp(βππ)
= π₯π β Eπ₯π
(cf. (16)) and that the diagonal elements of the Fisher information matrixare given by exp(βππ)/(1 + exp(βππ))2 = Var π₯π (compare with (17)). So thenatural gradient update (15) with Bernoulli measures now reads
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘(1 + exp(ππ‘π))
βββοΏ½οΏ½ + (1 + exp(βππ‘π))
πβπ=1
π€π [π₯π:π ]π
ββ .
To better compare the update with the previous representation, notethat ππ = 1
1+exp(βππ)and thus we can rewrite
ππ‘+πΏπ‘π = ππ‘
π + πΏπ‘
ππ‘π(1β ππ‘
π)
πβπ=1
π€π
([π₯π:π ]π β ππ‘
π
).
So the direction of the update is the same as before and is given by theproportion of bits set to 0 or 1 in the sample, compared to its expected valueunder the current distribution. The magnitude of the update is differentsince the parameter π ranges from ββ to +β instead of from 0 to 1. Wedid not find this algorithm in the literature.
These updates illustrate the influence of setting the sum of weights to0 or not (Section 2.7). If, at some time, the first bit is set to 1 both for amajority of good points and for a majority of bad points, then the originalPBIL will increase the probability of setting the first bit to 1, which iscounterintuitive. If the weights π€π are chosen to sum to 0 this noise effectdisappears; otherwise, it disappears only on average.
2Note that the pseudocode for the algorithm in [BC95, Figure 4] is slightly erroneoussince it gives smaller weights to better individuals. The error can be fixed by updatingthe probability in reversed order, looping from NUMBER_OF_VECTORS_TO_UPDATE_FROM to1. This was confirmed by S. Baluja in personal communication. We consider here thecorrected version of the algorithm.
23
4.2 Multivariate normal distributions (Gaussians)
Evolution strategies [Rec73, Sch95, BS02] are black-box optimization algo-rithms for the continuous search domain, π β Rπ (for simplicity we assumeπ = Rπ in the following). They sample new solutions from a multivariatenormal distribution. In the context of continuous black-box optimization,Natural Evolution Strategies (NES) introduced the idea of using a naturalgradient update of the distribution parameters [WSPS08, SWSS09, GSS+10].Surprisingly, also the well-known Covariance Matrix Adaption EvolutionStrategy, CMA-ES [HO96, HO01, HMK03, HK04, JA06], turns out to conducta natural gradient update of distribution parameters [ANOK10, GSS+10].
Let π₯ β Rπ. As the most prominent example, we use mean vectorπ = Eπ₯ and covariance matrix πΆ = E(π₯βπ)(π₯βπ)T = E(π₯π₯T)βππT toparametrize the distribution via π = (π, πΆ). The IGO update in (14) or (15)depends on the chosen parametrization, but can now be entirely reformulatedwithout the (inverse) Fisher matrix, compare (18). The complexity of theupdate is linear in the number of parameters (size of π = (π, πΆ), where(π2 β π)/2 parameters are redundant). We discuss known algorithms thatimplement variants of this update.
CMA-ES. The CMA-ES implements the equations3
ππ‘+1 = ππ‘ + πm
πβπ=1
π€π(π₯π βππ‘) (29)
πΆπ‘+1 = πΆπ‘ + πc
πβπ=1
π€π((π₯π βππ‘)(π₯π βππ‘)π β πΆπ‘) (30)
where π€π are the weights based on ranked π -values, see (12) and (14).When πc = πm, Equations (30) and (29) coincide with the IGO update
(14) expressed in the parametrization (π, πΆ) [ANOK10, GSS+10]4. Note,however, that the learning rates πm and πc take essentially different valuesin CMA-ES, if π βͺ dim Ξ5: this is in deviation from an IGO algorithm.(Remark that the Fisher information matrix is block-diagonal in π and πΆ[ANOK10], so that application of the different learning rates and of theinverse Fisher matrix commute.)
Natural evolution strategies. Natural evolution strategies (NES) [WSPS08,SWSS09] implement (29) as well, but use a Cholesky decomposition of πΆ asparametrization for the update of the variance parameters. The resultingupdate that replaces (30) is neither particularly elegant nor numericallyefficient. However, the most recent xNES [GSS+10] chooses an βexponentialβparametrization depending on the current parameters. This leads to an ele-gant formulation where the additive update in exponential parametrizationbecomes a multiplicative update for πΆ in (30). With πΆ = π΄π΄T, the matrix
3The CMA-ES implements these equations given the parameter setting π1 = 0 andππ = 0 (or ππ =β, see e.g. [Han09]) that disengages the effect of both so-called evolutionpaths.
4 In these articles the result has been derived for π β π + π βπEππ π , see (9), leading toπ(π₯π) in place of π€π. No assumptions on π have been used besides that it does not dependon π. By replacing π with π π
ππ‘ , where ππ‘ is fixed, the derivation holds equally well forπ β π + π βπEππ π π
ππ‘ .5Specifically, let
β|π€π| = 1, then πm = 1 and πc β 1 β§ 1/(π2β π€2
π ).
24
update reads
π΄β π΄Γ exp(
πc2
πβπ=1
π€π(π΄β1(π₯π βπ)(π΄β1(π₯π βπ))π β Iπ))
(31)
where Iπ is the identity matrix. (From (31) we get πΆ β π΄Γ exp2(. . . )Γπ΄T.)Remark that in the representation π = (π΄β1π, π΄β1πΆπ΄Tβ1) = (π΄β1π, Iπ),the Fisher information matrix becomes diagonal.
The update has the advantage over (30) that even negative weightsalways lead to feasible values. By default, πm = πc in xNES in the samecircumstances as in CMA-ES (most parameter settings are borrowed fromCMA-ES), but contrary to CMA-ES the past evolution path is not takeninto account [GSS+10].
When πc = πm, xNES is consistent with the IGO flow (6), and im-plements a slightly generalized IGO algorithm (14) using a π-dependentparametrization.
Cross-entropy method and EMNA. Estimation of distribution algo-rithms (EDA) and the cross-entropy method (CEM) [Rub99, RK04] estimatea new distribution from a censored sample. Generally, the new parametervalue can be written as
πmaxLL = arg maxπ
πβπ=1
π€π ln ππ(π₯π) (32)
ββπββ arg maxπ
Eπππ‘ π πππ‘ ln ππ
For positive weights, πmaxLL maximizes the weighted log-likelihood of π₯1 . . . π₯π .The argument under the arg max in the RHS of (32) is the negative cross-entropy between ππ and the distribution of censored (elitist) samples πππ‘π π
ππ‘
or of π realizations of such samples. The distribution ππmaxLL has thereforeminimal cross-entropy and minimal KL-divergence to the distribution ofthe π best samples.6 As said above, (32) recovers the cross-entropy method(CEM) [Rub99, RK04].
For Gaussian distributions, equation (32) can be explicitly written in theform
ππ‘+1 = π* =πβ
π=1π€ππ₯π (33)
πΆπ‘+1 = πΆ* =πβ
π=1π€π(π₯π βπ*)(π₯π βπ*)T (34)
the empirical mean and variance of the elite sample. The weights οΏ½οΏ½π areequal to 1/π for the π best points and 0 otherwise.
Equations (33) and (34) also define the most fundamental continuousdomain EDA, the estimation of multivariate normal algorithm (EMNAglobal,[LL02]). It might be interesting to notice that (33) and (34) only differ from(29) and (30) in that the new mean ππ‘+1 is used in the covariance matrixupdate.
6Let ππ€ denote the distribution of the weighted samples: Pr(π₯ = π₯π) = π€π andβππ€π = 1. Then the cross-entropy between ππ€ and ππ reads
βπππ€(π₯π) ln 1/ππ(π₯π) and
the KL-divergence reads KL(ππ€ ||ππ) =β
πππ€(π₯π) ln 1/ππ(π₯π) β
βπππ€(π₯π) ln 1/ππ€(π₯π).
Minimization of both terms in π result in πmaxLL.
25
Let us compare IGO-ML (23), CMA (29)β(30), and smoothed CEM (25)in the parametrization with mean and covariance matrix. For learning rateπΏπ‘ = 1, IGO-ML and smoothed CEM/EMNA realize πmaxLL from (32)β(34).For πΏπ‘ < 1 these algorithms all update the distribution mean in the sameway; the update of mean and covariance matrix is computed to be
ππ‘+1 = (1β πΏπ‘) ππ‘ + πΏπ‘ π*
πΆπ‘+1 = (1β πΏπ‘) πΆπ‘ + πΏπ‘ πΆ* + πΏπ‘(1β πΏπ‘)π (π* βππ‘)(π* βππ‘)T,(35)
for different values of π, where π* and πΆ* are the mean and covariancematrix computed over the elite sample (with weights π€π) as above. For CMAwe have π = 0, for IGO-ML we have π = 1, and for smoothed CEM wehave π =β (the rightmost term is absent). The case π = 2 corresponds toan update that uses ππ‘+1 instead of ππ‘ in (30) (compare [Han06b]). For0 < πΏπ‘ < 1, the larger π, the smaller πΆπ‘+1.
The rightmost term of (35) resembles the so-called rank-one update inCMA.
For πΏπ‘ β 0, the update is independent of π at first order in πΏπ‘ if π <β:this reflects compatibility with the IGO flow of CMA and of IGO-ML, butnot of smoothed CEM.
Critical πΏπ‘. Let us assume that the weights π€π are non-negative, sum toone, and π < π weights take a value of 1/π, so that the selection quantile isπ = π/π .
There is a critical value of πΏπ‘ depending on this quantile π, such thatabove the critical πΏπ‘ the algorithms given by IGO-ML and smoothed CEMare prone to premature convergence. Indeed, let π be a linear function onRπ, and consider the variance in the direction of the gradient of π . Assumingfurther π β β and π β€ 1/2, then the variance πΆ* of the elite sample issmaller than the current variance πΆπ‘, by a constant factor. Depending onthe precise update for πΆπ‘+1, if πΏπ‘ is too large, the variance πΆπ‘+1 is goingto be smaller than πΆπ‘ by a constant factor as well. This implies that thealgorithm is going to stall. (On the other hand, the continuous-time IGOflow corresponding to πΏπ‘ β 0 does not stall, see Section 4.3.)
We now study the critical πΏπ‘ under which the algorithm does not stall.For IGO-ML, (π = 1 in (35), or equivalently for the smoothed CEM in theexpectation parameters (π, πΆ + ππT), see Section 3), the variance increasesif and only if πΏπ‘ is smaller than the critical value πΏπ‘crit = ππ
β2πππ2/2 where
π is the percentile function of π, i.e. π is such that π =β«β
π πβπ₯2/2/β
2π. Thisvalue πΏπ‘crit is plotted as solid line in Fig. 1. For π = 2, πΏπ‘crit is smaller,related to the above by πΏπ‘crit β
β1 + πΏπ‘crit β 1 and plotted as dashed line in
Fig. 1. For CEM (π =β), the critical πΏπ‘ is zero. For CMA-ES (π = 0), thecritical πΏπ‘ is infinite for π < 1/2. When the selection quantile π is above 1/2,for all algorithms the critical πΏπ‘ becomes zero.
We conclude that, despite the principled approach of ascending thenatural gradient, the choice of the weighting function π€, the choice of πΏπ‘,and possible choices in the update for πΏπ‘ > 0, need to be taken with greatcare in relation to the choice of parametrization.
4.3 Computing the IGO flow for some simple examples
In this section we take a closer look at the IGO flow solutions of (6) for somesimple examples of fitness functions, for which it is possible to obtain exactinformation about these IGO trajectories.
26
0.0 0.1 0.2 0.3 0.4 0.5truncation quantile q
0.0
0.2
0.4
0.6
0.8
1.0
crit
ical le
arn
ing r
ate
Ξ΄t
j=1
j=2
j=β
Figure 1: Critical πΏπ‘ versus truncation quantile π. Above the critical πΏπ‘, thevariance decreases systematically when optimizing a linear function. ForCMA-ES/NES, the critical πΏπ‘ for π < 0.5 is infinite.
We start with the discrete search space π = {0, 1}π and linear functions(to be minimized) defined as π(π₯) = πβ
βππ=1 πΌππ₯π with πΌπ > 0. Note that
the onemax function to be maximized πonemax(π₯) =βπ
π=1 π₯π is covered bysetting πΌπ = 1. The differential equation (6) for the Bernoulli measuresππ(π₯) = ππ1(π₯1) . . . πππ
(π₯π) defined on π can be derived taking the limit ofthe IGO-PBIL update given in (28):
dππ‘π
dπ‘=β«
π πππ‘(π₯)(π₯π β ππ‘
π)πππ‘(ππ₯) =: ππ(ππ‘) . (36)
Though finding the analytical solution of the differential equation (36) forany initial condition seems a bit intricate we show that the equation admitsone critical stable point, (1, . . . , 1), and one critical unstable point (0, . . . , 0).In addition we prove that the trajectory decreases along π in the sensedπ(ππ‘)
dπ‘ 6 0. To do so we establish the following result:
Lemma 17. On π(π₯) = πββπ
π=1 πΌππ₯π the solution of (36) satisfiesβπ
π=1 πΌπdππdπ‘ >
0; moreoverβ
πΌπdππdπ‘ = 0 if and only if π = (0, . . . , 0) or π = (1, . . . , 1).
Proof. We computeβπ
π=1 πΌπππ(ππ‘) and find that
πβπ=1
πΌπdππ‘
π
dπ‘=β«
π πππ‘(π₯)
(πβ
π=1πΌππ₯π β
πβπ=1
πΌπππ‘π
)πππ‘(dπ₯)
=β«
π πππ‘(π₯)(π(ππ‘)β π(π₯))πππ‘(dπ₯)
= E[π πππ‘(π₯)]E[π(π₯)]β E[π π
ππ‘(π₯)π(π₯)]
In addition, βπ πππ‘(π₯) = βπ€(Pr(π(π₯β²) < π(π₯))) is a nondecreasing bounded
function in the variable π(π₯) such that βπ πππ‘(π₯) and π(π₯) are positively
correlated (see [Tho00, Chapter 1] for a proof of this result), i.e.
E[βπ πππ‘(π₯)π(π₯)] > E[βπ π
ππ‘(π₯)]E[π(π₯)]
with equality if and only if ππ‘ = (0, . . . , 0) or ππ‘ = (1, . . . , 1). Thusβππ=1 πΌπ
dππ‘π
dπ‘ > 0.
27
The previous result implies that the positive definite function π (π) =βππ=1 πΌπ β
βππ=1 πΌπππ in (1, . . . , 1) satisfies π *(π) = βπ (π) Β· π(π) 6 0 (such a
function is called a Lyapunov function). Consequently (1, . . . , 1) is stable.Similarly π (π) =
βππ=1 πΌπππ is a Lyapunov function for (0, . . . , 0) such that
βπ (π) Β· π(π) > 0. Consequently (0, . . . , 0) is unstable [AO08].
We now consider on Rπ the family of multivariate normal distributionsππ = π© (π, π2πΌπ) with covariance matrix equal to π2πΌπ. The parameter πthus has π+1 components π = (π, π) β RπΓR. The natural gradient updateusing this family was derived in [GSS+10]; from it we can derive the IGOdifferential equation which reads:
dππ‘
dπ‘=β«Rπ
π πππ‘(π₯)(π₯βππ‘)ππ© (ππ‘,(ππ‘)2πΌπ)(π₯)ππ₯ (37)
dοΏ½οΏ½π‘
dπ‘=β«Rπ
12π
β§β¨β©πβ
π=1
(π₯π βππ‘
π
ππ‘
)2
β 1
β«β¬βπ πππ‘(π₯)ππ© (ππ‘,(ππ‘)2πΌπ)(π₯)ππ₯ (38)
where ππ‘ and οΏ½οΏ½π‘ are linked via ππ‘ = exp(οΏ½οΏ½π‘) or οΏ½οΏ½π‘ = ln(ππ‘). Denoting π© arandom vector following a centered multivariate normal distribution withcovariance matrix identity we write equivalently the gradient flow as
dππ‘
dπ‘= ππ‘E
[π π
ππ‘(ππ‘ + ππ‘π© )π©]
(39)
dοΏ½οΏ½π‘
dπ‘= E
[12
(βπ©β2
πβ 1
)π π
ππ‘(ππ‘ + ππ‘π© )]
. (40)
Let us analyze the solution of the previous system on linear functions.Without loss of generality (because of invariance) we can consider the linearfunction π(π₯) = π₯1. We have
π πππ‘(π₯) = π€(Pr(ππ‘
1 + ππ‘π1 < π₯1))
where π1 follows a standard normal distribution and thus
π πππ‘(ππ‘ + ππ‘π© ) = π€( Pr
π1βΌπ© (0,1)(π1 < π©1)) (41)
= π€(β±(π©1)) (42)
with β± the cumulative distribution of a standard normal distribution, andπ©1 the first component of π© . The differential equation thus simplifies into
dππ‘
dπ‘= ππ‘
ββββββE [π€(β±(π©1))π©1]
0...0
ββββββ (43)
dοΏ½οΏ½π‘
dπ‘= 1
2πE[(|π©1|2 β 1)π€(β±(π©1))
]. (44)
Consider, for example, the weight function associated with truncationselection, i.e. π€(π) = 1π6π0 where π0 β]0, 1]βalso called intermediate recom-bination. We find that
dππ‘1
dπ‘= ππ‘E[π©11{π©16β±β1(π0)}] =: ππ‘π½ (45)
dοΏ½οΏ½π‘
dπ‘= 1
2π
(β« π0
0β±β1(π’)2ππ’β π0
)=: πΌ . (46)
28
The solution of the IGO flow for the linear function π(π₯) = π₯1 is thusgiven by
ππ‘1 = ππ‘
0 + π0π½
πΌexp(πΌπ‘) (47)
ππ‘ = π0 exp(πΌπ‘) . (48)
The coefficient π½ is strictly negative for any π0 < 1. The coefficient πΌ isstrictly positive if and only if π0 < 1/2 which corresponds to selecting lessthan half of the sampled points in an ES. In this case the step-size ππ‘ growsexponentially fast to infinity and the mean vectors moves along the gradientdirection towards minus β at the same rate. If more than half of the pointsare selected, π0 > 1/2, the step-size will decrease to zero exponentially fastand the mean vector will get stuck (compare also [Han06a]).
5 Multimodal optimization using restricted Boltz-mann machines
We now illustrate experimentally the influence of the natural gradient, versusvanilla gradient, on diversity over the course of optimization. We consider avery simple situation of a fitness function with two distant optima and testwhether different algorithms are able to reach both optima simultaneously oronly converge to one of them. This provides a practical test of Proposition 1stating that the natural gradient minimizes loss of diversity.
The IGO method allows to build a natural search algorithm from anarbitrary probability distribution on an arbitrary search space. In particular,by choosing families of probability distributions that are richer than Gaussianor Bernoulli, one may hope to be able to optimize functions with complexshapes. Here we study how this might help optimize multimodal functions.
Both Gaussian distributions on Rπ and Bernoulli distributions on {0, 1}πare unimodal. So at any given time, a search algorithm using such distribu-tions concentrates around a given point in the search space, looking aroundthat point (with some variance). It is an often-sought-after feature for anoptimization algorithm to handle multiple hypotheses simultaneously.
In this section we apply our method to a multimodal distribution on{0, 1}π: restricted Boltzmann machines (RBMs). Depending on the activationstate of the latent variables, values for various blocks of bits can be switchedon or off, hence multimodality. So the optimization algorithm derived fromthese distributions will, hopefully, explore several distant zones of the searchspace at any given time. A related model (Boltzmann machines) was used in[Ber02] and was found to perform better than PBIL on some functions.
Our study of a bimodal RBM distribution for the optimization of abimodal function confirms that the natural gradient does indeed behave ina more natural way than the vanilla gradient: when initialized properly,the natural gradient is able to maintain diversity by fully using the RBMdistribution to learn the two modes, while the vanilla gradient only convergesto one of the two modes.
Although these experiments support using a natural gradient approach,they also establish that complications can arise for estimating the inverseFisher matrix in the case of complex distributions such as RBMs: estimationerrors may lead to a singular or unreliable estimation of the Fisher matrix,especially when the distribution becomes singular. Further research is neededfor a better understanding of this issue.
29
x
h
Figure 2: The RBM architecture with the observed (x) and latent (h)variables. In our experiments, a single hidden unit was used.
The experiments reported here, and the fitness function used, are ex-tremely simple from the optimization viewpoint: both algorithms using thenatural and vanilla gradient find an optimum in only a few steps. The empha-sis here is on the specific influence of replacing the vanilla gradient with thenatural gradient, and the resulting influence on diversity and multimodality,in a simple situation.
5.1 IGO for restricted Boltzmann machines
Restricted Boltzmann machines. A restricted Boltzmann machine(RBM) [Smo86, AHS85] is a kind of probability distribution belongingto the family of undirected graphical models (also known as a Markov ran-dom fields). A set of observed variables x β {0, 1}ππ₯ are given a probabilityusing their joint distribution with unobserved latent variables h β {0, 1}πβ
[Gha04]. The latent variables are then marginalized over. See Figure 2 forthe graph structure of a RBM.
The probability associated with an observation x and latent variable h isgiven by
ππ(x, h) = πβπΈ(x,h)βxβ²,hβ² πβπΈ(xβ²,hβ²) , ππ(x) =
βh
ππ(x, h). (49)
where πΈ(x, h) is the so-called energy function andβ
xβ²,hβ² πβπΈ(xβ²,hβ²) is thepartition function denoted π in Section 2.3. The energy πΈ is defined by
πΈ(x, h) = ββ
π
πππ₯π ββ
π
ππβπ ββπ,π
π€πππ₯πβπ . (50)
The distribution is fully parametrized by the bias on the observed variablesa, the bias on the latent variables b and the weights W which account forpairwise interactions between observed and latent variables: π = (a, b, W).
RBM distributions are a special case of exponential family distributionswith latent variables (see (21) in Section 2.3). The RBM IGO equationsstem from Equations (16), (17) and (18) by identifying the statistics π (π₯)with π₯π, βπ or π₯πβπ .
For these distributions, the gradient of the log-likelihood is well-known[Hin02]. Although it is often considered intractable in the context of machinelearning where a lot of variables are required, it becomes tractable for smallerRBMs. The gradient of the log-likelihood consists of the difference of twoexpectations (compare (16)):
π ln ππ(π₯)ππ€ππ
= Eππ[π₯πβπ |π₯]β Eππ
[π₯πβπ ] (51)
30
The Fisher information matrix is given by (20)
πΌπ€πππ€ππ(π) = E[π₯πβππ₯πβπ]β E[π₯πβπ]E[π₯πβπ] (52)
where πΌπ€πππ€ππdenotes the entry of the Fisher matrix corresponding to the
components π€ππ and π€ππ of the parameter π. These equations are understoodto encompass the biases a and b by noticing that the bias can be replacedin the model by adding two variables π₯π and βπ always equal to one.
Finally, the IGO update rule is taken from (14):
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘ πΌβ1(ππ‘)πβ
π=1π€π
π ln ππ(xπ)ππ
π=ππ‘
(53)
Implementation. In this experimental study, the IGO algorithm forRBMs is directly implemented from Equation (53). At each optimizationstep, the algorithm consists in (1) finding a reliable estimate of the Fishermatrix (see Eq. 52) which is then inverted using the QR-Algorithm if it isinvertible; (2) computing an estimate of the vanilla gradient which is thenweighted according to π π
π ; and (3) updating the parameters, taking intoaccount the gradient step size πΏπ‘. In order to estimate both the Fisher matrixand the gradient, samples must be drawn from the model ππ. This is doneusing Gibbs sampling (see for instance [Hin02]).
Fisher matrix imprecision. The imprecision incurred by the limitedsampling size may sometimes lead to a singular estimation of the Fisher matrix(see p. 11 for a lower bound on the number of samples needed). Althoughhaving a singular Fisher estimation happens rarely in normal conditions, itoccurs with certainty when the probabilities become too concentrated overa few points. This situation arises naturally when the algorithm is allowedto continue the optimization after the optimum has been reached. For thisreason, in our experiments, we stop the optimization as soon as both optimahave been sampled, thus preventing ππ from becoming too concentrated.
In a variant of the same problem, the Fisher estimation can becomeclose to singular while still being numerically invertible. This situation leadsto unreliable estimates and should therefore be avoided. To evaluate thereliability of the inverse Fisher matrix estimate, we use a cross-validationmethod: (1) making two estimates πΉ1 and πΉ2 of the Fisher matrix on twodistinct sets of points generated from ππ, and (2) making sure that theeigenvalues of πΉ1 Γ πΉ β1
2 are close to 1. In practice, at all steps we checkthat the average of the eigenvalues of πΉ1 Γ πΉ β1
2 is between 1/2 and 2. If atsome point during the optimization the Fisher estimation becomes singularor unreliable according to this criterion, the corresponding run is stoppedand considered to have failed.
When optimizing on a larger space helps. A restricted Boltzmannmachine defines naturally a distribution ππ on both visible and hidden units(x, h), whereas the function to optimize depends only on the visible unitsx. Thus we are faced with a choice. A first possibility is to decide that theobjective function π(x) is really a function of (x, h) where h is a dummyvariable; then we can use the IGO algorithm to optimize over (x, h) usingthe distributions ππ(x, h). A second possibility is to marginalize ππ(x, h)over the hidden units h as in (49), to define the distribution ππ(x); then wecan use the IGO algorithm to optimize over x using ππ(x).
31
These two approaches yield slightly different algorithms. Both were testedand found to be viable. However the first approach is numerically more stableand requires less samples to estimate the Fisher matrix. Indeed, if πΌ1(π) isthe Fisher matrix at π in the first approach and πΌ2(π) in the second approach,we always have πΌ1(π) > πΌ2(π) (in the sense of positive-definite matrices). Thisis because probability distributions on the pair (x, h) carry more informationthan their projections on x only, and so computed KullbackβLeibler distanceswill be larger.
In particular, there are (isolated) values of π for which the Fisher matrixπΌ2 is not invertible whereas πΌ1 is. For this reason, we selected the firstapproach.
5.2 Experimental results
In our experiments, we look at the optimization of the two-min functiondefined below with a bimodal RBM: an RBM with only one latent variable.Such an RBM is bimodal because it has two possible configurations of thelatent variable: h = 0 or h = 1, and given h, the observed variables areindependent and distributed according to two Bernoulli distributions.
Set a parameter y β {0, 1}π. The two-min function is defined as follows:
πy(x) = min(β
π
|π₯π β π¦π| ,β
π
|(1β π₯π)β π¦π)|)
(54)
This function of x has two optima: one at y, the other at its binary comple-ment y.
For the quantile rewriting of π (Section 1.2), we chose the function π€ tobe π€(π) = 1π61/2 so that points which are below the median are given theweight 1, whereas other points are given the weight 0. Also, in accordancewith (3), if several points have the same fitness value, their weight π π
π is setto the average of π€ over all those points.
For initialization of the RBMs, the weights W are sampled from a normaldistribution centered around zero and of standard deviation 1/
βππ₯ Γ πβ,
where ππ₯ is the number of observed variables (dimension π of the problem)and πβ is the number of latent variables (πβ = 1 in our case), so that initiallythe energies πΈ are not too large. Then the bias parameters are initialized asππ β β
βπ
π€ππ
2 and ππ β ββ
ππ€ππ
2 +π© (0.01π2
π₯) so that each variable (observed
or latent) has a probability of activation close to 1/2.In the following experiments, we show the results of IGO optimization
and vanilla gradient optimization for the two-min function in dimension 40,for various values of the step size πΏπ‘. For each πΏπ‘, we present the median of thequantity of interest over 100 runs. Error bars indicate the 16th percentile andthe 84th percentile (this is the same as meanΒ±stddev for a Gaussian variable,but is invariant by π -reparametrization). For each run, the parameter yof the two-max function is sampled randomly in order to ensure that thepresented results are not dependent on a particular choice of optima.
The number of sample points used for estimating the Fisher matrix isset to 10, 000: large enough (by far) to ensure the stability of the estimates.The same points are used for estimating the integral of (14), therefore thereare 10, 000 calls to the fitness function at each gradient step. These rathercomfortable settings allow for a good illustration of the theoretical propertiesof the π =β IGO flow limit.
The numbers of runs that fail after the occurrence of a singular matrixor an unreliable estimate amount for less than 10% for πΏπ‘ 6 2 (as little as
32
3% for the smallest learning rate), but can increase up to 30% for higherlearning rates.
5.2.1 Convergence
We first check that both vanilla and natural gradient are able to converge toan optimum.
Figures 3 and 4 show the fitness of the best sampled point for theIGO algorithm and for the vanilla gradient at each step. Predictably, bothalgorithms are able to optimize the two-min function in a few steps.
The two-min function is extremely simple from an optimization viewpoint;thus, convergence speed is not the main focus here, all the more since we usea large number of π -calls at each step.
Note that the values of the parameter πΏπ‘ for the two gradients used arenot directly comparable from a theoretical viewpoint (they correspond toparametrizations of different trajectories in Ξ-space, and identifying vanillaπΏπ‘ with natural πΏπ‘ is meaningless). We selected larger values of πΏπ‘ for thevanilla gradient, in order to obtain roughly comparable convergence speedsin practice.
5.2.2 Diversity
As we have seen, the two-min function is equally well optimized by the IGOand vanilla gradient optimization. However, the methods fare very differentlywhen we look at the distance from the sample points to both optima. From(54), the fitness gives the distance of sample points to the closest optimum.We now study how close sample points come to the other, opposite optimum.The distance of sample points to the second optimum is shown in Figure 5for IGO, and in Figure 6 for the vanilla gradient.
Figure 5 shows that IGO also reaches the second optimum most of thetime, and is often able to find it only a few steps after the first. This propertyof IGO is of course dependent on the initialization of the RBM with enoughdiversity. When initialized properly so that each variable (observed andlatent) has a probability 1/2 of being equal to 1, the initial RBM distributionhas maximal diversity over the search space and is at equal distance from thetwo optima of the function. From this starting position, the natural gradientis then able to increase the likelihood of the two optima at the same time.
By stark contrast, the vanilla gradient is not able to go towards bothoptima at the same time as shown in Fig. 6. In fact, the vanilla gradientonly converges to one optimum at the expense of the other. For all values ofπΏπ‘, the distance to the second optimum increases gradually and approachesthe maximum possible distance.
As mentioned earlier, each state of the latent variable h corresponds to amode of the distribution. In Figures 7 and 8, we look at the average valueof h for each gradient step. An average value close to 1/2 means that thedistribution samples from both modes: h = 0 or h = 1 with a comparableprobability. Conversely, average values close to 0 or 1 indicate that thedistribution gives most probability to one mode at the expense of the other.In Figure 7, we can see that with IGO, the average value of h is close to 1/2during the whole optimization procedure for most runs: the distribution isinitialized with two modes and stays bimodal7. As for the vanilla gradient,
7In Fig. 7, we use the following adverse setting: runs are interrupted once they reachboth optima, therefore the statistics are taken only over those runs which have not yetconverged and reached both optima, which results in higher variation around 1/2. The
33
0 10 20 30 40 50gradient steps
0
2
4
6
8
10
12
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Ξ΄t=0.5
Ξ΄t=0.25
Figure 3: Fitness of sampled points during IGO optimization.
0 10 20 30 40 50gradient steps
0
2
4
6
8
10
12
Ξ΄t=32.
Ξ΄t=16.
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Figure 4: Fitness of sampled points during vanilla gradient optimization.
34
0 10 20 30 40 50gradient steps
0
5
10
15
20
25
30
35
40
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Ξ΄t=0.5
Ξ΄t=0.25
Figure 5: Distance to the second optimum during IGO optimization.
0 10 20 30 40 50gradient steps
0
5
10
15
20
25
30
35
40
Ξ΄t=32.
Ξ΄t=16.
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Figure 6: Distance to second optimum during vanilla gradient optimization.
35
statistics for h are depicted in Figure 8 and we can see that they converge to1: one of the two modes of the distribution has been lost during optimization.
Hidden breach of symmetry by the vanilla gradient. The experi-ments reveal a curious phenomenon: the vanilla gradient loses multimodalityby always setting the hidden variable β to 1, not to 0. (We detected noobvious asymmetry on the visible units π₯, though.)
Of course, exchanging the values 0 and 1 for the hidden variables in arestricted Boltzmann machine still gives a distribution of another Boltzmannmachine. More precisely, changing βπ into 1β βπ is equivalent to resettingππ β ππ + π€ππ , ππ β βππ , and π€ππ β βπ€ππ . IGO and the natural gradient areimpervious to such a change by Proposition 9.
The vanilla gradient implicitly relies on the Euclidean norm on parameterspace, as explained in Section 1.1. For this norm, the distance betweenthe RBM distributions (ππ, ππ , π€ππ) and (πβ²
π, πβ²π , π€β²
ππ) is simplyβ
π |ππ β πβ²π|
2 +βπ
ππ β πβ²
π
2+β
ππ
π€ππ β π€β²
ππ
2. However, the change of variables ππ β
ππ + π€ππ , ππ β βππ , π€ππ β βπ€ππ does not preserve this Euclidean metric.Thus, exchanging 0 and 1 for the hidden variables will result in two differentvanilla gradient ascents. The observed asymmetry on β is a consequence ofthis implicit asymmetry.
The same asymmetry exists for the visible variables π₯π; but this doesnot prevent convergence to an optimum in our situation, since any gradientdescent eventually reaches some local optimum.
Of course it is possible to cook up parametrizations for which the vanillagradient will be more symmetric: for instance, using β1/1 instead of 0/1 forthe variables, or defining the energy by
πΈ(x, h) = ββ
ππ΄π(π₯π β 12)β
βππ΅π(βπ β 1
2)ββ
π,ππππ(π₯π β 12)(βπ β 1
2) (55)
with βbias-freeβ parameters π΄π, π΅π , πππ related to the usual parametrizationby π€ππ = πππ , ππ = π΄π β 1
2β
π π€ππ ππ = π΅π β 12β
π π€ππ . The vanilla gradientmight perform better in these parametrizations.
However, we chose a naive approach: we used a family of probabilitydistributions found in the literature, with the parametrization found in theliterature. We then use the vanilla gradient and the natural gradient on thesedistributions. This directly illustrates the specific influence of the chosengradient (the two implementations only differ by the inclusion of the Fishermatrix). It is remarkable, we think, that the natural gradient is able torecover symmetry where there was none.
5.3 Convergence to the continuous-time limit
In the previous figures, it looks like changing the parameter πΏπ‘ only resultsin a time speedup of the plots.
Update rules of the type π β π + πΏπ‘βππ (for either gradient) are Eulerapproximations of the continuous-time ordinary differential equation dπ
dπ‘ =βππ, with each iteration corresponding to an increment πΏπ‘ of the time π‘.Thus, it is expected that for small enough πΏπ‘, the algorithm after π stepsapproximates the IGO flow or vanilla gradient flow at time π‘ = π.πΏπ‘.
Figures 9 and 10 illustrate this convergence: we show the fitness w.r.t toπΏπ‘ times the number of gradient steps. An asymptotic trajectory seems to
plot has been stopped when less than half the runs remain. The error bars are relativeonly to the remaining runs.
36
0 10 20 30 40 50gradient steps
0.0
0.2
0.4
0.6
0.8
1.0
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Ξ΄t=0.5
Ξ΄t=0.25
Figure 7: Average value of h during IGO optimization.
0 10 20 30 40 50gradient steps
0.0
0.2
0.4
0.6
0.8
1.0
Ξ΄t=32.
Ξ΄t=16.
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Figure 8: Average value of h during vanilla gradient optimization.
37
emerge when πΏπ‘ decreases. For the natural gradient, it can be interpreted asthe fitness of samples of the continuous-time IGO flow.
Thus, for this kind of optimization algorithms, it makes theoreticalsense to plot the results according to the βintrinsic timeβ of the underlyingcontinuous-time object, to illustrate properties that do not depend on thesetting of the parameter πΏπ‘. (Still, the raw number of steps is more directlyrelated to algorithmic cost.)
6 Further discussionA single framework for optimization on arbitrary spaces. A strengthof the IGO viewpoint is to automatically provide optimization algorithmsusing any family of probability distributions on any given space, discrete orcontinuous. This has been illustrated with restricted Boltzmann machines.IGO algorithms also feature good invariance properties and make a leastnumber of arbitrary choices.
In particular, IGO unifies several well-known optimization algorithmsinto a single framework. For instance, to the best of our knowledge, PBILhas never been described as a natural gradient ascent in the literature8.
For Gaussian measures, algorithms of the same form (14) had beendeveloped previously [HO01, WSPS08] and their close relationship with anatural gradient ascent had been recognized [ANOK10, GSS+10].
The wide applicability of natural gradient approaches seems not to bewidely known in the optimization community, though see [MMS08].
About quantiles. The IGO flow, to the best of our knowledge, has notbeen defined before. The introduction of the quantile-rewriting (3) of theobjective function provides the first rigorous derivation of quantile- or rank-based natural optimization from a gradient ascent in π-space.
Indeed, NES and CMA-ES have been claimed to maximize βEπππ via
natural gradient ascent [WSPS08, ANOK10]. However, we have proved thatwhen the number of samples is large and the step size is small, the NESand CMA-ES updates converge to the IGO flow, not to the similar flowwith the gradient of Eππ
π (Theorem 4). So we find that in reality thesealgorithms maximize Eππ
π πππ‘ , where π π
ππ‘ is a decreasing transformation ofthe π -quantiles under the current sample distribution.
Also in practice, maximizing βEπππ is a rather unstable procedure and
has been discouraged, see for example [Whi89].
About choice of ππ: learning a model of good points. The choice ofthe family of probability distributions ππ plays a double role.
First, it is analogous to a mutation operator as seen in evolutionaryalgorithms: indeed, ππ encodes possible moves according to which newsample points are explored.
Second, optimization algorithms using distributions can be interpreted aslearning a probabilistic model of where the points with good values lie in thesearch space. With this point of view, ππ describes richness of this model:for instance, restricted Boltzmann machines with β hidden units can describedistributions with up to 2β modes, whereas the Bernoulli distribution usedin PBIL is unimodal. This influences, for instance, the ability to exploreseveral valleys and optimize multimodal functions in a single run.
8Thanks to Jonathan Shapiro for an early argument confirming this property (personalcommunication).
38
0 5 10 15 20 25 30gradient steps * Ξ΄t
0
2
4
6
8
10
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Ξ΄t=0.5
Ξ΄t=0.25
Figure 9: Fitness of sampled points w.r.t. βintrinsic timeβ during IGOoptimization.
0 5 10 15 20 25 30gradient steps * Ξ΄t/4
0
2
4
6
8
10
Ξ΄t=32.
Ξ΄t=16.
Ξ΄t=8.
Ξ΄t=4.
Ξ΄t=2.
Ξ΄t=1.
Figure 10: Fitness of sampled points w.r.t. βintrinsic timeβ during vanillagradient optimization.
39
Natural gradient and parametrization invariance. Central to IGOis the use of natural gradient, which follows from π-invariance and makessense on any search space, discrete or continuous.
While the IGO flow is exactly π-invariant, for any practical implementa-tion of an IGO algorithm, a parametrization choice has to be made. Still,since all IGO algorithms approximate the IGO flow, two parametrizations ofIGO will differ less than two parametrizations of another algorithm (such asthe vanilla gradient or the smoothed CEM method)βat least if the learningrate πΏπ‘ is not too large.
On the other hand, natural evolution strategies have never strived forπ-invariance: the chosen parametrization (Cholesky, exponential) has beendeemed a relevant feature. In the IGO framework, the chosen parametrizationbecomes more relevant as the step size πΏπ‘ increases.
IGO, maximum likelihood and cross-entropy. The cross-entropy method(CEM) [dBKMR05] can be used to produce optimization algorithms given afamily of probability distributions on an arbitrary space, by performing ajump to a maximum likelihood estimate of the parameters.
We have seen (Corollary 16) that the standard CEM is an IGO algorithmin a particular parametrization, with a learning rate πΏπ‘ equal to 1. However,it is well-known, both theoretically and experimentally [BLS07, Han06b,WAS04], that standard CEM loses diversity too fast in many situations. Theusual solution [dBKMR05] is to reduce the learning rate (smoothed CEM,(25)), but this breaks the reparametrization invariance.
On the other hand, the IGO flow can be seen as a maximum likelihoodupdate with infinitesimal learning rate (Theorem 13). This interpretationallows to define a particular IGO algorithm, the IGO-ML (Definition 14): itperforms a maximum likelihood update with an arbitrary learning rate, andkeeps the reparametrization invariance. It coincides with CEM when thelearning rate is set to 1, but it differs from smoothed CEM by the exchangeof the order of argmax and averaging (compare (23) and (25)). We arguethat this new algorithm may be a better way to reduce the learning rate andachieve smoothing in CEM.
Standard CEM can lose diversity, yet is a particular case of an IGOalgorithm: this illustrates the fact that reasonable values of the learningrate πΏπ‘ depend on the parametrization. We have studied this phenomenonin detail for various Gaussian IGO algorithms (Section 4.2).
Why would a smaller learning rate perform better than a large one inan optimization setting? It might seem more efficient to jump directly tothe maximum likelihood estimate of currently known good points, instead ofperforming a slow gradient ascent towards this maximum.
However, optimization faces a βmoving targetβ, contrary to a learningsetting in which the example distribution is often stationary. Currently knowngood points are likely not to indicate the position at which the optimum lies,but, rather, the direction in which the optimum is to be found. After anupdate, the next elite sample points are going to be located somewhere new.So the goal is certainly not to settle down around these currently knownpoints, as a maximum likelihood update does: by design, CEM only tries toreflect status-quo (even for π =β), whereas IGO tries to move somewhere.When the target moves over time, a progressive gradient ascent is morereasonable than an immediate jump to a temporary optimum, and realizes akind of time smoothing.
This phenomenon is most clear when the number of sample points is small.
40
Then, a full maximum likelihood update risks losing a lot of diversity; it mayeven produce a degenerate distribution if the number of sample points issmaller than the number of parameters of the distribution. On the other hand,for smaller πΏπ‘, the IGO algorithms do, by design, try to maintain diversityby moving as little as possible from the current distribution ππ in KullbackβLeibler divergence. A full ML update disregards the current distributionand tries to move as close as possible to the elite sample in KullbackβLeiblerdivergence [dBKMR05], thus realizing maximal diversity loss. This makessense in a non-iterated scenario but is unsuited for optimization.
Diversity and multiple optima. The IGO framework emphasizes therelation of natural gradient and diversity: we argued that IGO providesminimal diversity change for a given objective function increment. In partic-ular, provided the initial diversity is large, diversity is kept at a maximum.This theoretical relationship has been confirmed experimentally for restrictedBoltzmann machines.
On the other hand, using the vanilla gradient does not lead to a balanceddistribution between the two optima in our experiments. Using the vanillagradient introduces hidden arbitrary choices between those points (moreexactly between moves in Ξ-space). This results in loss of diversity, andmight also be detrimental at later stages in the optimization. This may reflectthe fact that the Euclidean metric on the space of parameters, implicitlyused in the vanilla gradient, becomes less and less meaningful for gradientdescent on complex distributions.
IGO and the natural gradient are certainly relevant to the well-knownproblem of exploration-exploitation balance: as we have seen, arguably thenatural gradient realizes the best increment of the objective function withthe least possible change of diversity in the population.
More generally, IGO tries to learn a model of where the good pointsare, representing all good points seen so far rather than focusing only onsome good points; this is typical of machine learning, one of the contexts forwhich the natural gradient was studied. The conceptual relationship of IGOand IGO-like optimization algorithms with machine learning is still to beexplored and exploited.
Summary and conclusionWe sum up:
β The information-geometric optimization (IGO) framework derives frominvariance principles and allows to build optimization algorithms fromany family of distributions on any search space. In some instances(Gaussian distributions on Rπ or Bernoulli distributions on {0, 1}π) itrecovers versions of known algorithms (CMA-ES or PBIL); in otherinstances (restricted Boltzmann machine distributions) it producesnew, hopefully efficient optimization algorithms.
β The use of a quantile-based, time-dependent transform of the objectivefunction (Equation (3)) provides a rigorous derivation of rank-basedupdate rules currently used in optimization algorithms. Theorem 4uniquely identifies the infinite-population limit of these update rules.
β The IGO flow is singled out by its equivalent description as an in-finitesimal weighted maximal log-likelihood update (Theorem 13). In
41
a particular parametrization and with a step size of 1, it recovers thecross-entropy method (Corollary 16).
β Theoretical arguments suggest that the IGO flow minimizes the changeof diversity in the course of optimization. In particular, startingwith high diversity and using multimodal distributions may allowsimultaneous exploration of multiple optima of the objective function.Preliminary experiments with restricted Boltzmann machines confirmthis effect in a simple situation.
Thus, the IGO framework is an attempt to provide sound theoreticalfoundations to optimization algorithms based on probability distributions.In particular, this viewpoint helps to bridge the gap between continuous anddiscrete optimization.
The invariance properties, which reduce the number of arbitrary choices,together with the relationship between natural gradient and diversity, maycontribute to a theoretical explanation of the good practical performance ofthose currently used algorithms, such as CMA-ES, which can be interpretedas instantiations of IGO.
We hope that invariance properties will acquire in computer sciencethe importance they have in mathematics, where intrinsic thinking is thefirst step for abstract linear algebra or differential geometry, and in modernphysics, where the notions of invariance w.r.t. the coordinate system andso-called gauge invariance play central roles.
AcknowledgementsThe authors would like to thank Michèle Sebag for the acronym and for helpfulcomments. Y. O. would like to thank Cédric Villani and Bruno Sévennecfor helpful discussions on the Fisher metric. A. A. and N. H. would like toacknowledge the Dagstuhl Seminar No 10361 on the Theory of EvolutionaryComputation (http://www.dagstuhl.de/10361) for inspiring their work onnatural gradients and beyond. This work was partially supported by theANR-2010-COSI-002 grant (SIMINOLE) of the French National ResearchAgency.
Appendix: Proofs
Proof of Theorem 4 (Convergence of empirical means andquantiles)
Let us give a more precise statement including the necessary regularityconditions.
Proposition 18. Let π β Ξ. Assume that the derivative π ln ππ(π₯)ππ exists
for ππ-almost all π₯ β π and that Eππ
π ln ππ(π₯)
ππ
2< +β. Assume that the
function π€ is non-decreasing and bounded.Let (π₯π)πβN be a sequence of independent samples of ππ. Then with
probability 1, as π ββ we have
1π
πβπ=1
π π (π₯π)π ln ππ(π₯π)
ππββ«
π ππ (π₯) π ln ππ(π₯)
ππππ(dπ₯)
42
where π π (π₯π) = π€
(rkπ (π₯π) + 1/2π
)with rkπ (π₯π) = #{1 6 π 6 π, π(π₯π) < π(π₯π)}. (When there are π -ties in thesample, π π (π₯π) is defined as the average of π€((π + 1/2)/π) over the possiblerankings π of π₯π.)
Proof. Let π : π β R be any function with Eπππ2 <β. We will show that
1π
βπ π (π₯π)π(π₯π)ββ«
π ππ (π₯)π(π₯) ππ(dπ₯). Applying this with π equal to the
components of π ln ππ(π₯)ππ will yield the result.
Let us decompose
1π
βπ π (π₯π)π(π₯π) = 1π
βπ π
π (π₯π)π(π₯π) + 1π
β(π π (π₯π)βπ π
π (π₯π))π(π₯π).
Each summand in the first term involves only one sample π₯π (contrary toπ π (π₯π) which depends on the whole sample). So by the strong law of largenumbers, almost surely 1
π
βπ π
π (π₯π)π(π₯π) converges toβ«
π ππ (π₯)π(π₯) ππ(dπ₯).
So we have to show that the second term converges to 0 almost surely.By the CauchyβSchwarz inequality, we have 1
π
β(π π (π₯π)βπ π
π (π₯π))π(π₯π)2
6( 1
π
β(π π (π₯π)βπ π
π (π₯π))2)( 1
π
βπ(π₯π)2
)By the strong law of large numbers, the second term 1
π
βπ(π₯π)2 converges to
πΈπππ2 almost surely. So we have to prove that the first term 1
π
β(π π (π₯π)β
π ππ (π₯π))2 converges to 0 almost surely.
Since π€ is bounded by assumption, we can write
(π π (π₯π)βπ ππ (π₯π))2 6 2π΅
π π (π₯π)βπ ππ (π₯π)
= 2π΅
π π (π₯π)βπ ππ (π₯π)
+
+ 2π΅π π (π₯π)βπ π
π (π₯π)β
where π΅ is the bound on |π€|. We will bound each of these terms.Let us abbreviate πβ
π = Prπ₯β²βΌππ(π(π₯β²) < π(π₯π)), π+
π = Prπ₯β²βΌππ(π(π₯β²) 6
π(π₯π)), πβπ = #{π6π, π(π₯π) < π(π₯π)}, π+
π = #{π6π, π(π₯π) 6 π(π₯π)}.By definition of π π we have
π π (π₯π) = 1π+
π β πβπ
π+π β1β
π=πβπ
π€((π + 1/2)/π)
and moreover π ππ (π₯π) = π€(πβ
π ) if πβπ = π+
π or π ππ (π₯π) = 1
π+π βπβ
π
β« π+π
πβπ
π€ other-wise.
The GlivenkoβCantelli theorem states that supπ
π+
π β π+π /π
tends to 0
almost surely, and likewise for supπ
πβ
π β πβπ /π
. So let π be large enough
so that these errors are bounded by π.Since π€ is non-increasing, we have π€(πβ
π ) 6 π€(πβπ /π β π). In the case
πβπ = π+
π , we decompose the interval [πβπ ; π+
π ] into (π+π β πβ
π ) subintervals.The average of π€ over each such subinterval is compared to a term in thesum defining π€π (π₯π): since π€ is non-increasing, the average of π€ over theπth subinterval is at most π€((πβ
π + π)/π β π). So we get
π ππ (π₯π) 6
1π+
π β πβπ
π+π β1β
π=πβπ
π€(π/π β π)
43
so that
π ππ (π₯π)β π π (π₯π) 6
1π+
π β πβπ
π+π β1β
π=πβπ
(π€(π/π β π)β π€((π + 1/2)/π)).
Let us sum over π, remembering that there are (π+π β πβ
π ) values of π forwhich π(π₯π) = π(π₯π). Taking the positive part, we get
1π
πβπ=1
π π
π (π₯π)β π π (π₯π)+6
1π
πβ1βπ=0
(π€(π/π β π)β π€((π + 1/2)/π)).
Since π€ is non-increasing we have
1π
πβ1βπ=0
π€(π/π β π) 6β« 1βπβ1/π
βπβ1/ππ€
and1π
πβ1βπ=0
π€((π + 1/2)/π) >β« 1+1/2π
1/2ππ€
(we implicitly extend the range of π€ so that π€(π) = π€(0) for π < 0). So wehave
1π
πβπ=1
π π
π (π₯π)β π π (π₯π)+6β« 1/2π
βπβ1/ππ€ β
β« 1+1/2π
1βπβ1/ππ€ 6 (2π + 3/π)π΅
where π΅ is the bound on |π€|.Reasoning symmetrically with π€(π/π + π) and the inequalities reversed,
we get a similar bound for 1π
β π π
π (π₯π)β π π (π₯π)β
. This ends the proof.
Proof of Proposition 5 (Quantile improvement)
Let us use the weight π€(π’) = 1π’6π. Let π be the value of the π-quantile of πunder πππ‘ . We want to show that the value of the π-quantile of π under πππ‘+πΏπ‘
is less than π, unless the gradient vanishes and the IGO flow is stationary.Let πβ = Prπ₯βΌπππ‘ (π(π₯) < π), ππ = Prπ₯βΌπππ‘ (π(π₯) = π) and π+ =
Prπ₯βΌπππ‘ (π(π₯) > π). By definition of the quantile value we have πβ + ππ > πand π+ + ππ > 1β π. Let us assume that we are in the more complicatedcase ππ = 0 (for the case ππ = 0, simply remove the corresponding terms).
We have π πππ‘
(π₯) = 1 if π(π₯) < π, π πππ‘
(π₯) = 0 if π(π₯) > π and π πππ‘
(π₯) =1
ππ
β« πβ+πππβ
π€(π’)dπ’ = πβπβππ
if π(π₯) = π.Using the same notation as above, let ππ‘(π) =
β«π π
ππ‘(π₯) ππ(dπ₯). Decom-posing this integral on the three sets π(π₯) < π, π(π₯) = π and π(π₯) > π, weget that ππ‘(π) = Prπ₯βΌππ
(π(π₯) < π) + Prπ₯βΌππ(π(π₯) = π) πβπβ
ππ. In particular,
ππ‘(ππ‘) = π.Since we follow a gradient ascent of ππ‘, for πΏπ‘ small enough we have
ππ‘(ππ‘+πΏπ‘) > ππ‘(ππ‘) unless the gradient vanishes. If the gradient vanisheswe have ππ‘+πΏπ‘ = ππ‘ and the quantiles are the same. Otherwise we getππ‘(ππ‘+πΏπ‘) > ππ‘(ππ‘) = π.
Since πβπβππ
6 (πβ+ππ)βπβππ
= 1, we have ππ‘(π) 6 Prπ₯βΌππ(π(π₯) < π) +
Prπ₯βΌππ(π(π₯) = π) = Prπ₯βΌππ
(π(π₯) 6 π).So Prπ₯βΌπ
ππ‘+πΏπ‘(π(π₯) 6 π) > ππ‘(ππ‘+πΏπ‘) > π. This implies that, by definition,
the π-quantile value of πππ‘+πΏπ‘ is smaller than π.
44
Proof of Proposition 10 (Speed of the IGO flow)
Lemma 19. Let π be a centered πΏ2 random variable with values in Rπ andlet π΄ be a real-valued πΏ2 random variable. Then
βE(π΄π)β 6β
π Var π΄
where π is the largest eigenvalue of the covariance matrix of π expressed inan orthonormal basis.
Proof of the lemma. Let π£ be any vector in Rπ; its norm satisfies
βπ£β = supπ€, βπ€β61
(π£ Β· π€)
and in particular
βE(π΄π)β = supπ€, βπ€β61
(π€ Β· E(π΄π))
= supπ€, βπ€β61
E(π΄ (π€ Β·π))
= supπ€, βπ€β61
E((π΄β Eπ΄) (π€ Β·π)) since (π€ Β·π) is centered
6 supπ€, βπ€β61
βVar π΄
βE((π€ Β·π)2)
by the CauchyβSchwarz inequality and using the fact that π΄ is centered.Now, in an orthonormal basis we have
(π€ Β·π) =β
π
π€πππ
so that
E((π€ Β·π)2) = E((β
ππ€πππ)(β
ππ€πππ))
=β
π
βπE(π€ππππ€πππ)
=β
π
βππ€ππ€πE(ππππ)
=β
π
βππ€ππ€ππΆππ
with πΆππ the covariance matrix of π. The latter expression is the scalarproduct (π€ Β· πΆπ€). Since πΆ is a symmetric positive-semidefinite matrix,(π€ Β· πΆπ€) is at most πβπ€β2 with π the largest eigenvalue of πΆ.
For the IGO flow we have dππ‘
dπ‘ = Eπ₯βΌπππ π
π (π₯)βπ ln ππ(π₯).So applying the lemma, we get that the norm of dπ
dπ‘ is at mostβ
π Varπ₯βΌπππ π
π (π₯)where π is the largest eivengalue of the covariance matrix of βπ ln ππ(π₯) (ex-pressed in a coordinate system where the Fisher matrix at the current pointπ is the identity).
By construction of the quantiles, we have Varπ₯βΌπππ π
π (π₯) 6 Var[0,1] π€(with equality unless there are ties). Indeed, for a given π₯, let π° be a uniformrandom variable in [0, 1] independent from π₯ and define the random variableπ = πβ(π₯) + (π+(π₯)β πβ(π₯))π° . Then π is uniformly distributed between theupper and lower quantiles π+(π₯) and πβ(π₯) and thus we can rewrite π π
π (π₯) asE(π€(π)|π₯). By the Jensen inequality we have Var π π
π (π₯) = VarE(π€(π)|π₯) 6Var π€(π). In addition when π₯ is taken under ππ, π is uniformly distributedin [0, 1] and thus Var π€(π) = Var[0,1] π€, i.e. Varπ₯βΌππ
π ππ (π₯) 6 Var[0,1] π€.
45
Besides, consider the tangent space in Ξ-space at point ππ‘, and let uschoose an orthonormal basis in this tangent space for the Fisher metric.Then, in this basis we have βπ ln ππ(π₯) = ππ ln ππ(π₯). So the covariancematrix of β ln ππ(π₯) is Eπ₯βΌππ
(ππ ln ππ(π₯)ππ ln ππ(π₯)), which is equal to theFisher matrix by definition. So this covariance matrix is the identity, whoselargest eigenvalue is 1.
Proof of Proposition 12 (Noisy IGO)
On the one hand, let ππ be a family of distributions on π. The IGOalgorithm (13) applied to a random function π(π₯) = π(π₯, π) where π is arandom variable uniformly distributed in [0, 1] reads
ππ‘+πΏπ‘ = ππ‘ + πΏπ‘πβ
π=1π€πβπ ln ππ(π₯π) (56)
where π₯π βΌ ππ and π€π is according to (12) where ranking is applied to thevalues π(π₯π, ππ), with ππ uniform variables in [0, 1] independent from π₯π andfrom each other.
On the other hand, for the IGO algorithm using ππ β π[0,1] and appliedto the deterministic function π , π€π is computed using the ranking accordingto the π values of the sampled points οΏ½οΏ½π = (π₯π, ππ), and thus coincides withthe one in (56).
Besides,
βπ ln ππβπ[0,1](οΏ½οΏ½π) = βπ ln ππ(π₯π) + βπ ln π[0,1](ππ)β β =0
and thus the IGO algorithm update on space π Γ [0, 1], using the familyof distributions ππ = ππ β π[0,1], applied to the deterministic function π ,coincides with (56).
Proof of Theorem 13 (Natural gradient as ML with infinitesi-mal weights)
We begin with a calculus lemma (proof omitted).
Lemma 20. Let π be real-valued function on a finite-dimensional vectorspace πΈ equipped with a definite positive quadratic form β Β· β2. Assume π issmooth and has at most quadratic growth at infinity. Then, for any π₯ β πΈ,we have
βπ(π₯) = limπβ0
1π
arg maxβ
{π(π₯ + β)β 1
2πβββ2
}where β is the gradient associated with the norm β Β· β. Equivalently,
arg maxπ¦
{π(π¦)β 1
2πβπ¦ β π₯β2
}= π₯ + πβπ(π₯) + π(π2)
when πβ 0.
We are now ready to prove Theorem 13. Let π be a function of π₯, andfix some π0 in Ξ.
We need some regularity assumptions: we assume that the parameterspace Ξ is non-degenerate (no two points π β Ξ define the same probabilitydistribution) and proper (the map ππ β¦β π is continuous). We also assumethat the map π β¦β ππ is smooth enough, so that
β«log ππ(π₯) π (π₯) ππ0(dπ₯) is
46
a smooth function of π. (These are restrictions on π-regularity: this does notmean that π has to be continuous as a function of π₯.)
The two statements of Theorem 13 using a sum and an integral havesimilar proofs, so we only include the first. For π > 0, let π be the solution of
π = arg max{
(1β π)β«
log ππ(π₯) ππ0(dπ₯) + π
β«log ππ(π₯) π (π₯) ππ0(dπ₯)
}.
Then we have
π = arg max{β«
log ππ(π₯) ππ0(dπ₯) + π
β«log ππ(π₯) (π (π₯)β 1) ππ0(dπ₯)
}
= arg max{β«
log ππ(π₯) ππ0(dπ₯)ββ«
log ππ0(π₯) ππ0(dπ₯) + π
β«log ππ(π₯) (π (π₯)β 1) ππ0(dπ₯)
}
(because the added term does not depend on π)
= arg max{βKL(ππ0 ||ππ) + π
β«log ππ(π₯) (π (π₯)β 1) ππ0(dπ₯)
}
= arg max{β 1
πKL(ππ0 ||ππ) +
β«log ππ(π₯) (π (π₯)β 1) ππ0(dπ₯)
}
When πβ 0, the first term exceeds the second one if KL(ππ0 ||ππ) is toolarge (because π is bounded), and so KL(ππ0 ||ππ) tends to 0. So we canassume that π is close to π0.
When π = π0 + πΏπ is close to π0, we have
KL(ππ0 ||ππ) = 12β
πΌππ(π0) πΏππ πΏππ + π(πΏπ3)
with πΌππ(π0) the Fisher matrix at π0. (This actually holds both for KL(ππ0 ||ππ)and KL(ππ ||ππ0).)
Thus, we can apply the lemma above using the Fisher metricβ
πΌππ(π0) πΏππ πΏππ ,and working on a small neighborhood of π0 in π-space (which can be identifiedwith Rdim Ξ). The lemma states that the argmax above is attained at
π = π0 + π βπ
β«log ππ(π₯) (π (π₯)β 1) ππ0(dπ₯)
up to π(π2), with β the natural gradient.Finally, the gradient cancels the constant β1 because
β«( β log ππ) ππ0 = 0
at π = π0. This proves Theorem 13.
Proof of Theorem 15 (IGO, CEM and IGO-ML)
Let ππ be a family of probability distributions of the form
ππ(π₯) = 1π(π) exp
(βππππ(π₯)
)π»(dπ₯)
where π1, . . . , ππ is a finite family of functions on π and π»(dπ₯) is somereference measure on π. We assume that the family of functions (ππ)π
together with the constant function π0(π₯) = 1, are linearly independent.This prevents redundant parametrizations where two values of π describethe same distribution; this also ensures that the Fisher matrix Cov(ππ, ππ) isinvertible.
47
The IGO update (14) in the parametrization ππ is a sum of terms of theform βππ
ln π (π₯).
So we will compute the natural gradient βππin those coordinates. We first
need some general results about the Fisher metric for exponential families.The next proposition gives the expression of the Fisher scalar product be-
tween two tangent vectors πΏπ and πΏβ²π of a statistical manifold of exponentialdistributions. It is one way to express the duality between the coordinatesππ and ππ (compare [AN00, (3.30) and Section 3.5]).
Proposition 21. Let πΏππ and πΏβ²ππ be two small variations of the parametersππ. Let πΏπ (π₯) and πΏβ²π (π₯) be the resulting variations of the probability distri-bution π , and πΏππ and πΏβ²ππ the resulting variations of ππ. Then the scalarproduct, in Fisher information metric, between the tangent vectors πΏπ andπΏβ²π , is
β¨πΏπ, πΏβ²π β© =β
π
πΏππ πΏβ²ππ =β
π
πΏβ²ππ πΏππ.
Proof. By definition of the Fisher metric:
β¨πΏπ, πΏβ²π β© =βπ,π
πΌππ πΏππ πΏβ²ππ
=βπ,π
πΏππ πΏβ²ππ
β«π₯
π ln π (π₯)πππ
π ln π (π₯)πππ
π (π₯)
=β«
π₯
βπ
π ln π (π₯)πππ
πΏππ
βπ
π ln π (π₯)πππ
πΏβ²ππ π (π₯)
=β«
π₯
βπ
π ln π (π₯)πππ
πΏππ πΏβ²(ln π (π₯)) π (π₯)
=β«
π₯
βπ
π ln π (π₯)πππ
πΏππ πΏβ²π (π₯)
=β«
π₯
βπ
(ππ(π₯)β ππ)πΏππ πΏβ²π (π₯) by (16)
=β
π
πΏππ
(β«π₯
ππ(π₯) πΏβ²π (π₯))ββ
π
πΏππππ
β«π₯
πΏβ²π (π₯)
=β
π
πΏππ πΏβ²ππ
becauseβ«
π₯ πΏβ²π (π₯) = 0 since the total mass of π is 1, andβ«
π₯ ππ(π₯) πΏβ²π (π₯) =πΏβ²ππ by definition of ππ.
Proposition 22. Let π be a function on the statistical manifold of anexponential family as above. Then the components of the natural gradientw.r.t. the expectation parameters are given by the vanilla gradient w.r.t. thenatural parameters: βππ
π = ππ
πππ
and conversely βπππ = ππ
πππ.
(Beware this does not mean that the gradient ascent in any of thoseparametrizations is the vanilla gradient ascent.)
We could not find a reference for this result, though we think it known.
48
Proof. By definition, the natural gradient βπ of a function π is the uniquetangent vector πΏπ such that that, for any other tangent vector πΏβ²π , we have
πΏβ²π = β¨πΏπ, πΏβ²π β©
with β¨Β·, Β·β© the scalar product associated with the Fisher metric. We want tocompute this natural gradient in coordinates ππ, so we are interested in thevariations πΏππ associated with πΏπ .
By Proposition 21, the scalar product above is
β¨πΏπ, πΏβ²π β© =β
πΏππ πΏβ²ππ
where πΏππ is the variation of ππ associated with πΏπ , and πΏβ²ππ the variation ofππ associated with πΏβ²π .
On the other hand we have πΏβ²π =β
ππππππ
πΏβ²ππ. So we must have
βπ
πΏππ πΏβ²ππ =β
π
ππ
ππππΏβ²ππ
for any πΏβ²π , which leads toπΏππ = ππ
πππ
as needed. The converse relation is proved mutatis mutandis.
Back to the proof of Theorem 15. We can now compute the desiredterms: βππ
ln π (π₯) = π ln π (π₯)πππ
= ππ(π₯)β ππ
by (16). This proves the first statement (27) in Theorem 15 about the formof the IGO update in these parameters.
The other statements follow easily from this together with the additionalfact (26) that, for any set of weights ππ with
βππ = 1, the value π * =β
π π(π)π (π₯π) is the maximum likelihood estimate ofβ
π π(π) log π (π₯π).
References[AHS85] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski, A learning
algorithm for Boltzmann machines, Cognitive Science 9 (1985),no. 1, 147β169.
[Ama98] Shun-Ichi Amari, Natural gradient works efficiently in learning,Neural Comput. 10 (1998), 251β276.
[AN00] Shun-ichi Amari and Hiroshi Nagaoka, Methods of informa-tion geometry, Translations of Mathematical Monographs, vol.191, American Mathematical Society, Providence, RI, 2000,Translated from the 1993 Japanese original by Daishi Harada.
[ANOK10] Youhei Akimoto, Yuichi Nagata, Isao Ono, and ShigenobuKobayashi, Bidirectional relation between CMA evolution strate-gies and natural evolution strategies, Proceedings of ParallelProblem Solving from Nature - PPSN XI, Lecture Notes inComputer Science, vol. 6238, Springer, 2010, pp. 154β163.
[AO08] Ravi P. Agarwal and Donal OβRegan, An introduction to ordi-nary differential equations, Springer, 2008.
49
[Arn06] D.V. Arnold, Weighted multirecombination evolution strategies,Theoretical computer science 361 (2006), no. 1, 18β37.
[Bal94] Shumeet Baluja, Population based incremental learning: Amethod for integrating genetic search based function optimiza-tion and competitve learning, Tech. Report CMU-CS-94-163,Carnegie Mellon Report, 1994.
[BC95] Shumeet Baluja and Rich Caruana, Removing the genetics fromthe standard genetic algorithm, Proceedings of ICMLβ95, 1995,pp. 38β46.
[Ber00] A. Berny, Selection and reinforcement learning for combinatorialoptimization, Parallel Problem Solving from Nature PPSN VI(Marc Schoenauer, Kalyanmoy Deb, GΓΌnther Rudolph, XinYao, Evelyne Lutton, Juan Merelo, and Hans-Paul Schwefel,eds.), Lecture Notes in Computer Science, vol. 1917, SpringerBerlin Heidelberg, 2000, pp. 601β610.
[Ber02] Arnaud Berny, Boltzmann machine for population-based incre-mental learning, ECAI, 2002, pp. 198β202.
[Bey01] H.-G. Beyer, The theory of evolution strategies, Natural Com-puting Series, Springer-Verlag, 2001.
[BLS07] J. Branke, C. Lode, and J.L. Shapiro, Addressing samplingerrors and diversity loss in umda, Proceedings of the 9th annualconference on Genetic and evolutionary computation, ACM,2007, pp. 508β515.
[BS02] H.G. Beyer and H.P. Schwefel, Evolution strategiesβa com-prehensive introduction, Natural computing 1 (2002), no. 1,3β52.
[CT06] Thomas M. Cover and Joy A. Thomas, Elements of informationtheory, second ed., Wiley-Interscience [John Wiley & Sons],Hoboken, NJ, 2006.
[dBKMR05] Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, andReuven Y. Rubinstein, A tutorial on the cross-entropy method,Annals OR 134 (2005), no. 1, 19β67.
[Gha04] Zoubin Ghahramani, Unsupervised learning, Advanced Lectureson Machine Learning (Olivier Bousquet, Ulrike von Luxburg,and Gunnar RΓ€tsch, eds.), Lecture Notes in Computer Science,vol. 3176, Springer Berlin / Heidelberg, 2004, pp. 72β112.
[GSS+10] Tobias Glasmachers, Tom Schaul, Yi Sun, Daan Wierstra, andJΓΌrgen Schmidhuber, Exponential natural evolution strategies,GECCO, 2010, pp. 393β400.
[Han06a] N. Hansen, An analysis of mutative π-self-adaptation on linearfitness functions, Evolutionary Computation 14 (2006), no. 3,255β275.
[Han06b] , The CMA evolution strategy: a comparing review, To-wards a new evolutionary computation. Advances on estimationof distribution algorithms (J.A. Lozano, P. Larranaga, I. Inza,and E. Bengoetxea, eds.), Springer, 2006, pp. 75β102.
50
[Han09] , Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed, Proceedings of the 11th Annual Confer-ence Companion on Genetic and Evolutionary ComputationConference: Late Breaking Papers (New York, NY, USA),GECCO β09, ACM, 2009, pp. 2389β2396.
[Hin02] G.E. Hinton., Training products of experts by minimizing con-trastive divergence, Neural Computation 14 (2002), 1771β1800.
[HJ61] R. Hooke and T.A. Jeeves, βdirect searchβ solution of numericaland statistical problems, Journal of the ACM 8 (1961), 212β229.
[HK04] N. Hansen and S. Kern, Evaluating the CMA evolution strategyon multimodal test functions, Parallel Problem Solving fromNature PPSN VIII (X. Yao et al., eds.), LNCS, vol. 3242,Springer, 2004, pp. 282β291.
[HMK03] N. Hansen, S.D. MΓΌller, and P. Koumoutsakos, Reducing thetime complexity of the derandomized evolution strategy withcovariance matrix adaptation (CMA-ES), Evolutionary Com-putation 11 (2003), no. 1, 1β18.
[HO96] N. Hansen and A. Ostermeier, Adapting arbitrary normal muta-tion distributions in evolution strategies: The covariance matrixadaptation, ICEC96, IEEE Press, 1996, pp. 312β317.
[HO01] Nikolaus Hansen and Andreas Ostermeier, Completely deran-domized self-adaptation in evolution strategies, EvolutionaryComputation 9 (2001), no. 2, 159β195.
[JA06] G.A. Jastrebski and D.V. Arnold, Improving evolution strate-gies through active covariance matrix adaptation, EvolutionaryComputation, 2006. CEC 2006. IEEE Congress on, IEEE, 2006,pp. 2814β2821.
[Jef46] Harold Jeffreys, An invariant form for the prior probabilityin estimation problems, Proc. Roy. Soc. London. Ser. A. 186(1946), 453β461.
[Kul97] Solomon Kullback, Information theory and statistics, DoverPublications Inc., Mineola, NY, 1997, Reprint of the second(1968) edition.
[LL02] P. Larranaga and J.A. Lozano, Estimation of distribution al-gorithms: A new tool for evolutionary computation, SpringerNetherlands, 2002.
[LRMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,Topmoumoute online natural gradient algorithm, NIPS, 2007.
[MMS08] Luigi MalagΓ², Matteo Matteucci, and Bernardo Dal Seno, Aninformation geometry perspective on estimation of distributionalgorithms: boundary analysis, GECCO (Companion), 2008,pp. 2081β2088.
[NM65] John Ashworth Nelder and R Mead, A simplex method forfunction minimization, The Computer Journal (1965), 308β313.
51
[PGL02] M. Pelikan, D.E. Goldberg, and F.G. Lobo, A survey of opti-mization by building and using probabilistic models, Computa-tional optimization and applications 21 (2002), no. 1, 5β20.
[Rao45] Calyampudi Radhakrishna Rao, Information and the accuracyattainable in the estimation of statistical parameters, Bull. Cal-cutta Math. Soc. 37 (1945), 81β91.
[Rec73] I. Rechenberg, Evolutionstrategie: Optimierung technisher sys-teme nach prinzipien des biologischen evolution, Frommann-Holzboog Verlag, Stuttgart, 1973.
[RK04] R.Y. Rubinstein and D.P. Kroese, The cross-entropy method:a unified approach to combinatorial optimization, monte-carlosimulation, and machine learning, Springer-Verlag New YorkInc, 2004.
[Rub99] Reuven Rubinstein, The cross-entropy method for com-binatorial and continuous optimization, Methodology andComputing in Applied Probability 1 (1999), 127β190,10.1023/A:1010091220143.
[SA90] F. Silva and L. Almeida, Acceleration techniques for the back-propagation algorithm, Neural Networks (1990), 110β119.
[Sch92] Laurent Schwartz, Analyse. II, Collection Enseignement desSciences [Collection: The Teaching of Science], vol. 43, Hermann,Paris, 1992, Calcul diffΓ©rentiel et Γ©quations diffΓ©rentielles, Withthe collaboration of K. Zizi.
[Sch95] H.-P. Schwefel, Evolution and optimum seeking, Sixth-generation computer technology series, John Wiley & Sons,Inc. New York, NY, USA, 1995.
[SHI09] Thorsten Suttorp, Nikolaus Hansen, and Christian Igel, Efficientcovariance matrix update for variable metric evolution strategies,Machine Learning 75 (2009), no. 2, 167β197.
[Smo86] P. Smolensky, Information processing in dynamical systems:foundations of harmony theory, Parallel Distributed Processing(D. Rumelhart and J. McClelland, eds.), vol. 1, MIT Press,Cambridge, MA, USA, 1986, pp. 194β281.
[SWSS09] Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber,Efficient natural evolution strategies, Proceedings of the 11thAnnual conference on Genetic and evolutionary computation(New York, NY, USA), GECCO β09, ACM, 2009, pp. 539β546.
[Tho00] H. Thorisson, Coupling, stationarity, and regeneration, Springer,2000.
[Tor97] Virginia Torczon, On the convergence of pattern search algo-rithms, SIAM Journal on optimization 7 (1997), no. 1, 1β25.
[WAS04] Michael Wagner, Anne Auger, and Marc Schoenauer, EEDA :A New Robust Estimation of Distribution Algorithms, ResearchReport RR-5190, INRIA, 2004.
52
[Whi89] D. Whitley, The genitor algorithm and selection pressure: Whyrank-based allocation of reproductive trials is best, Proceedingsof the third international conference on Genetic algorithms,1989, pp. 116β121.
[WSPS08] Daan Wierstra, Tom Schaul, Jan Peters, and JΓΌrgen Schmidhu-ber, Natural evolution strategies, IEEE Congress on Evolution-ary Computation, 2008, pp. 3381β3387.
53