information-geometricoptimizationalgorithms...
Post on 21-Feb-2020
6 Views
Preview:
TRANSCRIPT
Information-Geometric Optimization Algorithms:A Unifying Picture via Invariance Principles
Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier
Abstract
We present a canonical way to turn any smooth parametric familyof probability distributions on an arbitrary search space ð into acontinuous-time black-box optimization method on ð, the information-geometric optimization (IGO) method. Invariance as a major designprinciple keeps the number of arbitrary choices to a minimum. Theresulting method conducts a natural gradient ascent using an adaptive,time-dependent transformation of the objective function, and makes noparticular assumptions on the objective function to be optimized.
The IGO method produces explicit IGO algorithms through timediscretization. The cross-entropy method is recovered in a particularcase with a large time step, and can be extended into a smoothed,parametrization-independent maximum likelihood update.
When applied to specific families of distributions on discrete or con-tinuous spaces, the IGO framework allows to naturally recover versionsof known algorithms. From the family of Gaussian distributions on Rð,we arrive at a version of the well-known CMA-ES algorithm. Fromthe family of Bernoulli distributions on {0, 1}ð, we recover the seminalPBIL algorithm. From the distributions of restricted Boltzmann ma-chines, we naturally obtain a novel algorithm for discrete optimizationon {0, 1}ð.
The IGO method achieves, thanks to its intrinsic formulation, max-imal invariance properties: invariance under reparametrization of thesearch space ð, under a change of parameters of the probability dis-tribution, and under increasing transformation of the function to beoptimized. The latter is achieved thanks to an adaptive formulation ofthe objective.
Theoretical considerations strongly suggest that IGO algorithmsare characterized by a minimal change of the distribution. Thereforethey have minimal loss in diversity through the course of optimization,provided the initial diversity is high. First experiments using restrictedBoltzmann machines confirm this insight. As a simple consequence,IGO seems to provide, from information theory, an elegant way tospontaneously explore several valleys of a fitness landscape in a singlerun.
1
ContentsIntroduction 2
1 Algorithm description 41.1 The natural gradient on parameter space . . . . . . . . . . . 51.2 IGO: Information-geometric optimization . . . . . . . . . . . 7
2 First properties of IGO 112.1 Consistency of sampling . . . . . . . . . . . . . . . . . . . . . 112.2 Monotonicity: quantile improvement . . . . . . . . . . . . . . 122.3 The IGO flow for exponential families . . . . . . . . . . . . . 122.4 Invariance properties . . . . . . . . . . . . . . . . . . . . . . . 142.5 Speed of the IGO flow . . . . . . . . . . . . . . . . . . . . . . 152.6 Noisy objective function . . . . . . . . . . . . . . . . . . . . . 162.7 Implementation remarks . . . . . . . . . . . . . . . . . . . . . 17
3 IGO, maximum likelihood and the cross-entropy method 19
4 CMA-ES, NES, EDAs and PBIL from the IGO framework 224.1 IGO algorithm for Bernoulli measures and PBIL . . . . . . . 224.2 Multivariate normal distributions (Gaussians) . . . . . . . . . 244.3 Computing the IGO flow for some simple examples . . . . . . 26
5 Multimodal optimization using restricted Boltzmann ma-chines 295.1 IGO for restricted Boltzmann machines . . . . . . . . . . . . 305.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Convergence to the continuous-time limit . . . . . . . . . . . 36
6 Further discussion 38
Summary and conclusion 41
IntroductionIn this article, we consider an objective function ð : ð â R to be minimizedover a search space ð. No particular assumption on ð is needed: it maybe discrete or continuous, finite or infinite. We adopt a standard scenariowhere we consider ð as a black box that delivers values ð(ð¥) for any desiredinput ð¥ â ð. The objective of black-box optimization is to find solutionsð¥ â ð with small value ð(ð¥), using the least number of calls to the blackbox. In this context, we design a stochastic optimization method from soundtheoretical principles.
We assume that we are given a family of probability distributions ðð
on ð depending on a continuous multicomponent parameter ð â Î. Abasic example is to take ð = Rð and to consider the family of all Gaussiandistributions ðð on Rð, with ð = (ð, ð¶) the mean and covariance matrix.Another simple example is ð = {0, 1}ð equipped with the family of Bernoullimeasures, i.e. ð = (ðð)16ð6ð and ðð(ð¥) =
âðð¥ð
ð (1â ðð)1âð¥ð for ð¥ = (ð¥ð) â ð.The parameters ð of the family ðð belong to a space, Î, assumed to be asmooth manifold.
2
From this setting, we build in a natural way an optimization method,the information-geometric optimization (IGO). At each time ð¡, we maintaina value ðð¡ such that ððð¡ represents, loosely speaking, the current beliefabout where the smallest values of the function ð may lie. Over time, ððð¡
evolves and is expected to concentrate around the minima of ð . This generalapproach resembles the wide family of estimation of distribution algorithms(EDA) [LL02, BC95, PGL02]. However, we deviate somewhat from thecommon EDA reasoning, as explained in the following.
The IGO method takes the form of a gradient ascent on ðð¡ in the parameterspace Î. We follow the gradient of a suitable transformation of ð , basedon the ððð¡-quantiles of ð . The gradient used for ð is the natural gradientdefined from the Fisher information metric [Rao45, Jef46, AN00], as is thecase in various other optimization strategies, for instance so-called naturalevolution strategies [WSPS08, SWSS09, GSS+10]. Thus, we extend the scopeof optimization strategies based on this gradient to arbitrary search spaces.
The IGO method also has an equivalent description as an infinitesimalmaximum likelihood update; this reveals a new property of the naturalgradient. This also relates IGO to the cross-entropy method [dBKMR05] insome situations.
When we instantiate IGO using the family of Gaussian distributionson Rð, we naturally obtain variants of the well-known covariance matrixadaptation evolution strategy (CMA-ES) [HO01, HK04, JA06] and of naturalevolution strategies. With Bernoulli measures on the discrete cube {0, 1}ð,we recover the well-known population based incremental learning (PBIL)[BC95, Bal94]; this derivation of PBIL as a natural gradient ascent appearsto be new, and sheds some light on the common ground between continuousand discrete optimization.
From the IGO framework, it is immediate to build new optimizationalgorithms using more complex families of distributions than Gaussian orBernoulli. As an illustration, distributions associated with restricted Boltz-mann machines (RBMs) provide a new but natural algorithm for discreteoptimization on {0, 1}ð, able to handle dependencies between the bits (seealso [Ber02]). The probability distributions associated with RBMs are mul-timodal; combined with specific information-theoretic properties of IGOthat guarantee minimal loss of diversity over time, this allows IGO to reachmultiple optima at once very naturally, at least in a simple experimentalsetup.
Our method is built to achieve maximal invariance properties. First, itwill be invariant under reparametrization of the family of distributions ðð,that is, at least for infinitesimally small steps, it only depends on ðð and noton the way we write the parameter ð. (For instance, for Gaussian measuresit should not matter whether we use the covariance matrix or its inverse ora Cholesky factor as the parameter.) This limits the influence of encodingchoices on the behavior of the algorithm. Second, it will be invariant undera change of coordinates in the search space ð, provided that this change ofcoordinates globally preserves our family of distributions ðð. (For Gaussiandistributions on Rð, this includes all affine changes of coordinates.) Thismeans that the algorithm, apart from initialization, does not depend onthe precise way the data is presented. Last, the algorithm will be invariantunder applying an increasing function to ð , so that it is indifferent whetherwe minimize e.g. ð , ð3 or ð Ã |ð |â2/3. This way some non-convex or non-smooth functions can be as âeasilyâ optimised as convex ones. Contrary toprevious formulations using natural gradients [WSPS08, GSS+10, ANOK10],
3
this invariance under increasing transformation of the objective function isachieved from the start.
Invariance under ð-reparametrization has beenâwe believeâone ofthe keys to the success of the CMA-ES algorithm, which derives from aparticular case of ours. Invariance under ð-reparametrization is the main ideabehind information geometry [AN00]. Invariance under ð -transformationis not uncommon, e.g., for evolution strategies [Sch95] or pattern searchmethods [HJ61, Tor97, NM65]; however it is not always recognized as anattractive feature. Such invariance properties mean that we deal with intrinsicproperties of the objects themselves, and not with the way we encode themas collections of numbers in Rð. It also means, most importantly, that wemake a minimal number of arbitrary choices.
In Section 1, we define the IGO flow and the IGO algorithm. We beginwith standard facts about the definition and basic properties of the naturalgradient, and its connection with KullbackâLeibler divergence and diversity.We then proceed to the detailed description of our algorithm.
In Section 2, we state some first mathematical properties of IGO. Theseinclude monotone improvement of the objective function, invariance proper-ties, the form of IGO for exponential families of probability distributions,and the case of noisy objective functions.
In Section 3 we explain the theoretical relationships between IGO, maxi-mum likelihood estimates and the cross-entropy method. In particular, IGOis uniquely characterized by a weighted log-likelihood maximization property.
In Section 4, we derive several well-known optimization algorithms fromthe IGO framework. These include PBIL, versions of CMA-ES and otherGaussian evolutionary algorithms such as EMNA. This also illustrates howa large step size results in more and more differing algorithms w.r.t. thecontinuous-time IGO flow.
In Section 5, we illustrate how IGO can be used to design new optimizationalgorithms. As a proof of concept, we derive the IGO algorithm associatedwith restricted Boltzmann machines for discrete optimization, allowing formultimodal optimization. We perform a preliminary experimental study ofthe specific influence of the Fisher information matrix on the performance ofthe algorithm and on diversity of the optima obtained.
In Section 6, we discuss related work, and in particular, IGOâs relationshipwith and differences from various other optimization algorithms such asnatural evolution strategies or the cross-entropy method. We also sum upthe main contributions of the paper and the design philosophy of IGO.
1 Algorithm descriptionWe now present the outline of our algorithms. Each step is described in moredetail in the sections below.
Our method can be seen as an estimation of distribution algorithm: ateach time ð¡, we maintain a probability distribution ððð¡ on the search spaceð, where ðð¡ â Î. The value of ðð¡ will evolve so that, over time, ððð¡ givesmore weight to points ð¥ with better values of the function ð(ð¥) to optimize.
A straightforward way to proceed is to transfer ð from ð¥-space to ð-space:define a function ð¹ (ð) as the ðð-average of ð and then to do a gradientdescent for ð¹ (ð) in space Î [Ber00]. This way, ð will converge to a pointsuch that ðð yields a good average value of ð . We depart from this approachin two ways:
4
â At each time, we replace ð with an adaptive transformation of ð rep-resenting how good or bad observed values of ð are relative to otherobservations. This provides invariance under all monotone transforma-tions of ð .
â Instead of the vanilla gradient for ð, we use the so-called natural gradientgiven by the Fisher information matrix. This reflects the intrinsicgeometry of the space of probability distributions, as introduced byRao and Jeffreys [Rao45, Jef46] and later elaborated upon by Amariand others [AN00]. This provides invariance under reparametrizationof ð and, importantly, minimizes the change of diversity of ðð.
The algorithm is constructed in two steps: we first give an âidealâ version,namely, a version in which time ð¡ is continuous so that the evolution of ðð¡ isgiven by an ordinary differential equation in Î. Second, the actual algorithmis a time discretization using a finite time step and Monte Carlo samplinginstead of exact ðð-averages.
1.1 The natural gradient on parameter space
About gradients and the shortest path uphill. Let ð be a smoothfunction from Rð to R, to be maximized. We first present the interpretationof gradient ascent as âthe shortest path uphillâ.
Let ðŠ â Rð. Define the vector ð§ by
ð§ = limðâ0
arg maxð§, âð§â61
ð(ðŠ + ðð§). (1)
Then one can check that ð§ is the normalized gradient of ð at ðŠ: ð§ð = ðð/ððŠð
âðð/ððŠðâ .(This holds only at points ðŠ where the gradient of ð does not vanish.)
This shows that, for small ð¿ð¡, the well-known gradient ascent of ð givenby
ðŠð¡+ð¿ð¡ð = ðŠð¡
ð + ð¿ð¡ ððððŠð
realizes the largest increase in the value of ð, for a given step size âðŠð¡+ð¿ð¡âðŠð¡â.The relation (1) depends on the choice of a norm â·â (the gradient of
ð is given by ðð/ððŠð only in an orthonormal basis). If we use, instead ofthe standard metric âðŠ â ðŠâ²â =
ââ(ðŠð â ðŠâ²
ð)2 on Rð, a metric âðŠ â ðŠâ²âðŽ =ââðŽðð(ðŠð â ðŠâ²
ð)(ðŠð â ðŠâ²ð) defined by a positive definite matrix ðŽðð , then the
gradient of ð with respect to this metric is given byâ
ð ðŽâ1ðð
ððððŠð
. (This followsfrom the textbook definition of gradients by ð(ðŠ + ðð§) = ð(ðŠ) + ðâšâð, ð§â©ðŽ +ð(ð2) with âšÂ·, ·â©ðŽ the scalar product associated with the matrix ðŽðð [Sch92].)
We can write the analogue of (1) using the ðŽ-norm. We get that thegradient ascent associated with metric ðŽ, given by
ðŠð¡+ð¿ð¡ = ðŠð¡ + ð¿ð¡ ðŽâ1 ððððŠð
,
for small ð¿ð¡, maximizes the increment of ð for a given ðŽ-distance âðŠð¡+ð¿ð¡ âðŠð¡âðŽâit realizes the steepest ðŽ-ascent. Maybe this viewpoint clarifies therelationship between gradient and metric: this steepest ascent property canactually be used as a definition of gradients.
In our setting we want to use a gradient ascent in the parameter space Îof our distributions ðð. The metric âðâ ðâ²â =
ââ(ðð â ðâ²
ð)2 clearly dependson the choice of parametrization ð, and thus is not intrinsic. Therefore, weuse a metric depending on ð only through the distributions ðð, as follows.
5
Fisher information and the natural gradient on parameter space.Let ð, ðâ² â Î be two values of the distribution parameter. The KullbackâLeibler divergence between ðð and ððâ² is defined [Kul97] as
KL(ððâ² ||ðð) =â«
ð¥ln ððâ²(ð¥)
ðð(ð¥) ððâ²(dð¥).
When ðâ² = ð + ð¿ð is close to ð, under mild smoothness assumptions wecan expand the KullbackâLeibler divergence at second order in ð¿ð. Thisexpansion defines the Fisher information matrix ðŒ at ð [Kul97]:
KL(ðð+ð¿ð ||ðð) = 12â
ðŒðð(ð) ð¿ððð¿ðð + ð(ð¿ð3).
An equivalent definition of the Fisher information matrix is by the usualformulas [CT06]
ðŒðð(ð) =â«
ð¥
ð ln ðð(ð¥)ððð
ð ln ðð(ð¥)ððð
dðð(ð¥) = ââ«
ð¥
ð2 ln ðð(ð¥)ððð ððð
dðð(ð¥).
The Fisher information matrix defines a (Riemannian) metric on Î: thedistance, in this metric, between two very close values of ð is given by thesquare root of twice the KullbackâLeibler divergence. Since the KullbackâLeibler divergence depends only on ðð and not on the parametrization of ð,this metric is intrinsic.
If ð : Î â R is a smooth function on the parameter space, its naturalgradient at ð is defined in accordance with the Fisher metric as
( âð ð)ð =â
ð
ðŒâ1ðð (ð) ðð(ð)
ððð
or more synthetically âð ð = ðŒâ1 ðð
ðð.
From now on, we will use âð to denote the natural gradient and ððð to denote
the vanilla gradient.By construction, the natural gradient descent is intrinsic: it does not
depend on the chosen parametrization ð of ðð, so that it makes sense tospeak of the natural gradient ascent of a function ð(ðð).
Given that the Fisher metric comes from the KullbackâLeibler divergence,the âshortest path uphillâ property of gradients mentioned above translatesas follows (see also [Ama98, Theorem 1]):
Proposition 1. The natural gradient ascent points in the direction ð¿ð achiev-ing the largest change of the objective function, for a given distance betweenðð and ðð+ð¿ð in KullbackâLeibler divergence. More precisely, let ð be asmooth function on the parameter space Î. Let ð â Î be a point where âð(ð)does not vanish. Then
âð(ð)â âð(ð)â
= limðâ0
1ð
arg maxð¿ð such that
KL(ðð+ð¿ð || ðð)6ð2/2
ð(ð + ð¿ð).
Here we have implicitly assumed that the parameter space Î is non-degenerate and proper (that is, no two points ð â Î define the same proba-bility distribution, and the mapping ðð âŠâ ð is continuous).
6
Why use the Fisher metric gradient for optimization? Relation-ship to diversity. The first reason for using the natural gradient is itsreparametrization invariance, which makes it the only gradient available ina general abstract setting [AN00]. Practically, this invariance also limitsthe influence of encoding choices on the behavior of the algorithm. Moreprosaically, the Fisher matrix can be also seen as an adaptive learning ratefor different components of the parameter vector ðð: components ð with ahigh impact on ðð will be updated more cautiously.
Another advantage comes from the relationship with KullbackâLeiblerdistance in view of the âshortest path uphillâ (see also [Ama98]). To minimizethe value of some function ð(ð) defined on the parameter space Î, the naiveapproach follows a gradient descent for ð using the âvanillaâ gradient
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡ ððððð
and, as explained above, this maximizes the increment of ð for a givenincrement âðð¡+ð¿ð¡ â ðð¡â. On the other hand, the Fisher gradient
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡ðŒâ1 ððððð
maximizes the increment of ð for a given KullbackâLeibler distance KL(ððð¡+ð¿ð¡ ||ððð¡).In particular, if we choose an initial value ð0 such that ðð0 covers a wide
portion of the space ð uniformly, the KullbackâLeibler divergence betweenððð¡ and ðð0 measures the loss of diversity of ððð¡ . So the natural gradientdescent is a way to optimize the function ð with minimal loss of diversity,provided the initial diversity is large. On the other hand the vanilla gradientdescent optimizes ð with minimal change in the numerical values of theparameter ð, which is of little interest.
So arguably this method realizes the best trade-off between optimizationand loss of diversity. (Though, as can be seen from the detailed algorithmdescription below, maximization of diversity occurs only greedily at eachstep, and so there is no guarantee that after a given time, IGO will providethe highest possible diversity for a given objective function value.)
An experimental confirmation of the positive influence of the Fishermatrix on diversity is given in Section 5 below. This may also provide atheoretical explanation to the good performance of CMA-ES.
1.2 IGO: Information-geometric optimization
Quantile rewriting of ð . Our original problem is to minimize a functionð : ð â R. A simple way to turn ð into a function on Î is to use theexpected value âEðð
ð [Ber00, WSPS08], but expected values can be undulyinfluenced by extreme values and using them is rather unstable [Whi89];moreover âEðð
ð is not invariant under increasing transformation of ð (thisinvariance implies we can only compare ð -values, not add them).
Instead, we take an adaptive, quantile-based approach by first replacingthe function ð with a monotone rewriting ð ð
ð and then following the gradientof Eðð
ð ðð . A due choice of ð ð
ð allows to control the range of the resultingvalues and achieves the desired invariance. Because the rewriting ð ð
ð dependson ð, it might be viewed as an adaptive ð -transformation.
The goal is that if ð(ð¥) is âsmallâ then ð ðð (ð¥) â R is âlargeâ and vice
versa, and that ð ðð remains invariant under monotone transformations of
ð . The meaning of âsmallâ or âlargeâ depends on ð â Î and is taken withrespect to typical values of ð under the current distribution ðð. This is
7
measured by the ðð-quantile in which the value of ð(ð¥) lies. We write thelower and upper ðð-ð -quantiles of ð¥ â ð as
ðâð (ð¥) = Prð¥â²âŒðð
(ð(ð¥â²) < ð(ð¥))ð+
ð (ð¥) = Prð¥â²âŒðð(ð(ð¥â²) 6 ð(ð¥)) .
(2)
These quantile functions reflect the probability to sample a better value thanð(ð¥). They are monotone in ð (if ð(ð¥1) 6 ð(ð¥2) then ð±
ð (ð¥1) 6 ð±ð (ð¥2)) and
invariant under increasing transformations of ð .Given ð â [0; 1], we now choose a non-increasing function ð€ : [0; 1]â R
(fixed once and for all). A typical choice for ð€ is ð€(ð) = 1ð6ð0 for somefixed value ð0, the selection quantile. The transform ð ð
ð (ð¥) is defined as afunction of the ðð-ð -quantile of ð¥ as
ð ðð (ð¥) =
â§âšâ©ð€(ð+ð (ð¥)) if ð+
ð (ð¥) = ðâð (ð¥),
1ð+
ð(ð¥)âðâ
ð(ð¥)â« ð=ð+
ð(ð¥)
ð=ðâð
(ð¥) ð€(ð) dð otherwise.(3)
As desired, the definition of ð ðð is invariant under an increasing transforma-
tion of ð . For instance, the ðð-median of ð gets remapped to ð€(12).
Note that Eððð ð
ð =â« 1
0 ð€ is independent of ð and ð: indeed, by definition,the quantile of a random point under ðð is uniformly distributed in [0; 1]. Inthe following, our objective will be to maximize the expected value of ð ð
ðð¡ inð, that is, to maximize
Eððð ð
ðð¡ =â«
ð ððð¡(ð¥) ðð(dð¥) (4)
over ð, where ðð¡ is fixed at a given step but will adapt over time.Importantly, ð ð
ð (ð¥) can be estimated in practice: indeed, the quantilesPrð¥â²âŒðð
(ð(ð¥â²) < ð(ð¥)) can be estimated by taking samples of ðð and orderingthe samples according to the value of ð (see below). The estimate remainsinvariant under increasing ð -transformations.
The IGO gradient flow. At the most abstract level, IGO is a continuous-time gradient flow in the parameter space Î, which we define now. Inpractice, discrete time steps (a.k.a. iterations) are used, and ðð-integrals areapproximated through sampling, as described in the next section.
Let ðð¡ be the current value of the parameter at time ð¡, and let ð¿ð¡ ⪠1.We define ðð¡+ð¿ð¡ in such a way as to increase the ðð-weight of points where ðis small, while not going too far from ððð¡ in KullbackâLeibler divergence. Weuse the adaptive weights ð ð
ðð¡ as a way to measure which points have large orsmall values. In accordance with (4), this suggests taking the gradient ascent
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ âð
â«ð ð
ðð¡(ð¥) ðð(dð¥) (5)
where the natural gradient is suggested by Proposition 1.Note again that we use ð ð
ðð¡ and not ð ðð in the integral. So the gradientâð does not act on the adaptive objective ð ð
ðð¡ . If we used ð ðð instead, we
would face a paradox: right after a move, previously good points do notseem so good any more since the distribution has improved. More precisely,â«
ð ðð (ð¥) ðð(dð¥) is constant and always equal to the average weight
â« 10 ð€,
and so the gradient would always vanish.Using the log-likelihood trick âðð = ðð
âln ðð (assuming ðð is smooth),we get an equivalent expression of the update above as an integral underthe current distribution ððð¡ ; this is important for practical implementation.This leads to the following definition.
8
Definition 2 (IGO flow). The IGO flow is the set of continuous-timetrajectories in space Î, defined by the differential equation
dðð¡
dð¡= âð
â«ð ð
ðð¡(ð¥) ðð(dð¥) (6)
=â«
ð ððð¡(ð¥) âð ln ðð(ð¥) ððð¡(dð¥) (7)
= ðŒâ1(ðð¡)â«
ð ððð¡(ð¥) ð ln ðð(ð¥)
ððððð¡(dð¥). (8)
where the gradients are taken at point ð = ðð¡, and ðŒ is the Fisher informationmatrix.
Natural evolution strategies (NES, [WSPS08, GSS+10, SWSS09]) featurea related gradient descent with ð(ð¥) instead of ð ð
ðð¡(ð¥). The associated flowwould read
dðð¡
dð¡= â âð
â«ð(ð¥) ðð(dð¥) , (9)
where the gradient is taken at ðð¡ (in the sequel when not explicitly stated,gradients in ð are taken at ð = ðð¡). However, in the end NESs alwaysimplement algorithms using sample quantiles, as if derived from the gradientascent of ð ð
ðð¡(ð¥).The update (7) is a weighted average of âintrinsic movesâ increasing the
log-likelihood of some points. We can slightly rearrange the update as
dðð¡
dð¡=â«preference weightâ â
ð ððð¡(ð¥) âð ln ðð(ð¥)â â intrinsic move to reinforce ð¥
current sample distributionâ â ððð¡(dð¥) (10)
= âð
â«ð ð
ðð¡(ð¥) ln ðð(ð¥)â â weighted log-likelihood
ððð¡(dð¥). (11)
which provides an interpretation for the IGO gradient flow as a gradientascent optimization of the weighted log-likelihood of the âgood pointsâ ofthe current distribution. In a precise sense, IGO is in fact the âbestâ way toincrease this log-likelihood (Theorem 13).
For exponential families of probability distributions, we will see later thatthe IGO flow rewrites as a nice derivative-free expression (18).
The IGO algorithm. Time discretization and sampling. The aboveis a mathematically well-defined continuous-time flow in the parameter space.Its practical implementation involves three approximations depending ontwo parameters ð and ð¿ð¡:
â the integral under ððð¡ is approximated using ð samples taken fromððð¡ ;
â the value ð ððð¡ is approximated for each sample taken from ððð¡ ;
â the time derivative dðð¡
dð¡ is approximated by a ð¿ð¡ time increment.
We also assume that the Fisher information matrix ðŒ(ð) and ð ln ðð(ð¥)ðð can
be computed (see discussion below if ðŒ(ð) is unkown).At each step, we pick ð samples ð¥1, . . . , ð¥ð under ððð¡ . To approximate
the quantiles, we rank the samples according to the value of ð . Definerk(ð¥ð) = #{ð, ð(ð¥ð) < ð(ð¥ð)} and let the estimated weight of sample ð¥ð be
ð€ð = 1ð
ð€
(rk(ð¥ð) + 1/2ð
), (12)
9
using the weighting scheme function ð€ introduced above. (This is assumingthere are no ties in our sample; in case several sample points have the samevalue of ð , we define ð€ð by averaging the above over all possible rankings ofthe ties1.)
Then we can approximate the IGO flow as follows.
Definition 3 (IGO algorithm). The IGO algorithm with sample size ð andstep size ð¿ð¡ is the following update rule for the parameter ðð¡. At each step,ð sample points ð¥1, . . . , ð¥ð are picked according to the distribution ððð¡ . Theparameter is updated according to
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ðâ
ð=1ð€ðâð ln ðð(ð¥ð)
ð=ðð¡
(13)
= ðð¡ + ð¿ð¡ ðŒâ1(ðð¡)ðâ
ð=1ð€ð
ð ln ðð(ð¥ð)ðð
ð=ðð¡
(14)
where ð€ð is the weight (12) obtained from the ranked values of the objectivefunction ð .
Equivalently one can fix the weights ð€ð = 1ð ð€
(ðâ1/2
ð
)once and for all
and rewrite the update as
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ ðŒâ1(ðð¡)ðâ
ð=1ð€ð
ð ln ðð(ð¥ð:ð )ðð
ð=ðð¡
(15)
where ð¥ð:ð denotes the ðth sampled point ranked according to ð , i.e. ð(ð¥1:ð ) <. . . < ð(ð¥ð :ð ) (assuming again there are no ties). Note that {ð¥ð:ð} = {ð¥ð}and {ð€ð} = { ð€ð}.
As will be discussed in Section 4, this update applied to multivariatenormal distributions or Bernoulli measures allows to neatly recover versions ofsome well-established algorithms, in particular CMA-ES and PBIL. Actually,in the Gaussian context updates of the form (14) have already been introduced[GSS+10, ANOK10], though not formally derived from a continuous-timeflow with quantiles.
When ð â â, the IGO algorithm using samples approximates thecontinuous-time IGO gradient flow, see Theorem 4 below. Indeed, the IGOalgorithm, with ð =â, is simply the Euler approximation scheme for theordinary differential equation defining the IGO flow (6). The latter resultthus provides a sound mathematical basis for currently used rank-basedupdates.
IGO flow versus IGO algorithms. The IGO flow (6) is a well-definedcontinuous-time set of trajectories in the space of probability distributionsðð, depending only on the objective function ð and the chosen family ofdistributions. It does not depend on the chosen parametrization for ð(Proposition 8).
On the other hand, there are several IGO algorithms associated with thisflow. Each IGO algorithm approximates the IGO flow in a slightly different
1A mathematically neater but less intuitive version would be
ð€ð = 1rk+(ð¥ð)â rkâ(ð¥ð)
â« ð¢=rk+(ð¥ð)/ð
ð¢=rkâ(ð¥ð)/ð
ð€(ð¢)dð¢
with rkâ(ð¥ð) = #{ð, ð(ð¥ð) < ð(ð¥ð)} and rk+(ð¥ð) = #{ð, ð(ð¥ð) 6 ð(ð¥ð)}.
10
way. An IGO algorithm depends on three further choices: a sample size ð ,a time discretization step size ð¿ð¡, and a choice of parametrization for ð inwhich to implement (14).
If ð¿ð¡ is small enough, and ð large enough, the influence of the parametriza-tion ð disappears and all IGO algorithms are approximations of the âidealâIGO flow trajectory. However, the larger ð¿ð¡, the poorer the approximationgets.
So for large ð¿ð¡, different IGO algorithms for the same IGO flow mayexhibit different behaviors. We will see an instance of this phenomenon forGaussian distributions: both CMA-ES and the maximum likelihood update(EMNA) can be seen as IGO algorithms, but the latter with ð¿ð¡ = 1 is knownto exhibit premature loss of diversity (Section 4.2).
Still, two IGO algorithms for the same IGO flow will differ less from eachother than from a non-IGO algorithm: at each step the difference is onlyð(ð¿ð¡2) (Section 2.4). On the other hand, for instance, the difference betweenan IGO algorithm and the vanilla gradient ascent is, generally, not smallerthan ð(ð¿ð¡) at each step, i.e. roughly as big as the steps themselves.
Unknown Fisher matrix. The algorithm presented so far assumes thatthe Fisher matrix ðŒ(ð) is known as a function of ð. This is the case forGaussian distributions in CMA-ES and for Bernoulli distributions. How-ever, for restricted Boltzmann machines as considered below, no analyticalform is known. Yet, provided the quantity ð
ðð ln ðð(ð¥) can be computed orapproximated, it is possible to approximate the integral
ðŒðð(ð) =â«
ð¥
ð ln ðð(ð¥)ððð
ð ln ðð(ð¥)ððð
ðð(dð¥)
using ðð-Monte Carlo samples for ð¥. These samples may or may not be thesame as those used in the IGO update (14): in particular, it is possible touse as many Monte Carlo samples as necessary to approximate ðŒðð , at noadditional cost in terms of the number of calls to the black-box function ðto optimize.
Note that each Monte Carlo sample ð¥ will contribute ð ln ðð(ð¥)ððð
ð ln ðð(ð¥)ððð
to the Fisher matrix approximation. This is a rank-1 matrix. So, for theapproximated Fisher matrix to be invertible, the number of (distinct) samplesð¥ needs to be at least equal to the number of components of the parameterð i.e. ðFisher > dim Î.
For exponential families of distributions, the IGO update has a particularform (18) which simplifies this matter somewhat. More details are givenbelow (see Section 5) for the concrete situation of restricted Boltzmannmachines.
2 First properties of IGO
2.1 Consistency of sampling
The first property to check is that when ð ââ, the update rule using ðsamples converges to the IGO update rule. This is not a straightforwardapplication of the law of large numbers, because the estimated weights ð€ð
depend (non-continuously) on the whole sample ð¥1, . . . , ð¥ð , and not only onð¥ð.
11
Theorem 4 (Consistency). When ð ââ, the ð -sample IGO update rule(14):
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ ðŒâ1(ðð¡)ðâ
ð=1ð€ð
ð ln ðð(ð¥ð)ðð
ð=ðð¡
converges with probability 1 to the update rule (5):
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ âð
â«ð ð
ðð¡(ð¥) ðð(dð¥).
The proof is given in the Appendix, under mild regularity assumptions.In particular we do not require that ð€ be continuous.
This theorem may clarify previous claims [WSPS08, ANOK10] whererank-based updates similar to (5), such as in NES or CMA-ES, were derivedfrom optimizing the expected value âEðð
ð . The rank-based weights ð€ð werethen introduced somewhat arbitrarily. Theorem 4 shows that, for large ð ,CMA-ES and NES actually follow the gradient flow of the quantity Eðð
ð ððð¡ :
the update can be rigorously derived from optimizing the expected value ofthe quantile-rewriting ð ð
ðð¡ .
2.2 Monotonicity: quantile improvement
Gradient descents come with a guarantee that the fitness value decreases overtime. Here, since we work with probability distributions on ð, we need todefine the fitness of the distribution ððð¡ . An obvious choice is the expectationEððð¡ ð , but it is not invariant under ð -transformation and moreover may besensitive to extreme values.
It turns out that the monotonicity properties of the IGO gradient flowdepend on the choice of the weighting scheme ð€. For instance, if ð€(ð¢) =1ð¢61/2, then the median of ð improves over time.
Proposition 5 (Quantile improvement). Consider the IGO flow given by (6),with the weight ð€(ð¢) = 1ð¢6ð where 0 < ð < 1 is fixed. Then the value ofthe ð-quantile of ð improves over time: if ð¡1 6 ð¡2 then either ðð¡1 = ðð¡2 orðð
ððð¡2
(ð) < ððð
ðð¡1(ð).
Here the ð-quantile ððð (ð) of ð under a probability distribution ð is
defined as any number ð such that Prð¥âŒð (ð(ð¥) 6 ð) > ð and Prð¥âŒð (ð(ð¥) >ð) > 1â ð.
The proof is given in the Appendix, together with the necessary regularityassumptions.
Of course this holds only for the IGO gradient flow (6) with ð =â andð¿ð¡ â 0. For an IGO algorithm with finite ð , the dynamics is random andone cannot expect monotonicity. Still, Theorem 4 ensures that, with highprobability, trajectories of a large enough finite population dynamics stayclose to the infinite-population limit trajectory.
2.3 The IGO flow for exponential families
The expressions for the IGO update simplify somewhat if the family ðð
happens to be an exponential family of probability distributions (see also[MMS08]). Suppose that ðð can be written as
ðð(ð¥) = 1ð(ð) exp
(âðððð(ð¥)
)ð»(dð¥)
12
where ð1, . . . , ðð is a finite family of functions on ð, ð»(dð¥) is an arbitraryreference measure on ð, and ð(ð) is the normalization constant. It iswell-known [AN00, (2.33)] that
ð ln ðð(ð¥)ððð
= ðð(ð¥)â Eðððð (16)
so that [AN00, (3.59)]
ðŒðð(ð) = Covðð(ðð, ðð). (17)
In the end we find:
Proposition 6. Let ðð be an exponential family parametrized by the naturalparameters ð as above. Then the IGO flow is given by
dð
dð¡= Covðð
(ð, ð )â1 Covðð(ð, ð ð
ð ) (18)
where Covðð(ð, ð ð
ð ) denotes the vector (Covðð(ðð, ð ð
ð ))ð, and Covðð(ð, ð )
the matrix (Covðð(ðð, ðð))ðð.
Note that the right-hand side does not involve derivatives w.r.t. ð anymore. This result makes it easy to simulate the IGO flow using e.g. a Gibbssampler for ðð: both covariances in (18) may be approximated by sampling,so that neither the Fisher matrix nor the gradient term need to be known inadvance, and no derivatives are involved.
The CMA-ES uses the family of all Gaussian distributions on Rð. Then,the family ðð is the family of all linear and quadratic functions of thecoordinates on Rð. The expression above is then a particularly conciserewriting of a CMA-ES update, see also Section 4.2.
Moreover, the expected values ðð = Eðð of ðð satisfy the simple evolutionequation under the IGO flow
dðð
dð¡= Cov(ðð, ð ð
ð ) = E(ðð ð ðð )â ðð Eð ð
ð . (19)
The proof is given in the Appendix, in the proof of Theorem 15.The variables ðð can sometimes be used as an alternative parametrization
for an exponential family (e.g. for a one-dimensional Gaussian, these are themean ð and the second moment ð2 + ð2). Then the IGO flow (7) may berewritten using the relation âðð
= ð
ððð
for the natural gradient of exponentialfamilies (Appendix, Proposition 22), which sometimes results in simplerexpressions. We shall further exploit this fact in Section 3.
Exponential families with latent variables. Similar formulas holdwhen the distribution ðð(ð¥) is the marginal of an exponential distributionðð(ð¥, â) over a âhiddenâ or âlatentâ variable â, such as the restricted Boltz-mann machines of Section 5.
Namely, with ðð(ð¥) = 1ð(ð)
ââ exp(
âð ðððð(ð¥, â)) ð»(dð¥, dâ) the Fisher
matrix isðŒðð(ð) = Covðð
(ðð, ðð) (20)
where ðð(ð¥) = Eðð(ðð(ð¥, â)|ð¥). Consequently, the IGO flow takes the form
dð
dð¡= Covðð
(ð, ð)â1 Covðð(ð, ð ð
ð ). (21)
13
2.4 Invariance properties
Here we formally state the invariance properties of the IGO flow under variousreparametrizations. Since these results follow from the very construction ofthe algorithm, the proofs are omitted.
Proposition 7 (ð -invariance). Let ð : R â R be an increasing function.Then the trajectories of the IGO flow when optimizing the functions ð andð(ð) are the same.
The same is true for the discretized algorithm with population size ð andstep size ð¿ð¡ > 0.
Proposition 8 (ð-invariance). Let ðâ² = ð(ð) be a one-to-one function ofð and let ð â²
ðâ² = ððâ1(ð). Let ðð¡ be the trajectory of the IGO flow whenoptimizing a function ð using the distributions ðð, initialized at ð0. Then theIGO flow trajectory (ðâ²)ð¡ obtained from the optimization of the function ðusing the distributions ð â²
ðâ², initialized at (ðâ²)0 = ð(ð0), is the same, namely(ðâ²)ð¡ = ð(ðð¡).
For the algorithm with finite ð and ð¿ð¡ > 0, invariance under ð-reparame-trization is only true approximately, in the limit when ð¿ð¡ â 0. As mentionedabove, the IGO update (14), with ð =â, is simply the Euler approximationscheme for the ordinary differential equation (6) defining the IGO flow. Ateach step, the Euler scheme is known to make an error ð(ð¿ð¡2) with respectto the true flow. This error actually depends on the parametrization of ð.
So the IGO updates for different parametrizations coincide at first orderin ð¿ð¡, and may, in general, differ by ð(ð¿ð¡2). For instance the differencebetween the CMA-ES and xNES updates is indeed ð(ð¿ð¡2), see Section 4.2.
For comparison, using the vanilla gradient results in a divergence of ð(ð¿ð¡)at each step between different parametrizations. So the divergence could beof the same magnitude as the steps themselves.
In that sense, one can say that IGO algorithms are âmore parametrization-invariantâ than other algorithms. This stems from their origin as a discretiza-tion of the IGO flow.
The next proposition states that, for example, if one uses a family ofdistributions on Rð which is invariant under affine transformations, thenour algorithm optimizes equally well a function and its image under anyaffine transformation (up to an obvious change in the initialization). Thisproposition generalizes the well-known corresponding property of CMA-ES[HO01].
Here, as usual, the image of a probability distribution ð by a transfor-mation ð is defined as the probability distribution ð â² such that ð â²(ð ) =ð (ðâ1(ð )) for any subset ð â ð. In the continuous domain, the density ofthe new distribution ð â² is obtained by the usual change of variable formulainvolving the Jacobian of ð.
Proposition 9 (ð-invariance). Let ð : ð â ð be a one-to-one transfor-mation of the search space, and assume that ð globally preserves the familyof measures ðð. Let ðð¡ be the IGO flow trajectory for the optimization offunction ð , initialized at ðð0. Let (ðâ²)ð¡ be the IGO flow trajectory for opti-mization of ð â ðâ1, initialized at the image of ðð0 by ð. Then ð(ðâ²)ð¡ is theimage of ððð¡ by ð.
For the discretized algorithm with population size ð and step size ð¿ð¡ > 0,the same is true up to an error of ð(ð¿ð¡2) per iteration. This error disappearsif the map ð acts on Î in an affine way.
14
The latter case of affine transforms is well exemplified by CMA-ES:here, using the variance and mean as the parametrization of Gaussians, thenew mean and variance after an affine transform of the search space arean affine function of the old mean and variance; specifically, for the affinetransformation ðŽ : ð¥ âŠâ ðŽð¥ + ð we have (ð, ð¶) âŠâ (ðŽð + ð, ðŽð¶ðŽT).
2.5 Speed of the IGO flow
Proposition 10. The speed of the IGO flow, i.e. the norm of dðð¡
dð¡ in theFisher metric, is at most
ââ« 10 ð€2 â (
â« 10 ð€)2 where ð€ is the weighting scheme.
This speed can be tested in practice in at least two ways. The first isjust to compute the Fisher norm of the increment ðð¡+ð¿ð¡ â ðð¡ using the Fishermatrix; for small ð¿ð¡ this is close to ð¿ð¡âdð
dð¡ â with â · â the Fisher metric. Thesecond is as follows: since the Fisher metric coincides with the KullbackâLeibler divergence up to a factor 1/2, we have KL(ððð¡+ð¿ð¡ ||ððð¡) â 1
2 ð¿ð¡2âdðdð¡ â
2
at least for small ð¿ð¡. Since it is relatively easy to estimate KL(ððð¡+ð¿ð¡ ||ððð¡)by comparing the new and old log-likelihoods of points in a Monte Carlosample, one can obtain an estimate of âdð
dð¡ â.
Corollary 11. Consider an IGO algorithm with weighting scheme ð€, stepsize ð¿ð¡ and sample size ð . Then, for small ð¿ð¡ and large ð we have
KL(ððð¡+ð¿ð¡ ||ððð¡) 6 12 ð¿ð¡2 Var[0,1] ð€ + ð(ð¿ð¡3) + ð(1/
âð).
For instance, with ð€(ð) = 1ð6ð0 and neglecting the error terms, an IGOalgorithm introduces at most 1
2 ð¿ð¡2 ð0(1â ð0) bits of information (in base ð)per iteration into the probability distribution ðð.
Thus, the time discretization parameter ð¿ð¡ is not just an arbitrary variable:it has an intrinsic interpretation related to a number of bits introduced ateach step of the algorithm. This kind of relationship suggests, more generally,to use the KullbackâLeibler divergence as an external and objective way tomeasure learning rates in those optimization algorithms which use probabilitydistributions.
The result above is only an upper bound. Maximal speed can be achievedonly if all âgoodâ points point in the same direction. If the various goodpoints in the sample suggest moves in inconsistent directions, then the IGOupdate will be much smaller. The latter may be a sign that the signal isnoisy, or that the family of distributions ðð is not well suited to the problemat hand and should be enriched.
As an example, using a family of Gaussian distributions with unkownmean and fixed identity variance on Rð, one checks that for the optimizationof a linear function on Rð, with the weight ð€(ð¢) = â1ð¢>1/2 + 1ð¢<1/2, theIGO flow moves at constant speed 1/
â2ð â 0.4, whatever the dimension
ð. On a rapidly varying sinusoidal function, the moving speed will be muchslower because there are âgoodâ and âbadâ points in all directions.
This may suggest ways to design the weighting scheme ð€ to achievemaximal speed in some instances. Indeed, looking at the proof of theproposition, which involves a CauchyâSchwarz inequality, one can see thatthe maximal speed is achieved only if there is a linear relationship betweenthe weights ð ð
ð (ð¥) and the gradient âð ln ðð(ð¥). For instance, for theoptimization of a linear function on Rð using Gaussian measures of knownvariance, the maximal speed will be achieved when the weighting schemeð€(ð¢) is the inverse of the Gaussian cumulative distribution function. (In
15
particular, ð€(ð¢) tends to +â when ð¢â 0 and to ââ when ð¢â 1.) Thisis in accordance with previously known results: the expected value of theð-th order statistic of ð standard Gaussian variates is the optimal ð€ð valuein evolution strategies [Bey01, Arn06]. For ð â â, this order statisticconverges to the inverse Gaussian cumulative distribution function.
2.6 Noisy objective function
Suppose that the objective function ð is non-deterministic: each time weask for the value of ð at a point ð¥ â ð, we get a random result. In thissetting we may write the random value ð(ð¥) as ð(ð¥) = ð(ð¥, ð) where ð isan unseen random parameter, and ð is a deterministic function of ð¥ and ð.Without loss of generality, up to a change of variables we can assume that ðis uniformly distributed in [0, 1].
We can still use the IGO algorithm without modification in this context.One might wonder which properties (consistency of sampling, etc.) still applywhen ð is not deterministic. Actually, IGO algorithms for noisy functionsfit very nicely into the IGO framework: the following proposition allows totransfer any property of IGO to the case of noisy functions.
Proposition 12 (Noisy IGO). Let ð be a random function of ð¥ â ð, namely,ð(ð¥) = ð(ð¥, ð) where ð is a random variable uniformly distributed in [0, 1],and ð is a deterministic function of ð¥ and ð. Then the two followingalgorithms coincide:
â The IGO algorithm (13), using a family of distributions ðð on spaceð, applied to the noisy function ð , and where the samples are rankedaccording to the random observed value of ð (here we assume that, foreach sample, the noise ð is independent from everything else);
â The IGO algorithm on space ðà [0, 1], using the family of distributionsðð = ðð â ð[0,1], applied to the deterministic function ð . Here ð[0,1]denotes the uniform law on [0, 1].
The (easy) proof is given in the Appendix.This proposition states that noisy optimization is the same as ordinary
optimization using a family of distributions which cannot operate any selec-tion or convergence over the parameter ð. More generally, any component ofthe search space in which a distribution-based evolutionary strategy cannotperform selection or specialization will effectively act as a random noise onthe objective function.
As a consequence of this result, all properties of IGO can be transferred tothe noisy case. Consider, for instance, consistency of sampling (Theorem 4).The ð -sample IGO update rule for the noisy case is identical to the non-noisycase (14):
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ ðŒâ1(ðð¡)ðâ
ð=1ð€ð
ð ln ðð(ð¥ð)ðð
ð=ðð¡
where each weight ð€ð computed from (12) now incorporates noise from theobjective function because the rank of ð¥ð is computed on the random function,or equivalently on the deterministic function ð : rk(ð¥ð) = #{ð, ð(ð¥ð , ðð) <ð(ð¥ð, ðð)}.
Consistency of sampling (Theorem 4) thus takes the following form:When ð â â, the ð -sample IGO update rule on the noisy function ð
16
converges with probability 1 to the update rule
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ âð
â« 1
0
â«ð ð
ðð¡(ð¥, ð) ðð(dð¥) dð.
= ðð¡ + ð¿ð¡ âð
â«ï¿œï¿œ ð
ðð¡(ð¥) ðð(dð¥) (22)
where ᅵᅵ ðð (ð¥) = Eðð ð
ð (ð¥, ð). This entails, in particular, that when ð ââ,the noise disappears asymptotically, as could be expected.
Consequently, the IGO flow in the noisy case should be defined by theð¿ð¡ â 0 limit of the update (22) using ᅵᅵ . Note that the quantiles ð±(ð¥)defined by (2) still make sense in the noisy case, and are deterministicfunctions of ð¥; thus ð ð
ð (ð¥) can also be defined by (3) and is deterministic.However, unless the weighting scheme ð€(ð) is affine, ᅵᅵ ð
ð (ð¥) is different fromð ð
ð (ð¥) in general. Thus, unless ð€ is affine the flows defined by ð and ᅵᅵdo not coincide in the noisy case. The flow using ð would be the ð ââlimit of a slightly more complex algorithm using several evaluations of ð foreach sample ð¥ð in order to compute noise-free ranks.
2.7 Implementation remarks
Influence of the weighting scheme ð€. The weighting scheme ð€ directlyaffects the update rule (15).
A natural choice is ð€(ð¢) = 1ð¢6ð. This, as we have proved, results inan improvement of the ð-quantile over the course of optimization. Takingð = 1/2 springs to mind; however, this is not selective enough, and boththeory and experiments confirm that for the Gaussian case (CMA-ES), mostefficient optimization requires ð < 1/2 (see Section 4.2). The optimal ð isabout 0.27 if ð is not larger than the search space dimension ð [Bey01] andeven smaller otherwise.
Second, replacing ð€ with ð€+ð for some constant ð clearly has no influenceon the IGO continuous-time flow (5), since the gradient will cancel out theconstant. However, this is not the case for the update rule (15) with a finitesample of size ð .
Indeed, adding a constant ð to ð€ adds a quantity ð 1ð
â âð ln ðð(ð¥ð) to theupdate. Since we know that the ðð-expected value of âð ln ðð is 0 (becauseâ«
( âð ln ðð) ðð =â« âðð = â1 = 0), we have E 1
ðâð ln ðð(ð¥ð) = 0. So adding
a constant to ð€ does not change the expected value of the update, but itmay change e.g. its variance. The empirical average of âð ln ðð(ð¥ð) in thesample will be ð(1/
âð). So translating the weights results in a ð(1/
âð)
change in the update. See also Section 4 in [SWSS09].Determining an optimal value for ð to reduce the variance of the update is
difficult, though: the optimal value actually depends on possible correlationsbetween âð ln ðð and the function ð . The only general result is that oneshould shift ð€ so that 0 lies within its range. Assuming independence, ordependence with enough symmetry, the optimal shift is when the weightsaverage to 0.
Adaptive learning rate. Comparing consecutive updates to evaluate alearning rate or step size is an effective measure. For example, in back-propagation, the update sign has been used to adapt the learning rate of eachsingle weight in an artificial neural network [SA90]. In CMA-ES, a step sizeis adapted depending on whether recent steps tended to move in a consistentdirection or to backtrack. This is measured by considering the changes of
17
the mean ð of the Gaussian distribution. For a probability distribution ðð
on an arbitrary search space, in general no notion of mean may be defined.However, it is still possible to define âbacktrackingâ in the evolution of ð asfollows.
Consider two successive updates ð¿ðð¡ = ðð¡ â ðð¡âð¿ð¡ and ð¿ðð¡+ð¿ð¡ = ðð¡+ð¿ð¡ â ðð¡.Their scalar product in the Fisher metric ðŒ(ðð¡) is
âšð¿ðð¡, ð¿ðð¡+ð¿ð¡â© =âðð
ðŒðð(ðð¡) ð¿ðð¡ð ð¿ðð¡+ð¿ð¡
ð .
Dividing by the associated norms will yield the cosine cos ðŒ of the anglebetween ð¿ðð¡ and ð¿ðð¡+ð¿ð¡ .
If this cosine is positive, the learning rate ð¿ð¡ may be increased. If thecosine is negative, the learning rate probably needs to be decreased. Variousschemes for the change of ð¿ð¡ can be devised; for instance, inspired by CMA-ES, one can multiply ð¿ð¡ by exp(ðœ(cos ðŒ)/2) or exp(ðœ(1cos ðŒ>0 â 1cos ðŒ<0)/2),where ðœ â min(ð/ dim Î, 1/2).
As before, this scheme is constructed to be robust w.r.t. reparametrizationof ð, thanks to the use of the Fisher metric. However, for large learning ratesð¿ð¡, in practice the parametrization might well become relevant.
A consistent direction of the updates does not necessarily mean thatthe algorithm is performing well: for instance, when CEM/EMNA exhibitspremature convergence (see below), the parameters consistently move towardsa zero covariance matrix and the cosines above are positive.
Complexity. The complexity of the IGO algorithm depends much on thecomputational cost model. In optimization, it is fairly common to assumethat the objective function ð is very costly compared to any other calculationsperformed by the algorithm. Then the cost of IGO in terms of number ofð -calls is ð per iteration, and the cost of using quantiles and computing thenatural gradient is negligible.
Setting the cost of ð aside, the complexity of the IGO algorithm dependsmainly on the computation of the (inverse) Fisher matrix. Assume ananalytical expression for this matrix is known. Then, with ð = dim Î thenumber of parameters, the cost of storage of the Fisher matrix is ð(ð2)per iteration, and its inversion typically costs ð(ð3) per iteration. However,depending on the situation and on possible algebraic simplifications, strategiesexist to reduce this cost (e.g. [LRMB07] in a learning context). For instance,for CMA-ES the cost is ð(ðð) [SHI09]. More generally, parametrization byexpectation parameters (see above), when available, may reduce the cost toð(ð) as well.
If no analytical form of the Fisher matrix is known and Monte Carloestimation is required, then complexity depends on the particular situation athand and is related to the best sampling strategies available for a particularfamily of distributions. For Boltzmann machines, for instance, a host of suchstrategies are available. Still, in such a situation, IGO can be competitive ifthe objective function ð is costly.
Recycling samples. We might use samples not only from the last iter-ation to compute the ranks in (12), see e.g. [SWSS09]. For ð = 1 this isindispensable. In order to preserve sampling consistency (Theorem 4) the oldsamples need to be reweighted (using the ratio of their new vs old likelihood,as in importance sampling).
18
Initialization. As with other optimization algorithms, it is probably agood idea to initialize in such a way as to cover a wide portion of the searchspace, i.e. ð0 should be chosen so that ðð0 has maximal diversity. For IGOalgorithms this is particularly relevant, since, as explained above, the naturalgradient provides minimal change of diversity (greedily at each step) for agiven change in the objective function.
3 IGO, maximum likelihood and the cross-entropymethod
IGO as a smooth-time maximum likelihood estimate. The IGOflow turns out to be the only way to maximize a weighted log-likelihood,where points of the current distribution are slightly reweighted according toð -preferences.
This relies on the following interpretation of the natural gradient as aweighted maximum likelihood update with infinitesimal learning rate. Thisresult singles out, in yet another way, the natural gradient among all possiblegradients. The proof is given in the Appendix.
Theorem 13 (Natural gradient as ML with infinitesimal weights). Let ð > 0and ð0 â Î. Let ð (ð¥) be a function of ð¥ and let ð be the solution of
ð = arg maxð
{(1â ð)
â«log ðð(ð¥) ðð0(dð¥)â â maximal for ð = ð0
+ ð
â«log ðð(ð¥) ð (ð¥) ðð0(dð¥)
}.
Then, when ðâ 0, up to ð(ð2) we have
ð = ð0 + ð
â« âð ln ðð(ð¥) ð (ð¥) ðð0(dð¥).
Likewise for discrete samples: with ð¥1, . . . , ð¥ð â ð, let ð be the solution of
ð = arg max{
(1â ð)â«
log ðð(ð¥) ðð0(dð¥) + ðâ
ð
ð (ð¥ð) log ðð(ð¥ð)}
.
Then when ðâ 0, up to ð(ð2) we have
ð = ð0 + ðâ
ð
ð (ð¥ð) âð ln ðð(ð¥ð).
So if ð (ð¥) = ð ðð0
(ð¥) is the weight of the points according to quantilizedð -preferences, the weighted maximum log-likelihood necessarily is the IGOflow (7) using the natural gradientâor the IGO update (14) when usingsamples.
Thus the IGO flow is the unique flow that, continuously in time, slightlychanges the distribution to maximize the log-likelihood of points with goodvalues of ð . Moreover IGO continuously updates the weight ð ð
ð0(ð¥) depending
on ð and on the current distribution, so that we keep optimizing.This theorem suggests a way to approximate the IGO flow by enforcing
this interpretation for a given non-infinitesimal step size ð¿ð¡, as follows.
Definition 14 (IGO-ML algorithm). The IGO-ML algorithm with step sizeð¿ð¡ updates the value of the parameter ðð¡ according to
ðð¡+ð¿ð¡ = arg maxð
{(1â ð¿ð¡
âðð€ð)â«
log ðð(ð¥) ððð¡(dð¥) + ð¿ð¡â
ð
ð€ð log ðð(ð¥ð)}
(23)
19
where ð¥1, . . . , ð¥ð are sample points picked according to the distribution ððð¡,and ð€ð is the weight (12) obtained from the ranked values of the objectivefunction ð .
As for the cross-entropy method below, this only makes algorithmic senseif the argmax is tractable.
It turns out that IGO-ML is just the IGO algorithm in a particularparametrization (see Theorem 15).
The cross-entropy method. Taking ð¿ð¡ = 1 in (23) above correspondsto a full maximum likelihood update, which is also related to the cross-entropy method (CEM). The cross-entropy method can be defined as follows[dBKMR05] in an optimization setting. Like IGO, it depends on a family ofprobability distributions ðð parametrized by ð â Î, and a number of samplesð at each iteration. Let also ðð = âððâ (0 < ð < 1) be a number of elitesamples.
At each step, the cross-entropy method for optimization samples ð pointsð¥1, . . . , ð¥ð from the current distribution ððð¡ . Let ð€ð be 1/ðð if ð¥ð belongs tothe ðð samples with the best value of the objective function ð , and ð€ð = 0otherwise. Then the cross-entropy method or maximum likelihoood update(CEM/ML) for optimization is
ðð¡+1 = arg maxð
â ð€ð log ðð(ð¥ð) (24)
(assuming the argmax is tractable).A version with a smoother update depends on a step size parameter
0 < ðŒ 6 1 and is given [dBKMR05] by
ðð¡+1 = (1â ðŒ)ðð¡ + ðŒ arg maxð
â ð€ð log ðð(ð¥ð). (25)
The standard CEM/ML update corresponds to ðŒ = 1.For ðŒ = 1 the standard cross-entropy method is independent of the
parametrization ð, whereas for ðŒ < 1 this is not the case.Note the difference between the IGO-ML algorithm (23) and the smoothed
CEM update (25) with step size ðŒ = ð¿ð¡: the smoothed CEM update per-forms a weighted average of the parameter value after taking the maximumlikelihood estimate, whereas IGO-ML uses a weighted average of current andprevious likelihoods, then takes a maximum likelihood estimate. In general,these two rules can greatly differ, as they do for Gaussian distributions(Section 4.2).
This interversion of averaging makes IGO-ML parametrization-independentwhereas the smoothed CEM update is not.
Yet, for exponential families of probability distributions, there exists oneparticular parametrization ð in which the IGO algorithm and the smoothedCEM update coincide. We now proceed to this construction.
IGO for expectation parameters and maximum likelihood. Theparticular form of IGO for exponential families has an interesting conse-quence if the parametrization chosen for the exponential family is the set ofexpectation parameters. Let ðð(ð¥) = 1
ð(ð) exp (â
ðððð(ð¥)) ð»(dð¥) be an expo-nential family as above. The expectation parameters are ðð = ðð(ð) = Eðð
ðð ,(denoted ðð in [AN00, (3.56)]). The notation ð will denote the collection(ðð).
20
It is well-known that, in this parametrization, the maximum likelihoodestimate for a sample of points ð¥1, . . . , ð¥ð is just the empirical average of theexpectation parameters over that sample:
arg maxð
1ð
ðâð=1
log ðð (ð¥ð) = 1ð
ðâð=1
ð (ð¥ð). (26)
In the discussion above, one main difference between IGO and smoothedCEM was whether we took averages before or after taking the maximumlog-likelihood estimate. For the expectation parameters ðð, we see thatthese operations commute. (One can say that these expectation parametersâlinearize maximum likelihood estimatesâ.) A little work brings us to the
Theorem 15 (IGO, CEM and maximum likelihood). Let
ðð(ð¥) = 1ð(ð) exp
(âðððð(ð¥)
)ð»(dð¥)
be an exponential family of probability distributions, where the ðð are functionsof ð¥ and ð» is some reference measure. Let us parametrize this family by theexpected values ðð = Eðð.
Let us assume the chosen weights ð€ð sum to 1. For a sample ð¥1, . . . , ð¥ð ,let
ð *ð =
âð
ð€ð ðð(ð¥ð).
Then the IGO update (14) in this parametrization reads
ð ð¡+ð¿ð¡ð = (1â ð¿ð¡) ð ð¡
ð + ð¿ð¡ ð *ð . (27)
Moreover these three algorithms coincide:
â The IGO-ML algorithm (23).
â The IGO algorithm written in the parametrization ðð.
â The smoothed CEM algorithm (25) written in the parametrization ðð,with ðŒ = ð¿ð¡.
Corollary 16. The standard CEM/ML update (24) is the IGO algorithmin parametrization ðð with ð¿ð¡ = 1.
Beware that the expectation parameters ðð are not always the mostobvious parameters [AN00, Section 3.5]. For example, for 1-dimensionalGaussian distributions, these expectation parameters are the mean ð andthe second moment ð2 + ð2. When expressed back in terms of mean andvariance, with the update (27) the new mean is (1 â ð¿ð¡)ð + ð¿ð¡ð*, but thenew variance is (1â ð¿ð¡)ð2 + ð¿ð¡(ð*)2 + ð¿ð¡(1â ð¿ð¡)(ð* â ð)2.
On the other hand, when using smoothed CEM with mean and variance asparameters, the new variance is (1âð¿ð¡)ð2+ð¿ð¡(ð*)2, which can be significantlysmaller for ð¿ð¡ â (0, 1). This proves, in passing, that the smoothed CEMupdate in other parametrizations is generally not an IGO algorithm (becauseit can differ at first order in ð¿ð¡).
The case of Gaussian distributions is further exemplified in Section 4.2below: in particular, smoothed CEM in the (ð, ð) parametrization exhibitspremature reduction of variance, preventing good convergence.
For these reasons we think that the IGO-ML algorithm is the sensibleway to perform an interpolated ML estimate for ð¿ð¡ < 1, in a parametrization-independent way. In Section 6 we further discuss IGO and CEM and sumup the differences and relative advantages.
21
Taking ð¿ð¡ = 1 is a bold approximation choice: the âidealâ continuous-timeIGO flow itself, after time 1, does not coincide with the maximum likelihoodupdate of the best points in the sample. Since the maximum likelihoodalgorithm is known to converge prematurely in some instances (Section 4.2),using the parametrization by expectation parameters with large ð¿ð¡ may notbe desirable.
The considerable simplification of the IGO update in these coordinatesreflects the duality of coordinates ðð and ðð. More precisely, the naturalgradient ascent w.r.t. the parameters ðð is given by the vanilla gradient w.r.t.the parameters ðð: âðð
= ð
ððð
(Proposition 22 in the Appendix).
4 CMA-ES, NES, EDAs and PBIL from the IGOframework
In this section we investigate the IGO algorithm for Bernoulli measures and formultivariate normal distributions and show the correspondence to well-knownalgorithms. In addition, we discuss the influence of the parametrization ofthe distributions.
4.1 IGO algorithm for Bernoulli measures and PBIL
We consider on ð = {0, 1}ð a family of Bernoulli measures ðð(ð¥) = ðð1(ð¥1)Ã. . . à ððð
(ð¥ð) with ððð(ð¥ð) = ðð¥ð
ð (1 â ðð)1âð¥ð . As this family is a product ofprobability measures ððð
(ð¥ð), the different components of a random vectorðŠ following ðð are independent and all off-diagonal terms of the Fisherinformation matrix (FIM) are zero. Diagonal terms are given by 1
ðð(1âðð) .Therefore the inverse of the FIM is a diagonal matrix with diagonal entriesequal to ðð(1 â ðð). In addition, the partial derivative of ln ðð(ð¥) w.r.t. ðð
can be computed in a straightforward manner resulting in
ð ln ðð(ð¥)ððð
= ð¥ð
ððâ 1â ð¥ð
1â ðð.
Let ð¥1, . . . , ð¥ð be ð samples at step ð¡ with distribution ððð¡ and letð¥1:ð , . . . , ð¥ð :ð be the ranked samples according to ð . The natural gradientupdate (15) with Bernoulli measures is then given by
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡ ðð¡ð(1â ðð¡
ð)ðâ
ð=1ð€ð
([ð¥ð:ð ]ð
ðð¡ð
â 1â [ð¥ð:ð ]ð1â ðð¡
ð
)(28)
where ð€ð = ð€((ð â 1/2)/ð)/ð and [ðŠ]ð denotes the ðth coordinate of ðŠ â ð.The previous equation simplifies to
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡ðâ
ð=1ð€ð
([ð¥ð:ð ]ð â ðð¡
ð
),
or, denoting ᅵᅵ the sum of the weightsâð
ð=1 ð€ð ,
ðð¡+ð¿ð¡ð = (1â ᅵᅵð¿ð¡) ðð¡
ð + ð¿ð¡ðâ
ð=1ð€ð [ð¥ð:ð ]ð .
22
If we set the IGO weights as ð€1 = 1, ð€ð = 0 for ð = 2, . . . , ð , we recoverthe PBIL/EGA algorithm with update rule towards the best solution only(disregarding the mutation of the probability vector), with ð¿ð¡ = LR whereLR is the so-called learning rate of the algorithm [Bal94, Figure 4]. The PBILupdate rule towards the ð best solutions, proposed in [BC95, Figure 4]2, canbe recovered as well using
ð¿ð¡ = LRð€ð = (1â LR)ðâ1, for ð = 1, . . . , ð
ð€ð = 0, for ð = ð + 1, . . . , ð .
Interestingly, the parameters ðð are the expectation parameters describedin Section 3: indeed, the expectation of ð¥ð is ðð. So the formulas above areparticular cases of (27). Thus, by Theorem 15, PBIL is both a smoothedCEM in these parameters and an IGO-ML algorithm.
Let us now consider another, so-called âlogitâ representation, given bythe logistic function ð (ð¥ð = 1) = 1
1+exp(âðð). This ð is the exponential
parametrization of Section 2.3. We find that
ð ln ðð(ð¥)ððð
= (ð¥ð â 1) + exp(âðð)1 + exp(âðð)
= ð¥ð â Eð¥ð
(cf. (16)) and that the diagonal elements of the Fisher information matrixare given by exp(âðð)/(1 + exp(âðð))2 = Var ð¥ð (compare with (17)). So thenatural gradient update (15) with Bernoulli measures now reads
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡(1 + exp(ðð¡ð))
âââᅵᅵ + (1 + exp(âðð¡ð))
ðâð=1
ð€ð [ð¥ð:ð ]ð
ââ .
To better compare the update with the previous representation, notethat ðð = 1
1+exp(âðð)and thus we can rewrite
ðð¡+ð¿ð¡ð = ðð¡
ð + ð¿ð¡
ðð¡ð(1â ðð¡
ð)
ðâð=1
ð€ð
([ð¥ð:ð ]ð â ðð¡
ð
).
So the direction of the update is the same as before and is given by theproportion of bits set to 0 or 1 in the sample, compared to its expected valueunder the current distribution. The magnitude of the update is differentsince the parameter ð ranges from ââ to +â instead of from 0 to 1. Wedid not find this algorithm in the literature.
These updates illustrate the influence of setting the sum of weights to0 or not (Section 2.7). If, at some time, the first bit is set to 1 both for amajority of good points and for a majority of bad points, then the originalPBIL will increase the probability of setting the first bit to 1, which iscounterintuitive. If the weights ð€ð are chosen to sum to 0 this noise effectdisappears; otherwise, it disappears only on average.
2Note that the pseudocode for the algorithm in [BC95, Figure 4] is slightly erroneoussince it gives smaller weights to better individuals. The error can be fixed by updatingthe probability in reversed order, looping from NUMBER_OF_VECTORS_TO_UPDATE_FROM to1. This was confirmed by S. Baluja in personal communication. We consider here thecorrected version of the algorithm.
23
4.2 Multivariate normal distributions (Gaussians)
Evolution strategies [Rec73, Sch95, BS02] are black-box optimization algo-rithms for the continuous search domain, ð â Rð (for simplicity we assumeð = Rð in the following). They sample new solutions from a multivariatenormal distribution. In the context of continuous black-box optimization,Natural Evolution Strategies (NES) introduced the idea of using a naturalgradient update of the distribution parameters [WSPS08, SWSS09, GSS+10].Surprisingly, also the well-known Covariance Matrix Adaption EvolutionStrategy, CMA-ES [HO96, HO01, HMK03, HK04, JA06], turns out to conducta natural gradient update of distribution parameters [ANOK10, GSS+10].
Let ð¥ â Rð. As the most prominent example, we use mean vectorð = Eð¥ and covariance matrix ð¶ = E(ð¥âð)(ð¥âð)T = E(ð¥ð¥T)âððT toparametrize the distribution via ð = (ð, ð¶). The IGO update in (14) or (15)depends on the chosen parametrization, but can now be entirely reformulatedwithout the (inverse) Fisher matrix, compare (18). The complexity of theupdate is linear in the number of parameters (size of ð = (ð, ð¶), where(ð2 â ð)/2 parameters are redundant). We discuss known algorithms thatimplement variants of this update.
CMA-ES. The CMA-ES implements the equations3
ðð¡+1 = ðð¡ + ðm
ðâð=1
ð€ð(ð¥ð âðð¡) (29)
ð¶ð¡+1 = ð¶ð¡ + ðc
ðâð=1
ð€ð((ð¥ð âðð¡)(ð¥ð âðð¡)ð â ð¶ð¡) (30)
where ð€ð are the weights based on ranked ð -values, see (12) and (14).When ðc = ðm, Equations (30) and (29) coincide with the IGO update
(14) expressed in the parametrization (ð, ð¶) [ANOK10, GSS+10]4. Note,however, that the learning rates ðm and ðc take essentially different valuesin CMA-ES, if ð ⪠dim Î5: this is in deviation from an IGO algorithm.(Remark that the Fisher information matrix is block-diagonal in ð and ð¶[ANOK10], so that application of the different learning rates and of theinverse Fisher matrix commute.)
Natural evolution strategies. Natural evolution strategies (NES) [WSPS08,SWSS09] implement (29) as well, but use a Cholesky decomposition of ð¶ asparametrization for the update of the variance parameters. The resultingupdate that replaces (30) is neither particularly elegant nor numericallyefficient. However, the most recent xNES [GSS+10] chooses an âexponentialâparametrization depending on the current parameters. This leads to an ele-gant formulation where the additive update in exponential parametrizationbecomes a multiplicative update for ð¶ in (30). With ð¶ = ðŽðŽT, the matrix
3The CMA-ES implements these equations given the parameter setting ð1 = 0 andðð = 0 (or ðð =â, see e.g. [Han09]) that disengages the effect of both so-called evolutionpaths.
4 In these articles the result has been derived for ð â ð + ð âðEðð ð , see (9), leading toð(ð¥ð) in place of ð€ð. No assumptions on ð have been used besides that it does not dependon ð. By replacing ð with ð ð
ðð¡ , where ðð¡ is fixed, the derivation holds equally well forð â ð + ð âðEðð ð ð
ðð¡ .5Specifically, let
â|ð€ð| = 1, then ðm = 1 and ðc â 1 ⧠1/(ð2â ð€2
ð ).
24
update reads
ðŽâ ðŽÃ exp(
ðc2
ðâð=1
ð€ð(ðŽâ1(ð¥ð âð)(ðŽâ1(ð¥ð âð))ð â Ið))
(31)
where Ið is the identity matrix. (From (31) we get ð¶ â ðŽÃ exp2(. . . )ÃðŽT.)Remark that in the representation ð = (ðŽâ1ð, ðŽâ1ð¶ðŽTâ1) = (ðŽâ1ð, Ið),the Fisher information matrix becomes diagonal.
The update has the advantage over (30) that even negative weightsalways lead to feasible values. By default, ðm = ðc in xNES in the samecircumstances as in CMA-ES (most parameter settings are borrowed fromCMA-ES), but contrary to CMA-ES the past evolution path is not takeninto account [GSS+10].
When ðc = ðm, xNES is consistent with the IGO flow (6), and im-plements a slightly generalized IGO algorithm (14) using a ð-dependentparametrization.
Cross-entropy method and EMNA. Estimation of distribution algo-rithms (EDA) and the cross-entropy method (CEM) [Rub99, RK04] estimatea new distribution from a censored sample. Generally, the new parametervalue can be written as
ðmaxLL = arg maxð
ðâð=1
ð€ð ln ðð(ð¥ð) (32)
ââðââ arg maxð
Eððð¡ ð ððð¡ ln ðð
For positive weights, ðmaxLL maximizes the weighted log-likelihood of ð¥1 . . . ð¥ð .The argument under the arg max in the RHS of (32) is the negative cross-entropy between ðð and the distribution of censored (elitist) samples ððð¡ð ð
ðð¡
or of ð realizations of such samples. The distribution ððmaxLL has thereforeminimal cross-entropy and minimal KL-divergence to the distribution ofthe ð best samples.6 As said above, (32) recovers the cross-entropy method(CEM) [Rub99, RK04].
For Gaussian distributions, equation (32) can be explicitly written in theform
ðð¡+1 = ð* =ðâ
ð=1ð€ðð¥ð (33)
ð¶ð¡+1 = ð¶* =ðâ
ð=1ð€ð(ð¥ð âð*)(ð¥ð âð*)T (34)
the empirical mean and variance of the elite sample. The weights ᅵᅵð areequal to 1/ð for the ð best points and 0 otherwise.
Equations (33) and (34) also define the most fundamental continuousdomain EDA, the estimation of multivariate normal algorithm (EMNAglobal,[LL02]). It might be interesting to notice that (33) and (34) only differ from(29) and (30) in that the new mean ðð¡+1 is used in the covariance matrixupdate.
6Let ðð€ denote the distribution of the weighted samples: Pr(ð¥ = ð¥ð) = ð€ð andâðð€ð = 1. Then the cross-entropy between ðð€ and ðð reads
âððð€(ð¥ð) ln 1/ðð(ð¥ð) and
the KL-divergence reads KL(ðð€ ||ðð) =â
ððð€(ð¥ð) ln 1/ðð(ð¥ð) â
âððð€(ð¥ð) ln 1/ðð€(ð¥ð).
Minimization of both terms in ð result in ðmaxLL.
25
Let us compare IGO-ML (23), CMA (29)â(30), and smoothed CEM (25)in the parametrization with mean and covariance matrix. For learning rateð¿ð¡ = 1, IGO-ML and smoothed CEM/EMNA realize ðmaxLL from (32)â(34).For ð¿ð¡ < 1 these algorithms all update the distribution mean in the sameway; the update of mean and covariance matrix is computed to be
ðð¡+1 = (1â ð¿ð¡) ðð¡ + ð¿ð¡ ð*
ð¶ð¡+1 = (1â ð¿ð¡) ð¶ð¡ + ð¿ð¡ ð¶* + ð¿ð¡(1â ð¿ð¡)ð (ð* âðð¡)(ð* âðð¡)T,(35)
for different values of ð, where ð* and ð¶* are the mean and covariancematrix computed over the elite sample (with weights ð€ð) as above. For CMAwe have ð = 0, for IGO-ML we have ð = 1, and for smoothed CEM wehave ð =â (the rightmost term is absent). The case ð = 2 corresponds toan update that uses ðð¡+1 instead of ðð¡ in (30) (compare [Han06b]). For0 < ð¿ð¡ < 1, the larger ð, the smaller ð¶ð¡+1.
The rightmost term of (35) resembles the so-called rank-one update inCMA.
For ð¿ð¡ â 0, the update is independent of ð at first order in ð¿ð¡ if ð <â:this reflects compatibility with the IGO flow of CMA and of IGO-ML, butnot of smoothed CEM.
Critical ð¿ð¡. Let us assume that the weights ð€ð are non-negative, sum toone, and ð < ð weights take a value of 1/ð, so that the selection quantile isð = ð/ð .
There is a critical value of ð¿ð¡ depending on this quantile ð, such thatabove the critical ð¿ð¡ the algorithms given by IGO-ML and smoothed CEMare prone to premature convergence. Indeed, let ð be a linear function onRð, and consider the variance in the direction of the gradient of ð . Assumingfurther ð â â and ð †1/2, then the variance ð¶* of the elite sample issmaller than the current variance ð¶ð¡, by a constant factor. Depending onthe precise update for ð¶ð¡+1, if ð¿ð¡ is too large, the variance ð¶ð¡+1 is goingto be smaller than ð¶ð¡ by a constant factor as well. This implies that thealgorithm is going to stall. (On the other hand, the continuous-time IGOflow corresponding to ð¿ð¡ â 0 does not stall, see Section 4.3.)
We now study the critical ð¿ð¡ under which the algorithm does not stall.For IGO-ML, (ð = 1 in (35), or equivalently for the smoothed CEM in theexpectation parameters (ð, ð¶ + ððT), see Section 3), the variance increasesif and only if ð¿ð¡ is smaller than the critical value ð¿ð¡crit = ðð
â2ððð2/2 where
ð is the percentile function of ð, i.e. ð is such that ð =â«â
ð ðâð¥2/2/â
2ð. Thisvalue ð¿ð¡crit is plotted as solid line in Fig. 1. For ð = 2, ð¿ð¡crit is smaller,related to the above by ð¿ð¡crit â
â1 + ð¿ð¡crit â 1 and plotted as dashed line in
Fig. 1. For CEM (ð =â), the critical ð¿ð¡ is zero. For CMA-ES (ð = 0), thecritical ð¿ð¡ is infinite for ð < 1/2. When the selection quantile ð is above 1/2,for all algorithms the critical ð¿ð¡ becomes zero.
We conclude that, despite the principled approach of ascending thenatural gradient, the choice of the weighting function ð€, the choice of ð¿ð¡,and possible choices in the update for ð¿ð¡ > 0, need to be taken with greatcare in relation to the choice of parametrization.
4.3 Computing the IGO flow for some simple examples
In this section we take a closer look at the IGO flow solutions of (6) for somesimple examples of fitness functions, for which it is possible to obtain exactinformation about these IGO trajectories.
26
0.0 0.1 0.2 0.3 0.4 0.5truncation quantile q
0.0
0.2
0.4
0.6
0.8
1.0
crit
ical le
arn
ing r
ate
ÎŽt
j=1
j=2
j=â
Figure 1: Critical ð¿ð¡ versus truncation quantile ð. Above the critical ð¿ð¡, thevariance decreases systematically when optimizing a linear function. ForCMA-ES/NES, the critical ð¿ð¡ for ð < 0.5 is infinite.
We start with the discrete search space ð = {0, 1}ð and linear functions(to be minimized) defined as ð(ð¥) = ðâ
âðð=1 ðŒðð¥ð with ðŒð > 0. Note that
the onemax function to be maximized ðonemax(ð¥) =âð
ð=1 ð¥ð is covered bysetting ðŒð = 1. The differential equation (6) for the Bernoulli measuresðð(ð¥) = ðð1(ð¥1) . . . ððð
(ð¥ð) defined on ð can be derived taking the limit ofthe IGO-PBIL update given in (28):
dðð¡ð
dð¡=â«
ð ððð¡(ð¥)(ð¥ð â ðð¡
ð)ððð¡(ðð¥) =: ðð(ðð¡) . (36)
Though finding the analytical solution of the differential equation (36) forany initial condition seems a bit intricate we show that the equation admitsone critical stable point, (1, . . . , 1), and one critical unstable point (0, . . . , 0).In addition we prove that the trajectory decreases along ð in the sensedð(ðð¡)
dð¡ 6 0. To do so we establish the following result:
Lemma 17. On ð(ð¥) = ðââð
ð=1 ðŒðð¥ð the solution of (36) satisfiesâð
ð=1 ðŒðdððdð¡ >
0; moreoverâ
ðŒðdððdð¡ = 0 if and only if ð = (0, . . . , 0) or ð = (1, . . . , 1).
Proof. We computeâð
ð=1 ðŒððð(ðð¡) and find that
ðâð=1
ðŒðdðð¡
ð
dð¡=â«
ð ððð¡(ð¥)
(ðâ
ð=1ðŒðð¥ð â
ðâð=1
ðŒððð¡ð
)ððð¡(dð¥)
=â«
ð ððð¡(ð¥)(ð(ðð¡)â ð(ð¥))ððð¡(dð¥)
= E[ð ððð¡(ð¥)]E[ð(ð¥)]â E[ð ð
ðð¡(ð¥)ð(ð¥)]
In addition, âð ððð¡(ð¥) = âð€(Pr(ð(ð¥â²) < ð(ð¥))) is a nondecreasing bounded
function in the variable ð(ð¥) such that âð ððð¡(ð¥) and ð(ð¥) are positively
correlated (see [Tho00, Chapter 1] for a proof of this result), i.e.
E[âð ððð¡(ð¥)ð(ð¥)] > E[âð ð
ðð¡(ð¥)]E[ð(ð¥)]
with equality if and only if ðð¡ = (0, . . . , 0) or ðð¡ = (1, . . . , 1). Thusâðð=1 ðŒð
dðð¡ð
dð¡ > 0.
27
The previous result implies that the positive definite function ð (ð) =âðð=1 ðŒð â
âðð=1 ðŒððð in (1, . . . , 1) satisfies ð *(ð) = âð (ð) · ð(ð) 6 0 (such a
function is called a Lyapunov function). Consequently (1, . . . , 1) is stable.Similarly ð (ð) =
âðð=1 ðŒððð is a Lyapunov function for (0, . . . , 0) such that
âð (ð) · ð(ð) > 0. Consequently (0, . . . , 0) is unstable [AO08].
We now consider on Rð the family of multivariate normal distributionsðð = ð© (ð, ð2ðŒð) with covariance matrix equal to ð2ðŒð. The parameter ðthus has ð+1 components ð = (ð, ð) â RðÃR. The natural gradient updateusing this family was derived in [GSS+10]; from it we can derive the IGOdifferential equation which reads:
dðð¡
dð¡=â«Rð
ð ððð¡(ð¥)(ð¥âðð¡)ðð© (ðð¡,(ðð¡)2ðŒð)(ð¥)ðð¥ (37)
dᅵᅵð¡
dð¡=â«Rð
12ð
â§âšâ©ðâ
ð=1
(ð¥ð âðð¡
ð
ðð¡
)2
â 1
â«â¬âð ððð¡(ð¥)ðð© (ðð¡,(ðð¡)2ðŒð)(ð¥)ðð¥ (38)
where ðð¡ and ᅵᅵð¡ are linked via ðð¡ = exp(ᅵᅵð¡) or ᅵᅵð¡ = ln(ðð¡). Denoting ð© arandom vector following a centered multivariate normal distribution withcovariance matrix identity we write equivalently the gradient flow as
dðð¡
dð¡= ðð¡E
[ð ð
ðð¡(ðð¡ + ðð¡ð© )ð©]
(39)
dᅵᅵð¡
dð¡= E
[12
(âð©â2
ðâ 1
)ð ð
ðð¡(ðð¡ + ðð¡ð© )]
. (40)
Let us analyze the solution of the previous system on linear functions.Without loss of generality (because of invariance) we can consider the linearfunction ð(ð¥) = ð¥1. We have
ð ððð¡(ð¥) = ð€(Pr(ðð¡
1 + ðð¡ð1 < ð¥1))
where ð1 follows a standard normal distribution and thus
ð ððð¡(ðð¡ + ðð¡ð© ) = ð€( Pr
ð1âŒð© (0,1)(ð1 < ð©1)) (41)
= ð€(â±(ð©1)) (42)
with â± the cumulative distribution of a standard normal distribution, andð©1 the first component of ð© . The differential equation thus simplifies into
dðð¡
dð¡= ðð¡
ââââââE [ð€(â±(ð©1))ð©1]
0...0
ââââââ (43)
dᅵᅵð¡
dð¡= 1
2ðE[(|ð©1|2 â 1)ð€(â±(ð©1))
]. (44)
Consider, for example, the weight function associated with truncationselection, i.e. ð€(ð) = 1ð6ð0 where ð0 â]0, 1]âalso called intermediate recom-bination. We find that
dðð¡1
dð¡= ðð¡E[ð©11{ð©16â±â1(ð0)}] =: ðð¡ðœ (45)
dᅵᅵð¡
dð¡= 1
2ð
(â« ð0
0â±â1(ð¢)2ðð¢â ð0
)=: ðŒ . (46)
28
The solution of the IGO flow for the linear function ð(ð¥) = ð¥1 is thusgiven by
ðð¡1 = ðð¡
0 + ð0ðœ
ðŒexp(ðŒð¡) (47)
ðð¡ = ð0 exp(ðŒð¡) . (48)
The coefficient ðœ is strictly negative for any ð0 < 1. The coefficient ðŒ isstrictly positive if and only if ð0 < 1/2 which corresponds to selecting lessthan half of the sampled points in an ES. In this case the step-size ðð¡ growsexponentially fast to infinity and the mean vectors moves along the gradientdirection towards minus â at the same rate. If more than half of the pointsare selected, ð0 > 1/2, the step-size will decrease to zero exponentially fastand the mean vector will get stuck (compare also [Han06a]).
5 Multimodal optimization using restricted Boltz-mann machines
We now illustrate experimentally the influence of the natural gradient, versusvanilla gradient, on diversity over the course of optimization. We consider avery simple situation of a fitness function with two distant optima and testwhether different algorithms are able to reach both optima simultaneously oronly converge to one of them. This provides a practical test of Proposition 1stating that the natural gradient minimizes loss of diversity.
The IGO method allows to build a natural search algorithm from anarbitrary probability distribution on an arbitrary search space. In particular,by choosing families of probability distributions that are richer than Gaussianor Bernoulli, one may hope to be able to optimize functions with complexshapes. Here we study how this might help optimize multimodal functions.
Both Gaussian distributions on Rð and Bernoulli distributions on {0, 1}ðare unimodal. So at any given time, a search algorithm using such distribu-tions concentrates around a given point in the search space, looking aroundthat point (with some variance). It is an often-sought-after feature for anoptimization algorithm to handle multiple hypotheses simultaneously.
In this section we apply our method to a multimodal distribution on{0, 1}ð: restricted Boltzmann machines (RBMs). Depending on the activationstate of the latent variables, values for various blocks of bits can be switchedon or off, hence multimodality. So the optimization algorithm derived fromthese distributions will, hopefully, explore several distant zones of the searchspace at any given time. A related model (Boltzmann machines) was used in[Ber02] and was found to perform better than PBIL on some functions.
Our study of a bimodal RBM distribution for the optimization of abimodal function confirms that the natural gradient does indeed behave ina more natural way than the vanilla gradient: when initialized properly,the natural gradient is able to maintain diversity by fully using the RBMdistribution to learn the two modes, while the vanilla gradient only convergesto one of the two modes.
Although these experiments support using a natural gradient approach,they also establish that complications can arise for estimating the inverseFisher matrix in the case of complex distributions such as RBMs: estimationerrors may lead to a singular or unreliable estimation of the Fisher matrix,especially when the distribution becomes singular. Further research is neededfor a better understanding of this issue.
29
x
h
Figure 2: The RBM architecture with the observed (x) and latent (h)variables. In our experiments, a single hidden unit was used.
The experiments reported here, and the fitness function used, are ex-tremely simple from the optimization viewpoint: both algorithms using thenatural and vanilla gradient find an optimum in only a few steps. The empha-sis here is on the specific influence of replacing the vanilla gradient with thenatural gradient, and the resulting influence on diversity and multimodality,in a simple situation.
5.1 IGO for restricted Boltzmann machines
Restricted Boltzmann machines. A restricted Boltzmann machine(RBM) [Smo86, AHS85] is a kind of probability distribution belongingto the family of undirected graphical models (also known as a Markov ran-dom fields). A set of observed variables x â {0, 1}ðð¥ are given a probabilityusing their joint distribution with unobserved latent variables h â {0, 1}ðâ
[Gha04]. The latent variables are then marginalized over. See Figure 2 forthe graph structure of a RBM.
The probability associated with an observation x and latent variable h isgiven by
ðð(x, h) = ðâðž(x,h)âxâ²,hâ² ðâðž(xâ²,hâ²) , ðð(x) =
âh
ðð(x, h). (49)
where ðž(x, h) is the so-called energy function andâ
xâ²,hâ² ðâðž(xâ²,hâ²) is thepartition function denoted ð in Section 2.3. The energy ðž is defined by
ðž(x, h) = ââ
ð
ððð¥ð ââ
ð
ððâð ââð,ð
ð€ððð¥ðâð . (50)
The distribution is fully parametrized by the bias on the observed variablesa, the bias on the latent variables b and the weights W which account forpairwise interactions between observed and latent variables: ð = (a, b, W).
RBM distributions are a special case of exponential family distributionswith latent variables (see (21) in Section 2.3). The RBM IGO equationsstem from Equations (16), (17) and (18) by identifying the statistics ð (ð¥)with ð¥ð, âð or ð¥ðâð .
For these distributions, the gradient of the log-likelihood is well-known[Hin02]. Although it is often considered intractable in the context of machinelearning where a lot of variables are required, it becomes tractable for smallerRBMs. The gradient of the log-likelihood consists of the difference of twoexpectations (compare (16)):
ð ln ðð(ð¥)ðð€ðð
= Eðð[ð¥ðâð |ð¥]â Eðð
[ð¥ðâð ] (51)
30
The Fisher information matrix is given by (20)
ðŒð€ððð€ðð(ð) = E[ð¥ðâðð¥ðâð]â E[ð¥ðâð]E[ð¥ðâð] (52)
where ðŒð€ððð€ððdenotes the entry of the Fisher matrix corresponding to the
components ð€ðð and ð€ðð of the parameter ð. These equations are understoodto encompass the biases a and b by noticing that the bias can be replacedin the model by adding two variables ð¥ð and âð always equal to one.
Finally, the IGO update rule is taken from (14):
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ ðŒâ1(ðð¡)ðâ
ð=1ð€ð
ð ln ðð(xð)ðð
ð=ðð¡
(53)
Implementation. In this experimental study, the IGO algorithm forRBMs is directly implemented from Equation (53). At each optimizationstep, the algorithm consists in (1) finding a reliable estimate of the Fishermatrix (see Eq. 52) which is then inverted using the QR-Algorithm if it isinvertible; (2) computing an estimate of the vanilla gradient which is thenweighted according to ð ð
ð ; and (3) updating the parameters, taking intoaccount the gradient step size ð¿ð¡. In order to estimate both the Fisher matrixand the gradient, samples must be drawn from the model ðð. This is doneusing Gibbs sampling (see for instance [Hin02]).
Fisher matrix imprecision. The imprecision incurred by the limitedsampling size may sometimes lead to a singular estimation of the Fisher matrix(see p. 11 for a lower bound on the number of samples needed). Althoughhaving a singular Fisher estimation happens rarely in normal conditions, itoccurs with certainty when the probabilities become too concentrated overa few points. This situation arises naturally when the algorithm is allowedto continue the optimization after the optimum has been reached. For thisreason, in our experiments, we stop the optimization as soon as both optimahave been sampled, thus preventing ðð from becoming too concentrated.
In a variant of the same problem, the Fisher estimation can becomeclose to singular while still being numerically invertible. This situation leadsto unreliable estimates and should therefore be avoided. To evaluate thereliability of the inverse Fisher matrix estimate, we use a cross-validationmethod: (1) making two estimates ð¹1 and ð¹2 of the Fisher matrix on twodistinct sets of points generated from ðð, and (2) making sure that theeigenvalues of ð¹1 Ã ð¹ â1
2 are close to 1. In practice, at all steps we checkthat the average of the eigenvalues of ð¹1 Ã ð¹ â1
2 is between 1/2 and 2. If atsome point during the optimization the Fisher estimation becomes singularor unreliable according to this criterion, the corresponding run is stoppedand considered to have failed.
When optimizing on a larger space helps. A restricted Boltzmannmachine defines naturally a distribution ðð on both visible and hidden units(x, h), whereas the function to optimize depends only on the visible unitsx. Thus we are faced with a choice. A first possibility is to decide that theobjective function ð(x) is really a function of (x, h) where h is a dummyvariable; then we can use the IGO algorithm to optimize over (x, h) usingthe distributions ðð(x, h). A second possibility is to marginalize ðð(x, h)over the hidden units h as in (49), to define the distribution ðð(x); then wecan use the IGO algorithm to optimize over x using ðð(x).
31
These two approaches yield slightly different algorithms. Both were testedand found to be viable. However the first approach is numerically more stableand requires less samples to estimate the Fisher matrix. Indeed, if ðŒ1(ð) isthe Fisher matrix at ð in the first approach and ðŒ2(ð) in the second approach,we always have ðŒ1(ð) > ðŒ2(ð) (in the sense of positive-definite matrices). Thisis because probability distributions on the pair (x, h) carry more informationthan their projections on x only, and so computed KullbackâLeibler distanceswill be larger.
In particular, there are (isolated) values of ð for which the Fisher matrixðŒ2 is not invertible whereas ðŒ1 is. For this reason, we selected the firstapproach.
5.2 Experimental results
In our experiments, we look at the optimization of the two-min functiondefined below with a bimodal RBM: an RBM with only one latent variable.Such an RBM is bimodal because it has two possible configurations of thelatent variable: h = 0 or h = 1, and given h, the observed variables areindependent and distributed according to two Bernoulli distributions.
Set a parameter y â {0, 1}ð. The two-min function is defined as follows:
ðy(x) = min(â
ð
|ð¥ð â ðŠð| ,â
ð
|(1â ð¥ð)â ðŠð)|)
(54)
This function of x has two optima: one at y, the other at its binary comple-ment y.
For the quantile rewriting of ð (Section 1.2), we chose the function ð€ tobe ð€(ð) = 1ð61/2 so that points which are below the median are given theweight 1, whereas other points are given the weight 0. Also, in accordancewith (3), if several points have the same fitness value, their weight ð ð
ð is setto the average of ð€ over all those points.
For initialization of the RBMs, the weights W are sampled from a normaldistribution centered around zero and of standard deviation 1/
âðð¥ à ðâ,
where ðð¥ is the number of observed variables (dimension ð of the problem)and ðâ is the number of latent variables (ðâ = 1 in our case), so that initiallythe energies ðž are not too large. Then the bias parameters are initialized asðð â â
âð
ð€ðð
2 and ðð â ââ
ðð€ðð
2 +ð© (0.01ð2
ð¥) so that each variable (observed
or latent) has a probability of activation close to 1/2.In the following experiments, we show the results of IGO optimization
and vanilla gradient optimization for the two-min function in dimension 40,for various values of the step size ð¿ð¡. For each ð¿ð¡, we present the median of thequantity of interest over 100 runs. Error bars indicate the 16th percentile andthe 84th percentile (this is the same as mean±stddev for a Gaussian variable,but is invariant by ð -reparametrization). For each run, the parameter yof the two-max function is sampled randomly in order to ensure that thepresented results are not dependent on a particular choice of optima.
The number of sample points used for estimating the Fisher matrix isset to 10, 000: large enough (by far) to ensure the stability of the estimates.The same points are used for estimating the integral of (14), therefore thereare 10, 000 calls to the fitness function at each gradient step. These rathercomfortable settings allow for a good illustration of the theoretical propertiesof the ð =â IGO flow limit.
The numbers of runs that fail after the occurrence of a singular matrixor an unreliable estimate amount for less than 10% for ð¿ð¡ 6 2 (as little as
32
3% for the smallest learning rate), but can increase up to 30% for higherlearning rates.
5.2.1 Convergence
We first check that both vanilla and natural gradient are able to converge toan optimum.
Figures 3 and 4 show the fitness of the best sampled point for theIGO algorithm and for the vanilla gradient at each step. Predictably, bothalgorithms are able to optimize the two-min function in a few steps.
The two-min function is extremely simple from an optimization viewpoint;thus, convergence speed is not the main focus here, all the more since we usea large number of ð -calls at each step.
Note that the values of the parameter ð¿ð¡ for the two gradients used arenot directly comparable from a theoretical viewpoint (they correspond toparametrizations of different trajectories in Î-space, and identifying vanillað¿ð¡ with natural ð¿ð¡ is meaningless). We selected larger values of ð¿ð¡ for thevanilla gradient, in order to obtain roughly comparable convergence speedsin practice.
5.2.2 Diversity
As we have seen, the two-min function is equally well optimized by the IGOand vanilla gradient optimization. However, the methods fare very differentlywhen we look at the distance from the sample points to both optima. From(54), the fitness gives the distance of sample points to the closest optimum.We now study how close sample points come to the other, opposite optimum.The distance of sample points to the second optimum is shown in Figure 5for IGO, and in Figure 6 for the vanilla gradient.
Figure 5 shows that IGO also reaches the second optimum most of thetime, and is often able to find it only a few steps after the first. This propertyof IGO is of course dependent on the initialization of the RBM with enoughdiversity. When initialized properly so that each variable (observed andlatent) has a probability 1/2 of being equal to 1, the initial RBM distributionhas maximal diversity over the search space and is at equal distance from thetwo optima of the function. From this starting position, the natural gradientis then able to increase the likelihood of the two optima at the same time.
By stark contrast, the vanilla gradient is not able to go towards bothoptima at the same time as shown in Fig. 6. In fact, the vanilla gradientonly converges to one optimum at the expense of the other. For all values ofð¿ð¡, the distance to the second optimum increases gradually and approachesthe maximum possible distance.
As mentioned earlier, each state of the latent variable h corresponds to amode of the distribution. In Figures 7 and 8, we look at the average valueof h for each gradient step. An average value close to 1/2 means that thedistribution samples from both modes: h = 0 or h = 1 with a comparableprobability. Conversely, average values close to 0 or 1 indicate that thedistribution gives most probability to one mode at the expense of the other.In Figure 7, we can see that with IGO, the average value of h is close to 1/2during the whole optimization procedure for most runs: the distribution isinitialized with two modes and stays bimodal7. As for the vanilla gradient,
7In Fig. 7, we use the following adverse setting: runs are interrupted once they reachboth optima, therefore the statistics are taken only over those runs which have not yetconverged and reached both optima, which results in higher variation around 1/2. The
33
0 10 20 30 40 50gradient steps
0
2
4
6
8
10
12
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
ÎŽt=0.5
ÎŽt=0.25
Figure 3: Fitness of sampled points during IGO optimization.
0 10 20 30 40 50gradient steps
0
2
4
6
8
10
12
ÎŽt=32.
ÎŽt=16.
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
Figure 4: Fitness of sampled points during vanilla gradient optimization.
34
0 10 20 30 40 50gradient steps
0
5
10
15
20
25
30
35
40
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
ÎŽt=0.5
ÎŽt=0.25
Figure 5: Distance to the second optimum during IGO optimization.
0 10 20 30 40 50gradient steps
0
5
10
15
20
25
30
35
40
ÎŽt=32.
ÎŽt=16.
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
Figure 6: Distance to second optimum during vanilla gradient optimization.
35
statistics for h are depicted in Figure 8 and we can see that they converge to1: one of the two modes of the distribution has been lost during optimization.
Hidden breach of symmetry by the vanilla gradient. The experi-ments reveal a curious phenomenon: the vanilla gradient loses multimodalityby always setting the hidden variable â to 1, not to 0. (We detected noobvious asymmetry on the visible units ð¥, though.)
Of course, exchanging the values 0 and 1 for the hidden variables in arestricted Boltzmann machine still gives a distribution of another Boltzmannmachine. More precisely, changing âð into 1â âð is equivalent to resettingðð â ðð + ð€ðð , ðð â âðð , and ð€ðð â âð€ðð . IGO and the natural gradient areimpervious to such a change by Proposition 9.
The vanilla gradient implicitly relies on the Euclidean norm on parameterspace, as explained in Section 1.1. For this norm, the distance betweenthe RBM distributions (ðð, ðð , ð€ðð) and (ðâ²
ð, ðâ²ð , ð€â²
ðð) is simplyâ
ð |ðð â ðâ²ð|
2 +âð
ðð â ðâ²
ð
2+â
ðð
ð€ðð â ð€â²
ðð
2. However, the change of variables ðð â
ðð + ð€ðð , ðð â âðð , ð€ðð â âð€ðð does not preserve this Euclidean metric.Thus, exchanging 0 and 1 for the hidden variables will result in two differentvanilla gradient ascents. The observed asymmetry on â is a consequence ofthis implicit asymmetry.
The same asymmetry exists for the visible variables ð¥ð; but this doesnot prevent convergence to an optimum in our situation, since any gradientdescent eventually reaches some local optimum.
Of course it is possible to cook up parametrizations for which the vanillagradient will be more symmetric: for instance, using â1/1 instead of 0/1 forthe variables, or defining the energy by
ðž(x, h) = ââ
ððŽð(ð¥ð â 12)â
âððµð(âð â 1
2)ââ
ð,ðððð(ð¥ð â 12)(âð â 1
2) (55)
with âbias-freeâ parameters ðŽð, ðµð , ððð related to the usual parametrizationby ð€ðð = ððð , ðð = ðŽð â 1
2â
ð ð€ðð ðð = ðµð â 12â
ð ð€ðð . The vanilla gradientmight perform better in these parametrizations.
However, we chose a naive approach: we used a family of probabilitydistributions found in the literature, with the parametrization found in theliterature. We then use the vanilla gradient and the natural gradient on thesedistributions. This directly illustrates the specific influence of the chosengradient (the two implementations only differ by the inclusion of the Fishermatrix). It is remarkable, we think, that the natural gradient is able torecover symmetry where there was none.
5.3 Convergence to the continuous-time limit
In the previous figures, it looks like changing the parameter ð¿ð¡ only resultsin a time speedup of the plots.
Update rules of the type ð â ð + ð¿ð¡âðð (for either gradient) are Eulerapproximations of the continuous-time ordinary differential equation dð
dð¡ =âðð, with each iteration corresponding to an increment ð¿ð¡ of the time ð¡.Thus, it is expected that for small enough ð¿ð¡, the algorithm after ð stepsapproximates the IGO flow or vanilla gradient flow at time ð¡ = ð.ð¿ð¡.
Figures 9 and 10 illustrate this convergence: we show the fitness w.r.t toð¿ð¡ times the number of gradient steps. An asymptotic trajectory seems to
plot has been stopped when less than half the runs remain. The error bars are relativeonly to the remaining runs.
36
0 10 20 30 40 50gradient steps
0.0
0.2
0.4
0.6
0.8
1.0
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
ÎŽt=0.5
ÎŽt=0.25
Figure 7: Average value of h during IGO optimization.
0 10 20 30 40 50gradient steps
0.0
0.2
0.4
0.6
0.8
1.0
ÎŽt=32.
ÎŽt=16.
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
Figure 8: Average value of h during vanilla gradient optimization.
37
emerge when ð¿ð¡ decreases. For the natural gradient, it can be interpreted asthe fitness of samples of the continuous-time IGO flow.
Thus, for this kind of optimization algorithms, it makes theoreticalsense to plot the results according to the âintrinsic timeâ of the underlyingcontinuous-time object, to illustrate properties that do not depend on thesetting of the parameter ð¿ð¡. (Still, the raw number of steps is more directlyrelated to algorithmic cost.)
6 Further discussionA single framework for optimization on arbitrary spaces. A strengthof the IGO viewpoint is to automatically provide optimization algorithmsusing any family of probability distributions on any given space, discrete orcontinuous. This has been illustrated with restricted Boltzmann machines.IGO algorithms also feature good invariance properties and make a leastnumber of arbitrary choices.
In particular, IGO unifies several well-known optimization algorithmsinto a single framework. For instance, to the best of our knowledge, PBILhas never been described as a natural gradient ascent in the literature8.
For Gaussian measures, algorithms of the same form (14) had beendeveloped previously [HO01, WSPS08] and their close relationship with anatural gradient ascent had been recognized [ANOK10, GSS+10].
The wide applicability of natural gradient approaches seems not to bewidely known in the optimization community, though see [MMS08].
About quantiles. The IGO flow, to the best of our knowledge, has notbeen defined before. The introduction of the quantile-rewriting (3) of theobjective function provides the first rigorous derivation of quantile- or rank-based natural optimization from a gradient ascent in ð-space.
Indeed, NES and CMA-ES have been claimed to maximize âEððð via
natural gradient ascent [WSPS08, ANOK10]. However, we have proved thatwhen the number of samples is large and the step size is small, the NESand CMA-ES updates converge to the IGO flow, not to the similar flowwith the gradient of Eðð
ð (Theorem 4). So we find that in reality thesealgorithms maximize Eðð
ð ððð¡ , where ð ð
ðð¡ is a decreasing transformation ofthe ð -quantiles under the current sample distribution.
Also in practice, maximizing âEððð is a rather unstable procedure and
has been discouraged, see for example [Whi89].
About choice of ðð: learning a model of good points. The choice ofthe family of probability distributions ðð plays a double role.
First, it is analogous to a mutation operator as seen in evolutionaryalgorithms: indeed, ðð encodes possible moves according to which newsample points are explored.
Second, optimization algorithms using distributions can be interpreted aslearning a probabilistic model of where the points with good values lie in thesearch space. With this point of view, ðð describes richness of this model:for instance, restricted Boltzmann machines with â hidden units can describedistributions with up to 2â modes, whereas the Bernoulli distribution usedin PBIL is unimodal. This influences, for instance, the ability to exploreseveral valleys and optimize multimodal functions in a single run.
8Thanks to Jonathan Shapiro for an early argument confirming this property (personalcommunication).
38
0 5 10 15 20 25 30gradient steps * ÎŽt
0
2
4
6
8
10
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
ÎŽt=0.5
ÎŽt=0.25
Figure 9: Fitness of sampled points w.r.t. âintrinsic timeâ during IGOoptimization.
0 5 10 15 20 25 30gradient steps * ÎŽt/4
0
2
4
6
8
10
ÎŽt=32.
ÎŽt=16.
ÎŽt=8.
ÎŽt=4.
ÎŽt=2.
ÎŽt=1.
Figure 10: Fitness of sampled points w.r.t. âintrinsic timeâ during vanillagradient optimization.
39
Natural gradient and parametrization invariance. Central to IGOis the use of natural gradient, which follows from ð-invariance and makessense on any search space, discrete or continuous.
While the IGO flow is exactly ð-invariant, for any practical implementa-tion of an IGO algorithm, a parametrization choice has to be made. Still,since all IGO algorithms approximate the IGO flow, two parametrizations ofIGO will differ less than two parametrizations of another algorithm (such asthe vanilla gradient or the smoothed CEM method)âat least if the learningrate ð¿ð¡ is not too large.
On the other hand, natural evolution strategies have never strived forð-invariance: the chosen parametrization (Cholesky, exponential) has beendeemed a relevant feature. In the IGO framework, the chosen parametrizationbecomes more relevant as the step size ð¿ð¡ increases.
IGO, maximum likelihood and cross-entropy. The cross-entropy method(CEM) [dBKMR05] can be used to produce optimization algorithms given afamily of probability distributions on an arbitrary space, by performing ajump to a maximum likelihood estimate of the parameters.
We have seen (Corollary 16) that the standard CEM is an IGO algorithmin a particular parametrization, with a learning rate ð¿ð¡ equal to 1. However,it is well-known, both theoretically and experimentally [BLS07, Han06b,WAS04], that standard CEM loses diversity too fast in many situations. Theusual solution [dBKMR05] is to reduce the learning rate (smoothed CEM,(25)), but this breaks the reparametrization invariance.
On the other hand, the IGO flow can be seen as a maximum likelihoodupdate with infinitesimal learning rate (Theorem 13). This interpretationallows to define a particular IGO algorithm, the IGO-ML (Definition 14): itperforms a maximum likelihood update with an arbitrary learning rate, andkeeps the reparametrization invariance. It coincides with CEM when thelearning rate is set to 1, but it differs from smoothed CEM by the exchangeof the order of argmax and averaging (compare (23) and (25)). We arguethat this new algorithm may be a better way to reduce the learning rate andachieve smoothing in CEM.
Standard CEM can lose diversity, yet is a particular case of an IGOalgorithm: this illustrates the fact that reasonable values of the learningrate ð¿ð¡ depend on the parametrization. We have studied this phenomenonin detail for various Gaussian IGO algorithms (Section 4.2).
Why would a smaller learning rate perform better than a large one inan optimization setting? It might seem more efficient to jump directly tothe maximum likelihood estimate of currently known good points, instead ofperforming a slow gradient ascent towards this maximum.
However, optimization faces a âmoving targetâ, contrary to a learningsetting in which the example distribution is often stationary. Currently knowngood points are likely not to indicate the position at which the optimum lies,but, rather, the direction in which the optimum is to be found. After anupdate, the next elite sample points are going to be located somewhere new.So the goal is certainly not to settle down around these currently knownpoints, as a maximum likelihood update does: by design, CEM only tries toreflect status-quo (even for ð =â), whereas IGO tries to move somewhere.When the target moves over time, a progressive gradient ascent is morereasonable than an immediate jump to a temporary optimum, and realizes akind of time smoothing.
This phenomenon is most clear when the number of sample points is small.
40
Then, a full maximum likelihood update risks losing a lot of diversity; it mayeven produce a degenerate distribution if the number of sample points issmaller than the number of parameters of the distribution. On the other hand,for smaller ð¿ð¡, the IGO algorithms do, by design, try to maintain diversityby moving as little as possible from the current distribution ðð in KullbackâLeibler divergence. A full ML update disregards the current distributionand tries to move as close as possible to the elite sample in KullbackâLeiblerdivergence [dBKMR05], thus realizing maximal diversity loss. This makessense in a non-iterated scenario but is unsuited for optimization.
Diversity and multiple optima. The IGO framework emphasizes therelation of natural gradient and diversity: we argued that IGO providesminimal diversity change for a given objective function increment. In partic-ular, provided the initial diversity is large, diversity is kept at a maximum.This theoretical relationship has been confirmed experimentally for restrictedBoltzmann machines.
On the other hand, using the vanilla gradient does not lead to a balanceddistribution between the two optima in our experiments. Using the vanillagradient introduces hidden arbitrary choices between those points (moreexactly between moves in Î-space). This results in loss of diversity, andmight also be detrimental at later stages in the optimization. This may reflectthe fact that the Euclidean metric on the space of parameters, implicitlyused in the vanilla gradient, becomes less and less meaningful for gradientdescent on complex distributions.
IGO and the natural gradient are certainly relevant to the well-knownproblem of exploration-exploitation balance: as we have seen, arguably thenatural gradient realizes the best increment of the objective function withthe least possible change of diversity in the population.
More generally, IGO tries to learn a model of where the good pointsare, representing all good points seen so far rather than focusing only onsome good points; this is typical of machine learning, one of the contexts forwhich the natural gradient was studied. The conceptual relationship of IGOand IGO-like optimization algorithms with machine learning is still to beexplored and exploited.
Summary and conclusionWe sum up:
â The information-geometric optimization (IGO) framework derives frominvariance principles and allows to build optimization algorithms fromany family of distributions on any search space. In some instances(Gaussian distributions on Rð or Bernoulli distributions on {0, 1}ð) itrecovers versions of known algorithms (CMA-ES or PBIL); in otherinstances (restricted Boltzmann machine distributions) it producesnew, hopefully efficient optimization algorithms.
â The use of a quantile-based, time-dependent transform of the objectivefunction (Equation (3)) provides a rigorous derivation of rank-basedupdate rules currently used in optimization algorithms. Theorem 4uniquely identifies the infinite-population limit of these update rules.
â The IGO flow is singled out by its equivalent description as an in-finitesimal weighted maximal log-likelihood update (Theorem 13). In
41
a particular parametrization and with a step size of 1, it recovers thecross-entropy method (Corollary 16).
â Theoretical arguments suggest that the IGO flow minimizes the changeof diversity in the course of optimization. In particular, startingwith high diversity and using multimodal distributions may allowsimultaneous exploration of multiple optima of the objective function.Preliminary experiments with restricted Boltzmann machines confirmthis effect in a simple situation.
Thus, the IGO framework is an attempt to provide sound theoreticalfoundations to optimization algorithms based on probability distributions.In particular, this viewpoint helps to bridge the gap between continuous anddiscrete optimization.
The invariance properties, which reduce the number of arbitrary choices,together with the relationship between natural gradient and diversity, maycontribute to a theoretical explanation of the good practical performance ofthose currently used algorithms, such as CMA-ES, which can be interpretedas instantiations of IGO.
We hope that invariance properties will acquire in computer sciencethe importance they have in mathematics, where intrinsic thinking is thefirst step for abstract linear algebra or differential geometry, and in modernphysics, where the notions of invariance w.r.t. the coordinate system andso-called gauge invariance play central roles.
AcknowledgementsThe authors would like to thank MichÚle Sebag for the acronym and for helpfulcomments. Y. O. would like to thank Cédric Villani and Bruno Sévennecfor helpful discussions on the Fisher metric. A. A. and N. H. would like toacknowledge the Dagstuhl Seminar No 10361 on the Theory of EvolutionaryComputation (http://www.dagstuhl.de/10361) for inspiring their work onnatural gradients and beyond. This work was partially supported by theANR-2010-COSI-002 grant (SIMINOLE) of the French National ResearchAgency.
Appendix: Proofs
Proof of Theorem 4 (Convergence of empirical means andquantiles)
Let us give a more precise statement including the necessary regularityconditions.
Proposition 18. Let ð â Î. Assume that the derivative ð ln ðð(ð¥)ðð exists
for ðð-almost all ð¥ â ð and that Eðð
ð ln ðð(ð¥)
ðð
2< +â. Assume that the
function ð€ is non-decreasing and bounded.Let (ð¥ð)ðâN be a sequence of independent samples of ðð. Then with
probability 1, as ð ââ we have
1ð
ðâð=1
ð ð (ð¥ð)ð ln ðð(ð¥ð)
ððââ«
ð ðð (ð¥) ð ln ðð(ð¥)
ðððð(dð¥)
42
where ð ð (ð¥ð) = ð€
(rkð (ð¥ð) + 1/2ð
)with rkð (ð¥ð) = #{1 6 ð 6 ð, ð(ð¥ð) < ð(ð¥ð)}. (When there are ð -ties in thesample, ð ð (ð¥ð) is defined as the average of ð€((ð + 1/2)/ð) over the possiblerankings ð of ð¥ð.)
Proof. Let ð : ð â R be any function with Eððð2 <â. We will show that
1ð
âð ð (ð¥ð)ð(ð¥ð)ââ«
ð ðð (ð¥)ð(ð¥) ðð(dð¥). Applying this with ð equal to the
components of ð ln ðð(ð¥)ðð will yield the result.
Let us decompose
1ð
âð ð (ð¥ð)ð(ð¥ð) = 1ð
âð ð
ð (ð¥ð)ð(ð¥ð) + 1ð
â(ð ð (ð¥ð)âð ð
ð (ð¥ð))ð(ð¥ð).
Each summand in the first term involves only one sample ð¥ð (contrary toð ð (ð¥ð) which depends on the whole sample). So by the strong law of largenumbers, almost surely 1
ð
âð ð
ð (ð¥ð)ð(ð¥ð) converges toâ«
ð ðð (ð¥)ð(ð¥) ðð(dð¥).
So we have to show that the second term converges to 0 almost surely.By the CauchyâSchwarz inequality, we have 1
ð
â(ð ð (ð¥ð)âð ð
ð (ð¥ð))ð(ð¥ð)2
6( 1
ð
â(ð ð (ð¥ð)âð ð
ð (ð¥ð))2)( 1
ð
âð(ð¥ð)2
)By the strong law of large numbers, the second term 1
ð
âð(ð¥ð)2 converges to
ðžððð2 almost surely. So we have to prove that the first term 1
ð
â(ð ð (ð¥ð)â
ð ðð (ð¥ð))2 converges to 0 almost surely.
Since ð€ is bounded by assumption, we can write
(ð ð (ð¥ð)âð ðð (ð¥ð))2 6 2ðµ
ð ð (ð¥ð)âð ðð (ð¥ð)
= 2ðµ
ð ð (ð¥ð)âð ðð (ð¥ð)
+
+ 2ðµð ð (ð¥ð)âð ð
ð (ð¥ð)â
where ðµ is the bound on |ð€|. We will bound each of these terms.Let us abbreviate ðâ
ð = Prð¥â²âŒðð(ð(ð¥â²) < ð(ð¥ð)), ð+
ð = Prð¥â²âŒðð(ð(ð¥â²) 6
ð(ð¥ð)), ðâð = #{ð6ð, ð(ð¥ð) < ð(ð¥ð)}, ð+
ð = #{ð6ð, ð(ð¥ð) 6 ð(ð¥ð)}.By definition of ð ð we have
ð ð (ð¥ð) = 1ð+
ð â ðâð
ð+ð â1â
ð=ðâð
ð€((ð + 1/2)/ð)
and moreover ð ðð (ð¥ð) = ð€(ðâ
ð ) if ðâð = ð+
ð or ð ðð (ð¥ð) = 1
ð+ð âðâ
ð
â« ð+ð
ðâð
ð€ other-wise.
The GlivenkoâCantelli theorem states that supð
ð+
ð â ð+ð /ð
tends to 0
almost surely, and likewise for supð
ðâ
ð â ðâð /ð
. So let ð be large enough
so that these errors are bounded by ð.Since ð€ is non-increasing, we have ð€(ðâ
ð ) 6 ð€(ðâð /ð â ð). In the case
ðâð = ð+
ð , we decompose the interval [ðâð ; ð+
ð ] into (ð+ð â ðâ
ð ) subintervals.The average of ð€ over each such subinterval is compared to a term in thesum defining ð€ð (ð¥ð): since ð€ is non-increasing, the average of ð€ over theðth subinterval is at most ð€((ðâ
ð + ð)/ð â ð). So we get
ð ðð (ð¥ð) 6
1ð+
ð â ðâð
ð+ð â1â
ð=ðâð
ð€(ð/ð â ð)
43
so that
ð ðð (ð¥ð)â ð ð (ð¥ð) 6
1ð+
ð â ðâð
ð+ð â1â
ð=ðâð
(ð€(ð/ð â ð)â ð€((ð + 1/2)/ð)).
Let us sum over ð, remembering that there are (ð+ð â ðâ
ð ) values of ð forwhich ð(ð¥ð) = ð(ð¥ð). Taking the positive part, we get
1ð
ðâð=1
ð ð
ð (ð¥ð)â ð ð (ð¥ð)+6
1ð
ðâ1âð=0
(ð€(ð/ð â ð)â ð€((ð + 1/2)/ð)).
Since ð€ is non-increasing we have
1ð
ðâ1âð=0
ð€(ð/ð â ð) 6â« 1âðâ1/ð
âðâ1/ðð€
and1ð
ðâ1âð=0
ð€((ð + 1/2)/ð) >â« 1+1/2ð
1/2ðð€
(we implicitly extend the range of ð€ so that ð€(ð) = ð€(0) for ð < 0). So wehave
1ð
ðâð=1
ð ð
ð (ð¥ð)â ð ð (ð¥ð)+6â« 1/2ð
âðâ1/ðð€ â
â« 1+1/2ð
1âðâ1/ðð€ 6 (2ð + 3/ð)ðµ
where ðµ is the bound on |ð€|.Reasoning symmetrically with ð€(ð/ð + ð) and the inequalities reversed,
we get a similar bound for 1ð
â ð ð
ð (ð¥ð)â ð ð (ð¥ð)â
. This ends the proof.
Proof of Proposition 5 (Quantile improvement)
Let us use the weight ð€(ð¢) = 1ð¢6ð. Let ð be the value of the ð-quantile of ðunder ððð¡ . We want to show that the value of the ð-quantile of ð under ððð¡+ð¿ð¡
is less than ð, unless the gradient vanishes and the IGO flow is stationary.Let ðâ = Prð¥âŒððð¡ (ð(ð¥) < ð), ðð = Prð¥âŒððð¡ (ð(ð¥) = ð) and ð+ =
Prð¥âŒððð¡ (ð(ð¥) > ð). By definition of the quantile value we have ðâ + ðð > ðand ð+ + ðð > 1â ð. Let us assume that we are in the more complicatedcase ðð = 0 (for the case ðð = 0, simply remove the corresponding terms).
We have ð ððð¡
(ð¥) = 1 if ð(ð¥) < ð, ð ððð¡
(ð¥) = 0 if ð(ð¥) > ð and ð ððð¡
(ð¥) =1
ðð
â« ðâ+ðððâ
ð€(ð¢)dð¢ = ðâðâðð
if ð(ð¥) = ð.Using the same notation as above, let ðð¡(ð) =
â«ð ð
ðð¡(ð¥) ðð(dð¥). Decom-posing this integral on the three sets ð(ð¥) < ð, ð(ð¥) = ð and ð(ð¥) > ð, weget that ðð¡(ð) = Prð¥âŒðð
(ð(ð¥) < ð) + Prð¥âŒðð(ð(ð¥) = ð) ðâðâ
ðð. In particular,
ðð¡(ðð¡) = ð.Since we follow a gradient ascent of ðð¡, for ð¿ð¡ small enough we have
ðð¡(ðð¡+ð¿ð¡) > ðð¡(ðð¡) unless the gradient vanishes. If the gradient vanisheswe have ðð¡+ð¿ð¡ = ðð¡ and the quantiles are the same. Otherwise we getðð¡(ðð¡+ð¿ð¡) > ðð¡(ðð¡) = ð.
Since ðâðâðð
6 (ðâ+ðð)âðâðð
= 1, we have ðð¡(ð) 6 Prð¥âŒðð(ð(ð¥) < ð) +
Prð¥âŒðð(ð(ð¥) = ð) = Prð¥âŒðð
(ð(ð¥) 6 ð).So Prð¥âŒð
ðð¡+ð¿ð¡(ð(ð¥) 6 ð) > ðð¡(ðð¡+ð¿ð¡) > ð. This implies that, by definition,
the ð-quantile value of ððð¡+ð¿ð¡ is smaller than ð.
44
Proof of Proposition 10 (Speed of the IGO flow)
Lemma 19. Let ð be a centered ð¿2 random variable with values in Rð andlet ðŽ be a real-valued ð¿2 random variable. Then
âE(ðŽð)â 6â
ð Var ðŽ
where ð is the largest eigenvalue of the covariance matrix of ð expressed inan orthonormal basis.
Proof of the lemma. Let ð£ be any vector in Rð; its norm satisfies
âð£â = supð€, âð€â61
(𣠷 ð€)
and in particular
âE(ðŽð)â = supð€, âð€â61
(ð€ · E(ðŽð))
= supð€, âð€â61
E(ðŽ (ð€ ·ð))
= supð€, âð€â61
E((ðŽâ EðŽ) (ð€ ·ð)) since (ð€ ·ð) is centered
6 supð€, âð€â61
âVar ðŽ
âE((ð€ ·ð)2)
by the CauchyâSchwarz inequality and using the fact that ðŽ is centered.Now, in an orthonormal basis we have
(ð€ ·ð) =â
ð
ð€ððð
so that
E((ð€ ·ð)2) = E((â
ðð€ððð)(â
ðð€ððð))
=â
ð
âðE(ð€ðððð€ððð)
=â
ð
âðð€ðð€ðE(ðððð)
=â
ð
âðð€ðð€ðð¶ðð
with ð¶ðð the covariance matrix of ð. The latter expression is the scalarproduct (ð€ · ð¶ð€). Since ð¶ is a symmetric positive-semidefinite matrix,(ð€ · ð¶ð€) is at most ðâð€â2 with ð the largest eigenvalue of ð¶.
For the IGO flow we have dðð¡
dð¡ = Eð¥âŒððð ð
ð (ð¥)âð ln ðð(ð¥).So applying the lemma, we get that the norm of dð
dð¡ is at mostâ
ð Varð¥âŒððð ð
ð (ð¥)where ð is the largest eivengalue of the covariance matrix of âð ln ðð(ð¥) (ex-pressed in a coordinate system where the Fisher matrix at the current pointð is the identity).
By construction of the quantiles, we have Varð¥âŒððð ð
ð (ð¥) 6 Var[0,1] ð€(with equality unless there are ties). Indeed, for a given ð¥, let ð° be a uniformrandom variable in [0, 1] independent from ð¥ and define the random variableð = ðâ(ð¥) + (ð+(ð¥)â ðâ(ð¥))ð° . Then ð is uniformly distributed between theupper and lower quantiles ð+(ð¥) and ðâ(ð¥) and thus we can rewrite ð ð
ð (ð¥) asE(ð€(ð)|ð¥). By the Jensen inequality we have Var ð ð
ð (ð¥) = VarE(ð€(ð)|ð¥) 6Var ð€(ð). In addition when ð¥ is taken under ðð, ð is uniformly distributedin [0, 1] and thus Var ð€(ð) = Var[0,1] ð€, i.e. Varð¥âŒðð
ð ðð (ð¥) 6 Var[0,1] ð€.
45
Besides, consider the tangent space in Î-space at point ðð¡, and let uschoose an orthonormal basis in this tangent space for the Fisher metric.Then, in this basis we have âð ln ðð(ð¥) = ðð ln ðð(ð¥). So the covariancematrix of â ln ðð(ð¥) is Eð¥âŒðð
(ðð ln ðð(ð¥)ðð ln ðð(ð¥)), which is equal to theFisher matrix by definition. So this covariance matrix is the identity, whoselargest eigenvalue is 1.
Proof of Proposition 12 (Noisy IGO)
On the one hand, let ðð be a family of distributions on ð. The IGOalgorithm (13) applied to a random function ð(ð¥) = ð(ð¥, ð) where ð is arandom variable uniformly distributed in [0, 1] reads
ðð¡+ð¿ð¡ = ðð¡ + ð¿ð¡ðâ
ð=1ð€ðâð ln ðð(ð¥ð) (56)
where ð¥ð ⌠ðð and ð€ð is according to (12) where ranking is applied to thevalues ð(ð¥ð, ðð), with ðð uniform variables in [0, 1] independent from ð¥ð andfrom each other.
On the other hand, for the IGO algorithm using ðð â ð[0,1] and appliedto the deterministic function ð , ð€ð is computed using the ranking accordingto the ð values of the sampled points ᅵᅵð = (ð¥ð, ðð), and thus coincides withthe one in (56).
Besides,
âð ln ððâð[0,1](ᅵᅵð) = âð ln ðð(ð¥ð) + âð ln ð[0,1](ðð)â â =0
and thus the IGO algorithm update on space ð Ã [0, 1], using the familyof distributions ðð = ðð â ð[0,1], applied to the deterministic function ð ,coincides with (56).
Proof of Theorem 13 (Natural gradient as ML with infinitesi-mal weights)
We begin with a calculus lemma (proof omitted).
Lemma 20. Let ð be real-valued function on a finite-dimensional vectorspace ðž equipped with a definite positive quadratic form â · â2. Assume ð issmooth and has at most quadratic growth at infinity. Then, for any ð¥ â ðž,we have
âð(ð¥) = limðâ0
1ð
arg maxâ
{ð(ð¥ + â)â 1
2ðâââ2
}where â is the gradient associated with the norm â · â. Equivalently,
arg maxðŠ
{ð(ðŠ)â 1
2ðâðŠ â ð¥â2
}= ð¥ + ðâð(ð¥) + ð(ð2)
when ðâ 0.
We are now ready to prove Theorem 13. Let ð be a function of ð¥, andfix some ð0 in Î.
We need some regularity assumptions: we assume that the parameterspace Î is non-degenerate (no two points ð â Î define the same probabilitydistribution) and proper (the map ðð âŠâ ð is continuous). We also assumethat the map ð âŠâ ðð is smooth enough, so that
â«log ðð(ð¥) ð (ð¥) ðð0(dð¥) is
46
a smooth function of ð. (These are restrictions on ð-regularity: this does notmean that ð has to be continuous as a function of ð¥.)
The two statements of Theorem 13 using a sum and an integral havesimilar proofs, so we only include the first. For ð > 0, let ð be the solution of
ð = arg max{
(1â ð)â«
log ðð(ð¥) ðð0(dð¥) + ð
â«log ðð(ð¥) ð (ð¥) ðð0(dð¥)
}.
Then we have
ð = arg max{â«
log ðð(ð¥) ðð0(dð¥) + ð
â«log ðð(ð¥) (ð (ð¥)â 1) ðð0(dð¥)
}
= arg max{â«
log ðð(ð¥) ðð0(dð¥)ââ«
log ðð0(ð¥) ðð0(dð¥) + ð
â«log ðð(ð¥) (ð (ð¥)â 1) ðð0(dð¥)
}
(because the added term does not depend on ð)
= arg max{âKL(ðð0 ||ðð) + ð
â«log ðð(ð¥) (ð (ð¥)â 1) ðð0(dð¥)
}
= arg max{â 1
ðKL(ðð0 ||ðð) +
â«log ðð(ð¥) (ð (ð¥)â 1) ðð0(dð¥)
}
When ðâ 0, the first term exceeds the second one if KL(ðð0 ||ðð) is toolarge (because ð is bounded), and so KL(ðð0 ||ðð) tends to 0. So we canassume that ð is close to ð0.
When ð = ð0 + ð¿ð is close to ð0, we have
KL(ðð0 ||ðð) = 12â
ðŒðð(ð0) ð¿ðð ð¿ðð + ð(ð¿ð3)
with ðŒðð(ð0) the Fisher matrix at ð0. (This actually holds both for KL(ðð0 ||ðð)and KL(ðð ||ðð0).)
Thus, we can apply the lemma above using the Fisher metricâ
ðŒðð(ð0) ð¿ðð ð¿ðð ,and working on a small neighborhood of ð0 in ð-space (which can be identifiedwith Rdim Î). The lemma states that the argmax above is attained at
ð = ð0 + ð âð
â«log ðð(ð¥) (ð (ð¥)â 1) ðð0(dð¥)
up to ð(ð2), with â the natural gradient.Finally, the gradient cancels the constant â1 because
â«( â log ðð) ðð0 = 0
at ð = ð0. This proves Theorem 13.
Proof of Theorem 15 (IGO, CEM and IGO-ML)
Let ðð be a family of probability distributions of the form
ðð(ð¥) = 1ð(ð) exp
(âðððð(ð¥)
)ð»(dð¥)
where ð1, . . . , ðð is a finite family of functions on ð and ð»(dð¥) is somereference measure on ð. We assume that the family of functions (ðð)ð
together with the constant function ð0(ð¥) = 1, are linearly independent.This prevents redundant parametrizations where two values of ð describethe same distribution; this also ensures that the Fisher matrix Cov(ðð, ðð) isinvertible.
47
The IGO update (14) in the parametrization ðð is a sum of terms of theform âðð
ln ð (ð¥).
So we will compute the natural gradient âððin those coordinates. We first
need some general results about the Fisher metric for exponential families.The next proposition gives the expression of the Fisher scalar product be-
tween two tangent vectors ð¿ð and ð¿â²ð of a statistical manifold of exponentialdistributions. It is one way to express the duality between the coordinatesðð and ðð (compare [AN00, (3.30) and Section 3.5]).
Proposition 21. Let ð¿ðð and ð¿â²ðð be two small variations of the parametersðð. Let ð¿ð (ð¥) and ð¿â²ð (ð¥) be the resulting variations of the probability distri-bution ð , and ð¿ðð and ð¿â²ðð the resulting variations of ðð. Then the scalarproduct, in Fisher information metric, between the tangent vectors ð¿ð andð¿â²ð , is
âšð¿ð, ð¿â²ð â© =â
ð
ð¿ðð ð¿â²ðð =â
ð
ð¿â²ðð ð¿ðð.
Proof. By definition of the Fisher metric:
âšð¿ð, ð¿â²ð â© =âð,ð
ðŒðð ð¿ðð ð¿â²ðð
=âð,ð
ð¿ðð ð¿â²ðð
â«ð¥
ð ln ð (ð¥)ððð
ð ln ð (ð¥)ððð
ð (ð¥)
=â«
ð¥
âð
ð ln ð (ð¥)ððð
ð¿ðð
âð
ð ln ð (ð¥)ððð
ð¿â²ðð ð (ð¥)
=â«
ð¥
âð
ð ln ð (ð¥)ððð
ð¿ðð ð¿â²(ln ð (ð¥)) ð (ð¥)
=â«
ð¥
âð
ð ln ð (ð¥)ððð
ð¿ðð ð¿â²ð (ð¥)
=â«
ð¥
âð
(ðð(ð¥)â ðð)ð¿ðð ð¿â²ð (ð¥) by (16)
=â
ð
ð¿ðð
(â«ð¥
ðð(ð¥) ð¿â²ð (ð¥))ââ
ð
ð¿ðððð
â«ð¥
ð¿â²ð (ð¥)
=â
ð
ð¿ðð ð¿â²ðð
becauseâ«
ð¥ ð¿â²ð (ð¥) = 0 since the total mass of ð is 1, andâ«
ð¥ ðð(ð¥) ð¿â²ð (ð¥) =ð¿â²ðð by definition of ðð.
Proposition 22. Let ð be a function on the statistical manifold of anexponential family as above. Then the components of the natural gradientw.r.t. the expectation parameters are given by the vanilla gradient w.r.t. thenatural parameters: âðð
ð = ðð
ððð
and conversely âððð = ðð
ððð.
(Beware this does not mean that the gradient ascent in any of thoseparametrizations is the vanilla gradient ascent.)
We could not find a reference for this result, though we think it known.
48
Proof. By definition, the natural gradient âð of a function ð is the uniquetangent vector ð¿ð such that that, for any other tangent vector ð¿â²ð , we have
ð¿â²ð = âšð¿ð, ð¿â²ð â©
with âšÂ·, ·⩠the scalar product associated with the Fisher metric. We want tocompute this natural gradient in coordinates ðð, so we are interested in thevariations ð¿ðð associated with ð¿ð .
By Proposition 21, the scalar product above is
âšð¿ð, ð¿â²ð â© =â
ð¿ðð ð¿â²ðð
where ð¿ðð is the variation of ðð associated with ð¿ð , and ð¿â²ðð the variation ofðð associated with ð¿â²ð .
On the other hand we have ð¿â²ð =â
ðððððð
ð¿â²ðð. So we must have
âð
ð¿ðð ð¿â²ðð =â
ð
ðð
ðððð¿â²ðð
for any ð¿â²ð , which leads toð¿ðð = ðð
ððð
as needed. The converse relation is proved mutatis mutandis.
Back to the proof of Theorem 15. We can now compute the desiredterms: âðð
ln ð (ð¥) = ð ln ð (ð¥)ððð
= ðð(ð¥)â ðð
by (16). This proves the first statement (27) in Theorem 15 about the formof the IGO update in these parameters.
The other statements follow easily from this together with the additionalfact (26) that, for any set of weights ðð with
âðð = 1, the value ð * =â
ð ð(ð)ð (ð¥ð) is the maximum likelihood estimate ofâ
ð ð(ð) log ð (ð¥ð).
References[AHS85] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski, A learning
algorithm for Boltzmann machines, Cognitive Science 9 (1985),no. 1, 147â169.
[Ama98] Shun-Ichi Amari, Natural gradient works efficiently in learning,Neural Comput. 10 (1998), 251â276.
[AN00] Shun-ichi Amari and Hiroshi Nagaoka, Methods of informa-tion geometry, Translations of Mathematical Monographs, vol.191, American Mathematical Society, Providence, RI, 2000,Translated from the 1993 Japanese original by Daishi Harada.
[ANOK10] Youhei Akimoto, Yuichi Nagata, Isao Ono, and ShigenobuKobayashi, Bidirectional relation between CMA evolution strate-gies and natural evolution strategies, Proceedings of ParallelProblem Solving from Nature - PPSN XI, Lecture Notes inComputer Science, vol. 6238, Springer, 2010, pp. 154â163.
[AO08] Ravi P. Agarwal and Donal OâRegan, An introduction to ordi-nary differential equations, Springer, 2008.
49
[Arn06] D.V. Arnold, Weighted multirecombination evolution strategies,Theoretical computer science 361 (2006), no. 1, 18â37.
[Bal94] Shumeet Baluja, Population based incremental learning: Amethod for integrating genetic search based function optimiza-tion and competitve learning, Tech. Report CMU-CS-94-163,Carnegie Mellon Report, 1994.
[BC95] Shumeet Baluja and Rich Caruana, Removing the genetics fromthe standard genetic algorithm, Proceedings of ICMLâ95, 1995,pp. 38â46.
[Ber00] A. Berny, Selection and reinforcement learning for combinatorialoptimization, Parallel Problem Solving from Nature PPSN VI(Marc Schoenauer, Kalyanmoy Deb, GÃŒnther Rudolph, XinYao, Evelyne Lutton, Juan Merelo, and Hans-Paul Schwefel,eds.), Lecture Notes in Computer Science, vol. 1917, SpringerBerlin Heidelberg, 2000, pp. 601â610.
[Ber02] Arnaud Berny, Boltzmann machine for population-based incre-mental learning, ECAI, 2002, pp. 198â202.
[Bey01] H.-G. Beyer, The theory of evolution strategies, Natural Com-puting Series, Springer-Verlag, 2001.
[BLS07] J. Branke, C. Lode, and J.L. Shapiro, Addressing samplingerrors and diversity loss in umda, Proceedings of the 9th annualconference on Genetic and evolutionary computation, ACM,2007, pp. 508â515.
[BS02] H.G. Beyer and H.P. Schwefel, Evolution strategiesâa com-prehensive introduction, Natural computing 1 (2002), no. 1,3â52.
[CT06] Thomas M. Cover and Joy A. Thomas, Elements of informationtheory, second ed., Wiley-Interscience [John Wiley & Sons],Hoboken, NJ, 2006.
[dBKMR05] Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, andReuven Y. Rubinstein, A tutorial on the cross-entropy method,Annals OR 134 (2005), no. 1, 19â67.
[Gha04] Zoubin Ghahramani, Unsupervised learning, Advanced Lectureson Machine Learning (Olivier Bousquet, Ulrike von Luxburg,and Gunnar RÀtsch, eds.), Lecture Notes in Computer Science,vol. 3176, Springer Berlin / Heidelberg, 2004, pp. 72â112.
[GSS+10] Tobias Glasmachers, Tom Schaul, Yi Sun, Daan Wierstra, andJÃŒrgen Schmidhuber, Exponential natural evolution strategies,GECCO, 2010, pp. 393â400.
[Han06a] N. Hansen, An analysis of mutative ð-self-adaptation on linearfitness functions, Evolutionary Computation 14 (2006), no. 3,255â275.
[Han06b] , The CMA evolution strategy: a comparing review, To-wards a new evolutionary computation. Advances on estimationof distribution algorithms (J.A. Lozano, P. Larranaga, I. Inza,and E. Bengoetxea, eds.), Springer, 2006, pp. 75â102.
50
[Han09] , Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed, Proceedings of the 11th Annual Confer-ence Companion on Genetic and Evolutionary ComputationConference: Late Breaking Papers (New York, NY, USA),GECCO â09, ACM, 2009, pp. 2389â2396.
[Hin02] G.E. Hinton., Training products of experts by minimizing con-trastive divergence, Neural Computation 14 (2002), 1771â1800.
[HJ61] R. Hooke and T.A. Jeeves, âdirect searchâ solution of numericaland statistical problems, Journal of the ACM 8 (1961), 212â229.
[HK04] N. Hansen and S. Kern, Evaluating the CMA evolution strategyon multimodal test functions, Parallel Problem Solving fromNature PPSN VIII (X. Yao et al., eds.), LNCS, vol. 3242,Springer, 2004, pp. 282â291.
[HMK03] N. Hansen, S.D. MÃŒller, and P. Koumoutsakos, Reducing thetime complexity of the derandomized evolution strategy withcovariance matrix adaptation (CMA-ES), Evolutionary Com-putation 11 (2003), no. 1, 1â18.
[HO96] N. Hansen and A. Ostermeier, Adapting arbitrary normal muta-tion distributions in evolution strategies: The covariance matrixadaptation, ICEC96, IEEE Press, 1996, pp. 312â317.
[HO01] Nikolaus Hansen and Andreas Ostermeier, Completely deran-domized self-adaptation in evolution strategies, EvolutionaryComputation 9 (2001), no. 2, 159â195.
[JA06] G.A. Jastrebski and D.V. Arnold, Improving evolution strate-gies through active covariance matrix adaptation, EvolutionaryComputation, 2006. CEC 2006. IEEE Congress on, IEEE, 2006,pp. 2814â2821.
[Jef46] Harold Jeffreys, An invariant form for the prior probabilityin estimation problems, Proc. Roy. Soc. London. Ser. A. 186(1946), 453â461.
[Kul97] Solomon Kullback, Information theory and statistics, DoverPublications Inc., Mineola, NY, 1997, Reprint of the second(1968) edition.
[LL02] P. Larranaga and J.A. Lozano, Estimation of distribution al-gorithms: A new tool for evolutionary computation, SpringerNetherlands, 2002.
[LRMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,Topmoumoute online natural gradient algorithm, NIPS, 2007.
[MMS08] Luigi Malagò, Matteo Matteucci, and Bernardo Dal Seno, Aninformation geometry perspective on estimation of distributionalgorithms: boundary analysis, GECCO (Companion), 2008,pp. 2081â2088.
[NM65] John Ashworth Nelder and R Mead, A simplex method forfunction minimization, The Computer Journal (1965), 308â313.
51
[PGL02] M. Pelikan, D.E. Goldberg, and F.G. Lobo, A survey of opti-mization by building and using probabilistic models, Computa-tional optimization and applications 21 (2002), no. 1, 5â20.
[Rao45] Calyampudi Radhakrishna Rao, Information and the accuracyattainable in the estimation of statistical parameters, Bull. Cal-cutta Math. Soc. 37 (1945), 81â91.
[Rec73] I. Rechenberg, Evolutionstrategie: Optimierung technisher sys-teme nach prinzipien des biologischen evolution, Frommann-Holzboog Verlag, Stuttgart, 1973.
[RK04] R.Y. Rubinstein and D.P. Kroese, The cross-entropy method:a unified approach to combinatorial optimization, monte-carlosimulation, and machine learning, Springer-Verlag New YorkInc, 2004.
[Rub99] Reuven Rubinstein, The cross-entropy method for com-binatorial and continuous optimization, Methodology andComputing in Applied Probability 1 (1999), 127â190,10.1023/A:1010091220143.
[SA90] F. Silva and L. Almeida, Acceleration techniques for the back-propagation algorithm, Neural Networks (1990), 110â119.
[Sch92] Laurent Schwartz, Analyse. II, Collection Enseignement desSciences [Collection: The Teaching of Science], vol. 43, Hermann,Paris, 1992, Calcul différentiel et équations différentielles, Withthe collaboration of K. Zizi.
[Sch95] H.-P. Schwefel, Evolution and optimum seeking, Sixth-generation computer technology series, John Wiley & Sons,Inc. New York, NY, USA, 1995.
[SHI09] Thorsten Suttorp, Nikolaus Hansen, and Christian Igel, Efficientcovariance matrix update for variable metric evolution strategies,Machine Learning 75 (2009), no. 2, 167â197.
[Smo86] P. Smolensky, Information processing in dynamical systems:foundations of harmony theory, Parallel Distributed Processing(D. Rumelhart and J. McClelland, eds.), vol. 1, MIT Press,Cambridge, MA, USA, 1986, pp. 194â281.
[SWSS09] Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber,Efficient natural evolution strategies, Proceedings of the 11thAnnual conference on Genetic and evolutionary computation(New York, NY, USA), GECCO â09, ACM, 2009, pp. 539â546.
[Tho00] H. Thorisson, Coupling, stationarity, and regeneration, Springer,2000.
[Tor97] Virginia Torczon, On the convergence of pattern search algo-rithms, SIAM Journal on optimization 7 (1997), no. 1, 1â25.
[WAS04] Michael Wagner, Anne Auger, and Marc Schoenauer, EEDA :A New Robust Estimation of Distribution Algorithms, ResearchReport RR-5190, INRIA, 2004.
52
[Whi89] D. Whitley, The genitor algorithm and selection pressure: Whyrank-based allocation of reproductive trials is best, Proceedingsof the third international conference on Genetic algorithms,1989, pp. 116â121.
[WSPS08] Daan Wierstra, Tom Schaul, Jan Peters, and JÃŒrgen Schmidhu-ber, Natural evolution strategies, IEEE Congress on Evolution-ary Computation, 2008, pp. 3381â3387.
53
top related