information-geometricoptimizationalgorithms...

53
Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier Abstract We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space into a continuous-time black-box optimization method on , the information- geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting method conducts a natural gradient ascent using an adaptive, time-dependent transformation of the objective function, and makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update. When applied to specific families of distributions on discrete or con- tinuous spaces, the IGO framework allows to naturally recover versions of known algorithms. From the family of Gaussian distributions on R , we arrive at a version of the well-known CMA-ES algorithm. From the family of Bernoulli distributions on {0, 1} , we recover the seminal PBIL algorithm. From the distributions of restricted Boltzmann ma- chines, we naturally obtain a novel algorithm for discrete optimization on {0, 1} . The IGO method achieves, thanks to its intrinsic formulation, max- imal invariance properties: invariance under reparametrization of the search space , under a change of parameters of the probability dis- tribution, and under increasing transformation of the function to be optimized. The latter is achieved thanks to an adaptive formulation of the objective. Theoretical considerations strongly suggest that IGO algorithms are characterized by a minimal change of the distribution. Therefore they have minimal loss in diversity through the course of optimization, provided the initial diversity is high. First experiments using restricted Boltzmann machines confirm this insight. As a simple consequence, IGO seems to provide, from information theory, an elegant way to spontaneously explore several valleys of a fitness landscape in a single run. 1

Upload: others

Post on 21-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Information-Geometric Optimization Algorithms:A Unifying Picture via Invariance Principles

Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier

Abstract

We present a canonical way to turn any smooth parametric familyof probability distributions on an arbitrary search space 𝑋 into acontinuous-time black-box optimization method on 𝑋, the information-geometric optimization (IGO) method. Invariance as a major designprinciple keeps the number of arbitrary choices to a minimum. Theresulting method conducts a natural gradient ascent using an adaptive,time-dependent transformation of the objective function, and makes noparticular assumptions on the objective function to be optimized.

The IGO method produces explicit IGO algorithms through timediscretization. The cross-entropy method is recovered in a particularcase with a large time step, and can be extended into a smoothed,parametrization-independent maximum likelihood update.

When applied to specific families of distributions on discrete or con-tinuous spaces, the IGO framework allows to naturally recover versionsof known algorithms. From the family of Gaussian distributions on R𝑑,we arrive at a version of the well-known CMA-ES algorithm. Fromthe family of Bernoulli distributions on {0, 1}𝑑, we recover the seminalPBIL algorithm. From the distributions of restricted Boltzmann ma-chines, we naturally obtain a novel algorithm for discrete optimizationon {0, 1}𝑑.

The IGO method achieves, thanks to its intrinsic formulation, max-imal invariance properties: invariance under reparametrization of thesearch space 𝑋, under a change of parameters of the probability dis-tribution, and under increasing transformation of the function to beoptimized. The latter is achieved thanks to an adaptive formulation ofthe objective.

Theoretical considerations strongly suggest that IGO algorithmsare characterized by a minimal change of the distribution. Thereforethey have minimal loss in diversity through the course of optimization,provided the initial diversity is high. First experiments using restrictedBoltzmann machines confirm this insight. As a simple consequence,IGO seems to provide, from information theory, an elegant way tospontaneously explore several valleys of a fitness landscape in a singlerun.

1

Page 2: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

ContentsIntroduction 2

1 Algorithm description 41.1 The natural gradient on parameter space . . . . . . . . . . . 51.2 IGO: Information-geometric optimization . . . . . . . . . . . 7

2 First properties of IGO 112.1 Consistency of sampling . . . . . . . . . . . . . . . . . . . . . 112.2 Monotonicity: quantile improvement . . . . . . . . . . . . . . 122.3 The IGO flow for exponential families . . . . . . . . . . . . . 122.4 Invariance properties . . . . . . . . . . . . . . . . . . . . . . . 142.5 Speed of the IGO flow . . . . . . . . . . . . . . . . . . . . . . 152.6 Noisy objective function . . . . . . . . . . . . . . . . . . . . . 162.7 Implementation remarks . . . . . . . . . . . . . . . . . . . . . 17

3 IGO, maximum likelihood and the cross-entropy method 19

4 CMA-ES, NES, EDAs and PBIL from the IGO framework 224.1 IGO algorithm for Bernoulli measures and PBIL . . . . . . . 224.2 Multivariate normal distributions (Gaussians) . . . . . . . . . 244.3 Computing the IGO flow for some simple examples . . . . . . 26

5 Multimodal optimization using restricted Boltzmann ma-chines 295.1 IGO for restricted Boltzmann machines . . . . . . . . . . . . 305.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Convergence to the continuous-time limit . . . . . . . . . . . 36

6 Further discussion 38

Summary and conclusion 41

IntroductionIn this article, we consider an objective function 𝑓 : 𝑋 β†’ R to be minimizedover a search space 𝑋. No particular assumption on 𝑋 is needed: it maybe discrete or continuous, finite or infinite. We adopt a standard scenariowhere we consider 𝑓 as a black box that delivers values 𝑓(π‘₯) for any desiredinput π‘₯ ∈ 𝑋. The objective of black-box optimization is to find solutionsπ‘₯ ∈ 𝑋 with small value 𝑓(π‘₯), using the least number of calls to the blackbox. In this context, we design a stochastic optimization method from soundtheoretical principles.

We assume that we are given a family of probability distributions π‘ƒπœƒ

on 𝑋 depending on a continuous multicomponent parameter πœƒ ∈ Θ. Abasic example is to take 𝑋 = R𝑑 and to consider the family of all Gaussiandistributions π‘ƒπœƒ on R𝑑, with πœƒ = (π‘š, 𝐢) the mean and covariance matrix.Another simple example is 𝑋 = {0, 1}𝑑 equipped with the family of Bernoullimeasures, i.e. πœƒ = (πœƒπ‘–)16𝑖6𝑑 and π‘ƒπœƒ(π‘₯) =

βˆπœƒπ‘₯𝑖

𝑖 (1βˆ’ πœƒπ‘–)1βˆ’π‘₯𝑖 for π‘₯ = (π‘₯𝑖) ∈ 𝑋.The parameters πœƒ of the family π‘ƒπœƒ belong to a space, Θ, assumed to be asmooth manifold.

2

Page 3: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

From this setting, we build in a natural way an optimization method,the information-geometric optimization (IGO). At each time 𝑑, we maintaina value πœƒπ‘‘ such that π‘ƒπœƒπ‘‘ represents, loosely speaking, the current beliefabout where the smallest values of the function 𝑓 may lie. Over time, π‘ƒπœƒπ‘‘

evolves and is expected to concentrate around the minima of 𝑓 . This generalapproach resembles the wide family of estimation of distribution algorithms(EDA) [LL02, BC95, PGL02]. However, we deviate somewhat from thecommon EDA reasoning, as explained in the following.

The IGO method takes the form of a gradient ascent on πœƒπ‘‘ in the parameterspace Θ. We follow the gradient of a suitable transformation of 𝑓 , basedon the π‘ƒπœƒπ‘‘-quantiles of 𝑓 . The gradient used for πœƒ is the natural gradientdefined from the Fisher information metric [Rao45, Jef46, AN00], as is thecase in various other optimization strategies, for instance so-called naturalevolution strategies [WSPS08, SWSS09, GSS+10]. Thus, we extend the scopeof optimization strategies based on this gradient to arbitrary search spaces.

The IGO method also has an equivalent description as an infinitesimalmaximum likelihood update; this reveals a new property of the naturalgradient. This also relates IGO to the cross-entropy method [dBKMR05] insome situations.

When we instantiate IGO using the family of Gaussian distributionson R𝑑, we naturally obtain variants of the well-known covariance matrixadaptation evolution strategy (CMA-ES) [HO01, HK04, JA06] and of naturalevolution strategies. With Bernoulli measures on the discrete cube {0, 1}𝑑,we recover the well-known population based incremental learning (PBIL)[BC95, Bal94]; this derivation of PBIL as a natural gradient ascent appearsto be new, and sheds some light on the common ground between continuousand discrete optimization.

From the IGO framework, it is immediate to build new optimizationalgorithms using more complex families of distributions than Gaussian orBernoulli. As an illustration, distributions associated with restricted Boltz-mann machines (RBMs) provide a new but natural algorithm for discreteoptimization on {0, 1}𝑑, able to handle dependencies between the bits (seealso [Ber02]). The probability distributions associated with RBMs are mul-timodal; combined with specific information-theoretic properties of IGOthat guarantee minimal loss of diversity over time, this allows IGO to reachmultiple optima at once very naturally, at least in a simple experimentalsetup.

Our method is built to achieve maximal invariance properties. First, itwill be invariant under reparametrization of the family of distributions π‘ƒπœƒ,that is, at least for infinitesimally small steps, it only depends on π‘ƒπœƒ and noton the way we write the parameter πœƒ. (For instance, for Gaussian measuresit should not matter whether we use the covariance matrix or its inverse ora Cholesky factor as the parameter.) This limits the influence of encodingchoices on the behavior of the algorithm. Second, it will be invariant undera change of coordinates in the search space 𝑋, provided that this change ofcoordinates globally preserves our family of distributions π‘ƒπœƒ. (For Gaussiandistributions on R𝑑, this includes all affine changes of coordinates.) Thismeans that the algorithm, apart from initialization, does not depend onthe precise way the data is presented. Last, the algorithm will be invariantunder applying an increasing function to 𝑓 , so that it is indifferent whetherwe minimize e.g. 𝑓 , 𝑓3 or 𝑓 Γ— |𝑓 |βˆ’2/3. This way some non-convex or non-smooth functions can be as β€œeasily” optimised as convex ones. Contrary toprevious formulations using natural gradients [WSPS08, GSS+10, ANOK10],

3

Page 4: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

this invariance under increasing transformation of the objective function isachieved from the start.

Invariance under 𝑋-reparametrization has beenβ€”we believeβ€”one ofthe keys to the success of the CMA-ES algorithm, which derives from aparticular case of ours. Invariance under πœƒ-reparametrization is the main ideabehind information geometry [AN00]. Invariance under 𝑓 -transformationis not uncommon, e.g., for evolution strategies [Sch95] or pattern searchmethods [HJ61, Tor97, NM65]; however it is not always recognized as anattractive feature. Such invariance properties mean that we deal with intrinsicproperties of the objects themselves, and not with the way we encode themas collections of numbers in R𝑑. It also means, most importantly, that wemake a minimal number of arbitrary choices.

In Section 1, we define the IGO flow and the IGO algorithm. We beginwith standard facts about the definition and basic properties of the naturalgradient, and its connection with Kullback–Leibler divergence and diversity.We then proceed to the detailed description of our algorithm.

In Section 2, we state some first mathematical properties of IGO. Theseinclude monotone improvement of the objective function, invariance proper-ties, the form of IGO for exponential families of probability distributions,and the case of noisy objective functions.

In Section 3 we explain the theoretical relationships between IGO, maxi-mum likelihood estimates and the cross-entropy method. In particular, IGOis uniquely characterized by a weighted log-likelihood maximization property.

In Section 4, we derive several well-known optimization algorithms fromthe IGO framework. These include PBIL, versions of CMA-ES and otherGaussian evolutionary algorithms such as EMNA. This also illustrates howa large step size results in more and more differing algorithms w.r.t. thecontinuous-time IGO flow.

In Section 5, we illustrate how IGO can be used to design new optimizationalgorithms. As a proof of concept, we derive the IGO algorithm associatedwith restricted Boltzmann machines for discrete optimization, allowing formultimodal optimization. We perform a preliminary experimental study ofthe specific influence of the Fisher information matrix on the performance ofthe algorithm and on diversity of the optima obtained.

In Section 6, we discuss related work, and in particular, IGO’s relationshipwith and differences from various other optimization algorithms such asnatural evolution strategies or the cross-entropy method. We also sum upthe main contributions of the paper and the design philosophy of IGO.

1 Algorithm descriptionWe now present the outline of our algorithms. Each step is described in moredetail in the sections below.

Our method can be seen as an estimation of distribution algorithm: ateach time 𝑑, we maintain a probability distribution π‘ƒπœƒπ‘‘ on the search space𝑋, where πœƒπ‘‘ ∈ Θ. The value of πœƒπ‘‘ will evolve so that, over time, π‘ƒπœƒπ‘‘ givesmore weight to points π‘₯ with better values of the function 𝑓(π‘₯) to optimize.

A straightforward way to proceed is to transfer 𝑓 from π‘₯-space to πœƒ-space:define a function 𝐹 (πœƒ) as the π‘ƒπœƒ-average of 𝑓 and then to do a gradientdescent for 𝐹 (πœƒ) in space Θ [Ber00]. This way, πœƒ will converge to a pointsuch that π‘ƒπœƒ yields a good average value of 𝑓 . We depart from this approachin two ways:

4

Page 5: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

βˆ™ At each time, we replace 𝑓 with an adaptive transformation of 𝑓 rep-resenting how good or bad observed values of 𝑓 are relative to otherobservations. This provides invariance under all monotone transforma-tions of 𝑓 .

βˆ™ Instead of the vanilla gradient for πœƒ, we use the so-called natural gradientgiven by the Fisher information matrix. This reflects the intrinsicgeometry of the space of probability distributions, as introduced byRao and Jeffreys [Rao45, Jef46] and later elaborated upon by Amariand others [AN00]. This provides invariance under reparametrizationof πœƒ and, importantly, minimizes the change of diversity of π‘ƒπœƒ.

The algorithm is constructed in two steps: we first give an β€œideal” version,namely, a version in which time 𝑑 is continuous so that the evolution of πœƒπ‘‘ isgiven by an ordinary differential equation in Θ. Second, the actual algorithmis a time discretization using a finite time step and Monte Carlo samplinginstead of exact π‘ƒπœƒ-averages.

1.1 The natural gradient on parameter space

About gradients and the shortest path uphill. Let 𝑔 be a smoothfunction from R𝑑 to R, to be maximized. We first present the interpretationof gradient ascent as β€œthe shortest path uphill”.

Let 𝑦 ∈ R𝑑. Define the vector 𝑧 by

𝑧 = limπœ€β†’0

arg max𝑧, ‖𝑧‖61

𝑔(𝑦 + πœ€π‘§). (1)

Then one can check that 𝑧 is the normalized gradient of 𝑔 at 𝑦: 𝑧𝑖 = πœ•π‘”/πœ•π‘¦π‘–

β€–πœ•π‘”/πœ•π‘¦π‘˜β€– .(This holds only at points 𝑦 where the gradient of 𝑔 does not vanish.)

This shows that, for small 𝛿𝑑, the well-known gradient ascent of 𝑔 givenby

𝑦𝑑+𝛿𝑑𝑖 = 𝑦𝑑

𝑖 + 𝛿𝑑 πœ•π‘”πœ•π‘¦π‘–

realizes the largest increase in the value of 𝑔, for a given step size ‖𝑦𝑑+π›Ώπ‘‘βˆ’π‘¦π‘‘β€–.The relation (1) depends on the choice of a norm β€–Β·β€– (the gradient of

𝑔 is given by πœ•π‘”/πœ•π‘¦π‘– only in an orthonormal basis). If we use, instead ofthe standard metric ‖𝑦 βˆ’ 𝑦′‖ =

βˆšβˆ‘(𝑦𝑖 βˆ’ 𝑦′

𝑖)2 on R𝑑, a metric ‖𝑦 βˆ’ 𝑦′‖𝐴 =βˆšβˆ‘π΄π‘–π‘—(𝑦𝑖 βˆ’ 𝑦′

𝑖)(𝑦𝑗 βˆ’ 𝑦′𝑗) defined by a positive definite matrix 𝐴𝑖𝑗 , then the

gradient of 𝑔 with respect to this metric is given byβˆ‘

𝑗 π΄βˆ’1𝑖𝑗

πœ•π‘”πœ•π‘¦π‘–

. (This followsfrom the textbook definition of gradients by 𝑔(𝑦 + πœ€π‘§) = 𝑔(𝑦) + πœ€βŸ¨βˆ‡π‘”, π‘§βŸ©π΄ +𝑂(πœ€2) with ⟨·, ·⟩𝐴 the scalar product associated with the matrix 𝐴𝑖𝑗 [Sch92].)

We can write the analogue of (1) using the 𝐴-norm. We get that thegradient ascent associated with metric 𝐴, given by

𝑦𝑑+𝛿𝑑 = 𝑦𝑑 + 𝛿𝑑 π΄βˆ’1 πœ•π‘”πœ•π‘¦π‘–

,

for small 𝛿𝑑, maximizes the increment of 𝑔 for a given 𝐴-distance ‖𝑦𝑑+𝛿𝑑 βˆ’π‘¦π‘‘β€–π΄β€”it realizes the steepest 𝐴-ascent. Maybe this viewpoint clarifies therelationship between gradient and metric: this steepest ascent property canactually be used as a definition of gradients.

In our setting we want to use a gradient ascent in the parameter space Θof our distributions π‘ƒπœƒ. The metric β€–πœƒβˆ’ πœƒβ€²β€– =

βˆšβˆ‘(πœƒπ‘– βˆ’ πœƒβ€²

𝑖)2 clearly dependson the choice of parametrization πœƒ, and thus is not intrinsic. Therefore, weuse a metric depending on πœƒ only through the distributions π‘ƒπœƒ, as follows.

5

Page 6: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Fisher information and the natural gradient on parameter space.Let πœƒ, πœƒβ€² ∈ Θ be two values of the distribution parameter. The Kullback–Leibler divergence between π‘ƒπœƒ and π‘ƒπœƒβ€² is defined [Kul97] as

KL(π‘ƒπœƒβ€² ||π‘ƒπœƒ) =∫

π‘₯ln π‘ƒπœƒβ€²(π‘₯)

π‘ƒπœƒ(π‘₯) π‘ƒπœƒβ€²(dπ‘₯).

When πœƒβ€² = πœƒ + π›Ώπœƒ is close to πœƒ, under mild smoothness assumptions wecan expand the Kullback–Leibler divergence at second order in π›Ώπœƒ. Thisexpansion defines the Fisher information matrix 𝐼 at πœƒ [Kul97]:

KL(π‘ƒπœƒ+π›Ώπœƒ ||π‘ƒπœƒ) = 12βˆ‘

𝐼𝑖𝑗(πœƒ) π›Ώπœƒπ‘–π›Ώπœƒπ‘— + 𝑂(π›Ώπœƒ3).

An equivalent definition of the Fisher information matrix is by the usualformulas [CT06]

𝐼𝑖𝑗(πœƒ) =∫

π‘₯

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘—

dπ‘ƒπœƒ(π‘₯) = βˆ’βˆ«

π‘₯

πœ•2 ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘– πœ•πœƒπ‘—

dπ‘ƒπœƒ(π‘₯).

The Fisher information matrix defines a (Riemannian) metric on Θ: thedistance, in this metric, between two very close values of πœƒ is given by thesquare root of twice the Kullback–Leibler divergence. Since the Kullback–Leibler divergence depends only on π‘ƒπœƒ and not on the parametrization of πœƒ,this metric is intrinsic.

If 𝑔 : Θ β†’ R is a smooth function on the parameter space, its naturalgradient at πœƒ is defined in accordance with the Fisher metric as

( βˆ‡πœƒ 𝑔)𝑖 =βˆ‘

𝑗

πΌβˆ’1𝑖𝑗 (πœƒ) πœ•π‘”(πœƒ)

πœ•πœƒπ‘—

or more synthetically βˆ‡πœƒ 𝑔 = πΌβˆ’1 πœ•π‘”

πœ•πœƒ.

From now on, we will use βˆ‡πœƒ to denote the natural gradient and πœ•πœ•πœƒ to denote

the vanilla gradient.By construction, the natural gradient descent is intrinsic: it does not

depend on the chosen parametrization πœƒ of π‘ƒπœƒ, so that it makes sense tospeak of the natural gradient ascent of a function 𝑔(π‘ƒπœƒ).

Given that the Fisher metric comes from the Kullback–Leibler divergence,the β€œshortest path uphill” property of gradients mentioned above translatesas follows (see also [Ama98, Theorem 1]):

Proposition 1. The natural gradient ascent points in the direction π›Ώπœƒ achiev-ing the largest change of the objective function, for a given distance betweenπ‘ƒπœƒ and π‘ƒπœƒ+π›Ώπœƒ in Kullback–Leibler divergence. More precisely, let 𝑔 be asmooth function on the parameter space Θ. Let πœƒ ∈ Θ be a point where βˆ‡π‘”(πœƒ)does not vanish. Then

βˆ‡π‘”(πœƒ)β€– βˆ‡π‘”(πœƒ)β€–

= limπœ€β†’0

1πœ€

arg maxπ›Ώπœƒ such that

KL(π‘ƒπœƒ+π›Ώπœƒ || π‘ƒπœƒ)6πœ€2/2

𝑔(πœƒ + π›Ώπœƒ).

Here we have implicitly assumed that the parameter space Θ is non-degenerate and proper (that is, no two points πœƒ ∈ Θ define the same proba-bility distribution, and the mapping π‘ƒπœƒ ↦→ πœƒ is continuous).

6

Page 7: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Why use the Fisher metric gradient for optimization? Relation-ship to diversity. The first reason for using the natural gradient is itsreparametrization invariance, which makes it the only gradient available ina general abstract setting [AN00]. Practically, this invariance also limitsthe influence of encoding choices on the behavior of the algorithm. Moreprosaically, the Fisher matrix can be also seen as an adaptive learning ratefor different components of the parameter vector πœƒπ‘–: components 𝑖 with ahigh impact on π‘ƒπœƒ will be updated more cautiously.

Another advantage comes from the relationship with Kullback–Leiblerdistance in view of the β€œshortest path uphill” (see also [Ama98]). To minimizethe value of some function 𝑔(πœƒ) defined on the parameter space Θ, the naiveapproach follows a gradient descent for 𝑔 using the β€œvanilla” gradient

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + 𝛿𝑑 πœ•π‘”πœ•πœƒπ‘–

and, as explained above, this maximizes the increment of 𝑔 for a givenincrement β€–πœƒπ‘‘+𝛿𝑑 βˆ’ πœƒπ‘‘β€–. On the other hand, the Fisher gradient

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + π›Ώπ‘‘πΌβˆ’1 πœ•π‘”πœ•πœƒπ‘–

maximizes the increment of 𝑔 for a given Kullback–Leibler distance KL(π‘ƒπœƒπ‘‘+𝛿𝑑 ||π‘ƒπœƒπ‘‘).In particular, if we choose an initial value πœƒ0 such that π‘ƒπœƒ0 covers a wide

portion of the space 𝑋 uniformly, the Kullback–Leibler divergence betweenπ‘ƒπœƒπ‘‘ and π‘ƒπœƒ0 measures the loss of diversity of π‘ƒπœƒπ‘‘ . So the natural gradientdescent is a way to optimize the function 𝑔 with minimal loss of diversity,provided the initial diversity is large. On the other hand the vanilla gradientdescent optimizes 𝑔 with minimal change in the numerical values of theparameter πœƒ, which is of little interest.

So arguably this method realizes the best trade-off between optimizationand loss of diversity. (Though, as can be seen from the detailed algorithmdescription below, maximization of diversity occurs only greedily at eachstep, and so there is no guarantee that after a given time, IGO will providethe highest possible diversity for a given objective function value.)

An experimental confirmation of the positive influence of the Fishermatrix on diversity is given in Section 5 below. This may also provide atheoretical explanation to the good performance of CMA-ES.

1.2 IGO: Information-geometric optimization

Quantile rewriting of 𝑓 . Our original problem is to minimize a function𝑓 : 𝑋 β†’ R. A simple way to turn 𝑓 into a function on Θ is to use theexpected value βˆ’Eπ‘ƒπœƒ

𝑓 [Ber00, WSPS08], but expected values can be undulyinfluenced by extreme values and using them is rather unstable [Whi89];moreover βˆ’Eπ‘ƒπœƒ

𝑓 is not invariant under increasing transformation of 𝑓 (thisinvariance implies we can only compare 𝑓 -values, not add them).

Instead, we take an adaptive, quantile-based approach by first replacingthe function 𝑓 with a monotone rewriting π‘Š 𝑓

πœƒ and then following the gradientof Eπ‘ƒπœƒ

π‘Š π‘“πœƒ . A due choice of π‘Š 𝑓

πœƒ allows to control the range of the resultingvalues and achieves the desired invariance. Because the rewriting π‘Š 𝑓

πœƒ dependson πœƒ, it might be viewed as an adaptive 𝑓 -transformation.

The goal is that if 𝑓(π‘₯) is β€œsmall” then π‘Š π‘“πœƒ (π‘₯) ∈ R is β€œlarge” and vice

versa, and that π‘Š π‘“πœƒ remains invariant under monotone transformations of

𝑓 . The meaning of β€œsmall” or β€œlarge” depends on πœƒ ∈ Θ and is taken withrespect to typical values of 𝑓 under the current distribution π‘ƒπœƒ. This is

7

Page 8: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

measured by the π‘ƒπœƒ-quantile in which the value of 𝑓(π‘₯) lies. We write thelower and upper π‘ƒπœƒ-𝑓 -quantiles of π‘₯ ∈ 𝑋 as

π‘žβˆ’πœƒ (π‘₯) = Prπ‘₯β€²βˆΌπ‘ƒπœƒ

(𝑓(π‘₯β€²) < 𝑓(π‘₯))π‘ž+

πœƒ (π‘₯) = Prπ‘₯β€²βˆΌπ‘ƒπœƒ(𝑓(π‘₯β€²) 6 𝑓(π‘₯)) .

(2)

These quantile functions reflect the probability to sample a better value than𝑓(π‘₯). They are monotone in 𝑓 (if 𝑓(π‘₯1) 6 𝑓(π‘₯2) then π‘žΒ±

πœƒ (π‘₯1) 6 π‘žΒ±πœƒ (π‘₯2)) and

invariant under increasing transformations of 𝑓 .Given π‘ž ∈ [0; 1], we now choose a non-increasing function 𝑀 : [0; 1]β†’ R

(fixed once and for all). A typical choice for 𝑀 is 𝑀(π‘ž) = 1π‘ž6π‘ž0 for somefixed value π‘ž0, the selection quantile. The transform π‘Š 𝑓

πœƒ (π‘₯) is defined as afunction of the π‘ƒπœƒ-𝑓 -quantile of π‘₯ as

π‘Š π‘“πœƒ (π‘₯) =

βŽ§βŽ¨βŽ©π‘€(π‘ž+πœƒ (π‘₯)) if π‘ž+

πœƒ (π‘₯) = π‘žβˆ’πœƒ (π‘₯),

1π‘ž+

πœƒ(π‘₯)βˆ’π‘žβˆ’

πœƒ(π‘₯)∫ π‘ž=π‘ž+

πœƒ(π‘₯)

π‘ž=π‘žβˆ’πœƒ

(π‘₯) 𝑀(π‘ž) dπ‘ž otherwise.(3)

As desired, the definition of π‘Š π‘“πœƒ is invariant under an increasing transforma-

tion of 𝑓 . For instance, the π‘ƒπœƒ-median of 𝑓 gets remapped to 𝑀(12).

Note that Eπ‘ƒπœƒπ‘Š 𝑓

πœƒ =∫ 1

0 𝑀 is independent of 𝑓 and πœƒ: indeed, by definition,the quantile of a random point under π‘ƒπœƒ is uniformly distributed in [0; 1]. Inthe following, our objective will be to maximize the expected value of π‘Š 𝑓

πœƒπ‘‘ inπœƒ, that is, to maximize

Eπ‘ƒπœƒπ‘Š 𝑓

πœƒπ‘‘ =∫

π‘Š π‘“πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯) (4)

over πœƒ, where πœƒπ‘‘ is fixed at a given step but will adapt over time.Importantly, π‘Š 𝑓

πœƒ (π‘₯) can be estimated in practice: indeed, the quantilesPrπ‘₯β€²βˆΌπ‘ƒπœƒ

(𝑓(π‘₯β€²) < 𝑓(π‘₯)) can be estimated by taking samples of π‘ƒπœƒ and orderingthe samples according to the value of 𝑓 (see below). The estimate remainsinvariant under increasing 𝑓 -transformations.

The IGO gradient flow. At the most abstract level, IGO is a continuous-time gradient flow in the parameter space Θ, which we define now. Inpractice, discrete time steps (a.k.a. iterations) are used, and π‘ƒπœƒ-integrals areapproximated through sampling, as described in the next section.

Let πœƒπ‘‘ be the current value of the parameter at time 𝑑, and let 𝛿𝑑 β‰ͺ 1.We define πœƒπ‘‘+𝛿𝑑 in such a way as to increase the π‘ƒπœƒ-weight of points where 𝑓is small, while not going too far from π‘ƒπœƒπ‘‘ in Kullback–Leibler divergence. Weuse the adaptive weights π‘Š 𝑓

πœƒπ‘‘ as a way to measure which points have large orsmall values. In accordance with (4), this suggests taking the gradient ascent

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 βˆ‡πœƒ

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯) (5)

where the natural gradient is suggested by Proposition 1.Note again that we use π‘Š 𝑓

πœƒπ‘‘ and not π‘Š π‘“πœƒ in the integral. So the gradientβˆ‡πœƒ does not act on the adaptive objective π‘Š 𝑓

πœƒπ‘‘ . If we used π‘Š π‘“πœƒ instead, we

would face a paradox: right after a move, previously good points do notseem so good any more since the distribution has improved. More precisely,∫

π‘Š π‘“πœƒ (π‘₯) π‘ƒπœƒ(dπ‘₯) is constant and always equal to the average weight

∫ 10 𝑀,

and so the gradient would always vanish.Using the log-likelihood trick βˆ‡π‘ƒπœƒ = π‘ƒπœƒ

βˆ‡ln π‘ƒπœƒ (assuming π‘ƒπœƒ is smooth),we get an equivalent expression of the update above as an integral underthe current distribution π‘ƒπœƒπ‘‘ ; this is important for practical implementation.This leads to the following definition.

8

Page 9: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Definition 2 (IGO flow). The IGO flow is the set of continuous-timetrajectories in space Θ, defined by the differential equation

dπœƒπ‘‘

d𝑑= βˆ‡πœƒ

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯) (6)

=∫

π‘Š π‘“πœƒπ‘‘(π‘₯) βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯) π‘ƒπœƒπ‘‘(dπ‘₯) (7)

= πΌβˆ’1(πœƒπ‘‘)∫

π‘Š π‘“πœƒπ‘‘(π‘₯) πœ• ln π‘ƒπœƒ(π‘₯)

πœ•πœƒπ‘ƒπœƒπ‘‘(dπ‘₯). (8)

where the gradients are taken at point πœƒ = πœƒπ‘‘, and 𝐼 is the Fisher informationmatrix.

Natural evolution strategies (NES, [WSPS08, GSS+10, SWSS09]) featurea related gradient descent with 𝑓(π‘₯) instead of π‘Š 𝑓

πœƒπ‘‘(π‘₯). The associated flowwould read

dπœƒπ‘‘

d𝑑= βˆ’ βˆ‡πœƒ

βˆ«π‘“(π‘₯) π‘ƒπœƒ(dπ‘₯) , (9)

where the gradient is taken at πœƒπ‘‘ (in the sequel when not explicitly stated,gradients in πœƒ are taken at πœƒ = πœƒπ‘‘). However, in the end NESs alwaysimplement algorithms using sample quantiles, as if derived from the gradientascent of π‘Š 𝑓

πœƒπ‘‘(π‘₯).The update (7) is a weighted average of β€œintrinsic moves” increasing the

log-likelihood of some points. We can slightly rearrange the update as

dπœƒπ‘‘

d𝑑=∫preference weight⏞ ⏟

π‘Š π‘“πœƒπ‘‘(π‘₯) βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯)⏟ ⏞ intrinsic move to reinforce π‘₯

current sample distribution⏞ ⏟ π‘ƒπœƒπ‘‘(dπ‘₯) (10)

= βˆ‡πœƒ

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯) ln π‘ƒπœƒ(π‘₯)⏟ ⏞ weighted log-likelihood

π‘ƒπœƒπ‘‘(dπ‘₯). (11)

which provides an interpretation for the IGO gradient flow as a gradientascent optimization of the weighted log-likelihood of the β€œgood points” ofthe current distribution. In a precise sense, IGO is in fact the β€œbest” way toincrease this log-likelihood (Theorem 13).

For exponential families of probability distributions, we will see later thatthe IGO flow rewrites as a nice derivative-free expression (18).

The IGO algorithm. Time discretization and sampling. The aboveis a mathematically well-defined continuous-time flow in the parameter space.Its practical implementation involves three approximations depending ontwo parameters 𝑁 and 𝛿𝑑:

βˆ™ the integral under π‘ƒπœƒπ‘‘ is approximated using 𝑁 samples taken fromπ‘ƒπœƒπ‘‘ ;

βˆ™ the value π‘Š π‘“πœƒπ‘‘ is approximated for each sample taken from π‘ƒπœƒπ‘‘ ;

βˆ™ the time derivative dπœƒπ‘‘

d𝑑 is approximated by a 𝛿𝑑 time increment.

We also assume that the Fisher information matrix 𝐼(πœƒ) and πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒ can

be computed (see discussion below if 𝐼(πœƒ) is unkown).At each step, we pick 𝑁 samples π‘₯1, . . . , π‘₯𝑁 under π‘ƒπœƒπ‘‘ . To approximate

the quantiles, we rank the samples according to the value of 𝑓 . Definerk(π‘₯𝑖) = #{𝑗, 𝑓(π‘₯𝑗) < 𝑓(π‘₯𝑖)} and let the estimated weight of sample π‘₯𝑖 be

𝑀𝑖 = 1𝑁

𝑀

(rk(π‘₯𝑖) + 1/2𝑁

), (12)

9

Page 10: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

using the weighting scheme function 𝑀 introduced above. (This is assumingthere are no ties in our sample; in case several sample points have the samevalue of 𝑓 , we define 𝑀𝑖 by averaging the above over all possible rankings ofthe ties1.)

Then we can approximate the IGO flow as follows.

Definition 3 (IGO algorithm). The IGO algorithm with sample size 𝑁 andstep size 𝛿𝑑 is the following update rule for the parameter πœƒπ‘‘. At each step,𝑁 sample points π‘₯1, . . . , π‘₯𝑁 are picked according to the distribution π‘ƒπœƒπ‘‘ . Theparameter is updated according to

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + π›Ώπ‘‘π‘βˆ‘

𝑖=1π‘€π‘–βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖)

πœƒ=πœƒπ‘‘

(13)

= πœƒπ‘‘ + 𝛿𝑑 πΌβˆ’1(πœƒπ‘‘)π‘βˆ‘

𝑖=1𝑀𝑖

πœ• ln π‘ƒπœƒ(π‘₯𝑖)πœ•πœƒ

πœƒ=πœƒπ‘‘

(14)

where 𝑀𝑖 is the weight (12) obtained from the ranked values of the objectivefunction 𝑓 .

Equivalently one can fix the weights 𝑀𝑖 = 1𝑁 𝑀

(π‘–βˆ’1/2

𝑁

)once and for all

and rewrite the update as

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 πΌβˆ’1(πœƒπ‘‘)π‘βˆ‘

𝑖=1𝑀𝑖

πœ• ln π‘ƒπœƒ(π‘₯𝑖:𝑁 )πœ•πœƒ

πœƒ=πœƒπ‘‘

(15)

where π‘₯𝑖:𝑁 denotes the 𝑖th sampled point ranked according to 𝑓 , i.e. 𝑓(π‘₯1:𝑁 ) <. . . < 𝑓(π‘₯𝑁 :𝑁 ) (assuming again there are no ties). Note that {π‘₯𝑖:𝑁} = {π‘₯𝑖}and {𝑀𝑖} = { 𝑀𝑖}.

As will be discussed in Section 4, this update applied to multivariatenormal distributions or Bernoulli measures allows to neatly recover versions ofsome well-established algorithms, in particular CMA-ES and PBIL. Actually,in the Gaussian context updates of the form (14) have already been introduced[GSS+10, ANOK10], though not formally derived from a continuous-timeflow with quantiles.

When 𝑁 β†’ ∞, the IGO algorithm using samples approximates thecontinuous-time IGO gradient flow, see Theorem 4 below. Indeed, the IGOalgorithm, with 𝑁 =∞, is simply the Euler approximation scheme for theordinary differential equation defining the IGO flow (6). The latter resultthus provides a sound mathematical basis for currently used rank-basedupdates.

IGO flow versus IGO algorithms. The IGO flow (6) is a well-definedcontinuous-time set of trajectories in the space of probability distributionsπ‘ƒπœƒ, depending only on the objective function 𝑓 and the chosen family ofdistributions. It does not depend on the chosen parametrization for πœƒ(Proposition 8).

On the other hand, there are several IGO algorithms associated with thisflow. Each IGO algorithm approximates the IGO flow in a slightly different

1A mathematically neater but less intuitive version would be

𝑀𝑖 = 1rk+(π‘₯𝑖)βˆ’ rkβˆ’(π‘₯𝑖)

∫ 𝑒=rk+(π‘₯𝑖)/𝑁

𝑒=rkβˆ’(π‘₯𝑖)/𝑁

𝑀(𝑒)d𝑒

with rkβˆ’(π‘₯𝑖) = #{𝑗, 𝑓(π‘₯𝑗) < 𝑓(π‘₯𝑖)} and rk+(π‘₯𝑖) = #{𝑗, 𝑓(π‘₯𝑗) 6 𝑓(π‘₯𝑖)}.

10

Page 11: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

way. An IGO algorithm depends on three further choices: a sample size 𝑁 ,a time discretization step size 𝛿𝑑, and a choice of parametrization for πœƒ inwhich to implement (14).

If 𝛿𝑑 is small enough, and 𝑁 large enough, the influence of the parametriza-tion πœƒ disappears and all IGO algorithms are approximations of the β€œideal”IGO flow trajectory. However, the larger 𝛿𝑑, the poorer the approximationgets.

So for large 𝛿𝑑, different IGO algorithms for the same IGO flow mayexhibit different behaviors. We will see an instance of this phenomenon forGaussian distributions: both CMA-ES and the maximum likelihood update(EMNA) can be seen as IGO algorithms, but the latter with 𝛿𝑑 = 1 is knownto exhibit premature loss of diversity (Section 4.2).

Still, two IGO algorithms for the same IGO flow will differ less from eachother than from a non-IGO algorithm: at each step the difference is only𝑂(𝛿𝑑2) (Section 2.4). On the other hand, for instance, the difference betweenan IGO algorithm and the vanilla gradient ascent is, generally, not smallerthan 𝑂(𝛿𝑑) at each step, i.e. roughly as big as the steps themselves.

Unknown Fisher matrix. The algorithm presented so far assumes thatthe Fisher matrix 𝐼(πœƒ) is known as a function of πœƒ. This is the case forGaussian distributions in CMA-ES and for Bernoulli distributions. How-ever, for restricted Boltzmann machines as considered below, no analyticalform is known. Yet, provided the quantity πœ•

πœ•πœƒ ln π‘ƒπœƒ(π‘₯) can be computed orapproximated, it is possible to approximate the integral

𝐼𝑖𝑗(πœƒ) =∫

π‘₯

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘—

π‘ƒπœƒ(dπ‘₯)

using π‘ƒπœƒ-Monte Carlo samples for π‘₯. These samples may or may not be thesame as those used in the IGO update (14): in particular, it is possible touse as many Monte Carlo samples as necessary to approximate 𝐼𝑖𝑗 , at noadditional cost in terms of the number of calls to the black-box function 𝑓to optimize.

Note that each Monte Carlo sample π‘₯ will contribute πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘—

to the Fisher matrix approximation. This is a rank-1 matrix. So, for theapproximated Fisher matrix to be invertible, the number of (distinct) samplesπ‘₯ needs to be at least equal to the number of components of the parameterπœƒ i.e. 𝑁Fisher > dim Θ.

For exponential families of distributions, the IGO update has a particularform (18) which simplifies this matter somewhat. More details are givenbelow (see Section 5) for the concrete situation of restricted Boltzmannmachines.

2 First properties of IGO

2.1 Consistency of sampling

The first property to check is that when 𝑁 β†’βˆž, the update rule using 𝑁samples converges to the IGO update rule. This is not a straightforwardapplication of the law of large numbers, because the estimated weights 𝑀𝑖

depend (non-continuously) on the whole sample π‘₯1, . . . , π‘₯𝑁 , and not only onπ‘₯𝑖.

11

Page 12: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Theorem 4 (Consistency). When 𝑁 β†’βˆž, the 𝑁 -sample IGO update rule(14):

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 πΌβˆ’1(πœƒπ‘‘)π‘βˆ‘

𝑖=1𝑀𝑖

πœ• ln π‘ƒπœƒ(π‘₯𝑖)πœ•πœƒ

πœƒ=πœƒπ‘‘

converges with probability 1 to the update rule (5):

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 βˆ‡πœƒ

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯).

The proof is given in the Appendix, under mild regularity assumptions.In particular we do not require that 𝑀 be continuous.

This theorem may clarify previous claims [WSPS08, ANOK10] whererank-based updates similar to (5), such as in NES or CMA-ES, were derivedfrom optimizing the expected value βˆ’Eπ‘ƒπœƒ

𝑓 . The rank-based weights 𝑀𝑖 werethen introduced somewhat arbitrarily. Theorem 4 shows that, for large 𝑁 ,CMA-ES and NES actually follow the gradient flow of the quantity Eπ‘ƒπœƒ

π‘Š π‘“πœƒπ‘‘ :

the update can be rigorously derived from optimizing the expected value ofthe quantile-rewriting π‘Š 𝑓

πœƒπ‘‘ .

2.2 Monotonicity: quantile improvement

Gradient descents come with a guarantee that the fitness value decreases overtime. Here, since we work with probability distributions on 𝑋, we need todefine the fitness of the distribution π‘ƒπœƒπ‘‘ . An obvious choice is the expectationEπ‘ƒπœƒπ‘‘ 𝑓 , but it is not invariant under 𝑓 -transformation and moreover may besensitive to extreme values.

It turns out that the monotonicity properties of the IGO gradient flowdepend on the choice of the weighting scheme 𝑀. For instance, if 𝑀(𝑒) =1𝑒61/2, then the median of 𝑓 improves over time.

Proposition 5 (Quantile improvement). Consider the IGO flow given by (6),with the weight 𝑀(𝑒) = 1𝑒6π‘ž where 0 < π‘ž < 1 is fixed. Then the value ofthe π‘ž-quantile of 𝑓 improves over time: if 𝑑1 6 𝑑2 then either πœƒπ‘‘1 = πœƒπ‘‘2 orπ‘„π‘ž

π‘ƒπœƒπ‘‘2

(𝑓) < π‘„π‘žπ‘ƒ

πœƒπ‘‘1(𝑓).

Here the π‘ž-quantile π‘„π‘žπ‘ƒ (𝑓) of 𝑓 under a probability distribution 𝑃 is

defined as any number π‘š such that Prπ‘₯βˆΌπ‘ƒ (𝑓(π‘₯) 6 π‘š) > π‘ž and Prπ‘₯βˆΌπ‘ƒ (𝑓(π‘₯) >π‘š) > 1βˆ’ π‘ž.

The proof is given in the Appendix, together with the necessary regularityassumptions.

Of course this holds only for the IGO gradient flow (6) with 𝑁 =∞ and𝛿𝑑 β†’ 0. For an IGO algorithm with finite 𝑁 , the dynamics is random andone cannot expect monotonicity. Still, Theorem 4 ensures that, with highprobability, trajectories of a large enough finite population dynamics stayclose to the infinite-population limit trajectory.

2.3 The IGO flow for exponential families

The expressions for the IGO update simplify somewhat if the family π‘ƒπœƒ

happens to be an exponential family of probability distributions (see also[MMS08]). Suppose that π‘ƒπœƒ can be written as

π‘ƒπœƒ(π‘₯) = 1𝑍(πœƒ) exp

(βˆ‘πœƒπ‘–π‘‡π‘–(π‘₯)

)𝐻(dπ‘₯)

12

Page 13: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

where 𝑇1, . . . , π‘‡π‘˜ is a finite family of functions on 𝑋, 𝐻(dπ‘₯) is an arbitraryreference measure on 𝑋, and 𝑍(πœƒ) is the normalization constant. It iswell-known [AN00, (2.33)] that

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

= 𝑇𝑖(π‘₯)βˆ’ Eπ‘ƒπœƒπ‘‡π‘– (16)

so that [AN00, (3.59)]

𝐼𝑖𝑗(πœƒ) = Covπ‘ƒπœƒ(𝑇𝑖, 𝑇𝑗). (17)

In the end we find:

Proposition 6. Let π‘ƒπœƒ be an exponential family parametrized by the naturalparameters πœƒ as above. Then the IGO flow is given by

dπœƒ

d𝑑= Covπ‘ƒπœƒ

(𝑇, 𝑇 )βˆ’1 Covπ‘ƒπœƒ(𝑇, π‘Š 𝑓

πœƒ ) (18)

where Covπ‘ƒπœƒ(𝑇, π‘Š 𝑓

πœƒ ) denotes the vector (Covπ‘ƒπœƒ(𝑇𝑖, π‘Š 𝑓

πœƒ ))𝑖, and Covπ‘ƒπœƒ(𝑇, 𝑇 )

the matrix (Covπ‘ƒπœƒ(𝑇𝑖, 𝑇𝑗))𝑖𝑗.

Note that the right-hand side does not involve derivatives w.r.t. πœƒ anymore. This result makes it easy to simulate the IGO flow using e.g. a Gibbssampler for π‘ƒπœƒ: both covariances in (18) may be approximated by sampling,so that neither the Fisher matrix nor the gradient term need to be known inadvance, and no derivatives are involved.

The CMA-ES uses the family of all Gaussian distributions on R𝑑. Then,the family 𝑇𝑖 is the family of all linear and quadratic functions of thecoordinates on R𝑑. The expression above is then a particularly conciserewriting of a CMA-ES update, see also Section 4.2.

Moreover, the expected values 𝑇𝑖 = E𝑇𝑖 of 𝑇𝑖 satisfy the simple evolutionequation under the IGO flow

d𝑇𝑖

d𝑑= Cov(𝑇𝑖, π‘Š 𝑓

πœƒ ) = E(𝑇𝑖 π‘Š π‘“πœƒ )βˆ’ 𝑇𝑖 Eπ‘Š 𝑓

πœƒ . (19)

The proof is given in the Appendix, in the proof of Theorem 15.The variables 𝑇𝑖 can sometimes be used as an alternative parametrization

for an exponential family (e.g. for a one-dimensional Gaussian, these are themean πœ‡ and the second moment πœ‡2 + 𝜎2). Then the IGO flow (7) may berewritten using the relation βˆ‡πœƒπ‘–

= πœ•

πœ•π‘‡π‘–

for the natural gradient of exponentialfamilies (Appendix, Proposition 22), which sometimes results in simplerexpressions. We shall further exploit this fact in Section 3.

Exponential families with latent variables. Similar formulas holdwhen the distribution π‘ƒπœƒ(π‘₯) is the marginal of an exponential distributionπ‘ƒπœƒ(π‘₯, β„Ž) over a β€œhidden” or β€œlatent” variable β„Ž, such as the restricted Boltz-mann machines of Section 5.

Namely, with π‘ƒπœƒ(π‘₯) = 1𝑍(πœƒ)

βˆ‘β„Ž exp(

βˆ‘π‘– πœƒπ‘–π‘‡π‘–(π‘₯, β„Ž)) 𝐻(dπ‘₯, dβ„Ž) the Fisher

matrix is𝐼𝑖𝑗(πœƒ) = Covπ‘ƒπœƒ

(π‘ˆπ‘–, π‘ˆπ‘—) (20)

where π‘ˆπ‘–(π‘₯) = Eπ‘ƒπœƒ(𝑇𝑖(π‘₯, β„Ž)|π‘₯). Consequently, the IGO flow takes the form

dπœƒ

d𝑑= Covπ‘ƒπœƒ

(π‘ˆ, π‘ˆ)βˆ’1 Covπ‘ƒπœƒ(π‘ˆ, π‘Š 𝑓

πœƒ ). (21)

13

Page 14: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

2.4 Invariance properties

Here we formally state the invariance properties of the IGO flow under variousreparametrizations. Since these results follow from the very construction ofthe algorithm, the proofs are omitted.

Proposition 7 (𝑓 -invariance). Let πœ™ : R β†’ R be an increasing function.Then the trajectories of the IGO flow when optimizing the functions 𝑓 andπœ™(𝑓) are the same.

The same is true for the discretized algorithm with population size 𝑁 andstep size 𝛿𝑑 > 0.

Proposition 8 (πœƒ-invariance). Let πœƒβ€² = πœ™(πœƒ) be a one-to-one function ofπœƒ and let 𝑃 β€²

πœƒβ€² = π‘ƒπœ™βˆ’1(πœƒ). Let πœƒπ‘‘ be the trajectory of the IGO flow whenoptimizing a function 𝑓 using the distributions π‘ƒπœƒ, initialized at πœƒ0. Then theIGO flow trajectory (πœƒβ€²)𝑑 obtained from the optimization of the function 𝑓using the distributions 𝑃 β€²

πœƒβ€², initialized at (πœƒβ€²)0 = πœ™(πœƒ0), is the same, namely(πœƒβ€²)𝑑 = πœ™(πœƒπ‘‘).

For the algorithm with finite 𝑁 and 𝛿𝑑 > 0, invariance under πœƒ-reparame-trization is only true approximately, in the limit when 𝛿𝑑 β†’ 0. As mentionedabove, the IGO update (14), with 𝑁 =∞, is simply the Euler approximationscheme for the ordinary differential equation (6) defining the IGO flow. Ateach step, the Euler scheme is known to make an error 𝑂(𝛿𝑑2) with respectto the true flow. This error actually depends on the parametrization of πœƒ.

So the IGO updates for different parametrizations coincide at first orderin 𝛿𝑑, and may, in general, differ by 𝑂(𝛿𝑑2). For instance the differencebetween the CMA-ES and xNES updates is indeed 𝑂(𝛿𝑑2), see Section 4.2.

For comparison, using the vanilla gradient results in a divergence of 𝑂(𝛿𝑑)at each step between different parametrizations. So the divergence could beof the same magnitude as the steps themselves.

In that sense, one can say that IGO algorithms are β€œmore parametrization-invariant” than other algorithms. This stems from their origin as a discretiza-tion of the IGO flow.

The next proposition states that, for example, if one uses a family ofdistributions on R𝑑 which is invariant under affine transformations, thenour algorithm optimizes equally well a function and its image under anyaffine transformation (up to an obvious change in the initialization). Thisproposition generalizes the well-known corresponding property of CMA-ES[HO01].

Here, as usual, the image of a probability distribution 𝑃 by a transfor-mation πœ™ is defined as the probability distribution 𝑃 β€² such that 𝑃 β€²(π‘Œ ) =𝑃 (πœ™βˆ’1(π‘Œ )) for any subset π‘Œ βŠ‚ 𝑋. In the continuous domain, the density ofthe new distribution 𝑃 β€² is obtained by the usual change of variable formulainvolving the Jacobian of πœ™.

Proposition 9 (𝑋-invariance). Let πœ™ : 𝑋 β†’ 𝑋 be a one-to-one transfor-mation of the search space, and assume that πœ™ globally preserves the familyof measures π‘ƒπœƒ. Let πœƒπ‘‘ be the IGO flow trajectory for the optimization offunction 𝑓 , initialized at π‘ƒπœƒ0. Let (πœƒβ€²)𝑑 be the IGO flow trajectory for opti-mization of 𝑓 ∘ πœ™βˆ’1, initialized at the image of π‘ƒπœƒ0 by πœ™. Then 𝑃(πœƒβ€²)𝑑 is theimage of π‘ƒπœƒπ‘‘ by πœ™.

For the discretized algorithm with population size 𝑁 and step size 𝛿𝑑 > 0,the same is true up to an error of 𝑂(𝛿𝑑2) per iteration. This error disappearsif the map πœ™ acts on Θ in an affine way.

14

Page 15: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

The latter case of affine transforms is well exemplified by CMA-ES:here, using the variance and mean as the parametrization of Gaussians, thenew mean and variance after an affine transform of the search space arean affine function of the old mean and variance; specifically, for the affinetransformation 𝐴 : π‘₯ ↦→ 𝐴π‘₯ + 𝑏 we have (π‘š, 𝐢) ↦→ (π΄π‘š + 𝑏, 𝐴𝐢𝐴T).

2.5 Speed of the IGO flow

Proposition 10. The speed of the IGO flow, i.e. the norm of dπœƒπ‘‘

d𝑑 in theFisher metric, is at most

√∫ 10 𝑀2 βˆ’ (

∫ 10 𝑀)2 where 𝑀 is the weighting scheme.

This speed can be tested in practice in at least two ways. The first isjust to compute the Fisher norm of the increment πœƒπ‘‘+𝛿𝑑 βˆ’ πœƒπ‘‘ using the Fishermatrix; for small 𝛿𝑑 this is close to 𝛿𝑑‖dπœƒ

d𝑑 β€– with β€– Β· β€– the Fisher metric. Thesecond is as follows: since the Fisher metric coincides with the Kullback–Leibler divergence up to a factor 1/2, we have KL(π‘ƒπœƒπ‘‘+𝛿𝑑 ||π‘ƒπœƒπ‘‘) β‰ˆ 1

2 𝛿𝑑2β€–dπœƒd𝑑 β€–

2

at least for small 𝛿𝑑. Since it is relatively easy to estimate KL(π‘ƒπœƒπ‘‘+𝛿𝑑 ||π‘ƒπœƒπ‘‘)by comparing the new and old log-likelihoods of points in a Monte Carlosample, one can obtain an estimate of β€–dπœƒ

d𝑑 β€–.

Corollary 11. Consider an IGO algorithm with weighting scheme 𝑀, stepsize 𝛿𝑑 and sample size 𝑁 . Then, for small 𝛿𝑑 and large 𝑁 we have

KL(π‘ƒπœƒπ‘‘+𝛿𝑑 ||π‘ƒπœƒπ‘‘) 6 12 𝛿𝑑2 Var[0,1] 𝑀 + 𝑂(𝛿𝑑3) + 𝑂(1/

βˆšπ‘).

For instance, with 𝑀(π‘ž) = 1π‘ž6π‘ž0 and neglecting the error terms, an IGOalgorithm introduces at most 1

2 𝛿𝑑2 π‘ž0(1βˆ’ π‘ž0) bits of information (in base 𝑒)per iteration into the probability distribution π‘ƒπœƒ.

Thus, the time discretization parameter 𝛿𝑑 is not just an arbitrary variable:it has an intrinsic interpretation related to a number of bits introduced ateach step of the algorithm. This kind of relationship suggests, more generally,to use the Kullback–Leibler divergence as an external and objective way tomeasure learning rates in those optimization algorithms which use probabilitydistributions.

The result above is only an upper bound. Maximal speed can be achievedonly if all β€œgood” points point in the same direction. If the various goodpoints in the sample suggest moves in inconsistent directions, then the IGOupdate will be much smaller. The latter may be a sign that the signal isnoisy, or that the family of distributions π‘ƒπœƒ is not well suited to the problemat hand and should be enriched.

As an example, using a family of Gaussian distributions with unkownmean and fixed identity variance on R𝑑, one checks that for the optimizationof a linear function on R𝑑, with the weight 𝑀(𝑒) = βˆ’1𝑒>1/2 + 1𝑒<1/2, theIGO flow moves at constant speed 1/

√2πœ‹ β‰ˆ 0.4, whatever the dimension

𝑑. On a rapidly varying sinusoidal function, the moving speed will be muchslower because there are β€œgood” and β€œbad” points in all directions.

This may suggest ways to design the weighting scheme 𝑀 to achievemaximal speed in some instances. Indeed, looking at the proof of theproposition, which involves a Cauchy–Schwarz inequality, one can see thatthe maximal speed is achieved only if there is a linear relationship betweenthe weights π‘Š 𝑓

πœƒ (π‘₯) and the gradient βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯). For instance, for theoptimization of a linear function on R𝑑 using Gaussian measures of knownvariance, the maximal speed will be achieved when the weighting scheme𝑀(𝑒) is the inverse of the Gaussian cumulative distribution function. (In

15

Page 16: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

particular, 𝑀(𝑒) tends to +∞ when 𝑒→ 0 and to βˆ’βˆž when 𝑒→ 1.) Thisis in accordance with previously known results: the expected value of the𝑖-th order statistic of 𝑁 standard Gaussian variates is the optimal 𝑀𝑖 valuein evolution strategies [Bey01, Arn06]. For 𝑁 β†’ ∞, this order statisticconverges to the inverse Gaussian cumulative distribution function.

2.6 Noisy objective function

Suppose that the objective function 𝑓 is non-deterministic: each time weask for the value of 𝑓 at a point π‘₯ ∈ 𝑋, we get a random result. In thissetting we may write the random value 𝑓(π‘₯) as 𝑓(π‘₯) = 𝑓(π‘₯, πœ”) where πœ” isan unseen random parameter, and 𝑓 is a deterministic function of π‘₯ and πœ”.Without loss of generality, up to a change of variables we can assume that πœ”is uniformly distributed in [0, 1].

We can still use the IGO algorithm without modification in this context.One might wonder which properties (consistency of sampling, etc.) still applywhen 𝑓 is not deterministic. Actually, IGO algorithms for noisy functionsfit very nicely into the IGO framework: the following proposition allows totransfer any property of IGO to the case of noisy functions.

Proposition 12 (Noisy IGO). Let 𝑓 be a random function of π‘₯ ∈ 𝑋, namely,𝑓(π‘₯) = 𝑓(π‘₯, πœ”) where πœ” is a random variable uniformly distributed in [0, 1],and 𝑓 is a deterministic function of π‘₯ and πœ”. Then the two followingalgorithms coincide:

βˆ™ The IGO algorithm (13), using a family of distributions π‘ƒπœƒ on space𝑋, applied to the noisy function 𝑓 , and where the samples are rankedaccording to the random observed value of 𝑓 (here we assume that, foreach sample, the noise πœ” is independent from everything else);

βˆ™ The IGO algorithm on space 𝑋× [0, 1], using the family of distributionsπ‘ƒπœƒ = π‘ƒπœƒ βŠ— π‘ˆ[0,1], applied to the deterministic function 𝑓 . Here π‘ˆ[0,1]denotes the uniform law on [0, 1].

The (easy) proof is given in the Appendix.This proposition states that noisy optimization is the same as ordinary

optimization using a family of distributions which cannot operate any selec-tion or convergence over the parameter πœ”. More generally, any component ofthe search space in which a distribution-based evolutionary strategy cannotperform selection or specialization will effectively act as a random noise onthe objective function.

As a consequence of this result, all properties of IGO can be transferred tothe noisy case. Consider, for instance, consistency of sampling (Theorem 4).The 𝑁 -sample IGO update rule for the noisy case is identical to the non-noisycase (14):

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 πΌβˆ’1(πœƒπ‘‘)π‘βˆ‘

𝑖=1𝑀𝑖

πœ• ln π‘ƒπœƒ(π‘₯𝑖)πœ•πœƒ

πœƒ=πœƒπ‘‘

where each weight 𝑀𝑖 computed from (12) now incorporates noise from theobjective function because the rank of π‘₯𝑖 is computed on the random function,or equivalently on the deterministic function 𝑓 : rk(π‘₯𝑖) = #{𝑗, 𝑓(π‘₯𝑗 , πœ”π‘—) <𝑓(π‘₯𝑖, πœ”π‘–)}.

Consistency of sampling (Theorem 4) thus takes the following form:When 𝑁 β†’ ∞, the 𝑁 -sample IGO update rule on the noisy function 𝑓

16

Page 17: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

converges with probability 1 to the update rule

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 βˆ‡πœƒ

∫ 1

0

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯, πœ”) π‘ƒπœƒ(dπ‘₯) dπœ”.

= πœƒπ‘‘ + 𝛿𝑑 βˆ‡πœƒ

∫�� 𝑓

πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯) (22)

where οΏ½οΏ½ π‘“πœƒ (π‘₯) = Eπœ”π‘Š 𝑓

πœƒ (π‘₯, πœ”). This entails, in particular, that when 𝑁 β†’βˆž,the noise disappears asymptotically, as could be expected.

Consequently, the IGO flow in the noisy case should be defined by the𝛿𝑑 β†’ 0 limit of the update (22) using οΏ½οΏ½ . Note that the quantiles π‘žΒ±(π‘₯)defined by (2) still make sense in the noisy case, and are deterministicfunctions of π‘₯; thus π‘Š 𝑓

πœƒ (π‘₯) can also be defined by (3) and is deterministic.However, unless the weighting scheme 𝑀(π‘ž) is affine, οΏ½οΏ½ 𝑓

πœƒ (π‘₯) is different fromπ‘Š 𝑓

πœƒ (π‘₯) in general. Thus, unless 𝑀 is affine the flows defined by π‘Š and οΏ½οΏ½do not coincide in the noisy case. The flow using π‘Š would be the 𝑁 β†’βˆžlimit of a slightly more complex algorithm using several evaluations of 𝑓 foreach sample π‘₯𝑖 in order to compute noise-free ranks.

2.7 Implementation remarks

Influence of the weighting scheme 𝑀. The weighting scheme 𝑀 directlyaffects the update rule (15).

A natural choice is 𝑀(𝑒) = 1𝑒6π‘ž. This, as we have proved, results inan improvement of the π‘ž-quantile over the course of optimization. Takingπ‘ž = 1/2 springs to mind; however, this is not selective enough, and boththeory and experiments confirm that for the Gaussian case (CMA-ES), mostefficient optimization requires π‘ž < 1/2 (see Section 4.2). The optimal π‘ž isabout 0.27 if 𝑁 is not larger than the search space dimension 𝑑 [Bey01] andeven smaller otherwise.

Second, replacing 𝑀 with 𝑀+𝑐 for some constant 𝑐 clearly has no influenceon the IGO continuous-time flow (5), since the gradient will cancel out theconstant. However, this is not the case for the update rule (15) with a finitesample of size 𝑁 .

Indeed, adding a constant 𝑐 to 𝑀 adds a quantity 𝑐 1𝑁

βˆ‘ βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖) to theupdate. Since we know that the π‘ƒπœƒ-expected value of βˆ‡πœƒ ln π‘ƒπœƒ is 0 (because∫

( βˆ‡πœƒ ln π‘ƒπœƒ) π‘ƒπœƒ =∫ βˆ‡π‘ƒπœƒ = βˆ‡1 = 0), we have E 1

π‘βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖) = 0. So adding

a constant to 𝑀 does not change the expected value of the update, but itmay change e.g. its variance. The empirical average of βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖) in thesample will be 𝑂(1/

βˆšπ‘). So translating the weights results in a 𝑂(1/

βˆšπ‘)

change in the update. See also Section 4 in [SWSS09].Determining an optimal value for 𝑐 to reduce the variance of the update is

difficult, though: the optimal value actually depends on possible correlationsbetween βˆ‡πœƒ ln π‘ƒπœƒ and the function 𝑓 . The only general result is that oneshould shift 𝑀 so that 0 lies within its range. Assuming independence, ordependence with enough symmetry, the optimal shift is when the weightsaverage to 0.

Adaptive learning rate. Comparing consecutive updates to evaluate alearning rate or step size is an effective measure. For example, in back-propagation, the update sign has been used to adapt the learning rate of eachsingle weight in an artificial neural network [SA90]. In CMA-ES, a step sizeis adapted depending on whether recent steps tended to move in a consistentdirection or to backtrack. This is measured by considering the changes of

17

Page 18: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

the mean π‘š of the Gaussian distribution. For a probability distribution π‘ƒπœƒ

on an arbitrary search space, in general no notion of mean may be defined.However, it is still possible to define β€œbacktracking” in the evolution of πœƒ asfollows.

Consider two successive updates π›Ώπœƒπ‘‘ = πœƒπ‘‘ βˆ’ πœƒπ‘‘βˆ’π›Ώπ‘‘ and π›Ώπœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘+𝛿𝑑 βˆ’ πœƒπ‘‘.Their scalar product in the Fisher metric 𝐼(πœƒπ‘‘) is

βŸ¨π›Ώπœƒπ‘‘, π›Ώπœƒπ‘‘+π›Ώπ‘‘βŸ© =βˆ‘π‘–π‘—

𝐼𝑖𝑗(πœƒπ‘‘) π›Ώπœƒπ‘‘π‘– π›Ώπœƒπ‘‘+𝛿𝑑

𝑗 .

Dividing by the associated norms will yield the cosine cos 𝛼 of the anglebetween π›Ώπœƒπ‘‘ and π›Ώπœƒπ‘‘+𝛿𝑑 .

If this cosine is positive, the learning rate 𝛿𝑑 may be increased. If thecosine is negative, the learning rate probably needs to be decreased. Variousschemes for the change of 𝛿𝑑 can be devised; for instance, inspired by CMA-ES, one can multiply 𝛿𝑑 by exp(𝛽(cos 𝛼)/2) or exp(𝛽(1cos 𝛼>0 βˆ’ 1cos 𝛼<0)/2),where 𝛽 β‰ˆ min(𝑁/ dim Θ, 1/2).

As before, this scheme is constructed to be robust w.r.t. reparametrizationof πœƒ, thanks to the use of the Fisher metric. However, for large learning rates𝛿𝑑, in practice the parametrization might well become relevant.

A consistent direction of the updates does not necessarily mean thatthe algorithm is performing well: for instance, when CEM/EMNA exhibitspremature convergence (see below), the parameters consistently move towardsa zero covariance matrix and the cosines above are positive.

Complexity. The complexity of the IGO algorithm depends much on thecomputational cost model. In optimization, it is fairly common to assumethat the objective function 𝑓 is very costly compared to any other calculationsperformed by the algorithm. Then the cost of IGO in terms of number of𝑓 -calls is 𝑁 per iteration, and the cost of using quantiles and computing thenatural gradient is negligible.

Setting the cost of 𝑓 aside, the complexity of the IGO algorithm dependsmainly on the computation of the (inverse) Fisher matrix. Assume ananalytical expression for this matrix is known. Then, with 𝑝 = dim Θ thenumber of parameters, the cost of storage of the Fisher matrix is 𝑂(𝑝2)per iteration, and its inversion typically costs 𝑂(𝑝3) per iteration. However,depending on the situation and on possible algebraic simplifications, strategiesexist to reduce this cost (e.g. [LRMB07] in a learning context). For instance,for CMA-ES the cost is 𝑂(𝑁𝑝) [SHI09]. More generally, parametrization byexpectation parameters (see above), when available, may reduce the cost to𝑂(𝑝) as well.

If no analytical form of the Fisher matrix is known and Monte Carloestimation is required, then complexity depends on the particular situation athand and is related to the best sampling strategies available for a particularfamily of distributions. For Boltzmann machines, for instance, a host of suchstrategies are available. Still, in such a situation, IGO can be competitive ifthe objective function 𝑓 is costly.

Recycling samples. We might use samples not only from the last iter-ation to compute the ranks in (12), see e.g. [SWSS09]. For 𝑁 = 1 this isindispensable. In order to preserve sampling consistency (Theorem 4) the oldsamples need to be reweighted (using the ratio of their new vs old likelihood,as in importance sampling).

18

Page 19: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Initialization. As with other optimization algorithms, it is probably agood idea to initialize in such a way as to cover a wide portion of the searchspace, i.e. πœƒ0 should be chosen so that π‘ƒπœƒ0 has maximal diversity. For IGOalgorithms this is particularly relevant, since, as explained above, the naturalgradient provides minimal change of diversity (greedily at each step) for agiven change in the objective function.

3 IGO, maximum likelihood and the cross-entropymethod

IGO as a smooth-time maximum likelihood estimate. The IGOflow turns out to be the only way to maximize a weighted log-likelihood,where points of the current distribution are slightly reweighted according to𝑓 -preferences.

This relies on the following interpretation of the natural gradient as aweighted maximum likelihood update with infinitesimal learning rate. Thisresult singles out, in yet another way, the natural gradient among all possiblegradients. The proof is given in the Appendix.

Theorem 13 (Natural gradient as ML with infinitesimal weights). Let πœ€ > 0and πœƒ0 ∈ Θ. Let π‘Š (π‘₯) be a function of π‘₯ and let πœƒ be the solution of

πœƒ = arg maxπœƒ

{(1βˆ’ πœ€)

∫log π‘ƒπœƒ(π‘₯) π‘ƒπœƒ0(dπ‘₯)⏟ ⏞ maximal for πœƒ = πœƒ0

+ πœ€

∫log π‘ƒπœƒ(π‘₯) π‘Š (π‘₯) π‘ƒπœƒ0(dπ‘₯)

}.

Then, when πœ€β†’ 0, up to 𝑂(πœ€2) we have

πœƒ = πœƒ0 + πœ€

∫ βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯) π‘Š (π‘₯) π‘ƒπœƒ0(dπ‘₯).

Likewise for discrete samples: with π‘₯1, . . . , π‘₯𝑁 ∈ 𝑋, let πœƒ be the solution of

πœƒ = arg max{

(1βˆ’ πœ€)∫

log π‘ƒπœƒ(π‘₯) π‘ƒπœƒ0(dπ‘₯) + πœ€βˆ‘

𝑖

π‘Š (π‘₯𝑖) log π‘ƒπœƒ(π‘₯𝑖)}

.

Then when πœ€β†’ 0, up to 𝑂(πœ€2) we have

πœƒ = πœƒ0 + πœ€βˆ‘

𝑖

π‘Š (π‘₯𝑖) βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖).

So if π‘Š (π‘₯) = π‘Š π‘“πœƒ0

(π‘₯) is the weight of the points according to quantilized𝑓 -preferences, the weighted maximum log-likelihood necessarily is the IGOflow (7) using the natural gradientβ€”or the IGO update (14) when usingsamples.

Thus the IGO flow is the unique flow that, continuously in time, slightlychanges the distribution to maximize the log-likelihood of points with goodvalues of 𝑓 . Moreover IGO continuously updates the weight π‘Š 𝑓

πœƒ0(π‘₯) depending

on 𝑓 and on the current distribution, so that we keep optimizing.This theorem suggests a way to approximate the IGO flow by enforcing

this interpretation for a given non-infinitesimal step size 𝛿𝑑, as follows.

Definition 14 (IGO-ML algorithm). The IGO-ML algorithm with step size𝛿𝑑 updates the value of the parameter πœƒπ‘‘ according to

πœƒπ‘‘+𝛿𝑑 = arg maxπœƒ

{(1βˆ’ 𝛿𝑑

βˆ‘π‘–π‘€π‘–)∫

log π‘ƒπœƒ(π‘₯) π‘ƒπœƒπ‘‘(dπ‘₯) + π›Ώπ‘‘βˆ‘

𝑖

𝑀𝑖 log π‘ƒπœƒ(π‘₯𝑖)}

(23)

19

Page 20: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

where π‘₯1, . . . , π‘₯𝑁 are sample points picked according to the distribution π‘ƒπœƒπ‘‘,and 𝑀𝑖 is the weight (12) obtained from the ranked values of the objectivefunction 𝑓 .

As for the cross-entropy method below, this only makes algorithmic senseif the argmax is tractable.

It turns out that IGO-ML is just the IGO algorithm in a particularparametrization (see Theorem 15).

The cross-entropy method. Taking 𝛿𝑑 = 1 in (23) above correspondsto a full maximum likelihood update, which is also related to the cross-entropy method (CEM). The cross-entropy method can be defined as follows[dBKMR05] in an optimization setting. Like IGO, it depends on a family ofprobability distributions π‘ƒπœƒ parametrized by πœƒ ∈ Θ, and a number of samples𝑁 at each iteration. Let also 𝑁𝑒 = βŒˆπ‘žπ‘βŒ‰ (0 < π‘ž < 1) be a number of elitesamples.

At each step, the cross-entropy method for optimization samples 𝑁 pointsπ‘₯1, . . . , π‘₯𝑁 from the current distribution π‘ƒπœƒπ‘‘ . Let 𝑀𝑖 be 1/𝑁𝑒 if π‘₯𝑖 belongs tothe 𝑁𝑒 samples with the best value of the objective function 𝑓 , and 𝑀𝑖 = 0otherwise. Then the cross-entropy method or maximum likelihoood update(CEM/ML) for optimization is

πœƒπ‘‘+1 = arg maxπœƒ

βˆ‘ 𝑀𝑖 log π‘ƒπœƒ(π‘₯𝑖) (24)

(assuming the argmax is tractable).A version with a smoother update depends on a step size parameter

0 < 𝛼 6 1 and is given [dBKMR05] by

πœƒπ‘‘+1 = (1βˆ’ 𝛼)πœƒπ‘‘ + 𝛼 arg maxπœƒ

βˆ‘ 𝑀𝑖 log π‘ƒπœƒ(π‘₯𝑖). (25)

The standard CEM/ML update corresponds to 𝛼 = 1.For 𝛼 = 1 the standard cross-entropy method is independent of the

parametrization πœƒ, whereas for 𝛼 < 1 this is not the case.Note the difference between the IGO-ML algorithm (23) and the smoothed

CEM update (25) with step size 𝛼 = 𝛿𝑑: the smoothed CEM update per-forms a weighted average of the parameter value after taking the maximumlikelihood estimate, whereas IGO-ML uses a weighted average of current andprevious likelihoods, then takes a maximum likelihood estimate. In general,these two rules can greatly differ, as they do for Gaussian distributions(Section 4.2).

This interversion of averaging makes IGO-ML parametrization-independentwhereas the smoothed CEM update is not.

Yet, for exponential families of probability distributions, there exists oneparticular parametrization πœƒ in which the IGO algorithm and the smoothedCEM update coincide. We now proceed to this construction.

IGO for expectation parameters and maximum likelihood. Theparticular form of IGO for exponential families has an interesting conse-quence if the parametrization chosen for the exponential family is the set ofexpectation parameters. Let π‘ƒπœƒ(π‘₯) = 1

𝑍(πœƒ) exp (βˆ‘

πœƒπ‘—π‘‡π‘—(π‘₯)) 𝐻(dπ‘₯) be an expo-nential family as above. The expectation parameters are 𝑇𝑗 = 𝑇𝑗(πœƒ) = Eπ‘ƒπœƒ

𝑇𝑗 ,(denoted πœ‚π‘— in [AN00, (3.56)]). The notation 𝑇 will denote the collection(𝑇𝑗).

20

Page 21: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

It is well-known that, in this parametrization, the maximum likelihoodestimate for a sample of points π‘₯1, . . . , π‘₯π‘˜ is just the empirical average of theexpectation parameters over that sample:

arg max𝑇

1π‘˜

π‘˜βˆ‘π‘–=1

log 𝑃𝑇 (π‘₯𝑖) = 1π‘˜

π‘˜βˆ‘π‘–=1

𝑇 (π‘₯𝑖). (26)

In the discussion above, one main difference between IGO and smoothedCEM was whether we took averages before or after taking the maximumlog-likelihood estimate. For the expectation parameters 𝑇𝑖, we see thatthese operations commute. (One can say that these expectation parametersβ€œlinearize maximum likelihood estimates”.) A little work brings us to the

Theorem 15 (IGO, CEM and maximum likelihood). Let

π‘ƒπœƒ(π‘₯) = 1𝑍(πœƒ) exp

(βˆ‘πœƒπ‘—π‘‡π‘—(π‘₯)

)𝐻(dπ‘₯)

be an exponential family of probability distributions, where the 𝑇𝑗 are functionsof π‘₯ and 𝐻 is some reference measure. Let us parametrize this family by theexpected values 𝑇𝑗 = E𝑇𝑗.

Let us assume the chosen weights 𝑀𝑖 sum to 1. For a sample π‘₯1, . . . , π‘₯𝑁 ,let

𝑇 *𝑗 =

βˆ‘π‘–

𝑀𝑖 𝑇𝑗(π‘₯𝑖).

Then the IGO update (14) in this parametrization reads

𝑇 𝑑+𝛿𝑑𝑗 = (1βˆ’ 𝛿𝑑) 𝑇 𝑑

𝑗 + 𝛿𝑑 𝑇 *𝑗 . (27)

Moreover these three algorithms coincide:

βˆ™ The IGO-ML algorithm (23).

βˆ™ The IGO algorithm written in the parametrization 𝑇𝑗.

βˆ™ The smoothed CEM algorithm (25) written in the parametrization 𝑇𝑗,with 𝛼 = 𝛿𝑑.

Corollary 16. The standard CEM/ML update (24) is the IGO algorithmin parametrization 𝑇𝑗 with 𝛿𝑑 = 1.

Beware that the expectation parameters 𝑇𝑗 are not always the mostobvious parameters [AN00, Section 3.5]. For example, for 1-dimensionalGaussian distributions, these expectation parameters are the mean πœ‡ andthe second moment πœ‡2 + 𝜎2. When expressed back in terms of mean andvariance, with the update (27) the new mean is (1 βˆ’ 𝛿𝑑)πœ‡ + π›Ώπ‘‘πœ‡*, but thenew variance is (1βˆ’ 𝛿𝑑)𝜎2 + 𝛿𝑑(𝜎*)2 + 𝛿𝑑(1βˆ’ 𝛿𝑑)(πœ‡* βˆ’ πœ‡)2.

On the other hand, when using smoothed CEM with mean and variance asparameters, the new variance is (1βˆ’π›Ώπ‘‘)𝜎2+𝛿𝑑(𝜎*)2, which can be significantlysmaller for 𝛿𝑑 ∈ (0, 1). This proves, in passing, that the smoothed CEMupdate in other parametrizations is generally not an IGO algorithm (becauseit can differ at first order in 𝛿𝑑).

The case of Gaussian distributions is further exemplified in Section 4.2below: in particular, smoothed CEM in the (πœ‡, 𝜎) parametrization exhibitspremature reduction of variance, preventing good convergence.

For these reasons we think that the IGO-ML algorithm is the sensibleway to perform an interpolated ML estimate for 𝛿𝑑 < 1, in a parametrization-independent way. In Section 6 we further discuss IGO and CEM and sumup the differences and relative advantages.

21

Page 22: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Taking 𝛿𝑑 = 1 is a bold approximation choice: the β€œideal” continuous-timeIGO flow itself, after time 1, does not coincide with the maximum likelihoodupdate of the best points in the sample. Since the maximum likelihoodalgorithm is known to converge prematurely in some instances (Section 4.2),using the parametrization by expectation parameters with large 𝛿𝑑 may notbe desirable.

The considerable simplification of the IGO update in these coordinatesreflects the duality of coordinates 𝑇𝑖 and πœƒπ‘–. More precisely, the naturalgradient ascent w.r.t. the parameters 𝑇𝑖 is given by the vanilla gradient w.r.t.the parameters πœƒπ‘–: βˆ‡π‘‡π‘–

= πœ•

πœ•πœƒπ‘–

(Proposition 22 in the Appendix).

4 CMA-ES, NES, EDAs and PBIL from the IGOframework

In this section we investigate the IGO algorithm for Bernoulli measures and formultivariate normal distributions and show the correspondence to well-knownalgorithms. In addition, we discuss the influence of the parametrization ofthe distributions.

4.1 IGO algorithm for Bernoulli measures and PBIL

We consider on 𝑋 = {0, 1}𝑑 a family of Bernoulli measures π‘ƒπœƒ(π‘₯) = π‘πœƒ1(π‘₯1)Γ—. . . Γ— π‘πœƒπ‘‘

(π‘₯𝑑) with π‘πœƒπ‘–(π‘₯𝑖) = πœƒπ‘₯𝑖

𝑖 (1 βˆ’ πœƒπ‘–)1βˆ’π‘₯𝑖 . As this family is a product ofprobability measures π‘πœƒπ‘–

(π‘₯𝑖), the different components of a random vector𝑦 following π‘ƒπœƒ are independent and all off-diagonal terms of the Fisherinformation matrix (FIM) are zero. Diagonal terms are given by 1

πœƒπ‘–(1βˆ’πœƒπ‘–) .Therefore the inverse of the FIM is a diagonal matrix with diagonal entriesequal to πœƒπ‘–(1 βˆ’ πœƒπ‘–). In addition, the partial derivative of ln π‘ƒπœƒ(π‘₯) w.r.t. πœƒπ‘–

can be computed in a straightforward manner resulting in

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

= π‘₯𝑖

πœƒπ‘–βˆ’ 1βˆ’ π‘₯𝑖

1βˆ’ πœƒπ‘–.

Let π‘₯1, . . . , π‘₯𝑁 be 𝑁 samples at step 𝑑 with distribution π‘ƒπœƒπ‘‘ and letπ‘₯1:𝑁 , . . . , π‘₯𝑁 :𝑁 be the ranked samples according to 𝑓 . The natural gradientupdate (15) with Bernoulli measures is then given by

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + 𝛿𝑑 πœƒπ‘‘π‘–(1βˆ’ πœƒπ‘‘

𝑖)π‘βˆ‘

𝑗=1𝑀𝑗

([π‘₯𝑗:𝑁 ]𝑖

πœƒπ‘‘π‘–

βˆ’ 1βˆ’ [π‘₯𝑗:𝑁 ]𝑖1βˆ’ πœƒπ‘‘

𝑖

)(28)

where 𝑀𝑗 = 𝑀((𝑗 βˆ’ 1/2)/𝑁)/𝑁 and [𝑦]𝑖 denotes the 𝑖th coordinate of 𝑦 ∈ 𝑋.The previous equation simplifies to

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + π›Ώπ‘‘π‘βˆ‘

𝑗=1𝑀𝑗

([π‘₯𝑗:𝑁 ]𝑖 βˆ’ πœƒπ‘‘

𝑖

),

or, denoting οΏ½οΏ½ the sum of the weightsβˆ‘π‘

𝑗=1 𝑀𝑗 ,

πœƒπ‘‘+𝛿𝑑𝑖 = (1βˆ’ ��𝛿𝑑) πœƒπ‘‘

𝑖 + π›Ώπ‘‘π‘βˆ‘

𝑗=1𝑀𝑗 [π‘₯𝑗:𝑁 ]𝑖 .

22

Page 23: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

If we set the IGO weights as 𝑀1 = 1, 𝑀𝑗 = 0 for 𝑗 = 2, . . . , 𝑁 , we recoverthe PBIL/EGA algorithm with update rule towards the best solution only(disregarding the mutation of the probability vector), with 𝛿𝑑 = LR whereLR is the so-called learning rate of the algorithm [Bal94, Figure 4]. The PBILupdate rule towards the πœ‡ best solutions, proposed in [BC95, Figure 4]2, canbe recovered as well using

𝛿𝑑 = LR𝑀𝑗 = (1βˆ’ LR)π‘—βˆ’1, for 𝑗 = 1, . . . , πœ‡

𝑀𝑗 = 0, for 𝑗 = πœ‡ + 1, . . . , 𝑁 .

Interestingly, the parameters πœƒπ‘– are the expectation parameters describedin Section 3: indeed, the expectation of π‘₯𝑖 is πœƒπ‘–. So the formulas above areparticular cases of (27). Thus, by Theorem 15, PBIL is both a smoothedCEM in these parameters and an IGO-ML algorithm.

Let us now consider another, so-called β€œlogit” representation, given bythe logistic function 𝑃 (π‘₯𝑖 = 1) = 1

1+exp(βˆ’πœƒπ‘–). This πœƒ is the exponential

parametrization of Section 2.3. We find that

πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒπ‘–

= (π‘₯𝑖 βˆ’ 1) + exp(βˆ’πœƒπ‘–)1 + exp(βˆ’πœƒπ‘–)

= π‘₯𝑖 βˆ’ Eπ‘₯𝑖

(cf. (16)) and that the diagonal elements of the Fisher information matrixare given by exp(βˆ’πœƒπ‘–)/(1 + exp(βˆ’πœƒπ‘–))2 = Var π‘₯𝑖 (compare with (17)). So thenatural gradient update (15) with Bernoulli measures now reads

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + 𝛿𝑑(1 + exp(πœƒπ‘‘π‘–))

βŽ›βŽβˆ’οΏ½οΏ½ + (1 + exp(βˆ’πœƒπ‘‘π‘–))

π‘βˆ‘π‘—=1

𝑀𝑗 [π‘₯𝑗:𝑁 ]𝑖

⎞⎠ .

To better compare the update with the previous representation, notethat πœƒπ‘– = 1

1+exp(βˆ’πœƒπ‘–)and thus we can rewrite

πœƒπ‘‘+𝛿𝑑𝑖 = πœƒπ‘‘

𝑖 + 𝛿𝑑

πœƒπ‘‘π‘–(1βˆ’ πœƒπ‘‘

𝑖)

π‘βˆ‘π‘—=1

𝑀𝑗

([π‘₯𝑗:𝑁 ]𝑖 βˆ’ πœƒπ‘‘

𝑖

).

So the direction of the update is the same as before and is given by theproportion of bits set to 0 or 1 in the sample, compared to its expected valueunder the current distribution. The magnitude of the update is differentsince the parameter πœƒ ranges from βˆ’βˆž to +∞ instead of from 0 to 1. Wedid not find this algorithm in the literature.

These updates illustrate the influence of setting the sum of weights to0 or not (Section 2.7). If, at some time, the first bit is set to 1 both for amajority of good points and for a majority of bad points, then the originalPBIL will increase the probability of setting the first bit to 1, which iscounterintuitive. If the weights 𝑀𝑖 are chosen to sum to 0 this noise effectdisappears; otherwise, it disappears only on average.

2Note that the pseudocode for the algorithm in [BC95, Figure 4] is slightly erroneoussince it gives smaller weights to better individuals. The error can be fixed by updatingthe probability in reversed order, looping from NUMBER_OF_VECTORS_TO_UPDATE_FROM to1. This was confirmed by S. Baluja in personal communication. We consider here thecorrected version of the algorithm.

23

Page 24: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

4.2 Multivariate normal distributions (Gaussians)

Evolution strategies [Rec73, Sch95, BS02] are black-box optimization algo-rithms for the continuous search domain, 𝑋 βŠ† R𝑑 (for simplicity we assume𝑋 = R𝑑 in the following). They sample new solutions from a multivariatenormal distribution. In the context of continuous black-box optimization,Natural Evolution Strategies (NES) introduced the idea of using a naturalgradient update of the distribution parameters [WSPS08, SWSS09, GSS+10].Surprisingly, also the well-known Covariance Matrix Adaption EvolutionStrategy, CMA-ES [HO96, HO01, HMK03, HK04, JA06], turns out to conducta natural gradient update of distribution parameters [ANOK10, GSS+10].

Let π‘₯ ∈ R𝑑. As the most prominent example, we use mean vectorπ‘š = Eπ‘₯ and covariance matrix 𝐢 = E(π‘₯βˆ’π‘š)(π‘₯βˆ’π‘š)T = E(π‘₯π‘₯T)βˆ’π‘šπ‘šT toparametrize the distribution via πœƒ = (π‘š, 𝐢). The IGO update in (14) or (15)depends on the chosen parametrization, but can now be entirely reformulatedwithout the (inverse) Fisher matrix, compare (18). The complexity of theupdate is linear in the number of parameters (size of πœƒ = (π‘š, 𝐢), where(𝑑2 βˆ’ 𝑑)/2 parameters are redundant). We discuss known algorithms thatimplement variants of this update.

CMA-ES. The CMA-ES implements the equations3

π‘šπ‘‘+1 = π‘šπ‘‘ + πœ‚m

π‘βˆ‘π‘–=1

𝑀𝑖(π‘₯𝑖 βˆ’π‘šπ‘‘) (29)

𝐢𝑑+1 = 𝐢𝑑 + πœ‚c

π‘βˆ‘π‘–=1

𝑀𝑖((π‘₯𝑖 βˆ’π‘šπ‘‘)(π‘₯𝑖 βˆ’π‘šπ‘‘)𝑇 βˆ’ 𝐢𝑑) (30)

where 𝑀𝑖 are the weights based on ranked 𝑓 -values, see (12) and (14).When πœ‚c = πœ‚m, Equations (30) and (29) coincide with the IGO update

(14) expressed in the parametrization (π‘š, 𝐢) [ANOK10, GSS+10]4. Note,however, that the learning rates πœ‚m and πœ‚c take essentially different valuesin CMA-ES, if 𝑁 β‰ͺ dim Θ5: this is in deviation from an IGO algorithm.(Remark that the Fisher information matrix is block-diagonal in π‘š and 𝐢[ANOK10], so that application of the different learning rates and of theinverse Fisher matrix commute.)

Natural evolution strategies. Natural evolution strategies (NES) [WSPS08,SWSS09] implement (29) as well, but use a Cholesky decomposition of 𝐢 asparametrization for the update of the variance parameters. The resultingupdate that replaces (30) is neither particularly elegant nor numericallyefficient. However, the most recent xNES [GSS+10] chooses an β€œexponential”parametrization depending on the current parameters. This leads to an ele-gant formulation where the additive update in exponential parametrizationbecomes a multiplicative update for 𝐢 in (30). With 𝐢 = 𝐴𝐴T, the matrix

3The CMA-ES implements these equations given the parameter setting 𝑐1 = 0 andπ‘πœŽ = 0 (or π‘‘πœŽ =∞, see e.g. [Han09]) that disengages the effect of both so-called evolutionpaths.

4 In these articles the result has been derived for πœƒ ← πœƒ + πœ‚ βˆ‡πœƒEπ‘ƒπœƒ 𝑓 , see (9), leading to𝑓(π‘₯𝑖) in place of 𝑀𝑖. No assumptions on 𝑓 have been used besides that it does not dependon πœƒ. By replacing 𝑓 with π‘Š 𝑓

πœƒπ‘‘ , where πœƒπ‘‘ is fixed, the derivation holds equally well forπœƒ ← πœƒ + πœ‚ βˆ‡πœƒEπ‘ƒπœƒ π‘Š 𝑓

πœƒπ‘‘ .5Specifically, let

βˆ‘|𝑀𝑖| = 1, then πœ‚m = 1 and πœ‚c β‰ˆ 1 ∧ 1/(𝑑2βˆ‘ 𝑀2

𝑖 ).

24

Page 25: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

update reads

𝐴← 𝐴× exp(

πœ‚c2

π‘βˆ‘π‘–=1

𝑀𝑖(π΄βˆ’1(π‘₯𝑖 βˆ’π‘š)(π΄βˆ’1(π‘₯𝑖 βˆ’π‘š))𝑇 βˆ’ I𝑑))

(31)

where I𝑑 is the identity matrix. (From (31) we get 𝐢 ← 𝐴× exp2(. . . )×𝐴T.)Remark that in the representation πœƒ = (π΄βˆ’1π‘š, π΄βˆ’1𝐢𝐴Tβˆ’1) = (π΄βˆ’1π‘š, I𝑑),the Fisher information matrix becomes diagonal.

The update has the advantage over (30) that even negative weightsalways lead to feasible values. By default, πœ‚m = πœ‚c in xNES in the samecircumstances as in CMA-ES (most parameter settings are borrowed fromCMA-ES), but contrary to CMA-ES the past evolution path is not takeninto account [GSS+10].

When πœ‚c = πœ‚m, xNES is consistent with the IGO flow (6), and im-plements a slightly generalized IGO algorithm (14) using a πœƒ-dependentparametrization.

Cross-entropy method and EMNA. Estimation of distribution algo-rithms (EDA) and the cross-entropy method (CEM) [Rub99, RK04] estimatea new distribution from a censored sample. Generally, the new parametervalue can be written as

πœƒmaxLL = arg maxπœƒ

π‘βˆ‘π‘–=1

𝑀𝑖 ln π‘ƒπœƒ(π‘₯𝑖) (32)

βˆ’β†’π‘β†’βˆž arg maxπœƒ

Eπ‘ƒπœƒπ‘‘ π‘Š π‘“πœƒπ‘‘ ln π‘ƒπœƒ

For positive weights, πœƒmaxLL maximizes the weighted log-likelihood of π‘₯1 . . . π‘₯𝑁 .The argument under the arg max in the RHS of (32) is the negative cross-entropy between π‘ƒπœƒ and the distribution of censored (elitist) samples π‘ƒπœƒπ‘‘π‘Š 𝑓

πœƒπ‘‘

or of 𝑁 realizations of such samples. The distribution π‘ƒπœƒmaxLL has thereforeminimal cross-entropy and minimal KL-divergence to the distribution ofthe πœ‡ best samples.6 As said above, (32) recovers the cross-entropy method(CEM) [Rub99, RK04].

For Gaussian distributions, equation (32) can be explicitly written in theform

π‘šπ‘‘+1 = π‘š* =π‘βˆ‘

𝑖=1𝑀𝑖π‘₯𝑖 (33)

𝐢𝑑+1 = 𝐢* =π‘βˆ‘

𝑖=1𝑀𝑖(π‘₯𝑖 βˆ’π‘š*)(π‘₯𝑖 βˆ’π‘š*)T (34)

the empirical mean and variance of the elite sample. The weights ��𝑖 areequal to 1/πœ‡ for the πœ‡ best points and 0 otherwise.

Equations (33) and (34) also define the most fundamental continuousdomain EDA, the estimation of multivariate normal algorithm (EMNAglobal,[LL02]). It might be interesting to notice that (33) and (34) only differ from(29) and (30) in that the new mean π‘šπ‘‘+1 is used in the covariance matrixupdate.

6Let 𝑃𝑀 denote the distribution of the weighted samples: Pr(π‘₯ = π‘₯𝑖) = 𝑀𝑖 andβˆ‘π‘–π‘€π‘– = 1. Then the cross-entropy between 𝑃𝑀 and π‘ƒπœƒ reads

βˆ‘π‘–π‘ƒπ‘€(π‘₯𝑖) ln 1/π‘ƒπœƒ(π‘₯𝑖) and

the KL-divergence reads KL(𝑃𝑀 ||π‘ƒπœƒ) =βˆ‘

𝑖𝑃𝑀(π‘₯𝑖) ln 1/π‘ƒπœƒ(π‘₯𝑖) βˆ’

βˆ‘π‘–π‘ƒπ‘€(π‘₯𝑖) ln 1/𝑃𝑀(π‘₯𝑖).

Minimization of both terms in πœƒ result in πœƒmaxLL.

25

Page 26: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Let us compare IGO-ML (23), CMA (29)–(30), and smoothed CEM (25)in the parametrization with mean and covariance matrix. For learning rate𝛿𝑑 = 1, IGO-ML and smoothed CEM/EMNA realize πœƒmaxLL from (32)–(34).For 𝛿𝑑 < 1 these algorithms all update the distribution mean in the sameway; the update of mean and covariance matrix is computed to be

π‘šπ‘‘+1 = (1βˆ’ 𝛿𝑑) π‘šπ‘‘ + 𝛿𝑑 π‘š*

𝐢𝑑+1 = (1βˆ’ 𝛿𝑑) 𝐢𝑑 + 𝛿𝑑 𝐢* + 𝛿𝑑(1βˆ’ 𝛿𝑑)𝑗 (π‘š* βˆ’π‘šπ‘‘)(π‘š* βˆ’π‘šπ‘‘)T,(35)

for different values of 𝑗, where π‘š* and 𝐢* are the mean and covariancematrix computed over the elite sample (with weights 𝑀𝑖) as above. For CMAwe have 𝑗 = 0, for IGO-ML we have 𝑗 = 1, and for smoothed CEM wehave 𝑗 =∞ (the rightmost term is absent). The case 𝑗 = 2 corresponds toan update that uses π‘šπ‘‘+1 instead of π‘šπ‘‘ in (30) (compare [Han06b]). For0 < 𝛿𝑑 < 1, the larger 𝑗, the smaller 𝐢𝑑+1.

The rightmost term of (35) resembles the so-called rank-one update inCMA.

For 𝛿𝑑 β†’ 0, the update is independent of 𝑗 at first order in 𝛿𝑑 if 𝑗 <∞:this reflects compatibility with the IGO flow of CMA and of IGO-ML, butnot of smoothed CEM.

Critical 𝛿𝑑. Let us assume that the weights 𝑀𝑖 are non-negative, sum toone, and πœ‡ < 𝑁 weights take a value of 1/πœ‡, so that the selection quantile isπ‘ž = πœ‡/𝑁 .

There is a critical value of 𝛿𝑑 depending on this quantile π‘ž, such thatabove the critical 𝛿𝑑 the algorithms given by IGO-ML and smoothed CEMare prone to premature convergence. Indeed, let 𝑓 be a linear function onR𝑑, and consider the variance in the direction of the gradient of 𝑓 . Assumingfurther 𝑁 β†’ ∞ and π‘ž ≀ 1/2, then the variance 𝐢* of the elite sample issmaller than the current variance 𝐢𝑑, by a constant factor. Depending onthe precise update for 𝐢𝑑+1, if 𝛿𝑑 is too large, the variance 𝐢𝑑+1 is goingto be smaller than 𝐢𝑑 by a constant factor as well. This implies that thealgorithm is going to stall. (On the other hand, the continuous-time IGOflow corresponding to 𝛿𝑑 β†’ 0 does not stall, see Section 4.3.)

We now study the critical 𝛿𝑑 under which the algorithm does not stall.For IGO-ML, (𝑗 = 1 in (35), or equivalently for the smoothed CEM in theexpectation parameters (π‘š, 𝐢 + π‘šπ‘šT), see Section 3), the variance increasesif and only if 𝛿𝑑 is smaller than the critical value 𝛿𝑑crit = π‘žπ‘

√2πœ‹π‘’π‘2/2 where

𝑏 is the percentile function of π‘ž, i.e. 𝑏 is such that π‘ž =∫∞

𝑏 π‘’βˆ’π‘₯2/2/√

2πœ‹. Thisvalue 𝛿𝑑crit is plotted as solid line in Fig. 1. For 𝑗 = 2, 𝛿𝑑crit is smaller,related to the above by 𝛿𝑑crit ←

√1 + 𝛿𝑑crit βˆ’ 1 and plotted as dashed line in

Fig. 1. For CEM (𝑗 =∞), the critical 𝛿𝑑 is zero. For CMA-ES (𝑗 = 0), thecritical 𝛿𝑑 is infinite for π‘ž < 1/2. When the selection quantile π‘ž is above 1/2,for all algorithms the critical 𝛿𝑑 becomes zero.

We conclude that, despite the principled approach of ascending thenatural gradient, the choice of the weighting function 𝑀, the choice of 𝛿𝑑,and possible choices in the update for 𝛿𝑑 > 0, need to be taken with greatcare in relation to the choice of parametrization.

4.3 Computing the IGO flow for some simple examples

In this section we take a closer look at the IGO flow solutions of (6) for somesimple examples of fitness functions, for which it is possible to obtain exactinformation about these IGO trajectories.

26

Page 27: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

0.0 0.1 0.2 0.3 0.4 0.5truncation quantile q

0.0

0.2

0.4

0.6

0.8

1.0

crit

ical le

arn

ing r

ate

Ξ΄t

j=1

j=2

j=∞

Figure 1: Critical 𝛿𝑑 versus truncation quantile π‘ž. Above the critical 𝛿𝑑, thevariance decreases systematically when optimizing a linear function. ForCMA-ES/NES, the critical 𝛿𝑑 for π‘ž < 0.5 is infinite.

We start with the discrete search space 𝑋 = {0, 1}𝑑 and linear functions(to be minimized) defined as 𝑓(π‘₯) = π‘βˆ’

βˆ‘π‘‘π‘–=1 𝛼𝑖π‘₯𝑖 with 𝛼𝑖 > 0. Note that

the onemax function to be maximized 𝑓onemax(π‘₯) =βˆ‘π‘‘

𝑖=1 π‘₯𝑖 is covered bysetting 𝛼𝑖 = 1. The differential equation (6) for the Bernoulli measuresπ‘ƒπœƒ(π‘₯) = π‘πœƒ1(π‘₯1) . . . π‘πœƒπ‘‘

(π‘₯𝑑) defined on 𝑋 can be derived taking the limit ofthe IGO-PBIL update given in (28):

dπœƒπ‘‘π‘–

d𝑑=∫

π‘Š π‘“πœƒπ‘‘(π‘₯)(π‘₯𝑖 βˆ’ πœƒπ‘‘

𝑖)π‘ƒπœƒπ‘‘(𝑑π‘₯) =: 𝑔𝑖(πœƒπ‘‘) . (36)

Though finding the analytical solution of the differential equation (36) forany initial condition seems a bit intricate we show that the equation admitsone critical stable point, (1, . . . , 1), and one critical unstable point (0, . . . , 0).In addition we prove that the trajectory decreases along 𝑓 in the sensed𝑓(πœƒπ‘‘)

d𝑑 6 0. To do so we establish the following result:

Lemma 17. On 𝑓(π‘₯) = π‘βˆ’βˆ‘π‘‘

𝑖=1 𝛼𝑖π‘₯𝑖 the solution of (36) satisfiesβˆ‘π‘‘

𝑖=1 𝛼𝑖dπœƒπ‘–d𝑑 >

0; moreoverβˆ‘

𝛼𝑖dπœƒπ‘–d𝑑 = 0 if and only if πœƒ = (0, . . . , 0) or πœƒ = (1, . . . , 1).

Proof. We computeβˆ‘π‘‘

𝑖=1 𝛼𝑖𝑔𝑖(πœƒπ‘‘) and find that

π‘‘βˆ‘π‘–=1

𝛼𝑖dπœƒπ‘‘

𝑖

d𝑑=∫

π‘Š π‘“πœƒπ‘‘(π‘₯)

(π‘‘βˆ‘

𝑖=1𝛼𝑖π‘₯𝑖 βˆ’

π‘‘βˆ‘π‘–=1

π›Όπ‘–πœƒπ‘‘π‘–

)π‘ƒπœƒπ‘‘(dπ‘₯)

=∫

π‘Š π‘“πœƒπ‘‘(π‘₯)(𝑓(πœƒπ‘‘)βˆ’ 𝑓(π‘₯))π‘ƒπœƒπ‘‘(dπ‘₯)

= E[π‘Š π‘“πœƒπ‘‘(π‘₯)]E[𝑓(π‘₯)]βˆ’ E[π‘Š 𝑓

πœƒπ‘‘(π‘₯)𝑓(π‘₯)]

In addition, βˆ’π‘Š π‘“πœƒπ‘‘(π‘₯) = βˆ’π‘€(Pr(𝑓(π‘₯β€²) < 𝑓(π‘₯))) is a nondecreasing bounded

function in the variable 𝑓(π‘₯) such that βˆ’π‘Š π‘“πœƒπ‘‘(π‘₯) and 𝑓(π‘₯) are positively

correlated (see [Tho00, Chapter 1] for a proof of this result), i.e.

E[βˆ’π‘Š π‘“πœƒπ‘‘(π‘₯)𝑓(π‘₯)] > E[βˆ’π‘Š 𝑓

πœƒπ‘‘(π‘₯)]E[𝑓(π‘₯)]

with equality if and only if πœƒπ‘‘ = (0, . . . , 0) or πœƒπ‘‘ = (1, . . . , 1). Thusβˆ‘π‘‘π‘–=1 𝛼𝑖

dπœƒπ‘‘π‘–

d𝑑 > 0.

27

Page 28: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

The previous result implies that the positive definite function 𝑉 (πœƒ) =βˆ‘π‘‘π‘–=1 𝛼𝑖 βˆ’

βˆ‘π‘‘π‘–=1 π›Όπ‘–πœƒπ‘– in (1, . . . , 1) satisfies 𝑉 *(πœƒ) = βˆ‡π‘‰ (πœƒ) Β· 𝑔(πœƒ) 6 0 (such a

function is called a Lyapunov function). Consequently (1, . . . , 1) is stable.Similarly 𝑉 (πœƒ) =

βˆ‘π‘‘π‘–=1 π›Όπ‘–πœƒπ‘– is a Lyapunov function for (0, . . . , 0) such that

βˆ‡π‘‰ (πœƒ) Β· 𝑔(πœƒ) > 0. Consequently (0, . . . , 0) is unstable [AO08].

We now consider on R𝑑 the family of multivariate normal distributionsπ‘ƒπœƒ = 𝒩 (π‘š, 𝜎2𝐼𝑑) with covariance matrix equal to 𝜎2𝐼𝑑. The parameter πœƒthus has 𝑑+1 components πœƒ = (π‘š, 𝜎) ∈ R𝑑×R. The natural gradient updateusing this family was derived in [GSS+10]; from it we can derive the IGOdifferential equation which reads:

dπ‘šπ‘‘

d𝑑=∫R𝑑

π‘Š π‘“πœƒπ‘‘(π‘₯)(π‘₯βˆ’π‘šπ‘‘)𝑝𝒩 (π‘šπ‘‘,(πœŽπ‘‘)2𝐼𝑑)(π‘₯)𝑑π‘₯ (37)

d��𝑑

d𝑑=∫R𝑑

12𝑑

βŽ§βŽ¨βŽ©π‘‘βˆ‘

𝑖=1

(π‘₯𝑖 βˆ’π‘šπ‘‘

𝑖

πœŽπ‘‘

)2

βˆ’ 1

βŽ«βŽ¬βŽ­π‘Š π‘“πœƒπ‘‘(π‘₯)𝑝𝒩 (π‘šπ‘‘,(πœŽπ‘‘)2𝐼𝑑)(π‘₯)𝑑π‘₯ (38)

where πœŽπ‘‘ and ��𝑑 are linked via πœŽπ‘‘ = exp(��𝑑) or ��𝑑 = ln(πœŽπ‘‘). Denoting 𝒩 arandom vector following a centered multivariate normal distribution withcovariance matrix identity we write equivalently the gradient flow as

dπ‘šπ‘‘

d𝑑= πœŽπ‘‘E

[π‘Š 𝑓

πœƒπ‘‘(π‘šπ‘‘ + πœŽπ‘‘π’© )𝒩]

(39)

d��𝑑

d𝑑= E

[12

(‖𝒩‖2

π‘‘βˆ’ 1

)π‘Š 𝑓

πœƒπ‘‘(π‘šπ‘‘ + πœŽπ‘‘π’© )]

. (40)

Let us analyze the solution of the previous system on linear functions.Without loss of generality (because of invariance) we can consider the linearfunction 𝑓(π‘₯) = π‘₯1. We have

π‘Š π‘“πœƒπ‘‘(π‘₯) = 𝑀(Pr(π‘šπ‘‘

1 + πœŽπ‘‘π‘1 < π‘₯1))

where 𝑍1 follows a standard normal distribution and thus

π‘Š π‘“πœƒπ‘‘(π‘šπ‘‘ + πœŽπ‘‘π’© ) = 𝑀( Pr

𝑍1βˆΌπ’© (0,1)(𝑍1 < 𝒩1)) (41)

= 𝑀(β„±(𝒩1)) (42)

with β„± the cumulative distribution of a standard normal distribution, and𝒩1 the first component of 𝒩 . The differential equation thus simplifies into

dπ‘šπ‘‘

d𝑑= πœŽπ‘‘

βŽ›βŽœβŽœβŽœβŽœβŽE [𝑀(β„±(𝒩1))𝒩1]

0...0

⎞⎟⎟⎟⎟⎠ (43)

d��𝑑

d𝑑= 1

2𝑑E[(|𝒩1|2 βˆ’ 1)𝑀(β„±(𝒩1))

]. (44)

Consider, for example, the weight function associated with truncationselection, i.e. 𝑀(π‘ž) = 1π‘ž6π‘ž0 where π‘ž0 ∈]0, 1]β€”also called intermediate recom-bination. We find that

dπ‘šπ‘‘1

d𝑑= πœŽπ‘‘E[𝒩11{𝒩16β„±βˆ’1(π‘ž0)}] =: πœŽπ‘‘π›½ (45)

d��𝑑

d𝑑= 1

2𝑑

(∫ π‘ž0

0β„±βˆ’1(𝑒)2π‘‘π‘’βˆ’ π‘ž0

)=: 𝛼 . (46)

28

Page 29: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

The solution of the IGO flow for the linear function 𝑓(π‘₯) = π‘₯1 is thusgiven by

π‘šπ‘‘1 = π‘šπ‘‘

0 + 𝜎0𝛽

𝛼exp(𝛼𝑑) (47)

πœŽπ‘‘ = 𝜎0 exp(𝛼𝑑) . (48)

The coefficient 𝛽 is strictly negative for any π‘ž0 < 1. The coefficient 𝛼 isstrictly positive if and only if π‘ž0 < 1/2 which corresponds to selecting lessthan half of the sampled points in an ES. In this case the step-size πœŽπ‘‘ growsexponentially fast to infinity and the mean vectors moves along the gradientdirection towards minus ∞ at the same rate. If more than half of the pointsare selected, π‘ž0 > 1/2, the step-size will decrease to zero exponentially fastand the mean vector will get stuck (compare also [Han06a]).

5 Multimodal optimization using restricted Boltz-mann machines

We now illustrate experimentally the influence of the natural gradient, versusvanilla gradient, on diversity over the course of optimization. We consider avery simple situation of a fitness function with two distant optima and testwhether different algorithms are able to reach both optima simultaneously oronly converge to one of them. This provides a practical test of Proposition 1stating that the natural gradient minimizes loss of diversity.

The IGO method allows to build a natural search algorithm from anarbitrary probability distribution on an arbitrary search space. In particular,by choosing families of probability distributions that are richer than Gaussianor Bernoulli, one may hope to be able to optimize functions with complexshapes. Here we study how this might help optimize multimodal functions.

Both Gaussian distributions on R𝑑 and Bernoulli distributions on {0, 1}𝑑are unimodal. So at any given time, a search algorithm using such distribu-tions concentrates around a given point in the search space, looking aroundthat point (with some variance). It is an often-sought-after feature for anoptimization algorithm to handle multiple hypotheses simultaneously.

In this section we apply our method to a multimodal distribution on{0, 1}𝑑: restricted Boltzmann machines (RBMs). Depending on the activationstate of the latent variables, values for various blocks of bits can be switchedon or off, hence multimodality. So the optimization algorithm derived fromthese distributions will, hopefully, explore several distant zones of the searchspace at any given time. A related model (Boltzmann machines) was used in[Ber02] and was found to perform better than PBIL on some functions.

Our study of a bimodal RBM distribution for the optimization of abimodal function confirms that the natural gradient does indeed behave ina more natural way than the vanilla gradient: when initialized properly,the natural gradient is able to maintain diversity by fully using the RBMdistribution to learn the two modes, while the vanilla gradient only convergesto one of the two modes.

Although these experiments support using a natural gradient approach,they also establish that complications can arise for estimating the inverseFisher matrix in the case of complex distributions such as RBMs: estimationerrors may lead to a singular or unreliable estimation of the Fisher matrix,especially when the distribution becomes singular. Further research is neededfor a better understanding of this issue.

29

Page 30: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

x

h

Figure 2: The RBM architecture with the observed (x) and latent (h)variables. In our experiments, a single hidden unit was used.

The experiments reported here, and the fitness function used, are ex-tremely simple from the optimization viewpoint: both algorithms using thenatural and vanilla gradient find an optimum in only a few steps. The empha-sis here is on the specific influence of replacing the vanilla gradient with thenatural gradient, and the resulting influence on diversity and multimodality,in a simple situation.

5.1 IGO for restricted Boltzmann machines

Restricted Boltzmann machines. A restricted Boltzmann machine(RBM) [Smo86, AHS85] is a kind of probability distribution belongingto the family of undirected graphical models (also known as a Markov ran-dom fields). A set of observed variables x ∈ {0, 1}𝑛π‘₯ are given a probabilityusing their joint distribution with unobserved latent variables h ∈ {0, 1}π‘›β„Ž

[Gha04]. The latent variables are then marginalized over. See Figure 2 forthe graph structure of a RBM.

The probability associated with an observation x and latent variable h isgiven by

π‘ƒπœƒ(x, h) = π‘’βˆ’πΈ(x,h)βˆ‘xβ€²,hβ€² π‘’βˆ’πΈ(xβ€²,hβ€²) , π‘ƒπœƒ(x) =

βˆ‘h

π‘ƒπœƒ(x, h). (49)

where 𝐸(x, h) is the so-called energy function andβˆ‘

xβ€²,hβ€² π‘’βˆ’πΈ(xβ€²,hβ€²) is thepartition function denoted 𝑍 in Section 2.3. The energy 𝐸 is defined by

𝐸(x, h) = βˆ’βˆ‘

𝑖

π‘Žπ‘–π‘₯𝑖 βˆ’βˆ‘

𝑗

π‘π‘—β„Žπ‘— βˆ’βˆ‘π‘–,𝑗

𝑀𝑖𝑗π‘₯π‘–β„Žπ‘— . (50)

The distribution is fully parametrized by the bias on the observed variablesa, the bias on the latent variables b and the weights W which account forpairwise interactions between observed and latent variables: πœƒ = (a, b, W).

RBM distributions are a special case of exponential family distributionswith latent variables (see (21) in Section 2.3). The RBM IGO equationsstem from Equations (16), (17) and (18) by identifying the statistics 𝑇 (π‘₯)with π‘₯𝑖, β„Žπ‘— or π‘₯π‘–β„Žπ‘— .

For these distributions, the gradient of the log-likelihood is well-known[Hin02]. Although it is often considered intractable in the context of machinelearning where a lot of variables are required, it becomes tractable for smallerRBMs. The gradient of the log-likelihood consists of the difference of twoexpectations (compare (16)):

πœ• ln π‘ƒπœƒ(π‘₯)πœ•π‘€π‘–π‘—

= Eπ‘ƒπœƒ[π‘₯π‘–β„Žπ‘— |π‘₯]βˆ’ Eπ‘ƒπœƒ

[π‘₯π‘–β„Žπ‘— ] (51)

30

Page 31: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

The Fisher information matrix is given by (20)

πΌπ‘€π‘Žπ‘π‘€π‘π‘‘(πœƒ) = E[π‘₯π‘Žβ„Žπ‘π‘₯π‘β„Žπ‘‘]βˆ’ E[π‘₯π‘Žβ„Žπ‘]E[π‘₯π‘β„Žπ‘‘] (52)

where πΌπ‘€π‘Žπ‘π‘€π‘π‘‘denotes the entry of the Fisher matrix corresponding to the

components π‘€π‘Žπ‘ and 𝑀𝑐𝑑 of the parameter πœƒ. These equations are understoodto encompass the biases a and b by noticing that the bias can be replacedin the model by adding two variables π‘₯π‘˜ and β„Žπ‘˜ always equal to one.

Finally, the IGO update rule is taken from (14):

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + 𝛿𝑑 πΌβˆ’1(πœƒπ‘‘)π‘βˆ‘

π‘˜=1π‘€π‘˜

πœ• ln π‘ƒπœƒ(xπ‘˜)πœ•πœƒ

πœƒ=πœƒπ‘‘

(53)

Implementation. In this experimental study, the IGO algorithm forRBMs is directly implemented from Equation (53). At each optimizationstep, the algorithm consists in (1) finding a reliable estimate of the Fishermatrix (see Eq. 52) which is then inverted using the QR-Algorithm if it isinvertible; (2) computing an estimate of the vanilla gradient which is thenweighted according to π‘Š 𝑓

πœƒ ; and (3) updating the parameters, taking intoaccount the gradient step size 𝛿𝑑. In order to estimate both the Fisher matrixand the gradient, samples must be drawn from the model π‘ƒπœƒ. This is doneusing Gibbs sampling (see for instance [Hin02]).

Fisher matrix imprecision. The imprecision incurred by the limitedsampling size may sometimes lead to a singular estimation of the Fisher matrix(see p. 11 for a lower bound on the number of samples needed). Althoughhaving a singular Fisher estimation happens rarely in normal conditions, itoccurs with certainty when the probabilities become too concentrated overa few points. This situation arises naturally when the algorithm is allowedto continue the optimization after the optimum has been reached. For thisreason, in our experiments, we stop the optimization as soon as both optimahave been sampled, thus preventing π‘ƒπœƒ from becoming too concentrated.

In a variant of the same problem, the Fisher estimation can becomeclose to singular while still being numerically invertible. This situation leadsto unreliable estimates and should therefore be avoided. To evaluate thereliability of the inverse Fisher matrix estimate, we use a cross-validationmethod: (1) making two estimates 𝐹1 and 𝐹2 of the Fisher matrix on twodistinct sets of points generated from π‘ƒπœƒ, and (2) making sure that theeigenvalues of 𝐹1 Γ— 𝐹 βˆ’1

2 are close to 1. In practice, at all steps we checkthat the average of the eigenvalues of 𝐹1 Γ— 𝐹 βˆ’1

2 is between 1/2 and 2. If atsome point during the optimization the Fisher estimation becomes singularor unreliable according to this criterion, the corresponding run is stoppedand considered to have failed.

When optimizing on a larger space helps. A restricted Boltzmannmachine defines naturally a distribution π‘ƒπœƒ on both visible and hidden units(x, h), whereas the function to optimize depends only on the visible unitsx. Thus we are faced with a choice. A first possibility is to decide that theobjective function 𝑓(x) is really a function of (x, h) where h is a dummyvariable; then we can use the IGO algorithm to optimize over (x, h) usingthe distributions π‘ƒπœƒ(x, h). A second possibility is to marginalize π‘ƒπœƒ(x, h)over the hidden units h as in (49), to define the distribution π‘ƒπœƒ(x); then wecan use the IGO algorithm to optimize over x using π‘ƒπœƒ(x).

31

Page 32: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

These two approaches yield slightly different algorithms. Both were testedand found to be viable. However the first approach is numerically more stableand requires less samples to estimate the Fisher matrix. Indeed, if 𝐼1(πœƒ) isthe Fisher matrix at πœƒ in the first approach and 𝐼2(πœƒ) in the second approach,we always have 𝐼1(πœƒ) > 𝐼2(πœƒ) (in the sense of positive-definite matrices). Thisis because probability distributions on the pair (x, h) carry more informationthan their projections on x only, and so computed Kullback–Leibler distanceswill be larger.

In particular, there are (isolated) values of πœƒ for which the Fisher matrix𝐼2 is not invertible whereas 𝐼1 is. For this reason, we selected the firstapproach.

5.2 Experimental results

In our experiments, we look at the optimization of the two-min functiondefined below with a bimodal RBM: an RBM with only one latent variable.Such an RBM is bimodal because it has two possible configurations of thelatent variable: h = 0 or h = 1, and given h, the observed variables areindependent and distributed according to two Bernoulli distributions.

Set a parameter y ∈ {0, 1}𝑑. The two-min function is defined as follows:

𝑓y(x) = min(βˆ‘

𝑖

|π‘₯𝑖 βˆ’ 𝑦𝑖| ,βˆ‘

𝑖

|(1βˆ’ π‘₯𝑖)βˆ’ 𝑦𝑖)|)

(54)

This function of x has two optima: one at y, the other at its binary comple-ment y.

For the quantile rewriting of 𝑓 (Section 1.2), we chose the function 𝑀 tobe 𝑀(π‘ž) = 1π‘ž61/2 so that points which are below the median are given theweight 1, whereas other points are given the weight 0. Also, in accordancewith (3), if several points have the same fitness value, their weight π‘Š 𝑓

πœƒ is setto the average of 𝑀 over all those points.

For initialization of the RBMs, the weights W are sampled from a normaldistribution centered around zero and of standard deviation 1/

βˆšπ‘›π‘₯ Γ— π‘›β„Ž,

where 𝑛π‘₯ is the number of observed variables (dimension 𝑑 of the problem)and π‘›β„Ž is the number of latent variables (π‘›β„Ž = 1 in our case), so that initiallythe energies 𝐸 are not too large. Then the bias parameters are initialized as𝑏𝑗 ← βˆ’

βˆ‘π‘–

𝑀𝑖𝑗

2 and π‘Žπ‘– ← βˆ’βˆ‘

𝑗𝑀𝑖𝑗

2 +𝒩 (0.01𝑛2

π‘₯) so that each variable (observed

or latent) has a probability of activation close to 1/2.In the following experiments, we show the results of IGO optimization

and vanilla gradient optimization for the two-min function in dimension 40,for various values of the step size 𝛿𝑑. For each 𝛿𝑑, we present the median of thequantity of interest over 100 runs. Error bars indicate the 16th percentile andthe 84th percentile (this is the same as meanΒ±stddev for a Gaussian variable,but is invariant by 𝑓 -reparametrization). For each run, the parameter yof the two-max function is sampled randomly in order to ensure that thepresented results are not dependent on a particular choice of optima.

The number of sample points used for estimating the Fisher matrix isset to 10, 000: large enough (by far) to ensure the stability of the estimates.The same points are used for estimating the integral of (14), therefore thereare 10, 000 calls to the fitness function at each gradient step. These rathercomfortable settings allow for a good illustration of the theoretical propertiesof the 𝑁 =∞ IGO flow limit.

The numbers of runs that fail after the occurrence of a singular matrixor an unreliable estimate amount for less than 10% for 𝛿𝑑 6 2 (as little as

32

Page 33: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

3% for the smallest learning rate), but can increase up to 30% for higherlearning rates.

5.2.1 Convergence

We first check that both vanilla and natural gradient are able to converge toan optimum.

Figures 3 and 4 show the fitness of the best sampled point for theIGO algorithm and for the vanilla gradient at each step. Predictably, bothalgorithms are able to optimize the two-min function in a few steps.

The two-min function is extremely simple from an optimization viewpoint;thus, convergence speed is not the main focus here, all the more since we usea large number of 𝑓 -calls at each step.

Note that the values of the parameter 𝛿𝑑 for the two gradients used arenot directly comparable from a theoretical viewpoint (they correspond toparametrizations of different trajectories in Θ-space, and identifying vanilla𝛿𝑑 with natural 𝛿𝑑 is meaningless). We selected larger values of 𝛿𝑑 for thevanilla gradient, in order to obtain roughly comparable convergence speedsin practice.

5.2.2 Diversity

As we have seen, the two-min function is equally well optimized by the IGOand vanilla gradient optimization. However, the methods fare very differentlywhen we look at the distance from the sample points to both optima. From(54), the fitness gives the distance of sample points to the closest optimum.We now study how close sample points come to the other, opposite optimum.The distance of sample points to the second optimum is shown in Figure 5for IGO, and in Figure 6 for the vanilla gradient.

Figure 5 shows that IGO also reaches the second optimum most of thetime, and is often able to find it only a few steps after the first. This propertyof IGO is of course dependent on the initialization of the RBM with enoughdiversity. When initialized properly so that each variable (observed andlatent) has a probability 1/2 of being equal to 1, the initial RBM distributionhas maximal diversity over the search space and is at equal distance from thetwo optima of the function. From this starting position, the natural gradientis then able to increase the likelihood of the two optima at the same time.

By stark contrast, the vanilla gradient is not able to go towards bothoptima at the same time as shown in Fig. 6. In fact, the vanilla gradientonly converges to one optimum at the expense of the other. For all values of𝛿𝑑, the distance to the second optimum increases gradually and approachesthe maximum possible distance.

As mentioned earlier, each state of the latent variable h corresponds to amode of the distribution. In Figures 7 and 8, we look at the average valueof h for each gradient step. An average value close to 1/2 means that thedistribution samples from both modes: h = 0 or h = 1 with a comparableprobability. Conversely, average values close to 0 or 1 indicate that thedistribution gives most probability to one mode at the expense of the other.In Figure 7, we can see that with IGO, the average value of h is close to 1/2during the whole optimization procedure for most runs: the distribution isinitialized with two modes and stays bimodal7. As for the vanilla gradient,

7In Fig. 7, we use the following adverse setting: runs are interrupted once they reachboth optima, therefore the statistics are taken only over those runs which have not yetconverged and reached both optima, which results in higher variation around 1/2. The

33

Page 34: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

0 10 20 30 40 50gradient steps

0

2

4

6

8

10

12

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Ξ΄t=0.5

Ξ΄t=0.25

Figure 3: Fitness of sampled points during IGO optimization.

0 10 20 30 40 50gradient steps

0

2

4

6

8

10

12

Ξ΄t=32.

Ξ΄t=16.

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Figure 4: Fitness of sampled points during vanilla gradient optimization.

34

Page 35: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

0 10 20 30 40 50gradient steps

0

5

10

15

20

25

30

35

40

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Ξ΄t=0.5

Ξ΄t=0.25

Figure 5: Distance to the second optimum during IGO optimization.

0 10 20 30 40 50gradient steps

0

5

10

15

20

25

30

35

40

Ξ΄t=32.

Ξ΄t=16.

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Figure 6: Distance to second optimum during vanilla gradient optimization.

35

Page 36: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

statistics for h are depicted in Figure 8 and we can see that they converge to1: one of the two modes of the distribution has been lost during optimization.

Hidden breach of symmetry by the vanilla gradient. The experi-ments reveal a curious phenomenon: the vanilla gradient loses multimodalityby always setting the hidden variable β„Ž to 1, not to 0. (We detected noobvious asymmetry on the visible units π‘₯, though.)

Of course, exchanging the values 0 and 1 for the hidden variables in arestricted Boltzmann machine still gives a distribution of another Boltzmannmachine. More precisely, changing β„Žπ‘— into 1βˆ’ β„Žπ‘— is equivalent to resettingπ‘Žπ‘– ← π‘Žπ‘– + 𝑀𝑖𝑗 , 𝑏𝑗 ← βˆ’π‘π‘— , and 𝑀𝑖𝑗 ← βˆ’π‘€π‘–π‘— . IGO and the natural gradient areimpervious to such a change by Proposition 9.

The vanilla gradient implicitly relies on the Euclidean norm on parameterspace, as explained in Section 1.1. For this norm, the distance betweenthe RBM distributions (π‘Žπ‘–, 𝑏𝑗 , 𝑀𝑖𝑗) and (π‘Žβ€²

𝑖, 𝑏′𝑗 , 𝑀′

𝑖𝑗) is simplyβˆ‘

𝑖 |π‘Žπ‘– βˆ’ π‘Žβ€²π‘–|

2 +βˆ‘π‘—

𝑏𝑗 βˆ’ 𝑏′

𝑗

2+βˆ‘

𝑖𝑗

𝑀𝑖𝑗 βˆ’ 𝑀′

𝑖𝑗

2. However, the change of variables π‘Žπ‘– ←

π‘Žπ‘– + 𝑀𝑖𝑗 , 𝑏𝑗 ← βˆ’π‘π‘— , 𝑀𝑖𝑗 ← βˆ’π‘€π‘–π‘— does not preserve this Euclidean metric.Thus, exchanging 0 and 1 for the hidden variables will result in two differentvanilla gradient ascents. The observed asymmetry on β„Ž is a consequence ofthis implicit asymmetry.

The same asymmetry exists for the visible variables π‘₯𝑖; but this doesnot prevent convergence to an optimum in our situation, since any gradientdescent eventually reaches some local optimum.

Of course it is possible to cook up parametrizations for which the vanillagradient will be more symmetric: for instance, using βˆ’1/1 instead of 0/1 forthe variables, or defining the energy by

𝐸(x, h) = βˆ’βˆ‘

𝑖𝐴𝑖(π‘₯𝑖 βˆ’ 12)βˆ’

βˆ‘π‘—π΅π‘—(β„Žπ‘— βˆ’ 1

2)βˆ’βˆ‘

𝑖,π‘—π‘Šπ‘–π‘—(π‘₯𝑖 βˆ’ 12)(β„Žπ‘— βˆ’ 1

2) (55)

with β€œbias-free” parameters 𝐴𝑖, 𝐡𝑗 , π‘Šπ‘–π‘— related to the usual parametrizationby 𝑀𝑖𝑗 = π‘Šπ‘–π‘— , π‘Žπ‘– = 𝐴𝑖 βˆ’ 1

2βˆ‘

𝑗 𝑀𝑖𝑗 𝑏𝑗 = 𝐡𝑗 βˆ’ 12βˆ‘

𝑖 𝑀𝑖𝑗 . The vanilla gradientmight perform better in these parametrizations.

However, we chose a naive approach: we used a family of probabilitydistributions found in the literature, with the parametrization found in theliterature. We then use the vanilla gradient and the natural gradient on thesedistributions. This directly illustrates the specific influence of the chosengradient (the two implementations only differ by the inclusion of the Fishermatrix). It is remarkable, we think, that the natural gradient is able torecover symmetry where there was none.

5.3 Convergence to the continuous-time limit

In the previous figures, it looks like changing the parameter 𝛿𝑑 only resultsin a time speedup of the plots.

Update rules of the type πœƒ ← πœƒ + π›Ώπ‘‘βˆ‡πœƒπ‘” (for either gradient) are Eulerapproximations of the continuous-time ordinary differential equation dπœƒ

d𝑑 =βˆ‡πœƒπ‘”, with each iteration corresponding to an increment 𝛿𝑑 of the time 𝑑.Thus, it is expected that for small enough 𝛿𝑑, the algorithm after π‘˜ stepsapproximates the IGO flow or vanilla gradient flow at time 𝑑 = π‘˜.𝛿𝑑.

Figures 9 and 10 illustrate this convergence: we show the fitness w.r.t to𝛿𝑑 times the number of gradient steps. An asymptotic trajectory seems to

plot has been stopped when less than half the runs remain. The error bars are relativeonly to the remaining runs.

36

Page 37: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

0 10 20 30 40 50gradient steps

0.0

0.2

0.4

0.6

0.8

1.0

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Ξ΄t=0.5

Ξ΄t=0.25

Figure 7: Average value of h during IGO optimization.

0 10 20 30 40 50gradient steps

0.0

0.2

0.4

0.6

0.8

1.0

Ξ΄t=32.

Ξ΄t=16.

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Figure 8: Average value of h during vanilla gradient optimization.

37

Page 38: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

emerge when 𝛿𝑑 decreases. For the natural gradient, it can be interpreted asthe fitness of samples of the continuous-time IGO flow.

Thus, for this kind of optimization algorithms, it makes theoreticalsense to plot the results according to the β€œintrinsic time” of the underlyingcontinuous-time object, to illustrate properties that do not depend on thesetting of the parameter 𝛿𝑑. (Still, the raw number of steps is more directlyrelated to algorithmic cost.)

6 Further discussionA single framework for optimization on arbitrary spaces. A strengthof the IGO viewpoint is to automatically provide optimization algorithmsusing any family of probability distributions on any given space, discrete orcontinuous. This has been illustrated with restricted Boltzmann machines.IGO algorithms also feature good invariance properties and make a leastnumber of arbitrary choices.

In particular, IGO unifies several well-known optimization algorithmsinto a single framework. For instance, to the best of our knowledge, PBILhas never been described as a natural gradient ascent in the literature8.

For Gaussian measures, algorithms of the same form (14) had beendeveloped previously [HO01, WSPS08] and their close relationship with anatural gradient ascent had been recognized [ANOK10, GSS+10].

The wide applicability of natural gradient approaches seems not to bewidely known in the optimization community, though see [MMS08].

About quantiles. The IGO flow, to the best of our knowledge, has notbeen defined before. The introduction of the quantile-rewriting (3) of theobjective function provides the first rigorous derivation of quantile- or rank-based natural optimization from a gradient ascent in πœƒ-space.

Indeed, NES and CMA-ES have been claimed to maximize βˆ’Eπ‘ƒπœƒπ‘“ via

natural gradient ascent [WSPS08, ANOK10]. However, we have proved thatwhen the number of samples is large and the step size is small, the NESand CMA-ES updates converge to the IGO flow, not to the similar flowwith the gradient of Eπ‘ƒπœƒ

𝑓 (Theorem 4). So we find that in reality thesealgorithms maximize Eπ‘ƒπœƒ

π‘Š π‘“πœƒπ‘‘ , where π‘Š 𝑓

πœƒπ‘‘ is a decreasing transformation ofthe 𝑓 -quantiles under the current sample distribution.

Also in practice, maximizing βˆ’Eπ‘ƒπœƒπ‘“ is a rather unstable procedure and

has been discouraged, see for example [Whi89].

About choice of π‘ƒπœƒ: learning a model of good points. The choice ofthe family of probability distributions π‘ƒπœƒ plays a double role.

First, it is analogous to a mutation operator as seen in evolutionaryalgorithms: indeed, π‘ƒπœƒ encodes possible moves according to which newsample points are explored.

Second, optimization algorithms using distributions can be interpreted aslearning a probabilistic model of where the points with good values lie in thesearch space. With this point of view, π‘ƒπœƒ describes richness of this model:for instance, restricted Boltzmann machines with β„Ž hidden units can describedistributions with up to 2β„Ž modes, whereas the Bernoulli distribution usedin PBIL is unimodal. This influences, for instance, the ability to exploreseveral valleys and optimize multimodal functions in a single run.

8Thanks to Jonathan Shapiro for an early argument confirming this property (personalcommunication).

38

Page 39: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

0 5 10 15 20 25 30gradient steps * Ξ΄t

0

2

4

6

8

10

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Ξ΄t=0.5

Ξ΄t=0.25

Figure 9: Fitness of sampled points w.r.t. β€œintrinsic time” during IGOoptimization.

0 5 10 15 20 25 30gradient steps * Ξ΄t/4

0

2

4

6

8

10

Ξ΄t=32.

Ξ΄t=16.

Ξ΄t=8.

Ξ΄t=4.

Ξ΄t=2.

Ξ΄t=1.

Figure 10: Fitness of sampled points w.r.t. β€œintrinsic time” during vanillagradient optimization.

39

Page 40: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Natural gradient and parametrization invariance. Central to IGOis the use of natural gradient, which follows from πœƒ-invariance and makessense on any search space, discrete or continuous.

While the IGO flow is exactly πœƒ-invariant, for any practical implementa-tion of an IGO algorithm, a parametrization choice has to be made. Still,since all IGO algorithms approximate the IGO flow, two parametrizations ofIGO will differ less than two parametrizations of another algorithm (such asthe vanilla gradient or the smoothed CEM method)β€”at least if the learningrate 𝛿𝑑 is not too large.

On the other hand, natural evolution strategies have never strived forπœƒ-invariance: the chosen parametrization (Cholesky, exponential) has beendeemed a relevant feature. In the IGO framework, the chosen parametrizationbecomes more relevant as the step size 𝛿𝑑 increases.

IGO, maximum likelihood and cross-entropy. The cross-entropy method(CEM) [dBKMR05] can be used to produce optimization algorithms given afamily of probability distributions on an arbitrary space, by performing ajump to a maximum likelihood estimate of the parameters.

We have seen (Corollary 16) that the standard CEM is an IGO algorithmin a particular parametrization, with a learning rate 𝛿𝑑 equal to 1. However,it is well-known, both theoretically and experimentally [BLS07, Han06b,WAS04], that standard CEM loses diversity too fast in many situations. Theusual solution [dBKMR05] is to reduce the learning rate (smoothed CEM,(25)), but this breaks the reparametrization invariance.

On the other hand, the IGO flow can be seen as a maximum likelihoodupdate with infinitesimal learning rate (Theorem 13). This interpretationallows to define a particular IGO algorithm, the IGO-ML (Definition 14): itperforms a maximum likelihood update with an arbitrary learning rate, andkeeps the reparametrization invariance. It coincides with CEM when thelearning rate is set to 1, but it differs from smoothed CEM by the exchangeof the order of argmax and averaging (compare (23) and (25)). We arguethat this new algorithm may be a better way to reduce the learning rate andachieve smoothing in CEM.

Standard CEM can lose diversity, yet is a particular case of an IGOalgorithm: this illustrates the fact that reasonable values of the learningrate 𝛿𝑑 depend on the parametrization. We have studied this phenomenonin detail for various Gaussian IGO algorithms (Section 4.2).

Why would a smaller learning rate perform better than a large one inan optimization setting? It might seem more efficient to jump directly tothe maximum likelihood estimate of currently known good points, instead ofperforming a slow gradient ascent towards this maximum.

However, optimization faces a β€œmoving target”, contrary to a learningsetting in which the example distribution is often stationary. Currently knowngood points are likely not to indicate the position at which the optimum lies,but, rather, the direction in which the optimum is to be found. After anupdate, the next elite sample points are going to be located somewhere new.So the goal is certainly not to settle down around these currently knownpoints, as a maximum likelihood update does: by design, CEM only tries toreflect status-quo (even for 𝑁 =∞), whereas IGO tries to move somewhere.When the target moves over time, a progressive gradient ascent is morereasonable than an immediate jump to a temporary optimum, and realizes akind of time smoothing.

This phenomenon is most clear when the number of sample points is small.

40

Page 41: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Then, a full maximum likelihood update risks losing a lot of diversity; it mayeven produce a degenerate distribution if the number of sample points issmaller than the number of parameters of the distribution. On the other hand,for smaller 𝛿𝑑, the IGO algorithms do, by design, try to maintain diversityby moving as little as possible from the current distribution π‘ƒπœƒ in Kullback–Leibler divergence. A full ML update disregards the current distributionand tries to move as close as possible to the elite sample in Kullback–Leiblerdivergence [dBKMR05], thus realizing maximal diversity loss. This makessense in a non-iterated scenario but is unsuited for optimization.

Diversity and multiple optima. The IGO framework emphasizes therelation of natural gradient and diversity: we argued that IGO providesminimal diversity change for a given objective function increment. In partic-ular, provided the initial diversity is large, diversity is kept at a maximum.This theoretical relationship has been confirmed experimentally for restrictedBoltzmann machines.

On the other hand, using the vanilla gradient does not lead to a balanceddistribution between the two optima in our experiments. Using the vanillagradient introduces hidden arbitrary choices between those points (moreexactly between moves in Θ-space). This results in loss of diversity, andmight also be detrimental at later stages in the optimization. This may reflectthe fact that the Euclidean metric on the space of parameters, implicitlyused in the vanilla gradient, becomes less and less meaningful for gradientdescent on complex distributions.

IGO and the natural gradient are certainly relevant to the well-knownproblem of exploration-exploitation balance: as we have seen, arguably thenatural gradient realizes the best increment of the objective function withthe least possible change of diversity in the population.

More generally, IGO tries to learn a model of where the good pointsare, representing all good points seen so far rather than focusing only onsome good points; this is typical of machine learning, one of the contexts forwhich the natural gradient was studied. The conceptual relationship of IGOand IGO-like optimization algorithms with machine learning is still to beexplored and exploited.

Summary and conclusionWe sum up:

βˆ™ The information-geometric optimization (IGO) framework derives frominvariance principles and allows to build optimization algorithms fromany family of distributions on any search space. In some instances(Gaussian distributions on R𝑑 or Bernoulli distributions on {0, 1}𝑑) itrecovers versions of known algorithms (CMA-ES or PBIL); in otherinstances (restricted Boltzmann machine distributions) it producesnew, hopefully efficient optimization algorithms.

βˆ™ The use of a quantile-based, time-dependent transform of the objectivefunction (Equation (3)) provides a rigorous derivation of rank-basedupdate rules currently used in optimization algorithms. Theorem 4uniquely identifies the infinite-population limit of these update rules.

βˆ™ The IGO flow is singled out by its equivalent description as an in-finitesimal weighted maximal log-likelihood update (Theorem 13). In

41

Page 42: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

a particular parametrization and with a step size of 1, it recovers thecross-entropy method (Corollary 16).

βˆ™ Theoretical arguments suggest that the IGO flow minimizes the changeof diversity in the course of optimization. In particular, startingwith high diversity and using multimodal distributions may allowsimultaneous exploration of multiple optima of the objective function.Preliminary experiments with restricted Boltzmann machines confirmthis effect in a simple situation.

Thus, the IGO framework is an attempt to provide sound theoreticalfoundations to optimization algorithms based on probability distributions.In particular, this viewpoint helps to bridge the gap between continuous anddiscrete optimization.

The invariance properties, which reduce the number of arbitrary choices,together with the relationship between natural gradient and diversity, maycontribute to a theoretical explanation of the good practical performance ofthose currently used algorithms, such as CMA-ES, which can be interpretedas instantiations of IGO.

We hope that invariance properties will acquire in computer sciencethe importance they have in mathematics, where intrinsic thinking is thefirst step for abstract linear algebra or differential geometry, and in modernphysics, where the notions of invariance w.r.t. the coordinate system andso-called gauge invariance play central roles.

AcknowledgementsThe authors would like to thank Michèle Sebag for the acronym and for helpfulcomments. Y. O. would like to thank Cédric Villani and Bruno Sévennecfor helpful discussions on the Fisher metric. A. A. and N. H. would like toacknowledge the Dagstuhl Seminar No 10361 on the Theory of EvolutionaryComputation (http://www.dagstuhl.de/10361) for inspiring their work onnatural gradients and beyond. This work was partially supported by theANR-2010-COSI-002 grant (SIMINOLE) of the French National ResearchAgency.

Appendix: Proofs

Proof of Theorem 4 (Convergence of empirical means andquantiles)

Let us give a more precise statement including the necessary regularityconditions.

Proposition 18. Let πœƒ ∈ Θ. Assume that the derivative πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒ exists

for π‘ƒπœƒ-almost all π‘₯ ∈ 𝑋 and that Eπ‘ƒπœƒ

πœ• ln π‘ƒπœƒ(π‘₯)

πœ•πœƒ

2< +∞. Assume that the

function 𝑀 is non-decreasing and bounded.Let (π‘₯𝑖)π‘–βˆˆN be a sequence of independent samples of π‘ƒπœƒ. Then with

probability 1, as 𝑁 β†’βˆž we have

1𝑁

π‘βˆ‘π‘–=1

π‘Š 𝑓 (π‘₯𝑖)πœ• ln π‘ƒπœƒ(π‘₯𝑖)

πœ•πœƒβ†’βˆ«

π‘Š π‘“πœƒ (π‘₯) πœ• ln π‘ƒπœƒ(π‘₯)

πœ•πœƒπ‘ƒπœƒ(dπ‘₯)

42

Page 43: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

where π‘Š 𝑓 (π‘₯𝑖) = 𝑀

(rk𝑁 (π‘₯𝑖) + 1/2𝑁

)with rk𝑁 (π‘₯𝑖) = #{1 6 𝑗 6 𝑁, 𝑓(π‘₯𝑗) < 𝑓(π‘₯𝑖)}. (When there are 𝑓 -ties in thesample, π‘Š 𝑓 (π‘₯𝑖) is defined as the average of 𝑀((π‘Ÿ + 1/2)/𝑁) over the possiblerankings π‘Ÿ of π‘₯𝑖.)

Proof. Let 𝑔 : 𝑋 β†’ R be any function with Eπ‘ƒπœƒπ‘”2 <∞. We will show that

1𝑁

βˆ‘π‘Š 𝑓 (π‘₯𝑖)𝑔(π‘₯𝑖)β†’βˆ«

π‘Š π‘“πœƒ (π‘₯)𝑔(π‘₯) π‘ƒπœƒ(dπ‘₯). Applying this with 𝑔 equal to the

components of πœ• ln π‘ƒπœƒ(π‘₯)πœ•πœƒ will yield the result.

Let us decompose

1𝑁

βˆ‘π‘Š 𝑓 (π‘₯𝑖)𝑔(π‘₯𝑖) = 1𝑁

βˆ‘π‘Š 𝑓

πœƒ (π‘₯𝑖)𝑔(π‘₯𝑖) + 1𝑁

βˆ‘(π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š 𝑓

πœƒ (π‘₯𝑖))𝑔(π‘₯𝑖).

Each summand in the first term involves only one sample π‘₯𝑖 (contrary toπ‘Š 𝑓 (π‘₯𝑖) which depends on the whole sample). So by the strong law of largenumbers, almost surely 1

𝑁

βˆ‘π‘Š 𝑓

πœƒ (π‘₯𝑖)𝑔(π‘₯𝑖) converges to∫

π‘Š π‘“πœƒ (π‘₯)𝑔(π‘₯) π‘ƒπœƒ(dπ‘₯).

So we have to show that the second term converges to 0 almost surely.By the Cauchy–Schwarz inequality, we have 1

𝑁

βˆ‘(π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š 𝑓

πœƒ (π‘₯𝑖))𝑔(π‘₯𝑖)2

6( 1

𝑁

βˆ‘(π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š 𝑓

πœƒ (π‘₯𝑖))2)( 1

𝑁

βˆ‘π‘”(π‘₯𝑖)2

)By the strong law of large numbers, the second term 1

𝑁

βˆ‘π‘”(π‘₯𝑖)2 converges to

πΈπ‘ƒπœƒπ‘”2 almost surely. So we have to prove that the first term 1

𝑁

βˆ‘(π‘Š 𝑓 (π‘₯𝑖)βˆ’

π‘Š π‘“πœƒ (π‘₯𝑖))2 converges to 0 almost surely.

Since 𝑀 is bounded by assumption, we can write

(π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š π‘“πœƒ (π‘₯𝑖))2 6 2𝐡

π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š π‘“πœƒ (π‘₯𝑖)

= 2𝐡

π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š π‘“πœƒ (π‘₯𝑖)

+

+ 2π΅π‘Š 𝑓 (π‘₯𝑖)βˆ’π‘Š 𝑓

πœƒ (π‘₯𝑖)βˆ’

where 𝐡 is the bound on |𝑀|. We will bound each of these terms.Let us abbreviate π‘žβˆ’

𝑖 = Prπ‘₯β€²βˆΌπ‘ƒπœƒ(𝑓(π‘₯β€²) < 𝑓(π‘₯𝑖)), π‘ž+

𝑖 = Prπ‘₯β€²βˆΌπ‘ƒπœƒ(𝑓(π‘₯β€²) 6

𝑓(π‘₯𝑖)), π‘Ÿβˆ’π‘– = #{𝑗6𝑁, 𝑓(π‘₯𝑗) < 𝑓(π‘₯𝑖)}, π‘Ÿ+

𝑖 = #{𝑗6𝑁, 𝑓(π‘₯𝑗) 6 𝑓(π‘₯𝑖)}.By definition of π‘Š 𝑓 we have

π‘Š 𝑓 (π‘₯𝑖) = 1π‘Ÿ+

𝑖 βˆ’ π‘Ÿβˆ’π‘–

π‘Ÿ+𝑖 βˆ’1βˆ‘

π‘˜=π‘Ÿβˆ’π‘–

𝑀((π‘˜ + 1/2)/𝑁)

and moreover π‘Š π‘“πœƒ (π‘₯𝑖) = 𝑀(π‘žβˆ’

𝑖 ) if π‘žβˆ’π‘– = π‘ž+

𝑖 or π‘Š π‘“πœƒ (π‘₯𝑖) = 1

π‘ž+𝑖 βˆ’π‘žβˆ’

𝑖

∫ π‘ž+𝑖

π‘žβˆ’π‘–

𝑀 other-wise.

The Glivenko–Cantelli theorem states that sup𝑖

π‘ž+

𝑖 βˆ’ π‘Ÿ+𝑖 /𝑁

tends to 0

almost surely, and likewise for sup𝑖

π‘žβˆ’

𝑖 βˆ’ π‘Ÿβˆ’π‘– /𝑁

. So let 𝑁 be large enough

so that these errors are bounded by πœ€.Since 𝑀 is non-increasing, we have 𝑀(π‘žβˆ’

𝑖 ) 6 𝑀(π‘Ÿβˆ’π‘– /𝑁 βˆ’ πœ€). In the case

π‘žβˆ’π‘– = π‘ž+

𝑖 , we decompose the interval [π‘žβˆ’π‘– ; π‘ž+

𝑖 ] into (π‘Ÿ+𝑖 βˆ’ π‘Ÿβˆ’

𝑖 ) subintervals.The average of 𝑀 over each such subinterval is compared to a term in thesum defining 𝑀𝑁 (π‘₯𝑖): since 𝑀 is non-increasing, the average of 𝑀 over theπ‘˜th subinterval is at most 𝑀((π‘Ÿβˆ’

𝑖 + π‘˜)/𝑁 βˆ’ πœ€). So we get

π‘Š π‘“πœƒ (π‘₯𝑖) 6

1π‘Ÿ+

𝑖 βˆ’ π‘Ÿβˆ’π‘–

π‘Ÿ+𝑖 βˆ’1βˆ‘

π‘˜=π‘Ÿβˆ’π‘–

𝑀(π‘˜/𝑁 βˆ’ πœ€)

43

Page 44: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

so that

π‘Š π‘“πœƒ (π‘₯𝑖)βˆ’ π‘Š 𝑓 (π‘₯𝑖) 6

1π‘Ÿ+

𝑖 βˆ’ π‘Ÿβˆ’π‘–

π‘Ÿ+𝑖 βˆ’1βˆ‘

π‘˜=π‘Ÿβˆ’π‘–

(𝑀(π‘˜/𝑁 βˆ’ πœ€)βˆ’ 𝑀((π‘˜ + 1/2)/𝑁)).

Let us sum over 𝑖, remembering that there are (π‘Ÿ+𝑖 βˆ’ π‘Ÿβˆ’

𝑖 ) values of 𝑗 forwhich 𝑓(π‘₯𝑗) = 𝑓(π‘₯𝑖). Taking the positive part, we get

1𝑁

π‘βˆ‘π‘–=1

π‘Š 𝑓

πœƒ (π‘₯𝑖)βˆ’ π‘Š 𝑓 (π‘₯𝑖)+6

1𝑁

π‘βˆ’1βˆ‘π‘˜=0

(𝑀(π‘˜/𝑁 βˆ’ πœ€)βˆ’ 𝑀((π‘˜ + 1/2)/𝑁)).

Since 𝑀 is non-increasing we have

1𝑁

π‘βˆ’1βˆ‘π‘˜=0

𝑀(π‘˜/𝑁 βˆ’ πœ€) 6∫ 1βˆ’πœ€βˆ’1/𝑁

βˆ’πœ€βˆ’1/𝑁𝑀

and1𝑁

π‘βˆ’1βˆ‘π‘˜=0

𝑀((π‘˜ + 1/2)/𝑁) >∫ 1+1/2𝑁

1/2𝑁𝑀

(we implicitly extend the range of 𝑀 so that 𝑀(π‘ž) = 𝑀(0) for π‘ž < 0). So wehave

1𝑁

π‘βˆ‘π‘–=1

π‘Š 𝑓

πœƒ (π‘₯𝑖)βˆ’ π‘Š 𝑓 (π‘₯𝑖)+6∫ 1/2𝑁

βˆ’πœ€βˆ’1/𝑁𝑀 βˆ’

∫ 1+1/2𝑁

1βˆ’πœ€βˆ’1/𝑁𝑀 6 (2πœ€ + 3/𝑁)𝐡

where 𝐡 is the bound on |𝑀|.Reasoning symmetrically with 𝑀(π‘˜/𝑁 + πœ€) and the inequalities reversed,

we get a similar bound for 1𝑁

βˆ‘ π‘Š 𝑓

πœƒ (π‘₯𝑖)βˆ’ π‘Š 𝑓 (π‘₯𝑖)βˆ’

. This ends the proof.

Proof of Proposition 5 (Quantile improvement)

Let us use the weight 𝑀(𝑒) = 1𝑒6π‘ž. Let π‘š be the value of the π‘ž-quantile of 𝑓under π‘ƒπœƒπ‘‘ . We want to show that the value of the π‘ž-quantile of 𝑓 under π‘ƒπœƒπ‘‘+𝛿𝑑

is less than π‘š, unless the gradient vanishes and the IGO flow is stationary.Let π‘βˆ’ = Prπ‘₯βˆΌπ‘ƒπœƒπ‘‘ (𝑓(π‘₯) < π‘š), π‘π‘š = Prπ‘₯βˆΌπ‘ƒπœƒπ‘‘ (𝑓(π‘₯) = π‘š) and 𝑝+ =

Prπ‘₯βˆΌπ‘ƒπœƒπ‘‘ (𝑓(π‘₯) > π‘š). By definition of the quantile value we have π‘βˆ’ + π‘π‘š > π‘žand 𝑝+ + π‘π‘š > 1βˆ’ π‘ž. Let us assume that we are in the more complicatedcase π‘π‘š = 0 (for the case π‘π‘š = 0, simply remove the corresponding terms).

We have π‘Š π‘“πœƒπ‘‘

(π‘₯) = 1 if 𝑓(π‘₯) < π‘š, π‘Š π‘“πœƒπ‘‘

(π‘₯) = 0 if 𝑓(π‘₯) > π‘š and π‘Š π‘“πœƒπ‘‘

(π‘₯) =1

π‘π‘š

∫ π‘βˆ’+π‘π‘šπ‘βˆ’

𝑀(𝑒)d𝑒 = π‘žβˆ’π‘βˆ’π‘π‘š

if 𝑓(π‘₯) = π‘š.Using the same notation as above, let 𝑔𝑑(πœƒ) =

βˆ«π‘Š 𝑓

πœƒπ‘‘(π‘₯) π‘ƒπœƒ(dπ‘₯). Decom-posing this integral on the three sets 𝑓(π‘₯) < π‘š, 𝑓(π‘₯) = π‘š and 𝑓(π‘₯) > π‘š, weget that 𝑔𝑑(πœƒ) = Prπ‘₯βˆΌπ‘ƒπœƒ

(𝑓(π‘₯) < π‘š) + Prπ‘₯βˆΌπ‘ƒπœƒ(𝑓(π‘₯) = π‘š) π‘žβˆ’π‘βˆ’

π‘π‘š. In particular,

𝑔𝑑(πœƒπ‘‘) = π‘ž.Since we follow a gradient ascent of 𝑔𝑑, for 𝛿𝑑 small enough we have

𝑔𝑑(πœƒπ‘‘+𝛿𝑑) > 𝑔𝑑(πœƒπ‘‘) unless the gradient vanishes. If the gradient vanisheswe have πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ and the quantiles are the same. Otherwise we get𝑔𝑑(πœƒπ‘‘+𝛿𝑑) > 𝑔𝑑(πœƒπ‘‘) = π‘ž.

Since π‘žβˆ’π‘βˆ’π‘π‘š

6 (π‘βˆ’+π‘π‘š)βˆ’π‘βˆ’π‘π‘š

= 1, we have 𝑔𝑑(πœƒ) 6 Prπ‘₯βˆΌπ‘ƒπœƒ(𝑓(π‘₯) < π‘š) +

Prπ‘₯βˆΌπ‘ƒπœƒ(𝑓(π‘₯) = π‘š) = Prπ‘₯βˆΌπ‘ƒπœƒ

(𝑓(π‘₯) 6 π‘š).So Prπ‘₯βˆΌπ‘ƒ

πœƒπ‘‘+𝛿𝑑(𝑓(π‘₯) 6 π‘š) > 𝑔𝑑(πœƒπ‘‘+𝛿𝑑) > π‘ž. This implies that, by definition,

the π‘ž-quantile value of π‘ƒπœƒπ‘‘+𝛿𝑑 is smaller than π‘š.

44

Page 45: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Proof of Proposition 10 (Speed of the IGO flow)

Lemma 19. Let 𝑋 be a centered 𝐿2 random variable with values in R𝑑 andlet 𝐴 be a real-valued 𝐿2 random variable. Then

β€–E(𝐴𝑋)β€– 6√

πœ† Var 𝐴

where πœ† is the largest eigenvalue of the covariance matrix of 𝑋 expressed inan orthonormal basis.

Proof of the lemma. Let 𝑣 be any vector in R𝑑; its norm satisfies

‖𝑣‖ = sup𝑀, ‖𝑀‖61

(𝑣 Β· 𝑀)

and in particular

β€–E(𝐴𝑋)β€– = sup𝑀, ‖𝑀‖61

(𝑀 Β· E(𝐴𝑋))

= sup𝑀, ‖𝑀‖61

E(𝐴 (𝑀 ·𝑋))

= sup𝑀, ‖𝑀‖61

E((π΄βˆ’ E𝐴) (𝑀 ·𝑋)) since (𝑀 ·𝑋) is centered

6 sup𝑀, ‖𝑀‖61

√Var 𝐴

√E((𝑀 ·𝑋)2)

by the Cauchy–Schwarz inequality and using the fact that 𝐴 is centered.Now, in an orthonormal basis we have

(𝑀 ·𝑋) =βˆ‘

𝑖

𝑀𝑖𝑋𝑖

so that

E((𝑀 ·𝑋)2) = E((βˆ‘

𝑖𝑀𝑖𝑋𝑖)(βˆ‘

𝑗𝑀𝑗𝑋𝑗))

=βˆ‘

𝑖

βˆ‘π‘—E(𝑀𝑖𝑋𝑖𝑀𝑗𝑋𝑗)

=βˆ‘

𝑖

βˆ‘π‘—π‘€π‘–π‘€π‘—E(𝑋𝑖𝑋𝑗)

=βˆ‘

𝑖

βˆ‘π‘—π‘€π‘–π‘€π‘—πΆπ‘–π‘—

with 𝐢𝑖𝑗 the covariance matrix of 𝑋. The latter expression is the scalarproduct (𝑀 Β· 𝐢𝑀). Since 𝐢 is a symmetric positive-semidefinite matrix,(𝑀 Β· 𝐢𝑀) is at most πœ†β€–π‘€β€–2 with πœ† the largest eigenvalue of 𝐢.

For the IGO flow we have dπœƒπ‘‘

d𝑑 = Eπ‘₯βˆΌπ‘ƒπœƒπ‘Š 𝑓

πœƒ (π‘₯)βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯).So applying the lemma, we get that the norm of dπœƒ

d𝑑 is at most√

πœ† Varπ‘₯βˆΌπ‘ƒπœƒπ‘Š 𝑓

πœƒ (π‘₯)where πœ† is the largest eivengalue of the covariance matrix of βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯) (ex-pressed in a coordinate system where the Fisher matrix at the current pointπœƒ is the identity).

By construction of the quantiles, we have Varπ‘₯βˆΌπ‘ƒπœƒπ‘Š 𝑓

πœƒ (π‘₯) 6 Var[0,1] 𝑀(with equality unless there are ties). Indeed, for a given π‘₯, let 𝒰 be a uniformrandom variable in [0, 1] independent from π‘₯ and define the random variable𝑄 = π‘žβˆ’(π‘₯) + (π‘ž+(π‘₯)βˆ’ π‘žβˆ’(π‘₯))𝒰 . Then 𝑄 is uniformly distributed between theupper and lower quantiles π‘ž+(π‘₯) and π‘žβˆ’(π‘₯) and thus we can rewrite π‘Š 𝑓

πœƒ (π‘₯) asE(𝑀(𝑄)|π‘₯). By the Jensen inequality we have Var π‘Š 𝑓

πœƒ (π‘₯) = VarE(𝑀(𝑄)|π‘₯) 6Var 𝑀(𝑄). In addition when π‘₯ is taken under π‘ƒπœƒ, 𝑄 is uniformly distributedin [0, 1] and thus Var 𝑀(𝑄) = Var[0,1] 𝑀, i.e. Varπ‘₯βˆΌπ‘ƒπœƒ

π‘Š π‘“πœƒ (π‘₯) 6 Var[0,1] 𝑀.

45

Page 46: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Besides, consider the tangent space in Θ-space at point πœƒπ‘‘, and let uschoose an orthonormal basis in this tangent space for the Fisher metric.Then, in this basis we have βˆ‡π‘– ln π‘ƒπœƒ(π‘₯) = πœ•π‘– ln π‘ƒπœƒ(π‘₯). So the covariancematrix of βˆ‡ ln π‘ƒπœƒ(π‘₯) is Eπ‘₯βˆΌπ‘ƒπœƒ

(πœ•π‘– ln π‘ƒπœƒ(π‘₯)πœ•π‘— ln π‘ƒπœƒ(π‘₯)), which is equal to theFisher matrix by definition. So this covariance matrix is the identity, whoselargest eigenvalue is 1.

Proof of Proposition 12 (Noisy IGO)

On the one hand, let π‘ƒπœƒ be a family of distributions on 𝑋. The IGOalgorithm (13) applied to a random function 𝑓(π‘₯) = 𝑓(π‘₯, πœ”) where πœ” is arandom variable uniformly distributed in [0, 1] reads

πœƒπ‘‘+𝛿𝑑 = πœƒπ‘‘ + π›Ώπ‘‘π‘βˆ‘

𝑖=1π‘€π‘–βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖) (56)

where π‘₯𝑖 ∼ π‘ƒπœƒ and 𝑀𝑖 is according to (12) where ranking is applied to thevalues 𝑓(π‘₯𝑖, πœ”π‘–), with πœ”π‘– uniform variables in [0, 1] independent from π‘₯𝑖 andfrom each other.

On the other hand, for the IGO algorithm using π‘ƒπœƒ βŠ— π‘ˆ[0,1] and appliedto the deterministic function 𝑓 , 𝑀𝑖 is computed using the ranking accordingto the 𝑓 values of the sampled points ��𝑖 = (π‘₯𝑖, πœ”π‘–), and thus coincides withthe one in (56).

Besides,

βˆ‡πœƒ ln π‘ƒπœƒβŠ—π‘ˆ[0,1](��𝑖) = βˆ‡πœƒ ln π‘ƒπœƒ(π‘₯𝑖) + βˆ‡πœƒ ln π‘ˆ[0,1](πœ”π‘–)⏟ ⏞ =0

and thus the IGO algorithm update on space 𝑋 Γ— [0, 1], using the familyof distributions π‘ƒπœƒ = π‘ƒπœƒ βŠ— π‘ˆ[0,1], applied to the deterministic function 𝑓 ,coincides with (56).

Proof of Theorem 13 (Natural gradient as ML with infinitesi-mal weights)

We begin with a calculus lemma (proof omitted).

Lemma 20. Let 𝑓 be real-valued function on a finite-dimensional vectorspace 𝐸 equipped with a definite positive quadratic form β€– Β· β€–2. Assume 𝑓 issmooth and has at most quadratic growth at infinity. Then, for any π‘₯ ∈ 𝐸,we have

βˆ‡π‘“(π‘₯) = limπœ€β†’0

1πœ€

arg maxβ„Ž

{𝑓(π‘₯ + β„Ž)βˆ’ 1

2πœ€β€–β„Žβ€–2

}where βˆ‡ is the gradient associated with the norm β€– Β· β€–. Equivalently,

arg max𝑦

{𝑓(𝑦)βˆ’ 1

2πœ€β€–π‘¦ βˆ’ π‘₯β€–2

}= π‘₯ + πœ€βˆ‡π‘“(π‘₯) + 𝑂(πœ€2)

when πœ€β†’ 0.

We are now ready to prove Theorem 13. Let π‘Š be a function of π‘₯, andfix some πœƒ0 in Θ.

We need some regularity assumptions: we assume that the parameterspace Θ is non-degenerate (no two points πœƒ ∈ Θ define the same probabilitydistribution) and proper (the map π‘ƒπœƒ ↦→ πœƒ is continuous). We also assumethat the map πœƒ ↦→ π‘ƒπœƒ is smooth enough, so that

∫log π‘ƒπœƒ(π‘₯) π‘Š (π‘₯) π‘ƒπœƒ0(dπ‘₯) is

46

Page 47: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

a smooth function of πœƒ. (These are restrictions on πœƒ-regularity: this does notmean that π‘Š has to be continuous as a function of π‘₯.)

The two statements of Theorem 13 using a sum and an integral havesimilar proofs, so we only include the first. For πœ€ > 0, let πœƒ be the solution of

πœƒ = arg max{

(1βˆ’ πœ€)∫

log π‘ƒπœƒ(π‘₯) π‘ƒπœƒ0(dπ‘₯) + πœ€

∫log π‘ƒπœƒ(π‘₯) π‘Š (π‘₯) π‘ƒπœƒ0(dπ‘₯)

}.

Then we have

πœƒ = arg max{∫

log π‘ƒπœƒ(π‘₯) π‘ƒπœƒ0(dπ‘₯) + πœ€

∫log π‘ƒπœƒ(π‘₯) (π‘Š (π‘₯)βˆ’ 1) π‘ƒπœƒ0(dπ‘₯)

}

= arg max{∫

log π‘ƒπœƒ(π‘₯) π‘ƒπœƒ0(dπ‘₯)βˆ’βˆ«

log π‘ƒπœƒ0(π‘₯) π‘ƒπœƒ0(dπ‘₯) + πœ€

∫log π‘ƒπœƒ(π‘₯) (π‘Š (π‘₯)βˆ’ 1) π‘ƒπœƒ0(dπ‘₯)

}

(because the added term does not depend on πœƒ)

= arg max{βˆ’KL(π‘ƒπœƒ0 ||π‘ƒπœƒ) + πœ€

∫log π‘ƒπœƒ(π‘₯) (π‘Š (π‘₯)βˆ’ 1) π‘ƒπœƒ0(dπ‘₯)

}

= arg max{βˆ’ 1

πœ€KL(π‘ƒπœƒ0 ||π‘ƒπœƒ) +

∫log π‘ƒπœƒ(π‘₯) (π‘Š (π‘₯)βˆ’ 1) π‘ƒπœƒ0(dπ‘₯)

}

When πœ€β†’ 0, the first term exceeds the second one if KL(π‘ƒπœƒ0 ||π‘ƒπœƒ) is toolarge (because π‘Š is bounded), and so KL(π‘ƒπœƒ0 ||π‘ƒπœƒ) tends to 0. So we canassume that πœƒ is close to πœƒ0.

When πœƒ = πœƒ0 + π›Ώπœƒ is close to πœƒ0, we have

KL(π‘ƒπœƒ0 ||π‘ƒπœƒ) = 12βˆ‘

𝐼𝑖𝑗(πœƒ0) π›Ώπœƒπ‘– π›Ώπœƒπ‘— + 𝑂(π›Ώπœƒ3)

with 𝐼𝑖𝑗(πœƒ0) the Fisher matrix at πœƒ0. (This actually holds both for KL(π‘ƒπœƒ0 ||π‘ƒπœƒ)and KL(π‘ƒπœƒ ||π‘ƒπœƒ0).)

Thus, we can apply the lemma above using the Fisher metricβˆ‘

𝐼𝑖𝑗(πœƒ0) π›Ώπœƒπ‘– π›Ώπœƒπ‘— ,and working on a small neighborhood of πœƒ0 in πœƒ-space (which can be identifiedwith Rdim Θ). The lemma states that the argmax above is attained at

πœƒ = πœƒ0 + πœ€ βˆ‡πœƒ

∫log π‘ƒπœƒ(π‘₯) (π‘Š (π‘₯)βˆ’ 1) π‘ƒπœƒ0(dπ‘₯)

up to 𝑂(πœ€2), with βˆ‡ the natural gradient.Finally, the gradient cancels the constant βˆ’1 because

∫( βˆ‡ log π‘ƒπœƒ) π‘ƒπœƒ0 = 0

at πœƒ = πœƒ0. This proves Theorem 13.

Proof of Theorem 15 (IGO, CEM and IGO-ML)

Let π‘ƒπœƒ be a family of probability distributions of the form

π‘ƒπœƒ(π‘₯) = 1𝑍(πœƒ) exp

(βˆ‘πœƒπ‘–π‘‡π‘–(π‘₯)

)𝐻(dπ‘₯)

where 𝑇1, . . . , π‘‡π‘˜ is a finite family of functions on 𝑋 and 𝐻(dπ‘₯) is somereference measure on 𝑋. We assume that the family of functions (𝑇𝑖)𝑖

together with the constant function 𝑇0(π‘₯) = 1, are linearly independent.This prevents redundant parametrizations where two values of πœƒ describethe same distribution; this also ensures that the Fisher matrix Cov(𝑇𝑖, 𝑇𝑗) isinvertible.

47

Page 48: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

The IGO update (14) in the parametrization 𝑇𝑖 is a sum of terms of theform βˆ‡π‘‡π‘–

ln 𝑃 (π‘₯).

So we will compute the natural gradient βˆ‡π‘‡π‘–in those coordinates. We first

need some general results about the Fisher metric for exponential families.The next proposition gives the expression of the Fisher scalar product be-

tween two tangent vectors 𝛿𝑃 and 𝛿′𝑃 of a statistical manifold of exponentialdistributions. It is one way to express the duality between the coordinatesπœƒπ‘– and 𝑇𝑖 (compare [AN00, (3.30) and Section 3.5]).

Proposition 21. Let π›Ώπœƒπ‘– and π›Ώβ€²πœƒπ‘– be two small variations of the parametersπœƒπ‘–. Let 𝛿𝑃 (π‘₯) and 𝛿′𝑃 (π‘₯) be the resulting variations of the probability distri-bution 𝑃 , and 𝛿𝑇𝑖 and 𝛿′𝑇𝑖 the resulting variations of 𝑇𝑖. Then the scalarproduct, in Fisher information metric, between the tangent vectors 𝛿𝑃 and𝛿′𝑃 , is

βŸ¨π›Ώπ‘ƒ, 𝛿′𝑃 ⟩ =βˆ‘

𝑖

π›Ώπœƒπ‘– 𝛿′𝑇𝑖 =βˆ‘

𝑖

π›Ώβ€²πœƒπ‘– 𝛿𝑇𝑖.

Proof. By definition of the Fisher metric:

βŸ¨π›Ώπ‘ƒ, 𝛿′𝑃 ⟩ =βˆ‘π‘–,𝑗

𝐼𝑖𝑗 π›Ώπœƒπ‘– π›Ώβ€²πœƒπ‘—

=βˆ‘π‘–,𝑗

π›Ώπœƒπ‘– π›Ώβ€²πœƒπ‘—

∫π‘₯

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘–

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘—

𝑃 (π‘₯)

=∫

π‘₯

βˆ‘π‘–

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘–

π›Ώπœƒπ‘–

βˆ‘π‘—

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘—

π›Ώβ€²πœƒπ‘— 𝑃 (π‘₯)

=∫

π‘₯

βˆ‘π‘–

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘–

π›Ώπœƒπ‘– 𝛿′(ln 𝑃 (π‘₯)) 𝑃 (π‘₯)

=∫

π‘₯

βˆ‘π‘–

πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘–

π›Ώπœƒπ‘– 𝛿′𝑃 (π‘₯)

=∫

π‘₯

βˆ‘π‘–

(𝑇𝑖(π‘₯)βˆ’ 𝑇𝑖)π›Ώπœƒπ‘– 𝛿′𝑃 (π‘₯) by (16)

=βˆ‘

𝑖

π›Ώπœƒπ‘–

(∫π‘₯

𝑇𝑖(π‘₯) 𝛿′𝑃 (π‘₯))βˆ’βˆ‘

𝑖

π›Ώπœƒπ‘–π‘‡π‘–

∫π‘₯

𝛿′𝑃 (π‘₯)

=βˆ‘

𝑖

π›Ώπœƒπ‘– 𝛿′𝑇𝑖

because∫

π‘₯ 𝛿′𝑃 (π‘₯) = 0 since the total mass of 𝑃 is 1, and∫

π‘₯ 𝑇𝑖(π‘₯) 𝛿′𝑃 (π‘₯) =𝛿′𝑇𝑖 by definition of 𝑇𝑖.

Proposition 22. Let 𝑓 be a function on the statistical manifold of anexponential family as above. Then the components of the natural gradientw.r.t. the expectation parameters are given by the vanilla gradient w.r.t. thenatural parameters: βˆ‡π‘‡π‘–

𝑓 = πœ•π‘“

πœ•πœƒπ‘–

and conversely βˆ‡πœƒπ‘–π‘“ = πœ•π‘“

πœ•π‘‡π‘–.

(Beware this does not mean that the gradient ascent in any of thoseparametrizations is the vanilla gradient ascent.)

We could not find a reference for this result, though we think it known.

48

Page 49: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

Proof. By definition, the natural gradient βˆ‡π‘“ of a function 𝑓 is the uniquetangent vector 𝛿𝑃 such that that, for any other tangent vector 𝛿′𝑃 , we have

𝛿′𝑓 = βŸ¨π›Ώπ‘ƒ, 𝛿′𝑃 ⟩

with ⟨·, ·⟩ the scalar product associated with the Fisher metric. We want tocompute this natural gradient in coordinates 𝑇𝑖, so we are interested in thevariations 𝛿𝑇𝑖 associated with 𝛿𝑃 .

By Proposition 21, the scalar product above is

βŸ¨π›Ώπ‘ƒ, 𝛿′𝑃 ⟩ =βˆ‘

𝛿𝑇𝑖 π›Ώβ€²πœƒπ‘–

where 𝛿𝑇𝑖 is the variation of 𝑇𝑖 associated with 𝛿𝑃 , and π›Ώβ€²πœƒπ‘– the variation ofπœƒπ‘– associated with 𝛿′𝑃 .

On the other hand we have 𝛿′𝑓 =βˆ‘

π‘–πœ•π‘“πœ•πœƒπ‘–

π›Ώβ€²πœƒπ‘–. So we must have

βˆ‘π‘–

𝛿𝑇𝑖 π›Ώβ€²πœƒπ‘– =βˆ‘

𝑖

πœ•π‘“

πœ•πœƒπ‘–π›Ώβ€²πœƒπ‘–

for any 𝛿′𝑃 , which leads to𝛿𝑇𝑖 = πœ•π‘“

πœ•πœƒπ‘–

as needed. The converse relation is proved mutatis mutandis.

Back to the proof of Theorem 15. We can now compute the desiredterms: βˆ‡π‘‡π‘–

ln 𝑃 (π‘₯) = πœ• ln 𝑃 (π‘₯)πœ•πœƒπ‘–

= 𝑇𝑖(π‘₯)βˆ’ 𝑇𝑖

by (16). This proves the first statement (27) in Theorem 15 about the formof the IGO update in these parameters.

The other statements follow easily from this together with the additionalfact (26) that, for any set of weights π‘Žπ‘– with

βˆ‘π‘Žπ‘– = 1, the value 𝑇 * =βˆ‘

𝑖 π‘Ž(𝑖)𝑇 (π‘₯𝑖) is the maximum likelihood estimate ofβˆ‘

𝑖 π‘Ž(𝑖) log 𝑃 (π‘₯𝑖).

References[AHS85] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski, A learning

algorithm for Boltzmann machines, Cognitive Science 9 (1985),no. 1, 147–169.

[Ama98] Shun-Ichi Amari, Natural gradient works efficiently in learning,Neural Comput. 10 (1998), 251–276.

[AN00] Shun-ichi Amari and Hiroshi Nagaoka, Methods of informa-tion geometry, Translations of Mathematical Monographs, vol.191, American Mathematical Society, Providence, RI, 2000,Translated from the 1993 Japanese original by Daishi Harada.

[ANOK10] Youhei Akimoto, Yuichi Nagata, Isao Ono, and ShigenobuKobayashi, Bidirectional relation between CMA evolution strate-gies and natural evolution strategies, Proceedings of ParallelProblem Solving from Nature - PPSN XI, Lecture Notes inComputer Science, vol. 6238, Springer, 2010, pp. 154–163.

[AO08] Ravi P. Agarwal and Donal O’Regan, An introduction to ordi-nary differential equations, Springer, 2008.

49

Page 50: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

[Arn06] D.V. Arnold, Weighted multirecombination evolution strategies,Theoretical computer science 361 (2006), no. 1, 18–37.

[Bal94] Shumeet Baluja, Population based incremental learning: Amethod for integrating genetic search based function optimiza-tion and competitve learning, Tech. Report CMU-CS-94-163,Carnegie Mellon Report, 1994.

[BC95] Shumeet Baluja and Rich Caruana, Removing the genetics fromthe standard genetic algorithm, Proceedings of ICML’95, 1995,pp. 38–46.

[Ber00] A. Berny, Selection and reinforcement learning for combinatorialoptimization, Parallel Problem Solving from Nature PPSN VI(Marc Schoenauer, Kalyanmoy Deb, GΓΌnther Rudolph, XinYao, Evelyne Lutton, Juan Merelo, and Hans-Paul Schwefel,eds.), Lecture Notes in Computer Science, vol. 1917, SpringerBerlin Heidelberg, 2000, pp. 601–610.

[Ber02] Arnaud Berny, Boltzmann machine for population-based incre-mental learning, ECAI, 2002, pp. 198–202.

[Bey01] H.-G. Beyer, The theory of evolution strategies, Natural Com-puting Series, Springer-Verlag, 2001.

[BLS07] J. Branke, C. Lode, and J.L. Shapiro, Addressing samplingerrors and diversity loss in umda, Proceedings of the 9th annualconference on Genetic and evolutionary computation, ACM,2007, pp. 508–515.

[BS02] H.G. Beyer and H.P. Schwefel, Evolution strategiesβ€”a com-prehensive introduction, Natural computing 1 (2002), no. 1,3–52.

[CT06] Thomas M. Cover and Joy A. Thomas, Elements of informationtheory, second ed., Wiley-Interscience [John Wiley & Sons],Hoboken, NJ, 2006.

[dBKMR05] Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, andReuven Y. Rubinstein, A tutorial on the cross-entropy method,Annals OR 134 (2005), no. 1, 19–67.

[Gha04] Zoubin Ghahramani, Unsupervised learning, Advanced Lectureson Machine Learning (Olivier Bousquet, Ulrike von Luxburg,and Gunnar RΓ€tsch, eds.), Lecture Notes in Computer Science,vol. 3176, Springer Berlin / Heidelberg, 2004, pp. 72–112.

[GSS+10] Tobias Glasmachers, Tom Schaul, Yi Sun, Daan Wierstra, andJΓΌrgen Schmidhuber, Exponential natural evolution strategies,GECCO, 2010, pp. 393–400.

[Han06a] N. Hansen, An analysis of mutative 𝜎-self-adaptation on linearfitness functions, Evolutionary Computation 14 (2006), no. 3,255–275.

[Han06b] , The CMA evolution strategy: a comparing review, To-wards a new evolutionary computation. Advances on estimationof distribution algorithms (J.A. Lozano, P. Larranaga, I. Inza,and E. Bengoetxea, eds.), Springer, 2006, pp. 75–102.

50

Page 51: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

[Han09] , Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed, Proceedings of the 11th Annual Confer-ence Companion on Genetic and Evolutionary ComputationConference: Late Breaking Papers (New York, NY, USA),GECCO ’09, ACM, 2009, pp. 2389–2396.

[Hin02] G.E. Hinton., Training products of experts by minimizing con-trastive divergence, Neural Computation 14 (2002), 1771–1800.

[HJ61] R. Hooke and T.A. Jeeves, β€œdirect search” solution of numericaland statistical problems, Journal of the ACM 8 (1961), 212–229.

[HK04] N. Hansen and S. Kern, Evaluating the CMA evolution strategyon multimodal test functions, Parallel Problem Solving fromNature PPSN VIII (X. Yao et al., eds.), LNCS, vol. 3242,Springer, 2004, pp. 282–291.

[HMK03] N. Hansen, S.D. MΓΌller, and P. Koumoutsakos, Reducing thetime complexity of the derandomized evolution strategy withcovariance matrix adaptation (CMA-ES), Evolutionary Com-putation 11 (2003), no. 1, 1–18.

[HO96] N. Hansen and A. Ostermeier, Adapting arbitrary normal muta-tion distributions in evolution strategies: The covariance matrixadaptation, ICEC96, IEEE Press, 1996, pp. 312–317.

[HO01] Nikolaus Hansen and Andreas Ostermeier, Completely deran-domized self-adaptation in evolution strategies, EvolutionaryComputation 9 (2001), no. 2, 159–195.

[JA06] G.A. Jastrebski and D.V. Arnold, Improving evolution strate-gies through active covariance matrix adaptation, EvolutionaryComputation, 2006. CEC 2006. IEEE Congress on, IEEE, 2006,pp. 2814–2821.

[Jef46] Harold Jeffreys, An invariant form for the prior probabilityin estimation problems, Proc. Roy. Soc. London. Ser. A. 186(1946), 453–461.

[Kul97] Solomon Kullback, Information theory and statistics, DoverPublications Inc., Mineola, NY, 1997, Reprint of the second(1968) edition.

[LL02] P. Larranaga and J.A. Lozano, Estimation of distribution al-gorithms: A new tool for evolutionary computation, SpringerNetherlands, 2002.

[LRMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio,Topmoumoute online natural gradient algorithm, NIPS, 2007.

[MMS08] Luigi MalagΓ², Matteo Matteucci, and Bernardo Dal Seno, Aninformation geometry perspective on estimation of distributionalgorithms: boundary analysis, GECCO (Companion), 2008,pp. 2081–2088.

[NM65] John Ashworth Nelder and R Mead, A simplex method forfunction minimization, The Computer Journal (1965), 308–313.

51

Page 52: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

[PGL02] M. Pelikan, D.E. Goldberg, and F.G. Lobo, A survey of opti-mization by building and using probabilistic models, Computa-tional optimization and applications 21 (2002), no. 1, 5–20.

[Rao45] Calyampudi Radhakrishna Rao, Information and the accuracyattainable in the estimation of statistical parameters, Bull. Cal-cutta Math. Soc. 37 (1945), 81–91.

[Rec73] I. Rechenberg, Evolutionstrategie: Optimierung technisher sys-teme nach prinzipien des biologischen evolution, Frommann-Holzboog Verlag, Stuttgart, 1973.

[RK04] R.Y. Rubinstein and D.P. Kroese, The cross-entropy method:a unified approach to combinatorial optimization, monte-carlosimulation, and machine learning, Springer-Verlag New YorkInc, 2004.

[Rub99] Reuven Rubinstein, The cross-entropy method for com-binatorial and continuous optimization, Methodology andComputing in Applied Probability 1 (1999), 127–190,10.1023/A:1010091220143.

[SA90] F. Silva and L. Almeida, Acceleration techniques for the back-propagation algorithm, Neural Networks (1990), 110–119.

[Sch92] Laurent Schwartz, Analyse. II, Collection Enseignement desSciences [Collection: The Teaching of Science], vol. 43, Hermann,Paris, 1992, Calcul diffΓ©rentiel et Γ©quations diffΓ©rentielles, Withthe collaboration of K. Zizi.

[Sch95] H.-P. Schwefel, Evolution and optimum seeking, Sixth-generation computer technology series, John Wiley & Sons,Inc. New York, NY, USA, 1995.

[SHI09] Thorsten Suttorp, Nikolaus Hansen, and Christian Igel, Efficientcovariance matrix update for variable metric evolution strategies,Machine Learning 75 (2009), no. 2, 167–197.

[Smo86] P. Smolensky, Information processing in dynamical systems:foundations of harmony theory, Parallel Distributed Processing(D. Rumelhart and J. McClelland, eds.), vol. 1, MIT Press,Cambridge, MA, USA, 1986, pp. 194–281.

[SWSS09] Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber,Efficient natural evolution strategies, Proceedings of the 11thAnnual conference on Genetic and evolutionary computation(New York, NY, USA), GECCO ’09, ACM, 2009, pp. 539–546.

[Tho00] H. Thorisson, Coupling, stationarity, and regeneration, Springer,2000.

[Tor97] Virginia Torczon, On the convergence of pattern search algo-rithms, SIAM Journal on optimization 7 (1997), no. 1, 1–25.

[WAS04] Michael Wagner, Anne Auger, and Marc Schoenauer, EEDA :A New Robust Estimation of Distribution Algorithms, ResearchReport RR-5190, INRIA, 2004.

52

Page 53: Information-GeometricOptimizationAlgorithms ...repmus.ircam.fr/_media/brillouin/ressources/information...Information-GeometricOptimizationAlgorithms: AUnifyingPictureviaInvariancePrinciples

[Whi89] D. Whitley, The genitor algorithm and selection pressure: Whyrank-based allocation of reproductive trials is best, Proceedingsof the third international conference on Genetic algorithms,1989, pp. 116–121.

[WSPS08] Daan Wierstra, Tom Schaul, Jan Peters, and JΓΌrgen Schmidhu-ber, Natural evolution strategies, IEEE Congress on Evolution-ary Computation, 2008, pp. 3381–3387.

53