learning for optimization: edas, probabilistic modelling, or

Explicit Modelling in Explicit Modelling in Metaheuristic OptimizationMetaheuristic Optimization

Dr Marcus GallagherDr Marcus GallagherSchool of Information Technology and Electrical School of Information Technology and Electrical

EngineeringEngineeringUniversity of Queensland Q. 4072University of Queensland Q. 4072

[email protected]@itee.uq.edu.au

22Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Talk outline:Talk outline: Optimization, heuristics and metaheuristics.Optimization, heuristics and metaheuristics. ““Estimation of Distribution” (optimization) Estimation of Distribution” (optimization)

algorithms (EDAs): a brief overview.algorithms (EDAs): a brief overview. A framework for describing EDAs.A framework for describing EDAs. Other modelling approaches in Other modelling approaches in

metaheuristics. metaheuristics. SummarySummary


““Hard” Optimization ProblemsHard” Optimization Problems

Goal: FindGoal: Find

where S is often multi-dimensional; real-valued or where S is often multi-dimensional; real-valued or

binarybinary

Many classes of optimization problems (and Many classes of optimization problems (and algorithms) exist.algorithms) exist.

When might it be worthwhile to consider metaheuristic When might it be worthwhile to consider metaheuristic or machine learning approaches?or machine learning approaches?

SffS xxxx ),(*)(such that *

nn SS 1,0or R


Finding an “exact” solution is intractable.Finding an “exact” solution is intractable.

Limited knowledge of Limited knowledge of f()f() No derivative information.No derivative information. May be discontinuous, noisy,…May be discontinuous, noisy,…

Evaluating f() is expensive in terms of time Evaluating f() is expensive in terms of time or cost.or cost.

f() is known or suspected to contain nasty f() is known or suspected to contain nasty featuresfeatures Many local minima, plateaus, ravines.Many local minima, plateaus, ravines.

The search space is high-dimensional.The search space is high-dimensional.


What is the “practical” goal of (global) What is the “practical” goal of (global) optimization?optimization? ““There exists a goal (e.g. to find as small a There exists a goal (e.g. to find as small a

value of f() as possible), there exist resources value of f() as possible), there exist resources (e.g. some number of trials), and the problem (e.g. some number of trials), and the problem is how to use these resources in an optimal is how to use these resources in an optimal way.”way.”

A. Torn and A. Zilinskas, Global Optimisation. Springer-A. Torn and A. Zilinskas, Global Optimisation. Springer-Verlag, 1989. Lecture Notes in Computer Science, Vol. Verlag, 1989. Lecture Notes in Computer Science, Vol. 350.350.


HeuristicsHeuristics

Heuristic (or approximate) algorithms aim Heuristic (or approximate) algorithms aim to find a good solution to a problem in a to find a good solution to a problem in a reasonable amount of computation time – reasonable amount of computation time – but with no guarantee of “goodness” or but with no guarantee of “goodness” or “efficiency” (cf. exact or complete “efficiency” (cf. exact or complete algorithms).algorithms).Broad classes of heuristics:Broad classes of heuristics: Constructive methodsConstructive methods Local search methodsLocal search methods


MetaheuristicsMetaheuristics

Metaheuristics are (roughly) high-level strategies Metaheuristics are (roughly) high-level strategies that combinine lower-level techniques for that combinine lower-level techniques for exploration and exploitation of the search space.exploration and exploitation of the search space. An overarching term to refer to algorithms including An overarching term to refer to algorithms including

Evolutionary Algorithms, Simulated Annealing, Tabu Evolutionary Algorithms, Simulated Annealing, Tabu Search, Ant Colony, Particle Swarm, Cross-Entropy,Search, Ant Colony, Particle Swarm, Cross-Entropy,……

C. Blum and A. Roli. Metaheuristics in Combinatorial C. Blum and A. Roli. Metaheuristics in Combinatorial Optimization: Overview and Conceptual Comparison. ACM Optimization: Overview and Conceptual Comparison. ACM Computing Surveys, 35(3), 2003, pp. 268-308.Computing Surveys, 35(3), 2003, pp. 268-308.


Learning/Modelling for OptimizationLearning/Modelling for Optimization

Most optimization algorithms make some (explicit or Most optimization algorithms make some (explicit or implicit) assumptions about the nature of f().implicit) assumptions about the nature of f().Many algorithms vary their behaviour during execution Many algorithms vary their behaviour during execution (e.g. simulated annealing).(e.g. simulated annealing).In some optimization algorithms the search is adaptive In some optimization algorithms the search is adaptive

Future search points evaluated depend on previous points Future search points evaluated depend on previous points searched (and/or their f() values, derivatives of f() etc).searched (and/or their f() values, derivatives of f() etc).

Learning/modelling can be Learning/modelling can be implicitimplicit (e.g, adapting the (e.g, adapting the step-size in gradient descent, population in an EA).step-size in gradient descent, population in an EA).……or or explicitexplicit; examples from optimization literature:; examples from optimization literature:

Nelder-Mead simplex algorithm.Nelder-Mead simplex algorithm. Response surfaces (metamodelling, surrogate function).Response surfaces (metamodelling, surrogate function).


EDAs: Probabilistic Modelling for EDAs: Probabilistic Modelling for OptimizationOptimization

Based on the use of (unsupervised) density Based on the use of (unsupervised) density estimators/generative statistical models.estimators/generative statistical models.

Idea is to convert the optimization problem into a Idea is to convert the optimization problem into a search over probability distributions.search over probability distributions.

P. Larranaga and J. A. Lozano (eds.). Estimation of Distribution P. Larranaga and J. A. Lozano (eds.). Estimation of Distribution Algorithms: a new tool for evolutionary computation. Kluwer Algorithms: a new tool for evolutionary computation. Kluwer Academic Publishers, 2002.Academic Publishers, 2002.

The probabilistic model is in some sense an The probabilistic model is in some sense an explicit model of (currently) promising regions of explicit model of (currently) promising regions of the search space.the search space.


EDAs: toy exampleEDAs: toy example


GAs and EDAs comparedGAs and EDAs compared

GA pseudocodeGA pseudocode1.1. Initialize the population, X(t);Initialize the population, X(t);

2.2. Evaluate the objective function for Evaluate the objective function for each point;each point;

3.3. Selection();Selection();

4.4. Crossover();Crossover();

5.5. Mutation();Mutation();

6666 Form new population X(t+1);Form new population X(t+1);

7.7. While !(terminate()) Goto 2;While !(terminate()) Goto 2;


EDA pseudocodeEDA pseudocode1.1. Initialize a probability model, Q(x);Initialize a probability model, Q(x);

2.2. Create a population of points by Create a population of points by sampling from Q(x);sampling from Q(x);

3.3. Evaluate the objective function for Evaluate the objective function for each point;each point;

4.4. Update Q(x) using selected population Update Q(x) using selected population and f() values;and f() values;

5.5. While !(terminate()) Goto 2;While !(terminate()) Goto 2;

GAs and EDAs comparedGAs and EDAs compared


EDA Example 1EDA Example 1

Population-based Incremental Learning Population-based Incremental Learning (PBIL)(PBIL)

S. Baluja, R. Caruana. Removing the Genetics from the S. Baluja, R. Caruana. Removing the Genetics from the Standard Genetic Algorithm. ICML’95.Standard Genetic Algorithm. ICML’95.

p1 = Pr(x1=1)

p2 = Pr(x2=1)

pn = Pr(xn=1)

biii xpp 1



Mutual Information Maximization for Input Mutual Information Maximization for Input Clustering (MIMIC)Clustering (MIMIC)

J. De Bonet, C. Isbell and P. Viola. MIMIC: Finding optima by J. De Bonet, C. Isbell and P. Viola. MIMIC: Finding optima by estimating probability densities. Advances in Neural Information estimating probability densities. Advances in Neural Information Processing Systems, vol.9, 1997.Processing Systems, vol.9, 1997.

)()|()|()|()(13221 nnn iiiiiii xpxxpxxpxxpp

x



Combining Optimizers with Mutual Information Combining Optimizers with Mutual Information Trees (COMIT)Trees (COMIT)

S. Baluja and S. Davies. Using optimal dependency-trees for combinatorial S. Baluja and S. Davies. Using optimal dependency-trees for combinatorial optimization: learning the structure of the search space. Proc. ICML’97.optimization: learning the structure of the search space. Proc. ICML’97.

Uses a tree-structured graphical modelUses a tree-structured graphical model Model can be constructed in O(nModel can be constructed in O(n22) time using a ) time using a

variant of the minimum spanning tree algorithm.variant of the minimum spanning tree algorithm. Model is optimal, given the restrictions, in the sense Model is optimal, given the restrictions, in the sense

that the Kullback-Liebler divergence between the that the Kullback-Liebler divergence between the model and a full joint distribution is minimized. model and a full joint distribution is minimized.



Bayesian Optimization Algorithm (BOA)Bayesian Optimization Algorithm (BOA) M. Pelikan, D. Goldberg and E. Cantu-Paz. BOA: The Bayesian M. Pelikan, D. Goldberg and E. Cantu-Paz. BOA: The Bayesian

optimization algorithm. In Proc. GECCO’99.optimization algorithm. In Proc. GECCO’99.

Bayesian network model where nodes can Bayesian network model where nodes can have at most have at most kk parents. parents. Greedy search over the Bayesian Dirichlet Greedy search over the Bayesian Dirichlet

equivalence metric to find the network equivalence metric to find the network structure.structure.


Further work on EDAsFurther work on EDAs

EDAs have also been developed EDAs have also been developed For problems with continuous and mixed For problems with continuous and mixed

variables.variables. That use mixture models and kernel That use mixture models and kernel

estimators - allowing for the modelling of estimators - allowing for the modelling of multi-modal distributions.multi-modal distributions.

……and more!and more!


A framework to describe building and adapting a A framework to describe building and adapting a probabilistic model for optimizationprobabilistic model for optimization

See:See:M. Gallagher and M. Frean. Population-Based Continuous M. Gallagher and M. Frean. Population-Based Continuous Optimization, Probabilistic Modelling and Mean Shift. To Optimization, Probabilistic Modelling and Mean Shift. To appear, appear, Evolutionary ComputationEvolutionary Computation, 2005., 2005.

Consider a continuous EDA with modelConsider a continuous EDA with model

Consider a Boltzmann distribution over f(x)Consider a Boltzmann distribution over f(x)

n

iii xQQ

1

)()(x

T

xf

ZxP

)(exp

1)(


As TAs T→0, P(x) tends towards a set of impulse →0, P(x) tends towards a set of impulse spikes over the global optima.spikes over the global optima.

Now, we have a probability distribution that we Now, we have a probability distribution that we know the form of, Q(x) and we would like to know the form of, Q(x) and we would like to modify it to be close to P(x). KL divergence:modify it to be close to P(x). KL divergence:

Let Q(x) be a Gaussian; try and minimize K via Let Q(x) be a Gaussian; try and minimize K via gradient descent with respect to the mean gradient descent with respect to the mean parameter of Q(x).parameter of Q(x).

x

dxxP

xQxQK

)(

)(log)(


The gradient becomesThe gradient becomes

An approximation to the integral is to use a An approximation to the integral is to use a sample of x from Q(x)sample of x from Q(x)

v

xxQ

Q

)(

x

dxxfxxQvT

K )()).((1

Sx

ii

i

xfxnvT

K )()(1


The algorithm update rule is thenThe algorithm update rule is then

Similar ideas can be found in:Similar ideas can be found in:A. Berny. Statistical Machine Learning and Combinatorial A. Berny. Statistical Machine Learning and Combinatorial Optimization. In L. Kallel et al. eds, Optimization. In L. Kallel et al. eds, Theoretical Aspects of Theoretical Aspects of Evolutionary ComputationEvolutionary Computation, pp. 287-306. Springer. 2001., pp. 287-306. Springer. 2001.

M. Toussaint. On the evolution of phenotypic exploration M. Toussaint. On the evolution of phenotypic exploration distributions. In C. Cotta et al. eds, distributions. In C. Cotta et al. eds, Foundations of Genetic Foundations of Genetic Algorithms (FOGA VII),Algorithms (FOGA VII), pp. 169-182. Morgan Kaufmann. pp. 169-182. Morgan Kaufmann. 2003.2003.

)(ˆ)( i

Sx

i xfxn i


Some insightsSome insights

The derived update rule is closely related The derived update rule is closely related to those found in Evolution Strategies and to those found in Evolution Strategies and a version of PBIL for continuous spaces.a version of PBIL for continuous spaces.

It is possible to view these existing It is possible to view these existing algorithms as approximately doing KL algorithms as approximately doing KL minimization.minimization.

The objective function appears explicitly in The objective function appears explicitly in this update rule (no selection).this update rule (no selection).


Other Research in Learning/Modelling Other Research in Learning/Modelling for Optimizationfor Optimization

J. A. Boyan and A. W. Moore. Learning Evaluation Functions to J. A. Boyan and A. W. Moore. Learning Evaluation Functions to Improve Optimization by Local Search. Improve Optimization by Local Search. Journal of Machine Learning Journal of Machine Learning ResearchResearch 1:2, 2000. 1:2, 2000. B. Anderson, A. Moore and D. Cohn. A Nonparametric Approach to B. Anderson, A. Moore and D. Cohn. A Nonparametric Approach to Noisy and Costly Optimization. Noisy and Costly Optimization. International Conference on International Conference on Machine LearningMachine Learning, 2000., 2000.D. R. Jones. A Taxonomy of Global Optimization Methods Based D. R. Jones. A Taxonomy of Global Optimization Methods Based on Response Surfaces. on Response Surfaces. Journal of Global OptimizationJournal of Global Optimization 21(4):345- 21(4):345-383, 2001.383, 2001.Reinforcement learningReinforcement learning

R. J. Williams (1992). Simple statistical gradient-following algorithms for R. J. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. connectionist reinforcement learning. Machine LearningMachine Learning, 8:229-256., 8:229-256.

V. V. Miagkikh and W. F. Punch III, An Approach to Solving Combinatorial V. V. Miagkikh and W. F. Punch III, An Approach to Solving Combinatorial Optimization Problems Using a Population of Reinforcement Learning Agents, Optimization Problems Using a Population of Reinforcement Learning Agents, Genetic and Evolutionary Computation Conf.(GECCO-99),Genetic and Evolutionary Computation Conf.(GECCO-99), p.1358-1365, 1999. p.1358-1365, 1999.


SummarySummary

The field of metaheuristics (including The field of metaheuristics (including Evolutionary Computation) has produced Evolutionary Computation) has produced A large variety of optimization algorithmsA large variety of optimization algorithms Demonstrated good performance on a range of real-Demonstrated good performance on a range of real-

world problems.world problems.

Metaheuristics are considerably more general:Metaheuristics are considerably more general: can even be applied when there isn’t a “true” objective can even be applied when there isn’t a “true” objective

function (coevolution).function (coevolution). Can evolve non-numerical objects. Can evolve non-numerical objects.


SummarySummary

EDAs take an explicit modelling approach to EDAs take an explicit modelling approach to optimization.optimization.

Existing statistical models and model-fitting algorithms can be Existing statistical models and model-fitting algorithms can be employed.employed.

Potential for solving challenging problems.Potential for solving challenging problems. Model can be more easily visualized/interpreted than a dynamic Model can be more easily visualized/interpreted than a dynamic

population in a conventional EA.population in a conventional EA.

Although the field is highly active, it is still relatively Although the field is highly active, it is still relatively immatureimmature

Improve quality of experimental results.Improve quality of experimental results. Make sure research goals are well-defined.Make sure research goals are well-defined. Lots of preliminary ideas, but lack of comparative/followup Lots of preliminary ideas, but lack of comparative/followup

research.research. Difficult to keep up with the literature and see connections with Difficult to keep up with the literature and see connections with

other fields. other fields.


The End!The End!

Questions?Questions?

learning for optimization: edas, probabilistic modelling, or

Documents

combinatorial optimization

optimization literature

search space

evolutionary algorithms

complete algorithms

approximate algorithms

derivatives of f

nature of f