learning for optimization: edas, probabilistic modelling, or

27
Explicit Modelling in Explicit Modelling in Metaheuristic Metaheuristic Optimization Optimization Dr Marcus Gallagher Dr Marcus Gallagher School of Information Technology and School of Information Technology and Electrical Engineering Electrical Engineering University of Queensland Q. 4072 University of Queensland Q. 4072 [email protected] [email protected]

Upload: butest

Post on 11-May-2015

486 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Learning for Optimization: EDAs, probabilistic modelling, or

Explicit Modelling in Explicit Modelling in Metaheuristic OptimizationMetaheuristic Optimization

Dr Marcus GallagherDr Marcus GallagherSchool of Information Technology and Electrical School of Information Technology and Electrical

EngineeringEngineeringUniversity of Queensland Q. 4072University of Queensland Q. 4072

[email protected]@itee.uq.edu.au

Page 2: Learning for Optimization: EDAs, probabilistic modelling, or

22Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Talk outline:Talk outline: Optimization, heuristics and metaheuristics.Optimization, heuristics and metaheuristics. ““Estimation of Distribution” (optimization) Estimation of Distribution” (optimization)

algorithms (EDAs): a brief overview.algorithms (EDAs): a brief overview. A framework for describing EDAs.A framework for describing EDAs. Other modelling approaches in Other modelling approaches in

metaheuristics. metaheuristics. SummarySummary

Page 3: Learning for Optimization: EDAs, probabilistic modelling, or

33Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

““Hard” Optimization ProblemsHard” Optimization Problems

Goal: FindGoal: Find

where S is often multi-dimensional; real-valued or where S is often multi-dimensional; real-valued or

binarybinary

Many classes of optimization problems (and Many classes of optimization problems (and algorithms) exist.algorithms) exist.

When might it be worthwhile to consider metaheuristic When might it be worthwhile to consider metaheuristic or machine learning approaches?or machine learning approaches?

SffS xxxx ),(*)(such that *

nn SS 1,0or R

Page 4: Learning for Optimization: EDAs, probabilistic modelling, or

44Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Finding an “exact” solution is intractable.Finding an “exact” solution is intractable.

Limited knowledge of Limited knowledge of f()f() No derivative information.No derivative information. May be discontinuous, noisy,…May be discontinuous, noisy,…

Evaluating f() is expensive in terms of time Evaluating f() is expensive in terms of time or cost.or cost.

f() is known or suspected to contain nasty f() is known or suspected to contain nasty featuresfeatures Many local minima, plateaus, ravines.Many local minima, plateaus, ravines.

The search space is high-dimensional.The search space is high-dimensional.

Page 5: Learning for Optimization: EDAs, probabilistic modelling, or

55Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

What is the “practical” goal of (global) What is the “practical” goal of (global) optimization?optimization? ““There exists a goal (e.g. to find as small a There exists a goal (e.g. to find as small a

value of f() as possible), there exist resources value of f() as possible), there exist resources (e.g. some number of trials), and the problem (e.g. some number of trials), and the problem is how to use these resources in an optimal is how to use these resources in an optimal way.”way.”

A. Torn and A. Zilinskas, Global Optimisation. Springer-A. Torn and A. Zilinskas, Global Optimisation. Springer-Verlag, 1989. Lecture Notes in Computer Science, Vol. Verlag, 1989. Lecture Notes in Computer Science, Vol. 350.350.

Page 6: Learning for Optimization: EDAs, probabilistic modelling, or

66Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

HeuristicsHeuristics

Heuristic (or approximate) algorithms aim Heuristic (or approximate) algorithms aim to find a good solution to a problem in a to find a good solution to a problem in a reasonable amount of computation time – reasonable amount of computation time – but with no guarantee of “goodness” or but with no guarantee of “goodness” or “efficiency” (cf. exact or complete “efficiency” (cf. exact or complete algorithms).algorithms).Broad classes of heuristics:Broad classes of heuristics: Constructive methodsConstructive methods Local search methodsLocal search methods

Page 7: Learning for Optimization: EDAs, probabilistic modelling, or

77Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

MetaheuristicsMetaheuristics

Metaheuristics are (roughly) high-level strategies Metaheuristics are (roughly) high-level strategies that combinine lower-level techniques for that combinine lower-level techniques for exploration and exploitation of the search space.exploration and exploitation of the search space. An overarching term to refer to algorithms including An overarching term to refer to algorithms including

Evolutionary Algorithms, Simulated Annealing, Tabu Evolutionary Algorithms, Simulated Annealing, Tabu Search, Ant Colony, Particle Swarm, Cross-Entropy,Search, Ant Colony, Particle Swarm, Cross-Entropy,……

C. Blum and A. Roli. Metaheuristics in Combinatorial C. Blum and A. Roli. Metaheuristics in Combinatorial Optimization: Overview and Conceptual Comparison. ACM Optimization: Overview and Conceptual Comparison. ACM Computing Surveys, 35(3), 2003, pp. 268-308.Computing Surveys, 35(3), 2003, pp. 268-308.

Page 8: Learning for Optimization: EDAs, probabilistic modelling, or

88Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Learning/Modelling for OptimizationLearning/Modelling for Optimization

Most optimization algorithms make some (explicit or Most optimization algorithms make some (explicit or implicit) assumptions about the nature of f().implicit) assumptions about the nature of f().Many algorithms vary their behaviour during execution Many algorithms vary their behaviour during execution (e.g. simulated annealing).(e.g. simulated annealing).In some optimization algorithms the search is adaptive In some optimization algorithms the search is adaptive

Future search points evaluated depend on previous points Future search points evaluated depend on previous points searched (and/or their f() values, derivatives of f() etc).searched (and/or their f() values, derivatives of f() etc).

Learning/modelling can be Learning/modelling can be implicitimplicit (e.g, adapting the (e.g, adapting the step-size in gradient descent, population in an EA).step-size in gradient descent, population in an EA).……or or explicitexplicit; examples from optimization literature:; examples from optimization literature:

Nelder-Mead simplex algorithm.Nelder-Mead simplex algorithm. Response surfaces (metamodelling, surrogate function).Response surfaces (metamodelling, surrogate function).

Page 9: Learning for Optimization: EDAs, probabilistic modelling, or

99Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDAs: Probabilistic Modelling for EDAs: Probabilistic Modelling for OptimizationOptimization

Based on the use of (unsupervised) density Based on the use of (unsupervised) density estimators/generative statistical models.estimators/generative statistical models.

Idea is to convert the optimization problem into a Idea is to convert the optimization problem into a search over probability distributions.search over probability distributions.

P. Larranaga and J. A. Lozano (eds.). Estimation of Distribution P. Larranaga and J. A. Lozano (eds.). Estimation of Distribution Algorithms: a new tool for evolutionary computation. Kluwer Algorithms: a new tool for evolutionary computation. Kluwer Academic Publishers, 2002.Academic Publishers, 2002.

The probabilistic model is in some sense an The probabilistic model is in some sense an explicit model of (currently) promising regions of explicit model of (currently) promising regions of the search space.the search space.

Page 10: Learning for Optimization: EDAs, probabilistic modelling, or

1010Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDAs: toy exampleEDAs: toy example

Page 11: Learning for Optimization: EDAs, probabilistic modelling, or

1111Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDAs: toy exampleEDAs: toy example

Page 12: Learning for Optimization: EDAs, probabilistic modelling, or

1212Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

GAs and EDAs comparedGAs and EDAs compared

GA pseudocodeGA pseudocode1.1. Initialize the population, X(t);Initialize the population, X(t);

2.2. Evaluate the objective function for Evaluate the objective function for each point;each point;

3.3. Selection();Selection();

4.4. Crossover();Crossover();

5.5. Mutation();Mutation();

6666 Form new population X(t+1);Form new population X(t+1);

7.7. While !(terminate()) Goto 2;While !(terminate()) Goto 2;

Page 13: Learning for Optimization: EDAs, probabilistic modelling, or

1313Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDA pseudocodeEDA pseudocode1.1. Initialize a probability model, Q(x);Initialize a probability model, Q(x);

2.2. Create a population of points by Create a population of points by sampling from Q(x);sampling from Q(x);

3.3. Evaluate the objective function for Evaluate the objective function for each point;each point;

4.4. Update Q(x) using selected population Update Q(x) using selected population and f() values;and f() values;

5.5. While !(terminate()) Goto 2;While !(terminate()) Goto 2;

GAs and EDAs comparedGAs and EDAs compared

Page 14: Learning for Optimization: EDAs, probabilistic modelling, or

1414Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDA Example 1EDA Example 1

Population-based Incremental Learning Population-based Incremental Learning (PBIL)(PBIL)

S. Baluja, R. Caruana. Removing the Genetics from the S. Baluja, R. Caruana. Removing the Genetics from the Standard Genetic Algorithm. ICML’95.Standard Genetic Algorithm. ICML’95.

p1 = Pr(x1=1)

p2 = Pr(x2=1)

pn = Pr(xn=1)

biii xpp 1

Page 15: Learning for Optimization: EDAs, probabilistic modelling, or

1515Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDA Example 2EDA Example 2

Mutual Information Maximization for Input Mutual Information Maximization for Input Clustering (MIMIC)Clustering (MIMIC)

J. De Bonet, C. Isbell and P. Viola. MIMIC: Finding optima by J. De Bonet, C. Isbell and P. Viola. MIMIC: Finding optima by estimating probability densities. Advances in Neural Information estimating probability densities. Advances in Neural Information Processing Systems, vol.9, 1997.Processing Systems, vol.9, 1997.

)()|()|()|()(13221 nnn iiiiiii xpxxpxxpxxpp

x

Page 16: Learning for Optimization: EDAs, probabilistic modelling, or

1616Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDA Example 3EDA Example 3

Combining Optimizers with Mutual Information Combining Optimizers with Mutual Information Trees (COMIT)Trees (COMIT)

S. Baluja and S. Davies. Using optimal dependency-trees for combinatorial S. Baluja and S. Davies. Using optimal dependency-trees for combinatorial optimization: learning the structure of the search space. Proc. ICML’97.optimization: learning the structure of the search space. Proc. ICML’97.

Uses a tree-structured graphical modelUses a tree-structured graphical model Model can be constructed in O(nModel can be constructed in O(n22) time using a ) time using a

variant of the minimum spanning tree algorithm.variant of the minimum spanning tree algorithm. Model is optimal, given the restrictions, in the sense Model is optimal, given the restrictions, in the sense

that the Kullback-Liebler divergence between the that the Kullback-Liebler divergence between the model and a full joint distribution is minimized. model and a full joint distribution is minimized.

Page 17: Learning for Optimization: EDAs, probabilistic modelling, or

1717Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

EDA Example 4EDA Example 4

Bayesian Optimization Algorithm (BOA)Bayesian Optimization Algorithm (BOA) M. Pelikan, D. Goldberg and E. Cantu-Paz. BOA: The Bayesian M. Pelikan, D. Goldberg and E. Cantu-Paz. BOA: The Bayesian

optimization algorithm. In Proc. GECCO’99.optimization algorithm. In Proc. GECCO’99.

Bayesian network model where nodes can Bayesian network model where nodes can have at most have at most kk parents. parents. Greedy search over the Bayesian Dirichlet Greedy search over the Bayesian Dirichlet

equivalence metric to find the network equivalence metric to find the network structure.structure.

Page 18: Learning for Optimization: EDAs, probabilistic modelling, or

1818Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Further work on EDAsFurther work on EDAs

EDAs have also been developed EDAs have also been developed For problems with continuous and mixed For problems with continuous and mixed

variables.variables. That use mixture models and kernel That use mixture models and kernel

estimators - allowing for the modelling of estimators - allowing for the modelling of multi-modal distributions.multi-modal distributions.

……and more!and more!

Page 19: Learning for Optimization: EDAs, probabilistic modelling, or

1919Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

A framework to describe building and adapting a A framework to describe building and adapting a probabilistic model for optimizationprobabilistic model for optimization

See:See:M. Gallagher and M. Frean. Population-Based Continuous M. Gallagher and M. Frean. Population-Based Continuous Optimization, Probabilistic Modelling and Mean Shift. To Optimization, Probabilistic Modelling and Mean Shift. To appear, appear, Evolutionary ComputationEvolutionary Computation, 2005., 2005.

Consider a continuous EDA with modelConsider a continuous EDA with model

Consider a Boltzmann distribution over f(x)Consider a Boltzmann distribution over f(x)

n

iii xQQ

1

)()(x

T

xf

ZxP

)(exp

1)(

Page 20: Learning for Optimization: EDAs, probabilistic modelling, or

2020Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

As TAs T→0, P(x) tends towards a set of impulse →0, P(x) tends towards a set of impulse spikes over the global optima.spikes over the global optima.

Now, we have a probability distribution that we Now, we have a probability distribution that we know the form of, Q(x) and we would like to know the form of, Q(x) and we would like to modify it to be close to P(x). KL divergence:modify it to be close to P(x). KL divergence:

Let Q(x) be a Gaussian; try and minimize K via Let Q(x) be a Gaussian; try and minimize K via gradient descent with respect to the mean gradient descent with respect to the mean parameter of Q(x).parameter of Q(x).

x

dxxP

xQxQK

)(

)(log)(

Page 21: Learning for Optimization: EDAs, probabilistic modelling, or

2121Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

The gradient becomesThe gradient becomes

An approximation to the integral is to use a An approximation to the integral is to use a sample of x from Q(x)sample of x from Q(x)

v

xxQ

Q

)(

x

dxxfxxQvT

K )()).((1

Sx

ii

i

xfxnvT

K )()(1

Page 22: Learning for Optimization: EDAs, probabilistic modelling, or

2222Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

The algorithm update rule is thenThe algorithm update rule is then

Similar ideas can be found in:Similar ideas can be found in:A. Berny. Statistical Machine Learning and Combinatorial A. Berny. Statistical Machine Learning and Combinatorial Optimization. In L. Kallel et al. eds, Optimization. In L. Kallel et al. eds, Theoretical Aspects of Theoretical Aspects of Evolutionary ComputationEvolutionary Computation, pp. 287-306. Springer. 2001., pp. 287-306. Springer. 2001.

M. Toussaint. On the evolution of phenotypic exploration M. Toussaint. On the evolution of phenotypic exploration distributions. In C. Cotta et al. eds, distributions. In C. Cotta et al. eds, Foundations of Genetic Foundations of Genetic Algorithms (FOGA VII),Algorithms (FOGA VII), pp. 169-182. Morgan Kaufmann. pp. 169-182. Morgan Kaufmann. 2003.2003.

)(ˆ)( i

Sx

i xfxn i

Page 23: Learning for Optimization: EDAs, probabilistic modelling, or

2323Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Some insightsSome insights

The derived update rule is closely related The derived update rule is closely related to those found in Evolution Strategies and to those found in Evolution Strategies and a version of PBIL for continuous spaces.a version of PBIL for continuous spaces.

It is possible to view these existing It is possible to view these existing algorithms as approximately doing KL algorithms as approximately doing KL minimization.minimization.

The objective function appears explicitly in The objective function appears explicitly in this update rule (no selection).this update rule (no selection).

Page 24: Learning for Optimization: EDAs, probabilistic modelling, or

2424Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

Other Research in Learning/Modelling Other Research in Learning/Modelling for Optimizationfor Optimization

J. A. Boyan and A. W. Moore. Learning Evaluation Functions to J. A. Boyan and A. W. Moore. Learning Evaluation Functions to Improve Optimization by Local Search. Improve Optimization by Local Search. Journal of Machine Learning Journal of Machine Learning ResearchResearch 1:2, 2000. 1:2, 2000. B. Anderson, A. Moore and D. Cohn. A Nonparametric Approach to B. Anderson, A. Moore and D. Cohn. A Nonparametric Approach to Noisy and Costly Optimization. Noisy and Costly Optimization. International Conference on International Conference on Machine LearningMachine Learning, 2000., 2000.D. R. Jones. A Taxonomy of Global Optimization Methods Based D. R. Jones. A Taxonomy of Global Optimization Methods Based on Response Surfaces. on Response Surfaces. Journal of Global OptimizationJournal of Global Optimization 21(4):345- 21(4):345-383, 2001.383, 2001.Reinforcement learningReinforcement learning

R. J. Williams (1992). Simple statistical gradient-following algorithms for R. J. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. connectionist reinforcement learning. Machine LearningMachine Learning, 8:229-256., 8:229-256.

V. V. Miagkikh and W. F. Punch III, An Approach to Solving Combinatorial V. V. Miagkikh and W. F. Punch III, An Approach to Solving Combinatorial Optimization Problems Using a Population of Reinforcement Learning Agents, Optimization Problems Using a Population of Reinforcement Learning Agents, Genetic and Evolutionary Computation Conf.(GECCO-99),Genetic and Evolutionary Computation Conf.(GECCO-99), p.1358-1365, 1999. p.1358-1365, 1999.

Page 25: Learning for Optimization: EDAs, probabilistic modelling, or

2525Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

SummarySummary

The field of metaheuristics (including The field of metaheuristics (including Evolutionary Computation) has produced Evolutionary Computation) has produced A large variety of optimization algorithmsA large variety of optimization algorithms Demonstrated good performance on a range of real-Demonstrated good performance on a range of real-

world problems.world problems.

Metaheuristics are considerably more general:Metaheuristics are considerably more general: can even be applied when there isn’t a “true” objective can even be applied when there isn’t a “true” objective

function (coevolution).function (coevolution). Can evolve non-numerical objects. Can evolve non-numerical objects.

Page 26: Learning for Optimization: EDAs, probabilistic modelling, or

2626Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

SummarySummary

EDAs take an explicit modelling approach to EDAs take an explicit modelling approach to optimization.optimization.

Existing statistical models and model-fitting algorithms can be Existing statistical models and model-fitting algorithms can be employed.employed.

Potential for solving challenging problems.Potential for solving challenging problems. Model can be more easily visualized/interpreted than a dynamic Model can be more easily visualized/interpreted than a dynamic

population in a conventional EA.population in a conventional EA.

Although the field is highly active, it is still relatively Although the field is highly active, it is still relatively immatureimmature

Improve quality of experimental results.Improve quality of experimental results. Make sure research goals are well-defined.Make sure research goals are well-defined. Lots of preliminary ideas, but lack of comparative/followup Lots of preliminary ideas, but lack of comparative/followup

research.research. Difficult to keep up with the literature and see connections with Difficult to keep up with the literature and see connections with

other fields. other fields.

Page 27: Learning for Optimization: EDAs, probabilistic modelling, or

2727Marcus Gallagher - MASCOS Symposium, 26/11/04Marcus Gallagher - MASCOS Symposium, 26/11/04

The End!The End!

Questions?Questions?