hierarchical bayesian optimization algorithm (hboa)

of 73/73
Hierarchical Bayesian Optimization Algorithm (hBOA) Martin Pelikan University of Missouri at St. Louis [email protected]

Post on 03-Feb-2016

45 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Hierarchical Bayesian Optimization Algorithm (hBOA). Martin Pelikan University of Missouri at St. Louis [email protected] Foreword. Motivation Black-box optimization (BBO) problem Set of all potential solutions Performance measure (evaluation procedure) - PowerPoint PPT Presentation

TRANSCRIPT

  • Hierarchical Bayesian Optimization Algorithm (hBOA)Martin Pelikan

    University of Missouri at St. [email protected]

  • ForewordMotivationBlack-box optimization (BBO) problemSet of all potential solutionsPerformance measure (evaluation procedure)Task: Find optimum (best solution)Formulation useful: No need for gradient, numerical functions, But many important and tough challengesThis talkCombine machine learning and evolutionary computationCreate practical and powerful optimizers (BOA and hBOA)

  • OverviewBlack-box optimization (BBO)BBO via probabilistic modelingMotivation and examplesBayesian optimization algorithm (BOA)Hierarchical BOA (hBOA)Theory and experimentConclusions

  • Black-box OptimizationInputHow do potential solutions look like?How to evaluate quality of potential solutions?OutputBest solution (the optimum)ImportantWe dont know whats inside evaluation procedureVector and tree representations commonThis talk: Binary strings of fixed length

  • BBO: ExamplesAtomic cluster optimizationSolutions: Vectors specifying positions of all atomsPerformance: Lower energy is betterTelecom network optimizationSolutions: Connections between nodes (cities, )Performance: Satisfy constraints, minimize costDesignSolutions: Vectors specifying parameters of the designPerformance: Finite element analysis, experiment,

  • BBO: Advantages & DifficultiesAdvantagesUse same optimizer for all problems.No need for much prior knowledge.DifficultiesMany places to go100-bit strings1267650600228229401496703205376 solutions.Enumeration is not an option.Many places to get stuckLocal operators are not an option.Must learn whats in the box automatically.Noise, multiple objectives, interactive evaluation, ...

  • Typical Black-Box OptimizerSample solutionsEvaluated sampled solutionsLearn to sample betterSampleEvaluateLearn

  • Many Ways to Do ItHill climberStart with a random solution.Flip bit that improves the solution most.Finish when no more improvement possible.Simulated annealingIntroduce Metropolis.Evolutionary algorithmsInspiration from natural evolution and genetics.

  • Evolutionary AlgorithmsEvolve a population of candidate solutions.Start with a random population.IterationSelection Select promising solutionsVariation Apply crossover and mutation to selected solutionsReplacement Incorporate new solutions into original population

  • Estimation of Distribution Algorithms Replace standard variation operators byBuilding a probabilistic model of promising solutionsSampling the built model to generate new solutionsProbabilistic modelStores features that make good solutions goodGenerates new solutions with just those features

  • EDAs0101111000110011010111001101010101111000SelectedpopulationCurrent populationProbabilistic Model11011001110111111001New population

  • What Models to Use?Our planSimple example: Probability vector for binary stringsBayesian networks (BOA)Bayesian networks with local structures (hBOA)

  • Probability VectorBaluja (1995)Assumes binary strings of fixed lengthStores probability of a 1 in each position.New strings generated with those proportions.Example: (0.5, 0.5, , 0.5) for uniform distribution (1, 1, , 1) for generating strings of all 1s

  • EDA Example: Probability Vector0101111000110011010111001101010101111000SelectedpopulationCurrent population11101110011010110001New population110011010101011110001.0 0.5 0.5 0.0 1.0

  • Probability Vector DynamicsBits that perform better get more copies.And are combined in new ways.But context of each bit is ignored.Example problem 1: ONEMAX

    Optimum: 1111

  • Probability Vector on ONEMAXIterationProportions of 1sOptimum

  • Probability Vector on ONEMAXIterationProportions of 1sOptimumSuccess

  • Probability Vector: Ideal Scale-upO(n log n) evaluations until convergence(Harik, Cant-Paz, Goldberg, & Miller, 1997)(Mhlenbein, Schlierkamp-Vosen, 1993)Other algorithmsHill climber: O(n log n) (Mhlenbein, 1992)GA with uniform: approx. O(n log n)GA with one-point: slightly slower

  • When Does Prob. Vector Fail?Example problem 2: Concatenated trapsPartition input string into disjoint groups of 5 bits.Each group contributes via trap (ones=num. ones):

    Concatenated trap = sum of single trapsOptimum: 1111

  • TrapNumber of 1sTrapGlobaloptimum

  • Probability Vector on TrapsIterationProportions of 1sOptimum

  • Probability Vector on TrapsOptimumFailureIterationProportions of 1s

  • Why Failure?Onemax: Optimum in 11111 outperforms 0 on average.

    Traps: optimum in 11111, butf(0****) = 2f(1****) = 1.375

    So single bits are misleading.

  • How to Fix It?Consider 5-bit statistics instead of 1-bit ones.Then, 11111 would outperform 00000.Learn modelCompute p(00000), p(00001), , p(11111)Sample modelSample 5 bits at a timeGenerate 00000 with p(00000), 00001 with p(00001),

  • Correct Model on Traps: DynamicsOptimumIterationProportions of 1s

  • Correct Model on Traps: DynamicsOptimumIterationProportions of 1sSuccess

  • Good News: Good Stats Work Great!Optimum in O(n log n) evaluations.Same performance as on onemax!OthersHill climber: O(n5 log n) = much worse.GA with uniform: O(2n) = intractable.GA with one point: O(2n) (without tight linkage).

  • ChallengeIf we could learn and use context for each positionFind nonmisleading statistics.Use those statistics as in probability vector.Then we could solve problems decomposable into statistics of order at most k with at most O(n2) evaluations!And there are many of those problems.

  • Bayesian Optimization Algorithm (BOA)Pelikan, Goldberg, & Cant-Paz (1998)Use a Bayesian network (BN) as a model.Bayesian networkAcyclic directed graph.Nodes are variables (string positions).Conditional dependencies (edges).Conditional independencies (implicit).

  • Conditional DependencyXZY

  • Bayesian Network (BN)Explicit: Conditional dependencies.Implicit: Conditional independencies.Probability tables

  • BOACurrent populationSelectedpopulationNew populationBayesian network

  • BOA VariationTwo stepsLearn a Bayesian network (for promising solutions)Sample the built Bayesian network (to generate new candidate solutions)NextBrief look at the two steps in BOA

  • Learning BNs

    Two components:

    Scoring metric (to evaluate models).

    Search procedure (to find the best model).

  • Learning BNs: Scoring MetricsBayesian metricsBayesian-Dirichlet with likelihood equivalence

    Minimum description length metricsBayesian information criterion (BIC)

  • Learning BNs: Search ProcedureStart with an empty network (like prob. vec.).Execute primitive operator that improves the metric the most.Until no more improvement possible.Primitive operatorsEdge additionEdge removalEdge reversal.

  • Sampling BNs: PLSProbabilistic logic sampling (PLS)Two phasesCreate ancestral ordering of variables: Each variable depends only on predecessorsSample all variables in that order using CPTs: Repeat for each new candidate solution

  • BOA Theory: Key ComponentsPrimary target: ScalabilityPopulation sizing NHow large populations for reliable solution?Number of generations (iterations) GHow many iterations until convergence?Overall complexityO(N x G)Overhead: Low-order polynomial in N, G, and n.

  • BOA Theory: Population SizingAssumptions: n bits, subproblems of order kInitial supply (Goldberg)Have enough partial sols. to combine.Decision making (Harik et al, 1997)Decide well between competing partial sols.Drift (Thierens, Goldberg, Pereira, 1998)Dont lose less salient stuff prematurely.Model building (Pelikan et al., 2000, 2002)Find a good model.

  • BOA Theory: Num. of GenerationsTwo bounding casesUniform scalingSubproblems converge in parallelOnemax model (Muehlenbein & Schlierkamp-Voosen, 1993)

    Exponential scalingSubproblems converge sequentiallyDomino convergence (Thierens, Goldberg, Pereira, 1998)

  • Good NewsTheoryPopulation sizing (Pelikan et al., 2000, 2002)Initial supply.Decision making.Drift.Model building.Iterations until convergence (Pelikan et al., 2000, 2002)Uniform scaling.Exponential scaling. BOA solves order-k decomposable problems in O(n1.55) to O(n2) evaluations! O(n) to O(n1.05)O(n0.5) to O(n)

  • Theory vs. Experiment (5-bit Traps)

  • Additional Plus: Prior KnowledgeBOA need not know much about problemOnly set of solutions + measure (BBO).BOA can use prior knowledgeHigh-quality partial or full solutions.Likely or known interactions.Previously learned structures.Problem specific heuristics, search methods.

  • From Single Level to HierarchyWhat if problem cant be decomposed like this?Inspiration from human problem solving.Use hierarchical decompositionDecompose problem on multiple levels.Solutions from lower levels = basic building blocks for constructing solutions on the current level.Bottom-up hierarchical problem solving.

  • Hierarchical Decomposition

    CarEngineBraking systemElectrical systemFuel systemValvesIgnition system

  • 3 Keys to Hierarchy SuccessProper decompositionMust decompose problem on each level properly.ChunkingMust represent & manipulate large order solutions.Preservation of alternative solutionsMust preserve alternative partial solutions (chunks).

  • Hierarchical BOA (hBOA)Pelikan & Goldberg (2001)Proper decompositionUse BNs as BOA.ChunkingUse local structures in BNs.Preservation of alternative solutionsRestricted tournament replacement (niching).

  • Local Structures in BNsLook at one conditional dependency.2k probabilities for k parents.Why not use more powerful representations for conditional probabilities?

    X1X3X2

  • Local Structures in BNsLook at one conditional dependency.2k probabilities for k parents.Why not use more powerful representations for conditional probabilities?

    X1X3X2X2X3010126%44%15%

  • Restricted Tournament ReplacementUsed in hBOA for niching.Insert each new candidate solution x like this:Pick random subset of original population.Find solution y most similar to x in the subset.Replace y by x if x is better than y.

  • hBOA: ScalabilitySolves nearly decomposable and hierarchical problems (Simon, 1968)Number of evaluations grows as a low-order polynomialMost other methods fail to solve many such problems

  • Hierarchical TrapsTraps on multiple levels.Blocks of 0s and 1s mapped to form solutions on the next level.

    3 challengesMany local optimaDeception everywhereNo single-level decomposability

  • Hierarchical Traps

  • Other Similar AlgorithmsEstimation of distribution algorithms (EDAs)Dynamic branch of evolutionary computationExamples:PBIL (Baluja, 1995)Univariate distributions (full independence)COMITConsiders tree modelsECGAGroups of variables considered togetherEBNA (Etxeberria et al., 1999), LFDA (Muhlenbein et al., 1999)Versions of BOAAnd others

  • EDAs: Promising ResultsArtificial classes of problemsMAXSAT, SAT (Pelikan, 2005).Nurse scheduling (Li, Aickelin, 2003)Military antenna design (Santarelli et al., 2004)Groundwater remediation design (Arst et al., 2004)Forest management (Ducheyne et al., 2003)Telecommunication network design (Rothlauf, 2002)Graph partitioning (Ocenasek, Schwarz, 1999; Muehlenbein, Mahnig, 2002; Baluja, 2004)Portfolio management (Lipinski, 2005)Quantum excitation chemistry (Sastry et al., 2005)

  • Current ProjectsAlgorithm designhBOA for computer programs.hBOA for geometries (distance/angle-based).hBOA for machine learners and data miners.hBOA for scheduling and permutation problems.Efficiency enhancement for EDAs.Multiobjective EDAs.ApplicationsCluster optimization and spin glasses.Data mining.Learning classifier systems & neural networks.

  • Conclusions for ResearchersPrincipled design of practical BBOers:ScalabilityRobustnessSolution to broad classes of problemsFacetwise design and little modelsUseful for approaching research in evol. comp.Allow creation of practical algorithms & theory

  • Conclusions for PractitionersBOA and hBOA revolutionary optimizersNeed no parameters to tune.Need almost no problem specific knowledge.But can incorporate knowledge in many forms.Problem regularities discovered and exploited automatically.Solves broad classes of challenging problems.Even problems unsolvable by any other BBOer.Can deal with noise & multiple objectives.

  • Book on hBOA

    Martin Pelikan (2005)Hierarchical Bayesian optimization algorithm:Toward a new generation of evolutionary algorithmsSpringer

  • Contact

    Martin PelikanDept. of Math. and Computer Science, 320 CCBUniversity of Missouri at St. Louis8001 Natural Bridge Rd.St. Louis, MO 63121

    [email protected]://www.cs.umsl.edu/~pelikan/

  • Problem 1: Concatenated TrapsPartition input binary strings into 5-bit groups.Partitions fixed but uknown.Each partition contributes the same.Contributions sum up.

  • Concatenated 5-bit Traps

  • Spin Glasses: Problem Definition1D, 2D, or 3D grid of spins.Each spin can take values +1 or -1.Relationships between neighboring spins (i,j) are defined by coupling constants Ji,j.Usually periodic boundary conditions (toroid).Task: Find values of spins to minimize the energy

  • Spin Glasses as Constraint SatisfactionSpins: Constraints:

  • Spin Glasses: Problem Difficulty1D Easy, set spins sequentially.2D Several polynomial methods exist, best is Exponentially many local optimaStandard approaches (e.g. simulated annealing, MCMC) fail3D NP-complete, even for couplings {-1,0,+1}.Often random subclasses are considered+-J spin glasses: Couplings uniform -1 or +1Gaussian spin glasses: Couplings N(0, 2).

  • Ising Spin Glasses (2D)

  • Results on 2D Spin GlassesNumber of evaluations is O(n1.51).Overall time is O(n3.51).Compare O(n3.51) to O(n3.5) for best method (Galluccio & Loebl, 1999)Great also on Gaussians.

  • Ising Spin Glasses (3D)

  • MAXSATGiven a CNF formula.Find interpretation of Boolean variables that maximizes the number of satisfied clauses.(x2 x7 x5 ) (x1 x4 x3)

  • MAXSAT DifficultyMAXSAT is NP complete for k-CNF, k>1

    But random problems are rather easy for almost any method.

    Many interesting subclasses on SATLIB, e.g.3-CNF from phase transition ( c = 4.3 n )CNFs from other problems (graph coloring, )

  • MAXSAT: Random 3CNFs

  • MAXSAT: Graph Coloring500 variables, 3600 clausesFrom morphed graph coloring (Toby Walsh)

  • Spin Glass to MAXSATConvert each coupling Jij with spins si and sj:Jij =+1 (si sj) (si sj)Jij = -1 (si sj) (si sj)Consistent pairs of spins = 2 sat. clausesInconsistent pairs of spins = 1 sat. clauseMAXSAT solvers perform poorly even in 2D!