data-driven meta-heuristic search - ustcstaff.ustc.edu.cn/~ketang/ppt/ddms201412.pdf · data-driven...

Data-driven Meta-heuristic Search

Ke Tang

USTC-Birmingham Joint Research Institute in Intelligent Computation and Its Applications (UBRI) School of Computer Science and Technology

University of Science and Technology of China

December 2014 @ CityU of Hong Kong

1

Outline

•  A data-driven perspective on Meta-heuristic Search •  Speciation in DDMS

•  Algorithm Selection in DDMS

•  Identification of Interacting Decision Variables in DDMS •  Summary

2 2

Outline




3 3

A data-driven perspective on MS

•  There are a lot of famous meta-heuristic (MS) search methods –  Simulated Annealing –  Tabu Search –  Scatter Search –  Genetic Algorithms –  Evolution Strategies –  Evolutionary Programming –  Particle Swarm Optimizer –  Ant Colony Optimization –  Differential Evolution –  Estimation of Distribution Algorithms –  etc.

4 4


•  Despite of the different historical background, most MS methods share a similar framework, i.e., they are Stochastic Generate-and-Test algorithms.

•  An MS method iteratively sample in a solution space, and thus can be viewed as a data-generating process.

5 5

Sampling

x1 … xD fitness Individual 1 …

Individual n


•  The data consist a lot of information, such as: –  The candidate solutions (individuals) –  Their corresponding fitness –  The “origin” of an individual (e.g., an individual was

generated by applying which operator to which parents).

•  Data-driven Meta-heuristic Search: To exploit the data generated by MS during its search process to enhance the performance of MS.

6 6


•  Many existing works could be interpreted as DDMS: –  Surrogate Assisted Evolutionary Algorithms –  Many parameter adaptation/self-adaptation schemes

•  Key questions of DDMS:

–  Q1: Why will an MS benefit from historical data? –  Q2: What information is to be extracted from the data? –  Q3: How should the required information been extracted

from the data?

7 7


•  The answers to Q1 and Q2 define the specific data analytics problem that need to be addressed.

•  Data analytics problems in DDMS are likely to be intractable (e.g., NP-hard) themselves.

•  Or, they may introduce substantial computational overhead.

(No free lunch)

•  Hence, trade-off between overhead and the benefit of using historical data should always been keep in mind when answering Q3.

8 8

Outline




9 9

Speciation in DDMS

10 10

•  Challenge brought by a multimodal problem –  There might be more than one optima that are (roughly) equally good. –  Finds multiple optima of a problem.

•  Why?

–  Provides the user with a range of choices (more informed decisions) –  Reveals insights into the problem (inspire innovations)

Speciation in DDMS

•  When employing EAs to find multiple optima, a procedure called speciation is usually required.

•  Speciation: partitioning a population into a few species. –  Niche: an region of attraction on the fitness landscape –  Species: a group of individuals occupying the same niche –  Species seed: the best (fittest) individual of a species

11 11

Speciation in DDMS

•  A typical speciation procedure

12 12

Speciation in DDMS

13 13

•  Most speciation methods relies on a sub-algorithm to determine whether two individuals are of the same species.

•  Speciation methods –  Distance-based: Determines whether two individuals are of the same

species according to their distance –  Topology-based: Determines whether two individuals are of the same

species according to the fitness landscape topography

Speciation

Distance-Based Topology-Based

Hill-Valley Recursive Middling

Distance-based Speciation

14 14

•  Two individuals are assigned to the same species if their distance is smaller than a predefined threshold called niche radius.

•  Introduce an additional parameter (i.e., niche radius), which is difficult to tune.

•  Make strong assumptions, i.e., equally sized and spherically shaped niches.

Topology-based Speciation

15 15

–  Hill-Valley (HV)

–  Recursive Middling

–  Make weaker assumptions –  Sampling new points in order to capture the landscape topography, –  When more FEs are spent on speciation, less are available for the

evolutionary algorithm to converge. –  Not very attractive especially when fitness evaluation is costly.

History-based Topological Speciation

16 16

•  Research question: Could topology-based speciation be FE-free, so that their benefits can be better appreciated?

•  Approach: History-Based Topological Speciation (HTS)

•  Capture landscape topography exclusively based on search history.


17 17

•  Topology-based speciation methods can be interpreted from the perspective of a sequence of points. –  is an infinite sequence and cannot be tested directly. –  RM “approximates” by sampling a few points on ab.

•  Basic idea: Approximate only using history data/points.

•  What is a “good” approximation? bad bad good

XX

X


18 18

•  Conceptually, HTS follows a two-step procedure 1.  Construct a finite discrete approximate sequence 2.  Test the approximate sequence to reach a final decision (trivial)

•  More formally, the problem of finding the best approximation

can be stated as:

•  where


19 19


20 20

HTS: Experiments

21 21

HTS: Experiments

22 22

•  Different methods are integrated into the same evolutionary framework for comparison –  Crowding Differential Evolution with Species Conservation

•  Benchmark functions –  F1—F6: 6 two-dimensional functions with various properties

Number of optima: 2—10 –  F7—F10: MMP functions in 4, 8, 16, 32 dimensions, respectively

Number of optima: 48 –  F11: An composition of 50 random 32-dimensional shifted rotated

ellipsoidal sub-functions coupled via the max operator Number of optima: 50

•  The goal is to find all optima of the benchmark functions

HTS: Experiments

23 23

•  Performance Measure: The distance error of the last generation is then used to measure the performance of the algorithm

•  Win/Draw/Lose of HTS versus every other method

–  Both t-test and Wilcoxon rank-sum tests are used –  Consider a difference to be statistically significant if it is asserted so by both

tests at the 0.05 significance level –  A draw is counted when no statistically significant difference is observed

DIS- DIS DIS+ HV1 HV3 HV5 RM RM* 9/0/2 2/5/4 4/5/2 9/0/2 9/0/2 9/0/2 9/0/2 1/3/7

Summary of HTS

24 24

•  Q1: Why will an MS benefit from historical data? A1: Because species could be formed without setting any parameter and consuming any additional FEs.

•  Q2: What information is to be extracted from the data?

A2: A few clusters of individuals (or tags of species for each individual)

•  Q3: How should the required information been extracted from the data? A3: To find an approximation to a line segment only using previously evaluated candidate solutions.

•  Representation of Data: Individuals and their fitness (e.g., n-by-(D+1) matrix).

•  Generalization issue is not required/considered.

For more details

25 25

•  L. Li and K. Tang, “History-Based Topological Speciation for Multimodal Optimization,” IEEE Transactions on Evolutionary Computation, in press (Early Access).

•  P. Yang, K. Tang and X. Lu, “Improving Estimation of Distribution Algorithm on Multi-modal Problems by Detecting Promising Areas,” IEEE Transactions on Cybernetics, accepted on 22 August 2014.

Outline




26 26

PAP - Background � A scenario frequently encounter in the real world: Ø  A number of optimization problems Ø  A time budget T Ø  A number of optimization algorithms (e.g., GA, ES, EP, EDA,

DE, PSO…) We want to obtain the best (or as good as possible) solutions for all the problems with T.

27

PAP - Background � Ø  Intuitively, the total time budget T can used for two purposes

(1) to identify the best algorithm (2) to search for the best solution

Ø  In general, the more time we spent on (2), the better solution it

will achieve.

Ø  Different problem may favor different algorithm. Finding the best algorithm for a problem can be very time consuming.

28

PAP - Background � General thoughts:

Ø  Arbitrarily pick an algorithm for every problem? – T will solely be used to search for solutions – too risky

Ø  Carefully identify the best algorithm for each problem? – A lot of time will be used for algorithm selection – The time left for searching for good solutions might be insufficient.

Ø  Try to find a single algorithm suitable for all problems? – Sounds like a good trade-off – Advantages of having a set of different algorithms are not fully utilized.

29

PAP - Background � Ø  How about establishing a good “portfolio” of algorithms (e.g.,

a combination of multiple algorithms) for all problems?

Advantages:

•  Making use of advantages of different algorithms, rather than putting all the eggs (time) into a single basket (algorithm).

•  Hopefully not too time-consuming since only one portfolio is needed for all problems.

30

PAP - Background �

Ø  Algorithm Portfolios “invests” limited time in multiple algorithms to fully utilize the advantages of these algorithms to maximize the expected utility of a problem solving episode.

Ø  Analogy to Economics: One allocates his/her money to different financial assets (stocks, bonds, etc.) in order to maximize the expected returns while minimizing risks.

Ø  Population-based Algorithm Portfolios (PAP) – Conceptually similar to Algorithm Portfolios – Aims to solve a set of problems rather a single one – Focuses on population-based algorithms (e.g., EAs)

31


Ø  The General Framework of PAP

32

Select the cons:tuent algorithms from a pool of candidate algorithms

Construct a concrete PAP instan:a:on with the cons:tuent algorithms

Apply the PAP instan:a:on to each problem

Output the best solu:on obtained for each problem


Which candidate algorithms should serve as constituent algorithms depends on the way of building a PAP instantiation: Ø  A PAP instantiation maintains multiple sub-populations.

Ø  Each sub-population is evolved with a constituent algorithm.

Ø  Information is shared among sub-populations by activating a migration scheme periodically.

33


Pseudo-code of a PAP instantiation:

34

EPM-PAP �

On Choosing Constituent Algorithms

Ø  Let F = {fk | k = 1, 2, … , n} be a given problem set, A = {aj |j = 1, 2, ..., m} be a set of candidate EAs, choosing constituent algorithm for PAP is formulated as seeking the algorithm subset = {ai |i = 1, 2, ..., l} of A that leads to the best overall performance on F, as given in Eq. (1)

Ø  A most straightforward approach: enumerate all possible subset and employ a procedure like statistical racing to find the best one.

Even more time consuming than selecting a single algorithm!

35

),,~(maxarg~~

TFAUAAA

opt⊆

=

EPM-PAP �

Ø  Recall that we expect that a good PAP instantiation to under-perform a candidate EA (say, ai) with small probability.

Ø  Assuming independence between constituent algorithms, the above statement can be written for an algorithm j on problem fk as:

Ø  Averaging over all problems and all candidate EAs, we get (1)

36

Rjk = (1−Pi, jk )

i=1

l

∏

∑∑∏= = =

−=m

j

n

k

l

i

kjiPmn

R1 1 1

, )1(1

EPM-PAP �

What is Estimated Performance Matrix

•  A matrix that records the performance of each candidate EA.

•  For each aj, the corresponding EPM, denoted by EPMj, is an r-by-n matrix.

•  This matrix can be obtained by running aj on each of the n problems for r times.

•  Each element of EPMj is the objective value of the best solution that aj obtained on a problem in a single run.

•  Since each element of EPMj is obtained with a small portion of T, it can be viewed as a conservative estimate of the solution quality achieved by running aj with T on the same problem.

37

EPM-PAP �

Ø  With the help of some statistical tests, EPMs provide all information that is needed to calculate Eq. (1)

Good news

Ø  No need to compare the performance of all possible subsets with a tedious procedure like statistical racing.

Ø  Estimating the performance of a single candidate EA is sufficient for constituent algorithm subset selection.

38

EPM-PAP �

Detailed steps for Choosing Constituent Algorithms

1.  Apply each candidate EA aj to each problem for r independent runs. The final population obtained in each run is stored.

2.  Construct EPM for each aj based on the quality of the best solution it obtained in each run.

3.  All possible subset of A is enumerated and the corresponding R is calculated using Eq. (1) and the EPMs.

4.  The subset with the smallest R is selected as the constituent algorithms for PAP.

39

EPM-PAP: Experiments � Ø  4 Candidate EAs: CMA-ES, G3PCX, SaNSDE, wPSO

Ø  Benchmark problems – 13 numerical problems from classical benchmark suite – 14 numerical problems from CEC2005 benchmark suite – Dimension: 30

40

EPM-PAP: Experiments � Ø  Total Fitness Evaluations (FEs) for each problem: 400000,

800000, and 1200000, respectively.

Ø  25 independent runs on each problem

Ø  For the convenience of implementation, all constituent algorithms of a PAP instantiation evolve with the same number of generations.

Ø  Parameters of constituent algorithms are not fine-tuned.

Ø  migration_interval=MAX_GEN/20, migration_size=1

Ø  PAP with 2 and 3 constituent algorithms are considered.

41

EPM-PAP: Experiments � Ø  Wilcoxon Test Results (Significance level 0.05): “w-d-l” stands

for “win-draw-lose”

42

Time Budget

SaNSDE

wPSO G3PCX CMA-ES

F-Race Intra-AOTA

EPM-PAP-2

T1 8-14-5 17-10-0 21-6-0 8-13-6 9-14-4 6-15-6

T2 7-14-6 16-10-1 20-7-0 9-14-4 7-15-5 5-18-4

T3 6-15-6 17-9-1 21-6-0 10-14-3 7-14-6 6-18-3

EPM-PAP-3

T1 9-11-7 19-7-1 21-5-1 10-10-4 10-13-4 5-17-5

T2 8-17-2 17-9-1 20-7-0 9-12-6 9-12-6 5-20-2

T3 9-16-2 17-10-0 21-6-0 9-14-4 9-14-4 6-20-1

EPM-PAP: Experiments �

Ø  Performance ranking of all possible EPM-PAP-2 and EPM-PAP-3

43

PAP Rank Time Budget = T1 Time Budget = T2 Time Budget = T3 with 2 constituent algorithms

1 SaNSDE + CMA-ES SaNSDE + CMA-ES SaNSDE + CMA-ES 2 wPSO + CMA-ES wPSO + CMA-ES wPSO + CMA-ES 3 SaNSDE + wPSO SaNSDE + wPSO SaNSDE + wPSO 4 SaNSDE + G3PCX SaNSDE + G3PCX SaNSDE + G3PCX 5 G3PCX + CMA-ES G3PCX + CMA-ES G3PCX + CMA-ES 6 wPSO + G3PCX wPSO + G3PCX wPSO + G3PCX

with 3 constituent algorithms

1 SaNSDE+wPSO+CMA-ES SaNSDE+wPSO+CMA-ES SaNSDE+wPSO+CMA-ES 2 SaNSDE+G3PCX+CMA-ES SaNSDE+G3PCX+CMA-ES SaNSDE+G3PCX+CMA-ES 3 SaNSDE+wPSO+G3PCX SaNSDE+wPSO+G3PCX wPSO+ G3PCX+CMA-ES 4 wPSO+ G3PCX+CMA-ES wPSO+ G3PCX+CMA-ES SaNSDE+wPSO+G3PCX

EPM-PAP: Experiments �

Ø  Success Rates of the EPM-based selection procedure: How likely did it select the best constituent algorithm subset?

44

Time Budget SR1 SR2

EPM-PAP-2 T1 40% 88%

T2 56% 100%

T3 72% 100%

EPM-PAP-3 T1 16% 84%

T2 36% 88%

T3 56% 100%

Summary of EPM-PAP

45 45

•  Q1: Why will an MS benefit from historical data? A1: Because a better subset of algorithms could be identified.


A2: An m-dimensional binary vector

•  Q3: How should the required information been extracted from the data? A3: Invest additional FEs to accumulate statistically meaningful estimation of the performance of algorithms.

•  Representation of Data: The quality of solutions of each candicate algorithm on each problem. (i.e., EPM).

•  It is implicitly assumed that performance achieved with small number of FEs could generalize to cases with larger number of FEs.

For more details

46 46

•  F. Peng, K. Tang, G. Chen and X. Yao, “Population-based Algorithm Portfolios for Numerical Optimization,” IEEE Transactions on Evolutionary Computation, 14(5): 782-800, October 2010.

•  K. Tang, F. Peng, G. Chen and X. Yao, “Population-based Algorithm Portfolios with automated constituent algorithms selection,” Information Sciences, 279: 94-104, September 2014.

Outline




47 47

Background

•  Although EAs have achieved great success in the domain of optimization, most reported studies are obtained using small scale problems (e.g., numerical optimization with less than 100 decision variables).

•  Most existing EAs suffer from the “Curse of Dimensionality” phenomenon.

•  On the other hand, large scale problems have emerged in many areas.

48 48

An example

•  Birds Nest (China & Switzerland)

•  The irregular ordering of the beams poses an insoluble problem for the then-current CAD tools.

49 49

Large Scale Optimization Problems

•  Research Target of LSGO: To scale up EAs to problems that are at least one magnitude larger than the state-of-the-art (i.e., with about 1000 variables).

•  What Makes Large Scale Problems Difficult?

–  Solution space often increases exponentially with the growth of problem dimensionality.

–  Problem complexity may increase with the growth of dimensionality, e.g., the number of local optima.

–  Candidate search directions often increase exponentially. EAs might fail to find the promising search directions.

50 50

EACC-G

•  Basic (and old) idea: divide-and-conquer.

•  Cooperative Coevolution is an ideal approach for implementing the idea: –  Decomposes the objective problem into some sub-problems; –  Evolves each sub-problem separately using EAs; –  Combines the solutions to all sub-problems to form the

solution to the original problem.

•  By decompose, we mean to categorize/divide the D decision variables into a few groups.

51 51

EACC-G

•  More formally speaking…

•  The above approach is named EACC-G, which involves a predefined number of cycles.

•  Each cycle consists of the following steps: –  Split D decision variables into m groups, each contains s variables. –  Optimize each sub-problem with an EA. –  Solutions for each sub-problem is evaluated by combining with the best

solution obtained for the other sub-problems.

52 52

EACC-G

•  The key question: How to decompose?

•  If a problem consists a nonseparable component, we say the decision variables in this component are interacting variables.

•  Intuitively, interacting variables should be grouped together by the decomposition procedure.

•  The simplest way for decomposition is to group decision variables randomly. –  Sounds too straightforward to work properly. –  But not as “silly” as it seems to be.

53 53

EACC-G

54 54

Nature Computing

The probability of EACC-G to assign two interacting variables xi and xj into the same group for at least k cycles is:

N: Number of Cycles; m: Number of Groups

• For example, given a 1000-D problem, when m = 10, P1

=0.9948, P2 =0.9662• Even the simple random grouping strategy has some chance to group two interacting variables together.

Benefit of Random Grouping

13

!! =!! ( 1!)

!(1 − 1!)

!!!!

!!!!

EACC-G

•  With the random grouping scheme, each cycle of EACC-G becomes: –  Randomly Split D decision variables into m groups, each contains s

variables. –  Optimize each sub-problem with an EA. –  Solutions for each sub-problem is evaluated by combining with the best

solution obtained for the other sub-problems.

55 55

Experimental Studies

•  Test Suite: 13 minimization problems (1000-dimensional).

•  Applying Differential Evolution (DE) to the problem directly.

•  DECC-G: using DE as basic optimizer.

•  The numbers of FEs were set to 5e+06 for all algorithms.

•  Results of 25 independent runs were collected for each problem.

56 56


57 57

Nature Computing

Comparison between DECC-G and SaNSDE on functions f1 − f7 (unimodal), with dimension D = 1000, averaged over 25 runs.

Results (Unimodal)

16

# of Dim SaNSDE DECC-G

f1 1000 6.97E+00 2.17E-25f2 1000 1.24E+00 5.37E-14f3 1000 6.43E+01 3.71E-23f4 1000 4.99E+01 1.01E-01f5 1000 3.31E+03 9.87E+02f6 1000 3.93E+03 0.00E+00f7 1000 1.18E+01 8.40E-03


58 58

Nature Computing

Comparison between DECC-G and SaNSDE on functions f8 − f13 (multimodal), with dimension D = 1000, averaged over 25 runs.

Results (MultiModal)

17

# of Dim SaNSDE DECC-G

f8 1000 -372991 -418983f9 1000 8.69E+02 3.55E-16f10 1000 1.12E+01 2.22E-13f11 1000 4.80E-01 1.01E-15f12 1000 8.97E+00 6.89E-25f13 1000 7.41E+02 2.55E-21

Drawbacks of Random Decomposition

•  The group-size needs to be predefined - rather difficult.

•  All groups are assumed to be of the same size - probably unreasonable.

•  The nature of random grouping limits the chance of categorizing all interacting variables into the same group.

59 59

Variable Interaction Learning

•  A bottom-up grouping approach 1.  Start by treating each decision variable as a group 2.  Learn the interaction between variables 3.  Merge interacting variables/groups into the same group 4.  Goto step 2 until a stopping criterion is met

•  Benefits –  No need to specify the number of groups. –  Groups are can be of different sizes. –  Once the learning phase finishes, no need to re-group the

decision variables.

60 60


•  How to Learn the Interaction? –  If two solution vectors, say x and x’ are different only on the ith

dimension, and the ith and jth decision variables are NOT interacting. –  Then changing the value of the jth decision variable will NOT change

the relative order of f(x) and f(x’).

•  Hence, we may say that the ith and jth variables are interacting if the following condition holds:

•  Every interaction learned by this mechanism is correct.

61 61


62 62


63 63

Nature Computing

CCVIL: A Two-stage Algorithm

21

Cooperative Coevolution with Variable Interaction Learning

1.! Initialization: Randomly initialize a population of solutions, and randomly choose an individual from the population.2.! Learning Stage: Repeat a number of learning cycles, each leaning cycle consists of three steps: (1) Randomly permute the sequence of decision variables (2) Scan over the permuted decision variables sequence to check the interaction between each pair of successive variables. If evidence of interaction is discovered, mark the two variables as ”belonging to the same group”.3. Optimization Stage: (1) Categorize the decision variables according to the information obtained in the learning stage (2) Solve the problem using CC framework


64 64

No Free Lunch: The Learning Overhead

26

The Learning stage costs FEs and a trade-off between learning and evolution (optimization) needs to be set.

Appropriate setting for learning cycle can deal with both separable functions and non-separable functions:

Termination Conditions for Learning Stage

• If no interactions were learned after Kˇ cycles, we treat it as separable function and thus the learning stage will terminate.

• If any interaction has been learned before reaching the Kˇ cycles, we treat it as a non-separable function. In this case, learning stage only stops if:

• all N dimensions have been combined into one group• 60% of FEs has been consumed in learning stage


65 65

Experimental Results

28

��

Summary of VIL

66 66

•  Q1: Why will an MS benefit from historical data? A1: Because interacting variables will be more likely to be grouped together


A2: A binary “interaction” matrix (D-by-D)

•  Q3: How should the required information been extracted from the data? A3: Invest additional FEs to perform tests between variables.

•  Representation of Data: Individuals and their fitness (e.g., n-by-(D+1) matrix).

•  Generalization issue is not required/considered.

For more details

67 67

•  Z. Yang, K. Tang and X. Yao, “Large Scale Evolutionary Optimization Using Cooperative Coevolution,” Information Sciences, 178(15): 2985-2999, 2008.

•  W. Chen, T. Weise, Z. Yang and K. Tang, “Large-Scale Global Optimization using Cooperative Coevolution with Variable Interaction Learning,” in Proceedings of the 11th International Conference on Parallel Problem Solving From Nature (PPSN), Kraków, Poland, September 11–15, 2010, pp. 300–309.

Outline




68 68

Summary

•  Data-driven MS makes use of data analytics approach to gain useful information from the data generated during search.

•  Three examples of DDMS have been introduced.

•  Different context in MS may induce significantly different data analytics problems, where a lot of work could be done.

69 69

Collaborators

•  Collaborators at UBRI (ubri.ustc.edu.cn) –  Mr. Lingxi Li (HTS)

–  Dr. Fei Peng (EPM-PAP)

–  Prof. Xin Yao (EPM-PAP)

–  Prof. Guoliang Chen (EPM-PAP)

–  Mr. Wenxiang Chen (CCVIL)

–  Dr. Thomas Weise (CCVIL)

–  Dr. Zhenyu Yang (CCVIL)

70 70

Thanks for your time! Q&A?

71 71

data-driven meta-heuristic search - ustcstaff.ustc.edu.cn/~ketang/ppt/ddms201412.pdf · data-driven...

Documents