a surrogate model based on walsh decomposition for pseudo …verel/talks/poster-ppsn18... ·...

1
A Surrogate Model based on Walsh Decomposition for Pseudo-Boolean Functions ebastien Verel 1 , Bilel Derbel 2,3 , Arnaud Liefooghe 2,3 , Hern ´ an Aguirre 4 , and Kiyoshi Tanaka 4 Black-box Pseudo-Boolean Functions x -→ -→ f (x) No information on the definition of f Search space : x ∈{0, 1} n Objective fonction: given by a com- putation or an (expensive) simulation Surrogate (meta-model) ˆ f : approximation of f , fast to compute Surrogate models For numerical problems: Huge number of works Kriging (Gaussian Proc.): collection of r. v.: N (m, k ) mean: m(x)= E[ f (x)] cov.: k (x, x 0 )= exp(-θ dist (x, x 0 ) p ) Efficient Global Optimization (EGO): maximize expected improvement (EI)... For combinatorial problems: distance for combinatorial space: Radial Basis Function Networks (RBFN) [Moraglio:2011] Kriging, EGO [Zaefferer:2014] Walsh functions in genetic algorithm Definition [Bethke:1980] For any k [0, 2 n - 1], Walsh function ϕ k : {0, 1} n → {-1, 1} x ∈{0, 1} n , ϕ k (x)=(-1) n-1 j =0 k j x j (ϕ 0 ,..., ϕ 2 n -1 ) is an orthogonal basis: x 0 = 000 1 = 001 2 = 010 3 = 011 4 = 100 5 = 101 6 = 110 7 = 111 ϕ 0 ϕ 1 ϕ 2 ϕ 3 ϕ 4 ϕ 5 ϕ 6 ϕ 7 1 1 1 1 1 1 1 1 1 -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 1 1 1 -1 -1 -1 -1 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 -1 -1 1 1 1 -1 -1 1 -1 1 1 -1 Decomposition of pseudo-bool. func. x ∈{0, 1} n , f (x)= 2 n -1 k=0 w k .ϕ k (x) k [0, 2 n - 1], w k = 1 2 n x∈{0,1} n f (x).ϕ k (x) Applications Design deceptive functions: average fitness over a schemata of order p is given by coefficients of order lower than p. Grey-box optimization : use linear de- composition for smart computation (see F. Chicano, D. Withley, etc.) Walsh surrogate model Approximation using Walsh decomposition: ˆ f ˆ w (x)= k ˆ w k .ϕ k (x) with k ∈{ j : ord (ϕ j ) 6 d } Linear model with predictors: (ϕ 0 (x i ),..., ϕ k (x i )), and y i = f (x i ). Estimators based on Mean Square Error ˆ w = argmin w x i X ( ˆ f w (x i ) - f (x i )) 2 Estimator types: Non sparse estimator: Conjugate Gradient (CG) Sparse estimator: Least-Angle Regression (LARS) Regularization methods (lasso, ridge, etc.) Forward stepwise selection regression that reduce the number of non-zero coefficients Experimental benchmark: nk-landscapes nk-landscapes [Kaufmann:1993] : f (x)= 1 n n i=1 f i (x i , x i 1 ,..., x i k ) k = 0, linear problem (easy to optimize), k = 1, quadratic problem (UBQP), k = 2, cubic problem (max-3-SAT problem) n ∈{10, 15, 20, 25}, k ∈{0, 1, 2}, 5 instances. Solutions generated uniformly at random for training and test sets, Maximum number of non-zero coefficients: n 0 = 1; n d = n d -1 + ( n d ) n ord. 10 15 20 25 0 1 1 1 1 1 11 16 21 26 2 46 121 211 326 3 176 576 1351 2626 but number of non-zero coef. is much smaller for nk-landscapes: n(2 k+1 - 1)+ 1 Non-sparse CG vs. sparse LARS Regression error (test on 10 3 solutions) according training sample size 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 Sample size Mean Abs. Error of fitness method CG LARS Walsh (with LARS) vs. Kriging Mean absolute error according to training sample size n = 10, k=0 n = 10, k=1 n = 10, k=2 n = 15, k=0 n = 15, k=1 n = 15, k=2 n = 20, k=0 n = 20, k=1 n = 20, k=2 n = 25, k=0 n = 25, k=1 n = 25, k=2 0.00 0.02 0.04 0.06 0.00 0.03 0.06 0.09 0.000 0.025 0.050 0.075 0.100 0.0e+00 5.0e-06 1.0e-05 1.5e-05 2.0e-05 0.00 0.02 0.04 0.00 0.02 0.04 0.06 0.000 0.005 0.010 0.015 0.020 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.02 0.04 0.06 0.00 0.01 0.02 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.02 0.04 0 100 200 300 0 100 200 300 0 100 200 300 100 200 300 100 200 300 100 200 300 400 100 200 300 100 200 300 100 200 300 400 100 200 300 0 100 200 300 400 500 0 200 400 600 # function evaluations (sample size) Mean absolute error Surrogate Model kriging walsh Regression error according to the order of Walsh function R 2 of coefficients according to training sample size n=10, k=0 n=10, k=1 n=10, k=2 n=15, k=0 n=15, k=1 n=15, k=2 n=20, k=0 n=20, k=1 n=20, k=2 n=25, k=0 n=25, k=1 n=25, k=2 0.98 0.99 1.00 0.92 0.96 1.00 0.80 0.85 0.90 0.95 1.00 0.985 0.990 0.995 1.000 0.980 0.985 0.990 0.995 1.000 0.80 0.85 0.90 0.95 1.00 0.9900 0.9925 0.9950 0.9975 1.0000 0.925 0.950 0.975 1.000 0.6 0.7 0.8 0.9 1.0 0.992 0.994 0.996 0.998 1.000 0.95 0.96 0.97 0.98 0.99 1.00 0.900 0.925 0.950 0.975 1.000 0 100 200 300 0 100 200 300 0 100 200 300 100 200 300 100 200 300 100 200 300 400 100 200 300 100 200 300 100 200 300 400 100 200 300 0 100 200 300 400 500 0 200 400 600 # function evaluations (sample size) Mean squared error (R2) of walsh coefficient estimate Walsh order 1 2 3 Conclusions and discussion Surrogate based on Walsh decomposition: Relevant orthogonal basis for learning pseudo- boolean function Efficient when using machine learning technics Model is not limited to surrogate : From black-box to grey-box: learn a model ! application to Cellular automata pb., etc. A way to detect interactions between variables Perspectives Replace LARS by others heuristics: from low to larger orders, or others subsets, a priori knowledge on the model, etc. model selection problem Combine Walsh surrogate with grey-box technics Bayesian estimation of the coefficients to compute the estimation error Extend to other combinatorial optimization problems Apply on expensive optimization problems 1 Universit ´ e du Littoral Cˆ ote d’Opale, LISIC, France 2 Univ. Lille, CNRS, Centrale Lille, UMR 9189 – CRIStAL, F-59000 Lille, France 3 Inria Lille – Nord Europe, F-59650 Villeneuve d’Ascq, France 4 Shinshu University, Faculty of Engineering, Nagano, Japan

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Surrogate Model based on Walsh Decomposition for Pseudo …verel/talks/poster-ppsn18... · 2019-02-04 · Relevant orthogonal basis for learning pseudo-boolean function Efficient

A Surrogate Model based on Walsh Decomposition

for Pseudo-Boolean FunctionsSebastien Verel1, Bilel Derbel2,3, Arnaud Liefooghe2,3, Hernan Aguirre4, and Kiyoshi Tanaka4

Black-box Pseudo-Boolean Functions

x−→ −→ f (x)

No information on the definition of f

Search space: x ∈ {0,1}n

Objective fonction: given by a com-putation or an (expensive) simulation

Surrogate (meta-model) f :approximation of f , fast to compute

Surrogate models

For numerical problems:

•Huge number of works

•Kriging (Gaussian Proc.):collection of r. v.: N (m,k)mean: m(x) = E[ f (x)]cov.: k(x,x′) = exp(−θdist(x,x′)p)

•Efficient Global Optimization (EGO):maximize expected improvement (EI)...

For combinatorial problems:distance for combinatorial space:

•Radial Basis Function Networks(RBFN) [Moraglio:2011]

•Kriging, EGO [Zaefferer:2014]

Walsh functions in genetic algorithm

Definition [Bethke:1980]For any k ∈ [0,2n−1], Walsh function

ϕk : {0,1}n→{−1,1}

x ∈ {0,1}n, ϕk(x) = (−1)∑n−1j=0 k jx j

(ϕ0, . . . ,ϕ2n−1) is an orthogonal basis:

x0 = 0001 = 0012 = 0103 = 0114 = 1005 = 1016 = 1107 = 111

ϕ0 ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ6 ϕ7

1 1 1 1 1 1 1 11 −1 1 −1 1 −1 1 −11 1 −1 −1 1 1 −1 −11 −1 −1 1 1 −1 −1 11 1 1 1 −1 −1 −1 −11 −1 1 −1 −1 1 −1 11 1 −1 −1 −1 −1 1 11 −1 −1 1 −1 1 1 −1

Decomposition of pseudo-bool. func.

∀x ∈ {0,1}n, f (x) =2n−1

∑k=0

wk.ϕk(x)

∀k ∈ [0,2n−1],

wk =12n ∑

x∈{0,1}n

f (x).ϕk(x)

Applications

•Design deceptive functions:average fitness over a schemata of order p is

given by coefficients of order lower than p.

•Grey-box optimization : use linear de-

composition for smart computation (see F.

Chicano, D. Withley, etc.)

Walsh surrogate model

Approximation using Walsh decomposition:

fw(x) = ∑k

wk.ϕk(x)

with k ∈ { j : ord(ϕ j)6 d}

Linear model with predictors:(ϕ0(xi), . . . ,ϕk(xi)), and yi = f (xi).

Estimators based on Mean Square Error

w = argminw ∑xi∈X( fw(xi)− f (xi))2

Estimator types:

•Non sparse estimator: Conjugate Gradient (CG)

•Sparse estimator:Least-Angle Regression (LARS)

Regularization methods (lasso, ridge, etc.)Forward stepwise selection regression that reducethe number of non-zero coefficients

Experimental benchmark: nk-landscapes

nk-landscapes [Kaufmann:1993] :f (x) = 1

n ∑ni=1 fi(xi,xi1, . . . ,xik)

k = 0, linear problem (easy to optimize),k = 1, quadratic problem (∼ UBQP),k = 2, cubic problem (∼ max-3-SAT problem)

n ∈ {10,15,20,25}, k ∈ {0,1,2}, 5 instances.

Solutions generated uniformly at random for training and test sets,

Maximum number of non-zero coefficients:n0 = 1; nd = nd−1+

(nd

)n

ord. 10 15 20 250 1 1 1 11 11 16 21 262 46 121 211 3263 176 576 1351 2626

but number of non-zero coef. is much smallerfor nk-landscapes: ≤ n(2k+1−1)+1

Non-sparse CG vs. sparse LARS

Regression error (test on 103 solutions)according training sample size

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●●

● ●

●●

● ● ● ● ● ● ● ● ●0.0

0.1

0.2

0.3

0.4

0.5

0 100 200 300 400

Sample size

Mea

n A

bs. E

rror

of f

itnes

s

method●

CGLARS

Walsh (with LARS) vs. Kriging

Mean absolute error according to training sample size

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●●

●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ●

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

● ● ●●

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ●● ●

● ● ● ● ● ● ● ● ● ●● ●

●●

● ● ● ● ● ● ●

●●

● ●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

n = 10, k=0 n = 10, k=1 n = 10, k=2

n = 15, k=0 n = 15, k=1 n = 15, k=2

n = 20, k=0 n = 20, k=1 n = 20, k=2

n = 25, k=0 n = 25, k=1 n = 25, k=2

0.00

0.02

0.04

0.06

0.00

0.03

0.06

0.09

0.000

0.025

0.050

0.075

0.100

0.0e+00

5.0e−06

1.0e−05

1.5e−05

2.0e−05

0.00

0.02

0.04

0.00

0.02

0.04

0.06

0.000

0.005

0.010

0.015

0.020

0.00

0.01

0.02

0.03

0.04

0.05

0.00

0.02

0.04

0.06

0.00

0.01

0.02

0.00

0.01

0.02

0.03

0.04

0.05

0.00

0.02

0.04

0 100 200 300 0 100 200 300 0 100 200 300

100 200 300 100 200 300 100 200 300 400

100 200 300 100 200 300 100 200 300 400

100 200 300 0 100 200 300 400 500 0 200 400 600# function evaluations (sample size)

Mea

n ab

solu

te e

rror

Surrogate Model ● kriging walsh

Regression error according to the order of Walsh function

R2 of coefficients according to training sample size

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

n=10, k=0 n=10, k=1 n=10, k=2

n=15, k=0 n=15, k=1 n=15, k=2

n=20, k=0 n=20, k=1 n=20, k=2

n=25, k=0 n=25, k=1 n=25, k=2

0.98

0.99

1.00

0.92

0.96

1.00

0.80

0.85

0.90

0.95

1.00

0.985

0.990

0.995

1.000

0.980

0.985

0.990

0.995

1.000

0.80

0.85

0.90

0.95

1.00

0.9900

0.9925

0.9950

0.9975

1.0000

0.925

0.950

0.975

1.000

0.6

0.7

0.8

0.9

1.0

0.992

0.994

0.996

0.998

1.000

0.95

0.96

0.97

0.98

0.99

1.00

0.900

0.925

0.950

0.975

1.000

0 100 200 300 0 100 200 300 0 100 200 300

100 200 300 100 200 300 100 200 300 400

100 200 300 100 200 300 100 200 300 400

100 200 300 0 100 200 300 400 500 0 200 400 600# function evaluations (sample size)

Mea

n sq

uare

d er

ror

(R2)

of w

alsh

coe

ffici

ent e

stim

ate

Walsh order ● 1 2 3

Conclusions and discussion

Surrogate based on Walsh decomposition:

•Relevant orthogonal basis for learning pseudo-boolean function

•Efficient when using machine learning technics

Model is not limited to surrogate :

• From black-box to grey-box: learn a model !application to Cellular automata pb., etc.

•A way to detect interactions between variables

Perspectives

•Replace LARS by others heuristics:from low to larger orders, or others subsets,a priori knowledge on the model, etc.⇒ model selection problem

•Combine Walsh surrogate with grey-box technics

•Bayesian estimation of the coefficientsto compute the estimation error

•Extend to other combinatorial optimization problems

•Apply on expensive optimization problems

1Universite du Littoral Cote d’Opale, LISIC, France2Univ. Lille, CNRS, Centrale Lille, UMR 9189 – CRIStAL, F-59000

Lille, France3Inria Lille – Nord Europe, F-59650 Villeneuve d’Ascq, France —

4Shinshu University, Faculty of Engineering, Nagano, Japan