dehb: evolutionary hyperband for scalable, robust and

19
DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization * Noor Awad 1 , Neeratyoy Mallik 1 , Frank Hutter 1,2 1 Department of Computer Science, University of Freiburg, Germany 2 Bosch Center for Artificial Intelligence, Renningen, Germany {awad, mallik, fh}@cs.uni-freiburg.de Abstract Modern machine learning algorithms crucially rely on several design decisions to achieve strong per- formance, making the problem of Hyperparame- ter Optimization (HPO) more important than ever. Here, we combine the advantages of the popular bandit-based HPO method Hyperband (HB) and the evolutionary search approach of Differential Evolution (DE) to yield a new HPO method which we call DEHB. Comprehensive results on a very broad range of HPO problems, as well as a wide range of tabular benchmarks from neural archi- tecture search, demonstrate that DEHB achieves strong performance far more robustly than all pre- vious HPO methods we are aware of, especially for high-dimensional problems with discrete input di- mensions. For example, DEHB is up to 1000× faster than random search. It is also efficient in computational time, conceptually simple and easy to implement, positioning it well to become a new default HPO method. 1 Introduction Many algorithms in artificial intelligence rely crucially on good settings of their hyperparameters to achieve strong per- formance. This is particularly true for deep learning [Hender- son et al., 2018; Melis et al., 2018], where dozens of hyperpa- rameters concerning both the neural architecture and the op- timization & regularization pipeline need to be instantiated. At the same time, modern neural networks continue to get larger and more computationally expensive, making the need for efficient hyperparameter optimization (HPO) ever more important. We believe that a practical, general HPO method must fulfill many desiderata, including: (1) strong anytime per- formance, (2) strong final performance with a large bud- get, (3) effective use of parallel resources, (4) scalability w.r.t. the dimensionality and (5) robustness & flexibility. These desiderata drove the development of BOHB [Falkner et al., 2018], which satisfied them by combining the best features of Bayesian optimization via Tree Parzen estimates * Proceedings of IJCAI-21 (TPE) [Bergstra et al., 2011] (in particular, strong final per- formance), and the many advantages of bandit-based HPO via Hyperband [Li et al., 2017]. While BOHB is among the best general-purpose HPO methods we are aware of, it still has problems with optimizing discrete dimensions and does not scale as well to high dimensions as one would wish. Therefore, it does not work well on high-dimensional HPO problems with discrete dimensions and also has problems with tabular neural architecture search (NAS) benchmarks (which can be tackled as high-dimensional discrete-valued HPO benchmarks, an approach followed, e.g., by regularized evolution (RE) [Real et al., 2019]). The main contribution of this paper is to further improve upon BOHB to devise an effective general HPO method, which we dub DEHB. DEHB is based on a combination of the evolutionary optimization method of differential evolu- tion (DE [Storn and Price, 1997]) and Hyperband and has several useful properties: 1. DEHB fulfills all the desiderata of a good HPO opti- mizer stated above, and in particular achieves more ro- bust strong final performance than BOHB, especially for high-dimensional and discrete-valued problems. 2. DEHB is conceptually simple and can thus be easily re- implemented in different frameworks. 3. DEHB is computationally cheap, not incurring the over- head typical of most BO methods. 4. DEHB effectively takes advantage of parallel resources. After discussing related work (Section 2) and background on DE and Hyperband (Section 3), Section 4 describes our new DEHB method in detail. Section 5 then presents com- prehensive experiments on artificial toy functions, surrogate benchmarks, Bayesian neural networks, reinforcement learn- ing, and 13 different tabular neural architecture search bench- marks, demonstrating that DEHB is more effective and robust than a wide range of other HPO methods, and in particular up to 1000× times faster than random search (Figure 7) and up to 32× times faster than BOHB (Figure 11) on HPO prob- lems; on toy functions, these speedup factors even reached 33 440× and 149×, respectively (Figure 6). arXiv:2105.09821v2 [cs.LG] 21 Oct 2021

Upload: others

Post on 21-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DEHB: Evolutionary Hyperband for Scalable, Robust and

DEHB: Evolutionary Hyperband for Scalable, Robust and EfficientHyperparameter Optimization∗

Noor Awad1 , Neeratyoy Mallik1 , Frank Hutter1,21Department of Computer Science, University of Freiburg, Germany

2Bosch Center for Artificial Intelligence, Renningen, Germany{awad, mallik, fh}@cs.uni-freiburg.de

AbstractModern machine learning algorithms crucially relyon several design decisions to achieve strong per-formance, making the problem of Hyperparame-ter Optimization (HPO) more important than ever.Here, we combine the advantages of the popularbandit-based HPO method Hyperband (HB) andthe evolutionary search approach of DifferentialEvolution (DE) to yield a new HPO method whichwe call DEHB. Comprehensive results on a verybroad range of HPO problems, as well as a widerange of tabular benchmarks from neural archi-tecture search, demonstrate that DEHB achievesstrong performance far more robustly than all pre-vious HPO methods we are aware of, especially forhigh-dimensional problems with discrete input di-mensions. For example, DEHB is up to 1000×faster than random search. It is also efficient incomputational time, conceptually simple and easyto implement, positioning it well to become a newdefault HPO method.

1 IntroductionMany algorithms in artificial intelligence rely crucially ongood settings of their hyperparameters to achieve strong per-formance. This is particularly true for deep learning [Hender-son et al., 2018; Melis et al., 2018], where dozens of hyperpa-rameters concerning both the neural architecture and the op-timization & regularization pipeline need to be instantiated.At the same time, modern neural networks continue to getlarger and more computationally expensive, making the needfor efficient hyperparameter optimization (HPO) ever moreimportant.

We believe that a practical, general HPO method mustfulfill many desiderata, including: (1) strong anytime per-formance, (2) strong final performance with a large bud-get, (3) effective use of parallel resources, (4) scalabilityw.r.t. the dimensionality and (5) robustness & flexibility.These desiderata drove the development of BOHB [Falkneret al., 2018], which satisfied them by combining the bestfeatures of Bayesian optimization via Tree Parzen estimates

∗Proceedings of IJCAI-21

(TPE) [Bergstra et al., 2011] (in particular, strong final per-formance), and the many advantages of bandit-based HPOvia Hyperband [Li et al., 2017]. While BOHB is amongthe best general-purpose HPO methods we are aware of, itstill has problems with optimizing discrete dimensions anddoes not scale as well to high dimensions as one would wish.Therefore, it does not work well on high-dimensional HPOproblems with discrete dimensions and also has problemswith tabular neural architecture search (NAS) benchmarks(which can be tackled as high-dimensional discrete-valuedHPO benchmarks, an approach followed, e.g., by regularizedevolution (RE) [Real et al., 2019]).

The main contribution of this paper is to further improveupon BOHB to devise an effective general HPO method,which we dub DEHB. DEHB is based on a combination ofthe evolutionary optimization method of differential evolu-tion (DE [Storn and Price, 1997]) and Hyperband and hasseveral useful properties:

1. DEHB fulfills all the desiderata of a good HPO opti-mizer stated above, and in particular achieves more ro-bust strong final performance than BOHB, especially forhigh-dimensional and discrete-valued problems.

2. DEHB is conceptually simple and can thus be easily re-implemented in different frameworks.

3. DEHB is computationally cheap, not incurring the over-head typical of most BO methods.

4. DEHB effectively takes advantage of parallel resources.

After discussing related work (Section 2) and backgroundon DE and Hyperband (Section 3), Section 4 describes ournew DEHB method in detail. Section 5 then presents com-prehensive experiments on artificial toy functions, surrogatebenchmarks, Bayesian neural networks, reinforcement learn-ing, and 13 different tabular neural architecture search bench-marks, demonstrating that DEHB is more effective and robustthan a wide range of other HPO methods, and in particular upto 1000× times faster than random search (Figure 7) and upto 32× times faster than BOHB (Figure 11) on HPO prob-lems; on toy functions, these speedup factors even reached33 440× and 149×, respectively (Figure 6).

arX

iv:2

105.

0982

1v2

[cs

.LG

] 2

1 O

ct 2

021

Page 2: DEHB: Evolutionary Hyperband for Scalable, Robust and

2 Related WorkHPO as a black-box optimization problem can be broadlytackled using two families of methods: model-free meth-ods, such as evolutionary algorithms, and model-basedBayesian optimization methods. Evolutionary Algorithms(EAs) are model-free population-based methods which gen-erally include a method of initializing a population; mutation,crossover, selection operations; and a notion of fitness. EAsare known for black-box optimization in a HPO setting sincethe 1980s [Grefenstette, 1986]. They have also been popularfor designing architectures of deep neural networks [Ange-line et al., 1994; Xie and Yuille, 2017; Real et al., 2017; Liuet al., 2017]; recently, Regularized Evolution (RE) [Real etal., 2019] achieved state-of-the-art results on ImageNet.

Bayesian optimization (BO) uses a probabilistic modelbased on the already observed data points to model the ob-jective function and to trade off exploration and exploita-tion. The most commonly used probabilistic model in BOare Gaussian processes (GP) since they obtain well-calibratedand smooth uncertainty estimates [Snoek et al., 2012]. How-ever, GP-based models have high complexity, do not nativelyscale well to high dimensions and do not apply to complexspaces without prior knowledge; alternatives include tree-based methods [Bergstra et al., 2011; Hutter et al., 2011] andBayesian neural networks [Springenberg et al., 2016].

Recent so-called multi-fidelity methods exploit cheap ap-proximations of the objective function to speed up the opti-mization [Liu et al., 2016; Wang et al., 2017]. Multi-fidelityoptimization is also popular in BO, with Fabolas [Klein etal., 2016] and Dragonfly [Kandasamy et al., 2020] being GP-based examples. The popular method BOHB [Falkner et al.,2018], which combines BO and the bandit-based approachHyperband [Li et al., 2017], has been shown to be a strongoff-the-shelf HPO method and to the best of our knowledgeis the best previous off-the-shelf multi-fidelity optimizer.

3 Background3.1 Differential Evolution (DE)In each generation g, DE uses an evolutionary search basedon difference vectors to generate new candidate solutions.DE is a population-based EA which uses three basic iterativesteps (mutation, crossover and selection). At the beginning ofthe search on a D-dimensional problem, we initialize a popu-lation of N individuals xi,g = (x1i,g, x

2i,g, ..., x

Di,g) randomly

within the search range of the problem being solved. Eachindividual xi,g is evaluated by computing its correspondingobjective function value. Then the mutation operation gener-ates a new offspring for each individual. The canonical DEuses a mutation strategy called rand/1, which selects threerandom parents xr1 , xr2 , xr3 to generate a new mutant vectorvi,g for each xi,g in the population as shown in Eq. 1 whereF is a scaling factor parameter and takes a value within therange (0,1].

vi,g = xr1,g + F · (xr2,g − xr3,g). (1)

The crossover operation then combines each individualxi,g and its corresponding mutant vector vi,g to generate the

final offspring/child ui,g . The canonical DE uses a simple bi-nomial crossover to select values from vi,g with a probabilityp (called crossover rate) and xi,g otherwise. For the membersxi,g+1 of the next generations, DE then uses the better of xi,gand ui,g . More details on DE can be found in appendix A.

3.2 Successive Halving (SH) and Hyperband (HB)Successive Halving (SH) [Jamieson and Talwalkar, 2016] isa simple yet effective multi-fidelity optimization method thatexploits the fact that, for many problems, low-cost approxi-mations of the expensive blackbox functions exist, which canbe used to rule out poor parts of the search space at little com-putational cost. Higher-cost approximations are only used fora small fraction of the configurations to be evaluated. Specif-ically, an iteration of SH starts by sampling N configurationsuniformly at random, evaluating them at the lowest-cost ap-proximation (the so-called lowest budget), and forwarding afraction of the top 1/η of them to the next budget (functionevaluations at which are expected to be roughly η more ex-pensive). This process is repeated until the highest budget,used by the expensive original blackbox function, is reached.Once the runs on the highest budget are complete, the currentSH iteration ends, and the next iteration starts with the lowestbudget. We call each such fixed sequence of evaluations fromlowest to highest budget a SH bracket. While SH is oftenvery effective, it is not guaranteed to converge to the optimalconfiguration even with infinite resources, because it can droppoorly-performing configurations at low budgets that actuallymight be the best with the highest budget.

Hyperband (HB) [Li et al., 2017] solves this problem byhedging its bets across different instantiations of SH withsuccessively larger lowest budgets, thereby being provably atmost a constant times slower than random search. In partic-ular, this procedure also allows to find configurations that arestrong for higher budgets but would have been eliminated forlower budgets. Algorithm 2 in Appendix B shows the pseu-docode for HB with the SH subroutine. One iteration of HB(also called HB bracket) can be viewed as a sequence of SHbrackets with different starting budgets and different numbersof configurations for each SH bracket. The precise budgetsand number of configurations per budget are determined byHB given its 3 parameters: minimum budget, maximum bud-get, and η.

The main advantages of HB are its simplicity, theoreticalguarantees, and strong anytime performance compared to op-timization methods operating on the full budget. However,HB can perform worse than BO and DE for longer runs sinceit only selects configurations based on random sampling anddoes not learn from previously sampled configurations.

4 DEHBWe design DEHB to satisfy all the desiderata described inthe introduction (Section 1). DEHB inherits several advan-tages from HB to satisfy some of these desiderata, includ-ing its strong anytime performance, scalability and flexibil-ity. From the DE component, it inherits robustness, simplic-ity, and computational efficiency. We explain DEHB in detailin the remainder of this section; full pseudocode can be foundin Algorithm 3 in Appendix C.

Page 3: DEHB: Evolutionary Hyperband for Scalable, Robust and

Figure 1: Internals of a DEHB iteration showing information flowacross fidelities (top-down), and how each subpopulation is updatedin each DEHB iteration (left-right).

4.1 High-Level OverviewA key design principle of DEHB is to share informationacross the runs it executes at various budgets. DEHB main-tains a subpopulation for each of the budget levels, wherethe population size for each subpopulation is assigned as themaximum number of function evaluations HB allocates forthe corresponding budget.

We borrow nomenclature from HB and call the HB itera-tions that DEHB uses DEHB iterations. Figure 1 illustratesone such iteration, where minimum budget, maximum bud-get, and η are 1, 27, and 3, respectively. The topmost spherefor SH Bracket 1, is the first step, where 27 configurationsare sampled uniformly at random and evaluated at the lowestbudget 1. These evaluated configurations now form the DEsubpopulation associated with budget 1. The dotted arrowpointing downwards indicates that the top-9 configurations(27/η) are promoted to be evaluated on the next higher bud-get 3 to create the DE subpopulation associated with budget 3,and so on until the highest budget. This progressive increaseof the budget by η and decrease of the number of configura-tions evaluated by η is simply the vanilla SH. Indeed, eachSH bracket for this first DEHB iteration is basically execut-ing vanilla SH, starting from different minimum budgets, justlike in HB.

One difference from vanilla SH is that random sampling ofconfigurations occurs only once: in the first step of the firstSH bracket of the first DEHB iteration. Every subsequentSH bracket begins by reusing the subpopulation updated inthe previous SH bracket, and carrying out a DE evolution(detailed in Section 4.2). For example, for SH bracket 2 inFigure 1, the subpopulation of 9 configurations for budget 3(topmost sphere) is propagated from SH bracket 1 and un-dergoes evolution. The top 3 configurations (9/η) then affectthe population for the next higher budget 9 of SH bracket

Figure 2: Modified SH routine under DEHB

2. Specifically, these will used as the so-called parent poolfor that higher budget, using the modified DE evolution to bediscussed in Section 4.2. The end of SH Bracket 4 marks theend of this DEHB iteration. We dub DEHB’s first iterationits initialization iteration. At the end of this iteration, all DEsubpopulations associated with the higher budgets are seededwith configurations that performed well in the lower budgets.In subsequent SH brackets, no random sampling occurs any-more, and the search runs separate DE evolutions at differentbudget levels, where information flows from the subpopula-tions at lower budgets to those at higher budgets through themodified DE mutation (Fig. 3).

4.2 Modified Successive Halving using DEEvolution

We now discuss the deviations from vanilla SH by elaborat-ing on the design of a SH bracket inside DEHB, highlightedwith a box in Figure 1 (SH Bracket 1). In DEHB, the top-performing configurations from a lower budget are not sim-ply promoted and evaluated on a higher budget (except forthe Initialization SH bracket). Rather, in DEHB, the top-performing configurations are collected in a Parent Pool (Fig-ure 2). This pool is responsible for transfer of informationfrom a lower budget to the next higher budget, but not by di-rectly suggesting best configurations from the lower budgetfor re-evaluation at a higher budget. Instead, the parent poolrepresents a good performing region w.r.t. the lower budget,from which parents can be sampled for mutation. Figure 3bdemonstrates how a parent pool contributes in a DE evolu-tion in DEHB. Unlike in vanilla DE (Figure 3a), in DEHB,the mutants involved in DE evolution are extracted from theparent pool instead of the population itself. This allows theevolution to incorporate and combine information from thecurrent budget, and also from the decoupled search happen-

Page 4: DEHB: Evolutionary Hyperband for Scalable, Robust and

Figure 3: Modified DE evolution under DEHB

Figure 4: Runtime comparisonfor DEHB and BOHB basedon a single run on the Cifar-10benchmark from NAS-Bench-201. The x-axis shows the ac-tual cumulative wall-clock timespent by the algorithm (opti-mization time) in between thefunction evaluations.

ing on the lower budget. The selection step as shown in Fig-ure 3 is responsible for updating the current subpopulation ifthe new suggested configuration is better. If not, the existingconfiguration is retained in the subpopulation. This guardsagainst cases where performance across budget levels is notcorrelated and good configurations from lower budgets do notimprove higher budget scores. However, search on the higherbudget can still progress, as the first step of every SH bracketperforms vanilla DE evolution (there is no parent pool to re-ceive information from). Thereby, search at the required bud-get level progresses even if lower budgets are not informative.

Additionally, we also construct a global population poolconsisting of configurations from all the subpopulations. Thispool does not undergo any evolution and serves as the parentpool in the edge case where the parent pool is smaller thanthe minimum number of individuals required for the mutationstep. For the example in Figure 2, under the rand1 mutationstrategy (which requires three parents), we see that for thehighest budget, only one configuration (3/η) is included fromthe previous budget. In such a scenario, the additional tworequired parents are sampled from the global population pool.

4.3 DEHB efficiency and parallelizationAs mentioned previously, DEHB carries out separate DEsearches at each budget level. Moreover, the DE operationsinvolved in evolving a configuration are constant in opera-tion and time. Therefore, DEHB’s runtime overhead doesnot grow over time, even as the number of performed func-tion evaluations increases; this is in stark contrast to model-based methods, whose time complexity is often cubic in thenumber of performed function evaluations. Indeed, Figure4 demonstrates that, for a tabular benchmark with negligi-

Figure 5: Results for theOpenML Letter surrogatebenchmark where n representsnumber of workers that wereused for each DEHB run. Eachtrace is averaged over 10 runs.

ble cost for function evaluations, DEHB is almost 2 orders ofmagnitude faster than BOHB to perform 13336 function eval-uations. GP-based Bayesian optimization tools would requireapproximations to even fit a single model with this number offunction evaluations.

We also briefly describe a parallel version of DEHB (seeAppendix C.3 for details of its design). Since DEHB can beviewed as a sequence of predetermined SH brackets, the SHbrackets can be asynchronously distributed over free workers.A central DEHB Orchestrator keeps a single copy of all DEsubpopulations, allowing for asynchronous, immediate DEevolution updates. Figure 5 illustrates that this parallel ver-sion achieves linear speedups for similar final performance.

5 ExperimentsWe now comprehensively evaluate DEHB, illustrating that itis more robust and efficient than any other HPO method weare aware of. To keep comparisons fair and reproducible, weuse a broad collection of publicly-available HPO and NASbenchmarks: all HPO benchmarks that were used to demon-strate the strength of BOHB [Falkner et al., 2018]1 and also abroad collection of 13 recent tabular NAS benchmarks repre-sented as HPO problems [Awad et al., 2020].

In this section, to avoid cluttered plots we present a focusedcomparison of DEHB with BOHB, the best previous off-the-shelf multi-fidelity HPO method we are aware of, whichhas in turn outperformed a broad range of competitors (GP-BO, TPE, SMAC, HB, Fabolas, MTBO, and HB-LCNet) onthese benchmarks [Falkner et al., 2018]. For reference, wealso include the obligatory random search (RS) baseline inthese plots, showing it to be clearly dominated, with up to1000-fold speedups. We also provide a comparison against abroader range of methods at the end of this section (see Figure13 and Table 1), with a full comparison in Appendix D. Wealso compare to the recent GP-based multi-fidelity BO toolDragonfly in Appendix D.7. Details for the hyperparametervalues of the used algorithms can be found in Appendix D.1.

We use the same parameter settings for mutation factorF = 0.5 and crossover rate p = 0.5 for both DE and DEHB.The population size for DEHB is not user-defined but set byits internal Hyperband component while we set it to 20 for DEfollowing [Awad et al., 2020]. Unless specified otherwise,we report results from 50 runs for all algorithms, plotting thevalidation regret2 over the cumulative cost incurred by thefunction evaluations, and ignoring the optimizers’ overheadin order to not give DEHB what could be seen as an unfair

1We leave out the 2-dimensionsal SVM surrogate benchmarkssince all multi-fidelity algorithms performed similarly for this easytask, without any discernible difference.

2This is the difference of validation score from the global best.

Page 5: DEHB: Evolutionary Hyperband for Scalable, Robust and

Figure 6: Results for theStochastic Counting Ones prob-lem in 64 dimensional spacewith 32 categorical and 32 con-tinuous hyperparameters. Allalgorithms shown were run for50 runs.

advantage.3 We also show the speedups that DEHB achievescompared to RS and BOHB, where this is possible withoutadding clutter.

5.1 Artificial Toy Function: Stochastic CountingOnes

This toy benchmark by Falkner et al. [2018] is useful to assessscaling behavior and ability to handle binary dimensions. Thegoal is to minimize the following objective function:

f(x) = −

( ∑x∈Xcat

x+∑

x∈Xcont

Eb[(Bp=x)]

),

where the sum of the categorical variables (xi ∈ {0, 1}) rep-resents the standard discrete counting ones problem. Thecontinuous variables (xj ∈ [0, 1]) represent the stochasticcomponent, with the budget b controlling the noise. Thebudget here represents the number of samples used to es-timate the mean of the Bernoulli distribution (B) with pa-rameters xj . Following Falkner et al. [2018], we run 4 setsof experiments with Ncont = Ncat = {4, 8, 16, 32}, whereNcont = |Xcont| and Ncat = |Xcat|, using the same bud-get spacing and plotting the normalized regret: (f(x)+d)/d,where d = Ncat+Ncont. Although this is a toy benchmark itcan offer interesting insights since the search space has mixedbinary/continuous dimensions which DEHB handles well (re-fer to C.2 in Appendix for more details). In Figure 6, we con-sider the 64-dimensional space Ncat = Ncont = 32; resultsfor the lower dimensions can be found in Appendix D.2. BothBOHB and DEHB begin with a set of randomly sampled indi-viduals evaluated on the lowest budget. It is therefore unsur-prising that in Figure 6 (and in other experiments too), thesetwo algorithms follow a similar optimization trace at the be-ginning of the search. Given the high dimensionality, BOHBrequires many more samples to switch to model-based searchwhich slows its convergence in comparison to the lower di-mensional cases (Ncont = Ncat = {4, 8, 16}). In contrast,DEHB’s convergence rate is almost agnostic to the increasein dimensionality.

5.2 Surrogates for Feedforward Neural NetworksIn this experiment, we optimize six architectural and traininghyperparameters of a feed-forward neural network on six dif-ferent datasets from OpenML [Vanschoren et al., 2014], us-ing a surrogate benchmark built by Falkner et al. [2018]. Thebudgets are the training epochs for the neural networks. Forall six datasets, we observe a similar pattern of the search tra-jectory, with DEHB and BOHB having similar anytime per-formance and DEHB achieving the best final score. An ex-ample is given in Figure 7, also showing a 1000-fold speedup

3Shaded bands in plots represent the standard error of the mean.

Figure 7: Results for theOpenML Adult surrogatebenchmark for 6 continuoushyperparameters for 50 runs ofeach algorithm.

Figure 8: Results for tuning 5hyperparameters of a BayesianNeural Network on the BostonHousing regression dataset for50 runs each.

over random search; qualitatitvely similar results for the other5 datasets are in Appendix D.3.

5.3 Bayesian Neural NetworksIn this benchmark, introduced by Falkner et al. [2018], a two-layer fully-connected Bayesian Neural Network is trained us-ing stochastic gradient Hamiltonian Monte-Carlo sampling(SGHMC) [Chen et al., 2014] with scale adaptation [Sprin-genberg et al., 2016]. The budgets were the number ofMCMC steps (500 as minimum; 10000 as maximum). Tworegression datasets from UCI [Dua and Graff, 2017] wereused for the experiments: Boston Housing and Protein Struc-ture. Figure 8 shows the results (for Boston housing; the re-sults for Protein Structure are in Appendix D.4). For thisextremely noisy benchmark, BOHB and DEHB perform sim-ilarly, and both are about 2× faster than RS.

5.4 Reinforcement LearningFor this benchmark used by Falkner et al. [2018]), a proxi-mal policy optimization (PPO) [Schulman et al., 2017] im-plementation is parameterized with 7 hyperparameters. PPOis used to learn the cartpole swing-up task from the OpenAIGym [Brockman et al., 2016] environment. We plot the meannumber of episodes needed until convergence for a configura-tion over actual cumulative wall-clock time in Figure 9. De-spite the strong noise in this problem, BOHB and DEHB areable to improve continuously, showing similar performance,and speeding up over random search by roughly 2×.

5.5 NAS BenchmarksIn this series of experiments, we evaluate DEHB on a broadrange of NAS benchmarks. We use a total of 13 tabularbenchmarks from NAS-Bench-101 [Ying et al., 2019], NAS-Bench-1shot1 [Zela et al., 2020], NAS-Bench-201 [Dongand Yang, 2020] and NAS-HPO-Bench [Klein and Hutter,

Figure 9: Results for tuningPPO on OpenAI Gym cartpoleenvironment with 7 hyperpa-rameters. Each algorithm wasrun for 50 runs.

Page 6: DEHB: Evolutionary Hyperband for Scalable, Robust and

Figure 10: Results for Cifar Cfrom NAS-Bench-101 for a 27-dimensional space — 22 con-tinuous + 5 categorical hyper-parameters)

Figure 11: Results forImageNet16-120 from NAS-Bench-201 for 50 runs ofeach algorithm. The searchspace contains 6 categoricalparameters.

2019]. For NAS-Bench-101, we show results on CifarC (amixed data type encoding of the parameter space [Awad etal., 2020]) in Figure 10; BOHB and DEHB initially performsimilarly as RS for this dataset, since there is only little corre-lation between runs with few epochs (low budgets) and manyepochs (high budgets) in NAS-Bench-101. In the end, RSstagnates, BOHB stagnates at a slightly better performance,and DEHB continues to improve. In Figure 11, we report re-sults for ImageNet16-120 from NAS-201. In this case, DEHBis clearly the best of the methods, quickly converging to astrong solution.

Finally, Figure 12 reports results for the Protein Struc-ture dataset provided in NAS-HPO-Bench. DEHB makesprogress faster than BOHB to reach the optimum. The resultson other NAS benchmarks are qualitatively similar to these 3representative benchmarks, and are given in Appendix D.6.

5.6 Results summaryWe now compare DEHB to a broader range of baseline al-gorithms, also including HB, TPE [Bergstra et al., 2011],SMAC [Hutter et al., 2011], regularized evolution (RE) [Realet al., 2019], and DE. Based on the mean validation regret, allalgorithms can be ranked for each benchmark, for every sec-ond of the estimated wallclock time. Arranging the mean re-gret per timepoint across all benchmarks (except the Stochas-tic Counting Ones and the Bayesian Neural Network bench-marks, which do not have runtimes as budgets), we computethe average relative rank over time for each algorithm in Fig-ure 13, where all 8 algorithms were given the mean rank of4.5 at the beginning. The shaded region clearly indicates thatDEHB is the most robust algorithm for this set of bench-marks (discussed further in Appendix D.8). In the end, REand DE are similarly good, but these blackbox optimizationalgorithms perform worst for small compute budgets, whileDEHB’s multi-fidelity aspect makes it robust across compute

Figure 12: Results for theProtein Structure dataset fromNAS-HPO-Bench for 50 runsof each algorithm. The searchspace contains 9 hyperparame-ters.

Figure 13: Average rank of the mean validation regret of 50 runs ofeach algorithm, averaged over the NAS-Bench-101, NAS-Bench-1shot1, NAS-HPO-Bench, NAS-Bench-201, OpenML surrogates,and the Reinforcement Learning benchmarks.

budgets. In Table 1, we show the average rank of each algo-rithm based on the final validation regret achieved across allbenchmarks (now also including Stochastic Counting Onesand Bayesian Neural Networks; data derived from Table 2in Appendix D.8). Next to its strong anytime performance,DEHB also yields the best final performance in this compar-ison, thus emerging as a strong general optimizer that worksconsistently across a diverse set of benchmarks. Result tablesand figures for all benchmarks can be found in Appendix D.

RS HB BOHB TPE SMAC RE DE DEHB

Avg. rank 7.46 6.54 4.42 4.35 4.73 3.16 2.96 2.39

Table 1: Mean ranks based on final mean validation regret for allalgorithms tested for all benchmarks.

6 ConclusionWe introduced DEHB, a new, general HPO solver, built toperform efficiently and robustly across many different prob-lem domains. As discussed, DEHB satisfies the many re-quirements of such an HPO solver: strong performance withboth short and long compute budgets, robust results, scal-ability to high dimensions, flexibility to handle mixed datatypes, parallelizability, and low computational overhead. Ourexperiments show that DEHB meets these requirements andin particular yields much more robust performance for dis-crete and high-dimensional problems than BOHB, the previ-ous best overall HPO method we are aware of. Indeed, inour experiments, DEHB was up to 32× faster than BOHBand up to 1000× faster than random search. DEHB doesnot require advanced software packages, is simple by de-sign, and can easily be implemented across various platformsand languages, allowing for practical adoption. We thushope that DEHB will become a new default HPO method.Our reference implementation of DEHB is available at https://github.com/automl/DEHB.

Page 7: DEHB: Evolutionary Hyperband for Scalable, Robust and

Acknowledgements. The authors acknowledge funding bythe Robert Bosch GmbH, by the German Federal Ministryof Education and Research (BMBF, grant Renormalized-Flows 01IS19077C), and support by the state of Baden-Wurttemberg through bwHPC and the German ResearchFoundation (DFG) through grant no INST 39/963-1 FUGG.

ReferencesP.J. Angeline, G.M. Saunders, and J.B. Pollack. An evo-

lutionary algorithm that constructs recurrent neural net-works. IEEE transactions on Neural Networks, 5(1):54–65, 1994.

N. Awad, N. Mallik, and F. Hutter. Differential evolutionfor neural architecture search. In First ICLR Workshop onNeural Architecture Search, 2020.

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithmsfor hyper-parameter optimization. In Proc. of NeurIPS’11,pages 2546–2554, 2011.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider,J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXivpreprint arXiv:1606.01540, 2016.

U. K. Chakraborty. Advances in differential evolution, vol-ume 143. Springer, 2008.

T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamil-tonian monte carlo. In International conference on ma-chine learning, pages 1683–1691, 2014.

S. Das, S. S. Mullick, and P. N. Suganthan. Recent advancesin differential evolution–an updated survey. Swarm andEvolutionary Computation, 27:1–30, 2016.

X. Dong and Y. Yang. Nas-bench-102: Extending the scopeof reproducible neural architecture search. arXiv preprintarXiv:2001.00326, 2020.

D. Dua and C. Graff. Uci machine learning repository, 2017.S. Falkner, A. Klein, and F. Hutter. BOHB: Robust and ef-

ficient hyperparameter optimization at scale. In Proc. ofICML’18, pages 1437–1446, 2018.

J. J. Grefenstette. Optimization of control parameters for ge-netic algorithms. IEEE Transactions on Systems, Man, andCybernetics, 16:341–359, 1986.

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,and D. Meger. Deep reinforcement learning that matters.In Proc. of AAAI’18, 2018.

F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. InProc. of LION’11, pages 507–523, 2011.

K. Jamieson and A. Talwalkar. Non-stochastic best arm iden-tification and hyperparameter optimization. In Proc. ofAISTATS’16, 2016.

R. M. Storn K. Price and J. A. Lampinen. Differential evolu-tion: a practical approach to global optimization. SpringerScience & Business Media, 2006.

K. Kandasamy, G. Dasarathy, J. Schneider, and B. Poczos.Multi-fidelity bayesian optimisation with continuous ap-proximations. arXiv:1703.06240 [stat.ML], 2017.

K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria,C. R. Collins, J. Schneider, B. Poczos, and E. P. Xing.Tuning hyperparameters without grad students: Scalableand robust bayesian optimisation with dragonfly. Journalof Machine Learning Research, 21(81):1–27, 2020.

A. Klein and F. Hutter. Tabular benchmarks for joint archi-tecture and hyperparameter optimization. arXiv preprintarXiv:1905.04970, 2019.

A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fastbayesian optimization of machine learning hyperparame-ters on large datasets. arXiv:1605.07079 [cs.LG], 2016.

L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, andA. Talwalkar. Hyperband: Bandit-based configurationevaluation for hyperparameter optimization. In Proc. ofICLR’17, 2017.

B. Liu, S. Koziel, and Q. Zhang. A multi-fidelity surrogate-model-assisted evolutionary algorithm for computationallyexpensive optimization problems. Journal of computa-tional science, 12:28–37, 2016.

H. Liu, K. Simonyan, O. Vinyals, C. Fernando, andK. Kavukcuoglu. Hierarchical representations for efficientarchitecture search. arXiv preprint arXiv:1711.00436,2017.

H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable ar-chitecture search. arXiv preprint arXiv:1806.09055, 2018.

G. Melis, C. Dyer, and P. Blunsom. On the state of the art ofevaluation in neural language models. In Proc. of ICLR’18,2018.

H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef-ficient neural architecture search via parameter sharing.arXiv preprint arXiv:1802.03268, 2018.

E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan,Q. V. Le, and A. Kurakin. Large-scale evolution of imageclassifiers. In Proc. of ICML, pages 2902–2911. JMLR.org, 2017.

E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regular-ized evolution for image classifier architecture search. InProc. of AAAI, volume 33, pages 4780–4789, 2019.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347, 2017.

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesianoptimization of machine learning algorithms. In Proc. ofNeurIPS’12, pages 2951–2959, 2012.

J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter.Bayesian optimization with robust bayesian neural net-works. In Proc. of NeurIPS, pages 4134–4142, 2016.

R. Storn and K. Price. Differential evolution–a simple andefficient heuristic for global optimization over continuousspaces. Journal of global optimization, 11(4):341–359,1997.

M. Vallati, F. Hutter, Lukas L. Chrpa, and T. L. McCluskey.On the effective configuration of planning domain models.

Page 8: DEHB: Evolutionary Hyperband for Scalable, Robust and

In Twenty-Fourth International Joint Conference on Artifi-cial Intelligence, 2015.

J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo.Openml: networked science in machine learning. ACMSIGKDD Explorations Newsletter, 15(2):49–60, 2014.

H. Wang, Y. Jin, and J. Doherty. A generic test suite for evo-lutionary multifidelity optimization. IEEE Transactions onEvolutionary Computation, 22(6):836–850, 2017.

L. Xie and A. Yuille. Genetic cnn. In Proc. of ICCV, pages1379–1388, 2017.

C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, andF. Hutter. Nas-bench-101: Towards reproducible neural ar-chitecture search. arXiv preprint arXiv:1902.09635, 2019.

A. Zela, J. Siems, and F. Hutter. Nas-bench-1shot1: Bench-marking and dissecting one-shot neural architecture search.arXiv preprint arXiv:2001.10422, 2020.

Page 9: DEHB: Evolutionary Hyperband for Scalable, Robust and

A More details on DEDifferential Evolution (DE) is a simple, well-performing evo-lutionary algorithm to solve a variety of optimization prob-lems [K. Price and Lampinen, 2006] [Das et al., 2016]. Thisalgorithm was originally introduced in 1995 by Storn andPrice [Storn and Price, 1997], and later attracted the attentionof many researchers to propose new improved state-of-the-artalgorithms [Chakraborty, 2008]. DE is based on four steps:initialization, mutation, crossover and selection. Algorithm 1presents the DE pseudo-code.

Initialization. DE is a population-based meta-heuristicalgorithm which consists of a population of N individuals.Each individual is considered a solution and expressed as avector of D-dimensional decision variables as follows:

popg = (x1i,g, x2i,g, ..., x

Di,g), i = 1, 2, ..., N, (2)

where g is the generation number, D is the dimension of theproblem being solved and N is the population size. The al-gorithm starts initially with randomly distributed individualswithin the search space. The function value for the problembeing solved is then computed for each individual, f(x).

Mutation. A new child/offspring is generated using themutation operation for each individual in the population bya so called mutation strategy. Figure 14 illustrates this op-eration for a 2-dimensional case. The classical DE uses themutation operator rand/1, in which three random individu-als/parents denoted as xr1 , xr2 , xr3 are chosen to generate anew vector vi as follows:

vi,g = xr1,g + F · (xr2,g − xr3,g), (3)

where vi,g is the mutant vector generated for each individualxi,g in the population. F is the scaling factor that usuallytakes values within the range (0, 1] and r1, r2, r3 are the in-dices of different randomly-selected individuals. Eq.3 allowssome parameters to be outside the search range, therefore,each parameter in vi,g is checked and reset4 if it happens tobe outside the boundaries.

Crossover. When the mutation phase is completed, thecrossover operation is applied to each target vector xi,g andits corresponding mutant vector vi,g to generate a trial vec-tor ui,g . Classical DE uses the following uniform (binomial)crossover:

uji,g =

{vji,g if (rand ≤ p) or (j = jrand)

xji,g otherwise(4)

The crossover rate p is real-valued and is usually specifiedin the range [0, 1]. This variable controls the portion of pa-rameter values that are copied from the mutant vector. Thejth parameter value is copied from the mutant vector vi,g tothe corresponding position in the trial vector ui,g if a randomnumber is less than or equal to p. If the condition is not satis-fied, then the jth position is copied from the target vector xi,g .jrand is a random integer in the range [1, D] to ensure thatat least one dimension is copied from the mutant, in case the

4a random value from [0, 1] is chosen uniformly in this work

X1

X2

Mutation

Global Optimum

Xr1

Xr2

Xr3

Xr2 - Xr3

V = Xr1 + F.(Xr2 - Xr3)

F.(Xr2 - Xr3)

Figure 14: Illustration of DE Mutation operation for a 2-dimensionalcase using the rand/1 mutation strategy. The scaled difference vector(F.(xr2 − xr3)) is used to determine the neighbourhood of searchfrom xr1 . Depending on the diversity of the population, DE muta-tion’s search will be explorative or exploitative

random number generated for all dimensions is >p. Figure15 shows an illustration of the crossover operations.

Selection. After the final offspring is generated, the se-lection operation takes place to determine whether the target(the parent, xi,g) or the trial (the offspring, ui,g) vector sur-vives to the next generation by comparing the function values.The offspring replaces its parents if it has a better5 functionvalue as shown in Equation 5. Otherwise, the new offspringis discarded, and the target vector remains in the populationfor the next generation.

xi,g =

{ui,g if (f(ui,g) ≤ f(xi,g))xi,g otherwise

(5)

B More details on HyberbandThe Hyperband [Li et al., 2017] (HB) algorithm was de-signed to perform random sampling with early stopping basedon pre-determined geometrically spaced resource allocation.For DEHB we replace the random sampling with DE search.However, DEHB uses HB at its core to solve the “n versusB/n” tradeoff that HB was designed to address. Algorithm2 shows how DEHB interfaces HB to query the sequence ofhow many configurations of each budget to run at each itera-tion. This view treats the DEHB algorithm as a sequence ofpredetermined (by HB), repeating Successive Halving brack-ets where, iteration number refers to the index of SH bracketsrun by DEHB.

C More details on DEHBC.1 DEHB algorithmAlgorithm 3 gives the pseudo code describing DEHB. DEHBtakes as input the parameters for HB (bmin, bmax, η) and the

5DE is a minimizer

Page 10: DEHB: Evolutionary Hyperband for Scalable, Robust and

X1

X2

Crossover

Global Optimum

U = V

XU = (V 1, X2)

U = (X1, V 2)

Figure 15: Illustration of DE Crossover operation for a 2-dimensional case using the binomial crossover. The vertex of therectangle shows the possible solutions of between a parent x andmutant v. Based on the choice of p, the resultant individual willeither be a copy of the parent, or the mutant, or incorporate eithercomponent from parent and mutant

Algorithm 1: DE OptimizerInput:f - black-box problemF - scaling factor (default F = 0.5)p - crossover rate (default p = 0.5)N - population sizeOutput: Return best found individual in pop

1 g = 0, FE = 0;2 popg ← initial population(N , D);3 fitnessg ← evaluate population(popg);4 FE = N ;5 while (FE < FEmax) do6 mutate(popg);7 offspringg ← crossover(popg);8 fitnessg ← evaluate population(offspringg);9 popg+1,fitnessg+1← select(popg ,offspringg);

10 FE = FE + N ;11 g = g+1;12 end13 return Individual with highest fitness seen

parameters for DE (F , p). For the experiments in this pa-per, the termination condition was chosen as the total num-ber of DEHB brackets to run. However, in our implementa-tion it can also be specified as the total absolute number offunction evaluations, or a cumulative wallclock time as bud-get. L6 is the call to Algorithm 2 which gives a list of bud-gets which represent the sequence of increasing budgets to beused for that SH bracket. The nomenclature DE[budgets[i]],used in L9 and L12, indicates the DE subpopulation asso-ciated with the budgets[i] fidelity level. The if...else blockfrom L11-15 differentiates the first DEHB bracket from the

Algorithm 2: A SH bracket under HyperbandInput:bmin, bmax - min and max budgetsη - fraction of configurations promotediteration - iteration numberOutput: List of no. of configurations and budgets

1 smax = blogη bmax

bminc

2 s = smax − (iteration mod (smax + 1))

3 N = d smax+1s+1 · ηse

4 b0 =bmax

bmin· η−s

5 budgets = n configs = []6 for i ∈ {0, ..., s} do7 Ni = bN · η−ic8 b = b0 · ηi9 n configs.append(Ni)

10 budgets.append(b)11 end12 return n configs, budgets

rest. During the first DEHB bracket (bracket counter== 0)and its second SH bracket onwards (i>0), the top configura-tions from the lower fidelity are promoted6 for evaluation inthe next higher fidelity. The function DE trial generation onL14, i.e, the sequence of mutation-crossover operations, gen-erates a candidate configuration (config) to be evaluated forall other scenarios. L17 carries out the DE selection proce-dure by comparing the fitness score of config and the selectedtarget for that DE evolution step. The target (xi,g from Equa-tion 4) is selected on L9 by a rolling pointer over the sub-population list. That is, for every iteration (every incrementof j) a pointer moves forward by one index position in thesubpopulation selecting an individual to be a target. Whenthis pointer reaches the maximal index, it resets to point backto the starting index of the subpopulation. L18 compares thescore of the last evaluated config with the best found score sofar. If the new config has a better fitness score, the best foundscore is updated and the new config is marked as the incum-bent, configinc. This stores the best found configuration asan anytime best performing configuration.

C.2 Handling Mixed Data TypesWhen dealing with discrete or categorical search spaces, suchas the NAS problem, the best way to apply DE with such pa-rameters is to keep the population continuous and performmutation and crossover normally (Eq. 3, 4); then, to evaluatea configuration we evaluate a copy of it in the original discretespace as we explain below. If we instead dealt with a dis-crete population, then the diversity of population would dropdramatically, leading to many individuals having the sameparameter values; the resulting population would then havemany duplicates, lowering the diversity of the difference dis-tribution and making it hard for DE to explore effectively. Wedesigned DEHB to scale all parameters of a configuration in a

6only evaluate on higher budget and not evolve using mutation-crossover-selection

Page 11: DEHB: Evolutionary Hyperband for Scalable, Robust and

Algorithm 3: DEHBInput:bmin, bmax - min and max budgetsη - (default η=3)F - scaling factor (default F = 0.5)p - crossover rate (default p = 0.5)Output: Best found configuration, configinc

1 smax = blogη bmax

bminc

2 Initialize (smax + 1) DE subpopulations randomly3 bracket counter = 04 while termination condition do5 for iteration ∈ {0, 1, ..., smax} do6 budgets, n configs =

SH bracket under HB(bmin, bmax, η,iteration)

7 for i ∈ {0, 1, ..., smax − iterations} do8 for j ∈ {1, 2, ..., n configs[i]} do9 target = rolling pointer for

DE[budgets[i]]10 mutation types = “vanilla” if i is 0

else “altered”11 if bracket counter is 0 and i >0 then12 config = j-th best config from

DE[budgets[i− 1]]13 else14 config =

DE trial generation(target,mutation type)

15 end16 result = Evaluate config on

budgets[i]17 DE selection using result, config vs.

target18 Update incumbent, configinc19 end20 end21 end22 bracket counter += 123 end24 return configinc

population to a unit hypercube [0, 1], for the two broad typesof parameters normally encountered:

• Integer and float parameters, Xi ∈ [ai, bi] are retrievedas: ai+(bi−ai) ·Ui,g , where the integer parameters areadditionally rounded.

• Ordinal and categorical parameters, Xi ∈ {x1, ..., xn},are treated equivalently s.t. the range [0, 1] is divideduniformly into n bins.

We also experimented with another encoding design whereeach category in each of the categorical variables are repre-sented as a continuous variables [0, 1] and the variable withthe max over the continuous variables is chosen as the cat-egory [Vallati et al., 2015]. For example, in Figure 16, theeffective dimensionality of the search space will become 96-dimensional — 32 continuous variables + 64 continuous vari-

10 1 100 101 102 103 104

cummulative budget / bmax

2 × 10 1

3 × 10 1

4 × 10 1

norm

aliz

ed v

alid

atio

n re

gret

32+32

DEHB (orig)DEHB (encoding)

Figure 16: Comparing DEHBencodings for the StochasticCounting Ones problem in 64dimensional space with 32 cat-egorical and 32 continuous hy-perparameters. Results for allalgorithms on 50 runs.

102 103 104 105 106 107

estimated wallclock time [s]

10 4

10 3

10 2

10 1

valid

atio

n re

gret

Cifar-100

DEHB (orig)DEHB (encoding)

Figure 17: Comparing DEHBencodings for the Cifar-100dataset from NAS-Bench-201’s6-dimensional space. Resultsfor all algorithms on 50 runs.

ables derived from 32 binary variables. We choose a repre-sentative set of benchmarks (NAS-Bench-201 and CountingOnes) to compare DEHB with the two encodings mentioned,in Figures 16, 17. It is enough to see one example whichperforms much worse than the DE-NAS [Awad et al., 2020]encoding we chose for DEHB. The encoding from [Vallatiet al., 2015] did not achieve a better final performance thanDEHB in any of our experiments.

C.3 Parallel ImplementationThe DEHB algorithm is a sequence of DEHB Brackets,which in turn are a fixed sequence of SH brackets. This fea-ture, along with the asynchronous nature of DE allows a par-allel execution of DEHB. We dub the main process as theDEHB Orchestrator which maintains a single copy of all DEsubpopulations. An HB bracket manager determines whichbudget to run from which SH bracket. Based on this inputfrom the bracket manager, the orchestrator can fetch a config-uration7 from the current subpopulations and make an asyn-chronous call for its evaluation on the assigned budget. Therest of the orchestrator continues synchronously to check forfree workers, and query the HB bracket manager for the nextbudget and SH bracket. Once a worker finishes computation,the orchestrator collects the result, performs DE selection andupdates the relevant subpopulation accordingly. This form ofan update is referred to as immediate, asynchronous DE.

DEHB uses a synchronous SH routine. Though each of thefunction evaluations at a particular budget can be distributed,a higher budget needs to wait on all the lower budget evalu-ations to be finished. A higher budget evaluation can beginonly once the lower budget evaluations are over and the top1/η can be selected. However, the asynchronous nature of DEallows a new bracket to begin if a worker is available whileexisting SH brackets have pending jobs or are waiting for re-sults. The new bracket can continue using the current stateof DE subpopulations maintained by the DEHB Orchestra-tor. Once the pending jobs from previous brackets are over,the DE selection updates the DEHB Orchestrator’s subpop-ulations. Thus, the utilisation of available computational re-sources is maximized while the central copy of subpopula-tions maintained by the Orchestrator ensures that each new

7DE mutation and crossover to generate configuration

Page 12: DEHB: Evolutionary Hyperband for Scalable, Robust and

SH bracket spawned works with the latest updated subpopu-lation.

D More details on Experiments

D.1 Baseline Algorithms

In all our experiments we keep the configuration of all thealgorithms the same. These settings are well-performingsetting that have been benchmarked in previous works —[Falkner et al., 2018], [Ying et al., 2019], [Awad et al., 2020].

Random Search (RS) We sample random architectures inthe configuration space from a uniform distribution in eachgeneration.

BOHB We used the implementation from https://github.com/automl/HpBandSter. In [Ying et al., 2019], they identi-fied the settings of key hyperparameters as: η is set to 3, theminimum bandwidth for the kernel density estimator is set to0.3 and bandwidth factor is set to 3. In our experiments, wedeploy the same settings.

Hyperband (HB) We used the implementation from https://github.com/automl/HpBandSter. We set η = 3 and this pa-rameter is not free to change since there is no other differentbudgets included in the NAS benchmarks.

Tree-structured Parzen estimator (TPE) We usedthe open-source implementation from https://github.com/hyperopt/hyperopt. We kept the settings of hyperparametersto their default.

Sequential Model-based Algorithm Configura-tion (SMAC) We used the implementation fromhttps://github.com/automl/SMAC3 under its default pa-rameter setting. Only for the Counting Ones problem with64-dimensions, the initial design had to be changed to aLatin Hypercube design, instead of a Sobol design.

Regularized Evolution (RE) We used the implementationfrom [Real et al., 2019]. We initially sample an edge or op-erator uniformly at random, then we perform the mutation.After reaching the population size, RE kills the oldest mem-ber at each iteration. As recommended by [Ying et al., 2019],the population size (PS) and sample size (TS) are set to 100and 10 respectively.

Differential Evolution (DE) We used the implementationfrom [Awad et al., 2020], keeping the rand1 strategy for mu-tation and binomial crossover as the crossover strategy. Wealso use the same population size of 20 as [Awad et al., 2020].

All plots for all baselines were plotted for the incumbentvalidation regret over the estimated wallclock time, ignoringthe optimization time. The x-axis therefore accounts for onlythe cumulative cost incurred by function evaluations for eachalgorithm. All algorithms were run for similar actual wall-clock time budget. Certain algorithms under certain bench-marks may not appear to have equivalent total estimated wall-clock time. That is an artefact of ignoring optimization time.Model-based algorithms such as SMAC, BOHB, TPE havea computational cost dependent on the observation history.They might undertake lesser number of function evaluationsfor the same actual wallclock time.

10 1 100 101 102 103 104 105

cummulative budget / bmax

10 7

10 6

10 5

10 4

10 3

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

4+4

RSHBBOHBTPESMACREDEDEHB

10 1 100 101 102 103 104 105

cummulative budget / bmax

10 7

10 6

10 5

10 4

10 3

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

8+8

RSHBBOHBTPESMACREDEDEHB

10 1 100 101 102 103 104 105

cummulative budget / bmax

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

16+16

RSHBBOHBTPESMACREDEDEHB

10 1 100 101 102 103 104 105

cummulative budget / bmax

10 1

norm

aliz

ed v

alid

atio

n re

gret

32+32

RSHBBOHBTPESMACREDEDEHB

Figure 18: Results for the Stochastic Counting Ones problem forN = {4, 8, 16, 32} respectively indicating N categorical and Ncontinuous hyperparameters for each case. All algorithms shownwere run for 50 runs.

D.2 Artificial Toy Function: Stochastic CountingOnes

The Counting Ones benchmark was designed to minimize thefollowing objective function:

f(x) = −

∑xi∈Xcat

xi +∑

xj∈Xcont

Eb[(Bp=xj)]

,

where the sum of the categorical variables (xi ∈ {0, 1}) rep-resents the standard discrete counting ones problem. The con-tinuous variables (xj ∈ [0, 1]) represent the stochastic com-ponent with the budget b controlling the noise. The budgethere represents the number of samples used to estimate themean of the Bernoulli distribution (B) with parameters xj .

The experiments on the Stochastic Counting Ones bench-mark used N = {4, 8, 16, 32}, all of which are shown in Fig-ure 18. For the low dimensional cases, BOHB and SMAC’smodels are able to give them an early advantage. For this toybenchmark the global optima is located at the corner of a unithypercube. Random samples can span the lower dimensionalspace adequately for a model to improve the search rapidly.DEHB on the other hand may require a few extra functionevaluations to reach similar convergence. However, this con-servative approach aids DEHB for the high-dimensional caseswhere it is able to converge much more rapidly in comparisonto other algorithms. Especially where SMAC and BOHB’sconvergence worsens significantly. DEHB thus showcasesits robust performance even when the dimensionality of theproblem increases exponentially.

D.3 Feed-forward networks on OpenML datasetsFigure 19, show the results on all 6 datasets from OpenMLsurrogates benchmark — Adult, Letter, Higgs, MNIST,Optdigits, Poker. The surrogate model space is just 6-dimensional, allowing BOHB and TPE to build more confi-dent models and be well-performing algorithms in this space,especially early in the optimization. However, DE and DEHBare able to remain competitive and consistently achieve animproved final performance than TPE and BOHB respec-tively. While even TPE achieves a better final performance

Page 13: DEHB: Evolutionary Hyperband for Scalable, Robust and

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

valid

atio

n re

gret

Adult

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

Higgs

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Letter

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105

estimated wallclock time [s]

10 4

10 3

10 2

10 1

valid

atio

n re

gret

Mnist

RSHBBOHBTPESMACREDEDEHB

100 101 102 103 104

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Optdigits

RSHBBOHBTPESMACREDEDEHB

102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Poker

RSHBBOHBTPESMACREDEDEHB

Figure 19: Results for the OpenML surrogate benchmark for the 6datasets: Adult, Higgs, Letter, MNIST, Optdigits, Poker. The searchspace had 6 continuous hyperparameters. All plots shown were av-eraged over 50 runs of each algorithm.

than BOHB. Overall, DEHB is a competetive anytime per-former for this benchmark with the most robust final perfor-mances.

D.4 Bayesian Neural NetworksThe search space for the two-layer fully-connected BayesianNeural Network is defined by 5 hyperparameters which are:the step length, the length of the burn-in period, the numberof units in each layer, and the decay parameter of the momen-tum variable. In Figure 20, we show the results for the tuningof Bayesian Neural Networks on both the Boston Housingand Protein Structure datasets for the 6-dimensional BayesianNeural Networks benchmark. We observe that SMAC, TPEand BOHB are able to build models and reach similar re-gions of performance with high confidence. DEHB is slowerto match in such a low-dimensional noisy space. However,given the same cumulative budget, DEHB achieves a com-petitive final score.

104 105 106

MCMC steps

10

20

30

40

50

60

70

nega

tive

log-

likel

ihoo

d

Boston Housing

RSHBBOHBTPESMACREDEDEHB

104 105 106

MCMC steps

10

20

30

40

50

60

70

nega

tive

log-

likel

ihoo

d

Protein Structure

RSHBBOHBTPESMACREDEDEHB

Figure 20: Results for tuning 5 hyperparameters of a BayesianNeural Network on the the Boston Housing and Protein Structuredatasets respectively, for 50 runs of each algorithm.

D.5 Reinforcement LearningFor this benchamrk, the proximal policy optimization (PPO)[Schulman et al., 2017] implementation is parameterized with

7 hyperparameters: # units layer 1, # units layer 2, batch size,learning rate, discount, likelihood ratio clipping and entropyregularization. Figure 21 summarises the performance of allalgorithms on the RL problem for the Cartpole benchmark.SMAC uses a SOBOL grid as its initial design and both itsbenefit and drawback can be seen as SMAC rapidly improves,stalls, and then improves again once model-based search be-gins. However, BOHB and DEHB both remain competi-tive and BOHB, DEHB, SMAC emerge as the top-3 for thisbenchmark, achieving similar final scores. We notice thatthe DE trace stands out as worse than RS and will explainthe reason behind this. Given the late improvement for DEpop = 20, we posit that this is a result of the deferred updatesof DE based on the classical DE [Awad et al., 2020] updatedesign and also the design of the benchmark.

For classical-DE, the updates are deferred, that is the re-sults of the selection process are incorporated into the popu-lation for consideration in the next evolution step, only afterall the individuals of the population have undergone evolu-tion. In terms of computation, the wall-clock time for popu-lation size number of function evaluations are accumulated,before the population is updated. In Figure 21 we illustratewhy given how this benchmark is designed, this minor de-tail for DE slows down convergence. Along with a DE ofpopulation size 20 as used in the experiments, we comparea DE of population size 10 in Figure 21. For the Reinforce-ment Learning benchmark from [Falkner et al., 2018], eachfull budget function evaluation consists of 9 trials of a maxi-mum of 3000 episodes. With a population of 20, DE will notinject a new individual into a population unless all 20 individ-uals have participated as a parent in the crossover operation.This accumulates wallclock time equivalent to 20 individu-als times 9 trials times time taken for a maximum of 3000episodes. Which can explain the flat trajectories in the op-timization trace for DE pop = 20 in Figure 21 (right). DEpop = 10 slashes this accumulated wallclock time in halfand is able to inject newer configurations into the populationfaster and is able to search faster. Given enough runtime,we expect DE pop = 20 to converge to similar final scores.DEHB uses the immediate update design for DE, wherein itupdates the population immediately after a DE selection, andnot wait for the entire population to evolve. We posit that thisfeature, along with lower fidelity search, and performing gridsearch over population sizes with Hyperband, enables DEHBto be more practical than classical-DE.

102 103 104

time [s]102

103

104

epoc

hs u

ntil

conv

erge

nce

cartpole

RSHBBOHBTPE

SMACREDEDEHB

104 105

time [s]102

103

104

epoc

hs u

ntil

conv

erge

nce

Cartpole

RSDE pop = 10DE pop = 20

Figure 21: (left) Results for tuning PPO on OpenAI Gym cartpoleenvironment with 7 hyperparameters. Each algorithm shown wasrun for 50 runs. (right) Same experiment to compare DE with apopulation size of 10 and 20.

Page 14: DEHB: Evolutionary Hyperband for Scalable, Robust and

D.6 NAS benchmarksNAS-Bench-101This benchmark was the first NAS benchmark relying on tab-ular lookup that was introduced to encourage research andreproducibility [Ying et al., 2019]. Each architecture fromthe search space is represented as a stack of cells. Each cellis treated as a directed acyclic graph (DAG) and the nodesand edges of these DAGs are parameterized which serve asthe hyperparameters specifying a neural network architecture.NAS-Bench-101 offers a large search space of nearly 423kunique architectures that are trained on Cifar-10. The bench-mark also offers a fidelity level — training epoch length —which allows HB, BOHB, and DEHB, to run on this bench-mark. We run experiments on all 3 variants provided by NAS-101: Cifar A, Cifar B, Cifar C. The primary search space dis-cussed by [Ying et al., 2019] is Cifar A; Cifar B and Cifar Care variants of the same search space with alternative encod-ings that deal with the hyperparameters defined on the edgesof the DAG as categorical or continuous.

In the NAS-Bench-101 benchmark, the correlation be-tween the performance scores and the different budgets aresmall [Ying et al., 2019], and therefore BOHB and DEHB donot yield better performance than the methods using full func-tion evaluations only. All 3 evolutionary algorithms testedare able to exploit the discrete high-dimensional space muchbetter than model-based methods such as BOHB and TPE,as seen by the performances of DE, DEHB and RE. WhileDEHB appears to be the algorithm with the best anytime per-formance in the high-dimensional discrete NAS space. DEyields the final best performance, closely followed by DEHB.

102 103 104 105 106 107

estimated wallclock time [s]

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

CifarA

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105 106 107

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

CifarB

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105 106 107

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

CifarC

RSHBBOHBTPESMACREDEDEHB

Figure 22: Results for Cifar A, B and C from NAS-Bench-101 for26, 14, 27-dimensional spaces respectively. All algorithms reportedfor 50 runs.

NAS-Bench-1shot1NAS-Bench-1shot1 was introduced by [Zela et al., 2020], asa benchmark derived from the large space of architectures of-fered by NAS-Bench-101. This benchmark allows the use ofmodern one-shot8 NAS methods with weight sharing ( [Phamet al., 2018], [Liu et al., 2018]). The search space in NAS-Bench-1shot1 was modified to accommodate one-shot meth-

8training a single large architecture that contains all possible ar-chitectures in the search space

ods by keeping the macro network-level topology of the ar-chitectures similar and offering a different encoding designfor the cell-level topology. This resulted in three searchspaces: search space 1, search space 2 and search space 3with 6240, 29160, and 363648 architectures respectively. InFigure 23, we show the results on all 3 search spaces. We ex-clude weight sharing methods from the algorithms compared,in order to maintain parity across all experiments, while fo-cusing on the objective of comparing black-box solvers.

102 103 104 105 106 107

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Search Space 1

RSHBBOHBTPESMACREDEDEHB

102 103 104 105 106 107

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Search Space 2

RSHBBOHBTPESMACREDEDEHB

102 103 104 105 106 107

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

Search Space 3

RSHBBOHBTPESMACREDEDEHB

Figure 23: Results for the 3 search spaces from NAS-Bench-1shot1for 50 runs of each algorithm. The 3 search spaces contains 9, 9, 11categorical parameters respectively.

From among RS, TPE, SMAC, RE and DE — the full bud-get algorithms — only DE is able to improve significantly asoptimization proceeds. For the multi-fidelity algorithms —HB, BOHB and DEHB — only DEHB is able to improve anddiverge away from HB by the end of optimization. BOHB,HB, RS, TPE and RE all appear to follow a similar traceshowing the difficulty of finding good architectures in thisbenchmark. Nevertheless, the DE-based family of algorithmsis able to further exploit the search space and show betterperformance than the other algorithms. Though DE performsthe best, RE remains competitive, again suggesting the powerof evolutionary methods on discrete spaces. Among model-based methods, only TPE competes with DEHB.

NAS-Bench-201To alleviate issues of direct applicability of weight sharingalgorithms to NAS-Bench-101, [Dong and Yang, 2020] pro-posed NAS-Bench-201. This benchmark contains a fixed cellsearch space having DAGs with 4 nodes as the cell struc-ture, and the edges of the DAG cells representing operations.The search space by design contains 6 discrete/categorical hy-perparameters. NAS-Bench-201 provides a lookup table for3 datasets: Cifar-10, Cifar-100 and ImageNet16-120, alongwith a fidelity level as number of training epochs. The searchspace for all 3 datasets include 15,625 cells/architectures.From the validation regret performances in Figure 24, it isclear that DEHB quickly converges to strong solutions whichare a few orders of magnitude better than BOHB and RS (interms of regret). DE and RE are both competitive with REconverging slightly faster than DE. Notably, DEHB is theonly multi-fidelity algorithm in this experiment that workswell.

Page 15: DEHB: Evolutionary Hyperband for Scalable, Robust and

NAS-Bench-201 specifies the same 6-dimensional discretehyperparameter space for the Cifar-10 and Cifar-100 datasets.Figure 24 again shows that the evolutionary algorithms per-form the best in a space defined by categorical parameters.SMAC in this scenario is the best-of-the-rest, outside ofDEHB, RE and DE. BOHB evidently struggles to be evensignificantly better than HB.

103 104 105 106

estimated wallclock time [s]10 7

10 6

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

cifar10-valid

RSHBBOHBTPESMACREDEDEHB

103 104 105 106

estimated wallclock time [s]

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

cifar100

RSHBBOHBTPESMACREDEDEHB

104 105 106

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

ImageNet16-120

RSHBBOHBTPESMACREDEDEHB

Figure 24: Results for Cifar-10, Cifar-100, ImageNet16-120 fromNAS-Bench-201 for 50 runs of each algorithm. The search spacecontains 6 categorical parameters.

NAS-HPO-BenchTo facilitate HPO research involving feed-forward neuralnetworks, [Klein and Hutter, 2019] introduced NAS-HPO-Bench with a search space composed of hyperparameters thatparameterize the architecture of a 2-layer feed-forward net-work9, along with hyperparameters for its training proce-dure. The primary difference between NAS-HPO from theOpenML surrogates benchmark is that in the latter, a randomforest model was used as a surrogate to approximate the per-formance for configurations. NAS-HPO-Bench is designed inthe same vein as the other NAS benchmarks discussed earlier.For the total of 9 discrete hyperparameters (4 for architec-ture + 5 for training), all 62208 configurations resulting froma grid search over the search space were evaluated to yielda tabular representation for configuration and performancemapping. The benchmark provides such lookup tables for4 popular UCI regression datasets: Protein Structure, SliceLocalization, Naval Propulsion and Parkinsons Telemonitor-ing. NAS-HPO-Bench also provides the number of trainingepochs as a fidelity level.

Figure 25 illustrates the performance of all algorithms onthe 4 datasets provided in NAS-HPO-Bench. As it appears,barring RS and HB, all other algorithms are able to obtainsimilar final scores for the benchmark with respect to the val-idation set. BOHB and DEHB both diverge from HB andstart improving early on. However, DEHB continues to im-prove and is able to converge the fastest. TPE, RE, DE allcompete with each other in terms of convergence rate, whileBOHB and SMAC show similar convergence speeds.

9additionally, a linear output layer

101 102 103 104 105 106 107

estimated wallclock time [s]

10 6

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

Protein

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105 106 107

estimated wallclock time [s]

10 7

10 6

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

Slice

RSHBBOHBTPESMACREDEDEHB

101 102 103 104 105 106 107

estimated wallclock time [s]

10 8

10 6

10 4

10 2

valid

atio

n re

gret

Naval

RSHBBOHBTPESMACREDEDEHB

100 101 102 103 104 105 106

estimated wallclock time [s]

10 6

10 5

10 4

10 3

10 2

10 1

valid

atio

n re

gret

Parkinsons

RSHBBOHBTPESMACREDEDEHB

Figure 25: Results for the Protein Structure, Slice Localization,Naval Propulsions, Parkinsons Telemonitoring datasets from NAS-HPO-Bench for 50 runs of each algorithm. The search space con-tains 9 hyperparameters.

D.7 Comparison of DEHB to BO-basedmulti-fidelity methods

BOHB [Falkner et al., 2018] showed that its KDE based BOoutperformed other GP-based BO methods. Hence, BOHBwas treated as the primary challenger to DEHB as a robust,general multi-fidelity based HPO solver. In this section werun experiments on the benchmarks detailed in the previoussections, to compare DEHB to another popular multi-fidelityBO optimizer, Dragonfly [Kandasamy et al., 2020] (in addi-tion to BOHB). Dragonfly implements BOCA (Kandasamy etal. [2017]) which performs BO with low-cost approximationsof function evaluations on fidelities treated as a continuousdomain. However, this GP-based BO method had longer exe-cution time compared to other algorithms for the tabular/sur-rogate benchmarks. In Figure 26 we therefore show averageof 32 runs for each algorithm, while having to terminate runsearlier than other algorithms for certain cases. In this experi-ment, we optimize the median performance of a configurationover different seeds. We observe that Dragonfly shows a highvariance in performance across benchmarks whereas DEHBis consistently the best or at worse, comparable to Dragonfly.Moreover, BOHB performs clearly better than Dragonfly in8 out of the 16 cases shown in Figure 26, while being com-parable to Dragonfly in at least 4 other benchmarks. Drag-onfly comes out as the best optimizer only for the Cifar10dataset in the NAS-201 benchmark in Figure 26. These exper-iments however, further illustrate the practicality, robustness,and generality of DEHB compared to GP-based multi-fidelityBO methods.

D.8 Results summaryIn the previous experiments sections, results on all the bench-marks for DEHB and all other baselines were reporteddemonstrating the competitive and often superior anytimeperformance of DEHB. In Table 2, we report the mean finalvalidation regret achieved by all algorithms across all the 26benchmarks. DEHB got the best performance in nearly 1/3-rd of the benchmarks while reporting the second-best perfor-mance in over 1/4-th of all the benchmarks. The last rowof Table 2 shows the rank of each algorithm averaged across

Page 16: DEHB: Evolutionary Hyperband for Scalable, Robust and

their final performances on each benchmark. DEHB clearlyis the best performing algorithm on the whole, followed byDE, which powers DEHB under the hood. Such rankingsillustrate DEHB’s robustness across different search spaces,including high dimensions, discrete or mixed type spaces,and even problems where response signals from lower fidelitysubspaces may not be too informative. It must be noted thatfor all the different problems tested for with the collectionof benchmarks, DEHB is never consistently outperformed byany multi-fidelity or full-budget algorithm.

It must be noted that based on the average rank plot inFigure 13, BOHB appears to be better than DEHB in theearly middle section of the optimization. The underlyingmodel-based search in BOHB can possibly explain this phe-nomenon. Though DEHB’s underlying DE requires morefunction evaluations to explore the space, it remains com-petitive with BOHB and the latter is not significantly bet-ter across any of the used benchmarks. Moreover, as Fig-ure 13 indicates, BOHB’s relative performance worsens whileDEHB continues to remain better even in comparison to theother full-budget black box optimizers such as DE, RE andTPE. BOHB’s model-based search can again be attributed forthis phenomenon. Many of the benchmarks used are high-dimensional and have mixed data types, which can affectBOHB’s model certainty over the configuration space and re-quire much more observations than DEHB requires. Over-all, DEHB shows consistently good anytime performancewith strong final performance scores too. As detailed earlier,DEHB’s efficiency, simplicity and its speed allow the gooduse of available resources and make it a good practical andreliable tool for HPO in practice.

E Ablation Studies

DEHB was designed as an easy-to-use tool for HPO andNAS. This necessitated that DEHB contains as few hyperpa-rameters as possible that require tuning or that which makesDEHB sensitive to them. Given that the HB parameters in-herent to DEHB are contingent on the problem being solved,that leaves only the DE components’ hyperparameters to beset adequately. We perform ablation of the mutation andcrossover rates to observe how it fairs for DEHB’s design forthe suite of benchmarks we experiment on.

E.1 Varying F

The crossover probability p was fixed at 0.5, while mutationfactor F was varied with the values 0.1, 0.3, 0.5, 0.7, 0.9. Thestudies are carried out on NAS-Bench-101, OpenML Surro-gate and the toy Stochastic Counting Ones benchmarks. Theresults are reported in Figure 27.

We observe that a low F of 0.1 allows more exploitativepower to the DE search for a well correlated benchmark suchas Counting Ones, while F = 0.9 performs the worst. How-ever, for the other benchmarks F = 0.1 performs the worstwith all other F performing similarly. As a result we choosethe conservative option of F = 0.5 to ensure one general set-ting performs acceptably across all benchmarks.

100 101 102 103 104 105

Simulated time in seconds

10 3

10 2

10 1

Med

ian

Opt

imiz

ed R

egre

t

Adult

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105

Simulated time in seconds

10 3

10 2

10 1

Med

ian

Opt

imiz

ed R

egre

t

Higgs

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105

Simulated time in seconds

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

Letter

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105

Simulated time in seconds

10 4

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

Mnist

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105

Simulated time in seconds

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

Optdigits

RSDRAGONFLYBOHBDEHB

102 103 104 105

Simulated time in seconds

10 4

10 3

10 2

10 1

100

101

102

Med

ian

Opt

imiz

ed R

egre

t

Poker

RSDRAGONFLYBOHBDEHB

102 103 104 105 106

Simulated time in seconds

10 5

10 4

10 3

10 2

10 1

100

101

102

Med

ian

Opt

imiz

ed R

egre

t

Cifar-10

RSDRAGONFLYBOHBDEHB

102 103 104 105 106

Simulated time in seconds

10 5

10 4

10 3

10 2

10 1

100

101

102

Med

ian

Opt

imiz

ed R

egre

t

Cifar-100

RSDRAGONFLYBOHBDEHB

102 103 104 105 106

Simulated time in seconds

10 2

10 1

100

101

102

Med

ian

Opt

imiz

ed R

egre

t

ImageNet16-120

RSDRAGONFLYBOHBDEHB

101 102 103 104 105 106

Runtime in seconds

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

CifarA

RSDRAGONFLYBOHBDEHB

101 102 103 104 105 106

Runtime in seconds

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

CifarB

RSDRAGONFLYBOHBDEHB

101 102 103 104 105 106

Runtime in seconds

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

CifarC

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105 106

Simulated time in seconds

10 4

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

Protein

RSDRAGONFLYBOHBDEHB

101 102 103 104 105 106

Simulated time in seconds

10 5

10 4

10 3

10 2

10 1

Med

ian

Opt

imiz

ed R

egre

t

Slice

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105 106

Simulated time in seconds

10 7

10 6

10 5

10 4

10 3

10 2

10 1

Med

ian

Opt

imiz

ed R

egre

t

Naval

RSDRAGONFLYBOHBDEHB

100 101 102 103 104 105 106

Simulated time in seconds

10 5

10 4

10 3

10 2

10 1

100

Med

ian

Opt

imiz

ed R

egre

t

Parkinsons

RSDRAGONFLYBOHBDEHB

Figure 26: Comparison of GP-based multi-fidelity BO (Dragon-fly), KDE based multi-fidelity BO (BOHB), DE based multi-fidelity(DEHB) methods for some of the benchmarks, averaged over 32runs of each algorithm.

Page 17: DEHB: Evolutionary Hyperband for Scalable, Robust and

RS HB BOHB TPE SMAC RE DE DEHB

Counting 9.8e-2± 6.1e-2± 2.1e-2± 8.1e-3± 1.1e-6 2.7e-2± 1.5e-2± 9.7e-4±4 + 4 2.6e-2 1.9e-2 1.8e-2 4.8e-3 ± 3.9e-6 7.9e-3 6.6e-3 4.6e-4 (2)

Counting 1.9e-1± 1.5e-1± 3.9e-3± 7.6e-2± 2.1e-3 5.0e-2± 6.6e-2± 1.4e-2±8 + 8 2.5e-2 2.5e-2 1.1e-3 3.1e-2 ± 2.3e-3 1.2e-2 1.5e-2 3.7e-3 (3)

Counting 2.8e-1± 2.4e-1± 9.6e-2± 1.7e-1± 1.6e-1± 9.2e-2± 1.4e-1± 6.5e-216 + 16 2.2e-2 2.2e-2 8.3e-3 2.8e-2 1.6e-2 1.5e-2 1.7e-2 ± 6.3e-3Counting 3.5e-1± 3.2e-1± 2.4e-1± 2.6e-1± 3.6e-1± 1.6e-1± 2.2e-1± 1.4e-132 + 32 1.4e-2 1.8e-2 1.3e-2 1.9e-2 2.e-2 2.e-2 1.8e-2 ± 1.1e-2OpenML 3.8e-3± 3.1e-3± 1.8e-3± 1.9e-3± 2.8e-3± 2.e-3± 1.9e-3± 1.1e-3adult 4.5e-4 5.8e-4 6.1e-4 5.e-4 8.3e-4 6.3e-4 3.4e-4 ± 2.0e-4

OpenML 4.1e-3± 3.6e-3± 2.5e-3± 2.3e-3± 3.e-3± 2.1e-3± 2.0e-3± 1.8e-3higgs 7.3e-4 5.5e-4 5.8e-4 7.4e-4 1.3e-3 9.4e-4 4.6e-4 ± 2.1e-4

OpenML 3.8e-3± 2.9e-3± 2.e-3± 7.4e-4± 1.0e-3± 6.2e-4± 8.2e-4± 5.5e-4letter 1.1e-3 7.5e-4 1.5e-3 2.8e-4 9.9e-4 3.3e-4 1.5e-4 ± 3.4e-4

OpenML 1.8e-3± 1.6e-3± 1.3e-3± 9.4e-4± 1.1e-3± 9.2e-4± 5.3e-5 9.5e-4±mnist 3.1e-4 2.0e-4 4.3e-4 5.2e-5 4.6e-4 7.1e-5 ± 7.2e-5 7.6e-5 (4)

OpenML 3.2e-3± 2.8e-3± 1.7e-3± 1.4e-3± 1.5e-3± 1.0e-3± 8.1e-4± 7.9e-4optdigits 5.4e-4 5.3e-4 8.1e-4 4.3e-4 8.8e-4 5.5e-4 3.8e-4 ± 2.5e-4OpenML 1.1e-3± 7.6e-4± 3.e-4± 4.4e-4± 2.1e-4± 4.e-4± 4.4e-4± 1.9e-4poker 3.1e-4 1.8e-4 1.6e-4 1.1e-4 1.7e-4 1.9e-4 1.6e-4 ± 5.4e-5BNN 4.7± 4.3± 3.8± 4.0± 4.4± 6.6± 5.0± 4.0±Boston 9.4e-1 7.e-1 3.1e-1 4.2e-1 5.5e-1 9.5e+0 2.8e+0 4.9e-1 (2)BNN 4.0± 3.7± 3.3± 3.4± 3.3± 5.1± 4.9± 3.5±Protein 9.6e-1 5.2e-1 1.8e-1 2.7e-1 2.2e-1 5.3e+0 2.6e+0 3.7e-1 (4)Cartpole 4.7e+2± 3.9e+2± 1.9e+2± 2.9e+2± 1.8e+2 3.8e+2± 4.9e+2± 2.e+2±

(RL) 1.4e+2 1.1e+2 4.4e+1 6.7e+1 ± 1.9e+1 1.3e+2 1.1e+2 7.3e+1 (3)NAS101 2.9e-3± 3.e-3± 2.6e-3± 2.8e-3± 4.0e-3± 2.3e-3± 1.2e-3 2.2e-3±CifarA 7.6e-4 6.0e-4 1.2e-3 1.1e-3 1.5e-3 1.2e-3 ± 1.1e-3 1.4e-3 (2)NAS101 3.1e-3± 3.2e-3± 2.8e-3± 3.e-3± 2.9e-3± 2.4e-3 2.6e-3± 2.6e-3±CifarB 5.3e-4 3.8e-4 7.4e-4 5.8e-4 1.4e-3 ± 1.0e-3 6.9e-4 1.1e-3 (2)NAS101 3.2e-3± 3.1e-3± 2.6e-3± 2.7e-3± 6.4e-3± 2.3e-3± 1.7e-3 2.0e-3±CifarC 3.5e-4 5.4e-4 7.6e-4 8.3e-4 1.3e-3 1.4e-3 ± 1.1e-3 1.2e-3 (2)NAS1s1 1.6e-3± 1.5e-3± 1.7e-3± 1.3e-3± 2.7e-3± 1.1e-3± 9.4e-4 1.4e-3±SS1 8.6e-4 9.6e-4 1.1e-3 1.1e-3 9.9e-4 1.1e-3 ± 9.1e-4 7.8e-4 (4)

NAS1s1 1.3e-3± 9.8e-4± 8.6e-4± 8.2e-4± 7.2e-4± 3.e-4± 2.3e-4 6.4e-4±SS2 6.4e-4 4.8e-4 5.0e-4 7.e-4 4.3e-4 3.e-4 ± 2.5e-4 5.5e-4 (3)

NAS1s1 3.5e-3± 3.4e-3± 3.9e-3± 3.5e-3± 3.8e-3± 2.8e-3± 2.3e-3 2.6e-3±SS3 9.3e-4 9.2e-4 7.2e-4 8.9e-4 8.6e-4 1.3e-3 ± 1.0e-3 1.1e-3 (2)

NAS201 2.7e-3± 2.3e-3± 2.0e-3± 7.2e-4± 4.1e-4± 1.0e-4± 2.3e-4± 7.8e-5Cifar10 1.1e-3 7.6e-4 1.4e-3 1.3e-3 5.8e-4 5.6e-4 1.2e-3 ± 1.7e-4NAS201 8.1e-3± 6.1e-3± 5.7e-3± 1.9e-3± 1.3e-3± 8.e-5± 0.e+0 1.3e-4±Cifar100 3.5e-3 3.2e-3 4.0e-3 2.8e-3 2.3e-3 2.4e-4 ± 0.e+0 2.9e-4 (3)NAS201 9.3e-3± 7.9e-3± 7.3e-3± 4.8e-3± 5.4e-3± 2.0e-3 2.3e-3± 2.2e-3±ImageNet 3.6e-3 3.9e-3 4.1e-3 3.7e-3 3.6e-3 ± 1.4e-3 8.8e-4 1.6e-3 (2)NASHPO 7.4e-3± 4.2e-3± 2.9e-4± -4.7e-4± 3.9e-4± -1.1e-3± -1.1e-3 -1.0e-3±Protein 4.5e-3 2.7e-3 1.1e-3 1.9e-3 2.5e-3 3.5e-4 ± 3.1e-4 5.9e-4 (3)NASHPO 2.8e-5± 2.9e-6± -1.0e-5± -3.1e-5± -1.9e-5± -3.5e-5± -4.3e-5 -2.3e-5±Slice 3.6e-5 2.2e-5 2.7e-5 1.6e-5 1.9e-5 1.2e-5 ± 7.8e-6 1.6e-5 (4)

NASHPO 6.8e-6± 2.5e-6± -2.5e-6± -4.8e-6± -6.e-6± -6.7e-6± -7.e-6 -6.e-6±Naval 6.2e-6 4.3e-6 2.8e-6 3.3e-6 2.6e-6 1.0e-6 ± 8.6e-7 2.3e-6 (4)

NASHPO -6.6e-4± -1.0e-3± -3.4e-3 -2.3e-3± -2.6e-3± -2.9e-3± -3.2e-3± -2.4e-3±Parkinsons 1.1e-3 1.0e-3 ± 7.4e-4 1.1e-3 8.4e-4 7.5e-4 6.8e-4 8.8e-4 (5)

Avg. rank 7.46 6.54 4.42 4.35 4.73 3.16 2.96 2.39

Table 2: Final mean validation regret ± standard deviation for 50 runs all algorithms tested for all benchmarks. Performance scores forDEHB is accompanied with its (rank) among other algorithms. The last row shows the average relative rank of each algorithm based on theirfinal performance on each benchmark.

E.2 Varying CRThe scaling factor F was fixed at 0.5, while crossover factor pwas varied with the values 0.1, 0.3, 0.5, 0.7, 0.9. The studieswere carried out on NAS-Bench-101, OpenML Surrogate andthe toy Stochastic Counting Ones benchmarks. The resultsare reported in Figure 28.

The lower the p value, the less likely are the random mu-tant traits to be incorporated into the population. For Count-ing Ones, we observe that a high p slows down convergence,whereas low p speeds up convergence. However, for the oth-ers, p = 0.5 is consistently the best performer. Hence, wechose p = 0.5 for the design of DEHB.

Page 18: DEHB: Evolutionary Hyperband for Scalable, Robust and

10 1 100 101 102 103 104

cummulative budget / bmax

10 3

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

4+4

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

8+8

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 1

norm

aliz

ed v

alid

atio

n re

gret

16+16

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 1

norm

aliz

ed v

alid

atio

n re

gret

32+32

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

valid

atio

n re

gret

Adult

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

101 102 103 104 105

estimated wallclock time [s]10 3

10 2

10 1

valid

atio

n re

gret

Higgs

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Letter

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Mnist

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

100 101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1va

lidat

ion

regr

et

Optdigits

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Poker

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

103 104 105 106

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

CifarA

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

102 103 104 105 106

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

CifarC

DEHB F=0.1DEHB F=0.3DEHB F=0.5DEHB F=0.7DEHB F=0.9

Figure 27: Ablation study for mutation factor F for DEHB, with crossover probability fixed at p = 0.5

Page 19: DEHB: Evolutionary Hyperband for Scalable, Robust and

10 1 100 101 102 103 104

cummulative budget / bmax

10 3

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

4+4

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 2

10 1

norm

aliz

ed v

alid

atio

n re

gret

8+8

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 1

norm

aliz

ed v

alid

atio

n re

gret

16+16

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

10 1 100 101 102 103 104

cummulative budget / bmax

10 1

2 × 10 1

3 × 10 1

4 × 10 1

norm

aliz

ed v

alid

atio

n re

gret

32+32

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

valid

atio

n re

gret

Adult

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

101 102 103 104 105

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

Higgs

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Letter

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Mnist

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

100 101 102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1va

lidat

ion

regr

et

Optdigits

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

102 103 104 105

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

Poker

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

103 104 105 106

estimated wallclock time [s]

10 3

10 2

10 1

valid

atio

n re

gret

CifarA

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

102 103 104 105 106

estimated wallclock time [s]

10 2

10 1

valid

atio

n re

gret

CifarC

DEHB Cr=0.1DEHB Cr=0.3DEHB Cr=0.5DEHB Cr=0.7DEHB Cr=0.9

Figure 28: Ablation study for crossover probability p for DEHB, with mutation factor fixed at F = 0.5