Supplementary material for: BO-HB: Robust and Efficient ......Supplementary material for: BO-HB:...

Supplementary material for:BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

Stefan Falkner 1 Aaron Klein 1 Frank Hutter 1

A. Available SoftwareTo promote reproducible science and enable other re-searchers to use our method, we provide an open-sourceimplementation of BOHB and Hyperband. It is availableunder https://github.com/automl/HpBandSter. The bench-marks and our scripts used to produce the data shown in thepaper can be found in the icml_2018 branch.

B. Comparison to other Combinations ofBayesian optimization and Hyperband

Here we discuss the differences between our method andthe related approaches of Bertrand et al. (2017) and Wanget al. (2018) in more detail. We note that these works areindependent and concurrent; our work extends our prelim-inary short workshop papers at NIPS 2017 (Falkner et al.,2017) and ICLR 2018 (Falkner et al., 2018).

While the general idea of combining Hyperband andBayesian optimization by Bertrand et al. (2017) is the sameas in our work, they use a Gaussian process for modelingthe performance. The budget is modeled like any other di-mension of the search space, without any special treatment.Based on our experience with Fabolas (Klein et al., 2017),we expect that the squared exponential kernel might notextrapolate well, which would hinder performance. Also,the small evaluation provided by Bertrand et al. (2017) doesnot allow strong conclusions about the performance of theirmethod.

Wang et al. (2018) also independently combined TPE andHyperband, but in a slightly different way than we did. Intheir method, TPE is used as a subroutine in every itera-tion of Hyperband. In particular, a new model is built fromscratch at the beginning of every SuccessiveHalving run(Algorithm 3, line 8 in Wang et al. (2018)). This meansthat in later iterations of the algorithm, the model does not

1Department of Computer Science, University of Freiburg,Freiburg, Germany. Correspondence to: Stefan Falkner<[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

benefit from any of the evaluations in previous iterations.In contrast, BOHB collects all evaluations on all budgetsand uses the largest budget with enough evaluations (admit-tedly a heuristic, but we would argue a reasonable one) asa base for future evaluations. This way, BOHB aggregatesmore knowledge into its models for the different budgetsas the optimization progresses. We believe this to be a cru-cial part of the strong performance of our method. Empiri-cally, Wang et al. (2018) did not achieve the consistent andlarge speedups across a wide range of applications BOHBachieved in our experiments.

C. Successive HalvingSuccessiveHalving is a simple heuristic to allocate moreresources to promising candidates. For completeness, weprovide pseudo code for it in Algorithm 1. It is initializedwith a set of configurations, a minimum and maximumbudget, and a scaling parameter η. In the first stage allconfigurations are evaluated on the smallest budget (line3). The losses are then sorted and only the best 1/η con-figurations are kept in the set C (line 4). For the followingstage, the budget is increased by a factor of η (line 5). Thisis repeated until the maximum budget for a single configura-tion is reached (line 2). Within Hyperband, the budgets arechosen such that all SuccessiveHalving executions require asimilar total budget.

Algorithm 1: Pseudocode for SuccessiveHalvingused by Hyperband as a subroutine.

input : initial budget b0, maximum budget bmax,set of n configurationsC = {c1, c2, . . . cn}

1 b = b02 while b ≤ bmax do3 L = {f̃(c, b) : c ∈ C}4 C = topk(C,L, b|C|/η)c5 b = η · b

https://github.com/automl/HpBandSter

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

101 102 103 104 105 10610−3

10−2

10−1

10−2

wall clock time [s]

regr

etadult

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

101 102 103 104 105 10610−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

higgs

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

100 101 102 103 104 10510−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

letter

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

100 101 102 103 104 105 106

10−3

10−1

10−4

10−2

100

wall clock time [s]

regr

et

mnist

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

100 101 102 103 104 10510−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

optdigits

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

102 103 104 105 106 107

10−3

10−1

10−4

10−2

100

wall clock time [s]

regr

et

poker

RS

TPE

GP-BO

HB

HB-LCNet

BOHB

Figure 1. Mean performance on the surrogates for all six datasets. As uncertainties, we show the standard error of the mean based on 512runs (except for GP-BO, which has only 50 runs).

D. Details on the Kernel Density EstimatorWe used the MultivariateKDE from the statsmodels package(Seabold & Perktold, 2010), which constructs a factorizedkernel, with a one-dimensional kernel for each dimension.Note that using this product of 1-d kernels differs from theoriginal TPE, which uses a pdf that is the product of 1-dpdfs. For the continuous parameters a Gaussian kernel isused, whereas the Aitchison-Aitken kernel is the defaultfor categorical parameters. We used Scott’s rule for ef-ficient bandwidth estimation, as preliminary experimentswith maximum-likelihood based bandwidth selection didnot yield better performance but caused a significant over-head.

E. Performance of all methods on allsurrogates

Figure 1 shows the performance of all methods we evaluatedon all our surrogate benchmarks. Random search is clearlythe worst optimizer across all datasets when the budget islarge enough for GP-BO and TPE to leverage their model.Hyperband and the two methods based on it (HB-LCNet)and BOHB improve much more quickly due to the smallerbudgets used. On all surrogate benchmarks, BOHB startsto outperform HB after the first couple of iterations (some-times even earlier, e.g., on dataset letter). The same datasetalso shows that traditional BO methods can still have anadvantage for very large budgets, since in these late stagesof the optimization process the low fidelity evaluations ofBOHB can cause a constant overhead without any gain.


101 102 103 104 105 10610−3

10−2

10−1

10−2

wall clock time [s]

regr

etadult

n = 1

n = 2

n = 4

n = 8

n = 16

n = 32

101 102 103 104 105 10610−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

higgs

n = 1

n = 2

n = 4

n = 8

n = 16

n = 32

100 101 102 103 104 10510−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

letter

n = 1

n = 2

n = 4

n = 8

n = 16

n = 32

101 102 103 104 105 106

10−3

10−1

10−4

10−2

100

wall clock time [s]

regr

et

mnist

n = 1

n = 2

n = 4

n = 8

n = 16

n = 32

100 101 102 103 104 10510−3

10−2

10−1

100

10−2

100

wall clock time [s]

regr

et

optdigits

n = 1

n = 2

n = 4

n = 8

n = 16

n = 32

102 103 104 105 106

10−3

10−1

10−4

10−2

100

wall clock time [s]

regr

et

poker

n = 1

n = 16

n = 32

Figure 2. Mean performance on the surrogates for all six datasets with different numbers of workers n. As uncertainties, we show thestandard error of the mean based on 128 runs. Because we simulated them in real time to capture the true behavior, poker is too expensiveto evaluate with less than 16 workers within a day.

F. Performance of parallel runsFigure 2 shows the performance of BOHB when run inparallel on all our surrogate benchmarks. The speed-upsare quite consistent, and almost linear for a small numberof workers (2-8). For more workers, more random config-urations are evaluated in parallel before the first model isbuilt, which degrades performance. But even for 32 work-ers, linear speedups are possible (see, e.g., dataset letter, forreaching a regret of 2× 10−3).

We note that in order to carry out this evaluation of par-allel performance, we actually simulated the parallel opti-mization by making each worker wait for the given budgetbefore returning the corresponding performance value ofour surrogate benchmark. (The case of one worker is an

exception, where we can simply reconstruct the trajectorybecause all configurations are evaluated serially.) By usingthis approach in connection with threads, each evaluationof a parallel algorithm still only used 1 CPU, but the runactually ran in real time. For this reason, we decided to notevaluate all possible numbers of workers for dataset poker,for which each run with less than 16 workers would havetaken more than a day, and we do not expect any differentbehavior compared to the other datasets.

G. Evaluating the hyperparameters of BOHBIn this section, we evaluate the importance of the individualhyperparameters of BOHB, namely the number of samplesused to optimize the acquisition function (Figure 3), the


Figure 3. Performance on the surrogates for all six datasets for different number of samples

fraction of purely random configuration ρ (Figure 4), thescaling parameter η (Figure 5), and the bandwidth factorused to encourage exploration (Figure 6).

Additionally, we want to discuss the importance of η, bmin

and bmax already present in HB. The parameter η controlshow aggressively SH cuts down the budget and the num-ber of configurations evaluated. Like HB (Li et al., 2017),BOHB is also quite insensitive to this choice in a reasonablerange. For our experiments, we use the same default value(η = 3) for HB and BOHB.

More important for the optimization are bmin and bmax,which are problem specific and inputs to both HB andBOHB. While the maximum budget is often naturally de-fined, or is constrained by compute resources, the situationfor the minimum budget is often different. To get substan-tial speedups, an evaluation with a budget of bmin should

contain some information about the quality of a configura-tion with larger budgets; for example, when subsamplingthe data, the smallest subset should not be one datum, butrather enough points to fit a meaningful model. This re-quires knowledge about the benchmark and the algorithmbeing optimized.

H. Counting OnesWe now show results for the counting ones function fordifferent dimensions. Figure 7 shows the mean performanceof all applicable methods in d = 8, 16, 32 and 64 dimensionsfor a budget of 8192 full function evaluations.

We draw the following conclusions from the results:

1. Despite its simple definition, this problem is quite chal-lenging for the methods we applied to it. RS and HB


Figure 4. Performance on the surrogates for all six datasets for different random fractions

both suffer from the fact that drawing configurationsat random performs quite poorly in this space. Themodel-based approaches SMAC and TPE performedsubstantially better, especially with large budgets. Theyrequired a larger number of samples before convergingto the true optimum than BOHB. However, we wouldlike to mention that SMAC and TPE treated the prob-lem as a blackbox optimization problem; the results forSMAC could likely be improved by treating individualsamples as “instances” and using SMAC’s intensifi-cation mechanism to reject poor configurations basedon few samples and evaluate promising configurationswith more samples.

2. BOHB struggles in the very high dimensional case. Weattribute this to the fact that the noise is substantiallyhigher in this case, such that larger budgets are requiredto build a good model. Therefore, given a large enough

budget, BOHB’s evaluations on small budgets lead toa constant overhead over only using the more reliableevaluations on larger budgets. Since the optimizationproblem is perfectly separable (there are no interac-tion effects between any dimensions), we also expectTPE’s univariate KDE to perform better than BOHB’smultivariate one.

I. SurrogatesI.1. Constructing the Surrogates

To build a surrogate, we sampled 10 000 random configu-rations for each dataset, trained them for 50 epochs, andrecorded their classification error after each epoch, alongwith their total training time. We fitted two independent ran-dom forests that predict these two quantities as a function ofthe hyperparameter configuration used. This enabled us to


Figure 5. Performance on the surrogates for all six datasets for different values of η.

predict the classification error as a function of time with suf-ficient accuracy. As almost all networks converged withinthe 50 epochs, we extend the curves by the last obtainedvalue if the budget would allow for more epochs.

The surrogates enable cheap benchmarking, allowing us torun each algorithm 256 times. Since evaluating a configura-tion with the random forest is inexpensive, we used a globaloptimizer (differential evolution) to find the true optimum.We allowed the optimizer 10 000 iterations which should besufficient to find the true optimum.

Besides these positive aspects of benchmarking with sur-rogates, there are also some drawbacks that we want tomention explicitly:

(a) There is no guarantee that the surrogate actually re-flects the important properties of the true benchmark.

(b) The presented results show the optimized classificationerror on the validation set used during training. Thereis no test performance that could indicate overfitting.

(c) Training with stochastic gradient descent is an inher-ently noisy process, i.e. two evaluations of the sameconfiguration can result in different performances. Thisis not at all reflected by our surrogates, making them apotentially easier to optimize than the true benchmarkthey are based on.

(d) By fixing the budgets (see below) and having determin-istic surrogates, the global minima might be the resultof some small fluctuations in the classification errorin the surrogates’ training data. That means that thesurrogate’s minimizer might not be the true minimizerof the real benchmark.


Figure 6. Performance on the surrogates for all six datasets for different bandwidth factors.

None of these downsides necessarily have substantial im-plications for comparing different optimizers; they simplyshow that the surrogate benchmarks are not perfect modelsfor the real benchmark they mimic. Nevertheless, we be-lieve that, especially for development of novel algorithms,the positive aspects outweigh the negative ones.

I.2. Determining the budgets

To choose the largest budget for training, we looked atthe best configuration as predicted by the surrogate and itstraining time. We chose the closest power of 3 (becausewe also used η = 3 for HB and BOHB) to achieve thatperformance. We chose the smallest budget for HB suchthat most configurations had finished at least one epoch.Table 2 lists the budgets used for all datasets.

Table 1. The hyperparameters and architecture choices for the fullyconnected networks.

Hyperparameter Range Log-transform

batch size [23, 28] yesdropout rate [0, 0.5] no

initial learning rate [10−6, 10−2] yesexponential decay factor [−0.185, 0] no

# hidden layers {1, 2, 3, 4, 5} no# units per layer [24, 28] yes

J. Bayesian Neural NetworksWe optimized the hyperparameters described in Table 3for a Bayesian neural network trained with SGHMC ontwo UCI regression datasets: Boston Housing and ProteinStructure. The budget for this benchmark was the number


Figure 7. Mean performance of BOHB, HB, TPE, SMAC and RS on the mixed domain counting ones function with different dimensions.As uncertainties, we show the standard error of the mean based on 512 runs.

Table 2. The budgets used by HB and BOHB; random search andTPE only used the last budget

Dataset Budgets in seconds for HB and BOHBAdult 9, 27, 81, 243Higgs 9, 27, 81, 243Letter 3, 9, 27, 81Poker 81, 243, 729, 2187

of steps for the MCMC sampler. We set the minimumbudget to 500 steps and the maximum budget to 10000steps. After sampling 100 parameter vectors, we computedthe log-likelihood on the validation dataset by averagingthe predictive mean and variances of the individual models.The performance of all methods for both datasets is shownin Figure 8.

K. Reinforcement LearningTable 4 shows the hyperparameters we optimized for thePPO Cartpole task.

Table 3. The hyperparameters for the Bayesian neural networktask.

Hyperparameter Range Log-transform# units layer 1 [24, 29] yes# units layer 2 [24, 29] yes

step length [10−6, 10−1] yesburn in [0, .8] no

momentum decay [0, 1] no

Table 4. The hyperparameters for the PPO Cartpole task.Hyperparameter Range Log-transform# units layer 1 [23, 27] yes# units layer 2 [23, 27] yes

batch size [23, 28] yeslearning rate [10−7, 10−1] yes

discount [0, 1] nolikelihood ratio clipping [0, 1] noentropy regularization [0, 1] no

ReferencesBertrand, H., Ardon, R., Perrot, M., and Bloch, I. Hyper-

parameter optimization of deep neural networks: Com-bining hyperband with Bayesian model selection. Pro-


104 105 106

4

6

8

3

5

7

9

MCMC steps

nega

tive

log-

likel

ihoo

dBoston Housing

RS

TPE

HB

BOHB

104 105 106

4

6

8

3

5

7

9

MCMC steps

nega

tive

log-

likel

ihoo

d

Protein

RS

TPE

HB

BOHB

Figure 8. Mean performance of TPE, RS, HB and BOHB for optimizing the 5 hyperparameters of a Bayesian neural network on twodifferent UCI datasets. As uncertainties, we show the stardard error of the mean based on 50 runs.

ceedings of Conférence sur l’Apprentissage Automatique(CAP 2017), 2017.

Falkner, S., Klein, A., and Hutter, F. Combining hyper-band and Bayesian optimization. In NIPS 2017 BayesianOptimization Workshop, December 2017.

Falkner, S., Klein, A., and Hutter, F. Practical hyperpa-rameter optimization for deep learning. In ICLR 2018Workshop Track, 2018.

Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter,F. Fast Bayesian optimization of machine learning hy-perparameters on large datasets. In Proceedings of theSeventeenth International Conference on Artificial Intelli-gence and Statistics (AISTATS), 2017.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A.,and Talwalkar, A. Hyperband: Bandit-based configu-ration evaluation for hyperparameter optimization. InProceedings of the International Conference on Learn-ing Representations (ICLR’17), 2017. Published online:iclr.cc.

Seabold, S. and Perktold, J. Statsmodels: Econometric andstatistical modeling with python. In 9th Python in ScienceConference, 2010.

Wang, J., Xu, J., and Wang, X. Combination of hyperbandand bayesian optimization for hyperparameter optimiza-tion in deep learning. arXiv preprint arxiv:1801.01596,01 2018.

iclr.cc

Supplementary material for: BO-HB: Robust and Efficient ......Supplementary material for: BO-HB:...

Documents

Transcript of Supplementary material for: BO-HB: Robust and Efficient ......Supplementary material for: BO-HB:...