Explaining Hyperparameter Optimization via Partial ...

Explaining Hyperparameter Optimizationvia Partial Dependence Plots

Julia Moosbauer∗, Julia Herbinger∗, Giuseppe Casalicchio, Marius Lindauer, Bernd BischlDepartment of Statistics, Ludwig-Maximilians-University Munich, Munich, GermanyInstitute of Information Processing, Leibniz University Hannover, Hannover, Germany

{julia.moosbauer, julia.herbinger, giuseppe.casalicchio,bernd.bischl}@stat.uni-muenchen.de

[email protected]

Abstract

Automated hyperparameter optimization (HPO) can support practitioners to obtainpeak performance in machine learning models. However, there is often a lack ofvaluable insights into the effects of different hyperparameters on the final modelperformance. This lack of explainability makes it difficult to trust and understandthe automated HPO process and its results. We suggest using interpretable machinelearning (IML) to gain insights from the experimental data obtained during HPOwith Bayesian optimization (BO). BO tends to focus on promising regions withpotential high-performance configurations and thus induces a sampling bias. Hence,many IML techniques, such as the partial dependence plot (PDP), carry the risk ofgenerating biased interpretations. By leveraging the posterior uncertainty of theBO surrogate model, we introduce a variant of the PDP with estimated confidencebands. We propose to partition the hyperparameter space to obtain more confidentand reliable PDPs in relevant sub-regions. In an experimental study, we providequantitative evidence for the increased quality of the PDPs within sub-regions.

1 Introduction

Most machine learning (ML) algorithms are highly configurable. Their hyperparameters must bechosen carefully, as their choice often impacts the model performance. Even for experts, it can bechallenging to find well-performing hyperparameter configurations. Automated machine learning(AutoML) systems and methods for automated HPO have been shown to yield considerable efficiencycompared to manual tuning by human experts [Snoek et al., 2012]. However, these approachesmainly return a well-performing configuration and leave users without insights into decisions ofthe optimization process. Questions about the importance of hyperparameters or their effects onthe resulting performance often remain unanswered. Not all data scientists trust the outcome of anAutoML system due to the lack of transparency [Drozdal et al., 2020]. Consequently, they might notdeploy an AutoML model, despite all performance gains. Providing insights into the search processmay help increase trust and facilitate interactive and exploratory processes: A data scientist couldmonitor the AutoML process and make changes to it (e.g., restricting or expanding the search space)already during optimization to anticipate unintended results.

Transparency, trust, and understanding of the inner workings of an AutoML system can be increasedby interpreting the internal surrogate model of an AutoML approach. For example, BO trains asurrogate model to approximate the relationship between hyperparameter configurations and modelperformance. It is used to guide the optimization process towards the most promising regions of thehyperparameter space. Hence, surrogate models implicitly contain information about the influence of

∗These authors contributed equally to this work.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arX

iv:2

111.

0482

0v2

[cs

.LG

] 2

6 Ja

n 20

22

hyperparameters. If the interpretation of the surrogate matches with a data scientist’s expectation,confidence in the correct functioning of the system may be increased. If these do not match, itprovides an opportunity to look either for bugs in the code or for new theoretical insights.

We propose to analyze surrogate models with methods from IML to provide insights into the resultsof HPO. In the context of BO, typical choices for surrogate models are flexible, probabilistic black-box models, such as Gaussian processes (GP) or random forests. Interpreting the effect of singlehyperparameters on the performance of the model to be tuned is analogous to interpreting thefeature effect of the black-box surrogate model. We focus on the PDP [Friedman, 2001], which isa widely-used method2 to visualize the average marginal effect of single features on a black-boxmodel’s prediction. When applied to surrogate models, they provide information on how a specifichyperparameter influences the estimated model performance. However, applying PDPs out of the boxto surrogate models might lead to misleading conclusions. Efficient optimizers such as BO tend tofocus on exploiting promising regions of the hyperparameter space while leaving other regions lessexplored. Therefore, a sampling bias in input space is introduced, which in turn can lead to a poor fitand biased interpretations in underexplored regions of the space.

Contributions: We study the problem of sampling bias in experimental data produced by AutoMLsystems and the resulting bias of the surrogate model and assess its implications on PDPs. We thenderive an uncertainty measure for PDPs of probabilistic surrogate models. In addition, we propose amethod that splits the hyperparameter space into interpretable sub-regions of varying uncertainty toobtain sub-regions with more reliable and confident PDP estimates. In the context of BO, we provideevidence for the usefulness of our proposed methods on a synthetic function and in an experimentalstudy in which we optimize the architecture and hyperparameters of a deep neural network. OurSupplementary Material provides (A) more background related to uncertainty estimates, (B) notes onhow our methods are applied to hierarchical hyperparameter spaces, (C) details on the experimentalsetup and more detailed results, (D) a link to the source code.

Reproducibility and Open Science: The implementation of the proposed methods as well asreproducible scripts for the experimental analysis are provided in a public git-repository3.

2 Background and Related Work

Recent research has begun to question whether the evaluation of an AutoML system should be purelybased on the generated models’ predictive performance without considering interpretability [Hutteret al., 2014a, Pfisterer et al., 2019, Freitas, 2019, Xanthopoulos et al., 2020]. Interpreting AutoMLsystems can be categorized as (1) interpreting the resulting ML model on the underlying dataset, or(2) interpreting the HPO process itself. In this paper, we focus on the latter.

Let c : Λ→ R be a black-box cost function, mapping a hyperparameter configurationλ = (λ1, ..., λd)to the model error4 obtained by a learning algorithm with configuration λ. The hyperparameter spacemay be mixed, containing categorical and continuous hyperparameters. The goal of HPO is to findλ∗ ∈ argminλ∈Λ c(λ). Throughout the paper, we assume that a surrogate model c : Λ→ R is givenas an approximation to c. If the surrogate is assumed to be a GP, c(λ) is a random variable followinga Gaussian posterior distribution. In particular, for any finite indexed family of hyperparameterconfigurations

(λ(1), ...,λ(k)

)∈ Λk, the vector of estimated performance values is Gaussian with a

posterior mean m =(m(λ(i)

))i=1,...,k

and covariance K =(k(λ(i),λ(j)

))i,j=1,...,k

.

Hyperparameter Importance. Understanding which hyperparameters influence model performancecan provide valuable insights into the tuning strategy [Probst et al., 2019]. To quantify relevanceof hyperparameters, models that inherently quantify feature relevance – e.g., GPs with ARD kernel[Neil, 1996] – can be used as surrogate models. Hutter et al. [2014a] quantified the importance ofhyperparameters based on a random forest fitted on data generated by BO, for which the importanceof both the main and the interaction effects of hyperparameters was calculated by a functionalANOVA approach. Similarly, Sharma et al. [2019] quantified the hyperparameter importance of

2There exist various implementations [Greenwell, 2017, Pedregosa et al., 2011]), extensions [Greenwellet al., 2018, Goldstein et al., 2015] and applications [Friedman and Meulman, 2003, Cutler et al., 2007].

3https://github.com/slds-lmu/paper_2021_xautoml4Typically, the model error is estimated via cross-validation or hold-out testing.

2

https://github.com/slds-lmu/paper_2021_xautoml

residual neural networks. These works highlight how useful it is to quantify the importance ofhyperparameters. However, importance scores do not show how a specific hyperparameter affectsthe model performance according to the surrogate model. Therefore, we propose to visualize theassumed marginal effect of a hyperparameter. A model-agnostic interpretation method that can beused for this purpose is the PDP.

PDPs for Hyperparameters. Let S ⊂ {1, 2, ..., d} denote an index set of features, and let C ={1, 2, ..., d} \ S be its complement. The partial dependence (PD) function [Friedman, 2001] ofc : Λ→ R for hyperparameter(s) S is defined as5

cS(λS) := EλC[c(λ)] =

∫ΛC

c(λS ,λC) dP(λC). (1)

When analyzing the PDP of hyperparameters, we are usually interested in how their values λS impactmodel performance uniformly across the hyperparameter space. In line with prior work [Hutter et al.,2014a], we therefore assume P to be the uniform distribution over ΛC . Computing cS(λS) exactly isusually not possible because c is unknown and expensive to evaluate in the context of HPO. Thus,the posterior mean m of the probabilistic surrogate model c(λ) is commonly used as a proxy for c.Furthermore, the integral may not be analytically tractable for arbitrary surrogate models c. Hence,the integral is approximated by Monte Carlo integration, i.e.,

cS (λS) =1

n

∑n

i=1m(λS ,λ

(i)C

)(2)

for a sample(λ

(i)C

)i=1,...,n

∼ P(λC). m(λS ,λ

(i)C

)represents the marginal effect of λS for one

specific instance i. Individual conditional expectation (ICE) curves [Goldstein et al., 2015] visualizethe marginal effect of the i-th observation by plotting the value of m

(λS ,λ

(i)C

)against λS for a

set of grid points6 λ(g)S ∈ ΛS , g ∈ {1, ..., G}. Analogously, the PDP visualizes cS(λS) against the

grid points. Following from Eq. 2, the PDP visualizes the average over all ICE curves. In HPO, themarginal predicted performance is a related concept. Instead of approximating the integral via MonteCarlo, the integral over c is computed exactly. Hutter et al. [2014a] propose an efficient approach tocompute this integral for random forest surrogate models.

Uncertainty Quantification in PDPs. Quantifying the uncertainty of PDPs provides additionalinformation about the reliability of the mean estimator. Hutter et al. [2014a] quantified the modeluncertainty specifically for random forests as surrogates in BO by calculating the standard deviation ofthe marginal predictions of the individual trees. However, this procedure is not applicable to generalprobabilistic surrogate models, such as the commonly used GP. There are approaches that quantify theuncertainty for ML models that do not provide uncertainty estimates out-of-the-box. Cafri and Bailey[2016] suggested a bootstrap approach for tree ensembles to quantify the uncertainties of effectsbased on PDPs. Another approach to quantify the uncertainty of PDPs is to leverage the ICE curves.For example, Greenwell [2017] implemented a method that marginalizes over the mean ± standarddeviation of the ICE curves for each grid point. However, this approach quantifies the underlyinguncertainty of the data at hand rather than the model uncertainty, as explained in Appendix A.1. Amodel-agnostic estimate based on uncertainty estimates for probabilistic models is missing so far.

Subgroup PDPs. Recently, a new research direction concentrates on finding more reliable PDPestimates within subgroups of observations. Molnar et al. [2020] focused on problems in PDPestimation with correlated features. To that end, they apply transformation trees to find homogeneoussubgroups and then visualize a PDP for each subgroup. Grömping [2020] looked at the same problemand also uses subgroup PDPs, where ICE curves are grouped regarding a correlated feature. Britton[2019] applied a clustering approach to group ICE curves to find interactions between features.However, none of these approaches aim at finding subgroups where reliable PDP estimates havelow uncertainty. Additionally, to the best of our knowledge, nothing similar exists for analyzingexperimental data created by HPO.

5To keep notation simple, we denote c(λ) as a function of two arguments (λS ,λC) to differentiate compo-nents in the index set S from those in the complement. The integral shall be understood as a multiple integral ofc where λj , j ∈ C, are integrated out.

6Grid points are typically chosen as an equidistant grid or sampled from P(λS). The granularity G is chosenby the user. For categorical features, the granularity typically corresponds to the number of categories.

3

Figure 1: Illustration of the sampling bias whenoptimizing the 2D Styblinski Tang function withBO and the Lower Confidence Bound (LCB)acquisition function a(λ) = m(λ) + τ · s(λ)for τ = 0.1 (left) and τ = 2 (middle) vs. datasampled uniformly at random (right).

Figure 2: The two horizontal cuts (left) yield twoICE curves (right) showing the mean predictionand uncertainty band against λ1 for c with τ =0.1 on the 2D Styblinski-Tang function. Theupper ICE curve deviates more from the trueeffect (black) and shows a higher uncertainty.

3 Biased Sampling in HPO

Visualizing the marginal effect of hyperparameters of surrogate models via PDPs can be misleading.We show that this problem is due to the sequential nature of BO, which generates dependent instances(i.e., hyperparameter configurations) and thereby introduces a sampling and a resulting model bias.To save computational resources in contrast to grid search or random search, efficient optimizerslike BO tend to exploit promising regions of the hyperparameter space while other regions are lessexplored (see Figure 1). Consequently, predictions of surrogate models are usually more accurate withless uncertainty in well-explored regions and less accurate with high uncertainty in under-exploredregions. This model bias also affects the PD estimate (see Figure 2). ICE curves may be biased andless confident if they are computed in poorly-learned regions where the model has not seen much databefore. Under the assumption of uniformly distributed hyperparameters, poorly-learned regions areincorporated in the PD estimate with the same weight as well-learned regions. ICE curves belongingto regions with high uncertainty may obfuscate well-learned effects of ICE curves belonging to otherregions when they are aggregated to a PDP. Hence, the model bias may also lead to a less reliablePD estimate. PDPs visualizing only the mean estimator of Eq. (2) do not provide insights into thereliability of the PD estimate and how it is affected by the described model bias.

4 Quantifying Uncertainty in PDPs

Figure 3: PDPs (blue) with confidence bands forsurrogates trained on data created by BO and LCBwith τ = 0.1 (left), τ = 1 (middle) and uniformi.i.d. dataset (right) vs. the true PD (black).

Pointwise uncertainty estimates of a probabilis-tic model provide insights into the reliability ofthe prediction c(λ) for a specific configurationλ. This uncertainty directly correlates with howexplored the region around λ is. Hence, includ-ing the model’s uncertainty structure into the PDestimate enables users to understand in whichregions the PDP is more reliable and which partsof the PDP must be cautiously interpreted.7 Wenow extend the PDP of Eq. (2) to probabilisticsurrogate models c (e.g., a GP). Let λS be afixed grid point and

(λ

(i)C

)i=1,...,n

∼ P (λC)

a sample that is used to compute the MonteCarlo estimate of Eq. (2). The vector of predicted performances at the grid point λS isc (λS) =

(c(λS ,λ

(i)C

))i=1,...,n

with (posterior) mean m (λS) :=(m(λS ,λ

(i)C

))i=1,...,n

and

7Note that we aim at representing model uncertainty in a PD estimate, and not the variability of the meanprediction (see Appendix A.1 for a more detailed justification).

4

a (posterior) covariance K (λS) :=(k((λS ,λ

(i)C

),(λS ,λ

(j)C

)))i,j=1,...,n

. Thus, cS (λS) =

1n

∑ni=1 c

(λS ,λ

(i)C

)is a random variable itself. The expected value of cS (λS) corresponds to the

PD of the posterior mean function m at λS , i.e.:

mS (λS) = Ec [cS (λS)] = Ec

[1

n

∑n

i=1c(λS ,λ

(i)C

)]=

1

n

∑n

i=1m(λS ,λ

(i)C

). (3)

The variance of cS (λS) is

s2S(λS) = Vc [cS (λS)] = Vc

[1

n

∑n

i=1c(λS ,λ

(i)C

)]=

1

n21>K (λS) 1. (4)

For the above estimate, it is important that the kernel is correctly specified such that the covariancestructure is modeled properly by the surrogate model. Eq. (4) can be approximated empirically bytreating the pairwise covariances as unknown, i.e.:

s2S (λS) ≈ 1

n

∑n

i=1K (λS)i,i . (5)

In Appendix A.2, we show empirically that this approximation is less sensitive to kernel misspecifi-cations. Please note that the variance estimate and the mean estimate can also be applied to otherprobabilistic models, such as GAMLSS8, transformation trees, or a random forest. An example forPDPs with uncertainty estimates is shown in Figure 3 for different degrees of a sampling bias.

5 Regional PDPs via Confidence Splitting

As discussed in Section 3, (efficient) optimization may imply that the sampling is biased, which inturn can produce misleading interpretations when IML is naively applied. We now aim to identify sub-regions Λ′ ⊂ Λ of the hyperparameter space in which the PD can be estimated with high confidence,and separate those from sub-regions in which it cannot be estimated reliably. In particular, we identifysub-regions in which poorly-learned effects do not obfuscate the well-learned effects along each gridpoint, thereby allowing the user to draw conclusions with higher confidence. By partitioning theentire hyperparameter space through a tree-based approach into disjoint and interpretable sub-regions,a more detailed understanding of the sampling process and hyperparameter effects is achieved. Userscan either study the hyperparameter effect of a (confident) sub-region individually or understand theexploration-exploitation sampling of HPO by considering the complete tree structure. The result ofthis procedure for a single split is shown in Figure 5.

The PD estimate on the entire hyperparameter space Λ is computed by sampling the Monte Carloestimate (λ

(i)C )i∈N ∼ P(λC), N := {1, 2, ..., n}. We now introduce the PD estimate on a sub-

region Λ′ ⊂ Λ simply as (λ(i)C )i∈N ′ only usingN ′ = {i ∈ N}λ(i)∈Λ′ . Since we are interested in the

marginal effect of the hyperparameter(s) S at each λS ∈ ΛS , we will usually visualize the PD for thewhole range ΛS . Thus, all obtained sub-regions should be of the form Λ′ = ΛS×Λ′C with Λ′C ⊂ ΛC .This corresponds to an average of ICE curves in the set i ∈ N ′. The pseudo-code to partition ahyperparameter (sub-)space Λ and corresponding sample (λ

(i)C )i∈N ∈ ΛC , N ⊆ {1, ..., n}, into

two child regions is shown in Algorithm 1. This splitting is recursively applied in a CART9-likeprocedure [Breiman et al., 1984b] to expand a full tree structure, with the usual stopping criteria (e.g.,a maximum number of splits, a minimum size of a region, or a minimum improvement in each node).In each leaf node, the sub-regional PDP and its corresponding uncertainty estimate are computed byaggregating over all contained ICE curves.

The criterion to evaluate a specific partitioning is based on the idea of grouping ICE curves withsimilar uncertainty structure. To be more exact, we evaluate the impurity of a PD estimate on asub-region Λ′ with the help of the associated set of observationsN ′ = {i ∈ N}

λ(i)C ∈Λ′C

, also referredto as nodes, as follows: For each grid point λS , we use the L2 loss in L (λS ,N ′) to evaluate how the

8Generalized additive models for location, scale and shape9Classification and regression trees

5

Figure 4: ICE curves of s of λS for the left(green) and right (blue) sub-region after the firstsplit. The darker lines represent the respectivePDPs. The orange vertical line marks the valueλS of the optimal configuration.

N

Nl Nr

λj < 6.9 λj ≥ 6.9

Figure 5: Example of two estimated PDPs (blueline) and 95% confidence bands after one parti-tioning step. The orange vertical line is the valueof λS from the optimal configuration, the blackcurve is the true PD estimate cS(λS).

uncertainty varies across all ICE estimates i ∈ N ′ using s2S|N ′ (λS) := 1

|N ′|∑i∈N ′ s

2(λS ,λ

(i)C

)and aggregate the loss L (λS ,N ′) over all grid points inRL2(N ′):

L (λS ,N ′) =∑

i∈N ′

(s2(λS ,λ

(i)C

)− s2

S|N ′ (λS))2

andRL2(N ′) =∑G

g=1L(λ

(g)S ,N ′). (6)

Algorithm 1: Tree-based Partitioninginput: Nfor j ∈ C do

for Every split t on hyperparameter λj doN j,tl = {i ∈ N}

λ(i)j ≤t

N j,tr = {i ∈ N}

λ(i)j >t

I(j, t) = RL2(N j,tl ) +RL2(N j,t

r )end for

end forChoose

(j∗, t∗λ∗j

)∈ argminj,t I(j, t)

Return N j,tl and N j,t

r for (j, t) =(j∗, t∗λ∗j

)

Hence, we measure the pointwise L2-distancebetween ICE curves of the variance functions2(λS ,λ

(i)C ) and its PD estimate s2

S|N ′ (λS)

within a sub-region N ′. This seems reasonable,as ICE curves in well-explored regions of thesearch space should, on average, have a loweruncertainty than those in less-explored regions.However, since we only split according to hy-perparameters in C but not in S, the partition-ing does not cut off less explored regions w.r.t.λS . Thus, the chosen split criterion groups ICEcurves of the uncertainty estimate such that wereceive sub-regions associated with low costs c(and thus high relevance for a user) to be moreconfident in well-explored regions of λS andless confident in under-explored regions. Fig-ure 4 shows that ICE curves of the uncertaintymeasure with high uncertainty over the entire

range of λS are grouped together (right sub-region). Those with low uncertainty close to the optimalconfiguration of λS and increasing uncertainties for less suitable configurations are grouped togetherby curve similarities in the left sub-region. The respective PDPs are illustrated in Figure 5, wherethe confidence band in the left sub-region decreased compared to the confidence band of the globalPDP especially for grid points close to the optimal value of λS . Hence, by grouping observationswith similar ICE curves of the variance function, resulting sub-regional PDPs with confidence bandsprovide the user with the information of which sub-regions of ΛC are well-explored and lead to morereliable PDP estimates. Furthermore, the user will know which ranges of λS can be interpretedreliably and which ones need to be regarded with caution.

To sum up, the splitting procedure provides interpretable, disjoint sub-regions of the hyperparameterspace. Based on the defined impurity measure, PDPs with high reliability can be identified andanalyzed. In particular, the method provides more confident and reliable estimates in the sub-regioncontaining the optimal configuration. Which PDPs are most interesting to explore depends on thequestion the user would like to answer. If the main interest lies in understanding the optimization andexploring the sampling process, a user might want to keep the number of sub-regions relatively lowby performing only a few partitioning steps. Subsequently, one would investigate the overall structureof the sub-regions and the individual sub-regional PDPs. If users are more interested in interpreting

6

hyperparameter effects only in the most relevant sub-region(s), they may want to split deeper andonly look at sub-regions that are more confident than the global PDP.

Due to the nature of the splitting procedure, the PDP estimate on the entire hyperparameter space is aweighted average of the respective sub-regional PDPs. Hence, the global PDP estimate is decomposedinto several sub-regional PDP estimates. Furthermore, note that the proposed method does not assumea numeric hyperparameter space, since the uncertainty estimates as well as ICE and PDP estimatescan also be calculated for categorical features. Thus, it is applicable to problems with mixed spacesas long as a probabilistic surrogate model – and particularly its uncertainty estimates – are available.In Appendix B we describe how our method is applied to hierarchical hyperparameter spaces.

Since the proposed method is an instance of the CART algorithm, finding the optimal split for a cate-gorical variable with q levels generally involves checking 2q subsets. This becomes computationallyinfeasible for high values of q. It remains an open question for future work if this can be sped by anoptimal procedure as in regression with L2 loss [Fisher, 1958] and binary classification [Breimanet al., 1984a] or by a clever heuristic as for multiclass classification Wright and König [2019].

6 Experimental Analysis

In this section, we validate the effectiveness of the introduced methods. We formulate two mainhypotheses: First, experimental data affected by the sampling bias lead to biased surrogate models andthus to unreliable and misleading PDPs. Second, the proposed partitioning allows us to identify aninterpretable sub-region (around the optimal configuration) that yields a more reliable and confidentPDP estimate. In a first experiment, we apply our methods to BO runs on a synthetic function. In thiscontrolled setup, we investigate the validity of our hypotheses with regards to problems of differentdimensionality and different degrees of sampling bias. In a second experiment, we evaluate our PDPpartitioning in the context of HPO for neural networks on a variety of tabular datasets.

We assess the sampling bias of the optimization design points by comparing their empirical distribu-tion to a uniform distribution via Maximum Mean Discrepancy (MMD) [Gretton et al., 2012, Molnaret al., 2020], which is covered in more detail in the Appendix C.1. We measure the reliability ofa PDP, i.e., the degree to which a user can rely on the estimate of the PD estimate, by comparingit to the true PD cS(λS) as defined in Eq. (1). More specifically, for every grid point λ(g)

S , wecompute the negative log-likelihood (NLL) of cS(λS) under the distribution of cS (λS) pointwisefor every grid point λ(g)

S . The confidence of a PDP is illustrated by the width of its confidence bandsmS (λS) ± q1−α/2 · sS (λS), with q1−α/2 denoting the (1 − α/2)-quantile of a standard normaldistribution. We measure the confidence by assessing sS(λS) pointwise for every grid point. Inparticular, we consider the mean confidence (MC) across all grid points 1

G

∑Gg=1 s

(λ

(g)S

)as well as

the confidence at the grid point closest to λS abbreviated by OC, with λ being the best configurationevaluated by the optimizer. To evaluate the performance of the confidence splitting, we report theabove metrics on the sub-region that contains the best configuration evaluated by the optimizer,assuming that this region is of particular interest for a user of HPO. PDPs are computed with regardsto single features for G = 20 equidistant grid points and n = 1000 Monte Carlo samples.

6.1 BO on a Synthetic Function

We consider the d-dimensional Styblinski-Tang function c : [−5, 5]d → R, λ 7→

12

∑di=1

(λ4i + 16λ2

i + 5λi)

for d ∈ {3, 5, 8}. Since the PD is the same for each dimension i,we only present the effects of λ1. We performed BO with a GP surrogate model with a Matérn-3/2kernel and the LCB acquisition function a(λ) = m(λ)+τ · s(λ) with different values τ ∈ {0.1, 1, 5}to control the sampling bias. We compute the global PDP with confidence bands estimated accordingto Eq. (5) for the GP surrogate model c that was fitted in the last iteration of BO. We ran Algorithm 1,and computed the PDP in the sub-region containing the optimal configuration. All computations wererepeated 30 times. Further details on the setup are given in Appendix C.2.1.

7

Figure 6: The figure presents the MC(left) and the NLL (right) for d ∈{3, 5, 8} for a high (τ = 0.1), medium(τ = 1), and low (τ = 5) samplingbias across 30 replications. With a lowersampling bias, we obtain narrower confi-dence bands and a lower NLL.

Table 1: The table shows the relative improvement of theMC and the NLL via Algorithm 1 with 1 and 3 splits,compared to the global PDP along with the samplingbias for a τ = 0.1 (high), τ = 2 (medium), and τ = 5(low). Results are averaged across 30 replications.

δ MC (%) δ NLL (%)

d MMD nsp = 1 nsp = 3 nsp = 1 nsp = 3

3 low (0.18) 7.65 13.64 5.89 10.923 medium (0.51) 12.86 36.92 4.78 7.703 high (0.56) 16.52 34.84 2.77 -1.625 low (0.15) 6.63 15.45 2.82 6.055 medium (0.45) 19.67 37.28 4.05 7.805 high (0.53) 11.99 33.06 -3.86 -1.938 low (0.11) 3.58 9.67 0.84 2.408 medium (0.42) 8.86 23.03 1.51 3.308 high (0.56) 6.59 19.84 1.53 4.29

As presented in Figure 6, the PDPs for surrogate models trained on less biased data (measured bythe MMD) yield lower values of the NLL, as well as lower values for the MC. Table 1 shows thata single tree-based split reduces the MC by up to almost 20%, and up to 37% when performing 3partitioning steps. Additionally, the NLL improves with an increasing number of partitioning steps inmost cases. The results on the synthetic functions support our second hypothesis that the tree-basedpartitioning improves the reliability in terms of the NLL and the confidence of the PD estimates. Theimprovement of the MC is higher for a medium to high sampling bias, compared to scenarios that areless affected by sampling bias. We observe that (particularly for high sampling bias) there are someoutlier cases in which the NLL worsens. More detailed results are shown in Appendix C.3.1.

6.2 HPO on Deep Learning

In a second experiment, we investigate HPO in the context of a surrogate benchmark [Eggenspergeret al., 2015] based on the LCBench data [Zimmer et al., 2021]. For each of the 35 different OpenML[Vanschoren et al., 2013] classification tasks, LCBench provides access to evaluations of a deepneural network on 2000 configurations randomly drawn from the configuration space defined byAuto-PyTorch Tabular (see Table 5 in Appendix C.2). For each task, we trained a random forest asan empirical performance model that predicts the balanced validation error of the neural networkfor a given configuration. These empirical performance models serve as cheap to evaluate objectivefunctions, which efficiently approximate the result of the real-world experiment of running a deeplearning configuration on an LCBench instance. BO then acts on this empirical performance modelas its objective10.

For each task, we ran BO to obtain the optimal architecture and hyperparameter configuration. Again,we used a GP with a Matérn-3/2 kernel and LCB with τ = 1. Each BO run was allotted a budgetof 200 objective function evaluations. We computed the PDPs and their confidences, which areestimated according to Eq. (5), based on the surrogate model c after the final iteration. We performedtree-based partitioning with up to 6 splits based on a uniformly distributed dataset of size n = 1000.All computations were statistically repeated 30 times. Further details are provided in Appendix C.2.2.

For the real-world data example, we focus on answering the second hypothesis, i.e., whether the tree-based Algorithm 1 improves the reliability of the PD estimates. We compare the PDP in sub-regionsafter 6 splits with the global PDP. We computed the relative improvement of the confidence (MC andOC) and the NLL of the sub-regional PDP compared to the respective estimates for the global PDP.As shown in Table 2, the MC of the PDPs is on average reduced by 30% to 52%, depending on thehyperparameter. At the optimal configuration λS , the improvement even increases to 50%− 62%.Thus, PDP estimates for all hyperparameters are on average – independent of the underlying dataset– clearly more confident in the relevant sub-regions when compared to the global PD estimates,especially around the optimal configuration λS . In addition to the MC, the NLL simultaneouslyimproves. In Appendix C.3.2, we provide details regarding the evaluated metrics on the level of thedataset and demonstrate that our split criterion outperforms other impurity measures regarding MC

10Please note that the random forest is only used as a surrogate in order to construct an efficient benchmarkobjective, and not as a surrogate in the BO algorithm, where we use a GP.

8

and OC. Furthermore, we emphasize in Appendix C.3.2 the significance of our results by providing acomparison to a naive baseline method.

Figure 7: PDP (blue) and confidence band (grey) of the GP for hyperparameter max. number ofunits (batch size) on the left (right) side. The black line shows the PDP of the meta surrogate modelrepresenting the true PDP estimate. The orange vertical line marks the optimal configuration λS . Therelative improvements from the global PDP to the sub-regional PDP after 6 splits are for max. numberof units (batch size): δ MC = 61.6% (28.4%), δ OC = 63.5% (62.2%), δ NLL = 48.6% (30.1%).

Table 2: Relative improvement of MC, OC, andNLL on hyperparameter level. The table shows therespective mean (standard deviation) of the averagerelative improvement over 30 replications for eachdataset and 6 splits.

Hyperparameter δ MC (%) δ OC (%) δ NLL (%)

Batch size 40.8 (14.9) 61.9 (13.5) 19.8 (19.5)Learning rate 50.2 (13.7) 57.6 (14.4) 17.9 (20.5)Max. dropout 49.7 (15.4) 62.4 (11.9) 17.4 (18.2)Max. units 51.1 (15.2) 58.6 (12.7) 24.6 (22.0)Momentum 51.7 (14.5) 58.3 (12.7) 19.7 (21.7)Number of layers 30.6 (16.4) 50.9 (16.6) 13.8 (32.5)Weight decay 36.3 (22.6) 61.0 (13.1) 11.9 (19.7)

To further study our suggested method, we nowhighlight a few individual experiments. Wechose one iteration of the shuttle dataset. Onthe two left plots of Figure 7, we see that thetrue PDP estimate for max. number of units is de-creasing, while the globally estimated PDP trendis increasing and thus misleading. Although theconfidence band already indicates that the PDPcannot be reliably interpreted on the entire hy-perparameter space, it remains challenging todraw any conclusions from it. After perform-ing 6 splits, we receive a confident and reliablePD estimate on an interpretable sub-region. Thesame plots are depicted for the hyperparameterbatch size on the right part of Figure 7. Thisexample illustrates that the confidence band might not always shrink uniformly over the entire rangeof λS during the partitioning, but often particularly around the optimal configuration λS .

7 Discussion and Conclusion

In this paper, we showed that partial dependence estimates for surrogate models fitted on experimentaldata generated by efficient hyperparameter optimization can be unreliable due to an underlyingsampling bias. We extended PDPs by an uncertainty estimate to provide users with more informationregarding the reliability of the mean estimator. Furthermore, we introduced a tree-based partitioningapproach for PDPs, where we leverage the uncertainty estimator to decompose the hyperparameterspace into interpretable, disjoint sub-regions. We showed with two experimental studies that wegenerate, on average, more confident and more reliable regional PDP estimates in the sub-regioncontaining the optimal configuration compared to the global PDP.

One of the main limitations of PDPs is that they bear the risk of providing misleading results if appliedto correlated data in the presence of interactions, especially for nonparametric models [Grömping,2020]. However, existing alternatives that visualize the global marginal effect of a feature such asaccumulated local effect (ALE) plots [Apley and Zhu, 2020] do also not provide a fully satisfyingsolution to this problem [Grömping, 2020]. As a solution to this problem, Grömping [2020] suggestsstratified PDPs by conditioning on a correlated and potentially interacting feature to group ICE curves.This idea is in the spirit of our introduced tree-based partitioning algorithm. However, in the contextof BO we might assume the distribution in Eq. (1) to be uniform and therefore no correlations arepresent. Instead of correlated features, we are faced with a sampling bias (see Section 3) where

9

we observe regions of varying uncertainty. Hence, instead of stratifying with respect to correlatedfeatures and aggregating ICE curves in regions with less correlated features, we stratify with respectto uncertainty and aggregate ICE curves in regions with low uncertainty variation. Nonetheless, itmight be interesting to compare our approach with approaches based on the considerations made byGrömping [2020] – or potentially improved ALE curves.

Another limitation when using single-feature PDPs as in our examples is that hyperparameterinteractions are not visible. However, two-way interactions can be visualized by plotting two-dimensional PDPs within sub-regions. Another possibility to detect interactions is to look at ICEcurves within the sub-regions. If the shape of ICE curves within a sub-region is very heterogeneous, itindicates that the hyperparameter under consideration interacts with one of the other hyperparameters.Hence, having the additional possibility to look at ICE curves of individual observations within asub-region is an advantage compared to other global feature effect plots such as ALE plots [Apleyand Zhu, 2020], as they are not defined on an observational level. While we mainly discussed GPsurrogate models on a numerical hyperparameter space in our examples, our methods are applicable toa wide variety of distributional regression models and also for mixed and hierarchical hyperparameterspaces. We also considered in Appendix C.3.2 different impurity measures. While the one introducedin this paper performed best in our experimental settings, this impurity measure as well as othercomponents are exchangeable within the proposed algorithm. In the future, we will study our methodon more complex, hierarchical configuration spaces for neural architecture search.

The proposed interpretation method is based on a surrogate and consequently does provide insightsabout what the AutoML system has learned, which in turn allows plausibility checks and may increasetrust in the system. To what extent this allows conclusions on the true underlying hyperparametereffects depends on the quality of the surrogate. How to efficiently perform model diagnostics to ensurea high surrogate quality before applying interpretability techniques is subject to future research.

While we focused on providing better explanations without generating any additional experimentaldata, it might be interesting to investigate in future work how confidence and reliability of IMLmethods can be increased most efficiently when a user is allowed to conduct additional experiments.

Overall, we believe that increasing interpretability of AutoML will pave the way for human-centeredAutoML. Our vision is that users will be able to better understand the reasoning and the samplingprocess of AutoML systems and thus can either trust and accept the results of the AutoML system orinteract with it in a feedback loop based on the gained insights and their preferences. How users canthen best interact with AutoML (beyond simple changes of the configuration space) will be left openfor future research.

Acknowledgments and Disclosure of Funding

This work has been partially supported by the German Federal Ministry of Education and Research(BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for itscontent.

10

ReferencesD. W. Apley and J. Zhu. Visualizing the effects of predictor variables in black box supervised learning models.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):1059–1086, 2020.

B. Bischl, J. Richter, J. Bossek, D. Horn, J. Thomas, and M. Lang. mlrmbo: A modular framework formodel-based optimization of expensive black-box functions, 2018.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth,1984a. ISBN 0-534-98053-8.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth,1984b.

M. Britton. VINE: Visualizing statistical interactions in black box models. CoRR, abs/1904.00561, 2019.

G. Cafri and B. A. Bailey. Understanding variable effects from black box prediction: Quantifying effects in treeensembles using partial dependence. Journal of Data Science, 14(1):67–95, 2016.

D. R. Cutler, T. C. Edwards Jr, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler. Random forestsfor classification in ecology. Ecology, 88(11):2783–2792, 2007.

J. Drozdal, J. Weisz, D. Wang, G. Dass, B. Yao, C. Zhao, M. Muller, L. Ju, and H. Su. Trust in AutoML:Exploring information needs for establishing trust in automated machine learning systems. In Proceedings ofthe 25th International Conference on Intelligent User Interfaces, pages 297–307, 2020.

K. Eggensperger, F. Hutter, H. Hoos, and K. Leyton-Brown. Efficient benchmarking of hyperparameteroptimizers via surrogates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, pages1114–1120, 2015.

W. D. Fisher. On grouping for maximum homogeneity. Journal of the American statistical Association, 53(284):789–798, 1958.

A. A. Freitas. Automated machine learning for studying the trade-off between predictive accuracy and inter-pretability. In Third IFIP International Cross-Domain Conference for Machine Learning and KnowledgeExtraction (CD-MAKE 2019), volume 11713, pages 48–66. Springer, August 2019.

J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, pages1189–1232, 2001.

J. H. Friedman and J. J. Meulman. Multiple additive regression trees with application in epidemiology. Statisticsin medicine, 22(9):1365–1381, 2003.

A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking inside the black box: Visualizing statistical learningwith plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, 2015.

B. M. Greenwell. pdp: An R Package for Constructing Partial Dependence Plots. The R Journal, 9(1):421–436,2017.

B. M. Greenwell, B. C. Boehmke, and A. J. McCarthy. A simple and effective model-based variable importancemeasure. arXiv preprint arXiv:1805.04755, 2018.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal ofMachine Learning Research, 13(25):723–773, 2012.

U. Grömping. Model-agnostic effects plots for interpreting machine learning models. Report 1/2020, Reports inMathematics, Physics and Chemistry. Department II, Beuth University of Applied Sciences Berlin, 2020.

F. Hutter, H. H. Hoos, and K. Leyton-Brown. An efficient approach for assessing hyperparameter importance. InProceedings of the 31th International Conference on Machine Learning, ICML, volume 32, pages 754–762.JMLR.org, 2014a.

F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown. Algorithm runtime prediction: Methods & evaluation.Artificial Intelligence, 206:79–111, 2014b.

J. Levesque, A. Durand, C. Gagné, and R. Sabourin. Bayesian optimization for conditional hyperparameterspaces. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May14-19, 2017, pages 286–293. IEEE, 2017. doi: 10.1109/IJCNN.2017.7965867. URL https://doi.org/10.1109/IJCNN.2017.7965867.

11

https://doi.org/10.1109/IJCNN.2017.7965867

https://doi.org/10.1109/IJCNN.2017.7965867

C. Molnar, G. König, B. Bischl, and G. Casalicchio. Model-agnostic feature importance and effects withdependent features - A conditional subgroup approach. CoRR, abs/2006.04628, 2020.

R. M. Neil. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996. ISBN0387947248.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

F. Pfisterer, J. Thomas, and B. Bischl. Towards human centered AutoML. CoRR, abs/1911.02391, 2019.

P. Probst, A. Boulesteix, and B. Bischl. Tunability: Importance of hyperparameters of machine learningalgorithms. Journal of Machine Learning Research, 20:53:1–53:32, 2019.

A. Sharma, J. N. van Rijn, F. Hutter, and A. Müller. Hyperparameter importance for image classification byresidual neural networks. In Discovery Science - 22nd International Conference, DS, volume 11828 of LectureNotes in Computer Science, pages 112–126. Springer, 2019.

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. InAdvances in Neural Information Processing Systems 25, pages 2960–2968, 2012.

K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, and M. A. Osborne. Raiders of the lost architecture: Kernels forbayesian optimization in conditional parameter spaces. arXiv: Machine Learning, 2014.

J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine learning. SIGKDDExplor., 15(2):49–60, 2013.

M. N. Wright and I. R. König. Splitting on categorical predictors in random forests. PeerJ, 7:e6339, 2019.

I. Xanthopoulos, I. Tsamardinos, V. Christophides, E. Simon, and A. Salinger. Putting the human back in theAutoML loop. In Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, volume 2578 ofCEUR Workshop Proceedings. CEUR-WS.org, 2020.

L. Zimmer, M. Lindauer, and F. Hutter. Auto-PyTorch Tabular: Multi-fidelity metalearning for efficient androbust AutoDL. IEEE TPAMI, 2021. Preprint via Early Access.

12

A Uncertainty Estimation

A.1 Choice of Uncertainty Quantification

Besides using the uncertainty estimate of the surrogate model to quantify the uncertainty for thePDP mean estimate (our method), it is also possible to estimate uncertainty w.r.t. the variance overdifferent ICE curves. However, if the uncertainty was estimated via computing the variance overICE curves, we describe how the levels of the mean prediction vary along the λS dimensions. Incontrast, we propose to capture model uncertainty along the λS dimensions. For example, consider aconstant surrogate function c(λ) = γ with high uncertainty estimation s2(λ) = 100. Computing thevariance over ICE curves on this example will result in an uncertainty estimate of 0 (all ICE curvesare identical). Our method, however, would return a variance estimate of 100 and thus capture modeluncertainty.

A.2 Covariance Estimates under Misspecification of Kernels

Figure 8: The figures show the NLL of the true PDP c1(λ1) under the estimated PDPs with varianceestimates (4) and (5) and for a misspecified kernel (Gaussian) and a correctly specified kernel(Matérn-3/2), respectively.

Table 3: The table shows the NLL of the true PDP c1(λ1) under the estimated PDPs with varianceestimates (4) and (5) and for a misspecified kernel (Gaussian) and a correctly specified kernel(Matérn-3/2), respectively. Shown are the mean across 50 replications, and the standard deviation inbrackets.

Correct specification Misspecification

d Estimate (5) Estimate (4) Estimate (5) Estimate (4)

3 3.61 (2.02) 4.47 (0.27) 5.10 (5.91) 4.62 (0.32)5 3.93 (2.00) 4.87 (0.23) 4.33 (3.72) 4.89 (0.28)8 4.05 (1.12) 5.18 (0.14) 4.24 (2.12) 5.13 (0.17)

In order to provide evidence for the claim that estimate Eq. (4) is more sensitive to misspecificationsin the kernel (and thus in the covariance structure) than Eq. (5), we performed some prior experiments.

We assume that we are given an objective function that is generated by a Gaussian process (GP)with a Matérn-3/2 kernel. In our experiments, that function was created by fitting a GP on tuples(λ(i), y(i)

)i=1,...,30

, with λ(i) ∼ Unif([−5, 5]d

)and y(i) corresponding to the value of the d-

dimensional Styblinski Tang function for λ(i). The posterior mean of this GP will further serveas our true objective c to pretend that we know the correct kernel specification of the ground-truth.Subsequently, we fit both a GP surrogate model with correctly specified kernel (i.e., a Matérn-3/2kernel) and a surrogate model with a misspecified kernel (in our case, we chose a Gaussian kernel)to the data

(λ(i), c

(λ(i)

))i=1,...,30

. In both cases, we compute the PDPs for λ1 with both variance

13

estimates Eq. (4) and Eq. (5) and measure the negative log-likelihood (NLL) of cS under the respectiveestimated PDP. We performed 50 repetitions of the experiments for d ∈ {3, 5, 8}, respectively.

Figure 8 shows that the median of the NLL across all 50 replications is slightly lower for thecovariance estimate in Eq. (4). However, the variance of the NLL is much higher for estimate inEq. (4) as compared to Eq. (5). Table 3 confirms that, when using variance estimate Eq. (4), thestandard deviation of the NLL values is lower. We conclude that the reliability of the estimate isparticularly sensitive to a correct choice of the kernel function. The NLL for the PDPs computedwith variance estimate Eq. (5) is - independent of whether the kernel is correctly specified or not -less sensitive to misspecifications in the kernel.

B Hierarchical Hyperparameter Spaces

Search spaces in HPO and AutoML are often hierarchical, i.e., some hyperparameters are only activeconditional on the value of another hyperparameter (the latter usually being a categorical choice, e.g.,using a certain method). The underlying dependency structure can be visualized by a tree structure(Figure 9 (a)). If one hyperparameter can activate another hyperparameter, we call the former a parentand the latter is subordinate to the former and called a child. A hyperparameter that has no parents iscalled a global hyperparameter. Sampled configurations can be presented in a nested block matrix(Figure 9 (b)), with missing entries for inactive hyperparameters.

Λ

k algorithm

SVM

C kernel

linear rbf

σ

xgboost

η nrounds

(a) Dependency structure

k algorithm η nrounds C kernel σ3 svm NA NA 1 linear NA...

......

......

7 svm NA NA 2 linear NA3 svm NA NA 10 rbf 0.2...

......

......

7 svm NA NA 0.4 rbf 0.022 xgboost 0.01 100 NA NA NA...

......

......

...5 xgboost 0.1 300 NA NA NA

(b) Sampled configurations

Figure 9: Hyperparameter k ∈ N represents the number of selected components of PCA, applied as apreprocessing step. While it is global, always active, and has no subordinate children, algorithm –dependent on its values SVM and xgboost – is parent to hyperparameters C, kernel, η and nrounds.

Various surrogate models exist, which can learn on such hierarchical spaces, e.g., GPs with specializedkernels [Levesque et al., 2017, Swersky et al., 2014], specialized trees in random forests [Hutter et al.,2014b] or handling the missing entries through imputation [Bischl et al., 2018].

The specialized tree algorithm (in normal AutoML, without PDP) can be briefly summarized asfollows: We run a normal recursive partitioning as in CART, but in each split, a hyperparameteris only then eligible for potential splitting if the path leading to the current node satisfies all of itspreconditions and therefore activates it.

In order to generalize this tree building in the context of hierarchical dependency structures to aregional PDP computation as in Section 5, we now make the following three modifications:

First, after having selected a hyperparameter λS for which we want to estimate the PDP, we subsetthe uniformly sampled test data on which we are going to fit our regional PDP algorithm to all rowsfor which λS is not missing, i.e., only to configurations in which λS is active11.

Second, we now run the specialized tree algorithm as described above, but never allow to split on λS .Obviously, this will then never activate any child of λS , so we can never split on these.

11This will likely result in constant values for hyperparameters in the path leading to λS , and consequentlythe tree will not split on these.

14

Third, we now adapt the estimation of the PDPs in a given tree node and the associated data blockN

cS (λS) =1

|N |∑

i∈Nm(λS ,λ

(i)C

)s2S (λS) =

1

|N |∑

i∈Ns2(λS ,λ

(i)C

)(7)

in the presence of hierarchical structures. If hyperparameter λS is a parent, sampling from themarginal λC ∼ P(λC) (which refers to all hyperparameters except λS) can still yield invalidcombinations (λS ,λC).

We now simply average only over valid configurations w.r.t. to the dependency structure. Formally,let q(λ) : Λ −→ {0, 1} be a binary predicate function which is 1 if and only if λ is valid w.r.t. tothe dependency structure. Now let v(λS ,N ) = {i ∈ N|q(λS ,λ(i)

C ) = 1} ⊂ N be the set of validconfigurations in N w.r.t. the dependency structure if we insert a given λS value. The dependencyadapted PDPs now are:

cS (λS) =1

|v(λS ,N )|∑

i∈v(λS ,N )

m(λS ,λ

(i)C

)(8)

with an analogous modification for s2S (λS).

For example, to calculate the PDP of a child hyperparameter such as nrounds w.r.t. the example inFigure 9, we first need to subset the test dataset to all rows for which nrounds is not missing (in thisspecific example this is the same as only keeping the instances where algorithm takes the valuexgboost). Due to the dependency structure only k, algorithm and η are active hyperparameters inλC . To calculate the PDP of a parent hyperparameter such as algorithm, there are (in this example)no missings w.r.t. λS . However, we will obtain invalid configurations, e.g., when we replace thevalue svm by xgboost for the parent hyperparameter algorithm. Thus, we need to use the thirdmodification from the above described adjustments and average only over all valid configurationsw.r.t. to the dependency structure: if we insert svm, then the invalid configurations are dropped andwe average only over those configurations that were activated when choosing svm, i.e., that containnon missing values in the child hyperparameters of the svm algorithm. If we insert xgboost then weonly average over those configurations that contain non missing values for η and nrounds.

C Experimental Analysis

C.1 Maximum Mean Discrepancy

In Section 6.1 we analyze the first hypothesis how the sampling bias affects the PDP estimation. Anindicator of the size of the sampling bias is the exploration factor τ . The smaller τ the higher thesampling bias compared to a uniformly distributed dataset (e.g. see Figure 1). To put it in otherwords, the sampling bias can be quantified by the distributional shift between a uniformly distributedsample and the sample generated by the BO process. A commonly used measure to quantify suchdistributional differences is the Kullback-Leibler divergence. However, since the joint distribution ofthe generated sample is not known, the Kullback-Leibler divergence might not be the most appropriatemeasure here. Another metric that tests if two different samples belong to the same distribution, is themaximum mean discrepancy (MMD) [Gretton et al., 2012]. It is defined by the maximum deviationin expectation and based on the function class of reproducing kernel Hilbert space (RKHS). This isequivalent to ‘the norm of the difference between distribution feature means in the RKHS’ [Grettonet al., 2012].

An unbiased empirical estimate of the MMD for samples X = {x(1), ...,x(n)} and Y ={y(1), ...,y(m)} is given by

MMD2(X,Y ) =1

n(n− 1)

n∑i=1

n∑j 6=i

k(x(i),x(j)

)+

1

m(m− 1)

m∑i=1

m∑j 6=i

k(y(i),y(j)

)− 2

nm

n∑i=1

m∑j=1

k(x(i),y(j)

)

15

It follows that for X and Y being drawn from the same distribution, the MMD is small while itbecomes large for increasing distributional differences.

Here, X represents the sample that is drawn from a uniform distribution over the hyperparameterspace Λ, while Y is the sample generated by the BO process. The kernel k is chosen to be the radialbasis function kernel with parameter σ being set to the median L2-distance between sample points.This heuristic is commonly used [Gretton et al., 2012].

C.2 Experimental Design

All experiments only require CPUs (and no GPUs) and were computed on a Linux cluster (seeTable 4).

Table 4: Description of the infrastructure used for the experiments in this paper.Computing Infrastructure

Type Linux CPU ClusterArchitecture 28-way Haswell-EP nodesCores per Node 1Memory limit (per core) 2.2 GB

The computational complexity of the PDP estimation with uncertainty is O (G · n) · O(c), with O(c)being the runtime complexity of single surrogate prediction, n denoting the size of the dataset tocompute the Monte Carlo estimate and G being the number of grid points. In the context of HPO, thegeneral assumption is that the evaluation time of c is negligibly low as compared to evaluation c. Sowe argue that the runtime complexity of computing a PDP with uncertainty estimate can be neglectedin this context. When computing ICE curves and their variance estimates beforehand, the algorithmiccomplexity of Algorithm 1 corresponds to the algorithmic complexity of the tree splitting [Breimanet al., 1984b].

In our experiments, the runtimes to compute the PDPs and perform the tree splitting lies within a fewminutes. We consider them to be negligible and will thus not report these.

C.2.1 BO on a Synthetic Function

The Styblinski-Tang function

c : [−5, 5]d → R (9)

λ 7→ 1

2

d∑i=1

(λ4i + 16λ2

i + 5λi)

(10)

was optimized via BO for d ∈ {3, 5, 8} with a total budget of {80, 150, 250} objective function eval-uations, respectively, to allow sufficient optimization progress depending on the problem dimension.

We computed an initial random design of size 4d12. We performed BO with a GP surrogate modelwith a Matérn-3/2 kernel and the LCB acquisition function a(λ) = m(λ) + τ · s(λ) with differentvalues τ ∈ {0.1, 1, 5}. A nugget 10−8 was added for for numerical stability. We denote the bestevaluated configuration, measured by c, by λ.

Based on the last surrogate model, we performed the partitioning in Algorithm 1 for a total numberof 5 splits, with the different splitting criteria (see Section C.3.2), with PDPs being computed with aG = 20 equidistant grid points, and n = 1000 samples for the Monte Carlo approximation13.

For all subsequent analysis, we considered the subregions Λ′, for which λ ∈ Λ′, and computed thePDPs according to Estimate (5).

12The initial design was fixed across replications13Both grid-points and the data to compute the MC estimate are fixed across replications

16

Table 5: Hyperparameter space 1 of Auto-PyTorch Tabular.Name Range log type

Number of layers [1, 5] no intMax. number of units [64, 512] yes int

Batch size [16, 512] yes intLearning rate (SGD) [1e−4, 1e−1] yes float

Weight decay [1e−5, 1e−1] no floatMomentum [0.1, 0.99] no float

Max. dropout rate [0.0, 1.0] no float

Table 6: Hyperparameter space of the random forest that was tuned over to compute the empiricalperformance model.

Name Range log type

Number of trees [10, 500] yes intmtry {true, false} no bool

Minimum Size of Nodes [1, 5] no intNumber of Random Splits [1, 100] no int

For every subregions considered Λ′, we compute a partial dependence of c for feature λ1, denoted asc1 (λ1) to establish a ground-truth PDP estimate.

C.2.2 MLP

All experimental data were downloaded from the LCBench project14. As an empirical performancemodel, we fitted a random forest (ranger) to approximate the relationship between hyperparametersand balanced error rate (BER). For every dataset, we performed a random search with 500 iterationsand evaluation via 3-fold cross-validation to choose reasonable for the hyperparameters representedin Table 6. The empirical performance model acts as ground-truth in our experiments, and thus, wedenote it by c. This function was used to compute the true PDP cS .

We computed an initial random design of size 2 · d15. We performed BO with a GP surrogate modelwith a Matérn-3/2 kernel and the LCB acquisition function a(λ) = m(λ) + τ · s(λ) with τ = 1.A nugget effect was modeled. The maximum budget per BO run was set to 200 objective functionevaluations. We denote the best evaluated configuration, measured by c, by λ.

Based on the last surrogate model, we performed the partitioning in Algorithm 1 for a total numberof 6 splits, with the different splitting criteria (see Section C.3.2), with PDPs being computed with aG = 20 equidistant grid points, and n = 1000 samples for the Monte Carlo approximation16.

C.3 Detailed Results

C.3.1 Synthetic

In Section 6.1 we analyzed our tree-based partitioning method on the Styblinski-Tang function fordifferent dimensions and degrees of sampling bias. To make the results of Table 1 more tangible, wevisualized the associated PDPs and confidence bands for λ1 and τ = 1 of one iteration in Figure 10.The plots show clearly, that the number of splits required to obtain more confident and reliable PDPestimate in the sub-region containing the optimal configuration depends on the problem dimension.

C.3.2 MLP

In Section 6.2 we evaluated the reliability of PDP estimation for the partitioning procedure proposedin Section 5. The results presented in Section 6.2 are aggregated over a total number of 35 different

14https://github.com/automl/LCBench, Apache License 2.015The initial design was fixed across replications16The grid and the data used to compute the Monte Carlo estimate was fixed across replications

17

Figure 10: PDP (blue) and confidence band (grey) of the GP for hyperparameter λ1 for the Styblinski-Tang function in case of 3 (top), 5 (middle) and 8 (bottom) dimensions. The black line showsthe true PDP estimate of the Styblinski-Tang function. The orange vertical line marks the optimalconfiguration λ1.

datasets. In Tables 7 and 8 the relative improvement of the mean confidence (MC) and NLL arepresented on dataset level. The mean and standard deviation are averaged over all hyperparameters.Furthermore, the mean values of the features providing the highest and lowest relative improvementfor each dataset are reported. Following on that, Table 9 shows for each hyperparameter the numberof datasets for which the respective hyperparameter led to the highest (lowest) relative improvementfor both evaluation metrics.

Split Criteria In Section 5, we introduced Eq. 6 as split criteria within the tree-based partitioningof Algorithm 1. This measure is based on splitting ICE curves based on curve similarities, which isespecially suitable in the underlying context as explained in Section 5. However, we also compared itto two other measures that are based on uncertainty estimates provided by the probabilistic surrogatemodel. The first one is also based on ICE curves of the variance function s2(λS ,λ

(i)C ) and its PD

estimate s2S|N ′ (λS) within a sub-region N ′. However, instead of minimizing the distance between

curves and group the associated ICE curves regarding similar behavior, we can also minimize thearea under ICE curves of the variance function. The reasoning for this is as follows: If we aim fortight confidence bands over the entire range of ΛS , we want the ICE curves of the variance functionto be - on average - as low as possible. This is equivalent to minimizing the average area under ICEcurves of the variance function. Thus, the calculation of Eq. 6 changes such that we first calculate theaverage area between each ICE curve of the uncertainty function and the respective sub-regional PDP

L (λS , i) =1

G

∑G

g=1

(s2(λ

(g)S ,λ

(i)(g)C

)− s2

S|N ′(λ

(g)S

)),

18

Table 7: Relative improvement of MC ondataset level. The table shows the mean (µ)and standard deviation (σ) of the relative im-provement (in %) over all 7 hyperparametersand 30 runs after 6 splits. Additionally themean value of the hyperparameter with thehighest (µh) and lowest (µl) mean improve-ment are shown.

Dataset µ σ µh µl

adult 34 6 38 25airlines 49 20 61 3albert 57 26 78 14Amazon_employee_access 58 17 69 21APSFailure 46 17 60 22Australian 41 7 46 32bank-marketing 29 13 45 15blood-transfusion-service 34 20 39 13car 44 17 51 32christine 47 14 54 19cnae-9 66 26 83 7connect-4 47 14 56 17covertype 41 17 53 12credit-g 57 21 69 7dionis 49 21 63 5fabert 64 21 75 18Fashion-MNIST 41 12 47 18helena 43 16 52 8higgs 42 14 52 17jannis 35 13 44 19jasmine 46 11 56 27jungle_chess_2pcs_raw 33 15 44 6kc1 33 12 41 17KDDCup09_appetency 52 21 63 3kr-vs-kp 46 14 56 26mfeat-factors 56 16 70 29MiniBooNE 36 14 42 18nomao 30 6 34 22numerai28.6 60 28 76 -3phoneme 29 7 32 25segment 53 21 66 10shuttle 48 11 58 32sylvine 37 6 42 29vehicle 34 8 41 30volkert 44 16 55 12

Table 8: Relative improvement of the NLLon dataset level. The table shows the mean(µ) and standard deviation (σ) of the relativeimprovement (in %) over all 7 hyperparame-ters and 30 runs after 6 splits. Additionallythe mean value of the feature with the highest(µh) and lowest (µl) mean improvement areshown.

Dataset µ σ µh µl

adult 13 6 23 8airlines 17 9 23 1albert 31 13 40 6Amazon_employee_access -0 36 29 -35APSFailure 15 7 23 6Australian 12 14 23 -4bank-marketing 7 9 17 -1blood-transfusion-service 6 17 10 -8car 26 32 35 10christine 10 11 17 1cnae-9 67 37 93 -11connect-4 -4 38 22 -84covertype 28 13 38 8credit-g 41 24 81 2dionis 47 55 144 -18fabert 37 17 54 8Fashion-MNIST 15 11 28 2helena -20 31 -9 -35higgs 20 12 33 -2jannis 17 7 21 8jasmine 6 14 24 -11jungle_chess_2pcs_raw 9 15 24 -7kc1 11 10 17 4KDDCup09_appetency 23 28 62 -33kr-vs-kp 9 35 43 -17mfeat-factors 25 19 51 10MiniBooNE 9 14 17 -8nomao 8 6 16 3numerai28.6 17 9 23 4phoneme 11 7 17 5segment 22 57 41 -31shuttle 35 24 84 19sylvine 14 17 38 -0vehicle 0 20 9 -14volkert 23 18 40 5

Table 9: Number of datasets each of the hyperparameters had the highest µh and lowest µl averagerelative improvement w.r.t. MC and NLL.

MC NLL

Hyperparameter # µh # µl # µh # µl

Batch size 1 3 3 4Learning rate 6 2 6 3Max. dropout 9 1 2 1

Max. units 4 7Momentum 8 7 3

Number of layers 3 14 9 11Weight decay 4 15 1 13

where s2S|N ′

(λ

(g)S

):= 1|N ′|

∑i∈N ′ s

2(λ

(g)S ,λ

(i)(g)C

), and aggregate the quadratic value of it over

all observations in the respective sub-region:

Rarea(N ′) =∑

i∈N ′L(λS , i)

2. (11)

Second, we also used the uncertainty estimates of the probabilistic surrogate model for each obser-vation of the test data itself to define an impurity measure. Therefore we calculated the squareddeviation of each observation to the mean uncertainty within the respective node. Hence, compared

19

to the other two approaches, we do not group curves but the observations themselves regarding theiruncertainty. We further refer to this approach as the variance (var) approach.

As a third measure that is not based on the uncertainty estimates, we used the MSE of the posteriormean estimate of the surrogate model as split criterion. This is the most commonly used measure forregression trees and hence a solid baseline measure.

We compared the four impurity measures for the partitioning procedure over all datasets and hyperpa-rameters. We compare the results that we presented in Section 6.2 with the according results for theother three measures in Table 10. The impurity measure based on curve similarities that we used forour analysis (L2) outperforms the other three measures on average for all hyperparameters regardingMC and especially regarding OC. With regards to NLL there is not one measure that outperforms allothers, but rather all measures perform on average over all hyperparameters equally good.

Table 10: Comparison of different impurity measures regarding the relative improvement of MC, OCand NLL on hyperparameter level. The table compares the results of Table 2 (L2) with the accordingresults for the impurity measure based on Eq. 11 (area), the variance measure (var) and the meanmeasure.

δ MC (in %) δ OC (in %) δ NLL (in %)

Hyperparameter L2 area var mean L2 area var mean L2 area var meanBatch size 41 40 38 36 62 58 55 53 20 19 16 19

Learning rate 50 50 50 42 58 57 57 51 18 18 18 19Max. dropout 50 49 47 41 62 61 58 53 17 18 17 15

Max. units 51 51 50 45 59 58 58 53 25 24 25 25Momentum 52 51 51 43 58 57 57 53 20 20 20 16

Number of layers 31 30 29 25 51 46 46 45 14 15 15 13Weight decay 36 35 34 29 61 53 51 52 12 12 11 10

Baseline comparison To emphasize the significance of our results we compare our results fromSection 6.2 with the following naive baseline method: We consider the L1-neighborhood aroundthe optimal configuration – where the GP can be assumed to be quite confident due to the focusedsampling of BO – which has the same size as the sub-region found by our method. We compute thePDP on this neighborhood, and compare it to the sub-regional PDP found by our method, in the sameway as in Section 6.2. We calculated the average improvement of the three evaluation metrics over alldatasets and repetitions on hyperparameter level as done in Table 2 of our paper. While the meanconfidence for our method improves on average by 30 - 52%, the naive baseline method improvesonly by 8 - 23%. Close to the optimal configuration, the mean improvement of our method is between50-62% while the baseline method only improves by 16-42%. While the negative log-likelihood doesnot improve for the baseline method, it improves using our method by 12-24%. See Table 11 formore detailed results. Hence, our method results in more reliable and confident PDP estimates thanthis baseline method. These results justify using the more complex approach of grouping ICE curvesbased on similarity of their uncertainty structure to receive more reliable and confident PDP estimatesin the sub-region close to the optimal configuration. Other disadvantages of the baseline method arethat we need to specify the size of the neighborhood and that we only receive a rather local viewaround the optimal configuration. Our method on the other hand decomposes the global PDP inseveral distinct and interpretable sub-regions which helps the user to understand which regions of theentire hyperparameter space can be interpreted more reliably and which ones need to be regardedwith caution.

Increased confidence with more splits Furthermore, it needs to be noted that by using our methodthe mean confidence and NLL improve on average if we use six splits. However, this does notmean that they improve by design when splitting into sub-regions. As shown in Tables 7 and 8,improvements heavily depend on dataset and HP. Different factors influence the optimal number ofsplits, such as the sampling bias, size of the test-set, and dimensionality of the HP space. For someof our benchmarks, the best results are reached with fewer splits. One example is shown in Figure11, where improvements in both metrics are made until Split 2 and by splitting deeper, estimates getless accurate especially when sample sizes in sub-regions become very small. Thus, the number ofsplits is a (useful and flexible) control parameter in our method which can be determined within a

20

Table 11: Relative improvement of MC, OC and NLL on hyperparameter level. The table shows forour method and the baseline method the respective mean (standard deviation) of the average relativeimprovement over 30 replications for each dataset and 6 splits.

Tree-based partitioning Baseline method

Hyperparameter δ MC (%) δ OC (%) δ NLL (%) δ MC (%) δ OC (%) δ NLL (%)

Batch size 40.8 (14.9) 61.9 (13.5) 19.8 (19.5) 13.7 (12.1) 18.9 (16.0) 1.4 (21.6)Learning rate 50.2 (13.7) 57.6 (14.4) 17.9 (20.5) 23.1 (17.7) 27.2 (20.7) -3.4 (27.0)Max. dropout 49.7 (15.4) 62.4 (11.9) 17.4 (18.2) 21.1 (16.8) 26.7 (16.8) 3.3 (22.1)Max. units 51.1 (15.2) 58.6 (12.7) 24.6 (22.0) 19.1 (16.5) 22.0 (17.1) -1.4 (19.7)Momentum 51.7 (14.5) 58.3 (12.7) 19.7 (21.7) 21.9 (16.4) 25.3 (16.9) 2.1 (25.4)Number of layers 30.6 (16.4) 50.9 (16.6) 13.8 (32.5) 8.1 (5.9) 15.4 (12.8) 0.9 (11.8)Weight decay 36.3 (22.6) 61.0 (13.1) 11.9 (19.7) 22.6 (15.9) 41.7 (15.8) 2.2 (24.2)

human-in-the-loop approach (view plots after each split and stop when results are satisfying) or bydefining a quantitative measure (e.g., based on a threshold for confidence improvement).

Figure 11: Estimated PDP of GP (blue) and true PDP estimate (black). The relative improvementsafter 2 (6) splits are δ MC = 5% (0%) and δ NLL = 5% (−28%).

D Code

All code related to this paper is made available via a public repository17. All methods are implementedwithin the folder R, and all code used to perform the experiments are provided in benchmarks. Allanalyses shown in this paper in form of tables or figures can be reproduced via running the notebooksin analysis.

17https://github.com/slds-lmu/paper_2021_xautoml

21

Explaining Hyperparameter Optimization via Partial ...

Documents

Transcript of Explaining Hyperparameter Optimization via Partial ...