Predictive Entropy Search for Multi-objective Bayesian...

Predictive Entropy Search for Multi-objective Bayesian Optimization

Daniel Hernández-Lobato [email protected] Autónoma de Madrid, Francisco Tomás y Valiente 11, 28049, Madrid, Spain.

José Miguel Hernández-Lobato [email protected] University, 33 Oxford street, Cambridge, MA 02138, USA.

Amar Shah [email protected] University, Trumpington Street, Cambridge CB2 1PZ, United Kingdom.

Ryan P. Adams [email protected] University and Twitter, 33 Oxford street Cambridge, MA 02138, USA.

AbstractWe present PESMO, a Bayesian method for iden-tifying the Pareto set of multi-objective optimiza-tion problems, when the functions are expen-sive to evaluate. PESMO chooses the evaluationpoints to maximally reduce the entropy of theposterior distribution over the Pareto set. ThePESMO acquisition function is decomposed as asum of objective-specific acquisition functions,which makes it possible to use the algorithm indecoupled scenarios in which the objectives canbe evaluated separately and perhaps with differ-ent costs. This decoupling capability is usefulto identify difficult objectives that require moreevaluations. PESMO also offers gains in effi-ciency, as its cost scales linearly with the numberof objectives, in comparison to the exponentialcost of other methods. We compare PESMO withother methods on synthetic and real-world prob-lems. The results show that PESMO producesbetter recommendations with a smaller numberof evaluations, and that a decoupled evaluationcan lead to improvements in performance, par-ticularly when the number of objectives is large.

1. IntroductionWe address the problem of optimizing K real-valuedfunctions f1(x), . . . , fK(x) over some bounded do-main X ⊂ Rd, where d is the dimensionality of the input

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

space. This is a more general, challenging and realisticscenario than the one considered in traditional optimizationproblems where there is a single-objective function. For ex-ample, in a complex robotic system, we may be interestedin minimizing the energy consumption while maximizinglocomotion speed (Ariizumi et al., 2014). When selectinga financial portfolio, it may be desirable to maximize re-turns while minimizing various risks. In a mechanical de-sign, one may wish to minimize manufacturing cost whilemaximizing durability. In each of these multi-objective ex-amples, it is unlikely to be possible to optimize all of theobjectives simultaneously as they may be conflicting: afast-moving robot probably consumes more energy, high-return financial instruments typically carry greater risk,and cheaply manufactured goods are often more likely tobreak. Nevertheless, it is still possible to find a set of opti-mal points X ? known as the Pareto set (Collette & Siarry,2003). Rather than a single best point, this set representsa collection of solutions at which no objective can be im-proved without damaging one of the others.

In the context of minimization, we say that x Paretodominates x′ if fk(x) ≤ fk(x′) ∀k, with at least oneof the inequalities being strict. The Pareto set X ? isthen the subset of non-dominated points in X , i.e., theset such that ∀x? ∈ X ?, ∀x ∈ X , ∃ k ∈ 1, . . . ,K forwhich fk(x?) < fk(x). The Pareto set is considered to beoptimal because for each point in that set one cannot im-prove in one of the objectives without deteriorating someother objective. Given X ?, the user may choose a pointfrom this set according to their preferences, e.g., locomo-tion speed vs. energy consumption. The Pareto set is oftennot finite, and most strategies aim at finding a finite set withwhich to approximate X ? well.

It frequently happens that there is a high cost to evaluat-


ing one or more of the functions fk(·). For example, in therobotic example, the evaluation process may involve a timeconsuming experiment with the embodied robot. In thiscase, one wishes to minimize the number of evaluationsrequired to obtain a useful approximation to the Paretoset X ?. Furthermore, it is often the case that there is nosimple closed form for the objectives fk(·), i.e., they canbe regarded as black boxes. One promising approach inthis setting has been to use a probabilistic model such as aGaussian process to approximate each function (Knowles,2006; Emmerich, 2008; Ponweiser et al., 2008; Picheny,2015). At each iteration, these strategies use the uncer-tainty captured by the probabilistic model to generate anacquisition (utility) function, the maximum of which pro-vides an effective heuristic for identifying a promising lo-cation on which to evaluate the objectives. Unlike the ac-tual objectives, the acquisition function is a function of themodel and therefore relatively cheap to evaluate and max-imize. This approach contrasts with model-free methodsbased on genetic algorithms or evolutionary strategies thatare known to be effective for approximating the Pareto set,but demand a large number of function evaluations (Debet al., 2002; Li, 2003; Zitzler & Thiele, 1999).

Despite these successes, there are notable limitations tocurrent model-based approaches: 1) they often build theacquisition function by transforming the multi-objectiveproblem into a single-objective problem using scalarizationtechniques (an approach that is expected to be suboptimal),2) the acquisition function generally requires the evalua-tion of all of the objective functions at the same location ineach iteration, and 3) the computational cost of evaluatingthe acquisition function typically grows exponentially withthe number of objectives, which limits their applicability tooptimization problems with just 2 or 3 objectives.

We describe here a strategy for multi-objective optimiza-tion that addresses these concerns. We extend previoussingle-objective strategies based on stepwise uncertaintyreduction to the multi-objective case (Villemonteix et al.,2009; Hernández-Lobato et al., 2014; Henning & Schuler,2012). In the single-objective case, these strategies choosethe next evaluation location based on the reduction of theShannon entropy of the posterior estimate of the mini-mizer x?. The idea is that a smaller entropy implies thatthe minimizer x? is better identified; the heuristic thenchooses candidate evaluations based on how much they areexpected to improve the quality of this estimate. These in-formation gain criteria have been shown to often providebetter results than other alternatives based, e.g., on the pop-ular expected improvement (Hernández-Lobato et al., 2014;Henning & Schuler, 2012; Shah & Ghahramani, 2015).

The extension to the multi-objective case is obtained byconsidering the entropy of the posterior distribution over

the Pareto set X ?. More precisely, we choose the next eval-uation as the one that is expected to most reduce the entropyof our estimate of X ?. The proposed approach is calledpredictive entropy search for multi-objective optimization(PESMO). Several experiments involving real-world andsynthetic optimization problems, show that PESMO canlead to better performance than related methods from theliterature. Furthermore, in PESMO the acquisition functionis expressed as a sum across the different objectives, al-lowing for decoupled scenarios in which we can choose toonly evaluate a subset of objectives at any given location.In the robotics example, one might be able to decouple theproblems by estimating energy consumption from a simu-lator even if the locomotion speed could only be evaluatedvia physical experimentation. Another example, inspiredby Gelbart et al. (2014), might be the design of a low-calorie cookie: one wishes to maximize taste while min-imizing calories, but calories are a simple function of theingredients, while taste could require human trials. Theresults obtained show that PESMO can obtain better re-sults with a smaller number of evaluations of the objec-tive functions in such scenarios. Furthermore, we have ob-served that the decoupled evaluation provides significantimprovements over a coupled evaluation when the numberof objectives is large. Finally, unlike other methods (Pon-weiser et al., 2008; Picheny, 2015), the computational costof PESMO grows linearly with the number of objectives.

2. Multi-objective Bayesian Optimization viaPredictive Entropy Search

In this section we describe the proposed approach for multi-objective optimization based on predictive entropy search.Given some previous evaluations of each objective func-tion fk(·), we seek to choose new evaluations that maxi-mize the information gained about the Pareto set X ?. Thisapproach requires a probabilistic model for the unknownobjectives, and we therefore assume that each fk(·) followsa Gaussian process (GP) prior (Rasmussen & Williams,2006), with observation noise that is i.i.d. Gaussian withzero mean. GPs are often used in model-based approachesto multi-objective optimization because of their flexibil-ity and ability to model uncertainty (Knowles, 2006; Em-merich, 2008; Ponweiser et al., 2008; Picheny, 2015). Forsimplicity, we initially consider a coupled setting in whichwe evaluate all objectives at the same location in any giveniteration. Nevertheless, the approach described can be eas-ily extended to the decoupled scenario.

Let D = {(xn,yn)}Nn=1 be the data (function evaluations)collected up to stepN , where yn is aK-dimensional vectorwith the values resulting from the evaluation of all objec-tives at step n, and xn is a vector in input space denotingthe evaluation location. The next query xN+1 is the one


that maximizes the expected reduction in the entropy H(·)of the posterior distribution over the Pareto set X ?, i.e.,p(X ?|D). The acquisition function of PESMO is hence:

α(x) = H(X ?|D)− Ey [H(X ?|D ∪ {(x,y)})] , (1)

where y is the output of all the GP models at xand the expectation is taken with respect to theposterior distribution for y given by these mod-els, p(y|D,x) =

∏Kk=1 p(yk|D,x). The GPs are assumed

to be independent a priori. This acquisition function isknown as entropy search (Villemonteix et al., 2009; Hen-ning & Schuler, 2012). Thus, at each iteration we set the lo-cation of the next evaluation to xN+1 = arg maxx∈X α(x).

A practical difficulty, however, is that the exact evalua-tion of Eq. (1) is generally infeasible and the functionmust be approximated; we follow the approach describedin (Hernández-Lobato et al., 2014; Houlsby et al., 2012).In particular, Eq. (1) is the mutual information between X ?and y given D. The mutual information is symmetric andhence we can exchange the roles of the variables X ? and y,leading to an expression that is equivalent to Eq. (1):

α(x) = H(y|D,x)− EX? [H(y|D,x,X ?)] , (2)

where the expectation is now with respect to the pos-terior distribution for the Pareto set X ? given the ob-served data, and H(y|D,x,X ?) measures the entropyof p(y|D,x,X ?), i.e., the predictive distribution for theobjectives at x given D and conditioned to X ? beingthe Pareto set of the objective functions. This alterna-tive formulation is known as predictive entropy search(Hernández-Lobato et al., 2014) and it significantly simpli-fies the evaluation of the acquisition function α(·). In par-ticular, we no longer have to evaluate or approximate theentropy of the Pareto set, X ?, which may be quite difficult.The new acquisition function obtained in Eq. (2) favors theevaluation in the regions of the input space for which X ?is more informative about y. These are precisely also theregions in which y is more informative about X ?.

The first term in the r.h.s. of Eq. (2) is straight-forward toevaluate; it is simply the entropy of the predictive distri-bution p(y|D,x), which is a factorizable K-dimensionalGaussian distribution. Thus, we have that

H(y|D,x) = K2

log(2πe) +

K∑k=1

0.5 log(vPDk ) , (3)

where vPDk is the predictive variance of fk(·) at x. Thedifficulty comes from the evaluation of the second termin the r.h.s. of Eq. (2), which is intractable and must beapproximated; we follow Hernández-Lobato et al. (2014)and approximate the expectation using a Monte Carlo es-timate of the Pareto set, X ? given D. This involves sam-pling several times the objective functions from their pos-terior distribution p(f1, . . . , fK |D). This step is done as in

Hernández-Lobato et al. (2014) using random kernel fea-tures and linear models that accurately approximate thesamples from p(f1, . . . , fK |D). In practice, we generate10 samples from the posterior of each objective fk(·).

Given the samples of the objectives, we must optimizethem to obtain a sample from the Pareto set X ?. Notethat unlike the true objectives, the sampled functions can beevaluated without significant cost. Thus, given these func-tions, we use a grid search with d× 1, 000 points to solvethe corresponding multi-objective problem to find X ?,where d is the number of dimensions. Of course, in highdimensional problems such a grid search is expected tobe sub-optimal; in that case, we use the NSGA-II evo-lutionary algorithm (Deb et al., 2002). The Pareto setis then approximated using a representative subset of 50points. Given such a sample of X ?, the differential en-tropy of p(y|D,x,X ?) is estimated using the expectationpropagation algorithm (Minka, 2001), as described in theproceeding section.

2.1. Approximating the Conditional PredictiveDistribution Using Expectation Propagation

To approximate the entropy of the conditional predic-tive distribution p(y|D,x,X ?) we consider the distribu-tion p(X ?|f1, . . . , fK). In particular, X ? is the Pareto setof f1, . . . , fK iff ∀x? ∈ X ?,∀x′ ∈ X ,∃ k ∈ 1, . . . ,K suchthat fk(x?) ≤ fk(x′), assuming minimization. That is,each point within the Pareto set has to be better or equal toany other point in the domain of the functions in at leastone of the objectives. Let f be the set {f1, . . . , fK}. Infor-mally, the conditions just described can be translated intothe following un-normalized distribution for X ?:

p(X ?|f) ∝∏

x?∈X?

∏x′∈X

[1−

K∏k=1

Θ (fk(x′)− fk(x?))

]=

∏x?∈X?

∏x′∈X

ψ(x′,x?) , (4)

where ψ(x′,x?) = 1−∏K

k=1 Θ (fk(x′)− fk(x?)), Θ(·)

is the Heaviside step function, and we have used the con-vention that Θ(0) = 1. Thus, the r.h.s. of Eq. (4) is non-zero only for a valid Pareto set. Next, we note that in thenoiseless case p(y|x, f) =

∏Kk=1 δ(yk−fk(x)), where δ(·)

is the Dirac delta function; in the noisy case we simply re-place the delta functions with Gaussians. We can hencewrite the unnormalized version of p(y|D,x,X ?) as:

p(y|D,x,X ?) ∝∫p(y|x, f)p(X ?|f)p(f |D)df

∝∫ K∏

k=1

δ(yk − fk(x))∏

x?∈X?ψ(x,x?)

×∏

x′∈X\{x}

ψ(x′,x?) p(f |D) df , (5)


where we have separated out the factors ψ that do not de-pend on x, the point in which the acquisition function α(·)is going to be evaluated. The approximation to the r.h.s. ofEq. (5) is obtained in two stages. First, we approximate Xwith the set X̃ = {xn}Nn=1 ∪ X ? ∪ {x}, i.e., the union ofthe input locations where the objective functions have beenalready evaluated, the current Pareto set and the candidatelocation x on which α(·) should be evaluated. Then, we re-place each non-Gaussian factor ψ with a corresponding ap-proximate Gaussian factor ψ̃ whose parameters are foundusing expectation propagation (EP) (Minka, 2001). That is,

ψ(x′,x?) = 1−K∏

k=1

Θ (fk(x′)− fk(x?))

≈ ψ̃(x′,x?) =K∏

k=1

φ̃k(fk(x′), fk(x

?)) , (6)

where each approximate factor φ̃k is an unnormalizedtwo-dimensional Gaussian distribution. In particular, weset φ̃k(fk(x′), fk(x?)) = exp

{− 12υ

TkṼkυk + m̃

Tkυk

},

where we have defined υk = (fk(x′), fk(x?))T, and Ṽkand m̃k are parameters to be adjusted by EP, which refineseach ψ̃ until convergence to enforce that it looks similar tothe corresponding exact factor ψ (Minka, 2001). The ap-proximate factors ψ̃ that do not depend on the candidateinput x are reused multiple times to evaluate the acquisi-tion function α(·), and they only have to be computed once.The |X ?| factors that depend on x must be obtained rela-tively quickly to guarantee that α(·) is not very expensiveto evaluate. Thus, in practice we only update those factorsonce using EP, i.e., they are not refined until convergence.

Once EP has been run, we approximate p(y|D,x,X ?)by the normalized Gaussian that results from replac-ing each exact factor ψ by the corresponding ap-proximate ψ̃. Note that the Gaussian distribution isclosed under the product operation, and because allnon-Gaussian factors in Eq. (5) have been replaced byGaussians, the result is a Gaussian distribution. Thatis, p(y|D,x,X ?) ≈

∏Kk=1N (fk(x)|mCPDk , vCPDk ), where

the parameters mCPDk and vCPDk can be obtained from each

ψ̃ and p(f1, . . . , fK |D). If we combine this result withEq. (3), we obtain an approximation to the acquisition func-tion in Eq. (2) that is given by the difference in entropiesbefore and after conditioning on the Pareto sets. That is,

α(x) ≈K∑

k=1

log vPDk (x)

2− 1S

S∑s=1

log vCPDk (x|X ?(s))2

, (7)

where S is the number of Monte Carlo samples, {X ?(s)}Ss=1

are the Pareto sets sampled to approximate the expectationin Eq. (2), and vPDk (x) and v

CPDk (x|X ?(s)) are respectively

the variances of the predictive distribution at x, before andafter conditioning to X ?(s). Last, in the case of noisy obser-vations around each fk(·), we just increase the predictive

variances by adding the noise variance. The next evalua-tion is simply set to xN+1 = arg maxx∈X α(x).

Note that Eq. (7) is the sum of K functions

αk(x) =log vPDk (x)

2− 1S

S∑s=1

log vCPDk (x|X ?(s))2

, (8)

that intuitively measure the contribution of each objectiveto the total acquisition. In a decoupled evaluation setting,each αk(·) can be individually maximized to identify thelocation xopk = arg maxx∈X αk(x), on which it is expectedto be most useful to evaluate each of the K objectives. Theobjective k with the largest individual acquisition αk(x

opk )

can then be chosen for evaluation in the next iteration. Thisapproach is expected to reduce the entropy of the posteriorover the Pareto set more quickly, i.e., with a smaller numberof evaluations of the objectives, and to lead to better results.

The total computational cost of evaluating the acquisi-tion function α(x) includes the cost of running EP, whichis O(Km3), where m = N + |X ?(s)|, N is the number ofobservations made and K is the number of objectives. Thisis done once per each sample X ?(s). After this, we can re-use the factors that are independent of the candidate loca-tion x. The cost of computing the predictive variance ateach x is hence O(K|X ?(s)|

3). In our experiments, the sizeof the Pareto set sampleX ?(s) is 50, which means thatm is afew hundred at most. The supplementary material containsadditional details about the EP approximation to Eq. (5).

3. Related WorkParEGO is another method for multi-objective Bayesianoptimization (Knowles, 2006). ParEGO transforms themulti-objective problem into a single-objective problemusing a scalarization technique: at each iteration, a vec-tor of K weights θ = (θ1, . . . , θK)T, with θk ∈ [0, 1]and

∑Kk=1 θk = 1, is sampled at random from a uniform

distribution. Given θ, a single-objective function is built:

fθ(x) = maxKk=1(θkfk(x)) + ρK∑

k=1

θkfk(x) (9)

where ρ is set equal to 0.05. See (Nakayama et al., 2009,Sec. 1.3.3) for further details. After step N of the opti-mization process, and given θ, a new set of N observationsof fθ(·) are obtained by evaluating this function in the al-ready observed points {xn}Nn=1. Then, a GP model is fitto the new data and expected improvement (Mockus et al.,1978; Jones et al., 1998) is used find the location of thenext evaluation xN+1. The cost of evaluating the acquisi-tion function in ParEGO isO(N3), where N is the numberof observations made. This is the cost of fitting the GP tothe new data (only done once). Thus, ParEGO is a simpleand fast technique. Nevertheless, it is often outperformedby more advanced approaches (Ponweiser et al., 2008).


SMSego is another technique for multi-objective Bayesianoptimization (Ponweiser et al., 2008). The first step inSMSego is to find a set of Pareto points X̃ ?, e.g., byoptimizing the posterior means of the GPs, or by find-ing the non-dominated observations. Consider now anoptimistic estimate of the objectives at input location xgiven by mPDk (x)− c · vPDk (x)1/2, where c is some con-stant, and mPDk (x) and v

PDk (x) are the posterior mean and

variance of the kth objective at location x, respectively.The acquisition value computed at a candidate location x ∈X by SMSego is given by the gain in hyper-volume ob-tained by the corresponding optimistic estimate, after an �-correction has been made. The hyper-volume is simply thevolume of points in functional space above the Pareto front(this is simply the function space values associated to thePareto set), with respect to a given reference point (Zitzler& Thiele, 1999). Because the hyper-volume is maximizedby the actual Pareto set, it is a natural measure of perfor-mance. Thus, SMSego does not reduce the problem to asingle-objective. However, at each iteration it has to finda set of Pareto points and to fit a different GP to each oneof the objectives. This gives a computational cost that isO(KN3). Finally, evaluating the gain in hyper-volume ateach candidate location x is also more expensive than thecomputation of expected improvement in ParEGO.

A similar method to SMSego is the Pareto active learning(PAL) algorithm (Zuluaga et al., 2013). At iteration N ,PAL uses the GP prediction for each point x ∈ Xto maintain an uncertainty region RN (x) about theobjective values associated with x. This region isdefined as the intersection of RN−1(x), i.e., the un-certainty region in the previous iteration, and Qc(x),defined as as the hyper-rectangle with lower-cornergiven by mPDk (x)− c · vPDk (x)0.5, for k = 1, . . . ,K,and upper-corner given by mPDk (x) + c · vPDk (x)1/2,for k = 1, . . . ,K, for some constant c. Given these re-gions, PAL classifies each point x ∈ X as Pareto-optimal,non-Pareto-optimal or uncertain. A point is classified asPareto-optimal if the worst value in RN (x) is not domi-nated by the best value in RN (x′), for any other x′ ∈ X ,with an � tolerance. A point is classified as non-Pareto-optimal if the best value in RN (x) is dominated by theworst value in RN (x′) for any other x′ ∈ X , with an �tolerance. All other points remain uncertain. After theclassification, PAL chooses the uncertain point x with thelargest uncertainty regionRN (x). The total computationalcost of PAL is hence similar to that of SMSego.

The expected hyper-volume improvement (EHI) (Em-merich, 2008) is a natural extension of expected improve-ment to the multi-objective setting (Mockus et al., 1978;Jones et al., 1998). Given the predictive distribution ofthe GPs at a candidate input location x, the acquisition isthe expected increment of the hyper-volume of a candidate

Pareto set X̃ ?. Thus, EHI also needs to find a Pareto set X̃ ?.This set can be obtained as in SMSego. A difficulty is, how-ever, that computing the expected increment of the hyper-volume is very expensive. For this, the output space is di-vided in a series of cells, and the probability of improve-ment is simply obtained as the probability that the observa-tion made at x lies in a non-dominated cell. This involves asum across all non-dominated cells, whose number growsexponentially with the number of objectives K. In particu-lar, the total number of cells is (|X̃ ?|+1)K . Thus, althoughsome methods have been suggested to speed-up its calcula-tion, e.g., (Hupkens et al., 2014; Feliot et al., 2015), EHI isonly feasible for 2 or 3 objectives at most.

Sequential uncertainty reduction (SUR) is another methodproposed for multi-objective Bayesian optimization(Picheny, 2015). The working principle of SUR is similarto that of EHI. However, SUR considers the probabilityof improving the hyper-volume in the whole domain ofthe objectives X . Thus, SUR also needs to find a set ofPareto points X̃ ?. These can be obtained as in SMSego.The acquisition computed by SUR is simply the expecteddecrease in the area under the probability of improvingthe hyper-volume, after evaluating the objectives at a newcandidate location x. The SUR acquisition is computedalso by dividing the output space in a total of (|X̃ ?|+ 1)Kcells, and the area under the probability of improvement isobtained using a Sobol sequence as the integration points.Although some grouping of the cells has been suggested(Picheny, 2015), SUR is an extremely expensive criterionthat is only feasible for 2 or 3 objectives at most.

The proposed approach, PESMO, differs from the methodsdescribed in this section in that 1) it does not transformthe multi-objective problem into a single-objective, 2) theacquisition function of PESMO can be decomposed as thesum of K individual acquisition functions, and this allowsfor decoupled evaluations, and 3) the computational cost ofPESMO is linear in the total number of objectives K.

4. ExperimentsWe compare PESMO with the other strategies describedin Section 3: ParEGO, SMSego, EHI and SUR. We donot compare results with PAL because it is expected togive similar results to those of SMSego, as both meth-ods are based on a lower confidence bound. We havecoded all these methods in the software for Bayesianoptimization Spearmint (https://github.com/HIPS/Spearmint). We use a Matérn covariance function forthe GPs and all hyper-parameters (noise, length-scales andamplitude) are approximately sampled from their poste-rior distribution (we generate 10 samples from this distri-bution). The acquisition function of each method is aver-aged over these samples. In ParEGO we consider a differ-

https://github.com/HIPS/Spearminthttps://github.com/HIPS/Spearmint


ent scalarization (i.e., a different value of θ) for each sam-ple of the hyper-parameters. In SMSego, EHI and SUR,for each hyper-parameter sample we consider a differentPareto set X̃ ?, obtained by optimizing the posterior meansof the GPs. The resulting Pareto set is extended by includ-ing all non-dominated observations. At iteration N , eachmethod gives a recommendation in the form of a Paretoset obtained by optimizing the posterior means of the GPs.The acquisition function of each method is maximized us-ing L-BFGS (a grid of size 1, 000 is used to find a goodstarting point). The gradients of the acquisition functionare approximated by differences (except in ParEGO).

4.1. Accuracy of the PESMO Approximation

One question is whether the proposed approximations aresufficiently accurate for the effective identification of thePareto set. We compare in a one-dimensional problemwith 2 objectives the acquisition function computed byPESMO with a more accurate estimate obtained via expen-sive Monte Carlo sampling and a non-parametric estimatorof the entropy (Singh et al., 2003). Figure 1 (top) shows at agiven step the observed data and the posterior mean and thestandard deviation of each objective. The figure on the bot-tom shows the acquisition function computed by PESMOand by the Monte Carlo method (Exact). Both functionslook very similar, including the location of the global max-imizer. This indicates that (7), obtained by expectationpropagation, is potentially a good approximation of (2), theexact acquisition. The supplementary material has extra re-sults showing that the individual acquisition functions com-puted by PESMO, i.e., αk(·), for k = 1, 2, are also accurate.

4.2. Experiments with Synthetic Objectives

We compare PESMO with other approaches in a 3-dimensional problem with 2 objectives obtained by sam-pling the functions from the GP prior. We generate 100 ofthese problems and report the average performance whenconsidering noiseless observations and when the observa-tions are contaminated with Gaussian noise with standarddeviation equal to 0.1. The performance metric is thehyper-volume indicator, which is maximized by the actualPareto set (Zitzler & Thiele, 1999). At each iteration wereport the logarithm of the relative difference between thehyper-volume of the actual Pareto set (obtained by optimiz-ing the actual objectives) and the hyper-volume of the rec-ommendation (obtained by optimizing the posterior meansof the GPs). Figure 2 (left-column) shows, as a functionof the evaluations made, the average performance of eachmethod with error bars. PESMO obtains the best results,and when executed in a decoupled scenario, slight improve-ments are observed (only with noisy observations).

Table 1 shows the average time in seconds to determinethe next evaluation in each method. The fastest method

Figure 1. (top) Observations of each objective and posterior meanand standard deviations of each GP model. (bottom) Estimatesof the acquisition function (2) by PESMO, and by a Monte Carlomethod combined with a non-parametric estimator of the entropy(Exact), which is expected to be more accurate. Best seen in color.

is ParEGO followed by SMSego and PESMO. The decou-pled version of PESMO, PESMOdec, takes more time be-cause it optimizes α1(·) and α2(·). The slowest meth-ods are EHI and SUR. Most of their cost is in the lastiterations, in which the Pareto set size, |X̃ ?|, is large.The cost of evaluating the acquisition function in EHI andSUR is O((|X̃ ?| + 1)K), leading to expensive optimiza-tion via L-BFGS. In PESMO the cost of evaluating α(·) isO(K|X ?(s)|

3) because K linear systems are solved. Thesecomputations are faster because they are performed by theopen-BLAS library, which is optimized for each processor.The acquisition function of EHI and SUR does not involvesolving linear systems and these methods cannot use open-BLAS. Note that we also keep fixed |X ?(s)| = 50 in PESMO.

Table 1. Avg. time in seconds doing calculations per iteration.PESMO PESMOdec ParEGO SMSego EHI SUR

33±1.0 52±2.5 11±0.2 16±1.3 405±115 623±59

We have carried out additional synthetic experiments with4 objectives on a 6-dimensional input space. In this case,EHI and SUR become infeasible, so we do not compareresults with them. Again, we sample the objectives fromthe GP prior. Figure 2 (right-column) shows, as a functionof the evaluations made, the average performance of eachmethod. The best method is PESMO, and in this case, thedecoupled evaluation performs significantly better. Thisimprovement is because in the decoupled setting, PESMOidentifies the most difficult objectives and evaluates themmore times. In particular, because there are 4 objectives itis likely that some objectives are more difficult than oth-


Figure 2. (left-column) Average log relative difference between the hyper-volume of the recommendation and the maximum hyper-volume for each number of evaluations made. We consider noiseless (top) and noisy observations (bottom). The problem consideredhas 2 objectives and 3 dimensions. (right-column) Similar results for a problem with 4 objectives and 6 dimensions. We do not compareresults with EHI and SUR because they are infeasible due to their exponential cost with the number of objectives. Best seen in color.

ers just by chance. Figure 3 illustrates this behavior fora representative case in which the first two objectives arenon-linear (difficult) and the last two objectives are linear(easy). We note that the decoupled version of PESMO eval-uates the first two objectives almost three times more.

Figure 3. (top) Contour curves of 4 illustrative objectives on 6 di-mensions obtained by changing the first two dimensions in inputspace while keeping the other 4 fixed to zero. The first 2 objec-tives are non-linear while the 2 last objectives are linear. (bottom)Number of evaluations of each objective done by PESMOdecoupledas a function of the iterations performed N . Best seen in color.

4.3. Finding a Fast and Accurate Neural Network

We consider the MNIST dataset (LeCun et al., 1998) andevaluate each method on the task of finding a neural net-work with low prediction error and small prediction time.

These are conflicting objectives because reducing the pre-diction error will involve larger networks which will takelonger at test time. We consider feed-forward networkswith ReLus at the hidden layers and a soft-max outputlayer. The networks are coded in the Keras library and theyare trained using Adam (D. Kingma, 2014) with a mini-batch size of 4, 000 instances during 150 epochs. The ad-justable parameters are: The number of hidden units perlayer (between 50 and 300), the number of layers (between1 and 3), the learning rate, the amount of dropout, and thelevel of `1 and `2 regularization. The prediction error ismeasured on a set of 10, 000 instances extracted from thetraining set. The rest of the training data, i.e., 50, 000 in-stances, is used for training. We consider a logit transfor-mation of the prediction error because the error rates arevery small. The prediction time is measured as the averagetime required for doing 10, 000 predictions. We computethe logarithm of the ratio between the prediction time ofthe network and the prediction time of the fastest network,(i.e., a single hidden layer and 50 units). When measur-ing the prediction time we do not train the network andconsider random weights (the time objective is also set toignore irrelevant parameters). Thus, the problem is suitedfor a decoupled evaluation because both objectives can beevaluated separately. We run each method for a total of200 evaluations of the objectives and report results after100 and 200 evaluations. Because there is no ground truthand the objectives are noisy, we re-evaluate 3 times thevalues associated with the recommendations made by eachmethod (in the form of a Pareto set) and average the results.


Figure 4. Avg. Pareto fronts obtained by each method after 100 (left) and 200 (right) evaluations of the objectives. Best seen in color.

Then, we compute the Pareto front (i.e., the function spacevalues of the Pareto set) and its hyper-volume. We repeatthese experiments 50 times and report the average results.

Table 2 shows the hyper-volumes obtained in the experi-ments (the higher, the better). The best results, after 100evaluations of the objectives, correspond to the decoupledversion of PESMO, followed by SUR and by the coupledversion. When 200 evaluations are done, the best methodis PESMO in either setting, i.e., coupled or decoupled. AfterPESMO, SUR gives the best results, followed by SMSegoand EHI. ParEGO is the worst performing method in eithersetting. In summary, PESMO gives the best overall results,and its decoupled version performs much better than theother methods when the number of evaluations is small.

Table 2. Avg. hyper-volume after 100 and 200 evaluations.# Eval. PESMO PESMOdec ParEGO SMSego EHI SUR

100 66.2±.2 67.6±.1 62.9±1.2 65.0±.3 64.0±.9 66.6±.2200 67.8±.1 67.8±.1 66.1±.2 67.1±.2 66.6±.2 67.2±.1

Figure 4 shows the average Pareto front obtained by eachmethod after 100 and 200 evaluations of the objectives. Theresults displayed are consistent with the ones in Table 2. Inparticular, PESMO is able to find networks that are fasterthan the ones found by the other methods, for a similar pre-diction error on the validation set. This is especially thecase of PESMO when executed in a decoupled setting, afterdoing only 100 evaluations of the objectives. We also notethat PESMO finds the most accurate networks, with almost1.5% of prediction error in the validation set.

Figure 5. Number of evaluations of each objective done byPESMOdecoupled, as a function of the iteration number N , in theproblem of finding good neural networks. Best seen in color.

The good results obtained by PESMOdecoupled are explainedby Figure 5, which shows the average number of evalua-tions of each objective. More precisely, the objective thatmeasures the prediction time is evaluated just a few times.This makes sense because it depends on only two param-eters, i.e., the number of layers and the number of hiddenunits per layer. It is hence simpler than the prediction er-ror. PESMOdecoupled is able to detect this and focuses onthe evaluation of the prediction error. Of course, evaluatingthe prediction error more times is more expensive, since itinvolves training the neural network more times. Neverthe-less, this shows that PESMOdecoupled is able to successfullydiscriminate between easy and difficult objective functions.

The supplementary material has extra experiments compar-ing each method on the task of finding an ensemble of deci-sion trees of small size and good prediction accuracy. Theresults obtained are similar to the ones reported here.

5. ConclusionsWe have described PESMO, a method for multi-objectiveBayesian optimization. At each iteration, PESMO evaluatesthe objective functions at the input location that is mostexpected to reduce the entropy of posterior estimate of thePareto set. Several synthetic experiments show that PESMOhas better performance than other methods from the litera-ture. That is, PESMO obtains better recommendations witha smaller number of evaluations, both in the case of noise-less and noisy observations. Furthermore, the acquisitionfunction of PESMO can be understood as a sum of K indi-vidual acquisition functions, one per each of the K objec-tives. This allows for a decoupled evaluation scenario, inwhich the most promising objective is identified by maxi-mizing the individual acquisition functions. When run in adecoupled evaluation setting, PESMO is able to identify themost difficult objectives and, by focusing on their evalua-tion, it provides better results. This behavior of PESMO hasbeen illustrated on a multi-objective optimization problemthat consists of finding an accurate and fast neural network.Finally, the computational cost of PESMO is small. In par-ticular, it scales linearly with the number of objectives K.Other methods have an exponential cost with respect to Kwhich makes them infeasible for more than 3 objectives.


AcknowledgmentsDaniel Hernández-Lobato gratefully acknowledges the useof the facilities of Centro de Computación Cientı́fica(CCC) at Universidad Autónoma de Madrid. This au-thor also acknowledges financial support from the Span-ish Plan Nacional I+D+i, Grants TIN2013-42351-P andTIN2015-70308-REDT, and from Comunidad de Madrid,Grant S2013/ICE-2845 CASI-CAM-CM. José MiguelHernández-Lobato acknowledges financial support fromthe Rafael del Pino Fundation. Amar Shah acknowledgessupport from the Qualcomm Innovation Fellowship pro-gram. Ryan P. Adams acknowledges support from the Al-fred P. Sloan Foundation.

ReferencesAriizumi, R., Tesch, M., Choset, H., and Matsuno, F.

Expensive multiobjective optimization for robotics withconsideration of heteroscedastic noise. In 2014 IEEEInternational Conference on Intelligent Robots and Sys-tems, pp. 2230–2235, 2014.

Collette, Y. and Siarry, P. Multiobjective Optimization:Principles and Case Studies. Springer, 2003.

D. Kingma, J. Ba. Adam: A method for stochastic opti-mization. 2014. arXiv:1412.6980.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. Afast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6:182–197, 2002.

Emmerich, A. The computation of the expected improve-ment in dominated hypervolume of Pareto front approx-imations. Technical Report LIACS TR-4-2008, LeidenUniversity, The Netherlands, 2008.

Feliot, P., Bect, J., and Vazquez, E. A Bayesian approachto constrained single- and multi-objective optimization.2015. arXiv:1510.00503 [stat.CO].

Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian opti-mization with unknown constraints. In Thirtieth Confer-ence on Uncertainty in Artificial Intelligence, 2014.

Henning, P. and Schuler, C. J. Entropy search forinformation-efficient global optimization. Journal ofMachine Learning Research, 13:1809–1837, 2012.

Hernández-Lobato, J. M., Hoffman, M. W., and Ghahra-mani, Z. Predictive entropy search for efficient globaloptimization of black-box functions. In Advances inNeural Information Processing Systems 27, pp. 918–926. 2014.

Houlsby, N., Hernández-lobato, J. M., Huszár, F., andGhahramani, Z. Collaborative Gaussian processes forpreference learning. In Advances in Neural InformationProcessing Systems 25, pp. 2096–2104. 2012.

Hupkens, I., Emmerich, M., and Deutz, A. Faster com-putation of expected hypervolume improvement. 2014.arXiv:1408.7114 [cs.DS].

Jones, D. R., Schonlau, M., and Welch, W. J. Efficientglobal optimization of expensive black-box functions.Journal of Global Optimization, 13(4):455–492, 1998.

Knowles, J. ParEGO: a hybrid algorithm with on-line land-scape approximation for expensive multiobjective opti-mization problems. IEEE Transactions on EvolutionaryComputation, 10:50–66, 2006.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Pro-ceedings of the IEEE, 86(11):2278–2324, 1998.

Li, X. A non-dominated sorting particle swarm optimizerfor multiobjective optimization. In Genetic and Evolu-tionary Computation GECCO 2003, pp. 37–48. 2003.

Minka, T. A Family of Algorithms for ApproximateBayesian Inference. PhD thesis, MIT, 2001.

Mockus, J., Tiesis, V., and Zilinskas, A. The application ofBayesian methods for seeking the extremum. TowardsGlobal Optimization, 2(117-129):2, 1978.

Nakayama, H., Yun, Y., and M.Yoon. Sequential Approxi-mate Multiobjective Optimization Using ComputationalIntelligence. Springer, 2009.

Picheny, V. Multiobjective optimization using Gaussianprocess emulators via stepwise uncertainty reduction.Statistics and Computing, 25:1265–1280, 2015.

Ponweiser, W., Wagner, T., Biermann, D., and Vincze, M.Multiobjective optimization on a limited budget of eval-uations using model-assisted S-metric selection. In Par-allel Problem Solving from Nature PPSN X, pp. 784–794. 2008.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for Machine Learning (Adaptive Computationand Machine Learning). The MIT Press, 2006.

Shah, A. and Ghahramani, Z. Parallel predictive entropysearch for batch global optimization of expensive objec-tive functions. In Advances in Neural Information Pro-cessing Systems 28, pp. 3312–3320. 2015.

Singh, H., Misra, N., Hnizdo, V., Fedorowicz, A., andDemchuk, E. Nearest neighbor estimates of entropy.American journal of mathematical and management sci-ences, 23:301–321, 2003.


Villemonteix, J., Vazquez, E., and Walter, E. An informa-tional approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization,44:509–534, 2009.

Zitzler, E. and Thiele, L. Multiobjective evolutionary algo-rithms: a comparative case study and the strength Paretoapproach. IEEE Transactions on Evolutionary Compu-tation, 3:257–271, 1999.

Zuluaga, M., Krause, A., Sergent, G., and Püschel, M. Ac-tive learning for multi-objective optimization. In Inter-national Conference on Machine Learning, pp. 462–470,2013.

Predictive Entropy Search for Multi-objective Bayesian...

Documents

Transcript of Predictive Entropy Search for Multi-objective Bayesian...