Machine Learning for Optimization Masterclass Day 1 · Machine Learning for Optimization...

Machine Learning for OptimizationMasterclass Day 1

Peter I. Frazier

Operations Research & Information Engineering, Cornell University

January 2014Lancaster University

Lancaster, UK

Outline

1 Introduction

2 Gaussian Process Regression

3 Noise-Free Global OptimizationExpected ImprovementWhere is it Useful?Knowledge-Gradient

4 Noisy Global Optimization

5 ApplicationsImproving Customer Experience and Revenue at YelpDeveloping Inexpensive Organic Photovoltaic CellsSimulation Calibration at Schneider NationalDesigning Cardiovascular Bypass GraftsDrug Development for Ewings SarcomaExperimental Design for Biochemistry and Materials Science

6 Conclusion

Noise-Free Global Optimization

X*

y(n)

f(x)

Objective function f : Rd 7 R, continuousbut generally not concave.

Feasible set A Rd .Our goal is to solve

maxxA

f (x)

Typically, f is time-consuming to evaluate,derivative information is unavailable, andthe dimension is not too large (d < 20).

Noise-Free Global Optimization Has Lots of Applications

Author's personal copy

unsteady turbulent flow problem by Marsden [810]. We follow themethods outlined by Marsden et al. [4] for the unconstrainedoptimization of cardiovascular geometries, coupling the SMF methodto a time-dependent 3-D finite element Navier-Stokes solver. Themethod was extended here for the constrained optimization caseusing filters [7,11]. In addition, we assess the performance of twopolling strategies used with the SMF method.

The surgical procedure investigated in this work is the Fontanoperation, and is used to treat single-ventricle types of heart defects.These defects, such as hypoplastic left heart syndorme (HLHS) andtricuspid atresia, leave patients with only one functional pumpingchamber. Since fully saturated pulmonary venous blood is mixed withdesaturated systemic venous blood, children born with singleventricle heart defects are cyanotic. These conditions are nearlyuniformly fatal without treatment. A three-staged surgical approach isused to palliate single ventricle heart defects. The first stage consists ofestablishing stable sources of aortic and pulmonary blood flow, in aNorwood procedure or variant thereof. In the second stage, thebidirectional Glenn procedure, the superior vena cava (SVC) isdisconnected from the heart and reimplanted into the pulmonaryarteries (PAs). In the third and final stage, the Fontan procedure, theinferior vena cava (IVC) is connected to the PAs either via anextracardiac Gore-Tex tube that bypasses the heart or via anintracardiac baffle (lateral tunnel), forming a T-shaped junction.This final stage completes the separation of oxygenated anddeoxygenated blood, and turns the circulation into a single-pumpsystem. Fig. 1 is a surgical illustration of the extracardiac Fontanprocedure. Although early survival rates following the Fontanprocedure are greater than 90%, significant long term morbidityremains, including diminished exercise capacity, thromboemboliccomplications, protein-losing enteropathy, arteriovenous malforma-tions, arrhythmias and as a result, the ultimate need for hearttransplantation[12].

In recent years there has been increasing interest in usingcomputational fluid dynamics (CFD) as a tool to quantify and improveFontan hemodynamic performance. [1416] Pioneering work by deLeval, Dubini et al. [17] and Migliavacca et al. [18] led to thewidespread adoption of the offset design by the surgical community.Dasi et al. [19] analyzed the dependence of the Fontan energy

dissipation on geometric variables, flow split, Reynolds number and anumber of other variables using dimensional analysis. Marsden et al.[10,20,21] demonstrated the effects of factors such as exercise andrespiration on Fontan energy dissipation, and energy loss betweencurved and flared anastomoses were compared in in vivo experimentsby Ensley et al. [22].

As early as 2002, Okano et al. [23] first performed an extracardiacFontan procedure on a patient with a severely distorted central PAusing a unique Y-shaped graft. Recently a similar Y-graft design wasproposed and tested via simulation by two groups. Soerensen et al.[24] created a new configuration called the OptiFlo for the totalcavopulmonary connection that bifurcates both the SVC and IVC in theconnection to the pulmonary arteries, and tested the proposed designusing an idealized model with steady flow conditions. Marsden et al.

Fig. 1. Extracardiac total cavopulmonary connection. The IVC is disconnected from theright atrium (RA) and connected to the PAs via a Gore-Tex conduit. Figure taken fromReddy et al. [13].

Fig. 2. Model parametrization showing the six design parameters used for shapeoptimization (a), and the resting pulsatile IVC and SVC flow waveforms used for inflowboundary conditions (b).

2136 W. Yang et al. / Computer Methods in Applied Mechanics and Engineering 199 (2010) 21352149

Design of grafts to be used in heartsurgery. [Yang et al., 2010]

Design of aeorodynamic structures, e.g.,cars, airplanes. [Forrester et al., 2008]

Calibrating the parameters of a climatemodel to historical data.

Predicting crystal structures by findingarrangements of molecules with minimumenergy (current project).

Noisy Global Optimization

X*

y(n)

f(x)

We cannot evaluate f (x) directly.

Instead, we have a stochastic simulatorthat can evaluate f (x) with noise.

It gives us g(x ,) = f (x) + (x ,), whereE [g(x ,)] = f (x).Our goal is still to find a global maximum,

maxxA

f (x)

The term simulation optimization is alsoused.

Noisy Global Optimization Has Lots of Applications

Shi, Chen, and Yucesan

budget. This is the basic idea of optimal computing budget

allocation (OCBA) (Chen et al. 1996, 1999).

We apply the hybrid algorithm for a stochastic

resource allocation problem, where no analytical

expression exists for the objective function, and it is

estimated through simulation. Numerical results show that

our proposed algorithm can be effectively used for solving

large-scale stochastic discrete optimization problems.

The paper is organized as follows: In section 2 we formulate the resource allocation problem as a stochastic

discrete optimization problem. In section 3 we present the

hybrid algorithm. The performance of the algorithm is

illustrated with one numerical example in Section 4.

Section 5 concludes the paper.

problem of performing numerical expectation since the

functional L( 0,5> is available only in the form of a complex

calculation via simulation. The standard approach is to

estimate E[L( 6, 5>] by simulation sampling, i.e.,

Unfortunately, t can not be too small for a reasonable

estimation of E[L(O, 01. And the total number of simulation samples can be extremely large since in the

resource allocation problems, the number of (el, &,..., 0,) combinations is usually very large as we will show the

following example.

2 RESOURCE ALLOCATION PROBLEMS 2.1 Buffer Allocation in Supply Chain Management

There are many resource allocation problems in the design

of discrete event systems. In this paper we consider the

following resource allocation optimization problem:

where 0 is a finite discrete set and J : 0 + R is a performance function that is subject to noise. Often J ( @ is

an expectation of some random estimate of the

performance,

where 5 is a random vector that represents uncertain factors in the systems. The "stochastic" aspect has to do with the

We consider a 10-node network shown in Figure 1. There

are 10 servers and 10 buffers, which is an example of a

supply chain, although such a network could be the model

for many different real-world systems, such as a

manufacturing system, a communication or a traffic

network. There are two classes of customers with different

arrival distributions, but the same service requirements. We

consider both exponential and non-exponential

distributions (uniform) in the network. Both classes arrive

at any of Nodes 0-3, and leave the network after having

gone through three different stages of service. The routing

is not probabilistic, but class dependent as shown in Figure

1. Finite buffer sizes at all nodes are assumed which is

exactly what makes our optimization problem interesting. More specific, we are interested in distributing optimally

C1: Unif[2,18]

C2: Exp(O.12) Arrival:

Figure 1: A 10-node Network in the Resource Allocation Problem

396

Authorized licensed use limited to: IEEE Xplore. Downloaded on November 20, 2008 at 16:29 from IEEE Xplore. Restrictions apply.

Choose staffing levels in a hospital, using a discrete-event simulator.

Choose an admissions control policy in a complex queuing system,e.g., a call center.

Tuning algorithm parameters at Yelp (case study).

Calibrate a logistics model to historical data (case study).

Drug development (case study).

What is Bayesian Global Optimization?

Bayesian Global Optimization (BGO) is a class of algorithms forsolving Noise-Free and Noisy Global Optimization problems.

These algorithms use methods from Bayesian statistics to decidewhere to sample.

BGO uses Bayesian Statistics to Decide Where to Sample

Given the function evaluations obtained so for, a BGO algorithm usesBayesian methods to get:

estimates of f (x) over the feasible set.uncertainties in these estimates.together, these are described by the posterior distribution.

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

BGO uses the posterior distribution to decide where to evaluate next.

Typical BGO Algorithm

1 Choose several initial points x and evaluate f (x) or g(x ,).2 While the stopping criterion is not met:

2a. Calculate the Bayesian posterior distribution on f from the pointsobserved.

2b. Use the posterior to decide where to evaluate next.

3 Based on the most recent posterior distribution, report the point withthe best estimated value.

The stopping criteria is often stop after N samples, but can be moresophisticated.

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

Animation of a BGO Algorithm

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

x

EI

Outline

1 Introduction





6 Conclusion

Illustration of Gaussian Process (GP) Regression

Left: 2 noise-free function evaluations (blue), estimate of f (solidred), confidence bounds on this estimate (dashed red).

Right: One more function evaluation is added.

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

Gaussian Process Regression: Two Points

Fix two points x and x .

Consider the values of f at these points, f (x) and f (x ).

x x

f(x)

f(x)

f(x)

f(x)




x x

f(x)f(x)

f(x)

f(x)




x x

f(x)

f(x)

f(x)

f(x)


f (x) and f (x ) are unknown before we measure them.

In Bayesian statistics, we model our uncertainty about f with a priorprobability distribution.[

f (x)f (x )

] N

([0(x)0(x )

],

[0(x ,x) 0(x ,x

)0(x

,x) 0(x,x )

])Here, 0() and 0(, ) are functions to be discussed later.In general, f (x) and f (x ) are correlated.


f(x)

f(x

)

2 0 22

0

2

2 0 20

0.5

1

f(x)

dens

ity o

f f(x

)

2 0 20

0.5

1

f(x)

dens

ity o

f f(x

)

x x

2

0

2

Nearby Points Have Stronger Correlation

The closer x and x are in the feasible domain, the stronger thecorrelation under our belief between f (x) and f (x ).

f(x)

f(x

)

2 0 22

0

2

2 0 20

0.5

1

f(x)

dens

ity o

f f(x

)

2 0 20

0.5

1

1.5

f(x)de

nsity

of f

(x)

x x

2

0

2

This should be enforced by our choice of 0(, ). A common choice isthe power exponential:

0(x ,x) = 0 exp

(1||xx ||2

)

Gaussian Process Regression: Two Close Points

f(x)

f(x

)

2 0 22

0

2

2 0 20

0.5

1

f(x)

dens

ity o

f f(x

)

2 0 20

0.5

1

1.5

f(x)

dens

ity o

f f(x

)

x x

2

0

2

GP Regression: Formal Definition

A GP prior on an unknown function f : Rd 7 R is parameterized by amean function 0().covariance function 0(, ), that must be positive semi-definite.

Definition: A prior P on a function f is a Gaussian Process (GP) priorwith mean function 0 and covariance function 0 if:For any given set of points x1, . . . ,xk , under P, f (x1)...

f (xk)

N 0(x1)...

0(xk)

, 0(x1,x1) . . . 0(x1,xk)... . . . ...

0(xk ,x1) . . . 0(xk ,xk)

The Posterior Can be Computed Analytically

Suppose we have observed

f (~x) = [f (x1), . . . , f (xn)].

Fix any x . The posterior on f (x ) is

f (x )|f (~x) N(n(x ),2n (x )).

When 0() = 0, n(x ) and 2n (x ) are:

n(x ) = 0(x ,~x)0(~x ,~x)1f (~x),

2n (x) = 0(x

,x )0(x ,~x ,)0(~x ,~x)10(~x ,x )

Illustrative 1D Example

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

GP Regression and BGO Work in More Than 1 Dimension

For clarity, most of the illustrations in this talk are in 1-dimension.

GP Regression and BGO can also be applied in Rd with d > 1, andalso in combinatorial spaces.

The key is including a notion of distance in 0(, ), e.g.

0(x ,x) = 0 exp

(1||xx ||2

)Mean of Posterior, n

Bonus 1

Bonu

s 2

0 1 2 30

0.5

1

1.5

2

2.5

3Std. Dev. of Posterior

Bonus 1

Bonu

s 2

0 1 2 30

0.5

1

1.5

2

2.5

3

log(KG Factor)

Bonus 1

Bonu

s 2

0 1 2 30

0.5

1

1.5

2

2.5

3

0 2 4 6 82

1.5

1

0.5

0Best Fit

n

log1

0(Be

st F

it)

Choice of 0()

The Gaussian process prior is parameterized by 0() and 0(, ).How should we choose these functions?

One common choice for 0() is simply to set it to a constant 0.Typically, one estimates this constant adaptively using maximumlikelihood. (Discussed later)Alternatively, if one places an independent normal prior on 0, then thiscan be folded back into the GP prior.Coupled with typical choices for (, ), this produces a prior that isstationary across the domain: for any a, the likelihood that f (x) = adoes not depend on x .

Choice of 0()

Alternatively, if we suspect strong trends in f , we can choose acollection of basis functions 1, . . . ,K , and set

0(x) = 0 + 11(x) + + Km(x).

This generally does not produce a stationary prior.Typically one estimates 0, . . . ,m using maximum likelihood.Alternatively, one can place normal priors on the k .

Choice of 0()We usually choose 0(, ) from one of a few parametric classes ofcovariance functions.

isometric Gaussian

0(x ,x) = 0 exp

(1||xx ||22

)power exponential

0(x ,x) = 0 exp

(

D

d=1

d |ed (xx )|p)

For others, see [Cressie, 1993, Rasmussen and Williams, 2006].

By choosing different parameter values, we can encode different beliefs inthe smoothness of f .

0 50 1003

2

1

0

1

2

3

0 50 1004

3

2

1

0

1

2

0 50 1002

1

0

1

2

3

We estimate these parameters adaptively using maximum likelihood.

Empirical Bayes Estimation of Parameters

We have observed x1, . . . ,xn, and y1 = f (x1), . . . ,yn = f (xn).

We have a Gaussian process prior with (), (, ).() and (, ) are parameterized in turn by a collection ofparameters .To estimate , we calculate the density of the prior at the observeddata,

P(y1, . . . ,yn;)

This density is multivariate normal with a mean and covariance thatdepends on .We find the that maximizes this density, and this is our estimate.

arg max

P(y1, . . . ,yn;)

We generally update this estimate as we obtain more data.

You can use cross-validation to check your model

You can use cross-validation to check your model

The plot at left was obtained from adeterministic simulation of blood flownear the heart.

For each datapoint, we hide thatdatapoint and train the model (includingestimating model parameters) on therest of the datapoints. We then get themean and variance of the marginalposterior on the value of the held outpoint.

We plot an error bar corresponding tothis mean and variance as Predicted,and the actual value as Actual.

Try transforming your objective to improve performance

You can also try transforming your objective via any strictly increasingtransformation to improve the ability of your Gaussian process prior tofit the objective.

Strictly increasing transformations preserves order betweenalternatives values, preserving the solution to the optimizationproblem.

If your objective is strictly positive, try log and square root.

I am leaving out some computational issues

There are ways to avoid performing matrix inversions whencalculating the posterior.

This improves speed and increases numerical accuracy.

For details see [Rasmussen and Williams, 2006] or[Forrester et al., 2008]. . .

Or, use an existing software package to do the computations(pointers at the end of the talk).

Outline

1 Introduction





6 Conclusion

Expected Improvement

In BGO, we use the posterior distribution to decide where to samplenext.

One classic method is called Efficient Global Optimization (EGO),and is based on the idea of Expected Improvement.

This method is due to [Jones et al., 1998], building on ideas in[Mockus, 1972].


Suppose weve measured n points x1, . . . ,xn, and observedf (x1), . . . , f (xn).

Let f n = maxm=1,...,n f (xm) be the best point observed so far.

If we measure at a new point x , the improvement in our objectivefunction is

[f (x) f n ]+

The expected improvement is

EIn(x) = En[[f (x) f n ]+

],

where En indicates the expectation taken with respect to the time-nposterior distribution.

Expected Improvement Can Be Computed Analytically

Let n(x) = n(x) f n be the difference between our estimate off (x) and the best value observed so far. Then,

EIn(x) = En[[f (x) f n ]+

]= [n(x)]

+ + n(x)(

n(x)

n(x)

)|n(x)|

(|n(x)|

n(x)

),

where and are the normal cdf and pdf,

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

x

EI


The EGO/EI policy chooses to sample at the point with the largestexpected improvement,

xn+1 = arg maxx

EIn(x)

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

x

EI



xn+1 = arg maxx

EIn(x)

Each time we decide which point to evaluate next (to solve our overalloptimization problem), we have to solve an optimization problem!

We have replaced one optimization problem (maxxA f (x)) with manyoptimization problems (maxx EIn(x), for n = 1,2,3, . . .). Why is this agood thing?

Evaluating f (x) is expensive (minutes, hours, days), and derivativeinformation is unavailable.Evaluating EIn(x) is quick (microseconds), and derivative informationis available.

Maximize Expected Improvement


xn+1 = arg maxx

EIn(x)

One can calculate the gradient of EIn(x) with respect to x .

To solve maxx EIn(x), use a first order method combined withmultistart.

EI Trades Exploration vs. Exploitation

EIn(x) is bigger when n(x) is bigger.EIn(x) is bigger when n(x) is bigger.These two tendencies often push against each other, and the EI policymust balance them.

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

x

EI


EIn(x) = [n(x)]+ + n(x)(

n(x)n(x)

)|n(x)|

( |n(x)|n(x)

)EIn(x) is determined by n(x) = n(x) f n and n(x).EIn(x) increases as n(x) increases.

Measure where f (x) seems large. (Exploitation)

EIn(x) increases as n(x) increases.Measure where we are uncertain about f (x). (Exploration)


EIn(x) is bigger when n(x) is bigger.EIn(x) is bigger when n(x) is bigger.Below is a contour plot of EIn(x). Red is bigger EI.

n(x)

n(x

)

0.2 0.4 0.6 0.8 11

0.5

0

0.5

1

EGO Animation

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

x

EI

Outline

1 Introduction





6 Conclusion

Requirement for Use: Expensive Function Evaluation

BGO is only useful when function evaluation is time-consuming orexpensive.

In the simulation calibration problem discussed later, each functionevaluation takes 3 days.

In the drug development problem discussed later, each functionevaluation takes several days.

How expensive is expensive enough? Function evaluation should takesignificantly longer than the time that the BGO algorithm requires todecide where to sample next.

BGO takes longer to decide where to take each sample, but requiresfewer samples than other methodologies (when it works well).

Requirement for Use: Lack of Gradient Information

If gradient information is available, it is usually better to simply use amultistart first-order method.

Gradient information can be incorporated into a BGO algorithm toimprove its speed, but this is difficult and is not covered here.

Incorporating gradient information into BGO algorithms remains anarea for research.

Other Derivative-Free Global Optimization Methods

Many other derivative-free noise-tolerant global optimization methodsexist, e.g.,

pattern search, e.g., Nelder-Meadstochastic approximation, e.g., SPSA [Spall 1992].evolutionary algorithms, simulated annealing, tabu searchresponse surface methods. [Myers & Montgomery 2002]Lipschitzian optimization, e.g., DIRECT [Gablonsky et al. 2001]

BGO methods require more computation to decide where to evaluatenext, but require fewer evaluations to find global extrema (caveat:when the prior is chosen well).

[Huang et al. 2006] compares sequential kriging optimization (a BGOmethod) against DIRECT 0Gablonsky et al 2001], Nelder-Meadmodified for noise by Humphrey et al 2000, and SPSA [Spall 1992],and finds that SKO requires fewer function evaluations.

BGO is a Surrogate Method

BGO methods operate by maintaining a posterior distribution on theunknown objective function f ,

There is a class of global optimization methods called surrogatemethods that maintain a cheap-to-evaluate approximation to theobjective function, and use this to decide where to sample next. (see,e.g., [Booker et al., 1999, Regis and Shoemaker, 2005])

The mean of the posterior distribution can be thought of as asurrogate, and so, loosely speaking, BGO methods are a type ofsurrogate method.

Outline

1 Introduction





6 Conclusion

Best estimated overall value might be at an unmeasuredpoint

The improvement considered by EI is:

[f (x) f n ]+ = max(f (x), f n ) f n = f n+1 f nwhere f n = maxmn f (xm) is the best point weve measured by time n.But the point with the best estimated value might not be a pointweve measured.

50 100 150 200 250 300!2

!1.5

!1

!0.5

0

0.5

1

1.5

2

2.5

x

value

50 100 150 200 250 300!2

!1.5

!1

!0.5

0

0.5

1

1.5

2

2.5

x

value

We can measure improvement w.r.t. the best overall value

Replace f n = maxmn

f (xm) = maxmn

n(xm) with n = maxxA

n(x).

50 100 150 200 250 300!2

!1.5

!1

!0.5

0

0.5

1

1.5

2

2.5

x

value

The corresponding improvement is n+1n .The corresponding value for taking a sample is

En[n+1n | xn+1 = x

].

The policy that measures at the x with the largest such value is calledthe knowledge-gradient with correlated beliefs (KGCB) policy.

Knowledge-Gradient with Correlated Beliefs (KGCB)

Call this modified expected improvement the knowledge-gradient(KG) factor

KGn(x) = En[n+1n | xn+1 = x

].

The KGCB policy measures at the point with the largest KG factor.

xn+1 arg maxx

KGn(x).

0 50 100 150 200 250 300!1.5

!1

!0.5

0

0.5

1

1.5

2

x

valu

e

n=4

0 50 100 150 200 250 300!1.5

!1

!0.5

0

0.5

1

1.5

2

x

valu

e

n=5

0 50 100 150 200 250 3000

0.01

0.02

0.03

0.04

x

valu

e

EGO EI

KG factor

0 50 100 150 200 250 3000

0.01

0.02

0.03

0.04

x

valu

e

EGO EI

KG factor

xk, k

KGCB Requires Fewer Function Evaluations than EGO,but More Computation

0 10 20 30

!4

!3

!2

!1

0

iterations (n)

log 1

0(O

C)

0 10 20 30!0.01

0

0.01

0.02

0.03

iterations (n)

EG

O O

C !

KG

OC

KG

EGO

Graph shows the difference in expected solution quality betweenKGCB and EGO, on noise-free problems.KGCB needs fewer function evaluations to find a good solution, butmore computation to decide where to evaluate.

Outline

1 Introduction





6 Conclusion

Noisy Global Optimization

Thus far we have assumed noise-free function evaluations f (x).

What if we observe function evaluations with noise, g(x ,)?We use the same approach:

1 Use GP regression to calculate the posterior on f (x) = E[g(x ,)] fromnoisy function evaluations.

2 Use the posterior to decide where to sample next.3 Repeat.

GP Regression Can Be Generalized to Allow Noise

What if we have noisy measurements? i.e., we observeg(x ,) = f (x ,) + (x ,).If the noise is normally distributed with a known (possiblyheterogeneous) variance, then we can still calculate the posterior inessentially the same way.

In practice, the noise is neither normal nor of known variance, but itremains a useful approximation. (In practice, one estimates thevariance as you go.)

Current research examines what can be done to get rid of thisapproximation. (e.g., stochastic kriging from [Ankenman, Nelson andStaum 2010])

Illustrative 1D Example with Noise

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

KGCB Can be Generalized to Allow Noise

When there is noise, the definition of the KG factor remains the same.


].

The KGCB policy still measures at the point with the largest KGfactor.

xn+1 arg maxx

KGn(x).

All that changes is that the estimate n(x) incorporates noise.

50 100 150 200 250 3002

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x

valu

e

Illustrative 1D Example with Noise (KGCB)

50 100 150 200 250 3002

1

0

1

2

x

valu

e

50 100 150 200 250 300

14

12

10

8

6

4

2

x

log(

KG

fact

or)

How do we compute the KG factor?

Recall that the KG factor is


].

To compute this, we must determine the distribution ofn+1 = maxx n+1(x) (given what we know at time n, and given thatwe are about to measure xn+1 = x).

This distribution is determined by the conditional distribution of thevector n+1 = (n+1(x) : x).This distribution is determined in turn by how n+1 depends on yn+1,and by the distribution of yn+1.

n+1 depends linearly on yn+1

Suppose momentarily that our space is finite and not too large.

Let f = (f (x) : x) be a vector containing the value of f at all points inour feasible space.

Then the posterior distribution on f is multivariate normal,f |x1:n,y1:n N (n,n).n and n may be computed recursively:

n+1 = n +yn+1n(x)

2(x) + n(x ,x)n(,x),

n+1 = nn(,x)n(x , )

2(x) + n(x ,x).

where x = xn+1.

In the above: 2(x) is the variance of the noise when measuring at x ;n(,x) is the column vector composed of all rows and just column xfrom n; and n(x , ) is similarly a row vector.

yn+1 is normal

From the previous page, n+1 = n + yn+1n(x) 2(x)+n(x ,x) n(,x).At time n, the only randomness in this expression comes from yn+1.

yn+1 = f (x) + n+1f (x)|x1:n,y1:n N(n(x),n(x ,x))

n+1 N(0, 2(x))f (x)|x1:n,y1:n N(n(x),n(x ,x) + 2(x))

Thus, n+1 = n + nZ , where Z = yn+1n(x)n(x ,x)+ 2(x)

is a univariate

N(0,1) and n = n(,x)/

2(x) + n(x ,x) is a vector thatdepends on what we measure, x = xn+1.

We use this to compute the KG factor

Recall n+1 = maxx n+1(x).Recall KGn(x) = En

[n+1n | xn+1 = x

].

We just worked out n+1 = n + nZ , for some univariate standardnormal Z .

So KGn(xn+1) = En[maxx n(x) + n(x)Z n |xn+1], and it turns outthere is a nice algebraic expression for the expectation of themaximimum of a collection of linear functions of a univariate normalthat we will see momentarily.

Computing the Knowledge-Gradient

0 20 40 60 80 1005

9

alternatives (i)

maxi n+1

i

yn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)


0 20 40 60 80 1005

9

alternatives (i)

maxi n+1

iyn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)


0 20 40 60 80 1005

9

alternatives (i)

max

i n+1

iyn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)


0 20 40 60 80 1005

9

alternatives (i)

maxi n+1

i

yn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)


0 20 40 60 80 1005

9

alternatives (i)

maxi n+1

iyn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)


0 20 40 60 80 1005

9

alternatives (i)

max

i n+1

iyn+1

x=49

5.5 6 6.5 7 7.5 8 8.57

7.5

8

8.5

9

observation (yn+1)

best

pos

terio

r (m

axi

n+1

i)

prior (ni)

posterior (n+1i

)

Computing the Knowledge Gradient

7.5

8

8.5

9

best

pos

terio

r (m

axi

n+1

i)

rescaled observation (Z)c

0=

c1 c

2 c

3

c4=+

a1 + b

1 Z a2 + b2 Z

a3 + b

3 Z

a4 + b

4 Z

En[maxx

ax +bxZ]

= x

En[(ax +bxZ )1{cx1Z

Computing the Knowledge Gradient

In general, to compute the KG factor for a candidate measurement:

Let A contain those alternatives that are best under the posteriorwith nonzero probability.

Let [j ] denote the j th entry in A.

Let aj = n([j ]) and bj = n([j ])Sort A in order of increasing bj .

The KG factor is

|A|1

j=1

(bi+1bi )f(|ai+1ai |bi+1bi

),

where f (z) = (z) + z(z), is the normal pdf and is the normalcdf.

When the feasible space is large, use approximationsinstead

When the feasible space is large or infinite, computing the KG factorexactly becomes difficult or impossible.

Instead, use an approximation:

For continuous spaces, use [Scott et al., 2011].

For large discrete spaces, use [Xie et al., 2013].

There are Many Other BGO Methods

There are other similar BGO methods:[Kushner, 1964, Mockus et al., 1978, Stuckman, 1988,Mockus, 1989, Calvin and Zilinskas, 2002, Calvin and Zilinskas, 2005,Huang et al., 2006, Forrester et al., 2006, Taddy et al., 2009,Villemonteix et al., 2009, Kleijnen et al., 2011],. . .

If you want to dive deeper into these other methods, understanding EIand KG will make it much easier to read these other papers.

Outline

1 Introduction





6 Conclusion

We are using BGO at Yelp

Outline

1 Introduction





6 Conclusion

We are using BGO to solve crystal structures, to designbetter photovoltaic cells

10 20 30 40 5050

55

60

65

70

75

80

Min

imum

ene

rgy

foun

d (k

cal/m

ol)

# Function Evaluations

OptimizationPrevious best

With Paulette Clancy and Kristina Lenn (Cornell Chemical Eng.).Each function evaluation evaluates the energy of a particular crystalconfiguration, and requires 1 to 2 minutes of computation.The dashed line shows the lowest energy configuration found by ahuman, using 10 hours CPU time and weeks of manual inspection.The optimization algorithm was able to find a lower energyconfiguration, automatically, and using only 1 hour of CPU time.

Outline

1 Introduction





6 Conclusion

Simulation Model Calibration at Schneider National

The logistics company Schneider National uses a largesimulation-based optimization model to try what if scenarios.The model has several input parameters that must be tuned to makeits behavior match reality before it can be used.The model is tuned by hand once per year on the most recent data.Each tuning effort requires between 1 and 2 weeks.

2008 Warren B. Powell Slide 113

Schneider National

2008 Warren B. Powell Slide 114

(Joint work with Warren B. Powell and Hugo Simao, Princeton University,[Frazier et al., 2009a])

Model Parameters

Input parameters to the model include:

time-at-home bonuses.pacing parameters describing how fast and far drivers drive per day.gas prices. . .

Output parameters from the model include:

billed milesdriver utilizationaverage number of trips home per driver per 4 weeks.proportion of drivers without time at home over 4 weeks.. . .

Some of these inputs are known (e.g., gas prices), but some areunknown (e.g. time-at-home bonuses).

Goal: adjust the inputs to make the optimal solution found by themodel match current practice.

Simulation Model Calibration

Goal: adjust the inputs to make the optimal solution found by theADP model match current practice.

x is a set of inputs to the simulator.f (x) is how closely the simulator output matches history.

Running the simulator for one set of bonuses takes 3 days, makingcalibration difficult.

The model may be run for shorter periods of time, e.g. 12 hours, toobtain noisy output estimates.

BGO is Flexible Enough to Handle Non-stationary Output

The output of the simulator is non-stationary.

Running the simulator to convergence takes too long (3 days).

With just 12 hours of samples, we can use Bayesian statistics to get anoisy estimate of where the path is going.

0 10 20 30 40 501.4

1.6

1.8

2

2.2

2.4

2.6

2.8

iterations (n)

solo

TA

H

0 50 100 150 2001.8

1.9

2

2.1

2.2

2.3

2.4

2.5

iterations (n)

Est

imat

e of

Gk(

)Avg of data after n=100

Avg of all data

Posterior mean

Posterion mean 2 std dev

Simulation Model Calibration Results

Mean of Posterior, n

Bonus 1

Bon

us 2

0 1 2 30

0.5

1

1.5

2

2.5

3Std. Dev. of Posterior

Bonus 1

Bon

us 2

0 1 2 30

0.5

1

1.5

2

2.5

3

log(KG Factor)

Bonus 1

Bon

us 2

0 1 2 30

0.5

1

1.5

2

2.5

3

0 2 4 6 82

1.5

1

0.5

0Best Fit

n

log1

0(B

est F

it)

Simulation Model Calibration Results

The KG method calibrates the model in approximately 3 days,compared to 714 days when tuned by hand.The calibration is automatic, freeing the human calibrator to do otherwork.

The KG method calibrates as accurately or better than does by-handcalibration.

Current practice uses the years calibrated bonuses for each newwhat if scenario, but to enforce the constraint on driver at-hometime it would be better to recalibrate the model for each scenario.Automatic calibration with the KG method makes this feasible.

Outline

1 Introduction





6 Conclusion

Design of Cardiovascular Bypass Grafts

Joint work with Alison Marsden, Sethuraman Sankaran (UCSD,Mechanical and Aerospace Engineering), Jing Xie, Saleh Elmohamed(Cornell). [Xie et al., 2012]


We work with an idealized model of a cardiovascular bypass graft.

Our goal is to choose the attachment angles to minimize the area oflow wall-shear stress, subject to uncertainty about graftimplementation, and environmental conditions within the body.

To evaluate the area of low wall-shear stress for a particular set ofimplemented angles, and a particular set of environmental conditions,we have a fluid-flow simulation.


Target attachment angles are given to the surgeon x = (x1,x2).

Actual attachment angles constructed in surgery are = (1,2) = x + , where = (1,2) are the implementation errorsintroduced during surgery.

Stenosis radius r and blood inflow velocity v are environmentalvariables.


The area of low wall-shear stress (WSS) given and = (r ,v) isf ( ,). f can be evaluated exactly through expensive simulation.The joint probability density of ( ,) is assumed known, and isdenoted p( ,).Our goal is to find x1 and x2 to minimize the average area of lowWSS,

minx1,x2

p( ,)f (x + ,)d d.

HandBrake 0.9.8 2012071800

Velocity_slice-1.mp4Media File (video/mp4)


The algorithm chooses not just which x to evaluate, but also which and .The problem setting is a bit different from typical simulationoptimization problems, but we still use value of informationcalculations to decide where to sample next.

We compare against the method from [Sankaran and Marsden, 2011],which uses stochastic collocation within the surrogate managementframework, and which was designed for problems of this type.

We compare on a test problem which is faster than the true fluid-flowsimulation to run. Comparison on the fluid flow simulation is inprocess.

Outline

1 Introduction





6 Conclusion

Ewings Sarcoma is a Pediatric Bone Cancer

Long-term survival rate is 6080% for localized disease, and 20%following metastisis.

Drug Development is Global Optimization

We have a large number of chemically related small molecules, someof which might make a good drug.

We can synthesize and test the quality of these molecules, but eachmolecule tested takes days of effort.

f (x) is the quality of molecule x , and g(x ,) is the test result.We would like to find a good drug with a limited number of tests.

Joint work with Jeffrey Toretsky, M.D. (Georgetown), Diana Negoescu(Stanford), Warren B. Powell (Princeton), [Negoescu et al., 2011])

We Use a Gaussian Process Prior

The molecules we consider share a common skeleton, and aredescribed by which substituents are present at each location. 14

14

Jo

urn

al o

f Med

icina

l Ch

em

istry, 1

977, V

ol. 2

0, N

o. 1

1

Ka

tz, Osb

orn

e, Ion

escu

ri

v

m

m

t-

m

m

(9

m

m

*

m

m

m

02

01

ri

m

0

m

0,

o]

co o]

t- o]

(9

o]

Ln

N

e

o]

a

N

o]

ri

N

0

o]

Q,

ri

30

ri

c- ri

(9

m

7..

m

ri

-e

m

3

cv

- d ri 3 0 ri Q, to t- (9 LQ Tr crj hl d

i

d

riri

37

-

ri

ri

ri

Td

--. r(

rid

ri

ri

ri

r-

dri

ri

d

d

v-t

d

3

14

14

Jo

urn

al o

f Med

icina

l Ch

em

istry, 1

977, V

ol. 2

0, N

o. 1

1

Ka

tz, Osb

orn

e, Ion

escu

ri

v

m

m

t-

m

m

(9

m

m

*

m

m

m

02

01

ri

m

0

m

0,

o]

co o]

t- o]

(9

o]

Ln

N

e

o]

a

N

o]

ri

N

0

o]

Q,

ri

30

ri

c- ri

(9

m

7..

m

ri

-e

m

3

cv

- d ri 3 0 ri Q, to t- (9 LQ Tr crj hl d

i

d

riri

37

-

ri

ri

ri

Td

--. r(

rid

ri

ri

ri

r-

dri

ri

d

d

v-t

d

3

We use Gaussian Process regression over the discrete, combinatorial,space of molecules.(over a discrete space, this is also called Bayesian linear regression).

The covariance 0(x ,x) of two molecules x and x is larger when the

two molecules have more substituents in common.

This is called the Free-Wilson model in medicinal chemistry [Free andWilson 1964].

KGCB Works Well in TestsCHAPTER 4. IMPLEMENTATION AND PRELIMINARY RESULTS 66

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

# measurements

max truth ! truth(best )

KG

Pure Exploration

Figure 4.13: CKG with Free Wilson model for 99 compounds using the informativeprior and a single truth

compounds almost at random just so that it learns something about their values,

which renders it a not too different policy from Pure Exploration.

Nonetheless, even with the numerical issues coming up in the noninformative

phase, the CKG policy is still doing quite well compared to the other policies, as it

is still the only policy that finds the best compunds in the first 100 measurements.

Informative prior

We have also tested the informative prior for this data set of 99 compounds under

the Free Wilson model, and the resulting plot can be seen in Figure 4.13. Just as

we observed using the 36 compounds data set, the learning rate is again significantly

faster than under the non-informative prior.

Average over 100 sample paths on randomly selected subsets ofbenzomorphan compounds of size 99.

KGCB Works Well in Tests

CHAPTER 5. INCREASING THE NUMBER OF COMPOUNDS 100

0 20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

2

2.5

3

3.5

4

measurement #

Mean Opportunity Cost (max(truth) ! truth(max belief)) for 25000 compounds

CKG

Pure Exploration

Figure 5.8: Average over nine runs of sample paths using data sets of 25000 com-pounds

0 20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

measurement #

max(truth) ! truth(max belief)

CKG

Pure Exploration

Figure 5.9: A sample path using the entire data set of 87120 compoundsOne sample path on the full set of 87,120 benzomorphan compounds.

Discussion: KGCB Works Well So Far. . .

BGO methods work well in test problems using a chemical datasetfrom the literature. [Negoescu, Frazier, Powell 2011]

Application to Ewings sarcoma is ongoing.

Our fingers are crossed. . .

Outline

1 Introduction





6 Conclusion

(Jump to keynote presentation)

Outline

1 Introduction





6 Conclusion

Software

DiceKriginghttp://cran.r-project.org/web/packages/DiceKriging/index.html and

DiceOptimhttp://cran.r-project.org/web/packages/DiceOptim/index.html.

These are R packages for doing statistical estimation and optimizationusing kriging and Gaussian process priors. They are described in thepackages documentation, and this paper:

Roustant, Ginsbourger, Deville (2012) DiceKriging, DiceOptim: Two RPackages for the Analysis of Computer Experiments by Kriging-BasedMetamodeling and Optimization, Journal of Stastistical Software

The version of KGCB for continuous spaces from [Scott et al., 2011] isimplemented as the AKG (approximate KG) method.

http://cran.r-project.org/web/packages/DiceKriging/index.htmlhttp://cran.r-project.org/web/packages/DiceOptim/index.html

Software

All software is free unless otherwise noted.

http://optimallearning.princeton.edu/ and go to DownloadableSoftware

TOMLAB (http://tomopt.com/tomlab/) a (commercial) Matlabadd-on with implementations noise-free EGO on continuous spaces.

SPACE (http://www.schonlau.net/space.html), an implementation ofEGO in C on continuous spaces.

the matlabKG library(http://people.orie.cornell.edu/pfrazier/src.html) an implementationof the KGCB algorithm for noisy discrete problems. I am planning toimprove this library, both with respect to speed and usability if youuse it, please send me an email and share your experiences.

Software library accompanying the book by Sobester & Keane, Go tohttp://www.soton.ac.uk/aijf197/ and search for software.

http://optimallearning.princeton.edu/http://tomopt.com/tomlab/http://www.schonlau.net/space.htmlhttp://people.orie.cornell.edu/pfrazier/src.htmlhttp://www.soton.ac.uk/~aijf197/

More Software

dace, a matlab kriging toolboxhttp://www2.imm.dtu.dk/hbn/dace/. A Matlab library for doingkriging, which is very similar to GP regression. Assumes noise-freefunction evaluations, but can be easily tweaked.

stochastic kriging, http://stochastickriging.net/. Matlab code forobtaining kriging estimates with unknown and variable sampling noise.

For other GP regression software from the machine learningcommunity, see http://www.gaussianprocess.org.

http://www2.imm.dtu.dk/~hbn/dace/http://stochastickriging.net/http://www.gaussianprocess.org

Introductory Reading (articles)

Brochu, E., Cora, V. M., and de Freitas, N. (2009).

A tutorial on Bayesian optimization of expensive cost functions, withapplication to active user modeling and hierarchical reinforcement learning.

Technical Report TR-2009-23, Department of Computer Science, Universityof British Columbia.

Powell, W. and Frazier, P. (2008).

Optimal Learning.

TutORials in Operations Research: State-of-the-Art Decision-Making Toolsin the Information-Intensive Age, pages 213246.

Introductory Reading (books)

Powell, W. and Ryzhov, I. (2012).Optimal Learning.Wiley.

Forrester, A., Sobester, A., and Keane, A. (2008).Engineering design via surrogate modelling: a practical guide.Wiley, West Sussex, UK.

Rasmussen, C. and Williams, C. (2006).Gaussian Processes for Machine Learning.MIT Press, Cambridge, MA.

More Advanced Reading

Advanced surveys and research papers may be found athttp://people.orie.cornell.edu/pfrazier/

Some introductory (and advanced) material may be found athttp://optimallearning.princeton.edu/

The original reference for the KGCB algorithm is[Frazier et al., 2009b].

http://people.orie.cornell.edu/pfrazier/http://optimallearning.princeton.edu/

Conclusion

BGO methods use the Bayesian posterior on the unknown function todecide where to sample next.

They tend to require a lot of computation to decide where to sample,but reduce the overall number of samples required.

They are very flexible, and the Bayesian statistical used can be tunedto new applications (non-stationary output, combinatorial feasible set,. . . )

References I

Booker, A., Dennis, J., Frank, P., Serafini, D., Torczon, V., and Trosset, M. (1999).

A rigorous framework for optimization of expensive functions by surrogates.Structural and Multidisciplinary Optimization, 17(1):113.

Brochu, E., Cora, V. M., and de Freitas, N. (2009).

A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchicalreinforcement learning.Technical Report TR-2009-23, Department of Computer Science, University of British Columbia.

Calvin, J. and Zilinskas, A. (2002).

One-dimensional Global Optimization Based on Statistical Models.Nonconvex Optimization and its Applications, 59:4964.

Calvin, J. and Zilinskas, A. (2005).

One-Dimensional global optimization for observations with noise.Computers & Mathematics with Applications, 50(1-2):157169.

Cressie, N. (1993).

Statistics for Spatial Data, revised edition.Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley Interscience, New York.

Forrester, A., Keane, A., and Bressloff, N. (2006).

Design and Analysis of Noisy Computer Experiments.AIAA Journal, 44(10):23312339.

Forrester, A., Sobester, A., and Keane, A. (2008).

Engineering design via surrogate modelling: a practical guide.Wiley, West Sussex, UK.

References II

Frazier, P., Powell, W., and Simao, H. (2009a).

Simulation model calibration with correlated knowledge-gradients.In Winter Simulation Conference Proceedings, 2009, Piscataway, New Jersey. Institute of Electrical and ElectronicsEngineers, Inc.

Frazier, P., Powell, W. B., and Dayanik, S. (2009b).

The knowledge gradient policy for correlated normal beliefs.INFORMS Journal on Computing, 21(4):599613.

Huang, D., Allen, T., Notz, W., and Miller, R. (2006).

Sequential kriging optimization using multiple-fidelity evaluations.Structural and Multidisciplinary Optimization, 32(5):369382.

Jones, D., Schonlau, M., and Welch, W. (1998).

Efficient Global Optimization of Expensive Black-Box Functions.Journal of Global Optimization, 13(4):455492.

Kleijnen, J., van Beers, W., and van Nieuwenhuyse I. (2011).

Expected improvement in efficient global optimization thr

Machine Learning for Optimization Masterclass Day 1 · Machine Learning for Optimization...

Documents

Transcript of Machine Learning for Optimization Masterclass Day 1 · Machine Learning for Optimization...