Introduction to Optimization - uni-stuttgart.de · Introduction to Optimization Marc Toussaint July...

Introduction to Optimization

Marc Toussaint

July 2, 2014

This is a direct concatenation and reformatting of all lecture slides and exercises fromthe Optimization course (summer term 2014, U Stuttgart), including a bullet point list tohelp prepare for exams.

Contents

1 Introduction 2Types of optimization problems (1:3)

2 Unconstraint Optimization Basics 4Plain gradient descent (2:1) Stepsize and step direction as core issues (2:2) Stepsize adaptation (2:4) Backtracking(2:5) Line search (2:5) Steepest descent direction (2:9) Covariant gradient descent (2:11) Newton direction (2:12)Newton method (2:13) Gauss-Newton method (2:18) Quasi-Newton methods (2:21) Broyden-Fletcher-Goldfarb-Shanno (BFGS) (2:23) Conjugate gradient (2:26) Rprop (2:31) Gradient descent convergence (2:34) Wolfe condi-tions (2:36) Trust region (2:37)

3 Constrained Optimization 10Constrained optimization (3:1) Log barrier method (3:6) Central path (3:9) Squared penalty method (3:12) Aug-mented Lagrangian method (3:14) Lagrangian: definition (3:21) Lagrangian: relation to KKT (3:24) Karush-Kuhn-Tucker (KKT) conditions (3:25) Lagrangian: saddle point view (3:27) Lagrange dual problem (3:29) Log barrier asapproximate KKT (3:33) Primal-dual interior-point Newton method (3:36) Phase I optimization (3:40)

4 Convex Optimization 16Function types: covex, quasi-convex, uni-modal (4:2) Linear program (LP) (4:7) Quadratic program (QP) (4:7) LP instandard form (4:8) Simplex method (4:11) LP-relaxations of integer programs (4:15) Sequential quadratic program-ming (4:23)

5 Blackbox Optimization: Local, Stochastic & Model-based Search 20Blackbox optimization: definition (5:1) Blackbox optimization: overview (5:3) Greedy local search (5:5) Stochasticlocal search (5:6) Simulated annealing (5:7) Random restarts (5:10) Iterated local search (5:11) Variable neighbor-hood search (5:13) Coordinate search (5:14) Pattern search (5:15) Nelder-Mead simplex method (5:16) Generalstochastic search (5:20) Evolutionary algorithms (5:23) Covariance Matrix Adaptation (CMA) (5:24) Estimation ofDistribution Algorithms (EDAs) (5:28) Model-based optimization (5:31) Implicit filtering (5:34)

6 Global & Bayesian Optimization 26Bandits (6:4) Exploration, Exploitation (6:6) Belief planning (6:8) Upper Confidence Bound (UCB) (6:12) GlobalOptimization as infinite bandits (6:17) Gaussian Processes as belief (6:19) Expected Improvement (6:24) MaximalProbability of Improvement (6:24) GP-UCB (6:24)

7 Exercises 30

8 Bullet points to help learning 34

Index 37

1

2 Introduction to Optimization, Marc Toussaint—July 2, 2014

1 Introduction

Why Optimization is interesting!

• In an otherwise unfortunate interview I’ve been asked why “we guys” (AI,

ML, optimal control people) always talk about optimality. “People are by

no means optimal”, the interviewer said. I think that statement pinpoints

the whole misunderstanding of the role and concept of optimality princi-

ples.

– Optimality principles are a means of scientific (or engineer-ing) description.

– It is often easier to describe a thing (natural or artifical) viaan optimality priciple than directly

• Which science does not use optimality principles to describe

nature & artifacts?– Physics, Chemistry, Biology, Mechanics, ...– Operations research, scheduling, ...– Computer Vision, Speach Recognition, Machine Learning,

Robotics, ...

• Endless applications

1:1

Teaching optimization

• Standard: Convex Optimization, Numerical Optimization

• Discrete Optimization (Stefan Funke)

• Exotics: Evolutionary Algorithms, Swarm optimization, etc

• In this lecture I try to cover the standard topics, but include as

well work on stochastic search & global optimization

1:2

Rough Types of Optimization Problems

• Generic optimization problem:

Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl. Find

minx

f(x)

s.t. g(x) ≤ 0 , h(x) = 0

• Blackbox: only f(x) can be evaluated

• Gradient: ∇f(x) can be evaluated

• Gauss-Newton type: f(x) = φ(x)>φ(x) and∇φ(x) can be evaluated

• 2nd order: ∇2f(x) can be evaluated

• “Approximate upgrade”:

– Use samples of f(x) to approximate ∇f(x) locally

– Use samples of ∇f(x) to approximate ∇2f(x) locally

1:3

Optimization in Machine Learning: SVMs

• optimization problem

maxβ,||β||=1 M subject to yi(φ(xi)>β) ≥M, i = 1, . . . , n

• can be rephrased as

minβ ||β|| subject to yi(φ(xi)>β) ≥ 1, i = 1, . . . , n

Ridge regularization like ridge regression, but different loss

y

x

A

B

1:4

Optimization in Robotics

• Trajectories:

Let xt ∈ Rn be a joint configuration and x = x1:T = (x1, . . . , xT )

a trajectory of length T . Find

minx

T∑t=0

ft(xt−k:t)>ft(xt−k:t)

s.t. ∀t : gt(xt) ≤ 0 , ht(xt) = 0

(1)

• Control:

minu,q,λ

||u− a||2H (2)

s.t. u = Mq + h+ J>gλ (3)

Jφq = c (4)

λ = λ∗ (5)

Jg q = b (6)

1:5

Optimization in Computer Vision

• Andres Bruhn’s lectures

• Flow estimation, (relaxed) min-cut problems, segmentation, ...

1:6

Planned Outline• Unconstrained Optimization: Gradient- and 2nd order methods

– stepsize & direction, plain gradient descent, steepest de-scent, line search & trust region methods, conjugate gradi-ent

– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS• Constrained Optimization

– log barrier, squared penalties, augmented Lagrangian– Lagrangian, KKT conditions, Lagrange dual, log barrier ↔

approx. KKT• Special convex cases

– Linear Programming, (sequential) Quadratic Programming

Introduction to Optimization, Marc Toussaint—July 2, 2014 3

– Simplex algorithm– Relaxation of integer linear programs

• Global Optimization

– infinite bandits, probabilistic modelling, exploration vs. ex-ploitation, GP-UCB

• Stochastic search

– Blackbox optimization (0th order methods), MCMC, down-hill simplex

1:7

Books

Boyd and Vandenberghe: ConvexOptimization.http://www.stanford.edu/

˜boyd/cvxbook/

(this course will not go to the full depth in math of Boyd et al.)

1:8

Books

Nocedal & Wright: Numerical Op-timizationwww.bioinfo.org.cn/

˜wangchao/maa/Numerical_Optimization.pdf

1:9

Organisation

• Webpage:

http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/14-Optimization/

– Slides, Exercises & Software (C++)– Links to books and other resources

• Admin things, please first ask:

Carola Stahl, [email protected], Raum 2.217

• Tutorials: Wednesday 9:45 (0.463)

• Rules for the tutorials:– Doing the exercises is crucial!– At the beginning of each tutorial:

– sign into a list– mark which exercises you have (successfully) worked on

– Students are randomly selected to present their solutions– You need 50% of completed exercises to be allowed to the

exam– Please check 2 weeks before the end of the term, if you

can take the exam

1:10

http://www.stanford.edu/~boyd/cvxbook/

http://www.stanford.edu/~boyd/cvxbook/

www.bioinfo.org.cn/~wangchao/maa/Numerical_Optimization.pdf



http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/14-Optimization/


2 Unconstraint Optimization Basics

Descent direction & stepsize, plain gradient descent, stepsize adap-

tation & monotonicity, line search, trust region, steepest descent,

Newton, Gauss-Newton, Quasi-Newton, BFGS, conjugate gradient,

exotic: Rprop

Gradient descent

• Objective function: f : Rn → R

Gradient vector: ∇f(x) =[∂∂xf(x)

]>∈ Rn

• Problem:

minxf(x)

where we can evaluate f(x) and ∇f(x) for any x ∈ Rn

• Plain gradient descent: iterative steps in the direction −∇f(x).

Input: initial x ∈ Rn, function ∇f(x), stepsize α, tolerance θOutput: x

1: repeat2: x← x− α∇f(x)

3: until |∆x| < θ [perhaps for 10 iterations in sequence]

2:1

• Plain gradient descent is really not efficient

• Two core issues of unconstrainted optimization:

A. Stepsize

B. Descent direction2:2

Stepsize

• Making steps proportional to ∇f(x)?

large gradient large step?

small gradient small step?

• We need methods that– robustly adapt stepsize– exploit convexity, if known

– perhaps be independent of |∇f(x)| (e.g. if non-convex asabove)

2:3

Stepsize Adaptation

Input: initial x ∈ Rn, functions f(x) and ∇f(x), tolerance θ,parameters (defaults: %+

α = 1.2, %−α = 0.5, %ls = 0.01)Output: x

1: initialize stepsize α = 1

2: repeat3: d← − ∇f(x)

|∇f(x)| // (alternative: d = −∇f(x))

4: while f(x+ αd) > f(x)+%ls∇f(x)>(αd) do // linesearch

5: α← %−αα // decrease stepsize6: end while7: x← x+ αd

8: α← %+αα // increase stepsize (alternative: α = 1)

9: until |αd| < θ [perhaps for 10 iterations in sequence]

• α determines the absolute stepsize

• Guaranteed monotonicity (by construction)

(“Typically” ensures convergence to locally convex minima; see

later)

2:4

Backtracking line search

• Line search in general denotes the problem

minα≥0

f(x+ αd)

for some step direction d.

• The most common line search is backtracking, which decreases

α as long as

f(x+ αd) > f(x) + %ls∇f(x)>(αd)

%−α describes the stepsize decrement in case of a rejected step

%ls describes a minimum desired decrease in f(x)

• Boyd at al: typically %ls ∈ [0.01, 0.3] and %−α ∈ [0.1, 0.8]

2:5

Backtracking line search

2:6


• In the appendix:– Convergence of gradient descent with backtracking– Trust region method

2:7

B. Descent Direction

2:8

Steepest Descent Direction

• The gradient∇f(x) is sometimes called steepest descent direc-

tion

Is it really?

• Here is a possible definition:

The steepest descent direction is the one where, when I make

a step of length 1, I get the largest decrease of f in its linear

approximation.

argminδ∇f(x)>δ s.t. ||δ|| = 1

2:9

Steepest Descent Direction

• But the norm ||δ||2 = δ>Aδ depends on the metric A!

Let A = B>B (Cholesky decomposition) and z = Bδ

δ∗ = argminδ∇f>δ s.t. δ>Aδ = 1

= B-1 argminz

(B-1z)>∇f s.t. z>z = 1

= B-1 argminz

z>B->∇f s.t. z>z = 1

= B-1[−B->∇f ] = −A-1∇f

The steepest descent direction is δ = −A-1∇f2:10

Behavior under linear coordinate transforma-tions

• Let B be a matrix that describes a linear transformation in coor-

dinates

• A coordinate vector x transforms as z = Bx

• The gradient vector ∇xf(x) transforms as ∇zf(z) = B->∇xf(x)

• The metric A transforms as Az = B->AxB-1

• The steepest descent transforms as A-1z∇zf(z) = BA-1

x∇xf(x)

The steepest descent transforms like a normal coordinate vector

(covariant)

2:11

Newton Direction

• Assume we have access to the symmetric Hessian

∇2f(x) =

∂2

∂x1∂x1f(x) ∂2

∂x1∂x2f(x) · · · ∂2

∂x1∂xnf(x)

∂2

∂x1∂x2f(x)

...

......

∂2

∂xn∂x1f(x) · · · · · · ∂2

∂xn∂xnf(x)

∈ Rn×n

• which defines the Taylor expansion:

f(x+ δ) ≈ f(x) +∇f(x)>δ +1

2δ>∇2f(x) δ

Note: ∇2f(x) acts like a metric for δ

2:12

Newton method

• For finding roots (zero points) of f(x)

x← x− f(x)

f ′(x)

• For finding optima of f(x) in 1D:

x← x− f ′(x)

f ′′(x)

For x ∈ Rn:

x← x−∇2f(x)-1∇f(x)

2:13

Why 2nd order information is better

• Better direction:

Conjugate Gradient

Plain Gradient

2nd Order

• Better stepsize:– a full step jumps directly to the minimum of the local squared

approx.– often this is already a good heuristic– additional stepsize reduction and dampening are straight-

forward

2:14


Newton method with adaptive stepsize

Input: initial x ∈ Rn, functions f(x),∇f(x),∇2f(x), toler-ance θ, parameters (defaults: %+

α = 1.2, %−α = 0.5, %+λ =

1, %−λ = 0.5, %ls = 0.01)Output: x

1: initialize stepsize α = 1 and damping λ = λ0

2: repeat3: compute d to solve (∇2f(x) + λI) d = −∇f(x)

4: while f(x+ αd) > f(x) + %ls∇f(x)>(αd) do // linesearch

5: α← %−αα // decrease stepsize6: optionally: λ← %+

λ λ and recompute d // increasedamping

7: end while8: x← x+ αd // step is accepted9: α← min{%+

αα, 1} // increase stepsize10: optionally: λ← %−λ λ // decrease damping11: until ||αd||∞ < θ

• Notes:

– Line 3 computes the Newton step d = −∇2f(x)-1∇f(x),use special Lapack routine dposv to solve Ax = b (usingCholesky)

– λ is called damping, related to trust region methods, makesthe parabola more steep around current xfor λ→∞: d becomes colinear with −∇f(x) but |d| = 0

2:15

Demo

2:16

• In the remainder: Extensions of the Newton approach:– Gauss-Newton– Quasi-Newton– BFGS, (L)BFGS– Conjugate Gradient

• And a crazy method: Rprop

2:17

Gauss-Newton method

• Consider a sum-of-squares problem:

minxf(x) where f(x) = φ(x)>φ(x) =

∑i

φi(x)2

and we can evaluate φ(x), ∇φ(x) for any x ∈ Rn

• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term tof(x)

• ∇φ(x) is the Jacobian (d× n-matrix)

∇φ(x) =

∂∂x1

φ1(x) ∂∂x2

φ1(x) · · · ∂∂xn

φ1(x)

∂∂x1

φ2(x)...

......

∂∂x1

φd(x) · · · · · · ∂∂xn

φd(x)

∈ Rd×n

with 1st-order Taylor expansion φ(x+ δ) = φ(x) +∇φ(x)δ

2:18

Gauss-Newton method

• The gradient and Hessian of f(x) become

f(x) = φ(x)>φ(x)

∇f(x) = 2∇φ(x)>φ(x)

∇2f(x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)

• The Gauss-Newton method is the Newton method for f(x) =

φ(x)>φ(x) with approximating ∇2φ(x) ≈ 0

In the Newton algorithm, replace line 3 by 3: compute d to solve (2∇φ(x)>∇φ(x) + λI) d = −2∇φ(x)>φ(x)

• The approximate Hessian 2∇φ(x)>∇φ(x) is always semi-pos-

def!

2:19

Quasi-Newton methods

2:20


• Assume we cannot evaluate ∇2f(x).

Can we still use 2nd order methods?

• Yes: We can approximate∇2f(x) from the data {(xi,∇f(xi))}ki=1

of previous iterations

2:21

Basic example

• We’ve seen already two data points (x1,∇f(x1)) and (x2,∇f(x2))

How can we estimate ∇2f(x)?

• In 1D:

∇2f(x) ≈ ∇f(x2)−∇f(x1)

x2 − x1

• In Rn: let y = ∇f(x2)−∇f(x1), δ = x2 − x1

∇2f(x) δ!= y δ

!= ∇2f(x)−1y

∇2f(x) =y y>

y>δ∇2f(x)−1 =

δδ>

δ>y

Convince yourself that the last line solves the desired relations[Left: how to update∇2f (x). Right: how to update directly∇2f(x)-1.]

2:22


BFGS

• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:

Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x

1: initialize H -1 = In2: repeat3: compute d = −H -1∇f(x)

4: perform a line search minα f(x+ αd)

5: δ ← αd

6: y ← ∇f(x+ δ)−∇f(x)

7: x← x+ δ

8: update H -1 ←(I− yδ>

δ>y

)>H -1(I− yδ>

δ>y

)+ δδ>

δ>y9: until ||δ||∞ < θ

• Notes:– The blue term is the H -1-update as on the previous slide– The red term “deletes” previous H -1-components

2:23


• BFGS is the most popular of all Quasi-Newton methods

Others exist, which differ in the exact H -1-update

• L-BFGS (limited memory BFGS) is a version which does not re-

quire to explicitly store H -1 but instead stores the previous data

{(xi,∇f(xi))}ki=1 and manages to compute d = −H -1∇f(x) di-

rectly from this data

• Some thought:

In principle, there are alternative ways to estimate H -1 from the

data {(xi, f(xi),∇f(xi))}ki=1, e.g. using Gaussian Process re-

gression with derivative observations– Not only the derivatives but also the value f(xi) should give

information on H(x) for non-quadratic functions– Should one weight ‘local’ data stronger than ‘far away’?

(GP covariance function)

2:24

(Nonlinear) Conjugate Gradient

2:25

Conjugate Gradient

• The “Conjugate Gradient Method” is a method for solving large

linear eqn. systems Ax+ b = 0

We mention its extension for optimizing nonlinear functions f(x)

• A key insight:

– at xk we computed ∇f(xk)

– we made a (line-search) step to xk+1

– at xk+1 we computed ∇f(xk+1)

What conclusions can we draw about the “local quadratic shape” of f?

2:26

Conjugate Gradient

Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x

1: initialize descent direction d = g = −∇f(x)

2: repeat3: α← argminα f(x+ αd) // line search4: x← x+ αd

5: g′ ← g, g = −∇f(x) // store and compute grad

6: β ← max

{g>(g−g′)g′>g′

, 0

}7: d← g + βd // conjugate descent direction8: until |∆x| < θ

• Notes:– β > 0: The new descent direction always adds a bit of the old direc-tion!– This essentially provides 2nd order information– The equation for β is by Polak-Ribiere: On a quadratic function f(x) =x>Ax this leads to conjugate search directions, d′>Ad = 0.

– All this really only works with perfect line search

2:27

Conjugate Gradient

• For quadratic functions CG converges in n iterations. But each

iteration does line search!

2:28

Conjugate Gradient

• Useful tutorial on CG and line search:

J. R. Shewchuk: An Introduction to the Conjugate Gradient Method

Without the Agonizing Pain

2:29

Rprop

2:30

Rprop

“Resilient Back Propagation” (outdated name from NN times...)


Input: initial x ∈ Rn, function f(x),∇f(x), initial stepsize α, toler-ance θ

Output: x1: initialize x = x0, all αi = α, all gi = 0

2: repeat3: g ← ∇f(x)

4: x′ ← x

5: for i = 1 : n do6: if gig′i > 0 then // same direction as last time7: αi ← 1.2αi8: xi ← xi − αi sign(gi)

9: g′i ← gi10: else if gig′i < 0 then // change of direction11: αi ← 0.5αi12: xi ← xi − αi sign(gi)

13: g′i ← 0 // force last case next time14: else15: xi ← xi − αi sign(gi)

16: g′i ← gi17: end if18: optionally: cap αi ∈ [αmin xi, αmax xi]

19: end for20: until |x′ − x| < θ for 10 iterations in sequence

2:31

Rprop

• Rprop is a bit crazy:

– stepsize adaptation in each dimension separately

– it not only ignores |∇f | but also its exact direction

step directions may differ up to < 90◦ from ∇f

– Often works very robustly

– Guarantees? See work by Ch. Igel

• If you like, have a look at:Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the nat-ural gradient compared to Levenberg-Marquardt optimization. In Trendsand Applications in Constructive Approximation. International Series ofNumerical Mathematics, volume 151, 259-272.

2:32

Appendix

2:33

Convergence for (locally) convex functions

following Boyd et al. Sec 9.3.1

• Assume that ∀x the Hessian is m ≤ eig(∇2f(x)) ≤M . If follows

f(x) +∇f(x)>(y − x) +m

2(y − x)2 ≤ f(y)

≤ f(x) +∇f(x)>(y − x) +M

2(y − x)2

f(x)−1

2m|∇f(x)|2 ≤ fmin ≤ f(x)−

1

2M|∇f(x)|2

|∇f(x)|2 ≥ 2m(f(x)− fmin)

• Consider a perfect line search with y = x−α∗∇f(x), α∗ = argminα f(y(α)).The following eqn. holds asM also upper-bounds∇2f(x) along−∇f(x):

f(y) ≤ f(x)−1

2M|∇f(x)|2

f(y)− fmin ≤ f(x)− fmin −1

2M|∇f(x)|2

≤ f(x)− fmin −2m

2M(f(x)− fmin)

≤[1−

m

M

](f(x)− fmin)

→ each step is contracting at least by 1− mM

< 1

2:34

Convergence for (locally) convex functions

following Boyd et al. Sec 9.3.1

• In the case of backtracking line search, backtracking will terminatelatest when α ≤ 1

M, because for y = x−α∇f(x) and α ≤ 1

Mwe have

f(y) ≤ f(x)− α|∇f(x)|2 +Mα2

2|∇f(x)|2

≤ f(x)−α

2|∇f(x)|2

≤ f(x)− %lsα|∇f(x)|2

As backtracking terminates for any α ≤ 1M

, a step α ≥ %−αM

is chosen,such that

f(y) ≤ f(x)−%ls%−α

M|∇f(x)|2

f(y)− fmin ≤ f(x)− fmin −%ls%−α

M|∇f(x)|2

≤ f(x)− fmin −2m%ls%

−α

M(f(x)− fmin)

≤[1−

2m%ls%−α

M

](f(x)− fmin)

→ each step is contracting at least by 1− 2m%ls%−α

M< 1

2:35

Wolfe Conditions

• The termination condition

f(x+ αd) ≤ f(x)− %ls∇f(x)>(αd)

is the 1st Wolfe condition (“sufficient decrease condition”)

• The second Wolfe condition (“curvature condition”)

|∇f(x+ αd)>d| ≤ b|∇f(x)>d|

implies a sufficient step

• See Nocedal et al., Section 3.1 & 3.2 on convergence of any

method that ensures the Wolfe conditions after each line search

2:36

Trust Region

• Instead of adapting the stepsize along a fixed direction, an al-

ternative it to adapt the trust region

• Rougly, while f(x+ δ) > f(x) + %ls∇f(x)>δ:– Reduce trust region radius β– try δ = argminδ:|δ|<β f(x+δ) using a local quadratic model

of f(x+ δ)

• The constraint optimization minδ:|δ|<β f(x+δ) can be translated

into an unconstrained minδ f(x + δ) + λδ2 for suitable λ. The

λ is equivalent to a regularization of the Hessian; see damped

Newton.


• We’ll not go into more details of trust region methods; see No-

cedal Section 4.

2:37

Stopping Criteria

• Standard references (Boyd) define stopping criteria based on

the “change” in f(x), e.g. |∆f(x)| < θ or |∇f(x)| < θ.

• Throughout I will define stopping criteria based on the change

in x, e.g. |∆x| < θ! In my experience with certain applications

this is more meaningful, and invariant of the scaling of f . But

this is application dependent.

2:38

Evaluating optimization costs

• Standard references (Boyd) assume line search is cheap and

measure optimization costs as the number of iterations (count-

ing 1 per line search).

• Throughout I will assume that every evaluation of f(x) or (f(x),∇f(x))

or (f(x),∇f(x),∇2f(x)) is approx. equally expensive—as is the

case in certain applications.

2:39


3 Constrained Optimization

General definition, log barriers, central path, squared penalties, aug-

mented Lagrangian (equalities & inequalities), the Lagrangian, force

balance view & KKT conditions, saddle point view, dual problem, min-

max max-min duality, modified KKT & log barriers, Phase I

Constrained Optimization

• General constrained optimization problem:

Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl find

minx

f(x) s.t. g(x) ≤ 0, h(x) = 0

In this lecture I’ll focus on inequality constraints g!

• Applications

– Find an optimal, non-colliding trajectory in robotics

– Optimize the shape of a turbine blade, s.t. it must not break

– Optimize the train schedule, s.t. consistency/possibility

3:1

General approaches

• Try to somehow transform the constraint problem to

a series of unconstraint problems

a single but larger unconstraint problem

another constraint problem, hopefully simpler (dual, con-

vex)

3:2

General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)

• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region

• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem

• Simplex methods (linear constraints)– Walk along the constraint boundaries

3:3

Penalties & Barries

• Convention:

A barrier is really∞ for g(x) > 0

A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0

3:4

Log barrier method or Interior Point method

3:5

Log barrier method

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x)− µ∑i

log(−gi(x))

3:6

Log barrier

• For µ→ 0, −µ log(−g) converges to∞[g > 0]

Notation: [boolean expression] ∈ {0, 1}

• The barrier gradient ∇− log(−g) = ∇gg

pushes away from the

constraint

• Eventually we want to have a very small µ—but choosing small

µ makes the barrier very non-smooth, which might be bad for

gradient and 2nd order methods

3:7

Log barrier method


Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol-erances θ

Output: x1: initialize µ = 1

2: repeat3: find x ← argminx f(x) − µ

∑i log(−gi(x)) with tol-

erance 10θ

4: decrease µ← µ/2

5: until |∆x| < θ

Note: See Boyd & Vandenberghe for stopping criteria based on

f precision (duality gap) and better choice of initial µ (which is

called t there).

3:8

Central Path

• Every µ defines a different optimal x∗(µ)

x∗(µ) = argminx

f(x)− µ∑i

log(−gi(x))

• Each point on the path can be understood as the optimal com-

promise of minimizing f(x) and a repelling force of the con-

straints. (Which corresponds to dual variables λ∗(µ).)

3:9

We will revisit the log barrier method later, once we introduced

the Langrangian...

3:10

Squared Penalty Method

3:11


• This is perhaps the simplest approach

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x) + µ

m∑i=1

[gi(x) > 0] gi(x)2

Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol.θ, ε

Output: x1: initialize µ = 1

2: repeat3: find x ← argminx f(x) + µ

∑i[gi(x) > 0] g(x)2 with

tolerance 10θ

4: µ← 10µ

5: until |∆x| < θ and ∀i : gi(x) < ε

3:12


• The method is ok, but will always lead to some violation of con-

straints

• A better idea would be to add an out-pushing gradient/force

−∇gi(x) for every constraint gi(x) > 0 that is violated.

Ideally, the out-pushing gradient mixes with−∇f(x) exactly such

that the result becomes tangential to the constraint!

This idea leads to the augmented Lagrangian approach.

3:13

Augmented Lagrangian

(We can introduce this is a self-contained manner, without yet defining

the “Lagrangian”)

3:14

Augmented Lagrangian (equality constraint)

• We first consider an equality constraint before addressing in-

equalities

• Instead of

minx

f(x) s.t. h(x) = 0

we address

minx

f(x) + µ

m∑i=1

hi(x)2 +∑i=1

λihi(x) (7)

• Note:

– The gradient ∇hi(x) is always orthogonal to the constraint

– By tuning λi we can induce a “virtual gradient” λi∇hi(x)

– The term µ∑mi=1 hi(x)2 penalizes as before

• Here is the trick:

– First minimize (16) for some µ and λi

– This will in general lead to a (slight) penalty µ∑mi=1 hi(x)2

– For the next iteration, choose λi to generate exactly the gradi-

ent that was previously generated by the penalty

3:15


• Optimality condition after an iteration:

x′ = argminx

f(x) + µ

m∑i=1

hi(x)2 +

m∑i=1

λihi(x)

⇒ 0 = ∇f(x′) + µm∑i=1

2hi(x′)∇hi(x′) +

m∑i=1

λi∇hi(x′)

• Update λ’s for the next iteration:

∑i=1

λnewi ∇hi(x

′) = µ

m∑i=1

2hi(x′)∇hi(x′) +

∑i=1

λoldi ∇hi(x

′)

λnewi = λold

i + 2µhi(x′)

Input: initial x ∈ Rn, functions f(x), h(x),∇f(x),∇h(x), tol.θ, ε

Output: x1: initialize µ = 1, λi = 0

2: repeat3: find x← argminx f(x) + µ

∑i hi(x)2 +

∑i λihi(x)

4: ∀i : λi ← λi + 2µhi(x′)

5: until |∆x| < θ and |hi(x)| < ε

3:16

This adaptation of λi is really elegant:– We do not have to take the penalty limit µ→∞ but still can

have exact constraints– If f and h were linear (∇f and ∇hi constant), the updatedλi is exactly right : In the next iteration we would exactly hitthe constraint (by construction)

– The penalty term is like a measuring device for the neces-sary “virtual gradient”, which is generated by the agumen-tation term in the next iteration

– The λi are very meaningful: they give the force/gradientthat a constraint exerts on the solution

3:17

Augmented Lagrangian (inequality constraint)

• Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x) + µ

m∑i=1

[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +

m∑i=1

λigi(x)

• A constraint is either active or inactive:

– When active (gi(x) ≥ 0∨λi > 0) we aim for equality gi(x) = 0

– When inactive (gi(x) < 0∧λi = 0) we don’t penalize/augment

– λi are zero or positive, but never negative

Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x

1: initialize µ = 1, λi = 0

2: repeat3: find x ← argminx f(x) + µ

∑i[gi(x) ≥ 0 ∨ λi >

0] gi(x)2 +∑i λigi(x)

4: ∀i : λi ← max(λi + 2µgi(x′), 0)

5: until |∆x| < θ and gi(x) < ε

3:18

General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log-barrier with a constraint, becoming ∞ for violation(interior point method)




3:19

The Lagrangian

3:20

The Lagrangian

• Given a constraint problem

minx

f(x) s.t. g(x) ≤ 0

we define the Lagrangian as

L(x, λ) = f(x) +

m∑i=1

λigi(x)

• The λi ≥ 0 are called dual variables or Lagrange multipliers

3:21

What’s the point of this definition?

• The Lagrangian is useful to compute optima analytically, on pa-

per – that’s why physicist learn it early on

• The Lagrangian implies the KKT conditions of optimality

• Optima are necessarily at saddle points of the Lagrangian

• The Lagrangian implies a dual problem, which is sometimes

easier to solve than the primal

3:22


Example: Some calculus using the Lagrangian

• For x ∈ R2, what is

minxx2 s.t. x1 + x2 = 1

• Solution:

L(x, λ) = x2 + λ(x1 + x2 − 1)

0 = ∇xL(x, λ) = 2x+ λ

11

⇒ x1 = x2 = −λ/2

0 = ∇λL(x, λ) = x1 + x2 − 1 = −λ/2− λ/2− 1 ⇒ λ = −1

⇒x1 = x2 = 1/2

3:23

The “force” & KKT view on the Lagrangian

• At the optimum there must be a balance between the cost gra-

dient −∇f(x) and the gradient of the active constraints −∇gi(x)

3:24


• At the optimum there must be a balance between the cost gra-

dient −∇f(x) and the gradient of the active constraints −∇gi(x)

• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}

• Or: for optimal x there must exist λi such that−∇f(x) = −[∑

i(−λi∇gi(x))]

• For optimal x it must hold (necessary condition): ∃λ s.t.

∇f(x) +

m∑i=1

λi∇gi(x) = 0 (“stationarity”)

∀i : gi(x) ≤ 0 (primal feasibility)

∀i : λi ≥ 0 (dual feasibility)

∀i : λigi(x) = 0 (complementary)

The last condition says that λi > 0 only for active constraints.

These are the Karush-Kuhn-Tucker conditions (KKT, neglect-

ing equality constraints)

3:25


• The first condition (“stationarity”), ∃λ s.t.

∇f(x) +

m∑i=1

λi∇gi(x) = 0

can be equivalently expressed as, ∃λ s.t.

∇xL(x, λ) = 0

• In that sense, the Lagrangian can be viewed as the “energy

function” that generates (for good choice of λ) the right balance

between cost and constraint gradients

• This is exactly as in the augmented Lagrangian approach, where how-ever we have an additional (“augmented”) squared penalty that is usedto tune the λi

3:26

Saddle point view on the Lagrangian

• Let’s briefly consider the equality case again:

minx

f(x) s.t. h(x) = 0

with the Lagrangian

L(x, λ) = f(x) +

m∑i=1

λihi(x)

• Note:

minxL(x, λ) ⇒ 0 = ∇xL(x, λ) ↔ stationarity

maxλ

L(x, λ) ⇒ 0 = ∇λL(x, λ) = h(x) ↔ constraint

• Optima (x∗, λ∗) are saddle points where

∇xL = 0 ensures stationarity and

∇λL = 0 ensures the primal feasibility

3:27

Saddle point view on the Lagrangian

• In the inequality case:

maxλ≥0

L(x, λ) =

{f(x) if g(x) ≤ 0

∞ otherwise

λ = argmaxλ≥0

L(x, λ) ⇒

{λi = 0 if gi(x) < 0

0 = ∇λiL(x, λ) = gi(x) otherwise

This implies either (λi = 0∧gi(x) < 0) or gi(x) = 0, which is ex-

actly equivalent to the complementarity and primal feasibility

conditions

• Again, optima (x∗, λ∗) are saddle points where

minx L enforces stationarity and

maxλ≥0 L enforces complementarity and primal feasi-

bility

Together, minx L and maxλ≥0 L enforce the KKT conditions!

3:28


The Lagrange dual problem

• Finding the saddle point can be written in two ways:

minx

maxλ≥0

L(x, λ) primal problem

maxλ≥0

minxL(x, λ) dual problem

• Let’s define the Lagrange dual function as

l(λ) = minxL(x, λ)

Then we have

minxf(x) s.t. g(x) ≤ 0 primal problem

maxλ

l(λ) s.t. λ ≥ 0 dual problem

The dual problem is convex (objective=concave, constraints=convex),

even if the primal is non-convex!

3:29

The Lagrange dual problem

• The dual function is always a lower bound (for any λi ≥ 0)

l(λ) = minxL(x, λ) ≤

[minxf(x) s.t. g(x) ≤ 0

]And consequently

maxλ≥0

minxL(x, λ) ≤ min

xmaxλ≥0

L(x, λ) = minx:g(x)≤0

f(x)

• We say strong duality holds iff

maxλ≥0

minxL(x, λ) = min

xmaxλ≥0

L(x, λ)

• If the primal is convex, and there exist an interior point

∃x : ∀i : gi(x) < 0

(which is called Slater condition), then we have strong duality

3:30

And what about algorithms?

• So far we’ve only introduced a whole lot of formalism, and seen

that the Lagrangian sort of represents the constrained problem

• What are the algorithms we can get out of this?

3:31

Log barrier method revisited

3:32


• Log barrier method: Instead of

minx

f(x) s.t. g(x) ≤ 0

we address

minx

f(x)− µ∑i

log(−gi(x))

• For given µ the optimality condition is

∇f(x)−∑i

µ

gi(x)∇gi(x) = 0

or equivalently

∇f(x) +∑i

λi∇gi(x) = 0 , λigi(x) = −µ

These are called modified (=approximate) KKT conditions.

3:33


Centering (the unconstrained minimization) in the log barrier

method is equivalent to solving the modified KKT conditions.

Note also: On the central path, the duality gap is mµ:l(λ∗(µ)) = f(x∗(µ)) +

∑i λigi(x

∗(µ)) = f(x∗(µ))−mµ

3:34

Primal-Dual interior-point Newton Method

3:35


• A core outcome of the Lagrangian theory was the shift in prob-

lem formulation:

find x to minx f(x) s.t. g(x) ≤ 0

→ find x to solve the KKT conditions

Optimization problem −→ Solve KKT conditions

• We think of the KKT conditions as an equation system r(x, λ) =

0, and can use the Newton method for solving it:

∇r∆x

∆λ

= −r

This leads to primal-dual algorithms that adapt x and λ concur-

rently. Roughly, this uses the curvature∇2f to estimate the right

λ to push out of the constraint.

3:36



• The first and last modified (=approximate) KKT conditions

∇f(x) +∑mi=1 λi∇gi(x) = 0 (“force balance”)

∀i : gi(x) ≤ 0 (primal feasibility)

∀i : λi ≥ 0 (dual feasibility)

∀i : λigi(x) = −µ (complementary)

can be written as the n+m-dimensional equation system

r(x, λ) = 0 , r(x, λ) :=

∇f(x) +∇g(x)>λ−diag(λ)g(x)− µ1n

• Newton method to find the root r(x, λ) = 0

xλ

←xλ

−∇r(x, λ)-1r(x, λ)

∇r(x, λ) =

∇2f(x) +∑i λi∇

2gi(x) ∇g(x)>

−diag(λ)∇g(x) −diag(g(x))

∈ R(n+m)×(n+m)

3:37


• The method requires the Hessians ∇2f(x) and ∇2gi(x)

– One can approximate the constraint Hessians ∇2gi(x) ≈ 0

– Gauss-Newton case: f(x) = φ(x)>φ(x) only requires∇φ(x)

• This primal-dual method does a joint update of both

– the solution x

– the lagrange multipliers (constraint forces) λ

No need for nested iterations, as with penalty/barrier methods!

• The above formulation allows for a duality gap µ; choose µ = 0

or consult Boyd how to update on the fly (sec 11.7.3)

• The feasibility constraints gi(x) ≤ 0 and λi ≥ 0 need to be

handled explicitly by the root finder (the line search needs to

ensure these constraints)

3:38

Phase I: Finding a feasible initialization

3:39

Phase I: Finding a feasible initialization

• An elegant method for finding a feasible point x:

min(x,s)∈Rn+1

s s.t. ∀i : gi(x) ≤ s, s ≥ 0

or

min(x,s)∈Rn+m

m∑i=1

si s.t. ∀i : gi(x) ≤ si, si ≥ 0

3:40

General approaches

• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)




3:41


4 Convex Optimization

Convex, quasiconvex, unimodal, convex optimization problem, lin-

ear program (LP), standard form, simplex algorithm, LP-relaxation

of integer linear programs, quadratic programming (QP), sequential

quadratic programming

Planned Outline

• Gradient-based optimization (1st order methods)– plain grad., steepest descent, conjugate grad., Rprop, stochas-

tic grad.– adaptive stepsize heuristics

• Constrained Optimization– squared penalties, augmented Lagrangian, log barrier– Lagrangian, KKT conditions, Lagrange dual, log barrier ↔

approx. KKT

• 2nd order methods– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS– constrained case, primal-dual Newton

• Special convex cases– Linear Programming, (sequential) Quadratic Programming– Simplex algorithm– relation to relaxed discrete optimization

• Black box optimization (“0th order methods”)– blackbox stochastic search– Markov Chain Monte Carlo methods– evolutionary algorithms

4:1

Function types

• A function is defined convex iff

f(ax+ (1−a)y) ≤ a f(x) + (1−a) f(y)

for all x, y ∈ Rn and a ∈ [0, 1].

• A function is quasiconvex iff

f(ax+ (1−a)y) ≤ max{f(x), f(y)}

for any x, y ∈ Rm and a ∈ [0, 1].

..alternatively, iff every sublevel set {x|f(x) ≤ α} is convex.

• [Subjective!] I call a function unimodal iff it has only 1 local

minimum, which is the global minimum

Note: in dimensions n > 1 quasiconvexity is stronger than unimodality

• A general non-linear function is unconstrained and can have

multiple local minima

4:2

convex ⊂ quasiconvex ⊂ unimodal ⊂ general

4:3

Local optimization

• So far I avoided making explicit assumptions about problem con-

vexity: To emphasize that all methods we considered – except

for Newton – are applicable also on non-convex problems.

• The methods we considered are local optimization methods,

which can be defined as

– a method that adapts the solution locally

– a method that is guaranteed to converge to a local minimum

only

• Local methods are efficient

– if the problem is (strictly) unimodal (strictly: no plateaux)

– if time is critical and a local optimum is a sufficiently good

solution

– if the algorithm is restarted very often to hit multiple local op-

tima

4:4

Convex problems

• Convexity is a strong assumption!

• Nevertheless, convex problems are important

– theoretically (convergence proofs!)

– for many real world applications

4:5

Convex problems

• A constrained optimization problem

minx

f(x) s.t. g(x) ≤ 0, h(x) = 0

is called convex iff

– f is convex

– each gi, i = 1, ..,m is convex

– h is linear: h(x) = Ax− b, A ∈ Rl×n, b ∈ Rl

• Alternative definition:f convex and feasible region is a convex set

4:6


Linear and Quadratic Programs

• Linear Program (LP)

minx

c>x s.t. Gx ≤ h, Ax = b

LP in standard form

minx

c>x s.t. x ≥ 0, Ax = b

• Quadratic Program (QP)

minx

1

2x>Qx+ c>x s.t. Gx ≤ h, Ax = b

where Q is positive definite.

(One also defines Quadratically Constraint Quadratic Programs (QCQP))

4:7

Transforming an LP problem into standardform

• LP problem:

minx

c>x s.t. Gx ≤ h, Ax = b

• Define slack variables:

minx,ξ

c>x s.t. Gx+ ξ = h, Ax = b, ξ ≥ 0

• Express x = x+ − x− with x+, x− ≥ 0:

minx+,x−,ξ

c>(x+ − x−)

s.t. G(x+ − x−) + ξ = h, A(x+ − x−) = b, ξ ≥ 0, x+ ≥ 0, x− ≥ 0

where (x+, x−, ξ) ∈ R2n+m

• Now this is conform with the standard form (replacing (x+, x−, ξ) ≡x, etc)

minx

c>x s.t. x ≥ 0, Ax = b

4:8

Linear Programming

– Algorithms

– Application: LP relaxtion of discret problems

4:9

Algorithms for Linear Programming

• All of which we know!

– augmented Lagrangian (LANCELOT software), penalty

– log barrier (“interior point method”, “[central] path following”)

– primal-dual Newton

• The simplex algorithm, walking on the constraints

(The emphasis in the notion of interior point methods is to dis-

tinguish from constraint walking methods.)

• Interior point and simplex methods are comparably efficient

Which is better depends on the problem

4:10

Simplex Algorithm

Georg Dantzig (1947)

Note: Not to confuse with the NelderMead method (downhill simplex method)

• We consider an LP in standard form

minx

c>x s.t. x ≥ 0, Ax = b

• Note that in a linear program the optimum is always situated at

a corner

4:11

Simplex Algorithm

• The Simplex Algorithm walks along the edges of the polytope,

at every corner choosing the edge that decreases c>x most

• This either terminates at a corner, or leads to an unconstrained

edge (−∞ optimum)

• In practise this procedure is done by “pivoting on the simplex

tableaux”

4:12

Simplex Algorithm

• The simplex algorithm is often efficient, but in worst case expo-

nential in n and m.


• Interior point methods (log barrier) and, more recently again,

augmented Lagrangian methods have become somewhat more

popular than the simplex algorithm

4:13

LP-relaxations of discrete problems

4:14

Integer linear programming

• An integer linear program (for simplicity binary) is

minxc>x s.t. Ax = b, xi ∈ {0, 1}

• Examples:– Traveling Salesman: minxij

∑ij cijxij with xij ∈ {0, 1}

and constraints ∀j :∑i xij = 1 (columns sum to 1), ∀j :∑

i xji = 1, ∀ij : tj − ti ≤ n − 1 + nxij (where ti areadditional integer variables).

– MaxSAT problem: In conjunctive normal form, each clausecontributes an additional variable and a term in the objec-tive function; each clause contributes a constraint

– Search the web for The Power of Semidefinite Program-ming Relaxations for MAXSAT

4:15

LP relaxations of integer linear programs

• Instead of solving

minxc>x s.t. Ax = b, xi ∈ {0, 1}

we solve

minxc>x s.t. Ax = b, x ∈ [0, 1]

• Clearly, the relaxed solution will be a lower bound on the integer

solution (sometimes also called “outer bound” because [0, 1] ⊃{0, 1})

• Computing the relaxed solution is interesting

– as an “approximation” or initialization to the integer problem

– to be aware of a lower bound

– in cases where the optimal relaxed solution happens to be

integer

4:16

Example: MAP inference in MRFs

• Given integer random variables xi, i = 1, .., n, a pairwise Markov

Random Field (MRF) is defined as

f(x) =∑

(ij)∈E

fij(xi, xj) +∑i

fi(xi)

where E denotes the set of edges. Problem: find maxx f(x).(Note: any general (non-pairwise) MRF can be converted into a pair-wise one,blowing up the number of variables)

• Reformulate with indicator variables

bi(x) = [xi = x] , bij(x, y) = [xi = x] [xj = y]

These are nm+ |E|m2 binary variables

• The indicator variables need to fulfil the constraints

bi(x), bij(x, y) ∈ {0, 1}∑x

bi(x) = 1 because xi takes eactly one value∑y

bij(x, y) = bi(x) consistency between indicators

4:17


• Finding maxx f(x) of a MRF is then equivalent to

maxbi(x),bij(x,y)

∑(ij)∈E

∑x,y

bij(x, y) fij(x, y) +∑i

∑x

bi(x) fi(x)

such that

bi(x), bij(x, y) ∈ {0, 1} ,∑x

bi(x) = 1 ,∑y

bij(x, y) = bi(x)

• The LP-relaxation replaces the constraint to be

bi(x), bij(x, y) ∈ [0, 1] ,∑x

bi(x) = 1 ,∑y

bij(x, y) = bi(x)

This set of feasible b’s is called marginal polytope (because

it describes the a space of “probability distributions” that are

marginally consistent (but not necessarily globally normalized!))

4:18


• Solving the original MAP problem is NP-hard

Solving the LP-relaxation is really efficient

• If the solution of the LP-relaxation turns out to be integer, we’ve

solved the originally NP-hard problem!

If not, the relaxed problem can be discretized to be a good ini-

tialization for discrete optimization

• For binary attractive MRFs (a common case) the solution will

always be integer

4:19

Quadratic Programming

4:20


Quadratic Programming

minx

1

2x>Qx+ c>x s.t. Gx ≤ h, Ax = b

• Efficient Algorithms:

– Interior point (log barrier)

– Augmented Lagrangian

– Penalty

• Highly relevant applications:

– Support Vector Machines

– Similar types of max-margin modelling methods

4:21

Example: Support Vector Machine

• Primal:

maxβ,||β||=1

M s.t. ∀i : yi(φ(xi)>β) ≥M

• Dual:

minβ||β||2 s.t. ∀i : yi(φ(xi)

>β) ≥ 1

y

x

A

B

4:22

Sequential Quadratic Programming

• We considered general non-linear problems

minx

f(x) s.t. g(x) ≤ 0

where we can evaluate f(x), ∇f(x), ∇2f(x) and g(x), ∇g(x),

∇2g(x) for any x ∈ Rn

→ Newton method

• The standard step direction ∆ is (∇2f(x) + λI) ∆ = −∇f(x)

• Sometimes a better step direction ∆ can be found by solving the

local QP-approximation to the problem

min∆

f(x) +∇f(x)>∆ + ∆>∇2f(x)∆ s.t. g(x) +∇g(x)>∆ ≤ 0

This is an optimization problem over ∆ and only requires the

evaluation of f(x),∇f(x),∇2f(x), g(x),∇g(x) once.

4:23


5 Blackbox Optimization: Local, Stochas-

tic & Model-based Search

“Blackbox Optimization”

• We use the term to denote the problem: Let x ∈ Rn, f : Rn →R, find

minx

f(x)

where we can only evaluate f(x) for any x ∈ Rn

∇f(x) or ∇2f(x) are not (directly) accessible

• A constrained version: Let x ∈ Rn, f : Rn → R, g : Rn → {0, 1}, find

minx

f(x) s.t. g(x) = 1

where we can only evaluate f(x) and g(x) for any x ∈ Rn

I haven’t seen much work on this. Would be interesting to consider this morerigorously.

5:1

“Blackbox Optimization” – terminology/subareas

• Stochastic Optimization (aka. Stochastic Search, Metaheuris-

tics)– Simulated Annealing, Stochastic Hill Climing, Tabu Search– Evolutionary Algorithms, esp. Evolution Strategies, Covari-

ance Matrix Adaptation, Estimation of Distribution Algorithms– Some of them (implicitly or explicitly) locally approximating

gradients or 2nd order models

• Derivative-Free Optimization (see Nocedal et al.)– Methods for (locally) convex/unimodal functions; extending

gradient/2nd-order methods– Gradient estimation (finite differencing), model-based, Im-

plicit Filtering

• Bayesian/Global Optimization– Methods for arbitrary (smooth) blackbox functions that get

not stuck in local optima.– Very interesting domain – close analogies to (active) Ma-

chine Learning, bandits, POMDPs, optimal decision mak-ing/planning, optimal experimental design

5:2

Outline

• Basic downhill running– Greedy local search, stochastic local search, simulated an-

nealing– Iterated local search, variable neighborhood search, Tabu

search– Coordinate & pattern search, Nelder-Mead downhill sim-

plex

• Memorize or model something– General stochastic search

– Evolutionary Algorithms, Evolution Strategies, CMA, EDAs– Model-based optimization, implicit filtering

• Bayesian/Global optimization: Learn & approximate optimal

optimization– Belief planning view on optimal optimization– GPs & Bayesian regression methods for belief tracking– bandits, UBC, expected improvement, etc for decision mak-

ing

5:3

Basic downhill running

– Greedy local search, stochastic local search, simulated annealing– Iterated local search, variable neighborhood search, Tabu search– Coordinate & pattern search, Nelder-Mead downhill simplex

5:4

Greedy local search (greedy downhill, hillclimbing)

• Let x ∈ X be continuous or discrete

• We assume there is a finite neighborhood N(x) ⊂ X defined for

every x

• Greedy local search (variant 1):

Input: initial x, function f(x)

1: repeat2: x← argminy∈N(x) f(y) // convention: we assumex ∈ N(x)

3: until x converges

• Variant 2: x← the “first” y ∈ N(x) such that f(y) < f(x)

• Greedy downhill is a basic ingredient of discrete optimization

• In the continuous case: what is N(x)? Why should it be fixed or

finite?

5:5

Stochastic local search

• Let x ∈ Rn

• We assume a “neighborhood” probability distribution q(y|x), typ-

ically a Gaussian q(y|x) ∝ exp{− 12(y − x)>Σ-1(y − x)}

Input: initial x, function f(x), proposal distribution q(y|x)

1: repeat2: Sample y ∼ q(y|x)

3: If f(y) < f(x) then x← y


• The choice of q(y|x) is crucial, e.g. of the covariance matrix Σ

• Simple heuristic: decrease variance if many steps “fail”; increase

variance if sufficient success steps

• Covariance Matrix Adaptation (discussed later) memorizes the

recent successful steps and adapts Σ based on this.


5:6

Simulated Annealing (run also uphill)

• An extension to avoid getting stuck in local optima is to also

accept steps with f(y) > f(x):

Input: initial x, function f(x), proposal distribution q(y|x)

1: initialilze the temperature T = 1

2: repeat3: Sample y ∼ q(y|x)

4: Acceptance probability A =

min{

1, ef(x)−f(y)

Tq(x|y)q(y|x)

}5: With probability A update x← y

6: Decrease T , e.g. T ← (1− ε)T for small ε7: until x converges

• Typically: q(y|x) ∝ exp{− 12(y − x)2/σ2}

5:7

Simulated Annealing

• Simulated Annealing is a Markov chain Monte Carlo (MCMC)

method.– Must read!: An Introduction to MCMC for Machine Learning– These are iterative methods to sample from a distribution,

in our casep(x) ∝ e

−f(x)T

• For a fixed temperature T , one can prove that the set of ac-

cepted points is distributed as p(x) (but non-i.i.d.!) The accep-

tance probability

A = min{

1, ef(x)−f(y)

Tq(x|y)

q(y|x)

}compares the f(y) and f(x), but also the reversibility of q(y|x)

• When cooling the temperature, samples focus at the extrema.

Guaranteed to sample all extrema eventually

• Of high theoretical relevance, less of practical

5:8

Simulated Annealing

5:9

Random Restarts (run downhill multiple times)

• Greedy local search is typically only used as an ingredient of

more robust methods

• We assume to have a start distribution q(x)

• Random restarts:

1: repeat2: Sample x ∼ q(x)

3: x← GreedySearch(x) or StochasticSearch(x)

4: If f(x) < f(x∗) then x∗ ← x

5: until run out of budget

• Greedy local search requires a neighborhood function N(x)

Stochastic local search requires a transition proposal q(y|x)

5:10

Iterated Local Search

• Random restarts may be rather expensive, sampling x ∼ q(x)

is fully uninformed

• Iterated Local Search picks up the last visited local minimum x

and restarts in a meta-neighborhood N∗(x)

• Iterated Local Search (variant 1):

Input: initial x, function f(x)

1: repeat2: x← argminy′∈{GreedySearch(y) : y∈N∗(x)} f(y′)


– This version evalutes a GreedySearch for all meta-neighborsy ∈ N∗(x) of the last local optimum x

– The inner GreedySearch uses another neighborhood func-tion N(x)

• Variant 2: x← the “first” y ∈ N∗(x) such that f(GS(y)) < f(x)

• Stochastic variant: Neighborhoods N(x) and N∗(x) are replaced

by transition proposals q(y|x) and q∗(y|x)

5:11

Iterated Local Search

• Application to Travelling Salesman Problem:

k-opt neighbourhood: solutions which differ by at most k edges

from Hoos & Stutzle: Tutorial: Stochastic Search Algorithms

• GreedySearch uses 2-opt or 3-opt neighborhood

Iterated Local Search uses 4-opt meta-neighborhood (double

bridges)

5:12


Very briefly...

• Variable Neighborhood Search:– Switch the neighborhood function in different phases– Similar to Iterated Local Search

• Tabu Search:– Maintain a tabu list points (or points features) which may

not be visited again– The list has a fixed finite size: FILO– Intensification and diversification heuristics make it more

global

5:13

Coordinate Search

Input: Initial x ∈ Rn1: repeat2: for i = 1, .., n do3: α∗ = argminα f(x+ αei) // Line Search4: x← x+ α∗ei5: end for6: until x converges

• The LineSearch must be approximated– E.g. abort on any improvement, when f(x+ αei) < f(x)

– Remember the last successful stepsize αi for each coordi-nate

• Twiddle:

Input: Initial x ∈ Rn, initial stepsizes αi for all i = 1 : n

1: repeat2: for i = 1, .., n do3: x← argminy∈{x−αiei,x,x+αiei} f(y)// twiddle xi4: Increase αi if x changed; decrease αi otherwise5: end for6: until x converges

5:14

Pattern Search

– In each iteration k, have a (new) set of search directionsDk = {dki} and test steps of length αk in these directions

– In each iteration, adapt the search directions Dk and steplength αk

Details: See Nocedal et al.

5:15

Nelder-Mead method – Downhill Simplex Method

5:16

Nelder-Mead method – Downhill Simplex Method

• Let x ∈ Rn

• Maintain n+ 1 points x0, .., xn, sorted by f(x0) < ... < f(xn)

• Compute center c of points

• Reflect: y = c+ α(c− xn)

• If f(y) < f(x0): Expand: y = c+ γ(c− xn)

• If f(y) > f(xn-1): Contract: y = c+ %(c− xn)

• If still f(y) > f(xn): Shrink ∀i=1,..,nxi ← x0 + σ(xi − x0)

• Typical parameters: α = 1, γ = 2, % = − 12, σ = 1

2

5:17

Summary: Basic downhill running

• These methods are highly relevant! Despite their simplicity

• Essential ingredient to iterative approaches that try to find as

many local minima as possible

• Methods essentially differ in the notion of

neighborhood, transition proposal, or pattern of next search points

to consider

• Iterated downhill can be very effective

• However: There should be ways to better exploit data!– Learn from previous evaluations where to test new point– Learn from previous local minima where to restart

5:18

Memorize or model something

– Stochastic search schemes– Evolutionary Algorithms, Evolution Strategies, CMA, EDAs– Model-based optimization, implicit filtering

5:19

A general stochastic search scheme

• The general scheme:– The algorithm maintains a probability distribution pθ(x)

– In each iteration it takes n samples {xi}ni=1 ∼ pθ(x)


– Each xi is evaluated → data {(xi, f(xi))}ni=1

– That data is used to update θ

Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)

Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)

3: Evaluate samples, D = {(xi, f(xi))}ni=1

4: Update θ ← h(θ,D)

5: until θ converges

5:20

Example: Gaussian search distribution “(µ, λ)-ES”

From 1960s/70s. Rechenberg/Schwefel

• The simplest distribution family

θ = (x) , pθ(x) = N(x | x, σ2)

a n-dimenstional isotropic Gaussian with fixed variance σ2

• Update heuristic:– GivenD = {(xi, f(xi))}λi=1, select µ best: D′ = bestOfµ(D)

– Compute the new mean x from D′

• This algorithm is called “Evolution Strategy (µ, λ)-ES”– The Gaussian is meant to represent a “species”– λ offspring are generated– the best µ selected

5:21

θ is the “knowledge/information” gained

• The parameter θ is the only “knowledge/information” that is be-

ing propagated between iterations

θ encodes what has been learned from the history

θ defines where to search in the future

• The downhill methods of the previous section did not store any

information other than the current x. (Exception: Tabu search,

Nelder-Mead)

• Evolutionary Algorithms are a special case of this stochastic

search scheme

5:22

Evolutionary Algorithms (EAs)

• EAs can well be described as special kinds of parameterizing

pθ(x) and updating θ– The θ typically is a set of good points found so far (parents)– Mutation & Crossover define pθ(x)

– The samples D are called offspring– The θ-update is often a selection of the best, or “fitness-

proportional” or rank-based

• Categories of EAs:– Evolution Strategies: x ∈ Rn, often Gaussian pθ(x)

– Genetic Algorithms: x ∈ {0, 1}n, crossover & mutationdefine pθ(x)

– Genetic Programming: x are programs/trees, crossover& mutation

– Estimation of Distribution Algorithms: θ directly definespθ(x)

5:23

Covariance Matrix Adaptation (CMA-ES)

• An obvious critique of the simple Evolution Strategies:– The search distribution N(x | x, σ2) is isotropic

(no going forward, no preferred direction)– The variance σ is fixed!

• Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

5:24

Covariance Matrix Adaptation (CMA-ES)

• In Covariance Matrix Adaptation

θ = (x, σ, C, %σ, %C) , pθ(x) = N(x | x, σ2C)

where C is the covariance matrix of the search distribution

• The θ maintains two more pieces of information: %σ and %C cap-

ture the “path” (motion) of the mean x in recent iterations

• Rough outline of the θ-update:– Let D′ = bestOfµ(D) be the set of selected points– Compute the new mean x from D′

– Update %σ and %C proportional to xk+1 − xk– Update σ depending on |%σ|– Update C depending on %c%>c (rank-1-update) and Var(D′)

5:25

CMA references

Hansen, N. (2006), ”The CMA evolution strategy: a comparing

review”

Hansen et al.: Evaluating the CMA Evolution Strategy on Multi-

modal Test Functions, PPSN 2004.


• For “large enough” populations local minima are avoided

5:26

CMA conclusions

• It is a good starting point for an off-the-shelf blackbox algorithm

• It includes components like estimating the local gradient (%σ, %C ),

the local “Hessian” (Var(D′)), smoothing out local minima (large

populations)

5:27

Estimation of Distribution Algorithms (EDAs)

• Generally, EDAs fit the distribution pθ(x) to model the distribu-

tion of previously good search pointsFor instance, if in all previous distributions, the 3rd bit equals the 7th bit, then thesearch distribution pθ(x) should put higher probability on such candidates.pθ(x) is meant to capture the structure in previously good points, i.e. the depen-dencies/correlation between variables.

• A rather successful class of EDAs on discrete spaces uses graph-

ical models to learn the dependencies between variables, e.g.

Bayesian Optimization Algorithm (BOA)

• In continuous domains, CMA is an example for an EDA

5:28

Stochastic search conclusions

Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)

Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)

3: Evaluate samples, D = {(xi, f(xi))}ni=1

4: Update θ ← h(θ,D)

5: until θ converges

• The framework is very general

• The crucial difference between algorithms is their choice of pθ(x)

5:29

Model-based optimizationfollowing Nodecal et al. “Derivative-free optimization”

5:30

Model-based optimization

• The previous stochastic serach methods are heuristics to up-

date θ

Why not store the previous data directly?

• Model-based optimization takes the approach– Store a data set θ = D = {(xi, yi)}ni=1 of previously ex-

plored points(let x be the current minimum in D)

– Compute a (quadratic) model D 7→ f(x) = φ2(x)>β

– Choose the next point as

x+ = argminx

f(x) s.t. |x− x| < α

– Update D and α depending on f(x+)

• The argmin is solved with constrained optimization methods

5:31


1: Initialize D with at least 12

(n+ 1)(n+ 2) data points2: repeat3: Compute a regression f(x) = φ2(x)>β on D4: Compute x+ = argminx f(x) s.t. |x− x| < α

5: Compute the improvement ratio % =f(x)−f(x+)

f(x)−f(x+)

6: if % > ε then7: Increase the stepsize α8: Accept x← x+

9: Add to data, D ← D ∪ {(x+, f(x+))}10: else11: if det(D) is too small then // Data improvement12: Compute x+ = argmaxx det(D ∪{x}) s.t. |x− x| < α

13: Add to data, D ← D ∪ {(x+, f(x+))}14: else15: Decrease the stepsize α16: end if17: end if18: Prune the data, e.g., remove argmaxx∈∆ det(D\{x})19: until x converges

• Variant: Initialize with only n + 1 data points and fit a linear model as long as|D| < 1

2 (n+ 1)(n+ 2) = dim(φ2(x))

5:32


• Optimal parameters (with data matrix X ∈ Rn×dim(β))

β ls = (X>X)-1X>y

The determinant det(X>X) or det(X) (denoted det(D) on the

previous slide) is a measure for well the data supports the re-

gression. The data improvement explicitly selects a next evalu-

ation point to increase det(D).

• Nocedal describes in more detail a geometry-improving proce-

dure to update D.

• Model-based optimization is closely related to Bayesian approaches.

But– Should we really prune data to have only a minimal set D

(of size dim(β)?)– Is there another way to think about the “data improvement”

selection of x+? (→ maximizing uncertainty/informationgain)

5:33


Implicit Filtering (briefly)

• Estimates the local gradient using finite differencing

∇εf(x) ≈[ 1

2ε(f(x+ εei)− f(x− εei))

]i=1,..,n

• Lines search along the gradient; if not succesful, decrease ε

• Can be extended by using ∇εf(x) to update an approximation

of the Hessian (as in BFGS)

5:34

Conclusions

• We covered– “downhill running”– Two flavors of methods that exploit the recent data:

– stochastic search (& EAs), maintaining θ that definespθ(x)– model-based opt., maintaining local data D that definesf(x)

• These methods can be very efficient, but somehow the problem

formalization is unsatisfactory:– What would be optimal optimization?– What exactly is the information that we can gain from data

about the optimum?– If the optimization algorithm would be an “AI agent”, select-

ing points his actions, seeing f(x) his observations, whatwould be his optimal decision making strategy?

– And what about global blackbox optimization?

5:35


6 Global & Bayesian Optimization

Multi-armed bandits, exploration vs. exploitation, navigation through

belief space, upper confidence bound (UCB), global optimization =

infinite bandits, Gaussian Processes, probability of improvement, ex-

pected improvement, UCB

Global Optimization

• Is there an optimal way to optimize (in the Blackbox case)?

• Is there a way to find the global optimum instead of only local?

6:1

Outline

• Play a game

• Multi-armed bandits– Belief state & belief planning– Upper Confidence Bound (UCB)

• Optimization as infinite bandits– GPs as belief state

• Standard heuristics:– Upper Confidence Bound (GP-UCB)– Maximal Probability of Improvement (MPI)– Expected Improvement (EI)

6:2

Bandits

6:3

Bandits

• There are n machines.

• Each machine i returns a reward y ∼ P (y; θi)

The machine’s parameter θi is unknown

6:4

Bandits

• Let at ∈ {1, .., n} be the choice of machine at time t

Let yt ∈ R be the outcome with mean 〈yat〉

• A policy or strategy maps all the history to a new choice:

π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at

• Problem: Find a policy π that

max⟨∑T

t=1 yt⟩

or

max 〈yT 〉

or other objectives like discounted infinite horizon max⟨∑∞

t=1 γtyt⟩

6:5

Exploration, Exploitation

• “Two effects” of choosing a machine:

– You collect more data about the machine→ knowledge

– You collect reward

• Exploration: Choose the next action at to min 〈H(bt)〉

• Exploitation: Choose the next action at to max 〈yt〉

6:6

The Belief State

• “Knowledge” can be represented in two ways:– as the full history

ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]

– as the belief

bt(θ) = P (θ|ht)

where θ are the unknown parameters θ = (θ1, .., θn) of allmachines

• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =

∏i bt(θi|ht)

e.g. for Gaussian bandits with constant noise, θi = µi

bt(µi|ht) = N(µi|yi, si)

e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):

bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)

ai,t =∑t−1s=1[as= i][ys=0] , bi,t =

∑t−1s=1[as= i][ys=1]

6:7


The Belief MDP

• The process can be modelled asa1 a2 a3y1 y2 y3

θ θ θ θ

or as Belief MDPa1 a2 a3y1 y2 y3

b0 b1 b2 b3

P (b′|y, a, b) =

{1 if b′ = b[a, y]

0 otherwise, P (y|a, b) =

∫θab(θa) P (y|θa)

• The Belief MDP describes a different process: the interaction betweenthe information available to the agent (bt or ht) and its actions, wherethe agent uses his current belief to anticipate observations, P (y|a, b).

• The belief (or history ht) is all the information the agent has avaiable;P (y|a, b) the “best” possible anticipation of observations. If it acts opti-mally in the Belief MDP, it acts optimally in the original problem.

Optimality in the Belief MDP ⇒ optimality in the original

problem

6:8

Optimal policies via Belief Planning

• The Belief MDP:a1 a2 a3y1 y2 y3

b0 b1 b2 b3

P (b′|y, a, b) =

{1 if b′ = b[a, y]

0 otherwise, P (y|a, b) =

∫θab(θa) P (y|θa)

• Belief Planning: Dynamic Programming on the value function

Vt-1(bt-1) = maxπ

⟨∑Tt=t yt

⟩= max

at

∫ytP (yt|at, bt-1)

[yt + Vt(bt-1[at, yt])

]6:9

Optimal policies

• The value function assigns a value (maximal achievable return)

to a state of knowledge

• The optimal policy is greedy w.r.t. the value function (in the

sense of the maxat above)

• Computationally heavy: bt is a probability distribution, Vt a func-

tion over probability distributions

• The term∫ytP (yt|at, bt-1)

[yt+Vt(bt-1[at, yt])

]is related to the Gittins Index :

it can be computed for each bandit separately.

6:10

Example exercise

• Consider 3 binary bandits for T = 10.– The belief is 3 Beta distributions Beta(pi|α+ai, β+ bi) →

6 integers– T = 10 → each integer ≤ 10

– Vt(bt) is a function over {0, .., 10}6

• Given a prior α = β = 1,

a) compute the optimal value function and policy for the final

reward and the average reward problems,

b) compare with the UCB policy.

6:11

Greedy heuristic: Upper Confidence Bound(UCB)

1: Initializaiton: Play each machine once2: repeat3: Play the machine i that maximizes yi +

√2 lnnni

4: until

yi is the average reward of machine i so far

ni is how often machine i has been played so far

n =∑i ni is the number of rounds so far

See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi &Fischer, Machine learning, 2002.

6:12

UCB algorithms

• UCB algorithms determine a confidence interval such that

yi − σi < 〈yi〉 < yi + σi

with high probability.

UCB chooses the upper bound of this confidence interval

• Optimism in the face of uncertainty

• Strong bounds on the regret (sub-optimality) of UCB (e.g. Auer

et al.)

6:13

Further reading

• ICML 2011 Tutorial Introduction to Bandits: Algorithms and The-

ory, Jean-Yves Audibert, Remi Munos

• Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-

Bianchi & Fischer, Machine learning, 2002.

• On the Gittins Index for Multiarmed Bandits, Richard Weber, An-

nals of Applied Probability, 1992.

Optimal Value function is submodular.

6:14


Conclusions

• The bandit problem is an archetype for– Sequential decision making– Decisions that influence knowledge as well as rewards/states– Exploration/exploitation

• The same aspects are inherent also in global optimization, ac-

tive learning & RL

• Belief Planning in principle gives the optimal solution

• Greedy Heuristics (UCB) are computationally much more effi-

cient and guarantee bounded regret

6:15

Global Optimization

6:16

Global Optimization

• Let x ∈ Rn, f : Rn → R, find

minx

f(x)

(I neglect constraints g(x) ≤ 0 and h(x) = 0 here – but could be included.)

• Blackbox optimization: find optimium by sampling values yt =

f(xt)

No access to ∇f or ∇2f

Observations may be noisy y ∼ N(y | f(xt), σ)

6:17

Global Optimization = infinite bandits

• In global optimization f(x) defines a reward for every x ∈ Rn

– Instead of a finite number of actions at we now have xt

• Optimal Optimization could be defined as: find π : ht 7→ xt

that

min⟨∑T

t=1 f(xt)⟩

or

min 〈f(xT )〉

6:18

Gaussian Processes as belief

• The unknown “world property” is the function θ = f

• Given a Gaussian Process prior GP (f |µ,C) over f and a his-

tory

Dt = [(x1, y1), (x2, y2), ..., (xt-1, yt-1)]

the belief is

bt(f) = P (f |Dt) = GP(f |Dt, µ, C)

Mean(f(x)) = f(x) = κ(x)(K + σ2I)-1y response surface

Var(f(x)) = σ(x) = k(x, x)− κ(x)(K + σ2In)-1κ(x) confidence interval

• Side notes:

– Don’t forget that Var(y∗|x∗, D) = σ2 + Var(f(x∗)|D)

– We can also handle discrete-valued functions f using GPclassification

6:19

6:20

Optimal optimization via belief planning

• As for bandits it holds

Vt-1(bt-1) = maxπ

⟨∑Tt=t yt

⟩= max

xt

∫ytP (yt|xt, bt-1)

[yt + Vt(bt-1[xt, yt])

]Vt-1(bt-1) is a function over the GP-belief!

If we could compute Vt-1(bt-1) we “optimally optimize”

• I don’t know of a minimalistic case where this might be feasible

6:21

Conclusions

• Optimization as a problem of– Computation of the belief– Belief planning

• Crucial in all of this: the prior P (f)

– GP prior: smoothness; but also limited: only local correla-tions!No “discovery” of non-local/structural correlations throughthe space

– The latter would require different priors, e.g. over differentfunction classes

6:22

Heuristics

6:23


1-step heuristics based on GPs

• Maximize Probability of Improvement (MPI)

from Jones (2001)

xt = argmaxx

∫ y∗−∞N(y|f(x), σ(x))

• Maximize Expected Improvement (EI)

xt = argmaxx

∫ y∗−∞N(y|f(x), σ(x)) (y∗ − y)

• Maximize UCB

xt = argmaxx

f(x) + βtσ(x)

(Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas etal. citation below.)

6:24

Each step requires solving an optimizationproblem

• Note: each argmax on the previous slide is an optimization prob-

lem

• As f , σ are given analytically, we have gradients and Hessians.

BUT: multi-modal problem.

• In practice:– Many restarts of gradient/2nd-order optimization runs– Restarts from a grid; from many random points

• We put a lot of effort into carefully selecting just the next query

point

6:25

From: Information-theoretic regret bounds for gaussian process optimization in thebandit setting Srinivas, Krause, Kakade & Seeger, Information Theory, 2012.

6:26

6:27

Pitfall of this approach

• A real issue, in my view, is the choice of kernel (i.e. prior P (f))– ’small’ kernel: almost exhaustive search– ’wide’ kernel: miss local optima– adapting/choosing kernel online (with CV): might fail– real f might be non-stationary– non RBF kernels? Too strong prior, strange extrapolation

• Assuming that we have the right prior P (f) is really a strong

assumption

6:28

Further reading

• Classically, such methods are known as Kriging

• Information-theoretic regret bounds for gaussian process opti-

mization in the bandit setting Srinivas, Krause, Kakade & Seeger,

Information Theory, 2012.

• Efficient global optimization of expensive black-box functions.

Jones, Schonlau, & Welch, Journal of Global Optimization, 1998.

• A taxonomy of global optimization methods based on response

surfaces Jones, Journal of Global Optimization, 2001.

• Explicit local models: Towards optimal optimization algorithms,

Poland, Technical Report No. IDSIA-09-04, 2004.

6:29


7 Exercises

7.1 Exercise 1

7.1.1 Boyd & Vandenberghe

Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe“Convex Optimization”. This is for you to get an impres-sion of the book. Learn in particular about their cate-gories of convex and non-linear optimization problems.

7.1.2 Getting started

Consider the following functions over x ∈ Rn:

fsq(x) = x>x (8)

fhole(x) = 1− exp(−x>x) (9)

These would be fairly simple to optimize. We changethe conditioning (“skewedness of the Hessian”) of thesefunctions to make them a bit more interesting.

Let c ∈ R be the conditioning parameter; let C be thediagonal matrix with entries C(i, i) = c

i−12(n−1) . We define

the test functions

f csq(x) = fsq(Cx) (10)

f chole(x) = fhole(Cx) . (11)

a) What are the gradients ∇f csq(x) and ∇f chole(x)?

b) What are the Hessians ∇2f csq(x) and ∇2f chole(x)?

c) Implement these functions and display for c = 10

them over x ∈ [−1, 1]2. You can use any language, Oc-tave/Matlab, Python, C++, R, whatever. Plotting is usu-ally done by evaluating the function on a grid of points,e.g. in Octave[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));X = [X0(:),X1(:)];Y = sum(X.*X, 2);Ygrid = reshape(Y,[20,20]);hold on;mesh(X0,X1,Ygrid);hold off;

Or you can store the grid data in a file and use gnuplot,e.g.splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3

d) Implement a simple fixed stepsize gradient descent,iterating xk+1 = xk − α∇f(xk), with start point x0 =

(1, 1), c = 10 and heuristically chosen α.

7.2 Exercise 2

7.2.1 Backtracking

Consider again the functions:

fsq(x) = x>Cx (12)

fhole(x) = 1− exp(−x>Cx) (13)

with diagonal matrix C and entries C(i, i) = ci−1n−1 . We

choose a conditioning1 c = 10.

a) Implement gradient descent with backtracking, asdescribed on slide 02-05 (with default parameters %).Test the algorithm on fsq(x) and fhole(x) with start pointx0 = (1, 1). To judge the performance, plot the functionvalue over the number of function evaluations.

b) Test also the alternatives in step 3 and 8. Further,how does the performance change with %ls (the back-tracking stop criterion)?

c) Implement steepest descent using C as a metric.Perform the same evaluations.

7.2.2 Newton direction

a) Derive the Newton direction d ∝ −∇2f(x)-1∇f(x) forfsq(x) and fhole(x).

b) Observe that the Newton direction diverges (is un-defined) in the concave part of fhole(x). Propose somemethod/tricks to fix this, which at least exploits the ef-ficiency of Newton methods in the convex part. Anyideas are allowed.

7.3 Exercise 3

As I was ill last week, we can first rediscuss open ques-tions on the previous exercises.

1The word “conditioning” generally denotes the ration of

the largest and smallest Eigenvalue of the Hessian.


7.3.1 Misc

a) How do you have to choose the “damping” λ depend-ing on ∇2f(x) in line 3 of the Newton method (slide 02-16) to ensure that the d is always well defined (i.e., fi-nite)?

b) The Gauss-Newton method uses the “approximateHessian” 2∇φ(x)>∇φ(x). First show that for any vectorv ∈ Rn the matrix vv> is symmetric and semi-positive-definite.2 From this, how can you argue that φ(x)>∇φ(x)is also symmetric and semi-positive-definite?

c) In the context of BFGS, convince yourself that choos-ing H -1 = δδ>

δ>yindeed fulfills the desired relation δ =

H -1y, where δ and y are defined as on slide 02-23.Are there other choices of H -1 that fulfill the relation?Which?

7.3.2 Gauss-Newton

In x ∈ R2 consider the function

f(x) = φ(x)>φ(x) , φ(x) =

sin(ax1)

sin(acx2)

2x1

2cx2

The function is plotted above for a = 4 (left) and a = 5

(right, having local minima), and conditioning c = 1.The function is non-convex.

a) Extend your backtracking method implemented inthe last week’s exercise to a Gauss-Newton method(with constant λ) to solve the unconstrained minimiza-tion problem minx f(x) for a random start point in x ∈[−1, 1]2. Compare the algorithm for a = 4 and a = 5

and conditioning c = 3 with gradient descent.

b) If you work in Octave/Matlab or alike, optimize thefunction also using the fminunc routine from Octave.(Typically this uses BFGS internally.)

2 A matrix A ∈ Rn×n is semi-positive-definite simply when

for any x ∈ Rn it holds x>Ax ≥ 0. Intuitively: A might be a

metric as it “measures” the norm of any x as positive. Or: If

A is a Hessian, the function is (locally) convex.

7.4 Exercise 4

7.4.1 Squared Panalties & Log Barriers

In a previous exercise we defined the “hole function”f chole(x), where we now assume a conditioning c = 4.

Consider the optimization problem

minxf chole(x) s.t. g(x) ≤ 0 (14)

g(x) =

x>x− 1

xn + 1/c

(15)

a) First, assume n = 2 (x ∈ R2 is 2-dimensional), c =

4, and draw on paper what the problem looks like andwhere you expect the optimum.

b) Implement the Squared Penalty Method. (In the in-ner loop you may choose any method, including simplegradient methods.) Choose as a start point x = (1

2 ,12 ).

Plot its optimization path and report on the number oftotal function/gradient evaluations needed.

c) Test the scaling of the method for n = 10 dimensions.

d) Implement the Log Barrier Method and test as inb) and c). Compare the function/gradient evaluationsneeded.

7.4.2 Lagrangian and dual function

(Taken roughly from ‘Convex Optimization’, Ex. 5.1)

A simple example. Consider the optimization problem

minx2 + 1 s.t. (x− 2)(x− 4) ≤ 0

with variable x ∈ R.

a) Derive the optimal solution x∗ and the optimal valuep∗ = f(x∗) by hand.

b) Write down the Lagrangian L(x, λ). Plot (using gnu-plot or so) L(x, λ) over x for various values of λ ≥0. Verify the lower bound property minx L(x, λ) ≤ p∗,where p∗ is the optimum value of the primal problem.

c) Derive the dual function l(λ) = minx L(x, λ) and plotit (for λ ≥ 0). Derive the dual optimal solution λ∗ =

argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?


7.5 Exercise 5

7.5.1 Optimize a constrained problem

Use the Newton (or a gradient) method to solve the fol-lowing constrained problem,

minx

n∑i=1

xi s.t. g(x) ≤ 0 (16)

g(x) =

x>x− 1

−x1

(17)

You are free to choose the squared penalty. Start withµ = 1 and increase it by µ ← 2µ in each iteration. Ineach iteration report λi := 2µ [gi(x) ≥ 0] gi(x) for i =1, 2. Test this for n = 2 and n = 50.

7.5.2 Phase I & Log Barriers

Consider the the same problem (16).

a) Use the method you implemented above to find afeasible initialization (Phase I). Do this by solving then+ 1-dimensional problem

min(x,s)∈Rn+1

s s.t. ∀i : gi(x) ≤ s, s ≥ −ε

For some very small ε. Initialize this with the infeasiblepoint (1, 1) ∈ R2.

b) Once you’ve found a feasible point, use the standardlog barrier method to find the solution to the originalproblem (16). Start with µ = 1, and decrease it by µ←µ/2 in each iteration. In each iteration also report λi :=µ

gi(x) for i = 1, 2.

7.6 Exercise 6

Solving real-world problems involves 2 subproblems:

1) formulating the problem as an optimization prob-lem (conform to a standard optimization problemcategory) (→ human)

2) the actual optimization problem (→ algorithm)

In the lecture we’ve seen some examples (MaxSAT,Travelling Salesman, MRFs) for the first problem, whichis absolutely non-trivial. Here is some more training onthis. Exercises from Boyd et al http://www.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf:

7.6.1 Network flow problem

Solve Exercise 4.12 (pdf page 207) from Boyd & Van-denberghe, Convex Optimization.

7.6.2 Minimum fuel optimal control

Solve Exercise 4.16 (pdf page 208) from Boyd & Van-denberghe, Convex Optimization.

7.6.3 Primal-Dual Newton for Quadratic Pro-

gramming

Derive an explicit equation for the primal-dual Newtonupdate of (x, λ) (slide 03:38) in the case of QuadraticProgramming. Use the special method for solving blockmatrix linear equations using the Schur complements(Wikipedia “Schur complement”).

What is the update for a general Linear Program?

7.7 Exercise 7

7.7.1 CMA vs. twiddle search

At https://www.lri.fr/˜hansen/cmaes_inmatlab.html there is code for CMA for all languages (I do notrecommend the C++ versions).

a) Test CMA with a standard parameter setting a log-variant of the Rosenbrock function (see Wikipedia). Myimplementation of this function in C++ is:double LogRosenbrock(const arr& x) {

double f=0.;for(uint i=1; i<x.N; i++)f += sqr(x(i)-sqr(x(i-1)))

+ .01*sqr(1-x(i-1));f = log(1.+f);return f;

}

where sqr computes the square of a double.

Test CMA for the n = 2 and n = 10 dimensional Rosen-brock function. Initialize around the start point (1, 10)and (1, 10, .., 10) ∈ R10 with standard deviation 0.1. Youmight require up to 1000 iterations.

CMA should have no problem in optimizing this function– but as it always samples a whole population of size λ,the number of evaluations is rather large. Plot f(xbest)

http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

https://www.lri.fr/~hansen/cmaes_inmatlab.html

https://www.lri.fr/~hansen/cmaes_inmatlab.html


for the best point found so far versus the total numberof function evaluations.

b) Implement Twiddle Search (slide 05:15) and test iton the same function under same conditions. Also plotf(xbest) versus the total number of function evaluationsand compare to the CMA results.

7.8 Exercise 8

7.8.1 Global optimization on the Rosenbrock

function

On the webpage you’ll find octave code for GP regres-sion from Carl Rasmussen (gp01pred.m). The test.mdemonstrates how to use it.

Use this code to implement a global optimization methodfor 2D problems. Test the method

a) on the 2D Rosenbrock function (as in exercise e07),and

b) on the Rastrigin function as defined in exercise e03with a = 6.

Note that in test.m I’ve chosen hyperparameters thatcorrespond to assuming: smoothness is given by a ker-nel width

√1/10; initial value uncertainty (range) is given

by√10. How does the performance of the method

change with these hyperparameters?

7.8.2 Constrained global optimization?

On slide 6:2 it is speculated that one could considera constrained blackbox optimization problem as well.How could one approach this in the UCB manner?


8 Bullet points to help learning

This is a summary list of core topics in the lecture andintended as a guide for preparation for the exam. Testyourself also on the bullet points in the table of contents.Going through all exercises is equally important.

8.1 Optimization Problems in General

• Types of optimization problems

– General constrained optimization problem defi-nition

– Blackbox, gradient-based, 2nd order

– Understand the differences

• Hardly coherent texts that cover all three

– constrained & convex optimization

– local & adaptive search

– global/Bayesian optimization

• In the lecture we usually only consider inequalityconstraints (for simplicity of presentation)

– Understand in all cases how also equality con-straints could be handled

8.2 Basic Unconstrained Optimization

• Plain gradient descent

– Understand the stepsize problem

– Stepsize adaptation

– Backtracking line search (2:21)

• Steepest descent

– Is the gradient the steepest direction?

– Covariance (= invariance under linear transfor-mations) of the steepest descent direction

• 2nd-order information

– 2nd order information can improve direction &stepsize

– Hessian needs to be pos-def (↔ f(x) is con-vex) or modified/approximated as pos-def (Gauss-Newton, damping)

• Newton method

– Definition

– Adaptive stepsize & damping

• Gauss-Newton

– f(x) is a sum of squared cost terms

– The approx. Hessian 2∇φ(x)>∇φ(x) is alwayssemi-pos-def!

• Quasi-Newton

– Accumulate gradient information to approximatea Hessian

– BFGS, understand the term δδ>

δ>y

• Conjugate gradient

– New direction d′ should be “orthogonal” to theprevious d, but relative to the local quadratic shape,d′>Ad = 0 (= d′ and d are conjugate)

– On quadratic functions CG converges in n iter-ations

• Rprop

– Seems awfully hacky

– Every coordinate is treated separately. No in-variance under rotations/transformations.

– Change in gradient sign → reduce stepsize;else increase

– Works surprisingly well and robust in practice

• Convergence

– With perfect line search, the extrem (finite &positive!) eigenvalues of the Hessian ensure con-vergence

– The Wolfe conditions (acceptance criterion forbacktracking line search) ensure a “significant”decrease in f(x), which also leads to conver-gence

• Trust region

– Alternative to stepsize adaptation and backtrack-ing

• Evaluating optimization costs

– Be aware in differences in convention. Some-times “1 iteration”=many function evaluations (linesearch)

– Best: always report on # function evaluations

8.3 Constrained Optimization

• Overview

– General problem definition


– Convert to series of unconstrained problems:penalty, log barrier, & Augmented Lagrangian meth-ods

– Convert to series of QPs and line search: Se-quential Quadratic Programming

– Convert to larger unconstrained problem: primal-dual Newton method

– Convert to other constrained problem: dual prob-lem

• Log barrier method

– Definition

– Understand how the barrier gets steeper withµ→ 0 (not µ→∞!)

– Iterativly decreasing µ generates the centralpath

– The gradient of the log barrier generates a La-grange term with λi = − µ

gi(x) !

→ Each iteration solves the modified (approxi-mate) KKT condition

• Squared penalty method

– Definition

– Motivates the Augmented Lagrangian

• Augmented Lagrangian

– Definition

– Role of the squared penalty: “measure” howstrong f pushes into the constraint

– Role of the Lagrangian term: generate counterforce

– Unstand that the λ update generates the “de-sired gradient”

• The Lagrangian

– Definition

– Using the Lagrangian to solve constrained prob-lems on paper (set both,∇xL(x, λ) = 0 and∇λL(x, λ) =0)

– “Balance of gradients” and the first KKT condi-tion

– Understand in detail the full KKT conditions

– Optima are necessarily saddle points of the La-grangian

– minx L↔ first KKT↔ balance of gradients

– maxλ L↔ complementarity KKT↔ constraints

• Lagrange dual problem

– primal problem: minxmaxl≥0 L(x, λ)

– dual problem: maxλ≥0 minx L(x, λ)

– Definition of Lagrange dual

– Lower bound and strong duality

• Primal-dual Newton method to solve KKT condi-tions

– Definition & description

• Phase I optimization

– Nice trick to find feasible initialization

8.4 Convex Optimization

• Definitions

– Convex, quasi-convex, uni-modal functions

– Convex optimization problem

• Linear Programming

– General and standard form definition

– Converting into standard form

– LPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or primal-dual meth-ods

– Simplex Algorithm is classical alternative; walkson the constraint edges instead of the interior

• Application of LP:

– Very important application of LPs: LP-relaxationsof integer linear programs

• Quadratic Programming

– Definition

– QPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods

– Sequential QP solves general (non-quadratic)problems by defining a local QP for the step di-rection followed by a line search in that direction

8.5 Search methods for Blackbox optimiza-

tion

• Overview

– Basic downhill running: mostly ignore the col-lected data


– Use the data to shape search: stochastic search,EAs, model-based search

– Bayesian (global) optimization

• Basic downhill running

– Greedy local search: defined by neighborhoodN

– Stochastic local search: defined by transitionprobability q(y|x)

– Simulated Annealing: also accepts “bad” stepsdepending on temperature; theoretically highly rel-evant, practically less

– Random restarts of local search can be effi-cient

– Iterated local search: use meta-neighborhoodN∗ to restart

– Coordinate & Pattern search, Twiddle: use heuris-tics to walk along coordinates

– Nelder-Mead simplex method: reflect, expand,contract, shrink

• Stochastic Search

– General scheme: sample from pθ(x), update θ

– Understand the crucial role of θ: θ captures allthat is maintained and updated depending on thedata; in EAs, θ is a population; in ESs, θ are pa-rameters of a Gaussian

– Categories of EAs: ES, GA, GP, EDA

– CMA: adapting C and σ based on the path ofthe mean

• Model-based Optimization

– Precursor of Bayesian Optimization

– Core: smart ways to keep data D healthy

8.6 Bayesian Optimization

• Multi-armed bandit framework

– Problem definition

– Understand the concepts of exploration, exploita-tion & belief

– Optimal Optimization would imply to plan (ex-actly) through belief space

– Upper Confidence Bound (UCB) and confidenceinterval

– UCB is optimistic

• Global optimization

– Global optimization = infinite bandits

– Locally correlated bandits→Gaussian Processbeliefs

– Maximum Probability of Improvement

– Expected Improvement

– GP-UCB

• Potential pitfalls

– Choice of prior belief (e.g. kernel of the GP) iscrucial

– Pure variance-based sampling for radially sym-metric kernel ≈ grid sampling

Index

Augmented Lagrangian method (3:13),

Backtracking (2:4),Bandits (6:3),Belief planning (6:7),Blackbox optimization: definition (5:0),Blackbox optimization: overview (5:2),Broyden-Fletcher-Goldfarb-Shanno (BFGS) (2:22),

Central path (3:8),Conjugate gradient (2:25),Constrained optimization (3:0),Coordinate search (5:13),Covariance Matrix Adaptation (CMA) (5:23),Covariant gradient descent (2:10),

Estimation of Distribution Algorithms (EDAs) (5:27),Evolutionary algorithms (5:22),Expected Improvement (6:23),Exploration, Exploitation (6:5),

Function types: covex, quasi-convex, uni-modal (4:1),

Gauss-Newton method (2:17),Gaussian Processes as belief (6:18),General stochastic search (5:19),Global Optimization as infinite bandits (6:16),GP-UCB (6:23),Gradient descent convergence (2:33),Greedy local search (5:4),

Implicit filtering (5:33),Iterated local search (5:10),

Karush-Kuhn-Tucker (KKT) conditions (3:24),

Lagrange dual problem (3:28),Lagrangian: definition (3:20),Lagrangian: relation to KKT (3:23),Lagrangian: saddle point view (3:26),Line search (2:4),Linear program (LP) (4:6),Log barrier as approximate KKT (3:32),Log barrier method (3:5),LP in standard form (4:7),LP-relaxations of integer programs (4:14),

Maximal Probability of Improvement (6:23),Model-based optimization (5:30),

Nelder-Mead simplex method (5:15),Newton direction (2:11),

Newton method (2:12),

Pattern search (5:14),Phase I optimization (3:39),Plain gradient descent (2:0),Primal-dual interior-point Newton method (3:35),

Quadratic program (QP) (4:6),Quasi-Newton methods (2:20),

Random restarts (5:9),Rprop (2:30),

Sequential quadratic programming (4:22),Simplex method (4:10),Simulated annealing (5:6),Squared penalty method (3:11),Steepest descent direction (2:8),Stepsize adaptation (2:3),Stepsize and step direction as core issues (2:1),Stochastic local search (5:5),

Trust region (2:36),Types of optimization problems (1:2),

Upper Confidence Bound (UCB) (6:11),

Variable neighborhood search (5:12),

Wolfe conditions (2:35),

37

Introduction to Optimization - uni-stuttgart.de · Introduction to Optimization Marc Toussaint July...

Documents

Transcript of Introduction to Optimization - uni-stuttgart.de · Introduction to Optimization Marc Toussaint July...