Interior-Point and Augmented Lagrangian Algorithms for ... · trace the curve of solutions (x; )...

Interior-Point and Augmented Lagrangian Algorithms forOptimization and Control

Stephen Wright

University of Wisconsin-Madison

May 2014

Wright (UW-Madison) Constrained Optimization May 2014 1 / 46

In This Section...

My first talk was about optimization formulations, optimality conditions,and duality — for LP, QP, LCP, and nonlinear optimization.

This section will review some algorithms, in particular:

Primal-dual interior-point (PDIP) methods

Augmented Lagrangian (AL) methods.

Both are useful in control applications. We’ll say something about PDIPmethods for model-predictive control, and how they exploit the structurein that problem.


Recapping Gradient Methods

Considering unconstrained minimization

min f (x)

where f is smooth and convex, or the constrained version in which x isrestricted to the set Ω, usually closed and convex.

First-order or gradient methods take steps of the form

xk+1 = xk − αkgk ,

where αk ∈ R+ is a steplength and gk is a search direction. dk isconstructed from knowledge of the gradient ∇f (x) at the current iteratex = xk and possibly previous iterates xk−1, xk−2, . . . .

Can extend to nonsmooth f by using the subgradient ∂f (x).

Extend to constrained minimization by projecting the search line onto theconvex set Ω, or (similarly) minimizing a linear approximation to f over Ω.


Prox Interpretation of Line Search

Can view the gradient method step xk+1 = xk − αkgk as the minimizationof a first-order model of f plus a “prox-term” which prevents the stepfrom being too long:

xk+1 = arg minx

f (xk) + gTk (x − xk) +

1

2αk‖x − xk‖2

2.

Taking the gradient the quadratic and setting to zero, we obtain

gk +1

αk(xk+1 − xk) = 0,

which gives the formula for xk+1.

This viewpoint is the key to several extensions.


Extensions: Constraints

When a constraint set Ω is present we can simply minimize the quadraticmodel function over Ω:

xk+1 = arg minx∈Ω

f (xk) + gTk (x − xk) +

1

2αk‖x − xk‖2

2.

Gradient Projection has this form.

We can replace the `2-norm measure of distance with some other measureφ(x ; xk):

xk+1 = arg minx∈Ω

f (xk) + gTk (x − xk) +

1

2αkφ(x ; xk).

Could choose φ to “match” Ω. For example, a measure derived from theentropy function is a good match for the simplex

Ω := x | x ≥ 0, eT x = 1.


Extensions: Regularizers

In many modern applications of optimization, f has the form

f (x) = l(x)︸︷︷︸smooth function

+ τψ(x).︸︷︷︸simple nonsmooth function

Can extend the prox approach above by

choosing gk to contain gradient information from l(x) only;

including τψ(x) explicitly in the subproblem.

Subproblems are thus:

xk+1 = arg minx

l(xk) + gTk (x − xk) +

1

2αk‖x − xk‖2 + τψ(x).


Extensions: Explicit Trust Regions

Rather than penalizing distance moved from current xk , we can enforce anexplicit constraint: a trust region.

xk+1 = arg minx

f (xk) + gTk (x − xk) + I‖x−xk‖≤∆k

(x),

where IΛ(x) denotes an indicator function with

IΛ(x) =

0 if x ∈ Λ

∞ otherwise.

Adjust trust-region radius ∆k to ensure progress e.g. descent in f .


Extension: Proximal Point

Could use the original f in the subproblem rather than a simpler modelfunction:

xk+1 = arg minx

f (x) +1

2αk‖x − xk‖2

2.

although the subproblem seems “just as hard” to solve as the original, theprox-term may make it easier by introducing strong convexity, and maystabilize progress.

Can extend to constrained and regularized cases also.


Quadratic Models: Newton’s Method

We can extend the iterative strategy further by adding a quadratic term tothe model, instead of (or in addition to) the simple prox-term above.

Taylor’s Theorem suggests basing this term on the Hessian(second-derivative) matrix. That is, obtain the step from

xk+1 := arg minx

f (xk) +∇f (xk)T (x − xk) +1

2(x − xk)T∇2f (xk)(x − xk).

Can reformulate to solve for the step dk : Then have xk+1 = xk + dk , where

dk := arg mind

f (xk) +∇f (xk)Td +1

2dT∇2f (xk)d .

See immediately that this model won’t have a bounded solution if ∇2f (xk)is not positive definite. It usually is positive definite near a strict localsolution x∗, but need something that works more globally.


Practical Newton Method

One “obvious” strategy is to add the prox-term to the quadratic model:

dk := arg mind


2dT

(∇2f (xk) +

1

αkI

)d ,

choosing αk so that

The quadratic term is positive definite;

Some other desirable property holds, e.g. descent f (xk + dk) < f (xk).

We can also impose the trust-region explicitly:

dk := arg mind


2dT∇2f (xk)d + I‖d‖≤∆k

(x),

or alternatively:

dk := arg mind : ‖d‖≤∆k


2dT∇2f (xk)d .

But this is equivalent: For any ∆k , there exists αk such that the solutionsof the prox form and the trust-region form are identical.


Quasi-Newton Methods

Another disadvantage of Newton is that the Hessian may be difficult toevaluate or otherwise work with. The quadratic model is still useful whenwe use first-order information to learn about the Hessian.

Key observation (from Taylor’s theorem): the secant condition:

∇2f (xk)(xk+1 − xk) ≈ ∇f (xk+1)−∇f (xk).

The difference of gradients tells us how the Hessian behaves along thedirection xk+1 − xk . By aggregating such information over multiple steps,we can build up an approximation to the Hessian than is valid alongmultiple directions.

Quasi-Newton Methods maintain an approximation Bk to ∇2f (xk) thatrespects the secant condition.

The approximation may be implicit rather than explicit, and we may storean approximation to the inverse Hessian instead.


L-BFGS

A particularly populat quasi-Newton method, suitable for large-scaleproblems is the limited-memory BFGS method (L-BFGS) which stores theHessian or inverse Hessian approximation implicitly.

L-BFGS stores the last 5− 10 update pairs:

sj := xj+1 − xj , yj := ∇f (xj+1)−∇f (xj).

for j = k , k − 1, k − 2, . . . , k −m. Can implicitly construct Hk+1 thatsatisfies

Hk+1yj = sj .

In fact, an efficient recursive formula is available for evaluatingdk+1 := −Hk+1∇f (xk+1) — the next search direction — directly from the(sj , yj) pairs and from some initial estimate of the form (1/αk+1)I .


Newton for Nonlinear Equations

There is also a variant of Newton’s method for nonlinear equations: Find xsuch that

F (x) = 0,

where F : Rn → Rn (n equations in n unknowns.)

Newton’s method forms a linear approximation to this system, based onanother variant of Taylor’s Theorem which says

F (x + d) = F (x) + J(x)d +

∫ 1

0[J(x + td)− J(x)]d dt,

where J(x) is the Jacobian matrix of first partial derivatives:

J(x) =

∂F1∂x1

· · · ∂F1∂xn

......

∂Fn∂x1

· · · ∂Fn∂xn

(usually not symmetric).


When F is continuously differentiable, we have

F (xk + d) ≈ F (xk) + J(xk)d ,

so the Newton step is the one that make the right-hand side zero:

dk := −J(xk)−1F (xk).

The basis Newton method takes steps xk+1 := xk + dk . Its effectivenesscan be improved by

Doing a line search: xk+1 := xk + αkdk , for some αk > 0;

Levenberg strategy: add λI to J and set dk := −(J(xk) +λI )−1F (xk).

Guide progress via a merit function, usually

φ(x) :=1

2‖F (x)‖2

2.

Achtung! Can get stuck in a local min of φ that’s not a solution ofF (x) = 0.


Homotopy

Tries to avoid the local-min issue with the merit function. Start with an“easier” set of nonlinear equations, and gradually deform it to the systemF (x) = 0, tracking changes to the solution as you go.

F (x , λ) := λF (x) + (1− λ)F0(x), λ ∈ [0, 1].

Assume that F (x , 0) = F0(x) = 0 has solution x0. Homotopy methodstrace the curve of solutions (x , λ) until λ = 1 is reached. Thecorresponding value of x then solves the original problem.

Many variants.

Some supporting theory.

Typically more expensive than enhanced Newton methods, but betterat finding solutions to F (x) = 0.

We mention homotopy mostly because of its connection to interior-pointmethods.


Interior-Point Methods

Recall the monotone LCP: Find z ∈ Rn such that

0 ≤ z ⊥ Mz + q ≥ 0,

where M ∈ Rn×n is positive semidefinite, and q ∈ Rn. Recall too thatmonotone LCP is a generalization of LP and convex QP.

Rewrite LCP as

w = Mz + q, (w , z) ≥ 0, wizi = 0, i = 1, 2, . . . , n,

which is a constrained system of nonlinear equations:

F (z ,w) = 0, (z ,w) ≥ 0,

where F : R2n → R2n is defined as

F (z ,w) =

[Mz + q − w

WZe

]=

[00

],

where

W = diag(w1,w2, . . . ,wn), Z = diag(z1, z2, . . . , zn), e = (1, 1, . . . , 1)T .


Interior-Point Methods

Interior-point methods generate iterates that satisfy the nonnegativityconstraints strictly, that is,

(zk ,wk) > 0 for all k .

An obvious strategy is to search along Newton directions for F , with a linesearch to maintain positivity of (zk+1,wk+1). The Newton equations are:[

M −IW Z

] [∆z∆w

]= −

[Mz + q − w

WZe

]This affine scaling approach can actually work, and convergence can beproved, but it needs very precise conditions on the steplength to avoidgetting jammed at the boundary.

Much better performance is obtained from methods that use homotopy, inparticular, the idea of following a central path.


The Central Path

The central path is a critical object in primal-dual interior-point methods.It’s defined by the parametric equations:

F (z ,w ; τ) :=

[Mz + q − wWZe − τe

]=

[00

],

where τ ≥ 0 is the central path parameter, along with the conditions(z ,w) > 0.

The second block of conditions is saying that

wizi = τ, for all i = 1, 2, . . . , n,

so if τ > 0, we must have all components of w and z strictly positive —hence the term interior point.1

1Actually, interior-point methods don’t force the first conditionMz + q − w = 0 to hold except in the limit, so the iterates are not really“interior” in the send of being “strictly feasible.”


Primal-Dual Interior-Point Methods

The central path guides iterates to the solution. The basic primal-dualinterior-point method works as follows:

Compute Newton steps for F (z ,w ; τk) for some value of τk .

Take a step

(zk+1,wk+1) = (zk ,wk) + αk(∆zk ,∆wk),

choosing αk so that (zk+1,wk+1) remains sufficiently positive.

Possibly enhance with some auxiliary “corrector” steps.

Choose a smaller value τk+1 < τk and repeat.

The effect is that it chases a moving target along the central path, stayingreasonably close to the central path to avoid getting jammed near theboundary of the region (w , z) ≥ 0.


A Simple PDIP Method

Usually works, simple to code (one page!)

Choose τk+1 = .9τk ;

Choose αk to be the largest α for which

(zki + α∆zk

i )(wki + α∆wk

i ) ≥ .01(zk + α∆zk)T (wk + α∆wk)/n.

The second condition ensures that iterates stay in a loose neighborhood ofthe central path — none of the variables go to zero prematurely.


A More Elaborate PDIP Method (Mehrotra)

Chooses τk+1 by a heuristic based on the performance of the affine scalingstep (∆zk

aff,∆wkaff), which is computed by solving the Newton equations

with τ = 0.

If we can take a long step along the affine scaling direction beforereaching the boundary, choose τk+1 much smaller than τk — anaggressive choice.

Otherwise, if the affine scaling step quickly jams at the boundary,make a more conservative choice — τk+1 only a little smaller than τk .

Mehrotra’s method also includes a corrector step, that corrects for thenonlinearity revealed by the affine-scaling step. The combinedcentering-corrector step is obtained from[

M −IW Z

] [∆zcc

∆wcc

]=

[0

τk+1e −∆zkaff∆wk

affe

].

This is added to the affine-scaling step to obtain the search direction.


Solving the Linear Equations

At each iteration need to solve two or more linear systems of the form:[M −IW Z

] [∆z∆w

]=

[rf

rzw

],

where

W and Z are the diagonals of wk and zk which contain allpositivenumbers.

rf and rzw are different right-hand sides (affine-scaling,centering-corrector).

Note that there is a lot of structure in this system — three of the blocksare N × N nonsingular diagonal — and that the structure is the same atevery interior-point iteration.


Reduced Form

Can do block elimination (of ∆w) to obtain

(M + Z−1W )∆z = rf + Z−1rwz .

Note that

When M is positive semidefinite (as it is for monotone LCP, thenM + Z−1W is positive definite).

If we can identify a good strategy for reordering rows and columns ofM to solve this efficiently, we can re-use this ordering at everyiteration, because M + Z−1W has the same nonzero pattern (onlythe diagonals change).

Since some elements of Z−1W are going to ∞ as the iterates near asolution while others are going to zero, this system can become highlyill-conditioned — but it doesn’t seem to matter much.


Application to LP

When we apply this technique to LCPs arising from LP, M has a particularform, and the linear system that we solve has the form:0 AT I

A 0 0S 0 X

∆x∆λ∆s

=

rprdrxs

,where X = diag(x1, x2, . . . , xn) and S = diag(s1, s2, . . . , sn). Byeliminating ∆s we obtain a reduced form:[

−S−1X AT

A 0

] [∆x∆λ

]=

[rp − S−1rxs

rd

].

Since S−1X is positive diagonal we can go a step further and eliminate∆x :

A(SX−1)AT∆λ = rd + ASX−1(rp − S−1rxs).


Thus we have a coefficient matrix of the form ADAT where

D = SX−1 is positive diagonal.

Some elements of D go to zero as the the iterates converge (those forwhich si = 0 at the solution) while others go to ∞ (those for whichxi = 0). Hence the system MAY be ill conditioned.

In practical codes, this matrix product is actually formed and factoredat every iteration, using a variant of the Cholesky factorization.

This factorization is stable regardless of the ordering ofrows/columns, so we are free to use orderings that create the leastfill-in during Cholesky.

Good codes: MOSEK, GUrobi, CPLEX. Freebie: PCx.


Optimal Control (Open Loop)

Given current state x0, choose a time horizon N (long), and solve theoptimization problem for x = xkNk=0, u = ukN−1

k=0 :

min L(x , u), subject to x0 given,

xk+1 = Fk(xk , uk), k = 0, 1, . . . ,N − 1,

other constraints on x , u.

Then apply controls u0, u1, u2, . . . blindly, without monitoring the state.

flexible with respect to nonlinearity and constraints;

if model Fk is inaccurate, solution may be bad;

doesn’t account for system disturbances during time horizon;

never used in industrial practice!


Linear-Quadratic Regulator (Closed Loop)

A canonical control problem. For given x0, solve:

minx ,u

Φ(x , u) :=1

2

∞∑k=0

xTk Qxk + uT

k Ruk s.t. xk+1 = Axk + Buk .

From KKT conditions, dependence of optimal values of x1, x2, . . . andu0, u1, . . . on initial x0 is linear. We can substitute these variables out toobtain

minx ,u

Φ(x , u) =1

2xT

0 Πx0,

for some s.p.d. matrix Π.

By using this dynamic programming principle, isolating the first stage, canwrite the problem as:

minx1,u0

1

2(xT

0 Qx0 + uT0 Ru0) +

1

2xT

1 Πx1 s.t. x1 = Ax0 + Bu0.


By substituting for x1, get unconstrained quadratic problem in u0.Minimizer is

u0 = Kx0, where K = −(R + BTΠB)−1BTΠA.

so thatx1 = Ax0 + Bu0 = (A + BK )x0.

By substituting for u0 and x1 in

1

2xT

0 Πx0 =1

2(xT

0 Qx0 + uT0 Ru0) +

1

2xT

1 Πx1,

obtain the Riccati equation:

Π = Q + ATΠA− ATΠTB(R + BTΠB)−1BTΠA.

There are well-known techniques to solve this equation for Π , hence K .

Hence, we have a feedback control law u = Kx that is optimal for theLQR problem. This is a closed-loop control strategy that can respond tochanges in state.


Model Predictive Control (MPC)

Closed-loop has the highly desirable property of being able to respond tosystem disturbances and modeling errors. But closed-form solutions can befound only for simple models such as LQR.

Open-loop allows for a rich variety of models and constraints, and yields ahighly structured optimization problem that can often be solved efficiently.But it can’t fix disturbances and errors. “Set and forget” at your peril!

Model-Predictive Control (MPC) is a way of doing closed-loop controlusing open-loop / optimal-control techniques.

Set up a optimal control problem, initialized at the current state, witha finite planning horizon.

Solve this open-loop problem (quickly!) and implement the control u0.

At the next decision point, repeat the process. Can use the previoussolution (shifted forward one period) as a warm start.


Linear MPC

A generalization of the linear-quadratic regulator (LQR) problem includesconstraints:

minx ,u

1

2

∞∑k=0

xTk Qxk + uT

k Ruk ,

subject to xk+1 = Axk + Buk , k = 0, 1, 2, . . . ,

xk ∈ X , uk ∈ U,

possibly also mixed constraints, and constraints on uk+1 − uk .

Assuming that 0 ∈ int(X ), 0 ∈ int(U) and that the system is stabilizable,we expect that uk → 0 and xk → 0 as k →∞. Therefore, for largeenough k , the non-model constraints become inactive.


Hence, for N large enough, the problem is equivalent to the following(finite) problem:

minx ,u

1

2

N−1∑k=0

xTk Qxk + uT

k Ruk +1

2xTN ΠxN ,

subject to xk+1 = Axk + Buk , k = 0, 1, 2, . . . ,N − 1

xk ∈ X , uk ∈ U, k = 0, 1, 2, . . . ,N − 1,

where Π is the solution of the Riccati equation. In the “tail” of thesequence (k > N) simply apply the unconstrained LQR feedback law,derived above.

When constraints are linear, need to solve a finite, structured, convexquadratic program.


Interior-Point Method for MPC

minu,x ,ε

N−1∑k=0

1

2(xT

k Qxk + uTk Ruk + 2xT

k Muk + εTk Zεk) + zT εk + xTN ΠxN ,

subject to

x0 = xj , (fixed)

xk+1 = Axk + Buk , k = 0, 1, . . . ,N − 1,

Duk − Gxk ≤ d , k = 0, 1, . . . ,N − 1,

Hxk − εk ≤ h, k = 1, 2, . . . ,N,

εk ≥ 0, k = 1, 2, . . . ,N,

FxN = 0.

Soft contraints on the state (involving εk);

Also general “hard” constraints on (xk , uk). Can use these toimplement constraints on control changes uk+1 − uk


Introduce dual variables, use stagewise ordering. Primal-dual interior-pointmethod yields a block-banded system at each iteration:

. . . Q M −GT AT

MT R DT BT

−G D −ΣDk

A B −I−Σε

k+1 −I−ΣH

k+1 −I H−I −I Z

−I HT Q . . .

...∆xk∆uk∆λk

∆pk+1

∆ξk+1

∆ηk+1

∆εk+1

∆xk+1

...

=

...rxkrukrλkrpk+1

rξk+1rηk+1rεk+1rxk+1

...

where ΣD

k , Σεk+1, etc are diagonal.


By performing block elimination, get reduced system

R0 BT

B −I−I Q1 M1 AT

MT1 R1 BT

A B −I−I Q2 M2 AT

MT2 R2 BT

A B. . .

. . .

. . . QN FT

F

∆u0

∆p0

∆x1

∆u1

∆p1

∆x2

∆u2

...∆xN∆β

=

ru0rp0rx1ru1rp1rx2ru2...rxNrβ

.

which has the same structure as the KKT system of a problem withoutside constraints (soft or hard).


Can solve by applying a banded linear solver: O(N) operations.Alternatively, seek matrices Πk and vectors πk such that the followingrelationship is satisfied between ∆pk−1 and ∆xk :

−∆pk−1 + Πk∆xk = πk , quadk = N,N − 1, . . . , 1.

By substituting in the linear system, find a recurrence relation:

ΠN = QN , πN = r xN ,

Πk−1 = Qk−1 + ATΠkA−(ATΠkB + Mk−1)(Rk−1 + BTΠkB)−1(BTΠkA + MT

k−1),

πk−1 = r xk−1 + ATΠk rpk−1 + ATπk−(ATΠkB + Mk−1)(Rk−1 + BTΠkB)−1(ruk−1 + BTΠk rpk−1 + BTπk).

The recurrence for Πk is the discrete time-varying Riccati equation!


Duality for Nonlinear Programming (Refresher)

Recall yesterday’s conversation about duality for general constrainedproblems:

min f (x) subject to ci (x) ≥ 0, i = 1, 2, . . . ,m.

with the Lagrangian defined by:

L(x , λ) := f (x)− λT c(x) = f (x)−m∑i=1

λici (x).

Remember that we defined primal and dual objectives:

r(x) = supλ≥0L(x , λ);

q(λ) = infxL(x , λ),

so the primal and dual problems are:

minx

r(x), maxλ≥0

q(λ).


Augmented Lagrangian

Can motivate it crudely as proximal point on the dual. Consider the dual:

maxλ≥0

q(λ) = maxλ≥0

infx

f (x)− λT c(x).

Add a proximal point term — a quadratic penalty on the distance movedfrom the last dual iterate λk :

maxλ≥0

infx

f (x)− λT c(x)− 1

2αk‖λ− λk‖2

2

Note that the objective here is “simple” in λ. Now switch the max and inf:

infx

maxλ≥0

f (x)− λT c(x)− 1

2αk‖λ− λk‖2

2

.

We can solve the max problem, to get an explicit value for λ! This is easybecause the components of λ are separated in the objective; we can solvefor each one individually.


Explicit solution is

λi =

0 if −αkci (x) + λki ≤ 0;

−αkci (x) + λki otherwise.

Substitute this result in the previous expression to get

minx

f (x) +∑

i : ci (x)≥λki /αk

1

2αk(λki )2

+∑

i : ci (x)<λki /αk

(αk

2ci (x)2 − λki ci (x)

).

Thus the basic augmented Lagrangian process is (iteration k):

Minimize the augmented Lagrangian function above for x(approximately); call it xk+1;

Plug this x into the explicit-max formula for λ to get λk+1;

Choose αk+1 for the next iteration.


Equality Constraints

For the equality constrained case, the formulae simplify a lot. We have

minx

f (x) subject to dj(x) = 0, j = 1, 2, . . . , p.

Dual objective is infx L(x , µ) := f (x)− µTd(x).

Augmented Lagrangian subproblems are:

xk+1 = arg minx

f (x)− (µk)Td(x) +αk

2‖d(x)‖2

2,

µk+1 = µk − αkd(xk+1).


Augmented Lagrangian: History and Practice

How accurately to solve the subproblem for x , which is unconstrainedbut nonlinear?

How to adjust αk? Use a different αk value for each constraint;increase it when this constraint is not becoming feasible rapidlyenough.

Historical Sketch:

Dates from 1969: Hestenes, Powell.

Developments in 1970s- early 1980s by Rockafellar, Bertsekas, others.

Lancelot code for nonlinear programming: Conn, Gould, Toint, 1990.

Largely lost favor as an approach for general nonlinear programmingduring the next 15 years.

Recent revival in the context of sparse optimization and its manyapplications, in conjunction with splitting / coordinate descent.


Separable Objectives: ADMM

Alternating Directions Method of Multipliers (ADMM) arises when theobjective in the basic linearly constrained problem is separable:

min(x ,z)

f (x) + h(z) subject to Ax + Bz = c ,

for which

L(x , z , λ; ρ) := f (x) + h(z)− λT (Ax + Bz − c) +α

2‖Ax − Bz − c‖2

2.

Standard augmented Lagrangian would minimize L(x , z , λ; ρ) over (x , z)jointly — but these are coupled through the quadratic term, so theadvantage of separability is lost.

Instead, minimize over x and z separately and sequentially:

xk = arg minxL(x , zk−1, λk−1;αk);

zk = arg minzL(xk , z , λk−1;αk);

λk = λk−1 − αk(Axk + Bzk − c).


ADMM

Basically, does a round of block-coordinate descent in (x , z).

The minimizations over x and z add only a quadratic term to f andh, respectively. This does not alter the cost much.

Can perform these minimizations inexactly.

Convergence is often slow, but sufficient for many applications.

Many recent applications to compressed sensing, image processing,matrix completion, sparse principal components analysis.

ADMM has a rich collection of antecendents. A nice recent survey,including a diverse collection of machine learning applications, is Boydet al. (2011).


ADMM for Consensus Optimization

Given

minm∑i=1

fi (x),

form m copies of the x , with the original x as a “master” variable:

minx ,x1,x2,...,xm

m∑i=1

fi (x i ) subject to x i = x , i = 1, 2, . . . ,m.

Apply ADMM, with z = (x1, x2, . . . , xm), get

x ik = arg min

x ifi (x i )− (λik−1)T (x i − xk−1) +

αk

2‖x i − xk−1‖2

2, ∀i ,

xk =1

m

m∑i=1

(x ik −

1

αkλik−1

),

λik = λik−1 − αk(x ik − xk), ∀i

Obvious parallel possibilities in the x i updates. Synchronize for x update.Wright (UW-Madison) Constrained Optimization May 2014 43 / 46

ADMM for Awkward Intersections

The feasible set is sometimes an intersection of two or more convex setsthat are easy to handle separately (e.g. projections are easily computable),but whose intersection is more difficult to work with.

Example: Optimization over the cone of doubly nonnegative matrices:

minX

f (X ) s.t. X 0, X ≥ 0.

General form:

min f (x) s.t. x ∈ Ωi , i = 1, 2, . . . ,m

Just consensus optimization, with indicator functions for the sets.

xk = arg minx

f (x)−m∑i=1

(λik−1)T (x − x ik−1) +

αk

2‖x − x i

k−1‖22,

x ik = arg min

xi∈Ωi

−(λik−1)T (xk − x i ) +αk

2‖xk − x i‖2

2, ∀i

λik = λik−1 − αk(xk − x ik), ∀i .


ADMM and Prox-Linear

Givenminx

f (x) + τψ(x),

reformulate as the equality constrained problem:

minx ,z

f (x) + τψ(z) subject to x = z .

ADMM form:

xk := minx

f (x) + τψ(zk−1) + (λk)T (x − zk−1) +αk

2‖zk−1 − x‖2

2,

zk := minz

f (xk) + τψ(z) + (λk)T (xk − z) +αk

2‖z − xk‖2

2,

λk+1 := λk + αk(xk − zk).

Minimization over z is the shrink operator — often inexpensive.

Minimization over x can be performed approximately using analgorithm suited to the form of f .


The subproblems are not too different from those obtained in prox-linearalgorithms:

λk is asymptotically similar to the gradient term in prox-linear, thatis, λk ≈ ∇f (xk);

Thus, the minimization over z is quite similar to the prox-linear step.


References I

Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific, second edition.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributedoptimization and statistical learning via the alternating direction methods ofmultipliers. Foundations and Trends in Machine Learning, 3(1):1–122.

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York,second edition.

Rao, C. V., Wright, S. J., and Rawlings, J. B. (1998). Application of interior-pointmethods to model predictive control. Journal of Optimization Theory andApplications, 99:723–757.

Wright, S. J. (1997a). Applying new optimization algorithms to model predictivecontrol. In Kantor, J. C., editor, Chemical Process Control-V, volume 93 of AIChESymposium Series, pages 147–155. CACHE Publications.

Wright, S. J. (1997b). Primal-Dual Interior-Point Methods. SIAM, Philadelphia, PA.


Interior-Point and Augmented Lagrangian Algorithms for ... · trace the curve of solutions (x; )...

Documents

Transcript of Interior-Point and Augmented Lagrangian Algorithms for ... · trace the curve of solutions (x; )...