NLP Slides

8/8/2019 NLP Slides

1/201

ECTURE SLIDES ON NONLINEAR PROGRAMMIN

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

CAMBRIDGE, MASS

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:Nonlinear Programming, Athena Scientific,by Dimitri P. Bertsekas; see

http://www.athenasc.com/nonlinbook.html

for errata, selected problem solutions, and othersupport material.

The slides are copyrighted but may be freelyreproduced and distributed for any noncom-mercial purpose.

LAST REVISED: Feb. 3, 2005

8/8/2019 NLP Slides

2/201

6.252 NONLINEAR PROGRAMMING

LECTURE 1: INTRODUCTION

LECTURE OUTLINE

Nonlinear Programming

Application Contexts

Characterization Issue

Computation Issue

Duality

Organization

8/8/2019 NLP Slides

3/201

NONLINEAR PROGRAMMING

minxX

f(x),

where

f : n is a continuous (and usually differ-entiable) function of n variables

X = n or X is a subset of n with a continu-ous character.

If X = n, the problem is called unconstrained

If f is linear and X is polyhedral, the problemis a linear programming problem. Otherwise it isa nonlinear programming problem

Linear and nonlinear programming have tradi-

tionally been treated separately. Their method-ologies have gradually come closer.

8/8/2019 NLP Slides

4/201

TWO MAIN ISSUES

Characterization of minima Necessary conditions

Sufficient conditions

Lagrange multiplier theory

Sensitivity

Duality

Computation by iterative algorithms

Iterative descent

Approximation methods

Dual and primal-dual methods

8/8/2019 NLP Slides

5/201

APPLICATIONS OF NONLINEAR PROGRAMMING

Data networks Routing

Production planning

Resource allocation

Computer-aided design

Solution of equilibrium models

Data analysis and least squares formulations

Modeling human or organizational behavior

8/8/2019 NLP Slides

6/201

CHARACTERIZATION PROBLEM

Unconstrained problems Zero 1st order variation along all directions

Constrained problems

Nonnegative 1st order variation along all fea-sible directions

Equality constraints

Zero 1st order variation along all directionson the constraint surface

Lagrange multiplier theory

Sensitivity

8/8/2019 NLP Slides

7/201

COMPUTATION PROBLEM

Iterative descent

Approximation

Role of convergence analysis

Role of rate of convergence analysis

Using an existing package to solve a nonlinearprogramming problem

8/8/2019 NLP Slides

8/201

POST-OPTIMAL ANALYSIS

Sensitivity

Role of Lagrange multipliers as prices

8/8/2019 NLP Slides

9/201

DUALITY

Min-common point problem / max-intercept prob-lem duality

0 0

(a) (b)

Min Common Point

Max Intercept Point Max Intercept Point

Min Common Point

S S

Illustration of the optimal values of the min common pointand max intercept point problems. In (a), the two optimal

values are not equal. In (b), the set S, when extendedupwards along the nth axis, yields the set

S = {x | for some x S, xn xn, xi = xi, i = 1, . . . , n 1}

which is convex. As a result, the two optimal values are

equal. This fact, when suitably formalized, is the basis for

some of the most important duality results.

8/8/2019 NLP Slides

10/201


LECTURE 2

UNCONSTRAINED OPTIMIZATION -

OPTIMALITY CONDITIONS

LECTURE OUTLINE

Unconstrained Optimization

Local Minima

Necessary Conditions for Local Minima

Sufficient Conditions for Local Minima

The Role of Convexity

8/8/2019 NLP Slides

11/201

MATHEMATICAL BACKGROUND

Vectors and matrices in n

Transpose, inner product, norm

Eigenvalues of symmetric matrices

Positive definite and semidefinite matrices

Convergent sequences and subsequences

Open, closed, and compact sets

Continuity of functions

1st and 2nd order differentiability of functions

Taylor series expansions

Mean value theorems

8/8/2019 NLP Slides

12/201

LOCAL AND GLOBAL MINIMA

f(x)

x

Strict LocalMinimum

Local Minima Strict GlobalMinimum

Unconstrained local and global minima in one dimension.

8/8/2019 NLP Slides

13/201

NECESSARY CONDITIONS FOR A LOCAL MIN

1st order condition: Zero slope at a localminimum x

f(x) = 0

2nd order condition: Nonnegative curvature

at a local minimum x

2f(x) : Positive Semidefinite

There may exist points that satisfy the 1st and

2nd order conditions but are not local minima

x xx

f(x) = |x|3 (convex) f(x) = x3 f(x) = - |x|3

x* = 0 x* = 0x* = 0

First and second order necessary optimality conditions for

functions of one variable.

8/8/2019 NLP Slides

14/201

PROOFS OF NECESSARY CONDITIONS

1st order condition f(x) = 0. Fix d n.Then (since x is a local min), from 1st order Taylor

df(x) = lim0

f(x + d) f(x)

0,

Replace d with d, to obtain

df(x) = 0, d n

2nd order condition 2

f(x

) 0. From 2ndorder Taylor

f(x+d)f(x) = f(x)d+2

2d2f(x)d+o(2

Since f(x) = 0 and x is local min, there issufficiently small > 0 such that for all (0, ),

0 f(x + d) f(x)

2= 12d

2f(x)d +o(2)

2

Take the limit as 0.

8/8/2019 NLP Slides

15/201

SUFFICIENT CONDITIONS FOR A LOCAL MIN

1st order condition: Zero slope

f(x) = 0

1st order condition: Positive curvature

2f(x) : Positive Definite

Proof: Let > 0 be the smallest eigenvalue of2f(x). Using a second order Taylor expansion,

we have for all d

f(x + d) f(x) = f(x)d +1

2d2f(x)d

+ o(d2)

2 d2 + o(d2)

=

2+

o(d2)

d2

d2.

For d small enough, o(d2)/d2 is negligible

relative to /2.

8/8/2019 NLP Slides

16/201

CONVEXITY

Convex Sets Nonconvex Sets

x

y

x + (1 - )y, 0 < < 1

x

x

y

y

xy

Convex and nonconvex sets.

f(x) + (1 - )f(y)

x y

C

z

f(z)

A convex function. Linear interpolation underestimates

the function.

8/8/2019 NLP Slides

17/201

MINIMA AND CONVEXITY

Local minima are also global under convexity

f(x*) + (1 - )f(x)

x

f(x* + (1- )x)

x x*

f(x)

Illustration of why local minima of convex functions are

also global. Suppose that f is convex and that x is alocal minimum of f. Let x be such that f(x) < f(x). Byconvexity, for all

(0, 1),

f

x + (1 )x

f(x) + (1 )f(x) < f(x).

Thus, f takes values strictly lower than f(x) on the linesegment connecting x with x, and x cannot be a local

minimum which is not global.

8/8/2019 NLP Slides

18/201

OTHER PROPERTIES OF CONVEX FUNCTIONS

f is convex if and only if the linear approximationat a point x based on the gradient, underestimatesf:

f(z) f(x) + f(x)(z x), z n

f(z)

f(z) + (z - x)'f(x)

x z

Implication:

f(x) = 0 x is a global minimum

f is convex if and only if 2f(x) is positivesemidefinite for all x

8/8/2019 NLP Slides

19/201


LECTURE 3: GRADIENT METHODS

LECTURE OUTLINE

Quadratic Unconstrained Problems

Existence of Optimal Solutions

Iterative Computational Methods

Gradient Methods - Motivation

Principal Gradient Methods

Gradient Methods - Choices of Direction

8/8/2019 NLP Slides

20/201

QUADRATIC UNCONSTRAINED PROBLEMS

minxn

f(x) = 12xQx bx,

where Q is n n symmetric, and b n.

Necessary conditions:

f(x) = Qx b = 0,

2f(x) = Q 0 : positive semidefinite.

Q 0 f : convex, nec. conditions are alsosufficient, and local minima are also global

Conclusions:

Q : not 0 f has no local minima

If Q > 0 (and hence invertible), x = Q1bis the unique global minimum.

If Q 0 but not invertible, either no solutionor number of solutions

8/8/2019 NLP Slides

21/201

00

0 0x x

x x

y

y y

y

1/ 1/

1/

> 0, > 0

(1/, 0) is the unique

global minimum

> 0, = 0

{(1/, ) | : real} is the set of

global minima

= 0There is no global minimum

> 0, < 0

There is no global minimum

Illustration of the isocost surfaces of the quadratic costfunction f : 2 given by

f(x, y) = 12

x2 + y2

x

for various values of and .

8/8/2019 NLP Slides

22/201

EXISTENCE OF OPTIMAL SOLUTIONS

Consider the problem

minxX

f(x)

The set of optimal solutions is

X = k=1

x X | f(x) k

where {k} is a scalar sequence such that k f

with

f = infxX

f(x)

X is nonempty and compact if all the sets{x X | f(x) k

are compact. So:

A global minimum exists if f is continuousand X is compact (Weierstrass theorem)

A global minimum exists if X is closed, andf is continuous and coercive, that is, f(x) when x

8/8/2019 NLP Slides

23/201

GRADIENT METHODS - MOTIVATION

f(x) = c1

f(x) = c2 < c1

f(x) = c3 < c2

xx = x - f(x)

f(x)

x - f(x)

If f(x) = 0, there is aninterval (0, ) of stepsizessuch that

f

x f(x) < f(x)for all (0, ).

f(x) = c1

f(x) = c2 < c1

f(x) = c3 < c2

x

f(x)

d

x + d

x = x + d

If d makes an angle withf(x) that is greater than90 degrees,

f(x)d < 0,

there is an interval (0, )

of stepsizes such that f(x+d) < f(x) for all (0, ).

8/8/2019 NLP Slides

24/201

PRINCIPAL GRADIENT METHODS

xk+1 = xk + kdk, k = 0, 1, . . .

where, if f(xk) = 0, the direction dk satisfies

f(xk)dk < 0,

and k is a positive stepsize. Principal example:

xk+1 = xk kDkf(xk),

where Dk is a positive definite symmetric matrix

Simplest method: Steepest descent

xk+1 = xk kf(xk), k = 0, 1, . . .

Most sophisticated method: Newtons method

xk+1 = xkk2f(xk)

1f(xk), k = 0, 1, . .

8/8/2019 NLP Slides

25/201

STEEPEST DESCENT AND NEWTONS METHOD

x0 Slow convergence of steep-

est descent

x0

x1

x2

f(x) = c1

f(x) = c3 < c2

f(x) = c2

8/8/2019 NLP Slides

26/201

OTHER CHOICES OF DIRECTION

Diagonally Scaled Steepest Descent

Dk = Diagonal approximation to

2f(xk)1

Modified Newtons Method

Dk = (2f(x0))1

, k = 0, 1, . . . ,

Discretized Newtons Method

Dk =

H(xk)1, k = 0, 1, . . . ,

where H(xk) is a finite-difference based approxi-mation of 2f(xk)

Gauss-Newton method for least squares prob-lems: minxn 12g(x)2. Here

Dk =

g(xk)g(xk)1

, k = 0, 1, . . .

8/8/2019 NLP Slides

27/201


LECTURE 4

ONVERGENCE ANALYSIS OF GRADIENT METHO

LECTURE OUTLINE

Gradient Methods - Choice of Stepsize

Gradient Methods - Convergence Issues

8/8/2019 NLP Slides

28/201

CHOICES OF STEPSIZE I

Minimization Rule: k is such that

f(xk + kdk) = min0

f(xk + dk).

Limited Minimization Rule: Min over [0, s]

Armijo rule:

f(xk)'dk

f(xk)'dk

0

Set of Acceptable

Stepsizes

s

s

Unsuccessful Stepsize

Trials

2sStepsize k

=

f(xk + dk) - f(xk)

Start with s and continue with s,2s, ..., until ms fallswithin the set of with

f(xk) f(xk + dk) f(xk)dk.

8/8/2019 NLP Slides

29/201

CHOICES OF STEPSIZE II

Constant stepsize: k is such that

k = s : a constant

Diminishing stepsize:

k 0

but satisfies the infinite travel condition

k=0

k =

8/8/2019 NLP Slides

30/201

GRADIENT METHODS WITH ERRORS

xk+1 = xk k(f(xk) + ek)

where ek is an uncontrollable error vector

Several special cases:

ek small relative to the gradient; i.e., for allk, ek < f(xk)

f(xk)

ek

gk

Illustration of the descent

property of the directiongk =

f(xk) + ek.

{ek} is bounded, i.e., for all k, ek ,where is some scalar.

{ek} is proportional to the stepsize, i.e., forall k, ek qk, where q is some scalar.

{ek} are independent zero mean random vec-tors

8/8/2019 NLP Slides

31/201

CONVERGENCE ISSUES

Only convergence to stationary points can beguaranteed

Even convergence to a single limit may be hardto guarantee (capture theorem)

Danger of nonconvergence if directions dk tendto be orthogonal to f(xk)

Gradient related condition:

For any subsequence {xk}kK that converges to

a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies

lim supk, kK

f(xk)dk < 0.

Satisfied if dk = Dkf(xk) and the eigenval-ues of Dk are bounded above and bounded awayfrom zero

8/8/2019 NLP Slides

32/201

CONVERGENCE RESULTS

CONSTANT AND DIMINISHING STEPSIZES

Let {xk} be a sequence generated by a gradientmethod xk+1 = xk + kdk, where {dk} is gradientrelated. Assume that for some constant L > 0,we have

f(x) f(y) Lx y, x, y n,

Assume that either

(1) there exists a scalar such that for all k

0 < k (2 )|f(xk)dk|

Ldk2

or

(2) k 0 andk=0

k = .

Then either f(xk) or else {f(xk)} con-verges to a finite value and f(xk) 0.

8/8/2019 NLP Slides

33/201

MAIN PROOF IDEA

0

f(xk)'dk

+ (1/2)2L||dk||2

f(xk)'dk

=|f(x

k)'d

k|

L||dk||

|2

f(xk + dk) - f(xk)

The idea of the convergence proof for a constant stepsize.Given xk and the descent direction dk, the cost differ-

ence f(xk + dk) f(xk) is majorized by f(xk)dk +12

2Ldk2 (based on the Lipschitz assumption; see nextslide). Minimization of this function over yields the step-

size

=|f(xk)dk|

Ldk2

This stepsize reduces the cost function f as well.

8/8/2019 NLP Slides

34/201

DESCENT LEMMA

Let be a scalar and let g() = f(x + y). Have

f(x + y) f(x) = g(1) g(0) =

10

dg

d() d

=10 y

f(x + y) d

10

yf(x) d

+ 1

0

yf(x + y) f(x) d

10

yf(x) d

+ 1

0

y f(x + y) f(x)d

yf(x) + y

10

Ly d

= yf(x) +L

2y2.

8/8/2019 NLP Slides

35/201

CONVERGENCE RESULT ARMIJO RULE

Let {xk} be generated by xk+1 = xk+kdk, where{dk} is gradient related and k is chosen by theArmijo rule. Then every limit point of {xk} is sta-tionary.

Proof Outline: Assume x is a nonstationary limit

point. Then f(xk

) f(x), so k

f(xk

)

dk

0. If {xk}K x, lim supk, kK f(xk)dk < 0,by gradient relatedness, so that {k}K 0.

By the Armijo rule, for large k K

f(xk)f

xk +(k/)dk

< (k/)f(xk)dk.

Defining pk = dk

dkand k =

kdk , we have

f(xk) f(xk + kpk)

k < f(xk)pk.

Use the Mean Value Theorem and let k .We get f(x)p f(x)p, wherep is a limitpoint of pk a contradiction since f(x)p < 0.

8/8/2019 NLP Slides

36/201


LECTURE 5: RATE OF CONVERGENCE

LECTURE OUTLINE

Approaches for Rate of Convergence Analysis

The Local Analysis Method

Quadratic Model Analysis

The Role of the Condition Number

Scaling

Diagonal Scaling

Extension to Nonquadratic Problems

Singular and Difficult Problems

8/8/2019 NLP Slides

37/201

APPROACHES FOR RATE OF

CONVERGENCE ANALYSIS

Computational complexity approach

Informational complexity approach

Local analysis Why we will focus on the local analysis method

8/8/2019 NLP Slides

38/201

THE LOCAL ANALYSIS APPROACH

Restrict attention to sequences xk convergingto a local min x

Measure progress in terms of an error functione(x) with e(x) = 0, such as

e(x) = x x, e(x) = f(x) f(x)

Compare the tail of the sequence e(xk) with thetail of standard sequences

Geometric or linear convergence [if e(xk

) qk

for some q > 0 and [0, 1), and for all k]. Holdsif

lim supk

e(xk+1)

e(xk)<

Superlinear convergence [if e(xk) q pk forsome q > 0, p > 1 and [0, 1), and for all k].

Sublinear convergence

8/8/2019 NLP Slides

39/201

QUADRATIC MODEL ANALYSIS

Focus on the quadratic function f(x) = (1/2)xQxwith Q > 0.

Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min

Consider steepest descent

xk+1 = xk kf(xk) = (I kQ)xk

xk+12 = xk(I kQ)2xk

max eig. (I k

Q)2

xk

2

The eigenvalues of (I kQ)2 are equal to (1 ki)2, where i are the eigenvalues of Q, so

max eig of (IkQ)2 = max(1km)2, (1kM)where m, M are the smallest and largest eigen-values of Q. Thus

xk+1

xk max|1 km|, |1 kM|

8/8/2019 NLP Slides

40/201

OPTIMAL CONVERGENCE RATE

The value of k that minimizes the bound is = 2/(M + m), in which case

xk+1

xk

M m

M + m

0

1

|1 - M |

|1 - m |

max {|1 - m|, |1 - M|}

M - m

M + m

2

M + m

1

M1m

2M

Stepsizes thatGuarantee Convergence

Conv. rate for minimization stepsize (see text)

f(xk+1)

f(xk)

M m

M + m2

The ratio M/m is called the condition numberof Q, and problems with M/m: large are calledill-conditioned.

8/8/2019 NLP Slides

41/201

SCALING AND STEEPEST DESCENT

View the more general method

xk+1 = xk kDkf(xk)

as a scaled version of steepest descent.

Consider a change of variables x = Sy withS = (Dk)1/2. In the space of y, the problem is

minimize h(y) f(Sy)

subject to y n

Apply steepest descent to this problem, multiplywith S, and pass back to the space of x, usingh(yk) = Sf(xk),

yk+1 = yk kh(yk)

Syk+1 = Syk kSh(yk)

xk+1 = xk kDkf(xk)

8/8/2019 NLP Slides

42/201

DIAGONAL SCALING

Apply the results for steepest descent to thescaled iteration yk+1 = yk kh(yk):

yk+1

yk max

|1 kmk|, |1 kMk|

f(xk+1)

f(xk)=

h(yk+1)

h(yk)

Mk mk

Mk + mk

2where mk and Mk are the smallest and largesteigenvalues of the Hessian of h, which is

2h(y) = S2f(x)S = (Dk)1/2Q(Dk)1/2

It is desirable to choose Dk as close as possibleto Q1. Also if Dk is so chosen, the stepsize = 1

is near the optimal 2/(Mk + mk).

Using as Dk a diagonal approximation to Q1

is common and often very effective. Corrects forpoor choice of units expressing the variables.

8/8/2019 NLP Slides

43/201

NONQUADRATIC PROBLEMS

Rate of convergence to a nonsingular local min-imum of a nonquadratic function is very similar tothe quadratic case (linear convergence is typical).

If Dk

2f(x)

1

, we asymptotically obtainoptimal scaling and superlinear convergence

More generally, if the direction dk = Dkf(xk)approaches asymptotically the Newton direction,

i.e.,

limk

dk + 2f(x)1f(xk)f(xk) = 0

and the Armijo rule is used with initial stepsize

equal to one, the rate of convergence is superlin-ear.

Convergence rate to a singular local min is typ-ically sublinear (in effect, condition number = )

8/8/2019 NLP Slides

44/201


LECTURE 6

NEWTON AND GAUSS-NEWTON METHODS

LECTURE OUTLINE

Newtons Method

Convergence Rate of the Pure Form

Global Convergence Variants of Newtons Method

Least Squares Problems

The Gauss-Newton Method

8/8/2019 NLP Slides

45/201

NEWTONS METHOD

xk+1 = xk k

2f(xk)1

f(xk)

assuming that the Newton direction is defined andis a direction of descent

Pure form of Newtons method (stepsize = 1)

xk+1 = xk

2f(xk)

1

f(xk)

Very fast when it converges (how fast?)

May not converge (or worse, it may not bedefined) when started far from a nonsingular

local min

Issue: How to modify the method so thatit converges globally, while maintaining thefast convergence rate

8/8/2019 NLP Slides

46/201

CONVERGENCE RATE OF PURE FORM

Consider solution of nonlinear system g(x) = 0where g : n n, with method

xk+1 = xk

g(xk)1

g(xk)

If g(x) = f(x), we get pure form of Newton

Quick derivation: Suppose xk x withg(x) = 0 and g(x) is invertible. By Taylor

0 = g(x) = g(xk)+g(xk)(xxk)+oxkx.Multiply with

g(xk)

1:

xk x

g(xk)

1

g(xk) = o

xk x

,

soxk+1 x = o

xk x

,

implying superlinear convergence and capture.

8/8/2019 NLP Slides

47/201

CONVERGENCE BEHAVIOR OF PURE FORM

x0 = -1x2 x

1

x

k xk g(xk)

0 - 1.00000 - 0.632121 0.71828 1.050912 0.20587 0.228593 0.01981 0.020004 0.00019 0.000195 0.00000 0.00000

0

g(x) = ex - 1

g(x)

0

x

x0 x2x1x3

8/8/2019 NLP Slides

48/201

MODIFICATIONS FOR GLOBAL CONVERGENCE

Use a stepsize

Modify the Newton direction when:

Hessian is not positive definite

When Hessian is nearly singular (needed toimprove performance)

Use

dk =

2f(xk) + k1

f(xk),

whenever the Newton direction does not exist oris not a descent direction. Here k is a diagonalmatrix such that

2f(xk) + k > 0

Modified Cholesky factorization

Trust region methods

8/8/2019 NLP Slides

49/201

LEAST-SQUARES PROBLEMS

minimize f(x) = 12g(x)2 =12

mi=1

gi(x)2

subject to x n,

where g = (g1, . . . , gm), gi : n ri .

Many applications:

Solution of systems of n nonlinear equationswith n unknowns

Model Construction Curve Fitting

Neural Networks

Pattern Classification

8/8/2019 NLP Slides

50/201

PURE FORM OF THE GAUSS-NEWTON METHOD

Idea: Linearize around the current point xk

g(x, xk) = g(xk) + g(xk)(x xk)

and minimize the norm of the linearized function

g:

xk+1 = arg minxn

12g(x, xk)

2

= xk

g(xk)g(xk)

1

g(xk)g(xk)

The direction

g(xk)g(xk)1

g(xk)g(xk)

is a descent direction since

g(xk)g(xk) =

(1/2)g(x)2

g(xk)g(xk) > 0

8/8/2019 NLP Slides

51/201

MODIFICATIONS OF THE GAUSS-NEWTON

Similar to those for Newtons method:

xk+1 = xkk

g(xk)g(xk)+k1

g(xk)g(xk

where k is a stepsize and k is a diagonal matrixsuch that

g(xk)g(xk) + k > 0

Incremental version of the Gauss-Newton method

Operate in cycles

Start a cycle with 0 (an estimate of x)

Update using a single component of g

i = arg minxn

ij=1

gj(x, j1)2, i = 1, . . . , m

where gj are the linearized functions

gj(x, j1) = gj(j1)+gj(j1)(xj1)

8/8/2019 NLP Slides

52/201

MODEL CONSTRUCTION

Given set of m input-output data pairs (yi, zi),i = 1, . . . , m, from the physical system

Hypothesize an input/output relation z = h(x, y),where x is a vector of unknown parameters, andh is known

Find x that matches best the data in the sensethat it minimizes the sum of squared errors

12

m

i=1

zi h(x, yi)2

Example of a linear model: Fit the data pairs bya cubic polynomial approximation. Take

h(x, y) = x3y3

+ x2y2

+ x1y + x0,

where x = (x0, x1, x2, x3) is the vector of unknowncoefficients of the cubic polynomial.

8/8/2019 NLP Slides

53/201

NEURAL NETS

Nonlinear model construction with multilayerperceptrons

x of the vector of weights

Universal approximation property

8/8/2019 NLP Slides

54/201

PATTERN CLASSIFICATION

Objects are presented to us, and we wish toclassify them in one of s categories 1, . . . , s, basedon a vector y of their features.

Classical maximum posterior probability ap-proach: Assume we know

p(j|y) = P(object w/ feature vector y is of category

Assign object with feature vector y to category

j

(y) = arg maxj=1,...,sp(j|y).

If p(j|y) are unknown, we can estimate themusing functions hj(xj , y) parameterized by vectorsxj . Obtain xj by minimizing

12

mi=1

zij hj(xj , yi)

2,

where

zij = 1 if yi is of category j,0 otherwise.

8/8/2019 NLP Slides

55/201


LECTURE 7: ADDITIONAL METHODS

LECTURE OUTLINE

Least-Squares Problems and Incremental Gra-

dient Methods

Conjugate Direction Methods

The Conjugate Gradient Method

Quasi-Newton Methods

Coordinate Descent Methods

Recall the least-squares problem:

minimize f(x) = 12g(x)2 = 12

mi=1

gi(x)2

subject to x n,

where g = (g1, . . . , gm), gi : n ri .

8/8/2019 NLP Slides

56/201

INCREMENTAL GRADIENT METHODS

Steepest descent method

xk+1 = xkkf(xk) = xkkmi=1

gi(xk)gi(xk)

Incremental gradient method:

i = i1 kgi(i1)gi(i1), i = 1, . . . , m

0 = xk, xk+1 = m

(aix - bi)2

amini

i

bi

amax i

i

bi

x*

xR

Advantage of incrementalism

8/8/2019 NLP Slides

57/201

VIEW AS GRADIENT METHOD W/ ERRORS

Can write incremental gradient method as

xk+1 = xk kmi=1

gi(xk)gi(xk)

+ k

mi=1

gi(xk)gi(xk) gi(i1)gi(i1)

Error term is proportional to stepsize k

Convergence (generically) for a diminishing step-size (under a Lipschitz condition on gigi)

Convergence to a neighborhood for a constantstepsize

8/8/2019 NLP Slides

58/201

CONJUGATE DIRECTION METHODS

Aim to improve convergence rate of steepest de-scent, without the overhead of Newtons method.

Analyzed for a quadratic model. They require niterations to minimize f(x) = (1/2)xQx bx withQ an n n positive definite matrix Q > 0.

Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min.

The directions d1, . . . , dk are Q-conjugate if diQdj

0 for all i = j.

Generic conjugate direction method:

xk+1 = xk + kdk

where k is obtained by line minimization.

y0

y1

y2

w0

w1

x0 x1

x2

d1 = Q -1/2w1

d0 = Q -1/2w0

Expanding Subspace Theorem

8/8/2019 NLP Slides

59/201

GENERATING Q-CONJUGATE DIRECTIONS

Given set of linearly independent vectors 0, . . . , we can construct a set of Q-conjugate directionsd0, . . . , dk s.t. Span(d0, . . . , di) = Span(0, . . . , i)

Gram-Schmidt procedure. Start with d0 = 0.If for some i < k, d0, . . . , di are Q-conjugate and

the above property holds, take

di+1 = i+1 +i

m=0

c(i+1)mdm;

choose c(i+1)m so di+1 is Q-conjugate to d0, . . . , di,

di+1Qdj = i+1Qdj+

i

m=0

c(i+1)mdm

Qdj = 0.

0 = d0

- c10

d0

1 2

d1

d00

0

d1= 1 + c10d0d2= 2 + c20d0 + c21d1

8/8/2019 NLP Slides

60/201

CONJUGATE GRADIENT METHOD

Apply Gram-Schmidt to the vectors k = gk =f(xk), k = 0, 1, . . . , n 1. Then

dk = gk +k1

j=0gkQdj

dj Qdjdj

Key fact: Direction formula can be simplified.

Proposition : The directions of the CGM aregenerated by d0 = g0, and

dk = gk + kdk1, k = 1, . . . , n 1,

where k is given by

k =gkgk

gk1gk1or k =

(gk gk1)gk

gk1gk1

Furthermore, the method terminates with an opti-mal solution after at most n steps.

Extension to nonquadratic problems.

8/8/2019 NLP Slides

61/201

PROOF OF CONJUGATE GRADIENT RESULT

Use induction to show that all gradients gk gen-erated up to termination are linearly independent.True for k = 1. Suppose no termination after ksteps, and g0, . . . , gk1 are linearly independent.Then, Span(d0, . . . , dk1) = Span(g0, . . . , gk1)

and there are two possibilities: gk = 0, and the method terminates.

gk = 0, in which case from the expandingmanifold property

gk

is orthogonal to d0

, . . . , dk1

gk is orthogonal to g0, . . . , gk1

so gk is linearly independent of g0, . . . , gk1,completing the induction.

Since at most n lin. independent gradients canbe generated, gk = 0 for some k n.

Algebra to verify the direction formula.

8/8/2019 NLP Slides

62/201

QUASI-NEWTON METHODS

xk+1 = xk kDkf(xk), where Dk is aninverse Hessian approximation.

Key idea: Successive iterates xk, xk+1 and gra-dients f(xk), f(xk+1), yield curvature info

qk 2f(xk+1)pk,

pk = xk+1 xk, qk = f(xk+1) f(xk),

2f(xn) q0 qn1 p

0 pn1 1

Most popular Quasi-Newton method is a cleverway to implement this idea

Dk+1 = Dk +pkpk

pkqk

DkqkqkDk

qkDkqk

+ kkvkvk,

vk =pk

pkqk

Dkqk

k, k = qk

Dkqk, 0 k 1

and D0 > 0 is arbitrary, k by line minimization,and Dn = Q1 for a quadratic.

8/8/2019 NLP Slides

63/201

NONDERIVATIVE METHODS

Finite difference implementations

Forward and central difference formulas

f(xk)

xi

1

hf(xk + hei) f(xk)

f(xk)

xi

1

2h

f(xk + hei) f(xk hei)

Use central difference for more accuracy nearconvergence

xk

xk+1xk+2

Coordinate descent.Applies also to the casewhere there are boundconstraints on the vari-

ables.

Direct search methods. Nelder-Mead method.

8/8/2019 NLP Slides

64/201


LECTURE 8

OPTIMIZATION OVER A CONVEX SET;

OPTIMALITY CONDITIONS

Problem: minxX f(x), where:

(a) X n is nonempty, convex, and closed.

(b) f is continuously differentiable over X. Local and global minima. If f is convex localminima are also global.

f(x)

x

Local Minima Global Minimum

X

8/8/2019 NLP Slides

65/201

OPTIMALITY CONDITION

Proposition (Optimality Condition)

(a) If x is a local minimum of f over X, then

f(x)(x x) 0, x X.

(b) If f is convex over X, then this condition isalso sufficient for x to minimize f over X.

Surfaces of equal cost f(x)

Constraint set X

x

x*

f(x*)At a local minimum x,the gradient

f(x) makes

an angle less than or equalto 90 degrees with all fea-sible variations xx, x X.

Constraint set X

x*

x

f(x*)Illustration of failure of theoptimality condition when

X is not convex. Here xis a local min but we havef(x)(x x) < 0 forthe feasible vector x shown.

8/8/2019 NLP Slides

66/201

PROOF

Proof: (a) By contradiction. Suppose that f(x)x) < 0 for some x X. By the Mean Value The-orem, for every > 0 there exists an s [0, 1]such that

fx+(xx) = f(x)+fx+s(xx)(xx).Since f is continuous, for suff. small > 0,

f

x + s(x x)

(x x) < 0

so that f

x + (x x)

< f(x). The vectorx + (x x) is feasible for all [0, 1] becauseXis convex, so the optimality of x is contradicted.

(b) Using the convexity of f

f(x) f(x) + f(x)(x x)

for every x X. If the condition f(x)(xx) 0 holds for all x X, we obtain f(x) f(x), sox minimizes f over X. Q.E.D.

8/8/2019 NLP Slides

67/201

OPTIMIZATION SUBJECT TO BOUNDS

Let X = {x | x 0}. Then the necessarycondition for x = (x1, . . . , x

n) to be a local min is

ni=1

f(x)xi

(xi xi ) 0, xi 0, i = 1, . . . , n .

Fix i. Let xj = xj for j = i and xi = xi + 1:

f(x)

xi 0, i.

If xi > 0, let also xj = xj forj = i and xi =

12xi .

Then f(x)/xi 0, so

f(x)

xi= 0, if xi > 0.

x* x* = 0

f(x*)f(x*)

8/8/2019 NLP Slides

68/201

OPTIMIZATION OVER A SIMPLEX

X =

x x 0,

ni=1

xi = r

where r > 0 is a given scalar.

Necessary condition for x = (x1, . . . , xn) to be

a local min:

ni=1

f(x)xi

(xixi ) 0, xi 0 with

ni=1

xi = r.

Fix i with xi > 0 and let j be any other index.Use x with xi = 0, xj = xj + x

i , and xm = x

m for

all m = i, j:

f(x)

xj

f(x)

xixi 0,

xi > 0 =f(x)

xi

f(x)

xj, j,

i.e., at the optimum, positive components have

minimal (and equal) first cost derivative.

8/8/2019 NLP Slides

69/201

OPTIMAL ROUTING

Given a data net, and a set W of OD pairsw = (i, j). Each OD pair w has input traffic rw.

Origin ofOD pair w

Destination ofOD pair w

rw rw

x1

x4

x3

x2

Optimal routing problem:

minimize D(x) =(i,j)

Dij

all paths pcontaining (i,j)

xp

subject to

pPw

xp = rw, w W,

xp 0, p Pw, w W

Optimality condition

xp > 0 =D(x)

xp

D(x)

xp, p Pw,

i.e., paths carrying > 0 flow are shortest with re-s ect to first cost derivative.

8/8/2019 NLP Slides

70/201

TRAFFIC ASSIGNMENT

Transportation network with OD pairs w. Eachw has paths p Pw and traffic rw. Let xp be the

flow of path p and let Tij

p: crossing (i,j) xp

be the travel time of link (i, j).

User-optimization principle: Traffic equilib-rium is established when each user of the networkchooses, among all available paths, a path of min-imum travel time, i.e., for all w W and pathsp Pw,

xp > 0 = tp(x) tp(x), p Pw, w W

where tp(x), is the travel time of path p

tp(x) = all arcs (i,j)on path p

Tij (Fij ), p Pw, w W.

Identical with the optimality condition of the routingproblem if we identify the arc travel time Tij(Fij)with the cost derivative Dij(Fij).

8/8/2019 NLP Slides

71/201

PROJECTION OVER A CONVEX SET

Let z n and a closed convex set Xbe given.Problem:

minimize f(x) = z x2

subject to x X.

Proposition (Projection Theorem) Problemhas a unique solution [z]+ (the projection of z).

z

x

Constraint set X

x*

x - x*

z - x*

Necessary and sufficient con-dition for x to be the pro-jection. The angle between

z x and x x shouldbe greater or equal to 90

degrees for all x X, or(z x)(x x) 0

If X is a subspace, z x X.

The mapping f : n X defined by f(x) =[x]+ is continuous and nonexpansive, that is,

[x]+ [y]+ x y, x, y n.

8/8/2019 NLP Slides

72/201


LECTURE 9: FEASIBLE DIRECTION METHODS

LECTURE OUTLINE

Conditional Gradient Method

Gradient Projection Methods

A feasible direction at an x X is a vector d = 0such that x + d is feasible for all suff. small > 0

x1

x2

d

Constraint set X

Feasibledirections at x

x

Note: the set of feasible directions at x is theset of all (z x) where z X, z = x, and > 0

8/8/2019 NLP Slides

73/201

FEASIBLE DIRECTION METHODS

A feasible direction method:

xk+1 = xk + kdk,

where dk: feasible descentdirection [f(xk)dk 0 and such that xk+1

X. Alternative definition:

xk+1 = xk + k(xk xk),

where k

(0, 1] and if xk

is nonstationary,

xk X, f(xk)(xk xk) < 0.

Stepsize rules: Limited minimization, Constant

k = 1, Armijo: k = mk , where mk is the firstnonnegative m for which

f(xk)f

xk+m(xkxk)

mf(xk)(xkxk

8/8/2019 NLP Slides

74/201

CONVERGENCE ANALYSIS

Similar to the one for (unconstrained) gradientmethods.

The direction sequence {dk} is gradient relatedto {xk} if the following property can be shown:For any subsequence {xk}kK that converges to

a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies

lim supk, kK

f(xk)dk < 0.

Proposition (Stationarity of Limit Points)Let {xk} be a sequence generated by the feasibledirection method xk+1 = xk + kdk. Assume that:

{dk} is gradient related

k is chosen by the limited minimization rule

or the Armijo rule.Then every limit point of {xk} is a stationary point.

Proof: Nearly identical to the unconstrainedcase.

8/8/2019 NLP Slides

75/201

CONDITIONAL GRADIENT METHOD

xk+1 = xk + k(xk xk), where

xk = arg minxX

f(xk)(x xk).

We assume that X is compact, so x

k

is guar-anteed to exist by Weierstrass.f(x)

x

x_

Constraint set X

Surfaces ofequal cost

Illustration of the directionof the conditional gradient

method.

x0

x1

x2

x1 x0__

Constraint set X

Surfaces of

equal cost

x*

Operation of the method.Slow (sublinear) convergence.

8/8/2019 NLP Slides

76/201

CONVERGENCE OF CONDITIONAL GRADIENT

Show that the direction sequence of the condi-tional gradient method is gradient related, so thegeneric convergence result applies.

Suppose that {xk}kK converges to a nonsta-tionary point x. We must prove that

{xkxk}kK : bounded, lim supk, kK

f(xk)(xkxk) 0 (involves transformation yk = (Hk)1/2xSince the minimum value above is negative when

xk is nonstationary, f(xk)(xk xk) < 0.

Newtons method for Hk = 2f(xk).

Variants: Projecting on an expanded constraintset, projecting on a restricted constraint set, com-

binations with unconstrained methods, etc.

8/8/2019 NLP Slides

80/201


LECTURE 10

ALTERNATIVES TO GRADIENT PROJECTION

LECTURE OUTLINE

Three Alternatives/Remedies for Gradient Pro-jection

Two-Metric Projection Methods

Manifold Suboptimization Methods

Affine Scaling Methods

Scaled GP method with scaling matrix Hk > 0:

xk+1 = xk + k(xk xk),

xk = arg minxX

f(xk)(x xk) + 1

2sk(x xk)Hk(x xk)

The QP direction subproblem is complicated by:

Difficult inequality (e.g., nonorthant) constraint

Nondiagonal Hk, needed for Newton scaling

8/8/2019 NLP Slides

81/201

THREE WAYS TO DEAL W/ THE DIFFICULTY

Two-metric projection methods:

xk+1 =

xk kDkf(xk)+

Use Newton-like scaling but use a standard

projection Suitable for bounds, simplexes, Cartesian

products of simple sets, etc

Manifold suboptimization methods:

Use (scaled) gradient projection on the man-ifold of active inequality constraints

Each QP subproblem is equality-constrained

Need strategies to cope with changing activemanifold (add-drop constraints)

Affine Scaling Methods Go through the interior of the feasible set

Each QP subproblem is equality-constrained,AND we dont have to deal with changing ac-

tive manifold

8/8/2019 NLP Slides

82/201

TWO-METRIC PROJECTION METHODS

In their simplest form, apply to constraint: x 0,but generalize to bound and other constraints

Like unconstr. gradient methods except for []+

xk+1 = xk kDkf(xk)+, Dk > 0 Major difficulty: Descent is not guaranteed forDk: arbitrary

f(xk)

xk

xk - kDkf(xk)

xk

xk - kDkf(xk)

(a) (b)

Remedy: Use Dk that is diagonal w/ respect toindices that are active and want to stay active

I+(xk) =

i

xki = 0, f(x

k)/xi > 0

8/8/2019 NLP Slides

83/201

PROPERTIES OF 2-METRIC PROJECTION

Suppose Dk is diagonal with respect to I+(xk),i.e., dkij = 0 for i, j I

+(xk) with i = j, and let

xk(a) =

xk Dkf(xk)

+

If xk is stationary, xk = xk() for all > 0. Otherwise f

x()

< f(xk) for all sufficiently

small > 0 (can use Armijo rule).

Because I+(x) is discontinuous w/ respect tox, to guarantee convergence we need to includein I+(x) constraints that are -active [those w/xki [0, ] and f(x

k)/xi > 0].

The constraints in I+(x) eventually becomeactive and dont matter.

Method reduces to unconstrained Newton-likemethod on the manifold of active constraints at x.

Thus, superlinear convergence is possible w/simple projections.

8/8/2019 NLP Slides

84/201

MANIFOLD SUBOPTIMIZATION METHODS

Feasible direction methods for

min f(x) subject to ajx bj , j = 1, . . . , r

Gradient is projected on a linear manifold of ac-tive constraints rather than on the entire constraint

set (linearly constrained QP).x0

x1

x2

x3

(a)

x0

x4

x2

x1

(b)

x3

Searches through sequence of manifolds, eachdiffering by at most one constraint from the next.

Potentially many iterations to identify the activemanifold; then method reduces to (scaled) steep-

est descent on the active manifold.

Well-suited for a small number of constraints,and for quadratic programming.

8/8/2019 NLP Slides

85/201

OPERATION OF MANIFOLD METHODS

Let A(x) = {j | ajx = bj} be the active indexset at x. Given xk, we find

dk = arg minajd=0, jA(x

k)f(xk)d + 12dHkd

If dk

= 0, then dk

is a feasible descent direction.Perform feasible descent on the current manifold.

If dk = 0, either (1) xk is stationary or (2) weenlarge the current manifold (drop an active con-

straint). For this, use the scalars j such that

f(xk) +

jA(xk)

jaj = 0

f(xk)

a1

a2

X

a1'x = b1

a2'x = b2

- 2a2

- 1a1

(1 < 0)

(2 > 0)

xk

If j 0 for all j, xk isstationary, since for all fea-sible x,

f(xk)

(x

xk) is

equal to

jA(xk)j a

j (xxk) 0

Else, drop a constraint j

with j < 0.

8/8/2019 NLP Slides

86/201

8/8/2019 NLP Slides

87/201

AFFINE SCALING

Particularly interesting choice (affine scaling)

Hk = (Xk)2,

where Xk is the diagonal matrix having the (pos-itive) coordinates xk

ialong the diagonal:

xk+1 = xkk(Xk)2(cAk), k =

A(Xk)2A1

A(Xk

Corresponds to unscaled gradient projection it-eration in the variables y = (Xk)1x. The vector xk

is mapped onto the unit vector yk = (1, . . . , 1).x*

xk

xk+1 yk+1

yk = (1,1,1)

y*= (Xk)-1 x*

yk= (Xk)-1 xk

Extensions, convergence, practical issues.

8/8/2019 NLP Slides

88/201


LECTURE 11

CONSTRAINED OPTIMIZATION;

LAGRANGE MULTIPLIERS

LECTURE OUTLINE

Equality Constrained Problems Basic Lagrange Multiplier Theorem

Proof 1: Elimination Approach Proof 2: Penalty Approach

Equality constrained problem

minimize f(x)subject to hi(x) = 0, i = 1, . . . , m .

where f : n , hi : n , i = 1, . . . , m, are con-tinuously differentiable functions. (Theory alsoapplies to case where f and hi are cont. differ-

entiable in a neighborhood of a local minimum.)

8/8/2019 NLP Slides

89/201

LAGRANGE MULTIPLIER THEOREM

Let x be a local min and a regular point [hi(x):linearly independent]. Then there exist uniquescalars 1, . . . ,

m such that

f(x) +m

i=1

i hi(x) = 0.

If in addition f and h are twice cont. differentiable,

y

2f(x) +

mi=1

i 2hi(x)

y 0, y s.t. h(x)y =

x1

x2

x* = (-1,-1)

h(x*) = (-2,-2)

f(x*) = (1,1) 0

2

2

h(x) = 0

minimize x1 + x2

subject to x21 + x22 = 2.

The Lagrange multiplier is

= 1/2.

x1

x2

f(x*) = (1,1)h1(x

*) = (-2,0)

h2(x*) = (-4,0)

h1(x) = 0

h2(x) = 0

21

minimize x1 + x2

s. t. (x1 1)2 + x22 1 =(x1 2)2 + x22 4 =

8/8/2019 NLP Slides

90/201

PROOF VIA ELIMINATION APPROACH

Consider the linear constraints caseminimize f(x)

subject to Ax = b

where A is an m n matrix with linearly indepen-dent rows and b m is a given vector. Partition A = ( B R ) , where B is mm invertible,and x = ( xB xR )

. Equivalent problem:

minimize F(xR) f

B1(b RxR), xR

subject to xR nm.

Unconstrained optimality condition:

0 = F(xR) = R(B)1B f(x

) + Rf(x) (1)

By defining

= (B)1

B f(x),we have B f(x) + B = 0, while Eq. (1) is writtenRf(x) + R = 0. Combining:

f(x) + A = 0

8/8/2019 NLP Slides

91/201

ELIMINATION APPROACH - CONTINUED

Second order condition: For all d n

m

0 d2F(xR)d = d2

f

B1(b RxR), xR

d. (2)

After calculation we obtain

2F(xR) = R(B)12BB f(x)B1R

R(B)12BR f(x) 2RB f(x

)B1R + 2RRf(x

Eq. (2) and the linearity of the constraints [im-plying that 2hi(x) = 0], yields for all d nm

0 d2F(xR)d = y2f(x)y

= y

2f(x) +

mi=1

i 2hi(x)

y,

where y = ( yB yR ) = ( B1Rd d ) . y has this form iff

0 = ByB + RyR = h(x)y.

8/8/2019 NLP Slides

92/201

8/8/2019 NLP Slides

93/201

8/8/2019 NLP Slides

94/201

LAGRANGIAN FUNCTION

Define the Lagrangian functionL(x, ) = f(x) +

mi=1

ihi(x).

Then, if x is a local minimum which is regular, theLagrange multiplier conditions are written

xL(x, ) = 0, L(x, ) = 0,

System of n + m equations with n + m unknowns.

y2

xxL(x, )y 0, y s.t. h(x)y = 0.

Exampleminimize 1

2

x21 + x

22 + x

23

subject to x1 + x2 + x3 = 3.

Necessary conditions

x1 + = 0, x2 +

= 0,

x3 + = 0, x1 + x

2 + x

3 = 3.

8/8/2019 NLP Slides

95/201

EXAMPLE - PORTFOLIO SELECTION

Investment of 1 unit of wealth among n assetswith random rates of return ei, and given meansei, and covariance matrix Q =

E{(ei ei)(ej ej )}

.

If xi: amount invested in asset i, we want to

minimize xQx

= Variance of return

ieixi

subject to

ixi = 1, and a given mean

i

eixi = m

Let 1 and 2 be the L-multipliers. Have 2Qx +1u+2e = 0, where u = (1, . . . , 1) and e = (e1, . . . , en).This yields

x = mv+w, Variance of return = 2 = (m+)2+,

where v and w are vectors, and , , and aresome scalars that depend on Q and e.

m

ef-

Efficient Frontier = m +

For given m the optimal

lies on a line (called effi-cient frontier).

8/8/2019 NLP Slides

96/201

8/8/2019 NLP Slides

97/201

SUFFICIENCY CONDITIONS

Second Order Sufficiency Conditions: Let x

n

and m satisfy

xL(x, ) = 0, L(x, ) = 0,

y2xxL(x, )y > 0, y = 0 with h(x)y = 0.

Then x is a strict local minimum.

Example: Minimize (x1x2 + x2x3 + x1x3) subject tox1 + x2 + x3 = 3. We have that x1 = x

2 = x

3 = 1 and

= 2 satisfy the 1st order conditions. Also

2xxL(x, ) =

0 1 11 0 11 1 0

.

We have for all y = 0 with h(x)y = 0 or y1 + y2 +y3 = 0,

y2xxL(x, )y = y1(y2 + y3) y2(y1 + y3) y3(y1 + y= y21 + y

22 + y

23 > 0.

Hence, x is a strict local minimum.

8/8/2019 NLP Slides

98/201

A BASIC LEMMA

Lemma: Let P and Q be two symmetric matrices.Assume that Q 0 and P > 0 on the nullspace ofQ, i.e., xP x > 0 for all x = 0 with xQx = 0. Thenthere exists a scalar c such that

P + cQ : positive definite, c > c.

Proof: Assume the contrary. Then for every k,there exists a vector xk with xk = 1 such that

xkP xk + kxk

Qxk 0.

Consider a subsequence {xk}kK converging tosome x with x = 1. Taking the limit superior,

xPx + lim supk, kK

(kxkQxk) 0. (*)

We have xkQxk 0 (since Q 0), so {xkQxk}kK 0. Therefore, xQx = 0 and using the hypothesis,xP x > 0. This contradicts (*).

8/8/2019 NLP Slides

99/201

PROOF OF SUFFICIENCY CONDITIONS

Consider the augmented Lagrangian function

Lc(x, ) = f(x) + h(x) +

c

2h(x)2,

where c is a scalar. We have

xLc(x, ) = xL(x, ),2xxLc(x, ) = 2xxL(x, ) + ch(x)h(x)

where = + ch(x). If (x, ) satisfy the suff. con-ditions, we have using the lemma,

xLc(x, ) = 0, 2xxLc(x, ) > 0,

for suff. large c. Hence for some > 0, > 0,

Lc(x, ) Lc(x, ) +

2x x2, if x x < .

Since Lc(x, ) = f(x) when h(x) = 0,

f(x) f(x) + 2

x x2, if h(x) = 0, x x < .

8/8/2019 NLP Slides

100/201

SENSITIVITY - GRAPHICAL DERIVATION

f(x*)

x* + x

x*

x

a a'x = b + b

a'x = b

Sensitivity theorem for the problem minax=b f(x). If b ischanged to b+b, the minimum x will change to x+x.Since b + b = a(x + x) = ax + ax = b + ax, wehave ax = b. Using the condition

f(x) =

a,

cost = f(x + x) f(x) = f(x)x + o(x)= ax + o(x)

Thus cost = b + o(x), so up to first order

= cost

b .

For multiple constraints aix = bi, i = 1, . . . , n, we have

cost = m

i=1

i bi + o(x).

8/8/2019 NLP Slides

101/201

8/8/2019 NLP Slides

102/201

EXAMPLE

p(u)

-1 0u

slope p(0) = - * = -1

Illustration of the primal function p(u) = f

x(u)

for the two-dimensional problem

minimize f(x) = 12x

21 x22 x2

subject to h(x) = x2 = 0.

Here,p(u) = min

h(x)=uf(x) = 1

2u2 u

and =

p(0) = 1, consistently with the sensitivity

theorem.

Need for regularity of x: Change constraint toh(x) = x22 = 0. Then p(u) = u/2

u for u 0 and

is undefined for u < 0.

8/8/2019 NLP Slides

103/201

PROOF OUTLINE OF SENSITIVITY THEOREM

Apply implicit function theorem to the system

f(x) + h(x) = 0, h(x) = u.For u = 0 the system has the solution (x, ), andthe corresponding (n + m) (n + m) Jacobian

J = 2f(x) +

m

i=1i 2hi(x) h(x)

h(x) 0 is shown nonsingular using the sufficiency con-ditions. Hence, for all u in some open sphere Scentered at u = 0, there exist x(u) and (u) suchthat x(0) = x, (0) = , the functions x() and ()are continuously differentiable, and

f

x(u)

+ h

x(u)

(u) = 0, h

x(u)

= u.

For u close to u = 0, using the sufficiency condi-tions, x(u) and (u) are a local minimum-Lagrangemultiplier pair for the problem minh(x)=u f(x).

To derive p(u), differentiate hx(u) = u, toobtain I = x(u)hx(u), and combine with the re-lations x(u)f

x(u)

+ x(u)h

x(u)

(u) = 0 and

p(u) = u

f

x(u)

= x(u)f

x(u)

.

8/8/2019 NLP Slides

104/201


LECTURE 13: INEQUALITY CONSTRAINTS

LECTURE OUTLINE

Inequality Constrained Problems

Necessary Conditions Sufficiency Conditions Linear Constraints

Inequality constrained problem

minimize f(x)

subject to h(x) = 0, g(x) 0

where f : n , h : n m, g : n r are

continuously differentiable. Here

h = (h1,...,hm), g = (g1,...,gr ).

8/8/2019 NLP Slides

105/201

TREATING INEQUALITIES AS EQUATIONS

Consider the set of active inequality constraintsA(x) =

j | gj (x) = 0

.

If x is a local minimum: The active inequality constraints at x can be

treated as equations

The inactive constraints at x dont matter Assuming regularity of x and assigning zeroLagrange multipliers to inactive constraints,

f(x) +

m

i=1

i

hi(x) +

r

j=1

j

gj (x) = 0,

j = 0, j / A(x).

Extra property: j 0 for all j. Intuitive reason: Relax jth constraint, gj (x) uj .Since cost 0 if uj > 0, by the sensitivity theorem,we have

j = (cost due to uj )/uj 0

8/8/2019 NLP Slides

106/201

BASIC RESULTS

Kuhn-Tucker Necessary Conditions: Let x be a lo-cal minimum and a regular point. Then there existunique Lagrange mult. vectors = (1, . . . ,

m),

= (1, . . . , r ), such that

xL(x, , ) = 0,

j 0, j = 1, . . . , r ,j = 0, j / A(x).

If f, h, and g are twice cont. differentiable,

y2

xxL(x, , )y 0, for all y V(x),where

V(x) =

y | h(x)y = 0, gj (x

)y = 0, j A(x)

Similar sufficiency conditions and sensitivity re-sults. They require strict complementarity, i.e.,j > 0, j A(x),

as well as regularity of x.

8/8/2019 NLP Slides

107/201

PROOF OF KUHN-TUCKER CONDITIONS

Use equality-constraints result to obtain all theconditions except for j 0 for j A(x). Intro-duce the penalty functions

g+j (x) = max

0, gj (x)

, j = 1, . . . , r ,

and for k = 1, 2, . . ., let xk minimize

f(x) +k

2||h(x)||2 +

k

2

rj=1

g

+j (x)

2+

1

2||x x||2

over a closed sphere of x such that f(x) f(x).Using the same argument as for equality con-straints,

i = limk

khi(xk), i = 1, . . . , m ,

j = limk

kg+j (xk), j = 1, . . . , r .

Since g+j (xk) 0, we obtain j 0 for all j.

8/8/2019 NLP Slides

108/201

GENERAL SUFFICIENCY CONDITION


minimize f(x)

subject to x X, gj (x) 0, j = 1, . . . , r .

Let x be feasible and satisfy

j 0, j = 1, . . . , r, j = 0, j / A(x),

x = arg minxX

L(x, ).

Then x is a global minimum of the problem.

Proof: We have

f(x) = f(x) + g(x) = minxX

f(x) + g(x)

min

xX, g(x)0

f(x) + g(x)

min

xX, g(x)0f(x),

where the first equality follows from the hypothe-sis, which implies that g(x) = 0, and the last in-equality follows from the nonnegativity of . Q.E.D.

Special Case: Let X = n, f and gj be con-vex and differentiable. Then the 1st order Kuhn-Tucker conditions are also sufficient for global op-

timality.

8/8/2019 NLP Slides

109/201

LINEAR CONSTRAINTS

Consider the problem minajxbj , j=1,...,r f(x). Remarkable property: No need for regularity. Proposition: If x is a local minimum, there exist1, . . . ,

r with j 0, j = 1, . . . , r, such that

f(x) +r

j=1

j aj = 0, j = 0, j / A(x).

The proof uses Farkas Lemma: Consider thecone Cgenerated by aj , j A(x), and the polarcone C

shown below

a1

a2

0C

= {y | aj'y 0, j=1,...,r}

C ={x | x =j=1

r

jaj, j 0}

Then, C

= C, i.e.,

x C iff xy 0, y C.

8/8/2019 NLP Slides

110/201

8/8/2019 NLP Slides

111/201

PROOF OF LAGRANGE MULTIPLIER RESULT

a2

a1

Cone generated by aj, j A(x*)

f(x*)

x*

Constraint set

{x | aj'x bj, j = 1,...,r}

C ={x | x =j=1

r

jaj, j 0}

The local min x of the original problem is also a local minfor the problem mina

jxbj , jA(x) f(x). Hence

f(x)(x x) 0, x with aj x bj , j A(x).

Since a constraint aj x bj , j A(x) can also be ex-pressed as aj (x x) 0, we have

f(x)y 0, y with aj y 0, j A(x).

From Farkas lemma, f(x) has the formjA(x)

j aj , for some j 0, j A(x).

Let j = 0 for j /

A(x).

8/8/2019 NLP Slides

112/201


LECTURE 14: INTRODUCTION TO DUALITY

LECTURE OUTLINE

Convex Cost/Linear Constraints

Duality Theorem Linear Programming Duality Quadratic Programming Duality

Linear inequality constrained problem

minimize f(x)

subject to aj x bj , j = 1, . . . , r ,

where f is convex and continuously differentiable

over n

.

8/8/2019 NLP Slides

113/201

LAGRANGE MULTIPLIER RESULT

Let J {

1, . . . , r

}. Then x is a global min if and

only if x is feasible and there exist j 0, j J,such that j = 0 for all j J / A(x), and

x = arg minajxbj

j/J

f(x) +

jJ

j (aj x bj )

.

Proof: Assume x is global min. Then there existj 0, such that j (aj x bj ) = 0 for all j andf(x) +

rj=1

j aj = 0, implying

x = arg minxnf(x) +

r

j=1 j (a

j x bj ).

Since j (aj x bj ) = 0 for all j,

f(x) = minxn

f(x) +

rj=1

j (aj x bj )

minajxbjj/J

f(x) +

rj=1

j (aj x bj )Since j (a

j x bj ) 0 if aj x bj 0,

f(x) minajxbj

j/J

f(x) +

jJ

j (aj x bj )

f(x).

8/8/2019 NLP Slides

114/201

PROOF (CONTINUED)

Conversely, if x is feasible and there exist scalarsj , j J with the stated properties, then x is aglobal min by the general sufficiency condition ofthe preceding lecture (where X is taken to be theset of x such that aj x bj for all j / J). Q.E.D.

Interesting observation: The same set of jworks for all index sets J.

The flexibility to split the set of constraints intothose that are handled by Lagrange multipliers(set J) and those that are handled explicitly comeshandy in many analytical and computational con-

texts.

8/8/2019 NLP Slides

115/201

THE DUAL PROBLEM

Consider the problemmin

xX, aj

xbj , j=1,...,rf(x)

where f is convex and cont. differentiable over n

and X is polyhedral. Define the dual function q : r [, )

q() = infxX

L(x, ) = infxX

f(x) +

r

j=1j (a

j x bj )

and the dual problem

max0

q().

If X is bounded, the dual function takes real

values. In general, q() can take the value .The effective constraint set of the dual is

Q =

| 0, q() >

.

8/8/2019 NLP Slides

116/201

DUALITY THEOREM

(a) If the primal problem has an optimal solution,the dual problem also has an optimal solution andthe optimal values are equal.(b) x is primal-optimal and is dual-optimal ifand only if x is primal-feasible, 0, and

f(x) = L(x, ) = minxX

L(x, ).

Proof: (a) Let x be a primal optimal solution. Forall primal feasible x, and all 0, we have j (aj xbj ) 0 for all j, so

q() infxX, a

jxbj , j=1,...,r

f(x) +

rj=1

j (aj x bj )

infxX, a

jxbj , j=1,...,r

f(x) = f(x).

(*)

By L-Mult. Th., there exists 0 such that j (aj xbj ) = 0 for all j, and x = arg minxX L(x, ), so

q() = L(x, ) = f(x) +r

j=1

j (aj x bj ) = f(x).

8/8/2019 NLP Slides

117/201

PROOF (CONTINUED)

(b) If x is primal-optimal and is dual-optimal,by part (a)

f(x) = q(),

which when combined with Eq. (*), yields

f(x) = L(x, ) = q() = minx

XL(x, ).

Conversely, the relation f(x) = minxX L(x, ) iswritten as f(x) = q(), and since x is primal-feasible and 0, Eq. (*) implies that x is primal-optimal and is dual-optimal. Q.E.D.

Linear equality constraints are treated similar toinequality constraints, except that the sign of theLagrange multipliers is unrestricted:

Primal: minxX, e

ix=di, i=1,...,m a

j

xbj , j=1,...,rf(x)

Dual: maxm, 0

q(, ) = maxm, 0

infxX

L(x,,).

8/8/2019 NLP Slides

118/201

THE DUAL OF A LINEAR PROGRAM

Consider the linear program

minimize cx

subject to eix = di, i = 1, . . . , m, x 0

Dual function

q() = infx0

n

j=1

cj

mi=1

ieij

xj +

mi=1

idi

.

If cj mi=1 ieij 0 for all j, the infimum isattained for x = 0, and q() =

mi=1

idi. If cj mi=1

ieij < 0 for some j, the expression in bracescan be arbitrarily small by taking xj suff. large, soq() = . Thus, the dual is

maximize

mi=1

idi

subject to

mi=1

ieij cj , j = 1, . . . , n .

8/8/2019 NLP Slides

119/201

THE DUAL OF A QUADRATIC PROGRAM

Consider the quadratic programminimize 1

2xQx + cx

subject to Ax b,

where Q is a given nn positive definite symmetricmatrix, A is a given r

n matrix, and b

r and

c n are given vectors. Dual function:

q() = inf xn

12

xQx + cx + (Ax b)

.

The infimum is attained for x = Q1(c + A), and,after substitution and calculation,

q() = 12

AQ1A (b + AQ1c) 12

cQ1c.

The dual problem, after a sign change, isminimize 12P + t

subject to 0,

where P = AQ1A and t = b + AQ1c.

8/8/2019 NLP Slides

120/201


LECTURE 15: INTERIOR POINT METHODS

LECTURE OUTLINE

Barrier and Interior Point Methods

Linear Programs and the Logarithmic Barrier Path Following Using Newtons Method

Inequality constrained problem

minimize f(x)

subject to x X, gj (x) 0, j = 1, . . . , r ,

where f and gj are continuous and X is closed.We assume that the set

S =

x X | gj (x) < 0, j = 1, . . . , ris nonempty and any feasible point is in the closureof S.

8/8/2019 NLP Slides

121/201

BARRIER METHOD

Consider a barrier function, that is continuousand goes to as any one of the constraints gj (x)approaches 0 from negative values. Examples:

B(x) = r

j=1ln

gj (x)

, B(x) =

r

j=11

gj (x).

Barrier Method:

xk = arg minxS

f(x) + kB(x)

, k = 0, 1, . . . ,

where the parameter sequence{

k

}satisfies 0 0,

x() = arg minxS

F(x) = arg minxS

cx

ni=1

ln xi

,

where S =

x | Ax = b, x > 0}. We assume that S is

nonempty and bounded.

As 0, x() follows the central path

Point x() oncentral path

x

S

x* ( = 0)

c

All central paths start atthe analytic center

x = arg minxS

ni=1

ln xi

,

and end at optimal solu-

tions of (LP).

8/8/2019 NLP Slides

124/201

PATH FOLLOWING W/ NEWTONS METHOD

Newtons method for minimizing F:x = x + (x x),

where x is the pure Newton iterate

x = arg minAz=bF(x)(z x) + 12 (z x)2F(x)(z x

By straightforward calculation

x = x Xq(x, ),

q(x, ) =Xz

e, e = (1 . . . 1), z = c

A,

= (AX2A)1AX

Xc e

,

and X is the diagonal matrix with xi, i = 1, . . . , nalong the diagonal.

View q(x, ) as the Newton increment (xx) trans-formed by X1 that maps x into e. Consider q(x, ) as a proximity measure of thecurrent point to the point x() on the central path.

8/8/2019 NLP Slides

125/201

KEY RESULTS

It is sufficient to minimize F approximately, upto where q(x, ) < 1.

x

S

x*

Central Path

Set {x | ||q(x,0)|| < 1}

x(2)

x(1)

x(0)x0

x2

x1

If x > 0, Ax = b, andq(x, ) < 1, then

cx minAy=b, y0

cy

n+

n

The termination set

x | q(x, ) < 1

is part

of the region of quadratic convergence of the pureform of Newtons method. In particular, if q(x, ) 0, Ax = b, and suppose thatfor some < 1 we have

q(x, )

. Then if =

(1 n1/2) for some > 0,q(x, )

2 +

1 n1/2 .

In particular, if

(1

)(1 + )1,

we have q(x, ) . Can be used to establish nice complexity results;but must be reduced VERY slowly.

8/8/2019 NLP Slides

127/201

LONG STEP METHODS

Main features: Decrease faster than dictated by complex-ity analysis.

Require more than one Newton step per (ap-proximate) minimization.

Use line search as in unconstrained New-

tons method. Require much smaller number of (approxi-

mate) minimizations.

S

x*

Central Path

x

x(k+1)

x(k)xk

xk+1x(k+2)xk+2

(a) (b)

S

x*

Central Path

x

x(k+1)

x(k)xk

xk+1

x(k+2)xk+2

The methodology generalizes to quadratic pro-gramming and convex programming.

8/8/2019 NLP Slides

128/201


LECTURE 16: PENALTY METHODS

LECTURE OUTLINE

Quadratic Penalty Methods

Introduction to Multiplier Methods*******************************************

Consider the equality constrained problem

minimize f(x)

subject to x X, h(x) = 0,where f : n and h : n m are continuous,and X is closed.

The quadratic penalty method:

xk

= arg minxX Lck (x, k

) f(x) + k

h(x) +ck

2 h(x)2

where the {k} is a bounded sequence and {ck}satisfies 0 < ck < ck+1 for all k and ck .

8/8/2019 NLP Slides

129/201

TWO CONVERGENCE MECHANISMS

Taking k

close to a Lagrange multiplier vector Assume X = n and (x, ) is a local min-Lagrange multiplier pair satisfying the 2ndorder sufficiency conditions

For c suff. large, x is a strict local min ofLc(, )

Taking ck very large For large c and any

Lc(, )

f(x) if x X and h(x) = 0 otherwise

Example:minimize f(x) = 1

2(x21 + x

22)

subject to x1 = 1

Lc(x, ) =12 (x

21 + x

22) + (x1 1) +

c

2 (x1 1)2

x1(, c) =c c + 1

, x2(, c) = 0

8/8/2019 NLP Slides

130/201

EXAMPLE CONTINUED

minx1=1

x21 + x22, x

= 1, = 1

0 x13/4

x2

10 x11/2

x2

1

c = 1

= 0

c = 1

= - 1/2

0 x11/2

x2

1 0 x110/11

x2

1

c = 1

= 0

c = 10

= 0

8/8/2019 NLP Slides

131/201

GLOBAL CONVERGENCE

Every limit point of {xk

} is a global min.Proof: The optimal value of the problem is f =infh(x)=0, xX Lck (x,

k). We have

Lck (xk, k) Lck (x, k), x X

so taking the inf of the RHS over x X, h(x) = 0

Lck (xk, k) = f(xk) + k

h(xk) +

ck

2h(xk)2 f.

Let (x, ) be a limit point of {xk, k}. Without lossof generality, assume that {x

k

, k

} (x, ). Takingthe limsup above

f(x) + h(x) + lim supk

ck

2h(xk)2 f. (*)

Since h(xk

)2

0 and ck

, it follows thath(xk) 0 and h(x) = 0. Hence, x is feasible, andsince from Eq. (*) we have f(x) f, x is optimal.Q.E.D.

8/8/2019 NLP Slides

132/201

LAGRANGE MULTIPLIER ESTIMATES

Assume that X = n

, and f and h are cont.differentiable. Let {k} be bounded, and ck .Assume xk satisfies xLck (xk, k) = 0 for all k, andthat xk x, where x is such that h(x) has rankm. Then h(x) = 0 and k , where

k = k + ckh(xk), xL(x, ) = 0.

Proof: We have

0 = xLck(xk

, k) = f(xk) + h(xk)

k + ckh(xk)

= f(xk) + h(xk)k.

Multiply withh(xk)h(xk)

1h(xk)and take lim to obtain k with

= h(x)h(x)1h(x)f(x).We also have xL(x, ) = 0 and h(x) = 0 (sincek converges).

8/8/2019 NLP Slides

133/201

PRACTICAL BEHAVIOR

Three possibilities:

The method breaks downbecause an xk withxLck (xk, k) 0 cannot be found.

A sequence {xk} with xLck (xk, k) 0 is ob-tained, but it either has no limit points, or foreach of its limit points x the matrix h(x)has rank < m.

A sequence {xk} with with xLck (xk, k) 0is found and it has a limit point x such thath(x) has rank m. Then, x together with [the corresp. limit point of

k + ckh(xk)

] sat-

isfies the first-order necessary conditions.

Ill-conditioning: The condition number of theHessian 2xxLck (xk, k) tends to increase with ck. To overcome ill-conditioning:

Use Newton-like method (and double preci-sion).

Use good starting points. Increase ck at a moderate rate (if ck is in-creased at a fast rate, {xk} converges faster,but the likelihood of ill-conditioning is greater).

8/8/2019 NLP Slides

134/201

INEQUALITY CONSTRAINTS

Convert them to equality constraints by usingsquared slack variables that are eliminated later. Convert inequality constraint gj (x) 0 to equalityconstraint gj (x) + z2j = 0.

The penalty method solves problems of the form

minx,z

Lc(x,z,,) = f(x)

+

rj=1

j

gj (x) + z2j

+

c

2|gj (x) + z2j |2

,

for various values of and c.

First minimize Lc(x,z,,) with respect to z,

Lc(x,,) = minz

Lc(x,z,,) = f(x)

+r

j=1

minzj

j

gj (x) + z2j

+

c

2|gj (x) + z2j |2

and then minimize Lc(x,,) with respect to x.

8/8/2019 NLP Slides

135/201

MULTIPLIER METHODS

Recall that if (x, ) is a local min-Lagrangemultiplier pair satisfying the 2nd order sufficiencyconditions, then for c suff. large, x is a strict localmin of Lc(, ). This suggests that for k , xk x.

Hence it is a good idea to use k

, such as

k+1 = k = k + ckh(xk)

This is the (1st order) method of multipliers.

Key advantages to be shown:

Less ill-conditioning: It is not necessary thatck (only that ck exceeds some thresh-old).

Faster convergence when k is updated thanwhen k is kept constant (whether ck ornot).

8/8/2019 NLP Slides

136/201


ECTURE 17: AUGMENTED LAGRANGIAN METHO

LECTURE OUTLINE

Multiplier Methods

******************************************* Consider the equality constrained problem

minimize f(x)

subject to h(x) = 0,

where f :

n

and h :

n

m are continuously

differentiable.

The (1st order) multiplier method finds

xk = arg minxn

Lck (x, k) f(x) + kh(x) + c

k

2h(x)2

and updates k using

k+1 = k + ckh(xk)

8/8/2019 NLP Slides

137/201

CONVEX EXAMPLE

Problem: minx1=1(1/2)(x21 + x

22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.

We have

xk = arg minx

n

Lck (x, k) =

ck kck + 1

, 0k+1 = k + ck

ck kck + 1

1

k+1

=

k

ck

+ 1

We see that: k = 1 and xk x = (1, 0) for ev-

ery nondecreasing sequence {ck}. It is NOTnecessary to increase ck to .

The convergence rate becomes faster as ck

becomes larger; in fact

|k |

converges

superlinearly if ck .

8/8/2019 NLP Slides

138/201

NONCONVEX EXAMPLE

Problem: minx1=1(1/2)(x21 + x

22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.

We have

xk = arg minxn

Lck (x, k) =

ck kck

1

, 0provided ck > 1 (otherwise the min does not exist)

k+1 = k + ck

ck kck 1 1

k+1 = k

ck 1

We see that:

No need to increase ck to

for convergence;

doing so results in faster convergence rate. To obtain convergence, ck must eventually

exceed the threshold 2.

8/8/2019 NLP Slides

139/201

THE PRIMAL FUNCTIONAL

Let (x, ) be a regular local min-Lagr. pair sat-isfying the 2nd order suff. conditions are satisfied. The primal functional

p(u) = minh(x)=u

f(x),

defined for u in an open sphere centered at u = 0,and we havep(0) = f(x), p(0) = ,

0 u

(u + 1)212

p(u) =

p(0) = f(x*) = 12

-1

0 u

(u + 1)212

p(u) = -

p(0) = f(x*) = -12

-1

(a) (b)

p(u) = minx11=u

12

(x21 + x22), p(u) = min

x11=u12

(x21 + x22)

8/8/2019 NLP Slides

140/201

AUGM. LAGRANGIAN MINIMIZATION

Break down the minimization of Lc(, ):

minx

Lc(x, ) = minu

minh(x)=u

f(x) + h(x) +

c

2h(x)2

= min

u p(u) + u +

c

2u2 ,

where the minimization above is understood tobe local in a neighborhood of u = 0.

Interpretation of this minimization:

p(0) = f(x*)

p(u)

Slope = - *

min Lc(x,)x

Slope = -

0 uu(,c)

- 'u(,c)

Primal Function

||u||2c2p(u) +

Penalized Primal Function

If c is suf. large, p(u) + u + c2u2 is convex in

a neighborhood of 0. Also, for and large c,the value minx Lc(x, ) p(0) = f(x).

8/8/2019 NLP Slides

141/201

INTERPRETATION OF THE METHOD

Geometric interpretation of the iterationk+1 = k + ckh(xk).

p(0) = f(x*)

p(u)

Slope = - *

min Lck(x,

k

)x

Slope = - k

Slope = - k+1 = p(uk)

Slope = - k+1

Slope = - k+2

0 uuk uk+1

min Lck+1(x,k+1)

x

||u||2c

2

p(u) +

If k is sufficiently close to and/or ck is suf.large, k+1 will be closer to than k.

ck need not be increased to

in order to ob-

tain convergence; it is sufficient that ck eventuallyexceeds some threshold level.

Ifp(u) is linear, convergence to will be achievedin one iteration.

8/8/2019 NLP Slides

142/201

COMPUTATIONAL ASPECTS

Key issue is how to select {ck

}. ck should eventually become larger than thethreshold of the given problem.

c0 should not be so large as to cause ill-conditioning at the 1st minimization.

ck should not be increased so fast that too

much ill-conditioning is forced upon the un-constrained minimization too early.

ck should not be increased so slowly thatthe multiplier iteration has poor convergencerate.

A good practical scheme is to choose a mod-erate value c0, and use ck+1 = ck, where is ascalar with > 1 (typically [5, 10] if a Newton-like method is used).

In practice the minimization of Lck (x, k) is typ-ically inexact (usually exact asymptotically). In

some variants of the method, only one Newtonstep per minimization is used (with safeguards).

8/8/2019 NLP Slides

143/201

DUALITY FRAMEWORK

Consider the problemminimize f(x) +

c

2h(x)2

subject to x x < , h(x) = 0,

where is small enough for a local analysis tohold based on the implicit function theorem, and cis large enough for the minimum to exist.

Consider the dual function and its gradient

qc() = min

x

x

8/8/2019 NLP Slides

144/201


LECTURE 18: DUALITY THEORY

LECTURE OUTLINE

Geometrical Framework for Duality

Geometric Multipliers The Dual Problem Properties of the Dual Function


minimize f(x)


We assume that the problem is feasible and the

cost is bounded from below:

< f = inf xX

gj(x)0, j=1,...,r

f(x) <

8/8/2019 NLP Slides

145/201

MIN COMMON POINT/MAX CROSSING POINT

Let M be a subset of n

: Min Common Point Problem: Among all points thatare common to both M and the nth axis,find theone whose nth component is minimum.

Max Crossing Point Problem: Among all hyper-planes that intersect the nth axis and support theset M from below, find the hyperplane for whichpoint of intercept with the nth axis is maximum.

0

Min Common Point w*

ax Crossing Point q*

M

0

M

Max Crossing Point q*

Min Common Point ww w

uu

Note: We will see that the min common/maxcrossing framework applies to the problem of thepreceding slide, i.e., minxX, g(x)0f(x), with thechoice

M = (z, f(x)) | g(x) z, x X

8/8/2019 NLP Slides

146/201

GEOMETRICAL DEFINITION OF A MULTIPLIER

A vector = (1, . . . , r ) is said to be a geometricmultiplier for the primal problem if

j 0, j = 1, . . . , r ,

and

f = infxX L(x, ).

0

(a)

(0,f*)

(*,1)

w

z

H ={(z,w) | f* = w + jjzj}*

0

(b)

(0,f*)(*,1)

Set of pairs (z,w) corresponding to xthat minimize L(x,*) over X

S = {(g(x),f(x)) | x X} S = {(g(x),f(x)) | x X}

z

w

Note that the definition differs from the one givenin Chapter 3 ... but if X = n, f and gj are convexand differentiable, and x is an optimal solution,the Lagrange multipliers corresponding to x, asper the definition of Chapter 3, coincide with thegeometric multipliers as per the above definition.

8/8/2019 NLP Slides

147/201

EXAMPLES: A G-MULTIPLIER EXISTS

0(-1,0)

(*,1) min f(x) = (1/2) (x12 + x2

2)

s.t. g(x) = x1 - 1 0

x X = R2

(b)

0

(0,-1)

(*,1)

(0,1) min f(x) = x1 - x2

s.t. g(x) = x1 + x2 - 1 0

x X ={(x1,x2) | x1 0, x2 0}

(a)

(-1,0)

0

(*,1)

min f(x) = |x1| + x2

s.t. g(x) = x1 0

x X ={(x1,x2) | x2 0}

(c)

(*,1)

(*,1)

S = {(g(x),f(x)) | x X}

S = {(g(x),f(x)) | x X}

S = {(g(x),f(x)) | x X}

8/8/2019 NLP Slides

148/201

EXAMPLES: A G-MULTIPLIER DOESNT EXIST

min f(x) = x

s.t. g(x) = x2 0

x X = R

(a)

(0,f*) = (0,0)

(-1/2,0)

S = {(g(x),f(x)) | x X}min f(x) = - x

s.t. g(x) = x - 1/2 0

x X = {0,1}

(b)

(0,f*) = (0,0)

(1/2,-1)

S = {(g(x),f(x)) | x X}

Proposition: Let be a geometric multiplier.Then x is a global minimum of the primal problemif and only if x is feasible and

x = arg minxX

L(x, ), j gj (x) = 0, j = 1, . . . , r

8/8/2019 NLP Slides

149/201

THE DUAL FUNCTION AND THE DUAL PROBLEM

The dual problem ismaximize q()

subject to 0,

where q is the dual function

q() = infxX

L(x, ), r.

Question: How does the optimal dual value q =sup0 q() relate to f?

(,1)

H ={(z,w) | w + 'z = b}

OptimalDual Value

x Xq() = inf L(x,)

Support pointscorrespond to minimizersof L(x,) over X

S = {(g(x),f(x)) | x X}

8/8/2019 NLP Slides

150/201

WEAK DUALITY

The domain of q isDq =

| q() >

.

Proposition: The domain Dq is a convex set andq is concave over Dq. Proposition: (Weak Duality Theorem) We have

q f.

Proof: For all 0, and x X with g(x) 0, wehave

q() = infzX

L(z, ) f(x) +r

j=1

j gj (x) f(x),

soq = sup

0q() inf

xX, g(x)0f(x) = f.

8/8/2019 NLP Slides

151/201

DUAL OPTIMAL SOLUTIONS AND G-MULTIPLIER

Proposition: (a) If q = f, the set of geometricmultipliers is equal to the set of optimal dual solu-tions. (b) If q < f, the set of geometric multipliersis empty.

Proof: By definition, a vector 0 is a geometricmultiplier if and only if f = q() q, which by theweak duality theorem, holds if and only if there isno duality gap and is a dual optimal solution.Q.E.D.

(a)

f* = 0

q()

1

(b)

q()

f* = 0

- 1

min f(x) = x

s.t. g(x) = x2 0

x X = R

min f(x) = - x

s.t. g(x) = x - 1/2 0x X = {0,1}

q() = min {x + x2} ={- 1/(4 ) if > 0

- if 0

- 1/2

x R

q() = min { - x + (x - 1/2)} = min{ - /2, /2 1}x {0,1}

8/8/2019 NLP Slides

152/201


LECTURE 19: DUALITY THEOREMS

LECTURE OUTLINE

Duality and G-multipliers (continued)


minimize f(x)


assuming < f < . is a geometric multiplier if 0 and f =infxX L(x, ).

The dual problem is

maximize q()subject to 0,

where q is the dual function q() = infxX L(x, ).

8/8/2019 NLP Slides

153/201

DUAL OPTIMALITY

(,1)

H ={(z,w) | w + 'z = b}

OptimalDual Value

x X

q() = inf L(x,)Support pointscorrespond to minimizers

of L(x,) over X

S = {(g(x),f(x)) | x X}

Weak Duality Theorem: q

f

.

Geometric Multipliers and Dual Optimal Solu-tions:

(a) If there is no duality gap, the set of geometricmultipliers is equal to the set of optimal dualsolutions.

(b) If there is a duality gap, the set of geometricmultipliers is empty.

8/8/2019 NLP Slides

154/201

DUALITY PROPERTIES

Optimality Conditions: (x, ) isan optimal solutiogeometric multiplier pair if and only if

x X, g(x) 0, (Primal Feasibility), 0, (Dual Feasibility),

x = arg minxX

L(x, ), (Lagrangian Optimality),

j gj (x) = 0, j = 1, . . . , r , (Compl. Slackness).

Saddle Point Theorem: (x, ) is an optimalsolution-geometric multiplier pair if and only if x X, 0, and (x, ) is a saddle point of the La-grangian, in the sense that

L(x, ) L(x, ) L(x, ), x X, 0.

8/8/2019 NLP Slides

155/201

INFEASIBLE AND UNBOUNDED PROBLEMS

0

min f(x) = 1/x

s.t. g(x) = x 0

x X ={x | x > 0}

(a)f* = , q* =

0

S = {(x2,x) | x > 0}

min f(x) = x

s.t. g(x) = x2 0

x X = {x | x > 0}

(b)f* = , q* = 0

0

S = {(g(x),f(x)) | x X}= {(z,w) | z > 0}

min f(x) = x1 + x2

s.t. g(x) = x1 0

x X = {(x1,x2) | x1 > 0}

(c)f* = , q* =

z

w

w

w

z

z

S = {(g(x),f(x)) | x X}

8/8/2019 NLP Slides

156/201

EXTENSIONS AND APPLICATIONS

Equality constraints hi(x) = 0, i = 1, . . . , m, canbe converted into the two inequality constraints

hi(x) 0, hi(x) 0.

Separable problems:

minimize

mi=1

fi(xi)

subject to

m

i=1

gij (xi) 0, j = 1, . . . , r ,

xi Xi, i = 1, . . . , m .

Separable problem with a single constraint:

minimize

ni=1

fi(xi)

subject to

ni=1

xi A, i xi i, i.

8/8/2019 NLP Slides

157/201

DUALITY THEOREM I FOR CONVEX PROBLEMS

Strong Duality Theorem - Linear Constraints:Assume that the problem

minimize f(x)

subject to x X, aix bi = 0, i = 1, . . . , m ,

ej x dj 0, j = 1, . . . , r ,

has finite optimal value f. Let also f be convexover n and let X be polyhedral. Then there existsat least one geometric multiplier and there is noduality gap.

Proof Issues Application to Linear Programming

8/8/2019 NLP Slides

158/201

COUNTEREXAMPLE

A Convex Problem with a Duality Gap: Considerthe two-dimensional problem

minimize f(x)

subject to x1 0, x X = {x | x 0},

wheref(x) = e

x1x2 , x X,

and f(x) is arbitrarily defined for x / X. f is convex over X (its Hessian is positive definite

in the interior of X), and f = 1. Also, for all 0 we have

q() = infx0

e

x1x2 + x1

= 0,

since the expression in braces is nonnegative forx 0 and can approach zero by taking x1 0 andx1x2 . It follows that q = 0.

8/8/2019 NLP Slides

159/201

DUALITY THEOREM II FOR CONVEX PROBLEMS

Consider the problemminimize f(x)

subject to x X, gj (x) 0, j = 1, . . . , r .

Assume that X is convex and the functionsf : n , gj : n are convex over X. Fur-thermore, the optimal value f is finite and thereexists a vector x X such that

gj (x) < 0,

j = 1, . . . , r .

Strong Duality Theorem: There exists at leastone geometric multiplier and there is no dualitygap.

Extension to linear equality constraints.

8/8/2019 NLP Slides

160/201


LECTURE 20: STRONG DUALITY

LECTURE OUTLINE

Strong Duality Theorem

Linear Equality Constraints Fenchel Duality

********************************


minimize f(x)


assuming

< f 0.

Normalize ( = 1), take the infimum over x X,and use the fact 0, to obtain

f

inf

xXf(x) + g(x) = q() sup0 q() = q.Using the weak duality theorem, is a geometricmultiplier and there is no duality gap.

8/8/2019 NLP Slides

163/201

LINEAR EQUALITY CONSTRAINTS

Suppose we have the additional constraintseix di = 0, i = 1, . . . , m

We need the notion of the affine hullof a convexset X [denoted af f(X)]. This is the intersection ofall hyperplanes containing X.

The relative interiorof X, denoted ri(X), is the setof all x X s.t. there exists > 0 with

z | z x < , z af f(X) X,that is, ri(X) is the interior of X relative to af f(X).

Every nonempty convex set has a nonemptyrelative interior.

8/8/2019 NLP Slides

164/201

DUALITY THEOREM FOR EQUALITIES

Assumptions: The set X is convex and the functions f, gjare convex over X.

The optimal value f is finite and there existsa vector x ri(X) such that

gj (x) < 0, j = 1, . . . , r ,

eix di = 0, i = 1, . . . , m .

Under the preceding assumptions there existsat least one geometric multiplier and there is no

duality gap.

8/8/2019 NLP Slides

165/201

COUNTEREXAMPLE

Considerminimize f(x) = x1

subject to x2 = 0, x X =

(x1, x2) | x21 x2

.

The optimal solution is x = (0, 0) and f = 0. The dual function is given by

q() = inf x21x2

{x1 + x2} = 1

4, if > 0,

, if 0.

No dual optimal solution and therefore there isno geometric multiplier. (Even though there is noduality gap.)

Assumptions are violated (the feasible set andthe relative interior of X have no common point).

8/8/2019 NLP Slides

166/201

FENCHEL DUALITY FRAMEWORK

Consider the problemminimize f1(x) f2(x)subject to x X1 X2,

where f1 and f2 are real-valued functions on n

,and X1 and X2 are subsets of n. Assume that < f < . Convert problem to

minimize f1(y) f2(z)subject to z = y, y

NLP Slides

Documents

Transcript of NLP Slides