NLP LINGUISTICS 101 David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
NLP Slides
-
Upload
adarsh2day -
Category
Documents
-
view
219 -
download
0
Transcript of NLP Slides
-
8/8/2019 NLP Slides
1/201
ECTURE SLIDES ON NONLINEAR PROGRAMMIN
BASED ON LECTURES GIVEN AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
CAMBRIDGE, MASS
DIMITRI P. BERTSEKAS
These lecture slides are based on the book:Nonlinear Programming, Athena Scientific,by Dimitri P. Bertsekas; see
http://www.athenasc.com/nonlinbook.html
for errata, selected problem solutions, and othersupport material.
The slides are copyrighted but may be freelyreproduced and distributed for any noncom-mercial purpose.
LAST REVISED: Feb. 3, 2005
-
8/8/2019 NLP Slides
2/201
6.252 NONLINEAR PROGRAMMING
LECTURE 1: INTRODUCTION
LECTURE OUTLINE
Nonlinear Programming
Application Contexts
Characterization Issue
Computation Issue
Duality
Organization
-
8/8/2019 NLP Slides
3/201
NONLINEAR PROGRAMMING
minxX
f(x),
where
f : n is a continuous (and usually differ-entiable) function of n variables
X = n or X is a subset of n with a continu-ous character.
If X = n, the problem is called unconstrained
If f is linear and X is polyhedral, the problemis a linear programming problem. Otherwise it isa nonlinear programming problem
Linear and nonlinear programming have tradi-
tionally been treated separately. Their method-ologies have gradually come closer.
-
8/8/2019 NLP Slides
4/201
TWO MAIN ISSUES
Characterization of minima Necessary conditions
Sufficient conditions
Lagrange multiplier theory
Sensitivity
Duality
Computation by iterative algorithms
Iterative descent
Approximation methods
Dual and primal-dual methods
-
8/8/2019 NLP Slides
5/201
APPLICATIONS OF NONLINEAR PROGRAMMING
Data networks Routing
Production planning
Resource allocation
Computer-aided design
Solution of equilibrium models
Data analysis and least squares formulations
Modeling human or organizational behavior
-
8/8/2019 NLP Slides
6/201
CHARACTERIZATION PROBLEM
Unconstrained problems Zero 1st order variation along all directions
Constrained problems
Nonnegative 1st order variation along all fea-sible directions
Equality constraints
Zero 1st order variation along all directionson the constraint surface
Lagrange multiplier theory
Sensitivity
-
8/8/2019 NLP Slides
7/201
COMPUTATION PROBLEM
Iterative descent
Approximation
Role of convergence analysis
Role of rate of convergence analysis
Using an existing package to solve a nonlinearprogramming problem
-
8/8/2019 NLP Slides
8/201
POST-OPTIMAL ANALYSIS
Sensitivity
Role of Lagrange multipliers as prices
-
8/8/2019 NLP Slides
9/201
DUALITY
Min-common point problem / max-intercept prob-lem duality
0 0
(a) (b)
Min Common Point
Max Intercept Point Max Intercept Point
Min Common Point
S S
Illustration of the optimal values of the min common pointand max intercept point problems. In (a), the two optimal
values are not equal. In (b), the set S, when extendedupwards along the nth axis, yields the set
S = {x | for some x S, xn xn, xi = xi, i = 1, . . . , n 1}
which is convex. As a result, the two optimal values are
equal. This fact, when suitably formalized, is the basis for
some of the most important duality results.
-
8/8/2019 NLP Slides
10/201
6.252 NONLINEAR PROGRAMMING
LECTURE 2
UNCONSTRAINED OPTIMIZATION -
OPTIMALITY CONDITIONS
LECTURE OUTLINE
Unconstrained Optimization
Local Minima
Necessary Conditions for Local Minima
Sufficient Conditions for Local Minima
The Role of Convexity
-
8/8/2019 NLP Slides
11/201
MATHEMATICAL BACKGROUND
Vectors and matrices in n
Transpose, inner product, norm
Eigenvalues of symmetric matrices
Positive definite and semidefinite matrices
Convergent sequences and subsequences
Open, closed, and compact sets
Continuity of functions
1st and 2nd order differentiability of functions
Taylor series expansions
Mean value theorems
-
8/8/2019 NLP Slides
12/201
LOCAL AND GLOBAL MINIMA
f(x)
x
Strict LocalMinimum
Local Minima Strict GlobalMinimum
Unconstrained local and global minima in one dimension.
-
8/8/2019 NLP Slides
13/201
NECESSARY CONDITIONS FOR A LOCAL MIN
1st order condition: Zero slope at a localminimum x
f(x) = 0
2nd order condition: Nonnegative curvature
at a local minimum x
2f(x) : Positive Semidefinite
There may exist points that satisfy the 1st and
2nd order conditions but are not local minima
x xx
f(x) = |x|3 (convex) f(x) = x3 f(x) = - |x|3
x* = 0 x* = 0x* = 0
First and second order necessary optimality conditions for
functions of one variable.
-
8/8/2019 NLP Slides
14/201
PROOFS OF NECESSARY CONDITIONS
1st order condition f(x) = 0. Fix d n.Then (since x is a local min), from 1st order Taylor
df(x) = lim0
f(x + d) f(x)
0,
Replace d with d, to obtain
df(x) = 0, d n
2nd order condition 2
f(x
) 0. From 2ndorder Taylor
f(x+d)f(x) = f(x)d+2
2d2f(x)d+o(2
Since f(x) = 0 and x is local min, there issufficiently small > 0 such that for all (0, ),
0 f(x + d) f(x)
2= 12d
2f(x)d +o(2)
2
Take the limit as 0.
-
8/8/2019 NLP Slides
15/201
SUFFICIENT CONDITIONS FOR A LOCAL MIN
1st order condition: Zero slope
f(x) = 0
1st order condition: Positive curvature
2f(x) : Positive Definite
Proof: Let > 0 be the smallest eigenvalue of2f(x). Using a second order Taylor expansion,
we have for all d
f(x + d) f(x) = f(x)d +1
2d2f(x)d
+ o(d2)
2 d2 + o(d2)
=
2+
o(d2)
d2
d2.
For d small enough, o(d2)/d2 is negligible
relative to /2.
-
8/8/2019 NLP Slides
16/201
CONVEXITY
Convex Sets Nonconvex Sets
x
y
x + (1 - )y, 0 < < 1
x
x
y
y
xy
Convex and nonconvex sets.
f(x) + (1 - )f(y)
x y
C
z
f(z)
A convex function. Linear interpolation underestimates
the function.
-
8/8/2019 NLP Slides
17/201
MINIMA AND CONVEXITY
Local minima are also global under convexity
f(x*) + (1 - )f(x)
x
f(x* + (1- )x)
x x*
f(x)
Illustration of why local minima of convex functions are
also global. Suppose that f is convex and that x is alocal minimum of f. Let x be such that f(x) < f(x). Byconvexity, for all
(0, 1),
f
x + (1 )x
f(x) + (1 )f(x) < f(x).
Thus, f takes values strictly lower than f(x) on the linesegment connecting x with x, and x cannot be a local
minimum which is not global.
-
8/8/2019 NLP Slides
18/201
OTHER PROPERTIES OF CONVEX FUNCTIONS
f is convex if and only if the linear approximationat a point x based on the gradient, underestimatesf:
f(z) f(x) + f(x)(z x), z n
f(z)
f(z) + (z - x)'f(x)
x z
Implication:
f(x) = 0 x is a global minimum
f is convex if and only if 2f(x) is positivesemidefinite for all x
-
8/8/2019 NLP Slides
19/201
6.252 NONLINEAR PROGRAMMING
LECTURE 3: GRADIENT METHODS
LECTURE OUTLINE
Quadratic Unconstrained Problems
Existence of Optimal Solutions
Iterative Computational Methods
Gradient Methods - Motivation
Principal Gradient Methods
Gradient Methods - Choices of Direction
-
8/8/2019 NLP Slides
20/201
QUADRATIC UNCONSTRAINED PROBLEMS
minxn
f(x) = 12xQx bx,
where Q is n n symmetric, and b n.
Necessary conditions:
f(x) = Qx b = 0,
2f(x) = Q 0 : positive semidefinite.
Q 0 f : convex, nec. conditions are alsosufficient, and local minima are also global
Conclusions:
Q : not 0 f has no local minima
If Q > 0 (and hence invertible), x = Q1bis the unique global minimum.
If Q 0 but not invertible, either no solutionor number of solutions
-
8/8/2019 NLP Slides
21/201
00
0 0x x
x x
y
y y
y
1/ 1/
1/
> 0, > 0
(1/, 0) is the unique
global minimum
> 0, = 0
{(1/, ) | : real} is the set of
global minima
= 0There is no global minimum
> 0, < 0
There is no global minimum
Illustration of the isocost surfaces of the quadratic costfunction f : 2 given by
f(x, y) = 12
x2 + y2
x
for various values of and .
-
8/8/2019 NLP Slides
22/201
EXISTENCE OF OPTIMAL SOLUTIONS
Consider the problem
minxX
f(x)
The set of optimal solutions is
X = k=1
x X | f(x) k
where {k} is a scalar sequence such that k f
with
f = infxX
f(x)
X is nonempty and compact if all the sets{x X | f(x) k
are compact. So:
A global minimum exists if f is continuousand X is compact (Weierstrass theorem)
A global minimum exists if X is closed, andf is continuous and coercive, that is, f(x) when x
-
8/8/2019 NLP Slides
23/201
GRADIENT METHODS - MOTIVATION
f(x) = c1
f(x) = c2 < c1
f(x) = c3 < c2
xx = x - f(x)
f(x)
x - f(x)
If f(x) = 0, there is aninterval (0, ) of stepsizessuch that
f
x f(x) < f(x)for all (0, ).
f(x) = c1
f(x) = c2 < c1
f(x) = c3 < c2
x
f(x)
d
x + d
x = x + d
If d makes an angle withf(x) that is greater than90 degrees,
f(x)d < 0,
there is an interval (0, )
of stepsizes such that f(x+d) < f(x) for all (0, ).
-
8/8/2019 NLP Slides
24/201
PRINCIPAL GRADIENT METHODS
xk+1 = xk + kdk, k = 0, 1, . . .
where, if f(xk) = 0, the direction dk satisfies
f(xk)dk < 0,
and k is a positive stepsize. Principal example:
xk+1 = xk kDkf(xk),
where Dk is a positive definite symmetric matrix
Simplest method: Steepest descent
xk+1 = xk kf(xk), k = 0, 1, . . .
Most sophisticated method: Newtons method
xk+1 = xkk2f(xk)
1f(xk), k = 0, 1, . .
-
8/8/2019 NLP Slides
25/201
STEEPEST DESCENT AND NEWTONS METHOD
x0 Slow convergence of steep-
est descent
x0
x1
x2
f(x) = c1
f(x) = c3 < c2
f(x) = c2
-
8/8/2019 NLP Slides
26/201
OTHER CHOICES OF DIRECTION
Diagonally Scaled Steepest Descent
Dk = Diagonal approximation to
2f(xk)1
Modified Newtons Method
Dk = (2f(x0))1
, k = 0, 1, . . . ,
Discretized Newtons Method
Dk =
H(xk)1, k = 0, 1, . . . ,
where H(xk) is a finite-difference based approxi-mation of 2f(xk)
Gauss-Newton method for least squares prob-lems: minxn 12g(x)2. Here
Dk =
g(xk)g(xk)1
, k = 0, 1, . . .
-
8/8/2019 NLP Slides
27/201
6.252 NONLINEAR PROGRAMMING
LECTURE 4
ONVERGENCE ANALYSIS OF GRADIENT METHO
LECTURE OUTLINE
Gradient Methods - Choice of Stepsize
Gradient Methods - Convergence Issues
-
8/8/2019 NLP Slides
28/201
CHOICES OF STEPSIZE I
Minimization Rule: k is such that
f(xk + kdk) = min0
f(xk + dk).
Limited Minimization Rule: Min over [0, s]
Armijo rule:
f(xk)'dk
f(xk)'dk
0
Set of Acceptable
Stepsizes
s
s
Unsuccessful Stepsize
Trials
2sStepsize k
=
f(xk + dk) - f(xk)
Start with s and continue with s,2s, ..., until ms fallswithin the set of with
f(xk) f(xk + dk) f(xk)dk.
-
8/8/2019 NLP Slides
29/201
CHOICES OF STEPSIZE II
Constant stepsize: k is such that
k = s : a constant
Diminishing stepsize:
k 0
but satisfies the infinite travel condition
k=0
k =
-
8/8/2019 NLP Slides
30/201
GRADIENT METHODS WITH ERRORS
xk+1 = xk k(f(xk) + ek)
where ek is an uncontrollable error vector
Several special cases:
ek small relative to the gradient; i.e., for allk, ek < f(xk)
f(xk)
ek
gk
Illustration of the descent
property of the directiongk =
f(xk) + ek.
{ek} is bounded, i.e., for all k, ek ,where is some scalar.
{ek} is proportional to the stepsize, i.e., forall k, ek qk, where q is some scalar.
{ek} are independent zero mean random vec-tors
-
8/8/2019 NLP Slides
31/201
CONVERGENCE ISSUES
Only convergence to stationary points can beguaranteed
Even convergence to a single limit may be hardto guarantee (capture theorem)
Danger of nonconvergence if directions dk tendto be orthogonal to f(xk)
Gradient related condition:
For any subsequence {xk}kK that converges to
a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies
lim supk, kK
f(xk)dk < 0.
Satisfied if dk = Dkf(xk) and the eigenval-ues of Dk are bounded above and bounded awayfrom zero
-
8/8/2019 NLP Slides
32/201
CONVERGENCE RESULTS
CONSTANT AND DIMINISHING STEPSIZES
Let {xk} be a sequence generated by a gradientmethod xk+1 = xk + kdk, where {dk} is gradientrelated. Assume that for some constant L > 0,we have
f(x) f(y) Lx y, x, y n,
Assume that either
(1) there exists a scalar such that for all k
0 < k (2 )|f(xk)dk|
Ldk2
or
(2) k 0 andk=0
k = .
Then either f(xk) or else {f(xk)} con-verges to a finite value and f(xk) 0.
-
8/8/2019 NLP Slides
33/201
MAIN PROOF IDEA
0
f(xk)'dk
+ (1/2)2L||dk||2
f(xk)'dk
=|f(x
k)'d
k|
L||dk||
|2
f(xk + dk) - f(xk)
The idea of the convergence proof for a constant stepsize.Given xk and the descent direction dk, the cost differ-
ence f(xk + dk) f(xk) is majorized by f(xk)dk +12
2Ldk2 (based on the Lipschitz assumption; see nextslide). Minimization of this function over yields the step-
size
=|f(xk)dk|
Ldk2
This stepsize reduces the cost function f as well.
-
8/8/2019 NLP Slides
34/201
DESCENT LEMMA
Let be a scalar and let g() = f(x + y). Have
f(x + y) f(x) = g(1) g(0) =
10
dg
d() d
=10 y
f(x + y) d
10
yf(x) d
+ 1
0
yf(x + y) f(x) d
10
yf(x) d
+ 1
0
y f(x + y) f(x)d
yf(x) + y
10
Ly d
= yf(x) +L
2y2.
-
8/8/2019 NLP Slides
35/201
CONVERGENCE RESULT ARMIJO RULE
Let {xk} be generated by xk+1 = xk+kdk, where{dk} is gradient related and k is chosen by theArmijo rule. Then every limit point of {xk} is sta-tionary.
Proof Outline: Assume x is a nonstationary limit
point. Then f(xk
) f(x), so k
f(xk
)
dk
0. If {xk}K x, lim supk, kK f(xk)dk < 0,by gradient relatedness, so that {k}K 0.
By the Armijo rule, for large k K
f(xk)f
xk +(k/)dk
< (k/)f(xk)dk.
Defining pk = dk
dkand k =
kdk , we have
f(xk) f(xk + kpk)
k < f(xk)pk.
Use the Mean Value Theorem and let k .We get f(x)p f(x)p, wherep is a limitpoint of pk a contradiction since f(x)p < 0.
-
8/8/2019 NLP Slides
36/201
6.252 NONLINEAR PROGRAMMING
LECTURE 5: RATE OF CONVERGENCE
LECTURE OUTLINE
Approaches for Rate of Convergence Analysis
The Local Analysis Method
Quadratic Model Analysis
The Role of the Condition Number
Scaling
Diagonal Scaling
Extension to Nonquadratic Problems
Singular and Difficult Problems
-
8/8/2019 NLP Slides
37/201
APPROACHES FOR RATE OF
CONVERGENCE ANALYSIS
Computational complexity approach
Informational complexity approach
Local analysis Why we will focus on the local analysis method
-
8/8/2019 NLP Slides
38/201
THE LOCAL ANALYSIS APPROACH
Restrict attention to sequences xk convergingto a local min x
Measure progress in terms of an error functione(x) with e(x) = 0, such as
e(x) = x x, e(x) = f(x) f(x)
Compare the tail of the sequence e(xk) with thetail of standard sequences
Geometric or linear convergence [if e(xk
) qk
for some q > 0 and [0, 1), and for all k]. Holdsif
lim supk
e(xk+1)
e(xk)<
Superlinear convergence [if e(xk) q pk forsome q > 0, p > 1 and [0, 1), and for all k].
Sublinear convergence
-
8/8/2019 NLP Slides
39/201
QUADRATIC MODEL ANALYSIS
Focus on the quadratic function f(x) = (1/2)xQxwith Q > 0.
Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min
Consider steepest descent
xk+1 = xk kf(xk) = (I kQ)xk
xk+12 = xk(I kQ)2xk
max eig. (I k
Q)2
xk
2
The eigenvalues of (I kQ)2 are equal to (1 ki)2, where i are the eigenvalues of Q, so
max eig of (IkQ)2 = max(1km)2, (1kM)where m, M are the smallest and largest eigen-values of Q. Thus
xk+1
xk max|1 km|, |1 kM|
-
8/8/2019 NLP Slides
40/201
OPTIMAL CONVERGENCE RATE
The value of k that minimizes the bound is = 2/(M + m), in which case
xk+1
xk
M m
M + m
0
1
|1 - M |
|1 - m |
max {|1 - m|, |1 - M|}
M - m
M + m
2
M + m
1
M1m
2M
Stepsizes thatGuarantee Convergence
Conv. rate for minimization stepsize (see text)
f(xk+1)
f(xk)
M m
M + m2
The ratio M/m is called the condition numberof Q, and problems with M/m: large are calledill-conditioned.
-
8/8/2019 NLP Slides
41/201
SCALING AND STEEPEST DESCENT
View the more general method
xk+1 = xk kDkf(xk)
as a scaled version of steepest descent.
Consider a change of variables x = Sy withS = (Dk)1/2. In the space of y, the problem is
minimize h(y) f(Sy)
subject to y n
Apply steepest descent to this problem, multiplywith S, and pass back to the space of x, usingh(yk) = Sf(xk),
yk+1 = yk kh(yk)
Syk+1 = Syk kSh(yk)
xk+1 = xk kDkf(xk)
-
8/8/2019 NLP Slides
42/201
DIAGONAL SCALING
Apply the results for steepest descent to thescaled iteration yk+1 = yk kh(yk):
yk+1
yk max
|1 kmk|, |1 kMk|
f(xk+1)
f(xk)=
h(yk+1)
h(yk)
Mk mk
Mk + mk
2where mk and Mk are the smallest and largesteigenvalues of the Hessian of h, which is
2h(y) = S2f(x)S = (Dk)1/2Q(Dk)1/2
It is desirable to choose Dk as close as possibleto Q1. Also if Dk is so chosen, the stepsize = 1
is near the optimal 2/(Mk + mk).
Using as Dk a diagonal approximation to Q1
is common and often very effective. Corrects forpoor choice of units expressing the variables.
-
8/8/2019 NLP Slides
43/201
NONQUADRATIC PROBLEMS
Rate of convergence to a nonsingular local min-imum of a nonquadratic function is very similar tothe quadratic case (linear convergence is typical).
If Dk
2f(x)
1
, we asymptotically obtainoptimal scaling and superlinear convergence
More generally, if the direction dk = Dkf(xk)approaches asymptotically the Newton direction,
i.e.,
limk
dk + 2f(x)1f(xk)f(xk) = 0
and the Armijo rule is used with initial stepsize
equal to one, the rate of convergence is superlin-ear.
Convergence rate to a singular local min is typ-ically sublinear (in effect, condition number = )
-
8/8/2019 NLP Slides
44/201
6.252 NONLINEAR PROGRAMMING
LECTURE 6
NEWTON AND GAUSS-NEWTON METHODS
LECTURE OUTLINE
Newtons Method
Convergence Rate of the Pure Form
Global Convergence Variants of Newtons Method
Least Squares Problems
The Gauss-Newton Method
-
8/8/2019 NLP Slides
45/201
NEWTONS METHOD
xk+1 = xk k
2f(xk)1
f(xk)
assuming that the Newton direction is defined andis a direction of descent
Pure form of Newtons method (stepsize = 1)
xk+1 = xk
2f(xk)
1
f(xk)
Very fast when it converges (how fast?)
May not converge (or worse, it may not bedefined) when started far from a nonsingular
local min
Issue: How to modify the method so thatit converges globally, while maintaining thefast convergence rate
-
8/8/2019 NLP Slides
46/201
CONVERGENCE RATE OF PURE FORM
Consider solution of nonlinear system g(x) = 0where g : n n, with method
xk+1 = xk
g(xk)1
g(xk)
If g(x) = f(x), we get pure form of Newton
Quick derivation: Suppose xk x withg(x) = 0 and g(x) is invertible. By Taylor
0 = g(x) = g(xk)+g(xk)(xxk)+oxkx.Multiply with
g(xk)
1:
xk x
g(xk)
1
g(xk) = o
xk x
,
soxk+1 x = o
xk x
,
implying superlinear convergence and capture.
-
8/8/2019 NLP Slides
47/201
CONVERGENCE BEHAVIOR OF PURE FORM
x0 = -1x2 x
1
x
k xk g(xk)
0 - 1.00000 - 0.632121 0.71828 1.050912 0.20587 0.228593 0.01981 0.020004 0.00019 0.000195 0.00000 0.00000
0
g(x) = ex - 1
g(x)
0
x
x0 x2x1x3
-
8/8/2019 NLP Slides
48/201
MODIFICATIONS FOR GLOBAL CONVERGENCE
Use a stepsize
Modify the Newton direction when:
Hessian is not positive definite
When Hessian is nearly singular (needed toimprove performance)
Use
dk =
2f(xk) + k1
f(xk),
whenever the Newton direction does not exist oris not a descent direction. Here k is a diagonalmatrix such that
2f(xk) + k > 0
Modified Cholesky factorization
Trust region methods
-
8/8/2019 NLP Slides
49/201
LEAST-SQUARES PROBLEMS
minimize f(x) = 12g(x)2 =12
mi=1
gi(x)2
subject to x n,
where g = (g1, . . . , gm), gi : n ri .
Many applications:
Solution of systems of n nonlinear equationswith n unknowns
Model Construction Curve Fitting
Neural Networks
Pattern Classification
-
8/8/2019 NLP Slides
50/201
PURE FORM OF THE GAUSS-NEWTON METHOD
Idea: Linearize around the current point xk
g(x, xk) = g(xk) + g(xk)(x xk)
and minimize the norm of the linearized function
g:
xk+1 = arg minxn
12g(x, xk)
2
= xk
g(xk)g(xk)
1
g(xk)g(xk)
The direction
g(xk)g(xk)1
g(xk)g(xk)
is a descent direction since
g(xk)g(xk) =
(1/2)g(x)2
g(xk)g(xk) > 0
-
8/8/2019 NLP Slides
51/201
MODIFICATIONS OF THE GAUSS-NEWTON
Similar to those for Newtons method:
xk+1 = xkk
g(xk)g(xk)+k1
g(xk)g(xk
where k is a stepsize and k is a diagonal matrixsuch that
g(xk)g(xk) + k > 0
Incremental version of the Gauss-Newton method
Operate in cycles
Start a cycle with 0 (an estimate of x)
Update using a single component of g
i = arg minxn
ij=1
gj(x, j1)2, i = 1, . . . , m
where gj are the linearized functions
gj(x, j1) = gj(j1)+gj(j1)(xj1)
-
8/8/2019 NLP Slides
52/201
MODEL CONSTRUCTION
Given set of m input-output data pairs (yi, zi),i = 1, . . . , m, from the physical system
Hypothesize an input/output relation z = h(x, y),where x is a vector of unknown parameters, andh is known
Find x that matches best the data in the sensethat it minimizes the sum of squared errors
12
m
i=1
zi h(x, yi)2
Example of a linear model: Fit the data pairs bya cubic polynomial approximation. Take
h(x, y) = x3y3
+ x2y2
+ x1y + x0,
where x = (x0, x1, x2, x3) is the vector of unknowncoefficients of the cubic polynomial.
-
8/8/2019 NLP Slides
53/201
NEURAL NETS
Nonlinear model construction with multilayerperceptrons
x of the vector of weights
Universal approximation property
-
8/8/2019 NLP Slides
54/201
PATTERN CLASSIFICATION
Objects are presented to us, and we wish toclassify them in one of s categories 1, . . . , s, basedon a vector y of their features.
Classical maximum posterior probability ap-proach: Assume we know
p(j|y) = P(object w/ feature vector y is of category
Assign object with feature vector y to category
j
(y) = arg maxj=1,...,sp(j|y).
If p(j|y) are unknown, we can estimate themusing functions hj(xj , y) parameterized by vectorsxj . Obtain xj by minimizing
12
mi=1
zij hj(xj , yi)
2,
where
zij = 1 if yi is of category j,0 otherwise.
-
8/8/2019 NLP Slides
55/201
6.252 NONLINEAR PROGRAMMING
LECTURE 7: ADDITIONAL METHODS
LECTURE OUTLINE
Least-Squares Problems and Incremental Gra-
dient Methods
Conjugate Direction Methods
The Conjugate Gradient Method
Quasi-Newton Methods
Coordinate Descent Methods
Recall the least-squares problem:
minimize f(x) = 12g(x)2 = 12
mi=1
gi(x)2
subject to x n,
where g = (g1, . . . , gm), gi : n ri .
-
8/8/2019 NLP Slides
56/201
INCREMENTAL GRADIENT METHODS
Steepest descent method
xk+1 = xkkf(xk) = xkkmi=1
gi(xk)gi(xk)
Incremental gradient method:
i = i1 kgi(i1)gi(i1), i = 1, . . . , m
0 = xk, xk+1 = m
(aix - bi)2
amini
i
bi
amax i
i
bi
x*
xR
Advantage of incrementalism
-
8/8/2019 NLP Slides
57/201
VIEW AS GRADIENT METHOD W/ ERRORS
Can write incremental gradient method as
xk+1 = xk kmi=1
gi(xk)gi(xk)
+ k
mi=1
gi(xk)gi(xk) gi(i1)gi(i1)
Error term is proportional to stepsize k
Convergence (generically) for a diminishing step-size (under a Lipschitz condition on gigi)
Convergence to a neighborhood for a constantstepsize
-
8/8/2019 NLP Slides
58/201
CONJUGATE DIRECTION METHODS
Aim to improve convergence rate of steepest de-scent, without the overhead of Newtons method.
Analyzed for a quadratic model. They require niterations to minimize f(x) = (1/2)xQx bx withQ an n n positive definite matrix Q > 0.
Analysis also applies to nonquadratic problemsin the neighborhood of a nonsingular local min.
The directions d1, . . . , dk are Q-conjugate if diQdj
0 for all i = j.
Generic conjugate direction method:
xk+1 = xk + kdk
where k is obtained by line minimization.
y0
y1
y2
w0
w1
x0 x1
x2
d1 = Q -1/2w1
d0 = Q -1/2w0
Expanding Subspace Theorem
-
8/8/2019 NLP Slides
59/201
GENERATING Q-CONJUGATE DIRECTIONS
Given set of linearly independent vectors 0, . . . , we can construct a set of Q-conjugate directionsd0, . . . , dk s.t. Span(d0, . . . , di) = Span(0, . . . , i)
Gram-Schmidt procedure. Start with d0 = 0.If for some i < k, d0, . . . , di are Q-conjugate and
the above property holds, take
di+1 = i+1 +i
m=0
c(i+1)mdm;
choose c(i+1)m so di+1 is Q-conjugate to d0, . . . , di,
di+1Qdj = i+1Qdj+
i
m=0
c(i+1)mdm
Qdj = 0.
0 = d0
- c10
d0
1 2
d1
d00
0
d1= 1 + c10d0d2= 2 + c20d0 + c21d1
-
8/8/2019 NLP Slides
60/201
CONJUGATE GRADIENT METHOD
Apply Gram-Schmidt to the vectors k = gk =f(xk), k = 0, 1, . . . , n 1. Then
dk = gk +k1
j=0gkQdj
dj Qdjdj
Key fact: Direction formula can be simplified.
Proposition : The directions of the CGM aregenerated by d0 = g0, and
dk = gk + kdk1, k = 1, . . . , n 1,
where k is given by
k =gkgk
gk1gk1or k =
(gk gk1)gk
gk1gk1
Furthermore, the method terminates with an opti-mal solution after at most n steps.
Extension to nonquadratic problems.
-
8/8/2019 NLP Slides
61/201
PROOF OF CONJUGATE GRADIENT RESULT
Use induction to show that all gradients gk gen-erated up to termination are linearly independent.True for k = 1. Suppose no termination after ksteps, and g0, . . . , gk1 are linearly independent.Then, Span(d0, . . . , dk1) = Span(g0, . . . , gk1)
and there are two possibilities: gk = 0, and the method terminates.
gk = 0, in which case from the expandingmanifold property
gk
is orthogonal to d0
, . . . , dk1
gk is orthogonal to g0, . . . , gk1
so gk is linearly independent of g0, . . . , gk1,completing the induction.
Since at most n lin. independent gradients canbe generated, gk = 0 for some k n.
Algebra to verify the direction formula.
-
8/8/2019 NLP Slides
62/201
QUASI-NEWTON METHODS
xk+1 = xk kDkf(xk), where Dk is aninverse Hessian approximation.
Key idea: Successive iterates xk, xk+1 and gra-dients f(xk), f(xk+1), yield curvature info
qk 2f(xk+1)pk,
pk = xk+1 xk, qk = f(xk+1) f(xk),
2f(xn) q0 qn1 p
0 pn1 1
Most popular Quasi-Newton method is a cleverway to implement this idea
Dk+1 = Dk +pkpk
pkqk
DkqkqkDk
qkDkqk
+ kkvkvk,
vk =pk
pkqk
Dkqk
k, k = qk
Dkqk, 0 k 1
and D0 > 0 is arbitrary, k by line minimization,and Dn = Q1 for a quadratic.
-
8/8/2019 NLP Slides
63/201
NONDERIVATIVE METHODS
Finite difference implementations
Forward and central difference formulas
f(xk)
xi
1
hf(xk + hei) f(xk)
f(xk)
xi
1
2h
f(xk + hei) f(xk hei)
Use central difference for more accuracy nearconvergence
xk
xk+1xk+2
Coordinate descent.Applies also to the casewhere there are boundconstraints on the vari-
ables.
Direct search methods. Nelder-Mead method.
-
8/8/2019 NLP Slides
64/201
6.252 NONLINEAR PROGRAMMING
LECTURE 8
OPTIMIZATION OVER A CONVEX SET;
OPTIMALITY CONDITIONS
Problem: minxX f(x), where:
(a) X n is nonempty, convex, and closed.
(b) f is continuously differentiable over X. Local and global minima. If f is convex localminima are also global.
f(x)
x
Local Minima Global Minimum
X
-
8/8/2019 NLP Slides
65/201
OPTIMALITY CONDITION
Proposition (Optimality Condition)
(a) If x is a local minimum of f over X, then
f(x)(x x) 0, x X.
(b) If f is convex over X, then this condition isalso sufficient for x to minimize f over X.
Surfaces of equal cost f(x)
Constraint set X
x
x*
f(x*)At a local minimum x,the gradient
f(x) makes
an angle less than or equalto 90 degrees with all fea-sible variations xx, x X.
Constraint set X
x*
x
f(x*)Illustration of failure of theoptimality condition when
X is not convex. Here xis a local min but we havef(x)(x x) < 0 forthe feasible vector x shown.
-
8/8/2019 NLP Slides
66/201
PROOF
Proof: (a) By contradiction. Suppose that f(x)x) < 0 for some x X. By the Mean Value The-orem, for every > 0 there exists an s [0, 1]such that
fx+(xx) = f(x)+fx+s(xx)(xx).Since f is continuous, for suff. small > 0,
f
x + s(x x)
(x x) < 0
so that f
x + (x x)
< f(x). The vectorx + (x x) is feasible for all [0, 1] becauseXis convex, so the optimality of x is contradicted.
(b) Using the convexity of f
f(x) f(x) + f(x)(x x)
for every x X. If the condition f(x)(xx) 0 holds for all x X, we obtain f(x) f(x), sox minimizes f over X. Q.E.D.
-
8/8/2019 NLP Slides
67/201
OPTIMIZATION SUBJECT TO BOUNDS
Let X = {x | x 0}. Then the necessarycondition for x = (x1, . . . , x
n) to be a local min is
ni=1
f(x)xi
(xi xi ) 0, xi 0, i = 1, . . . , n .
Fix i. Let xj = xj for j = i and xi = xi + 1:
f(x)
xi 0, i.
If xi > 0, let also xj = xj forj = i and xi =
12xi .
Then f(x)/xi 0, so
f(x)
xi= 0, if xi > 0.
x* x* = 0
f(x*)f(x*)
-
8/8/2019 NLP Slides
68/201
OPTIMIZATION OVER A SIMPLEX
X =
x x 0,
ni=1
xi = r
where r > 0 is a given scalar.
Necessary condition for x = (x1, . . . , xn) to be
a local min:
ni=1
f(x)xi
(xixi ) 0, xi 0 with
ni=1
xi = r.
Fix i with xi > 0 and let j be any other index.Use x with xi = 0, xj = xj + x
i , and xm = x
m for
all m = i, j:
f(x)
xj
f(x)
xixi 0,
xi > 0 =f(x)
xi
f(x)
xj, j,
i.e., at the optimum, positive components have
minimal (and equal) first cost derivative.
-
8/8/2019 NLP Slides
69/201
OPTIMAL ROUTING
Given a data net, and a set W of OD pairsw = (i, j). Each OD pair w has input traffic rw.
Origin ofOD pair w
Destination ofOD pair w
rw rw
x1
x4
x3
x2
Optimal routing problem:
minimize D(x) =(i,j)
Dij
all paths pcontaining (i,j)
xp
subject to
pPw
xp = rw, w W,
xp 0, p Pw, w W
Optimality condition
xp > 0 =D(x)
xp
D(x)
xp, p Pw,
i.e., paths carrying > 0 flow are shortest with re-s ect to first cost derivative.
-
8/8/2019 NLP Slides
70/201
TRAFFIC ASSIGNMENT
Transportation network with OD pairs w. Eachw has paths p Pw and traffic rw. Let xp be the
flow of path p and let Tij
p: crossing (i,j) xp
be the travel time of link (i, j).
User-optimization principle: Traffic equilib-rium is established when each user of the networkchooses, among all available paths, a path of min-imum travel time, i.e., for all w W and pathsp Pw,
xp > 0 = tp(x) tp(x), p Pw, w W
where tp(x), is the travel time of path p
tp(x) = all arcs (i,j)on path p
Tij (Fij ), p Pw, w W.
Identical with the optimality condition of the routingproblem if we identify the arc travel time Tij(Fij)with the cost derivative Dij(Fij).
-
8/8/2019 NLP Slides
71/201
PROJECTION OVER A CONVEX SET
Let z n and a closed convex set Xbe given.Problem:
minimize f(x) = z x2
subject to x X.
Proposition (Projection Theorem) Problemhas a unique solution [z]+ (the projection of z).
z
x
Constraint set X
x*
x - x*
z - x*
Necessary and sufficient con-dition for x to be the pro-jection. The angle between
z x and x x shouldbe greater or equal to 90
degrees for all x X, or(z x)(x x) 0
If X is a subspace, z x X.
The mapping f : n X defined by f(x) =[x]+ is continuous and nonexpansive, that is,
[x]+ [y]+ x y, x, y n.
-
8/8/2019 NLP Slides
72/201
6.252 NONLINEAR PROGRAMMING
LECTURE 9: FEASIBLE DIRECTION METHODS
LECTURE OUTLINE
Conditional Gradient Method
Gradient Projection Methods
A feasible direction at an x X is a vector d = 0such that x + d is feasible for all suff. small > 0
x1
x2
d
Constraint set X
Feasibledirections at x
x
Note: the set of feasible directions at x is theset of all (z x) where z X, z = x, and > 0
-
8/8/2019 NLP Slides
73/201
FEASIBLE DIRECTION METHODS
A feasible direction method:
xk+1 = xk + kdk,
where dk: feasible descentdirection [f(xk)dk 0 and such that xk+1
X. Alternative definition:
xk+1 = xk + k(xk xk),
where k
(0, 1] and if xk
is nonstationary,
xk X, f(xk)(xk xk) < 0.
Stepsize rules: Limited minimization, Constant
k = 1, Armijo: k = mk , where mk is the firstnonnegative m for which
f(xk)f
xk+m(xkxk)
mf(xk)(xkxk
-
8/8/2019 NLP Slides
74/201
CONVERGENCE ANALYSIS
Similar to the one for (unconstrained) gradientmethods.
The direction sequence {dk} is gradient relatedto {xk} if the following property can be shown:For any subsequence {xk}kK that converges to
a nonstationary point, the corresponding subse-quence {dk}kK is bounded and satisfies
lim supk, kK
f(xk)dk < 0.
Proposition (Stationarity of Limit Points)Let {xk} be a sequence generated by the feasibledirection method xk+1 = xk + kdk. Assume that:
{dk} is gradient related
k is chosen by the limited minimization rule
or the Armijo rule.Then every limit point of {xk} is a stationary point.
Proof: Nearly identical to the unconstrainedcase.
-
8/8/2019 NLP Slides
75/201
CONDITIONAL GRADIENT METHOD
xk+1 = xk + k(xk xk), where
xk = arg minxX
f(xk)(x xk).
We assume that X is compact, so x
k
is guar-anteed to exist by Weierstrass.f(x)
x
x_
Constraint set X
Surfaces ofequal cost
Illustration of the directionof the conditional gradient
method.
x0
x1
x2
x1 x0__
Constraint set X
Surfaces of
equal cost
x*
Operation of the method.Slow (sublinear) convergence.
-
8/8/2019 NLP Slides
76/201
CONVERGENCE OF CONDITIONAL GRADIENT
Show that the direction sequence of the condi-tional gradient method is gradient related, so thegeneric convergence result applies.
Suppose that {xk}kK converges to a nonsta-tionary point x. We must prove that
{xkxk}kK : bounded, lim supk, kK
f(xk)(xkxk) 0 (involves transformation yk = (Hk)1/2xSince the minimum value above is negative when
xk is nonstationary, f(xk)(xk xk) < 0.
Newtons method for Hk = 2f(xk).
Variants: Projecting on an expanded constraintset, projecting on a restricted constraint set, com-
binations with unconstrained methods, etc.
-
8/8/2019 NLP Slides
80/201
6.252 NONLINEAR PROGRAMMING
LECTURE 10
ALTERNATIVES TO GRADIENT PROJECTION
LECTURE OUTLINE
Three Alternatives/Remedies for Gradient Pro-jection
Two-Metric Projection Methods
Manifold Suboptimization Methods
Affine Scaling Methods
Scaled GP method with scaling matrix Hk > 0:
xk+1 = xk + k(xk xk),
xk = arg minxX
f(xk)(x xk) + 1
2sk(x xk)Hk(x xk)
The QP direction subproblem is complicated by:
Difficult inequality (e.g., nonorthant) constraint
Nondiagonal Hk, needed for Newton scaling
-
8/8/2019 NLP Slides
81/201
THREE WAYS TO DEAL W/ THE DIFFICULTY
Two-metric projection methods:
xk+1 =
xk kDkf(xk)+
Use Newton-like scaling but use a standard
projection Suitable for bounds, simplexes, Cartesian
products of simple sets, etc
Manifold suboptimization methods:
Use (scaled) gradient projection on the man-ifold of active inequality constraints
Each QP subproblem is equality-constrained
Need strategies to cope with changing activemanifold (add-drop constraints)
Affine Scaling Methods Go through the interior of the feasible set
Each QP subproblem is equality-constrained,AND we dont have to deal with changing ac-
tive manifold
-
8/8/2019 NLP Slides
82/201
TWO-METRIC PROJECTION METHODS
In their simplest form, apply to constraint: x 0,but generalize to bound and other constraints
Like unconstr. gradient methods except for []+
xk+1 = xk kDkf(xk)+, Dk > 0 Major difficulty: Descent is not guaranteed forDk: arbitrary
f(xk)
xk
xk - kDkf(xk)
xk
xk - kDkf(xk)
(a) (b)
Remedy: Use Dk that is diagonal w/ respect toindices that are active and want to stay active
I+(xk) =
i
xki = 0, f(x
k)/xi > 0
-
8/8/2019 NLP Slides
83/201
PROPERTIES OF 2-METRIC PROJECTION
Suppose Dk is diagonal with respect to I+(xk),i.e., dkij = 0 for i, j I
+(xk) with i = j, and let
xk(a) =
xk Dkf(xk)
+
If xk is stationary, xk = xk() for all > 0. Otherwise f
x()
< f(xk) for all sufficiently
small > 0 (can use Armijo rule).
Because I+(x) is discontinuous w/ respect tox, to guarantee convergence we need to includein I+(x) constraints that are -active [those w/xki [0, ] and f(x
k)/xi > 0].
The constraints in I+(x) eventually becomeactive and dont matter.
Method reduces to unconstrained Newton-likemethod on the manifold of active constraints at x.
Thus, superlinear convergence is possible w/simple projections.
-
8/8/2019 NLP Slides
84/201
MANIFOLD SUBOPTIMIZATION METHODS
Feasible direction methods for
min f(x) subject to ajx bj , j = 1, . . . , r
Gradient is projected on a linear manifold of ac-tive constraints rather than on the entire constraint
set (linearly constrained QP).x0
x1
x2
x3
(a)
x0
x4
x2
x1
(b)
x3
Searches through sequence of manifolds, eachdiffering by at most one constraint from the next.
Potentially many iterations to identify the activemanifold; then method reduces to (scaled) steep-
est descent on the active manifold.
Well-suited for a small number of constraints,and for quadratic programming.
-
8/8/2019 NLP Slides
85/201
OPERATION OF MANIFOLD METHODS
Let A(x) = {j | ajx = bj} be the active indexset at x. Given xk, we find
dk = arg minajd=0, jA(x
k)f(xk)d + 12dHkd
If dk
= 0, then dk
is a feasible descent direction.Perform feasible descent on the current manifold.
If dk = 0, either (1) xk is stationary or (2) weenlarge the current manifold (drop an active con-
straint). For this, use the scalars j such that
f(xk) +
jA(xk)
jaj = 0
f(xk)
a1
a2
X
a1'x = b1
a2'x = b2
- 2a2
- 1a1
(1 < 0)
(2 > 0)
xk
If j 0 for all j, xk isstationary, since for all fea-sible x,
f(xk)
(x
xk) is
equal to
jA(xk)j a
j (xxk) 0
Else, drop a constraint j
with j < 0.
-
8/8/2019 NLP Slides
86/201
-
8/8/2019 NLP Slides
87/201
AFFINE SCALING
Particularly interesting choice (affine scaling)
Hk = (Xk)2,
where Xk is the diagonal matrix having the (pos-itive) coordinates xk
ialong the diagonal:
xk+1 = xkk(Xk)2(cAk), k =
A(Xk)2A1
A(Xk
Corresponds to unscaled gradient projection it-eration in the variables y = (Xk)1x. The vector xk
is mapped onto the unit vector yk = (1, . . . , 1).x*
xk
xk+1 yk+1
yk = (1,1,1)
y*= (Xk)-1 x*
yk= (Xk)-1 xk
Extensions, convergence, practical issues.
-
8/8/2019 NLP Slides
88/201
6.252 NONLINEAR PROGRAMMING
LECTURE 11
CONSTRAINED OPTIMIZATION;
LAGRANGE MULTIPLIERS
LECTURE OUTLINE
Equality Constrained Problems Basic Lagrange Multiplier Theorem
Proof 1: Elimination Approach Proof 2: Penalty Approach
Equality constrained problem
minimize f(x)subject to hi(x) = 0, i = 1, . . . , m .
where f : n , hi : n , i = 1, . . . , m, are con-tinuously differentiable functions. (Theory alsoapplies to case where f and hi are cont. differ-
entiable in a neighborhood of a local minimum.)
-
8/8/2019 NLP Slides
89/201
LAGRANGE MULTIPLIER THEOREM
Let x be a local min and a regular point [hi(x):linearly independent]. Then there exist uniquescalars 1, . . . ,
m such that
f(x) +m
i=1
i hi(x) = 0.
If in addition f and h are twice cont. differentiable,
y
2f(x) +
mi=1
i 2hi(x)
y 0, y s.t. h(x)y =
x1
x2
x* = (-1,-1)
h(x*) = (-2,-2)
f(x*) = (1,1) 0
2
2
h(x) = 0
minimize x1 + x2
subject to x21 + x22 = 2.
The Lagrange multiplier is
= 1/2.
x1
x2
f(x*) = (1,1)h1(x
*) = (-2,0)
h2(x*) = (-4,0)
h1(x) = 0
h2(x) = 0
21
minimize x1 + x2
s. t. (x1 1)2 + x22 1 =(x1 2)2 + x22 4 =
-
8/8/2019 NLP Slides
90/201
PROOF VIA ELIMINATION APPROACH
Consider the linear constraints caseminimize f(x)
subject to Ax = b
where A is an m n matrix with linearly indepen-dent rows and b m is a given vector. Partition A = ( B R ) , where B is mm invertible,and x = ( xB xR )
. Equivalent problem:
minimize F(xR) f
B1(b RxR), xR
subject to xR nm.
Unconstrained optimality condition:
0 = F(xR) = R(B)1B f(x
) + Rf(x) (1)
By defining
= (B)1
B f(x),we have B f(x) + B = 0, while Eq. (1) is writtenRf(x) + R = 0. Combining:
f(x) + A = 0
-
8/8/2019 NLP Slides
91/201
ELIMINATION APPROACH - CONTINUED
Second order condition: For all d n
m
0 d2F(xR)d = d2
f
B1(b RxR), xR
d. (2)
After calculation we obtain
2F(xR) = R(B)12BB f(x)B1R
R(B)12BR f(x) 2RB f(x
)B1R + 2RRf(x
Eq. (2) and the linearity of the constraints [im-plying that 2hi(x) = 0], yields for all d nm
0 d2F(xR)d = y2f(x)y
= y
2f(x) +
mi=1
i 2hi(x)
y,
where y = ( yB yR ) = ( B1Rd d ) . y has this form iff
0 = ByB + RyR = h(x)y.
-
8/8/2019 NLP Slides
92/201
-
8/8/2019 NLP Slides
93/201
-
8/8/2019 NLP Slides
94/201
LAGRANGIAN FUNCTION
Define the Lagrangian functionL(x, ) = f(x) +
mi=1
ihi(x).
Then, if x is a local minimum which is regular, theLagrange multiplier conditions are written
xL(x, ) = 0, L(x, ) = 0,
System of n + m equations with n + m unknowns.
y2
xxL(x, )y 0, y s.t. h(x)y = 0.
Exampleminimize 1
2
x21 + x
22 + x
23
subject to x1 + x2 + x3 = 3.
Necessary conditions
x1 + = 0, x2 +
= 0,
x3 + = 0, x1 + x
2 + x
3 = 3.
-
8/8/2019 NLP Slides
95/201
EXAMPLE - PORTFOLIO SELECTION
Investment of 1 unit of wealth among n assetswith random rates of return ei, and given meansei, and covariance matrix Q =
E{(ei ei)(ej ej )}
.
If xi: amount invested in asset i, we want to
minimize xQx
= Variance of return
ieixi
subject to
ixi = 1, and a given mean
i
eixi = m
Let 1 and 2 be the L-multipliers. Have 2Qx +1u+2e = 0, where u = (1, . . . , 1) and e = (e1, . . . , en).This yields
x = mv+w, Variance of return = 2 = (m+)2+,
where v and w are vectors, and , , and aresome scalars that depend on Q and e.
m
ef-
Efficient Frontier = m +
For given m the optimal
lies on a line (called effi-cient frontier).
-
8/8/2019 NLP Slides
96/201
-
8/8/2019 NLP Slides
97/201
SUFFICIENCY CONDITIONS
Second Order Sufficiency Conditions: Let x
n
and m satisfy
xL(x, ) = 0, L(x, ) = 0,
y2xxL(x, )y > 0, y = 0 with h(x)y = 0.
Then x is a strict local minimum.
Example: Minimize (x1x2 + x2x3 + x1x3) subject tox1 + x2 + x3 = 3. We have that x1 = x
2 = x
3 = 1 and
= 2 satisfy the 1st order conditions. Also
2xxL(x, ) =
0 1 11 0 11 1 0
.
We have for all y = 0 with h(x)y = 0 or y1 + y2 +y3 = 0,
y2xxL(x, )y = y1(y2 + y3) y2(y1 + y3) y3(y1 + y= y21 + y
22 + y
23 > 0.
Hence, x is a strict local minimum.
-
8/8/2019 NLP Slides
98/201
A BASIC LEMMA
Lemma: Let P and Q be two symmetric matrices.Assume that Q 0 and P > 0 on the nullspace ofQ, i.e., xP x > 0 for all x = 0 with xQx = 0. Thenthere exists a scalar c such that
P + cQ : positive definite, c > c.
Proof: Assume the contrary. Then for every k,there exists a vector xk with xk = 1 such that
xkP xk + kxk
Qxk 0.
Consider a subsequence {xk}kK converging tosome x with x = 1. Taking the limit superior,
xPx + lim supk, kK
(kxkQxk) 0. (*)
We have xkQxk 0 (since Q 0), so {xkQxk}kK 0. Therefore, xQx = 0 and using the hypothesis,xP x > 0. This contradicts (*).
-
8/8/2019 NLP Slides
99/201
PROOF OF SUFFICIENCY CONDITIONS
Consider the augmented Lagrangian function
Lc(x, ) = f(x) + h(x) +
c
2h(x)2,
where c is a scalar. We have
xLc(x, ) = xL(x, ),2xxLc(x, ) = 2xxL(x, ) + ch(x)h(x)
where = + ch(x). If (x, ) satisfy the suff. con-ditions, we have using the lemma,
xLc(x, ) = 0, 2xxLc(x, ) > 0,
for suff. large c. Hence for some > 0, > 0,
Lc(x, ) Lc(x, ) +
2x x2, if x x < .
Since Lc(x, ) = f(x) when h(x) = 0,
f(x) f(x) + 2
x x2, if h(x) = 0, x x < .
-
8/8/2019 NLP Slides
100/201
SENSITIVITY - GRAPHICAL DERIVATION
f(x*)
x* + x
x*
x
a a'x = b + b
a'x = b
Sensitivity theorem for the problem minax=b f(x). If b ischanged to b+b, the minimum x will change to x+x.Since b + b = a(x + x) = ax + ax = b + ax, wehave ax = b. Using the condition
f(x) =
a,
cost = f(x + x) f(x) = f(x)x + o(x)= ax + o(x)
Thus cost = b + o(x), so up to first order
= cost
b .
For multiple constraints aix = bi, i = 1, . . . , n, we have
cost = m
i=1
i bi + o(x).
-
8/8/2019 NLP Slides
101/201
-
8/8/2019 NLP Slides
102/201
EXAMPLE
p(u)
-1 0u
slope p(0) = - * = -1
Illustration of the primal function p(u) = f
x(u)
for the two-dimensional problem
minimize f(x) = 12x
21 x22 x2
subject to h(x) = x2 = 0.
Here,p(u) = min
h(x)=uf(x) = 1
2u2 u
and =
p(0) = 1, consistently with the sensitivity
theorem.
Need for regularity of x: Change constraint toh(x) = x22 = 0. Then p(u) = u/2
u for u 0 and
is undefined for u < 0.
-
8/8/2019 NLP Slides
103/201
PROOF OUTLINE OF SENSITIVITY THEOREM
Apply implicit function theorem to the system
f(x) + h(x) = 0, h(x) = u.For u = 0 the system has the solution (x, ), andthe corresponding (n + m) (n + m) Jacobian
J = 2f(x) +
m
i=1i 2hi(x) h(x)
h(x) 0 is shown nonsingular using the sufficiency con-ditions. Hence, for all u in some open sphere Scentered at u = 0, there exist x(u) and (u) suchthat x(0) = x, (0) = , the functions x() and ()are continuously differentiable, and
f
x(u)
+ h
x(u)
(u) = 0, h
x(u)
= u.
For u close to u = 0, using the sufficiency condi-tions, x(u) and (u) are a local minimum-Lagrangemultiplier pair for the problem minh(x)=u f(x).
To derive p(u), differentiate hx(u) = u, toobtain I = x(u)hx(u), and combine with the re-lations x(u)f
x(u)
+ x(u)h
x(u)
(u) = 0 and
p(u) = u
f
x(u)
= x(u)f
x(u)
.
-
8/8/2019 NLP Slides
104/201
6.252 NONLINEAR PROGRAMMING
LECTURE 13: INEQUALITY CONSTRAINTS
LECTURE OUTLINE
Inequality Constrained Problems
Necessary Conditions Sufficiency Conditions Linear Constraints
Inequality constrained problem
minimize f(x)
subject to h(x) = 0, g(x) 0
where f : n , h : n m, g : n r are
continuously differentiable. Here
h = (h1,...,hm), g = (g1,...,gr ).
-
8/8/2019 NLP Slides
105/201
TREATING INEQUALITIES AS EQUATIONS
Consider the set of active inequality constraintsA(x) =
j | gj (x) = 0
.
If x is a local minimum: The active inequality constraints at x can be
treated as equations
The inactive constraints at x dont matter Assuming regularity of x and assigning zeroLagrange multipliers to inactive constraints,
f(x) +
m
i=1
i
hi(x) +
r
j=1
j
gj (x) = 0,
j = 0, j / A(x).
Extra property: j 0 for all j. Intuitive reason: Relax jth constraint, gj (x) uj .Since cost 0 if uj > 0, by the sensitivity theorem,we have
j = (cost due to uj )/uj 0
-
8/8/2019 NLP Slides
106/201
BASIC RESULTS
Kuhn-Tucker Necessary Conditions: Let x be a lo-cal minimum and a regular point. Then there existunique Lagrange mult. vectors = (1, . . . ,
m),
= (1, . . . , r ), such that
xL(x, , ) = 0,
j 0, j = 1, . . . , r ,j = 0, j / A(x).
If f, h, and g are twice cont. differentiable,
y2
xxL(x, , )y 0, for all y V(x),where
V(x) =
y | h(x)y = 0, gj (x
)y = 0, j A(x)
Similar sufficiency conditions and sensitivity re-sults. They require strict complementarity, i.e.,j > 0, j A(x),
as well as regularity of x.
-
8/8/2019 NLP Slides
107/201
PROOF OF KUHN-TUCKER CONDITIONS
Use equality-constraints result to obtain all theconditions except for j 0 for j A(x). Intro-duce the penalty functions
g+j (x) = max
0, gj (x)
, j = 1, . . . , r ,
and for k = 1, 2, . . ., let xk minimize
f(x) +k
2||h(x)||2 +
k
2
rj=1
g
+j (x)
2+
1
2||x x||2
over a closed sphere of x such that f(x) f(x).Using the same argument as for equality con-straints,
i = limk
khi(xk), i = 1, . . . , m ,
j = limk
kg+j (xk), j = 1, . . . , r .
Since g+j (xk) 0, we obtain j 0 for all j.
-
8/8/2019 NLP Slides
108/201
GENERAL SUFFICIENCY CONDITION
Consider the problem
minimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r .
Let x be feasible and satisfy
j 0, j = 1, . . . , r, j = 0, j / A(x),
x = arg minxX
L(x, ).
Then x is a global minimum of the problem.
Proof: We have
f(x) = f(x) + g(x) = minxX
f(x) + g(x)
min
xX, g(x)0
f(x) + g(x)
min
xX, g(x)0f(x),
where the first equality follows from the hypothe-sis, which implies that g(x) = 0, and the last in-equality follows from the nonnegativity of . Q.E.D.
Special Case: Let X = n, f and gj be con-vex and differentiable. Then the 1st order Kuhn-Tucker conditions are also sufficient for global op-
timality.
-
8/8/2019 NLP Slides
109/201
LINEAR CONSTRAINTS
Consider the problem minajxbj , j=1,...,r f(x). Remarkable property: No need for regularity. Proposition: If x is a local minimum, there exist1, . . . ,
r with j 0, j = 1, . . . , r, such that
f(x) +r
j=1
j aj = 0, j = 0, j / A(x).
The proof uses Farkas Lemma: Consider thecone Cgenerated by aj , j A(x), and the polarcone C
shown below
a1
a2
0C
= {y | aj'y 0, j=1,...,r}
C ={x | x =j=1
r
jaj, j 0}
Then, C
= C, i.e.,
x C iff xy 0, y C.
-
8/8/2019 NLP Slides
110/201
-
8/8/2019 NLP Slides
111/201
PROOF OF LAGRANGE MULTIPLIER RESULT
a2
a1
Cone generated by aj, j A(x*)
f(x*)
x*
Constraint set
{x | aj'x bj, j = 1,...,r}
C ={x | x =j=1
r
jaj, j 0}
The local min x of the original problem is also a local minfor the problem mina
jxbj , jA(x) f(x). Hence
f(x)(x x) 0, x with aj x bj , j A(x).
Since a constraint aj x bj , j A(x) can also be ex-pressed as aj (x x) 0, we have
f(x)y 0, y with aj y 0, j A(x).
From Farkas lemma, f(x) has the formjA(x)
j aj , for some j 0, j A(x).
Let j = 0 for j /
A(x).
-
8/8/2019 NLP Slides
112/201
6.252 NONLINEAR PROGRAMMING
LECTURE 14: INTRODUCTION TO DUALITY
LECTURE OUTLINE
Convex Cost/Linear Constraints
Duality Theorem Linear Programming Duality Quadratic Programming Duality
Linear inequality constrained problem
minimize f(x)
subject to aj x bj , j = 1, . . . , r ,
where f is convex and continuously differentiable
over n
.
-
8/8/2019 NLP Slides
113/201
LAGRANGE MULTIPLIER RESULT
Let J {
1, . . . , r
}. Then x is a global min if and
only if x is feasible and there exist j 0, j J,such that j = 0 for all j J / A(x), and
x = arg minajxbj
j/J
f(x) +
jJ
j (aj x bj )
.
Proof: Assume x is global min. Then there existj 0, such that j (aj x bj ) = 0 for all j andf(x) +
rj=1
j aj = 0, implying
x = arg minxnf(x) +
r
j=1 j (a
j x bj ).
Since j (aj x bj ) = 0 for all j,
f(x) = minxn
f(x) +
rj=1
j (aj x bj )
minajxbjj/J
f(x) +
rj=1
j (aj x bj )Since j (a
j x bj ) 0 if aj x bj 0,
f(x) minajxbj
j/J
f(x) +
jJ
j (aj x bj )
f(x).
-
8/8/2019 NLP Slides
114/201
PROOF (CONTINUED)
Conversely, if x is feasible and there exist scalarsj , j J with the stated properties, then x is aglobal min by the general sufficiency condition ofthe preceding lecture (where X is taken to be theset of x such that aj x bj for all j / J). Q.E.D.
Interesting observation: The same set of jworks for all index sets J.
The flexibility to split the set of constraints intothose that are handled by Lagrange multipliers(set J) and those that are handled explicitly comeshandy in many analytical and computational con-
texts.
-
8/8/2019 NLP Slides
115/201
THE DUAL PROBLEM
Consider the problemmin
xX, aj
xbj , j=1,...,rf(x)
where f is convex and cont. differentiable over n
and X is polyhedral. Define the dual function q : r [, )
q() = infxX
L(x, ) = infxX
f(x) +
r
j=1j (a
j x bj )
and the dual problem
max0
q().
If X is bounded, the dual function takes real
values. In general, q() can take the value .The effective constraint set of the dual is
Q =
| 0, q() >
.
-
8/8/2019 NLP Slides
116/201
DUALITY THEOREM
(a) If the primal problem has an optimal solution,the dual problem also has an optimal solution andthe optimal values are equal.(b) x is primal-optimal and is dual-optimal ifand only if x is primal-feasible, 0, and
f(x) = L(x, ) = minxX
L(x, ).
Proof: (a) Let x be a primal optimal solution. Forall primal feasible x, and all 0, we have j (aj xbj ) 0 for all j, so
q() infxX, a
jxbj , j=1,...,r
f(x) +
rj=1
j (aj x bj )
infxX, a
jxbj , j=1,...,r
f(x) = f(x).
(*)
By L-Mult. Th., there exists 0 such that j (aj xbj ) = 0 for all j, and x = arg minxX L(x, ), so
q() = L(x, ) = f(x) +r
j=1
j (aj x bj ) = f(x).
-
8/8/2019 NLP Slides
117/201
PROOF (CONTINUED)
(b) If x is primal-optimal and is dual-optimal,by part (a)
f(x) = q(),
which when combined with Eq. (*), yields
f(x) = L(x, ) = q() = minx
XL(x, ).
Conversely, the relation f(x) = minxX L(x, ) iswritten as f(x) = q(), and since x is primal-feasible and 0, Eq. (*) implies that x is primal-optimal and is dual-optimal. Q.E.D.
Linear equality constraints are treated similar toinequality constraints, except that the sign of theLagrange multipliers is unrestricted:
Primal: minxX, e
ix=di, i=1,...,m a
j
xbj , j=1,...,rf(x)
Dual: maxm, 0
q(, ) = maxm, 0
infxX
L(x,,).
-
8/8/2019 NLP Slides
118/201
THE DUAL OF A LINEAR PROGRAM
Consider the linear program
minimize cx
subject to eix = di, i = 1, . . . , m, x 0
Dual function
q() = infx0
n
j=1
cj
mi=1
ieij
xj +
mi=1
idi
.
If cj mi=1 ieij 0 for all j, the infimum isattained for x = 0, and q() =
mi=1
idi. If cj mi=1
ieij < 0 for some j, the expression in bracescan be arbitrarily small by taking xj suff. large, soq() = . Thus, the dual is
maximize
mi=1
idi
subject to
mi=1
ieij cj , j = 1, . . . , n .
-
8/8/2019 NLP Slides
119/201
THE DUAL OF A QUADRATIC PROGRAM
Consider the quadratic programminimize 1
2xQx + cx
subject to Ax b,
where Q is a given nn positive definite symmetricmatrix, A is a given r
n matrix, and b
r and
c n are given vectors. Dual function:
q() = inf xn
12
xQx + cx + (Ax b)
.
The infimum is attained for x = Q1(c + A), and,after substitution and calculation,
q() = 12
AQ1A (b + AQ1c) 12
cQ1c.
The dual problem, after a sign change, isminimize 12P + t
subject to 0,
where P = AQ1A and t = b + AQ1c.
-
8/8/2019 NLP Slides
120/201
6.252 NONLINEAR PROGRAMMING
LECTURE 15: INTERIOR POINT METHODS
LECTURE OUTLINE
Barrier and Interior Point Methods
Linear Programs and the Logarithmic Barrier Path Following Using Newtons Method
Inequality constrained problem
minimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r ,
where f and gj are continuous and X is closed.We assume that the set
S =
x X | gj (x) < 0, j = 1, . . . , ris nonempty and any feasible point is in the closureof S.
-
8/8/2019 NLP Slides
121/201
BARRIER METHOD
Consider a barrier function, that is continuousand goes to as any one of the constraints gj (x)approaches 0 from negative values. Examples:
B(x) = r
j=1ln
gj (x)
, B(x) =
r
j=11
gj (x).
Barrier Method:
xk = arg minxS
f(x) + kB(x)
, k = 0, 1, . . . ,
where the parameter sequence{
k
}satisfies 0 0,
x() = arg minxS
F(x) = arg minxS
cx
ni=1
ln xi
,
where S =
x | Ax = b, x > 0}. We assume that S is
nonempty and bounded.
As 0, x() follows the central path
Point x() oncentral path
x
S
x* ( = 0)
c
All central paths start atthe analytic center
x = arg minxS
ni=1
ln xi
,
and end at optimal solu-
tions of (LP).
-
8/8/2019 NLP Slides
124/201
PATH FOLLOWING W/ NEWTONS METHOD
Newtons method for minimizing F:x = x + (x x),
where x is the pure Newton iterate
x = arg minAz=bF(x)(z x) + 12 (z x)2F(x)(z x
By straightforward calculation
x = x Xq(x, ),
q(x, ) =Xz
e, e = (1 . . . 1), z = c
A,
= (AX2A)1AX
Xc e
,
and X is the diagonal matrix with xi, i = 1, . . . , nalong the diagonal.
View q(x, ) as the Newton increment (xx) trans-formed by X1 that maps x into e. Consider q(x, ) as a proximity measure of thecurrent point to the point x() on the central path.
-
8/8/2019 NLP Slides
125/201
KEY RESULTS
It is sufficient to minimize F approximately, upto where q(x, ) < 1.
x
S
x*
Central Path
Set {x | ||q(x,0)|| < 1}
x(2)
x(1)
x(0)x0
x2
x1
If x > 0, Ax = b, andq(x, ) < 1, then
cx minAy=b, y0
cy
n+
n
The termination set
x | q(x, ) < 1
is part
of the region of quadratic convergence of the pureform of Newtons method. In particular, if q(x, ) 0, Ax = b, and suppose thatfor some < 1 we have
q(x, )
. Then if =
(1 n1/2) for some > 0,q(x, )
2 +
1 n1/2 .
In particular, if
(1
)(1 + )1,
we have q(x, ) . Can be used to establish nice complexity results;but must be reduced VERY slowly.
-
8/8/2019 NLP Slides
127/201
LONG STEP METHODS
Main features: Decrease faster than dictated by complex-ity analysis.
Require more than one Newton step per (ap-proximate) minimization.
Use line search as in unconstrained New-
tons method. Require much smaller number of (approxi-
mate) minimizations.
S
x*
Central Path
x
x(k+1)
x(k)xk
xk+1x(k+2)xk+2
(a) (b)
S
x*
Central Path
x
x(k+1)
x(k)xk
xk+1
x(k+2)xk+2
The methodology generalizes to quadratic pro-gramming and convex programming.
-
8/8/2019 NLP Slides
128/201
6.252 NONLINEAR PROGRAMMING
LECTURE 16: PENALTY METHODS
LECTURE OUTLINE
Quadratic Penalty Methods
Introduction to Multiplier Methods*******************************************
Consider the equality constrained problem
minimize f(x)
subject to x X, h(x) = 0,where f : n and h : n m are continuous,and X is closed.
The quadratic penalty method:
xk
= arg minxX Lck (x, k
) f(x) + k
h(x) +ck
2 h(x)2
where the {k} is a bounded sequence and {ck}satisfies 0 < ck < ck+1 for all k and ck .
-
8/8/2019 NLP Slides
129/201
TWO CONVERGENCE MECHANISMS
Taking k
close to a Lagrange multiplier vector Assume X = n and (x, ) is a local min-Lagrange multiplier pair satisfying the 2ndorder sufficiency conditions
For c suff. large, x is a strict local min ofLc(, )
Taking ck very large For large c and any
Lc(, )
f(x) if x X and h(x) = 0 otherwise
Example:minimize f(x) = 1
2(x21 + x
22)
subject to x1 = 1
Lc(x, ) =12 (x
21 + x
22) + (x1 1) +
c
2 (x1 1)2
x1(, c) =c c + 1
, x2(, c) = 0
-
8/8/2019 NLP Slides
130/201
EXAMPLE CONTINUED
minx1=1
x21 + x22, x
= 1, = 1
0 x13/4
x2
10 x11/2
x2
1
c = 1
= 0
c = 1
= - 1/2
0 x11/2
x2
1 0 x110/11
x2
1
c = 1
= 0
c = 10
= 0
-
8/8/2019 NLP Slides
131/201
GLOBAL CONVERGENCE
Every limit point of {xk
} is a global min.Proof: The optimal value of the problem is f =infh(x)=0, xX Lck (x,
k). We have
Lck (xk, k) Lck (x, k), x X
so taking the inf of the RHS over x X, h(x) = 0
Lck (xk, k) = f(xk) + k
h(xk) +
ck
2h(xk)2 f.
Let (x, ) be a limit point of {xk, k}. Without lossof generality, assume that {x
k
, k
} (x, ). Takingthe limsup above
f(x) + h(x) + lim supk
ck
2h(xk)2 f. (*)
Since h(xk
)2
0 and ck
, it follows thath(xk) 0 and h(x) = 0. Hence, x is feasible, andsince from Eq. (*) we have f(x) f, x is optimal.Q.E.D.
-
8/8/2019 NLP Slides
132/201
LAGRANGE MULTIPLIER ESTIMATES
Assume that X = n
, and f and h are cont.differentiable. Let {k} be bounded, and ck .Assume xk satisfies xLck (xk, k) = 0 for all k, andthat xk x, where x is such that h(x) has rankm. Then h(x) = 0 and k , where
k = k + ckh(xk), xL(x, ) = 0.
Proof: We have
0 = xLck(xk
, k) = f(xk) + h(xk)
k + ckh(xk)
= f(xk) + h(xk)k.
Multiply withh(xk)h(xk)
1h(xk)and take lim to obtain k with
= h(x)h(x)1h(x)f(x).We also have xL(x, ) = 0 and h(x) = 0 (sincek converges).
-
8/8/2019 NLP Slides
133/201
PRACTICAL BEHAVIOR
Three possibilities:
The method breaks downbecause an xk withxLck (xk, k) 0 cannot be found.
A sequence {xk} with xLck (xk, k) 0 is ob-tained, but it either has no limit points, or foreach of its limit points x the matrix h(x)has rank < m.
A sequence {xk} with with xLck (xk, k) 0is found and it has a limit point x such thath(x) has rank m. Then, x together with [the corresp. limit point of
k + ckh(xk)
] sat-
isfies the first-order necessary conditions.
Ill-conditioning: The condition number of theHessian 2xxLck (xk, k) tends to increase with ck. To overcome ill-conditioning:
Use Newton-like method (and double preci-sion).
Use good starting points. Increase ck at a moderate rate (if ck is in-creased at a fast rate, {xk} converges faster,but the likelihood of ill-conditioning is greater).
-
8/8/2019 NLP Slides
134/201
INEQUALITY CONSTRAINTS
Convert them to equality constraints by usingsquared slack variables that are eliminated later. Convert inequality constraint gj (x) 0 to equalityconstraint gj (x) + z2j = 0.
The penalty method solves problems of the form
minx,z
Lc(x,z,,) = f(x)
+
rj=1
j
gj (x) + z2j
+
c
2|gj (x) + z2j |2
,
for various values of and c.
First minimize Lc(x,z,,) with respect to z,
Lc(x,,) = minz
Lc(x,z,,) = f(x)
+r
j=1
minzj
j
gj (x) + z2j
+
c
2|gj (x) + z2j |2
and then minimize Lc(x,,) with respect to x.
-
8/8/2019 NLP Slides
135/201
MULTIPLIER METHODS
Recall that if (x, ) is a local min-Lagrangemultiplier pair satisfying the 2nd order sufficiencyconditions, then for c suff. large, x is a strict localmin of Lc(, ). This suggests that for k , xk x.
Hence it is a good idea to use k
, such as
k+1 = k = k + ckh(xk)
This is the (1st order) method of multipliers.
Key advantages to be shown:
Less ill-conditioning: It is not necessary thatck (only that ck exceeds some thresh-old).
Faster convergence when k is updated thanwhen k is kept constant (whether ck ornot).
-
8/8/2019 NLP Slides
136/201
6.252 NONLINEAR PROGRAMMING
ECTURE 17: AUGMENTED LAGRANGIAN METHO
LECTURE OUTLINE
Multiplier Methods
******************************************* Consider the equality constrained problem
minimize f(x)
subject to h(x) = 0,
where f :
n
and h :
n
m are continuously
differentiable.
The (1st order) multiplier method finds
xk = arg minxn
Lck (x, k) f(x) + kh(x) + c
k
2h(x)2
and updates k using
k+1 = k + ckh(xk)
-
8/8/2019 NLP Slides
137/201
CONVEX EXAMPLE
Problem: minx1=1(1/2)(x21 + x
22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.
We have
xk = arg minx
n
Lck (x, k) =
ck kck + 1
, 0k+1 = k + ck
ck kck + 1
1
k+1
=
k
ck
+ 1
We see that: k = 1 and xk x = (1, 0) for ev-
ery nondecreasing sequence {ck}. It is NOTnecessary to increase ck to .
The convergence rate becomes faster as ck
becomes larger; in fact
|k |
converges
superlinearly if ck .
-
8/8/2019 NLP Slides
138/201
NONCONVEX EXAMPLE
Problem: minx1=1(1/2)(x21 + x
22) with optimal so-lution x = (1, 0) and Lagr. multiplier = 1.
We have
xk = arg minxn
Lck (x, k) =
ck kck
1
, 0provided ck > 1 (otherwise the min does not exist)
k+1 = k + ck
ck kck 1 1
k+1 = k
ck 1
We see that:
No need to increase ck to
for convergence;
doing so results in faster convergence rate. To obtain convergence, ck must eventually
exceed the threshold 2.
-
8/8/2019 NLP Slides
139/201
THE PRIMAL FUNCTIONAL
Let (x, ) be a regular local min-Lagr. pair sat-isfying the 2nd order suff. conditions are satisfied. The primal functional
p(u) = minh(x)=u
f(x),
defined for u in an open sphere centered at u = 0,and we havep(0) = f(x), p(0) = ,
0 u
(u + 1)212
p(u) =
p(0) = f(x*) = 12
-1
0 u
(u + 1)212
p(u) = -
p(0) = f(x*) = -12
-1
(a) (b)
p(u) = minx11=u
12
(x21 + x22), p(u) = min
x11=u12
(x21 + x22)
-
8/8/2019 NLP Slides
140/201
AUGM. LAGRANGIAN MINIMIZATION
Break down the minimization of Lc(, ):
minx
Lc(x, ) = minu
minh(x)=u
f(x) + h(x) +
c
2h(x)2
= min
u p(u) + u +
c
2u2 ,
where the minimization above is understood tobe local in a neighborhood of u = 0.
Interpretation of this minimization:
p(0) = f(x*)
p(u)
Slope = - *
min Lc(x,)x
Slope = -
0 uu(,c)
- 'u(,c)
Primal Function
||u||2c2p(u) +
Penalized Primal Function
If c is suf. large, p(u) + u + c2u2 is convex in
a neighborhood of 0. Also, for and large c,the value minx Lc(x, ) p(0) = f(x).
-
8/8/2019 NLP Slides
141/201
INTERPRETATION OF THE METHOD
Geometric interpretation of the iterationk+1 = k + ckh(xk).
p(0) = f(x*)
p(u)
Slope = - *
min Lck(x,
k
)x
Slope = - k
Slope = - k+1 = p(uk)
Slope = - k+1
Slope = - k+2
0 uuk uk+1
min Lck+1(x,k+1)
x
||u||2c
2
p(u) +
If k is sufficiently close to and/or ck is suf.large, k+1 will be closer to than k.
ck need not be increased to
in order to ob-
tain convergence; it is sufficient that ck eventuallyexceeds some threshold level.
Ifp(u) is linear, convergence to will be achievedin one iteration.
-
8/8/2019 NLP Slides
142/201
COMPUTATIONAL ASPECTS
Key issue is how to select {ck
}. ck should eventually become larger than thethreshold of the given problem.
c0 should not be so large as to cause ill-conditioning at the 1st minimization.
ck should not be increased so fast that too
much ill-conditioning is forced upon the un-constrained minimization too early.
ck should not be increased so slowly thatthe multiplier iteration has poor convergencerate.
A good practical scheme is to choose a mod-erate value c0, and use ck+1 = ck, where is ascalar with > 1 (typically [5, 10] if a Newton-like method is used).
In practice the minimization of Lck (x, k) is typ-ically inexact (usually exact asymptotically). In
some variants of the method, only one Newtonstep per minimization is used (with safeguards).
-
8/8/2019 NLP Slides
143/201
DUALITY FRAMEWORK
Consider the problemminimize f(x) +
c
2h(x)2
subject to x x < , h(x) = 0,
where is small enough for a local analysis tohold based on the implicit function theorem, and cis large enough for the minimum to exist.
Consider the dual function and its gradient
qc() = min
x
x
-
8/8/2019 NLP Slides
144/201
6.252 NONLINEAR PROGRAMMING
LECTURE 18: DUALITY THEORY
LECTURE OUTLINE
Geometrical Framework for Duality
Geometric Multipliers The Dual Problem Properties of the Dual Function
Consider the problem
minimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r ,
We assume that the problem is feasible and the
cost is bounded from below:
< f = inf xX
gj(x)0, j=1,...,r
f(x) <
-
8/8/2019 NLP Slides
145/201
MIN COMMON POINT/MAX CROSSING POINT
Let M be a subset of n
: Min Common Point Problem: Among all points thatare common to both M and the nth axis,find theone whose nth component is minimum.
Max Crossing Point Problem: Among all hyper-planes that intersect the nth axis and support theset M from below, find the hyperplane for whichpoint of intercept with the nth axis is maximum.
0
Min Common Point w*
ax Crossing Point q*
M
0
M
Max Crossing Point q*
Min Common Point ww w
uu
Note: We will see that the min common/maxcrossing framework applies to the problem of thepreceding slide, i.e., minxX, g(x)0f(x), with thechoice
M = (z, f(x)) | g(x) z, x X
-
8/8/2019 NLP Slides
146/201
GEOMETRICAL DEFINITION OF A MULTIPLIER
A vector = (1, . . . , r ) is said to be a geometricmultiplier for the primal problem if
j 0, j = 1, . . . , r ,
and
f = infxX L(x, ).
0
(a)
(0,f*)
(*,1)
w
z
H ={(z,w) | f* = w + jjzj}*
0
(b)
(0,f*)(*,1)
Set of pairs (z,w) corresponding to xthat minimize L(x,*) over X
S = {(g(x),f(x)) | x X} S = {(g(x),f(x)) | x X}
z
w
Note that the definition differs from the one givenin Chapter 3 ... but if X = n, f and gj are convexand differentiable, and x is an optimal solution,the Lagrange multipliers corresponding to x, asper the definition of Chapter 3, coincide with thegeometric multipliers as per the above definition.
-
8/8/2019 NLP Slides
147/201
EXAMPLES: A G-MULTIPLIER EXISTS
0(-1,0)
(*,1) min f(x) = (1/2) (x12 + x2
2)
s.t. g(x) = x1 - 1 0
x X = R2
(b)
0
(0,-1)
(*,1)
(0,1) min f(x) = x1 - x2
s.t. g(x) = x1 + x2 - 1 0
x X ={(x1,x2) | x1 0, x2 0}
(a)
(-1,0)
0
(*,1)
min f(x) = |x1| + x2
s.t. g(x) = x1 0
x X ={(x1,x2) | x2 0}
(c)
(*,1)
(*,1)
S = {(g(x),f(x)) | x X}
S = {(g(x),f(x)) | x X}
S = {(g(x),f(x)) | x X}
-
8/8/2019 NLP Slides
148/201
EXAMPLES: A G-MULTIPLIER DOESNT EXIST
min f(x) = x
s.t. g(x) = x2 0
x X = R
(a)
(0,f*) = (0,0)
(-1/2,0)
S = {(g(x),f(x)) | x X}min f(x) = - x
s.t. g(x) = x - 1/2 0
x X = {0,1}
(b)
(0,f*) = (0,0)
(1/2,-1)
S = {(g(x),f(x)) | x X}
Proposition: Let be a geometric multiplier.Then x is a global minimum of the primal problemif and only if x is feasible and
x = arg minxX
L(x, ), j gj (x) = 0, j = 1, . . . , r
-
8/8/2019 NLP Slides
149/201
THE DUAL FUNCTION AND THE DUAL PROBLEM
The dual problem ismaximize q()
subject to 0,
where q is the dual function
q() = infxX
L(x, ), r.
Question: How does the optimal dual value q =sup0 q() relate to f?
(,1)
H ={(z,w) | w + 'z = b}
OptimalDual Value
x Xq() = inf L(x,)
Support pointscorrespond to minimizersof L(x,) over X
S = {(g(x),f(x)) | x X}
-
8/8/2019 NLP Slides
150/201
WEAK DUALITY
The domain of q isDq =
| q() >
.
Proposition: The domain Dq is a convex set andq is concave over Dq. Proposition: (Weak Duality Theorem) We have
q f.
Proof: For all 0, and x X with g(x) 0, wehave
q() = infzX
L(z, ) f(x) +r
j=1
j gj (x) f(x),
soq = sup
0q() inf
xX, g(x)0f(x) = f.
-
8/8/2019 NLP Slides
151/201
DUAL OPTIMAL SOLUTIONS AND G-MULTIPLIER
Proposition: (a) If q = f, the set of geometricmultipliers is equal to the set of optimal dual solu-tions. (b) If q < f, the set of geometric multipliersis empty.
Proof: By definition, a vector 0 is a geometricmultiplier if and only if f = q() q, which by theweak duality theorem, holds if and only if there isno duality gap and is a dual optimal solution.Q.E.D.
(a)
f* = 0
q()
1
(b)
q()
f* = 0
- 1
min f(x) = x
s.t. g(x) = x2 0
x X = R
min f(x) = - x
s.t. g(x) = x - 1/2 0x X = {0,1}
q() = min {x + x2} ={- 1/(4 ) if > 0
- if 0
- 1/2
x R
q() = min { - x + (x - 1/2)} = min{ - /2, /2 1}x {0,1}
-
8/8/2019 NLP Slides
152/201
6.252 NONLINEAR PROGRAMMING
LECTURE 19: DUALITY THEOREMS
LECTURE OUTLINE
Duality and G-multipliers (continued)
Consider the problem
minimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r ,
assuming < f < . is a geometric multiplier if 0 and f =infxX L(x, ).
The dual problem is
maximize q()subject to 0,
where q is the dual function q() = infxX L(x, ).
-
8/8/2019 NLP Slides
153/201
DUAL OPTIMALITY
(,1)
H ={(z,w) | w + 'z = b}
OptimalDual Value
x X
q() = inf L(x,)Support pointscorrespond to minimizers
of L(x,) over X
S = {(g(x),f(x)) | x X}
Weak Duality Theorem: q
f
.
Geometric Multipliers and Dual Optimal Solu-tions:
(a) If there is no duality gap, the set of geometricmultipliers is equal to the set of optimal dualsolutions.
(b) If there is a duality gap, the set of geometricmultipliers is empty.
-
8/8/2019 NLP Slides
154/201
DUALITY PROPERTIES
Optimality Conditions: (x, ) isan optimal solutiogeometric multiplier pair if and only if
x X, g(x) 0, (Primal Feasibility), 0, (Dual Feasibility),
x = arg minxX
L(x, ), (Lagrangian Optimality),
j gj (x) = 0, j = 1, . . . , r , (Compl. Slackness).
Saddle Point Theorem: (x, ) is an optimalsolution-geometric multiplier pair if and only if x X, 0, and (x, ) is a saddle point of the La-grangian, in the sense that
L(x, ) L(x, ) L(x, ), x X, 0.
-
8/8/2019 NLP Slides
155/201
INFEASIBLE AND UNBOUNDED PROBLEMS
0
min f(x) = 1/x
s.t. g(x) = x 0
x X ={x | x > 0}
(a)f* = , q* =
0
S = {(x2,x) | x > 0}
min f(x) = x
s.t. g(x) = x2 0
x X = {x | x > 0}
(b)f* = , q* = 0
0
S = {(g(x),f(x)) | x X}= {(z,w) | z > 0}
min f(x) = x1 + x2
s.t. g(x) = x1 0
x X = {(x1,x2) | x1 > 0}
(c)f* = , q* =
z
w
w
w
z
z
S = {(g(x),f(x)) | x X}
-
8/8/2019 NLP Slides
156/201
EXTENSIONS AND APPLICATIONS
Equality constraints hi(x) = 0, i = 1, . . . , m, canbe converted into the two inequality constraints
hi(x) 0, hi(x) 0.
Separable problems:
minimize
mi=1
fi(xi)
subject to
m
i=1
gij (xi) 0, j = 1, . . . , r ,
xi Xi, i = 1, . . . , m .
Separable problem with a single constraint:
minimize
ni=1
fi(xi)
subject to
ni=1
xi A, i xi i, i.
-
8/8/2019 NLP Slides
157/201
DUALITY THEOREM I FOR CONVEX PROBLEMS
Strong Duality Theorem - Linear Constraints:Assume that the problem
minimize f(x)
subject to x X, aix bi = 0, i = 1, . . . , m ,
ej x dj 0, j = 1, . . . , r ,
has finite optimal value f. Let also f be convexover n and let X be polyhedral. Then there existsat least one geometric multiplier and there is noduality gap.
Proof Issues Application to Linear Programming
-
8/8/2019 NLP Slides
158/201
COUNTEREXAMPLE
A Convex Problem with a Duality Gap: Considerthe two-dimensional problem
minimize f(x)
subject to x1 0, x X = {x | x 0},
wheref(x) = e
x1x2 , x X,
and f(x) is arbitrarily defined for x / X. f is convex over X (its Hessian is positive definite
in the interior of X), and f = 1. Also, for all 0 we have
q() = infx0
e
x1x2 + x1
= 0,
since the expression in braces is nonnegative forx 0 and can approach zero by taking x1 0 andx1x2 . It follows that q = 0.
-
8/8/2019 NLP Slides
159/201
DUALITY THEOREM II FOR CONVEX PROBLEMS
Consider the problemminimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r .
Assume that X is convex and the functionsf : n , gj : n are convex over X. Fur-thermore, the optimal value f is finite and thereexists a vector x X such that
gj (x) < 0,
j = 1, . . . , r .
Strong Duality Theorem: There exists at leastone geometric multiplier and there is no dualitygap.
Extension to linear equality constraints.
-
8/8/2019 NLP Slides
160/201
6.252 NONLINEAR PROGRAMMING
LECTURE 20: STRONG DUALITY
LECTURE OUTLINE
Strong Duality Theorem
Linear Equality Constraints Fenchel Duality
********************************
Consider the problem
minimize f(x)
subject to x X, gj (x) 0, j = 1, . . . , r ,
assuming
< f 0.
Normalize ( = 1), take the infimum over x X,and use the fact 0, to obtain
f
inf
xXf(x) + g(x) = q() sup0 q() = q.Using the weak duality theorem, is a geometricmultiplier and there is no duality gap.
-
8/8/2019 NLP Slides
163/201
LINEAR EQUALITY CONSTRAINTS
Suppose we have the additional constraintseix di = 0, i = 1, . . . , m
We need the notion of the affine hullof a convexset X [denoted af f(X)]. This is the intersection ofall hyperplanes containing X.
The relative interiorof X, denoted ri(X), is the setof all x X s.t. there exists > 0 with
z | z x < , z af f(X) X,that is, ri(X) is the interior of X relative to af f(X).
Every nonempty convex set has a nonemptyrelative interior.
-
8/8/2019 NLP Slides
164/201
DUALITY THEOREM FOR EQUALITIES
Assumptions: The set X is convex and the functions f, gjare convex over X.
The optimal value f is finite and there existsa vector x ri(X) such that
gj (x) < 0, j = 1, . . . , r ,
eix di = 0, i = 1, . . . , m .
Under the preceding assumptions there existsat least one geometric multiplier and there is no
duality gap.
-
8/8/2019 NLP Slides
165/201
COUNTEREXAMPLE
Considerminimize f(x) = x1
subject to x2 = 0, x X =
(x1, x2) | x21 x2
.
The optimal solution is x = (0, 0) and f = 0. The dual function is given by
q() = inf x21x2
{x1 + x2} = 1
4, if > 0,
, if 0.
No dual optimal solution and therefore there isno geometric multiplier. (Even though there is noduality gap.)
Assumptions are violated (the feasible set andthe relative interior of X have no common point).
-
8/8/2019 NLP Slides
166/201
FENCHEL DUALITY FRAMEWORK
Consider the problemminimize f1(x) f2(x)subject to x X1 X2,
where f1 and f2 are real-valued functions on n
,and X1 and X2 are subsets of n. Assume that < f < . Convert problem to
minimize f1(y) f2(z)subject to z = y, y