01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...

Nonlinear Optimization

M.M. [email protected]

Tarbiat Moallem University of Tehran(Spring 2011)

mailto:[email protected]

Referencesv Edwin K. P. Chong, Stanislaw H. Zak, An introduction to

Optimization, John Wiley & Sons, 2nd Edition, 2001.vDavid G. Luenberger, Linear and Nonlinear Optimization,

Addison-Wesley Publishing Company, 2nd Edition, 1989.v S.S. Rao, Optimization Theory and Application, John Wiley &

Sons, 2nd Edition, 1984.

Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]


Unconstrained Optimization



v Let X∈ ℜn:

v Let A = (aij) ∈ ℜn×n be a n×n symmetric matrix. the k-th order leading principal minor is defined as the determinant of the

k×k submatrix formed by the first k rows and columns.

1

2

1

.

.

n n

x

x

X

x×

=

Quadratic Form

1 1 1 2 1

21 2 2 2

1 2

......

.

....

k

k

k k kk

a a a

a a a

a a a



Quadratic FormvConsider the quadratic form:

v The above quadratic form (or simply matrix A) is called:

Positive semi-definite if XTAX ≥ 0 for all X.

Positive definite if XTAX > 0 for all X ≠ 0.

Negative semi-definite if XTAX ≤ 0 for all X.

Negative definite if XTAX < 0 for all X ≠ 0.

Indefinite if XTAX < 0 for some X and >0 for other X.

2

1

( )

2

T

ii i ij i ji i j n

Q X X A X

a x a x x≤ < ≤

=

= +∑ ∑ ∑



Quadratic FormTheorems: Sylvester’s CriterionvA quadratic form XTAX, A = AT, is positive definite (or positive

semi-definite) if and only if all the leading principal minors of Aare >0 (or ≥ 0).

vA quadratic form XTAX, A = AT, is Negative definite ( negativesemi-definite) if and only if k-th leading principal minor of A has the sign of (-1)k , k=1,2,…,n (or k-th leading principal minor of A is zero or has the sign of (-1)k , k=1,2,…,n ).



Quadratic FormTheoremsvA symmetric matrix A is positive definite (or positive semi-

definite) if and only if all eigenvalues of A are positive (or nonnegative).

vA symmetric matrix A is Negative definite (or negative semi-definite) if and only if all eigenvalues of A are negative (or nonpositive).



Optimav Let f (X) = f (x1, x2 ,…, xn ) be a real-valued function of the n

variables x1, x2 ,…, xn.

v Suppose:

A point X0 is said to be a local maximum of f (X) if there exists an ε > 0 such that f (X0) ≥ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a local minimum of f (X) if there exists an ε > 0 such

that f (X0) ≤ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local maximum of f (X) if there exists an ε >

0 such that f (X0) > f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local minimum of f (X) if there exists an ε >

0 such that f (X0) < f (X0+ h) for all |hj|≤ε.

0 0 00 1 2

1 20 0 0

0 1 1 2 2

( , ,..., )( , ,..., )

( , , ..., )

n

n

n n

X x x x

h h h h

X h x h x h x h

==

+ = + + +



Optima

A point X0 is said to be a absolute maximum or global maximum of f (X), if f (X0) ≥ f (X) for all X. A point X0 is said to be a absolute minimum or global minimum of f (X),

if f (X0) ≤ f (X) for all X. A point X0 is said to be a strict absolute maximum or strict global maximum of

f (X), if f (X0) > f (X) for all X. A point X0 is said to be a strict absolute minimum or strict global minimum of

f (X), if f (X0) < f (X) for all X.



Example

x1: strict global minimizer x2: strict local minimizer x3: local (not strict) minimizer



Conditions for Local MinimizersTheorem: First-Order Necessary Condition (FONC)v Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If

x* is a local minimizer of f over Ω, then for any feasible direction d at x*, we have

dT ∇f(x*) ≥ 0or <∇f(x*), d> ≥ 0

Corollary: Interior casev Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If

x* is a local minimizer of f over Ω and if x* is an interior point of Ω, then

∇f(x*) = 0



Conditions for Local MinimizersTheorem (another statement)

vA necessary condition for x* to be an optimum point of f (x) is that ∇f (x*) = 0. i.e. , all the first order partial derivatives are zero at x*.

Definition

vA point x* for which ∇f (x*) = 0, is called a stationary point of f(x*) A stationary point is a potential candidate for local maximum or local

minimum.

i

fx

∂∂



Example: #1v Illustration of the FONC for the constrained case: x1 does not satisfy the FONC, x2 satisfies the FONC,



Example: #2vConsider the problem:

min f (x1,x2) = x12 + 0.5x2

2 + 3x2 + 4.5

s.t.: x1, x2≥0

a. Is the FONC for a local minimizer satisfied at x = [l,3]T?b. Is the FONC for a local minimizer satisfied at x = [0,3]T?c. Is the FONC for a local minimizer satisfied at x = [l,0]T?d.Is the FONC for a local minimizer satisfied at x = [0,0]T?

Solutionv ∇f (x1,x2) = [2x1, x2 + 3]T

v A plot of the level sets of f :



Example: #2a. At x = [1,3]T, we have ∇f (x1,x2) = [2,6]T. The point x = [1,3]T is an interior

point of Ω = x : x1≥ 0, x2≥ 0. Hence, the FONC requires ∇f (x1,x2) = 0. The point x = [1,3]T does not satisfy the FONC for a local minimizer.

b. At x = [0,3]T, we have ∇f (x1,x2) = [0,6]T, and hence dT ∇f (x1,x2) = 6d2, where d = [d1,d2]

T. For d to be feasible at (x1,x2), we need d1 > 0, and d2 can take an arbitrary value in ℜ. The point x = [0,3]T does not satisfy the FONC for a minimizer because d2 is allowed to be less than zero. For example, d = [1, -1]T is a feasible direction, but dT ∇f (x1,x2) = -6 < 0.

c. At x = [1,0]T, we have ∇f (x1,x2) = [2,3]T, and hence dT ∇f (x1,x2) = 2d2 +3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1 can take an arbitrary value in ℜ. For example, d = [-5, -1]T is a feasible direction, but dT ∇f (x1,x2) = -7 < 0. Thus x = [1,0]T does not satisfy the FONC for a local minimizer.

d. At x = [0,0]T, we have ∇f (x1,x2) = [0,3]T, and hence dT ∇f (x1,x2) = 3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1≥ 0. Hence x = [0,0]T satisfies the FONC for a local minimizer.



Example #3: Function Approximationv Suppose that through an experiment the value of a function g is

observed at m points, x1, x2, . . . , xm. Thus, values g(x1), g(x2), . . . , g(xm) are known. We wish to approximate the function g(x) by a polynomial

h(x) = anxn + an-1xn-1 + . . . + a0

of degree n (or less), where n < m. Find ai ‘s.

Solutionv Define: ek = g(xk) − h(xk). v The best approximation is the polynomial that minimizes the sum of the squares

of these errors;2

1min

m

kk

e=

∑Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]


Example: Function Approximationv or

v For FONC:

( ) 211 0 1 0

1

min ( , ,..., ) ( ) ...m

n nn n k n k n k

k

f a a a g x a x a x a−− −

=

= − + + + ∑

( )

( )

11 0

1

11 0

1 1

0

2 ( ) ... 0 0,1,...,

... . ( ) 0,1,...,

i

mi n nk k n k n k

k

m mn i n i i i

n k n k k k kk k

fa

x g x a x a x a for i n

or

a x a x a x x g x for i n

−−

=

+ − +−

= =

∂= ⇒

∂

− − + + + = =

+ + + = =

∑

∑ ∑



Example: Function Approximationv In matrix form:

v This leads directly to the system of (n+1) equations, which can be solved to determine ai’s.

0

1 1 1 1

( 1) ( 1) ( 1) 1( 1) 1

. ... ... ... . .:. ... ... ... . :

... ... ( )..

. ... ... ... . :

m m m mji i j i n i

k k k i k kk k k k

n n nn n

a

ax x x b g x x

a

+ +

= = = =

+ × + + ×+ ×

× =

=

∑ ∑ ∑ ∑

Column j

Rowi



Example: Function Approximationv Let

0

11

( 1) 1( 1) ( 1)

2

11

( 1) 1

. . .

. . .:

. .

. . .

.:

( )( ).

:

a

b c

mi j

n ij kk

n nn n

mm

kjkj k k

k

n

a

Aa A x

a

g xb g x x

+−

=

+ ×+ × +

==

+ ×

= = =

= =

=

∑

∑∑

LL

L

L



Example: Function Approximationv The problem can be stated as the following quadratic form:

v Then as said before, the solution is determined by solving the following system of (n+1) equations :

v It should be noted that the answer is the solution to LSE problem.

1 0min ( , ,..., ) - 2a a b a cT Tn nf a a a A− = +

a bA =



Conditions for Local MinimizersDefinitionvHessian matrix of f (X) : H(x) is the n×n matrix whose i-th row are

the partial derivaves of (j =1,2,…,n) with respect to xi (i=1,2,…,n). j

fx

∂∂

2 2 2

21 1 2 1

2 2 2

22 1 2 22

2 2 2

21 2

. .

. .( ) ( )

.

.

. . .

n

n

n n n

f f fx x x x x

f f fx x x x x

H f

f f fx x x x x

∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ∇ =

∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂

x x



Example

2 31 2 3 1 2 1 3( , , ) 7 3 4f x x x x x x x= − +

3

2

2

2 23 1 3

0 0 12

0 6 012 0 24

x

fx x x

∇ = −

3 23 2 1 37 4 6 12f x x x x ∇ = + −



Conditions for Local MinimizersTheorem: Second-Order Necessary Condition (SONC)v Let Ω ⊂ ℜn and f ∈C 2 a real-valued function on Ω, x* a local

minimizer of f over Ω, and d a feasible direction at x*.If dT ∇f(x*) = 0, then

dT H(x*) d ≥ 0 (H(x*) is a positive semi-definite)where H is the Hessian of f.

Corollary: Interior Casev Let x* be an interior point of Ω ⊂ ℜn. If x* is a local minimizer of

f : Ω →ℜn, f ∈C 2, then∇f(x*) = 0 and dT H(x*) d ≥ 0 for all d∈ℜn

i.e. H(x*) is positive semi-definite.Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]


Conditions for Local MinimizersTheorem: Second-Order Sufficient Condition (SOSC), Interior Casev Let f ∈C 2 be defined on a region in which x* is an interior point.

Suppose that:1. ∇f(x*) = 0, 2. H(x*) is positive definite , i.e. dT H(x*) d > 0.Then, x* is a strict local minimizer of f.



Conditions for Local MinimizersTheorem

v Let X0 be a stationary point of f (X), A sufficient condition for X0 to be a

local minimum of f (X) is that the Hessian matrix H(X0) ispositive definite;

local maximum of f (X) is that the Hessian matrix H(X0) is negative definite.

If H(X0) is neither negative definite nor positive definite,

Ø If detH(X0) = 0 then x* is a local minimum, local maximum, or saddle point

Ø If detH(X0) ≠ 0 then x* is not a optimum



Conditions for Local MinimizersCorollaryv If the Hessian matrix H(X) is indefinite at X0, where the necessity

conditions are satisfied, then the point X0 is not an extreme point.



Conditions for Local MinimizersQuestionvHow to determine sufficient condition when H(X) is semi-definite?



Review: Necessary and Sufficient ConditionsLocal Minimumv Necessary conditions First-order (FONC)

∇f(x0)=0 (x0: stationary point) Second-order (SONC)

H(x0)=∇2f(x0) is positive semi-definite.v Sufficient conditions First-order (FOSC)

∇f(x0)=0 (x0: stationary point) Second-order (SOSC)

H(x0)=∇2f(x0) is positive definite.

Global Minimumv Compare all local minimums



Examplev Find the stationary points of the function

f (x1,x2,x3) = 2x1x2x3 – 4x1x3 – 2x2x3 + x12 + x2

2 + x32 – 2x1 - 4x2 + 4x3

and hence find the extrema of f.

Solution:

2 3 3 11

1 3 3 22

1 2 1 2 33

2 4 2 2 0 (1)

2 2 2 4 0 (2)

2 4 2 2 4 0 (3)

fx x x x

x

fx x x x

x

fx x x x x

x

∂= − + − =

∂∂

= − + − =∂∂

= − − + + =∂



Substituting in (3) for x2:

2x1+ x1x3 – x12x3 – 2x1 – 2 – x3 + x1x3 + x3 = -2

or

x1x3 (2 – x1) = 0

thus, x1= 0 or x3 = 0 or x1 = 2

v Case (i) x1 = 0:

(1) ⇒ x2x3 – 2x3 = 1 (4)

(2) ⇒ x2 – x3 = 2 (5)

(3) ⇒ -x2+x3 = -2 same as (5)

(4) using (5) ⇒ x3(2 + x3) – 2x3 = 1 or x32 = 1 i.e. x3= ±1

Example



Example Sub-case (i) x3 = 1 (using (5) ) ⇒ x2 = 3

Sub-case (ii) x3= -1 (using (5) ) ⇒ x2 = 1

There is 2 stationary points: (0,3,1), (0,1,-1)

v Case (ii) x3 = 0 :

(1) ⇒ x1 = 1

(2) ⇒ x2 = 2

⇒ (3) : x1x2 - 2x1 - x2 = -2 üTherefore, the stationary point is : (1,2,0)

v Case (iii) x1= 2 :

(1) ⇒ x2x3 – 2x3 = -1 (6)

(2) ⇒ x2 + x3 = 2 (7)

(3) ⇒ x2 + x3 = 2 same as (7)



Example(6) Using (7) ⇒ x3(2 – x3) – 2x3 = -1 ≡ x3

2 = 1 ⇒ x3 = ±1

Sub-case (i) x3 = 1 ⇒ (using (5)) x2 = 1

Sub-case(ii) x3 = -1 ⇒ (using (5) x2 = 3

There is 2 stationary points: (2,1,1), (2,3,-1)

The Hessian matrix:

3 2

3 1

2 1

2 2 2 4( ) 2 2 2 2

2 4 2 2 2

x xH X x x

x x

− = − − −



ExamplePoint Principal minors Nature

(0,3,1) 2, 0, -32 Saddle point

(0,1,-1) 2, 0, -32 Saddle point

(1,2,0) 2, 4, 8 Local min

(2,1,1) 2, 0, -32 Saddle point

(2,3,-1) 2, 0, -32 Saddle point



Convex & Concave FunctionDefinition

v A function f(X)= f(x1,x2,…xn) of n variables is said to be convex if for each pair of points X1, X2 on the graph, the line segment joining these two points lies entirely above or on the graph.

i.e.

f((1− α)X1 + α X2) ≤ (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1

f is said to be strictly convex if for each pair of points X1, X2 on the graph,

f((1− α)X1 + α X2) < (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1

v f is called concave (strictly concave), if −f is convex (strictly convex).



Example: convex functionv f(x) = x2

X1 X2(1- α) X1 + α X2

x

y



v f(x) = -x2

Example: concave function

X1 X2(1- α) X1 + α X2 x

y



-60 -40 -20 0 20 40 60-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Example: nonconvex/nonconcave function

X1 X2



Convexity for function of one variable

2

2 0d fdx

≥

2

2 0d fdx

≤ Concave:

Convex:2

2 0d fdx

≥

2

2 0d fdx

≤



Convexity test for functions of 2 variables

Principal minors convex Strictly convex concave Strictly concave

fxx ≥ 0 > 0 ≤ 0 < 0

fxx fyy − (fxy)2 ≥ 0 > 0 ≥ 0 > 0

2 2

2

2 2

2

( ) xx xy

yx yy

f ff fx x y

H Xf ff f

y x y

∂ ∂ ∂ ∂ ∂ = = ∂ ∂

∂ ∂ ∂



Examplev Find if f is convex, concave or neither.

f (x1,x2) = 3x1 + 5x2 – 4x12 + x2

2 – 5x1x2

Solutionv Put f in matrix form

[ ] [ ]1 11 2 1 2

2 2

542( , ) 3 5

5 12

x xf x x x x

x x

− − = +

−

1 2

542( , )

5 12

A x x

− − ⇒ =

−



Example

v Since some eigenvalues are negative and some are positive, A is neither positive definite nor negative definite, which implies that f is neither convex and nor concave.

2

1

2

det( ) 054 412det 0 3 0

5 412

3 50 02

3 50 02

A Iλ

λλ λ

λ

λ

λ

⇒ − =

− − − ⇒ = ⇒ + − =

− −

− +⇒ = >

− −= <



Examplev Find local minimum of f :

f (x) = x3

Solution

v The point x = 0 satisfies the FONC, SONC, but not SOSC.

2( ) 3 0 0( ) 6 ( ) 0f x x x

H x x H x

∇ = = ⇒ == ⇒ =



Examplev Find local minimum of f :

f (x1,x2) = x12 + x2

2

Solution

v The point X = [0,0]T satisfies the FONC, SONC, and SOSC. It is a strict local minimizer.

v Actually X = [0,0]T is a strict global minimizer.

11 2 1 2

2

2

2 0( , ) 0 ( , )

2 0

2 0( ) ( )

0 2

xf x x X x x

x

H X X H X

∇ = = ⇒ = =

= ∈ℜ ⇒

for all is positivedefinite



Examplev Find local minimum of f

f (x1,x2) = x12 - x2

2

Solution

v The point X = [0,0]T satisfies the FONC, but SONC is not satisfied. v It is a saddle point.

11 2 1 2

2

2

2 0( , ) 0 ( , )

2 0

2 0( ) ( )

0 2

xf x x X x x

x

H X X H X in

∇ = = ⇒ = = −

= ∈ℜ ⇒ −

for all is definite



Example

f(x1, x2)



Importance of Concave Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is concave the objective is to maximize the value of the objective function:

Then:v Any local maximum is a global maximum!

( )max s.t.

z f x

x

=

∈Ω



Importance of Concave Functions in NLPproofv Suppose there exists a solution x’ that is a local maximum, but not a global

maximum. Since x’ is not a global maximum, there exists a solution x with the property that f(x) > f(x’).

v Now use the fact that f is concave Since f is concave, we have that

v If α is very close to 1, then α x’+(1-α)x is very close to x’ f (αx’+(1-α)x ) > f (x’ )

v Therefore, x’ cannot be a local optimum. This is a contradiction to our assumption that there exists a local maximum that is not a global maximum!

( ) ( ) ( )( ) ( )

( )

(1 ) (1 ) 0 1

(1 )

f x x f x f x

f x f x

f x

α α α α α

α α

′ ′+ − ≥ + − < <

′ ′> + −

′>

, for all



Importance of Convex Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is convex the objective is to minimize the value of the objective function:

Then:v Any local minimum is a global minimum!v The Hessian matrix of a convex function is positive semi-definite.

( )min s.t.

z f x

x

=

∈Ω



Importance of Convex Functions in NLPAnother reason:v Basic optimization algorithms search for local optima. Those that try to find global optima generally just run underlying algorithms

several times starting at different solutions.



Properties of Convex/Concave FunctionsTheoremv Let f∈C1. Then f is convex over a convex set Ω if and only if

f(x) ≥ f(x0) + ∇f(x0)(x - x0)for all x, x0 ∈ Ω.

vA convex function lies aboveits tangent planes.

x

f(x)

x0



Properties of Convex/Concave FunctionsTheoremv Let f ∈C2. Then f is convex over a convex set Ω containing an

interior point if and only if the Hessian matrix H of f is positive semi-definite throughout Ω.



Properties of Convex/Concave FunctionsNotes1. The Hessian matrix is the generalization to ℜn of the concept of

the curvature of a function, and correspondingly, positive definiteness of the Hessian is the generalization of positive curvature. Convex functions have positive (or at least nonnegative) curvature in every direction.

2. We sometimes refer to a function as being locally convex if its Hessian matrix is positive semi-definite in a small region, and locally strictly convex if the Hessian is positive definite in the region.



Properties of Convex/Concave FunctionsTheorem 1v Let f be a convex function defined on the convex set Ω. Then the

set Г where f achieves its minimum is convex, and any relative minimum of f is a global minimum.



Properties of Convex/Concave FunctionsTheorem 2v Let f ∈ С1 be convex on the convex set Ω. If there is a point

x*∈Ω such that, for all у∈Ω, ∇f(x*)(y - x*) ≥ 0, then x* is a global minimum point of f over Ω.



Properties of Convex/Concave FunctionsTheorem 3v Let f be a convex function defined on the bounded, closed convex

set Ω. If f has a maximum over Ω it is achieved at an extreme point of Ω.



v Extrema = minimum/ maximum



constrained optimization



Level Sets of a FunctionDefinitionv The level set of a function f : ℜn→ℜ at level c is the set of points

S=x| f(x)=c

v For f : ℜ2→ℜ , we are usually interested in S when it is a curve. v For f : ℜ3→ℜ , the sets S most often considered are surfaces.



Non-linear optimization with constraints 3



Example #3: vProduct Mix Problem:

Data Mining, Spring 2011

…..Eq (4)Max Z = 13x1 + 11x2 (Income)

s.t.4x1 + 5x2 ≤1500 (Storage Space)5x1 + 3x2 ≤ 1575 (Raw Material)x1 + 2x2 ≤ 420 (Production Rate)

x1 ≥ 0x2 ≥ 0

TMU, M.M.Pedram, [email protected]


x2

525

420315 x1375

300

d3

210

d2

d1

Max Z = 13x1 + 11x2s.t.

d1 : 4x1 + 5x2 + s1 = 1500d2 : 5x1 + 3x2 + s2 = 1575d3 : x1 + 2x2 + s3 = 420

Zmax = 4335

Example #3:

(270, 75)



Example #3:


1000 2000

2000

3000

3000

3000

4000

4000

4000

4000

5000

5000

5000

6000

6000

7000

7000

8000

0 50 100 150 200 250 300 3500

50

100

150

200

250

300

350

x1

x2



v f (x1,x2) = x12 + x2

2

-15 -10 -5 0 5 10 15

-10

0

10

0

50

100

150

200

xy x1

x2

f (x 1

,x2)




5

5

10

10

10

20 20

2020

40

40

40

40

40

40

60

6060

60

60

60

60

80

8080

80

80

8080

80

100

100

100

100

100

100

100100

100

120

120

120

120

-10 -8 -6 -4 -2 0 2 4 6 8 10-10

-8

-6

-4

-2

0

2

4

6

8

10

x1

x2



v f (x1,x2) = x12 - x2

2

-10-5 0

5 10

-10-5

05

10

-100

-50

0

50

100

xy x1

x2

f (x 1

,x2)

-80 -80

-80-80

-60

-60

-60

-60

-40

-40

-40

-40

-40

-40

-20

-20

-20

-20

-20

-20

0

0

0

0

0

0

0

0

20

20

20

20

20

20

40

40

40

40

40

40

60

60

60

60

80

80

80

80

x-10 -8 -6 -4 -2 0 2 4 6 8 10

-10

-8

-6

-4

-2

0

2

4

6

8

10

x1

x2



v Rosenbrock’s function: f (x1,x2) = 100 (x2−x12)2 + (1− x1)2


-2-1

01

2

-2

-1

0

1

2

0

500

1000

1500

2000

2500

3000

x1x2

f (x 1

,x2)




0.7

0.7

0.7

0.7

0.77

7

7

7

7

7

7 7

7

7

70

70

70

70

70

70

70

70

70

70

200

200

200

200

200

200

200

200

200

500

500

500

500

500

500

700

700

700

700

700

1000 1000

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

x1

x2



v Peaks function:

-3-2

-10

12

3

-3-2

-10

12

3-10

-5

0

5

xyx1

x2

f (x 1

,x2)



The Importance of Level SetsTheoremv The vector ∇f(x0) is orthogonal to the tangent vector to an

arbitrary smooth curve passing through x0 on the level set determined by f(x) = f(x0).



The Importance of convexityvAs said before, convexity guarantees that a local optimum is a

global optimum.



Possible Optimal Solutions to Convex NLPs (not occurring at corner points)


objective function level curve

optimal solution

Feasible Region

linear objective,nonlinear constraints


optimal solution

Feasible Region

nonlinear objective,nonlinear constraints


optimal solution

Feasible Region

nonlinear objective,linear constraints

objective function level curves

optimal solution

Feasible Regionnonlinear objective,linear constraints



Local vs. Global Optimal Solutions for Nonconvex NLPs


A

C

B

Local optimal solution

Feasible Region

D

EF

G

Local and global optimal solution

x1

x2



NLP with Equality Constraints

[ ]( )

1 2( )

0 1,2, ,

Min

s.t.

Tn

j

f x x x

h j m

=

= =

x x

x

L

K



The Lagrangian Functionv Let introduce the Lagrangian Function, L(x,λ), as:

L(x,λ) = f(x) + λT h(x)

where

v The notation ∇x f(x,y) means the gradient of f with respect to x.v Thus:

∇x L(x,λ) = ∇x f(x) + λT ∇x h(x)

1 1

2 2

( )( )

( ) ,: :( )

λλ

λ

= =

λ

m m

hh

h

xx

h x

xLagrangian Multipliers

or Dual Vector



First Order Necessary Conditions (FONC)Theorem : Lagrange's Theorem (FONC)v Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject

to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:

∇x L(x∗,λ∗) = ∇x f(x∗) + λ∗T ∇x h(x∗) = (0n ×1)T

vNote that the constraint are in the form of h(x) = 0m×1. Thus the above FONC could be stated as:

∇x L(x,λ) = 01×n

∇λ L(x,λ) = 01×m



First Order Necessary Conditions (FONC)v The Lagrangian can be thought as an unconstrained optimization

problem with variables x1, x2, …, xn and λ1, λ2, …, λm. The problem can be solved by solving the equations:


1

1

, 0 1,2,...,

, 0 1,2,...,

ni

mj

or for i nx

or for j mλ

×

×

∂ ∂= = = ∂ ∂

∂ ∂ = = =

∂ ∂

0

0

xL L

L Lλ


First Order Necessary Conditions (FONC)v For n=2 and m=1, i.e., let the problem consists of f as a function of

2 variables and only one constraint. Then the Lagrange’s FONC is stated as follows for a local minimizer (or maximizer) x*:

∇f(x∗) + λ∗ ∇h(x∗) = 0

v The equation means that

∇f(x∗) and ∇h(x∗) must be in exactly opposite directionsat a minimum or maximum point!



Examplev Consider the problem:

min x1 + x2

s.t: (x1)2 + (x2)2 – 1 = 0

The feasible region is a circle with a radius of one. The possible objective function curves are lines with a slope of -1. The minimum will be the point where the lowest line still touches the circle.



Example

f(x) = 1

f(x) = 0

f(x) = -1.414

Feasible region

* 0.7070.707

x

=

( )f x∇

The gradient of f points in the direction of

increasing f

x1

x2



Examplev Since the objective function lines are straight parallel lines, the gradient of f is a

straight line pointing toward the direction of increasing f, which is to the upper right.

v The gradient of h will be pointing out from the circle and so its direction will depend on the point at which the gradient is evaluated.



Example

*( )h x∇

=

011x

1( )f x∇

Tangent Plane

1( )h x∇x1

x2

* 0.7070.707

x

=

f(x) = 1

f(x) = 0

f(x) = -1.414

*( )f x∇



Conclusionsv At the optimum point, ∇f(x) is parallel to ∇h(x).v As we can see at point x1, ∇f(x) is not parallel to ∇h(x) and we can move

(down) to improve the objective function.

v We can say that at a max or min, ∇f(x) must be parallel to ∇h(x), Otherwise, we could improve the objective function by changing position.



Examplev Using the FONC for the previous example:

L(x,λ) = f(x) + λ h(x)

And the first FONC equation is:

( ) ( )( )2 21 2 1 2 1x x x xλ= + + ⋅ + −

0 1,2

0

i

for ix

λ

∂ = = ∂∂ =

∂

L

L



Examplev This becomes:

v There are three equations and three unknowns. By solving the system:

v It can be seen from the graph that positive x1 & x2 corresponds to a maximum while negative x1 & x2 corresponds to the minimum.

11

22

1 2 0

1 2 0

xx

xx

λ

λ

∂= + =

∂∂

= + =∂

L

L

( ) ( )2 21 2 1 0x x

λ∂

= + − =∂

L

1 2 0.707x x= = ± 0.707λ = m



Limitations of FONCv The FONC do not guarantee that the solution(s) will be

minimums/maximums.vAs in the case of unconstrained optimization, they only provide us

with candidate points that need to be verified by the second order conditions.

v If the problem is convex, then the FONC will guarantee the solutions be extreme points.



Some Definitionsv Let: ∇2

x L(x,λ) : the Hessian matrix of L(x,λ) = f(x) + λT h(x) with respect to x, ∇2f(x) : the Hessian matrix of f(x),∇2hj(x) : the Hessian matrix of hj(x), j=1,2,…,m.

Then:

∇2x L(x,λ) = ∇2f(x) + λ1.∇

2h1(x) + λ2.∇2h2(x) + …+ λm.∇2hm(x)



Some Definitionsv The tangent space at a point x* on the surface

S=x∈ℜn : h(x)=0is the set T(x*)=y | <∇h(x), y>=0


Tangent Plane


Examplev Let S=x∈ℜ3 : h1(x)=x1=0, h2(x)=x1−x2=0,

then:


1

2

1 2

1

2

3

33

1 0 0( )( )

1 1 0( )

( ) | ( ) 0, ( ) 0

1 0 0| 0

1 1 0

0| 0 ,

T

T

T T

hh

T h h

yyy

r the x axis inr

∇ ∇ = = −∇

= ∇ = ∇ =

= = −

= ∈ℜ = ℜ

xh x

x

x y x y x y

y

y


Second Order Necessary Conditions (SONC)

Theorem : SONCv Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject

to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1

2. for all y∈ T(x*), we have yT ∇2x L(x,λ) y ≥ 0.



Second Order Sufficient Conditions (SOSC)

Theorem : SOSCv Let f and h ∈ C2, and there is a point x*∈ℜn and λ* ∈ ℜm such

that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1 .2. for all y∈ T(x*), we have yT ∇2

x L(x,λ) y > 0 .

Then x* is a strict minimizer f subject to h(x) = 0m×1.



The Tangent space

x1

x3

x2Tangent Plane

(all possible y vectors)

( )h∇ x

( ) 0h =x

*x

v The tangent plane is the location of all y vectors and intersects with x*, It must be orthogonal (perpendicular) to ∇h(x)



Note vNote the similarity between Lagrangian function and unconstraint

optimization.



Maximization ProblemsvNote that the previous definitions of the SONC & SOSC were for

minimization problems!v For maximization problems, the sense of the inequality sign will be

reversed (like that of unconstraint optimization).

v For maximization problems:SONC: yT ∇2

x L(x,λ) y ≤ 0

SOSC: yT ∇2x L(x,λ) y < 0



v The necessary conditions are required for a point to be an extremum but even if they are satisfied, they do not guarantee that the point is an extremum.

v If the sufficient conditions are true, then the point is guaranteed to be an extremum. But if they are not satisfied, this does not mean that the point is not an extremum.

Necessary & Sufficient



Procedure1. Solve the FONC to obtain candidate points.2. Test the candidate points with the SONC

Eliminate any points that do not satisfy the SONC

3. Test the remaining points with the SOSC The points that satisfy them are min/max’s For the points that do not satisfy, we cannot say whether they are

extreme points or not.



NLP with Inequality ConstraintsvConsider problems such as:

min f(x) x =[x1 x2 … xn]s.t. hi(x) = 0 i = 1, …, m

gj(x) ≤ 0 j = 1, …, p

vAn inequality constraint, gj(x) ≤ 0 is called “active” at x* if gj(x*) = 0.

v Let the set I(x*) contain all the indices of the active constraints at x*:

gj(x*) = 0 for all j in set I(x*)



NLP with Inequality Constraintsv The generalized Lagrangian is written:

vWe use λ’s for the equalities & µ’s for the inequalities.

1 1( , , ) ( ) ( ) ( )

pm

i i j ji j

f h gλ µ= =

= + ⋅ + ⋅∑ ∑x x x xL λ µ



FONC 4 Equality & Inequality ConstraintsKarush-Kuhn-Tucker (KKT) Theoremv For the generalized Lagrangian, the FONC become:

and the complementary slackness condition :

* * * * * * * *1

1 1( , , ) ( ) ( ) ( ) 0

pm

i i j j ni j

f h gλ µ ×= =

∇ = ∇ + ⋅∇ + ⋅∇ =∑ ∑x x x xL λ µ

* *( ) 0,j jgµ ⋅ = x* 0,jµ ≥ 1,j p= K


Non-negative Lagrange Multiplier Two cases:

1. g j(x) = 0,2. g j(x) < 0 à µj =0


v The SONC (for a minimization problem) are:for all y∈ T(x*), we have yT ∇2

x L(x*,λ*, µ*) y ≥ 0

Where J(x*).y = 0 as before.

v J(x*) : the matrix of the gradients of all the equality constraints and only the inequality constraints that are active at x*.

SONC 4 Equality & Inequality Constraints



SOSC 4 Equality & Inequality Constraintsv The SOSC for a minimization problem with equality & inequality

constraints are:yT ∇2

x L(x*,λ*, µ*) y > 0



Examplev Solve the problem:

Min f(x) = (x1 – 1)2 + (x2)2

s.t. h(x) = (x1)2 + (x2)2 + x1 + x2 = 0g(x) = x1 – (x2)2 ≤ 0

The Lagrangian for this problem is:

( ) ( ) ( ) ( ) 2

22 2 221 2 1 1 2 1 2( , , ) ( 1)x x x x x x x xλ µ λ µ= − + + ⋅ + + + + ⋅ −xL



Examplev The first order necessary conditions:

( )1 11

2 1 2 0x xx

λ λ µ∂

= − + ⋅ ⋅ + + =∂

L

2 2 22

2 2 2 0x x xx

λ λ µ∂

= + ⋅ ⋅ + − ⋅ ⋅ =∂

L

( ) ( )2 21 2 1 2 0x x x x

λ∂

= + + + =∂L

( )( ) 0221 =−⋅ xxµ



Examplev Solving the 4 FONC equations, we get 2 solutions:

1.

2.

(1) 0.20560.4534

= −

x 9537.0=µ45.0=λ

(2) 00

=

x 2−=µ0=λ



ExamplevNow try the SONC at the 1st solution:

Both h(x) & g(x) are active at this point (they both equal zero). So, the Jacobian is the gradient of both functions evaluated at x(1):

( )( )(1)

1 1 2

2

2 1 2 1 1.411 0.09321 2 1 0.9068

x

x xJ

x+ +

= = − − − x



Examplev The only solution to the equation:

(1)( ) 0J ⋅ =x y

is: 00

=

y

And the Hessian of the Lagrangian is:

(1)

2 2 2 0 2.9 00 2 2 2 0 0.993x

x

λλ µ

+ ∇ = = + −

L



Examplev So, the SONC equation is:

v This inequality is true, so the SONC is satisfied for x(1) and it is still a candidate point.

[ ] 2.9 0 00 0 0 0

0 0.993 0

⋅ ⋅ = ≥



Examplev The SOSC equation is:

And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:

Thus, the SOSC not satisfied.

2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL

(1)

2 0 0Tx x

>⋅∇ ⋅ =y yL



Examplev For the second solution:

Again, both h(x) & g(x) are active at this point. The Jacobian is:

( )( 2 )

1 2(2)

2

2 1 2 1 1 11 2 1 0

x

x xJ

x+ +

= = − − x



Examplev The only solution to the equation:

(2)( ) 0J ⋅ =x y

is: 00

=

y

and the Hessian of the Lagrangian is:

( 2)

2 2 2 0 2 00 2 2 2 0 2x

x

λλ µ

+ ∇ = = + − −

L



Examplev So, the SONC equation is:

v This inequality is true, so the SONC is satisfied for x(2) and it is still a candidate point.

[ ] 2 0 00 0 0 0

0 2 0

⋅ ⋅ = ≥ −



Examplev The SOSC equation is:

And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:

Thus, the SOSC not satisfied.

2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL

(1)

2 0 0Tx x

>⋅∇ ⋅ =y yL



Example Conclusionsv So, we can say that both x(1) & x(2) may be local minimums, but we

cannot be sure because the SOSC are not satisfied for either point.



Dual ProblemvUsing dual problem Constrained optimization à

unconstrained optimization

vNeed to change maximizationto minimization

vOnly valid when the original optimization problem is convex/concave (strong duality)

Dual Problem* arg min ( )l

λλ λ=

* arg max ( )

subject to ( )x

x f x

g x c

=

=

Primal Problem

x*=λ*

When convex/concave

( ) ( ( ) ( ( ) ))λ λ= + −x

Max l f x g x c



Example:

v Introduce a Lagrange multiplier λ for constraint v Construct the Lagrangian

v KKT conditions

,max subject to 6

x yxy x y+ ≤

( , ) + (6 )L x y xy x yλ= − −

( , ) 0

( , ) 0

6 3(6 ) 0

L x y yx x yL x y xy

x yx y

λλ

λ

λλ

∂ = − = ∂ ⇒ = =∂ = − =∂

+ ≤ ⇒ <− − =

6 ( ) 0x y− + ≥

o Expressing objective function using λ

o Solution is λ=3

2min ( ) 6. . 3

Ls t

λ λ λλ

= − +≤



SVM



Perceptron: Linear Separators v Binary classification can be viewed as the task of separating classes

in feature space:

wTx + b = 0

wTx + b < 0

wTx + b > 0

f(x) = sign(wTx + b)



Linear SeparatorsvWhich of the linear separators is optimal?



Classification MarginvDistance from example xi to the separator is v Examples closest to the hyperplane are support vectors. vMargin ρ of the separator is the distance between support

vectors.

wxw br i

T +=

r

ρ



Maximum Margin ClassificationvMaximizing the margin is good.v Implies that only support vectors matter; other training examples

are ignorable.



Linear SVM Mathematicallyv Let training set (xi, yi)i=1..n, xi∈Rd, yi ∈ -1, 1 be separated by a

hyperplane with margin ρ. Then for each training example (xi, yi):

v For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and the hyperplane is

v Then the margin can be expressed through (rescaled) w and b as:

wTxi + b ≤ - ρ/2 if yi = -1wTxi + b ≥ ρ/2 if yi = 1

w22 == rρ

wwxw 1)(y

=+

=br s

Ts

yi(wTxi + b) ≥ ρ/2⇔



Linear SVMs Mathematically (cont.)v Then we can formulate the quadratic optimization problem:

Which can be reformulated as:

Find w and b such that

is maximized

and for all (xi, yi), i=1...n : yi(wTxi + b) ≥ 1w2

=ρ

Find w and b such that

Φ(w) = ||w||2=wTw is minimized

and for all (xi, yi), i=1...n : yi (wTxi + b) ≥ 1



Linear SVMs Mathematically (cont.)vUse Lagrangian Theory to solve the optimization problem L:

FOCs:

∑=

−+><−><=l

iiiii bxwywwbwL

1]1),([,

21),,( αα

1

1

( , , ) 0

( , , ) 0

l

i i ii

l

i ii

L w bw y x

wL w b

yb

αα

αα

=

=

∂= − =

∂∂

= =∂

∑

∑1

1

0

l

i i ii

l

i ii

w y x

y

α

α

=

=

⇒ =

⇒ =

∑

∑



Linear SVMs Mathematically (cont.)v Substitute w into the Lagrangian:

vDual Problem:

∑∑

∑∑∑

∑

==

===

=

><−=

+><−><=

−+><−><=

l

jijijiji

l

ii

l

ii

l

jijijiji

l

jijijiji

l

iiiii

xxyy

xxyyxxyy

bxwywwbwL

1,1

11,1,

1

,21

,,21

]1),([,21),,(

ααα

ααααα

αα

li

yts

xxyyW

i

l

iii

l

jijijiji

l

ii

,...1,0

0..

,21)(max

1

1,1

=>

=

><−=

∑

∑∑

=

==

α

α

αααα



Support Vector Classification (cont)vMoving to Linear Non-Separable Situation Make the training sample linearly separable in the feature space implicitly

defined by K(x,z) Primal Problem:

Dual Problem:

libxwyts

ww

ii

bw

,...,1,1))(,(..

,min ,

=≥+><

><

φ

liyts

xxKyy

xxyyW

i

l

iii

l

jijijiji

l

ii

l

jijijiji

l

ii

,...1,0,0..

),(21

)(),(21)(max

1

1,1

1,1

=>=

−=

><−=

∑

∑∑

∑∑

=

==

==

αα

ααα

φφαααα



Implementation TechniquesvWhat Problem Do We Have?

This is a quadratic programming problemØ Standard software packages existØ NP-Hard, Poor scalability (memory requirement)

This is also a very special quadratic programming problem

liyts

xxKyyW

i

l

iii

l

jijijiji

l

ii

,...1,0,0..

),(21)(max

1

1,1

=>=

−=

∑

∑∑

=

==

αα

αααα



01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...

Documents

Transcript of 01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...