01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...

Post on 15-Mar-2021

0 views 0 download

Transcript of 01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...

Nonlinear Optimization

M.M. Pedrampedram@tmu.ac.ir

Tarbiat Moallem University of Tehran(Spring 2011)

Referencesv Edwin K. P. Chong, Stanislaw H. Zak, An introduction to

Optimization, John Wiley & Sons, 2nd Edition, 2001.vDavid G. Luenberger, Linear and Nonlinear Optimization,

Addison-Wesley Publishing Company, 2nd Edition, 1989.v S.S. Rao, Optimization Theory and Application, John Wiley &

Sons, 2nd Edition, 1984.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir2

Unconstrained Optimization

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir

v Let X∈ ℜn:

v Let A = (aij) ∈ ℜn×n be a n×n symmetric matrix. the k-th order leading principal minor is defined as the determinant of the

k×k submatrix formed by the first k rows and columns.

1

2

1

.

.

n n

x

x

X

=

Quadratic Form

1 1 1 2 1

21 2 2 2

1 2

......

.

....

k

k

k k kk

a a a

a a a

a a a

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir4

Quadratic FormvConsider the quadratic form:

v The above quadratic form (or simply matrix A) is called:

Positive semi-definite if XTAX ≥ 0 for all X.

Positive definite if XTAX > 0 for all X ≠ 0.

Negative semi-definite if XTAX ≤ 0 for all X.

Negative definite if XTAX < 0 for all X ≠ 0.

Indefinite if XTAX < 0 for some X and >0 for other X.

2

1

( )

2

T

ii i ij i ji i j n

Q X X A X

a x a x x≤ < ≤

=

= +∑ ∑ ∑

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir5

Quadratic FormTheorems: Sylvester’s CriterionvA quadratic form XTAX, A = AT, is positive definite (or positive

semi-definite) if and only if all the leading principal minors of Aare >0 (or ≥ 0).

vA quadratic form XTAX, A = AT, is Negative definite ( negativesemi-definite) if and only if k-th leading principal minor of A has the sign of (-1)k , k=1,2,…,n (or k-th leading principal minor of A is zero or has the sign of (-1)k , k=1,2,…,n ).

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir6

Quadratic FormTheoremsvA symmetric matrix A is positive definite (or positive semi-

definite) if and only if all eigenvalues of A are positive (or nonnegative).

vA symmetric matrix A is Negative definite (or negative semi-definite) if and only if all eigenvalues of A are negative (or nonpositive).

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir7

Optimav Let f (X) = f (x1, x2 ,…, xn ) be a real-valued function of the n

variables x1, x2 ,…, xn.

v Suppose:

A point X0 is said to be a local maximum of f (X) if there exists an ε > 0 such that f (X0) ≥ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a local minimum of f (X) if there exists an ε > 0 such

that f (X0) ≤ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local maximum of f (X) if there exists an ε >

0 such that f (X0) > f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local minimum of f (X) if there exists an ε >

0 such that f (X0) < f (X0+ h) for all |hj|≤ε.

0 0 00 1 2

1 20 0 0

0 1 1 2 2

( , ,..., )( , ,..., )

( , , ..., )

n

n

n n

X x x x

h h h h

X h x h x h x h

==

+ = + + +

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir8

Optima

A point X0 is said to be a absolute maximum or global maximum of f (X), if f (X0) ≥ f (X) for all X. A point X0 is said to be a absolute minimum or global minimum of f (X),

if f (X0) ≤ f (X) for all X. A point X0 is said to be a strict absolute maximum or strict global maximum of

f (X), if f (X0) > f (X) for all X. A point X0 is said to be a strict absolute minimum or strict global minimum of

f (X), if f (X0) < f (X) for all X.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir9

Example

x1: strict global minimizer x2: strict local minimizer x3: local (not strict) minimizer

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir10

Conditions for Local MinimizersTheorem: First-Order Necessary Condition (FONC)v Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If

x* is a local minimizer of f over Ω, then for any feasible direction d at x*, we have

dT ∇f(x*) ≥ 0or <∇f(x*), d> ≥ 0

Corollary: Interior casev Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If

x* is a local minimizer of f over Ω and if x* is an interior point of Ω, then

∇f(x*) = 0

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir11

Conditions for Local MinimizersTheorem (another statement)

vA necessary condition for x* to be an optimum point of f (x) is that ∇f (x*) = 0. i.e. , all the first order partial derivatives are zero at x*.

Definition

vA point x* for which ∇f (x*) = 0, is called a stationary point of f(x*) A stationary point is a potential candidate for local maximum or local

minimum.

i

fx

∂∂

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir12

Example: #1v Illustration of the FONC for the constrained case: x1 does not satisfy the FONC, x2 satisfies the FONC,

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir13

Example: #2vConsider the problem:

min f (x1,x2) = x12 + 0.5x2

2 + 3x2 + 4.5

s.t.: x1, x2≥0

a. Is the FONC for a local minimizer satisfied at x = [l,3]T?b. Is the FONC for a local minimizer satisfied at x = [0,3]T?c. Is the FONC for a local minimizer satisfied at x = [l,0]T?d.Is the FONC for a local minimizer satisfied at x = [0,0]T?

Solutionv ∇f (x1,x2) = [2x1, x2 + 3]T

v A plot of the level sets of f :

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir14

Example: #2a. At x = [1,3]T, we have ∇f (x1,x2) = [2,6]T. The point x = [1,3]T is an interior

point of Ω = x : x1≥ 0, x2≥ 0. Hence, the FONC requires ∇f (x1,x2) = 0. The point x = [1,3]T does not satisfy the FONC for a local minimizer.

b. At x = [0,3]T, we have ∇f (x1,x2) = [0,6]T, and hence dT ∇f (x1,x2) = 6d2, where d = [d1,d2]

T. For d to be feasible at (x1,x2), we need d1 > 0, and d2 can take an arbitrary value in ℜ. The point x = [0,3]T does not satisfy the FONC for a minimizer because d2 is allowed to be less than zero. For example, d = [1, -1]T is a feasible direction, but dT ∇f (x1,x2) = -6 < 0.

c. At x = [1,0]T, we have ∇f (x1,x2) = [2,3]T, and hence dT ∇f (x1,x2) = 2d2 +3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1 can take an arbitrary value in ℜ. For example, d = [-5, -1]T is a feasible direction, but dT ∇f (x1,x2) = -7 < 0. Thus x = [1,0]T does not satisfy the FONC for a local minimizer.

d. At x = [0,0]T, we have ∇f (x1,x2) = [0,3]T, and hence dT ∇f (x1,x2) = 3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1≥ 0. Hence x = [0,0]T satisfies the FONC for a local minimizer.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir15

Example #3: Function Approximationv Suppose that through an experiment the value of a function g is

observed at m points, x1, x2, . . . , xm. Thus, values g(x1), g(x2), . . . , g(xm) are known. We wish to approximate the function g(x) by a polynomial

h(x) = anxn + an-1xn-1 + . . . + a0

of degree n (or less), where n < m. Find ai ‘s.

Solutionv Define: ek = g(xk) − h(xk). v The best approximation is the polynomial that minimizes the sum of the squares

of these errors;2

1min

m

kk

e=

∑Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir16

Example: Function Approximationv or

v For FONC:

( ) 211 0 1 0

1

min ( , ,..., ) ( ) ...m

n nn n k n k n k

k

f a a a g x a x a x a−− −

=

= − + + + ∑

( )

( )

11 0

1

11 0

1 1

0

2 ( ) ... 0 0,1,...,

... . ( ) 0,1,...,

i

mi n nk k n k n k

k

m mn i n i i i

n k n k k k kk k

fa

x g x a x a x a for i n

or

a x a x a x x g x for i n

−−

=

+ − +−

= =

∂= ⇒

− − + + + = =

+ + + = =

∑ ∑

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir17

Example: Function Approximationv In matrix form:

v This leads directly to the system of (n+1) equations, which can be solved to determine ai’s.

0

1 1 1 1

( 1) ( 1) ( 1) 1( 1) 1

. ... ... ... . .:. ... ... ... . :

... ... ( )..

. ... ... ... . :

m m m mji i j i n i

k k k i k kk k k k

n n nn n

a

ax x x b g x x

a

+ +

= = = =

+ × + + ×+ ×

× =

=

∑ ∑ ∑ ∑

Column j

Rowi

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir18

Example: Function Approximationv Let

0

11

( 1) 1( 1) ( 1)

2

11

( 1) 1

. . .

. . .:

. .

. . .

.:

( )( ).

:

a

b c

mi j

n ij kk

n nn n

mm

kjkj k k

k

n

a

Aa A x

a

g xb g x x

+−

=

+ ×+ × +

==

+ ×

= = =

= =

=

∑∑

LL

L

L

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir19

Example: Function Approximationv The problem can be stated as the following quadratic form:

v Then as said before, the solution is determined by solving the following system of (n+1) equations :

v It should be noted that the answer is the solution to LSE problem.

1 0min ( , ,..., ) - 2a a b a cT Tn nf a a a A− = +

a bA =

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir20

Conditions for Local MinimizersDefinitionvHessian matrix of f (X) : H(x) is the n×n matrix whose i-th row are

the partial derivaves of (j =1,2,…,n) with respect to xi (i=1,2,…,n). j

fx

∂∂

2 2 2

21 1 2 1

2 2 2

22 1 2 22

2 2 2

21 2

. .

. .( ) ( )

.

.

. . .

n

n

n n n

f f fx x x x x

f f fx x x x x

H f

f f fx x x x x

∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ∇ =

∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂

x x

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir21

Example

2 31 2 3 1 2 1 3( , , ) 7 3 4f x x x x x x x= − +

3

2

2

2 23 1 3

0 0 12

0 6 012 0 24

x

fx x x

∇ = −

3 23 2 1 37 4 6 12f x x x x ∇ = + −

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir22

Conditions for Local MinimizersTheorem: Second-Order Necessary Condition (SONC)v Let Ω ⊂ ℜn and f ∈C 2 a real-valued function on Ω, x* a local

minimizer of f over Ω, and d a feasible direction at x*.If dT ∇f(x*) = 0, then

dT H(x*) d ≥ 0 (H(x*) is a positive semi-definite)where H is the Hessian of f.

Corollary: Interior Casev Let x* be an interior point of Ω ⊂ ℜn. If x* is a local minimizer of

f : Ω →ℜn, f ∈C 2, then∇f(x*) = 0 and dT H(x*) d ≥ 0 for all d∈ℜn

i.e. H(x*) is positive semi-definite.Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir23

Conditions for Local MinimizersTheorem: Second-Order Sufficient Condition (SOSC), Interior Casev Let f ∈C 2 be defined on a region in which x* is an interior point.

Suppose that:1. ∇f(x*) = 0, 2. H(x*) is positive definite , i.e. dT H(x*) d > 0.Then, x* is a strict local minimizer of f.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir24

Conditions for Local MinimizersTheorem

v Let X0 be a stationary point of f (X), A sufficient condition for X0 to be a

local minimum of f (X) is that the Hessian matrix H(X0) ispositive definite;

local maximum of f (X) is that the Hessian matrix H(X0) is negative definite.

If H(X0) is neither negative definite nor positive definite,

Ø If detH(X0) = 0 then x* is a local minimum, local maximum, or saddle point

Ø If detH(X0) ≠ 0 then x* is not a optimum

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir25

Conditions for Local MinimizersCorollaryv If the Hessian matrix H(X) is indefinite at X0, where the necessity

conditions are satisfied, then the point X0 is not an extreme point.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir26

Conditions for Local MinimizersQuestionvHow to determine sufficient condition when H(X) is semi-definite?

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir27

Review: Necessary and Sufficient ConditionsLocal Minimumv Necessary conditions First-order (FONC)

∇f(x0)=0 (x0: stationary point) Second-order (SONC)

H(x0)=∇2f(x0) is positive semi-definite.v Sufficient conditions First-order (FOSC)

∇f(x0)=0 (x0: stationary point) Second-order (SOSC)

H(x0)=∇2f(x0) is positive definite.

Global Minimumv Compare all local minimums

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir28

Examplev Find the stationary points of the function

f (x1,x2,x3) = 2x1x2x3 – 4x1x3 – 2x2x3 + x12 + x2

2 + x32 – 2x1 - 4x2 + 4x3

and hence find the extrema of f.

Solution:

2 3 3 11

1 3 3 22

1 2 1 2 33

2 4 2 2 0 (1)

2 2 2 4 0 (2)

2 4 2 2 4 0 (3)

fx x x x

x

fx x x x

x

fx x x x x

x

∂= − + − =

∂∂

= − + − =∂∂

= − − + + =∂

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir29

Substituting in (3) for x2:

2x1+ x1x3 – x12x3 – 2x1 – 2 – x3 + x1x3 + x3 = -2

or

x1x3 (2 – x1) = 0

thus, x1= 0 or x3 = 0 or x1 = 2

v Case (i) x1 = 0:

(1) ⇒ x2x3 – 2x3 = 1 (4)

(2) ⇒ x2 – x3 = 2 (5)

(3) ⇒ -x2+x3 = -2 same as (5)

(4) using (5) ⇒ x3(2 + x3) – 2x3 = 1 or x32 = 1 i.e. x3= ±1

Example

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir30

Example Sub-case (i) x3 = 1 (using (5) ) ⇒ x2 = 3

Sub-case (ii) x3= -1 (using (5) ) ⇒ x2 = 1

There is 2 stationary points: (0,3,1), (0,1,-1)

v Case (ii) x3 = 0 :

(1) ⇒ x1 = 1

(2) ⇒ x2 = 2

⇒ (3) : x1x2 - 2x1 - x2 = -2 üTherefore, the stationary point is : (1,2,0)

v Case (iii) x1= 2 :

(1) ⇒ x2x3 – 2x3 = -1 (6)

(2) ⇒ x2 + x3 = 2 (7)

(3) ⇒ x2 + x3 = 2 same as (7)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir31

Example(6) Using (7) ⇒ x3(2 – x3) – 2x3 = -1 ≡ x3

2 = 1 ⇒ x3 = ±1

Sub-case (i) x3 = 1 ⇒ (using (5)) x2 = 1

Sub-case(ii) x3 = -1 ⇒ (using (5) x2 = 3

There is 2 stationary points: (2,1,1), (2,3,-1)

The Hessian matrix:

3 2

3 1

2 1

2 2 2 4( ) 2 2 2 2

2 4 2 2 2

x xH X x x

x x

− = − − −

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir32

ExamplePoint Principal minors Nature

(0,3,1) 2, 0, -32 Saddle point

(0,1,-1) 2, 0, -32 Saddle point

(1,2,0) 2, 4, 8 Local min

(2,1,1) 2, 0, -32 Saddle point

(2,3,-1) 2, 0, -32 Saddle point

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir33

Convex & Concave FunctionDefinition

v A function f(X)= f(x1,x2,…xn) of n variables is said to be convex if for each pair of points X1, X2 on the graph, the line segment joining these two points lies entirely above or on the graph.

i.e.

f((1− α)X1 + α X2) ≤ (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1

f is said to be strictly convex if for each pair of points X1, X2 on the graph,

f((1− α)X1 + α X2) < (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1

v f is called concave (strictly concave), if −f is convex (strictly convex).

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir34

Example: convex functionv f(x) = x2

X1 X2(1- α) X1 + α X2

x

y

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir35

v f(x) = -x2

Example: concave function

X1 X2(1- α) X1 + α X2 x

y

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir36

-60 -40 -20 0 20 40 60-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Example: nonconvex/nonconcave function

X1 X2

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir37

Convexity for function of one variable

2

2 0d fdx

2

2 0d fdx

≤ Concave:

Convex:2

2 0d fdx

2

2 0d fdx

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir38

Convexity test for functions of 2 variables

Principal minors convex Strictly convex concave Strictly concave

fxx ≥ 0 > 0 ≤ 0 < 0

fxx fyy − (fxy)2 ≥ 0 > 0 ≥ 0 > 0

2 2

2

2 2

2

( ) xx xy

yx yy

f ff fx x y

H Xf ff f

y x y

∂ ∂ ∂ ∂ ∂ = = ∂ ∂

∂ ∂ ∂

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir39

Examplev Find if f is convex, concave or neither.

f (x1,x2) = 3x1 + 5x2 – 4x12 + x2

2 – 5x1x2

Solutionv Put f in matrix form

[ ] [ ]1 11 2 1 2

2 2

542( , ) 3 5

5 12

x xf x x x x

x x

− − = +

1 2

542( , )

5 12

A x x

− − ⇒ =

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir40

Example

v Since some eigenvalues are negative and some are positive, A is neither positive definite nor negative definite, which implies that f is neither convex and nor concave.

2

1

2

det( ) 054 412det 0 3 0

5 412

3 50 02

3 50 02

A Iλ

λλ λ

λ

λ

λ

⇒ − =

− − − ⇒ = ⇒ + − =

− −

− +⇒ = >

− −= <

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir41

Examplev Find local minimum of f :

f (x) = x3

Solution

v The point x = 0 satisfies the FONC, SONC, but not SOSC.

2( ) 3 0 0( ) 6 ( ) 0f x x x

H x x H x

∇ = = ⇒ == ⇒ =

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir42

Examplev Find local minimum of f :

f (x1,x2) = x12 + x2

2

Solution

v The point X = [0,0]T satisfies the FONC, SONC, and SOSC. It is a strict local minimizer.

v Actually X = [0,0]T is a strict global minimizer.

11 2 1 2

2

2

2 0( , ) 0 ( , )

2 0

2 0( ) ( )

0 2

xf x x X x x

x

H X X H X

∇ = = ⇒ = =

= ∈ℜ ⇒

for all is positivedefinite

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir43

Examplev Find local minimum of f

f (x1,x2) = x12 - x2

2

Solution

v The point X = [0,0]T satisfies the FONC, but SONC is not satisfied. v It is a saddle point.

11 2 1 2

2

2

2 0( , ) 0 ( , )

2 0

2 0( ) ( )

0 2

xf x x X x x

x

H X X H X in

∇ = = ⇒ = = −

= ∈ℜ ⇒ −

for all is definite

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir44

Example

f(x1, x2)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir45

Importance of Concave Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is concave the objective is to maximize the value of the objective function:

Then:v Any local maximum is a global maximum!

( )max s.t.

z f x

x

=

∈Ω

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir46

Importance of Concave Functions in NLPproofv Suppose there exists a solution x’ that is a local maximum, but not a global

maximum. Since x’ is not a global maximum, there exists a solution x with the property that f(x) > f(x’).

v Now use the fact that f is concave Since f is concave, we have that

v If α is very close to 1, then α x’+(1-α)x is very close to x’ f (αx’+(1-α)x ) > f (x’ )

v Therefore, x’ cannot be a local optimum. This is a contradiction to our assumption that there exists a local maximum that is not a global maximum!

( ) ( ) ( )( ) ( )

( )

(1 ) (1 ) 0 1

(1 )

f x x f x f x

f x f x

f x

α α α α α

α α

′ ′+ − ≥ + − < <

′ ′> + −

′>

, for all

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir47

Importance of Convex Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is convex the objective is to minimize the value of the objective function:

Then:v Any local minimum is a global minimum!v The Hessian matrix of a convex function is positive semi-definite.

( )min s.t.

z f x

x

=

∈Ω

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir48

Importance of Convex Functions in NLPAnother reason:v Basic optimization algorithms search for local optima. Those that try to find global optima generally just run underlying algorithms

several times starting at different solutions.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir49

Properties of Convex/Concave FunctionsTheoremv Let f∈C1. Then f is convex over a convex set Ω if and only if

f(x) ≥ f(x0) + ∇f(x0)(x - x0)for all x, x0 ∈ Ω.

vA convex function lies aboveits tangent planes.

x

f(x)

x0

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir50

Properties of Convex/Concave FunctionsTheoremv Let f ∈C2. Then f is convex over a convex set Ω containing an

interior point if and only if the Hessian matrix H of f is positive semi-definite throughout Ω.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir51

Properties of Convex/Concave FunctionsNotes1. The Hessian matrix is the generalization to ℜn of the concept of

the curvature of a function, and correspondingly, positive definiteness of the Hessian is the generalization of positive curvature. Convex functions have positive (or at least nonnegative) curvature in every direction.

2. We sometimes refer to a function as being locally convex if its Hessian matrix is positive semi-definite in a small region, and locally strictly convex if the Hessian is positive definite in the region.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir52

Properties of Convex/Concave FunctionsTheorem 1v Let f be a convex function defined on the convex set Ω. Then the

set Г where f achieves its minimum is convex, and any relative minimum of f is a global minimum.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir53

Properties of Convex/Concave FunctionsTheorem 2v Let f ∈ С1 be convex on the convex set Ω. If there is a point

x*∈Ω such that, for all у∈Ω, ∇f(x*)(y - x*) ≥ 0, then x* is a global minimum point of f over Ω.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir54

Properties of Convex/Concave FunctionsTheorem 3v Let f be a convex function defined on the bounded, closed convex

set Ω. If f has a maximum over Ω it is achieved at an extreme point of Ω.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir55

v Extrema = minimum/ maximum

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir56

constrained optimization

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir

Level Sets of a FunctionDefinitionv The level set of a function f : ℜn→ℜ at level c is the set of points

S=x| f(x)=c

v For f : ℜ2→ℜ , we are usually interested in S when it is a curve. v For f : ℜ3→ℜ , the sets S most often considered are surfaces.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir58

Non-linear optimization with constraints 3

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir59

Example #3: vProduct Mix Problem:

Data Mining, Spring 2011

…..Eq (4)Max Z = 13x1 + 11x2 (Income)

s.t.4x1 + 5x2 ≤1500 (Storage Space)5x1 + 3x2 ≤ 1575 (Raw Material)x1 + 2x2 ≤ 420 (Production Rate)

x1 ≥ 0x2 ≥ 0

TMU, M.M.Pedram, pedram@tmu.ac.ir60

x2

525

420315 x1375

300

d3

210

d2

d1

Max Z = 13x1 + 11x2s.t.

d1 : 4x1 + 5x2 + s1 = 1500d2 : 5x1 + 3x2 + s2 = 1575d3 : x1 + 2x2 + s3 = 420

Zmax = 4335

Example #3:

(270, 75)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir61

Example #3:

Data Mining, Spring 2011

1000 2000

2000

3000

3000

3000

4000

4000

4000

4000

5000

5000

5000

6000

6000

7000

7000

8000

0 50 100 150 200 250 300 3500

50

100

150

200

250

300

350

x1

x2

TMU, M.M.Pedram, pedram@tmu.ac.ir62

v f (x1,x2) = x12 + x2

2

-15 -10 -5 0 5 10 15

-10

0

10

0

50

100

150

200

xy x1

x2

f (x 1

,x2)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir63

Data Mining, Spring 2011

5

5

10

10

10

20 20

2020

40

40

40

40

40

40

60

6060

60

60

60

60

80

8080

80

80

8080

80

100

100

100

100

100

100

100100

100

120

120

120

120

-10 -8 -6 -4 -2 0 2 4 6 8 10-10

-8

-6

-4

-2

0

2

4

6

8

10

x1

x2

TMU, M.M.Pedram, pedram@tmu.ac.ir64

v f (x1,x2) = x12 - x2

2

-10-5 0

5 10

-10-5

05

10

-100

-50

0

50

100

xy x1

x2

f (x 1

,x2)

-80 -80

-80-80

-60

-60

-60

-60

-40

-40

-40

-40

-40

-40

-20

-20

-20

-20

-20

-20

0

0

0

0

0

0

0

0

20

20

20

20

20

20

40

40

40

40

40

40

60

60

60

60

80

80

80

80

x-10 -8 -6 -4 -2 0 2 4 6 8 10

-10

-8

-6

-4

-2

0

2

4

6

8

10

x1

x2

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir65

v Rosenbrock’s function: f (x1,x2) = 100 (x2−x12)2 + (1− x1)2

Data Mining, Spring 2011

-2-1

01

2

-2

-1

0

1

2

0

500

1000

1500

2000

2500

3000

x1x2

f (x 1

,x2)

TMU, M.M.Pedram, pedram@tmu.ac.ir66

Data Mining, Spring 2011

0.7

0.7

0.7

0.7

0.77

7

7

7

7

7

7 7

7

7

70

70

70

70

70

70

70

70

70

70

200

200

200

200

200

200

200

200

200

500

500

500

500

500

500

700

700

700

700

700

1000 1000

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

x1

x2

TMU, M.M.Pedram, pedram@tmu.ac.ir67

v Peaks function:

-3-2

-10

12

3

-3-2

-10

12

3-10

-5

0

5

xyx1

x2

f (x 1

,x2)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir68

The Importance of Level SetsTheoremv The vector ∇f(x0) is orthogonal to the tangent vector to an

arbitrary smooth curve passing through x0 on the level set determined by f(x) = f(x0).

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir69

The Importance of convexityvAs said before, convexity guarantees that a local optimum is a

global optimum.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir70

Possible Optimal Solutions to Convex NLPs (not occurring at corner points)

Data Mining, Spring 2011

objective function level curve

optimal solution

Feasible Region

linear objective,nonlinear constraints

objective function level curve

optimal solution

Feasible Region

nonlinear objective,nonlinear constraints

objective function level curve

optimal solution

Feasible Region

nonlinear objective,linear constraints

objective function level curves

optimal solution

Feasible Regionnonlinear objective,linear constraints

TMU, M.M.Pedram, pedram@tmu.ac.ir71

Local vs. Global Optimal Solutions for Nonconvex NLPs

Data Mining, Spring 2011

A

C

B

Local optimal solution

Feasible Region

D

EF

G

Local and global optimal solution

x1

x2

TMU, M.M.Pedram, pedram@tmu.ac.ir72

NLP with Equality Constraints

[ ]( )

1 2( )

0 1,2, ,

Min

s.t.

Tn

j

f x x x

h j m

=

= =

x x

x

L

K

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir73

The Lagrangian Functionv Let introduce the Lagrangian Function, L(x,λ), as:

L(x,λ) = f(x) + λT h(x)

where

v The notation ∇x f(x,y) means the gradient of f with respect to x.v Thus:

∇x L(x,λ) = ∇x f(x) + λT ∇x h(x)

1 1

2 2

( )( )

( ) ,: :( )

λλ

λ

= =

λ

m m

hh

h

xx

h x

xLagrangian Multipliers

or Dual Vector

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir74

First Order Necessary Conditions (FONC)Theorem : Lagrange's Theorem (FONC)v Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject

to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:

∇x L(x∗,λ∗) = ∇x f(x∗) + λ∗T ∇x h(x∗) = (0n ×1)T

vNote that the constraint are in the form of h(x) = 0m×1. Thus the above FONC could be stated as:

∇x L(x,λ) = 01×n

∇λ L(x,λ) = 01×m

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir75

First Order Necessary Conditions (FONC)v The Lagrangian can be thought as an unconstrained optimization

problem with variables x1, x2, …, xn and λ1, λ2, …, λm. The problem can be solved by solving the equations:

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir76

1

1

, 0 1,2,...,

, 0 1,2,...,

ni

mj

or for i nx

or for j mλ

×

×

∂ ∂= = = ∂ ∂

∂ ∂ = = =

∂ ∂

0

0

xL L

L Lλ

First Order Necessary Conditions (FONC)v For n=2 and m=1, i.e., let the problem consists of f as a function of

2 variables and only one constraint. Then the Lagrange’s FONC is stated as follows for a local minimizer (or maximizer) x*:

∇f(x∗) + λ∗ ∇h(x∗) = 0

v The equation means that

∇f(x∗) and ∇h(x∗) must be in exactly opposite directionsat a minimum or maximum point!

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir77

Examplev Consider the problem:

min x1 + x2

s.t: (x1)2 + (x2)2 – 1 = 0

The feasible region is a circle with a radius of one. The possible objective function curves are lines with a slope of -1. The minimum will be the point where the lowest line still touches the circle.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir78

Example

f(x) = 1

f(x) = 0

f(x) = -1.414

Feasible region

* 0.7070.707

x

=

( )f x∇

The gradient of f points in the direction of

increasing f

x1

x2

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir79

Examplev Since the objective function lines are straight parallel lines, the gradient of f is a

straight line pointing toward the direction of increasing f, which is to the upper right.

v The gradient of h will be pointing out from the circle and so its direction will depend on the point at which the gradient is evaluated.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir80

Example

*( )h x∇

=

011x

1( )f x∇

Tangent Plane

1( )h x∇x1

x2

* 0.7070.707

x

=

f(x) = 1

f(x) = 0

f(x) = -1.414

*( )f x∇

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir81

Conclusionsv At the optimum point, ∇f(x) is parallel to ∇h(x).v As we can see at point x1, ∇f(x) is not parallel to ∇h(x) and we can move

(down) to improve the objective function.

v We can say that at a max or min, ∇f(x) must be parallel to ∇h(x), Otherwise, we could improve the objective function by changing position.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir82

Examplev Using the FONC for the previous example:

L(x,λ) = f(x) + λ h(x)

And the first FONC equation is:

( ) ( )( )2 21 2 1 2 1x x x xλ= + + ⋅ + −

0 1,2

0

i

for ix

λ

∂ = = ∂∂ =

L

L

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir83

Examplev This becomes:

v There are three equations and three unknowns. By solving the system:

v It can be seen from the graph that positive x1 & x2 corresponds to a maximum while negative x1 & x2 corresponds to the minimum.

11

22

1 2 0

1 2 0

xx

xx

λ

λ

∂= + =

∂∂

= + =∂

L

L

( ) ( )2 21 2 1 0x x

λ∂

= + − =∂

L

1 2 0.707x x= = ± 0.707λ = m

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir84

Limitations of FONCv The FONC do not guarantee that the solution(s) will be

minimums/maximums.vAs in the case of unconstrained optimization, they only provide us

with candidate points that need to be verified by the second order conditions.

v If the problem is convex, then the FONC will guarantee the solutions be extreme points.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir85

Some Definitionsv Let: ∇2

x L(x,λ) : the Hessian matrix of L(x,λ) = f(x) + λT h(x) with respect to x, ∇2f(x) : the Hessian matrix of f(x),∇2hj(x) : the Hessian matrix of hj(x), j=1,2,…,m.

Then:

∇2x L(x,λ) = ∇2f(x) + λ1.∇

2h1(x) + λ2.∇2h2(x) + …+ λm.∇2hm(x)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir86

Some Definitionsv The tangent space at a point x* on the surface

S=x∈ℜn : h(x)=0is the set T(x*)=y | <∇h(x), y>=0

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir87

Tangent Plane

Examplev Let S=x∈ℜ3 : h1(x)=x1=0, h2(x)=x1−x2=0,

then:

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir88

1

2

1 2

1

2

3

33

1 0 0( )( )

1 1 0( )

( ) | ( ) 0, ( ) 0

1 0 0| 0

1 1 0

0| 0 ,

T

T

T T

hh

T h h

yyy

r the x axis inr

∇ ∇ = = −∇

= ∇ = ∇ =

= = −

= ∈ℜ = ℜ

xh x

x

x y x y x y

y

y

Second Order Necessary Conditions (SONC)

Theorem : SONCv Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject

to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1

2. for all y∈ T(x*), we have yT ∇2x L(x,λ) y ≥ 0.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir89

Second Order Sufficient Conditions (SOSC)

Theorem : SOSCv Let f and h ∈ C2, and there is a point x*∈ℜn and λ* ∈ ℜm such

that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1 .2. for all y∈ T(x*), we have yT ∇2

x L(x,λ) y > 0 .

Then x* is a strict minimizer f subject to h(x) = 0m×1.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir90

The Tangent space

x1

x3

x2Tangent Plane

(all possible y vectors)

( )h∇ x

( ) 0h =x

*x

v The tangent plane is the location of all y vectors and intersects with x*, It must be orthogonal (perpendicular) to ∇h(x)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir91

Note vNote the similarity between Lagrangian function and unconstraint

optimization.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir92

Maximization ProblemsvNote that the previous definitions of the SONC & SOSC were for

minimization problems!v For maximization problems, the sense of the inequality sign will be

reversed (like that of unconstraint optimization).

v For maximization problems:SONC: yT ∇2

x L(x,λ) y ≤ 0

SOSC: yT ∇2x L(x,λ) y < 0

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir93

v The necessary conditions are required for a point to be an extremum but even if they are satisfied, they do not guarantee that the point is an extremum.

v If the sufficient conditions are true, then the point is guaranteed to be an extremum. But if they are not satisfied, this does not mean that the point is not an extremum.

Necessary & Sufficient

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir94

Procedure1. Solve the FONC to obtain candidate points.2. Test the candidate points with the SONC

Eliminate any points that do not satisfy the SONC

3. Test the remaining points with the SOSC The points that satisfy them are min/max’s For the points that do not satisfy, we cannot say whether they are

extreme points or not.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir95

NLP with Inequality ConstraintsvConsider problems such as:

min f(x) x =[x1 x2 … xn]s.t. hi(x) = 0 i = 1, …, m

gj(x) ≤ 0 j = 1, …, p

vAn inequality constraint, gj(x) ≤ 0 is called “active” at x* if gj(x*) = 0.

v Let the set I(x*) contain all the indices of the active constraints at x*:

gj(x*) = 0 for all j in set I(x*)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir96

NLP with Inequality Constraintsv The generalized Lagrangian is written:

vWe use λ’s for the equalities & µ’s for the inequalities.

1 1( , , ) ( ) ( ) ( )

pm

i i j ji j

f h gλ µ= =

= + ⋅ + ⋅∑ ∑x x x xL λ µ

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir97

FONC 4 Equality & Inequality ConstraintsKarush-Kuhn-Tucker (KKT) Theoremv For the generalized Lagrangian, the FONC become:

and the complementary slackness condition :

* * * * * * * *1

1 1( , , ) ( ) ( ) ( ) 0

pm

i i j j ni j

f h gλ µ ×= =

∇ = ∇ + ⋅∇ + ⋅∇ =∑ ∑x x x xL λ µ

* *( ) 0,j jgµ ⋅ = x* 0,jµ ≥ 1,j p= K

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir98

Non-negative Lagrange Multiplier Two cases:

1. g j(x) = 0,2. g j(x) < 0 à µj =0

v The SONC (for a minimization problem) are:for all y∈ T(x*), we have yT ∇2

x L(x*,λ*, µ*) y ≥ 0

Where J(x*).y = 0 as before.

v J(x*) : the matrix of the gradients of all the equality constraints and only the inequality constraints that are active at x*.

SONC 4 Equality & Inequality Constraints

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir99

SOSC 4 Equality & Inequality Constraintsv The SOSC for a minimization problem with equality & inequality

constraints are:yT ∇2

x L(x*,λ*, µ*) y > 0

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir100

Examplev Solve the problem:

Min f(x) = (x1 – 1)2 + (x2)2

s.t. h(x) = (x1)2 + (x2)2 + x1 + x2 = 0g(x) = x1 – (x2)2 ≤ 0

The Lagrangian for this problem is:

( ) ( ) ( ) ( ) 2

22 2 221 2 1 1 2 1 2( , , ) ( 1)x x x x x x x xλ µ λ µ= − + + ⋅ + + + + ⋅ −xL

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir101

Examplev The first order necessary conditions:

( )1 11

2 1 2 0x xx

λ λ µ∂

= − + ⋅ ⋅ + + =∂

L

2 2 22

2 2 2 0x x xx

λ λ µ∂

= + ⋅ ⋅ + − ⋅ ⋅ =∂

L

( ) ( )2 21 2 1 2 0x x x x

λ∂

= + + + =∂L

( )( ) 0221 =−⋅ xxµ

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir102

Examplev Solving the 4 FONC equations, we get 2 solutions:

1.

2.

(1) 0.20560.4534

= −

x 9537.0=µ45.0=λ

(2) 00

=

x 2−=µ0=λ

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir103

ExamplevNow try the SONC at the 1st solution:

Both h(x) & g(x) are active at this point (they both equal zero). So, the Jacobian is the gradient of both functions evaluated at x(1):

( )( )(1)

1 1 2

2

2 1 2 1 1.411 0.09321 2 1 0.9068

x

x xJ

x+ +

= = − − − x

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir104

Examplev The only solution to the equation:

(1)( ) 0J ⋅ =x y

is: 00

=

y

And the Hessian of the Lagrangian is:

(1)

2 2 2 0 2.9 00 2 2 2 0 0.993x

x

λλ µ

+ ∇ = = + −

L

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir105

Examplev So, the SONC equation is:

v This inequality is true, so the SONC is satisfied for x(1) and it is still a candidate point.

[ ] 2.9 0 00 0 0 0

0 0.993 0

⋅ ⋅ = ≥

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir106

Examplev The SOSC equation is:

And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:

Thus, the SOSC not satisfied.

2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL

(1)

2 0 0Tx x

>⋅∇ ⋅ =y yL

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir107

Examplev For the second solution:

Again, both h(x) & g(x) are active at this point. The Jacobian is:

( )( 2 )

1 2(2)

2

2 1 2 1 1 11 2 1 0

x

x xJ

x+ +

= = − − x

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir108

Examplev The only solution to the equation:

(2)( ) 0J ⋅ =x y

is: 00

=

y

and the Hessian of the Lagrangian is:

( 2)

2 2 2 0 2 00 2 2 2 0 2x

x

λλ µ

+ ∇ = = + − −

L

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir109

Examplev So, the SONC equation is:

v This inequality is true, so the SONC is satisfied for x(2) and it is still a candidate point.

[ ] 2 0 00 0 0 0

0 2 0

⋅ ⋅ = ≥ −

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir110

Examplev The SOSC equation is:

And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:

Thus, the SOSC not satisfied.

2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL

(1)

2 0 0Tx x

>⋅∇ ⋅ =y yL

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir111

Example Conclusionsv So, we can say that both x(1) & x(2) may be local minimums, but we

cannot be sure because the SOSC are not satisfied for either point.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir112

Dual ProblemvUsing dual problem Constrained optimization à

unconstrained optimization

vNeed to change maximizationto minimization

vOnly valid when the original optimization problem is convex/concave (strong duality)

Dual Problem* arg min ( )l

λλ λ=

* arg max ( )

subject to ( )x

x f x

g x c

=

=

Primal Problem

x*=λ*

When convex/concave

( ) ( ( ) ( ( ) ))λ λ= + −x

Max l f x g x c

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir113

Example:

v Introduce a Lagrange multiplier λ for constraint v Construct the Lagrangian

v KKT conditions

,max subject to 6

x yxy x y+ ≤

( , ) + (6 )L x y xy x yλ= − −

( , ) 0

( , ) 0

6 3(6 ) 0

L x y yx x yL x y xy

x yx y

λλ

λ

λλ

∂ = − = ∂ ⇒ = =∂ = − =∂

+ ≤ ⇒ <− − =

6 ( ) 0x y− + ≥

o Expressing objective function using λ

o Solution is λ=3

2min ( ) 6. . 3

Ls t

λ λ λλ

= − +≤

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir114

SVM

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir115

Perceptron: Linear Separators v Binary classification can be viewed as the task of separating classes

in feature space:

wTx + b = 0

wTx + b < 0

wTx + b > 0

f(x) = sign(wTx + b)

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir116

Linear SeparatorsvWhich of the linear separators is optimal?

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir117

Classification MarginvDistance from example xi to the separator is v Examples closest to the hyperplane are support vectors. vMargin ρ of the separator is the distance between support

vectors.

wxw br i

T +=

r

ρ

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir118

Maximum Margin ClassificationvMaximizing the margin is good.v Implies that only support vectors matter; other training examples

are ignorable.

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir119

Linear SVM Mathematicallyv Let training set (xi, yi)i=1..n, xi∈Rd, yi ∈ -1, 1 be separated by a

hyperplane with margin ρ. Then for each training example (xi, yi):

v For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and the hyperplane is

v Then the margin can be expressed through (rescaled) w and b as:

wTxi + b ≤ - ρ/2 if yi = -1wTxi + b ≥ ρ/2 if yi = 1

w22 == rρ

wwxw 1)(y

=+

=br s

Ts

yi(wTxi + b) ≥ ρ/2⇔

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir120

Linear SVMs Mathematically (cont.)v Then we can formulate the quadratic optimization problem:

Which can be reformulated as:

Find w and b such that

is maximized

and for all (xi, yi), i=1...n : yi(wTxi + b) ≥ 1w2

Find w and b such that

Φ(w) = ||w||2=wTw is minimized

and for all (xi, yi), i=1...n : yi (wTxi + b) ≥ 1

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir121

Linear SVMs Mathematically (cont.)vUse Lagrangian Theory to solve the optimization problem L:

FOCs:

∑=

−+><−><=l

iiiii bxwywwbwL

1]1),([,

21),,( αα

1

1

( , , ) 0

( , , ) 0

l

i i ii

l

i ii

L w bw y x

wL w b

yb

αα

αα

=

=

∂= − =

∂∂

= =∂

∑1

1

0

l

i i ii

l

i ii

w y x

y

α

α

=

=

⇒ =

⇒ =

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir122

Linear SVMs Mathematically (cont.)v Substitute w into the Lagrangian:

vDual Problem:

∑∑

∑∑∑

==

===

=

><−=

+><−><=

−+><−><=

l

jijijiji

l

ii

l

ii

l

jijijiji

l

jijijiji

l

iiiii

xxyy

xxyyxxyy

bxwywwbwL

1,1

11,1,

1

,21

,,21

]1),([,21),,(

ααα

ααααα

αα

li

yts

xxyyW

i

l

iii

l

jijijiji

l

ii

,...1,0

0..

,21)(max

1

1,1

=>

=

><−=

∑∑

=

==

α

α

αααα

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir123

Support Vector Classification (cont)vMoving to Linear Non-Separable Situation Make the training sample linearly separable in the feature space implicitly

defined by K(x,z) Primal Problem:

Dual Problem:

libxwyts

ww

ii

bw

,...,1,1))(,(..

,min ,

=≥+><

><

φ

liyts

xxKyy

xxyyW

i

l

iii

l

jijijiji

l

ii

l

jijijiji

l

ii

,...1,0,0..

),(21

)(),(21)(max

1

1,1

1,1

=>=

−=

><−=

∑∑

∑∑

=

==

==

αα

ααα

φφαααα

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir124

Implementation TechniquesvWhat Problem Do We Have?

This is a quadratic programming problemØ Standard software packages existØ NP-Hard, Poor scalability (memory requirement)

This is also a very special quadratic programming problem

liyts

xxKyyW

i

l

iii

l

jijijiji

l

ii

,...1,0,0..

),(21)(max

1

1,1

=>=

−=

∑∑

=

==

αα

αααα

Data Mining, Spring 2011TMU, M.M.Pedram, pedram@tmu.ac.ir125