01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...
Transcript of 01 - NLP (1390-03-17)yamaghani.com/Files/1123201135459.pdfvLet X∈ℜn: vLet A = (aij) ∈ℜn×n...
Referencesv Edwin K. P. Chong, Stanislaw H. Zak, An introduction to
Optimization, John Wiley & Sons, 2nd Edition, 2001.vDavid G. Luenberger, Linear and Nonlinear Optimization,
Addison-Wesley Publishing Company, 2nd Edition, 1989.v S.S. Rao, Optimization Theory and Application, John Wiley &
Sons, 2nd Edition, 1984.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Unconstrained Optimization
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
v Let X∈ ℜn:
v Let A = (aij) ∈ ℜn×n be a n×n symmetric matrix. the k-th order leading principal minor is defined as the determinant of the
k×k submatrix formed by the first k rows and columns.
1
2
1
.
.
n n
x
x
X
x×
=
Quadratic Form
1 1 1 2 1
21 2 2 2
1 2
......
.
....
k
k
k k kk
a a a
a a a
a a a
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Quadratic FormvConsider the quadratic form:
v The above quadratic form (or simply matrix A) is called:
Positive semi-definite if XTAX ≥ 0 for all X.
Positive definite if XTAX > 0 for all X ≠ 0.
Negative semi-definite if XTAX ≤ 0 for all X.
Negative definite if XTAX < 0 for all X ≠ 0.
Indefinite if XTAX < 0 for some X and >0 for other X.
2
1
( )
2
T
ii i ij i ji i j n
Q X X A X
a x a x x≤ < ≤
=
= +∑ ∑ ∑
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Quadratic FormTheorems: Sylvester’s CriterionvA quadratic form XTAX, A = AT, is positive definite (or positive
semi-definite) if and only if all the leading principal minors of Aare >0 (or ≥ 0).
vA quadratic form XTAX, A = AT, is Negative definite ( negativesemi-definite) if and only if k-th leading principal minor of A has the sign of (-1)k , k=1,2,…,n (or k-th leading principal minor of A is zero or has the sign of (-1)k , k=1,2,…,n ).
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Quadratic FormTheoremsvA symmetric matrix A is positive definite (or positive semi-
definite) if and only if all eigenvalues of A are positive (or nonnegative).
vA symmetric matrix A is Negative definite (or negative semi-definite) if and only if all eigenvalues of A are negative (or nonpositive).
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Optimav Let f (X) = f (x1, x2 ,…, xn ) be a real-valued function of the n
variables x1, x2 ,…, xn.
v Suppose:
A point X0 is said to be a local maximum of f (X) if there exists an ε > 0 such that f (X0) ≥ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a local minimum of f (X) if there exists an ε > 0 such
that f (X0) ≤ f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local maximum of f (X) if there exists an ε >
0 such that f (X0) > f (X0+ h) for all |hj|≤ε. A point X0 is said to be a strict local minimum of f (X) if there exists an ε >
0 such that f (X0) < f (X0+ h) for all |hj|≤ε.
0 0 00 1 2
1 20 0 0
0 1 1 2 2
( , ,..., )( , ,..., )
( , , ..., )
n
n
n n
X x x x
h h h h
X h x h x h x h
==
+ = + + +
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Optima
A point X0 is said to be a absolute maximum or global maximum of f (X), if f (X0) ≥ f (X) for all X. A point X0 is said to be a absolute minimum or global minimum of f (X),
if f (X0) ≤ f (X) for all X. A point X0 is said to be a strict absolute maximum or strict global maximum of
f (X), if f (X0) > f (X) for all X. A point X0 is said to be a strict absolute minimum or strict global minimum of
f (X), if f (X0) < f (X) for all X.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example
x1: strict global minimizer x2: strict local minimizer x3: local (not strict) minimizer
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersTheorem: First-Order Necessary Condition (FONC)v Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If
x* is a local minimizer of f over Ω, then for any feasible direction d at x*, we have
dT ∇f(x*) ≥ 0or <∇f(x*), d> ≥ 0
Corollary: Interior casev Let Ω be a subset of ℜn and f ∈C l a real-valued function on Ω. If
x* is a local minimizer of f over Ω and if x* is an interior point of Ω, then
∇f(x*) = 0
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersTheorem (another statement)
vA necessary condition for x* to be an optimum point of f (x) is that ∇f (x*) = 0. i.e. , all the first order partial derivatives are zero at x*.
Definition
vA point x* for which ∇f (x*) = 0, is called a stationary point of f(x*) A stationary point is a potential candidate for local maximum or local
minimum.
i
fx
∂∂
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: #1v Illustration of the FONC for the constrained case: x1 does not satisfy the FONC, x2 satisfies the FONC,
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: #2vConsider the problem:
min f (x1,x2) = x12 + 0.5x2
2 + 3x2 + 4.5
s.t.: x1, x2≥0
a. Is the FONC for a local minimizer satisfied at x = [l,3]T?b. Is the FONC for a local minimizer satisfied at x = [0,3]T?c. Is the FONC for a local minimizer satisfied at x = [l,0]T?d.Is the FONC for a local minimizer satisfied at x = [0,0]T?
Solutionv ∇f (x1,x2) = [2x1, x2 + 3]T
v A plot of the level sets of f :
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: #2a. At x = [1,3]T, we have ∇f (x1,x2) = [2,6]T. The point x = [1,3]T is an interior
point of Ω = x : x1≥ 0, x2≥ 0. Hence, the FONC requires ∇f (x1,x2) = 0. The point x = [1,3]T does not satisfy the FONC for a local minimizer.
b. At x = [0,3]T, we have ∇f (x1,x2) = [0,6]T, and hence dT ∇f (x1,x2) = 6d2, where d = [d1,d2]
T. For d to be feasible at (x1,x2), we need d1 > 0, and d2 can take an arbitrary value in ℜ. The point x = [0,3]T does not satisfy the FONC for a minimizer because d2 is allowed to be less than zero. For example, d = [1, -1]T is a feasible direction, but dT ∇f (x1,x2) = -6 < 0.
c. At x = [1,0]T, we have ∇f (x1,x2) = [2,3]T, and hence dT ∇f (x1,x2) = 2d2 +3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1 can take an arbitrary value in ℜ. For example, d = [-5, -1]T is a feasible direction, but dT ∇f (x1,x2) = -7 < 0. Thus x = [1,0]T does not satisfy the FONC for a local minimizer.
d. At x = [0,0]T, we have ∇f (x1,x2) = [0,3]T, and hence dT ∇f (x1,x2) = 3d2. For d to be feasible at (x1,x2), we need d2 ≥ 0, and d1≥ 0. Hence x = [0,0]T satisfies the FONC for a local minimizer.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example #3: Function Approximationv Suppose that through an experiment the value of a function g is
observed at m points, x1, x2, . . . , xm. Thus, values g(x1), g(x2), . . . , g(xm) are known. We wish to approximate the function g(x) by a polynomial
h(x) = anxn + an-1xn-1 + . . . + a0
of degree n (or less), where n < m. Find ai ‘s.
Solutionv Define: ek = g(xk) − h(xk). v The best approximation is the polynomial that minimizes the sum of the squares
of these errors;2
1min
m
kk
e=
∑Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: Function Approximationv or
v For FONC:
( ) 211 0 1 0
1
min ( , ,..., ) ( ) ...m
n nn n k n k n k
k
f a a a g x a x a x a−− −
=
= − + + + ∑
( )
( )
11 0
1
11 0
1 1
0
2 ( ) ... 0 0,1,...,
... . ( ) 0,1,...,
i
mi n nk k n k n k
k
m mn i n i i i
n k n k k k kk k
fa
x g x a x a x a for i n
or
a x a x a x x g x for i n
−−
=
+ − +−
= =
∂= ⇒
∂
− − + + + = =
+ + + = =
∑
∑ ∑
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: Function Approximationv In matrix form:
v This leads directly to the system of (n+1) equations, which can be solved to determine ai’s.
0
1 1 1 1
( 1) ( 1) ( 1) 1( 1) 1
. ... ... ... . .:. ... ... ... . :
... ... ( )..
. ... ... ... . :
m m m mji i j i n i
k k k i k kk k k k
n n nn n
a
ax x x b g x x
a
+ +
= = = =
+ × + + ×+ ×
× =
=
∑ ∑ ∑ ∑
Column j
Rowi
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: Function Approximationv Let
0
11
( 1) 1( 1) ( 1)
2
11
( 1) 1
. . .
. . .:
. .
. . .
.:
( )( ).
:
a
b c
mi j
n ij kk
n nn n
mm
kjkj k k
k
n
a
Aa A x
a
g xb g x x
+−
=
+ ×+ × +
==
+ ×
= = =
= =
=
∑
∑∑
LL
L
L
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: Function Approximationv The problem can be stated as the following quadratic form:
v Then as said before, the solution is determined by solving the following system of (n+1) equations :
v It should be noted that the answer is the solution to LSE problem.
1 0min ( , ,..., ) - 2a a b a cT Tn nf a a a A− = +
a bA =
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersDefinitionvHessian matrix of f (X) : H(x) is the n×n matrix whose i-th row are
the partial derivaves of (j =1,2,…,n) with respect to xi (i=1,2,…,n). j
fx
∂∂
2 2 2
21 1 2 1
2 2 2
22 1 2 22
2 2 2
21 2
. .
. .( ) ( )
.
.
. . .
n
n
n n n
f f fx x x x x
f f fx x x x x
H f
f f fx x x x x
∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ∇ =
∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂
x x
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example
2 31 2 3 1 2 1 3( , , ) 7 3 4f x x x x x x x= − +
3
2
2
2 23 1 3
0 0 12
0 6 012 0 24
x
fx x x
∇ = −
3 23 2 1 37 4 6 12f x x x x ∇ = + −
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersTheorem: Second-Order Necessary Condition (SONC)v Let Ω ⊂ ℜn and f ∈C 2 a real-valued function on Ω, x* a local
minimizer of f over Ω, and d a feasible direction at x*.If dT ∇f(x*) = 0, then
dT H(x*) d ≥ 0 (H(x*) is a positive semi-definite)where H is the Hessian of f.
Corollary: Interior Casev Let x* be an interior point of Ω ⊂ ℜn. If x* is a local minimizer of
f : Ω →ℜn, f ∈C 2, then∇f(x*) = 0 and dT H(x*) d ≥ 0 for all d∈ℜn
i.e. H(x*) is positive semi-definite.Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersTheorem: Second-Order Sufficient Condition (SOSC), Interior Casev Let f ∈C 2 be defined on a region in which x* is an interior point.
Suppose that:1. ∇f(x*) = 0, 2. H(x*) is positive definite , i.e. dT H(x*) d > 0.Then, x* is a strict local minimizer of f.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersTheorem
v Let X0 be a stationary point of f (X), A sufficient condition for X0 to be a
local minimum of f (X) is that the Hessian matrix H(X0) ispositive definite;
local maximum of f (X) is that the Hessian matrix H(X0) is negative definite.
If H(X0) is neither negative definite nor positive definite,
Ø If detH(X0) = 0 then x* is a local minimum, local maximum, or saddle point
Ø If detH(X0) ≠ 0 then x* is not a optimum
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersCorollaryv If the Hessian matrix H(X) is indefinite at X0, where the necessity
conditions are satisfied, then the point X0 is not an extreme point.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conditions for Local MinimizersQuestionvHow to determine sufficient condition when H(X) is semi-definite?
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Review: Necessary and Sufficient ConditionsLocal Minimumv Necessary conditions First-order (FONC)
∇f(x0)=0 (x0: stationary point) Second-order (SONC)
H(x0)=∇2f(x0) is positive semi-definite.v Sufficient conditions First-order (FOSC)
∇f(x0)=0 (x0: stationary point) Second-order (SOSC)
H(x0)=∇2f(x0) is positive definite.
Global Minimumv Compare all local minimums
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Find the stationary points of the function
f (x1,x2,x3) = 2x1x2x3 – 4x1x3 – 2x2x3 + x12 + x2
2 + x32 – 2x1 - 4x2 + 4x3
and hence find the extrema of f.
Solution:
2 3 3 11
1 3 3 22
1 2 1 2 33
2 4 2 2 0 (1)
2 2 2 4 0 (2)
2 4 2 2 4 0 (3)
fx x x x
x
fx x x x
x
fx x x x x
x
∂= − + − =
∂∂
= − + − =∂∂
= − − + + =∂
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Substituting in (3) for x2:
2x1+ x1x3 – x12x3 – 2x1 – 2 – x3 + x1x3 + x3 = -2
or
x1x3 (2 – x1) = 0
thus, x1= 0 or x3 = 0 or x1 = 2
v Case (i) x1 = 0:
(1) ⇒ x2x3 – 2x3 = 1 (4)
(2) ⇒ x2 – x3 = 2 (5)
(3) ⇒ -x2+x3 = -2 same as (5)
(4) using (5) ⇒ x3(2 + x3) – 2x3 = 1 or x32 = 1 i.e. x3= ±1
Example
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example Sub-case (i) x3 = 1 (using (5) ) ⇒ x2 = 3
Sub-case (ii) x3= -1 (using (5) ) ⇒ x2 = 1
There is 2 stationary points: (0,3,1), (0,1,-1)
v Case (ii) x3 = 0 :
(1) ⇒ x1 = 1
(2) ⇒ x2 = 2
⇒ (3) : x1x2 - 2x1 - x2 = -2 üTherefore, the stationary point is : (1,2,0)
v Case (iii) x1= 2 :
(1) ⇒ x2x3 – 2x3 = -1 (6)
(2) ⇒ x2 + x3 = 2 (7)
(3) ⇒ x2 + x3 = 2 same as (7)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example(6) Using (7) ⇒ x3(2 – x3) – 2x3 = -1 ≡ x3
2 = 1 ⇒ x3 = ±1
Sub-case (i) x3 = 1 ⇒ (using (5)) x2 = 1
Sub-case(ii) x3 = -1 ⇒ (using (5) x2 = 3
There is 2 stationary points: (2,1,1), (2,3,-1)
The Hessian matrix:
3 2
3 1
2 1
2 2 2 4( ) 2 2 2 2
2 4 2 2 2
x xH X x x
x x
− = − − −
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
ExamplePoint Principal minors Nature
(0,3,1) 2, 0, -32 Saddle point
(0,1,-1) 2, 0, -32 Saddle point
(1,2,0) 2, 4, 8 Local min
(2,1,1) 2, 0, -32 Saddle point
(2,3,-1) 2, 0, -32 Saddle point
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Convex & Concave FunctionDefinition
v A function f(X)= f(x1,x2,…xn) of n variables is said to be convex if for each pair of points X1, X2 on the graph, the line segment joining these two points lies entirely above or on the graph.
i.e.
f((1− α)X1 + α X2) ≤ (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1
f is said to be strictly convex if for each pair of points X1, X2 on the graph,
f((1− α)X1 + α X2) < (1− α) f(X1) + α f(X2) for all α, 0 ≤ α ≤ 1
v f is called concave (strictly concave), if −f is convex (strictly convex).
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example: convex functionv f(x) = x2
X1 X2(1- α) X1 + α X2
x
y
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
v f(x) = -x2
Example: concave function
X1 X2(1- α) X1 + α X2 x
y
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
-60 -40 -20 0 20 40 60-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Example: nonconvex/nonconcave function
X1 X2
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Convexity for function of one variable
2
2 0d fdx
≥
2
2 0d fdx
≤ Concave:
Convex:2
2 0d fdx
≥
2
2 0d fdx
≤
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Convexity test for functions of 2 variables
Principal minors convex Strictly convex concave Strictly concave
fxx ≥ 0 > 0 ≤ 0 < 0
fxx fyy − (fxy)2 ≥ 0 > 0 ≥ 0 > 0
2 2
2
2 2
2
( ) xx xy
yx yy
f ff fx x y
H Xf ff f
y x y
∂ ∂ ∂ ∂ ∂ = = ∂ ∂
∂ ∂ ∂
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Find if f is convex, concave or neither.
f (x1,x2) = 3x1 + 5x2 – 4x12 + x2
2 – 5x1x2
Solutionv Put f in matrix form
[ ] [ ]1 11 2 1 2
2 2
542( , ) 3 5
5 12
x xf x x x x
x x
− − = +
−
1 2
542( , )
5 12
A x x
− − ⇒ =
−
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example
v Since some eigenvalues are negative and some are positive, A is neither positive definite nor negative definite, which implies that f is neither convex and nor concave.
2
1
2
det( ) 054 412det 0 3 0
5 412
3 50 02
3 50 02
A Iλ
λλ λ
λ
λ
λ
⇒ − =
− − − ⇒ = ⇒ + − =
− −
− +⇒ = >
− −= <
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Find local minimum of f :
f (x) = x3
Solution
v The point x = 0 satisfies the FONC, SONC, but not SOSC.
2( ) 3 0 0( ) 6 ( ) 0f x x x
H x x H x
∇ = = ⇒ == ⇒ =
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Find local minimum of f :
f (x1,x2) = x12 + x2
2
Solution
v The point X = [0,0]T satisfies the FONC, SONC, and SOSC. It is a strict local minimizer.
v Actually X = [0,0]T is a strict global minimizer.
11 2 1 2
2
2
2 0( , ) 0 ( , )
2 0
2 0( ) ( )
0 2
xf x x X x x
x
H X X H X
∇ = = ⇒ = =
= ∈ℜ ⇒
for all is positivedefinite
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Find local minimum of f
f (x1,x2) = x12 - x2
2
Solution
v The point X = [0,0]T satisfies the FONC, but SONC is not satisfied. v It is a saddle point.
11 2 1 2
2
2
2 0( , ) 0 ( , )
2 0
2 0( ) ( )
0 2
xf x x X x x
x
H X X H X in
∇ = = ⇒ = = −
= ∈ℜ ⇒ −
for all is definite
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Importance of Concave Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is concave the objective is to maximize the value of the objective function:
Then:v Any local maximum is a global maximum!
( )max s.t.
z f x
x
=
∈Ω
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Importance of Concave Functions in NLPproofv Suppose there exists a solution x’ that is a local maximum, but not a global
maximum. Since x’ is not a global maximum, there exists a solution x with the property that f(x) > f(x’).
v Now use the fact that f is concave Since f is concave, we have that
v If α is very close to 1, then α x’+(1-α)x is very close to x’ f (αx’+(1-α)x ) > f (x’ )
v Therefore, x’ cannot be a local optimum. This is a contradiction to our assumption that there exists a local maximum that is not a global maximum!
( ) ( ) ( )( ) ( )
( )
(1 ) (1 ) 0 1
(1 )
f x x f x f x
f x f x
f x
α α α α α
α α
′ ′+ − ≥ + − < <
′ ′> + −
′>
, for all
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Importance of Convex Functions in NLPv Suppose that we have an NLP with the following properties: the feasible region, say Ω, is convex the objective function, say f, is convex the objective is to minimize the value of the objective function:
Then:v Any local minimum is a global minimum!v The Hessian matrix of a convex function is positive semi-definite.
( )min s.t.
z f x
x
=
∈Ω
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Importance of Convex Functions in NLPAnother reason:v Basic optimization algorithms search for local optima. Those that try to find global optima generally just run underlying algorithms
several times starting at different solutions.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsTheoremv Let f∈C1. Then f is convex over a convex set Ω if and only if
f(x) ≥ f(x0) + ∇f(x0)(x - x0)for all x, x0 ∈ Ω.
vA convex function lies aboveits tangent planes.
x
f(x)
x0
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsTheoremv Let f ∈C2. Then f is convex over a convex set Ω containing an
interior point if and only if the Hessian matrix H of f is positive semi-definite throughout Ω.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsNotes1. The Hessian matrix is the generalization to ℜn of the concept of
the curvature of a function, and correspondingly, positive definiteness of the Hessian is the generalization of positive curvature. Convex functions have positive (or at least nonnegative) curvature in every direction.
2. We sometimes refer to a function as being locally convex if its Hessian matrix is positive semi-definite in a small region, and locally strictly convex if the Hessian is positive definite in the region.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsTheorem 1v Let f be a convex function defined on the convex set Ω. Then the
set Г where f achieves its minimum is convex, and any relative minimum of f is a global minimum.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsTheorem 2v Let f ∈ С1 be convex on the convex set Ω. If there is a point
x*∈Ω such that, for all у∈Ω, ∇f(x*)(y - x*) ≥ 0, then x* is a global minimum point of f over Ω.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Properties of Convex/Concave FunctionsTheorem 3v Let f be a convex function defined on the bounded, closed convex
set Ω. If f has a maximum over Ω it is achieved at an extreme point of Ω.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
v Extrema = minimum/ maximum
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
constrained optimization
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Level Sets of a FunctionDefinitionv The level set of a function f : ℜn→ℜ at level c is the set of points
S=x| f(x)=c
v For f : ℜ2→ℜ , we are usually interested in S when it is a curve. v For f : ℜ3→ℜ , the sets S most often considered are surfaces.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Non-linear optimization with constraints 3
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example #3: vProduct Mix Problem:
Data Mining, Spring 2011
…..Eq (4)Max Z = 13x1 + 11x2 (Income)
s.t.4x1 + 5x2 ≤1500 (Storage Space)5x1 + 3x2 ≤ 1575 (Raw Material)x1 + 2x2 ≤ 420 (Production Rate)
x1 ≥ 0x2 ≥ 0
TMU, M.M.Pedram, [email protected]
x2
525
420315 x1375
300
d3
210
d2
d1
Max Z = 13x1 + 11x2s.t.
d1 : 4x1 + 5x2 + s1 = 1500d2 : 5x1 + 3x2 + s2 = 1575d3 : x1 + 2x2 + s3 = 420
Zmax = 4335
Example #3:
(270, 75)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example #3:
Data Mining, Spring 2011
1000 2000
2000
3000
3000
3000
4000
4000
4000
4000
5000
5000
5000
6000
6000
7000
7000
8000
0 50 100 150 200 250 300 3500
50
100
150
200
250
300
350
x1
x2
TMU, M.M.Pedram, [email protected]
v f (x1,x2) = x12 + x2
2
-15 -10 -5 0 5 10 15
-10
0
10
0
50
100
150
200
xy x1
x2
f (x 1
,x2)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Data Mining, Spring 2011
5
5
10
10
10
20 20
2020
40
40
40
40
40
40
60
6060
60
60
60
60
80
8080
80
80
8080
80
100
100
100
100
100
100
100100
100
120
120
120
120
-10 -8 -6 -4 -2 0 2 4 6 8 10-10
-8
-6
-4
-2
0
2
4
6
8
10
x1
x2
TMU, M.M.Pedram, [email protected]
v f (x1,x2) = x12 - x2
2
-10-5 0
5 10
-10-5
05
10
-100
-50
0
50
100
xy x1
x2
f (x 1
,x2)
-80 -80
-80-80
-60
-60
-60
-60
-40
-40
-40
-40
-40
-40
-20
-20
-20
-20
-20
-20
0
0
0
0
0
0
0
0
20
20
20
20
20
20
40
40
40
40
40
40
60
60
60
60
80
80
80
80
x-10 -8 -6 -4 -2 0 2 4 6 8 10
-10
-8
-6
-4
-2
0
2
4
6
8
10
x1
x2
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
v Rosenbrock’s function: f (x1,x2) = 100 (x2−x12)2 + (1− x1)2
Data Mining, Spring 2011
-2-1
01
2
-2
-1
0
1
2
0
500
1000
1500
2000
2500
3000
x1x2
f (x 1
,x2)
TMU, M.M.Pedram, [email protected]
Data Mining, Spring 2011
0.7
0.7
0.7
0.7
0.77
7
7
7
7
7
7 7
7
7
70
70
70
70
70
70
70
70
70
70
200
200
200
200
200
200
200
200
200
500
500
500
500
500
500
700
700
700
700
700
1000 1000
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3
x1
x2
TMU, M.M.Pedram, [email protected]
v Peaks function:
-3-2
-10
12
3
-3-2
-10
12
3-10
-5
0
5
xyx1
x2
f (x 1
,x2)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
The Importance of Level SetsTheoremv The vector ∇f(x0) is orthogonal to the tangent vector to an
arbitrary smooth curve passing through x0 on the level set determined by f(x) = f(x0).
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
The Importance of convexityvAs said before, convexity guarantees that a local optimum is a
global optimum.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Possible Optimal Solutions to Convex NLPs (not occurring at corner points)
Data Mining, Spring 2011
objective function level curve
optimal solution
Feasible Region
linear objective,nonlinear constraints
objective function level curve
optimal solution
Feasible Region
nonlinear objective,nonlinear constraints
objective function level curve
optimal solution
Feasible Region
nonlinear objective,linear constraints
objective function level curves
optimal solution
Feasible Regionnonlinear objective,linear constraints
TMU, M.M.Pedram, [email protected]
Local vs. Global Optimal Solutions for Nonconvex NLPs
Data Mining, Spring 2011
A
C
B
Local optimal solution
Feasible Region
D
EF
G
Local and global optimal solution
x1
x2
TMU, M.M.Pedram, [email protected]
NLP with Equality Constraints
[ ]( )
1 2( )
0 1,2, ,
Min
s.t.
Tn
j
f x x x
h j m
=
= =
x x
x
L
K
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
The Lagrangian Functionv Let introduce the Lagrangian Function, L(x,λ), as:
L(x,λ) = f(x) + λT h(x)
where
v The notation ∇x f(x,y) means the gradient of f with respect to x.v Thus:
∇x L(x,λ) = ∇x f(x) + λT ∇x h(x)
1 1
2 2
( )( )
( ) ,: :( )
λλ
λ
= =
λ
m m
hh
h
xx
h x
xLagrangian Multipliers
or Dual Vector
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
First Order Necessary Conditions (FONC)Theorem : Lagrange's Theorem (FONC)v Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject
to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:
∇x L(x∗,λ∗) = ∇x f(x∗) + λ∗T ∇x h(x∗) = (0n ×1)T
vNote that the constraint are in the form of h(x) = 0m×1. Thus the above FONC could be stated as:
∇x L(x,λ) = 01×n
∇λ L(x,λ) = 01×m
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
First Order Necessary Conditions (FONC)v The Lagrangian can be thought as an unconstrained optimization
problem with variables x1, x2, …, xn and λ1, λ2, …, λm. The problem can be solved by solving the equations:
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
1
1
, 0 1,2,...,
, 0 1,2,...,
ni
mj
or for i nx
or for j mλ
×
×
∂ ∂= = = ∂ ∂
∂ ∂ = = =
∂ ∂
0
0
xL L
L Lλ
First Order Necessary Conditions (FONC)v For n=2 and m=1, i.e., let the problem consists of f as a function of
2 variables and only one constraint. Then the Lagrange’s FONC is stated as follows for a local minimizer (or maximizer) x*:
∇f(x∗) + λ∗ ∇h(x∗) = 0
v The equation means that
∇f(x∗) and ∇h(x∗) must be in exactly opposite directionsat a minimum or maximum point!
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Consider the problem:
min x1 + x2
s.t: (x1)2 + (x2)2 – 1 = 0
The feasible region is a circle with a radius of one. The possible objective function curves are lines with a slope of -1. The minimum will be the point where the lowest line still touches the circle.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example
f(x) = 1
f(x) = 0
f(x) = -1.414
Feasible region
* 0.7070.707
x
=
( )f x∇
The gradient of f points in the direction of
increasing f
x1
x2
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Since the objective function lines are straight parallel lines, the gradient of f is a
straight line pointing toward the direction of increasing f, which is to the upper right.
v The gradient of h will be pointing out from the circle and so its direction will depend on the point at which the gradient is evaluated.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example
*( )h x∇
=
011x
1( )f x∇
Tangent Plane
1( )h x∇x1
x2
* 0.7070.707
x
=
f(x) = 1
f(x) = 0
f(x) = -1.414
*( )f x∇
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Conclusionsv At the optimum point, ∇f(x) is parallel to ∇h(x).v As we can see at point x1, ∇f(x) is not parallel to ∇h(x) and we can move
(down) to improve the objective function.
v We can say that at a max or min, ∇f(x) must be parallel to ∇h(x), Otherwise, we could improve the objective function by changing position.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Using the FONC for the previous example:
L(x,λ) = f(x) + λ h(x)
And the first FONC equation is:
( ) ( )( )2 21 2 1 2 1x x x xλ= + + ⋅ + −
0 1,2
0
i
for ix
λ
∂ = = ∂∂ =
∂
L
L
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev This becomes:
v There are three equations and three unknowns. By solving the system:
v It can be seen from the graph that positive x1 & x2 corresponds to a maximum while negative x1 & x2 corresponds to the minimum.
11
22
1 2 0
1 2 0
xx
xx
λ
λ
∂= + =
∂∂
= + =∂
L
L
( ) ( )2 21 2 1 0x x
λ∂
= + − =∂
L
1 2 0.707x x= = ± 0.707λ = m
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Limitations of FONCv The FONC do not guarantee that the solution(s) will be
minimums/maximums.vAs in the case of unconstrained optimization, they only provide us
with candidate points that need to be verified by the second order conditions.
v If the problem is convex, then the FONC will guarantee the solutions be extreme points.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Some Definitionsv Let: ∇2
x L(x,λ) : the Hessian matrix of L(x,λ) = f(x) + λT h(x) with respect to x, ∇2f(x) : the Hessian matrix of f(x),∇2hj(x) : the Hessian matrix of hj(x), j=1,2,…,m.
Then:
∇2x L(x,λ) = ∇2f(x) + λ1.∇
2h1(x) + λ2.∇2h2(x) + …+ λm.∇2hm(x)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Some Definitionsv The tangent space at a point x* on the surface
S=x∈ℜn : h(x)=0is the set T(x*)=y | <∇h(x), y>=0
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Tangent Plane
Examplev Let S=x∈ℜ3 : h1(x)=x1=0, h2(x)=x1−x2=0,
then:
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
1
2
1 2
1
2
3
33
1 0 0( )( )
1 1 0( )
( ) | ( ) 0, ( ) 0
1 0 0| 0
1 1 0
0| 0 ,
T
T
T T
hh
T h h
yyy
r the x axis inr
∇ ∇ = = −∇
= ∇ = ∇ =
= = −
= ∈ℜ = ℜ
xh x
x
x y x y x y
y
y
Second Order Necessary Conditions (SONC)
Theorem : SONCv Let x* be a local minimizer (or maximizer) of f : ℜn→ℜ, subject
to h(x) = 0m×1, h : ℜn→ℜm, m ≤ n. Assume that x* is a regular point. Then, there exists λ* ∈ ℜm such that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1
2. for all y∈ T(x*), we have yT ∇2x L(x,λ) y ≥ 0.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Second Order Sufficient Conditions (SOSC)
Theorem : SOSCv Let f and h ∈ C2, and there is a point x*∈ℜn and λ* ∈ ℜm such
that:1. ∇x L(x,λ) = ∇x f(x) + λ∗T ∇x h(x) = 0n ×1 .2. for all y∈ T(x*), we have yT ∇2
x L(x,λ) y > 0 .
Then x* is a strict minimizer f subject to h(x) = 0m×1.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
The Tangent space
x1
x3
x2Tangent Plane
(all possible y vectors)
( )h∇ x
( ) 0h =x
*x
v The tangent plane is the location of all y vectors and intersects with x*, It must be orthogonal (perpendicular) to ∇h(x)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Note vNote the similarity between Lagrangian function and unconstraint
optimization.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Maximization ProblemsvNote that the previous definitions of the SONC & SOSC were for
minimization problems!v For maximization problems, the sense of the inequality sign will be
reversed (like that of unconstraint optimization).
v For maximization problems:SONC: yT ∇2
x L(x,λ) y ≤ 0
SOSC: yT ∇2x L(x,λ) y < 0
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
v The necessary conditions are required for a point to be an extremum but even if they are satisfied, they do not guarantee that the point is an extremum.
v If the sufficient conditions are true, then the point is guaranteed to be an extremum. But if they are not satisfied, this does not mean that the point is not an extremum.
Necessary & Sufficient
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Procedure1. Solve the FONC to obtain candidate points.2. Test the candidate points with the SONC
Eliminate any points that do not satisfy the SONC
3. Test the remaining points with the SOSC The points that satisfy them are min/max’s For the points that do not satisfy, we cannot say whether they are
extreme points or not.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
NLP with Inequality ConstraintsvConsider problems such as:
min f(x) x =[x1 x2 … xn]s.t. hi(x) = 0 i = 1, …, m
gj(x) ≤ 0 j = 1, …, p
vAn inequality constraint, gj(x) ≤ 0 is called “active” at x* if gj(x*) = 0.
v Let the set I(x*) contain all the indices of the active constraints at x*:
gj(x*) = 0 for all j in set I(x*)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
NLP with Inequality Constraintsv The generalized Lagrangian is written:
vWe use λ’s for the equalities & µ’s for the inequalities.
1 1( , , ) ( ) ( ) ( )
pm
i i j ji j
f h gλ µ= =
= + ⋅ + ⋅∑ ∑x x x xL λ µ
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
FONC 4 Equality & Inequality ConstraintsKarush-Kuhn-Tucker (KKT) Theoremv For the generalized Lagrangian, the FONC become:
and the complementary slackness condition :
* * * * * * * *1
1 1( , , ) ( ) ( ) ( ) 0
pm
i i j j ni j
f h gλ µ ×= =
∇ = ∇ + ⋅∇ + ⋅∇ =∑ ∑x x x xL λ µ
* *( ) 0,j jgµ ⋅ = x* 0,jµ ≥ 1,j p= K
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Non-negative Lagrange Multiplier Two cases:
1. g j(x) = 0,2. g j(x) < 0 à µj =0
v The SONC (for a minimization problem) are:for all y∈ T(x*), we have yT ∇2
x L(x*,λ*, µ*) y ≥ 0
Where J(x*).y = 0 as before.
v J(x*) : the matrix of the gradients of all the equality constraints and only the inequality constraints that are active at x*.
SONC 4 Equality & Inequality Constraints
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
SOSC 4 Equality & Inequality Constraintsv The SOSC for a minimization problem with equality & inequality
constraints are:yT ∇2
x L(x*,λ*, µ*) y > 0
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Solve the problem:
Min f(x) = (x1 – 1)2 + (x2)2
s.t. h(x) = (x1)2 + (x2)2 + x1 + x2 = 0g(x) = x1 – (x2)2 ≤ 0
The Lagrangian for this problem is:
( ) ( ) ( ) ( ) 2
22 2 221 2 1 1 2 1 2( , , ) ( 1)x x x x x x x xλ µ λ µ= − + + ⋅ + + + + ⋅ −xL
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev The first order necessary conditions:
( )1 11
2 1 2 0x xx
λ λ µ∂
= − + ⋅ ⋅ + + =∂
L
2 2 22
2 2 2 0x x xx
λ λ µ∂
= + ⋅ ⋅ + − ⋅ ⋅ =∂
L
( ) ( )2 21 2 1 2 0x x x x
λ∂
= + + + =∂L
( )( ) 0221 =−⋅ xxµ
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev Solving the 4 FONC equations, we get 2 solutions:
1.
2.
(1) 0.20560.4534
= −
x 9537.0=µ45.0=λ
(2) 00
=
x 2−=µ0=λ
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
ExamplevNow try the SONC at the 1st solution:
Both h(x) & g(x) are active at this point (they both equal zero). So, the Jacobian is the gradient of both functions evaluated at x(1):
( )( )(1)
1 1 2
2
2 1 2 1 1.411 0.09321 2 1 0.9068
x
x xJ
x+ +
= = − − − x
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev The only solution to the equation:
(1)( ) 0J ⋅ =x y
is: 00
=
y
And the Hessian of the Lagrangian is:
(1)
2 2 2 0 2.9 00 2 2 2 0 0.993x
x
λλ µ
+ ∇ = = + −
L
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev So, the SONC equation is:
v This inequality is true, so the SONC is satisfied for x(1) and it is still a candidate point.
[ ] 2.9 0 00 0 0 0
0 0.993 0
⋅ ⋅ = ≥
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev The SOSC equation is:
And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:
Thus, the SOSC not satisfied.
2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL
(1)
2 0 0Tx x
>⋅∇ ⋅ =y yL
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev For the second solution:
Again, both h(x) & g(x) are active at this point. The Jacobian is:
( )( 2 )
1 2(2)
2
2 1 2 1 1 11 2 1 0
x
x xJ
x+ +
= = − − x
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev The only solution to the equation:
(2)( ) 0J ⋅ =x y
is: 00
=
y
and the Hessian of the Lagrangian is:
( 2)
2 2 2 0 2 00 2 2 2 0 2x
x
λλ µ
+ ∇ = = + − −
L
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev So, the SONC equation is:
v This inequality is true, so the SONC is satisfied for x(2) and it is still a candidate point.
[ ] 2 0 00 0 0 0
0 2 0
⋅ ⋅ = ≥ −
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Examplev The SOSC equation is:
And we just calculated the left-hand side of the equation to be the zero matrix. So, in our case for x2:
Thus, the SOSC not satisfied.
2 * * *( , , ) 0Tx λ µ⋅∇ ⋅ >y x yL
(1)
2 0 0Tx x
>⋅∇ ⋅ =y yL
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example Conclusionsv So, we can say that both x(1) & x(2) may be local minimums, but we
cannot be sure because the SOSC are not satisfied for either point.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Dual ProblemvUsing dual problem Constrained optimization à
unconstrained optimization
vNeed to change maximizationto minimization
vOnly valid when the original optimization problem is convex/concave (strong duality)
Dual Problem* arg min ( )l
λλ λ=
* arg max ( )
subject to ( )x
x f x
g x c
=
=
Primal Problem
x*=λ*
When convex/concave
( ) ( ( ) ( ( ) ))λ λ= + −x
Max l f x g x c
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Example:
v Introduce a Lagrange multiplier λ for constraint v Construct the Lagrangian
v KKT conditions
,max subject to 6
x yxy x y+ ≤
( , ) + (6 )L x y xy x yλ= − −
( , ) 0
( , ) 0
6 3(6 ) 0
L x y yx x yL x y xy
x yx y
λλ
λ
λλ
∂ = − = ∂ ⇒ = =∂ = − =∂
+ ≤ ⇒ <− − =
6 ( ) 0x y− + ≥
o Expressing objective function using λ
o Solution is λ=3
2min ( ) 6. . 3
Ls t
λ λ λλ
= − +≤
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Perceptron: Linear Separators v Binary classification can be viewed as the task of separating classes
in feature space:
wTx + b = 0
wTx + b < 0
wTx + b > 0
f(x) = sign(wTx + b)
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Linear SeparatorsvWhich of the linear separators is optimal?
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Classification MarginvDistance from example xi to the separator is v Examples closest to the hyperplane are support vectors. vMargin ρ of the separator is the distance between support
vectors.
wxw br i
T +=
r
ρ
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Maximum Margin ClassificationvMaximizing the margin is good.v Implies that only support vectors matter; other training examples
are ignorable.
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Linear SVM Mathematicallyv Let training set (xi, yi)i=1..n, xi∈Rd, yi ∈ -1, 1 be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
v For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and the hyperplane is
v Then the margin can be expressed through (rescaled) w and b as:
wTxi + b ≤ - ρ/2 if yi = -1wTxi + b ≥ ρ/2 if yi = 1
w22 == rρ
wwxw 1)(y
=+
=br s
Ts
yi(wTxi + b) ≥ ρ/2⇔
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Linear SVMs Mathematically (cont.)v Then we can formulate the quadratic optimization problem:
Which can be reformulated as:
Find w and b such that
is maximized
and for all (xi, yi), i=1...n : yi(wTxi + b) ≥ 1w2
=ρ
Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1...n : yi (wTxi + b) ≥ 1
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Linear SVMs Mathematically (cont.)vUse Lagrangian Theory to solve the optimization problem L:
FOCs:
∑=
−+><−><=l
iiiii bxwywwbwL
1]1),([,
21),,( αα
1
1
( , , ) 0
( , , ) 0
l
i i ii
l
i ii
L w bw y x
wL w b
yb
αα
αα
=
=
∂= − =
∂∂
= =∂
∑
∑1
1
0
l
i i ii
l
i ii
w y x
y
α
α
=
=
⇒ =
⇒ =
∑
∑
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Linear SVMs Mathematically (cont.)v Substitute w into the Lagrangian:
vDual Problem:
∑∑
∑∑∑
∑
==
===
=
><−=
+><−><=
−+><−><=
l
jijijiji
l
ii
l
ii
l
jijijiji
l
jijijiji
l
iiiii
xxyy
xxyyxxyy
bxwywwbwL
1,1
11,1,
1
,21
,,21
]1),([,21),,(
ααα
ααααα
αα
li
yts
xxyyW
i
l
iii
l
jijijiji
l
ii
,...1,0
0..
,21)(max
1
1,1
=>
=
><−=
∑
∑∑
=
==
α
α
αααα
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Support Vector Classification (cont)vMoving to Linear Non-Separable Situation Make the training sample linearly separable in the feature space implicitly
defined by K(x,z) Primal Problem:
Dual Problem:
libxwyts
ww
ii
bw
,...,1,1))(,(..
,min ,
=≥+><
><
φ
liyts
xxKyy
xxyyW
i
l
iii
l
jijijiji
l
ii
l
jijijiji
l
ii
,...1,0,0..
),(21
)(),(21)(max
1
1,1
1,1
=>=
−=
><−=
∑
∑∑
∑∑
=
==
==
αα
ααα
φφαααα
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]
Implementation TechniquesvWhat Problem Do We Have?
This is a quadratic programming problemØ Standard software packages existØ NP-Hard, Poor scalability (memory requirement)
This is also a very special quadratic programming problem
liyts
xxKyyW
i
l
iii
l
jijijiji
l
ii
,...1,0,0..
),(21)(max
1
1,1
=>=
−=
∑
∑∑
=
==
αα
αααα
Data Mining, Spring 2011TMU, M.M.Pedram, [email protected]