A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form...

A Review of Our CourseClassification and Regression

The Perceptron Algorithm: Primal vs. Dual form

An on-line and mistake-driven procedure

Update the weight vector and bias when there is a misclassified point

Converge when problem is linearly separable

Classification Problem2-Category Linearly Separable

Case

A-

A+

x0w+ b= à 1

wx0w+ b= + 1x0w+ b= 0

Malignant

Benign

Algebra of the Classification Problem

Linearly Separable Case

Given l points in the n dimensional real spaceRn

Represented by an â n matrixAor Membership of each pointA iin the classesAà A+

is specified by an `â ` diagonal matrix D :

D ii = à 1 if A i 2 Aà and D ii = 1 A i 2 A+if SeparateAà and A+by two bounding planes such that:

A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1

More succinctly:D(Aw+ eb)>e

e= [1;1; . . .;1]02 R :

, where

D(Aw+ eb) >e+ øø>0

whereø: nonnegative slack (error) vector

The term e0ø, 1-norm measure of error vector, is

called the training error.

minw;b;ø

e0ø

s.t. (LP)

Robust Linear ProgrammingPreliminary Approach to SVM

For the linearly separable case, at solution of (LP):

ø= 0

Support Vector MachinesMaximizing the Margin between Bounding

Planes

x0w+ b= + 1

x0w+ b= à 1

A+

A-

w

jjwjj22 = Margin

Support Vector Classification(Linearly Separable Case, Primal)

The hyperplane that solves the minimization problem:

(w;b)

min(w;b)2R n+1

21 jjwjj22

D(Aw+ eb)>e;

realizes the maximal margin hyperplane withgeometric margin í = jjwjj2

1

Soft Margin SVM(Nonseparable Case)

If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above

Introduce the slack variable for each training point

yi(w0xi + b)>1à øi; øi>0 8 i

The inequality system is always feasible

w = 0; b= 0 & ø= ee.g.

xj

x

x

x

x

x

x

x

x

o

o

o

o

o

o

o

oi

í

í

øj

øi

Two Different Measures of Training Error

min(w;b;ø)2R n+1+l

21jjwjj22 + 2

Cjjøjj22

D(Aw+ eb) + ø>e

2-Norm Soft Margin:

1-Norm Soft Margin:min

(w;b;ø)2R n+1+l21jjwjj22 + Ce0ø

D(Aw+ eb) + ø>e

ø> 0

Optimization Problem Formulation

Problem setting: Given functionsf ; gi; i = 1;. . .; kand hj ; j = 1;. . .; m, defined on a domainÒ ò Rn;

minx2Ò

f (x)

subject to gi(x) 6 0; 8ihj(x) = 0; 8j

where f (x)is called the objective function and

g(x) 6 0; h(x) = 0are called constraints.

Definitions and Notation

Feasible region:

F = f x 2 Òj g(x)60; h(x)60g

where g(x) =g1(x)...gk(x)

2

4

3

5 and h(x) =h1(x)...hm(x)

2

4

3

5

A solution of the optimization problem is a point

xã 2 F such that@x 2 F for which f (x) < f (xã)

and xãis called a global minimum.


A point x 2 Fis called a local minimum of the

optimization problem if9" > 0such that

f (x) > f (x); 8x 2 F and jjx à xjj < " At the solutionxã, an inequality constraintgi(x)

is said to be active if gi(xã) = 0, otherwise it is

called an inactive constraint.

gi(x) 6 0 , gi(x) + øi = 0; øi > 0 øiwhere

is called the slack variable


Remove an inactive constraint in an optimizationproblem will NOT affect the optimal solution Very useful feature in SVM

If F = Rnthen the problem is called unconstrained minimization problem

SSVM formulation is in this category Difficult to find the global minimum without convexity assumption

Least square problem is in this category

Gradient and Hessian

Let f : Rn ! Rbe a differentiable function. The gradient of function f at a point x 2 Rnis defined

asr f (x) = [ @x1

@f (x) ; @x2

@f (x) ; . . .; @xn

@f (x)] 2 Rn

If f : Rn ! Ris a twice differentiable function. The Hessian matrix off at a point x 2 Rnis defined as

r 2f (x) =@x2

1

@2f@x1@x2

@2f ááá @x1@xn

@2f

......

......

@xn@x1

@2f@xn@x2

@2f ááá@x2

n

@2f

2

64

3

75 2Rnâ n

The Most Important Concept in Optimization (minimization)

A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction

A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction

There might exist decent direction but move along this direction will leave out the feasible region

Two Important Algorithms for Unconstrained Minimization Problem

Steepest decent with exact line search

Newton’s method

Linear Program and Quadratic Program

An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem

If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem

Standard SVM formulation is in this category

formulation is in this category SVM jjájj1

Lagrangian Dual Problem

maxë;ì

minx2Ò

L(x;ë; ì )

subject to ë > 0

Lagrangian Dual Problem

maxë;ì

minx2Ò

L(x;ë; ì )

subject to ë > 0

maxë;ì

ò(ë; ì )

subject to ë > 0where ò(ë; ì ) = inf

x2ÒL(x;ë; ì )

Weak Duality Theorem

Let x 2 Òbe a feasible solution of the primal

problem and(ë; ì )a feasible solution of the

dual problem. Then f (x)>ò(ë; ì )

Corollary: supfò(ë; ì )j ë>0g

6 inff f (x)j g(x) 6 0; h(x) = 0g

ò(ë; ì ) = infx2Ò

L(x;ë; ì ) ô L (xà;ë; ì )

Saddle Point of Lagrangian

Let xã 2 Ò;ëã>0; ì ã 2 Rmsatisfying

L (xã;ë; ì )6L (xã;ëã; ì ã) 6L(x;ëã; ì ã);

8 x 2 Ò; ë>0: Then (xã;ëã; ì ã) is called

The saddle point of the Lagrangian function

Dual Problem of Linear Program

minx2R n

p0x

subject to Ax > b; x>0

Primal LP

Dual LP maxë2R m

b0ë

subject to A0ë6p; ë>0

※ All duality theorems hold and work perfectly!

Dual Problem of Strictly Convex Quadratic Program

minx2R n

21x0Qx + p0x

subject to Ax6 b

Primal QP

With strictly convex assumption, we have

Dual QP

max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b

subject to ë>0

Support Vector Classification(Linearly Separable Case, Dual Form)

The dual problem of previous MP:

maxë2R l

e0ë à 21ë0DAA0Dë

subject to

e0Dë = 0; ë>0:Applying the KKT optimality conditions, we have

w = A0Dë. But where isb?

06ë ? D(Aw+ eb) à e>0Don’t forget

Dual Representation of SVM

(Key of Kernel Methods: )

The hypothesis is determined by(ëã;bã)

h(x) = sgn(êx;A0Dëã

ë+ bã)

= sgn(P

i=1

l

yiëãi

êxi;x

ë+ bã)

= sgn(P

ëãi >0

yiëãi

êxi;x

ë+ bã)

w = A0Dëã =P

i=1

`

yiëiA0i

Remember : A0i = xi

Learning in Feature Space(Could Simplify the Classification Task)

Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality

By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space

There is no free lunch Deal with a huge and dense kernel matrix

Reduced kernel can avoid this difficulty

X Fþ

þ( ) þ( )

þ( )þ( )

þ( )

þ( )

þ( )þ( )

The value of kernel function represents the inner product in feature space

Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

Kernel TechniqueBased on Mercer’s Condition (1909)

f (x) =ð P

j=1

?

wjþj(x)ñ

+ b

Linear Machine in Feature Space

Let þ : X ! Fbe a nonlinear map from the

input space to some feature space

The classifier will be in the form (Primal):

Make it in the dual form:

f (x) =ð P

i=1

lë iyi

êþ(xi) áþ(x)

ëñ+ b

K (x;z) =êþ(x) áþ(z)

ë

Kernel: Represent Inner Product in Feature Space

The classifier will become:

f (x) =ð P

i=1

lë iyiK (xi;x)

ñ+ b

Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X

where þ : X ! F

A Simple Example of Kernel

Polynomial Kernel of Degree 2:K (x;z) =êx;z

ë2

Let

x = x1

x2

ô õ

;z = z1

z2

ô õ2 R2and the nonlinear map

þ : R27! R3 defined by

þ(x) =x2

1

x22

2p

x1x2

2

4

3

5 .

Then

êþ(x);þ(z)

ë=

êx;z

ë2= K (x;z).

There are many other nonlinear maps, (x), that

satisfy the relation:ê (x); (z)

ë=

êx;z

ë2= K (x;z)

Power of the Kernel Technique

Consider a nonlinear map

þ : Rn7! Rp that consists

of distinct features of all the monomials of degree d.Then p = n + dà 1

d

ð ñ.

For example:n = 11; d = 10; p = 92378

Is it necessary? We only need to know êþ(x);þ(z)

ë!

This can be achieved

K (x;z) =êx;z

ëd

2-Norm Soft Margin Dual Formulation

The Lagrangian for 2-norm soft margin:

L (w;b;ø;ë) = 21w0w+ 2

Cø0ø+ë0[eà D(Aw+ eb) à ø]

where ë>0

The partial derivatives with respect to primalvariables equal zeros

@w@L (w;b;ø;ë) = wà A0Dë = 0

@b@L (w;b;ø;ë) = e0Dë = 0; @ø

@L (w;b;ø;ë) = Cøà ë = 0

Dual Maximization ProblemFor 2-Norm Soft Margin

Dual:

ë>0

maxë2R l

e0ë à 21ë0D(AA0+ C

1I )Dë

e0Dë = 0

The corresponding KKT complementarity:

06ë ? D(Aw+ eb) + øà e>0 Use above conditions to find bã

Introduce Kernel in Dual FormulationFor 2-Norm Soft Margin

ë>0

maxë2R l

e0ë à 21ë0D(K (A;A0) + C

1I )Dë

e0Dë = 0

Then the decision rule is defined by

Use above conditions to find

The feature space implicitly defined byk(x;z) Supposeëãsolves the QP problem:

h(x) = sgn(K (x;A0)Dëã + bã)

Introduce Kernel in Dual Formulationfor 2-Norm Soft Margin

for any

bã is chosen so that

yi[K (A0i;A

0)Dëã + bã] = 1à Cëã

i

i with ëãi 6= 0

06ëã ? D(K (A;A0)Dëã + ebã)+ øã à e> 0

Because:

and ëã = Cøã

Sequential Minimal Optimization

Deals with an equality constraint and a box

constraints of dual problem

Works on the smallest working set (only 2)

Find the optimal solution by only changing

value that is in the working set

The solution can be analytically defined

(SMO)

ë0s

The best feature of SMO

Analytical Solution for Two Points

Suppose that we changeë1 and ë2

ë1y1+ ë2y2 = ëold1 y1+ ëold

2 y2

In order to keep the equality constraint we have to change twoë0svalue such that

The new ë0svalue has to satisfy the box constraints

We have a more restriction on changing ë

A Restrictive Constraint on New

Suppose that we changeë1 and ë2

ë

Once we haveënew2

we can getënew1

A restrictive constraint: U6ënew2

6V

where U = max(0; ëold2 à ëold

1 );

V = min(C;C à ëold1 + ëold

2 )if y16=y2

U = max(0;ëold1 + ëold

2 à C)

V = min(C;ëold1 + ëold

2 )

and

if y1 = y2

-Support Vector Regression(Linear Case:f (x) = x0w+ b)

Given the training set:

S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .;mg

Motivated by SVM:

jjwjj2should be as small as possible Some tiny error should be discarded

Represented by an matrix and a vector mâ n Ay 2 Rm

ï

Try to find such that that is (w;b) y ù Aw+ eb

yi ù w0xi + b ;i = 1ááámwhere e= [1;ááá1]02 Rm

-Insensitive Loss Function"

-insensitive loss function:"

jyi à f (xi)j" = maxf0; jyi à f (xi)j à "g

The loss made by the estimation function, fat the data point(xi;yi) is

jøj" = maxf0; jøj à "g=0 if jøj6 "

jøj à " otherwise

ú

If ø2 Rn then jøj" 2 Rn is defined as:

(jøj") i = jøij" ; i = 1. . .n

(Tiny Error Should Be Discarded)

x

x

x

x

x

x

x

x

x

"

"

-Insensitive Linear Regression"

f (x) = x0w+ b

yj à f (xj) à "f (xk) à yk à "

Find (w;b)with the smallest overall error

Five Popular Loss Functions

-Insensitive Loss Regression

Linear -insensitive loss function:

where

""L "(x;y; f ) = jyà f (x)j"

= max(0; jyà f (x)j à ");

x 2 Rn;y 2 R & f is a real function

Quadratic -insensitive loss function:"

L 2"(x;y; f ) = jyà f (x)j2"

ï- insensitive Support Vector Regression Model

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discarded

min(w;b;ø)2Rn+1+m

21jjwjj22+ Ce0 øj jï

where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )

Why minimize ?probably approximately correct

(pac)

íí w

íí

2

Consider performing linear regression for any trainingdata distribution and

max1ô iô m

jj(xi;yi)jj6R ;0< î < 1 and c> 0

Pr(err(f ) > mc( ï2jjwjj22R2+SSElog2m + logî1)) < îD

Pr(err(f ) ô mc ( ï 2

jjwjj22R2+SSE

log2m+ logî1)) > 1à î

D

D

then

Occam’s razor: the simplest is the best

Reformulated - SVR as a Constrained Minimization Problem

min(w;b;ø;øã)2Rn+1+2m

21w0w+ Ce0(ø+ øã)

yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã

ø;øã > 0

subject to

n+1+2m variables and 2m constrains minimization problem

ï

Enlarge the problem size and computational complexity for solving the problem

SV Regression by Minimizing Quadratic -Insensitive Loss"

min(w;b;ø)2R n+1+l

21jjwjj22 + 2

÷jj(jøj")jj22

We have the following problem:

where (jøj") i = jyi à (w0xi + b)j"

Primal Formulation of SVR for Quadratic -Insensitive Loss

min(w;b;ø+;øà )2R n+1+2l

21jjwjj22 + 2

C(jjø+jj22 + jjøà jj22)

"

Extremely important: At the solution

à Awà eb+ y 6 e" + ø+

Aw+ ebà y 6 e" + øà

ø+;øà > 0

06øà ? ø+>0

subject to

Simplify Dual Formulation of SVR

maxë

y0ë à "jjëjj1 à 21ë0(AA0+ C

1I )ë

e0ë = 0subject to

The case , problem becomes to the least squares linear regression with a weight decay factor

" = 0

Kernel in Dual Formulation for SVR

maxë2R l

y0ë à "jjëjj1 à 21ë0(K (A;A0) + C

1I )ë0

e0ë = 0 Then the regression function is defined by

Supposeëãsolves the QP problem:

f (x) = K (x;A0)ëã + bã

where bãis chosen such that

f (xi) à yi = à " à Cëã

i with ëãi > 0

subject to

Probably Approximately Correct Learningpac Model

Key assumption:

Training and testing data are generated i.i.d.according to anfixed but unknowndistributionD

We call such measure risk functional and denote

When we evaluate the “quality” of a hypothesis(classification function)h 2 H

D

we should take the

unknowndistributionerror” or “expected error”made by theh 2 H

( i.e.“average

)

it as Derr(h) =D

f (x;y) 2 X â f 1;à 1gj h(x)6=yg

into account

Generalization Error of pac Model

Let be a set ofS = f (x1;y1);. . .;(xl;yl)g l training

Dexamples chosen i.i.d. according to Treat the generalization error err(hS)

Das a r.v.

depending on the random selection of S Find a bound of the trail of the distribution of

in the formr.v.

err(hS)D

" = "(l;H;î )

" = "(l;H;î ) is a function ofl;H and î,where1à î

is the confidence level of the error bound which isgiven by learner

Probably Approximately Correct

We assert:

Pr(f err(hS)D

> " = "(l;H; î )g) < î

The error made by the hypothesisthen the error bound

hs will be less

"(l;H;î )that is not dependon the unknown distributionD

Pr(f err(hS)D

6" = "(l;H; î )g)>1à î

or

Find the Hypothesis with MinimumExpected Risk?

LetS = f (x1;y1);. . .;(xl;yl)gò X â f à 1;1gthe training Dexamples chosen i.i.d. according towiththe probability densityp(x;y)

be

The expected misclassification error made byh 2 His

R[h] =8;

X â f à 1;1g21jh(x) à yjdp(x;y)

The ideal hypothesishãoptshould has the smallest

expected riskR[hãopt]6 R[h]; 8h 2 H

Unrealistic !!!

Empirical Risk Minimization (ERM)

Find the hypothesishãempwith the smallest empirical

risk Remp[hãemp]6 Remp[h]; 8h 2 H

D p(x;y)and are not needed)(

Replace the expected risk over by an p(x;y)average over the training example

Remp[h] = l1 P

i=1

l

21 jh(xi) à yij The empirical risk:

Only focusing on empirical risk will cause overfitting

Overfitting

Solid : f (x) = 2x2 à 5x + 5Spot : nonlinear regression

which passes

through this 8 points

Overfitting is a phenomena that the resulting function fits the training set too well, but does not have a good prediction performance on unseen data.Red dots :

generated by f(x) with random noise

Tuning Procedure

overfitting

The final value of parameter is one with the maximum testing set correctness !

VC ConfidenceRemp[h] & R[h](The Bound between )

R[h]6Remp[h]+ lv(log(2l=v)+1)à log(î =4)

q

The following inequality will be held with probability

1à î

C. J. C. Burges, A tutorial on support vector machines for pattern recognition,Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167

Capacity (Complexity) of Hypothesis Space :VC-dimension

H

A given training set is shattered byif for every labeling of

with this labeling

S H if and only

S; 9 h 2 H consistent

Three (linear independent) points shattered by ahyperplanes inR2

Shattering Points with Hyperplanesin Rn

Theorem: Consider some set of m points inRn. Choosea point as origin. Then the m points can be shattered

by oriented hyperplanes if and only if the positionvectors of the rest points are linearly independent.

Can you always shatter three points with a line inR2?

Definition of VC-dimension H(A Capacity Measure of Hypothesis Space )

The Vapnik-Chervonenkis dimension,VC(H), ofhypothesis spaceHdefined over the input spaceXis the size of the (existent) largest finite subset

Xshattered byH If arbitrary large finite set ofXcan be shattered

byH, then VC(H) ñ 1

of

Let H = fall hyperplanes in Rngthen

VC(H) = n + 1

A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form...

Documents

Transcript of A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form...