A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form...
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form...
![Page 1: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/1.jpg)
A Review of Our CourseClassification and Regression
The Perceptron Algorithm: Primal vs. Dual form
An on-line and mistake-driven procedure
Update the weight vector and bias when there is a misclassified point
Converge when problem is linearly separable
![Page 2: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/2.jpg)
Classification Problem2-Category Linearly Separable
Case
A-
A+
x0w+ b= à 1
wx0w+ b= + 1x0w+ b= 0
Malignant
Benign
![Page 3: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/3.jpg)
Algebra of the Classification Problem
Linearly Separable Case
Given l points in the n dimensional real spaceRn
Represented by an â n matrixAor Membership of each pointA iin the classesAà A+
is specified by an `â ` diagonal matrix D :
D ii = à 1 if A i 2 Aà and D ii = 1 A i 2 A+if SeparateAà and A+by two bounding planes such that:
A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1
More succinctly:D(Aw+ eb)>e
e= [1;1; . . .;1]02 R :
, where
![Page 4: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/4.jpg)
D(Aw+ eb) >e+ øø>0
whereø: nonnegative slack (error) vector
The term e0ø, 1-norm measure of error vector, is
called the training error.
minw;b;ø
e0ø
s.t. (LP)
Robust Linear ProgrammingPreliminary Approach to SVM
For the linearly separable case, at solution of (LP):
ø= 0
![Page 5: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/5.jpg)
Support Vector MachinesMaximizing the Margin between Bounding
Planes
x0w+ b= + 1
x0w+ b= à 1
A+
A-
w
jjwjj22 = Margin
![Page 6: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/6.jpg)
Support Vector Classification(Linearly Separable Case, Primal)
The hyperplane that solves the minimization problem:
(w;b)
min(w;b)2R n+1
21 jjwjj22
D(Aw+ eb)>e;
realizes the maximal margin hyperplane withgeometric margin í = jjwjj2
1
![Page 7: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/7.jpg)
Soft Margin SVM(Nonseparable Case)
If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above
Introduce the slack variable for each training point
yi(w0xi + b)>1à øi; øi>0 8 i
The inequality system is always feasible
w = 0; b= 0 & ø= ee.g.
![Page 8: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/8.jpg)
xj
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oi
í
í
øj
øi
![Page 9: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/9.jpg)
Two Different Measures of Training Error
min(w;b;ø)2R n+1+l
21jjwjj22 + 2
Cjjøjj22
D(Aw+ eb) + ø>e
2-Norm Soft Margin:
1-Norm Soft Margin:min
(w;b;ø)2R n+1+l21jjwjj22 + Ce0ø
D(Aw+ eb) + ø>e
ø> 0
![Page 10: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/10.jpg)
Optimization Problem Formulation
Problem setting: Given functionsf ; gi; i = 1;. . .; kand hj ; j = 1;. . .; m, defined on a domainÒ ò Rn;
minx2Ò
f (x)
subject to gi(x) 6 0; 8ihj(x) = 0; 8j
where f (x)is called the objective function and
g(x) 6 0; h(x) = 0are called constraints.
![Page 11: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/11.jpg)
Definitions and Notation
Feasible region:
F = f x 2 Òj g(x)60; h(x)60g
where g(x) =g1(x)...gk(x)
2
4
3
5 and h(x) =h1(x)...hm(x)
2
4
3
5
A solution of the optimization problem is a point
xã 2 F such that@x 2 F for which f (x) < f (xã)
and xãis called a global minimum.
![Page 12: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/12.jpg)
Definitions and Notation
A point x 2 Fis called a local minimum of the
optimization problem if9" > 0such that
f (x) > f (x); 8x 2 F and jjx à xjj < " At the solutionxã, an inequality constraintgi(x)
is said to be active if gi(xã) = 0, otherwise it is
called an inactive constraint.
gi(x) 6 0 , gi(x) + øi = 0; øi > 0 øiwhere
is called the slack variable
![Page 13: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/13.jpg)
Definitions and Notation
Remove an inactive constraint in an optimizationproblem will NOT affect the optimal solution Very useful feature in SVM
If F = Rnthen the problem is called unconstrained minimization problem
SSVM formulation is in this category Difficult to find the global minimum without convexity assumption
Least square problem is in this category
![Page 14: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/14.jpg)
Gradient and Hessian
Let f : Rn ! Rbe a differentiable function. The gradient of function f at a point x 2 Rnis defined
asr f (x) = [ @x1
@f (x) ; @x2
@f (x) ; . . .; @xn
@f (x)] 2 Rn
If f : Rn ! Ris a twice differentiable function. The Hessian matrix off at a point x 2 Rnis defined as
r 2f (x) =@x2
1
@2f@x1@x2
@2f ááá @x1@xn
@2f
......
......
@xn@x1
@2f@xn@x2
@2f ááá@x2
n
@2f
2
64
3
75 2Rnâ n
![Page 15: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/15.jpg)
The Most Important Concept in Optimization (minimization)
A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction
A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction
There might exist decent direction but move along this direction will leave out the feasible region
![Page 16: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/16.jpg)
Two Important Algorithms for Unconstrained Minimization Problem
Steepest decent with exact line search
Newton’s method
![Page 17: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/17.jpg)
Linear Program and Quadratic Program
An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem
If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem
Standard SVM formulation is in this category
formulation is in this category SVM jjájj1
![Page 18: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/18.jpg)
Lagrangian Dual Problem
maxë;ì
minx2Ò
L(x;ë; ì )
subject to ë > 0
![Page 19: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/19.jpg)
Lagrangian Dual Problem
maxë;ì
minx2Ò
L(x;ë; ì )
subject to ë > 0
maxë;ì
ò(ë; ì )
subject to ë > 0where ò(ë; ì ) = inf
x2ÒL(x;ë; ì )
![Page 20: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/20.jpg)
Weak Duality Theorem
Let x 2 Òbe a feasible solution of the primal
problem and(ë; ì )a feasible solution of the
dual problem. Then f (x)>ò(ë; ì )
Corollary: supfò(ë; ì )j ë>0g
6 inff f (x)j g(x) 6 0; h(x) = 0g
ò(ë; ì ) = infx2Ò
L(x;ë; ì ) ô L (xà;ë; ì )
![Page 21: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/21.jpg)
Saddle Point of Lagrangian
Let xã 2 Ò;ëã>0; ì ã 2 Rmsatisfying
L (xã;ë; ì )6L (xã;ëã; ì ã) 6L(x;ëã; ì ã);
8 x 2 Ò; ë>0: Then (xã;ëã; ì ã) is called
The saddle point of the Lagrangian function
![Page 22: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/22.jpg)
Dual Problem of Linear Program
minx2R n
p0x
subject to Ax > b; x>0
Primal LP
Dual LP maxë2R m
b0ë
subject to A0ë6p; ë>0
※ All duality theorems hold and work perfectly!
![Page 23: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/23.jpg)
Dual Problem of Strictly Convex Quadratic Program
minx2R n
21x0Qx + p0x
subject to Ax6 b
Primal QP
With strictly convex assumption, we have
Dual QP
max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b
subject to ë>0
![Page 24: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/24.jpg)
Dual Problem of Strictly Convex Quadratic Program
minx2R n
21x0Qx + p0x
subject to Ax6 b
Primal QP
With strictly convex assumption, we have
Dual QP
max à 21(p0+ ë0A)Qà 1(A0ë + p) à ë0b
subject to ë>0
![Page 25: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/25.jpg)
Support Vector Classification(Linearly Separable Case, Dual Form)
The dual problem of previous MP:
maxë2R l
e0ë à 21ë0DAA0Dë
subject to
e0Dë = 0; ë>0:Applying the KKT optimality conditions, we have
w = A0Dë. But where isb?
06ë ? D(Aw+ eb) à e>0Don’t forget
![Page 26: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/26.jpg)
Dual Representation of SVM
(Key of Kernel Methods: )
The hypothesis is determined by(ëã;bã)
h(x) = sgn(êx;A0Dëã
ë+ bã)
= sgn(P
i=1
l
yiëãi
êxi;x
ë+ bã)
= sgn(P
ëãi >0
yiëãi
êxi;x
ë+ bã)
w = A0Dëã =P
i=1
`
yiëiA0i
Remember : A0i = xi
![Page 27: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/27.jpg)
Learning in Feature Space(Could Simplify the Classification Task)
Learning in a high dimensional space could degradegeneralization performance This phenomenon is called curse of dimensionality
By using a kernel function, that represents the innerproduct of training example in feature space, we neverneed to explicitly know what the nonlinear map is. Even do not know the dimensionality of feature space
There is no free lunch Deal with a huge and dense kernel matrix
Reduced kernel can avoid this difficulty
![Page 28: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/28.jpg)
X Fþ
þ( ) þ( )
þ( )þ( )
þ( )
þ( )
þ( )þ( )
![Page 29: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/29.jpg)
The value of kernel function represents the inner product in feature space
Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space
Kernel TechniqueBased on Mercer’s Condition (1909)
![Page 30: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/30.jpg)
f (x) =ð P
j=1
?
wjþj(x)ñ
+ b
Linear Machine in Feature Space
Let þ : X ! Fbe a nonlinear map from the
input space to some feature space
The classifier will be in the form (Primal):
Make it in the dual form:
f (x) =ð P
i=1
lë iyi
êþ(xi) áþ(x)
ëñ+ b
![Page 31: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/31.jpg)
K (x;z) =êþ(x) áþ(z)
ë
Kernel: Represent Inner Product in Feature Space
The classifier will become:
f (x) =ð P
i=1
lë iyiK (xi;x)
ñ+ b
Definition: A kernel is a functionK : X â X ! Rsuch thatfor all x;z 2 X
where þ : X ! F
![Page 32: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/32.jpg)
A Simple Example of Kernel
Polynomial Kernel of Degree 2:K (x;z) =êx;z
ë2
Let
x = x1
x2
ô õ
;z = z1
z2
ô õ2 R2and the nonlinear map
þ : R27! R3 defined by
þ(x) =x2
1
x22
2p
x1x2
2
4
3
5 .
Then
êþ(x);þ(z)
ë=
êx;z
ë2= K (x;z).
There are many other nonlinear maps, (x), that
satisfy the relation:ê (x); (z)
ë=
êx;z
ë2= K (x;z)
![Page 33: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/33.jpg)
Power of the Kernel Technique
Consider a nonlinear map
þ : Rn7! Rp that consists
of distinct features of all the monomials of degree d.Then p = n + dà 1
d
ð ñ.
For example:n = 11; d = 10; p = 92378
Is it necessary? We only need to know êþ(x);þ(z)
ë!
This can be achieved
K (x;z) =êx;z
ëd
![Page 34: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/34.jpg)
2-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin:
L (w;b;ø;ë) = 21w0w+ 2
Cø0ø+ë0[eà D(Aw+ eb) à ø]
where ë>0
The partial derivatives with respect to primalvariables equal zeros
@w@L (w;b;ø;ë) = wà A0Dë = 0
@b@L (w;b;ø;ë) = e0Dë = 0; @ø
@L (w;b;ø;ë) = Cøà ë = 0
![Page 35: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/35.jpg)
Dual Maximization ProblemFor 2-Norm Soft Margin
Dual:
ë>0
maxë2R l
e0ë à 21ë0D(AA0+ C
1I )Dë
e0Dë = 0
The corresponding KKT complementarity:
06ë ? D(Aw+ eb) + øà e>0 Use above conditions to find bã
![Page 36: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/36.jpg)
Introduce Kernel in Dual FormulationFor 2-Norm Soft Margin
ë>0
maxë2R l
e0ë à 21ë0D(K (A;A0) + C
1I )Dë
e0Dë = 0
Then the decision rule is defined by
Use above conditions to find
The feature space implicitly defined byk(x;z) Supposeëãsolves the QP problem:
h(x) = sgn(K (x;A0)Dëã + bã)
![Page 37: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/37.jpg)
Introduce Kernel in Dual Formulationfor 2-Norm Soft Margin
for any
bã is chosen so that
yi[K (A0i;A
0)Dëã + bã] = 1à Cëã
i
i with ëãi 6= 0
06ëã ? D(K (A;A0)Dëã + ebã)+ øã à e> 0
Because:
and ëã = Cøã
![Page 38: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/38.jpg)
Sequential Minimal Optimization
Deals with an equality constraint and a box
constraints of dual problem
Works on the smallest working set (only 2)
Find the optimal solution by only changing
value that is in the working set
The solution can be analytically defined
(SMO)
ë0s
The best feature of SMO
![Page 39: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/39.jpg)
Analytical Solution for Two Points
Suppose that we changeë1 and ë2
ë1y1+ ë2y2 = ëold1 y1+ ëold
2 y2
In order to keep the equality constraint we have to change twoë0svalue such that
The new ë0svalue has to satisfy the box constraints
We have a more restriction on changing ë
![Page 40: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/40.jpg)
A Restrictive Constraint on New
Suppose that we changeë1 and ë2
ë
Once we haveënew2
we can getënew1
A restrictive constraint: U6ënew2
6V
where U = max(0; ëold2 à ëold
1 );
V = min(C;C à ëold1 + ëold
2 )if y16=y2
U = max(0;ëold1 + ëold
2 à C)
V = min(C;ëold1 + ëold
2 )
and
if y1 = y2
![Page 41: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/41.jpg)
-Support Vector Regression(Linear Case:f (x) = x0w+ b)
Given the training set:
S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .;mg
Motivated by SVM:
jjwjj2should be as small as possible Some tiny error should be discarded
Represented by an matrix and a vector mâ n Ay 2 Rm
ï
Try to find such that that is (w;b) y ù Aw+ eb
yi ù w0xi + b ;i = 1ááámwhere e= [1;ááá1]02 Rm
![Page 42: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/42.jpg)
-Insensitive Loss Function"
-insensitive loss function:"
jyi à f (xi)j" = maxf0; jyi à f (xi)j à "g
The loss made by the estimation function, fat the data point(xi;yi) is
jøj" = maxf0; jøj à "g=0 if jøj6 "
jøj à " otherwise
ú
If ø2 Rn then jøj" 2 Rn is defined as:
(jøj") i = jøij" ; i = 1. . .n
(Tiny Error Should Be Discarded)
![Page 43: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/43.jpg)
x
x
x
x
x
x
x
x
x
"
"
-Insensitive Linear Regression"
f (x) = x0w+ b
yj à f (xj) à "f (xk) à yk à "
Find (w;b)with the smallest overall error
![Page 44: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/44.jpg)
Five Popular Loss Functions
![Page 45: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/45.jpg)
-Insensitive Loss Regression
Linear -insensitive loss function:
where
""L "(x;y; f ) = jyà f (x)j"
= max(0; jyà f (x)j à ");
x 2 Rn;y 2 R & f is a real function
Quadratic -insensitive loss function:"
L 2"(x;y; f ) = jyà f (x)j2"
![Page 46: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/46.jpg)
ï- insensitive Support Vector Regression Model
Motivated by SVM: jjwjj2should be as small as possible
Some tiny error should be discarded
min(w;b;ø)2Rn+1+m
21jjwjj22+ Ce0 øj jï
where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )
![Page 47: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/47.jpg)
Why minimize ?probably approximately correct
(pac)
íí w
íí
2
Consider performing linear regression for any trainingdata distribution and
max1ô iô m
jj(xi;yi)jj6R ;0< î < 1 and c> 0
Pr(err(f ) > mc( ï2jjwjj22R2+SSElog2m + logî1)) < îD
Pr(err(f ) ô mc ( ï 2
jjwjj22R2+SSE
log2m+ logî1)) > 1à î
D
D
then
Occam’s razor: the simplest is the best
![Page 48: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/48.jpg)
Reformulated - SVR as a Constrained Minimization Problem
min(w;b;ø;øã)2Rn+1+2m
21w0w+ Ce0(ø+ øã)
yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã
ø;øã > 0
subject to
n+1+2m variables and 2m constrains minimization problem
ï
Enlarge the problem size and computational complexity for solving the problem
![Page 49: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/49.jpg)
SV Regression by Minimizing Quadratic -Insensitive Loss"
min(w;b;ø)2R n+1+l
21jjwjj22 + 2
÷jj(jøj")jj22
We have the following problem:
where (jøj") i = jyi à (w0xi + b)j"
![Page 50: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/50.jpg)
Primal Formulation of SVR for Quadratic -Insensitive Loss
min(w;b;ø+;øà )2R n+1+2l
21jjwjj22 + 2
C(jjø+jj22 + jjøà jj22)
"
Extremely important: At the solution
à Awà eb+ y 6 e" + ø+
Aw+ ebà y 6 e" + øà
ø+;øà > 0
06øà ? ø+>0
subject to
![Page 51: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/51.jpg)
Simplify Dual Formulation of SVR
maxë
y0ë à "jjëjj1 à 21ë0(AA0+ C
1I )ë
e0ë = 0subject to
The case , problem becomes to the least squares linear regression with a weight decay factor
" = 0
![Page 52: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/52.jpg)
Kernel in Dual Formulation for SVR
maxë2R l
y0ë à "jjëjj1 à 21ë0(K (A;A0) + C
1I )ë0
e0ë = 0 Then the regression function is defined by
Supposeëãsolves the QP problem:
f (x) = K (x;A0)ëã + bã
where bãis chosen such that
f (xi) à yi = à " à Cëã
i with ëãi > 0
subject to
![Page 53: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/53.jpg)
Probably Approximately Correct Learningpac Model
Key assumption:
Training and testing data are generated i.i.d.according to anfixed but unknowndistributionD
We call such measure risk functional and denote
When we evaluate the “quality” of a hypothesis(classification function)h 2 H
D
we should take the
unknowndistributionerror” or “expected error”made by theh 2 H
( i.e.“average
)
it as Derr(h) =D
f (x;y) 2 X â f 1;à 1gj h(x)6=yg
into account
![Page 54: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/54.jpg)
Generalization Error of pac Model
Let be a set ofS = f (x1;y1);. . .;(xl;yl)g l training
Dexamples chosen i.i.d. according to Treat the generalization error err(hS)
Das a r.v.
depending on the random selection of S Find a bound of the trail of the distribution of
in the formr.v.
err(hS)D
" = "(l;H;î )
" = "(l;H;î ) is a function ofl;H and î,where1à î
is the confidence level of the error bound which isgiven by learner
![Page 55: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/55.jpg)
Probably Approximately Correct
We assert:
Pr(f err(hS)D
> " = "(l;H; î )g) < î
The error made by the hypothesisthen the error bound
hs will be less
"(l;H;î )that is not dependon the unknown distributionD
Pr(f err(hS)D
6" = "(l;H; î )g)>1à î
or
![Page 56: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/56.jpg)
Find the Hypothesis with MinimumExpected Risk?
LetS = f (x1;y1);. . .;(xl;yl)gò X â f à 1;1gthe training Dexamples chosen i.i.d. according towiththe probability densityp(x;y)
be
The expected misclassification error made byh 2 His
R[h] =8;
X â f à 1;1g21jh(x) à yjdp(x;y)
The ideal hypothesishãoptshould has the smallest
expected riskR[hãopt]6 R[h]; 8h 2 H
Unrealistic !!!
![Page 57: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/57.jpg)
Empirical Risk Minimization (ERM)
Find the hypothesishãempwith the smallest empirical
risk Remp[hãemp]6 Remp[h]; 8h 2 H
D p(x;y)and are not needed)(
Replace the expected risk over by an p(x;y)average over the training example
Remp[h] = l1 P
i=1
l
21 jh(xi) à yij The empirical risk:
Only focusing on empirical risk will cause overfitting
![Page 58: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/58.jpg)
Overfitting
Solid : f (x) = 2x2 à 5x + 5Spot : nonlinear regression
which passes
through this 8 points
Overfitting is a phenomena that the resulting function fits the training set too well, but does not have a good prediction performance on unseen data.Red dots :
generated by f(x) with random noise
![Page 59: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/59.jpg)
Tuning Procedure
overfitting
The final value of parameter is one with the maximum testing set correctness !
![Page 60: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/60.jpg)
VC ConfidenceRemp[h] & R[h](The Bound between )
R[h]6Remp[h]+ lv(log(2l=v)+1)à log(î =4)
q
The following inequality will be held with probability
1à î
C. J. C. Burges, A tutorial on support vector machines for pattern recognition,Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167
![Page 61: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/61.jpg)
Capacity (Complexity) of Hypothesis Space :VC-dimension
H
A given training set is shattered byif for every labeling of
with this labeling
S H if and only
S; 9 h 2 H consistent
Three (linear independent) points shattered by ahyperplanes inR2
![Page 62: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/62.jpg)
Shattering Points with Hyperplanesin Rn
Theorem: Consider some set of m points inRn. Choosea point as origin. Then the m points can be shattered
by oriented hyperplanes if and only if the positionvectors of the rest points are linearly independent.
Can you always shatter three points with a line inR2?
![Page 63: A Review of Our Course Classification and Regression The Perceptron Algorithm: Primal vs. Dual form An on-line and mistake-driven procedure Update.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d7f5503460f94a62ccd/html5/thumbnails/63.jpg)
Definition of VC-dimension H(A Capacity Measure of Hypothesis Space )
The Vapnik-Chervonenkis dimension,VC(H), ofhypothesis spaceHdefined over the input spaceXis the size of the (existent) largest finite subset
Xshattered byH If arbitrary large finite set ofXcan be shattered
byH, then VC(H) ñ 1
of
Let H = fall hyperplanes in Rngthen
VC(H) = n + 1