1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu,...
-
Upload
garey-lawrence -
Category
Documents
-
view
215 -
download
0
Transcript of 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu,...
1
New Horizon in Machine Learning —
Support Vector Machine for non-Parametric Learning
Zhao Lu, Ph.D.
Associate Professor
Department of Electrical Engineering, Tuskegee University
2
Introduction
As an innovative non-parametric learning strategy, Support Vector Machine (SVM) gained increasing popularity in late 1990s. Currently it is among the best performers for various tasks, such as pattern recognition, regression and signal processing, etc.
Support vector learning algorithmsSupport vector classification for nonlinear pattern recognition;
Support vector regression for highly nonlinear function approximation;
3
Part I. Support Vector Learning for Classification
4
Overfitting in linear separable classification
5
What is a good Decision Boundary?
Consider a two-class, linearly separable classification problem. Construct the hyperplane
to make
Many decision boundaries! Are all decision boundaries equally good?
Class 1
Class 2
)+(=)( bxwsignxf T
0,T nw x b x R
0, 1
0, 1
Ti i
Ti
w x b for y
w x b for y
6
Examples of Bad Decision Boundaries
Class 1
Class 2
Class 1
Class 2
)+(=)( bxwsignxf T)+(=)( bxwsignxf T
For linearly separable classes, the data from the same class should be close to the training data.
7
Optimal separating hyperplane
The optimal separating hyperplane (OSH) is defined as
It can be proved that OSH is
unique and locate halfway
between margin hyperplanes.
Class 1
Class 2
m
0=+ bxwT
1=+ bxwT
,
: , 0, 1,2, ,n
w b i
Timax min x x x R w x b i
w
1bxwT
8
Canonical separating hyperplane
A hyperplane is in canonical form with respect to all training data if :
Margin hyperplanes:
A canonical hyperplane having a maximal margin
is the ultimate learning goal, i.e. the optimal separating hyperplane.
x X0=+ bxwT
1
2
: 1
: 1
T
T
H w x b
H w x b
1i
Ti
x Xmin w x b
: , 0, 1,2, ,n Ti
im min x x x R w x b i
9
Margin in terms of the norm of
According to the conclusions from the statistical learning theory, the large-margin decision boundary has the excellent generalization capability.
For the canonical hyperplane, it can be proved that the margin m is
Hence, maximizing margin is equivalent to minimizing the square of the norm of .
2
1m
w
w
w
10
Finding the optimal decision boundary
Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi
The optimal decision boundary should classify all points correctly The decision boundary can be found by solving the following constrained optimization problem
This is a quadratic optimization problem with linear inequality constraints.
ibxwy iT
i ∀,1≥)+(
ibxwytosubject
wminimize
iT
i 1)(2
1 2
11
Generalized Lagrangian Function Consider the general (primal) optimization problem
where the functions are defined on a domain . The generalized Lagrangian was defined as
,,,1=,,,,1=,, mihandkigf ii
mjwh
kiwgtosubject
wfminimize
j
i
,,1=,0=)(
,,1=,0≤)(
)(
)()()(
)()()(),,(1 1
whwgwf
whwgwfwL
TT
k
i
m
jjjii
12
Dual Problem and Strong Duality Theorem
Given the primal optimization problem, the dual problem of it was defined as
Strong Duality Theorem: Given the primal optimization problem, where the domain is convex and the constraints are affine functions. Then the optimum of the primal problem occurs at the same values as the optimum of the dual problem .
ii handg
0
)()(
α>tosubject
w,α,Linfα,βθmaximizew
13
Karush-Kuhn-Tucker Conditions
Given the primal optimization problem with the objective function convex and , affine. Necessary and sufficient conditions for to be an optimum are the existence of , such that
(KKT complementarity condition)
1Cf
ki
kiwg
kiwg
wL
wLw
i
i
ii
,,1,0
,,1,0)(
,,1,0)(
0),,(
0),,(
*
*
**
***
***
*w* *
igih
14
Lagrangian of the optimization problem
The Lagrangian is
Setting the gradient of w.r.t. and b to zero, we have
ibxwytosubject
wminimize
iT
i 0)(12
1 2
n
ii
Tii
T bxwywwL1
))(1(2
1
n
iii
n
i
n
iiiiiii
y
xywxyw
1
1 1
0
0)(
L w
( )parametric nonparametric
15
The Dual Problem
If we substitute into Lagrangian , we have
Note that , and the data points appear in terms of
their inner product; this is a quadratic function of i only.
n
i
n
j
n
iij
Tijiji
n
i
n
j
n
i
n
i
n
j
n
iiii
Tjjjiiij
Tijiji
n
i
n
j
n
i
n
ji
Tjjjiijjj
Tiii
xxyy
ybxxyyxxyy
bxxyyxyxyL
1 1 1
1 1 1 1 1 1
1 1 1 1
2
12
1
)(12
1
n
iiii xyw
1
L
n
iii y
1
0
0
16
The Dual Problem
The new objective function is in terms of i only
The original problem is known as the primal problem
The objective function of the dual problem needs to be maximized!
The dual problem is therefore:
Properties of i when we introduce the Lagrange multipliers
The result when we differentiate the original Lagrangian w.r.t. b
n
iiii
n
i
n
jij
Tijijii
ytosubject
xxyyWmaxmize
1
1 1,1
0,0
2
1)(
17
The Dual Problem
This is a quadratic programming (QP) problem, and therefore a global minimum of can always be found
can be recovered by , so the decision function
can be written in the following non-parametric form
n
iiii
n
ji
n
iij
Tijiji
ytosubject
xxyyWminimize
1
1, 1
0,0
2
1)(
n
iiii xyw
1
)()(1
n
i
Tiii bxxysignxf
w
i
18
Conception of Support Vectors (SVs)
According to the Karush-Kuhn-Tucker (KKT) complementarity condition, the solution must satisfy
Thus, only for those points that are closest to the classifying hyperplane. These points are called support vectors.
From the KKT complementarity condition, the bias term b can be calculated by using the support vectors
n
ikk
Tiiik anyforxxyyb
1
0
0]1)([ bxwy iT
ii
0i ix
Ti iw x b y
19
6=1.4
Sparseness of the solution
Class 1
Class 2
1=0.8
2=0
3=0
4=0
5=07=0
8=0.6
9=0
10=0
1=+ bxwT
0Tw x b
1Tw x b
w
20
The use of slack variables
We allow “errors” i in classification for noisy data;
Class 1
Class 2
0Tw x b
1Tw x b
1=+ bxwT
wjx
ix
j
i
21
Soft Margin Hyperplane
The use of slack variables i enable the soft margin classifier
i are “slack variables” in optimization
Note that i=0 if there is no error for
The objective function
C : tradeoff parameter between error and margin
The primal optimization problem becomes
i
ybxw
ybxw
i
iiiT
iiiT
0
11
11
n
iiCw
1
2
2
1
0,1)(2
1
1
2
iiiT
i
n
ii
bxwytosubject
Cwminimize
ix
22
Dual Soft-Margin Optimization Problem
The dual of this new constrained optimization problem is
can be recovered as
This is very similar to the optimization problem in the hard-margin case, except that there is an upper bound C on i now.
Once again, a QP solver can be used to find i
1 1, 1
1
1( )
2
0, 0
n nT
i i j i j i ji i j
n
i i ii
maxmize W y y x x
subje Cct to y
n
iiii xyw
1
w
23
Nonlinear separable problems
24
Extension to Non-linear Decision Boundary
How to extend the linear large-margin classifier to nonlinear case?
Cover’s theorem
Consider a space made up of nonlinearly separable patterns.
Cover’s theorem states that such a multi-dimensional space can be transformed into a new feature space where the patterns are linearly separable with a high probability, provided two conditions are satisfied:
(1) The transform is nonlinear;
(2) The dimensionality of the feature space is high enough;
25
Non-linear SVMs: Feature spaces
General idea: the data in original input space can always be mapped into some higher-dimensional feature space where the training data become linearly separable by using a nonlinear transformation:
Φ: x → φ(x)
kernel visualization: http://www.youtube.com/watch?v=9NrALgHFwTo
26
Transforming the data
Key idea: transform to a higher dimensional space by using a nonlinear transformation
Input space: the space the point are locatedFeature space: the space of after transformation
Curse of dimensionality: Computation in the feature space can be very costly because it is high dimensional, and the feature space is typically infinite-dimensional!
This problem of ‘curse of dimensionality’ can be surmounted on the strength of kernel function because the inner product is just a scalar, which is the most appealing characteristic of SVM.
ix
ix)( ix
27
Kernel trick
Recall the SVM dual optimization problem
The data points only appear as inner product
With the aid of inner product representation in the feature space, the nonlinear mapping can be used implicitly by defining the kernel function K by
n
iiii
n
i
n
jij
Tijijii
yCtosubject
xxyyWmaxmize
1
1 1,1
0,0
2
1)(
)()(),( jT
iji xxxxK
)( ix
28
What functions can be used as kernels?
Mercer’s theorem in operator theory: Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix on data points:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
K=
29
An Example for f (.) and K(.,.)
Suppose the nonlinear mapping (.): is as follows
An inner product in the feature space is
So, if we define the kernel function as follows, there is no need to carry out (.) explicitly
This use of kernel function to avoid carrying out (.) explicitly is known as the kernel trick
1 2 21 2 1 2 1 2
2
(1, 2 , 2 , , , 2 )Txx x x x x x
x
22211
2
1
2
1 )1(, yxyxy
y
x
x
22211 )1(),( yxyxyxK
6R R
30
Kernel functions
In practical use of SVM, the user specifies the kernel function; the transformation (.) is not explicitly stated
Given a kernel function , the transformation (.) is given by its eigenfunctions (a concept in functional analysis)
Eigenfunctions can be difficult to construct explicitly
This is why people only specify the kernel function without worrying about the exact transformation
),( ji xxK
( , ) ( ) ( )X
k x z z dz x
31
Examples of kernel functions
Polynomial kernel with degree d
Radial basis function kernel with width
Closely related to radial basis function neural networks
The feature space induced is infinite-dimensional
Sigmoid function with parameter and
It does not satisfy the Mercer condition on all and
Closely related to feedforward neural networks
dT yxyxK )1(),(
))2((),( 22 yxexpyxK
)(),( yxtanhyxK T
32
Kernel: Bridge from linear to nonlinear
Change all inner products to kernel functions
For training, the optimization problem is
linear
nonlinear
n
iiii
n
i
n
jij
Tijijii
yCtosubject
xxyyWmaxmize
1
1 1,1
0,0
2
1)(
n
iiii
n
i
n
jijijijii
yCtosubject
xxKyyWmaxmize
1
1 1,1
0,0
),(2
1)(
33
Kernel expansion for decision function
For classifying the new data z, it belongs to the class 1 if f 0, and to class 2 if f <0
linear
nonlinear
n
iiii xyw
1
n
i
Tiii
T bxxybxwxf1
)(
1
( , )( ) ( )n
iT
i ii
y K x xf x w x b b
n
iiii xyw
1
)(
34
Compared to neural networks
SVMs are explicitly based on a theoretical model of learning rather than on loose analogies with natural learning systems or other heuristics.
Modularity: Any kernel-based learning algorithm is composed of two modules:
A general purpose learning machine
A problem specific kernel function
SVMs are not affected by the problem of local minima because their training amounts to convex optimization.
35
Key features in SV classifier
All features were already present and had been used in machine learning since 1960s:
Maximum (large) margin; Kernel method; Duality in nonlinear programming; Sparseness of the solution; Slack variables;
However, not until 1995 all features were combined together, and it is so surprising how naturally and elegantly they fit together and complement each other in SVM.
36
SVM classification for 2D data
Figure. Visualization of SVM classification.
37
Part II. Support Vector Learning for Regression
38
Overfitting in nonlinear regression
39
The linear regression problem
bxwxfy )(
40
Linear regression
The problem of linear regression is much older than the classification one. Least squares linear interpolation was first used by Gauss in the 18th century for astronomical problems.
Given a training set , with , , the problem of linear regression is to find a linear function that models the data
S ni RXx RYyi
f
bxwxfy )(
41
Least squares
The least squares approach prescribes choosing the parameters to minimize the sum of the squared derivation of the data,
Setting and
where
),( bw
1
2)(),(i
ii bxwybwJ
TT bww ),(ˆ
T
T
T
x
x
x
X
ˆ
ˆ
ˆ
ˆ 2
1
TTii xx )1,(ˆ
42
Least squares
The square loss function can be written as
Taking derivatives of the loss and setting them equal to zero,
yields the well-known ‘normal equations’
and, if the inverse of exists, the solution is:
)ˆˆ()ˆˆ()ˆ( wXywXywJ T
0ˆˆˆ2ˆ2ˆ
wXXyXw
J TT
yXwXX TT ˆˆˆˆ
XX T ˆˆ
yXXXw TT ˆ)ˆˆ(ˆ 1
43
Ridge regression If the matrix in the least squares problem is not of full rank, or in other situations where numerical stability problems occur, one can use the following solution,
where is the identity matrix with the entry set to zero. This solution is called ridge regression.
The ridge regression minimizes the penalized loss function
regularizer
XX T ˆˆ
yXIXXw Tn
T ˆ)ˆˆ(ˆ 1
nI )1,1( nn
1
2)(),(i
ii ybxwwwbwJ
44
-insensitive loss function
Instead of using the square loss function, the ε-insensitive loss function is used in SV regression
which leads to sparsity of the solution.
Value offtarget
Penalty
Value offtarget
Penalty
Square loss function-insensitive loss function
otherwise
if
0
45
The linear regression problem
bxwxfy )(
))(,0(
,)(
)(,0)(
bxwymax
otherwisebxwy
bxwyifbxwy
ii
ii
iiii
46
Primal problem in SVR (SVR)
Given a data set with values The -SVR was formulated as the following (primal) convex optimization problem:
The constant determines the trade-off between the flatness of and the amount up to which deviations larger than are tolerated.
xxx ,,, 21 yyy ,,, 21
0,
,
,
)(2
1
*
*
1
*2
ii
iii
iii
iii
ybxw
bxwy
tosubject
Cwminimize
0C
f
47
Lagrangian
Construct the Lagrange function from the objective function and the corresponding constraints:
where the Lagrange multipliers satisfy positivity constraints
0,,, ** iiii
1
**
1
1 1
***2
),(
),(
)()(2
1
iiiii
iiiii
i iiiiiii
bxwy
bxwy
CwL
48
Karush-Kuhn-Tucker Conditions
It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables have to vanish for optimality,
The 2nd equation indicates that can be written as a linear combination of training patterns .
0
0
0)(
0)(
**
1
*
1
*
*
ii
ii
iiiiw
iiib
CL
CL
xwL
L
wix
( )parametric nonparametric
49
Dual problem
Substituting the equations above into the Lagrangian yields the following dual problem
The function can be written in a non-parametric form by substituting into
1
* )(i
iii xw bxwxf )(
1
* ,)()(i
iii bxxxf
f
],0[,0)(
)()(,))((2
1
*
1
*
1, 1
*
1
***
Candtosubject
yxxmaximize
iii
ii
ji iiii
iiijijjii
50
KKT complementarity conditions
At the optimal solution the following Karush-Kuhn-Tucker complementarity condition must be fulfilled
Obviously, for , holds, and in similar for , .
0)(
0)(
0),(
0),(
****
**
iiii
iiii
iiii
iiii
C
C
ybxw
ybxw
Ci 0 0iCi *0 0* i
51
Unbounded support vectors
Hence, for , it follows
Thus, for all the data points fulfilling , dual variables , and for the ones satisfying , the dual variables . These data points are called the unbounded support vectors.
Cii *,0
0,
0,
ii
ii
ybxw
ybxw
Ci 0
Ci 0
Ci *0
)(xfy
)(xfy
52
Computing bias term b
Unbounded support vector allow computing the value of the bias term b as given below
The calculation of a bias term b is numerically very sensitive, and it is better to compute the bias b by averaging over all the unbounded support vector data points.
Cforxwyb
Cforxwyb
iii
iii
*0,,
0,,
53
Bounded support vector
The bounded support vectors are outside the -tube
54
Sparsity of SV expansion
For all data points inside the -tube, i.e.,
From the KKT complementarity condition,
and have to be zeros. Therefore, we have a sparse expansion of in terms of
)( bxwy ii
0),(
0),(**
iiii
iiii
ybxw
ybxw
i *i
ixw
SVi
iiii
iii xxw )()( *
1
*
55
SV Nonlinear Regression
Key idea: map data to a higher dimensional space by using a nonlinear transformation, and perform linear regression in feature (embedded) space
Input space: the space the point are locatedFeature space: the space of after transformation
Computation in the feature space can be costly because it is high dimensional, and the feature space is typically infinite-dimensional!
This problem of ‘curse of dimensionality’ can be surmounted by resorting to kernel trick, which is the most appealing characteristic of SVM.
ix
ix)( ix
56
Kernel trick
Recall the SVM optimization problem
The data points only appear as inner product
As long as we can calculate the inner product in the feature space, we do not need to know the nonlinear mapping explicitly
Define the kernel function K by )()(),( jT
iji xxxxK
],0[,0)(
)()(,))((2
1
*
1
*
1, 1
*
1
***
Candtosubject
yxxmaximize
iii
ii
ji iiii
iiijijjii
57
Kernel: Bridge from linear to nonlinear
Change all inner products to kernel functions
For training,Original
With kernel function * * *
, 1 1 1
* *
1
1( )( ) ( ) ( )
2
( ) 0
(
, [0
)
]
,
,
i i j j i i i ii j i i
i i i i
i
i
jmaximize y y
subje
K x
ct to and C
x
],0[,0)(
)()(,))((2
1
*
1
*
1, 1
*
1
**
Candtosubject
yyxxmaximize
iii
ii
ji iii
iiijijjii
58
Kernel expansion representation
For testing, Original
With kernel function
1
* )(i
iii xw
1
* ,)(,)(i
iii bxxbxwxf
1
* )()(i
iii xw
*
1
( , )( ) , ( ) ( )i
ii if x b bx xw x K
59
Compared to SV classification
The size of the SVR problem, with respect to the size of an SV classifier design task, is doubled now. There are unknown dual variables ( and ) for a support vector regression.
Besides the penalty parameter and the shape parameters of the kernel functions (such as the variances of a Gaussian kernel, order of polynomial), the insensitivity zone also need to be set beforehand in constructing SV machines for regression.
2
si '
si ' si '*
C
4/28/2006 60
Influence of insensitivity zone
61
Linear programming SV regression
In an attempt to improve computational efficiency and model sparsity, the linear programming SV regression was formulated
where .
0,
),(
),(
)(2
1
*
*
1
1
1
*1
ii
iij
ijj
ij
ijji
iii
yxxk
xxky
tosubject
Cminimize
T][ 21
62
Linear programming SV regression
The optimization problem can be converted into a linear programming problem as follows
where ,
y
y
IKK
IKKtosubject
cminimize T
T
CCCc
2,,2,2,1,,1,1,1,,1,1
T),,,( 21 T),,,( 21
iii
T),,,( 21 ),(= jiij xxkK
63
MATLAB Demo for SV regression
MATLAB support vector machines Toolbox developed by Steve R. Gunn at University of Southampton, UK.
The software can be downloaded from
http://www.isis.ecs.soton.ac.uk/resources/svminfo/
64
Questions?