Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....
-
Upload
noah-bailey -
Category
Documents
-
view
217 -
download
2
Transcript of Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....
Nonlinear Data Discriminationvia Generalized Support Vector
MachinesDavid R. Musicant and Olvi L. Mangasarian
University of Wisconsin - Madison
www.cs.wisc.edu/~musicant
Outline The linear support vector machine (SVM)
– Linear kernel
Generalized support vector machine (GSVM)– Nonlinear indefinite kernel
Linear Programming Formulation of GSVM– MINOS
Quadratic Programming Formulation of GSVM– Successive Overrelaxation (SOR)
Numerical comparisons Conclusions
The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case
w
x0w= í +1
x0w= í à 1x0w= íSeparating Surface:
A+
A-
The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case
Given m points in the n dimensional space Rn
Represented by an m x n matrix A Membership of each point Ai in the classes 1 or -1 is
specified by:– An m x m diagonal matrix D with along its diagonalæ1
Separate by two bounding planes: such that:
More succinctly:
where e is a vector of ones.
x0w= í æ1A iwõ í +1; for Dii = +1A iwô í à 1; for Dii = à 1
D(Awà eí ) õ e
Preliminary Attempt at the (Linear) Support Vector Machine:Robust Linear Programming
Solve the following mathematical program:
minw;í ;y
e0y
s:t: D(Awà eí ) +yõ eyõ 0
where y = nonnegative error (slack) vector
Note: y = 0 if convex hulls of A+ and A- do not intersect.
The (Linear) Support Vector MachineMaximize Margin Between Separating Planes
w x0w= í +1
x0w= í à 1
A+
A-
jjwjj22
The (Linear) Support Vector Machine Formulation
Solve the following mathematical program:
minw;í ;y
÷e0y+ jjwjj
s:t: D(Awà eí ) +yõ eyõ 0
where y = nonnegative error (slack) vector
Note: y = 0 if convex hulls of A+ and A- do not intersect.
GSVM: Generalized Support Vector MachineLinear Programming Formulation
Linear Support Vector Machine (linear separating surface )min
yõ 0;w;í
÷e0y+ jjwjj1
s:t: D(Awà eí ) +yõ e
x0w= í
By “duality”, set (linear separating surface )w= A0Du x0A0Du= íminyõ 0;u;í
÷e0y+ jjA0Dujj1
s:t: D(AA0Duà eí ) +yõ e Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel
. Nonlinear separating surface: K (x0;A0)0Du= íK (A;A0)minyõ 0;u;í
÷e0y+ jjA0Dujj1
s:t: D(K (A;A0)Duà eí ) +yõ e
Examples of Kernels Examples
– Polynomial Kernel• denotes componentwise exponentiation as in MATLAB
– Radial Basis Kernel
– Neural Network Kernel
`• denotes the step function
componentwise.
(AA0+öaa0)dï
exp(à öjjA i à A j jj2); i; j = 1; :::;m
(AA0+öaa0)ï ã
[ á]dï
[ á]ï ã R ! f 0;1g
(a 2 Rm;ö 2 R and d integer) :
A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points in R2
Separate 486 Asterisks from 514 Dots
Previous Work
K (A;A0) = ((100A à 1)(100
A0à 1) à 0:5)6ïPolynomial Kernel:
Large Margin Classifier (SOR) Reformulation in Space
wí
ô õx0wà 1áí = +1
x0wà 1áí = à 1
A+
A-
jjw;í jj22
(w; í )
(SOR) Linear Support Vector MachineQuadratic Programming Formulation
Solve the following mathematical program:
minw;í ;y
÷e0y+ 21(w0w+í 2)
s:t: D(Awà eí ) +yõ eyõ 0
The quadratic term here maximizes the distance between the bounding planes in the space
Rn+1of (w; í ) :x0wà í =+1x0wà í = à 1
Introducing a Nonlinear Kernel The Wolfe Dual for the SOR Linear SVM is:
maxu
à21u0DAA0Duà 2
1u0Dee0Du+e0u; s:t:0 ô u ô ÷e
(w= A0Du; í = à e0Du)
– Linear separating surface: x0A0Du = í
Substitute in a kernel for the AA’ term:
maxu
à21u0DK (A;A0)Duà 2
1u0Dee0Du+e0u; s:t:0 ô u ô ÷e
(í = à e0Du)
– Linear separating surface: K (x0;A0)Du = í
SVM Optimality Conditions Define Then dual SVM becomes much simpler!
H = D[K (A;A0) + ee0]D; L +E +L 0= H
minu
21u0Huà e0u
s:t: 0 ô u ô ÷e
Gradient Projection necessary & sufficient optimality condition:
u = (uà ! E à 1(Huà e))#! > 0
denotes projecting u onto the region0ô u ô ÷e
( á)#
SOR Algorithm & Convergence Above optimality conditions lead to the SOR algorithm:
Choose ! 2 (0;2): Start with any u0 2 Rm:Having ui; compute ui+1 as follows :
ui+1= (ui à ! E à 1(Hui à e+ L(ui+1à ui)))#
u = (uà ! E à 1(Huà e))#
– Remember, optimality conditions are expressed as:
SOR Linear Convergence [Luo-Tseng 1993]:– The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem– The objective function values converge Q-linearly
to
f uig
f f (ui)gf (uö)
uö
Numerical Testing Comparison of Linear & Nonlinear Kernels using
– Linear Programming
– Quadratic Programming - SOR Formulations
Data Sets:– UCI Liver Disorders: 345 points in R6
– Bell Labs Checkerboard: 1000 points in R2
– Gaussian Synthetic: 1000 points in R32
– SCDS Synthetic: 1 million points in R32
– Massive Synthetic: 10 million points in R32
Machines:– Cluster of 4 Sun Enterprise E6000 machines each consisting of
16 UltraSPARC II 250 MHz Processors with 2 Gig RAM• Total: 64 Processors, 8 Gig RAM
Comparison of Linear & Nonlinear SVMsLinear Programming Generated
Dataset Tenfold Linear Kernel Nonlinear KernelCorrectness (Type)
Liver Training 71.21% 78.33%Disorders Testing 68.70% 73.37%
(Quadratic)Checkerboard Training 51.12% 99.97%
Testing 48.60% 98.50%(6th Degree Polynomial)
Checkerboard Training 51.12% 98.43%Testing 48.60% 97.70%
(Sinusoidal)
Nonlinear kernels yield better training and testing set correctness
SOR Results
Dataset Correctness Linear Kernel Quadratic KernelGaussian Training 83.20% 97.68%
Testing 81.70% 93.41%
Examples of training on massive data:– 1 million point dataset generated by SCDS generator:
• Trained completely in 9.7 hours
• Tuning set reached 99.7% of final accuracy in 0.3 hours
– 10 million point randomly generated dataset:• Tuning set reached 95% of final accuracy in 14.3 hours
• Under 10,000 iterations
Comparison of linear and nonlinear kernels
Conclusions Linear programming and successive overrelaxation can
generate complex nonlinear separating surfaces via GSVMs
Nonlinear separating surfaces improve generalization over linear ones
SOR can handle very large problems not (easily) solveable by other methods
SOR scales up with virtually no changes Future directions
– Parallel SOR for very large problems not resident in memory– Massive multicategory discrimination via SOR– Support vector regression
Questions?