Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....

Nonlinear Data Discriminationvia Generalized Support Vector

MachinesDavid R. Musicant and Olvi L. Mangasarian

University of Wisconsin - Madison

www.cs.wisc.edu/~musicant

Outline The linear support vector machine (SVM)

– Linear kernel

Generalized support vector machine (GSVM)– Nonlinear indefinite kernel

Linear Programming Formulation of GSVM– MINOS

Quadratic Programming Formulation of GSVM– Successive Overrelaxation (SOR)

Numerical comparisons Conclusions

The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case

w

x0w= í +1

x0w= í à 1x0w= íSeparating Surface:

A+

A-

The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case

Given m points in the n dimensional space Rn

Represented by an m x n matrix A Membership of each point Ai in the classes 1 or -1 is

specified by:– An m x m diagonal matrix D with along its diagonalæ1

Separate by two bounding planes: such that:

More succinctly:

where e is a vector of ones.

x0w= í æ1A iwõ í +1; for Dii = +1A iwô í à 1; for Dii = à 1

D(Awà eí ) õ e

Preliminary Attempt at the (Linear) Support Vector Machine:Robust Linear Programming

Solve the following mathematical program:

minw;í ;y

e0y

s:t: D(Awà eí ) +yõ eyõ 0

where y = nonnegative error (slack) vector

Note: y = 0 if convex hulls of A+ and A- do not intersect.

The (Linear) Support Vector MachineMaximize Margin Between Separating Planes

w x0w= í +1

x0w= í à 1

A+

A-

jjwjj22

The (Linear) Support Vector Machine Formulation


minw;í ;y

÷e0y+ jjwjj


where y = nonnegative error (slack) vector

Note: y = 0 if convex hulls of A+ and A- do not intersect.

GSVM: Generalized Support Vector MachineLinear Programming Formulation

Linear Support Vector Machine (linear separating surface )min

yõ 0;w;í

÷e0y+ jjwjj1

s:t: D(Awà eí ) +yõ e

x0w= í

By “duality”, set (linear separating surface )w= A0Du x0A0Du= íminyõ 0;u;í

÷e0y+ jjA0Dujj1

s:t: D(AA0Duà eí ) +yõ e Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel

. Nonlinear separating surface: K (x0;A0)0Du= íK (A;A0)minyõ 0;u;í

÷e0y+ jjA0Dujj1

s:t: D(K (A;A0)Duà eí ) +yõ e

Examples of Kernels Examples

– Polynomial Kernel• denotes componentwise exponentiation as in MATLAB

– Radial Basis Kernel

– Neural Network Kernel

`• denotes the step function

componentwise.

(AA0+öaa0)dï

exp(à öjjA i à A j jj2); i; j = 1; :::;m

(AA0+öaa0)ï ã

[ á]dï

[ á]ï ã R ! f 0;1g

(a 2 Rm;ö 2 R and d integer) :

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points in R2

Separate 486 Asterisks from 514 Dots

Previous Work

K (A;A0) = ((100A à 1)(100

A0à 1) à 0:5)6ïPolynomial Kernel:

Large Margin Classifier (SOR) Reformulation in Space

wí

ô õx0wà 1áí = +1

x0wà 1áí = à 1

A+

A-

jjw;í jj22

(w; í )

(SOR) Linear Support Vector MachineQuadratic Programming Formulation


minw;í ;y

÷e0y+ 21(w0w+í 2)


The quadratic term here maximizes the distance between the bounding planes in the space

Rn+1of (w; í ) :x0wà í =+1x0wà í = à 1

Introducing a Nonlinear Kernel The Wolfe Dual for the SOR Linear SVM is:

maxu

à21u0DAA0Duà 2

1u0Dee0Du+e0u; s:t:0 ô u ô ÷e

(w= A0Du; í = à e0Du)

– Linear separating surface: x0A0Du = í

Substitute in a kernel for the AA’ term:

maxu

à21u0DK (A;A0)Duà 2

1u0Dee0Du+e0u; s:t:0 ô u ô ÷e

(í = à e0Du)

– Linear separating surface: K (x0;A0)Du = í

SVM Optimality Conditions Define Then dual SVM becomes much simpler!

H = D[K (A;A0) + ee0]D; L +E +L 0= H

minu

21u0Huà e0u

s:t: 0 ô u ô ÷e

Gradient Projection necessary & sufficient optimality condition:

u = (uà ! E à 1(Huà e))#! > 0

denotes projecting u onto the region0ô u ô ÷e

( á)#

SOR Algorithm & Convergence Above optimality conditions lead to the SOR algorithm:

Choose ! 2 (0;2): Start with any u0 2 Rm:Having ui; compute ui+1 as follows :

ui+1= (ui à ! E à 1(Hui à e+ L(ui+1à ui)))#

u = (uà ! E à 1(Huà e))#

– Remember, optimality conditions are expressed as:

SOR Linear Convergence [Luo-Tseng 1993]:– The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem– The objective function values converge Q-linearly

to

f uig

f f (ui)gf (uö)

uö

Numerical Testing Comparison of Linear & Nonlinear Kernels using

– Linear Programming

– Quadratic Programming - SOR Formulations

Data Sets:– UCI Liver Disorders: 345 points in R6

– Bell Labs Checkerboard: 1000 points in R2

– Gaussian Synthetic: 1000 points in R32

– SCDS Synthetic: 1 million points in R32

– Massive Synthetic: 10 million points in R32

Machines:– Cluster of 4 Sun Enterprise E6000 machines each consisting of

16 UltraSPARC II 250 MHz Processors with 2 Gig RAM• Total: 64 Processors, 8 Gig RAM

Comparison of Linear & Nonlinear SVMsLinear Programming Generated

Dataset Tenfold Linear Kernel Nonlinear KernelCorrectness (Type)

Liver Training 71.21% 78.33%Disorders Testing 68.70% 73.37%

(Quadratic)Checkerboard Training 51.12% 99.97%

Testing 48.60% 98.50%(6th Degree Polynomial)

Checkerboard Training 51.12% 98.43%Testing 48.60% 97.70%

(Sinusoidal)

Nonlinear kernels yield better training and testing set correctness

SOR Results

Dataset Correctness Linear Kernel Quadratic KernelGaussian Training 83.20% 97.68%

Testing 81.70% 93.41%

Examples of training on massive data:– 1 million point dataset generated by SCDS generator:

• Trained completely in 9.7 hours

• Tuning set reached 99.7% of final accuracy in 0.3 hours

– 10 million point randomly generated dataset:• Tuning set reached 95% of final accuracy in 14.3 hours

• Under 10,000 iterations

Comparison of linear and nonlinear kernels

Conclusions Linear programming and successive overrelaxation can

generate complex nonlinear separating surfaces via GSVMs

Nonlinear separating surfaces improve generalization over linear ones

SOR can handle very large problems not (easily) solveable by other methods

SOR scales up with virtually no changes Future directions

– Parallel SOR for very large problems not resident in memory– Massive multicategory discrimination via SOR– Support vector regression

Questions?

Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....

Documents

Transcript of Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....