Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....

22
Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison www.cs.wisc.edu/~musicant

Transcript of Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L....

Page 1: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Nonlinear Data Discriminationvia Generalized Support Vector

MachinesDavid R. Musicant and Olvi L. Mangasarian

University of Wisconsin - Madison

www.cs.wisc.edu/~musicant

Page 2: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Outline The linear support vector machine (SVM)

– Linear kernel

Generalized support vector machine (GSVM)– Nonlinear indefinite kernel

Linear Programming Formulation of GSVM– MINOS

Quadratic Programming Formulation of GSVM– Successive Overrelaxation (SOR)

Numerical comparisons Conclusions

Page 3: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case

w

x0w= í +1

x0w= í à 1x0w= íSeparating Surface:

A+

A-

Page 4: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

The Discrimination ProblemThe Fundamental 2-Category Linearly Separable Case

Given m points in the n dimensional space Rn

Represented by an m x n matrix A Membership of each point Ai in the classes 1 or -1 is

specified by:– An m x m diagonal matrix D with along its diagonalæ1

Separate by two bounding planes: such that:

More succinctly:

where e is a vector of ones.

x0w= í æ1A iwõ í +1; for Dii = +1A iwô í à 1; for Dii = à 1

D(Awà eí ) õ e

Page 5: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Preliminary Attempt at the (Linear) Support Vector Machine:Robust Linear Programming

Solve the following mathematical program:

minw;í ;y

e0y

s:t: D(Awà eí ) +yõ eyõ 0

where y = nonnegative error (slack) vector

Note: y = 0 if convex hulls of A+ and A- do not intersect.

Page 6: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

The (Linear) Support Vector MachineMaximize Margin Between Separating Planes

w x0w= í +1

x0w= í à 1

A+

A-

jjwjj22

Page 7: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

The (Linear) Support Vector Machine Formulation

Solve the following mathematical program:

minw;í ;y

÷e0y+ jjwjj

s:t: D(Awà eí ) +yõ eyõ 0

where y = nonnegative error (slack) vector

Note: y = 0 if convex hulls of A+ and A- do not intersect.

Page 8: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

GSVM: Generalized Support Vector MachineLinear Programming Formulation

Linear Support Vector Machine (linear separating surface )min

yõ 0;w;í

÷e0y+ jjwjj1

s:t: D(Awà eí ) +yõ e

x0w= í

By “duality”, set (linear separating surface )w= A0Du x0A0Du= íminyõ 0;u;í

÷e0y+ jjA0Dujj1

s:t: D(AA0Duà eí ) +yõ e Nonlinear Support Vector Machine: Replace AA’ by nonlinear kernel

. Nonlinear separating surface: K (x0;A0)0Du= íK (A;A0)minyõ 0;u;í

÷e0y+ jjA0Dujj1

s:t: D(K (A;A0)Duà eí ) +yõ e

Page 9: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Examples of Kernels Examples

– Polynomial Kernel• denotes componentwise exponentiation as in MATLAB

– Radial Basis Kernel

– Neural Network Kernel

`• denotes the step function

componentwise.

(AA0+öaa0)dï

exp(à öjjA i à A j jj2); i; j = 1; :::;m

(AA0+öaa0)ï ã

[ á]dï

[ á]ï ã R ! f 0;1g

(a 2 Rm;ö 2 R and d integer) :

Page 10: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points in R2

Separate 486 Asterisks from 514 Dots

Page 11: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Previous Work

Page 12: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

K (A;A0) = ((100A à 1)(100

A0à 1) à 0:5)6ïPolynomial Kernel:

Page 13: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Large Margin Classifier (SOR) Reformulation in Space

ô õx0wà 1áí = +1

x0wà 1áí = à 1

A+

A-

jjw;í jj22

(w; í )

Page 14: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

(SOR) Linear Support Vector MachineQuadratic Programming Formulation

Solve the following mathematical program:

minw;í ;y

÷e0y+ 21(w0w+í 2)

s:t: D(Awà eí ) +yõ eyõ 0

The quadratic term here maximizes the distance between the bounding planes in the space

Rn+1of (w; í ) :x0wà í =+1x0wà í = à 1

Page 15: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Introducing a Nonlinear Kernel The Wolfe Dual for the SOR Linear SVM is:

maxu

à21u0DAA0Duà 2

1u0Dee0Du+e0u; s:t:0 ô u ô ÷e

(w= A0Du; í = à e0Du)

– Linear separating surface: x0A0Du = í

Substitute in a kernel for the AA’ term:

maxu

à21u0DK (A;A0)Duà 2

1u0Dee0Du+e0u; s:t:0 ô u ô ÷e

(í = à e0Du)

– Linear separating surface: K (x0;A0)Du = í

Page 16: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

SVM Optimality Conditions Define Then dual SVM becomes much simpler!

H = D[K (A;A0) + ee0]D; L +E +L 0= H

minu

21u0Huà e0u

s:t: 0 ô u ô ÷e

Gradient Projection necessary & sufficient optimality condition:

u = (uà ! E à 1(Huà e))#! > 0

denotes projecting u onto the region0ô u ô ÷e

( á)#

Page 17: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

SOR Algorithm & Convergence Above optimality conditions lead to the SOR algorithm:

Choose ! 2 (0;2): Start with any u0 2 Rm:Having ui; compute ui+1 as follows :

ui+1= (ui à ! E à 1(Hui à e+ L(ui+1à ui)))#

u = (uà ! E à 1(Huà e))#

– Remember, optimality conditions are expressed as:

SOR Linear Convergence [Luo-Tseng 1993]:– The iterates of the SOR algorithm converge R-linearly to a solution of the dual problem– The objective function values converge Q-linearly

to

f uig

f f (ui)gf (uö)

Page 18: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Numerical Testing Comparison of Linear & Nonlinear Kernels using

– Linear Programming

– Quadratic Programming - SOR Formulations

Data Sets:– UCI Liver Disorders: 345 points in R6

– Bell Labs Checkerboard: 1000 points in R2

– Gaussian Synthetic: 1000 points in R32

– SCDS Synthetic: 1 million points in R32

– Massive Synthetic: 10 million points in R32

Machines:– Cluster of 4 Sun Enterprise E6000 machines each consisting of

16 UltraSPARC II 250 MHz Processors with 2 Gig RAM• Total: 64 Processors, 8 Gig RAM

Page 19: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Comparison of Linear & Nonlinear SVMsLinear Programming Generated

Dataset Tenfold Linear Kernel Nonlinear KernelCorrectness (Type)

Liver Training 71.21% 78.33%Disorders Testing 68.70% 73.37%

(Quadratic)Checkerboard Training 51.12% 99.97%

Testing 48.60% 98.50%(6th Degree Polynomial)

Checkerboard Training 51.12% 98.43%Testing 48.60% 97.70%

(Sinusoidal)

Nonlinear kernels yield better training and testing set correctness

Page 20: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

SOR Results

Dataset Correctness Linear Kernel Quadratic KernelGaussian Training 83.20% 97.68%

Testing 81.70% 93.41%

Examples of training on massive data:– 1 million point dataset generated by SCDS generator:

• Trained completely in 9.7 hours

• Tuning set reached 99.7% of final accuracy in 0.3 hours

– 10 million point randomly generated dataset:• Tuning set reached 95% of final accuracy in 14.3 hours

• Under 10,000 iterations

Comparison of linear and nonlinear kernels

Page 21: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Conclusions Linear programming and successive overrelaxation can

generate complex nonlinear separating surfaces via GSVMs

Nonlinear separating surfaces improve generalization over linear ones

SOR can handle very large problems not (easily) solveable by other methods

SOR scales up with virtually no changes Future directions

– Parallel SOR for very large problems not resident in memory– Massive multicategory discrimination via SOR– Support vector regression

Page 22: Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison musicant.

Questions?