Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT...
-
Upload
vuonghuong -
Category
Documents
-
view
213 -
download
0
Transcript of Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT...
![Page 2: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/2.jpg)
2
Overview
• Support Vector Machines for Classification– Linear Discrimination– Nonlinear Discrimination
• SVM Mathematically
• Extensions
• Application in Drug Design
• Data Classification
• Kernel Functions
![Page 3: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/3.jpg)
3
Definition
– AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods)
• N. Cristianini and J. Shawe-Taylor, Cambridge University Press 2000 ISBN: 0 521 78019 5
– Kernel Methods for Pattern Analysis• John Shawe-Taylor & Nello Cristianini Cambridge
University Press, 2004
One of the excellent classification system based on amathematical technique called convex optimization.
‘Support Vector Machine is a system for efficiently traininglinear learning machines in kernel-induced feature spaces, whilerespecting the insights of generalisation theory and exploitingoptimisation theory.’
![Page 4: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/4.jpg)
4
Dot product (aka inner product)
θ
a
θcosbaba =⋅
b
Recall: If the vectors are orthogonal, dot product is zero.
The scalar or dot product is, in some sense, a measure of similarity
![Page 5: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/5.jpg)
5
Decision function for binary classification
R∈)(xf
( ) 1010)(−=⇒<
=⇒≥
ii
ii
yxfyxf
![Page 6: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/6.jpg)
6
Support vector machines
• SVMs pick best separating hyper plane according to some criterion– e.g. maximum margin
• Training process is an optimisation
• Training set is effectively reduced to a relatively small number of support vectors
• Key words: optimization, kernels
![Page 7: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/7.jpg)
7
Feature spaces
• We may separate data by mapping to a higher-dimensional feature space– The feature space may even have an infinite number
of dimensions!
• We need not explicitly construct the new feature space– “Kernel trick”– Keeps the same computation time
• Key observation that optimization involves dot products
![Page 8: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/8.jpg)
8
Kernels
• What are kernels?
• We may use Kernel functions to implicitly map to a new feature space
• Kernel functions:
• In SVMs kernels preserve the inner product in the new feature space.
( ) Rxx ∈21,K
![Page 9: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/9.jpg)
9
Examples of kernels
zx ⋅Linear:
Polynomial
(non-linear)
( )zx ⋅P
Gaussian (non-linear)
( )22 /exp σzx−−
![Page 10: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/10.jpg)
10
Perceptron as linear separator
• Binary classification can be viewed as the task of separating classes in feature space:
wTx + b = 0
wTx + b < 0
wTx + b > 0
f(x) = sign(wTx + b)
![Page 11: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/11.jpg)
11
Which of the linear separators is optimal?
Tumor
Normal
![Page 12: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/12.jpg)
12
Best linear separator?
Tumor
Normal
![Page 13: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/13.jpg)
13
Best linear separator?
Tumor
Normal
![Page 14: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/14.jpg)
14
Best linear separator? Not so…
Tumor
Normal
![Page 15: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/15.jpg)
15
Best linear separator? Possibly…
Tumor
Normal
![Page 16: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/16.jpg)
16
Find closest points in convex hulls (3D)/convex polygon (2D)
c
d
![Page 17: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/17.jpg)
17
Plane (3D)/line(2D) to bisect closest points
dc
wT x + b =0w = d - c
![Page 18: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/18.jpg)
18
Classification margin
• Distance from example data to the separator is
• Data closest to the hyper plane are support vectors.
• Margin ρ of the separator is the width of separation between classes.
wxw brT +
=
r
ρ
![Page 19: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/19.jpg)
19
Maximum margin classification
• Maximize the margin (good according to intuition and theory).
• Implies that only support vectors are important; other training examples are ignorable.
![Page 20: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/20.jpg)
20
Statistical learning theory
• Misclassification error and the function complexity bound generalization error (prediction).
• Maximizing margins minimizes complexity.
• “Eliminates” overfitting.
• Solution depends only on support vectors not number of attributes.
![Page 21: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/21.jpg)
21
Margins and complexity
Skinny marginis more flexiblethus more complex.
![Page 22: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/22.jpg)
22
Margins and complexity
Fat marginis less complex.
![Page 23: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/23.jpg)
23
Linear SVM
• Assuming all data is at distance larger than 1 from the hyperplane, the following two constraints follow for a training set {(xi ,yi)}
• For support vectors, the inequality becomes an equality; then, since each example’s distance from the
• hyperplane is the margin is:
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1
w2
=ρwxw brT +
=
![Page 24: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/24.jpg)
24
Linear SVM
We can formulate the problem:
into quadratic optimization formulation:
Find w and b such that
is maximized and for all {(xi ,yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
w2
=ρ
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
![Page 25: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/25.jpg)
25
Solving the optimization problem
• Need to optimize a quadratic function subject to linear constraints.
• Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them.
• The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:
Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
![Page 26: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/26.jpg)
26
The quadratic optimization problem solution
• The solution has the form:
• Each non-zero αi indicates that corresponding xi is a support vector.• Then the classifying function will have the form:
• Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later!
• Also keep in mind that solving the optimization problem involved computing the inner products xi
Txj between all training points!
w =Σαiyixi b= yk- wTxk for any xk such that αk≠ 0
f(x) = ΣαiyixiTx + b
![Page 27: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/27.jpg)
27
Soft margin classification
• What if the training set is not linearly separable?• Slack variables ξi can be added to allow misclassification of difficult or
noisy examples.
ξi
ξi
![Page 28: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/28.jpg)
28
Soft margin classification
• The old formulation:
• The new formulation incorporating slack variables:
• Parameter C can be viewed as a way to control overfitting.
Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1
Find w and b such thatΦ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
![Page 29: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/29.jpg)
29
Soft margin classification – solution
• The dual problem for soft margin classification:
• Neither slack variables ξi nor their Lagrange multipliers appear in the dual problem!
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is:
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
w =Σαiyixi
b= yk(1- ξk) - wTxk where k = argmax αkk f(x) = Σαiyixi
Tx + b
But neither w nor b are needed explicitly for classification!
![Page 30: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/30.jpg)
30
Theoretical justification for maximum margins
• Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded from above as
where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m0 is the dimensionality.
• Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin ρ.
• Thus, complexity of the classifier is kept small regardless of dimensionality.
1,min 02
2
+!"#
$%&
''
())
*≤ mDh
ρ
![Page 31: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/31.jpg)
31
Linear SVM: Overview
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they define the hyperplane.
• Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution training points appear only inside inner products:
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
![Page 32: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/32.jpg)
32
Non-linear SVMs
• Datasets that are linearly separable with some noise work out great:
• But what are we going to do if the dataset is just too hard?
• How about… mapping data to a higher-dimensional space:
0
x2
x
0 x
0 x
![Page 33: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/33.jpg)
33
Nonlinear classification
x = a,b!" #$
xiw = w1a+w2b
↓
θ (x) = a,b,ab,a2 ,b2!"
#$
θ (x)iw = w1a+w2b+w3ab+w4a2 +w5b
2
![Page 34: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/34.jpg)
34
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x→ φ(x)
![Page 35: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/35.jpg)
35
The “Kernel Trick”
• The linear classifier relies on inner product between vectors K(xi, xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x→ φ(x), the inner product becomes:
K(xi, xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product into some feature space.
• Example: – 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2,
– Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj) = (1 + xiTxj)2
,= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi22xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
![Page 36: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/36.jpg)
36
![Page 37: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/37.jpg)
37
• A square matrix A is positive definite if xTAx>0 for all nonzero column vectors x.
• It is negative definite if xTAx < 0 for all nonzero x.
• It is positive semi-definite if xTAx ≥ 0.
• And negative semi-definite if xTAx ≤ 0 for all x.
Positive definite matrices
![Page 38: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/38.jpg)
38
What functions are kernels?
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
• Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
• Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
K=
![Page 39: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/39.jpg)
39
Examples of kernel functions
• Linear: K(xi,xj) = xi Txj
• Polynomial of power p: K(xi,xj) = (1+ xi Txj)p
• Gaussian (radial-basis function network): K(xi,xj) =
• Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)
2
2
2σji xx −
−
e
![Page 40: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/40.jpg)
40
Non-linear SVMs - optimization
• Dual problem formulation:
• The solution is:
• Optimization techniques for finding αi’s remain the same!
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi, xj)+ b
![Page 41: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/41.jpg)
41
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs are SMO [Platt ’99] and SVMlight [Joachims’ 99], both use decomposition to hill-climb over a subset of αi’s at a time.
• Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.
![Page 42: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/42.jpg)
42
SVM extensions
• Regression
• Variable Selection
• Boosting
• Density Estimation
• Unsupervised Learning– Novelty/Outlier Detection– Feature Detection– Clustering
![Page 43: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/43.jpg)
43
Example in drug design
• Goal to predict bio-reactivity of molecules to decrease drug development time.
• Target is to predict the logarithm of inhibition concentration for site "A" on the Cholecystokinin (CCK) molecule.
• Constructs quantitative structure activity relationship (QSAR) model.
![Page 44: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/44.jpg)
44
LCCKA problem
• Training data – 66 molecules
• 323 original attributes are wavelet coefficients of TAE Descriptors.
• 39 subset of attributes selected by linear 1-norm SVM (with no kernels).
• For details see DDASSL project link off of http://www.rpi.edu/~bennek
• Testing set results reported.
![Page 45: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/45.jpg)
45
LCCK prediction
LCCKA%Test%Set%Estimates
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
True%Value
Predicted%Value
![Page 46: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/46.jpg)
46
Many other applications• Speech Recognition
• Data Base Marketing
• Quark Flavors in High Energy Physics
• Dynamic Object Recognition
• Knock Detection in Engines
• Protein Sequence Problem
• Text Categorization
• Breast Cancer Diagnosis
• Cancer Tissue classification
• Translation initiation site recognition in DNA
• Protein fold recognition
![Page 47: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/47.jpg)
47
• Generalization theory and practice meet
• General methodology for many types of problems
• Same Program + New Kernel = New method
• No problems with local minima
• Few model parameters. Selects capacity
• Robust optimization methods
• Successful Applications
One of the best!!
![Page 48: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/48.jpg)
48
• Will SVMs beat my best hand-tuned method Z for X?
• Do SVM scale to massive datasets?
• How to chose C and Kernel?
• What is the effect of attribute scaling?
• How to handle categorical variables?
• How to incorporate domain knowledge?
• How to interpret results?
Open questions
![Page 49: Support'Vector'Machines - kannan-kasthuri.github.io · 3 Definition – AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel -based learning methods) • N. Cristianini and](https://reader030.fdocuments.us/reader030/viewer/2022040700/5d5175ab88c99347658bab18/html5/thumbnails/49.jpg)
49
Support Vector Machine Resources
• SVM Application Listhttp://www.clopinet.com/isabelle/Projects/SVM/applist.html
• Kernel machineshttp://www.kernel-machines.org/
• Pattern Classification and Machine Learninghttp://clopinet.com/isabelle/#projects
• R a GUI language for statistical computing and graphicshttp://www.r-project.org/
• Kernel Methods for Pattern Analysis – 2004http://www.kernel-methods.net/
• An Introduction to Support Vector Machines(and other kernel-based learning methods)
http://www.support-vector.net/• Kristin P. Bennett web page
http://www.rpi.edu/~bennek• Isabelle Guyon's home page
http://clopinet.com/isabelle