Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to...
-
Upload
flora-powell -
Category
Documents
-
view
229 -
download
1
Transcript of Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to...
![Page 1: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/1.jpg)
Support Vector Machines
Chapter 12
![Page 2: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/2.jpg)
2
Outline
• Separating Hyperplanes – Separable Case• Extension to Non-separable case – SVM• Nonlinear SVM• SVM as a Penalization method• SVM regression
![Page 3: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/3.jpg)
3
Separating Hyperplanes
• The separating hyperplane with maximum margin is likely to perform well on test data.
• Here the separating hyperplane is almost identical to the more standard linear logistic regression boundary
![Page 4: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/4.jpg)
4
Distance to Hyperplanes
• For any point x0 in L,
βT x0 = -β0
•The signed distance of any point x to L is given by
)(1
)(* 00 βββ
β +=− xxx TT
![Page 5: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/5.jpg)
5
Maximum Margin Classifier
pi Rx
xxy
xy
∈−∈
=+
=≥+
=
},1,1{y
(1995) Vapnik
.boundary to from distance ) (
.N, . . . 1, i ,C ) ( subject to
1,,C max
i
i0Tii
0Tii
0
ββ
ββ
βββ
• Found by quadratic programming (Convex optimization)
• Solution determined by just a few points (support vectors) near the boundary
• Sparse solution in dual space
• Decision function
0
1
ˆ ˆ( ) [ ]
ˆ ˆwhere,
ˆ non-zero only for those observations
for which constraints are exactly met
(support vectors)
T
N
i i ii
i
G x sign x
y x
β β
β α
α=
= +
=∑
![Page 6: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/6.jpg)
6
Non-separable Case: Standard Support Vector Classifier
€
max C
β ,β 0, β =1
subject to y i(x iTβ + β 0) ≥ C(1-ξ i),
i = 1, . . . ,N.
ξ i ≥ 0, ξ∑i≤ B.
This problem computationally equivalent to
0
2
1,
0
1min
2
. . 0, ( ) 1 ,
where, is a tuning parameter.
N
ii
Ti i i is t y x
β ββ γ ξ
ξ β β ξγ
=+
≥ + ≥ −
∑
![Page 7: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/7.jpg)
7
Computation of SVM
• Lagrange (prime) function:
• Minimize w.r.t β, β0 and ξi, a set
derivatives to zero:
![Page 8: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/8.jpg)
8
Computation of SVM
• Lagrange (dual) function:
with constraints: 0 αI γ and i=1αiyi =
0• Karush-Kuhn-Tucker conditions:
![Page 9: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/9.jpg)
9
Computation of SVM
• The final solution:
![Page 10: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/10.jpg)
10
Example-Mixture Data
![Page 11: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/11.jpg)
11
SVMs for large p, small n
• Suppose we have 5000 genes(p) and 50 samples(n), divided into two classes Many more variables than observations Infinitely many separating hyperplanes in this feature
space
• SVMs provide the unique maximal margin separating hyperplane
• Prediction performance can be good, but typically no better than simpler methods such as nearest centroids
• All genes get a weight, so no gene selection May overfit the data
![Page 12: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/12.jpg)
12
Non-Linear SVM via Kernels
• Note that the SVM classifier involves inner products <xi, xj>=xi
Txj
• Enlarge the feature space
• Replacing xiT xj by appropriate
kernel K(xi,xj) = <(xi), (xj)> provides a non-linear SVM in the input space
![Page 13: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/13.jpg)
13
Popular kernels
![Page 14: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/14.jpg)
14
Kernel SVM-Mixture Data
![Page 15: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/15.jpg)
15
Radial Basis Kernel
• Radial Basis function has infinite-dim basis: (x) are infinite dimension.
• Smaller the Bandwidth c, more wiggly the boundary and hence Less overlap
• Kernel trick doesn’t allow coefficients of all basis elements to be freely determined
![Page 16: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/16.jpg)
16
SVM as penalization method
• For , consider the problem
• Margin Loss + Penalty• For , the penalized setup
leads to the same solution as SVM.
0( ) ( )Tf x h x β β= +
0
2
1,min [1 ( )]
N
i iiy f x
β βλ β+=
− +∑
(1/ 2 )λ γ=
![Page 17: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/17.jpg)
17
SVM and other Loss Functions
![Page 18: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/18.jpg)
18
Population Minimizers for Two Loss Functions
![Page 19: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/19.jpg)
19
Logistic Regression with Loglikelihood Loss
![Page 20: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/20.jpg)
20
Curse of Dimensionality in SVM
![Page 21: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/21.jpg)
21
SVM Loss-Functions for Regression
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 22: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/22.jpg)
22
Example
![Page 23: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/23.jpg)
23
Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 24: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/24.jpg)
24
Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 25: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/25.jpg)
Generalized Discriminant Analysis
Chapter 12
![Page 26: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/26.jpg)
26
Outline
• Flexible Discriminant Analysis(FDA)• Penalized Discriminant Analysis• Mixture Discriminant Analysis (MDA)
![Page 27: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/27.jpg)
27
Linear Discriminant Analysis
• Let P(G = k) = k and P(X=x|G=k) = fk(x)• Then
• Assume fk(x) ~ N(k, k) and 1 = 2 = …= K= • Then we can show the decision rule is (HW#1):
![Page 28: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/28.jpg)
28
LDA (cont)
• Plug in the estimates:
![Page 29: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/29.jpg)
29
LDA Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Data Prediction Vector
In this three class problem, the middle class is classified correctly
![Page 30: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/30.jpg)
30
LDA Example
11 classes and X R10
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 31: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/31.jpg)
31
Virtues and Failings of LDA
• Simple prototype (centriod) classifier New observation classified into the class with
the closest centroid But uses Mahalonobis distance
• Simple decision rules based on linear decision boundaries
• Estimated Bayes classifier for Gaussian class conditionals
But data might not be Gaussian• Provides low dimensional view of data
Using discriminant functions as coordinates• Often produces best classification results
Simplicity and low variance in estimation
![Page 32: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/32.jpg)
32
Virtues and Failings of LDA
• LDA may fail in number of situations Often linear boundaries fail to separate classes With large N, may estimate quadratic decision
boundary May want to model even more irregular (non-
linear) boundaries Single prototype per class may not be
insufficient May have many (correlated) predictors for
digitized analog signals. Too many parameters estimated with high
variance, and the performance suffers May want to regularize
![Page 33: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/33.jpg)
33
Generalization of LDA
• Flexible Discriminant Analysis (FDA) LDA in enlarged space of predictors via basis expansions
• Penalized Discriminant Analysis (PDA) With too many predictors, do not want to expand the
set: Already too large Fit LDA model with penalized coefficient to be
smooth/coherent in spatial domain With large number of predictors, could use penalized
FDA
• Mixture Discriminant Analysis (MDA) Model each class by a mixture of two or more Gaussians
with different centroids, all sharing same covariance matrix Allows for subspace reduction
![Page 34: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/34.jpg)
34
Flexible Discriminant Analysis
• Linear regression on derived responses for K-class problem Define indicator
variables for each class (K in all)
Using indicator functions as responses to create a set of Y variables
Obtain mutually linear score functions as discriminant (canonical) variables
Classify into the nearest class centroid
Mahalanobis distance of a test point x to kth class centroid
( ) Tl lx xη β=
![Page 35: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/35.jpg)
35
Flexible Discriminant Analysis
1
1
2
2 2
ˆˆ( , ) ( ( ) ) ( ),
ˆ.{ ( )} in class k,
( ) does not depend on k,
residual mean square of the
th optimal score , and 1/ (1 )
K kJ k
k ki
l l l
x w x D x
Ave x
D x
r
w r r
δ η η
η η
−
== − +
=
=
= −
∑ l ll l
l l
l
l
• We can replace linear regression fits by non-parametric fits, e.g., generalized additive fits, spline functions, MARS models etc., with a regularizer or kernel regression and possibly reduced rank regression
( ) Tl lx xη β=
Mahalanobis distance of a test point x to kth class centroid
![Page 36: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/36.jpg)
36
Computation of FDA
1. Multivariate nonparametric regression
2. Optimal scores3. Update the model from step 1 using
the optimal scores
![Page 37: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/37.jpg)
37
Example of FDA
N(0, I)
N(0, 9I/4)
Bayes decision boundary
FDA using degree-twoPolynomial regression
![Page 38: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/38.jpg)
38
Speech Recognition Data
• K=11 classes spoken vowels sound
• p=10 predictors extracted from digitized speech
• FDA uses adaptive additive-spline regression (BRUTO in S-plus)
• FDA/MARS Uses Multivariate Adaptive Regression Splines; degree=2 allows pairwise products
![Page 39: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/39.jpg)
39
LDA Vs. FDA/BRUTO
![Page 40: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/40.jpg)
40
Penalized Discriminant Analysis
• PDA is a regularized discriminant analysis on enlarged set of predictors via a basis expansion
![Page 41: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/41.jpg)
41
Penalized Discriminant Analysis
• PDA enlarge the predictors to h(x)• Use LDA in the enlarged space, with
the penalized Mahalanobis distance:
with W as within-class Cov
![Page 42: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/42.jpg)
42
Penalized Discriminant Analysis
• Decompose the classification subspace using the penalized metric:
max w.r.t.
![Page 43: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/43.jpg)
43
USPS Digit Recognition
![Page 44: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/44.jpg)
44
Digit Recognition-LDA vs. PDA
![Page 45: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/45.jpg)
45
PDA Canonical Variates
![Page 46: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/46.jpg)
46
Mixture Discriminant Analysis
• The class conditional densities modeled as mixture of Gaussians Possibly different # of components in
each class Estimate the centroids and mixing
proportions in each subclass by max joint likelihood P(G, X)
EM algorithm for MLE
• Could use penalized estimation
![Page 47: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/47.jpg)
47
FDA and MDA
![Page 48: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/48.jpg)
48
Wave Form Signal with Additive Gaussian Noise
Class 1: Xj = U h1(j) + (1-U)h2(j) +j
Class 2: Xj = U h1(j) + (1-U)h3(j) +j
Class 3: Xj = U h2(j) + (1-U)h3(j) +j
Where j = 1,L, 21, and U ~ Unif(0,1)
h1(j) = max(6-|j-11|,0)
h2(j) = h1(j-4)
h3(j) = h1(j+4)
![Page 49: Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization.](https://reader031.fdocuments.us/reader031/viewer/2022012916/56649e315503460f94b22832/html5/thumbnails/49.jpg)
49
Wave From Data Results