1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17.
-
Upload
irene-butler -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Support Vector Machine (SVM) MUMT611 Beinan Li Music Tech @ McGill 2005-3-17.
1
Support Vector Machine (SVM)
MUMT611 Beinan Li Music Tech @ McGill 2005-3-17
2
Content
Related problems in pattern classification VC theory and VC dimension Overview of SVM Application example
3
Related problems in pattern classification
Small sample-size effect (peaking effect)Overly small or large sample-size results great error. Inaccurate estimate of probability densities via finite
sample sets for global set in typical Bayesian classifier.Training data vs. test data Empirical risk vs. structural risk
Misclassifying yet-to-be-seen data
Picture taken from (Ridder 1997)
4
Related problems in pattern classification
Avoid solving a more general problem as an intermediate step. (Vapnik 1995)Do it without estimation of probability of densities.
ANNDepends on knowledgeEmpirical-risk method (ERM):
Problem of generalization (hard to control over-fitting)
To find theoretical analysis for validity of ERM.
5
VC theory and VC dimension
VC dimension: (classifier complexity)The maximum size of a sample set that a decision
function can separate. Finite VC dimension coherence of ERM
Theoretical basis of ANN and SVM Linear decision function:
VC dim = number of parameters Non-linear decision function:
VC dim <= number of parameters
6
Overview of SVM
Structural-risk method (SRM)Minimize ER Control VC dimension Result: tradeoff between ER and over-fitting
Focus on the explicit problem of classification:To find the optimal hyperplane for dividing two classes
Supervised learning
7
Margin and Support Vectors (SV)
In the case of 2-category, linearly-separable data. Small vs. large margin
Picture taken from (Ferguson 2004)
8
Margin and Support Vectors (SV)
In the case of 2-category, linearly-separable data. Find a hyperplane that has the largest margin to
sample vectors of both classes. D(x) = wtx +b => D(x’) = atx’
Multiple solutions: weight spaceFind a weight that causes
the largest margin Margin determined by SVs
Picture taken from (Ferguson 2004)
9
Mathematical detail yiD(xi) >= 1, y = 1, -1
yiD(xi’) / ||a|| >= margin D(xi’) = atx’ Max margin -> minimum ||a||
Quadratic programming To find the minimum ||a|| under linear constraints
Weights: denoted by Lagrange multipliers
Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction)
The parameters of decision function and its complexity can be completely determined by SVs.
10
Linearly non-separable case
Example: XOR problemSample set size: 4VC dim = 3
Pictures taken from (Ferguson 2004)
11
Linearly non-separable case
Map data to higher-dimension spaceLinearly-separable in such Higher-D spaces
Make linear decision in higher-D spaces Example: XOR
6-D space:
D(x) = x1x2
),,2,2,2,1( 22
212121 xxxxxx
Picture taken from (Ferguson 2004)
12
Linearly non-separable case Hyperplane in both original and higher-D spaces
(trajectory to 2-D plane) The 4 samples are SVs.
Picture taken from (Ferguson 2004; Luo 2002)
13
Linearly non-separable case Modify the quadratic programming :
“Soft margin” Slack-variable: yiD(xi) >= 1- εi
Penalty function Upper bound for Lagrange multipliers: C.
Kernel function: Dot-product in higher-D space in terms of original parameters Resulting a symmetrical, positive semi-definite matrix. Satisfying Mercer’s theorem. Standard candidate: Polynomial, Gussian-Radial-basis Function Selection of kernel depends on knowledge.
14
Implementation with large sample set
Large computation: One Lagrange multiplier per sample Reductionist approach
Divide sample set into batches (subsets) Accumulate SV set from batch-by-batch operations Assumption: local non-SV samples are not global SVs either.
Several algorithms that varies in terms of size of subsets Vapnik: Chunking algorithm Osuna: Osuna algorithm Platt: SMO algorithm
Only 2 samples per operation Most popular
15
From 2-category to multi-category SVM
No uniform way to extend Common ways:
One-against-allOne-against-one: binary tree
16
Advantages of SVM Strong mathematical basis
Decision function and its complexity can be completely determined by SVs.
Training time does not depend on dimensionality of feature space, only on fixed input space.
Nice generalization Insensitive to “curse of dimensionality” Versatile choices of kernel function. Feature-less classification
Kernel -> data-similarity measure
17
Drawback of SVM Still rely on knowledge
Choices of C, kernel and penalty function C: how far the decision function is adapted to avoid any
error Kernel: how much freedom SVM should adapt itself
(dimension)
Overlapping classes Reductionism may discard promising SVs at any batch step. The classification can be limited by the size of the problem.
No uniform way to extend 2-category to multi-category “Still not an ideal optimally-generalizing classifier.”
18
Applications
Vapnik et al. at AT&T:Handwritten number recognitionError rate is lower than that of ANN
Speech recognition Face recognition MIR SVM-light: open source C library
19
Application example of SVM in MIR Li, Guo 2000: (Microsoft Research China) Problem:
classify 16 classes of sounds in a database of 409 sounds
Features: Concatenated perceptual and cepstral feature vectors.
Similarity measure: Distance from boundary (SV-based boundary)
Evaluation: Average retrieval accuracy Average retrieval efficiency
20
Application example of SVM in MIR
Details in applying SVM Both linear and kernel-based approaches are tested
Kernel: Exponential Radial Basis Function C: 200
Randomly partition corpus into training/test sets. One-against-one/binary tree in multi-category task.
Compared with other approaches NFL: Nearest Feature Line, unsupervised approach Muscle Fish: normalized Euclidean metric and nearest-neighbor
21
Application example of SVM in MIR
Average error rates comparisonDifferent feature-set over different approaches
Picture taken from (Li & Guo 2000)
22
Application example of SVM in MIR
Complexity comparisonSVM:
Training: yesClassification complexity: C * (C-1) / 2 (binary tree)Inner-class complexity: number of SVs
NFL:Training: noClassification complexity: linear to number of classesInner-class complexity: Nc * (Nc-1) / 2
23
Future work
Speed up quadratic programming Choice of kernel functions Find opportunities in solving impossible-so-far
missions Generalize the non-linear kernel approach to
approaches other than SVMKernel PCA (principle component analysis)
24
Bibliography
Summary:http://www.music.mcgill.ca/~damonli
/MUMT611/week9_summary.pdf HTML bibliography:
http://www.music.mcgill.ca/~damonli/MUMT611/week9_bib.htm