VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...
Transcript of VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...
VC Learning Theory and Support Vector
Machines
Achim Hoffmann
School of Computer Science and Engineering
University of New South Wales
Sydney, Australia
October 2002
UNSW Slides c©Achim Hoffmann, 2002 October 2002 1
The Vapnik-Chervonenkis dimension
• Definition
• Various function sets - relevant to learning systems - and theirVC-dimension
• General bounds on the Risk.
• Bounds from probably approximately correct learning (PAC-learning).
UNSW Slides c©Achim Hoffmann, 2002 October 2002 2
The Vapnik-Chervonenkis dimension
The VC-dimension is a useful combinatorial parameter on setsof subsets, e.g. on concept classes or hypothesis classes.
Definition
We say a set S ⊆ X is shattered by C iff {S ∩ c|c ∈ C} = 2S.
The Vapnik-Chervonenkis dimension of C, V C-Dim(C) is thecardinality of the greatest set S ⊆ X shattered by C, i.e.
VC − Dim(C) = maxS∈{S|S⊆X∧{S∩c|c∈C}=2S}
|S|
UNSW Slides c©Achim Hoffmann, 2002 October 2002 3
The Vapnik-Chervonenkis dimension
1
2
3
4
The VC-dimension of the set of linear decision functions in the2 dimensional Euclidean space is 3.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 4
VC-dimension of some (discrete) function sets
An upper bound:
Theorem (Blumer et al., 1989)
Let L be a learning algorithm that uses H consistently. For any0 < ε, δ < 1 given
(4 log(2δ) + 8V C − dim(H) log(13
ε ))
ε
random examples, L will with probability of at least 1 − δ
either produce a hypothesis h with error ≤ ε
or indicate correctly, that the target concept is not in H.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 5
VC-dimension of some (discrete) function sets
A lower bound:
Theorem (Ehrenfeucht et al., 1992)
Let L be a learning algorithm that uses H consistently. For any0 < ε < 1
8, 0 < δ < 1100 given less than
V Cdim(H) − 1
32ε
random examples, there is some probability distribution for whichL will not produce a hypothesis h with error(h) ≤ ε withprobability 1 − δ.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 6
VC-dimension bounds
ε δ VC-dim lower bound upper bound5% 5% 10 6 919210% 5% 10 3 40405% 5% 4 2 386010% 5% 4 1 170710% 10% 4 1 1677
UNSW Slides c©Achim Hoffmann, 2002 October 2002 7
The Vapnik-Chervonenkis dimension
X
Y
Let be C the set of all rectangles in the plane.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 8
The Vapnik-Chervonenkis dimension
X
Y
Let be C the set of all circles in the plane.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 9
The Vapnik-Chervonenkis dimension
X
Y
Let be C the set of all triangles in the plane.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 10
VC-dimension of some (discrete) function sets
• linear attributes
1 2 3 4 5 6
• tree-structured attributes
polygon shape
convex shape concave shape
any shape
UNSW Slides c©Achim Hoffmann, 2002 October 2002 11
VC-dimension of some (discrete) function sets
Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, then
n ≤ V Cdim(H) ≤ 2n
UNSW Slides c©Achim Hoffmann, 2002 October 2002 12
VC-dimension of some (discrete) function sets
Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, that contain at most s atoms (attribute-valueconstraints),
then
sblogn
sc ≤ V Cdim(H) ≤ 4s log(4s
√n)
UNSW Slides c©Achim Hoffmann, 2002 October 2002 13
VC-Dimensions of Neural Networks
As a good heuristic (Bartlett):
VC-dimension ≈ number of parameters
• For linear threshold functions on <n,
V C − dim = n + 1
(number of parameters is n + 1.)
• For linear threshold networks, and fixed depth networks withpiecewise polynomial squashing functions
c1| ~W | ≤ V C − dim ≤ c2| ~W | log | ~W |where | ~W | is number of weights in the network.
• Some threshold networks have V C − dim ≥ c| ~W | log | ~W |.• V C − dim(sigmoid net) ≤ c| ~W|4
UNSW Slides c©Achim Hoffmann, 2002 October 2002 14
VC-Dimensions of Neural Networks
Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:
• +,−,×, / on real numbers
•>,≥, =,≤, <, 6= on real numbers
• output value y ∈ {−1, +1}
has V C − dimension of O(kt).
(See work of Peter Bartlett)
UNSW Slides c©Achim Hoffmann, 2002 October 2002 15
VC-Dimensions of Neural Networks
Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:
• +,−,×, /, eα on real numbers
•>,≥, =,≤, <, 6= on real numbers
• output value y ∈ {−1, +1}
has V C − dimension of O(k2t2).
This includes sigmoid networks, RBF networks, mixture of ex-perts, etc.
(See work of Peter Bartlett)
UNSW Slides c©Achim Hoffmann, 2002 October 2002 16
VC-dimension Heuristic
For neural networks and decision trees:
VC-dimension ≈ size.
Hence, the order of the misclassification probability isno more than
training error +
√
√
√
√
√
√
√
size
mwhere m is the number of training examples.
This suggests that the number of training examples should growroughly linearly with the size of the hypothesis to be produced.
If the function to be produced is too complex for the amountof data available, it is likely that the learned function is not anear-optimal one.UNSW Slides c©Achim Hoffmann, 2002 October 2002 17
(See work of Peter Bartlett)
UNSW Slides c©Achim Hoffmann, 2002 October 2002 18
Summary - Part 2
• The VC-dimension is a useful combinatorial parameter of setsof functions.
• It can be used to estimate the true risk on the basis of theempirical risk and the number of i.i.d. training examples.
• It can also be used to determine a sufficient number of train-ing examples to learn probably approximately correct.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 19
Part 3: Structural Risk Minimisation (SRM)
• Applications of the VC-dimension for choosing the most suit-able subset of functions for a given number of i.i.d. examples.
• Trading empirical risk against confidence in estimate.
• Foundations of Support Vector Machines.
• Experiments with Support Vector Machines.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 20
Structural Risk Minimisation (SRM)
The complexity (or capacity) of a function class from whichthe learner chooses a function that minimises the empirical riskdetermines the convergence rate of the learner to the optimalfunction.
For a given number of i.i.d. training examples, there is a trade-off between the degree to which the empirical risk can be min-imised and to which the empirical risk will deviate from the truerisk.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 21
Structural Risk Minimisation (SRM)
S
SS
in
1
h1 hn
Empirical risk
Confidence interval
hi
Bound on the overall risk
Complexity degree
Consider a partition of the set S of functions from which ahypothesis is chosen as follows: S1 ⊂ S2 ⊂ ... ⊂ Sn...
UNSW Slides c©Achim Hoffmann, 2002 October 2002 22
Structural Risk Minimisation (SRM)
The general SRM principle:
Choose complexity parameter d, e.g. the number of hidden unitsin a MLP, or the size of a decision tree, and function h ∈ Hdsuch that the following is minimised:
Remp(h) + c
√
√
√
√
√
√
√
√
V C − dim(Hd)
mwhere m is the number of training examples.
The higher the VC dimension is the more likely will the empiricalerror be low.
Structural risk minimisation seeks the right balance.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 23
Support Vector Machines
Suppose the training data
( ~x1, y1), ..., (~x`, y`), ~x ∈ <n, y ∈ {−1, +1}
can be separated by a hyperplane
(~wT · ~x) + b = 0.
The hyperplane which separates the training data without errorand has maximal distance to the closest training vector is calledOptimal hyperplane.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 24
Support Vector Machines
++
+
+
- -
-
-
Optimalhyperplane
Margin
The Optimal separating hyperplane has maximal margin to thetraining examples.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 25
Support Vector Machines
To describe hyperplanes, we use the following canonical form(which scales the coefficients):
yi[(~wT · ~x) + b] ≥ 1, i = 1, ..., l
Then, the Optimal hyperplane is the hyperplane that satisfiesthe inequality above and minimises
Φ(~w) = ‖~w‖2
over the vector ~w as well as b.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 26
Support Vector Machines
Consider the following set of training vectors X∗ = { ~x1, ..., ~xr}.bounded by a sphere of the radius R, i.e.
|~xi − ~a| ≤ R, xi ∈ X∗
, where ~a is the centre of the sphere.
Theorem (Vapnik, 1995) A subset of canonical hyperplanes
f (~x, ~w, b) = sign{(~wT · ~x) + b},defined on X∗ and satisfying the constraint ‖~w‖ ≤ A has theVC-dimension h bounded by the inequality
h ≤ min([R2A2], n) + 1.
for n being the number of dimensions.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 27
Support Vector Machines
To find the Optimal hyperplane: minimise
Φ(~w) =1
2~wT · ~w
under the following constraints for all i ∈ {1, ..., l}:yi[(~xi
T · ~w) + b] ≥ 1.
The solution to this optimisation problem is given by the saddlepoint of the Lagrangian:
L(~w, b, ~α) =1
2(~wT · ~w) − ∑̀
i=1αi{[(~xi
T · ~w) + b]yi − 1},where the αi are Lagrange multipliers.
The Lagrangian has to be minimised with respect to ~w and band maximised with respect to αi ≥ 0.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 28
Support Vector Machines
In the saddle point, the solutions ~w0, b0, and ~α0 should satisfythe conditions
∂L( ~w0, b0, ~α0)
∂b= 0
and∂L( ~w0, b0, ~α
0)
∂ ~w= 0
UNSW Slides c©Achim Hoffmann, 2002 October 2002 29
Support Vector Machines
From the previous conditions, we can derive the following forthe Optimal hyperplane:
~w0 =∑
support vectorsyi α0
i ~xi, α0i ≥ 0.
which is equivalent to
~w0 =∑̀
i=1yi α0
i ~xi, α0i ≥ 0.
and∑̀
i=1αiyi = 0
UNSW Slides c©Achim Hoffmann, 2002 October 2002 30
Support Vector Machines
Plugging the first equation into the Lagrangian results in thefollowing:
L(~w, b, ~α) =1
2(
∑̀
i,j=1αiαjyiyj(~xi
T · ~xj)− ∑̀
i=1αi[yi((~xi
T · ∑̀
j=1αjyj ~xj)+b)−1]
=1
2∑̀
i,j=1αiαjyiyj(~xi
T · ~xj)
−{( ∑̀
i=1αi[yi(~xi
T · ∑̀
j=1αjyj ~xj)]) + (
∑̀
i=1αiyib) − (
∑̀
i=1αi)}
Using the second equation from above, i.e. ∑`i=1 xiyi = 0 we
obtain:
UNSW Slides c©Achim Hoffmann, 2002 October 2002 31
L(~w, b, ~α) =1
2∑̀
i,j=1αiαjyiyj(~xi
T · ~xj)−{[ ∑̀
i=1αiyi(~xi
T · ∑̀
j=1αjyj ~xj)]
−[∑̀
i=1αi]}
From that we obtain the following function Q(~α) of the La-grange multipliers ~α which needs to be maximised for findingthe Optimal hyperplane:
Q(~α) =∑̀
i=1αi −
1
2[
∑̀
i,j=1αiαjyiyj(
~xTi · ~xj)]
The optimal values for ~α can be found using quadratic program-ming techniques.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 32
Support Vector Machines
Once the Lagrange multipliers for the Optimal hyperplane havebeen determined the following separating rule can be used byexpressing the optimal weight vector in terms of support vectorsand the Lagrange multipliers:
f (~x) = sign(∑
support vectorsyi α0
i (~xiT ·~x) + b0)
where ~xi are the support vectors, α0i are the corresponding La-
grange coefficients, and b0 is the threshold constant
b0 = 1 − ~w0T · ~x+,
where ~x+ denotes a support vector belonging to the first class.(Any support vector would fulfill the equation.)
UNSW Slides c©Achim Hoffmann, 2002 October 2002 33
Support Vector Machines
A Support Vector Machine maps the input space into a high-dimensional feature space and then constructs an Optimal hy-perplane in the feature space.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 34
Support Vector Machines
Example:
To construct a decision surface to a polynomial of degree two,
one can create a feature space Z which has N = n(n+3)2 coor-
dinates of the form
z1 = x1, ..., zn = xn, n coordinates
zn+1 = x21, ..., z2n = x2
n, n coordinates
z2n+1 = x1x2, ..., zN = xn−1xn,n(n − 1)
2coordinates
where ~x = (x1, ..., xn).
The separating hyperplane constructed in this space is a seconddegree polynomial in the input space.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 35
Support Vector Machines
Two problems:
• How to find a separating hyperplane that will generalise well?
The dimensionality of the feature space will be very large.
As a consequence, not all separating hyperplanes will
generalise well.
• How to treat computationally such high-dimensional spaces?
A very high-dimensional feature space cannot be explic-
itly computed. E.g. considering polynomials of degree
4 or 5 in a 200 dimensional input space results in a
billion dimensional feature space. Obviously, ”special”
treatment of such spaces is required.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 36
Support Vector Machines
Theorem (Vapnik, 1995)
If the training vectors are separated by the Optimal hyperplane,then the expectation value of the probability of committing anerror on a test example is bounded by the ratio of the expecta-tion of the number of support vectors to the number of examplesin the training set:
E[P (error)] ≤ E[number of support vectors]
(number of training vectors) − 1
It is interesting to note that this bound neither depends on thedimensionality of the feature space, (nor on the norm of thevector of coefficients, nor on the bound of the norm of theinput vectors).
UNSW Slides c©Achim Hoffmann, 2002 October 2002 37
Support Vector Machines: Kernels
Using Kernel functions to map input space to feature space.
Instead of extending the input vector for each example by a hugenumber of derived dimensions, such as polynomials of inputscalars, the polynomials are not explicitly computed.
Instead SVMs use kernel functions such as
K(~x, ~y) = (~x · ~y)2
This is equivalent to the mapping Φ(~x) into the 3 followingfeatures: (x2
1, x22,√
2x1x2).
UNSW Slides c©Achim Hoffmann, 2002 October 2002 38
Support Vector Machines: Kernels
K(~x, ~y) = (~x · ~y)2 = (x1y1)2 + (x2y2)
2 + 2(x1x2y1y2)
(Φ(~x) · Φ(~y)) = (x21, x
22,√
2x1x2)(y21, y
22,√
2y1y2)T
= x21y
21 + x2
2y22 + 2x1x2y1y2
= (x1y1)2 + (x2y2)
2 + 2(x1x2y1y2)
I.e. by using the quadratic kernel function above, the samecalculations relevant for support vector machines can be per-formed as with an explicit mapping of the input features intopolynomials of degree 2.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 39
Commonly used Kernel functions
Polynomial function:
K(~x, ~y) = (~x · ~y + 1)d, d = 1, ...
E.g. for d = 2 and a 2-dimensional instance space we get:
K(~x, ~y) = (~x·~y+1)2 = 1+x21y
21+2x1x2y1y2+x2
2y22+2x1y1+2x2y2
which corresponds to a mapping from instance space to featurespace as follows:
Φ(~x) = [1, x21,√
2x1x2, x22,√
2x1,√
2x2]T
UNSW Slides c©Achim Hoffmann, 2002 October 2002 40
Further Kernel functions
A radial basis function machine with convolution function:
K(~x, ~y) = exp(−||~x − ~y||22σ2
orK(~x, ~y) = tanh(b(~xT · ~y) − c)
for one-hidden layer perceptrons.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 41
Support Vector Machines
Experiments with Support Vector Machines:
Handwritten digit recognition with benchmark training and testdata from the US Postal Service.
16 × 16 input values.
For each of the 10 classes an individual classifier was learned.There were 7,300 training patterns and 2,000 test patterns.
Classifier Raw error %Human performance 2.5
Decision tree learner; C4.5 16.2Best two-layer neural network 5.9Five-layer network (LeNet 1) 5.1
UNSW Slides c©Achim Hoffmann, 2002 October 2002 42
Support Vector Machines
Polynomials up to degree 7 were used for Support Vector Ma-chines.
degree of dimensionality of support rawpolynomial feature space vectors error
1 256 282 8.92 ≈ 33,000 227 4.7
3 ≈ 106 274 4.0
4 ≈ 109 321 4.2
5 ≈ 1012 374 4.3
6 ≈ 1014 377 4.5
7 ≈ 1016 422 4.5
UNSW Slides c©Achim Hoffmann, 2002 October 2002 43
Support Vector Machines
Chosen Classifier Number of test errorsDigit degree dimensions hest 1 2 3 4 5 6 7
0 3 ≈ 106 530 36 14 11 11 11 12 17
1 7 ≈ 1016 101 17 15 14 11 10 10 10
2 3 ≈ 106 842 53 32 28 26 28 27 32
3 3 ≈ 106 1157 57 25 22 22 22 22 23
4 4 ≈ 109 962 50 32 32 30 30 29 33
5 3 ≈ 106 1090 37 20 22 24 24 26 28
6 4 ≈ 109 626 23 12 12 15 17 17 19
7 5 ≈ 1012 530 25 15 12 10 11 13 14
8 4 ≈ 109 1445 71 33 28 24 28 32 44
9 5 ≈ 1012 1226 51 18 15 11 11 12 15
UNSW Slides c©Achim Hoffmann, 2002 October 2002 44
Summary - Part 3
• Structural Risk Minimisation can be applied in many learningsettings. E.g. for deciding the number of hidden units in aMLP or the size of a decision tree depending on the availablenumber of i.i.d. training examples.
• Support Vector Machines (SVM) are powerful learning ma-chines which select relevant features from a very-high dimen-sional feature space.
• SVM find a separating hyperplane for the data and use quadraticprogramming for finding the Optimal hyperplane.
• Those data points which have the minimal distance to theOptimal hyperplane are called support vectors.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 45
Summary
Strengths of the VC Learning Theory
• The application of results in VC learning theory to practicallearning has been demonstrated.
• Structural Risk Minimisation (SRM) is an important conceptfor the design of practical learning algorithms.
• However, the proper decision point for a substructure is apriori, i.e. before seeing the data.
• Support Vector Machines have been shown to be capable oflearning well from very-high dimensional feature spaces.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 46
Summary (cont.)
Limitations of the VC Learning Theory
• The probabilistic bounds on the required number of examplesare worst case analysis.
• No preference relation on the functions of the learner aremodelled.
• In practice, learning examples are not necessarily drawn fromthe same probability distribution as the test examples.
UNSW Slides c©Achim Hoffmann, 2002 October 2002 47