VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...

VC Learning Theory and Support Vector

Machines

Achim Hoffmann

School of Computer Science and Engineering

University of New South Wales

Sydney, Australia

October 2002

UNSW Slides c©Achim Hoffmann, 2002 October 2002 1

The Vapnik-Chervonenkis dimension

• Definition

• Various function sets - relevant to learning systems - and theirVC-dimension

• General bounds on the Risk.

• Bounds from probably approximately correct learning (PAC-learning).



The VC-dimension is a useful combinatorial parameter on setsof subsets, e.g. on concept classes or hypothesis classes.

Definition

We say a set S ⊆ X is shattered by C iff {S ∩ c|c ∈ C} = 2S.

The Vapnik-Chervonenkis dimension of C, V C-Dim(C) is thecardinality of the greatest set S ⊆ X shattered by C, i.e.

VC − Dim(C) = maxS∈{S|S⊆X∧{S∩c|c∈C}=2S}

|S|



1

2

3

4

The VC-dimension of the set of linear decision functions in the2 dimensional Euclidean space is 3.


VC-dimension of some (discrete) function sets

An upper bound:

Theorem (Blumer et al., 1989)

Let L be a learning algorithm that uses H consistently. For any0 < ε, δ < 1 given

(4 log(2δ) + 8V C − dim(H) log(13

ε ))

ε

random examples, L will with probability of at least 1 − δ

either produce a hypothesis h with error ≤ ε

or indicate correctly, that the target concept is not in H.



A lower bound:

Theorem (Ehrenfeucht et al., 1992)

Let L be a learning algorithm that uses H consistently. For any0 < ε < 1

8, 0 < δ < 1100 given less than

V Cdim(H) − 1

32ε

random examples, there is some probability distribution for whichL will not produce a hypothesis h with error(h) ≤ ε withprobability 1 − δ.


VC-dimension bounds

ε δ VC-dim lower bound upper bound5% 5% 10 6 919210% 5% 10 3 40405% 5% 4 2 386010% 5% 4 1 170710% 10% 4 1 1677



X

Y

Let be C the set of all rectangles in the plane.



X

Y

Let be C the set of all circles in the plane.



X

Y

Let be C the set of all triangles in the plane.



• linear attributes

1 2 3 4 5 6

• tree-structured attributes

polygon shape

convex shape concave shape

any shape



Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, then

n ≤ V Cdim(H) ≤ 2n



Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, that contain at most s atoms (attribute-valueconstraints),

then

sblogn

sc ≤ V Cdim(H) ≤ 4s log(4s

√n)


VC-Dimensions of Neural Networks

As a good heuristic (Bartlett):

VC-dimension ≈ number of parameters

• For linear threshold functions on <n,

V C − dim = n + 1

(number of parameters is n + 1.)

• For linear threshold networks, and fixed depth networks withpiecewise polynomial squashing functions

c1| ~W | ≤ V C − dim ≤ c2| ~W | log | ~W |where | ~W | is number of weights in the network.

• Some threshold networks have V C − dim ≥ c| ~W | log | ~W |.• V C − dim(sigmoid net) ≤ c| ~W|4



Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:

• +,−,×, / on real numbers

•>,≥, =,≤, <, 6= on real numbers

• output value y ∈ {−1, +1}

has V C − dimension of O(kt).

(See work of Peter Bartlett)



Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:

• +,−,×, /, eα on real numbers

•>,≥, =,≤, <, 6= on real numbers

• output value y ∈ {−1, +1}

has V C − dimension of O(k2t2).

This includes sigmoid networks, RBF networks, mixture of ex-perts, etc.



VC-dimension Heuristic

For neural networks and decision trees:

VC-dimension ≈ size.

Hence, the order of the misclassification probability isno more than

training error +

√

√

√

√

√

√

√

size

mwhere m is the number of training examples.

This suggests that the number of training examples should growroughly linearly with the size of the hypothesis to be produced.

If the function to be produced is too complex for the amountof data available, it is likely that the learned function is not anear-optimal one.UNSW Slides c©Achim Hoffmann, 2002 October 2002 17

Summary - Part 2

• The VC-dimension is a useful combinatorial parameter of setsof functions.

• It can be used to estimate the true risk on the basis of theempirical risk and the number of i.i.d. training examples.

• It can also be used to determine a sufficient number of train-ing examples to learn probably approximately correct.


Part 3: Structural Risk Minimisation (SRM)

• Applications of the VC-dimension for choosing the most suit-able subset of functions for a given number of i.i.d. examples.

• Trading empirical risk against confidence in estimate.

• Foundations of Support Vector Machines.

• Experiments with Support Vector Machines.


Structural Risk Minimisation (SRM)

The complexity (or capacity) of a function class from whichthe learner chooses a function that minimises the empirical riskdetermines the convergence rate of the learner to the optimalfunction.

For a given number of i.i.d. training examples, there is a trade-off between the degree to which the empirical risk can be min-imised and to which the empirical risk will deviate from the truerisk.



S

SS

in

1

h1 hn

Empirical risk

Confidence interval

hi

Bound on the overall risk

Complexity degree

Consider a partition of the set S of functions from which ahypothesis is chosen as follows: S1 ⊂ S2 ⊂ ... ⊂ Sn...



The general SRM principle:

Choose complexity parameter d, e.g. the number of hidden unitsin a MLP, or the size of a decision tree, and function h ∈ Hdsuch that the following is minimised:

Remp(h) + c

√

√

√

√

√

√

√

√

V C − dim(Hd)

mwhere m is the number of training examples.

The higher the VC dimension is the more likely will the empiricalerror be low.

Structural risk minimisation seeks the right balance.


Support Vector Machines

Suppose the training data

( ~x1, y1), ..., (~x`, y`), ~x ∈ <n, y ∈ {−1, +1}

can be separated by a hyperplane

(~wT · ~x) + b = 0.

The hyperplane which separates the training data without errorand has maximal distance to the closest training vector is calledOptimal hyperplane.



++

+

+

- -

-

-

Optimalhyperplane

Margin

The Optimal separating hyperplane has maximal margin to thetraining examples.



To describe hyperplanes, we use the following canonical form(which scales the coefficients):

yi[(~wT · ~x) + b] ≥ 1, i = 1, ..., l

Then, the Optimal hyperplane is the hyperplane that satisfiesthe inequality above and minimises

Φ(~w) = ‖~w‖2

over the vector ~w as well as b.



Consider the following set of training vectors X∗ = { ~x1, ..., ~xr}.bounded by a sphere of the radius R, i.e.

|~xi − ~a| ≤ R, xi ∈ X∗

, where ~a is the centre of the sphere.

Theorem (Vapnik, 1995) A subset of canonical hyperplanes

f (~x, ~w, b) = sign{(~wT · ~x) + b},defined on X∗ and satisfying the constraint ‖~w‖ ≤ A has theVC-dimension h bounded by the inequality

h ≤ min([R2A2], n) + 1.

for n being the number of dimensions.



To find the Optimal hyperplane: minimise

Φ(~w) =1

2~wT · ~w

under the following constraints for all i ∈ {1, ..., l}:yi[(~xi

T · ~w) + b] ≥ 1.

The solution to this optimisation problem is given by the saddlepoint of the Lagrangian:

L(~w, b, ~α) =1

2(~wT · ~w) − ∑̀

i=1αi{[(~xi

T · ~w) + b]yi − 1},where the αi are Lagrange multipliers.

The Lagrangian has to be minimised with respect to ~w and band maximised with respect to αi ≥ 0.



In the saddle point, the solutions ~w0, b0, and ~α0 should satisfythe conditions

∂L( ~w0, b0, ~α0)

∂b= 0

and∂L( ~w0, b0, ~α

0)

∂ ~w= 0



From the previous conditions, we can derive the following forthe Optimal hyperplane:

~w0 =∑

support vectorsyi α0

i ~xi, α0i ≥ 0.

which is equivalent to

~w0 =∑̀

i=1yi α0

i ~xi, α0i ≥ 0.

and∑̀

i=1αiyi = 0



Plugging the first equation into the Lagrangian results in thefollowing:

L(~w, b, ~α) =1

2(

∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)− ∑̀

i=1αi[yi((~xi

T · ∑̀

j=1αjyj ~xj)+b)−1]

=1

2∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)

−{( ∑̀

i=1αi[yi(~xi

T · ∑̀

j=1αjyj ~xj)]) + (

∑̀

i=1αiyib) − (

∑̀

i=1αi)}

Using the second equation from above, i.e. ∑`i=1 xiyi = 0 we

obtain:


L(~w, b, ~α) =1

2∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)−{[ ∑̀

i=1αiyi(~xi

T · ∑̀

j=1αjyj ~xj)]

−[∑̀

i=1αi]}

From that we obtain the following function Q(~α) of the La-grange multipliers ~α which needs to be maximised for findingthe Optimal hyperplane:

Q(~α) =∑̀

i=1αi −

1

2[

∑̀

i,j=1αiαjyiyj(

~xTi · ~xj)]

The optimal values for ~α can be found using quadratic program-ming techniques.



Once the Lagrange multipliers for the Optimal hyperplane havebeen determined the following separating rule can be used byexpressing the optimal weight vector in terms of support vectorsand the Lagrange multipliers:

f (~x) = sign(∑

support vectorsyi α0

i (~xiT ·~x) + b0)

where ~xi are the support vectors, α0i are the corresponding La-

grange coefficients, and b0 is the threshold constant

b0 = 1 − ~w0T · ~x+,

where ~x+ denotes a support vector belonging to the first class.(Any support vector would fulfill the equation.)



A Support Vector Machine maps the input space into a high-dimensional feature space and then constructs an Optimal hy-perplane in the feature space.



Example:

To construct a decision surface to a polynomial of degree two,

one can create a feature space Z which has N = n(n+3)2 coor-

dinates of the form

z1 = x1, ..., zn = xn, n coordinates

zn+1 = x21, ..., z2n = x2

n, n coordinates

z2n+1 = x1x2, ..., zN = xn−1xn,n(n − 1)

2coordinates

where ~x = (x1, ..., xn).

The separating hyperplane constructed in this space is a seconddegree polynomial in the input space.



Two problems:

• How to find a separating hyperplane that will generalise well?

The dimensionality of the feature space will be very large.

As a consequence, not all separating hyperplanes will

generalise well.

• How to treat computationally such high-dimensional spaces?

A very high-dimensional feature space cannot be explic-

itly computed. E.g. considering polynomials of degree

4 or 5 in a 200 dimensional input space results in a

billion dimensional feature space. Obviously, ”special”

treatment of such spaces is required.



Theorem (Vapnik, 1995)

If the training vectors are separated by the Optimal hyperplane,then the expectation value of the probability of committing anerror on a test example is bounded by the ratio of the expecta-tion of the number of support vectors to the number of examplesin the training set:

E[P (error)] ≤ E[number of support vectors]

(number of training vectors) − 1

It is interesting to note that this bound neither depends on thedimensionality of the feature space, (nor on the norm of thevector of coefficients, nor on the bound of the norm of theinput vectors).


Support Vector Machines: Kernels

Using Kernel functions to map input space to feature space.

Instead of extending the input vector for each example by a hugenumber of derived dimensions, such as polynomials of inputscalars, the polynomials are not explicitly computed.

Instead SVMs use kernel functions such as

K(~x, ~y) = (~x · ~y)2

This is equivalent to the mapping Φ(~x) into the 3 followingfeatures: (x2

1, x22,√

2x1x2).


Support Vector Machines: Kernels

K(~x, ~y) = (~x · ~y)2 = (x1y1)2 + (x2y2)

2 + 2(x1x2y1y2)

(Φ(~x) · Φ(~y)) = (x21, x

22,√

2x1x2)(y21, y

22,√

2y1y2)T

= x21y

21 + x2

2y22 + 2x1x2y1y2

= (x1y1)2 + (x2y2)

2 + 2(x1x2y1y2)

I.e. by using the quadratic kernel function above, the samecalculations relevant for support vector machines can be per-formed as with an explicit mapping of the input features intopolynomials of degree 2.


Commonly used Kernel functions

Polynomial function:

K(~x, ~y) = (~x · ~y + 1)d, d = 1, ...

E.g. for d = 2 and a 2-dimensional instance space we get:

K(~x, ~y) = (~x·~y+1)2 = 1+x21y

21+2x1x2y1y2+x2

2y22+2x1y1+2x2y2

which corresponds to a mapping from instance space to featurespace as follows:

Φ(~x) = [1, x21,√

2x1x2, x22,√

2x1,√

2x2]T


Further Kernel functions

A radial basis function machine with convolution function:

K(~x, ~y) = exp(−||~x − ~y||22σ2

orK(~x, ~y) = tanh(b(~xT · ~y) − c)

for one-hidden layer perceptrons.



Experiments with Support Vector Machines:

Handwritten digit recognition with benchmark training and testdata from the US Postal Service.

16 × 16 input values.

For each of the 10 classes an individual classifier was learned.There were 7,300 training patterns and 2,000 test patterns.

Classifier Raw error %Human performance 2.5

Decision tree learner; C4.5 16.2Best two-layer neural network 5.9Five-layer network (LeNet 1) 5.1



Polynomials up to degree 7 were used for Support Vector Ma-chines.

degree of dimensionality of support rawpolynomial feature space vectors error

1 256 282 8.92 ≈ 33,000 227 4.7

3 ≈ 106 274 4.0

4 ≈ 109 321 4.2

5 ≈ 1012 374 4.3

6 ≈ 1014 377 4.5

7 ≈ 1016 422 4.5



Chosen Classifier Number of test errorsDigit degree dimensions hest 1 2 3 4 5 6 7

0 3 ≈ 106 530 36 14 11 11 11 12 17

1 7 ≈ 1016 101 17 15 14 11 10 10 10

2 3 ≈ 106 842 53 32 28 26 28 27 32

3 3 ≈ 106 1157 57 25 22 22 22 22 23

4 4 ≈ 109 962 50 32 32 30 30 29 33

5 3 ≈ 106 1090 37 20 22 24 24 26 28

6 4 ≈ 109 626 23 12 12 15 17 17 19

7 5 ≈ 1012 530 25 15 12 10 11 13 14

8 4 ≈ 109 1445 71 33 28 24 28 32 44

9 5 ≈ 1012 1226 51 18 15 11 11 12 15


Summary - Part 3

• Structural Risk Minimisation can be applied in many learningsettings. E.g. for deciding the number of hidden units in aMLP or the size of a decision tree depending on the availablenumber of i.i.d. training examples.

• Support Vector Machines (SVM) are powerful learning ma-chines which select relevant features from a very-high dimen-sional feature space.

• SVM find a separating hyperplane for the data and use quadraticprogramming for finding the Optimal hyperplane.

• Those data points which have the minimal distance to theOptimal hyperplane are called support vectors.


Summary

Strengths of the VC Learning Theory

• The application of results in VC learning theory to practicallearning has been demonstrated.

• Structural Risk Minimisation (SRM) is an important conceptfor the design of practical learning algorithms.

• However, the proper decision point for a substructure is apriori, i.e. before seeing the data.

• Support Vector Machines have been shown to be capable oflearning well from very-high dimensional feature spaces.


Summary (cont.)

Limitations of the VC Learning Theory

• The probabilistic bounds on the required number of examplesare worst case analysis.

• No preference relation on the functions of the learner aremodelled.

• In practice, learning examples are not necessarily drawn fromthe same probability distribution as the test examples.


VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...

Documents

Transcript of VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...