VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...

47
VC Learning Theory and Support Vector Machines Achim Hoffmann School of Computer Science and Engineering University of New South Wales Sydney, Australia October 2002 UNSW Slides c Achim Hoffmann, 2002 October 2002 1

Transcript of VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober...

Page 1: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC Learning Theory and Support Vector

Machines

Achim Hoffmann

School of Computer Science and Engineering

University of New South Wales

Sydney, Australia

October 2002

UNSW Slides c©Achim Hoffmann, 2002 October 2002 1

Page 2: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

• Definition

• Various function sets - relevant to learning systems - and theirVC-dimension

• General bounds on the Risk.

• Bounds from probably approximately correct learning (PAC-learning).

UNSW Slides c©Achim Hoffmann, 2002 October 2002 2

Page 3: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

The VC-dimension is a useful combinatorial parameter on setsof subsets, e.g. on concept classes or hypothesis classes.

Definition

We say a set S ⊆ X is shattered by C iff {S ∩ c|c ∈ C} = 2S.

The Vapnik-Chervonenkis dimension of C, V C-Dim(C) is thecardinality of the greatest set S ⊆ X shattered by C, i.e.

VC − Dim(C) = maxS∈{S|S⊆X∧{S∩c|c∈C}=2S}

|S|

UNSW Slides c©Achim Hoffmann, 2002 October 2002 3

Page 4: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

1

2

3

4

The VC-dimension of the set of linear decision functions in the2 dimensional Euclidean space is 3.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 4

Page 5: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension of some (discrete) function sets

An upper bound:

Theorem (Blumer et al., 1989)

Let L be a learning algorithm that uses H consistently. For any0 < ε, δ < 1 given

(4 log(2δ) + 8V C − dim(H) log(13

ε ))

ε

random examples, L will with probability of at least 1 − δ

either produce a hypothesis h with error ≤ ε

or indicate correctly, that the target concept is not in H.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 5

Page 6: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension of some (discrete) function sets

A lower bound:

Theorem (Ehrenfeucht et al., 1992)

Let L be a learning algorithm that uses H consistently. For any0 < ε < 1

8, 0 < δ < 1100 given less than

V Cdim(H) − 1

32ε

random examples, there is some probability distribution for whichL will not produce a hypothesis h with error(h) ≤ ε withprobability 1 − δ.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 6

Page 7: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension bounds

ε δ VC-dim lower bound upper bound5% 5% 10 6 919210% 5% 10 3 40405% 5% 4 2 386010% 5% 4 1 170710% 10% 4 1 1677

UNSW Slides c©Achim Hoffmann, 2002 October 2002 7

Page 8: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

X

Y

Let be C the set of all rectangles in the plane.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 8

Page 9: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

X

Y

Let be C the set of all circles in the plane.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 9

Page 10: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

The Vapnik-Chervonenkis dimension

X

Y

Let be C the set of all triangles in the plane.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 10

Page 11: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension of some (discrete) function sets

• linear attributes

1 2 3 4 5 6

• tree-structured attributes

polygon shape

convex shape concave shape

any shape

UNSW Slides c©Achim Hoffmann, 2002 October 2002 11

Page 12: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension of some (discrete) function sets

Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, then

n ≤ V Cdim(H) ≤ 2n

UNSW Slides c©Achim Hoffmann, 2002 October 2002 12

Page 13: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension of some (discrete) function sets

Theorem Let X be the n-dimensional input space. If H is theset of all functions that are pure conjunctions of attribute-valueconstraints on X, that contain at most s atoms (attribute-valueconstraints),

then

sblogn

sc ≤ V Cdim(H) ≤ 4s log(4s

√n)

UNSW Slides c©Achim Hoffmann, 2002 October 2002 13

Page 14: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-Dimensions of Neural Networks

As a good heuristic (Bartlett):

VC-dimension ≈ number of parameters

• For linear threshold functions on <n,

V C − dim = n + 1

(number of parameters is n + 1.)

• For linear threshold networks, and fixed depth networks withpiecewise polynomial squashing functions

c1| ~W | ≤ V C − dim ≤ c2| ~W | log | ~W |where | ~W | is number of weights in the network.

• Some threshold networks have V C − dim ≥ c| ~W | log | ~W |.• V C − dim(sigmoid net) ≤ c| ~W|4

UNSW Slides c©Achim Hoffmann, 2002 October 2002 14

Page 15: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-Dimensions of Neural Networks

Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:

• +,−,×, / on real numbers

•>,≥, =,≤, <, 6= on real numbers

• output value y ∈ {−1, +1}

has V C − dimension of O(kt).

(See work of Peter Bartlett)

UNSW Slides c©Achim Hoffmann, 2002 October 2002 15

Page 16: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-Dimensions of Neural Networks

Any function class H that can be computed by a program thattakes a real input vector ~x and k real parameters and involvesno more than t of the following operations:

• +,−,×, /, eα on real numbers

•>,≥, =,≤, <, 6= on real numbers

• output value y ∈ {−1, +1}

has V C − dimension of O(k2t2).

This includes sigmoid networks, RBF networks, mixture of ex-perts, etc.

(See work of Peter Bartlett)

UNSW Slides c©Achim Hoffmann, 2002 October 2002 16

Page 17: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

VC-dimension Heuristic

For neural networks and decision trees:

VC-dimension ≈ size.

Hence, the order of the misclassification probability isno more than

training error +

size

mwhere m is the number of training examples.

This suggests that the number of training examples should growroughly linearly with the size of the hypothesis to be produced.

If the function to be produced is too complex for the amountof data available, it is likely that the learned function is not anear-optimal one.UNSW Slides c©Achim Hoffmann, 2002 October 2002 17

Page 18: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

(See work of Peter Bartlett)

UNSW Slides c©Achim Hoffmann, 2002 October 2002 18

Page 19: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Summary - Part 2

• The VC-dimension is a useful combinatorial parameter of setsof functions.

• It can be used to estimate the true risk on the basis of theempirical risk and the number of i.i.d. training examples.

• It can also be used to determine a sufficient number of train-ing examples to learn probably approximately correct.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 19

Page 20: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Part 3: Structural Risk Minimisation (SRM)

• Applications of the VC-dimension for choosing the most suit-able subset of functions for a given number of i.i.d. examples.

• Trading empirical risk against confidence in estimate.

• Foundations of Support Vector Machines.

• Experiments with Support Vector Machines.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 20

Page 21: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Structural Risk Minimisation (SRM)

The complexity (or capacity) of a function class from whichthe learner chooses a function that minimises the empirical riskdetermines the convergence rate of the learner to the optimalfunction.

For a given number of i.i.d. training examples, there is a trade-off between the degree to which the empirical risk can be min-imised and to which the empirical risk will deviate from the truerisk.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 21

Page 22: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Structural Risk Minimisation (SRM)

S

SS

in

1

h1 hn

Empirical risk

Confidence interval

hi

Bound on the overall risk

Complexity degree

Consider a partition of the set S of functions from which ahypothesis is chosen as follows: S1 ⊂ S2 ⊂ ... ⊂ Sn...

UNSW Slides c©Achim Hoffmann, 2002 October 2002 22

Page 23: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Structural Risk Minimisation (SRM)

The general SRM principle:

Choose complexity parameter d, e.g. the number of hidden unitsin a MLP, or the size of a decision tree, and function h ∈ Hdsuch that the following is minimised:

Remp(h) + c

V C − dim(Hd)

mwhere m is the number of training examples.

The higher the VC dimension is the more likely will the empiricalerror be low.

Structural risk minimisation seeks the right balance.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 23

Page 24: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Suppose the training data

( ~x1, y1), ..., (~x`, y`), ~x ∈ <n, y ∈ {−1, +1}

can be separated by a hyperplane

(~wT · ~x) + b = 0.

The hyperplane which separates the training data without errorand has maximal distance to the closest training vector is calledOptimal hyperplane.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 24

Page 25: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

++

+

+

- -

-

-

Optimalhyperplane

Margin

The Optimal separating hyperplane has maximal margin to thetraining examples.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 25

Page 26: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

To describe hyperplanes, we use the following canonical form(which scales the coefficients):

yi[(~wT · ~x) + b] ≥ 1, i = 1, ..., l

Then, the Optimal hyperplane is the hyperplane that satisfiesthe inequality above and minimises

Φ(~w) = ‖~w‖2

over the vector ~w as well as b.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 26

Page 27: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Consider the following set of training vectors X∗ = { ~x1, ..., ~xr}.bounded by a sphere of the radius R, i.e.

|~xi − ~a| ≤ R, xi ∈ X∗

, where ~a is the centre of the sphere.

Theorem (Vapnik, 1995) A subset of canonical hyperplanes

f (~x, ~w, b) = sign{(~wT · ~x) + b},defined on X∗ and satisfying the constraint ‖~w‖ ≤ A has theVC-dimension h bounded by the inequality

h ≤ min([R2A2], n) + 1.

for n being the number of dimensions.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 27

Page 28: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

To find the Optimal hyperplane: minimise

Φ(~w) =1

2~wT · ~w

under the following constraints for all i ∈ {1, ..., l}:yi[(~xi

T · ~w) + b] ≥ 1.

The solution to this optimisation problem is given by the saddlepoint of the Lagrangian:

L(~w, b, ~α) =1

2(~wT · ~w) − ∑̀

i=1αi{[(~xi

T · ~w) + b]yi − 1},where the αi are Lagrange multipliers.

The Lagrangian has to be minimised with respect to ~w and band maximised with respect to αi ≥ 0.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 28

Page 29: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

In the saddle point, the solutions ~w0, b0, and ~α0 should satisfythe conditions

∂L( ~w0, b0, ~α0)

∂b= 0

and∂L( ~w0, b0, ~α

0)

∂ ~w= 0

UNSW Slides c©Achim Hoffmann, 2002 October 2002 29

Page 30: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

From the previous conditions, we can derive the following forthe Optimal hyperplane:

~w0 =∑

support vectorsyi α0

i ~xi, α0i ≥ 0.

which is equivalent to

~w0 =∑̀

i=1yi α0

i ~xi, α0i ≥ 0.

and∑̀

i=1αiyi = 0

UNSW Slides c©Achim Hoffmann, 2002 October 2002 30

Page 31: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Plugging the first equation into the Lagrangian results in thefollowing:

L(~w, b, ~α) =1

2(

∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)− ∑̀

i=1αi[yi((~xi

T · ∑̀

j=1αjyj ~xj)+b)−1]

=1

2∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)

−{( ∑̀

i=1αi[yi(~xi

T · ∑̀

j=1αjyj ~xj)]) + (

∑̀

i=1αiyib) − (

∑̀

i=1αi)}

Using the second equation from above, i.e. ∑`i=1 xiyi = 0 we

obtain:

UNSW Slides c©Achim Hoffmann, 2002 October 2002 31

Page 32: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

L(~w, b, ~α) =1

2∑̀

i,j=1αiαjyiyj(~xi

T · ~xj)−{[ ∑̀

i=1αiyi(~xi

T · ∑̀

j=1αjyj ~xj)]

−[∑̀

i=1αi]}

From that we obtain the following function Q(~α) of the La-grange multipliers ~α which needs to be maximised for findingthe Optimal hyperplane:

Q(~α) =∑̀

i=1αi −

1

2[

∑̀

i,j=1αiαjyiyj(

~xTi · ~xj)]

The optimal values for ~α can be found using quadratic program-ming techniques.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 32

Page 33: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Once the Lagrange multipliers for the Optimal hyperplane havebeen determined the following separating rule can be used byexpressing the optimal weight vector in terms of support vectorsand the Lagrange multipliers:

f (~x) = sign(∑

support vectorsyi α0

i (~xiT ·~x) + b0)

where ~xi are the support vectors, α0i are the corresponding La-

grange coefficients, and b0 is the threshold constant

b0 = 1 − ~w0T · ~x+,

where ~x+ denotes a support vector belonging to the first class.(Any support vector would fulfill the equation.)

UNSW Slides c©Achim Hoffmann, 2002 October 2002 33

Page 34: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

A Support Vector Machine maps the input space into a high-dimensional feature space and then constructs an Optimal hy-perplane in the feature space.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 34

Page 35: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Example:

To construct a decision surface to a polynomial of degree two,

one can create a feature space Z which has N = n(n+3)2 coor-

dinates of the form

z1 = x1, ..., zn = xn, n coordinates

zn+1 = x21, ..., z2n = x2

n, n coordinates

z2n+1 = x1x2, ..., zN = xn−1xn,n(n − 1)

2coordinates

where ~x = (x1, ..., xn).

The separating hyperplane constructed in this space is a seconddegree polynomial in the input space.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 35

Page 36: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Two problems:

• How to find a separating hyperplane that will generalise well?

The dimensionality of the feature space will be very large.

As a consequence, not all separating hyperplanes will

generalise well.

• How to treat computationally such high-dimensional spaces?

A very high-dimensional feature space cannot be explic-

itly computed. E.g. considering polynomials of degree

4 or 5 in a 200 dimensional input space results in a

billion dimensional feature space. Obviously, ”special”

treatment of such spaces is required.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 36

Page 37: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Theorem (Vapnik, 1995)

If the training vectors are separated by the Optimal hyperplane,then the expectation value of the probability of committing anerror on a test example is bounded by the ratio of the expecta-tion of the number of support vectors to the number of examplesin the training set:

E[P (error)] ≤ E[number of support vectors]

(number of training vectors) − 1

It is interesting to note that this bound neither depends on thedimensionality of the feature space, (nor on the norm of thevector of coefficients, nor on the bound of the norm of theinput vectors).

UNSW Slides c©Achim Hoffmann, 2002 October 2002 37

Page 38: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines: Kernels

Using Kernel functions to map input space to feature space.

Instead of extending the input vector for each example by a hugenumber of derived dimensions, such as polynomials of inputscalars, the polynomials are not explicitly computed.

Instead SVMs use kernel functions such as

K(~x, ~y) = (~x · ~y)2

This is equivalent to the mapping Φ(~x) into the 3 followingfeatures: (x2

1, x22,√

2x1x2).

UNSW Slides c©Achim Hoffmann, 2002 October 2002 38

Page 39: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines: Kernels

K(~x, ~y) = (~x · ~y)2 = (x1y1)2 + (x2y2)

2 + 2(x1x2y1y2)

(Φ(~x) · Φ(~y)) = (x21, x

22,√

2x1x2)(y21, y

22,√

2y1y2)T

= x21y

21 + x2

2y22 + 2x1x2y1y2

= (x1y1)2 + (x2y2)

2 + 2(x1x2y1y2)

I.e. by using the quadratic kernel function above, the samecalculations relevant for support vector machines can be per-formed as with an explicit mapping of the input features intopolynomials of degree 2.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 39

Page 40: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Commonly used Kernel functions

Polynomial function:

K(~x, ~y) = (~x · ~y + 1)d, d = 1, ...

E.g. for d = 2 and a 2-dimensional instance space we get:

K(~x, ~y) = (~x·~y+1)2 = 1+x21y

21+2x1x2y1y2+x2

2y22+2x1y1+2x2y2

which corresponds to a mapping from instance space to featurespace as follows:

Φ(~x) = [1, x21,√

2x1x2, x22,√

2x1,√

2x2]T

UNSW Slides c©Achim Hoffmann, 2002 October 2002 40

Page 41: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Further Kernel functions

A radial basis function machine with convolution function:

K(~x, ~y) = exp(−||~x − ~y||22σ2

orK(~x, ~y) = tanh(b(~xT · ~y) − c)

for one-hidden layer perceptrons.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 41

Page 42: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Experiments with Support Vector Machines:

Handwritten digit recognition with benchmark training and testdata from the US Postal Service.

16 × 16 input values.

For each of the 10 classes an individual classifier was learned.There were 7,300 training patterns and 2,000 test patterns.

Classifier Raw error %Human performance 2.5

Decision tree learner; C4.5 16.2Best two-layer neural network 5.9Five-layer network (LeNet 1) 5.1

UNSW Slides c©Achim Hoffmann, 2002 October 2002 42

Page 43: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Polynomials up to degree 7 were used for Support Vector Ma-chines.

degree of dimensionality of support rawpolynomial feature space vectors error

1 256 282 8.92 ≈ 33,000 227 4.7

3 ≈ 106 274 4.0

4 ≈ 109 321 4.2

5 ≈ 1012 374 4.3

6 ≈ 1014 377 4.5

7 ≈ 1016 422 4.5

UNSW Slides c©Achim Hoffmann, 2002 October 2002 43

Page 44: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Support Vector Machines

Chosen Classifier Number of test errorsDigit degree dimensions hest 1 2 3 4 5 6 7

0 3 ≈ 106 530 36 14 11 11 11 12 17

1 7 ≈ 1016 101 17 15 14 11 10 10 10

2 3 ≈ 106 842 53 32 28 26 28 27 32

3 3 ≈ 106 1157 57 25 22 22 22 22 23

4 4 ≈ 109 962 50 32 32 30 30 29 33

5 3 ≈ 106 1090 37 20 22 24 24 26 28

6 4 ≈ 109 626 23 12 12 15 17 17 19

7 5 ≈ 1012 530 25 15 12 10 11 13 14

8 4 ≈ 109 1445 71 33 28 24 28 32 44

9 5 ≈ 1012 1226 51 18 15 11 11 12 15

UNSW Slides c©Achim Hoffmann, 2002 October 2002 44

Page 45: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Summary - Part 3

• Structural Risk Minimisation can be applied in many learningsettings. E.g. for deciding the number of hidden units in aMLP or the size of a decision tree depending on the availablenumber of i.i.d. training examples.

• Support Vector Machines (SVM) are powerful learning ma-chines which select relevant features from a very-high dimen-sional feature space.

• SVM find a separating hyperplane for the data and use quadraticprogramming for finding the Optimal hyperplane.

• Those data points which have the minimal distance to theOptimal hyperplane are called support vectors.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 45

Page 46: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Summary

Strengths of the VC Learning Theory

• The application of results in VC learning theory to practicallearning has been demonstrated.

• Structural Risk Minimisation (SRM) is an important conceptfor the design of practical learning algorithms.

• However, the proper decision point for a substructure is apriori, i.e. before seeing the data.

• Support Vector Machines have been shown to be capable oflearning well from very-high dimensional feature spaces.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 46

Page 47: VC Learning Theory and Support Vector Machines School of ...cs9444/Notes02/Achim-Week11.pdfOctober 2002 UNSW Slides c Achim Ho mann, 2002 October 2002 1 The Vapnik-Chervonenkis dimension

Summary (cont.)

Limitations of the VC Learning Theory

• The probabilistic bounds on the required number of examplesare worst case analysis.

• No preference relation on the functions of the learner aremodelled.

• In practice, learning examples are not necessarily drawn fromthe same probability distribution as the test examples.

UNSW Slides c©Achim Hoffmann, 2002 October 2002 47