2012 mdsp pr13 support vector machine

24
Course Calendar (revised 2012 Dec. 27) Class DATE Contents 1 Sep. 26 Course information & Course overview 2 Oct. 4 Bayes Estimation 3 11 Classical Bayes Estimation - Kalman Filter - 4 18 Simulation-based Bayesian Methods 5 25 Modern Bayesian Estimation Particle Filter 6 Nov. 1 HMM(Hidden Markov Model) Nov. 8 No Class 7 15 Bayesian Decision 8 29 Non parametric Approaches 9 Dec. 6 PCA(Principal Component Analysis) 10 13 ICA(Independent Component Analysis) 11 20 Applications of PCA and ICA 12 27 Clusteringk-means, Mixture Gaussian and EM 13 Jan. 17 Support Vector Machine 14 22(Tue) No Class

Transcript of 2012 mdsp pr13 support vector machine

Page 1: 2012 mdsp pr13 support vector machine

Course Calendar (revised 2012 Dec. 27)

Class DATE Contents

1 Sep. 26 Course information & Course overview

2 Oct. 4 Bayes Estimation

3 〃 11 Classical Bayes Estimation - Kalman Filter -

4 〃 18 Simulation-based Bayesian Methods

5 〃 25 Modern Bayesian Estimation :Particle Filter

6 Nov. 1 HMM(Hidden Markov Model)

Nov. 8 No Class

7 〃 15 Bayesian Decision

8 〃 29 Non parametric Approaches

9 Dec. 6 PCA(Principal Component Analysis)

10 〃 13 ICA(Independent Component Analysis)

11 〃 20 Applications of PCA and ICA

12 〃 27 Clustering; k-means, Mixture Gaussian and EM

13 Jan. 17 Support Vector Machine

14 〃 22(Tue) No Class

Page 2: 2012 mdsp pr13 support vector machine

Lecture Plan

Support Vector Machine

1. Linear Discriminative Machine

Perceptron Learning rule

2. Support Vector Machine

Problem setting, Optimization

3. Generalization of SVM

Page 3: 2012 mdsp pr13 support vector machine

3

1. Introduction 1.1 Classical Linear Discriminative Function

-Perceptron Machine-

Consider the two-category linear discriminative problem using

perceptron-type machine.

- Assumption

Two-category (𝐶1 , 𝐶2) training data in D-dimensional feature space are

separable by a linear discriminative function of the form

𝑓 𝑥 = 𝑤𝑇𝑥

which satisfies

𝑓 𝑥 ≥ 0 𝑓𝑜𝑟 𝑥 ∈ 𝐶1

𝑓 𝑥 < 0 𝑓𝑜𝑟 𝑥 ∈ 𝐶2

0 1

0 1

where , , , is (D+1)-dim. weight vector

1, , ,

Here, 0 gives the hyperplane surface which separates

two categories and its normal vector is .

T

D

T

D

T

w w w w

x x x x

f x w x

w

Page 4: 2012 mdsp pr13 support vector machine

4

𝑤0 𝑥0 = 1

+ 𝑤1

𝑥1

𝑤𝐷 𝑥𝐷

.

.

.

.

0

D

i i

i

f x w x

Fig. 1 Perceptron

Class C1

Class C2

Hyperplane f(x)=0

Fig. 2 Linear Discrimination

weights

x-space

Page 5: 2012 mdsp pr13 support vector machine

5

( )

2

( ) ( ) ( ) ( )

2

(0)

(1) (2)

( 1) ( )

- Reverse the training vectors of class C

for

- Initial weight vector :

- For a new training dataset , , ,

if

i

i new i i i

i i

x

x x x x C

w

x x

w w

( )

( 1) ( ) ( ) ( )

0

+ if 0

where determines the convergence speed of learning.

i

i i i i

f x

w w x f x

Class C1

Reversed Class C2 data

Hyperplane f(x)=0

反転

Fig. 3 Reversed data of class C2

1.2 Learning Rule of Perceptron (η=1 case)

Class C1

Reversed C2 data

Fig. 3 Reversed data of class C2

reflect

Page 6: 2012 mdsp pr13 support vector machine

6

Training data

H0

w0

H1

x1

H0

w0

x1

+ -

w1 =w0+x1

(a) i=0

(b) i=1

Illustration of weight

update scheme

Page 7: 2012 mdsp pr13 support vector machine

7

H1 H2

w0

x1

w1 =w0+x1

x2 H0

w2=w1+x2

(c) i=2

Fig. 4 Learning process of Perceptron

Page 8: 2012 mdsp pr13 support vector machine

8

2. Support Vector Machine (SVM)

2.1 Problem Setting

Given a linearly separable two-category(𝐶1 , 𝐶2) training dataset with

class labels

𝑥𝑖 , 𝑡𝑖 𝑖 = 1~𝑁

where 𝑥𝑖 ∶ D-dimensional feature vector

𝑡𝑖 = {−1,1} “1” for C1, and “ -1” for C2

Find a separating hyperplane H

𝑓 𝑥 = 𝑤𝑇𝑥 + 𝑏 = 0

- Among a set of possible hyperplanes, we want to seek a reasonable

hyperplane which is farthest from all training sample vectors.

- The obtained discriminant hyperplane will give better generalization

capability. (*)

(*) It is expected well for test data which are outside the training data

Page 9: 2012 mdsp pr13 support vector machine

9

Motivation of SVM

The optimal discriminative hyperplane should have the largest

margin which is defined as the minimum distance of the training

vectors to the separation surface.

Class C1

Class C2

Margin

Fig. 5 Margin

Hyperplane

Page 10: 2012 mdsp pr13 support vector machine

10

The distance between a hyperplane

0

and a sample point is given by

(see Appendix)

Since both the scalar( )-multiplication ( ) and a pair

T

i

T

i

w x b

x

w x b

w

k kw, kb

2.2 Optimization problem

of ( , )

give the same hyperplane, we choose the optimal hyperplane which

is given by the discriminative function

1

where in (3) is the closest vector to the separation surface.

T

i

i

w b

w x b

x

(Canonical hyperplane)

(1)

(2)

(3)

Page 11: 2012 mdsp pr13 support vector machine

11

2

2

0

p

T T

p

bx w w

w

w x b w w b

b

w

ix

qx

px

w

w

hyperplane

0Tw x b

2

2

= ( = )

T T

i iq q q

TTii

q p

w x w xwx x w x

w ww

w x bw x bx x w

ww

: distance between

and hyperplane

ix

Appendix

Fig.6

Page 12: 2012 mdsp pr13 support vector machine

12

1

2

- The distance (2) from the closest training vector to the decision

surface is

1

2- The margin is

- If 1 (C ) then 1

If 1 (C ) then 1

therefore

T

i

T

i i

T

i i

w x b

w w

w

t w x b

t w x b

1T

i it w x b

Fig. 7 Margin and distance

Hyperplane

T

iw x b

w

2

w

(4)

(5)

Page 13: 2012 mdsp pr13 support vector machine

13

2

- Maximization of the margin-

1 1 Minimize

2 2

Subject to ( ) 1 ( 1 ~ )

Since ( ) is a quadratic function with respect to , there exists

T

T

i i

J w w w w

t w x b i N

J w w

Optimization problem

an (unique) global minimum.

(7)

(6)

Page 14: 2012 mdsp pr13 support vector machine

14

* *

*

*

satisfies

( , )(i) 0

(ii) 0 ( 1,..., )

(iii) 0

(iv) 0

z z

i i

i

i

L z

z

g z i k

g z

(optimiztion conditions)

Minimize z (convex space)

Subject to ( ) 0 ( 1 ~ )

The necessary and suffi

i

J z

g z i k

2.3 Lagrangian multiplier approach - general theory -

Kuhn - Tucker Theorem

*

*

1

cient conditions for a point to be

an optimum are the existence of such that the Lagrangian function

( , ) : ( )k

i i

i

z

L z J z g z

(8)

(9)

(10)

(11)

(12)

Page 15: 2012 mdsp pr13 support vector machine

15

- The second condition (10), called Karush-Kuhn-Tucker(KKT)

condition or complementary condition, implies the following facts

ifor active constraints if α > 0

and for inactive constr iaints if α = 0

1

Apply K-T theorem to Eq. (6) (7)

- Lagrangian

1 ( , , ) : 1

2

- Condition (i) by substituting , gives

( , , ) 0

T T

p i i i

Np

i i i

i

L w b w w t w x b

z w b

L w bw t x

w

2.4 Dual Problem

1

( , , ) 0 0

Np

i i

i

L w bt

b

(13)

(14)

(15)

Page 16: 2012 mdsp pr13 support vector machine

16

0

1

1 1 1

1

1 ( , , )

2

1 1

2 2

1 1

2 2

1

2

1 (: ( , , )) =

2

T T

p i i i i i i

i i i

I

K

NT T

i i i

i

N N NT T

i i i i i i i i i

i i i

NT

i j i j i j

i j

p i

i

L w b w w t w x b t

I w w w t x

K t w x t t x x

t t x x

L L w b

1

1

is maximized subject to

0 and 0

NT

i j i j i j

i j

N

i i i

i

t t x x

t

(16)

(17)

Page 17: 2012 mdsp pr13 support vector machine

17

- Dual problem is easier to solve because depends only

on not on ,

- contains training data as the inner product form

- Geometric interpretation of KKT condition (ii) or Eq.(10)

i

T

i i j

L

w b

L x x x

1 0 1

mans,

at either =0 or 1 must hold.

for some 0 must lie on one of the hyperplanes

,namely with active constraint provides the largest margin.

T

i i i

T

i i i i

j

j

t w x b i N

x t w x b

x

(Such is called support vector, see Fig. 8)

At all other points 0 (inactive constraint points)

j

i

x

(18)

Page 18: 2012 mdsp pr13 support vector machine

18

0

- Only the support vectors contribute to determine hyperplane

because of

- The KTT condition is used to determine the bias b.

i

i i i i iw t x t x

Fig. 8 KTT conditions

support vectors

𝛼𝑖 > 0

𝛼𝑖 = 0

𝛼𝑖 = 0

inactive constraint points

0

- Hyperplane : 0i

T

i it x x b

(19)

(20)

Page 19: 2012 mdsp pr13 support vector machine

19

3. Generalization of SVM 3 .1 Non-separable case

- Introduce slack variables ξi in order to relax the constraint (7) as

follows;

𝑡𝑖(𝑤𝑇𝑥𝑖 + 𝑏) ≥1- ξi

For ξi =0, the data point is correctly separable with margin.

For 0≦ξi ≦1, the data point is separable but falls within the region of

the margin.

For ξi >1, the data point falls on the wrong side of the separating surface.

Define the slack variable

ξi := ramp{1-𝑡𝑖(𝑤𝑇𝑥𝑖 + 𝑏)}

where ramp{u} = u for u>0 and =0 for u≦0.

1

New Optimization Problem:

1 Minimize , :=

2

subject to 1+ 0

0 ( 1 )

NT

p i

i

T

i i i

i

L w w w C

t w x b

i N

(21)

(22)

Page 20: 2012 mdsp pr13 support vector machine

20

Fig. 9 Non separable case and stack variable

𝑡𝑖 = 1

𝑡𝑖 = −1

0

0 0

0

1Tw x b

0

0

0.5

1

2

Tw x b

0

0 0

1.5

1

2

Tw x b

00 0 1

optimum hyperplane

0 Tw x b

support vectors

i

Page 21: 2012 mdsp pr13 support vector machine

21

3.2 Nonlinear SVM

- For the separation problem by a nonlinear discriminative surface,

nonlinear mapping approach is useful.

- Cover’s theorem: A complex pattern classification problem cast in a

high-dimensional space non-linearly is more likely to be linearly

separable than in a low-dimensional space.

x ( )x ( )z x SVM

higher dimension

Fig. 10 nonlinear mapping

( )z x

x-space z-space

Page 22: 2012 mdsp pr13 support vector machine

22

0 1

0

1, , , ( )

- Hyperplane in -space: 0

- SVM in -space gives an optimum hyperplane with the form

(sum of support vectors in )

- Discriminat

T

M

T

i i i i

i

x x x x M D

x w x

x

w t x x

0

inner product in M-d space

inner product in -domain kernel function in -domain

ive function:

- If we can choose which satisfies

,

the co

T T

i i i

i

T

i j i j

x

w x t x x

x

x x K x x

mputational cost will be drastically reduced.

(23)

(24)

(25)

Page 23: 2012 mdsp pr13 support vector machine

23

2

2 2

1 1 2 2 1 2

) Polynomial kernel

, 1

where 1, , 2 , , 2 , 2

T T

T

Ex

K u v u v u v

v u u u u u u

Ex) Nonlinear SVM result by utilizing Gauss kernel

Fig. 11

Support vectors

Bishop [1]

Page 24: 2012 mdsp pr13 support vector machine

24

References:

[1] C. M. Bishop, “Pattern Recognition and Machine Learning”,

Springer, 2006

[2] R.O. Duda, P.E. Hart, and D. G. Stork, “Pattern Classification”,

John Wiley & Sons, 2nd edition, 2004 [3] 平井有三 「はじめてのパターン認識」森北出版(2012年)