CS 2750 Machine Learning Lecture 9people.cs.pitt.edu/.../cs2750-Spring2012/Lectures/class9.pdf ·...

CS 2750 Machine Learning


Lecture 9

Milos Hauskrecht

[email protected]

5329 Sennott Square

Support vector machines

mailto:[email protected]


Outline

Outline:

• Fisher Linear Discriminant

• Algorithms for linear decision boundary

• Support vector machines

• Maximum margin hyperplane.

• Support vectors.

• Support vector machines.

• Extensions to the non-separable case.

• Kernel functions.


Linear decision boundaries

• What models define linear decision boundaries?

)()( 01 xx gg

)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Logistic regression model

• Discriminant functions:

• where

)()()( 1 xwxwwx,TT ggf

)1/(1)( zezg

x

Input vector

1

1x )( wx,f

0w

1w

2w

dw2x

z

dx

Logistic function

)()(1 xwxTgg )(1)(0 xwx

Tgg

- is a logistic function


Linear discriminant analysis (LDA) • When covariances are the same 0,),(~ 0 yN Σμx

1,),(~ 1 yN Σμx


Linear decision boundaries

• Any other models/algorithms?

)()( 01 xx gg

)()( 01 xx gg

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Fisher linear discriminant

• Project data into one dimension

• How to find the projection line?

00 wy Txw

xwTy

Decision:



How to find the projection line?

xw

Ty

?



Assume:

Maximize the difference in projected means:

1

11

1

1 N

Ci

iN

xm

2

22

2

1 N

Ci

iN

xm

)( 1212 mmw Tmm



Problem 1: can be maximized

by increasing w

Problem 2: variance in class distributions after projection is

changed

Fisher’s solution:

2

2

2

1

12)(ss

mmJ

w

kCi

kik mys 22 )(Within class variance

)( 1212 mmw Tmm



Error:

2

2

2

1

12)(ss

mmJ

w

kCi

kik mys 22 )(

Within class variance after the projection

)( 12

1mmSw w

Optimal solution:

T

ii

Ci

T

ii

Ci

))((

))((

22

11

2

1

mxmx

mxmxSw


Linearly separable classes

There is a hyperplane that separates training instances with no

error

00 wTxw

Hyperplane:

Class (+1)

00 wTxw

Class (-1)

00 wTxw


Algorithms for linearly separable set

• Separating hyperplane

• We can use gradient methods or Newton Rhapson for

sigmoidal switching functions and learn the weights

• Recall that we learn the linear decision boundary

00 wTxw

1

1x

0w

1w

2w

dw

dx

?5.0y


Algorithms for linearly separable set

• Separating hyperplane

00 wTxw

1

1x

0w

1w

2w

dw

dx

0y


Algorithms for linearly separable sets

Perceptron algorithm:

• Simple iterative procedure for modifying the weights of the linear model

• Works for inputs x where each xi is in [0,1]

Initialize weights w

Loop through examples ( x , y) in the dataset D

1. Compute

2. If then

3. If then

Until all examples are classified correctly

Properties:

• guaranteed convergence if the classes are linearly separable

xwTy ˆ

1ˆ yy xww TT

1ˆ yy xww TT


Algorithms for linearly separable sets

Linear program solution:

• Finds weights that satisfy

the following constraints:

Property: if there is a hyperplane separating the examples, the

linear program finds the solution

00 wi

Txw For all i, such that 1iy

00 wi

Txw For all i, such that 1iy

0)( 0 wy i

T

i xwTogether:


Optimal separating hyperplane

• There are multiple hyperplanes that separate the data points

– Which one to choose?

• Maximum margin choice: maximum distance of

– where is the shortest distance of a positive example

from the hyperplane (similarly for negative examples)

dd

d

d

dd


Maximum margin hyperplane

• For the maximum margin hyperplane only examples on the

margin matter (only these affect the distances)

• These are called support vectors


Finding maximum margin hyperplanes

• Assume that examples in the training set are such

that

• Assume that all data satisfy:

• The inequalities can be combined as:

• Equalities define two hyperplanes:

10 wi

Txw

10 wi

Txw

),( ii yx

}1,1{ iy

for

for

1iy

1iy

iwy i

T

i allfor01)( 0 xw

10 wi

Txw 10 wi

Txw


Finding the maximum margin hyperplane

• Geometrical margin:

– measures the distance of a point x from the hyperplane

2

1

Lw

2

2

L

ddw

w - normal to the hyperplane 2

..L

- Euclidean norm

w

20, /)(),(0 L

T

w wyy wxwxw

For points satisfying:

01)( 0 wy i

T

i xw

The distance is

Width of the margin:



• We want to maximize

• We do it by minimizing

– But we also need to enforce the constraints on points:

2

2

L

ddw

2/2/2

2www

T

L

01)( 0 wy T

i xw

0, ww - variables



• Solution: Incorporate constraints into the optimization

• Optimization problem (Lagrangian)

• Minimize with respect to (primal variables)

• Maximize with respect to (dual variables)

1)(2/),,( 0

1

2

0

wywJ T

i

n

i

i xwww

0, ww

0i - Lagrange multipliers

Lagrange multipliers enforce the satisfaction of constraints

01)( 0 wy T

i xwIf 0i

Else 0i Active constraint

α


Max margin hyperplane solution

• Set derivatives to 0 (Kuhn-Tucker conditions)

• Now we need to solve for Lagrange parameters (Wolfe dual)

• Quadratic optimization problem: solution for all i

0),,(1

0

i

n

i

ii ywJ xwww

0),,(

10

0

n

i

ii yw

wJ

w

)(2

1)(

1,1

n

ji

j

T

ijiji

n

i

i yyJ xx

Subject to constraints

0i for all i, and

n

i

ii y1

0

i

maximize


Maximum margin solution

• The resulting parameter vector can be expressed as:

• The parameter is obtained through Karush-Kuhn-Tucker

(KKT) conditions

Solution properties

• for all points that are not on the margin

• is a linear combination of support vectors only

• The decision boundary:

w

ii

n

i

i y xw

1

ˆˆ is the solution of the dual problem i

0w

01)ˆ(ˆ0 wy iii xw

0ˆ i

w

0)(ˆˆ00

wywT

ii

SVi

i

Txxxw




• The decision:

00 )(ˆˆ wywT

ii

SVi

i

T

xxxw

0)(ˆsignˆ wyyT

ii

SVi

i xx




• The decision:

• (!!):

• Decision on a new x requires to compute the inner product

between the examples

• Similarly, the optimization depends on

00 )(ˆˆ wywT

ii

SVi

i

T

xxxw

)( xxT

i

0)(ˆsignˆ wyyT

ii

SVi

i xx

)( j

T

i xx

)(2

1)(

1,1

n

ji

j

T

ijiji

n

i

i yyJ xx


Extension to a linearly non-separable case

• Idea: Allow some flexibility on crossing the separating

hyperplane


Extension to the linearly non-separable case

• Relax constraints with variables

• Error occurs if , is the upper bound on the

number of errors

• Introduce a penalty for the errors

ii

T w 10xw

ii

T w 10xw for

for

1iy

1iy

0i

1i

n

i

i

1

n

i

iC1

22/ wminimize

Subject to constraints

C – set by a user, larger C leads to a larger penalty for an error


The parameter is obtained through KKT conditions

Extension to linearly non-separable case

• Lagrange multiplier form (primal problem)

• Dual form after are expressed ( s cancel out)

)(2

1)(

1,1

n

ji

j

T

ijiji

n

i

i yyJ xx

n

i

iii

T

i

n

i

i

n

i

i wyCwJ1

0

11

2

0 1)(2/),,( xwww

0, ww

Subject to: Ci 0 01

i

n

i

i yfor all i, and

ii

n

i

i y xw

1

ˆˆ Solution:

The difference from the separable case: Ci 0

0w

i




• The decision:

• (!!):

• Decision on a new x requires to compute the inner product

between the examples

• Similarly, the optimization depends on

00 )(ˆˆ wywT

ii

SVi

i

T

xxxw

)( xxT

i

0)(ˆsignˆ wyyT

ii

SVi

i xx

)( j

T

i xx

)(2

1)(

1,1

n

ji

j

T

ijiji

n

i

i yyJ xx


Nonlinear case

• The linear case requires to compute

• The non-linear case can be handled by using a set of features. Essentially we map input vectors to (larger) feature vectors

• It is possible to use SVM formalism on feature vectors

• Kernel function

• Crucial idea: If we choose the kernel function wisely we can compute linear separation in the feature space implicitly such that we keep working in the original input space !!!!

)( xxT

i

)(xφx

)'()()'( xφxφxx,TK

)'()( xφxφT


Kernel function example

• Assume and a feature mapping that maps the input into a quadratic feature set

• Kernel function for the feature space:

• The computation of the linear separation in the higher dimensional space is performed implicitly in the original input space

Txx ],[ 21x

Txxxxxx ]1,2,2,2,,[) 2121

2

2

2

1φ(xx

)()'()( xφxφx,x'TK

1'2'2''2'' 22112121

2

2

2

2

2

1

2

1 xxxxxxxxxxxx2

2211 )1''( xxxx2))'(1( xx

T


Kernel function example

Linear separator

in the feature space

Non-linear separator

in the input space


Kernel functions

• Linear kernel

• Polynomial kernel

• Radial basis kernel

')'( xxxx,TK

kTK '1)'( xxxx,

2'

2

1exp)'( xxxx,K


Kernels

• Kernels can be defined for more complex objects:

– Strings

– Graphs

– Images

• Kernel – similarity between pairs of objects

CS 2750 Machine Learning Lecture 9people.cs.pitt.edu/.../cs2750-Spring2012/Lectures/class9.pdf ·...

Documents

Transcript of CS 2750 Machine Learning Lecture 9people.cs.pitt.edu/.../cs2750-Spring2012/Lectures/class9.pdf ·...