CS 2750 Machine Learning Lecture 9people.cs.pitt.edu/.../cs2750-Spring2012/Lectures/class9.pdf ·...
Transcript of CS 2750 Machine Learning Lecture 9people.cs.pitt.edu/.../cs2750-Spring2012/Lectures/class9.pdf ·...
CS 2750 Machine Learning
CS 2750 Machine Learning
Lecture 9
Milos Hauskrecht
5329 Sennott Square
Support vector machines
CS 2750 Machine Learning
Outline
Outline:
• Fisher Linear Discriminant
• Algorithms for linear decision boundary
• Support vector machines
• Maximum margin hyperplane.
• Support vectors.
• Support vector machines.
• Extensions to the non-separable case.
• Kernel functions.
CS 2750 Machine Learning
Linear decision boundaries
• What models define linear decision boundaries?
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
CS 2750 Machine Learning
Logistic regression model
• Discriminant functions:
• where
)()()( 1 xwxwwx,TT ggf
)1/(1)( zezg
x
Input vector
1
1x )( wx,f
0w
1w
2w
dw2x
z
dx
Logistic function
)()(1 xwxTgg )(1)(0 xwx
Tgg
- is a logistic function
CS 2750 Machine Learning
Linear discriminant analysis (LDA) • When covariances are the same 0,),(~ 0 yN Σμx
1,),(~ 1 yN Σμx
CS 2750 Machine Learning
Linear decision boundaries
• Any other models/algorithms?
)()( 01 xx gg
)()( 01 xx gg
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
CS 2750 Machine Learning
Fisher linear discriminant
• Project data into one dimension
• How to find the projection line?
00 wy Txw
xwTy
Decision:
CS 2750 Machine Learning
Fisher linear discriminant
How to find the projection line?
xw
Ty
?
CS 2750 Machine Learning
Fisher linear discriminant
Assume:
Maximize the difference in projected means:
1
11
1
1 N
Ci
iN
xm
2
22
2
1 N
Ci
iN
xm
)( 1212 mmw Tmm
CS 2750 Machine Learning
Fisher linear discriminant
Problem 1: can be maximized
by increasing w
Problem 2: variance in class distributions after projection is
changed
Fisher’s solution:
2
2
2
1
12)(ss
mmJ
w
kCi
kik mys 22 )(Within class variance
)( 1212 mmw Tmm
CS 2750 Machine Learning
Fisher linear discriminant
Error:
2
2
2
1
12)(ss
mmJ
w
kCi
kik mys 22 )(
Within class variance after the projection
)( 12
1mmSw w
Optimal solution:
T
ii
Ci
T
ii
Ci
))((
))((
22
11
2
1
mxmx
mxmxSw
CS 2750 Machine Learning
Linearly separable classes
There is a hyperplane that separates training instances with no
error
00 wTxw
Hyperplane:
Class (+1)
00 wTxw
Class (-1)
00 wTxw
CS 2750 Machine Learning
Algorithms for linearly separable set
• Separating hyperplane
• We can use gradient methods or Newton Rhapson for
sigmoidal switching functions and learn the weights
• Recall that we learn the linear decision boundary
00 wTxw
1
1x
0w
1w
2w
dw
dx
?5.0y
CS 2750 Machine Learning
Algorithms for linearly separable set
• Separating hyperplane
00 wTxw
1
1x
0w
1w
2w
dw
dx
0y
CS 2750 Machine Learning
Algorithms for linearly separable sets
Perceptron algorithm:
• Simple iterative procedure for modifying the weights of the linear model
• Works for inputs x where each xi is in [0,1]
Initialize weights w
Loop through examples ( x , y) in the dataset D
1. Compute
2. If then
3. If then
Until all examples are classified correctly
Properties:
• guaranteed convergence if the classes are linearly separable
xwTy ˆ
1ˆ yy xww TT
1ˆ yy xww TT
CS 2750 Machine Learning
Algorithms for linearly separable sets
Linear program solution:
• Finds weights that satisfy
the following constraints:
Property: if there is a hyperplane separating the examples, the
linear program finds the solution
00 wi
Txw For all i, such that 1iy
00 wi
Txw For all i, such that 1iy
0)( 0 wy i
T
i xwTogether:
CS 2750 Machine Learning
Optimal separating hyperplane
• There are multiple hyperplanes that separate the data points
– Which one to choose?
• Maximum margin choice: maximum distance of
– where is the shortest distance of a positive example
from the hyperplane (similarly for negative examples)
dd
d
d
dd
CS 2750 Machine Learning
Maximum margin hyperplane
• For the maximum margin hyperplane only examples on the
margin matter (only these affect the distances)
• These are called support vectors
CS 2750 Machine Learning
Finding maximum margin hyperplanes
• Assume that examples in the training set are such
that
• Assume that all data satisfy:
• The inequalities can be combined as:
• Equalities define two hyperplanes:
10 wi
Txw
10 wi
Txw
),( ii yx
}1,1{ iy
for
for
1iy
1iy
iwy i
T
i allfor01)( 0 xw
10 wi
Txw 10 wi
Txw
CS 2750 Machine Learning
Finding the maximum margin hyperplane
• Geometrical margin:
– measures the distance of a point x from the hyperplane
2
1
Lw
2
2
L
ddw
w - normal to the hyperplane 2
..L
- Euclidean norm
w
20, /)(),(0 L
T
w wyy wxwxw
For points satisfying:
01)( 0 wy i
T
i xw
The distance is
Width of the margin:
CS 2750 Machine Learning
Maximum margin hyperplane
• We want to maximize
• We do it by minimizing
– But we also need to enforce the constraints on points:
2
2
L
ddw
2/2/2
2www
T
L
01)( 0 wy T
i xw
0, ww - variables
CS 2750 Machine Learning
Maximum margin hyperplane
• Solution: Incorporate constraints into the optimization
• Optimization problem (Lagrangian)
• Minimize with respect to (primal variables)
• Maximize with respect to (dual variables)
1)(2/),,( 0
1
2
0
wywJ T
i
n
i
i xwww
0, ww
0i - Lagrange multipliers
Lagrange multipliers enforce the satisfaction of constraints
01)( 0 wy T
i xwIf 0i
Else 0i Active constraint
α
CS 2750 Machine Learning
Max margin hyperplane solution
• Set derivatives to 0 (Kuhn-Tucker conditions)
• Now we need to solve for Lagrange parameters (Wolfe dual)
• Quadratic optimization problem: solution for all i
0),,(1
0
i
n
i
ii ywJ xwww
0),,(
10
0
n
i
ii yw
wJ
w
)(2
1)(
1,1
n
ji
j
T
ijiji
n
i
i yyJ xx
Subject to constraints
0i for all i, and
n
i
ii y1
0
i
maximize
CS 2750 Machine Learning
Maximum margin solution
• The resulting parameter vector can be expressed as:
• The parameter is obtained through Karush-Kuhn-Tucker
(KKT) conditions
Solution properties
• for all points that are not on the margin
• is a linear combination of support vectors only
• The decision boundary:
w
ii
n
i
i y xw
1
ˆˆ is the solution of the dual problem i
0w
01)ˆ(ˆ0 wy iii xw
0ˆ i
w
0)(ˆˆ00
wywT
ii
SVi
i
Txxxw
CS 2750 Machine Learning
Support vector machines
• The decision boundary:
• The decision:
00 )(ˆˆ wywT
ii
SVi
i
T
xxxw
0)(ˆsignˆ wyyT
ii
SVi
i xx
CS 2750 Machine Learning
Support vector machines
• The decision boundary:
• The decision:
• (!!):
• Decision on a new x requires to compute the inner product
between the examples
• Similarly, the optimization depends on
00 )(ˆˆ wywT
ii
SVi
i
T
xxxw
)( xxT
i
0)(ˆsignˆ wyyT
ii
SVi
i xx
)( j
T
i xx
)(2
1)(
1,1
n
ji
j
T
ijiji
n
i
i yyJ xx
CS 2750 Machine Learning
Extension to a linearly non-separable case
• Idea: Allow some flexibility on crossing the separating
hyperplane
CS 2750 Machine Learning
Extension to the linearly non-separable case
• Relax constraints with variables
• Error occurs if , is the upper bound on the
number of errors
• Introduce a penalty for the errors
ii
T w 10xw
ii
T w 10xw for
for
1iy
1iy
0i
1i
n
i
i
1
n
i
iC1
22/ wminimize
Subject to constraints
C – set by a user, larger C leads to a larger penalty for an error
CS 2750 Machine Learning
The parameter is obtained through KKT conditions
Extension to linearly non-separable case
• Lagrange multiplier form (primal problem)
• Dual form after are expressed ( s cancel out)
)(2
1)(
1,1
n
ji
j
T
ijiji
n
i
i yyJ xx
n
i
iii
T
i
n
i
i
n
i
i wyCwJ1
0
11
2
0 1)(2/),,( xwww
0, ww
Subject to: Ci 0 01
i
n
i
i yfor all i, and
ii
n
i
i y xw
1
ˆˆ Solution:
The difference from the separable case: Ci 0
0w
i
CS 2750 Machine Learning
Support vector machines
• The decision boundary:
• The decision:
• (!!):
• Decision on a new x requires to compute the inner product
between the examples
• Similarly, the optimization depends on
00 )(ˆˆ wywT
ii
SVi
i
T
xxxw
)( xxT
i
0)(ˆsignˆ wyyT
ii
SVi
i xx
)( j
T
i xx
)(2
1)(
1,1
n
ji
j
T
ijiji
n
i
i yyJ xx
CS 2750 Machine Learning
Nonlinear case
• The linear case requires to compute
• The non-linear case can be handled by using a set of features. Essentially we map input vectors to (larger) feature vectors
• It is possible to use SVM formalism on feature vectors
• Kernel function
• Crucial idea: If we choose the kernel function wisely we can compute linear separation in the feature space implicitly such that we keep working in the original input space !!!!
)( xxT
i
)(xφx
)'()()'( xφxφxx,TK
)'()( xφxφT
CS 2750 Machine Learning
Kernel function example
• Assume and a feature mapping that maps the input into a quadratic feature set
• Kernel function for the feature space:
• The computation of the linear separation in the higher dimensional space is performed implicitly in the original input space
Txx ],[ 21x
Txxxxxx ]1,2,2,2,,[) 2121
2
2
2
1φ(xx
)()'()( xφxφx,x'TK
1'2'2''2'' 22112121
2
2
2
2
2
1
2
1 xxxxxxxxxxxx2
2211 )1''( xxxx2))'(1( xx
T
CS 2750 Machine Learning
Kernel function example
Linear separator
in the feature space
Non-linear separator
in the input space
CS 2750 Machine Learning
Kernel functions
• Linear kernel
• Polynomial kernel
• Radial basis kernel
')'( xxxx,TK
kTK '1)'( xxxx,
2'
2
1exp)'( xxxx,K
CS 2750 Machine Learning
Kernels
• Kernels can be defined for more complex objects:
– Strings
– Graphs
– Images
• Kernel – similarity between pairs of objects