An Introduction to Support Vector...

53
An Introduction to Support Vector Machines Seong-Bae Park Kyungpook National University http://sejong.knu.ac.kr/~sbpark

Transcript of An Introduction to Support Vector...

Page 1: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

An Introduction to Support Vector Machines

Seong-Bae Park

Kyungpook National Universityhttp://sejong.knu.ac.kr/~sbpark

Page 2: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

2

Supervised Learning

Environment Solution d

Problem xTeacher

Learner f (Student) -

yx

feedback

Page 3: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

3

Quality of Learning MachineLoss L(y, f(x, w)) ≥ 0

Discrepancy between true output and output by the learning machine

Risk functionalExpected value of the loss

LearningThe process of estimating the function f(x, w) which minimizes the risk functional using only the training data

∫= dydypwfyLwR xxx ),()),(,()(

Page 4: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

4

Common Learning Tasks (1)Classification

∫=⎩⎨⎧

≠=

=

dydypwfyLwR

wfywfy

wfyL

xxx

xx

x

),()),(,()(

),( if1),( if0

)),(,(

Page 5: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

5

Common Learning Tasks (2)Regression

Common Loss FunctionSquared Error (L2)

Risk

2)),(()),(,( wfywfyL xx −=

∫ −= dydypwfywR xxx ),()),(()( 2

Page 6: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

6

ML HypothesisMaximum Likelihood hypothesis

)|(maxarg

)()|(maxarg)(

)()|(maxarg

)|(maxarg

hDP

hPhDPDP

hPhDP

DhPh

Hh

Hh

Hh

HhML

=

=

=

=

Page 7: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

7

Maximum Likelihood RevisitedIf x1, …, xn are iid samples from a pdf , the likelihood is defined by

Maximum Likelihood EstimatorChoose w* that maximizes the likelihood

Relation to LossL(w) = - P(w|x) Take a log-likelihood

)|( wf x

.)|()|(1∏=

=n

ii wfwP xx

n

∑=

−=i

iML wfwR1

)|(ln)( x

Page 8: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

8

Empirical Risk MinimizationDo we know p(x, y)?

Generally NO!!!What we have is only training data!

Empirical Risk

ERM is more general than ML.In density estimation, ERM is equivalent to ML.

L(f(x,w))= - ln f(x|w)

=

=

=n

i

wfyLn

wR

dydypwfyLwR

1emp )),(,(1)(

),()),(,()(

x

xxx

Page 9: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

9

Risk and Empirical RiskWhen loss is

Relation between them (Vapnik, 1995)With probability 1-η,

h: VC dimension ( ≥ 0)Regardless of P(x, y)

=

−=

−=

l

iii wfy

lwR

ydPwfyR

1emp ),(

21)(

),(),(21)(

x

xxω

),(21 wfy x−

⎟⎠⎞

⎜⎝⎛ −+

+≤l

hlhwRR )4/log()1)/2(log()()( empηω

Page 10: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

10

Risk and Empirical RiskWhen loss is

Relation between them (Vapnik, 1995)With probability 1-η,

h: VC dimension ( ≥ 0)Regardless of P(x, y)

=

−=

−=

l

iii wfy

lwR

ydPwfyR

1emp ),(

21)(

),(),(21)(

x

xxω

),(21 wfy x−

⎟⎠⎞

⎜⎝⎛ −+

+≤l

hlhwRR )4/log()1)/2(log()()( empηω

VC confidence

Page 11: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

11

VC dimension

A set of instances S is shattered by {f(w)} iff for every dichotomy of S there exists some f(w) consistent with this dichotomy.

In case of l points, there are 2l dichotomies.The Vapnik-Chervonenkis dimension, VC, is the maximum number of training points that can be shattered by {f(w)}.

Page 12: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

12

Minimizing R(w) by minimizing h

Page 13: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

13

Perceptron Revisited: Linear Separators

wTx + b = 0

wTx + b < 0wTx + b > 0

f(x) = sign(wTx + b)

Page 14: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

14

Learning Perceptron (1)

Perceptron Learning AlgorithmGiven a training set S = {(x1, y1), …., (xl, yl) }and learning rate η ∈ R+

w0 ← 0; b0 ← 0; k ← 0R ← max1≤i≤l ||xi||while (there is some errors)

for i = 1 to lif yi( < wk ⋅ xi > + bk ) ≤ 0 then

wk+1 ← wk + ηyixibk+1 ← bk + ηyiR2

end ifend for

end whilereturn ( wk, bk )

Page 15: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

15

Learning Perceptron (2)

Page 16: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

16

Linear Separators

Which of the linear separators is optimal?

Page 17: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

17

MarginDistance from example x to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between classes.

wxw br

T +=

r

ρ

Page 18: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

18

Maximum Margin ClassificationMaximizing the margin is good according to intuition and PAC theory.Implies that only support vectors are important; other training examples are ignorable.

Page 19: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

19

Linear SVM Mathematically (1)Assuming all data is at least distance 1 from the hyperplane, the following two constraints follow for a training set {(xi ,yi)}

For support vectors, the inequality becomes an equality. Since each example’s distance from the hyperplane is , the margin is:

wTxi + b ≥ 1 if yi = 1

wTxi + b ≤ -1 if yi = -1

w2

=ρwxw br

T +=

Page 20: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

20

Linear SVMs Mathematically (2)

Quadratic optimization problem:

A better formulation:

Find w and b such that

is maximized and for all {(xi ,yi)}wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

w2

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Page 21: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

21

Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints.Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a Lagrange multiplierαi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

Page 22: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

22

The Optimization Problem Solution

The solution has the form:

Each non-zero αi indicates that corresponding xi is a support vector.Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi! Also keep in mind that solving the optimization problem involvedcomputing the inner products xi

Txj between all training points!

w =Σαiyixi b= yk- wTxk for any xk such that αk≠ 0

f(x) = ΣαiyixiTx + b

Page 23: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

23

Support Vectors in Dual Form

r

ρ

α1 = 0

α2 = 0

α3 = 0.8

α4 = 0

α5 = 0α6 = 0.4

α7 = 0

α8 = 0

α9 = 0

α10 = 0

α11 = 0

α12 = 0

α13 = 1.4α15 = 0

α14 = 0 α17 = 0

α19 = 0

α20 = 0α21 = 0

Page 24: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

24

Soft Margin Classification (1)What if the training set is not linearly separable?Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

ξiξi

Page 25: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

25

Optimization Situation

Minimize

Dual problemMaximize

subject to

1111−=+−≤+⋅

Soft Margin Classification (2)

+=−+≥+⋅

iii

iii

yforbyforb

ξξ

wxwx

∑=

+l

iiC

1

2||||21 ξw

∑ =

≤≤ i

yCC

0parameter) defined-useran is (0

α

α

iii

jiji

jijii

i yy xx ⋅− ∑∑,2

1 ααα

Page 26: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

26

Non-linear SVMsDatasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:x2

0 x

0 x

0 x

Page 27: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

27

Non-linear SVMs: Feature spaces

General ideaThe original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Page 28: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

28

The “Kernel Trick”The linear classifier relies on inner product between vectors K(xi,xj)=xi

Txj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)A kernel function is some function that corresponds to an inner product into some feature space.Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):K(xi,xj)=(1 + xi

Txj)2,= 1+ xi1

2xj12 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2] == φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Page 29: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

29

What Functions are Kernels?For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome. Mercer’s theorem:

Every semi-positive definite symmetric function is a kernelSemi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …

K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)

K=

Page 30: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

30

Examples of Kernel FunctionsLinear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network): K(xi,xj)=

Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)

2

2

2σji xx −

−e

Page 31: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

31

Non-linear SVMs MathematicallyDual problem formulation:

The solution is:

Optimization techniques for finding αi’sremain the same!

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

f(x) = ΣαiyiK(xi, xj)+ b

Page 32: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

32

SVM Structure

Page 33: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

33

VC dimension of SVMMinimal embedding space

Any embedding space with minimal dimension for a given kernel

Let K be a kernel which corresponds to a minimal embedding space H. Then the VC dimension of the corresponding SVM is dim(H) + 1.

VC dimension of SVM can be ∞.Striking conundrum

High VC dimension, but good performance!

Page 34: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

34

Generalization Error by Margin

Risk bound by margin ρWith a probability 1-η,

Large margin makes SVM stronger!

⎟⎟⎠

⎞⎜⎜⎝

⎛++≤ )/1log(log)( 2

2

2

ηρ

lRlc

lbwR

Page 35: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

35

SVMlight (1)Author: T. JoachimsDownload: http://svmlight.joachims.orgTwo executable files

svm_learnsvm_learn training_data model_file

svm_classifysvm_classify test_data model_file

svm_learn svm_classify

Training data Test data

Generated Model Classified Result

Page 36: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

36

SVMlight (2)

Written in CApplicable to Classification, Regression, and Ranking TasksCan handle thousands of support vectorsCan handle hundred-thousands of training examplesSupport standard kernel functions and user-defined kernelsUse sparse vector representation

Page 37: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

37

Why handling many SVs is important?

Learning SVM

Q: n x n matrix (Qij = yiyjK(xi, xj) )For many real world applications, Q is too large for standard computers.

SMO decompositionOverall QP QP subproblems

Joachims presented a optimization method for SMO decomposition.

Find α1…αN such thatQ(α) =Σαi - ½Σαiαj Qij is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

Page 38: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

38

svm_learn options-z {c, r, p} Selection of task: classification(c), regression(r), preference

ranking(p) (default is c)-c float C parameter for soft-margin SVM (default: E[xTx]-1)

-t int Type of kernel functions0: linear1: polynomial (sx⋅y + c)d

2: RBF3: sigmoid than(sx⋅y + c)4: user defined kernel

-d int Parameter d in polynomial kernel

-g float Parameter gamma in RBF kernel

-s float Parameter s in sigmoid/polynomial kernel

-r float Parameter c in sigmoid/polynomial kernel-u string Parameter of user-defined kernel

Page 39: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

39

Format of data

Each data is represented as a line

Feature/value pairs must be ordered by increasing feature number.Features with value zero can be skipped.Example

-1 1:0.43 3:0.12 9284:0.2 # comment

<line> .=. <target> <feature>:<value>...<feature>:<value> # <info><target> .=. +1 | -1 | 0 | <float><feature> .=. <integer><value> .=. <float><info> .=. <string>

Page 40: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

40

Text Chunking

Corpushttp://sejong.knu.ac.kr/~sbpark/Chunk

�  maj B-ADVP  mmd B-NP   ncn I-NP  jxt I-NP   ncn B-NP  jcm I-NP�  ncps I-NP  jca I-NP   mag B-ADVP  paa B-VP  ef I-VP� nbn I-VP  paa I-VP  ef I-VP. sf O

Information Value

VocabularyTotal WordsChunk TypesPOS TagsSentencesPhrases

16,838321,328

952

12,092112,658

Page 41: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

41

Context

Page 42: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

42

Data Format

1 1:1 16315:1 32630:1 50221:1 66411:1 82496:1 97890:1 114205:1 114258:1 114311:1 114401:1 114447:1 114492:1 114553:1 114576:1 114586:1 114596:1-1 1:1 16315:1 33906:1 50096:1 66181:1 81575:1 98759:1 114205:1 114258:1 114348:1 114394:1 114439:1 114500:1 114535:1 114576:1 114586:1 114599:1-1 1:1 17591:1 33781:1 49866:1 65260:1 82444:1 97890:1 114205:1 114295:1 114341:1 114386:1 114447:1 114482:1 114553:1 114576:1 114589:1 114603:11 1276:1 17466:1 33551:1 48945:1 66129:1 81575:1 97894:1 114242:1 114288:1 114333:1 114394:1 114429:1 114500:1 114556:1 114579:1 114593:1 114603:1-1 1276:1 17466:1 33551:1 49814:1 65260:1 81579:1 97890:1 114242:1 114288:1 114333:1 114376:1 114447:1 114503:1 114552:1 114583:1 114593:1 114599:1-1 1151:1 17236:1 33499:1 48945:1 65264:1 81575:1 98803:1 114235:1 114280:1 114323:1 114394:1 114450:1 114499:1 114533:1 114583:1 114589:1 114603:1

BNP.data

Page 43: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

43

Running SVMlight

svm_learn BNP.data BNP.modelSVM-light Version V3.500 # kernel type3 # kernel parameter -d 1 # kernel parameter -g 1 # kernel parameter -s 1 # kernel parameter -r empty # kernel parameter -u 114605 # highest feature index 290465 # number of training documents 13947 # number of support vectors plus 1 0.94731663 # threshold b -0.05882352941165028270553705169732 456:1 16683:1 33555:1 48945:1 65260:1 81981:1 98703:1 114229:1 114309:1 114324:1 114394:1 114447:1 114480:1 114564:1 114579:1 114593:1 114603:1 -0.05882352941165028270553705169732 1:1 17591:1 33555:1 49634:1 65472:1 82444:1 98054:1 114205:1 114295:1 114324:1 114401:1 114447:1 114482:1 114550:1 114576:1 114589:1 114603:1 …

Page 44: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

44

Performance

Decision Tree SVM MBL

AccuracyF-score

97.95±0.24%91.36±0.85

98.15±0.20%92.54±0.72

97.79±0.29%91.38±1.01

Page 45: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

45

Another Example Task

Korean Clause Boundary Detection

Word POS Chunk Output

기지에서보이는

위버반도에서가장높ㄴ

봉우리를

서울봉이

라부르ㄴ다

.

ncnjca

pvgetmnqjca

magpaaetmncnjconqjp

ecspvgefsf

B-NPI-NPB-VPI-VPB-NPI-NP

B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP

O

SSSXXESXXXEXXXXEXXE

Page 46: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

46

Clause Boundary DetectionTwo Binary Classification Tasks

Finding Ending Point (S, X)Finding Starting Point (E, X)

Feature set

Feature set

Feature Selection

Feature Selection

Learning

Classification

Learning

Classification

Ending Point

Starting Point

S X

S: w1, w2, …, wi,…..wn

S: w1, w2, …, wi,…..wn

E X

Page 47: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

47

FeaturesDimension of a vector (= 4,232)

# of words: 4,171# of POSs: 52# of chunks: 9

Trigram Modelwi-1: 1 ~ 4,232wi: 4,233 ~ 8,464wi+1: 8,465 ~ 12,696

Page 48: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

48

Vector RepresentationWord POS Chunk Output

기지에서보이는

위버반도에서가장높ㄴ

봉우리를

서울봉이

라부르ㄴ다

.

ncnjca

pvgetmnqjca

magpaaetmncnjconqjp

ecspvgefsf

B-NPI-NPB-VPI-VPB-NPI-NP

B-ADJPB-VPI-VPB-NPI-NPB-NPB-VPI-VPB-VPI-VP

O

SSSXXESXXXEXXXXEXXE

wi

wi-1 wi-1 wi+1

30:1 6302:1 9921:14215:1 8423:1 12664:14229:1 8462:1 12692:14232:1

wi-1 wi wi+1

Word 는 위버반도 에서

POS etm nq jcaChungk I-VP B-NP I-NP

Ending Point E

-1 30:1 4215:1 4229:1 4232:1 6302:1 8423:1 8462:1 9921:1 12664:1 12692:1

Page 49: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

49

Execution of SVMlight (1)

Page 50: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

50

Execution of SVMlight (2)

Page 51: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

51

Third Example: Text Classification

Document into a vectorBinary Vector

x = <w1, w2, …, w|v|}Commonly-used Corpus

Reuters-2157812,902 Reuters stories118 categoriesModApte split

75% for training (9,603 stories)25% for test (3,299 stroies)

Feature Selection300 words with the highest mutual information with each category|v| = 300

Page 52: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

52

Text Classification Results

Page 53: An Introduction to Support Vector Machinesisoft.postech.ac.kr/Course/CS703AHLT/SupportVectorMachines.pdf · 22 The Optimization Problem Solution The solution has the form: Each non-zero

53

Interpreting Weight VectorCategory “interest”

Terms with Highest WeightPrime: 0.70Rate: 0.67Interest: 0.63Rates: 0.60Discount: 0.46

Terms with Lowest WeightGroup: -0.24Year: -0.25Sees: -0.33World: -0.35Dlrs: -0.71