Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

33
Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier

Transcript of Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Page 1: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore

Linear Document Classifier

Page 2: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 2Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

• Binary classification • y=+1 for positive class, y=-1 for negative class

• Vector representation for documents

denotes +1

denotes -1 How would you classify this data?

b

a

Page 3: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 3Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

• Binary classification • y=+1 for positive class, y=-1 for negative class

• Vector representation for documents

denotes +1

denotes -1 How would you classify this data?

+

f(d)

Page 4: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 4Copyright © 2001, 2003, Andrew W. Moore

Decision Boundary

d1

d2

d4

d3

f(d)

1. How to classify documents using f(d)?

2. How to find the line f(d) ?

• wa and wb are the weights for word a and b

a

b

Page 5: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 5Copyright © 2001, 2003, Andrew W. Moore

How to Classify Documents ?

d1

d2

d4

d3

f(d)

• wa and wb are the weights for word a and b

a

b

Page 6: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 6Copyright © 2001, 2003, Andrew W. Moore

Decision Boundary

d1

d2

d4

d3

f(d)

1. How to classify documents using f(d)?

2. How to find the line f(d) ?

• wa and wb are the weights for word a and b

a

b

Page 7: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 7

Perception Algorithm

• Initialize • Repeat

• Receive a labeled document (d, y) (y=+1 or -1)

• Check if doc d is classified correctly•yf(d) > 0 ?

• Yes: do nothing• No:

d1

d2

d4

d3

f(d)

y w w d

b

a

Page 8: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 8

Perception Algorithm

• Initialize • Repeat

• Receive a labeled document (d, y) (y=+1 or -1)

• Check if doc d is classified correctly•yf(d) > 0 ?

• Yes: do nothing• No:

d1

d2

d4

d3

f(d)

y w w d

b

a

Page 9: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 9

Perception Algorithm

• Initialize • Repeat

• Receive a labeled document (d, y) (y=+1 or -1)

• Check if doc d is classified correctly•yf(d) > 0 ?

• Yes: do nothing• No:

d1

d2

d4

d3

f(d)

y w w d

b

a

Page 10: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 1010

Geometrical Interpretation

Page 11: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 11Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

f(d)

Page 12: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 12Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

f(d)

Page 13: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 13Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

Any of these would be fine..

..but which is best?

Page 14: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 14Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin

denotes +1

denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 15: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 15Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, maximum margin.

Page 16: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 16Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, maximum margin.

Called Linear Support Vector Machine (SVM)

Page 17: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 17Copyright © 2001, 2003, Andrew W. Moore

Empirical Studies with Text Categorization

• 10 Categories from Reuters-21578

• For a few categories, the SVM method significantly outperforms the KNN approach

Category KNN SVM

earn 97.3 98.0

acq 92.0 93.6

money-fx 78.2 74.5

grain 82.2 94.6

crude 85.7 88.9

trade 77.4 75.9

interest 74.0 77.7

ship 79.2 85.6

wheat 76.6 91.8

corn 77.9 90.3

Classification accuracy

Page 18: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 18Copyright © 2001, 2003, Andrew W. Moore

Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• How to handle multiple classes• E.g., classify documents into three

categories: sports, business, politics

Page 19: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 19Copyright © 2001, 2003, Andrew W. Moore

Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• How to handle multiple classes• E.g., classify documents into three

categories: sports, business, politics• Answer: one-vs-all, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”

Page 20: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 20

One-vs-All • vs the other classes: red(d)

Copyright © 2001, 2003, Andrew W. Moore

Page 21: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 21

One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)

Copyright © 2001, 2003, Andrew W. Moore

Page 22: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 22

One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)

Copyright © 2001, 2003, Andrew W. Moore

Page 23: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 23

One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)• Given a test document d, how to decide its

color ?

Copyright © 2001, 2003, Andrew W. Moore

Page 24: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 24

One-vs-All • vs the other classes: red(d)• vs the other classes: yellow(d)• vs the other classes: cyan(d)• Given a test document d, how to decide its

color ?• Assign d to the color function with the largest

score

Copyright © 2001, 2003, Andrew W. Moore

Page 25: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 25Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0

Page 26: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 26Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension

Not a big surprise

x=0

Page 27: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 27Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional dataset

What can be done about this?

x=0

Page 28: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 28Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional datasetExpand from one

dimensional space to a two dimensional space

x=0

x2

x

Page 29: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 29Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional datasetExpand from one

dimensional space to a two dimensional space

x=0

x2

x

Kernel trick: expand the dimensionality by a kernel function

Page 30: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 30Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (I)

Page 31: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 31Copyright © 2001, 2003, Andrew W. Moore

Nonlinear Kernel (II)

Page 32: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 32

Software for SVM• SVMlight (http://svmlight.joachims.org/)• Libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

• It is faster than SVMlight• Sparse data representation

• The occurrences of most words in a document are zero

• <label> <index1>:<value1> <index2>:<value2>

Copyright © 2001, 2003, Andrew W. Moore

class label word-id-1: word-occurrence

Page 33: Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Support Vector Machines: Slide 33

Software for SVM• SVMlight (http://svmlight.joachims.org/)• Libsvm (

http://www.csie.ntu.edu.tw/~cjlin/libsvm/)• It is faster than SVMlight

• Sparse data representation• The occurrences of most words in a document

are zero• Example

• D = (‘hello’: 2, ‘world’: 3), negative document• Wor-id for `hello’ is 100, word-id for ‘world’ is

54• -1 100:2 54:3

Copyright © 2001, 2003, Andrew W. Moore