CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...

23
CS 188: Artificial Intelligence Perceptrons Instructor: Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hp://ai.berkeley.edu.]

Transcript of CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...

Page 1: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

CS 188: Artificial IntelligencePerceptrons

Instructor: Anca Dragan --- University of California, Berkeley[These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Page 2: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Linear Classifiers

Page 3: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Feature Vectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...

SPAMor+

PIXEL-7,12 : 1PIXEL-7,13 : 0...NUM_LOOPS : 1...

“2”

Page 4: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Linear Classifiers

o Inputs are feature valueso Each feature has a weighto Sum is the activation

o If the activation is:o Positive, output +1o Negative, output -1

f1

f2

f3

w1

w2

w3

>0?

Page 5: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Geometric Explanation

# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...

# free : 4YOUR_NAME :-1MISSPELLED : 1FROM_FRIEND :-3...

# free : 0YOUR_NAME : 1MISSPELLED : 1FROM_FRIEND : 1...

Dot product positive means the positive class

Page 6: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Geometric Explanation

o In the space of feature vectorso Examples are pointso Any weight vector is a hyperplaneo One side corresponds to Y=+1o Other corresponds to Y=-1

BIAS : -3free : 4money : 2... 0 1

0

1

2

freem

oney

+1 = SPAM

-1 = HAM

Page 7: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Weight Updates

Page 8: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Learning: Binary Perceptrono Start with weights = 0o For each training instance:

o Classify with current weights

o If correct (i.e., y=y*), no change!

o If wrong: adjust the weight vector

Page 9: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Learning: Binary Perceptrono Start with weights = 0o For each training instance:

o Classify with current weights

o If correct (i.e., y=y*), no change!o If wrong: adjust the weight vector

by adding or subtracting the feature vector. Subtract if y* is -1.

Page 10: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Examples: Perceptron

o Separable Case

Page 11: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Multiclass Decision Rule

o If we have multiple classes:o A weight vector for each class:

o Score (activation) of a class y:

o Prediction highest score wins

Binary = multiclass where the negative class has weight zero

Page 12: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Learning: Multiclass Perceptron

o Start with all weights = 0o Pick up training examples one by oneo Predict with current weights

o If correct, no change!o If wrong: lower score of wrong

answer, raise score of right answer

Page 13: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Example: Multiclass Perceptron

BIAS : 1win : 0game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

“win the vote”

“win the election”

“win the game”

Page 14: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Properties of Perceptrons

o Separability: true if some parameters get the training set perfectly correct

o Convergence: if the training is separable, perceptron will eventually converge (binary case)

o Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability

Separable

Non-Separable

Page 15: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Examples: Perceptron

o Non-Separable Case

Page 16: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Improving the Perceptron

Page 17: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Problems with the Perceptron

o Noise: if the data isn’t separable, weights might thrasho Averaging weight vectors over time

can help (averaged perceptron)

o Mediocre generalization: finds a “barely” separating solution

o Overtraining: test / held-out accuracy usually rises, then fallso Overtraining is a kind of overfitting

Page 18: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Fixing the Perceptron

o Idea: adjust the weight update to mitigate these effects

o MIRA*: choose an update size that fixes the current mistake…

o … but, minimizes the change to w

o The +1 helps to generalize

* Margin Infused Relaxed Algorithm

Page 19: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Minimum Correcting Update

min not =0, or would not have made an error, so min will be where equality holds

Page 20: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Maximum Step Sizeo In practice, it’s also bad to make updates that are too large

o Example may be labeled incorrectlyo You may not have enough featureso Solution: cap the maximum possible value of with some

constant C

o Corresponds to an optimization that assumes non-separable data

o Usually converges faster than perceptrono Usually better, especially on noisy data

Page 21: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Linear Separators

o Which of these linear separators is optimal?

Page 22: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Support Vector Machineso Maximizing the margin: good according to intuition, theory,

practiceo Only support vectors matter; other training examples are ignorable o Support vector machines (SVMs) find the separator with max

margino Basically, SVMs are MIRA where you optimize over all examples at

once MIRA

SVM

Page 23: CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available at .] Linear Classifiers. ... Problems with the ... other training examples are ignorable

Classification: Comparison

o Naïve Bayeso Builds a model training datao Gives prediction probabilitieso Strong assumptions about feature independenceo One pass through data (counting)

o Perceptrons / MIRA:o Makes less assumptions about datao Mistake-driven learningo Multiple passes through data (prediction)o Often more accurate