CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...

CS 188: Artificial IntelligencePerceptrons

Instructor: Anca Dragan --- University of California, Berkeley[These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Linear Classifiers

Feature Vectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...

SPAMor+

PIXEL-7,12 : 1PIXEL-7,13 : 0...NUM_LOOPS : 1...

“2”

Linear Classifiers

o Inputs are feature valueso Each feature has a weighto Sum is the activation

o If the activation is:o Positive, output +1o Negative, output -1

f1

f2

f3

w1

w2

w3

>0?

Geometric Explanation


# free : 4YOUR_NAME :-1MISSPELLED : 1FROM_FRIEND :-3...


Dot product positive means the positive class

Geometric Explanation

o In the space of feature vectorso Examples are pointso Any weight vector is a hyperplaneo One side corresponds to Y=+1o Other corresponds to Y=-1

BIAS : -3free : 4money : 2... 0 1

0

1

2

freem

oney

+1 = SPAM

-1 = HAM

Weight Updates

Learning: Binary Perceptrono Start with weights = 0o For each training instance:

o Classify with current weights

o If correct (i.e., y=y*), no change!

o If wrong: adjust the weight vector

Learning: Binary Perceptrono Start with weights = 0o For each training instance:

o Classify with current weights

o If correct (i.e., y=y*), no change!o If wrong: adjust the weight vector

by adding or subtracting the feature vector. Subtract if y* is -1.

Examples: Perceptron

o Separable Case

Multiclass Decision Rule

o If we have multiple classes:o A weight vector for each class:

o Score (activation) of a class y:

o Prediction highest score wins

Binary = multiclass where the negative class has weight zero

Learning: Multiclass Perceptron

o Start with all weights = 0o Pick up training examples one by oneo Predict with current weights

o If correct, no change!o If wrong: lower score of wrong

answer, raise score of right answer

Example: Multiclass Perceptron

BIAS : 1win : 0game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

“win the vote”

“win the election”

“win the game”

Properties of Perceptrons

o Separability: true if some parameters get the training set perfectly correct

o Convergence: if the training is separable, perceptron will eventually converge (binary case)

o Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability

Separable

Non-Separable

Examples: Perceptron

o Non-Separable Case

Improving the Perceptron

Problems with the Perceptron

o Noise: if the data isn’t separable, weights might thrasho Averaging weight vectors over time

can help (averaged perceptron)

o Mediocre generalization: finds a “barely” separating solution

o Overtraining: test / held-out accuracy usually rises, then fallso Overtraining is a kind of overfitting

Fixing the Perceptron

o Idea: adjust the weight update to mitigate these effects

o MIRA*: choose an update size that fixes the current mistake…

o … but, minimizes the change to w

o The +1 helps to generalize

* Margin Infused Relaxed Algorithm

Minimum Correcting Update

min not =0, or would not have made an error, so min will be where equality holds

Maximum Step Sizeo In practice, it’s also bad to make updates that are too large

o Example may be labeled incorrectlyo You may not have enough featureso Solution: cap the maximum possible value of with some

constant C

o Corresponds to an optimization that assumes non-separable data

o Usually converges faster than perceptrono Usually better, especially on noisy data

Linear Separators

o Which of these linear separators is optimal?

Support Vector Machineso Maximizing the margin: good according to intuition, theory,

practiceo Only support vectors matter; other training examples are ignorable o Support vector machines (SVMs) find the separator with max

margino Basically, SVMs are MIRA where you optimize over all examples at

once MIRA

SVM

Classification: Comparison

o Naïve Bayeso Builds a model training datao Gives prediction probabilitieso Strong assumptions about feature independenceo One pass through data (counting)

o Perceptrons / MIRA:o Makes less assumptions about datao Mistake-driven learningo Multiple passes through data (prediction)o Often more accurate

CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...

Documents

Transcript of CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...