CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...
Transcript of CS 294-5: Statistical Natural Language Processing · PDF fileAll CS188 materials are available...
CS 188: Artificial IntelligencePerceptrons
Instructor: Anca Dragan --- University of California, Berkeley[These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Linear Classifiers
Feature Vectors
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...
SPAMor+
PIXEL-7,12 : 1PIXEL-7,13 : 0...NUM_LOOPS : 1...
“2”
Linear Classifiers
o Inputs are feature valueso Each feature has a weighto Sum is the activation
o If the activation is:o Positive, output +1o Negative, output -1
f1
f2
f3
w1
w2
w3
>0?
Geometric Explanation
# free : 2YOUR_NAME : 0MISSPELLED : 2FROM_FRIEND : 0...
# free : 4YOUR_NAME :-1MISSPELLED : 1FROM_FRIEND :-3...
# free : 0YOUR_NAME : 1MISSPELLED : 1FROM_FRIEND : 1...
Dot product positive means the positive class
Geometric Explanation
o In the space of feature vectorso Examples are pointso Any weight vector is a hyperplaneo One side corresponds to Y=+1o Other corresponds to Y=-1
BIAS : -3free : 4money : 2... 0 1
0
1
2
freem
oney
+1 = SPAM
-1 = HAM
Weight Updates
Learning: Binary Perceptrono Start with weights = 0o For each training instance:
o Classify with current weights
o If correct (i.e., y=y*), no change!
o If wrong: adjust the weight vector
Learning: Binary Perceptrono Start with weights = 0o For each training instance:
o Classify with current weights
o If correct (i.e., y=y*), no change!o If wrong: adjust the weight vector
by adding or subtracting the feature vector. Subtract if y* is -1.
Examples: Perceptron
o Separable Case
Multiclass Decision Rule
o If we have multiple classes:o A weight vector for each class:
o Score (activation) of a class y:
o Prediction highest score wins
Binary = multiclass where the negative class has weight zero
Learning: Multiclass Perceptron
o Start with all weights = 0o Pick up training examples one by oneo Predict with current weights
o If correct, no change!o If wrong: lower score of wrong
answer, raise score of right answer
Example: Multiclass Perceptron
BIAS : 1win : 0game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
“win the vote”
“win the election”
“win the game”
Properties of Perceptrons
o Separability: true if some parameters get the training set perfectly correct
o Convergence: if the training is separable, perceptron will eventually converge (binary case)
o Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
Separable
Non-Separable
Examples: Perceptron
o Non-Separable Case
Improving the Perceptron
Problems with the Perceptron
o Noise: if the data isn’t separable, weights might thrasho Averaging weight vectors over time
can help (averaged perceptron)
o Mediocre generalization: finds a “barely” separating solution
o Overtraining: test / held-out accuracy usually rises, then fallso Overtraining is a kind of overfitting
Fixing the Perceptron
o Idea: adjust the weight update to mitigate these effects
o MIRA*: choose an update size that fixes the current mistake…
o … but, minimizes the change to w
o The +1 helps to generalize
* Margin Infused Relaxed Algorithm
Minimum Correcting Update
min not =0, or would not have made an error, so min will be where equality holds
Maximum Step Sizeo In practice, it’s also bad to make updates that are too large
o Example may be labeled incorrectlyo You may not have enough featureso Solution: cap the maximum possible value of with some
constant C
o Corresponds to an optimization that assumes non-separable data
o Usually converges faster than perceptrono Usually better, especially on noisy data
Linear Separators
o Which of these linear separators is optimal?
Support Vector Machineso Maximizing the margin: good according to intuition, theory,
practiceo Only support vectors matter; other training examples are ignorable o Support vector machines (SVMs) find the separator with max
margino Basically, SVMs are MIRA where you optimize over all examples at
once MIRA
SVM
Classification: Comparison
o Naïve Bayeso Builds a model training datao Gives prediction probabilitieso Strong assumptions about feature independenceo One pass through data (counting)
o Perceptrons / MIRA:o Makes less assumptions about datao Mistake-driven learningo Multiple passes through data (prediction)o Often more accurate