CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT...
Transcript of CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT...
![Page 1: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/1.jpg)
Supervised Classification
CMSC 723 / LING 723 / INST 725
MARINE CARPUAT
[email protected] Some slides by Graham Neubig , Jacob Eisenstein
![Page 2: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/2.jpg)
Last time
• Text classification problems
– and their evaluation
• Linear classifiers
– Features & Weights
– Bag of words
– Naïve Bayes
Machine Learning, Probability
Linguistics
![Page 3: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/3.jpg)
Today
• 3 linear classifiers
– Naïve Bayes
– Perceptron
– (Logistic Regression)
• Bag of words vs. rich feature sets
• Generative vs. discriminative models
• Bias-variance tradeoff
![Page 4: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/4.jpg)
Naïve Bayes Recap
![Page 5: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/5.jpg)
The Naivety of Naïve Bayes
Conditional independence
assumption
![Page 6: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/6.jpg)
Naïve Bayes: Example
![Page 7: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/7.jpg)
Smoothing
• Goal: assign some probability mass to events
that were not seen during training
• One method: “add alpha” smoothing
– Often, alpha = 1
![Page 8: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/8.jpg)
Multinomial Naïve Bayes:
Learning in Practice
• Calculate P(yj) terms– For each yj in Y do
docsj all docs with class =yj
||)|(
Vocabularyn
nywP k
jk
|documents # total|
||)(
j
j
docsyP
• Calculate P(wk | yj) terms• Textj single doc containing all docsj• For each word wk in Vocabulary
nk # of occurrences of wk in Textj
• From training corpus, extract Vocabulary
![Page 9: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/9.jpg)
Bias Variance trade-off
• Variance of a classifier
– How much its decisions are affected by small changes
in training sets
– Lower variance = smaller changes
• Bias of a classifier
– How accurate it is at modeling different training sets
– Lower bias = more accurate
• High variance classifiers tend to overfit
• High bias classifiers tend to underfit
![Page 10: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/10.jpg)
Bias Variance trade-off
• Impact of smoothing
– Lowers variance
– Increases bias (toward uniform probabilities)
![Page 11: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/11.jpg)
Naïve Bayes
• A linear classifier whose weights can be
interpreted as parameters of a probabilistic model
• Pros
– parameters are easy to estimate from data:
“count and normalize” (and smooth)
• Cons
– requires making a conditional independence
assumption
– which does not hold in practice
![Page 12: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/12.jpg)
Today
• 3 linear classifiers
– Naïve Bayes
– Perceptron
– Logistic Regression
• Bag of words vs. rich feature sets
• Generative vs. discriminative models
• Bias-variance tradeoff
– Smoothing, regularization
![Page 13: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/13.jpg)
Beyond Bag of Words for
classification tasks
Given an introductory sentence in Wikipedia
predict whether the article is about a person
![Page 14: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/14.jpg)
Designing features
![Page 15: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/15.jpg)
Predicting requires
combining information• Given features and weights
• Predicting for a new example:
– If (sum of weights > 0), “yes”; otherwise “no”
![Page 16: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/16.jpg)
Formalizing binary classification
with linear models
![Page 17: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/17.jpg)
Example feature functions:
Unigram features• Number of times a particular word appears
– i.e. bag of words
![Page 18: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/18.jpg)
An online learning algorithm
![Page 19: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/19.jpg)
Perceptron weight update
• If y = 1, increase the weights for features in
• If y = -1, decrease the weights for features in
![Page 20: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/20.jpg)
Example: initial update
![Page 21: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/21.jpg)
Example: second update
![Page 22: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/22.jpg)
Perceptron
• A linear model for classification
• An algorithm to learn feature weights given
labeled data
– online algorithm
– error-driven
– Does it converge?
• See “A Course In Machine Learning” Ch.3
![Page 23: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/23.jpg)
Multiclass perceptron
![Page 24: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/24.jpg)
Bias Variance trade off
• How do we decide when to stop?
– Accuracy on held out data
– Early stopping
• Averaged perceptron
– Improves generalization
![Page 25: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/25.jpg)
Averaged perceptron
![Page 26: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/26.jpg)
Learning as optimization:
Loss functions• Naïve Bayes chooses weights to maximize the
joint likelihood of the training data (or log
likelihood)
![Page 27: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/27.jpg)
Perceptron Loss function
• “0-1” loss
• Treats all errors equally
• Does not care about confidence of
classification decision
![Page 28: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/28.jpg)
Today
• 3 linear classifiers
– Naïve Bayes
– Perceptron
– (Logistic Regression)
• Bag of words vs. rich feature sets
• Generative vs. discriminative models
• Bias-variance tradeoff
![Page 29: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/29.jpg)
Perceptron & Probabilities
• What if we want a probability p(y|x)?
• The perceptron gives us a prediction y
![Page 30: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/30.jpg)
The logistic function
• “Softer” function than in perceptron
• Can account for uncertainty
• Differentiable
![Page 31: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/31.jpg)
Logistic regression: how to train?
• Train based on conditional likelihood
• Find parameters w that maximize conditional
likelihood of all answers 𝑦𝑖 given examples 𝑥𝑖
![Page 32: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/32.jpg)
Stochastic gradient ascent
(or descent)• Online training algorithm for logistic regression
– and other probabilistic models
• Update weights for every training example• Move in direction given by gradient• Size of update step scaled by learning rate
![Page 33: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/33.jpg)
Gradient of the logistic function
![Page 34: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/34.jpg)
Example: initial update
![Page 35: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/35.jpg)
Example: second update
![Page 36: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/36.jpg)
How to set the learning rate?
• Various strategies
• decay over time
𝛼 =1
𝐶 + 𝑡
• Use held-out test set, increase learning rate
when likelihood increases
ParameterNumber of
samples
![Page 37: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/37.jpg)
Some models are
better then others…• Consider these 2 examples
• Which of the 2 models below is better?
Classifier 2 will probably generalize better!It does not include irrelevant information=> Smaller model is better
![Page 38: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/38.jpg)
Regularization
• A penalty on adding extra
weights
• L2 regularization:
– big penalty on large
weights
– small penalty on small
weights
• L1 regularization:
– Uniform increase when
large or small
– Will cause many weights to
become zero
𝑤 2
𝑤 1
![Page 39: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/39.jpg)
L1 regularization in online learning
![Page 40: CMSC 723 Fall 2016 - University Of Maryland · CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Graham Neubig , Jacob Eisenstein . Last time •Text](https://reader030.fdocuments.us/reader030/viewer/2022041103/5f0324487e708231d407be9b/html5/thumbnails/40.jpg)
Today
• 3 linear classifiers
– Naïve Bayes
– Perceptron
– (Logistic Regression)
• Bag of words vs. rich feature sets
• Generative vs. discriminative models
• Bias-variance tradeoff