16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression...
Transcript of 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression...
![Page 1: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/1.jpg)
Midterm Review Instructor: Wei Xu
![Page 2: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/2.jpg)
![Page 3: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/3.jpg)
Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Classification
![Page 4: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/4.jpg)
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Test document
parserlanguagelabeltranslation…
Text Classification
...planningtemporalreasoningplanlanguage...
?
![Page 5: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/5.jpg)
Classification Methods: Supervised Machine Learning
• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}
– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output: – a learned classifier γ:d ! c
![Page 6: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/6.jpg)
Naïve Bayes Classifier
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmaxc∈C
P(cj ) P(x | c)x∈X∏
bag of word
conditional independence assumption
![Page 7: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/7.jpg)
Multinomial Naïve Bayes: Learning
• maximum likelihood estimates – simply use the frequencies in the data
P̂(wi | cj ) =count(wi ,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
=count(wi ,c)+1
count(w,cw∈V∑ )
⎛
⎝⎜⎜
⎞
⎠⎟⎟ + V
• For unknown words (which completely doesn’t occur in training set), we can ignore them.
laplace (add-1) smoothing to avoid zero probabilities
![Page 8: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/8.jpg)
Weaknesses of Naïve Bayes
• Assuming conditional independence
• Correlated features -> double counting evidence – Parameters are estimated independently
• This can hurt classifier accuracy and calibration
![Page 9: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/9.jpg)
Logistic Regression
• Doesn’t assume conditional independence of features – Better calibrated probabilities – Can handle highly correlated overlapping
features
![Page 10: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/10.jpg)
Logistic vs. Softmax Regression
one estimated probability for
each class
one estimated probability good for binary classification
the class with highest
probability
Generalization to Multi-class Problems
![Page 11: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/11.jpg)
Maximum Entropy Models (MaxEnt)• a.k.a logistic regression or multinominal logistic regression or
multiclass logistic regression or softmax regression
• belongs to the family of classifiers know as log-linear classifiers
• Math proof of “LR=MaxEnt”: - [Klein and Manning 2003] - [Mount 2011]
http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf
![Page 12: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/12.jpg)
convert into probabilities between [0, 1]
softmax function
![Page 13: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/13.jpg)
concave function!
![Page 14: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/14.jpg)
Gradient Ascent
![Page 15: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/15.jpg)
Gradient Ascent
Loop While not converged: For all features j, compute and add derivatives
: Training set log-likelihood (Objective function)
: Gradient vector
![Page 16: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/16.jpg)
![Page 17: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/17.jpg)
![Page 18: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/18.jpg)
LR Gradient
logistic function
![Page 19: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/19.jpg)
Perceptron vs. Logistic Regression
weights (parameters)
inputs (features)
choices of activation function
![Page 20: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/20.jpg)
Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i) != y_i w += y_i * x_i
Error-driven!
Perceptron Algorithm• Very similar to logistic regression • Not exactly computing gradient
![Page 21: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/21.jpg)
MultiClass Perceptron Algorithm
Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred != y_i w_y_gold += x_i w_y_pred -= x_i
increase score for right answer
decrease score for wrong answer
![Page 22: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/22.jpg)
Perceptron vs. Logistic Regression
• Only hyperparameter of perceptron is maximum number of iterations (LR also needs learning rate)
• Perceptron is guaranteed to converge if the data is linearly separable (LR always converge)
![Page 23: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/23.jpg)
Perceptron vs. Logistic Regression• The Perceptron is an online learning algorithm. • Logistic Regression is not:
this update is effectively the same as “w += y_i * x_i”
![Page 24: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/24.jpg)
Multinominal LR Gradient
empirical feature count expected feature count
![Page 25: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/25.jpg)
MAP-based learning (Perceptron)
.≈
Maximum A Posteriori
approximate using maximization
![Page 26: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/26.jpg)
Markov Assumption, Perplexity, Interpolation, Backoff, and
Kneser-Ney Smoothing
Language Modeling
![Page 27: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/27.jpg)
Probabilistic Language Modeling•Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)
•Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)
•A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model or LM
![Page 28: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/28.jpg)
The Chain Rule applied to compute joint probability of words in sentence
Markov Assumption
![Page 29: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/29.jpg)
Bigram model• Condition on the previous word:
![Page 30: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/30.jpg)
Add-one estimation
•Also called Laplace smoothing
• Pretend we saw each word one more time than we did • Just add one to all the counts!
•MLE estimate:
•Add-1 estimate:
PMLE(wi |wi−1) =c(wi−1,wi )c(wi−1)
PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V
![Page 31: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/31.jpg)
Add-1 estimation is a blunt instrument
• So add-1 isn’t used for N-grams: •We’ll see better methods
• But add-1 is used to smooth other NLP models • For text classification • In domains where the number of zeros isn’t so huge.
![Page 32: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/32.jpg)
•Linear interpolation
•Backoff •Absolute Discounting •Kneser-Ney Smoothing
Better Language Models
![Page 33: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/33.jpg)
Perplexity
Perplexity is the inverse probability of the test set, “normalized” by the number of words:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set • Gives the highest P(sentence)
Chain Rule
for bigram
![Page 34: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/34.jpg)
Hidden Markov Models, Maximum Entropy Markov
Models (Log-linear Models for Tagging), and Viterbi Algorithm
Tagging
![Page 35: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/35.jpg)
Tagging (Sequence Labeling)• Given a sequence (in NLP, words), assign appropriate labels to
each word. • Many NLP problems can be viewed as sequence labeling: - POS Tagging - Chunking - Named Entity Tagging
• Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors
Plays well with others. VBZ RB IN NNS
![Page 36: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/36.jpg)
![Page 37: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/37.jpg)
![Page 38: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/38.jpg)
Emission parameters
Trigram parameters
![Page 39: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/39.jpg)
![Page 40: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/40.jpg)
![Page 41: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/41.jpg)
![Page 42: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/42.jpg)
Andrew Viterbi, 1967
![Page 43: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/43.jpg)
![Page 44: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/44.jpg)
![Page 45: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/45.jpg)
running time for trigram HMM
running time for bigram HMM is O(n|S|2)
![Page 46: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/46.jpg)
![Page 47: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/47.jpg)
![Page 48: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/48.jpg)
![Page 49: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/49.jpg)
Decoding§ Linear Perceptron
§ Features must be local, for x=x1…xm, and s=s1…sm
§ Define π(i,si) to be the max score of a sequence of length iending in tag si
§ Viterbi algorithm (HMMs):
§ Viterbi algorithm (Maxent):�(i, si) = max
si�1
p(si|si�1, x1 . . . xm)�(i� 1, si�1)
⇡(i, si) = max
si�1
e(xi|si)q(si|si�1)⇡(i� 1, si�1)
s⇤ = argmaxs
w · �(x, s) · �
�(i, si) = max
si�1
w · ⇥(x, i, si�i, si) + �(i� 1, si�1)
�(x, s) =mX
j=1
�(x, j, sj�1, sj)
![Page 50: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94646836427826c44b4c40/html5/thumbnails/50.jpg)