Maxent Models (II)
Transcript of Maxent Models (II)
Maxent Models (II)
CMSC 473/673
UMBC
September 20th, 2017
Announcements: Assignment 1
Due 11:59 PM, Saturday 9/23
~3.5 days
Use submit utility with:class id cs473_ferraroassignment id a1
We must be able to run it on GL!Common pitfall #1: forgetting filesCommon pitfall #2: incorrect paths to filesCommon pitfall #3: 3rd party libraries
Announcements: Question 6
𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) = 𝜆𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 + 1 − 𝜆𝑛 𝑝(𝑛−1)(𝑥𝑖|𝑥𝑖−𝑛+2: 𝑥𝑖−1)
𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) =
𝜆𝑛,𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 +
𝜆𝑛,𝑛−1𝑓(𝑛−1) 𝑥𝑖−𝑛+2: 𝑥𝑖 +
⋯
𝜆𝑛,0𝑓(0) ⋅
𝜆𝑛,0 = 1 −
𝑚=0
𝑛−1
𝜆𝑛,𝑛−𝑚
Announcements: Course Project
Official handout will be out Friday 9/22
Until then, focus on assignment 1
Teams of 1-3
Mixed undergrad/grad is encouraged but not required
Some novel aspect is needed
Ex 1: reimplement existing technique and apply to new domain
Ex 2: reimplement existing technique and apply to new (human) language
Ex 3: explore novel technique on existing problem
Recap from last time…
Classify or Decode with Bayes Rule
how well does text Yrepresent label X?
how likely is label X overall?
For “simple” or “flat” labels:* iterate through labels* evaluate score for each label, keeping only the best (n best)* return the best (or n best) label and score
Classification Evaluation:the 2-by-2 contingency table
ActuallyCorrect
Actually Incorrect
Selected/Guessed
True Positive (TP)
False Positive (FP)
Not selected/not guessed
False Negative (FN)
True Negative (TN)
Classes/Choices
Correct Guessed Correct Guessed
Correct Guessed Correct Guessed
Classification Evaluation:Accuracy, Precision, and Recall
Accuracy: % of items correct
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Actually Correct Actually Incorrect
Selected/Guessed True Positive (TP) False Positive (FP)
Not select/not guessed False Negative (FN) True Negative (TN)
Language Modeling as Naïve Bayes Classifier
Adopt naïve bag of words representation Yi
Assume position doesn’t matter
Assume the feature probabilities are independent given the class X
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Relevant for classification…
Naïve Bayes Summary
Potential Advantages
Very Fast, low storage requirements
Robust to Irrelevant Features
Very good in domains with many equally important features
Optimal if the independence assumptions hold
Dependable baseline for text classification (but often not the best)
Potential Issues
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Model the posterior in one go?
Are the features really uncorrelated?
Are plain counts always appropriate?
Are there “better” ways of handling missing/noisy data?
(automated, more principled)
Relevant for classification… andlanguage modeling
Maximum Entropy Models
a more general language model
argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)
Maximum Entropy Models
a more general language model
classify in one go
argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)
argmax𝑋𝑝 𝑋 𝑌)
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
We need to score the different combinations.
Document Classification
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
COMBINEposterior
probability of ATTACK
are all of these uncorrelated?
…scorek(department, ATTACK)
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
COMBINEposterior
probability of ATTACK
Q: What are the score and combine functions for Naïve
Bayes?
Scoring Our Possibilities
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
score( , ) =ATTACK
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)
…
Learn these scores… but how?
What do we optimize?
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
SNAP(score( , ))ATTACK
Maxent Modeling
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
exp(score( , ))ATTACK
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
f(x) = exp(x)
exp gives a positive, unnormalized probability
exp( ))score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)
score3(Shining Path, ATTACK)…
Maxent Modeling
Learn the scores (but we’ll declare what combinations should be looked at)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))weight1 * applies1(fatally shot, ATTACK)
weight2 * applies2(seriously wounded, ATTACK)
weight3 * applies3(Shining Path, ATTACK)…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
Q: What if none of our features apply?
Guiding Principle for Log-Linear Models
“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957
Guiding Principle for Log-Linear Models
“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957
exp(θ· f) exp(θ· 0) = 1
exp( ))θ1 * f1(fatally shot, ATTACK)
θ 2 * f2(seriously wounded, ATTACK)
θ 3 * f 3(Shining Path, ATTACK)…
Easier-to-write form
exp( ))θ1 * f1(fatally shot, ATTACK)
θ 2 * f2(seriously wounded, ATTACK)
θ 3 * f 3(Shining Path, ATTACK)…
Easier-to-write form
K weights
K features
exp( )θ · f (doc, ATTACK)
Easier-to-write form
K-dimensional weight vector
K-dimensional feature vector
dot product
Log-Linear Models
Log-Linear Models
Log-Linear Models
Feature function(s)Sufficient statistics
“Strength” function(s)
Log-Linear Models
Feature WeightsNatural parameters
Distribution Parameters
Log-Linear Models
How do we normalize?
exp( )weight1 * applies1(fatally shot, X)
weight2 * applies2(seriously wounded, X)
weight3 * applies3(Shining Path, X)…
Σlabel x
Z =Normalization for Classification
𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go
Normalization for Language Model
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
general class-based (X) language model of doc y
Normalization for Language Model
Can be significantly harder in the general case
Simplifying assumption: maxent n-grams!
general class-based (X) language model of doc y
https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/
https://goo.gl/B23Rxo
Lesson 1
Connections to Other Techniques
Log-Linear Models
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regressionas statistical regression
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
as statistical regression
based in information theory
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
as statistical regression
a form of
based in information theory
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
as statistical regression
a form of
viewed as
based in information theory
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
Very shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)
Learning θ
pθ(y | x ) probabilistic model
objective(given observations)
How will we optimize F(θ)?
Calculus
F(θ)
θ
F(θ)
θθ*
F(θ)
θ
F’(θ)derivative of F
wrt θ
θ*
Example
F’(x) = -2x + 4
F(x) = -(x-2)2
differentiate
Solve F’(x) = 0
x = 2
Common Derivative Rules
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)
θ0
y0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)
θ0
y0
g0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
g0
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
g0
g1
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0Pick a starting value θt
Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t
4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
y2
y3
θ3
g0
g1 g2
F(θ)
θ
F’(θ)derivative of F wrt θ
θ*
What if you can’t find the roots? Follow the derivative
Set t = 0
Pick a starting value θt
Until converged:
1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)
3. Get scaling factor ρ t4. Set θ t+1 = θ t + ρ t *g t
5. Set t += 1θ0
y0
θ1
y1
θ2
y2
y3
θ3
g0
g1 g2
Gradient = Multi-variable derivative
K-dimensional input
K-dimensional output
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent
Gradient Ascent