Post on 12-Jan-2016
description
Maximum Entropy Maximum Entropy ModelModel
ELN – Natural Language ELN – Natural Language ProcessingProcessing
Slides by: Fei Xia
HistoryHistory
The concept of Maximum The concept of Maximum Entropy can be traced back Entropy can be traced back along multiple threads to along multiple threads to Biblical timesBiblical times
Introduced to NLP by Berger Introduced to NLP by Berger et. al. (1996)et. al. (1996)
Used in many NLP tasks: MT, Used in many NLP tasks: MT, Tagging, Parsing, PP Tagging, Parsing, PP attachment, LM, …attachment, LM, …
OutlineOutline
Modeling: Intuition, basic Modeling: Intuition, basic concepts, …concepts, …
Parameter trainingParameter trainingFeature selectionFeature selectionCase studyCase study
Reference papersReference papers
(Ratnaparkhi, 1997)(Ratnaparkhi, 1997)(Ratnaparkhi, 1996)(Ratnaparkhi, 1996)(Berger et. al., 1996)(Berger et. al., 1996)(Klein and Manning, 2003)(Klein and Manning, 2003)
Different notations.Different notations.
ModelingModeling
The basic ideaThe basic idea
Goal: estimate Goal: estimate ppChoose Choose pp with maximum entropy (or with maximum entropy (or
“uncertainty”) subject to the “uncertainty”) subject to the constraints (or “evidence”).constraints (or “evidence”).
BAx
xpxppH )(log)()(
BbAawherebax ),,(
SettingSetting
From training data, collect (a, b) From training data, collect (a, b) pairs:pairs: a: thing to be predicted (e.g., a class
in a classification problem) b: the context Ex: POS tagging:
• a=NN• b=the words in a window and previous
two tagsLearn the probability of each (a, b): Learn the probability of each (a, b):
pp((aa, , bb))
Features in POS tagging Features in POS tagging (Ratnaparkhi, 1996)(Ratnaparkhi, 1996)
context (a.k.a. history) allowable classes
Condition Features
wi is not rare wi = X & ti = T
wi is rare X is prefix of wi, |X| ≤ 4 & ti = T
X is suffix of wi, |X| ≤ 4 & ti = T
wi contains number & ti = T
wi contains uppercase character & ti = T
wi contains hyphen & ti = T
wi ti-1 = X & ti = T
ti-2 ti-1 = X Y & ti = T
wi-1 = X & ti = T
wi-2 = X & ti = T
wi+1 = X & ti = T
wi+2 = X & ti = T
Maximum EntropyMaximum Entropy
Why Why maximummaximum entropy? entropy?Maximize entropy = Minimize Maximize entropy = Minimize
commitmentcommitmentModel all that is known and assume Model all that is known and assume
nothing about what is unknown. nothing about what is unknown. Model all that is known: satisfy a set of
constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy
(Maximum) Entropy(Maximum) Entropy
Entropy: the uncertainty of a Entropy: the uncertainty of a distributiondistributionQuantifying uncertainty (“surprise”)Quantifying uncertainty (“surprise”) Event x Probability px
“surprise” log(1/px)
Entropy: expected surprise (over Entropy: expected surprise (over pp):):
xxx
xp
pppH
pEpH
log)(
1log)(
p(HEADS)
H
Ex1: Coin-flip example (Klein Ex1: Coin-flip example (Klein & Manning 2003)& Manning 2003) Toss a coin: p(H)=p1, p(T)=p2Toss a coin: p(H)=p1, p(T)=p2 Constraint: p1 + p2 = 1Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)?Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p)Answer: choose the p that maximizes H(p)
x
xpxppH )(log)()(
p1
H
p1=0.3
ConvexityConvexity
Constrained Constrained HH((pp)) = – = – Σ Σ x log x log x x is convex:is convex: – x log x is convex – Σ x log x is convex (sum of
convex functions is convex) The feasible region of
constrained H is a linear subspace (which is convex)
The constrained entropy surface is therefore convex
The maximum likelihood The maximum likelihood exponential model (dual) exponential model (dual) formulation is also convexformulation is also convex
Ex2: An MT example (Berger Ex2: An MT example (Berger et. al., 1996)et. al., 1996)
Possible translation for the word “in” is:
{dans, en, à, au cours de, pendant}
Constraint:
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) = 1/5p(en) = 1/5p(à) = 1/5
p(au cours de) = 1/5p(pendant) = 1/5
An MT example (cont)An MT example (cont)
Constraint:p(dans) + p(en) = 1/5
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) = 1/10p(en) = 1/10p(à) = 8/30
p(au cours de) = 8/30p(pendant) = 8/30
An MT example (cont)An MT example (cont)
Constraint:p(dans) + p(en) = 1/5p(dans) + p(à) = 1/2
p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1
Intuitive answer:p(dans) =p(en) =p(à) =
p(au cours de) =p(pendant) =
Ex3: POS tagging (Klein and Ex3: POS tagging (Klein and Manning, 2003)Manning, 2003) Lets say we have the following event space:Lets say we have the following event space:
… … and the following empirical data:and the following empirical data:
Maximize H:Maximize H:
… … want probabilities: E[want probabilities: E[NN, NNS, NNP, NN, NNS, NNP, NNPS, VBZ, VBDNNPS, VBZ, VBD] = 1] = 1
1/6 1/6 1/6 1/6 1/6 1/6
1/e 1/e 1/e 1/e 1/e 1/e
NN NNS NNP NNPS BVZ VBD
3 5 11 13 3 1
Ex3 (cont)Ex3 (cont)
Too uniform!Too uniform! N* are more common than V*, so we add the feature N* are more common than V*, so we add the feature
ffNN = {NN, NNS, NNP, NNPS}, with E[ = {NN, NNS, NNP, NNPS}, with E[ffNN] = 32/36] = 32/36
… … and proper nous are more frequent than common and proper nous are more frequent than common nouns, so we add nouns, so we add ffPP = {NNP, NNPS}, with E[ = {NNP, NNPS}, with E[ffPP] = ] =
24/3624/36
… … we could keep refining the models. E.g. by adding we could keep refining the models. E.g. by adding a feature to distinguish singular vs. plural nouns, or a feature to distinguish singular vs. plural nouns, or verb types.verb types.
4/36 4/36 12/36 12/36 2/36 2/36
NN NNS NNP NNPS VBZ VBD
8/36 8/36 8/36 8/36 2/36 2/36
Ex4: overlapping features (Klein Ex4: overlapping features (Klein and Manning, 2003)and Manning, 2003) Maxent models handle overlapping featuresMaxent models handle overlapping features Unlike a NB model, there is no double counting!Unlike a NB model, there is no double counting! But do not automatically model feature interactions.But do not automatically model feature interactions.
Modeling the problemModeling the problem
Objective function: Objective function: HH((pp))Goal: Among all the distributions Goal: Among all the distributions
that satisfy the constraints, choose that satisfy the constraints, choose the one, the one, p*p*, that maximizes , that maximizes HH((pp))..
Question: How to represent Question: How to represent constraints?constraints?
)(maxarg* pHpPp
FeaturesFeatures
Feature (a.k.a. feature function, Indicator function) is a Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: binary-valued function on events:
ffjj : : {0, 1}, {0, 1}, = = AA BB
A: the set of possible classes (e.g., tags in POS tagging)A: the set of possible classes (e.g., tags in POS tagging)
B: space of contexts (e.g., neighboring words/ tags in POS B: space of contexts (e.g., neighboring words/ tags in POS tagging)tagging)
Example:Example:
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
Some notationsSome notations
S
)(~ xp
)(xp
jf
Finite training sample of events:
Observed probability of x in S:
The model p’s probability of x:
The jth feature:
Observed expectation of fi :(empirical count of fi)
Model expectation of fi
x
jjp xfxpfE )()(
x
jjp xfxpfE )()(~~
ConstraintsConstraints
Model’s feature expectation = observed Model’s feature expectation = observed feature expectationfeature expectation
How to calculate ?How to calculate ?
jpjp fEfE ~
jp fE ~
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
x
N
ij
jjp N
xfxfxpfE 1
~
)()()(~
Training data Training data observed observed eventsevents
Restating the problemRestating the problem
)(maxarg* pHpPp
}},...,1{,|{ ~ kjfEfEpP jpjp
},...,1,{ ~ kjdfEfE jjpjp
x
xp 1)(
The task: find p* s.t.
where
Objective function: -H(p)
Constraints:
Add a feature babaf ,1),(0
10~0 fEfE pp
QuestionsQuestions
Is Is PP empty? empty?Does Does p*p* exist? exist?Is Is p*p* unique? unique?What is the form of What is the form of p*p*? ? How to find How to find p*p*??
What is the form of What is the form of p*p*? ? (Ratnaparkhi, 1997)(Ratnaparkhi, 1997)
}},...,1{,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
Theorem: if p* P Q then
Furthermore, p* is unique.
)(maxarg* pHpPp
Using Lagrangian multipliersUsing Lagrangian multipliers
)()()(0
j
k
jjpj dfEpHpA
0
1
010
1
)(
1)(1)(
0
0
0
)(
)(
1)()(log
0)()(log1
0)(/)))()((()(log)((
0)('
eZwhereZ
exp
eexp
xfxp
xfxp
xpdxfxpxpxp
pA
xf
xfxf
j
k
jj
j
k
jj
jx
jx
k
jj
j
k
jj
j
k
jjj
k
jj
Minimize A(p):
Derivative = 0
Two equivalent formsTwo equivalent forms
Z
exp
xf j
k
jj )(
1
)(
k
j
xfj
jxp1
)()(
jjZ ln
1
The log-likelihood of the empirical distribution as predicted by a model q is defined as
Relation to Maximum Relation to Maximum LikelihoodLikelihood
x
xqxpqL )(log)(~)(
p~
QPp * )(maxarg* qLpQq
Theorem: if then Furthermore, p* is unique.
Goal: find p* in P, which maximizes H(p).
It can be proved that when p* exists it is unique.
The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample
Summary (so far)Summary (so far)
},...,1,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
p~
Summary (cont)Summary (cont)
Adding constraints (features):Adding constraints (features):
(Klein and Manning, 2003)(Klein and Manning, 2003) Lower maximum entropy Raise maximum likelihood of data Bring the distribution further from
uniform Bring the distribution closer to data
Parameter estimationParameter estimation
AlgorithmsAlgorithms
Generalized Iterative Scaling Generalized Iterative Scaling (GIS):(GIS): (Darroch and Ratcliff, 1972)
Improved Iterative Scaling Improved Iterative Scaling (IIS):(IIS): (Della Pietra et al., 1995)
GIS: setupGIS: setup
Requirements for running GIS:Requirements for running GIS:Obey form of model and constraints:Obey form of model and constraints:
An additional constraint:An additional constraint:
Add a new feature Add a new feature ffkk+1+1::
Z
exp
xf j
k
jj )(
1
)(
jjp dfE
k
jj Cxfx
1
)(
k
jjk xfCxfx
11 )()(
k
jj
xxfC
1
)(max
GIS algorithmGIS algorithm
Compute Compute ddjj, , jj=1, …, =1, …, kk+1+1
Initialize (any values, e.g., 0) Initialize (any values, e.g., 0) Repeat until convergeRepeat until converge for each j
• compute
where
• update
Z
exp
xf
n
j
k
j
nj )(
)(
1
1
)(
)(
)()()()( xfxpfE j
x
njp n
)(log1
)(
)()1(
jp
inj
nj fE
d
C n
)1(j
Approximation for calculating Approximation for calculating feature expectationfeature expectation
N
i Aaiji
Bb Aaj
BbAaj
BbAaj
x BbAajjjp
bafbapN
bafbapbp
bafbapbp
bafbapbp
bafbapxfxpfE
1
,
,
,
),()|(1
),()|()(~
),()|()(~
),()|()(
),(),()()(
Properties of GISProperties of GIS
L(pL(p(n+1)(n+1)) >= L(p) >= L(p(n)(n))) The sequence is guaranteed to converge to The sequence is guaranteed to converge to
p*.p*. The converge can be very slow.The converge can be very slow.
The running time of each iteration is O(NPA):The running time of each iteration is O(NPA): N: the training set size P: the number of classes A: the average number of features that are active
for a given event (a, b).
Feature selectionFeature selection
Feature selectionFeature selection
Throw in many features and let the machine Throw in many features and let the machine select the weightsselect the weights Manually specify feature templates
Problem: too many featuresProblem: too many features
An alternative: greedy algorithmAn alternative: greedy algorithm Start with an empty set S Add a feature at each iteration
NotationNotation
The gain in the log-likelihood of the training data:
After adding a feature:
With the feature set S:
Feature selection algorithm Feature selection algorithm (Berger et al., 1996)(Berger et al., 1996)
Start with S being empty; thus pStart with S being empty; thus pss is is uniform.uniform.
Repeat until the gain is small enoughRepeat until the gain is small enough For each candidate feature f
• Computer the model using IIS• Calculate the log-likelihood gain
Choose the feature with maximal gain, and add it to S
fSp
Problem: too expensive
Approximating gains (Berger Approximating gains (Berger et. al., 1996) et. al., 1996)
Instead of recalculating all the weights, Instead of recalculating all the weights, calculate only the weight of the new feature.calculate only the weight of the new feature.
Training a MaxEnt ModelTraining a MaxEnt Model
Scenario #1:Scenario #1: Define features templatesDefine features templates Create the feature setCreate the feature set Determine the optimum feature weights via GIS or IISDetermine the optimum feature weights via GIS or IIS
Scenario #2:Scenario #2: Define feature templatesDefine feature templates Create candidate feature set SCreate candidate feature set S At every iteration, choose the feature from S (with At every iteration, choose the feature from S (with
max gain) and determine its weight (or choose top-n max gain) and determine its weight (or choose top-n features and their weights).features and their weights).
Case studyCase study
POS tagging (Ratnaparkhi, POS tagging (Ratnaparkhi, 1996)1996)
Notation variation: Notation variation: fj(a, b): a: class, b: context
fj(hi, ti): h: history for ith word, t: tag for ith word
History:History:
hhii = { = {wwii,, w wii-1-1, , wwii-2-2, , wwii+1+1, , wwii+2+2, , ttii-2-2, t, tii-1-1}}
Training data:Training data: Treat it as a list of (hi, ti) pairs How many pairs are there?
Using a MaxEnt ModelUsing a MaxEnt Model
Modeling: Modeling:
Training: Training: Define features templates Create the feature set Determine the optimum feature weights via GIS or
IIS
Decoding: Decoding:
ModelingModeling
)|(
),|(
),...,|,...,(
1
111
1
11
i
n
ii
inn
ii
nn
htp
twtp
wwttP
Tt
thp
thphtp
'
)',(
),()|(
Training step 1: define Training step 1: define feature templatesfeature templates
History hi Tag ti
Condition Features
wi is not rare wi = X & ti = T
wi is rare X is prefix of wi, |X| ≤ 4 & ti = T
X is suffix of wi, |X| ≤ 4 & ti = T
wi contains number & ti = T
wi contains uppercase character & ti = T
wi contains hyphen & ti = T
wi ti-1 = X & ti = T
ti-2 ti-1 = X Y & ti = T
wi-1 = X & ti = T
wi-2 = X & ti = T
wi+1 = X & ti = T
wi+2 = X & ti = T
Step 2: Create feature setStep 2: Create feature set
Collect all the features from the training dataThrow away features that appear less than 10 times
Step 3: determine the Step 3: determine the feature weightsfeature weightsGISGIS
Training time:Training time: Each iteration: O(NTA):
• N: the training set size• T: the number of allowable tags• A: average number of features that are active for a (h, t).
How many features?How many features?
Decoding: Beam searchDecoding: Beam search
Generate tags for Generate tags for ww11, find top , find top NN, set , set ss11jj
accordingly, accordingly, jj =1, 2, …, =1, 2, …, NN for for i = 2 i = 2 to to nn ( (nn is the sentence length) is the sentence length)
for j =1 to Ngenerate tags for wi, given s(i-1)j as previous tag context
append each tag to s(i-1)j to make a new sequence
find N highest prob sequences generated above, and set sij accordingly, j = 1, …, N
Return highest prob sequence Return highest prob sequence ssnn11..
Beam searchBeam search
Beam inference:Beam inference: At each position keep the top At each position keep the top kk complete sequences complete sequences Extend each sequence in each local wayExtend each sequence in each local way The extensions compete for the The extensions compete for the kk slots at the next position slots at the next position
AdvantagesAdvantages Fast: and beam sizes of 3-5 are as good or almost as good Fast: and beam sizes of 3-5 are as good or almost as good
as exact inference in many casesas exact inference in many cases Easy to implement (no dynamic programming required)Easy to implement (no dynamic programming required)
Disadvantage:Disadvantage: Inexact: the global best sequence can fall off the beamInexact: the global best sequence can fall off the beam
Viterbi searchViterbi search
Viterbi inferenceViterbi inference Dynamic programming or memoizationDynamic programming or memoization Requires small window of state influence (e.g., Requires small window of state influence (e.g.,
past two states are relevant)past two states are relevant) AdvantagesAdvantages
Exact: the global best sequence is returnedExact: the global best sequence is returned DisadtvantagesDisadtvantages
Harder to implement long-distance state-state Harder to implement long-distance state-state interactions (but beam inference tends not to interactions (but beam inference tends not to allow long-distance resurrection of sequences allow long-distance resurrection of sequences anyway)anyway)
Decoding (cont)Decoding (cont)
Tags for words:Tags for words: Known words: use tag dictionary Unknown words: try all possible tags
Ex: “time flies like an arrow”Ex: “time flies like an arrow”Running time: O(NTAB)Running time: O(NTAB) N: sentence length B: beam size T: tagset size A: average number of features that
are active for a given event
Experiment resultsExperiment results
Comparison with other Comparison with other learnerslearnersHMM: MaxEnt uses more contextHMM: MaxEnt uses more context
SDT: MaxEnt does not split dataSDT: MaxEnt does not split data
TBL: MaxEnt is statistical and it provides TBL: MaxEnt is statistical and it provides probability distributions.probability distributions.
MaxEnt SummaryMaxEnt Summary
Concept: choose the p* that Concept: choose the p* that maximizes entropy while satisfying all maximizes entropy while satisfying all the constraints.the constraints.
Max likelihood: p* is also the model Max likelihood: p* is also the model within a model family that maximizes within a model family that maximizes the log-likelihood of the training data.the log-likelihood of the training data.
Training: GIS or IIS, which can be slow.Training: GIS or IIS, which can be slow.MaxEnt handles overlapping features MaxEnt handles overlapping features
well.well. In general, MaxEnt achieves good In general, MaxEnt achieves good
performances on many NLP tasks.performances on many NLP tasks.