Max Entropy
-
Upload
jianingy -
Category
Technology
-
view
1.383 -
download
0
Transcript of Max Entropy
![Page 1: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/1.jpg)
Maximum Entropy Model
LING 572Fei Xia
02/07-02/09/06
![Page 2: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/2.jpg)
History
• The concept of Maximum Entropy can be traced back along multiple threads to Biblical times.
• Introduced to NLP area by Berger et. Al. (1996).
• Used in many NLP tasks: MT, Tagging, Parsing, PP attachment, LM, …
![Page 3: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/3.jpg)
Outline
• Modeling: Intuition, basic concepts, …• Parameter training• Feature selection• Case study
![Page 4: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/4.jpg)
Reference papers
• (Ratnaparkhi, 1997)• (Ratnaparkhi, 1996)• (Berger et. al., 1996)• (Klein and Manning, 2003)
Different notations.
![Page 5: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/5.jpg)
Modeling
![Page 6: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/6.jpg)
The basic idea
• Goal: estimate p
• Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).
BAx
xpxppH )(log)()(
BbAawherebax ),,(
![Page 7: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/7.jpg)
Setting
• From training data, collect (a, b) pairs:– a: thing to be predicted (e.g., a class in a
classification problem)– b: the context– Ex: POS tagging:
• a=NN• b=the words in a window and previous two tags
• Learn the prob of each (a, b): p(a, b)
![Page 8: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/8.jpg)
Features in POS tagging(Ratnaparkhi, 1996)
context (a.k.a. history) allowable classes
![Page 9: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/9.jpg)
Maximum Entropy
• Why maximum entropy?• Maximize entropy = Minimize commitment
• Model all that is known and assume nothing about what is unknown. – Model all that is known: satisfy a set of constraints that
must hold
– Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy
![Page 10: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/10.jpg)
Ex1: Coin-flip example(Klein & Manning 2003)
• Toss a coin: p(H)=p1, p(T)=p2.• Constraint: p1 + p2 = 1• Question: what’s your estimation of p=(p1, p2)?• Answer: choose the p that maximizes H(p)
p1
H
p1=0.3
x
xpxppH )(log)()(
![Page 11: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/11.jpg)
Coin-flip example (cont)
p1 p2
H
p1 + p2 = 1
p1+p2=1.0, p1=0.3
![Page 12: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/12.jpg)
Ex2: An MT example(Berger et. al., 1996)
Possible translation for the word “in” is:
Constraint:
Intuitive answer:
![Page 13: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/13.jpg)
An MT example (cont)Constraints:
Intuitive answer:
![Page 14: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/14.jpg)
An MT example (cont)Constraints:
Intuitive answer:
??
![Page 15: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/15.jpg)
Ex3: POS tagging(Klein and Manning, 2003)
![Page 16: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/16.jpg)
Ex3 (cont)
![Page 17: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/17.jpg)
Ex4: overlapping features(Klein and Manning, 2003)
![Page 18: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/18.jpg)
Modeling the problem
• Objective function: H(p)
• Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p).
• Question: How to represent constraints?
)(maxarg* pHpPp
![Page 19: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/19.jpg)
Features
• Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events:
• A: the set of possible classes (e.g., tags in POS tagging)
• B: space of contexts (e.g., neighboring words/ tags in POS tagging)
• Ex:
BAf j },1,0{:
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
![Page 20: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/20.jpg)
Some notations
S
)(~ xp
)(xp
Finite training sample of events:
Observed probability of x in S:
The model p’s probability of x:
Model expectation of :
Observed expectation of :
jf
x
jjp xfxpfE )()(
x
jjp xfxpfE )()(~~
The jth feature:
jf
jf
(empirical count of )jf
![Page 21: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/21.jpg)
Constraints
• Model’s feature expectation = observed feature expectation
• How to calculate ?
jpjp fEfE ~
jp fE ~
..0
"")(&1),(
wo
thatbcurWordDETaifbaf j
x
N
ij
jjp N
xfxfxpfE 1
~
)()()(~
![Page 22: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/22.jpg)
Training data observed events
![Page 23: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/23.jpg)
Restating the problem
)(maxarg* pHpPp
}},...,1{,|{ ~ kjfEfEpP jpjp
}},...,1{,{ ~ kjdfEfE jjpjp
The task: find p* s.t.
where
Objective function: -H(p)
Constraints:
x
xp 1)( Add a feature babaf ,1),(0
10~0 fEfE pp
![Page 24: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/24.jpg)
Questions
• Is P empty?
• Does p* exist?• Is p* unique?• What is the form of p*?
• How to find p*?
![Page 25: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/25.jpg)
What is the form of p*?(Ratnaparkhi, 1997)
}},...,1{,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
QPp *Theorem: if then Furthermore, p* is unique.
)(maxarg* pHpPp
![Page 26: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/26.jpg)
Using Lagrangian multipliers
)()()(0
j
k
jjpj dfEpHpA
Minimize A(p):
0
1
010
1
)(
1)(1)(
0
0
0
)(
)(
1)()(log
0)()(log1
0)(/)))()((()(log)((
0)('
eZwhereZ
exp
eexp
xfxp
xfxp
xpdxfxpxpxp
pA
xf
xfxf
j
k
jj
j
k
jj
jx
jx
k
jj
j
k
jj
j
k
jjj
k
jj
![Page 27: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/27.jpg)
Two equivalent forms
Z
exp
xf jk
jj )(
1
)(
k
j
xfj
jxp1
)()(
jjZ ln
1
![Page 28: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/28.jpg)
Relation to Maximum Likelihood
x
xqxpqL )(log)(~)(
The log-likelihood of the empirical distribution as predicted by a model q is defined as
p~
QPp *Theorem: if then Furthermore, p* is unique.
)(maxarg* qLpQq
![Page 29: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/29.jpg)
Summary (so far)
}},...,1{,|{ ~ kjfEfEpP jpjp
}0,)(|{1
)(
j
k
j
xfj
jxppQ
The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample p~
Goal: find p* in P, which maximizes H(p).
It can be proved that when p* exists it is unique.
![Page 30: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/30.jpg)
Summary (cont)
• Adding constraints (features): (Klein and Manning, 2003)
– Lower maximum entropy– Raise maximum likelihood of data– Bring the distribution further from uniform– Bring the distribution closer to data
![Page 31: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/31.jpg)
Parameter estimation
![Page 32: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/32.jpg)
Algorithms
• Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972)
• Improved Iterative Scaling (IIS): (Della Pietra et al., 1995)
![Page 33: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/33.jpg)
GIS: setup
Requirements for running GIS:• Obey form of model and constraints:
• An additional constraint:Z
exp
xf jk
jj )(
1
)(
jjp dfE
k
jj Cxfx
1
)(
k
jjk xfCxfx
11 )()(
k
jj
xxfC
1
)(max
Let
Add a new feature fk+1:
![Page 34: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/34.jpg)
GIS algorithm
• Compute dj, j=1, …, k+1
• Initialize (any values, e.g., 0) • Repeat until converge
– For each j• Compute
• Update
Z
exp
xf
n
j
k
j
nj )(
)(
1
1
)(
)(
)()()()( xfxpfE jx
njp n
)(log1
)(
)()1(
jp
inj
nj fE
d
C n
where
)1(j
![Page 35: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/35.jpg)
Approximation for calculating feature expectation
N
i Aaiji
Bb Aaj
BbAaj
BbAaj
x BbAajjjp
bafbapN
bafbapbp
bafbapbp
bafbapbp
bafbapxfxpfE
1
,
,
,
),()|(1
),()|()(~
),()|()(~
),()|()(
),(),()()(
![Page 36: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/36.jpg)
Properties of GIS
• L(p(n+1)) >= L(p(n))• The sequence is guaranteed to converge to p*.• The converge can be very slow.
• The running time of each iteration is O(NPA):– N: the training set size– P: the number of classes– A: the average number of features that are active for a
given event (a, b).
![Page 37: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/37.jpg)
IIS algorithm
• Compute dj, j=1, …, k+1 and
• Initialize (any values, e.g., 0) • Repeat until converge
– For each j• Let be the solution to
• Update
jxf
jx
n dexfxp j
)()(
#
)()(
jnj
nj )()1(
)1(j
)()(1
# xfxfk
jj
j
![Page 38: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/38.jpg)
Calculating
k
jj Cxfx
1
)(
)(log1
)( jp
ij fE
d
C n
j
If
Then
GIS is the same as IIS
Elsemust be calcuated numerically.
j
![Page 39: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/39.jpg)
Feature selection
![Page 40: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/40.jpg)
Feature selection
• Throw in many features and let the machine select the weights– Manually specify feature templates
• Problem: too many features
• An alternative: greedy algorithm– Start with an empty set S– Add a feature at each iteration
![Page 41: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/41.jpg)
Notation
The gain in the log-likelihood of the training data:
After adding a feature:
With the feature set S:
![Page 42: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/42.jpg)
Feature selection algorithm(Berger et al., 1996)
• Start with S being empty; thus ps is uniform.
• Repeat until the gain is small enough– For each candidate feature f
• Computer the model using IIS• Calculate the log-likelihood gain
– Choose the feature with maximal gain, and add it to S
fSp
Problem: too expensive
![Page 43: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/43.jpg)
Approximating gains(Berger et. al., 1996)
• Instead of recalculating all the weights, calculate only the weight of the new feature.
![Page 44: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/44.jpg)
Training a MaxEnt Model
Scenario #1:• Define features templates• Create the feature set• Determine the optimum feature weights via GIS or IIS
Scenario #2:• Define feature templates• Create candidate feature set S• At every iteration, choose the feature from S (with max
gain) and determine its weight (or choose top-n features and their weights).
![Page 45: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/45.jpg)
Case study
![Page 46: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/46.jpg)
POS tagging(Ratnaparkhi, 1996)
• Notation variation: – fj(a, b): a: class, b: context
– fj(hi, ti): h: history for ith word, t: tag for ith word
• History:
• Training data:– Treat it as a list of (hi, ti) pairs.
– How many pairs are there?
},,,,,,{ 212121 iiiiiiii ttwwwwwh
![Page 47: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/47.jpg)
Using a MaxEnt Model
• Modeling:
• Training: – Define features templates– Create the feature set– Determine the optimum feature weights via
GIS or IIS
• Decoding:
![Page 48: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/48.jpg)
Modeling
)|(
),|(
),...,|,...,(
1
111
1
11
i
n
ii
inn
ii
nn
htp
twtp
wwttP
Tt
thp
thphtp
'
)',(
),()|(
![Page 49: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/49.jpg)
Training step 1: define feature templates
History hi Tag ti
![Page 50: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/50.jpg)
Step 2: Create feature set
Collect all the features from the training dataThrow away features that appear less than 10 times
![Page 51: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/51.jpg)
Step 3: determine the feature weights
• GIS
• Training time:– Each iteration: O(NTA):
• N: the training set size• T: the number of allowable tags• A: average number of features that are active for a (h, t).
– About 24 hours on an IBM RS/6000 Model 380.
• How many features?
![Page 52: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/52.jpg)
Decoding: Beam search
• Generate tags for w1, find top N, set s1j
accordingly, j=1, 2, …, N• For i=2 to n (n is the sentence length)
– For j=1 to N• Generate tags for wi, given s(i-1)j as previous tag
context• Append each tag to s(i-1)j to make a new sequence.
– Find N highest prob sequences generated above, and set sij accordingly, j=1, …, N
• Return highest prob sequence sn1.
![Page 53: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/53.jpg)
Beam search
![Page 54: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/54.jpg)
Viterbi search
![Page 55: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/55.jpg)
Decoding (cont)
• Tags for words:– Known words: use tag dictionary– Unknown words: try all possible tags
• Ex: “time flies like an arrow”
• Running time: O(NTAB)– N: sentence length– B: beam size– T: tagset size– A: average number of features that are active for a given event
![Page 56: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/56.jpg)
Experiment results
![Page 57: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/57.jpg)
Comparison with other learners
• HMM: MaxEnt uses more context
• SDT: MaxEnt does not split data
• TBL: MaxEnt is statistical and it provides probability distributions.
![Page 58: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/58.jpg)
MaxEnt Summary
• Concept: choose the p* that maximizes entropy while satisfying all the constraints.
• Max likelihood: p* is also the model within a model family that maximizes the log-likelihood of the training data.
• Training: GIS or IIS, which can be slow.
• MaxEnt handles overlapping features well.
• In general, MaxEnt achieves good performances on many NLP tasks.
![Page 59: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/59.jpg)
Additional slides
![Page 60: Max Entropy](https://reader034.fdocuments.us/reader034/viewer/2022052601/5595113c1a28ab16108b4749/html5/thumbnails/60.jpg)
Ex4 (cont)
??