Maximum Entropy Model

Post on 12-Jan-2016

67 views 0 download

description

Maximum Entropy Model. ELN – Natural Language Processing. Slides by: Fei Xia. History. The concept of Maximum Entropy can be traced back along multiple threads to Biblical times Introduced to NLP by Berger et. al . (1996 ) - PowerPoint PPT Presentation

Transcript of Maximum Entropy Model

Maximum Entropy Maximum Entropy ModelModel

ELN – Natural Language ELN – Natural Language ProcessingProcessing

Slides by: Fei Xia

HistoryHistory

The concept of Maximum The concept of Maximum Entropy can be traced back Entropy can be traced back along multiple threads to along multiple threads to Biblical timesBiblical times

Introduced to NLP by Berger Introduced to NLP by Berger et. al. (1996)et. al. (1996)

Used in many NLP tasks: MT, Used in many NLP tasks: MT, Tagging, Parsing, PP Tagging, Parsing, PP attachment, LM, …attachment, LM, …

OutlineOutline

Modeling: Intuition, basic Modeling: Intuition, basic concepts, …concepts, …

Parameter trainingParameter trainingFeature selectionFeature selectionCase studyCase study

Reference papersReference papers

(Ratnaparkhi, 1997)(Ratnaparkhi, 1997)(Ratnaparkhi, 1996)(Ratnaparkhi, 1996)(Berger et. al., 1996)(Berger et. al., 1996)(Klein and Manning, 2003)(Klein and Manning, 2003)

Different notations.Different notations.

ModelingModeling

The basic ideaThe basic idea

Goal: estimate Goal: estimate ppChoose Choose pp with maximum entropy (or with maximum entropy (or

“uncertainty”) subject to the “uncertainty”) subject to the constraints (or “evidence”).constraints (or “evidence”).

BAx

xpxppH )(log)()(

BbAawherebax ),,(

SettingSetting

From training data, collect (a, b) From training data, collect (a, b) pairs:pairs: a: thing to be predicted (e.g., a class

in a classification problem) b: the context Ex: POS tagging:

• a=NN• b=the words in a window and previous

two tagsLearn the probability of each (a, b): Learn the probability of each (a, b):

pp((aa, , bb))

Features in POS tagging Features in POS tagging (Ratnaparkhi, 1996)(Ratnaparkhi, 1996)

context (a.k.a. history) allowable classes

Condition Features

wi is not rare wi = X & ti = T

wi is rare X is prefix of wi, |X| ≤ 4 & ti = T

X is suffix of wi, |X| ≤ 4 & ti = T

wi contains number & ti = T

wi contains uppercase character & ti = T

wi contains hyphen & ti = T

wi ti-1 = X & ti = T

ti-2 ti-1 = X Y & ti = T

wi-1 = X & ti = T

wi-2 = X & ti = T

wi+1 = X & ti = T

wi+2 = X & ti = T

Maximum EntropyMaximum Entropy

Why Why maximummaximum entropy? entropy?Maximize entropy = Minimize Maximize entropy = Minimize

commitmentcommitmentModel all that is known and assume Model all that is known and assume

nothing about what is unknown. nothing about what is unknown. Model all that is known: satisfy a set of

constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution choose the one with maximum entropy

(Maximum) Entropy(Maximum) Entropy

Entropy: the uncertainty of a Entropy: the uncertainty of a distributiondistributionQuantifying uncertainty (“surprise”)Quantifying uncertainty (“surprise”) Event x Probability px

“surprise” log(1/px)

Entropy: expected surprise (over Entropy: expected surprise (over pp):):

xxx

xp

pppH

pEpH

log)(

1log)(

p(HEADS)

H

Ex1: Coin-flip example (Klein Ex1: Coin-flip example (Klein & Manning 2003)& Manning 2003) Toss a coin: p(H)=p1, p(T)=p2Toss a coin: p(H)=p1, p(T)=p2 Constraint: p1 + p2 = 1Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)?Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p)Answer: choose the p that maximizes H(p)

x

xpxppH )(log)()(

p1

H

p1=0.3

ConvexityConvexity

Constrained Constrained HH((pp)) = – = – Σ Σ x log x log x x is convex:is convex: – x log x is convex – Σ x log x is convex (sum of

convex functions is convex) The feasible region of

constrained H is a linear subspace (which is convex)

The constrained entropy surface is therefore convex

The maximum likelihood The maximum likelihood exponential model (dual) exponential model (dual) formulation is also convexformulation is also convex

Ex2: An MT example (Berger Ex2: An MT example (Berger et. al., 1996)et. al., 1996)

Possible translation for the word “in” is:

{dans, en, à, au cours de, pendant}

Constraint:

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) = 1/5p(en) = 1/5p(à) = 1/5

p(au cours de) = 1/5p(pendant) = 1/5

An MT example (cont)An MT example (cont)

Constraint:p(dans) + p(en) = 1/5

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) = 1/10p(en) = 1/10p(à) = 8/30

p(au cours de) = 8/30p(pendant) = 8/30

An MT example (cont)An MT example (cont)

Constraint:p(dans) + p(en) = 1/5p(dans) + p(à) = 1/2

p(dans) + p(en) + p(à) + p(au cours de) + p(pendant) = 1

Intuitive answer:p(dans) =p(en) =p(à) =

p(au cours de) =p(pendant) =

Ex3: POS tagging (Klein and Ex3: POS tagging (Klein and Manning, 2003)Manning, 2003) Lets say we have the following event space:Lets say we have the following event space:

… … and the following empirical data:and the following empirical data:

Maximize H:Maximize H:

… … want probabilities: E[want probabilities: E[NN, NNS, NNP, NN, NNS, NNP, NNPS, VBZ, VBDNNPS, VBZ, VBD] = 1] = 1

1/6 1/6 1/6 1/6 1/6 1/6

1/e 1/e 1/e 1/e 1/e 1/e

NN NNS NNP NNPS BVZ VBD

3 5 11 13 3 1

Ex3 (cont)Ex3 (cont)

Too uniform!Too uniform! N* are more common than V*, so we add the feature N* are more common than V*, so we add the feature

ffNN = {NN, NNS, NNP, NNPS}, with E[ = {NN, NNS, NNP, NNPS}, with E[ffNN] = 32/36] = 32/36

… … and proper nous are more frequent than common and proper nous are more frequent than common nouns, so we add nouns, so we add ffPP = {NNP, NNPS}, with E[ = {NNP, NNPS}, with E[ffPP] = ] =

24/3624/36

… … we could keep refining the models. E.g. by adding we could keep refining the models. E.g. by adding a feature to distinguish singular vs. plural nouns, or a feature to distinguish singular vs. plural nouns, or verb types.verb types.

4/36 4/36 12/36 12/36 2/36 2/36

NN NNS NNP NNPS VBZ VBD

8/36 8/36 8/36 8/36 2/36 2/36

Ex4: overlapping features (Klein Ex4: overlapping features (Klein and Manning, 2003)and Manning, 2003) Maxent models handle overlapping featuresMaxent models handle overlapping features Unlike a NB model, there is no double counting!Unlike a NB model, there is no double counting! But do not automatically model feature interactions.But do not automatically model feature interactions.

Modeling the problemModeling the problem

Objective function: Objective function: HH((pp))Goal: Among all the distributions Goal: Among all the distributions

that satisfy the constraints, choose that satisfy the constraints, choose the one, the one, p*p*, that maximizes , that maximizes HH((pp))..

Question: How to represent Question: How to represent constraints?constraints?

)(maxarg* pHpPp

FeaturesFeatures

Feature (a.k.a. feature function, Indicator function) is a Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events: binary-valued function on events:

ffjj : : {0, 1}, {0, 1}, = = AA BB

A: the set of possible classes (e.g., tags in POS tagging)A: the set of possible classes (e.g., tags in POS tagging)

B: space of contexts (e.g., neighboring words/ tags in POS B: space of contexts (e.g., neighboring words/ tags in POS tagging)tagging)

Example:Example:

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

Some notationsSome notations

S

)(~ xp

)(xp

jf

Finite training sample of events:

Observed probability of x in S:

The model p’s probability of x:

The jth feature:

Observed expectation of fi :(empirical count of fi)

Model expectation of fi

x

jjp xfxpfE )()(

x

jjp xfxpfE )()(~~

ConstraintsConstraints

Model’s feature expectation = observed Model’s feature expectation = observed feature expectationfeature expectation

How to calculate ?How to calculate ?

jpjp fEfE ~

jp fE ~

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

x

N

ij

jjp N

xfxfxpfE 1

~

)()()(~

Training data Training data observed observed eventsevents

Restating the problemRestating the problem

)(maxarg* pHpPp

}},...,1{,|{ ~ kjfEfEpP jpjp

},...,1,{ ~ kjdfEfE jjpjp

x

xp 1)(

The task: find p* s.t.

where

Objective function: -H(p)

Constraints:

Add a feature babaf ,1),(0

10~0 fEfE pp

QuestionsQuestions

Is Is PP empty? empty?Does Does p*p* exist? exist?Is Is p*p* unique? unique?What is the form of What is the form of p*p*? ? How to find How to find p*p*??

What is the form of What is the form of p*p*? ? (Ratnaparkhi, 1997)(Ratnaparkhi, 1997)

}},...,1{,|{ ~ kjfEfEpP jpjp

}0,)(|{1

)(

j

k

j

xfj

jxppQ

Theorem: if p* P Q then

Furthermore, p* is unique.

)(maxarg* pHpPp

Using Lagrangian multipliersUsing Lagrangian multipliers

)()()(0

j

k

jjpj dfEpHpA

0

1

010

1

)(

1)(1)(

0

0

0

)(

)(

1)()(log

0)()(log1

0)(/)))()((()(log)((

0)('

eZwhereZ

exp

eexp

xfxp

xfxp

xpdxfxpxpxp

pA

xf

xfxf

j

k

jj

j

k

jj

jx

jx

k

jj

j

k

jj

j

k

jjj

k

jj

Minimize A(p):

Derivative = 0

Two equivalent formsTwo equivalent forms

Z

exp

xf j

k

jj )(

1

)(

k

j

xfj

jxp1

)()(

jjZ ln

1

The log-likelihood of the empirical distribution as predicted by a model q is defined as

Relation to Maximum Relation to Maximum LikelihoodLikelihood

x

xqxpqL )(log)(~)(

p~

QPp * )(maxarg* qLpQq

Theorem: if then Furthermore, p* is unique.

Goal: find p* in P, which maximizes H(p).

It can be proved that when p* exists it is unique.

The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample

Summary (so far)Summary (so far)

},...,1,|{ ~ kjfEfEpP jpjp

}0,)(|{1

)(

j

k

j

xfj

jxppQ

p~

Summary (cont)Summary (cont)

Adding constraints (features):Adding constraints (features):

(Klein and Manning, 2003)(Klein and Manning, 2003) Lower maximum entropy Raise maximum likelihood of data Bring the distribution further from

uniform Bring the distribution closer to data

Parameter estimationParameter estimation

AlgorithmsAlgorithms

Generalized Iterative Scaling Generalized Iterative Scaling (GIS):(GIS): (Darroch and Ratcliff, 1972)

Improved Iterative Scaling Improved Iterative Scaling (IIS):(IIS): (Della Pietra et al., 1995)

GIS: setupGIS: setup

Requirements for running GIS:Requirements for running GIS:Obey form of model and constraints:Obey form of model and constraints:

An additional constraint:An additional constraint:

Add a new feature Add a new feature ffkk+1+1::

Z

exp

xf j

k

jj )(

1

)(

jjp dfE

k

jj Cxfx

1

)(

k

jjk xfCxfx

11 )()(

k

jj

xxfC

1

)(max

GIS algorithmGIS algorithm

Compute Compute ddjj, , jj=1, …, =1, …, kk+1+1

Initialize (any values, e.g., 0) Initialize (any values, e.g., 0) Repeat until convergeRepeat until converge for each j

• compute

where

• update

Z

exp

xf

n

j

k

j

nj )(

)(

1

1

)(

)(

)()()()( xfxpfE j

x

njp n

)(log1

)(

)()1(

jp

inj

nj fE

d

C n

)1(j

Approximation for calculating Approximation for calculating feature expectationfeature expectation

N

i Aaiji

Bb Aaj

BbAaj

BbAaj

x BbAajjjp

bafbapN

bafbapbp

bafbapbp

bafbapbp

bafbapxfxpfE

1

,

,

,

),()|(1

),()|()(~

),()|()(~

),()|()(

),(),()()(

Properties of GISProperties of GIS

L(pL(p(n+1)(n+1)) >= L(p) >= L(p(n)(n))) The sequence is guaranteed to converge to The sequence is guaranteed to converge to

p*.p*. The converge can be very slow.The converge can be very slow.

The running time of each iteration is O(NPA):The running time of each iteration is O(NPA): N: the training set size P: the number of classes A: the average number of features that are active

for a given event (a, b).

Feature selectionFeature selection

Feature selectionFeature selection

Throw in many features and let the machine Throw in many features and let the machine select the weightsselect the weights Manually specify feature templates

Problem: too many featuresProblem: too many features

An alternative: greedy algorithmAn alternative: greedy algorithm Start with an empty set S Add a feature at each iteration

NotationNotation

The gain in the log-likelihood of the training data:

After adding a feature:

With the feature set S:

Feature selection algorithm Feature selection algorithm (Berger et al., 1996)(Berger et al., 1996)

Start with S being empty; thus pStart with S being empty; thus pss is is uniform.uniform.

Repeat until the gain is small enoughRepeat until the gain is small enough For each candidate feature f

• Computer the model using IIS• Calculate the log-likelihood gain

Choose the feature with maximal gain, and add it to S

fSp

Problem: too expensive

Approximating gains (Berger Approximating gains (Berger et. al., 1996) et. al., 1996)

Instead of recalculating all the weights, Instead of recalculating all the weights, calculate only the weight of the new feature.calculate only the weight of the new feature.

Training a MaxEnt ModelTraining a MaxEnt Model

Scenario #1:Scenario #1: Define features templatesDefine features templates Create the feature setCreate the feature set Determine the optimum feature weights via GIS or IISDetermine the optimum feature weights via GIS or IIS

Scenario #2:Scenario #2: Define feature templatesDefine feature templates Create candidate feature set SCreate candidate feature set S At every iteration, choose the feature from S (with At every iteration, choose the feature from S (with

max gain) and determine its weight (or choose top-n max gain) and determine its weight (or choose top-n features and their weights).features and their weights).

Case studyCase study

POS tagging (Ratnaparkhi, POS tagging (Ratnaparkhi, 1996)1996)

Notation variation: Notation variation: fj(a, b): a: class, b: context

fj(hi, ti): h: history for ith word, t: tag for ith word

History:History:

hhii = { = {wwii,, w wii-1-1, , wwii-2-2, , wwii+1+1, , wwii+2+2, , ttii-2-2, t, tii-1-1}}

Training data:Training data: Treat it as a list of (hi, ti) pairs How many pairs are there?

Using a MaxEnt ModelUsing a MaxEnt Model

Modeling: Modeling:

Training: Training: Define features templates Create the feature set Determine the optimum feature weights via GIS or

IIS

Decoding: Decoding:

ModelingModeling

)|(

),|(

),...,|,...,(

1

111

1

11

i

n

ii

inn

ii

nn

htp

twtp

wwttP

Tt

thp

thphtp

'

)',(

),()|(

Training step 1: define Training step 1: define feature templatesfeature templates

History hi Tag ti

Condition Features

wi is not rare wi = X & ti = T

wi is rare X is prefix of wi, |X| ≤ 4 & ti = T

X is suffix of wi, |X| ≤ 4 & ti = T

wi contains number & ti = T

wi contains uppercase character & ti = T

wi contains hyphen & ti = T

wi ti-1 = X & ti = T

ti-2 ti-1 = X Y & ti = T

wi-1 = X & ti = T

wi-2 = X & ti = T

wi+1 = X & ti = T

wi+2 = X & ti = T

Step 2: Create feature setStep 2: Create feature set

Collect all the features from the training dataThrow away features that appear less than 10 times

Step 3: determine the Step 3: determine the feature weightsfeature weightsGISGIS

Training time:Training time: Each iteration: O(NTA):

• N: the training set size• T: the number of allowable tags• A: average number of features that are active for a (h, t).

How many features?How many features?

Decoding: Beam searchDecoding: Beam search

Generate tags for Generate tags for ww11, find top , find top NN, set , set ss11jj

accordingly, accordingly, jj =1, 2, …, =1, 2, …, NN for for i = 2 i = 2 to to nn ( (nn is the sentence length) is the sentence length)

for j =1 to Ngenerate tags for wi, given s(i-1)j as previous tag context

append each tag to s(i-1)j to make a new sequence

find N highest prob sequences generated above, and set sij accordingly, j = 1, …, N

Return highest prob sequence Return highest prob sequence ssnn11..

Beam searchBeam search

Beam inference:Beam inference: At each position keep the top At each position keep the top kk complete sequences complete sequences Extend each sequence in each local wayExtend each sequence in each local way The extensions compete for the The extensions compete for the kk slots at the next position slots at the next position

AdvantagesAdvantages Fast: and beam sizes of 3-5 are as good or almost as good Fast: and beam sizes of 3-5 are as good or almost as good

as exact inference in many casesas exact inference in many cases Easy to implement (no dynamic programming required)Easy to implement (no dynamic programming required)

Disadvantage:Disadvantage: Inexact: the global best sequence can fall off the beamInexact: the global best sequence can fall off the beam

Viterbi searchViterbi search

Viterbi inferenceViterbi inference Dynamic programming or memoizationDynamic programming or memoization Requires small window of state influence (e.g., Requires small window of state influence (e.g.,

past two states are relevant)past two states are relevant) AdvantagesAdvantages

Exact: the global best sequence is returnedExact: the global best sequence is returned DisadtvantagesDisadtvantages

Harder to implement long-distance state-state Harder to implement long-distance state-state interactions (but beam inference tends not to interactions (but beam inference tends not to allow long-distance resurrection of sequences allow long-distance resurrection of sequences anyway)anyway)

Decoding (cont)Decoding (cont)

Tags for words:Tags for words: Known words: use tag dictionary Unknown words: try all possible tags

Ex: “time flies like an arrow”Ex: “time flies like an arrow”Running time: O(NTAB)Running time: O(NTAB) N: sentence length B: beam size T: tagset size A: average number of features that

are active for a given event

Experiment resultsExperiment results

Comparison with other Comparison with other learnerslearnersHMM: MaxEnt uses more contextHMM: MaxEnt uses more context

SDT: MaxEnt does not split dataSDT: MaxEnt does not split data

TBL: MaxEnt is statistical and it provides TBL: MaxEnt is statistical and it provides probability distributions.probability distributions.

MaxEnt SummaryMaxEnt Summary

Concept: choose the p* that Concept: choose the p* that maximizes entropy while satisfying all maximizes entropy while satisfying all the constraints.the constraints.

Max likelihood: p* is also the model Max likelihood: p* is also the model within a model family that maximizes within a model family that maximizes the log-likelihood of the training data.the log-likelihood of the training data.

Training: GIS or IIS, which can be slow.Training: GIS or IIS, which can be slow.MaxEnt handles overlapping features MaxEnt handles overlapping features

well.well. In general, MaxEnt achieves good In general, MaxEnt achieves good

performances on many NLP tasks.performances on many NLP tasks.