A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger,...

A Brief Maximum Entropy TutorialPresenter: DavidsonDate: 2009/02/04

Original Author: Adam Berger, 1996/07/05http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html

http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html

Outline Overview

Motivating example Maxent modeling

Training data Features and constraints The maxent principle Exponential form Maximum likelihood

Skipped sections and further reading

Overview Statistical modeling

Models the behavior of a random process Utilizes samples of output data to construct a

representation of the process Predicts the future behavior of the process

Maximum Entropy Models A family of distributions within the class of

exponential models for statistical modeling

Motivating example (1/4)

f )( fP

English-to-French

Translatorin

au cours de

pendant

en

à

dans

P

An English-to-French translator translates the English word in into 5 French phrases

Goal1. Extract a set of facts about the decision-making process2. Construct a model of this process

Motivating example (2/4) The translator always chooses among those 5 French words

1)() ()()()( pendantpdecoursaupàpenpdansp

The most intuitively appealing model (most uniform model subject to our knowledge) is:

51)() ()()()( pendantpdecoursaupàpenpdansp

Motivating example (3/4)

A reasonable choice for P would be (the most uniform one):

307)() ()(203)()(

pendantpdecoursaupàpenpdansp

If the second clue is discovered: translator chose either dans or en 30% of the time, P must satisfy 2 constraints:

1)() ()()()(103)()(

pendantpdecoursaupàpenpdanspenpdansp

Motivating example (4/4) What if the third constraint is discovered:

21)()(1)() ()()()(

103)()(

àpdansppendantpdecoursaupàpenpdansp

enpdansp

The choice for the model is not as obvious Two problems arise when complexity is added:

The meaning of “uniform” and how to measure the uniformity of a model

How to find the most uniform model subject to a set of constraints One solution: Maximum Entropy (maxent) Model

Maxent Modeling Consider a random process which produces an output value y,

a member of a finite set Y In generating y, the process may be influenced by some

contextual information x, a member of a finite set X. The task is to construct a stochastic model that accurately

represents the behavior of the random process This model estimates the conditional probability that, given a context

x, the process will output y. We denote by P the set of all conditional probability

distributions. A model is an element of P)|( xyp

Training data Training sample:

),(),...,,(),,( 2211 NN yxyxyx

Training sample’s empirical probability distribution

sample in the occurs )( that timesofnumber ),(~ 1 x,yyxp N

Features and constraints (1/4) Use a set of statistics of the training sample to construct a

statistical model of the process Statistics that is independent of the context:

21)()(1)() ()()()(

103)()(

àpdansppendantpdecoursaupàpenpdansp

enpdansp

Statistics that depends on the conditioning information x, e.g. in training sample, if April is the word following in, then the translation of in is en with frequency 9/10.

Features and constraints (2/4) To express the event that in translates as en when April is the

following word, we can introduce the indicator function:

otherwise 0

follows and if 1),(

inAprilenyyxf

The expected value of f with respect to the empirical distribution is exactly the statistic we are interested in. This expected value is given by:

),(~ yxp

yx

yxfyxpfp,

),(),(~)(~

We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short.

Features and constraints (3/4) The expected value of f with respect to the model is:)|( xyp

yx

yxfxypxpfp,

),()|()(~)(

where is the empirical distribution of x in the training sample)(~ xp

We constrain this expected value to be the same as the expected value of f in the training sample:

)(~)( fpfp

a constraint equation or simply a constraint

By restricting attention to those models for which the constraint holds, we are eliminating from considering those models which do not agree with the training sample on how often the output of the process should exhibit the feature f.

What we have so far: A means of representing statistical phenomena inherent in a sample of

data, namely A means of requiring that our model of the process exhibit these

phenomena, namely

Features and constraints (4/4)

)|( xyp

)(~ fp

Combining the above 3 equations yields:

yxyx

yxfyxpyxfxypxp,,

),(),(~),()|()(~

)(~)( fpfp

The maxent principle (1/2) Suppose n feature functions fi are given We would like our model to accord with these statistics, i.e. we

would like p to lie in the subset C of P defined by

nifpfpPpC ii ,...,2,1for ~|

Among the models , we would like to select the distribution which is most uniform. But what does “uniform” mean?

A mathematical measure of the uniformity of a conditional distribution is provided by the conditional entropy:

Cp

)|( xyp

yx

xypxypxppH,

)|(log)|()(~)(

The maxent principle (2/2)

The entropy is bounded from below by zero The entropy of a model with no uncertainty at all

The entropy is bounded from above by The entropy of the uniform distribution over all possible values of

y

Ylog

yx

xypxypxppH,

)|(log)|()(~)(

Y The principle of maximum entropy:

To select a model from a set of allowed probability distributions, choose the model with maximum entropy

CCp * )( pH

)(maxarg* pHpCp

is always well-defined; that is, there is always a unique model with maximum entropy in any constrained set

*p *pC

Exponential form (1/3) The method of Lagrange multipliers is applied to impose the

constraint on the optimization The constrained optimization problem is to find

yxCp

Cp

xypxypxp

pHp

,

)|(log)|()(~ maxarg

)( maxarg*

Maximize subject to the following constraints:yxxyp , allfor 0)|(

xxypy

allfor 1)|(

)( pH

},...,2,1{for ),(),(~),()|()(~,,

niyxfyxpyxfxypxpyx

iyx

i

Guarantee that p is a conditional probability distribution

In other words, Cp

Exponential form (2/3) When the Lagrange multiplier is introduced, the objective

function becomes:

x

i yxi

yxii

yx

xyp

yxfxypxpyxfyxp

xypxypxpp

1)|(

),()|()(~),(),(~

)|(log)|()(~),,(

,,

,

The real-valued parameters and correspond to the 1+n constraints imposed on the solution

Solve by using EM algorithm See http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node7.html

n ,...,, 21

,,p

http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node7.html

The maximum entropy model subject to the constraints has the parametric form of the equation below, where can be determined by maximizing the dual function

Exponential form (3/3) The final result:

iii yxfxZxyp ),(exp)()|(*

C*p *

)(

y iii yxfxZ ),(exp)(

*),*,()( p

Maximum likelihood The log-likelihood of the empirical distribution as

predicted by a model is defined by:p~)(~ pLp

p

yxyx

yxpp xypyxpxyppL

,,

),(~~ )|(log),(~)|(log)(

The dual function of the previous section is just the log-likelihood for the exponential model ; that is:

)(p

)()( ~ pLp The result from the previous section can be rephrased as:

The model with maximum entropy is the model in the parametric family that maximizes the likelihood of the training sample

Cp *)|( xyp

p~

Skipped sections Computing the parameters

http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node10.html#SECTION00030000000000000000

Algorithms for inductive learning http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tut

orial/node11.html#SECTION00040000000000000000

Further readings http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tut

orial/node14.html#SECTION00050000000000000000







A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger,...

Documents

Transcript of A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger,...