A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling...

A brief maximum entropy tutorial

Overview

• Statistical modeling addresses the problem of modeling the behavior of a random process

• In constructing this model, we typically have at our disposal a sample of output from the process. From the sample, which constitutes an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a succinct, accurate representation of the process

• We can then use this representation to make predictions of the future behavior of the process

Motivating example

• Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word in.

• A model p of the expert’s decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in.

• Develop p – collect a large sample of instances of the expert’s decisions

Motivating example

• Our goal is to– Extract a set of facts about the decision-making

process from the sample (the first task of modeling)

– Construct a model of this process (the second task)

Motivating example

• One obvious clue we might glean from the sample is the list of allowed translations– in {dans, en, à, au cours de, pendant}

• With this information in hand, we can impose our first constraint on our model p:

This equation represents our first statistic of the process; we can now proceed to search for a suitable model which obeys this equation– There are infinite number of models p for which this id

entify holds

1)() ()()()( pendantpdecoursaupapenpdansp

Motivating example

• One model which satisfies the above equation is p(dans)=1 ； in other words, the model always predicts dans.

• Another model which obeys this constraint predicts pendant with a probability of ½, and à with a probability of ½.

• But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions?

Motivating example• Knowing only that the expert chose exclusively from

among these five French phrases, the most intuitively appealing model is

5/1)(

5/1) (

5/1)(

5/1)(

5/1)(

pendantp

decoursaup

ap

enp

dansp

This model, which allocates the total probability evenly among the five possible phrases, is the most uniform model subject to our knowledge

It is not, however, the most uniform overall; that model would grant an equal probability to every possible French phrase.

Motivating example• We might hope to glean more clues about the expert’s deci

sions from our sample.

• Suppose we notice that the expert chose either dans or en 30% of the time

Once again there are many probability distributions consistent with these two constraints.

• In the absence of any other knowledge, a reasonable choice for p is again the most uniform – that is, the distribution which allocates its probability as evenly as possible, subject to the constrains:

1)() ()()()( pendantpdecoursaupapenpdansp 10/3)()( enpdansp

30/7)(

30/7) (

30/7)(

20/3)(

20/3)(

pendantp

decoursaup

ap

enp

dansp

Motivating example

• Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or à. We can incorporate this information into our model as a third constraint:

• We can once again look for the most uniform p satisfying these constraints, but now the choice is not as obvious.

1)() ()()()( pendantpdecoursaupapenpdansp 10/3)()( enpdansp

2/1)()( apdansp

Motivating example

• As we have added complexity, we have encountered two problems:– First, what exactly is meant by “uniform,” and how can

one measure the uniformity of a model?

– Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described?

Motivating example

• The maximum entropy method answers both these questions.

• Intuitively, the principle is simple:– model all that is known and assume nothing about that

which is unknown

– In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible.

• This is precisely the approach we took in selecting our model p at each step in the above example

Maxent Modeling• Consider a random process which produces an out

put value y, a member of a finite set У.– y may be any word in the set {dans, en, à, au cours de,

pendant}

• In generating y, the process may be influenced by some contextual information x, a mamber of a finite set X.– x could include the words in the English sentence surro

unding in

• To construct a stochastic model that accurately represents the behavior of the random process– Given a context x, the process will output y.

Training data

• Collect a large number of samples (x1, y1), (x2, y2),…, (xN, yN)

– Each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in which the process produced

• Typically, a particular pair (x, y) will either not occur at all in the sample, or will occur at most a few times.– smoothing

sample in the occurs , that timesofnumber 1

,~ yxN

yxp

Features and constraints

• The goal is to construct a statistical model of the process which generated the training sample

• The building blocks of this model will be a set of statistics of the training sample– The frequency that in translated to either dans or en wa

s 3/10

– The frequency that in translated to either dans or au cours de was ½

– And so on

yxp ,~

Statistics of the training sample yxp ,~


• Conditioning information x– E.g., in the training sample, if April is the word

following in, then the translation of in is en with frequency 9/10

• Indicator function

• Expected value of f

otherwise 0

follows and if 1,

inAprilenyyxf

(1) ,,~~,

yx

yxfyxpfp


• We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f– We call such function a feature function or feature for

short


• When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it

• We do this by constraining the expected value that the model assigns to the corresponding feature function f

• The expected value of f with respect to the model p(y | x) is

(2) ,|~,

yx

yxfxypxpfp

sample trainingin the ofon distributi empirical theis ~ where xxp


• We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require

– We call the requirement (3) a constraint equation or simply a constraint

• Combining (1), (2) and (3) yields

(3) ~ fpfp

yxyx

yxfyxpyxfxypxp,,

,,~,|~


• To sum up so far, we now have– A means of representing statistical phenomena inherent

in a sample of data (namely, )

– A means of requiring that our model of the process exhibit these phenomena (namely, )

• Feature:– Is a binary-value function of (x, y)

• Constraint– Is an equation between the expected value of the

feature function in the model and its expected value in the training data

fp~

fpfp ~

The maxent principle

• Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics

• That is, we would like p to lie in the subset C of P defined by

(4) ,...,2,1for ~| nifpfpp ii PC

(a)

P

(b)

P

(d)

P

(c)

P

C1 C1C1 C2C2

Figure 1:

• If we impose no constraints, then all probability models are allowable

• Imposing one linear constraint C1 restricts us to those pP which lie on the region defined by C1

• A second linear constraint could determine p exactly, if the two constraints are satisfiable, where the intersection of C1 and C2 is non-empty. p C1 C2

• Alternatively, a second linear constraint could be inconsistent with the first (i,e, C1 C2 = ); no pP can satisfy them both


• In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent

• Furthermore, the linear constraints in our applications will not even come close to determining pP uniquely as they do in (c); instead, the set C = C1 C2 … Cn of allowable models will be infinite


• Among the models pC, the maximum entropy philosophy dictates that we select the distribution which is most uniform

• A mathematical measure of the uniformity of a conditional distribution p(y|x) is provided by the conditional entropy

(5) |log|~,

yx

xypxypxppH


• The principle of maximum entropy– To select a model from a set C of allowed probability

distributions, choose the model p ★C with maximum entropy H(p):

(6) maxarg* pHpp C

Exponential form

• The maximum entropy principle presents us with a problem in constrained optimization: find the p★C which maximizes H(p)

• Find

(7) |log|~maxarg

maxarg

,

*

yxCp

Cp

xypxypxp

pHp

Exponential form

• We refer to this as the primal problem; it is a succinct way of saying that we seek to maximize H(p) subject to the following constraints:– 1. – 2.

• This and the previous condition guarantee that p is a conditional probability distribution

– 3.

• In other words, p C, and so satisfies the active constraints C

., allfor 0| yxxyp . allfor 1| xxyp

y

.,...,2,1for

,,~,|~,,

ni

yxfyxpyxfxypxpyxyx

Exponential form

• To solve this optimization problem, introduce the Lagrangian

1|

(8) ,|~,,~

|log|~,,

,

,

y

i yxiii

yx

xyp

yxfxypxpyxfyxp

xypxypxpp

Exponential form

(10) 1~exp,~exp|

1~,~|log

,~|log1~

0,~|log1~

(9) ,~|log1~|

xpyxfxpxyp

xpyxfxpxyp

yxfxpxypxp

yxfxpxypxp

yxfxpxypxpxyp

iii

iii

iii

iii

iii

Exponential form

• We have thus found the parametric form of p★, and so we now take up the task of solving for the optimal values ★, ★.

• Recognizing that the second factor in this equation is the factor corresponding to the second of the constraints listed above, we can rewrit (10) as

where Z(x), the normalizing factor, is given by

(11) ,exp|

iii yxfxZxyp

(12) ,exp

y iii yxfxZ

yi

ii

yi

ii

yi

ii

yi

iiy

y

yxfxZ

xZyxf

xp

yxfxp

xpyxfxyp

xypx

,exp

1

,exp

11~exp

1,exp1~exp

11~exp,exp|

1|, :constraint second

:(12) Proof

*

Exponential form

• We have found ★ but not yet ★. Towards this end we introduce some further notation. Define the dual function () as

and the dual optimization problem as

• Since p★ and ★ are fixed, the righthand side of (14) has only the free variables ={1, 2,…, n}.

(13) ,, p

(14) maxarg Find

Exponential form

• Final result– The maximum entropy model subject to the constraints C has the parametric form p★ of (11), where Λ★ can be determined by maximizing the dual function ()

Maximum likelihood

.~ sample ing train

theof likelihood themaximizes that |family parametric

in the model theisentropy maximum with model The

:as rephrased becan section previous theofresult the

tion,interpreta With this(11). of form parametric thehas where

(16)

is that ; model

lexponentia for the likelihood-log just the fact,in is,section

previous theof function dual thecheck that easy to isIt

(15) |log,~|log

by defined is model aby predicted as

~on distributi empirical theof likelihood-log The

*

~

,,

,~~

~

p

xyp

Cp

p

pL

p

xypyxpxyppL

p

ppL

p

yxyx

yxpp

p

Maximum likelihood

yx

yx

y

i yxiii

yx

xypxypxpp

xypxypxpp

xyp

yxfxypxpyxfyxp

xypxypxpp

,

**

,

,

,

|log|~~,,

|log|~,,

1|

,|~,,~

|log|~,,

:(8) From and (16) Since

iii

x

i yxii

x

yx iii

x

yx iii

yx

yx iii

yx iii

yx

fpxZxp

yxfyxpxZxp

yxfyxpxZxp

yxfxypxpxZxypxp

yxfxZxypxp

yxfxZ

xypxp

xypxypxpp

~log~

.,~log~

.,~log~

.|~~log|~~

.log|~~

.exp1

log|~~

|log|~~,,

,

.

.,

,

,

,

**

Outline (Maxent Modeling summary)

• We began by seeking the conditional distribution p(y|x) which had maximal entropy H(p) subject to a set of linear constraints (7)

• Following the traditional procedure in constrained optimization, we introduced the Lagrangian ( p,,), where , are a set of Lagrange multipliers for the constraints we imposed on p(y|x)

• To find the solution to the optimization problem, we appealed to the Kuhn-Tucker theorem, which states that we can (1) first solve ( p,,) for p to get a parametric form for p★ in terms of , ; (2) then plug p★ back in to ( p,,), this time solving for ★, ★.


• The parametric form for p★ turns out to have the exponential form (11)

• The ★ gives rise to the normalizing factor Z(x), given in (12)

• The ★ will be solved for numerically using the dual function (14). Furthermore, it so happens that this function, (), is the log-likelihood for the exponential model p (11). So what started as the maximization of entropy subject to a set of linear constraints turns out to be equivalent to the unconstrained maximization of likelihood of a certain parametric family of distributions.


• Table 1 summarize the primal-dual framework

Primal Dual

problemdescription

type of searchsearch domain

solution

argmaxpCH(p)maximum entropy

constrained optimizationpCp★

argmax()maximum likelihood

unconstrained optimizationreal-value vectors {1 2,…}

★

Kuhn-Tucker theorem: p★ = p★

Computing the parameters

convergedhavetheallnotifsteptoGo

toaccordingofvaluetheUpdateb

yxfyxf

fpyxfyxfxypxp

tosolutionthebeLeta

nieachforDo

niallforwithStart

del poptimal molues rameter vaOptimal pa

yxption l distribu; empiricaf,,fnctions fFeature fu

calingterative SImproved I

i

iiii

n

i i

iyx

ii

i

i

*i

n

2 .3

: .

(19) ,, where

(18) )(~),(exp),()|()(~

.

:},,2,1{ .2

},,2,1{ 0 .1

* ; :Output

),(~ :Input

1 Algorithm

1

#

,

#

21

i

ii yxf ,

A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling...

Documents

Transcript of A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling...