A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger,...
-
Upload
ella-bishop -
Category
Documents
-
view
214 -
download
0
description
Transcript of A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger,...
![Page 1: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/1.jpg)
A Brief Maximum Entropy TutorialPresenter: DavidsonDate: 2009/02/04
Original Author: Adam Berger, 1996/07/05http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
![Page 2: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/2.jpg)
Outline Overview
Motivating example Maxent modeling
Training data Features and constraints The maxent principle Exponential form Maximum likelihood
Skipped sections and further reading
![Page 3: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/3.jpg)
Overview Statistical modeling
Models the behavior of a random process Utilizes samples of output data to construct a
representation of the process Predicts the future behavior of the process
Maximum Entropy Models A family of distributions within the class of
exponential models for statistical modeling
![Page 4: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/4.jpg)
Motivating example (1/4)
f )( fP
English-to-French
Translatorin
au cours de
pendant
en
à
dans
P
An English-to-French translator translates the English word in into 5 French phrases
Goal1. Extract a set of facts about the decision-making process2. Construct a model of this process
![Page 5: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/5.jpg)
Motivating example (2/4) The translator always chooses among those 5 French words
1)() ()()()( pendantpdecoursaupàpenpdansp
The most intuitively appealing model (most uniform model subject to our knowledge) is:
51)() ()()()( pendantpdecoursaupàpenpdansp
![Page 6: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/6.jpg)
Motivating example (3/4)
A reasonable choice for P would be (the most uniform one):
307)() ()(203)()(
pendantpdecoursaupàpenpdansp
If the second clue is discovered: translator chose either dans or en 30% of the time, P must satisfy 2 constraints:
1)() ()()()(103)()(
pendantpdecoursaupàpenpdanspenpdansp
![Page 7: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/7.jpg)
Motivating example (4/4) What if the third constraint is discovered:
21)()(1)() ()()()(
103)()(
àpdansppendantpdecoursaupàpenpdansp
enpdansp
The choice for the model is not as obvious Two problems arise when complexity is added:
The meaning of “uniform” and how to measure the uniformity of a model
How to find the most uniform model subject to a set of constraints One solution: Maximum Entropy (maxent) Model
![Page 8: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/8.jpg)
Maxent Modeling Consider a random process which produces an output value y,
a member of a finite set Y In generating y, the process may be influenced by some
contextual information x, a member of a finite set X. The task is to construct a stochastic model that accurately
represents the behavior of the random process This model estimates the conditional probability that, given a context
x, the process will output y. We denote by P the set of all conditional probability
distributions. A model is an element of P)|( xyp
![Page 9: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/9.jpg)
Training data Training sample:
),(),...,,(),,( 2211 NN yxyxyx
Training sample’s empirical probability distribution
sample in the occurs )( that timesofnumber ),(~ 1 x,yyxp N
![Page 10: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/10.jpg)
Features and constraints (1/4) Use a set of statistics of the training sample to construct a
statistical model of the process Statistics that is independent of the context:
21)()(1)() ()()()(
103)()(
àpdansppendantpdecoursaupàpenpdansp
enpdansp
Statistics that depends on the conditioning information x, e.g. in training sample, if April is the word following in, then the translation of in is en with frequency 9/10.
![Page 11: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/11.jpg)
Features and constraints (2/4) To express the event that in translates as en when April is the
following word, we can introduce the indicator function:
otherwise 0
follows and if 1),(
inAprilenyyxf
The expected value of f with respect to the empirical distribution is exactly the statistic we are interested in. This expected value is given by:
),(~ yxp
yx
yxfyxpfp,
),(),(~)(~
We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short.
![Page 12: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/12.jpg)
Features and constraints (3/4) The expected value of f with respect to the model is:)|( xyp
yx
yxfxypxpfp,
),()|()(~)(
where is the empirical distribution of x in the training sample)(~ xp
We constrain this expected value to be the same as the expected value of f in the training sample:
)(~)( fpfp
a constraint equation or simply a constraint
![Page 13: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/13.jpg)
By restricting attention to those models for which the constraint holds, we are eliminating from considering those models which do not agree with the training sample on how often the output of the process should exhibit the feature f.
What we have so far: A means of representing statistical phenomena inherent in a sample of
data, namely A means of requiring that our model of the process exhibit these
phenomena, namely
Features and constraints (4/4)
)|( xyp
)(~ fp
Combining the above 3 equations yields:
yxyx
yxfyxpyxfxypxp,,
),(),(~),()|()(~
)(~)( fpfp
![Page 14: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/14.jpg)
The maxent principle (1/2) Suppose n feature functions fi are given We would like our model to accord with these statistics, i.e. we
would like p to lie in the subset C of P defined by
nifpfpPpC ii ,...,2,1for ~|
Among the models , we would like to select the distribution which is most uniform. But what does “uniform” mean?
A mathematical measure of the uniformity of a conditional distribution is provided by the conditional entropy:
Cp
)|( xyp
yx
xypxypxppH,
)|(log)|()(~)(
![Page 15: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/15.jpg)
The maxent principle (2/2)
The entropy is bounded from below by zero The entropy of a model with no uncertainty at all
The entropy is bounded from above by The entropy of the uniform distribution over all possible values of
y
Ylog
yx
xypxypxppH,
)|(log)|()(~)(
Y The principle of maximum entropy:
To select a model from a set of allowed probability distributions, choose the model with maximum entropy
CCp * )( pH
)(maxarg* pHpCp
is always well-defined; that is, there is always a unique model with maximum entropy in any constrained set
*p *pC
![Page 16: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/16.jpg)
Exponential form (1/3) The method of Lagrange multipliers is applied to impose the
constraint on the optimization The constrained optimization problem is to find
yxCp
Cp
xypxypxp
pHp
,
)|(log)|()(~ maxarg
)( maxarg*
Maximize subject to the following constraints:yxxyp , allfor 0)|(
xxypy
allfor 1)|(
)( pH
},...,2,1{for ),(),(~),()|()(~,,
niyxfyxpyxfxypxpyx
iyx
i
Guarantee that p is a conditional probability distribution
In other words, Cp
![Page 17: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/17.jpg)
Exponential form (2/3) When the Lagrange multiplier is introduced, the objective
function becomes:
x
i yxi
yxii
yx
xyp
yxfxypxpyxfyxp
xypxypxpp
1)|(
),()|()(~),(),(~
)|(log)|()(~),,(
,,
,
The real-valued parameters and correspond to the 1+n constraints imposed on the solution
Solve by using EM algorithm See http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node7.html
n ,...,, 21
,,p
![Page 18: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/18.jpg)
The maximum entropy model subject to the constraints has the parametric form of the equation below, where can be determined by maximizing the dual function
Exponential form (3/3) The final result:
iii yxfxZxyp ),(exp)()|(*
C*p *
)(
y iii yxfxZ ),(exp)(
*),*,()( p
![Page 19: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/19.jpg)
Maximum likelihood The log-likelihood of the empirical distribution as
predicted by a model is defined by:p~)(~ pLp
p
yxyx
yxpp xypyxpxyppL
,,
),(~~ )|(log),(~)|(log)(
The dual function of the previous section is just the log-likelihood for the exponential model ; that is:
)(p
)()( ~ pLp The result from the previous section can be rephrased as:
The model with maximum entropy is the model in the parametric family that maximizes the likelihood of the training sample
Cp *)|( xyp
p~
![Page 20: A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05](https://reader036.fdocuments.us/reader036/viewer/2022080204/5a4d1b587f8b9ab0599aa3da/html5/thumbnails/20.jpg)
Skipped sections Computing the parameters
http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node10.html#SECTION00030000000000000000
Algorithms for inductive learning http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tut
orial/node11.html#SECTION00040000000000000000
Further readings http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tut
orial/node14.html#SECTION00050000000000000000