1
LING 696B: Maximum-Entropy and Random Fields
2
Review: two worlds Statistical model and OT seem to ask
different questions about learning UG: what is possible/impossible?
Hard-coded generalizations Combinatorial optimization (sorting)
Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization
Marriage of the two?
3
Review: two worlds OT: relate possible/impossible
patterns in different languages through constraint reranking
Stochastic OT: consider a distribution over all possible grammars to generate variation
Today: model frequency of input/output pairs (among the possible) directly using a powerful model
4
Maximum entropy and OT Imaginary data:
Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each
Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}
/bap/ P(.) *[+voice]
Ident(#voi)
Bab .5 2
pap .5 1
5
Maximum entropy Why have Z?
Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1
So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant
Z can quickly become difficult to compute, when number of candidates is large
Very similar proposal in Smolensky, 86 How to get w1, w2?
Learned from data (by calculating gradients) Need: frequency counts, violation vectors
(same as stochastic OT)
6
Maximum entropy Why do exp{.}?
It’s like take maximum, but “soft” -- easy to differentiate and optimize
7
Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a
classification problem Violating a constraint works against the candidate
(prob ~ exp{-(x1*w1 + x2*w2)} Crucial difference: ordering candidates by one
score, not by lexico-graphic orders
/bap/ P(.) *[+voice]
Ident(voice)
Bab .5 2
Pap .5 1
8
Maximum entropy Ordering discrete outputs from
input vectors is a common problem: Also called Logistic Regression (recall
Nearey) Explaining the name:
Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1
Linear regressionLogistic transform
9
The power of Maximum Entropy Max Eng/logistic regression is widely used
in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with
a discrete output) Easy to learn: only a global maximum,
optimization efficient Isn’t this the greatest thing in the world?
Need to understand the story behind the exp{} (in a few minutes)
10
Demo: Spanish diminutives Data from Arbisi-Kelm
Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle
11
Stochastic OT and Max-Ent Is better fit always a good thing?
12
Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new
fashion in phonology?
13
The crucial difference What are the possible distributions
of p(.|/bap/) in this case?
/bap/ P(.) *[+voice]
Ident(voice)
Bab 2
Pap 1
Bap 1
pab 1 1
14
The crucial difference What are the possible distributions
of p(.|/bap/) in this case? Max-Ent considers a much wider
range of distributions
/bap/ P(.) *[+voice]
Ident(voice)
Bab 2
Pap 1
Bap 1
pab 1 1
15
What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state
corresponds to the distribution with the most entropy
Given a dice, which distribution has the largest entropy?
16
What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state
corresponds to the distribution with the most entropy
Given a dice, which distribution has the largest entropy?
Add constraints to distributions: the average of some feature functions is assumed to be fixed:
Observed value
17
What is Maximum Entropy anyway?
Example of features: violations, word counts, N-grams, co-occurrences, …
The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem
This leads to p(x) ~ exp{k wk*fk(x)} Very general (see later), many choices of
fk
18
The basic intuition Begin “ignorant” as much as possible (with
maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))
Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP
This is better seen as a “descriptive” model
19
Going towards Markov random fields Maximum entropy applied to
conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}
There can be many creative ways of extracting features fk(x,y) One way is to let a graph structure
guide the calculation of features. E.g. neighborhood/clique
Known as Markov network/random field
20
Conditional random field Impose a chain-structured graph,
and assign features to edges Still a max-ent, same calculation
f(xi, yi)
m(yi, yi+1)
21
Wilson’s idea Isn’t this a familiar picture in
phonology?
m(yi, yi+1) -- Markedness
f(xi, yi)Faithfulnes
s
Surface form
Underlying form
22
The story of smoothing In Max-Ent models, the weights can get
very large and “over-fit” the data (see demo)
Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights
Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a
higher penalty for them to change value
23
Wilson’s model fitting to the velar palatalization data
Top Related