Download - LING 696B: Maximum-Entropy and Random Fields

1

LING 696B: Maximum-Entropy and Random Fields

2

Review: two worlds Statistical model and OT seem to ask

different questions about learning UG: what is possible/impossible?

Hard-coded generalizations Combinatorial optimization (sorting)

Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization

Marriage of the two?

3

Review: two worlds OT: relate possible/impossible

patterns in different languages through constraint reranking

Stochastic OT: consider a distribution over all possible grammars to generate variation

Today: model frequency of input/output pairs (among the possible) directly using a powerful model

4

Maximum entropy and OT Imaginary data:

Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each

Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}

/bap/ P(.) *[+voice]

Ident(#voi)

Bab .5 2

pap .5 1

5

Maximum entropy Why have Z?

Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1

So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant

Z can quickly become difficult to compute, when number of candidates is large

Very similar proposal in Smolensky, 86 How to get w1, w2?

Learned from data (by calculating gradients) Need: frequency counts, violation vectors

(same as stochastic OT)

6

Maximum entropy Why do exp{.}?

It’s like take maximum, but “soft” -- easy to differentiate and optimize

7

Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a

classification problem Violating a constraint works against the candidate

(prob ~ exp{-(x1*w1 + x2*w2)} Crucial difference: ordering candidates by one

score, not by lexico-graphic orders


Ident(voice)

Bab .5 2

Pap .5 1

8

Maximum entropy Ordering discrete outputs from

input vectors is a common problem: Also called Logistic Regression (recall

Nearey) Explaining the name:

Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1

Linear regressionLogistic transform

9

The power of Maximum Entropy Max Eng/logistic regression is widely used

in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with

a discrete output) Easy to learn: only a global maximum,

optimization efficient Isn’t this the greatest thing in the world?

Need to understand the story behind the exp{} (in a few minutes)

10

Demo: Spanish diminutives Data from Arbisi-Kelm

Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle

11

Stochastic OT and Max-Ent Is better fit always a good thing?

12

Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new

fashion in phonology?

13

The crucial difference What are the possible distributions

of p(.|/bap/) in this case?


Ident(voice)

Bab 2

Pap 1

Bap 1

pab 1 1

14

The crucial difference What are the possible distributions

of p(.|/bap/) in this case? Max-Ent considers a much wider

range of distributions


Ident(voice)

Bab 2

Pap 1

Bap 1

pab 1 1

15

What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state

corresponds to the distribution with the most entropy

Given a dice, which distribution has the largest entropy?

16

What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state

corresponds to the distribution with the most entropy

Given a dice, which distribution has the largest entropy?

Add constraints to distributions: the average of some feature functions is assumed to be fixed:

Observed value

17

What is Maximum Entropy anyway?

Example of features: violations, word counts, N-grams, co-occurrences, …

The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem

This leads to p(x) ~ exp{k wk*fk(x)} Very general (see later), many choices of

fk

18

The basic intuition Begin “ignorant” as much as possible (with

maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))

Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP

This is better seen as a “descriptive” model

19

Going towards Markov random fields Maximum entropy applied to

conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}

There can be many creative ways of extracting features fk(x,y) One way is to let a graph structure

guide the calculation of features. E.g. neighborhood/clique

Known as Markov network/random field

20

Conditional random field Impose a chain-structured graph,

and assign features to edges Still a max-ent, same calculation

f(xi, yi)

m(yi, yi+1)

21

Wilson’s idea Isn’t this a familiar picture in

phonology?

m(yi, yi+1) -- Markedness

f(xi, yi)Faithfulnes

s

Surface form

Underlying form

22

The story of smoothing In Max-Ent models, the weights can get

very large and “over-fit” the data (see demo)

Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights

Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a

higher penalty for them to change value

23

Wilson’s model fitting to the velar palatalization data