Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

98
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Transcript of Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Page 1: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum EntropyAdvanced Statistical Methods in NLP

Ling 572January 31, 2012

Page 2: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

2

Maximum Entropy

Page 3: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

RoadmapMaximum entropy:

OverviewGenerative & Discriminative models:

Naïve Bayes v Maxent

Maximum entropy principleFeature functionsModelingDecoding

Summary

Page 4: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

4

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

Page 5: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

5

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

Page 6: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

6

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Page 7: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

7

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

Page 8: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

8

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

Page 9: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

9

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Page 10: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

10

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

Page 11: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

11

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

Page 12: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

12

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

Page 13: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

13

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

Page 14: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

14

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Page 15: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

c

f1 f2 f3 fk

Page 16: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

16

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

Page 17: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

17

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

Page 18: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

18

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

Page 19: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

19

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Page 20: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

20

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Page 21: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

Page 22: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

22

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

Page 23: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

23

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

Page 24: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

24

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 25: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

25

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 26: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

26

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 27: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

27

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, sign

Page 28: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

28

Page 29: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

MaxEnt OverviewPrediction:

P(y|x)

29

Page 30: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

30

Page 31: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

31

Page 32: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

32

Page 33: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

33

Page 34: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle

34

Page 35: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

35

Page 36: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

36

Page 37: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

37

Page 38: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

38

Page 39: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example I: (K&M 2003)Consider a coin flip

H(X)

39

Page 40: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?

40

Page 41: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

41

Page 42: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?

42

Page 43: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?P(X=T)=0.7

43

Page 44: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

44

Page 45: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

45

Page 46: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

46

Page 47: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

47

Page 48: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

48

Page 49: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=

49

Page 50: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

50

Page 51: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

51

Page 52: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

52

Page 53: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

53

Example III: POS (K&M, 2003)

Page 54: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

54

Example III: POS (K&M, 2003)

Page 55: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

55

Example III: POS (K&M, 2003)

Page 56: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

56

Example III: POS (K&M, 2003)

Page 57: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

57

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

Page 58: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

58

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Etc

Page 59: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle:Summary

Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:

Page 60: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Principle:Summary

Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:

Questions:1) How do we model the constraints?2) How can select the distributions?

Page 61: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

MaxEnt Modeling

Page 62: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), where

Page 63: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predict

Page 64: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y: text categories: guns, mideast, misc…

Compute P(y|x)Select y with highest P(y|x) as classification

Page 65: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X

Page 66: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y:

Page 67: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y: text categories: guns, mideast, misc…

Compute P(y|x)Select y with highest P(y|x) as classification

Page 68: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy ModelCompute P(y|x)

Select distribution that maximizes entropy subject to constraints in training data:

Page 69: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Modeling ComponentsFeature functions

Computing expectations

Calculating P(y|x) and P(x,y)

Page 70: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature Functions

Page 71: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsA feature function is a binary-valued indicator

function:

Page 72: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

Page 73: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

fj(x,y) = {1 if y=“guns” and x includes “rifle”

Page 74: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

fj(x,y) = {1 if y=“guns” and x includes “rifle”

{0 otherwise

Page 75: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsFeature functions can be complex:

Page 76: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

Page 77: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or

currwd=“guns” and so

Page 78: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or currwd=“guns”

and so on

Many are simple

Page 79: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow”

class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns”

and so

Many are simple

Feature selection will be an issue

Page 80: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Key PointsFeature functions are:

binary correspond to a (feature,class) pair

Page 81: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Key PointsFeature functions are:

binary correspond to a (feature,class) pair

MaxEnt training learns weights associated with feature functions

Page 82: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Computing Expectations

Page 83: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50

Page 84: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

Page 85: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

P(X=H)*100+P(X=T)*(-50)

2: Consider the more general case, outcomei : yield = vi

What is the expected yield?

Page 86: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

P(X=H)*100+P(X=T)*(-50)

2: Consider the more general case, outcomei : yield = vi

Page 87: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Calculating ExpectationsLet P(X=i) be a distribution of a random variable

X

Let f(x) be a function of x.

Let Ep(f) be the expectation of f(x) based on P(x)

Page 88: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Example due to F. Xia

Page 89: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Example due to F. Xia

Page 90: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Empirical expectation

Example due to F. Xia

Page 91: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Empirical expectation = ¾*100 + ¼*(-50)=62.5

Example due to F. Xia

Page 92: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Model ExpectationModel: p(x)

Assume a fair coin

Page 93: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Model ExpectationModel: p(x)

Assume a fair coinP(X=H)= ½; P(X=T)= ½;

Model expectation:

Page 94: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Model ExpectationModel: p(x)

Assume a fair coinP(X=H)= ½; P(X=T)= ½;

Model expectation:½*100 + ½*(-50)=25

Page 95: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Empirical Expectation

Page 96: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

ExampleTraining data:

x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3

Example due F. Xia

t1 t2 t3 t4

c1 1 1 2 1

c2 1 0 0 1

c3 1 0 1 0

Raw counts

Page 97: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

ExampleTraining data:

x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3

Example due F. Xia

t1 t2 t3 t4

c1 1/4 1/4 2/4 1/4

c2 1/4 0 0 1/4

c3 1/4 0 1/4 0

Empirical distribution

Page 98: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.