Maximum Entropy Modeling and its application to NLP

56
Maximum Entropy Modeling and its application to NLP Utpal Garain Indian Statistical Institute, Kolkata http://www.isical.ac.in/ ~utpal

description

Maximum Entropy Modeling and its application to NLP. Utpal Garain Indian Statistical Institute, Kolkata http://www.isical.ac.in/~utpal. Language Engineering in Daily Life. In Our Daily Life. Message, Email We can now type our message in my own language/script - PowerPoint PPT Presentation

Transcript of Maximum Entropy Modeling and its application to NLP

Page 1: Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling and its application to NLP

Utpal GarainIndian Statistical Institute, Kolkata

http://www.isical.ac.in/~utpal

Page 2: Maximum Entropy Modeling and its application to NLP

Language Engineering in Daily Life

Page 3: Maximum Entropy Modeling and its application to NLP

In Our Daily Life• Message, Email– We can now type our message in my own language/script

• Oftentimes I need not write the full text– My mobile understands what I intend to write!!– I ha reac saf– I have reached safely

• Even if I am afraid of typing in my own language (so many letters, spellings are so difficult.. Uffss!!)– I type my language in “English” and my computer or my

mobile types it in my language!!– mera bharat..

Page 4: Maximum Entropy Modeling and its application to NLP

• I say “maa…” to my cell– and my mother’s number is called!

• I have gone back to my previous days and left typing in the computer/mobile– I just write on a piece of paper or scribble on the screen – My letters are typed!!

• Those days were so boring…– If you are an exiting customer press 1 otherwise press 2– If you remember your customer ID press 1 otherwise press 2– So on and so on..– I just say “1”, “service”, “cricket” and the telephone understands what I want!!

• My grandma can’t read English but she told she found her name written in Hindi in Railway reservation chart– Do Railway staff type so many names in Hindi everyday– NO!! Computer does this

In Our Daily Life

Page 5: Maximum Entropy Modeling and its application to NLP

• Cross Lingual Information Search– I wanted to know what exactly happened that created

such a big inter-community problem in UP– My friend told me read UP newspaper– I don’t know Hindi – I gave query in the net in my language– I got news articles from UP local newspaper translated in

my language!! Unbelievable!!! • Translation– I don’t know French– Still I can chat with my French friend

In Our Daily Life

Page 6: Maximum Entropy Modeling and its application to NLP

• I had problem to draw the diagram for this – ABCD is a parallelogram, DC is extended to E such that BCE is an

equilateral triangle.– I gave it to my computer and it draws the diagram showing the

steps!!• I got three history books for my son and couldn’t decide

which one will be good for him– My computer suggested Book 2 as it has better readability for a

grade-V student– Later on, I found it is right!!!

• I type questions in the net and get answers (oftentimes they are correct!!)– How does it happen?!!!

In Our Daily Life

Page 7: Maximum Entropy Modeling and its application to NLP

• Language is key to culture– Communication– Power and Influence– Identity – Cultural records

• The multilingual character of Indian society – Need to preserve this character to move successfully

towards closer cooperation at a political, economic, and social level

• Language is both the basis for communication and a barrier

Language

Page 8: Maximum Entropy Modeling and its application to NLP

Role of Language

Courtesy: Simpkins and Diver

Page 9: Maximum Entropy Modeling and its application to NLP

• Application of knowledge of language to the development of computer systems – That can understand, interpret and generate human

language in all its forms• Comprises a set of – Techniques and – Language resources

Language Engineering

Page 10: Maximum Entropy Modeling and its application to NLP

• Get material– Speech, typed/printed/handwritten text, image, video

• Recognize the language and validate it– Encoding scheme, distinguishing separate words..

• Build an understanding of the meaning– Depending on the application you target

• Build the application– Speech to text

• Generate and present the results – Use monitor, printer, plotter, speaker, telephone…

Components of Language Engg.

Page 11: Maximum Entropy Modeling and its application to NLP

• Lexicons– Repository of words and knowledge about them

• Specialist lexicons– Proper names, Terminology– Wordnets

• Grammars• Corpora– Language sample– Text, speech– Helps to train a machine

Language Resources

Page 12: Maximum Entropy Modeling and its application to NLP

NLP vs. Speech

• Consider these two types of problems:– Problem set-1• “I teach NLP at M.Tech. CS”=> what’s in Bengali?• Scan newspaper, pick out those news dealing with

forest fires, fill up a database with relevant information– Problem set-2• In someone’s utterance you might have difficulty to

distinguish between “merry” from “very” or “pan” from “ban”• Context often overcomes this

– Please give me the ??? (pan/ban)– The choice you made was ??? good.

Page 13: Maximum Entropy Modeling and its application to NLP

NLP

• NLU community is more concerned about– Parsing sentences– Assigning semantic relations to the parts of a

sentence– etc…

• Speech recognition community– Predicting next word on the basis of the words so

far• Extracting the most likely words from the signal• Deciding among these possibilities using knowledge

about the language

Page 14: Maximum Entropy Modeling and its application to NLP

NLP• NLU demands “understanding” – Requires a lot of human effort

• Speech people rely on statistical technology– Absence of any understanding limits its ability

• Combination of these two techniques

Page 15: Maximum Entropy Modeling and its application to NLP

NLP• Understanding– Rule based• POS tagging• Tag using rule base

– I am going to make some tea– I dislike the make of this shirt– Use grammatical rules

– Statistical• Use probability• Probability of sequence/path

– PN VG PREP V/N? ADJ N– PN V ART V/N? PP

Page 16: Maximum Entropy Modeling and its application to NLP

Basic Probability

Page 17: Maximum Entropy Modeling and its application to NLP

Probability Theory• X: random variable– Uncertain outcome of some event

• V(X): outcome– Example event: open to some page of an English book and

X is the word you pointed to– V(X) ranges over all possible words of English

• If x is a possible outcome of X, i.e. x V(X)– P(X=x) or P(x)

• Wi is the i-th word prob. Of picking up the i-th word is

• if U denotes the universe of all possible outcomes then the denominator is |U|.

w

j

j

ii

w

wwwP

1

||

||)(

Page 18: Maximum Entropy Modeling and its application to NLP

Conditional Probability• Pick up two words which are in a row -> w1 and w2

– Or, given the first word, guess the second word– Choice of w1 changes things

• Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y)– |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U|

• Given some evidence e, we want to pick up the best conclusion P(c|e)… it is done– if we know P(c|e) = P(c) * P(e|c) /P(e)

• Once evidence is fixed then the denominator stays the same for all conclusions.

|||,|)|( 1

2112

i

jiij

wwwwwwwwwwP

Page 19: Maximum Entropy Modeling and its application to NLP

Conditional Probabiliy

• P(w,x|y,z) = P(w,x) P(y,z|w,x) / P(y,z)• Generalization:– P(w1,w2,…,wn) = P(w1) p(w2|w1) P(w3|w1,w2)

…. P(wn|w1, w2, wn-1)– P(w1,w2,…,wn|x) = P(w1|x) p(w2|w1,x) P(w3|

w1,w2,x) …. P(wn|w1, w2, wn-1,x)– P(W1,n = w1,n)

• Example:– John went to ?? (hospital, pink, number, if)

Page 20: Maximum Entropy Modeling and its application to NLP

Conditional Probability• P(w1,n|speech signal)

= P(w1,n) P(signal|w1,n)/P(signal)• Say, there are words (a1,a2,a3) (b1,b2)

(c1,c2,c3,c4)• P(a2,b1,c4|signal) and P(a2,b1,c4)• P(a2,b1,c4|signal)

= P(a2,b1,c4) * P(signal|a2,b1,c4) • Example:– The {big / pig} dog– P(the big dog) = P(the) P(big|the) P(dog|the big)– P(the pig dog) = P(the) P(pig|the) P(dog|the pig)

Page 21: Maximum Entropy Modeling and its application to NLP

• Predictive Text Entry– Tod => Today – I => I– => have– a => a– ta => take, tal => talk– Tod => today tod toddler

• Techniques– Probability of the word– Probability of the word at position “x”– Conditional probability

• What is the probability of writing “have” after writing two words “today” and “I”

• Resource– Language corpus

Application Building

Page 22: Maximum Entropy Modeling and its application to NLP

• Transliteration– Kamal => কমল

• Indian Railways did it before– Rule based

• Kazi => কাজী• Ka => ক or কা

– Difficult to extend it other languages• Statistical model– N-gram modeling– Kamal=> ka am ma al; kazi => ka az zi– কমল => কম মল; কাজী => কা াাজ জী– Alignment of pairs (difficult computational problem)

Application Building

Page 23: Maximum Entropy Modeling and its application to NLP

• Probability is computed for – P(ka=>ক), P(ক), P( কমল ), etc.

• Best probable word is the output• Advantage:– Easily extendable to any language pairs– Multiple choices are given (according to rank)

• Resource needed– Name pairs– Language model

Transliteration

Page 24: Maximum Entropy Modeling and its application to NLP

Statistical Models and Methods

Page 25: Maximum Entropy Modeling and its application to NLP

Statistical models and methods• Intuition to make crude probability judgments• Entropy– Situation Prob.

No occu 0.51st occu 0.1252nd occu o.125Both 0.25

• [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits• Random variable W takes on one of the several values

V(W), entropy: H(W) = -P(w)log P(w); wV(W)• -logP(w) bits are required to code w

Page 26: Maximum Entropy Modeling and its application to NLP

Use in Speech• {the, a, cat, dog, ate, slept, here, there}• If use of each word is equal and independent• Then the entropy of the language

-P(the)logP(the)-P(a)log P(a)…=8.(-1/8*log1/8)= 3

– H(L) = lim [1/n P(w1,n)logP(w1,n)]

Page 27: Maximum Entropy Modeling and its application to NLP

Markov Chain

• If we remove the numbers then it’s a finite state automaton which is acceptor as well as generator

• Adding the probabilities we make it probabilistic finite state automaton => Markov Chain

• Assuming all states are accepting states (Markov Process), we can compute the prob. of generating a given string• Product of probabilities of the arcs traversed in generating

the string.

Page 28: Maximum Entropy Modeling and its application to NLP

Cross entropy• Per word entropy of the previous model is– [0.5log (1/2)]– At each state only two equi-probable choices so• H(p) = 1

• If we consider each word is equi-probable then H(pm) = 3 bits/word

• Cross Entropy– Cross entropy of a set of random variables W1,n

where correct model is P(w1,n) but the probabilities are estimated using the model Pm(w1,n) is

nwnmnn wPwPPmWH

,1

)(log)(),( ,1,1,1

Page 29: Maximum Entropy Modeling and its application to NLP

Cross entropy• Per word cross entropy is• Per word entropy of the given Markov Chain: 1• If we slightly change the model:– Outgoing probabilities are 0.75 and 0.25– per word entropy becomes• -[1/2log(3/4)+1/2log(1/4)]= - (1/2) [log 3 – log4 +log 1 – log4]= - (1/2) [log 3 – log4]= 2 – 1.7/2 = 1.2

• Incorrect model:– H(W1,n) H(W1,n, PM)

),(1,1 PmWH

n n

Page 30: Maximum Entropy Modeling and its application to NLP

Cross entropy• Per word cross entropy is• Per word entropy of the given Markov Chain: 1• If we slightly change the model:– Outgoing probabilities are 0.75 and 0.25– per word entropy becomes• -[1/2log(3/4)+1/2log(1/4)]= - (1/2) [log 3 – log4 +log 1 – log4]= - (1/2) [log 3 – log4]= 2 – 1.7/2 = 1.2

• Incorrect model:– H(W1,n) H(W1,n, PM)

),(1,1 PmWH

n n

Page 31: Maximum Entropy Modeling and its application to NLP

Cross entropy• Cross entropy of a language– A stochastic process is ergodic if its statistical

properties (i.e. and ) can be computed from a single sufficiently large sample of the process.

– Assuming L is an ergodic language– Cross entropy of L is– H (L, PM)

=

)(log1lim

)(log)(1lim

,1

,1,1

nMn

nMnn

wPn

wPwPn

Page 32: Maximum Entropy Modeling and its application to NLP

Corpus• Brown corpus– Coverage• 500 text segments of 2000 words• Press, reportage etc. 44• Press editorial etc. 27• Press, reviews 17• Religion books, periodicals.. 17• …

Page 33: Maximum Entropy Modeling and its application to NLP

Trigram Model

Page 34: Maximum Entropy Modeling and its application to NLP

Trigram models• N-gram model…• P(wn|w1…wn-1) = P(wn|wn-1wn-2)• P(w1,n)=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|w1,n-1)

=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|wn-1,n-2)

• P(w1,n)=P(w1)P(w2|w1)P(wi|wi-1wi-2)

• Pseudo words: w-1, w0

• “to create such”• #to create such=?• #to create=?

)()(

)|(

)|()(

1,2

,22,1

12,1,1

ii

iiiiie

n

iiiin

wCwC

wwP

wwPwP

Page 35: Maximum Entropy Modeling and its application to NLP

Trigram as Markov Chain

• It is not possible to determine state of the machine simply on the basis of the last output (the last two outputs are needed)

• Markov chain of order 2

Page 36: Maximum Entropy Modeling and its application to NLP

Problem of sparse data• Jelinek stuided– 1,500,000 word corpus– Extracted trigrams– Applied to 300,000 words– 25% trigram types were missing

Page 37: Maximum Entropy Modeling and its application to NLP

Maximum Entropy Model

Page 38: Maximum Entropy Modeling and its application to NLP

An example

• Machine translation– Star in English

• Translation in Hindi: – सि�तारा, तारा, तारक, परसि�दध अभि�नता, �ागय

• First statistics of this process– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)+p(�ागय)

= 1• There are infinite number of models p for which

this identity holds

Page 39: Maximum Entropy Modeling and its application to NLP

• One model– p(सि�तारा) = 1– This model always predicts सि�तारा

• Another model– p(तारा) = ½– p(परसि�दध अभि�नता) = ½

• These models offend our sensibilities– The expert always chose from the five choices– How can we justify either of these probability

distribution ?– These models bold assumptions without empirical

justification

Page 40: Maximum Entropy Modeling and its application to NLP

• What we know– Experts chose exclusively from these five words

• the most intuitively appealing model is– p(सि�तारा) = 1/5– p(तारा) = 1/5– p(तारक) = 1/5– p(परसि�दध अभि�नता) = 1/5– p(�ागय) = 1/5

• The most uniform model subject o our knowledge

Page 41: Maximum Entropy Modeling and its application to NLP

• Suppose we notice that the expert’s chose either सि�तारा or तारा 30% of the time

• We apply this knowledge to update our model– p(सि�तारा)+p(तारा) = 3/10– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)

+p(�ागय) = 1• Many probability distributions consistent with

the above constraints• A reasonable choice for p is again the most

uniform

Page 42: Maximum Entropy Modeling and its application to NLP

• i.e. the distribution which allocates its probability as evenly as possible, subject to the constraints– p(सि�तारा) = 3/20– p(तारा) = 3/20– p(तारक) = 7/30– p(परसि�दध अभि�नता) = 7/30– p(�ागय) = 7/30

• Say we inspect the data once more and notice another interesting fact– In half the cases, the expert chose either सि�तारा or

परसि�दध अभि�नता

Page 43: Maximum Entropy Modeling and its application to NLP

• So we add a third constraint– p(सि�तारा)+p(तारा) = 3/10– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)+p(�ागय) = 1– p(सि�तारा)+p(परसि�दध अभि�नता) = ½

• Now if we want to look for the most uniform p satisfying the constraints the choice is not as obvious

• As complexity added, we have two difficulties– What is meant by “uniform” and how can we measure

the uniformity of a model– How will we find the most uniform model subject to a set

constraints?• Maximum entropy method (E. T. Jaynes) answers

both of these questions

Page 44: Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling• Consider a random process that produces an output value y

(a member of a finite set, Y)• For the translation example just considered, the process

generates a translation of the word star, and the output y can be any word in the set {सि�तारा, तारा, तारक, परसि�दध अभि�नता, �ागय}.

• In generating y, the process may be influenced by some contextual information x, a member of a finite set X.

• In the present example, this information could include the words in the English sentence surrounding star.

• Our task is to construct a stochastic model that accurately represents the behavior of the random process.

Page 45: Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling• Such a model is a method of estimating the

conditional probability that, given a context x, the process will output c.

• We will denote by p(clx) the probability that the model assigns to y in context x.

• We will denote by P the set of all conditional probability distributions. Thus a model p(c|x) is, by definition, just an element of P.

Page 46: Maximum Entropy Modeling and its application to NLP

Training Data• A large number of samples – (x1,c1), (x2, c2) . . . (xN, cN).

• Each sample would consist of a phrase x containing the words surrounding star, together with the translation c of star that the process produced.

• Empirical probability distribution psample in the occurs ),( that timeofnumber 1),(~ cx

Ncxp

Page 47: Maximum Entropy Modeling and its application to NLP

Features• The feature fi are binary functions that can be

used to characterize any property of a pair (x, c), • x is a vector representing an input element and c

is the class label• f(x, c) = 1 if c = परसि�दध अभि�नता and star follows

cinema; otherwise = 0

Page 48: Maximum Entropy Modeling and its application to NLP

Features• We have two things in hand– Empirical distribution– The model p(c|x)

• The expected value of f with respect to the empirical distribution is

• The expected value of f with respect to the model p(c|x) is

• Our constraint is

cx

cxfcxpfp,

),(),(~)(~

cx

cxfxcpxpfp,

),()|()(~)(

)(~)( fpfp

Page 49: Maximum Entropy Modeling and its application to NLP

Classification• For a given x we need to know its class label c– p(x, c)

• Loglinear models– General and very important class of models for

classification of categorical variables– Logistic regression is another example

– K is the number of features, i is the weight for the feature fi and Z is a normalizing constant used to ensure that a probability distribution results.

K

i

cxfii

Zcxp

1

),(1),(

K

iii cxfZcxp

1log),(log),(log

Page 50: Maximum Entropy Modeling and its application to NLP

An example

• Text classification• x consists of a single element, indicating presence

or absence of the word profit in the article• Classes, c– two classes; earnings or not

• Features – Two features• f1: 1 if and only if the article is “earnings” and the word profit

is in it• f2: filler feature (fK+1)

– C is the greatest possible feature sum

K

iiK cxfCcxf

11 ),(),(

Page 51: Maximum Entropy Modeling and its application to NLP

An example

X cProfit “earnings” f1 f2 =f1log1+f2log2 2

(0) 0 0 1 1 2(0) 1 0 1 1 2(1) 0 0 1 1 2(1) 1 1 0 2 4• Parameters:

– log1 = 2.0, log2 = 1.0

• Z = 2+2+2+4 = 10• p(0,0) = p(0,1) = p(1,0) = 2/10 = 0.2• p(1,1) = 4/10 = 0.4• A data set that follows the same empirical distribution– ((0,0), (0,1),(1,0), (1,1),(1,1))

i

ii

K

i

cxfi fi log2

1

),(

Page 52: Maximum Entropy Modeling and its application to NLP

Computation of i and Z• We search for a model p* such that• Empirical expectation

• In general Epfi cannot be computed efficiently as it would require summing over all possible combinations of x and c, a huge or infinite set.

• Following approximation is followed

ipip fEfE ~*

N

jjii

cxip cxf

NcxfcxpfE

1,~ ),(1),(),(~

N

j cjij

cxip cxfxcp

NcxfxcpxpfE

1,),()|(1),()|()(~

Page 53: Maximum Entropy Modeling and its application to NLP

Generalized Iterative Scaling Algo• Step 1– For all i =1, K+1, initialize i(1).– Compute empirical expectation – Set n =1

• Step 2– Compute pn(x,c) for the distribution pn given by

the {j(n)} for each element (x,c) in the training set

c,x

1

1

)(1

1

)(),(),(

)( Zwhere)(1),)((

cxifcxif K

i

ni

K

i

niZ

cxnp

Page 54: Maximum Entropy Modeling and its application to NLP

Generalized Iterative Scaling Algo• Step 3– Compute Ep(fi) for all I = 1, … K+1 according formula

shown before• Step 4– Update the parameters i

• Step 5– If the parameters have converged, stop; otherwise

increment n and go to Step 2.

C

ip

ipni

ni fE

fE

n

1

~)()1(

)(

Page 55: Maximum Entropy Modeling and its application to NLP

Application of MaxEnt in NLP• POS tagger– Stanford tagger

• At our lab– Honorific information– Use of this information for Anaphora Resolution– BioNLP

• Entity tagger

• Stanford Univ. has open source code for MaxEnt• You can also use their implementation for your

own task.

Page 56: Maximum Entropy Modeling and its application to NLP

HMM, MaxEnt and CRF• HMM– Observation and class

• MaxEnt– Local decision

• CRF– Combines good of HMM and MaxEnt