Maximum Entropy Modeling and its application to NLP

Maximum Entropy Modeling and its application to NLP

Utpal GarainIndian Statistical Institute, Kolkata

http://www.isical.ac.in/~utpal

Language Engineering in Daily Life

In Our Daily Life• Message, Email– We can now type our message in my own language/script

• Oftentimes I need not write the full text– My mobile understands what I intend to write!!– I ha reac saf– I have reached safely

• Even if I am afraid of typing in my own language (so many letters, spellings are so difficult.. Uffss!!)– I type my language in “English” and my computer or my

mobile types it in my language!!– mera bharat..

• I say “maa…” to my cell– and my mother’s number is called!

• I have gone back to my previous days and left typing in the computer/mobile– I just write on a piece of paper or scribble on the screen – My letters are typed!!

• Those days were so boring…– If you are an exiting customer press 1 otherwise press 2– If you remember your customer ID press 1 otherwise press 2– So on and so on..– I just say “1”, “service”, “cricket” and the telephone understands what I want!!

• My grandma can’t read English but she told she found her name written in Hindi in Railway reservation chart– Do Railway staff type so many names in Hindi everyday– NO!! Computer does this

In Our Daily Life

• Cross Lingual Information Search– I wanted to know what exactly happened that created

such a big inter-community problem in UP– My friend told me read UP newspaper– I don’t know Hindi – I gave query in the net in my language– I got news articles from UP local newspaper translated in

my language!! Unbelievable!!! • Translation– I don’t know French– Still I can chat with my French friend

In Our Daily Life

• I had problem to draw the diagram for this – ABCD is a parallelogram, DC is extended to E such that BCE is an

equilateral triangle.– I gave it to my computer and it draws the diagram showing the

steps!!• I got three history books for my son and couldn’t decide

which one will be good for him– My computer suggested Book 2 as it has better readability for a

grade-V student– Later on, I found it is right!!!

• I type questions in the net and get answers (oftentimes they are correct!!)– How does it happen?!!!

In Our Daily Life

• Language is key to culture– Communication– Power and Influence– Identity – Cultural records

• The multilingual character of Indian society – Need to preserve this character to move successfully

towards closer cooperation at a political, economic, and social level

• Language is both the basis for communication and a barrier

Language

Role of Language

Courtesy: Simpkins and Diver

• Application of knowledge of language to the development of computer systems – That can understand, interpret and generate human

language in all its forms• Comprises a set of – Techniques and – Language resources

Language Engineering

• Get material– Speech, typed/printed/handwritten text, image, video

• Recognize the language and validate it– Encoding scheme, distinguishing separate words..

• Build an understanding of the meaning– Depending on the application you target

• Build the application– Speech to text

• Generate and present the results – Use monitor, printer, plotter, speaker, telephone…

Components of Language Engg.

• Lexicons– Repository of words and knowledge about them

• Specialist lexicons– Proper names, Terminology– Wordnets

• Grammars• Corpora– Language sample– Text, speech– Helps to train a machine

Language Resources

NLP vs. Speech

• Consider these two types of problems:– Problem set-1• “I teach NLP at M.Tech. CS”=> what’s in Bengali?• Scan newspaper, pick out those news dealing with

forest fires, fill up a database with relevant information– Problem set-2• In someone’s utterance you might have difficulty to

distinguish between “merry” from “very” or “pan” from “ban”• Context often overcomes this

– Please give me the ??? (pan/ban)– The choice you made was ??? good.

NLP

• NLU community is more concerned about– Parsing sentences– Assigning semantic relations to the parts of a

sentence– etc…

• Speech recognition community– Predicting next word on the basis of the words so

far• Extracting the most likely words from the signal• Deciding among these possibilities using knowledge

about the language

NLP• NLU demands “understanding” – Requires a lot of human effort

• Speech people rely on statistical technology– Absence of any understanding limits its ability

• Combination of these two techniques

NLP• Understanding– Rule based• POS tagging• Tag using rule base

– I am going to make some tea– I dislike the make of this shirt– Use grammatical rules

– Statistical• Use probability• Probability of sequence/path

– PN VG PREP V/N? ADJ N– PN V ART V/N? PP

Basic Probability

Probability Theory• X: random variable– Uncertain outcome of some event

• V(X): outcome– Example event: open to some page of an English book and

X is the word you pointed to– V(X) ranges over all possible words of English

• If x is a possible outcome of X, i.e. x V(X)– P(X=x) or P(x)

• Wi is the i-th word prob. Of picking up the i-th word is

• if U denotes the universe of all possible outcomes then the denominator is |U|.

w

j

j

ii

w

wwwP

1

||

||)(

Conditional Probability• Pick up two words which are in a row -> w1 and w2

– Or, given the first word, guess the second word– Choice of w1 changes things

• Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y)– |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U|

• Given some evidence e, we want to pick up the best conclusion P(c|e)… it is done– if we know P(c|e) = P(c) * P(e|c) /P(e)

• Once evidence is fixed then the denominator stays the same for all conclusions.

|||,|)|( 1

2112

i

jiij

wwwwwwwwwwP

Conditional Probability• P(w1,n|speech signal)

= P(w1,n) P(signal|w1,n)/P(signal)• Say, there are words (a1,a2,a3) (b1,b2)

(c1,c2,c3,c4)• P(a2,b1,c4|signal) and P(a2,b1,c4)• P(a2,b1,c4|signal)

= P(a2,b1,c4) * P(signal|a2,b1,c4) • Example:– The {big / pig} dog– P(the big dog) = P(the) P(big|the) P(dog|the big)– P(the pig dog) = P(the) P(pig|the) P(dog|the pig)

• Predictive Text Entry– Tod => Today – I => I– => have– a => a– ta => take, tal => talk– Tod => today tod toddler

• Techniques– Probability of the word– Probability of the word at position “x”– Conditional probability

• What is the probability of writing “have” after writing two words “today” and “I”

• Resource– Language corpus

Application Building

• Transliteration– Kamal => কমল

• Indian Railways did it before– Rule based

• Kazi => কাজী• Ka => ক or কা

– Difficult to extend it other languages• Statistical model– N-gram modeling– Kamal=> ka am ma al; kazi => ka az zi– কমল => কম মল; কাজী => কা াাজ জী– Alignment of pairs (difficult computational problem)

Application Building

• Probability is computed for – P(ka=>ক), P(ক), P( কমল ), etc.

• Best probable word is the output• Advantage:– Easily extendable to any language pairs– Multiple choices are given (according to rank)

• Resource needed– Name pairs– Language model

Transliteration

Statistical Models and Methods

Statistical models and methods• Intuition to make crude probability judgments• Entropy– Situation Prob.

No occu 0.51st occu 0.1252nd occu o.125Both 0.25

• [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits• Random variable W takes on one of the several values

V(W), entropy: H(W) = -P(w)log P(w); wV(W)• -logP(w) bits are required to code w

Use in Speech• {the, a, cat, dog, ate, slept, here, there}• If use of each word is equal and independent• Then the entropy of the language

-P(the)logP(the)-P(a)log P(a)…=8.(-1/8*log1/8)= 3

– H(L) = lim [1/n P(w1,n)logP(w1,n)]

Markov Chain

• If we remove the numbers then it’s a finite state automaton which is acceptor as well as generator

• Adding the probabilities we make it probabilistic finite state automaton => Markov Chain

• Assuming all states are accepting states (Markov Process), we can compute the prob. of generating a given string• Product of probabilities of the arcs traversed in generating

the string.

Cross entropy• Per word entropy of the previous model is– [0.5log (1/2)]– At each state only two equi-probable choices so• H(p) = 1

• If we consider each word is equi-probable then H(pm) = 3 bits/word

• Cross Entropy– Cross entropy of a set of random variables W1,n

where correct model is P(w1,n) but the probabilities are estimated using the model Pm(w1,n) is

nwnmnn wPwPPmWH

,1

)(log)(),( ,1,1,1

Cross entropy• Per word cross entropy is• Per word entropy of the given Markov Chain: 1• If we slightly change the model:– Outgoing probabilities are 0.75 and 0.25– per word entropy becomes• -[1/2log(3/4)+1/2log(1/4)]= - (1/2) [log 3 – log4 +log 1 – log4]= - (1/2) [log 3 – log4]= 2 – 1.7/2 = 1.2

• Incorrect model:– H(W1,n) H(W1,n, PM)

),(1,1 PmWH

n n

Cross entropy• Cross entropy of a language– A stochastic process is ergodic if its statistical

properties (i.e. and ) can be computed from a single sufficiently large sample of the process.

– Assuming L is an ergodic language– Cross entropy of L is– H (L, PM)

=

)(log1lim

)(log)(1lim

,1

,1,1

nMn

nMnn

wPn

wPwPn

Corpus• Brown corpus– Coverage• 500 text segments of 2000 words• Press, reportage etc. 44• Press editorial etc. 27• Press, reviews 17• Religion books, periodicals.. 17• …

Trigram Model

Trigram as Markov Chain

• It is not possible to determine state of the machine simply on the basis of the last output (the last two outputs are needed)

• Markov chain of order 2

Problem of sparse data• Jelinek stuided– 1,500,000 word corpus– Extracted trigrams– Applied to 300,000 words– 25% trigram types were missing

Maximum Entropy Model

An example

• Machine translation– Star in English

• Translation in Hindi: – सि�तारा, तारा, तारक, परसि�दध अभि�नता, �ागय

• First statistics of this process– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)+p(�ागय)

= 1• There are infinite number of models p for which

this identity holds

• One model– p(सि�तारा) = 1– This model always predicts सि�तारा

• Another model– p(तारा) = ½– p(परसि�दध अभि�नता) = ½

• These models offend our sensibilities– The expert always chose from the five choices– How can we justify either of these probability

distribution ?– These models bold assumptions without empirical

justification

• What we know– Experts chose exclusively from these five words

• the most intuitively appealing model is– p(सि�तारा) = 1/5– p(तारा) = 1/5– p(तारक) = 1/5– p(परसि�दध अभि�नता) = 1/5– p(�ागय) = 1/5

• The most uniform model subject o our knowledge

• Suppose we notice that the expert’s chose either सि�तारा or तारा 30% of the time

• We apply this knowledge to update our model– p(सि�तारा)+p(तारा) = 3/10– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)

+p(�ागय) = 1• Many probability distributions consistent with

the above constraints• A reasonable choice for p is again the most

uniform

• i.e. the distribution which allocates its probability as evenly as possible, subject to the constraints– p(सि�तारा) = 3/20– p(तारा) = 3/20– p(तारक) = 7/30– p(परसि�दध अभि�नता) = 7/30– p(�ागय) = 7/30

• Say we inspect the data once more and notice another interesting fact– In half the cases, the expert chose either सि�तारा or

परसि�दध अभि�नता

• So we add a third constraint– p(सि�तारा)+p(तारा) = 3/10– p(सि�तारा)+p(तारा)+p(तारक)+p(परसि�दध अभि�नता)+p(�ागय) = 1– p(सि�तारा)+p(परसि�दध अभि�नता) = ½

• Now if we want to look for the most uniform p satisfying the constraints the choice is not as obvious

• As complexity added, we have two difficulties– What is meant by “uniform” and how can we measure

the uniformity of a model– How will we find the most uniform model subject to a set

constraints?• Maximum entropy method (E. T. Jaynes) answers

both of these questions

Maximum Entropy Modeling• Consider a random process that produces an output value y

(a member of a finite set, Y)• For the translation example just considered, the process

generates a translation of the word star, and the output y can be any word in the set {सि�तारा, तारा, तारक, परसि�दध अभि�नता, �ागय}.

• In generating y, the process may be influenced by some contextual information x, a member of a finite set X.

• In the present example, this information could include the words in the English sentence surrounding star.

• Our task is to construct a stochastic model that accurately represents the behavior of the random process.

Maximum Entropy Modeling• Such a model is a method of estimating the

conditional probability that, given a context x, the process will output c.

• We will denote by p(clx) the probability that the model assigns to y in context x.

• We will denote by P the set of all conditional probability distributions. Thus a model p(c|x) is, by definition, just an element of P.

Training Data• A large number of samples – (x1,c1), (x2, c2) . . . (xN, cN).

• Each sample would consist of a phrase x containing the words surrounding star, together with the translation c of star that the process produced.

• Empirical probability distribution psample in the occurs ),( that timeofnumber 1),(~ cx

Ncxp

Features• The feature fi are binary functions that can be

used to characterize any property of a pair (x, c), • x is a vector representing an input element and c

is the class label• f(x, c) = 1 if c = परसि�दध अभि�नता and star follows

cinema; otherwise = 0

Features• We have two things in hand– Empirical distribution– The model p(c|x)

• The expected value of f with respect to the empirical distribution is

• The expected value of f with respect to the model p(c|x) is

• Our constraint is

cx

cxfcxpfp,

),(),(~)(~

cx

cxfxcpxpfp,

),()|()(~)(

)(~)( fpfp

Classification• For a given x we need to know its class label c– p(x, c)

• Loglinear models– General and very important class of models for

classification of categorical variables– Logistic regression is another example

– K is the number of features, i is the weight for the feature fi and Z is a normalizing constant used to ensure that a probability distribution results.

K

i

cxfii

Zcxp

1

),(1),(

K

iii cxfZcxp

1log),(log),(log

An example

• Text classification• x consists of a single element, indicating presence

or absence of the word profit in the article• Classes, c– two classes; earnings or not

• Features – Two features• f1: 1 if and only if the article is “earnings” and the word profit

is in it• f2: filler feature (fK+1)

– C is the greatest possible feature sum

K

iiK cxfCcxf

11 ),(),(

An example

X cProfit “earnings” f1 f2 =f1log1+f2log2 2

(0) 0 0 1 1 2(0) 1 0 1 1 2(1) 0 0 1 1 2(1) 1 1 0 2 4• Parameters:

– log1 = 2.0, log2 = 1.0

• Z = 2+2+2+4 = 10• p(0,0) = p(0,1) = p(1,0) = 2/10 = 0.2• p(1,1) = 4/10 = 0.4• A data set that follows the same empirical distribution– ((0,0), (0,1),(1,0), (1,1),(1,1))

i

ii

K

i

cxfi fi log2

1

),(

Computation of i and Z• We search for a model p* such that• Empirical expectation

• In general Epfi cannot be computed efficiently as it would require summing over all possible combinations of x and c, a huge or infinite set.

• Following approximation is followed

ipip fEfE ~*

N

jjii

cxip cxf

NcxfcxpfE

1,~ ),(1),(),(~

N

j cjij

cxip cxfxcp

NcxfxcpxpfE

1,),()|(1),()|()(~

Generalized Iterative Scaling Algo• Step 1– For all i =1, K+1, initialize i(1).– Compute empirical expectation – Set n =1

• Step 2– Compute pn(x,c) for the distribution pn given by

the {j(n)} for each element (x,c) in the training set

c,x

1

1

)(1

1

)(),(),(

)( Zwhere)(1),)((

cxifcxif K

i

ni

K

i

niZ

cxnp

Generalized Iterative Scaling Algo• Step 3– Compute Ep(fi) for all I = 1, … K+1 according formula

shown before• Step 4– Update the parameters i

• Step 5– If the parameters have converged, stop; otherwise

increment n and go to Step 2.

C

ip

ipni

ni fE

fE

n

1

~)()1(

)(

Application of MaxEnt in NLP• POS tagger– Stanford tagger

• At our lab– Honorific information– Use of this information for Anaphora Resolution– BioNLP

• Entity tagger

• Stanford Univ. has open source code for MaxEnt• You can also use their implementation for your

own task.

HMM, MaxEnt and CRF• HMM– Observation and class

• MaxEnt– Local decision

• CRF– Combines good of HMM and MaxEnt

Maximum Entropy Modeling and its application to NLP

Documents

Transcript of Maximum Entropy Modeling and its application to NLP