CS460/626 : Natural LanguageCS460/626 : Natural Language ...cs626-460-2012/cs626... · The...
Transcript of CS460/626 : Natural LanguageCS460/626 : Natural Language ...cs626-460-2012/cs626... · The...
CS460/626 : Natural LanguageCS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 13 14 Argmax Computation)(Lecture 13, 14–Argmax Computation)
Pushpak BhattacharyyaPushpak BhattacharyyaCSE Dept., IIT Bombay
3 d d 7th F b 20113rd and 7th Feb, 2011
Key difference between Statistical/ML-based NLP and Knowledge-based NLP and Knowledgebased/linguistics-based NLP
Stat NLP: speed and robustness are the main concernsKB NLP: Phenomena basedExample:p
Boys, Toys, ToesTo get the root remove “s”o get t e oot e o e sHow about foxes, boxes, ladiesUnderstand phenomena: go deeperUnderstand phenomena: go deeperSlower processing
Noisy Channel ModelNoisy Channel Model
w tw t
( ) (t t t )
Noisy Channel
(wn, wn-1, … , w1) (tm, tm-1, … , t1)
Sequence w is transformed into sequence t.
T*=argmax(P(T|W))w
W*=argmax(P(W|T))T
Bayesian Decision TheoryBayes Theorem : Given the random variables ABayes Theorem : Given the random variables A and B,
( ) ( | )( | ) P A P B AP A B ( ) ( | )( | )( )
P A BP B
=
( | )P A B Posterior probability( | )P A B( )P A
Posterior probability
Prior probability
( | )P B A Likelihood
To understand when and why to apply Bayes Theorem
An example: it is known that in a population, 1 in 50000 has meningitis and 1 in 20 has stiff neck. It is also observed that 50% of the meningitis patients have stiff neck.
A doctor observes that a patient has stiff neckA doctor observes that a patient has stiff neck. What is the probability that the patient has meningitis?g
(Mitchel, Machine Learning, 1997)
Ans: We need to findP(m|s): probability of
meningitis given the stiff neck
Apply Bayes Rule (why?)
P(m|s)P(m|s)= [P(m). P(s|m)]/P(s)
P(m)= prior probability of meningitisP(s|m)=likelihod of stiff neck given
meningitismeningitisP(s)=Probability of stiff neck
Probabilities1
50000P(m)=
P i120
P(s)=
Prior
Likelihood
( ) ( | )P P1 * 0 .5
5 0 0 0 0 1
P(s|m)= 0.5Likelihood
( ) ( | )( | )( )
P m P s mP m sP s
= 5 0 0 0 01
2 0
=1
5 0 0 0=
posterior
( | )« (~ | )P m s P m s
posterior
Hence meningitis is not likely
Some IssuesSome Issues
p(m|s) could have been found asp(m|s) could have been found as
#( )#
m s∩
Questions:Which is more reliable to compute, p(s|m) or p(m|s)?
# s
Which evidence is more sparse , p(s|m) or p(m|s)?Test of significance : The counts are always on a sample of population Which probability count has sufficientpopulation. Which probability count has sufficient statistics?
5 problems in NLP whose probabilistic formulation use Bayes theoremBayes theorem
The problems
Statistical Spell CheckingAutomatic Speech RecognitionAutomatic Speech RecognitionPart of Speech Tagging: discussed in detail in subsequent classesdetail in subsequent classesProbabilistic ParsingStatistical Machine Translation
Some general observations
A*= argmax [P(A|B)]A
= argmax [P(A).P(B|A)]AA
Computing and using P(A) and P(B|A), both needeed(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts
CorpusCorpus
A ll i f ll d i d f ll iA collection of text called corpus, is used for collecting various language dataWith annotation: more information but manual laborWith annotation: more information, but manual labor intensivePractice: label automatically; correct manuallyThe famous Brown Corpus contains 1 million tagged words.Switchboard: very famous corpora 2400 conversations, 543 speakers many US dialects annotated with orthography543 speakers, many US dialects, annotated with orthography and phonetics
Example-1 of Application of Noisy Channel Model:Example 1 of Application of Noisy Channel Model: Probabilistic Speech Recognition (Isolated Word)[8]
Problem Definition : Given a sequence of speech Problem Definition : Given a sequence of speech signals, identify the words.2 steps : p
Segmentation (Word Boundary Detection)Identify the word
Isolated Word Recognition : Identify W given SS (speech signal)
^arg max ( | )W P W SS=
W
Identifying the wordIdentifying the word^
arg m ax ( | )W P W SS=
a rg m ax ( ) ( | )W
WP W P SS W=
P(SS|W) = likelihood called “phonological model “ intuitively more tractable!P(W) = prior probability called “language model”
# W appears in the corpus( )# words in the corpus
P W =# words in the corpus
Ambiguities in the context ofAmbiguities in the context of P(SS|W) or P(W|SS)
ConcernsSound Text ambiguitySound Text ambiguity
whether v/s weatherright v/s writebought v/s bot
Text Sound ambiguityread (present tense) v/s read (past tense)lead (verb) v/s lead (noun)
Primitives
Phonemes (sound)SyllablesSyllablesASCII bytes (machine representation)
Phonemes
Standardized by the IPA (International Phonetic Alphabet) conventionPhonetic Alphabet) convention/t/ sound of t in tag/d/ sound of d in dog/d/ sound of d in dog/D/ sound of the
SyllablesAd iAdvise (verb)
Advice (noun)
d dad viceadvise
• Consists of• Consists of1. Rhyme
1. Nucleus2 O t2. Onset3. Coda
Pronunciation DictionaryPronunciation Automaton
s4
1 00 73Word
t o m o
ae
t end1.0 1.0 1.0 1.0
1.0
1 0
0.73
0.27Tomato
aas1 s2 s3s5
s6 s71.0
P(SS|W) is maintained in this way.P(t o m ae t o |Word is “tomato”) = Product of arc probabilities
Problem 2: Spell checker: apply Bayes RuleBayes Rule
W*= argmax [P(W|T)]= argmax [P(W).P(T|W)]
W=correct word, T=misspelt wordW correct word, T misspelt wordWhy apply Bayes rule?
Finding p(w|t) vs. p(t|w) ?g p( | ) p( | )Assumptions :
t is obtained from w by a single error.y gThe words consist of only alphabets
(Jurafsky and Martin, Speech and NLP, 2000)
4 Confusion Matrices: sub, ins, del and trans
If x and y are alphabetsIf x and y are alphabets,sub(x,y) = # times y is written for x (substitution)(substitution)ins(x,y) = # times x is written as xydel(x y) = # times xy is written as xdel(x,y) = # times xy is written as xtrans(x,y) = # times xy is written as yx
Probabilities from confusion matrix
P(t|w)= P(t|w)S + P(t|w)I + P(t|w)D + P(t|w)X
whereP(t|w)S = sub(x,y) / count of xP(t|w)I = ins(x,y) / count of xP(t|w)D = del(x,y) / count of xP(t|w)X = trans(x,y) / count of x
These are considered to be mutually l iexclusive events
URLs for database of misspeltwords
http://www.wsu.edu/~brians/errors/errors.htmlors.htmlhttp://en.wikipedia.org/wiki/Wikipedia:Lists of common misspellings/For machists_of_common_misspellings/For_machines
A sampleabandonned->abandoned aberation->aberration abilties->abilities abilty->ability b d b dabondon->abandon
abondoned->abandoned abondoning->abandoningabondoning->abandoning abondons->abandons aborigene->aborigineaborigene aborigine
fi yuo cna raed tihs, yuo hvae a sgtrane mnid too.Cna yuo raed tihs? Olny 55 plepoe can.
i cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg.The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch atrscheearch atCmabrigde Uinervtisy, it dseno't mtaetr in waht oerdr the ltteres in awrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer bein the rghit pclae. The rset can be a taotl mses and you can sitll
draedit whotuit a pboerlm. Tihs is bcuseae the huamn mnid deos not raedervey lteter by istlef, but the wrod as a wlohe. Azanmig huh? yaehy y , g yandI
awlyas tghuhot slpeling was ipmorantt! if you can raed tihs forwradit.
S ll h ki E lSpell checking: ExampleGiven aple, find and rankp ,
P(maple|aple), P(apple|aple), P(able|aple), P(pale|aple) etc.
Exercise: Give an intuitive feel for which of these will rank higherrank higher
Example 3: Part of Speech T iTagging
POS Tagging is a process that attaches each word in a sentence with a suitableeach word in a sentence with a suitable tag from a given set of tags.
The set of tags is called the Tag-set.
Standard Tag-set : Penn Treebank (forStandard Tag-set : Penn Treebank (for English).
Penn Treebank Tagset: sample1. CC Coordinating conjunction; Jack and_CC Jill2. CD Cardinal number; Four_CD children3. DT Determiner; The_DT sky4. EX Existential there ; There_EX was a king5. FW Foreign word; श द_FW means ‘word’6. IN Preposition or subordinating conjunction; play with_IN ball7. JJ Adjective; fast_JJ car8 JJR Adjective comparative; faster JJR car8. JJR Adjective, comparative; faster_JJR car 9. JJS Adjective, superlative; fastest_JJS car10. LS List item marker; 1._LS bread 2._LS butter 3._LS Jam 11. MD Modal; You may_MD go12. NN Noun, singular or mass; water_NN13. NNS Noun, plural; boys_NNS, p ; y4. NNP Proper noun, singular; John_NNP
POS TPOS Tags
NN – Noun; e.g. Dog_NN
VM Main Verb; e g Run VMVM – Main Verb; e.g. Run_VM
VAUX – Auxiliary Verb; e.g. Is_VAUX
JJ – Adjective; e.g. Red_JJ
PRP – Pronoun; e.g. You_PRP
NNP – Proper Noun; e.g. John NNPNNP Proper Noun; e.g. John_NNP
etc.
POS T A bi itPOS Tag Ambiguity
In English : I bank1 on the bank2 on the river bank3
for my transactions.
Bank1 is verb, the other two banks are noun
In Hindi :
”Khaanaa” : can be noun (food) or verb (to eat)
Mujhe khaanaa khaanaa hai. (first khaanaa is j (noun and second is verb)
For Hindi
Rama achhaa gaata hai. (hai is VAUX : Auxiliary verb); Ram sings wellAuxiliary verb); Ram sings well
Rama achha ladakaa hai. (hai is VCOP : Copula verb) ; Ram is a good boy
PProcess
List all possible tag for each word in sentencesentence.
Choose best suitable tag sequence.
E lExample
”People jump high”.
People : Noun/VerbPeople : Noun/Verb
jump : Noun/Verb
high : Noun/Verb/Adjective
We can start with probabilities.
Derivation of POS tagging formulaDerivation of POS tagging formulaBest tag sequence= T*= T= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)
P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …0 1 0 2 1 0 3 2 1 0
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
= P(ti|ti-1) Bigram Assumption∏N+1
i = 1
Lexical Probability AssumptionP(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …
P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
Assumption: A word is determined completely by its tag. This is inspired by speech recognition
= P(w |t )P(w |t ) P(w |t )= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
= P(wi|ti)∏n+1
i = 0
= P(wi|ti) (Lexical Probability Assumption)∏n+1
i =0
Generative Model
^_^ People_N Jump_V High_R ._.
^ N N N
Lexical Probabilities
^ N
V
N
V
N
V
.
Bigram
R
BigramProbabilities
RR
This model is called Generative model. Here words are observed from tags as states.This is similar to HMM.
b b lBigram probabilities
N V A
N 0.2 0.7 0.1
V 0.6 0.2 0.2
A 0.5 0.2 0.3
L i l P b bilitLexical Probability
People jump high
N 10-5 0 4x10-3 10-7N 10 0.4x10 10
V 10-7 10-2 10-7
A 0 0 10-1
values in cell are P(col heading/row heading)values in cell are P(col-heading/row-heading)
Calculation from actual dataCorpusCorpus
^ Ram got many NLP books. He found them all very interestingall very interesting.
Pos Tagged^ N V A N N N V N A R A^ N V A N N . N V N A R A .
Recording numbers^ N V A R .
^ 0 2 0 0 0 0
N 0 1 2 1 0 1
V 0 1 0 1 0 0V 0 1 0 1 0 0
A 0 1 0 0 1 1
R 0 0 0 1 0 0
. 1 0 0 0 0 0
Probabilities^ N V A R .
^ 0 1 0 0 0 0
N 0 1/5 2/5 1/5 0 1/5
V 0 1/2 0 1/2 0 0V 0 1/2 0 1/2 0 0
A 0 1/3 0 0 1/3 1/3
R 0 0 0 1 0 0
. 1 0 0 0 0 0
T fi dTo find
T* = argmax (P(T) P(W/T)) P(T).P(W/T) = Π P( ti / ti-1 ).P(wi /ti)( ) ( ) ( i i 1 ) ( i i)
i=1 n+1
P( t / t ) : Bigram probabilityP( ti / ti-1 ) : Bigram probabilityP(wi /ti): Lexical probability
P( /t ) 1 i 0 ,Note: P(wi/ti)=1 for i=0 (̂ , sentence beginner)) and i=(n+1) ( f ll t )(., fullstop)
b b lBigram probabilities
N V A R
N 0.15 0.7 0.05 0.1
V 0 6 0 2 0 1 0 1V 0.6 0.2 0.1 0.1
A 0.5 0.2 0.3 0
R 0.1 0.3 0.5 0.1
L i l P b bilitLexical Probability
People jump high
N 10-5 0.4x10-3 10-7
V 10-7 10-2 10-7
A 0 0 10-1
R 0 0 0
values in cell are P(col-heading/row-heading)values in cell are P(col heading/row heading)
Some notable text corpora of English
American National CorpusBank of EnglishBank of EnglishBritish National CorpusCorpus Juris SecundumpCorpus of Contemporary American English (COCA) 400+ million words, 1990-present. Freely searchable onlineonline.Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.p , g ,International Corpus of EnglishOxford English CorpusScottish Corpus of Texts & Speech