Introduction to Markov Models
Transcript of Introduction to Markov Models
![Page 1: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/1.jpg)
Introduction to Markov Models
Estimating the probability of phrases of words, sentences, etc.…
![Page 2: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/2.jpg)
But first:
A few preliminaries
CIS 391 - Intro to AI 2
![Page 3: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/3.jpg)
CIS 391 - Intro to AI 3
What counts as a word? A tricky question….
![Page 4: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/4.jpg)
CIS 391 - Intro to AI 4
How to find Sentences??
![Page 5: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/5.jpg)
Q1: How to estimate the probability of a
given sentence W?
A crucial step in speech recognition (and lots of
other applications)
First guess: products of unigrams
Given word lattice:
Unigram counts (in 1.7 * 106 words of AP text):
Not quite right…CIS 391 - Intro to AI 5
ˆ( ) ( )w W
P W P w
form subsidy for
farm subsidies far
form 183 subsidy 15 for 18185
farm 74 subsidies 55 far 570
![Page 6: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/6.jpg)
Predicting a word sequence II
Next guess: products of bigrams
• For W=w1w2w3… wn,
Given word lattice:
Bigram counts (in 1.7 * 106 words of AP text):
Better (if not quite right) … (But the counts are tiny! Why?)
CIS 391 – Intr)o to AI 6
1
1
1
ˆ( ) ( )n
i i
i
P W P w w
form subsidy for
farm subsidies far
form subsidy 0 subsidy for 2
form subsidies 0 subsidy far 0
farm subsidy 0 subsidies for 6
farm subsidies 4 subsidies far 0
![Page 7: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/7.jpg)
How can we estimate P correctly?
Problem: Naïve Bayes model for bigrams violates independence
assumptions.
Let’s do this right….
Let W=w1w2w3… wn. Then, by the chain rule,
We can estimate P(w2|w1) by the Maximum Likelihood Estimator
and P(w3|w1w2) by
and so on…
CIS 391 - Intro to AI 7
1 1 11 2 3 2 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP W P w P w w P w w w P w w w
1 2
1
( )
( )
Count w w
Count w
1 2 3
1 2
( )
( )
Count w w w
Count w w
1 1 11 2 3 2 1( ) ( )* ( | )* ( | )*...* ( | ... )n nP W P w P w w P w w w P w w w
![Page 8: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/8.jpg)
CIS 391 - Intro to AI 8
and finally, Estimating P(wn|w1w2…wn-1)
Again, we can estimate P(wn|w1w2…wn-1) with the MLE
So to decide pat vs. pot in Heat up the oil in a large p?t,
compute for pot
1 2
1 2 1
( ... )
( ... )
n
n
Count w w w
Count w w w
("Heat up the oil in a large pot")
("Heat up the oil in a larg ")
0
e 0
Count
Count
![Page 9: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/9.jpg)
CIS 391 - Intro to AI 9
Hmm..The Web Changes Things (2008 or so)
Even the web in 2008 yields low counts!
![Page 10: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/10.jpg)
CIS 391 - Intro to AI 10
Statistics and the Web II
So, P(“pot”|”heat up the oil in a large___”) = 8/49 0.16
![Page 11: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/11.jpg)
But the web has grown!!!
CIS 391 - Intro to AI 11
![Page 12: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/12.jpg)
….
CIS 391 - Intro to AI 12
165/891=0.185
![Page 13: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/13.jpg)
CIS 391 - Intro to AI 13
So….
A larger corpus won’t help much unless it’s
HUGE …. but the web is!!!
But what if we only have 100 million words for our
estimates??
![Page 14: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/14.jpg)
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with 100 million
words of training data??
Assuming (for now) uniform distribution over only 5000 words
So even with 108 words of data, for even trigrams we encounter
the sparse data problem…..
CIS 391 - Intro to AI 14
![Page 15: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/15.jpg)
CIS 391 - Intro to AI 15
The Markov Assumption:
Only the Immediate Past Matters
![Page 16: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/16.jpg)
The Markov Assumption: Estimation
We estimate the probability of each wi given previous context by
which can be estimated by
So we’re back to counting only unigrams and bigrams!!
AND we have a correct practical estimation method for P(W) given the Markov assumption!
CIS 391 - Intro to AI 16
P(wi|w1w2…wi-1) = P(wi|wi-1)
1
1
( )
( )
i i
i
Count w w
Count w
![Page 17: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/17.jpg)
CIS 391 - Intro to AI 17
Markov Models
![Page 18: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/18.jpg)
CIS 391 - Intro to AI 18
Visualizing an n-gram based language model:
the Shannon/Miller/Selfridge method
To generate a sequence of n words given unigram
estimates:
• Fix some ordering of the vocabulary v1 v2 v3 …vk.
• For each word wi , 1 ≤ i ≤ n
—Choose a random value ri between 0 and 1
— wi = the first vj such that
1
( )j
m i
m
P v r
![Page 19: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/19.jpg)
CIS 391 - Intro to AI 19
Visualizing an n-gram based language model:
the Shannon/Miller/Selfridge method
To generate a sequence of n words given a 1st
order Markov model (i.e. conditioned on one
previous word):
• Fix some ordering of the vocabulary v1 v2 v3 …vk.
• Use unigram method to generate an initial word w1
• For each remaining wi , 2 ≤ i ≤ n
—Choose a random value ri between 0 and 1
— wi = the first vj such that1
1
( | )j
m i i
m
P v w r
![Page 20: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/20.jpg)
CIS 391 - Intro to AI 20
The Shannon/Miller/Selfridge method trained on
Shakespeare
(This and next two slides from Jurafsky)
![Page 21: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/21.jpg)
Wall Street Journal just isn’t Shakespeare
![Page 22: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/22.jpg)
Shakespeare as corpus
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen (have zero entries in the table)
Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
![Page 23: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/23.jpg)
The Sparse Data Problem Again
How likely is a 0 count? Much more likely than I let on!!!
CIS 391 - Intro to AI 23
![Page 24: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/24.jpg)
English word frequencies well described
by Zipf’s Law
Zipf (1949) characterized the relation between word
frequency and rank as:
Purely Zipfian data plots as a straight line on a log-
log scale
*Rank (r): The numerical position of a word in a list sorted by
decreasing frequency (f ).
(f) log - log(C) log(r)
C/f r
)constant (for
CCrf
CIS 391 - Intro to AI 24
![Page 25: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/25.jpg)
Word frequency & rank in Brown Corpus
vs Zipf
CIS 391 - Intro to AI 25
From: Interactive mathematics http://www.intmath.com
Lots of area
under the tail
of this curve!
![Page 26: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/26.jpg)
CIS 391 - Intro to AI 26
Zipf’s law for the Brown corpus
![Page 27: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/27.jpg)
Smoothing
This black art is why NLP is taught in the engineering school –Jason Eisner
![Page 28: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/28.jpg)
CIS 391 - Intro to AI 28
Smoothing
At least one unknown word likely per sentence given Zipf!!
To fix 0’s caused by this, we can smooth the data.
• Assume we know how many types never occur in the data.
• Steal probability mass from types that occur at least once.
• Distribute this probability mass over the types that never occur.
![Page 29: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/29.jpg)
CIS 391 - Intro to AI 29
Smoothing
….is like Robin Hood:
• it steals from the rich
• and gives to the poor
![Page 30: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/30.jpg)
Review: Add-One Smoothing
Estimate probabilities by assuming every possible word
type v V actually occurred one extra time (as if by
appending an unabridged dictionary)
So if there were N words in our corpus, then instead of
estimating
we estimate
CIS 391 - Intro to AI 30
P̂
( )ˆ( )Count w
P wN
( 1)ˆ( )Count w
P wN V
![Page 31: Introduction to Markov Models](https://reader031.fdocuments.us/reader031/viewer/2022012915/61c5bc1656cc705c260d3893/html5/thumbnails/31.jpg)
CIS 391 - Intro to AI 31
Add-One Smoothing (again)
Pro: Very simple technique
Cons:
• Probability of frequent n-grams is underestimated
• Probability of rare (or unseen) n-grams is overestimated
• Therefore, too much probability mass is shifted towards unseen n-grams
• All unseen n-grams are smoothed in the same way
Using a smaller added-count improves things but only some
More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like...
(Hint for this homework )