Part 5 Language Model
description
Transcript of Part 5 Language Model
![Page 1: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/1.jpg)
Part 5Language Model
CSE717, SPRING 2008
CUBS, Univ at Buffalo
![Page 2: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/2.jpg)
Examples of Good & Bad Language Models Excerption from Herman, comic strips by Jim Unger
1 2
3 4
![Page 3: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/3.jpg)
What’s a Language Model
A Language model is a probability distribution over word sequences
P(“And nothing but the truth”) 0.001
P(“And nuts sing on the roof”) 0
![Page 4: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/4.jpg)
What’s a language model for?
Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation
(and anyone doing statistical modeling)
![Page 5: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/5.jpg)
The Equation
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
cewordsequenPcewordsequennsobservatioP
nsobservatioP
cewordsequenPcewordsequennsobservatioP
nsobservatiocewordsequenP
cewordsequen
cewordsequen
cewordsequen
The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.
![Page 6: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/6.jpg)
How Language Models work
Hard to compute P(“And nothing but the truth”)
Decompose probabilityP(“and nothing but the truth) =P(“and”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)
![Page 7: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/7.jpg)
The Trigram Approximation
Assume each word depends only on the previous two words
P(“the|and nothing but”)
P(“the|nothing but”)
P(“truth|and nothing but the”)
P(“truth|but the”)
![Page 8: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/8.jpg)
How to find probabilities?
Count from real text
Pr(“the | nothing but”) c(“nothing but the”) / c(“nothing but”)
![Page 9: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/9.jpg)
Evaluation
How can you tell a good language model from a bad one?
Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer
![Page 10: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/10.jpg)
Perplexity
An exampleData: “the whole truth and nothing but the truth”Lexicon: L={the, whole, truth, and, nothing, but}Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6
Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8
TTww
1
1 )],...,[Pr()(
wP
6])6/1[()( 8
1 8
wP
5.657])8/1()4/1[()( 8
1 44
wP
modelgiven by generated
is y that probabilit :),...,Pr(
test text:,...,
1
1
w
w
T
T
ww
ww
![Page 11: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/11.jpg)
Perplexity: Is lower better?
Remarkable fact: the “true” model for data has the lowest possible perplexity
Lower the perplexity, the closer we are to true model.
Perplexity correlates well with the error rate of recognition task Correlates better when both models are trained on
same data Doesn’t correlate well when training data changes
![Page 12: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/12.jpg)
Smoothing
Terrible on test data: If no occurrences of C(xyz), probability is 0
P(sing|nuts) =0 leads to infinite perplexity!
)(
)(
)(
)()|Pr(
y
y
y
yy
c
zc
wc
zcz
w
![Page 13: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/13.jpg)
Smoothing: Add One
Add one smoothing:
Add delta smoothing:
Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated
||)(
1)()|Pr(
Lc
zcz
y
yy
||)(
)()|Pr(
Lc
zcz
y
yy
![Page 14: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/14.jpg)
Smoothing: Simple Interpolation
Interpolate Trigram, Bigram, Unigram for best combination
Almost good enough
)(
)()1(
)(
)(
)(
)()|Pr(
c
zc
yc
yzc
xyc
xyzcxyz
![Page 15: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/15.jpg)
Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] Discounting
Discounted probability mass
Redistribution
)()(, ,)(
)()()|Pr( zzcz
c
zzcz yyy
y
yyy
)...|Pr()...|Pr()|Pr(
,0)...( If
21
1
nn
n
yyzkyyzz
zyyc
y
)(
)()(
y
yy
c
z
1)|Pr( that so selected is z
zk y
(n-1)-gram
![Page 16: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/16.jpg)
Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]
Linear Discount
0)(, ,)(
)()1()|Pr(
zcz
c
zcz yy
y
yy
)(
)(1
zc
zd
y
y
1 ),()( zcz yy
)( zc y
![Page 17: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/17.jpg)
Generalization
: function of y, determined by cross-validation
Requires more data
Computation is expensive
More General Formulation
Drawback of linear discount
The counts of frequently observed events are modified the most ; against the “law of large numbers”
)( y
1)( ),()()( yyyy zcz
)( zc y
![Page 18: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/18.jpg)
The discount is an absolute value
Works pretty well, easier than linear discounting
Absolute Discounting
)(, ,
)(
)()|Pr( zcz
c
zcz yy
y
yy
)( zy
![Page 19: Part 5 Language Model](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813f6c550346895daa4360/html5/thumbnails/19.jpg)
References
[1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987
[2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995
[3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt