Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling...
Transcript of Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling...
![Page 1: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/1.jpg)
Language Modeling
Michael Collins, Columbia University
![Page 2: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/2.jpg)
Overview
I The language modeling problem
I Trigram models
I Evaluating language models: perplexity
I Estimation techniques:
I Linear interpolationI Discounting methods
![Page 3: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/3.jpg)
The Language Modeling Problem
I We have some (finite) vocabulary,say V = {the, a, man, telescope, Beckham, two, . . .}
I We have an (infinite) set of strings, V†
the STOPa STOPthe fan STOPthe fan saw Beckham STOPthe fan saw saw STOPthe fan saw Beckham play for Real Madrid STOP
![Page 4: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/4.jpg)
The Language Modeling Problem (Continued)
I We have a training sample of example sentences inEnglish
I We need to “learn” a probability distribution pi.e., p is a function that satisfies∑
x∈V†p(x) = 1, p(x) ≥ 0 for all x ∈ V†
p(the STOP) = 10−12
p(the fan STOP) = 10−8
p(the fan saw Beckham STOP) = 2× 10−8
p(the fan saw saw STOP) = 10−15
. . .p(the fan saw Beckham play for Real Madrid STOP) = 2× 10−9
. . .
![Page 5: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/5.jpg)
The Language Modeling Problem (Continued)
I We have a training sample of example sentences inEnglish
I We need to “learn” a probability distribution pi.e., p is a function that satisfies∑
x∈V†p(x) = 1, p(x) ≥ 0 for all x ∈ V†
p(the STOP) = 10−12
p(the fan STOP) = 10−8
p(the fan saw Beckham STOP) = 2× 10−8
p(the fan saw saw STOP) = 10−15
. . .p(the fan saw Beckham play for Real Madrid STOP) = 2× 10−9
. . .
![Page 6: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/6.jpg)
The Language Modeling Problem (Continued)
I We have a training sample of example sentences inEnglish
I We need to “learn” a probability distribution pi.e., p is a function that satisfies∑
x∈V†p(x) = 1, p(x) ≥ 0 for all x ∈ V†
p(the STOP) = 10−12
p(the fan STOP) = 10−8
p(the fan saw Beckham STOP) = 2× 10−8
p(the fan saw saw STOP) = 10−15
. . .p(the fan saw Beckham play for Real Madrid STOP) = 2× 10−9
. . .
![Page 7: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/7.jpg)
Why on earth would we want to do this?!
I Speech recognition was the original motivation.(Related problems are optical character recognition,handwriting recognition.)
I The estimation techniques developed for this problem willbe VERY useful for other problems in NLP
![Page 8: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/8.jpg)
Why on earth would we want to do this?!
I Speech recognition was the original motivation.(Related problems are optical character recognition,handwriting recognition.)
I The estimation techniques developed for this problem willbe VERY useful for other problems in NLP
![Page 9: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/9.jpg)
A Naive Method
I We have N training sentences
I For any sentence x1 . . . xn, c(x1 . . . xn) is the number oftimes the sentence is seen in our training data
I A naive estimate:
p(x1 . . . xn) =c(x1 . . . xn)
N
![Page 10: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/10.jpg)
Overview
I The language modeling problem
I Trigram models
I Evaluating language models: perplexity
I Estimation techniques:
I Linear interpolationI Discounting methods
![Page 11: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/11.jpg)
Markov Processes
I Consider a sequence of random variables X1, X2, . . . Xn.Each random variable can take any value in a finite set V .For now we assume the length n is fixed (e.g., n = 100).
I Our goal: model
P (X1 = x1, X2 = x2, . . . , Xn = xn)
![Page 12: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/12.jpg)
First-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)n∏i=2
P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)
= P (X1 = x1)n∏i=2
P (Xi = xi|Xi−1 = xi−1)
The first-order Markov assumption: For any i ∈ {2 . . . n}, forany x1 . . . xi,
P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1) = P (Xi = xi|Xi−1 = xi−1)
![Page 13: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/13.jpg)
First-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)n∏i=2
P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)
= P (X1 = x1)n∏i=2
P (Xi = xi|Xi−1 = xi−1)
The first-order Markov assumption: For any i ∈ {2 . . . n}, forany x1 . . . xi,
P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1) = P (Xi = xi|Xi−1 = xi−1)
![Page 14: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/14.jpg)
First-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)n∏i=2
P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)
= P (X1 = x1)n∏i=2
P (Xi = xi|Xi−1 = xi−1)
The first-order Markov assumption: For any i ∈ {2 . . . n}, forany x1 . . . xi,
P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1) = P (Xi = xi|Xi−1 = xi−1)
![Page 15: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/15.jpg)
First-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)n∏i=2
P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)
= P (X1 = x1)n∏i=2
P (Xi = xi|Xi−1 = xi−1)
The first-order Markov assumption: For any i ∈ {2 . . . n}, forany x1 . . . xi,
P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1) = P (Xi = xi|Xi−1 = xi−1)
![Page 16: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/16.jpg)
Second-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)× P (X2 = x2|X1 = x1)
×n∏i=3
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
=n∏i=1
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)
![Page 17: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/17.jpg)
Second-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)× P (X2 = x2|X1 = x1)
×n∏i=3
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
=n∏i=1
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)
![Page 18: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/18.jpg)
Second-Order Markov Processes
P (X1 = x1, X2 = x2, . . . Xn = xn)
= P (X1 = x1)× P (X2 = x2|X1 = x1)
×n∏i=3
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
=n∏i=1
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)
![Page 19: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/19.jpg)
Modeling Variable Length Sequences
I We would like the length of the sequence, n, to also be arandom variable
I A simple solution: always define Xn = STOP whereSTOP is a special symbol
I Then use a Markov process as before:
P (X1 = x1, X2 = x2, . . . Xn = xn)
=n∏i=1
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)
![Page 20: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/20.jpg)
Modeling Variable Length Sequences
I We would like the length of the sequence, n, to also be arandom variable
I A simple solution: always define Xn = STOP whereSTOP is a special symbol
I Then use a Markov process as before:
P (X1 = x1, X2 = x2, . . . Xn = xn)
=n∏i=1
P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)
(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)
![Page 21: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/21.jpg)
Trigram Language Models
I A trigram language model consists of:
1. A finite set V2. A parameter q(w|u, v) for each trigram u, v, w such that
w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {*}.
I For any sentence x1 . . . xn where xi ∈ V fori = 1 . . . (n− 1), and xn = STOP, the probability of thesentence under the trigram language model is
p(x1 . . . xn) =n∏i=1
q(xi|xi−2, xi−1)
where we define x0 = x−1 = *.
![Page 22: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/22.jpg)
Trigram Language Models
I A trigram language model consists of:
1. A finite set V2. A parameter q(w|u, v) for each trigram u, v, w such that
w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {*}.
I For any sentence x1 . . . xn where xi ∈ V fori = 1 . . . (n− 1), and xn = STOP, the probability of thesentence under the trigram language model is
p(x1 . . . xn) =n∏i=1
q(xi|xi−2, xi−1)
where we define x0 = x−1 = *.
![Page 23: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/23.jpg)
An Example
For the sentence
the dog barks STOP
we would have
p(the dog barks STOP) = q(the|*, *)
×q(dog|*, the)
×q(barks|the, dog)
×q(STOP|dog, barks)
![Page 24: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/24.jpg)
The Trigram Estimation Problem
Remaining estimation problem:
q(wi | wi−2, wi−1)
For example:q(laughs | the, dog)
A natural estimate (the “maximum likelihood estimate”):
q(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)
Count(wi−2, wi−1)
q(laughs | the, dog) =Count(the, dog, laughs)
Count(the, dog)
![Page 25: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/25.jpg)
The Trigram Estimation Problem
Remaining estimation problem:
q(wi | wi−2, wi−1)
For example:q(laughs | the, dog)
A natural estimate (the “maximum likelihood estimate”):
q(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)
Count(wi−2, wi−1)
q(laughs | the, dog) =Count(the, dog, laughs)
Count(the, dog)
![Page 26: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/26.jpg)
Sparse Data Problems
A natural estimate (the “maximum likelihood estimate”):
q(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)
Count(wi−2, wi−1)
q(laughs | the, dog) =Count(the, dog, laughs)
Count(the, dog)
Say our vocabulary size is N = |V|, then there are N3
parameters in the model.
e.g., N = 20, 000 ⇒ 20, 0003 = 8× 1012 parameters
![Page 27: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/27.jpg)
Overview
I The language modeling problem
I Trigram models
I Evaluating language models: perplexity
I Estimation techniques:
I Linear interpolationI Discounting methods
![Page 28: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/28.jpg)
Evaluating a Language Model: Perplexity
I We have some test data, m sentences
s1, s2, s3, . . . , sm
I We could look at the probability under our model∏mi=1 p(si). Or more conveniently, the log probability
logm∏i=1
p(si) =m∑i=1
log p(si)
I In fact the usual evaluation measure is perplexity
Perplexity = 2−l where l =1
M
m∑i=1
log p(si)
and M is the total number of words in the test data.
![Page 29: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/29.jpg)
Evaluating a Language Model: Perplexity
I We have some test data, m sentences
s1, s2, s3, . . . , sm
I We could look at the probability under our model∏mi=1 p(si). Or more conveniently, the log probability
logm∏i=1
p(si) =m∑i=1
log p(si)
I In fact the usual evaluation measure is perplexity
Perplexity = 2−l where l =1
M
m∑i=1
log p(si)
and M is the total number of words in the test data.
![Page 30: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/30.jpg)
Evaluating a Language Model: Perplexity
I We have some test data, m sentences
s1, s2, s3, . . . , sm
I We could look at the probability under our model∏mi=1 p(si). Or more conveniently, the log probability
logm∏i=1
p(si) =m∑i=1
log p(si)
I In fact the usual evaluation measure is perplexity
Perplexity = 2−l where l =1
M
m∑i=1
log p(si)
and M is the total number of words in the test data.
![Page 31: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/31.jpg)
Some Intuition about Perplexity
I Say we have a vocabulary V , and N = |V|+ 1and model that predicts
q(w|u, v) =1
N
for all w ∈ V ∪ {STOP}, for all u, v ∈ V ∪ {*}.I Easy to calculate the perplexity in this case:
Perplexity = 2−l where l = log1
N
⇒Perplexity = N
Perplexity is a measure of effective “branching factor”
![Page 32: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/32.jpg)
Typical Values of Perplexity
I Results from Goodman (“A bit of progress in languagemodeling”), where |V| = 50, 000
I A trigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−2, xi−1).Perplexity = 74
I A bigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−1).Perplexity = 137
I A unigram model: p(x1 . . . xn) =∏n
i=1 q(xi).Perplexity = 955
![Page 33: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/33.jpg)
Typical Values of Perplexity
I Results from Goodman (“A bit of progress in languagemodeling”), where |V| = 50, 000
I A trigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−2, xi−1).Perplexity = 74
I A bigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−1).Perplexity = 137
I A unigram model: p(x1 . . . xn) =∏n
i=1 q(xi).Perplexity = 955
![Page 34: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/34.jpg)
Typical Values of Perplexity
I Results from Goodman (“A bit of progress in languagemodeling”), where |V| = 50, 000
I A trigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−2, xi−1).Perplexity = 74
I A bigram model: p(x1 . . . xn) =∏n
i=1 q(xi|xi−1).Perplexity = 137
I A unigram model: p(x1 . . . xn) =∏n
i=1 q(xi).Perplexity = 955
![Page 35: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/35.jpg)
Some History
I Shannon conducted experiments on entropy of Englishi.e., how good are people at the perplexity game?
C. Shannon. Prediction and entropy of printedEnglish. Bell Systems Technical Journal,30:50–64, 1951.
![Page 36: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/36.jpg)
Some HistoryChomsky (in Syntactic Structures (1957)):
Second, the notion “grammatical” cannot be identified with“meaningful” or “significant” in any semantic sense.Sentences (1) and (2) are equally nonsensical, but any speakerof English will recognize that only the former is grammatical.
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
. . .
. . . Third, the notion “grammatical in English” cannot beidentified in any way with the notion “high order of statisticalapproximation to English”. It is fair to assume that neithersentence (1) nor (2) (nor indeed any part of these sentences)has ever occurred in an English discourse. Hence, in anystatistical model for grammaticalness, these sentences will beruled out on identical grounds as equally ‘remote’ fromEnglish. Yet (1), though nonsensical, is grammatical, while(2) is not. . . .
![Page 37: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/37.jpg)
Overview
I The language modeling problem
I Trigram models
I Evaluating language models: perplexity
I Estimation techniques:
I Linear interpolationI Discounting methods
![Page 38: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/38.jpg)
Sparse Data Problems
A natural estimate (the “maximum likelihood estimate”):
q(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)
Count(wi−2, wi−1)
q(laughs | the, dog) =Count(the, dog, laughs)
Count(the, dog)
Say our vocabulary size is N = |V|, then there are N3
parameters in the model.
e.g., N = 20, 000 ⇒ 20, 0003 = 8× 1012 parameters
![Page 39: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/39.jpg)
The Bias-Variance Trade-Off
I Trigram maximum-likelihood estimate
qML(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)
Count(wi−2, wi−1)
I Bigram maximum-likelihood estimate
qML(wi | wi−1) =Count(wi−1, wi)
Count(wi−1)
I Unigram maximum-likelihood estimate
qML(wi) =Count(wi)
Count()
![Page 40: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/40.jpg)
Linear Interpolation
I Take our estimate q(wi | wi−2, wi−1) to be
q(wi | wi−2, wi−1) = λ1 × qML(wi | wi−2, wi−1)+λ2 × qML(wi | wi−1)+λ3 × qML(wi)
where λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i.
![Page 41: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/41.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 42: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/42.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 43: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/43.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 44: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/44.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 45: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/45.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 46: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/46.jpg)
Linear Interpolation (continued)
Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)
=∑
w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]
= λ1∑
w qML(w | u, v) + λ2∑
w qML(w | v) + λ3∑
w qML(w)
= λ1 + λ2 + λ3
= 1
(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)
![Page 47: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/47.jpg)
How to estimate the λ values?
I Hold out part of training set as “validation” data
I Define c′(w1, w2, w3) to be the number of times thetrigram (w1, w2, w3) is seen in validation set
I Choose λ1, λ2, λ3 to maximize:
L(λ1, λ2, λ3) =∑
w1,w2,w3
c′(w1, w2, w3) log q(w3 | w1, w2)
such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, andwhere
q(wi | wi−2, wi−1) = λ1 × qML(wi | wi−2, wi−1)+λ2 × qML(wi | wi−1)+λ3 × qML(wi)
![Page 48: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/48.jpg)
How to estimate the λ values?
I Hold out part of training set as “validation” data
I Define c′(w1, w2, w3) to be the number of times thetrigram (w1, w2, w3) is seen in validation set
I Choose λ1, λ2, λ3 to maximize:
L(λ1, λ2, λ3) =∑
w1,w2,w3
c′(w1, w2, w3) log q(w3 | w1, w2)
such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, andwhere
q(wi | wi−2, wi−1) = λ1 × qML(wi | wi−2, wi−1)+λ2 × qML(wi | wi−1)+λ3 × qML(wi)
![Page 49: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/49.jpg)
How to estimate the λ values?
I Hold out part of training set as “validation” data
I Define c′(w1, w2, w3) to be the number of times thetrigram (w1, w2, w3) is seen in validation set
I Choose λ1, λ2, λ3 to maximize:
L(λ1, λ2, λ3) =∑
w1,w2,w3
c′(w1, w2, w3) log q(w3 | w1, w2)
such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, andwhere
q(wi | wi−2, wi−1) = λ1 × qML(wi | wi−2, wi−1)+λ2 × qML(wi | wi−1)+λ3 × qML(wi)
![Page 50: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/50.jpg)
Allowing the λ’s to vary
I Take a function Π that partitions historiese.g.,
Π(wi−2, wi−1) =
1 If Count(wi−1, wi−2) = 02 If 1 ≤ Count(wi−1, wi−2) ≤ 23 If 3 ≤ Count(wi−1, wi−2) ≤ 54 Otherwise
I Introduce a dependence of the λ’s on the partition:
q(wi | wi−2, wi−1) = λΠ(wi−2,wi−1)1 × qML(wi | wi−2, wi−1)
+λΠ(wi−2,wi−1)2 × qML(wi | wi−1)
+λΠ(wi−2,wi−1)3 × qML(wi)
where λΠ(wi−2,wi−1)1 + λ
Π(wi−2,wi−1)2 + λ
Π(wi−2,wi−1)3 = 1,
and λΠ(wi−2,wi−1)i ≥ 0 for all i.
![Page 51: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/51.jpg)
Overview
I The language modeling problem
I Trigram models
I Evaluating language models: perplexity
I Estimation techniques:
I Linear interpolationI Discounting methods
![Page 52: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/52.jpg)
Discounting MethodsI Say we’ve seen the following counts:
x Count(x) qML(wi | wi−1)the 48
the, dog 15 15/48the, woman 11 11/48the, man 10 10/48the, park 5 5/48the, job 2 2/48the, telescope 1 1/48the, manual 1 1/48the, afternoon 1 1/48the, country 1 1/48the, street 1 1/48
I The maximum-likelihood estimates are high(particularly for low count items)
![Page 53: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/53.jpg)
Discounting MethodsI Now define “discounted” counts,
Count∗(x) = Count(x)− 0.5I New estimates:
x Count(x) Count∗(x) Count∗(x)
Count(the)
the 48
the, dog 15 14.5 14.5/48the, woman 11 10.5 10.5/48the, man 10 9.5 9.5/48the, park 5 4.5 4.5/48the, job 2 1.5 1.5/48the, telescope 1 0.5 0.5/48the, manual 1 0.5 0.5/48the, afternoon 1 0.5 0.5/48the, country 1 0.5 0.5/48the, street 1 0.5 0.5/48
![Page 54: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/54.jpg)
Discounting Methods (Continued)I We now have some “missing probability mass”:
α(wi−1) = 1−∑w
Count∗(wi−1, w)
Count(wi−1)
e.g., in our example, α(the) = 10× 0.5/48 = 5/48
![Page 55: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/55.jpg)
Katz Back-Off Models (Bigrams)I For a bigram model, define two sets
A(wi−1) = {w : Count(wi−1, w) > 0}B(wi−1) = {w : Count(wi−1, w) = 0}
I A bigram model
qBO(wi | wi−1) =
Count∗(wi−1,wi)
Count(wi−1)If wi ∈ A(wi−1)
α(wi−1)qML(wi)∑
w∈B(wi−1)qML(w)
If wi ∈ B(wi−1)
where
α(wi−1) = 1−∑
w∈A(wi−1)
Count∗(wi−1, w)
Count(wi−1)
![Page 56: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/56.jpg)
Katz Back-Off Models (Trigrams)I For a trigram model, first define two sets
A(wi−2, wi−1) = {w : Count(wi−2, wi−1, w) > 0}B(wi−2, wi−1) = {w : Count(wi−2, wi−1, w) = 0}
I A trigram model is defined in terms of the bigram model:
qBO(wi | wi−2, wi−1) =
Count∗(wi−2,wi−1,wi)
Count(wi−2,wi−1)
If wi ∈ A(wi−2, wi−1)
α(wi−2,wi−1)qBO(wi|wi−1)∑w∈B(wi−2,wi−1)
qBO(w|wi−1)
If wi ∈ B(wi−2, wi−1)
where
α(wi−2, wi−1) = 1−∑
w∈A(wi−2,wi−1)
Count∗(wi−2, wi−1, w)
Count(wi−2, wi−1)
![Page 57: Language Modeling - Columbia Universitymcollins/cs4705-spring2019/slides/...The Language Modeling Problem (Continued) I We have a training sample of example sentences in English I](https://reader035.fdocuments.us/reader035/viewer/2022070214/61155ab5bec46a32283c86e2/html5/thumbnails/57.jpg)
Summary
I Three steps in deriving the language model probabilities:
1. Expand p(w1, w2 . . . wn) using Chain rule.2. Make Markov Independence Assumptions
p(wi | w1, w2 . . . wi−2, wi−1) = p(wi | wi−2, wi−1)3. Smooth the estimates using low order counts.
I Other methods used to improve language models:
I “Topic” or “long-range” features.I Syntactic models.
It’s generally hard to improve on trigram models though!!