Lm Tutorial v8
-
Upload
piyush-modi -
Category
Documents
-
view
128 -
download
0
Transcript of Lm Tutorial v8
SA1-1
The State of The Art in The State of The Art in Language Modeling Language Modeling
Joshua GoodmanJoshua GoodmanMicrosoft Research, Machine Learning Microsoft Research, Machine Learning GroupGrouphttp://www.research.microsoft.com/http://www.research.microsoft.com/~joshuago~joshuago
Eugene CharniakEugene CharniakBrown University, Department of Brown University, Department of Computer ScienceComputer Sciencehttp://www.cs.brown.edu/people/echttp://www.cs.brown.edu/people/ec
SA1-2
A bad language modelA bad language model
SA1-3
A bad language modelA bad language model
SA1-4
A bad language modelA bad language model
Herm
an is reprinted with perm
ission from L
aughingStock L
icensing Inc., Ottaw
a Canada.
All rights reserved.
SA1-5
A bad language modelA bad language model
SA1-6
What’s a Language ModelWhat’s a Language Model
A Language model is a probability A Language model is a probability distribution over word sequencesdistribution over word sequences
P(“And nothing but the truth”) P(“And nothing but the truth”) 0.0010.001
P(“And nuts sing on the roof”) P(“And nuts sing on the roof”) 0 0
SA1-7
What’s a language model What’s a language model for?for?
Speech recognitionSpeech recognition Handwriting recognitionHandwriting recognition Spelling correctionSpelling correction Optical character recognitionOptical character recognition Machine translationMachine translation
(and anyone doing statistical (and anyone doing statistical modeling)modeling)
SA1-8
Really Quick OverviewReally Quick Overview
HumorHumor What is a language model?What is a language model? Really quick overviewReally quick overview Two minute probability overviewTwo minute probability overview How language models work (trigrams)How language models work (trigrams) Real overviewReal overview Smoothing, caching, skipping, sentence-Smoothing, caching, skipping, sentence-
mixture models, clustering, parsing mixture models, clustering, parsing language models, applications, toolslanguage models, applications, tools
SA1-9
Everything you need to Everything you need to know about probability – know about probability – definitiondefinition
P(X) means probability that X is P(X) means probability that X is truetrue• P(baby is a boy) P(baby is a boy) 0.5 (% of total that 0.5 (% of total that
are boys)are boys)• P(baby is named John) P(baby is named John) 0.001 (% of 0.001 (% of
total named John)total named John) BabiesBaby boys
John
SA1-10
Everything about Everything about probabilityprobabilityJoint probabilitiesJoint probabilities
P(X, Y) means probability that X P(X, Y) means probability that X and Y are both true, e.g. P(brown and Y are both true, e.g. P(brown eyes, boy)eyes, boy)
BabiesBaby boys
JohnBrown eyes
SA1-11
Everything about Everything about probability:probability:Conditional probabilitiesConditional probabilities
P(X|Y) means probability that X is P(X|Y) means probability that X is true when we already know Y is true when we already know Y is truetrue• P(baby is named John | baby is a boy) P(baby is named John | baby is a boy)
0.002 0.002• P(baby is a boy | baby is named John ) P(baby is a boy | baby is named John )
1 1 BabiesBaby boys
John
SA1-12
Everything about Everything about probabilities: mathprobabilities: math
P(X|Y) = P(X, Y) / P(Y)P(X|Y) = P(X, Y) / P(Y)• P(baby is named John | baby is a boy) P(baby is named John | baby is a boy)
==
P(baby is named John, baby is a boy) P(baby is named John, baby is a boy) / P(baby is a boy) = 0.001 / 0.5 = / P(baby is a boy) = 0.001 / 0.5 = 0.0020.002
BabiesBaby boys
John
SA1-13
Everything about Everything about probabilities: probabilities: Bayes RuleBayes Rule
Bayes rule: P(X|Y) = P(Y|X) Bayes rule: P(X|Y) = P(Y|X) P(X) / P(X) / P(Y)P(Y)
P(named John | boy) = P(boy | P(named John | boy) = P(boy | named John) named John) P(named John) / P(named John) / P(boy)P(boy)
BabiesBaby boys
John
SA1-14
THE EquationTHE Equation
SA1-15
How Language Models How Language Models workwork
Hard to compute P(“And nothing but Hard to compute P(“And nothing but the truth”)the truth”)
Step 1: Decompose probabilityStep 1: Decompose probability P(“And nothing but the truth) =P(“And nothing but the truth) =P(“And”) P(“And”) P(“nothing|and”) P(“nothing|and”) P(“but| P(“but|
and nothing”) and nothing”) P(“the|and nothing P(“the|and nothing but”) but”) P(“truth|and nothing but P(“truth|and nothing but the”) the”)
SA1-16
The Trigram The Trigram ApproximationApproximation
Assume each word depends only on the Assume each word depends only on the previous two words (three words total – previous two words (three words total – tri means three, gram means writing)tri means three, gram means writing)
P(“the|… whole truth and nothing but”) P(“the|… whole truth and nothing but”) P(“the|nothing but”)P(“the|nothing but”)P(“truth|… whole truth and nothing but P(“truth|… whole truth and nothing but
the”) the”) P(“truth|but the”)P(“truth|but the”)
SA1-17
Trigrams, continuedTrigrams, continued
How do we find probabilities?How do we find probabilities? Get real text, and start counting!Get real text, and start counting!
• P(“the | nothing but”) P(“the | nothing but”) C(“nothing but the”) / C(“nothing but the”) /
C(“nothing but”)C(“nothing but”)
SA1-18
Language Modeling is “AI Language Modeling is “AI Complete”Complete”
What isWhat is
p(205| “213 minus 6 is”)p(205| “213 minus 6 is”)
p(207| “213 minus 6 is”)p(207| “213 minus 6 is”)
p(plates | “Fred ate several red”)p(plates | “Fred ate several red”)
p(apples | “Fred ate several red”)p(apples | “Fred ate several red”)
SA1-19
Real Overview OverviewReal Overview Overview
Basics: probability, language model definitionBasics: probability, language model definition Real Overview (8 slides)Real Overview (8 slides) EvaluationEvaluation SmoothingSmoothing Caching, SkippingCaching, Skipping ClusteringClustering Sentence-mixture models Sentence-mixture models Parsing language modelsParsing language models ApplicationsApplications ToolsTools
SA1-20
Real Overview: EvaluationReal Overview: Evaluation
Need to compare different Need to compare different language modelslanguage models
Speech recognition word error rateSpeech recognition word error rate PerplexityPerplexity EntropyEntropy Coding theoryCoding theory
SA1-21
Real Overview: SmoothingReal Overview: Smoothing
Got trigram for P(“the” | “nothing Got trigram for P(“the” | “nothing but”) from C(“nothing but the”) / but”) from C(“nothing but the”) / C(“nothing but”)C(“nothing but”)
What about P(“sing” | “and nuts”) What about P(“sing” | “and nuts”) ==
C(“and nuts sing”) / C(“and nuts”)C(“and nuts sing”) / C(“and nuts”) Probability would be 0: very bad!Probability would be 0: very bad!
SA1-22
Real Overview: CachingReal Overview: Caching
If you say something, you are If you say something, you are likely to say it again laterlikely to say it again later
SA1-23
Real Overview: SkippingReal Overview: Skipping
Trigram uses last two wordsTrigram uses last two words Other words are useful too – 3-Other words are useful too – 3-
back, 4-backback, 4-back Words are useful in various Words are useful in various
combinations (e.g. 1-back (bigram) combinations (e.g. 1-back (bigram) combined with 3-back)combined with 3-back)
SA1-24
Real Overview: ClusteringReal Overview: Clustering
What is the probability What is the probability
P(“Tuesday | party on”)P(“Tuesday | party on”) Similar to P(“Monday | party on”)Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration Similar to P(“Tuesday | celebration
on”)on”) Put words in clusters: Put words in clusters:
• WEEKDAY = Sunday, Monday, Tuesday, …WEEKDAY = Sunday, Monday, Tuesday, …• EVENT=party, celebration, birthday, …EVENT=party, celebration, birthday, …
SA1-25
Real Overview:Real Overview:Sentence Mixture ModelsSentence Mixture Models
In Wall Street Journal, many sentencesIn Wall Street Journal, many sentences““In heavy trading, Sun Microsystems In heavy trading, Sun Microsystems fell 25 points yesterday”fell 25 points yesterday”
In Wall Street Journal, many sentencesIn Wall Street Journal, many sentences““Nathan Mhyrvold, vice president of Nathan Mhyrvold, vice president of Microsoft, took a one year leave of Microsoft, took a one year leave of absence.”absence.”
Model each sentence type separately.Model each sentence type separately.
SA1-26
Real Overview: Real Overview: Parsing Language ModelsParsing Language Models
Language has structure – noun Language has structure – noun phrases, verb phrases, etc.phrases, verb phrases, etc.
““The butcher from Albuquerque The butcher from Albuquerque slaughtered chickens” – even though slaughtered chickens” – even though slaughtered is far from butchered, it slaughtered is far from butchered, it is predicted by butcher, not by is predicted by butcher, not by AlbuquerqueAlbuquerque
Recent, somewhat promising modelsRecent, somewhat promising models
SA1-27
Real Overview: Real Overview: ApplicationsApplications
In Machine Translation we break up In Machine Translation we break up the problem in two: proposing the problem in two: proposing possible phrases, and then fitting them possible phrases, and then fitting them together. The second is the language together. The second is the language model.model.
The same is true for optical character The same is true for optical character recognition (just substitute “letter” for recognition (just substitute “letter” for “phrase”.“phrase”.
A lot of other problems also fit this A lot of other problems also fit this mold.mold.
SA1-28
Real Overview:Real Overview:ToolsTools
You can make your own language You can make your own language models with tools freely available models with tools freely available for researchfor research
CMU language modeling toolkitCMU language modeling toolkit SRI language modeling toolkitSRI language modeling toolkit
SA1-29
EvaluationEvaluation
How can you tell a good language How can you tell a good language model from a bad one?model from a bad one?
Run a speech recognizer (or your Run a speech recognizer (or your application of choice), calculate application of choice), calculate word error rateword error rate• SlowSlow• Specific to your recognizerSpecific to your recognizer
SA1-30
Evaluation:Evaluation:Perplexity IntuitionPerplexity Intuition
Ask a speech recognizer to recognize digits: “0, Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 101, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10
Ask a speech recognizer to recognize names at Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000Microsoft – hard – 30,000 – perplexity 30,000
Ask a speech recognizer to recognize Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54each – perplexity 54
Perplexity is weighted equivalent branching Perplexity is weighted equivalent branching factor.factor.
SA1-31
Evaluation: perplexityEvaluation: perplexity
““A, B, C, D, E, F, G…Z”: perplexity is 26A, B, C, D, E, F, G…Z”: perplexity is 26 ““Alpha, bravo, charlie, delta…yankee, Alpha, bravo, charlie, delta…yankee,
zulu”: perplexity is 26zulu”: perplexity is 26 Perplexity measures language model Perplexity measures language model
difficulty, not acoustic difficulty.difficulty, not acoustic difficulty.
SA1-32
Perplexity: MathPerplexity: Math
Perplexity is geometric Perplexity is geometric average inverse probability average inverse probability Imagine model: “Operator” (1 in 4),Imagine model: “Operator” (1 in 4), “ “Technical support” (1 in 4),Technical support” (1 in 4), “ “sales” (1 in 4), 30,000 names (1 in 120,000) sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,003 equally likelyImagine data: All 30,003 equally likely
Example:Example:
Perplexity of test data, given model, is 119,829Perplexity of test data, given model, is 119,829
Remarkable fact: the true model for data has the lowest possible Remarkable fact: the true model for data has the lowest possible perplexityperplexity
Perplexity is geometric Perplexity is geometric average inverse probability average inverse probability
SA1-33
Perplexity: MathPerplexity: Math
Imagine model: “Operator” (1 in 4), Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) 4), 30,000 names (1 in 120,000)
Imagine data: All 30,003 equally likelyImagine data: All 30,003 equally likely Can compute three different perplexitiesCan compute three different perplexities
• Model (ignoring test data): perplexity 54Model (ignoring test data): perplexity 54• Test data (ignoring model): perplexity 30,003Test data (ignoring model): perplexity 30,003• Model on test data: perplexity 119,829Model on test data: perplexity 119,829
When we say perplexity, we mean “model When we say perplexity, we mean “model on test”on test”
Remarkable fact: the true model for data Remarkable fact: the true model for data has the lowest possible perplexityhas the lowest possible perplexity
SA1-34
Perplexity:Perplexity:Is lower better?Is lower better?
Remarkable fact: the true model for data Remarkable fact: the true model for data has the lowest possible perplexityhas the lowest possible perplexity
Lower the perplexity, the closer we are to Lower the perplexity, the closer we are to true model.true model.
Typically, perplexity correlates well with Typically, perplexity correlates well with speech recognition word error ratespeech recognition word error rate• Correlates better when both models are Correlates better when both models are
trained on same datatrained on same data• Doesn’t correlate well when training data Doesn’t correlate well when training data
changeschanges
SA1-35
Perplexity: The Shannon Perplexity: The Shannon GameGame
Ask people to guess the next letter, Ask people to guess the next letter, given context. Compute perplexity.given context. Compute perplexity.
• (when we get to entropy, the “100” column (when we get to entropy, the “100” column corresponds to the “1 bit per character” corresponds to the “1 bit per character” estimate)estimate)
Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532
10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998
100 1.5 2.5 10 142
SA1-36
Evaluation: entropyEvaluation: entropy
Entropy = Entropy =
loglog2 2 perplexityperplexity
Should be called “cross-entropy of Should be called “cross-entropy of model on test data.”model on test data.” Remarkable fact: entropy is average Remarkable fact: entropy is average number of bits per word required to number of bits per word required to encode test data using this probability encode test data using this probability model, and an optimal coder. Called model, and an optimal coder. Called bits.bits.
SA1-37
Smoothing: NoneSmoothing: None
Called Maximum Likelihood estimate.Called Maximum Likelihood estimate. Lowest perplexity trigram on training Lowest perplexity trigram on training
data.data. Terrible on test data: If no occurrences Terrible on test data: If no occurrences
of C(xyz), probability is 0.of C(xyz), probability is 0.
SA1-38
Smoothing: Add OneSmoothing: Add One
What is P(sing|nuts)? Zero? Leads What is P(sing|nuts)? Zero? Leads to infinite perplexity!to infinite perplexity!
Add one smoothing:Add one smoothing: Works very badly. DO NOT DO Works very badly. DO NOT DO
THISTHIS Add delta smoothing:Add delta smoothing: Still very bad. DO NOT DO THISStill very bad. DO NOT DO THIS
SA1-39
Smoothing: Simple Smoothing: Simple InterpolationInterpolation
Trigram is very context specific, very noisyTrigram is very context specific, very noisy Unigram is context-independent, smoothUnigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for Interpolate Trigram, Bigram, Unigram for
best combinationbest combination Find Find 0<0<<1 by optimizing on “held-out” <1 by optimizing on “held-out”
datadata Almost good enoughAlmost good enough
SA1-40
Smoothing: Smoothing: Finding parameter valuesFinding parameter values
Split data into training, “heldout”, testSplit data into training, “heldout”, test Try lots of different values for Try lots of different values for on on
heldout data, pick bestheldout data, pick best Test on test dataTest on test data Sometimes, can use tricks like “EM” Sometimes, can use tricks like “EM”
(estimation maximization) to find values(estimation maximization) to find values I prefer to use a generalized search I prefer to use a generalized search
algorithm, “Powell search” – see algorithm, “Powell search” – see Numerical Recipes in CNumerical Recipes in C
SA1-41
Smoothing digression:Smoothing digression:Splitting dataSplitting data
How much data for training, heldout, test?How much data for training, heldout, test? Some people say things like “1/3, 1/3, 1/3” Some people say things like “1/3, 1/3, 1/3”
or “80%, 10%, 10%” They are WRONGor “80%, 10%, 10%” They are WRONG Heldout should have (at least) 100-1000 Heldout should have (at least) 100-1000
words per parameter.words per parameter. Answer: enough test data to be Answer: enough test data to be
statistically significant. (1000s of words statistically significant. (1000s of words perhaps)perhaps)
SA1-42
Smoothing digression:Smoothing digression:Splitting dataSplitting data
Be careful: WSJ data divided into Be careful: WSJ data divided into stories. Some are easy, with lots of stories. Some are easy, with lots of numbers, financial, others much harder. numbers, financial, others much harder. Use enough to cover many stories. Use enough to cover many stories.
Be careful: Some stories repeated in Be careful: Some stories repeated in data sets.data sets.
Can take data from end – better – or Can take data from end – better – or randomly from within training. randomly from within training. Temporal effects like “Elian Gonzalez”Temporal effects like “Elian Gonzalez”
SA1-43
Smoothing: Smoothing: Jelinek-MercerJelinek-Mercer
Simple interpolation:Simple interpolation:
Better: smooth a little after “The Better: smooth a little after “The Dow”, lots after “Adobe acquired” Dow”, lots after “Adobe acquired”
SA1-44
Smoothing:Smoothing:Jelinek-Mercer continuedJelinek-Mercer continued
Put Put s into buckets by counts into buckets by count Find Find s by cross-validation on s by cross-validation on
held-out dataheld-out data Also called “deleted-interpolation”Also called “deleted-interpolation”
SA1-45
Smoothing: Good TuringSmoothing: Good Turing
Imagine you are fishingImagine you are fishing You have caught 10 You have caught 10
Carp, 3 Cod, 2 tuna, 1 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.trout, 1 salmon, 1 eel.
How likely is it that next How likely is it that next species is new? 3/18species is new? 3/18
How likely is it that next How likely is it that next is tuna? Less than 2/18is tuna? Less than 2/18
SA1-46
Smoothing: Good TuringSmoothing: Good Turing
How many species How many species (words) were seen (words) were seen once? Estimate for once? Estimate for how many are unseen.how many are unseen.
All other estimates are All other estimates are adjusted (down) to adjusted (down) to give probabilities for give probabilities for unseenunseen
SA1-47
Smoothing:Smoothing:Good Turing ExampleGood Turing Example
10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.eel.
How likely is new data (pHow likely is new data (p0 0 ). ).
Let nLet n1 1 be number occurring be number occurring
once (3), N be total (18). ponce (3), N be total (18). p00=3/18=3/18
How likely is eel? 1How likely is eel? 1** nn1 1 =3, n=3, n2 2 =1=1 11* * =2 =2 1/3 = 2/31/3 = 2/3 P(eel) = 1P(eel) = 1* * /N = (2/3)/18 = 1/27/N = (2/3)/18 = 1/27
SA1-48
Smoothing: KatzSmoothing: Katz
Use Good-Turing estimateUse Good-Turing estimate
Works pretty well.Works pretty well. Not good for 1 countsNot good for 1 counts is calculated so probabilities sum to 1is calculated so probabilities sum to 1
SA1-49
Smoothing:Absolute Smoothing:Absolute DiscountingDiscounting
Assume fixed discountAssume fixed discount
Works pretty well, easier than Katz.Works pretty well, easier than Katz. Not so good for 1 countsNot so good for 1 counts
SA1-50
Smoothing:Smoothing:Interpolated Absolute Interpolated Absolute DiscountDiscount
Backoff: ignore bigram if have Backoff: ignore bigram if have trigramtrigram
Interpolated: always combine Interpolated: always combine bigram, trigrambigram, trigram
SA1-51
Smoothing: Interpolated Smoothing: Interpolated Multiple Absolute Multiple Absolute DiscountsDiscounts
One discount is goodOne discount is good
Different discounts for different countsDifferent discounts for different counts
Multiple discounts: for 1 count, 2 counts, Multiple discounts: for 1 count, 2 counts, >2>2
SA1-52
Smoothing: Kneser-NeySmoothing: Kneser-Ney
P(Francisco | eggplant) vs P(stew | P(Francisco | eggplant) vs P(stew | eggplant)eggplant)
““Francisco” is common, so backoff, Francisco” is common, so backoff, interpolated methods say it is likelyinterpolated methods say it is likely
But it only occurs in context of “San”But it only occurs in context of “San” ““Stew” is common, and in many Stew” is common, and in many
contextscontexts Weight backoff by number of contexts Weight backoff by number of contexts
word occurs inword occurs in
SA1-53
Smoothing: Kneser-NeySmoothing: Kneser-Ney
InterpolatedInterpolated Absolute-Absolute-
discountdiscount Modified Modified
backoff backoff distributiondistribution
Consistently Consistently best techniquebest technique
SA1-54
Smoothing: ChartSmoothing: Chart
SA1-55
CachingCaching
If you say If you say something, you something, you are likely to say are likely to say it again later.it again later.
Interpolate Interpolate trigram with trigram with cachecache
SA1-56
Caching: Real LifeCaching: Real Life
Someone says “I swear to tell the truth”Someone says “I swear to tell the truth” System hears “I swerve to smell the soup”System hears “I swerve to smell the soup” Cache remembers!Cache remembers! Person says “The whole truth”, and, with Person says “The whole truth”, and, with
cache, system hears “The whole soup.” – cache, system hears “The whole soup.” – errors are locked in.errors are locked in.
Caching works well when users corrects Caching works well when users corrects as they go, poorly or even hurts without as they go, poorly or even hurts without correction.correction.
SA1-57
Caching: VariationsCaching: Variations
N-gram caches:N-gram caches:
Conditional n-gram cache: use n-Conditional n-gram cache: use n-gram cache only if xy gram cache only if xy history history
Remove function-words like “the”, Remove function-words like “the”, “to”“to”
SA1-58
Cache ResultsCache Results
0
5
10
15
20
25
30
35
40
100,000 1,000,000 10,000,000 all
Training Data Size
Per
plex
ity
Red
ucti
on
unigram + condbigram + condtrigramunigram + condtrigram
trigram
bigram
unigram
SA1-59
5-grams5-grams
Why stop at 3-grams?Why stop at 3-grams? If P(z|…rstuvwxy)If P(z|…rstuvwxy) P(z|xy) is good, P(z|xy) is good,
thenthen
P(z|…rstuvwxy) P(z|…rstuvwxy) P(z|vwxy) is better! P(z|vwxy) is better! Very important to smooth wellVery important to smooth well Interpolated Kneser-Ney works much Interpolated Kneser-Ney works much
better than Katz on 5-gram, more better than Katz on 5-gram, more than on 3-gramthan on 3-gram
SA1-60
N-gram versus smoothing N-gram versus smoothing algorithmalgorithm
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
1 2 3 4 5 6 7 8 9 10 20
n-gram order
En
trop
y
100,000 Katz
100,000 KN
1,000,000 Katz
1,000,000 KN
10,000,000 Katz
10,000,000 KN
all Katz
all KN
SA1-61
Speech recognizer Speech recognizer mechanicsmechanics
Keep many Keep many
hypotheses alivehypotheses alive Find acoustic, language model Find acoustic, language model
scoresscores• P(acoustics | truth = .3), P(truth | tell P(acoustics | truth = .3), P(truth | tell
the) = .1the) = .1• P(acoustics | soup = .2), P(soup | P(acoustics | soup = .2), P(soup |
smell the) = .01smell the) = .01
“…tell the” (.01)“…smell the” (.01)
“…tell the truth” (.01 .3 .1)“…smell the soup” (.01 .2 .01)
SA1-62
Speech recognizer Speech recognizer slowdownsslowdowns
Speech recognizer uses tricks Speech recognizer uses tricks (dynamic programming) so merge (dynamic programming) so merge hypotheseshypotheses
Trigram: Trigram: Fivegram: Fivegram: “…tell the”
“…smell the”
“…swear to tell the”“…swerve to smell the”
“swear too tell the”“swerve too smell the”
“swerve to tell the”“swerve too tell the”
…
SA1-63
Speech recognizer vs. n-Speech recognizer vs. n-gramgram
Recognizer can threshold out bad Recognizer can threshold out bad hypotheseshypotheses
Trigram works so much better than Trigram works so much better than bigram, better thresholding, no bigram, better thresholding, no slow-down slow-down
4-gram, 5-gram start to become 4-gram, 5-gram start to become expensiveexpensive
SA1-64
Speech recognizer with Speech recognizer with language modellanguage model
In theory,In theory,
In practice, language model is a In practice, language model is a better predictor -- acoustic better predictor -- acoustic probabilities aren’t “real” probabilities aren’t “real” probabilitiesprobabilities
In practice, penalize insertionsIn practice, penalize insertions
)()|(maxarg cewordsequenPcewordsequenacousticsPcewordsequen
)(8 1.)(
)|(maxarg cewordsequenlength
cewordsequen cewordsequenP
cewordsequenacousticsP
SA1-65
SkippingSkipping
P(z|…rstuvwxy) P(z|…rstuvwxy) P(z|vwxy) P(z|vwxy) Why not P(z|v_xy) – “skipping” n-gram – Why not P(z|v_xy) – “skipping” n-gram –
skips value of 3-back word.skips value of 3-back word. Example: “P(time|show John a good)” ->Example: “P(time|show John a good)” ->
P(time | show ____ a good)P(time | show ____ a good) P(…rstuvwxy) P(…rstuvwxy) P(z|vwxy) + P(z|vwxy) + P(z|vw_y) + (1-P(z|vw_y) + (1---
)P(z|v_xy))P(z|v_xy)
SA1-66
5-gram Skipping Results5-gram Skipping Results
0
1
2
3
4
5
6
7
10000 100000 1000000 10000000 1E+08 1E+09
Training Size
Pe
rple
xit
y R
ed
uc
tio
n
vwyx, vxyw, wxyv,vywx, yvwx, xvwy,wvxyvw_y, v_xy, vwx_ --skipping
xvwy, wvxy, yvwx, --rearranging
vwyx, vxyw, wxyv --rearranging
vwyx, vywx, yvwx --rearranging
vw_y, v_xy -- skipping
(Best trigram skipping result: 11% reduction)
SA1-67
ClusteringClustering
CLUSTERING = CLASSES (same thing)CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”)What is P(“Tuesday | party on”) Similar to P(“Monday | party on”)Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”)Similar to P(“Tuesday | celebration on”) Put words in clusters: Put words in clusters:
• WEEKDAY = Sunday, Monday, Tuesday, …WEEKDAY = Sunday, Monday, Tuesday, …• EVENT=party, celebration, birthday, …EVENT=party, celebration, birthday, …
SA1-68
Clustering overviewClustering overview
Major topic, useful in many fieldsMajor topic, useful in many fields Kinds of clusteringKinds of clustering
• Predictive clusteringPredictive clustering• Conditional clusteringConditional clustering• IBM-style clusteringIBM-style clustering
How to get clustersHow to get clusters• Be clever or it takes forever!Be clever or it takes forever!
SA1-69
Predictive clusteringPredictive clustering
Let “z” be a word, “Z” be its clusterLet “z” be a word, “Z” be its cluster One cluster per word: hard clusteringOne cluster per word: hard clustering
• WEEKDAY = Sunday, Monday, Tuesday, …WEEKDAY = Sunday, Monday, Tuesday, …• MONTH = January, February, April, MONTH = January, February, April, MayMay, June, , June,
…… P(z|xy) = P(Z|xy) P(z|xy) = P(Z|xy) P(z|xyZ) P(z|xyZ) P(Tuesday | party on) = P(WEEKDAY | P(Tuesday | party on) = P(WEEKDAY |
party on) party on) P(Tuesday | party on P(Tuesday | party on WEEKDAY)WEEKDAY)
PPsmoothsmooth(z|xy) (z|xy) P Psmoothsmooth (Z|xy) (Z|xy) P Psmoothsmooth (z|xyZ) (z|xyZ)
SA1-70
Predictive clustering Predictive clustering exampleexample
Find P(Tuesday | party on) Find P(Tuesday | party on) • PPsmoothsmooth (WEEKDAY | party on) (WEEKDAY | party on) PPsmoothsmooth (Tuesday | party on WEEKDAY) (Tuesday | party on WEEKDAY)• C( party on Tuesday) = 0C( party on Tuesday) = 0• C(party on Wednesday) = 10C(party on Wednesday) = 10• C(arriving on Tuesday) = 10C(arriving on Tuesday) = 10• C(on Tuesday) = 100C(on Tuesday) = 100
PPsmoothsmooth (WEEKDAY | party on) is high (WEEKDAY | party on) is high PPsmoothsmooth (Tuesday | party on WEEKDAY) backs (Tuesday | party on WEEKDAY) backs
off to Poff to Psmooth smooth (Tuesday | on WEEKDAY)(Tuesday | on WEEKDAY)
SA1-71
Conditional clusteringConditional clustering
P(z|xy) P(z|xy) P(z|XY) P(z|XY) P(Tuesday | party on) P(Tuesday | party on)
P(Tuesday | EVENT P(Tuesday | EVENT PREPOSITION)PREPOSITION)
PPsmoothsmooth(z|xy) (z|xy) P Psmoothsmooth (z|XY) (z|XY)PPMLML (Tuesday | EVENT (Tuesday | EVENT
PREPOSITION)PREPOSITION)++
PPMLML (Tuesday | PREPOSITION) + (Tuesday | PREPOSITION) +
(1- (1- - - ) P) PMLML (Tuesday) (Tuesday)
SA1-72
IBM ClusteringIBM Clustering
PP (z|xy) (z|xy) P Psmoothsmooth(Z|XY) (Z|XY) P(z|Z) P(z|Z) P(WEEKDAY|EVENT PREPOSITION) P(WEEKDAY|EVENT PREPOSITION) P(Tuesday | P(Tuesday |
WEEKDAY)WEEKDAY)
Small, very smooth, mediocre perplexitySmall, very smooth, mediocre perplexity PP (z|xy) (z|xy)
PPsmoothsmooth (z|xy) + (1- (z|xy) + (1- )P )Psmoothsmooth(Z|XY) (Z|XY) P(z|Z)P(z|Z)
Bigger, better than no clustersBigger, better than no clusters
SA1-73
Cluster ResultsCluster Results
-20
-15
-10
-5
0
5
10
15
20
100,000 1,000,000 10,000,000
Training Size
Per
ple
xity
Red
uct
ion
Kneser-Neytrigram
Predict
IBM
Full IBMPredict
All Combine
SA1-74
Clustering by PositionClustering by Position
““A” and “AN”: same cluster or A” and “AN”: same cluster or different cluster?different cluster?
Same cluster for predictive clusteringSame cluster for predictive clustering Different clusters for conditional Different clusters for conditional
clusteringclustering Small improvement by using different Small improvement by using different
clusters for conditional and predictiveclusters for conditional and predictive
SA1-75
Clustering: how to get Clustering: how to get themthem
Build them by handBuild them by hand• Works ok when almost no dataWorks ok when almost no data
Part of Speech (POS) tagsPart of Speech (POS) tags• Tends not to work as well as Tends not to work as well as
automaticautomatic Automatic ClusteringAutomatic Clustering
• Swap words between clusters to Swap words between clusters to minimize perplexityminimize perplexity
SA1-76
Clustering: automaticClustering: automatic
Minimize perplexity of P(z|Y) Minimize perplexity of P(z|Y) Mathematical tricks speed Mathematical tricks speed it upit up
Use top-down splitting,Use top-down splitting,
not bottom up merging!not bottom up merging!
SA1-77
Two actual WSJ classesTwo actual WSJ classes
MONDAYS MONDAYS FRIDAYS FRIDAYS THURSDAY THURSDAY MONDAY MONDAY EURODOLLARS EURODOLLARS SATURDAYSATURDAY WEDNESDAYWEDNESDAY FRIDAYFRIDAY TENTERHOOKS TENTERHOOKS TUESDAY TUESDAY SUNDAYSUNDAY CONDITIONCONDITION
PARTYPARTY FESCOFESCO CULTCULT NILSON NILSON PETA PETA CAMPAIGN CAMPAIGN WESTPAC WESTPAC FORCE FORCE CONRAN CONRAN DEPARTMENT DEPARTMENT PENHPENH GUILDGUILD
SA1-78
Sentence Mixture ModelsSentence Mixture Models
Lots of different sentence types:Lots of different sentence types:• Numbers (The Dow rose one hundred Numbers (The Dow rose one hundred
seventy three points)seventy three points)• Quotations (Officials said “quote we deny Quotations (Officials said “quote we deny
all wrong doing ”quote)all wrong doing ”quote)• Mergers (AOL and Time Warner, in an Mergers (AOL and Time Warner, in an
attempt to control the media and the attempt to control the media and the internet, will merge)internet, will merge)
Model each sentence type separatelyModel each sentence type separately
SA1-79
Sentence Mixture ModelsSentence Mixture Models
Roll a die to pick sentence type, sRoll a die to pick sentence type, skk
with probability with probability kk
Probability of sentence, given sProbability of sentence, given sk k
Probability of sentence across types:Probability of sentence across types:
m
k
n
ikiiik swwwP
1 112 )|(
SA1-80
Sentence Model Sentence Model SmoothingSmoothing
Each topic model is smoothed with Each topic model is smoothed with overall model.overall model.
Sentence mixture model is Sentence mixture model is smoothed with overall model smoothed with overall model (sentence type 0).(sentence type 0).
m
k
n
i iiik
kiiikk wwwP
swwwP
0 1 12
12
)|()1(
)|(
SA1-81
Sentence Mixture ResultsSentence Mixture Results
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32 64 128
Number of Sentence Types
Per
ple
xity
Red
uct
ion all 3gram
all 5gram
10,000,000 3gram
10,000,000 5gram
1,000,000 3gram
1,000,000 5gram
100,000 5gram
100,000 3gram
SA1-82
Sentence ClusteringSentence Clustering
Same algorithm as word clusteringSame algorithm as word clustering Assign each sentence to a type, sAssign each sentence to a type, skk
Minimize perplexity of P(z|sMinimize perplexity of P(z|sk k ) ) instead of P(z|Y)instead of P(z|Y)
SA1-83
Topic Examples - 0Topic Examples - 0(Mergers and acquisitions)(Mergers and acquisitions)
JOHN BLAIR &ERSAND COMPANY IS CLOSE TO AN AGREEMENT TO JOHN BLAIR &ERSAND COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD
INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD
JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD BLAIR'S MAJOR ASSETS .PERIOD
JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD ADVERTISING .PERIOD
MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD
SA1-84
Topic Examples - 1Topic Examples - 1(production, promotions, (production, promotions, commas)commas)
MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD WORK WITH THESE PEOPLE .PERIOD
BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD SEAGRAM .PERIOD
JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD
MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD PRODUCTS DEPARTMENT .PERIOD
RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS CANADA'S MAIN OILSEED CROP .PERIOD CANADA'S MAIN OILSEED CROP .PERIOD
YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD
SA1-85
Topic Examples - 2Topic Examples - 2(Numbers) (Numbers) SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR
HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD THE GOVERNMENT SAID .PERIOD
THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD UNILATERAL TRANSFERS .PERIOD
COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD MAKERS .PERIOD
INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD
CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD
THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN DECEMBER .PERIOD DECEMBER .PERIOD
SA1-86
Topic Examples – 3Topic Examples – 3(quotations)(quotations)
NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD REACHED FOR COMMENT .PERIOD
THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD
THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND
SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD BEAUTY PRODUCTS .PERIOD
BUT THE COMPANY WOULDN'T ELABORATE .PERIOD BUT THE COMPANY WOULDN'T ELABORATE .PERIOD HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR. HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR.
GOLDSMITH COULDN'T BE REACHED .PERIOD GOLDSMITH COULDN'T BE REACHED .PERIOD A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON
AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT GIVES US A LITTLE FLEXIBILITY .PERIOD GIVES US A LITTLE FLEXIBILITY .PERIOD
SA1-87
ParsingParsing
Parser
Alice ate yellow squash.
S
NP VP
V NP
N Adj.
Alice ate yellow squash
N
SA1-88
The Importance of ParsingThe Importance of Parsing
In the hotel fake property was sold to In the hotel fake property was sold to tourists.tourists.
What does “fake” modify?
What does “In the hotel” modify?
SA1-89
AmbiguityAmbiguity
SS
NP NP
NP NPNP
VP VP
V VN N
N N N NDet Det
Salesmen sold the dog biscuits Salesmen sold the dog biscuits
SA1-90
Probabilistic Context-free Probabilistic Context-free Grammars (PCFGs)Grammars (PCFGs)
S NP VP S NP VP 1.01.0
VP V NPVP V NP 0.50.5
VP V NP NP VP V NP NP 0.50.5
NP Det NNP Det N 0.50.5
NP Det N NNP Det N N 0.50.5
N salespeopleN salespeople 0.30.3
N dogN dog 0.40.4
N biscuitsN biscuits 0.30.3
V soldV sold 1.01.0
S
NP
NP
VP
VN
N NDet
Salesmen sold the dog biscuits
SA1-91
Producing a Single “Best” Producing a Single “Best” ParseParse
The parser finds the most probable The parser finds the most probable parse tree given the sentence (parse tree given the sentence (ss))
For a PCFG we have the following, For a PCFG we have the following, where where r r varies over the rules used in varies over the rules used in the tree the tree ::
SA1-92
The Penn Wall Street The Penn Wall Street Journal Tree-bankJournal Tree-bank
About one million words.About one million words. Average sentence length is 23 Average sentence length is 23
words and punctuation.words and punctuation.
In an Oct. 19 review of “The Misanthrope” at Chicago’s Goodman Theatre (“Revitalized Classics Take the Stage in Windy City,” Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Hagg.
SA1-93
““Learning” a PCFG from a Learning” a PCFG from a Tree-BankTree-Bank
(S (NP (N Salespeople))(S (NP (N Salespeople))
(VP (V sold)(VP (V sold)
(NP (Det the)(NP (Det the)
(N dog)(N dog)
(N biscuits)))(N biscuits)))
(. .))(. .))
S NP VP .
VP V NP
SA1-94
Lexicalized ParsingLexicalized Parsing
To do better, it is necessary to condition To do better, it is necessary to condition probabilities on the actual words of the probabilities on the actual words of the sentence. This makes the probabilities sentence. This makes the probabilities much tighter:much tighter:
pp(VP V NP NP) (VP V NP NP) = 0.00151= 0.00151
pp(VP V NP NP | said) (VP V NP NP | said) = 0.00001= 0.00001
pp(VP V NP NP | gave) (VP V NP NP | gave) = 0.01980= 0.01980
SA1-95
Lexicalized Probabilities Lexicalized Probabilities for Headsfor Heads
p(prices | n-plural) = .013p(prices | n-plural) = .013 p(prices | n-plural, NP) = .013p(prices | n-plural, NP) = .013 p(prices | n-plural, NP, S) = .025p(prices | n-plural, NP, S) = .025 p(prices | n-plural, NP, S, v-past) p(prices | n-plural, NP, S, v-past)
= .052= .052 p(prices | n-plural, NP, S, v-past, fell) p(prices | n-plural, NP, S, v-past, fell)
= .146= .146
SA1-96
Statistical Parser Statistical Parser ImprovementImprovement
95 96 97 98 99 00 01
Year
84
85
86
87
88
89
90Labeled Precision & Recall %
SA1-97
Parsers and Language Parsers and Language ModelsModels
Generative parsers are of particular Generative parsers are of particular interest because they can be turned interest because they can be turned into language models.into language models.
Here, of course, Here, of course,
if if
SA1-98
Parsing vs. Trigram Parsing vs. Trigram
Model Perplexity
Trigram poor smoothing 167
Trigram deleted-interpolation 155
Trigram Kneser-Ney 145
Parsing 119
All experiments are trained on one million words of Penn tree-bank data, and tested on 80,000 words.
SA1-99
Language ModelingLanguage ModelingApplicationsApplications
Speech RecognitionSpeech Recognition Machine TranslationMachine Translation Language AnalysisLanguage Analysis Fuzzy KeyboardFuzzy Keyboard Information RetrievalInformation Retrieval (Spelling correction, handwriting (Spelling correction, handwriting
recognition, telephone keypad recognition, telephone keypad entry, Chinese/Japanese text entry)entry, Chinese/Japanese text entry)
SA1-100
Application in Machine Application in Machine TranslationTranslation
Let Let f f be a French sentence we wish be a French sentence we wish to translate into English.to translate into English.
Let Let e e be a possible English be a possible English translation.translation.
Language Model
Translation Model
SA1-101
Why is p(f|e) Easier than Why is p(f|e) Easier than p(e|f)?p(e|f)?
Good
Bad
Good
Bad
English French
With anything less than a perfect model of p(e|f) we will generally pick out bad English for Good French.
SA1-102
Why is p(f|e) Easier than Why is p(f|e) Easier than p(e|f)?p(e|f)?
Good
Bad
Good
Bad
English French
With a bad p(f|e) we will pick out lots of bad French, but that’s OK, because we already know the correct French!
SA1-103
MT and Parsing Language MT and Parsing Language ModelsModels
Finds tree fragments that match the French.
Puts the fragments together into a parse.
SA1-104
Some SuccessesSome Successes
Correct: This is not possible.
Trigram: Impossibility.
Parser: This is impossible.
Correct: This is the globalization of production.
Trigram: This globalization of production.
Parser: This is globalization of production.
SA1-105
Some Less than SuccessesSome Less than Successes
Correct: He said he often eats Chinese dishes.
Trigram: He said China frequently tastes food.
Parser: He said recurrent taste of Chinese cuisine.
Correct: Wishful thinking out of touch with reality.
Trigram: Divorce practical delusion.
Parser: Practical delusion divorced.
SA1-106
MeasurementMeasurement
So far we have asked how much good So far we have asked how much good does, say, caching do, and taken it as a does, say, caching do, and taken it as a fact about language modeling. fact about language modeling. Alternatively we could take it as a Alternatively we could take it as a measurement about caching.measurement about caching.
Also, we have asked questions about Also, we have asked questions about the perplexity of “all English”. the perplexity of “all English”. Alternatively we could ask about Alternatively we could ask about smaller units.smaller units.
SA1-107
Measurement by SentenceMeasurement by Sentence
A basic fact about, A basic fact about, say, a newspaper say, a newspaper article is that it has article is that it has a first sentence, a a first sentence, a second, etc. second, etc.
How does per-word How does per-word perplexity vary as perplexity vary as a function of a function of sentence number?sentence number?
1 2 3 4 5 6 7 8 ...
Sentence Number
Perplexity
SA1-108
Perplexity Increases with Perplexity Increases with Sentence NumberSentence Number
100120140160180200220
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sentence Number
Per
plex
ity
SA1-109
Theory: Sentence Perplexity Theory: Sentence Perplexity is is Constant.Constant.
We want to measure sentence We want to measure sentence perplexity given all previous perplexity given all previous information.information.
In fact, models like trigram, or even In fact, models like trigram, or even parsing language models only look at parsing language models only look at local context.local context.
There are useful clues from previous There are useful clues from previous sentences that are not being used, and sentences that are not being used, and these clues increase with sentence these clues increase with sentence number.number.
SA1-110
Which Words, How they Which Words, How they are Usedare Used
We can categorize possible We can categorize possible contextual influences into those contextual influences into those that effect which words are used that effect which words are used and those that effect how they are and those that effect how they are used.used.
SA1-111
Open vs. Closed Class Open vs. Closed Class WordsWords
Closed class words like Closed class words like prepositions (“of”, “at”, “for” etc.) prepositions (“of”, “at”, “for” etc.) or determiners (“the”, “a”, “some” or determiners (“the”, “a”, “some” etc.) should remain constant over etc.) should remain constant over the sentences.the sentences.
Open class words, like nouns Open class words, like nouns (“piano”, “trial”, “concert”) should (“piano”, “trial”, “concert”) should change depending on context.change depending on context.
SA1-112
Perplexity of Closed Class Perplexity of Closed Class Items Items
100
120
140
160
180
200
220
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sentence Number
Perp
lexit
y
SA1-113
Perplexity of Open Class Perplexity of Open Class ItemsItems
Noun
200
250
300
350400
450
500
550
1 2 3 4 5 6 7 8 9 10
Sentence Number
Per
ple
xity
SA1-114
Contextual Lexical Effects as Contextual Lexical Effects as Detected by Caching ModelsDetected by Caching Models
Prepositions
25
27
29
31
33
35
37
39
41
1 2 3 4 5 6 7 8 9 10
Sentence Number
Pe
rple
xit
y
SA1-115
Effect of Caching on NounsEffect of Caching on Nouns
Nouns
220
270
320
370
420
470
520
570
1 2 3 4 5 6 7 8 9 10
Sentence Number
SA1-116
Fuzzy KeyboardFuzzy Keyboard
A soft keyboard is A soft keyboard is an image of a an image of a keyboard on a Palm keyboard on a Palm Pilot or Windows CE Pilot or Windows CE device.device.
Very small – users Very small – users can type on key can type on key boundary, or hit boundary, or hit the wrong key the wrong key easilyeasily
SA1-117
Fuzzy Keyboard IdeaFuzzy Keyboard Idea
If a user hits the “Q” key, then hits between If a user hits the “Q” key, then hits between the “U” and “I” keys, assume he meant “QU”the “U” and “I” keys, assume he meant “QU”
If a user hits between “Q” and “W” and then If a user hits between “Q” and “W” and then hits the “E”, assume he meant “WE”hits the “E”, assume he meant “WE”
In general, we can use a language model to In general, we can use a language model to help decide.help decide.
SA1-118
Fuzzy KeyboardFuzzy KeyboardLanguage model and Pen Language model and Pen PositionsPositions
Math: Language Model times Pen PostionMath: Language Model times Pen Postion
For pen down positions, collect data, and For pen down positions, collect data, and compute simple Gaussian distribution.compute simple Gaussian distribution.
arg max (letter sequence) P(pen down positions | letter sequence)letter sequences
P ´
SA1-119
Fuzzy Keyboard ResultsFuzzy Keyboard Results
40% Fewer 40% Fewer errors, same errors, same speed.speed.
See poster at See poster at AAAIAAAI
Can be applied Can be applied to eye typing, to eye typing, etc.etc.
SA1-120
Information RetrievalInformation Retrieval
For a given query, what document is For a given query, what document is most likely?most likely?
Ignore, or use uniform, or learn somehow.Use a language model
SA1-121
Probability of QueryProbability of Query
Use a simple unigram language model Use a simple unigram language model smoothed with global model.smoothed with global model.
P(qP(qii|document) =|document) =
C(qC(qiidocument) / doclengthdocument) / doclength + (1 - + (1 - ) C(q) C(qi i in all documents) / everythingin all documents) / everything
P(query | document)=P(qP(query | document)=P(q11|document)|document) P(qP(q22|document)|document) … … P(qP(qnn|document)|document)
Works about as well as TF/IDF Works about as well as TF/IDF (standard, simple IR techniques).(standard, simple IR techniques).
SA1-122
Clusters for IRClusters for IR
Use predictive clustering:Use predictive clustering:
P(qP(qii|document) = |document) =
P(QP(Qii|document) |document) P(qP(qii|document, |document, QQii))
Example: search for Honda, find Example: search for Honda, find Accord, because in same cluster. Word Accord, because in same cluster. Word model is smoothed with global model.model is smoothed with global model.
In some experiments, works better In some experiments, works better than unigram.than unigram.
SA1-123
Handwriting RecognitionHandwriting Recognition• P(observed ink|words) P(observed ink|words) P(words)P(words)
Telephone Keypad inputTelephone Keypad input• P(numbers|words) P(numbers|words) P(words)P(words)
Spelling CorrectionSpelling Correction• P(observed keys|words) P(observed keys|words) P(words)P(words)
Chinese/Japanese text entryChinese/Japanese text entry• P(phonetic representation|P(phonetic representation|
characters) characters) P(characters)P(characters)
Other Language Model Other Language Model UsesUses
Language Model
SA1-124
Tools: Tools: CMU Language Modeling CMU Language Modeling ToolkitToolkit
Can handle bigram, trigrams, moreCan handle bigram, trigrams, more Can handle different smoothing Can handle different smoothing
schemes schemes Many separate tools – output of one Many separate tools – output of one
tool is input to next: easy to usetool is input to next: easy to use Free for research purposesFree for research purposes http://svr-www.eng.cam.ac.uk/~prc14/http://svr-www.eng.cam.ac.uk/~prc14/
toolkit.htmltoolkit.html
SA1-125
Using the CMU LM ToolsUsing the CMU LM Tools
SA1-126
Tools: SRI Language Tools: SRI Language Modeling ToolkitModeling Toolkit
More powerful than CMU toolkitMore powerful than CMU toolkit Can handles clusters, lattices, n-Can handles clusters, lattices, n-
best lists, hidden tagsbest lists, hidden tags Free for research useFree for research use http://www.speech.sri.com/http://www.speech.sri.com/
projects/srilmprojects/srilm
SA1-127
Tools: Text normalizationTools: Text normalization
What about “$3,100,000” What about “$3,100,000” convert to convert to “Three million one hundred thousand “Three million one hundred thousand dollars”, etc.dollars”, etc.
Need to do this for dates, numbers, Need to do this for dates, numbers, maybe abbreviations.maybe abbreviations.
Some text-normalization tools come Some text-normalization tools come with Wall Street Journal corpus, from with Wall Street Journal corpus, from LDC (Linguistic Data Consortium)LDC (Linguistic Data Consortium)
Not much availableNot much available Write your own (use Perl!)Write your own (use Perl!)
SA1-128
Small enoughSmall enough
Real language models are often hugeReal language models are often huge 5-gram models typically larger than the 5-gram models typically larger than the
training datatraining data Use count-cutoffs (eliminate parameters with Use count-cutoffs (eliminate parameters with
fewer counts) or, betterfewer counts) or, better Use Stolcke pruning – finds counts that Use Stolcke pruning – finds counts that
contribute least to perplexity reduction, contribute least to perplexity reduction, • P(City | New York”) P(City | New York”) P(City | York) P(City | York)• P(Friday | God it’s) P(Friday | God it’s) P(Friday | it’s) P(Friday | it’s)
Remember, Kneser-Ney helped most when Remember, Kneser-Ney helped most when lots of 1 countslots of 1 counts
SA1-129
Some ExperimentsSome Experiments
I re-implemented all techniques I re-implemented all techniques Trained on 260,000,000 words of WSJTrained on 260,000,000 words of WSJ Optimize parameters on heldoutOptimize parameters on heldout Test on separate test sectionTest on separate test section Some combinations extremely time-Some combinations extremely time-
consuming (days of CPU time)consuming (days of CPU time)• Don’t try this at home, or in anything you want to Don’t try this at home, or in anything you want to
shipship Rescored N-best lists to get resultsRescored N-best lists to get results
• Maximum possible improvement from 10% word Maximum possible improvement from 10% word error rate absolute to 5%error rate absolute to 5%
SA1-130
Overall Results: PerplexityOverall Results: Perplexity
Katz skip
all-cache-sentence
all-cache-skipall-cache
all-cache-5gram
all-cache-cluster
all-cache-KN
KN 5gram
KN SkipKN Sentence
KN Cluster
Katz ClusterKatz Sentence
Katz 5-gramKN
Katz
70
75
80
85
90
95
100
105
110
115
8.8 9 9.2 9.4 9.6 9.8 10
Word Error Rate
Per
ple
xity
SA1-131
ConclusionsConclusions
Use trigram modelsUse trigram models Use any reasonable smoothing Use any reasonable smoothing
algorithm (Katz, Kneser-Ney)algorithm (Katz, Kneser-Ney) Use caching if you have correction Use caching if you have correction
information.information. Parsing is promising technique.Parsing is promising technique. Clustering, sentence mixtures, Clustering, sentence mixtures,
skipping not usually worth effort.skipping not usually worth effort.
SA1-132
Shannon RevisitedShannon Revisited
People can make GREAT use of long People can make GREAT use of long contextcontext
With 100 characters, computers get very With 100 characters, computers get very roughly 50% word perplexity reduction.roughly 50% word perplexity reduction.
Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532
10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998
100 1.5 2.5 10 142
SA1-133
The Future?The Future?
Sentence mixture models need more Sentence mixture models need more explorationexploration
Structured language modelsStructured language models Topic-based modelsTopic-based models Integrating domain Integrating domain knowledge with knowledge with language modellanguage model Other ideas?Other ideas? In the end, we need In the end, we need real understandingreal understanding
SA1-134
More ResourcesMore Resources
Joshua’s web page: Joshua’s web page: www.research.microsoft.com/~joshuagowww.research.microsoft.com/~joshuago• Smoothing technical report: good Smoothing technical report: good
introduction to smoothing and lots of details introduction to smoothing and lots of details too.too.
• ““A Bit of Progress in Language Modeling,” A Bit of Progress in Language Modeling,” which is the journal version of much of this which is the journal version of much of this talk.talk.
• Papers on fuzzy keyboard, language model Papers on fuzzy keyboard, language model compression, and maximum entropy.compression, and maximum entropy.
• Clustering toolClustering tool
SA1-135
More ResourcesMore Resources
Eugene’s web page: Eugene’s web page: http://www.cs.brown.edu/people/http://www.cs.brown.edu/people/ecec• Papers on statistical parsing for it’s own Papers on statistical parsing for it’s own
sake and for language modeling, as sake and for language modeling, as well as using language modeling to well as using language modeling to measure contextual influence.measure contextual influence.
• Pointers to software for statistical Pointers to software for statistical parsing as well as statistical parsers parsing as well as statistical parsers optimized for language-modelingoptimized for language-modeling
SA1-136
More Resources:More Resources:BooksBooks
Books (all are OK, none focus on language Books (all are OK, none focus on language models)models)• Statistical Language LearningStatistical Language Learning by by
Eugene CharniakEugene Charniak• Speech and Language ProcessingSpeech and Language Processing by Dan by Dan
Jurafsky and Jim Martin (especially Chapter 6)Jurafsky and Jim Martin (especially Chapter 6)• Foundations of Statistical Natural Language Foundations of Statistical Natural Language
ProcessingProcessing by Chris Manning and Hinrich by Chris Manning and Hinrich Schütze. Schütze.
• Statistical Methods for Speech RecognitionStatistical Methods for Speech Recognition, by , by Frederick JelinekFrederick Jelinek
• Spoken Language ProcessingSpoken Language Processing by Huang, Acero by Huang, Acero and Honand Hon
SA1-137
More ResourcesMore Resources
Sentence Mixture Models: (also, caching) Sentence Mixture Models: (also, caching) • Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving and predicting Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving and predicting
performance of statistical language models in sparse domains" performance of statistical language models in sparse domains" • Rukmini Iyer and Mari Ostendorf. Modeling long distance dependence Rukmini Iyer and Mari Ostendorf. Modeling long distance dependence
in language: Topic mixtures versus dynamic cache models. in language: Topic mixtures versus dynamic cache models. IEEE IEEE Transactions on Acoustics, Speech and Audio ProcessingTransactions on Acoustics, Speech and Audio Processing, 7:30--39, , 7:30--39, January 1999.January 1999.
Caching: Above, plusCaching: Above, plus• R. Kuhn. Speech recognition and the frequency of recently used words: R. Kuhn. Speech recognition and the frequency of recently used words:
A modified markov model for natural language. In A modified markov model for natural language. In 12th International 12th International Conference on Computational LinguisticsConference on Computational Linguistics, pages 348--350, Budapest, , pages 348--350, Budapest, August 1988.August 1988.
• R. Kuhn and R. D. Mori. A cache-based natural language model for R. Kuhn and R. D. Mori. A cache-based natural language model for speech reproduction. speech reproduction. IEEE Transactions on Pattern Analysis and IEEE Transactions on Pattern Analysis and Machine IntelligenceMachine Intelligence, 12(6):570--583, 1990. , 12(6):570--583, 1990.
• R. Kuhn and R. D. Mori. Correction to a cache-based natural language R. Kuhn and R. D. Mori. Correction to a cache-based natural language model for speech reproduction. model for speech reproduction. IEEE Transactions on Pattern Analysis IEEE Transactions on Pattern Analysis and Machine Intelligenceand Machine Intelligence, 14(6):691--692, 1992., 14(6):691--692, 1992.
SA1-138
More Resources: More Resources: ClusteringClustering
The seminal referenceThe seminal reference• P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer.
Class-based n-gram models of natural language. Class-based n-gram models of natural language. Computational Computational LinguisticsLinguistics, 18(4):467--479, December 1992., 18(4):467--479, December 1992.
Two-sided clusteringTwo-sided clustering• H. Yamamoto and Y. Sagisaka. Multi-class composite n-gram based on H. Yamamoto and Y. Sagisaka. Multi-class composite n-gram based on
connection direction. In connection direction. In Proceedings of the IEEE International Proceedings of the IEEE International Conference on Acoustics, Speech and Signal ProcessingConference on Acoustics, Speech and Signal Processing Phoenix, Phoenix, Arizona, May 1999.Arizona, May 1999.
Fast clusteringFast clustering• D. R. Cutting, D. R. Karger, J. R. Pedersen, and J. W. Tukey. D. R. Cutting, D. R. Karger, J. R. Pedersen, and J. W. Tukey.
Scatter/gather: A cluster-based approach to browsing large document Scatter/gather: A cluster-based approach to browsing large document collections. In collections. In SIGIR 92,SIGIR 92, 1992. 1992.
Other:Other:• R. Kneser and H. Ney. Improved clustering techniques for class-based R. Kneser and H. Ney. Improved clustering techniques for class-based
statistical language modeling. In Eurospeech 93, volume 2, pages 973--statistical language modeling. In Eurospeech 93, volume 2, pages 973--976, 1993.976, 1993.
SA1-139
More ResourcesMore Resources
Structured Language ModelsStructured Language Models• Eugene’s web pageEugene’s web page• Ciprian Chelba’s web page: Ciprian Chelba’s web page:
– http://www.clsp.jhu.edu/people/chelba/http://www.clsp.jhu.edu/people/chelba/ Maximum EntropyMaximum Entropy
• Roni Rosenfeld’s home page and thesisRoni Rosenfeld’s home page and thesis http://http://www.cs.cmu.edu/~roniwww.cs.cmu.edu/~roni//• Joshua’s web pageJoshua’s web page
Stolcke PruningStolcke Pruning• A. Stolcke (1998), Entropy-based pruning of backoff A. Stolcke (1998), Entropy-based pruning of backoff
language models. language models. Proc. DARPA Broadcast News Proc. DARPA Broadcast News Transcription and Understanding WorkshopTranscription and Understanding Workshop, pp. 270-, pp. 270-274, Lansdowne, VA. NOTE: get corrected version from 274, Lansdowne, VA. NOTE: get corrected version from http://www.speech.sri.com/people/stolcke http://www.speech.sri.com/people/stolcke
SA1-140
More Resources: SkippingMore Resources: Skipping
Skipping: Skipping: • X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee,
and R. Rosenfeld. The SPHINX-II speech recognition and R. Rosenfeld. The SPHINX-II speech recognition system: An overview. system: An overview. Computer, Speech, and LanguageComputer, Speech, and Language, , 2:137--148, 1993.2:137--148, 1993.
Lots of stuff:Lots of stuff:• S. Martin, C. Hamacher, J. Liermann, F. Wessel, and H. S. Martin, C. Hamacher, J. Liermann, F. Wessel, and H.
Ney. Assessment of smoothing methods and complex Ney. Assessment of smoothing methods and complex stochastic language modeling. In stochastic language modeling. In 6th European 6th European Conference on Speech Communication and TechnologyConference on Speech Communication and Technology, , volume 5, pages 1939--1942, Budapest, Hungary, volume 5, pages 1939--1942, Budapest, Hungary, September 1999. H. Ney, U. Essen, and R. Kneser.September 1999. H. Ney, U. Essen, and R. Kneser.
• On structuring probabilistic dependences in stochastic On structuring probabilistic dependences in stochastic language modeling. language modeling. Computer, Speech, and LanguageComputer, Speech, and Language, , 8:1--38, 1994.8:1--38, 1994.