Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
-
Upload
brittany-goodman -
Category
Documents
-
view
219 -
download
0
Transcript of Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
![Page 1: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/1.jpg)
Hidden Markov Models:Probabilistic Reasoning Over Time
Natural Language Processing
![Page 2: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/2.jpg)
Noisy-Channel Model
• Original message not directly observable– Passed through some channel b/t sender, receiver + noise– From telephone (Shannon), Word sequence vs acoustics
(Jelinek), genome sequence vs CATG, object vs image
• Derive most likely original input based on observed
![Page 3: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/3.jpg)
Bayesian Inference• P(W|O) difficult to compute
– W – input, O – observations
– Generative and Sequence
)|(maxarg* OWPWW
)(
)()|(maxarg
OP
WPWOP
W
)()|(maxarg WPWOPW
![Page 4: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/4.jpg)
Applications
• AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval
• Non-AI: – Bioinformatics: gene sequencing基因序列– Security: intrusion detection入侵检测– Cryptography密码学
![Page 5: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/5.jpg)
Agenda
• Hidden Markov Models – Uncertain observation– Temporal Context– Recognition: Viterbi– Training the model: Baum-Welch
• Speech Recognition– Framing the problem: Sounds to Sense– Speech Recognition as Modern AI
![Page 6: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/6.jpg)
Modelling Processes over Time• Infer underlying state sequence from observed• Issue: New state depends on preceding states
– Analyzing sequences
• Problem 1: Possibly unbounded # prob tables– Observation+State+Time
• Solution 1: Assume stationary process– Rules governing process same at all time
• Problem 2: Possibly unbounded # parents– Markov assumption: Only consider finite history– Common: 1 or 2 Markov: depend on last couple
![Page 7: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/7.jpg)
Hidden Markov Models (HMMs)
• An HMM is:– 1) A set of states:– 2) A set of transition probabilities:
• Where aij is the probability of transition qi -> qj
– 3)Observation probabilities:• The probability of observing ot in state i
– 4) An initial probability dist over states: • The probability of starting in state i
– 5) A set of accepting states
ko qqqQ ,...,, 1
mnaaA ,...,01
)( ti obB
i
![Page 8: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/8.jpg)
Three Problems for HMMs
• Find the probability of an observation sequence given a model– Forward algorithm
• Find the most likely path through a model given an observed sequence– Viterbi algorithm (decoding)
• Find the most likely model (parameters) given an observed sequence– Baum-Welch (EM) algorithm
![Page 9: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/9.jpg)
Bins and Balls Example
• Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball…
– (Example due to J. Martin)
![Page 10: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/10.jpg)
Bins and Balls Example
Bin 1 Bin 2
.6 .7
.4
.3
![Page 11: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/11.jpg)
Bins and Balls
Bin1
Bin2
Bin1
0.6 0.4
Bin2
0.3 0.7
• Π Bin 1: 0.9; Bin 2: 0.1
• A: transition probabilities matrix
• B: emission probabilities matrixBin 1
Bin 2
Red 0.7 0.4
Blue 0.3 0.6
![Page 12: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/12.jpg)
Bins and Balls• Assume the observation sequence:
– Blue Blue Red (BBR)
• Both bins have Red and Blue– Any state sequence could produce observations
• However, NOT equally likely– Big difference in start probabilities– Observation depends on state– State depends on prior state
![Page 13: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/13.jpg)
Bins and BallsBlue Blue Red1 1 1 (0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204
1 1 2 (0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077
1 2 1 (0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136
1 2 2 (0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.01812 1 1 (0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052
2 1 2 (0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020
2 2 1 (0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052
2 2 2 (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070
)()|(maxarg WPWOPW
P(W=bin1) *P(O=blue|W=bin1)
![Page 14: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/14.jpg)
Answers and Issues• Here, to compute probability of observed
– Just add up all the state sequence probabilities
• To find most likely state sequence– Just pick the sequence with the highest value
• Problem: Computing all paths expensive– 2T*N^T
• Solution: Dynamic Programming– Sweep across all states at each time step
• Summing (Problem 1) or Maximizing (Problem 2)
![Page 15: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/15.jpg)
Forward Probability
)()|(
)()()1(
1),()1(
)|,,..,,()(
1
11
1
21
TOP
obatt
Njob
jqoooPt
N
ii
tj
N
iijij
jjj
ttj
Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj
N is the max state, T is the last time
![Page 16: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/16.jpg)
Forward Algorithm• Idea: matrix where each cell forward[t,j] represents probability of
being in state j after seeing first t observations. • Each cell expresses the probability:
forward[t,j] = P(o1,o2,...,ot,qt=j|w)• qt = j means "the probability that the tth state in the sequence of
states is state j. • Compute probability by summing over extensions of all paths
leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is
computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i], ii. transition probability aij from previous state i to current state j iii. observation likelihood bjt that current state j matches observation symbol t.
![Page 17: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/17.jpg)
Forward Algorithm
Function Forward(observations length T, state-graph) returns best-path
Num-states<-num-of-states(state-graph)
Create path prob matrix forwardi[num-states+2,T+2]
Forward[0,0]<- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s’ from s in state-graph
new-score<-Forward[s,t]*at[s,s’]*bs’(ot)
Forward[s’,t+1] <- Forward[s’,t+1]+new-score
![Page 18: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/18.jpg)
Viterbi Algorithm
• Find BEST sequence given signal– Best P(sequence|signal)– Take HMM & observation sequence
• => seq (prob)
• Dynamic programming solution– Record most probable path ending at a state i
• Then most probable path from i to end• O(bMn)
![Page 19: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/19.jpg)
Viterbi Code
Function Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score))
then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
![Page 20: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/20.jpg)
Learning HMMs• Issue: Where do the probabilities come from?• Supervised/manual construction• Solution: Learn from data
– Trains transition (aij), emission (bj), and initial (πi) probabilities• Typically assume state structure is given
– Unsupervised– Baum-Welch aka forward-backward algorithm
• Iteratively estimate counts of transitions/emitted• Get estimated probabilities by forward comput’n
– Divide probability mass over contributing paths
![Page 21: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/21.jpg)
Manual Construction• Manually labeled data
– Observation sequences, aligned to– Ground truth state sequences
• Compute (relative) frequencies of state transitions• Compute frequencies of observations/state• Compute frequencies of initial states• Bootstrapping: iterate tag, correct, reestimate, tag.• Problem:
– Labeled data is expensive, hard/impossible to obtain, may be inadequate to fully estimate
• Sparseness problems
![Page 22: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/22.jpg)
Unsupervised Learning• Re-estimation from unlabeled data
– Baum-Welch aka forward-backward algorithm– Assume “representative” collection of data
• E.g. recorded speech, gene sequences, etc
– Assign initial probabilities• Or estimate from very small labeled sample
– Compute state sequences given the data• I.e. use forward algorithm
– Update transition, emission, initial probabilities
![Page 23: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/23.jpg)
Updating Probabilities
• Intuition:– Observations identify state sequences– Adjust probability of transitions/emissions– Make closer to those consistent with observed– Increase P(Observations|Model)
• Functionally– For each state i, what proportion of transitions from state i
go to state j– For each state i, what proportion of observations match O?– How often is state i the initial state?
![Page 24: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/24.jpg)
Estimating Transitions
• Consider updating transition aij– Compute probability of all paths using aij– Compute probability of all paths through i (w/ and w/o i->j)
i j
![Page 25: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/25.jpg)
Forward Probability
)()|(
)()()1(
1),()1(
)|,,..,,()(
1
11
1
21
TOP
obatt
Njob
jqoooPt
N
ii
tj
N
iijij
jjj
ttj
Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj
N is the max state, T is the last time
![Page 26: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/26.jpg)
Backward Probability
)1()()|(
)1()()(
1)(
11
11
jj
N
jj
jt
N
ijiji
i
obOP
tobat
T
Where β is the backward probability, t is the time in sequence, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj
N is the final state, and T is the last time
![Page 27: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/27.jpg)
Re-estimating
• Estimate transitions from i->j
• Estimate observations in j
• Estimate initial i
1
1 1
1
1
1
),(
),(ˆ
)|(
)1()()(),(
T
t
N
j t
T
t tij
jtjijit
ji
jia
OP
tobatji
)1(ˆ
)(
)()(ˆ
)|(
)()(
)|(
)|,()(
1
..1
ii
T
t j
T
votst j
kj
jjtj
t
tvb
OP
tt
OP
OjqPt
kt
![Page 28: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/28.jpg)
Speech Recognition
• Goal:– Given an acoustic signal, identify the sequence of
words that produced it– Speech understanding goal:
• Given an acoustic signal, identify the meaning intended by the speaker
• Issues:– Ambiguity: many possible pronunciations, – Uncertainty: what signal, what word/sense
produced this sound sequence
![Page 29: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/29.jpg)
Decomposing Speech Recognition
• Q1: What speech sounds were uttered?– Human languages: 40-50 phones语音
• Basic sound units: b, m, k, ax, ey, …(arpabet)• Distinctions categorical to speakers
– Acoustically continuous
• Part of knowledge of language– Build per-language inventory– Could we learn these?
![Page 30: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/30.jpg)
Decomposing Speech Recognition
• Q2: What words produced these sounds?– Look up sound sequences in dictionary
– Problem 1: Homophones同音字• Two words, same sounds: too, two
– Problem 2: Segmentation划分• No “space” between words in continuous speech
• “I scream”/”ice cream”, “Wreck a nice beach”/”Recognize speech”
• Q3: What meaning produced these words?– NLP (But that’s not all!)
![Page 31: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/31.jpg)
![Page 32: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/32.jpg)
Signal Processing• Goal: Convert impulses from microphone
into a representation that – is compact– encodes features relevant for speech recognition
• Compactness: Step 1– Sampling rate: how often look at data
• 8KHz, 16KHz,(44.1KHz= CD quality)
– Quantization factor: how much precision• 8-bit, 16-bit (encoding: u-law, linear…)
![Page 33: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/33.jpg)
(A Little More) Signal Processing
• Compactness & Feature identification– Capture mid-length speech phenomena
• Typically “frames” of 10ms (80 samples)– Overlapping
– Vector of features: e.g. energy at some frequency– Vector quantization量化 :
• n-feature vectors: n-dimension space– Divide into m regions (e.g. 256) – All vectors in region get same label - e.g. C256– Use labels other a n-feature vectors– no longer popular in large-scale systems
![Page 34: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/34.jpg)
Signal Processing
![Page 35: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/35.jpg)
Speech Recognition Model• Question: Given signal, what words?• Problem: uncertainty
– Capture of sound by microphone, how phones produce sounds, which words make phones, etc
• Solution: Probabilistic model– P(words|signal) =– P(signal|words)P(words)/P(signal)– Idea: Maximize P(signal|words)*P(words)
• P(signal|words): acoustic model; P(words): lang model
![Page 36: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/36.jpg)
Language Model
• Idea: some utterances more probable
• Standard solution: “n-gram” model– Typically tri-gram: P(wi|wi-1,wi-2)
• Collect training data – Smooth with bi- & uni-grams to handle sparseness
– Product over words in utterance
![Page 37: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/37.jpg)
Acoustic Model• P(signal|words)
– words -> phones + phones -> vector quantiz’n
• Words -> phones ( pronunciation model)– Pronunciation dictionary lookup
• Multiple pronunciations?– Probability distribution
» Dialect Variation: tomato
» +Coarticulation
– Product along path
t ow maa
eyt ow
0.5
0.5
tow
maa
eyt ow
0.5ax
0.50.2
0.8
P(towmaatow|tomato)=0.5
P(towmaatow|tomato)=0.2*0.5=0.1
![Page 38: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/38.jpg)
Pronunciation Example
• Observations: 0/1
![Page 39: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/39.jpg)
Acoustic Model• P(signal| phones):
– how a phone maps into a sequence of frames– Problem: Phones can be pronounced differently
• Speaker differences, speaking rate, microphone
• Phones may not even appear, different contexts
– Observation sequence is uncertain
• Solution: Hidden Markov Models– 1) Hidden => Observations uncertain
– 2) Probability of word sequences =>• State transition probabilities
– 3) 1st order Markov => use 1 prior state
![Page 40: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/40.jpg)
Acoustic Model
• 3-state phone model for [m]– Use Hidden Markov Model (HMM)
– Probability of sequence: sum of prob of paths
Onset Mid End Final0.7
0.3 0.9
0.1
0.4
0.6
C1:0.5
C2:0.2
C3:0.3 C3:
0.2C4:0.7
C5:0.1 C4:
0.1C6:0.5
C6:0.4
Transition probabilities
Observation probabilities
![Page 41: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/41.jpg)
ASR Training
• Models to train:– Language model: typically tri-gram– Observation likelihoods: B– Transition probabilities: A– Pronunciation lexicon: sub-phone, word
• Training materials:– Speech files – word transcription– Large text corpus – Small phonetically transcribed speech corpus
![Page 42: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/42.jpg)
Training
• Language model:– Uses large text corpus to train n-grams
• 500 M words
• Pronunciation model:– HMM state graph– Manual coding from dictionary
• Expand to triphone context and sub-phone models
![Page 43: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/43.jpg)
HMM Training
• Training the observations:– E.g. Gaussian: set uniform initial mean/variance
• Train based on contents of small (e.g. 4hr) phonetically labeled speech set (e.g. Switchboard)
• Training A&B:– Forward-Backward algorithm training
![Page 44: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/44.jpg)
Does it work?
• Yes:– 99% on isolated single digits– 95% on restricted short utterances (air travel)– 89+% professional news broadcast
• No:– 77% Conversational English– 67% Conversational Mandarin (CER)– 55% Meetings– ?? Noisy cocktail parties
![Page 45: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/45.jpg)
N-grams
• Perspective:– Some sequences (words/chars) are more likely
than others– Given sequence, can guess most likely next
• Used in– Speech recognition– Spelling correction,– Augmentative communication– Other NL applications
![Page 46: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/46.jpg)
Probabilistic Language Generation
• Coin-flipping models– A sentence is generated by a randomized
algorithm• The generator can be in one of several “states”• Flip coins to choose the next state.• Flip other coins to decide which letter or word to
output
![Page 47: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/47.jpg)
Shannon’s Generated Language
• 1. Zero-order approximation:– XFOML RXKXRJFFUJ ZLPWCFWKCYJ
FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
• 2. First-order approximation:– OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA
TH EEI ALHENHTTPA OOBTTVA NAH RBL
• 3. Second-order approximation:– ON IE ANTSOUTINYS ARE T INCTORE ST BE S
DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
![Page 48: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/48.jpg)
Shannon’s Word Models
• 1. First-order approximation:– REPRESENTING AND SPEEDILY IS AN GOOD
APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
• 2. Second-order approximation:– THE HEAD AND IN FRONTAL ATTACK ON AN
ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
![Page 49: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/49.jpg)
Corpus Counts
• Estimate probabilities by counts in large collections of text/speech
• Issues:– Wordforms (surface) vs lemma (root)– Case? Punctuation? Disfluency?– Type (distinct words) vs Token (total)
![Page 50: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/50.jpg)
Basic N-grams
• Most trivial: 1/#tokens: too simple!• Standard unigram: frequency
– # word occurrences/total corpus size• E.g. the=0.07; rabbit = 0.00001
– Too simple: no context!
• Conditional probabilities of word sequences
)|()...|()|()()( 12131211
nn
n wwPwwPwwPwPwP
)|( 11
1
k
n
kk wwP
![Page 51: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/51.jpg)
Markov Assumptions
• Exact computation requires too much data
• Approximate probability given all prior wds– Assume finite history
– Bigram: Probability of word given 1 previous• First-order Markov
– Trigram: Probability of word given 2 previous
• N-gram approximation )|()|( 11
11
nNnn
nn wwPwwP
)|()( 11
1 k
n
kk
n wwPwPBigram sequence
![Page 52: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/52.jpg)
Issues
• Relative frequency– Typically compute count of sequence
• Divide by prefix
• Corpus sensitivity– Shakespeare vs Wall Street Journal
• Very unnatural
• Ngrams– Unigram: little; bigrams: colloc; trigrams:phrase
)(
)()|(
1
11
n
nnnn wC
wwCwwP
![Page 53: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/53.jpg)
Evaluating n-gram models
• Entropy & Perplexity– Information theoretic measures– Measures information in grammar or fit to data– Conceptually, lower bound on # bits to encode
• Entropy: H(X): X is a random var, p: prob fn
– E.g. 8 things: number as code => 3 bits/trans– Alt. short code if high prob; longer if lower
• Can reduce
• Perplexity: – Weighted average of number of choices
)(log)()( 2 xpxpXHXx
H2
![Page 54: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/54.jpg)
Computing Entropy
• Picking horses (Cover and Thomas)
• Send message: identify horse - 1 of 8– If all horses equally likely, p(i) = 1/8
– Some horses more likely:• 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64
bitsipipXHi
2)(log)()(8
1
8
1
38/1log8/1log8/1)(i
bitsXH
![Page 55: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/55.jpg)
Entropy of a Sequence
• Basic sequence
• Entropy of language: infinite lengths– Assume stationary
& ergodic
)(log)(1
)(1
1211
1
n
LW
nn WpWpn
WHn n
),...,(log1
lim)(
),...,(log),...,(1
lim)(
1
11
nn
nLW
nn
wwpn
LH
wwpwwpn
LH
![Page 56: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/56.jpg)
Cross-Entropy
• Comparing models– Actual distribution unknown– Use simplified model to estimate
• Closer match will have lower cross-entropy
),...,(log1
lim),(
),...,(log),...,(1
lim),(
1
11
nn
nLW
nn
wwmn
mpH
wwmwwpn
mpH
![Page 57: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/57.jpg)
Perplexity Model Comparison
• Compare models with different history• Train models
– 38 million words – Wall Street Journal
• Compute perplexity on held-out test set– 1.5 million words (~20K unique, smoothed)
• N-gram Order | Perplexity– Unigram | 962– Bigram | 170– Trigram | 109
![Page 58: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/58.jpg)
Entropy of English• Shannon’s experiment
– Subjects guess strings of letters, count guesses– Entropy of guess seq = Entropy of letter seq– 1.3 bits; Restricted text
• Build stochastic model on text & compute– Brown computed trigram model on varied corpus– Compute (per-char) entropy of model– 1.75 bits
![Page 59: Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.](https://reader035.fdocuments.us/reader035/viewer/2022062423/56649f425503460f94c61e33/html5/thumbnails/59.jpg)
Speech Recognition asModern AI
• Draws on wide range of AI techniques– Knowledge representation & manipulation
• Optimal search: Viterbi decoding
– Machine Learning• Baum-Welch for HMMs
• Nearest neighbor & k-means clustering for signal id
– Probabilistic reasoning/Bayes rule• Manage uncertainty in signal, phone, word mapping
• Enables real world application