Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
-
date post
22-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
Hidden Markov ModelsHidden Markov Models
Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence
COS302COS302
Michael L. LittmanMichael L. Littman
Fall 2001Fall 2001
AdministrationAdministration
Exams need a swift kick.Exams need a swift kick.
Form project groups next week.Form project groups next week.
Project due on the last day.Project due on the last day.
BUTBUT! There will be milestones.! There will be milestones.
In 2 weeks, synonyms via web.In 2 weeks, synonyms via web.
3 weeks, synonyms via wordnet.3 weeks, synonyms via wordnet.
Shannon Game AgainShannon Game Again
RecallRecall
Sue swallowed the large green __.Sue swallowed the large green __.
OkOk: pepper, frog, pea, pill: pepper, frog, pea, pill
Not ok: Not ok: beige, running, verybeige, running, very
Parts of speech could help:Parts of speech could help:
noun verbed det adj adj noun verbed det adj adj nounnoun
POS Language ModelPOS Language Model
How could we define a probabilistic How could we define a probabilistic model over sequences of parts of model over sequences of parts of speech?speech?
How could we use this to define a How could we use this to define a model over word sequences?model over word sequences?
Hidden Markov ModelHidden Markov Model
Idea: We have states with a simple Idea: We have states with a simple transition rule (Markov model). We transition rule (Markov model). We observe a probabilistic function of observe a probabilistic function of the states. Therefore, the states the states. Therefore, the states are not what we see…are not what we see…
HMM ExampleHMM Example
Browsing the web, connection is Browsing the web, connection is either up or down. Observe either up or down. Observe response or no.response or no.
Response: 0.7No response: 0.3
Response: 0.0No response: 1.0
0.1
0.2 0.80.9
UP DOWN
HMM DefinitionHMM Definition
N states, M observationsN states, M observations
(s): prob. starting state is s(s): prob. starting state is s
p(s,s’): prob. of s to s’ transitionp(s,s’): prob. of s to s’ transition
b(s, k): probability of obs k from sb(s, k): probability of obs k from s
kk00 k k11 … k … kll: observation sequence: observation sequence
HMM ProblemsHMM Problems
Probability of a state sequenceProbability of a state sequence
Probability of an obs. sequenceProbability of an obs. sequence
Most like state sequence for an Most like state sequence for an observation sequenceobservation sequence
Most likely model given obs. seq.Most likely model given obs. seq.
Response: 0.7No response: 0.3
Response: 0.0No response: 1.0
0.1
0.2 0.80.9
UP DOWN
States ExampleStates Example
Probability of:Probability of:
U U U U D DD D U U U U U U
isis
1.0 0.9 0.1 0.8 0.2 0.9 0.9 = .011661.0 0.9 0.1 0.8 0.2 0.9 0.9 = .01166
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
DerivationDerivation
Pr(sPr(s00…s…sll))
= Pr(s= Pr(s00) Pr(s) Pr(s11 | s | s00) Pr(s) Pr(s22 | s | s00 s s11)…)… Pr(sPr(sll | s| s00 s s11 s s22 … s … sl-1l-1))
= Pr(s= Pr(s00) Pr(s) Pr(s11 | s | s00) Pr(s) Pr(s22 | s | s11)…)…Pr(sPr(sll | s | sl-1l-1))
= = (s(s00) p(s) p(s00, s, s11) p(s) p(s11, s, s22)… p(s)… p(sl-1l-1, s, sll))
States/Obs. ExampleStates/Obs. Example
Probability of:Probability of: R R R R N N NN N N R R NN given given U U U U D DD D U U U U U Uisis0.7 0.7 1.0 1.0 0.3 0.7 0.3 = .03087000.7 0.7 1.0 1.0 0.3 0.7 0.3 = .0308700
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
DerivationDerivation
Pr(kPr(k00…k…kll | s | s00…s…sll))
= Pr(k= Pr(k00|s|s00…s…sll) Pr(k) Pr(k11 | s | s00…s…sll, k, k00) Pr(k) Pr(k22 | | ss00…s…sll, k, k00, k, k11) …) … Pr(k Pr(kll | | ss00…s…sll, k, k00…k…kl-1l-1))
= Pr(k= Pr(k00|s|s00) Pr(k) Pr(k11 | s | s11) Pr(k) Pr(k22 | s | s22) …) … Pr(kPr(kll | s | sll))
= b(s= b(s00,k,k00) b(s) b(s11,k,k11) b(s) b(s22,k,k22) … b(s) … b(sll,k,kll))
Observation ExampleObservation Example
Probability of:Probability of:
R R R R N N NN N N R R NN
= sum{s= sum{s00…s…s66} Pr(s} Pr(s00…s…s66) ) Pr(R R Pr(R R N N NN N N R R N N | s| s00…s…s66))
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
DerivationDerivation
Pr(kPr(k00…k…kll))
= sum{s= sum{s00…s…sll} Pr(s} Pr(s00…s…s66) Pr(k) Pr(k00…k…kll |s |s00…s…sll))
= sum{s= sum{s00…s…sll} Pr(s} Pr(s00) Pr(k) Pr(k00|s|s00)) Pr(s Pr(s11 | s | s00) ) Pr(kPr(k11 | s | s11) Pr(s) Pr(s22 | s | s11) Pr(k) Pr(k22 | s | s22) …) … Pr(sPr(sll | s| sl-1l-1) Pr(k) Pr(kll | s | sll))
= sum{s= sum{s00…s…sll} } (s(s00) b(s) b(s00,k,k00) p(s) p(s00, s, s11) b(s) b(s11,k,k11) ) p(sp(s11, s, s22) b(s) b(s22,k,k22) … p(s) … p(sl-1l-1, s, sll) b(s) b(sll,k,kll))
Uh Oh, Better Get Uh Oh, Better Get
Grows exponentially with l: MGrows exponentially with l: Mll
How do this more efficiently?How do this more efficiently?
(i,t): probability of seeing first t (i,t): probability of seeing first t observations observations andand ending up in state ending up in state i: Pr(ki: Pr(k00…k…ktt, s, stt=i)=i)
(i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii))
(i,t) = sum(i,t) = sumjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))
Return Pr(kReturn Pr(k00…k…kll) = sum) = sumjj (j,l)(j,l)
Partial DerivationPartial Derivation
(i,t) (i,t)
= Pr(k= Pr(k00…k…ktt, s, stt=i)=i)
= sum= sumjj Pr(k Pr(k00…k…kt-1t-1, s, st-1t-1=j)=j) Pr(kPr(k00…k…ktt, , sstt=i | k=i | k00…k…kt-1t-1, s, st-1t-1=j)=j)
= sum= sumjj (j,t-1) (j,t-1) Pr(kPr(k00…k…kt-1t-1 | k | k00…k…kt-1t-1, s, st-1t-1=j)=j)Pr(sPr(stt=i, k=i, ktt | k | k00…k…kt-1t-1, s, st-1t-1=j)=j)
= sum= sumjj (j,t-1) (j,t-1) Pr(sPr(stt=i | k=i | k00…k…kt-1t-1, s, st-1t-1=j)=j)Pr(kPr(ktt | s | stt=i, k=i, k00…k…kt-1t-1, s, st-1t-1=j)=j)
= sum= sumjj (j,t-1) Pr(s(j,t-1) Pr(stt=i | s=i | st-1t-1=j) Pr(k=j) Pr(ktt | s | stt=i)=i)
= sum= sumjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))
Pr(Pr(N N NN N N))
= Pr(= Pr(UUUUUU) Pr() Pr(N N N N N N | | UUUUUU))+ Pr(+ Pr(UUUUDD) Pr() Pr(N N N N N N | | UUUUDD))+ Pr(+ Pr(UUDDUU) Pr() Pr(N N N N N N | | UUDDUU))+ Pr(+ Pr(UUDDDD) Pr() Pr(N N N N N N | | UUDDDD))+ Pr(+ Pr(DDUUUU) Pr() Pr(N N N N N N | | DDUUUU))+ Pr(+ Pr(DDUUDD) Pr() Pr(N N N N N N | | DDUUDD))+ Pr(+ Pr(DDDDUU) Pr() Pr(N N N N N N | | DDDDUU))+ Pr(+ Pr(DDDDDD) Pr() Pr(N N N N N N | | DDDDDD))
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
Pr(Pr(N N NN N N))
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
NN NN NN
UPUP
DOWNDOWN
POS TaggingPOS Tagging
Simple tagging model says part of Simple tagging model says part of speech depends only on previous speech depends only on previous part of speech, word depends only part of speech, word depends only on part of speech.on part of speech.
So, HMM state is previous part of So, HMM state is previous part of speech, observation is word.speech, observation is word.
What are the probabilities and where What are the probabilities and where could they come from?could they come from?
Shannon GameShannon Game
If we have a POS-based language If we have a POS-based language model, how do we compute model, how do we compute probabilities?probabilities?
Pr(Pr(Sue ate the small blue candy.Sue ate the small blue candy.))
= sum= sumtagstags Pr(sent|tags) Pr(sent|tags)
Best Tag SequenceBest Tag Sequence
Useful problem to solve:Useful problem to solve: Given sentence, find most likely Given sentence, find most likely
sequence of POS tagssequence of POS tagsSueSue sawsaw thethe flyfly ..nounnoun verbverb detdet nounnoun ..
maxmaxtagstags Pr(tags|sent) Pr(tags|sent)
= max= maxtagstags Pr(sent|tags) Pr(tags) / Pr(sent|tags) Pr(tags) / Pr(sent)Pr(sent)
= c max= c maxtagstags Pr(sent|tags) Pr(tags) Pr(sent|tags) Pr(tags)
ViterbiViterbi
(i,t): probability of (i,t): probability of most likelymost likely state state sequence that ends in state i when sequence that ends in state i when seeing first t observations:seeing first t observations:
max{smax{s00…s…st-1t-1} Pr(k} Pr(k00…k…ktt,s,s00…s…st-1t-1, s, stt=i)=i)
(i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii))
(i,t) = max(i,t) = maxjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))
Return maxReturn maxjj (j,l) (trace it back)(j,l) (trace it back)
max Pr(s*) Pr(max Pr(s*) Pr(NNNNNN|s*)|s*)
max {max { Pr(Pr(UUUUUU) Pr() Pr(N N N N N N | | UUUUUU),),Pr(Pr(UUUUDD) Pr() Pr(N N N N N N | | UUUUDD),),Pr(Pr(UUDDUU) Pr() Pr(N N N N N N | | UUDDUU),),Pr(Pr(UUDDDD) Pr() Pr(N N N N N N | | UUDDDD),),Pr(Pr(DDUUUU) Pr() Pr(N N N N N N | | DDUUUU),),Pr(Pr(DDUUDD) Pr() Pr(N N N N N N | | DDUUDD),),Pr(Pr(DDDDUU) Pr() Pr(N N N N N N | | DDDDUU),),Pr(Pr(DDDDDD) Pr() Pr(N N N N N N | | DDDDDD) }) }
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
max Pr(s*) Pr(max Pr(s*) Pr(NNNNNN|s*)|s*)
R: 0.7, N: 0.3 R: 0.0, N: 1.00.1
0.2 0.80.9
NN NN NN
UPUP
DOWNDOWN
PCFGPCFG
An analogy…An analogy…
Finite-state Automata (FSAs):Finite-state Automata (FSAs):
Hidden Markov Models (HMMs) ::Hidden Markov Models (HMMs) ::
Context-free Grammars (CFGs):Context-free Grammars (CFGs):
Probabilistic context-free grammars Probabilistic context-free grammars (PCFGs)(PCFGs)
Grammars & RecursionGrammars & Recursion
S S 1.01.0 NP VP NP VP
NP NP 0.450.45 NP PP, NP PP, 0.550.55 N N
PP PP 1.01.0 P NP P NP
VP VP 0.30.3 VP PP, VP PP, 0.70.7 V NP V NP
N N 0.40.4 micemice, , 0.250.25 hatshats, , 0.350.35 pimplespimples
P P 1.01.0 withwith
V V 1.01.0 worewore
Computing with PCFGsComputing with PCFGs
Can compute Pr(sent) and also Can compute Pr(sent) and also maxmaxtreetree Pr(tree|sent) using Pr(tree|sent) using algorithms like the ones we algorithms like the ones we discussed for HMMs. Still discussed for HMMs. Still polynomial time.polynomial time.
What to LearnWhat to Learn
HMM DefinitionHMM Definition
Computing probability of state and Computing probability of state and observation sequencesobservation sequences
Part of Speech TaggingPart of Speech Tagging
Viterbi: most likely state sequenceViterbi: most likely state sequence
PCFG DefinitionPCFG Definition
Homework 7 (due 11/21)Homework 7 (due 11/21)
1.1. Give a maximization scheme for filling Give a maximization scheme for filling in the two blanks in a sentence like “I in the two blanks in a sentence like “I hate it when ___ goes ___ like that.” Be hate it when ___ goes ___ like that.” Be somewhat rigorous to make the TA’s job somewhat rigorous to make the TA’s job easier.easier.
2. Derive that (a) Pr(k2. Derive that (a) Pr(k00…k…kll) = sum) = sumjj (j,l), (j,l), and (b) and (b) (i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii).).
3. Email me your project group of three or 3. Email me your project group of three or four.four.
Homework 8 (due 11/28)Homework 8 (due 11/28)
1.1. Write a program that decides if a Write a program that decides if a pair of words are synonyms using pair of words are synonyms using the web. I’ll send you the list, you the web. I’ll send you the list, you send me the answers.send me the answers.
2.2. more soonmore soon
Homework 9 (due 12/5)Homework 9 (due 12/5)
1.1. Write a program that decides if a Write a program that decides if a pair of words are synonyms using pair of words are synonyms using wordnet. I’ll send you the list, you wordnet. I’ll send you the list, you send me the answers.send me the answers.
2.2. more soonmore soon