Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Hidden Markov ModelsHidden Markov Models

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

Exams need a swift kick.Exams need a swift kick.

Form project groups next week.Form project groups next week.

Project due on the last day.Project due on the last day.

BUTBUT! There will be milestones.! There will be milestones.

In 2 weeks, synonyms via web.In 2 weeks, synonyms via web.

3 weeks, synonyms via wordnet.3 weeks, synonyms via wordnet.

Shannon Game AgainShannon Game Again

RecallRecall

Sue swallowed the large green __.Sue swallowed the large green __.

OkOk: pepper, frog, pea, pill: pepper, frog, pea, pill

Not ok: Not ok: beige, running, verybeige, running, very

Parts of speech could help:Parts of speech could help:

noun verbed det adj adj noun verbed det adj adj nounnoun

POS Language ModelPOS Language Model

How could we define a probabilistic How could we define a probabilistic model over sequences of parts of model over sequences of parts of speech?speech?

How could we use this to define a How could we use this to define a model over word sequences?model over word sequences?

Hidden Markov ModelHidden Markov Model

Idea: We have states with a simple Idea: We have states with a simple transition rule (Markov model). We transition rule (Markov model). We observe a probabilistic function of observe a probabilistic function of the states. Therefore, the states the states. Therefore, the states are not what we see…are not what we see…

HMM ExampleHMM Example

Browsing the web, connection is Browsing the web, connection is either up or down. Observe either up or down. Observe response or no.response or no.

Response: 0.7No response: 0.3


0.1

0.2 0.80.9

UP DOWN

HMM DefinitionHMM Definition

N states, M observationsN states, M observations

(s): prob. starting state is s(s): prob. starting state is s

p(s,s’): prob. of s to s’ transitionp(s,s’): prob. of s to s’ transition

b(s, k): probability of obs k from sb(s, k): probability of obs k from s

kk00 k k11 … k … kll: observation sequence: observation sequence

HMM ProblemsHMM Problems

Probability of a state sequenceProbability of a state sequence

Probability of an obs. sequenceProbability of an obs. sequence

Most like state sequence for an Most like state sequence for an observation sequenceobservation sequence

Most likely model given obs. seq.Most likely model given obs. seq.



0.1

0.2 0.80.9

UP DOWN

States ExampleStates Example

Probability of:Probability of:

U U U U D DD D U U U U U U

isis

1.0 0.9 0.1 0.8 0.2 0.9 0.9 = .011661.0 0.9 0.1 0.8 0.2 0.9 0.9 = .01166

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

DerivationDerivation

Pr(sPr(s00…s…sll))

= Pr(s= Pr(s00) Pr(s) Pr(s11 | s | s00) Pr(s) Pr(s22 | s | s00 s s11)…)… Pr(sPr(sll | s| s00 s s11 s s22 … s … sl-1l-1))

= Pr(s= Pr(s00) Pr(s) Pr(s11 | s | s00) Pr(s) Pr(s22 | s | s11)…)…Pr(sPr(sll | s | sl-1l-1))

= = (s(s00) p(s) p(s00, s, s11) p(s) p(s11, s, s22)… p(s)… p(sl-1l-1, s, sll))

States/Obs. ExampleStates/Obs. Example

Probability of:Probability of: R R R R N N NN N N R R NN given given U U U U D DD D U U U U U Uisis0.7 0.7 1.0 1.0 0.3 0.7 0.3 = .03087000.7 0.7 1.0 1.0 0.3 0.7 0.3 = .0308700

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9


Pr(kPr(k00…k…kll | s | s00…s…sll))

= Pr(k= Pr(k00|s|s00…s…sll) Pr(k) Pr(k11 | s | s00…s…sll, k, k00) Pr(k) Pr(k22 | | ss00…s…sll, k, k00, k, k11) …) … Pr(k Pr(kll | | ss00…s…sll, k, k00…k…kl-1l-1))

= Pr(k= Pr(k00|s|s00) Pr(k) Pr(k11 | s | s11) Pr(k) Pr(k22 | s | s22) …) … Pr(kPr(kll | s | sll))

= b(s= b(s00,k,k00) b(s) b(s11,k,k11) b(s) b(s22,k,k22) … b(s) … b(sll,k,kll))

Observation ExampleObservation Example

Probability of:Probability of:

R R R R N N NN N N R R NN

= sum{s= sum{s00…s…s66} Pr(s} Pr(s00…s…s66) ) Pr(R R Pr(R R N N NN N N R R N N | s| s00…s…s66))

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9


Pr(kPr(k00…k…kll))

= sum{s= sum{s00…s…sll} Pr(s} Pr(s00…s…s66) Pr(k) Pr(k00…k…kll |s |s00…s…sll))

= sum{s= sum{s00…s…sll} Pr(s} Pr(s00) Pr(k) Pr(k00|s|s00)) Pr(s Pr(s11 | s | s00) ) Pr(kPr(k11 | s | s11) Pr(s) Pr(s22 | s | s11) Pr(k) Pr(k22 | s | s22) …) … Pr(sPr(sll | s| sl-1l-1) Pr(k) Pr(kll | s | sll))

= sum{s= sum{s00…s…sll} } (s(s00) b(s) b(s00,k,k00) p(s) p(s00, s, s11) b(s) b(s11,k,k11) ) p(sp(s11, s, s22) b(s) b(s22,k,k22) … p(s) … p(sl-1l-1, s, sll) b(s) b(sll,k,kll))

Uh Oh, Better Get Uh Oh, Better Get

Grows exponentially with l: MGrows exponentially with l: Mll

How do this more efficiently?How do this more efficiently?

(i,t): probability of seeing first t (i,t): probability of seeing first t observations observations andand ending up in state ending up in state i: Pr(ki: Pr(k00…k…ktt, s, stt=i)=i)

(i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii))

(i,t) = sum(i,t) = sumjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))

Return Pr(kReturn Pr(k00…k…kll) = sum) = sumjj (j,l)(j,l)

Partial DerivationPartial Derivation

(i,t) (i,t)

= Pr(k= Pr(k00…k…ktt, s, stt=i)=i)

= sum= sumjj Pr(k Pr(k00…k…kt-1t-1, s, st-1t-1=j)=j) Pr(kPr(k00…k…ktt, , sstt=i | k=i | k00…k…kt-1t-1, s, st-1t-1=j)=j)

= sum= sumjj (j,t-1) (j,t-1) Pr(kPr(k00…k…kt-1t-1 | k | k00…k…kt-1t-1, s, st-1t-1=j)=j)Pr(sPr(stt=i, k=i, ktt | k | k00…k…kt-1t-1, s, st-1t-1=j)=j)

= sum= sumjj (j,t-1) (j,t-1) Pr(sPr(stt=i | k=i | k00…k…kt-1t-1, s, st-1t-1=j)=j)Pr(kPr(ktt | s | stt=i, k=i, k00…k…kt-1t-1, s, st-1t-1=j)=j)

= sum= sumjj (j,t-1) Pr(s(j,t-1) Pr(stt=i | s=i | st-1t-1=j) Pr(k=j) Pr(ktt | s | stt=i)=i)

= sum= sumjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))

Pr(Pr(N N NN N N))

= Pr(= Pr(UUUUUU) Pr() Pr(N N N N N N | | UUUUUU))+ Pr(+ Pr(UUUUDD) Pr() Pr(N N N N N N | | UUUUDD))+ Pr(+ Pr(UUDDUU) Pr() Pr(N N N N N N | | UUDDUU))+ Pr(+ Pr(UUDDDD) Pr() Pr(N N N N N N | | UUDDDD))+ Pr(+ Pr(DDUUUU) Pr() Pr(N N N N N N | | DDUUUU))+ Pr(+ Pr(DDUUDD) Pr() Pr(N N N N N N | | DDUUDD))+ Pr(+ Pr(DDDDUU) Pr() Pr(N N N N N N | | DDDDUU))+ Pr(+ Pr(DDDDDD) Pr() Pr(N N N N N N | | DDDDDD))

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

Pr(Pr(N N NN N N))

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

NN NN NN

UPUP

DOWNDOWN

POS TaggingPOS Tagging

Simple tagging model says part of Simple tagging model says part of speech depends only on previous speech depends only on previous part of speech, word depends only part of speech, word depends only on part of speech.on part of speech.

So, HMM state is previous part of So, HMM state is previous part of speech, observation is word.speech, observation is word.

What are the probabilities and where What are the probabilities and where could they come from?could they come from?

Shannon GameShannon Game

If we have a POS-based language If we have a POS-based language model, how do we compute model, how do we compute probabilities?probabilities?

Pr(Pr(Sue ate the small blue candy.Sue ate the small blue candy.))

= sum= sumtagstags Pr(sent|tags) Pr(sent|tags)

Best Tag SequenceBest Tag Sequence

Useful problem to solve:Useful problem to solve: Given sentence, find most likely Given sentence, find most likely

sequence of POS tagssequence of POS tagsSueSue sawsaw thethe flyfly ..nounnoun verbverb detdet nounnoun ..

maxmaxtagstags Pr(tags|sent) Pr(tags|sent)

= max= maxtagstags Pr(sent|tags) Pr(tags) / Pr(sent|tags) Pr(tags) / Pr(sent)Pr(sent)

= c max= c maxtagstags Pr(sent|tags) Pr(tags) Pr(sent|tags) Pr(tags)

ViterbiViterbi

(i,t): probability of (i,t): probability of most likelymost likely state state sequence that ends in state i when sequence that ends in state i when seeing first t observations:seeing first t observations:

max{smax{s00…s…st-1t-1} Pr(k} Pr(k00…k…ktt,s,s00…s…st-1t-1, s, stt=i)=i)

(i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii))

(i,t) = max(i,t) = maxjj (j,t-1) p(s(j,t-1) p(sjj,s,sii) b(k) b(ktt, s, sii))

Return maxReturn maxjj (j,l) (trace it back)(j,l) (trace it back)

max Pr(s*) Pr(max Pr(s*) Pr(NNNNNN|s*)|s*)

max {max { Pr(Pr(UUUUUU) Pr() Pr(N N N N N N | | UUUUUU),),Pr(Pr(UUUUDD) Pr() Pr(N N N N N N | | UUUUDD),),Pr(Pr(UUDDUU) Pr() Pr(N N N N N N | | UUDDUU),),Pr(Pr(UUDDDD) Pr() Pr(N N N N N N | | UUDDDD),),Pr(Pr(DDUUUU) Pr() Pr(N N N N N N | | DDUUUU),),Pr(Pr(DDUUDD) Pr() Pr(N N N N N N | | DDUUDD),),Pr(Pr(DDDDUU) Pr() Pr(N N N N N N | | DDDDUU),),Pr(Pr(DDDDDD) Pr() Pr(N N N N N N | | DDDDDD) }) }

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

max Pr(s*) Pr(max Pr(s*) Pr(NNNNNN|s*)|s*)

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

NN NN NN

UPUP

DOWNDOWN

PCFGPCFG

An analogy…An analogy…

Finite-state Automata (FSAs):Finite-state Automata (FSAs):

Hidden Markov Models (HMMs) ::Hidden Markov Models (HMMs) ::

Context-free Grammars (CFGs):Context-free Grammars (CFGs):

Probabilistic context-free grammars Probabilistic context-free grammars (PCFGs)(PCFGs)

Grammars & RecursionGrammars & Recursion

S S 1.01.0 NP VP NP VP

NP NP 0.450.45 NP PP, NP PP, 0.550.55 N N

PP PP 1.01.0 P NP P NP

VP VP 0.30.3 VP PP, VP PP, 0.70.7 V NP V NP

N N 0.40.4 micemice, , 0.250.25 hatshats, , 0.350.35 pimplespimples

P P 1.01.0 withwith

V V 1.01.0 worewore

Computing with PCFGsComputing with PCFGs

Can compute Pr(sent) and also Can compute Pr(sent) and also maxmaxtreetree Pr(tree|sent) using Pr(tree|sent) using algorithms like the ones we algorithms like the ones we discussed for HMMs. Still discussed for HMMs. Still polynomial time.polynomial time.

What to LearnWhat to Learn

HMM DefinitionHMM Definition

Computing probability of state and Computing probability of state and observation sequencesobservation sequences

Part of Speech TaggingPart of Speech Tagging

Viterbi: most likely state sequenceViterbi: most likely state sequence

PCFG DefinitionPCFG Definition

Homework 7 (due 11/21)Homework 7 (due 11/21)

1.1. Give a maximization scheme for filling Give a maximization scheme for filling in the two blanks in a sentence like “I in the two blanks in a sentence like “I hate it when ___ goes ___ like that.” Be hate it when ___ goes ___ like that.” Be somewhat rigorous to make the TA’s job somewhat rigorous to make the TA’s job easier.easier.

2. Derive that (a) Pr(k2. Derive that (a) Pr(k00…k…kll) = sum) = sumjj (j,l), (j,l), and (b) and (b) (i,0) = (i,0) = (s(s00) b(k) b(k00, s, sii).).

3. Email me your project group of three or 3. Email me your project group of three or four.four.


1.1. Write a program that decides if a Write a program that decides if a pair of words are synonyms using pair of words are synonyms using the web. I’ll send you the list, you the web. I’ll send you the list, you send me the answers.send me the answers.

2.2. more soonmore soon


1.1. Write a program that decides if a Write a program that decides if a pair of words are synonyms using pair of words are synonyms using wordnet. I’ll send you the list, you wordnet. I’ll send you the list, you send me the answers.send me the answers.

2.2. more soonmore soon

Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Documents

Transcript of Hidden Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.