Sequence Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Sequence ModelsSequence Models

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

Exams enjoyed Toronto.Exams enjoyed Toronto.

Letter grades for programs:Letter grades for programs:

A: 74-100 (31)A: 74-100 (31)

B: 30-60 (20)B: 30-60 (20)

C: 10-15 (4)C: 10-15 (4)

?:?: (7) (7)

(0 did not imply “incorrect”)(0 did not imply “incorrect”)

Shannon GameShannon Game

Sue swallowed the large green __.Sue swallowed the large green __.

pepperpepper frogfrog

peapea pillpill

Not:Not:

ideaidea beigebeige

runningrunning veryvery

““AI Complete” ProblemAI Complete” Problem

My mom told me that playing My mom told me that playing Monopoly® with toddlers was a Monopoly® with toddlers was a bad idea, but I thought it would be bad idea, but I thought it would be ok. I was wrong. Billy chewed on ok. I was wrong. Billy chewed on the “Get Out of Jail Free Card”. the “Get Out of Jail Free Card”. Todd ran away with the little metal Todd ran away with the little metal dog. dog. Sue swallowed the large Sue swallowed the large green __.green __.

Language ModelingLanguage Modeling

If we had a way of assigning If we had a way of assigning probabilities to sentences, we could probabilities to sentences, we could solve this. How?solve this. How?

Pr(Pr(Sue swallowed the large green cat.Sue swallowed the large green cat.))

Pr(Pr(Sue swallowed the large green odd.Sue swallowed the large green odd.))

How could such a thing be learned from How could such a thing be learned from data?data?

Why Play This Game?Why Play This Game?

Being able to assign likelihood to Being able to assign likelihood to sentences a useful way of sentences a useful way of processing language.processing language.

Speech recognitionSpeech recognitionCriterion for comparing language Criterion for comparing language

modelsmodelsTechniques useful for other problemsTechniques useful for other problems

Statistical EstimationStatistical Estimation

To use statistical estimation:To use statistical estimation:• Divide data into equivalence Divide data into equivalence

classesclasses• Estimate parameters for the Estimate parameters for the

different classesdifferent classes

Conflicting InterestsConflicting Interests

ReliabilityReliability• Lots of data in each classLots of data in each class• So, small number of classesSo, small number of classes

DiscriminationDiscrimination• All relevant distinctions madeAll relevant distinctions made• So, large number of classesSo, large number of classes

End PointsEnd Points

Unigram model:Unigram model:Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ) = ) =

Pr(w)Pr(w)

Exact match model:Exact match model:

Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ) = ) = Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ))

What word would these suggest?What word would these suggest?

N-grams: CompromiseN-grams: Compromise

N-grams are simple, powerful.N-grams are simple, powerful.Bigram model:Bigram model:Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ) = Pr(w ) = Pr(w

| | green ___green ___ ))Trigram model:Trigram model:Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ) = Pr(w ) = Pr(w

| | large green ___large green ___ ))Not perfect: misses “swallowed”.Not perfect: misses “swallowed”. pillowpillow crystalcrystal catepillarcatepillar IguanaIguana SantaSanta tigerstigers

Aside: SyntaxAside: Syntax

Can do better with a little bit of knowledge Can do better with a little bit of knowledge about grammar:about grammar:

Pr(w | Pr(w | Sue swallowed the large green ___.Sue swallowed the large green ___. ) = ) = Pr(w | Pr(w | modified by modified by swallowedswallowed,, the the,, green green ))

pillpill dyedye oneone pineapplepineapple dragondragon beansbeans speckspeck liquidliquid solutionsolution drinkdrink

Estimating TrigramsEstimating Trigrams

Treat sentences independently. Ok?Treat sentences independently. Ok?

Pr(wPr(w11 w w22))

Pr(wPr(wjj | w | wj-1j-1 w wj-2j-2))

Pr(EOS | wPr(EOS | wj-1j-1 w wj-2j-2))

Simple so far.Simple so far.

SparsitySparsity

Pr(w| Pr(w| comes acrosscomes across))

asas 8/10 (in Austen’s works)8/10 (in Austen’s works)

aa 1/101/10

moremore 1/101/10

thethe 0/100/10

Don’t estimate as zeros!Don’t estimate as zeros!

Can use Laplace smoothing, e.g., or back Can use Laplace smoothing, e.g., or back off to bigram, unigram.off to bigram, unigram.

Unreliable WordsUnreliable Words

Can’t take much stock in words only Can’t take much stock in words only seen once (seen once (hapax legomenahapax legomena). ). Change to “Change to “UNKUNK”.”.

Generally a small fraction of the Generally a small fraction of the tokens and half the types.tokens and half the types.

The boy saw the dog.The boy saw the dog.

5 tokens, 4 types.5 tokens, 4 types.

Zipf’s LawZipf’s Law

Frequency is proportional to rank.Frequency is proportional to rank.

Thus, extremely long tail!Thus, extremely long tail!

Word Frequencies in Tom Word Frequencies in Tom SawyerSawyer

0500

100015002000250030003500

Using TrigramsUsing Trigrams

Hand me the ___ knife now .Hand me the ___ knife now .

butterbutter

knifeknife

CountsCounts

me theme the 28326702832670me the butterme the butter 88 88me the knifeme the knife 638 638the knifethe knife 154771 154771the knife knife the knife knife 72 72the butterthe butter 92304 92304the butter knifethe butter knife 559 559knife knifeknife knife 7831 7831knife knife nowknife knife now 4 4butter knifebutter knife 9046 9046butter knife nowbutter knife now 15 15

Markov ModelMarkov Model

Hand me

me the

the the butter

the knifeknife knife

butter knife

butter

knifeknife

knife

knife now

-2.4-10.4

-8.4

-5.1

-7.7

now

now-7.6

-6.4

Mutual InformationMutual Information

Log(Pr(x and y)/Pr(x) Pr(y))Log(Pr(x and y)/Pr(x) Pr(y))

Measures the degree to which two Measures the degree to which two events are independent (how much events are independent (how much “information” we learn about one “information” we learn about one from knowing the other).from knowing the other).

Mutual Inf. ApplicationMutual Inf. Application

Measure of strength of association Measure of strength of association between wordsbetween words

leviedlevied: : imposedimposed vs. vs. believedbelieved

Reduces to simplyReduces to simply

Pr(Pr(leviedlevied|x) = Pr(|x) = Pr(leviedlevied, x)/Pr(x), x)/Pr(x)

=count(=count(leviedlevied and x) / count (x) and x) / count (x)

““imposedimposed” has higher score.” has higher score.

Analogy IdeaAnalogy Idea

Find a linking word such that a Find a linking word such that a mutual information score is mutual information score is maximized.maximized.

Tricky to find the right word. Unclear Tricky to find the right word. Unclear if any word will have the right if any word will have the right effect.effect.

traffictraffic flowsflows through the through the streetstreet waterwater flowsflows through the through the riverbedriverbed

What to LearnWhat to Learn

Reliability/discrimination tradeoff.Reliability/discrimination tradeoff.

Definition of N-gram modelsDefinition of N-gram models

How to find most likely word in an N-How to find most likely word in an N-gram modelgram model

Mutual InformationMutual Information

Homework 7 (due 11/21)Homework 7 (due 11/21)

1.1. Give a maximization scheme for Give a maximization scheme for filling in the two blanks in a filling in the two blanks in a sentence like “I hate it when ___ sentence like “I hate it when ___ goes ___ on me.” Be somewhat goes ___ on me.” Be somewhat rigorous to make the TA’s job rigorous to make the TA’s job easier.easier.

2.2. more soonmore soon

Sequence Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Documents

Transcript of Sequence Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.