CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing

Jiang Bian, Fall 2012University of Arkansas at Little Rock

Natural Language Processing

• Understanding natural languages:– Philosophically: We—human—have defined

ourselves in terms of our ability to speak with and understand each other.

– Application-wise: We want to be able to talk to the computers.

– Learning: We want the computers to be smarter, and learn human knowledge from text-books.

Language Models• Two types of language models

– Represented as a sequence of letters/words.• Probabilistic: the probability of a sequence: P(word1, word2, …)• Mostly are word-based; and• Learned from data

– Trees and abstract structure of words.• Logical: L = {S1, S2, …}• Abstraction: trees/categories• Hand-coded

S

NP VP

Name Verb

Sam Slept

Bag of Words

• A bag rather than a sequence• Unigram, Naïve Bayes model:– Each individual word is treated as a separate factor that

unrelated or unconditionally independent of all the other words.• Possible to take the sequence into account.

HONK OF

IFTHE

MODEL

WORDSBAG

LOVEYOU

Probabilistic Models• P(w1 w2 w3 … wn) = P(W1:n)

– = ∏iP(wi|w1:i-1)

• Markov Assumption: – the effect of one variable on another will be local;– the nth word is only relevant to its previous k words.– P(wi|w1:i-1) = P(wi|wi-k:i-1)

– For first-order Markov model: P(wi|wi-1)

• Stationary Assumption:– the probability of each variable is the same– i.e., the word probability only depends on its surrounding words in a

sentence, but does not depend on which sentence I am saying…– P(wi|wi-1)=P(wj|wj-1)

Applications of Language Models

• Classification (e.g., spam)• Clustering (e.g., news stories)• Input correction (spelling, segmentation)• Sentiment analysis (e.g., product reviews)• Information retrieval (e.g., web search)• Question answering (e.g., IBM’s Watson)• Machine translation (e.g., Chinese to English)• Speech recognition (e.g., Apple’s Siri)

N-gram Model• An n-gram is a contiguous sequence of n items from a

given sequence of text or speech.

• Language Models (LM)– Unigrams, Bigrams, Trigrams…

• Applications:– Speech recognition/data compression

• Predict the next word– Information Retrieve

• Retrieved documents are ranked based on the probability of the query and the document’s language model

• P(Q|Md)

How do we train these models?• Very large corpora: collections of text and speech– Shakespeare– Brown Corpus– Wall Street Journal– AP newswire– Hansards– Timit– DARPA/NIST text/speech corpora (Call Home, Call Friend,

ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator)

– TRAINS, Boston Radio News Corpus

A Simple Bigram Example

• Estimate the likelihood of the sentence I want to eat Chinese food.– P(I want to eat Chinese food) = P(I | <start>) P(want

| I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

• What do we need to calculate these likelihoods?– Bigram probabilities for each word pair sequence in

the sentence– Calculated from a large corpus

Early Bigram Probabilities from BERP

.001Eat British.03Eat today

.007Eat dessert.04Eat Indian

.01Eat tomorrow.04Eat a

.02Eat Mexican.04Eat at

.02Eat Chinese.05Eat dinner

.02Eat in.06Eat lunch

.03Eat breakfast.06Eat some

.03Eat Thai.16Eat on

.01British lunch.05Want a

.01British cuisine.65Want to

.15British restaurant.04I have

.60British food.08I don’t

.02To be.29I would

.09To spend.32I want

.14To have.02<start> I’m

.26To eat.04<start> Tell

.01Want Thai.06<start> I’d

.04Want some.25<start> I

• P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080– Suppose P(<end>|food) = .2?– How would we calculate I want to eat Chinese food ?

• Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' – eat is often followed by an NP– British food is not too popular

• N-gram models can be trained by counting and normalization

Early BERP Bigram Counts

0100004Lunch

000017019Food

112000002Chinese

522190200Eat

12038601003To

686078603Want

00013010878I

lunchFoodChineseEatToWantI

Early BERP Bigram Probabilities• Normalization: divide each row's counts by

appropriate unigram counts for wn-1

• Computing the bigram probability of I I– C(I,I)/C( I in call contexts )– p (I|I) = 8 / 3437 = .0023

• Maximum Likelihood Estimation (MLE): relative frequency

4591506213938325612153437

LunchFoodChineseEatToWantI

)()(

1

2,1

wfreqwwfreq

– P(I | I) = .0023 I I I I want – P(I | want) = .0025 I want I want– P(I | food) = .013 the kind of food I want is ...

Approximating Shakespeare• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter

• With bigrams...– What means, sir. I confess she? then all sorts, he is

trim, captain.– Why dost stand forth thy canopy, forsooth; he is this

palpable hit the King Henry.• Trigrams– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if renown made

it empty.

• Quadrigrams– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?– What's coming out here looks like Shakespeare

because it is Shakespeare• Note: As we increase the value of N, the

accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

N-Gram Training Sensitivity

• If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?

• Note: This question has major implications for corpus selection or design

WSJ is not Shakespeare: Sentences Generated from WSJ

Probabilistic Letter Models

• The probability of a sequence of letters.• What can we do with letter models?– Language identification

EN DE FR ES AZ

Hello, World

Guten Tag, Welt

Salam Dunya

Language Identification

# A B C

1 TH EN IN

2 TE ER AN

3 OU CH ƏR

4 AN DE LA

5 ER EI IR

6 IN IN AR

English?

German?

Azerbijani?

Bigram Model:

Language Identification

# A B C

P(the) 1.1 % 0.03% 0.00%

P(der) 0.06% 0.68% 0.00%

P(rba) 0.00% 0.01% 0.53%

English?

German?

Azerbijani?

Trigram Model:

ClassificationPeople Places Drugs

Steve Jobs San Francisco Lipitor

Bill Gates Palo Alto Prevacid

Andy Grove Stern Grove Zoloft

Larry Page San Mateo Zocor

Andrew Ng Santa Cruz Plavix

Jennifer Widom New York Protonix

Daphne Koller New Jersey Celebrex

Noah Goodman Jersey City Zyrtec

Julie Zelinski South San Francisco Aggrenox

Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Logistic RegressionGzip Commond ???

Segmentation

• Given a sequence of words, how to break it up into meaningful segments.– e.g., 羽西中国新锐画家大奖

• Written English has spaces in between words:– e.g., words have spaces– Speech Recognition– URL: choosespain.com• Choose Spain• Chooses pain

Segmentation

• The best segmentation is the one that maximizes the joint probability of the segmentation.– S* = max P(w1:n) = max ∏iP(wi|w1:i-1)– Markov assumption:• S* ≈ max∏iP(wi|wi-1)

– Naïve Bayes assumption: words don’t depend on each other• S* ≈ maxP(wi)

Segmentation

• “nowisthetime”: 12 letters– How many possible segmentations?

• n-1• (n-1)^2• (n-1)!• 2^(n-1)

• Naïve Bayes assumption:– S* = argmaxs=f+rP(f)P(S*(r))– 1) Computationally easy– 2) Learning is easier: it’s easier to calculate the unigram

probabilities

Best Segmentation

• S* = argmaxs=f+rP(f)P(S*(r))• “nowisthetime”

f r P(f) P(S*(r)

n owis… .000001 10^-19

no wis… .004 10^-13

now is… .003 10^-10

nowi st… - 10^-18

Segmentation Examples

• Trained on 4 billions words corpus• e.g.,– Baseratesoughtto

• Base rate sought to• Base rates ought to

– smallandinsignificant• small and in significant• small and insignificant

– Ginormousego• G in or mouse go• Ginormous ego

What to do to improve?1) More data ???2) Markov assumption ???3) Smoothing ???

Spelling Correction

• Given a misspelled the word, find the best correction:– C* = argmaxcP(c|w)

– Bayes theorem: C* = argmaxcP(w|c)P(c)• P(c) = from data counts• P(w|c) = from spelling correction data

Spelling Data• c:w => P(w|c)

– pulse: pluse– elegant: elagent, elligit– second: secand, sexeon, secund, seconnd, seond, sekon– sailed: saled, saild– blouse: boludes– thunder: thounder– cooking: coking, chocking, kooking, cocking– fossil: fosscil

• We cannot have all the common misspelling cases.– Letter-based models, e.g.,

• ul:lu

Sentence Structure

• P(Fed raises interest rates) = ???

Fed raises interest rates

NV NN

NP

VP

NP

S


NN VN

VP

NP

S

NP

Ambiguity

How many parsing options do I have? ? ?• The Fed raises interest rates• The Fed raises raises • Raises raises interest raises

Ambiguity

How many parsing options do I have? ? ?• The Fed raises interest rates (2)

– The Fed (NP) raises (V) interest rates (NP)– The Fed raises (NP) interest (V) rates (NP)

• The Fed raises raises (1) – The Fed (NP) raises (V) raises (NP)

• Raises raises interest raises– Raises (NP) raises (V) interest raises (NP)– Raises (NP) raises (V) interest (NP) raises (NP)– Raises raises (NP) interest (V) raises (NP)– Raises raises interest (NP) raises (V)

Problems and SolutionsT F

Is it easy to omit good parsers?Is it easy to include bad parsers?Trees are unobservable?

Problems:

Solutions:

T F

Probabilistic view of the trees?

Consider word associations in the trees?

Make grammar unambiguous (like in programming languages)?

Problems and SolutionsT F

X Is it easy to omit good parsers?

X Is it easy to include bad parsers?

X Trees are unobservable?

Problems:

Solutions:

T F

X Probabilistic view of the trees?

X Consider word associations in the trees?

X Make grammar unambiguous (like in programming languages)?

Problems of writing grammars

• Natural languages are messy unorganized things evolved through the human history in variety contexts.

• It is naturally hard to specify a set of grammar rules that can comprehend all possibilities with out introduce errors.

• Ambiguity is the “enemy”…

Probabilistic Context-Free Grammar

• S -> NP VP (1)– NP ->

• | N (.3)• | DN (.4)• | NN (.2)• | NNN (.1)

– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)

– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises (.1)

– V ->• | interest (.1)• | rates (.3)• | raises (.6)

– D -> • | the (.5)• | a (.5)


• S -> NP VP (1)– NP ->

• | N (.3)• | DN (.2)• | NN (.2)• | NNN (.1)

– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)

– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises(.1)

– V ->• | interest (.1)• | rates (.3)• | raises(.6)

– D -> • | the (.5)• | a (.5)


NV NN

NP

VP

NP

S

.3 .6 .3 .3

.3 .2

.4

1

P() = 0.00038880.039%




NV NN

NP

VP

NP

S

N N VN

NP

VP

NP

S


P() = .012% P() = .00072%

Statistical Parsing


• Where are these probabilities coming from?– Training from large annotated corpus

• e.g., The Penn Treebank Project (1990): The Penn Treebank Project annotates naturally-occuring text for linguistic structure.

The Penn Treebank Project( (S

(NP-SBJ (NN Stock-market) (NNS tremors) )(ADVP-TMP (RB again) )(VP (VBD shook)

(NP (NN bond) (NNS prices) ) (, ,)

(SBAR-TMP (IN while)(S

(NP-SBJ (DT the) (NN dollar) )(VP (VBD turned)

(PRT (RP in) )(NP-PRD (DT a) (VBN mixed) (NN

performance) ))))) (. .) ))

Resolving Ambiguity

• Ambiguity:– Syntactical – more than one possible structure for

the same string of words.• e.g., We need more intelligent leaders.

– need more or more intelligent?

– lexical (homonymity) – a word form has more than one meaning.• e.g., Did you see the bat?• e.g., Where is the bank?

“The boy saw the man with the telescope”

V PP

with

NPP

the

Det N

telescopeThe

N

saw

S

NP VP

Det

boy

NP

the

Det N

man

V

PP

with

NPP

the

Det N

telescopeThe

N

saw

S

NP VP

Det

boy

NP

the

Det N

man

“The boy saw the man with the telescope”

Lexicalized PCFG• CFG:

– rules: VP -> V NP NP• PCFG:

– P( VP -> V NP NP | lhs = VP) = .2• LPCFG:

– P( VP -> V NP NP | V = `gave’) = .25• e.g., “Gave me the knife”

– P( VP -> V NP NP | V = `said’) = .0001• e.g., “I said my piece”

– P( VP -> V | V = `quake’) = – P ( VP -> V NP | V = `quake’) = 0.0001

• i.e., Dict: quake: verb (used without object)• ??? Web: “quake the earth”; 595,000 Google results ????

LPCFG

• “The boy saw the man with the telescope”– P( NP -> NP PP | H(NP) = man, PP = with/telescope)

• “The boy saw the man with the telescope”– P( VP -> V NP PP | V = saw, H(NP) = man, PP =

with/telescope)• These probabilities are hard to get, since we are

conditioning on quite a few specific words.– Back-off: instead conditioning on H(NP) = man, we can

conditioning on any subjects.

Parsing into a Tree


NV NN

NP

VP

NP

S



NV NN

VPNP

S

N V V

Machine Translation

Yo lo haré mañana. => I will do it tomorrow.word

Yo lo haré mañana. => I will do it tomorrow.phrase

Yo lo haré mañana. => I will do it tomorrow.tree

NP VP

S

VP

S

Yo lo haré mañana. => I will do it tomorrow.Semantics

Action: doing+ Time: tomorrow

Vauquois’ pyramid:

Phrase-based Translation model• The models define probabilities over inputs

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

• What is the probability of a specific phrase segmentation of both languages?• What is the probability of a foreign foreign phrase being translated as a particular English phrase?• What is the probability of a word/phrase changing ordering?• Is the translated English text the best sentence? (From language model)

Segmentation Translation Distortion

Statistical Machine Translation• Components: Translation model, language

model, decoder

Foreign/English parallel text English text

Statistical analysis Statistical analysis

Translation model Language model

Machine Translation Algorithm

CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Documents

Transcript of CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing