CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

56
CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing Jiang Bian, Fall 2012 University of Arkansas at Little Rock

description

CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing. Jiang Bian, Fall 2012 University of Arkansas at Little Rock. Natural Language Processing. Understanding natural languages: - PowerPoint PPT Presentation

Transcript of CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Page 1: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing

Jiang Bian, Fall 2012University of Arkansas at Little Rock

Page 2: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Natural Language Processing

• Understanding natural languages:– Philosophically: We—human—have defined

ourselves in terms of our ability to speak with and understand each other.

– Application-wise: We want to be able to talk to the computers.

– Learning: We want the computers to be smarter, and learn human knowledge from text-books.

Page 3: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Language Models• Two types of language models

– Represented as a sequence of letters/words.• Probabilistic: the probability of a sequence: P(word1, word2, …)• Mostly are word-based; and• Learned from data

– Trees and abstract structure of words.• Logical: L = {S1, S2, …}• Abstraction: trees/categories• Hand-coded

S

NP VP

Name Verb

Sam Slept

Page 4: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Bag of Words

• A bag rather than a sequence• Unigram, Naïve Bayes model:– Each individual word is treated as a separate factor that

unrelated or unconditionally independent of all the other words.• Possible to take the sequence into account.

HONK OF

IFTHE

MODEL

WORDSBAG

LOVEYOU

Page 5: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Models• P(w1 w2 w3 … wn) = P(W1:n)

– = ∏iP(wi|w1:i-1)

• Markov Assumption: – the effect of one variable on another will be local;– the nth word is only relevant to its previous k words.– P(wi|w1:i-1) = P(wi|wi-k:i-1)

– For first-order Markov model: P(wi|wi-1)

• Stationary Assumption:– the probability of each variable is the same– i.e., the word probability only depends on its surrounding words in a

sentence, but does not depend on which sentence I am saying…– P(wi|wi-1)=P(wj|wj-1)

Page 6: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Applications of Language Models

• Classification (e.g., spam)• Clustering (e.g., news stories)• Input correction (spelling, segmentation)• Sentiment analysis (e.g., product reviews)• Information retrieval (e.g., web search)• Question answering (e.g., IBM’s Watson)• Machine translation (e.g., Chinese to English)• Speech recognition (e.g., Apple’s Siri)

Page 7: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

N-gram Model• An n-gram is a contiguous sequence of n items from a

given sequence of text or speech.

• Language Models (LM)– Unigrams, Bigrams, Trigrams…

• Applications:– Speech recognition/data compression

• Predict the next word– Information Retrieve

• Retrieved documents are ranked based on the probability of the query and the document’s language model

• P(Q|Md)

Page 8: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

N-gram examples

• S = “I saw the red house”– Unigram:• P(S) = P(I, saw, the, red, house) =

P(I)P(saw)P(the)P(red)P(house)– Bigram – Markov assumption• P(S) = P(I|<s>)P(saw|I)P(the|saw)P(red|the)P(house|

red)P(</s>|house)– Trigram:• P(S) = P(I|<s>, <s>)P(saw|<s>, I)P(the|I, saw)P(red|saw,

the)P(house|the, red)P(</s>|red, house)

Page 9: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

How do we train these models?• Very large corpora: collections of text and speech– Shakespeare– Brown Corpus– Wall Street Journal– AP newswire– Hansards– Timit– DARPA/NIST text/speech corpora (Call Home, Call Friend,

ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator)

– TRAINS, Boston Radio News Corpus

Page 10: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

A Simple Bigram Example

• Estimate the likelihood of the sentence I want to eat Chinese food.– P(I want to eat Chinese food) = P(I | <start>) P(want

| I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

• What do we need to calculate these likelihoods?– Bigram probabilities for each word pair sequence in

the sentence– Calculated from a large corpus

Page 11: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Early Bigram Probabilities from BERP

.001Eat British.03Eat today

.007Eat dessert.04Eat Indian

.01Eat tomorrow.04Eat a

.02Eat Mexican.04Eat at

.02Eat Chinese.05Eat dinner

.02Eat in.06Eat lunch

.03Eat breakfast.06Eat some

.03Eat Thai.16Eat on

Page 12: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

.01British lunch.05Want a

.01British cuisine.65Want to

.15British restaurant.04I have

.60British food.08I don’t

.02To be.29I would

.09To spend.32I want

.14To have.02<start> I’m

.26To eat.04<start> Tell

.01Want Thai.06<start> I’d

.04Want some.25<start> I

Page 13: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

• P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080– Suppose P(<end>|food) = .2?– How would we calculate I want to eat Chinese food ?

• Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' – eat is often followed by an NP– British food is not too popular

• N-gram models can be trained by counting and normalization

Page 14: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Early BERP Bigram Counts

0100004Lunch

000017019Food

112000002Chinese

522190200Eat

12038601003To

686078603Want

00013010878I

lunchFoodChineseEatToWantI

Page 15: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Early BERP Bigram Probabilities• Normalization: divide each row's counts by

appropriate unigram counts for wn-1

• Computing the bigram probability of I I– C(I,I)/C( I in call contexts )– p (I|I) = 8 / 3437 = .0023

• Maximum Likelihood Estimation (MLE): relative frequency

4591506213938325612153437

LunchFoodChineseEatToWantI

)()(

1

2,1

wfreqwwfreq

Page 16: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

What do we learn about the language?

• What's being captured with ...– P(want | I) = .32 – P(to | want) = .65– P(eat | to) = .26 – P(food | Chinese) = .56– P(lunch | eat) = .055

• What about...– P(I | I) = .0023– P(I | want) = .0025– P(I | food) = .013

Page 17: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

– P(I | I) = .0023 I I I I want – P(I | want) = .0025 I want I want– P(I | food) = .013 the kind of food I want is ...

Page 18: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Approximating Shakespeare• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter

• With bigrams...– What means, sir. I confess she? then all sorts, he is

trim, captain.– Why dost stand forth thy canopy, forsooth; he is this

palpable hit the King Henry.• Trigrams– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if renown made

it empty.

Page 19: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

• Quadrigrams– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?– What's coming out here looks like Shakespeare

because it is Shakespeare• Note: As we increase the value of N, the

accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

Page 20: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

N-Gram Training Sensitivity

• If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?

• Note: This question has major implications for corpus selection or design

Page 21: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

WSJ is not Shakespeare: Sentences Generated from WSJ

Page 22: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Letter Models

• The probability of a sequence of letters.• What can we do with letter models?– Language identification

EN DE FR ES AZ

Hello, World

Guten Tag, Welt

Salam Dunya

Page 23: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Language Identification

# A B C

1 TH EN IN

2 TE ER AN

3 OU CH ƏR

4 AN DE LA

5 ER EI IR

6 IN IN AR

English?

German?

Azerbijani?

Bigram Model:

Page 24: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Language Identification

# A B C

P(the) 1.1 % 0.03% 0.00%

P(der) 0.06% 0.68% 0.00%

P(rba) 0.00% 0.01% 0.53%

English?

German?

Azerbijani?

Trigram Model:

Page 25: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

ClassificationPeople Places Drugs

Steve Jobs San Francisco Lipitor

Bill Gates Palo Alto Prevacid

Andy Grove Stern Grove Zoloft

Larry Page San Mateo Zocor

Andrew Ng Santa Cruz Plavix

Jennifer Widom New York Protonix

Daphne Koller New Jersey Celebrex

Noah Goodman Jersey City Zyrtec

Julie Zelinski South San Francisco Aggrenox

Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Logistic RegressionGzip Commond ???

Page 26: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Gzip

• EN– Hello world!– This is a file

full of English words…

• DE– Hallo Welt!– Dies ist eine

Datei voll von deutschen Worte …

• AZ– Salam Dunya!– Bu fayl

AzƏrbaycan tam sozlƏr…

This is a new piece of text to be classified.

(echo `cat new EN | gzip | wc –c` EN; \ echo `cat new DE | gzip | wc –c` DE; \ echo `cat new AZ | gzip | wc –c` AZ; \| sort –n | head -1

Page 27: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Segmentation

• Given a sequence of words, how to break it up into meaningful segments.– e.g., 羽西中国新锐画家大奖

• Written English has spaces in between words:– e.g., words have spaces– Speech Recognition– URL: choosespain.com• Choose Spain• Chooses pain

Page 28: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Segmentation

• The best segmentation is the one that maximizes the joint probability of the segmentation.– S* = max P(w1:n) = max ∏iP(wi|w1:i-1)– Markov assumption:• S* ≈ max∏iP(wi|wi-1)

– Naïve Bayes assumption: words don’t depend on each other• S* ≈ maxP(wi)

Page 29: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Segmentation

• “nowisthetime”: 12 letters– How many possible segmentations?

• n-1• (n-1)^2• (n-1)!• 2^(n-1)

• Naïve Bayes assumption:– S* = argmaxs=f+rP(f)P(S*(r))– 1) Computationally easy– 2) Learning is easier: it’s easier to calculate the unigram

probabilities

Page 30: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Best Segmentation

• S* = argmaxs=f+rP(f)P(S*(r))• “nowisthetime”

f r P(f) P(S*(r)

n owis… .000001 10^-19

no wis… .004 10^-13

now is… .003 10^-10

nowi st… - 10^-18

Page 31: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Segmentation Examples

• Trained on 4 billions words corpus• e.g.,– Baseratesoughtto

• Base rate sought to• Base rates ought to

– smallandinsignificant• small and in significant• small and insignificant

– Ginormousego• G in or mouse go• Ginormous ego

What to do to improve?1) More data ???2) Markov assumption ???3) Smoothing ???

Page 32: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Spelling Correction

• Given a misspelled the word, find the best correction:– C* = argmaxcP(c|w)

– Bayes theorem: C* = argmaxcP(w|c)P(c)• P(c) = from data counts• P(w|c) = from spelling correction data

Page 33: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Spelling Data• c:w => P(w|c)

– pulse: pluse– elegant: elagent, elligit– second: secand, sexeon, secund, seconnd, seond, sekon– sailed: saled, saild– blouse: boludes– thunder: thounder– cooking: coking, chocking, kooking, cocking– fossil: fosscil

• We cannot have all the common misspelling cases.– Letter-based models, e.g.,

• ul:lu

Page 34: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Correction Example

• w = “thew” => P(w|c)P(c)

w c w|c P(w|c) P(c) 10^9P(w|c)P(c)

thew the ew|e 0.000007 .02 144.

thew thew 0.95 0.00000009 90

thew thaw e|a 0.001 0.0000007 0.7

thew threw h|hr 0.000008 0.000004 0.03

thew thwe ew|we 0.000003 0.00000004 0.0001

Page 35: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Sentence Structure

• P(Fed raises interest rates) = ???

Fed raises interest rates

NV NN

NP

VP

NP

S

Fed raises interest rates

NN VN

VP

NP

S

NP

Page 36: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Context Free Grammar Parsing

• Sentence structure trees are constructed according to grammar.

• A grammar is a list of rules: e.g.,– S -> NP VP• NP -> N | D (determiners: e.g., the, a) N | NN | NNN

(mortgage interest rates), etc.• VP -> V NP | V | V NP NP (e.g., give me the money)• N -> interest | Fed | rates | raises• V -> interest | rates | raises• D -> the | a

Page 37: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Ambiguity

How many parsing options do I have? ? ?• The Fed raises interest rates• The Fed raises raises • Raises raises interest raises

Page 38: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Ambiguity

How many parsing options do I have? ? ?• The Fed raises interest rates (2)

– The Fed (NP) raises (V) interest rates (NP)– The Fed raises (NP) interest (V) rates (NP)

• The Fed raises raises (1) – The Fed (NP) raises (V) raises (NP)

• Raises raises interest raises– Raises (NP) raises (V) interest raises (NP)– Raises (NP) raises (V) interest (NP) raises (NP)– Raises raises (NP) interest (V) raises (NP)– Raises raises interest (NP) raises (V)

Page 39: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Problems and SolutionsT F

Is it easy to omit good parsers?Is it easy to include bad parsers?Trees are unobservable?

Problems:

Solutions:

T F

Probabilistic view of the trees?

Consider word associations in the trees?

Make grammar unambiguous (like in programming languages)?

Page 40: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Problems and SolutionsT F

X Is it easy to omit good parsers?

X Is it easy to include bad parsers?

X Trees are unobservable?

Problems:

Solutions:

T F

X Probabilistic view of the trees?

X Consider word associations in the trees?

X Make grammar unambiguous (like in programming languages)?

Page 41: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Problems of writing grammars

• Natural languages are messy unorganized things evolved through the human history in variety contexts.

• It is naturally hard to specify a set of grammar rules that can comprehend all possibilities with out introduce errors.

• Ambiguity is the “enemy”…

Page 42: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Context-Free Grammar

• S -> NP VP (1)– NP ->

• | N (.3)• | DN (.4)• | NN (.2)• | NNN (.1)

– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)

– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises (.1)

– V ->• | interest (.1)• | rates (.3)• | raises (.6)

– D -> • | the (.5)• | a (.5)

Page 43: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Context-Free Grammar

• S -> NP VP (1)– NP ->

• | N (.3)• | DN (.2)• | NN (.2)• | NNN (.1)

– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)

– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises(.1)

– V ->• | interest (.1)• | rates (.3)• | raises(.6)

– D -> • | the (.5)• | a (.5)

Fed raises interest rates

NV NN

NP

VP

NP

S

.3 .6 .3 .3

.3 .2

.4

1

P() = 0.00038880.039%

Page 44: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Context-Free Grammar

• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)

Raises raises interest rates

NV NN

NP

VP

NP

S

N N VN

NP

VP

NP

S

Raises raises interest rates

P() = ???% P() = ???%

Page 45: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Probabilistic Context-Free Grammar

• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)

Raises raises interest rates

NV NN

NP

VP

NP

S

N N VN

NP

VP

NP

S

Raises raises interest rates

P() = .012% P() = .00072%

Page 46: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Statistical Parsing

• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)

• Where are these probabilities coming from?– Training from large annotated corpus

• e.g., The Penn Treebank Project (1990): The Penn Treebank Project annotates naturally-occuring text for linguistic structure.

Page 47: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

The Penn Treebank Project( (S

(NP-SBJ (NN Stock-market) (NNS tremors) )(ADVP-TMP (RB again) )(VP (VBD shook)

(NP (NN bond) (NNS prices) ) (, ,)

(SBAR-TMP (IN while)(S

(NP-SBJ (DT the) (NN dollar) )(VP (VBD turned)

(PRT (RP in) )(NP-PRD (DT a) (VBN mixed) (NN

performance) ))))) (. .) ))

Page 48: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Resolving Ambiguity

• Ambiguity:– Syntactical – more than one possible structure for

the same string of words.• e.g., We need more intelligent leaders.

– need more or more intelligent?

– lexical (homonymity) – a word form has more than one meaning.• e.g., Did you see the bat?• e.g., Where is the bank?

Page 49: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

“The boy saw the man with the telescope”

V PP

with

NPP

the

Det N

telescopeThe

N

saw

S

NP VP

Det

boy

NP

the

Det N

man

Page 50: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

V

PP

with

NPP

the

Det N

telescopeThe

N

saw

S

NP VP

Det

boy

NP

the

Det N

man

“The boy saw the man with the telescope”

Page 51: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Lexicalized PCFG• CFG:

– rules: VP -> V NP NP• PCFG:

– P( VP -> V NP NP | lhs = VP) = .2• LPCFG:

– P( VP -> V NP NP | V = `gave’) = .25• e.g., “Gave me the knife”

– P( VP -> V NP NP | V = `said’) = .0001• e.g., “I said my piece”

– P( VP -> V | V = `quake’) = – P ( VP -> V NP | V = `quake’) = 0.0001

• i.e., Dict: quake: verb (used without object)• ??? Web: “quake the earth”; 595,000 Google results ????

Page 52: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

LPCFG

• “The boy saw the man with the telescope”– P( NP -> NP PP | H(NP) = man, PP = with/telescope)

• “The boy saw the man with the telescope”– P( VP -> V NP PP | V = saw, H(NP) = man, PP =

with/telescope)• These probabilities are hard to get, since we are

conditioning on quite a few specific words.– Back-off: instead conditioning on H(NP) = man, we can

conditioning on any subjects.

Page 53: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Parsing into a Tree

Fed raises interest rates

NV NN

NP

VP

NP

S

• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)

Fed raises interest rates

NV NN

VPNP

S

N V V

Page 54: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Machine Translation

Yo lo haré mañana. => I will do it tomorrow.word

Yo lo haré mañana. => I will do it tomorrow.phrase

Yo lo haré mañana. => I will do it tomorrow.tree

NP VP

S

VP

S

Yo lo haré mañana. => I will do it tomorrow.Semantics

Action: doing+ Time: tomorrow

Vauquois’ pyramid:

Page 55: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Phrase-based Translation model• The models define probabilities over inputs

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

• What is the probability of a specific phrase segmentation of both languages?• What is the probability of a foreign foreign phrase being translated as a particular English phrase?• What is the probability of a word/phrase changing ordering?• Is the translated English text the best sentence? (From language model)

Segmentation Translation Distortion

Page 56: CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

Statistical Machine Translation• Components: Translation model, language

model, decoder

Foreign/English parallel text English text

Statistical analysis Statistical analysis

Translation model Language model

Machine Translation Algorithm