CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520:...

124
CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2013 Thursday, April 11, 2013

Transcript of CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520:...

Page 1: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

CS 8520: Artificial Intelligence

Natural Language Processing Introduction

Paula MatuszekSpring, 2013

Thursday, April 11, 2013

Page 2: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 2

Natural Language Processing• speech recognition• natural language understanding• computational linguistics• psycholinguistics• information extraction• information retrieval• inference• natural language generation• speech synthesis• language evolution

Thursday, April 11, 2013

Page 3: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 3

Applied NLP• Machine translation• spelling/grammar correction• Information Retrieval• Data mining• Document classification• Question answering, conversational agents

Thursday, April 11, 2013

Page 4: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 4

Natural Language Understanding

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

sound waves

internal representation

Thursday, April 11, 2013

Page 5: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 5

Sounds Symbols Sense

Natural Language Understanding

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

sound waves

internal representation

Thursday, April 11, 2013

Page 6: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 6

•“How to recognize speech, not to wreck a nice beach”•“The cat scares all the birds away”•“The cat’s cares are few”

Where are the words? sound waves

internal representation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

- pauses in speech bear little relation to word breaks+ intonation offers additional clues to meaning

Thursday, April 11, 2013

Page 7: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 7

•“The dealer sold the merchant a dog”• “I saw the Golden bridge flying into San Francisco”

Dissecting words/sentences

internal representation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

sound waves

Thursday, April 11, 2013

Page 8: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 8

• “I saw Pathfinder on Mars with a telescope”

• “Pathfinder photographed Mars”

• “The Pathfinder photograph from Ford has arrived”

• “When a Pathfinder fords a river it sometimes mars its paint job.”

What does it mean?sound waves

internal representation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

Thursday, April 11, 2013

Page 9: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 9

What does it mean?sound waves

internal representation

accoustic /phonetic

morphological/syntactic

semantic / pragmatic

• “Jack went to the store. He found the milk in aisle 3. He paid for it and left.”• “ Q: Did you read the report?

A: I read Bob’s email.”

Thursday, April 11, 2013

Page 10: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

The steps in NLP• Morphology: Concerns the way words are

built up from smaller meaning bearing units.

• Syntax: concerns how words are put together to form correct sentences and what structural role each word has

• Semantics: concerns what words mean and how these meanings combine in sentences to form sentence meanings

Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt 10

Thursday, April 11, 2013

Page 11: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

The steps in NLP (Cont.)• Pragmatics: concerns how sentences are

used in different situations and how use affects the interpretation of the sentence

• Discourse: concerns how the immediately preceding sentences affect the interpretation of the next sentence

11Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Thursday, April 11, 2013

Page 12: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Some of the Tools• Regular Expressions and Finite State

Automata• Part of Speech taggers• N-Grams• Grammars• Parsers• Semantic Analysis

12

Thursday, April 11, 2013

Page 13: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Parsing (Syntactic Analysis)• Assigning a syntactic and logical form to an input

sentence– uses knowledge about word and word meanings (lexicon)– uses a set of rules defining legal structures (grammar)

• Paula ate the apple.• (S (NP (NAME Paula))• (VP (V ate)• (NP (ART the)• (N apple))))

13Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Thursday, April 11, 2013

Page 14: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Word Sense Resolution • Many words have many meanings or senses• We need to resolve which of the senses of

an ambiguous word is invoked in a particular use of the word

• I made her duck. (made her a bird for lunch or made her move her head quickly downwards?)

14Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Thursday, April 11, 2013

Page 15: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Reference Resolution• Domain Knowledge (Registration transaction)• Discourse Knowledge• World Knowledge• U: I would like to register in an CSC Course.• S: Which number?• U: Make it 8520.• S: Which section?• U: Which section is in the evening?• S: section 1.• U: Then make it that section.

15Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Thursday, April 11, 2013

Page 16: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

Morphological Analysis

Syntactic Analysis

Semantic Analysis

Discourse Analysis

Pragmatic Analysis

Internal representation

lexicon

user

Surface form

Perform action

stems

parse tree

Resolve references

Stages of NLP

Thursday, April 11, 2013

Page 17: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 17

Human Languages• You know ~50,000 words of primary language,

each with several meanings• six year old knows ~13000 words• First 16 years we learn 1 word every 90 min of

waking time• Mental grammar generates sentences -virtually

every sentence is novel• 3 year olds already have 90% of grammar• ~6000 human languages – none of them simple!

Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London

Thursday, April 11, 2013

Page 18: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 18

Human Spoken language• Most complicated mechanical motion of the

human body– Movements must be accurate to within mm– synchronized within hundredths of a second

• We can understand up to 50 phonemes/sec (normal speech 10-15ph/sec)– but if sound is repeated 20 times /sec we hear

continuous buzz!• All aspects of language processing are involved

and manage to keep apace

Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London

Thursday, April 11, 2013

Page 19: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 19

Why Language is Hard• NLP is AI-complete• Abstract concepts are difficult to represent• LOTS of possible relationships among

concepts• Many ways to represent similar concepts• Tens of hundreds or thousands of features/

dimensions

Thursday, April 11, 2013

Page 20: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 20

Why Language is Easy• Highly redundant

• Many relatively crude methods provide fairly good results

Thursday, April 11, 2013

Page 21: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 21

History of NLP• Prehistory (1940s, 1950s)

– automata theory, formal language theory, markov processes (Turing, McCullock&Pitts, Chomsky)

– information theory and probabilistic algorithms (Shannon)

– Turing test – can machines think?

Thursday, April 11, 2013

Page 22: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

History of NLP• Early work:

– symbolic approach• generative syntax - eg Transformations and Discourse Analysis Project

(TDAP- Harris)• AI – pattern matching, logic-based, special-purpose systems

– Eliza -- Rogerian therapist http://www.manifestation.com/neurotoys/eliza.php3

– stochastic• bayesian methods

– early successes -- $$$$ grants!– by 1966 US government had spent 20 million on machine translation

• Critics:– Bar Hillel – “no way to disambiguation without deep understanding”– Pierce NSF 1966 report: “no way to justify work in terms of practical

output”

22

Thursday, April 11, 2013

Page 23: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 23

History of NLP• The middle ages (1970-1990)

– stochastic• speech recognition and synthesis (Bell Labs)

– logic-based• compositional semantics (Montague)• definite clause grammars (Pereira&Warren)

– ad hoc AI-based NLU systems• SHRDLU robot in blocks world (Winograd)• knowledge representation systems at Yale (Shank)

– discourse modeling• anaphora• focus/topic (Groz et al)• conversational implicature (Grice)

Thursday, April 11, 2013

Page 24: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 24

History of NLP• NLP Renaissance (1990-2000)

– Lessons from phonology & morphology successes: – finite-state models are very powerful– probabilistic models pervasive– Web creates new opportunities and challenges– practical applications driving the field again

• 21st Century NLP– The web changes everything:– much greater use for NLP– much more data available

Thursday, April 11, 2013

Page 25: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Document Features• Most NLP is applied to some quantity of

unstructured text. • For simplicity, we will refer to any such

quantity as a document• What features of a document are of interest?• Most common is the actual terms in the

document.

25

Thursday, April 11, 2013

Page 26: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Tokenization• Tokenization is the process of breaking up a string

of letters into words and other meaningful components (numbers, punctuation, etc.

• Typically broken up at white space.• Very standard NLP tool• Language-dependent, and sometimes also domain-

dependent.– 3,7-dihydro-1,3,7-trimethyl-1H-purine-2,6-dione

• Tokens can also be larger divisions: sentences, paragraphs, etc.

26

Thursday, April 11, 2013

Page 27: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 27

Lexical Analyser• Basic idea is a finite state machine• Triples of input state, transition token,

output state

• Must be very efficient; gets used a LOT

0

1

2

blankA-Z

A-Z

blank, EOF

Thursday, April 11, 2013

Page 28: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 28

Design Issues for Tokenizer• Punctuation

– treat as whitespace?– treat as characters?– treat specially?

• Case– fold?

• Digits– assemble into numbers?– treat as characters?– treat as punctuation?

Thursday, April 11, 2013

Page 29: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

NLTK Tokenizer• Natural Language ToolKit• http://text-processing.com/demo/tokenize/

• Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

29

Thursday, April 11, 2013

Page 30: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Document Features• Once we have a corpus of standardized and possibly

tokenized documents, what features of text documents might be interesting?– Word frequencies: Bag of Words (BOW)– Language– Document Length -- characters, words, sentences– Named Entities– Parts of speech– Average word length– Average sentence length– Domain-specific stuff

30

Thursday, April 11, 2013

Page 31: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Bag of Words• Specifically not interested in order• Frequency of each possible word in this

document.• Very sparse vector!• In order to assign count to correct position,

need to know all the words used in the corpus– two-pass– reverse-index

31

Thursday, April 11, 2013

Page 32: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Simplifying BOW• Do we really want every word?• Stop words

– omit common function words.– e.g. http://www.ranks.nl/resources/stopwords.html

• Stemming or lemmatization– Convert words to standard form – lemma is the standard word; stem may not be a word at all

• Synonym matching• TF*IDF

– Use the “most meaningful” words

32

Thursday, April 11, 2013

Page 33: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Stemming• Inflectional stemming: eliminate morphological variants

– singular/plural, present/past– In English, rules plus a dictionary

• books -> book, children -> child– few errors, but many omissions

• Root stemming: eliminate inflections and derivations– invention, inventor -> invent– much more aggressive

• Which (if either) depends on problem• http://text-processing.com/demo/stem/

33

Thursday, April 11, 2013

Page 34: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Synonym-matching• Map some words to their synonyms• http://thesaurus.com/browse/computer• Problematic in English

– requires a large dictionary– many words have multiple meanings

• In specific domains may be important– biological and chemical domains: http://

en.wikipedia.org/wiki/Caffeine– any specific domain: Nova, Villanova, V’Nova

34

Thursday, April 11, 2013

Page 35: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

TF-IDF• Word frequency is simple, but:

– affected by length of document– not best indicator of what doc is about

• very common words don’t tell us much about differences between documents

• very uncommon words are typos or idiosyncratic

• Term Frequency Inverse Document Frequency– tf-idf(j) = tf(j)*idf(j). idf(j)=log(N/df(j))

35

Thursday, April 11, 2013

Page 36: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Language• Relevant if documents are in multiple

languages• May know from source• Determining language itself can be

considered a form of NLP. • http://translate.google.com/?hl=en&tab=wT• http://fr.wikipedia.org/• http://de.wikipedia.org/

36

Thursday, April 11, 2013

Page 37: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Document Counts• Number of characters, words, sentences• Average length of words, sentences,

paragraphs• EG: clustering documents to determine

how many authors have written them or how many genres are represented

• NLTK + Python makes this easy

37

Thursday, April 11, 2013

Page 38: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Named Entities• What persons, places, companies are mentioned in

documents?• “Proper nouns”• One of most common information extraction tasks• Combination of rules and dictionary

– Example rules:• Capitalized word not at beginning of sentence• Two capitalized words in a row• One or more capitalized words followed by Inc

– Dictionaries of common names, places, major corporations. Sometimes called “gazetteer”

38

Thursday, April 11, 2013

Page 39: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Parts of Speech• Part-of-Speech (POS) taggers identify

nouns, verbs, adjectives, noun phrases, etc• Brill is the best-known rule-based tagger• More recent work uses machine learning to

create taggers from labeled examples• http://text-processing.com/demo/tag/• http://cst.dk/online/pos_tagger/uk/

index.html

39

Thursday, April 11, 2013

Page 40: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

Domain-Specific• Structure in document

– email: To, From, Subject, Body– Villanova Campus Currents: News, Academic

Notes, Events, Sports, etc• Tags in document

– Medline corpus is tagged with MeSH terms– Twitter feeds may be tagged with #tags.

• Intranet documents might have date, source, department, author, etc.

40

Thursday, April 11, 2013

Page 41: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

N-Grams• These language features will take you a

long way• But it’s not hard to come up with examples

where it fails:– Dog bites man. vs. Man bites dog.

• Without adding explicit semantics we can still get substantial additional information by considering sequences of words.

41

Thursday, April 11, 2013

Page 42: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

Thursday, April 11, 2013

Page 43: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.

Thursday, April 11, 2013

Page 44: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____

Thursday, April 11, 2013

Page 45: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____

Thursday, April 11, 2013

Page 46: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____– One Fish, ______

Thursday, April 11, 2013

Page 47: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____– One Fish, ______– Colorless green ideas ______

Thursday, April 11, 2013

Page 48: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____– One Fish, ______– Colorless green ideas ______– Conjunction Junction, what’s __________

Thursday, April 11, 2013

Page 49: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____– One Fish, ______– Colorless green ideas ______– Conjunction Junction, what’s __________– Oh, say, can you see, by the dawn’s ______

Thursday, April 11, 2013

Page 50: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 42

Free Association Exercise J

• I am going to say some phrases. Write down the next word or two that occur to you.– Microsoft announced a new security ____– NHL commissioner cancels rest ____– One Fish, ______– Colorless green ideas ______– Conjunction Junction, what’s __________– Oh, say, can you see, by the dawn’s ______– After I finished my homework I went _____.

Thursday, April 11, 2013

Page 51: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 43

Human Word Prediction• Clearly, at least some of us have the ability

to predict future words in an utterance.• How?

– Domain knowledge– Syntactic knowledge– Lexical knowledge

Thursday, April 11, 2013

Page 52: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 44

Claim• A useful part of the knowledge needed to

allow Word Prediction can be captured using simple statistical techniques

• In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

Thursday, April 11, 2013

Page 53: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 45

Applications• Why do we want to predict a word, given

some preceding words?– Rank the likelihood of sequences containing

various alternative hypotheses, e.g. for automated speech recognition, OCRing.

• Theatre owners say popcorn/unicorn sales have doubled...

– Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation

• Como mucho pescado. --> At the most fished.

Thursday, April 11, 2013

Page 54: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 55: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 56: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 57: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.• The design an construction of the system will take

more than a year.

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 58: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.• The design an construction of the system will take

more than a year.• Hopefully, all with continue smoothly in my absence.

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 59: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.• The design an construction of the system will take

more than a year.• Hopefully, all with continue smoothly in my absence.• Can they lave him my messages?

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 60: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.• The design an construction of the system will take

more than a year.• Hopefully, all with continue smoothly in my absence.• Can they lave him my messages?• I need to notified the bank of….

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 61: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 46

Real Word Spelling Errors• They are leaving in about fifteen minuets to go to

her house.• The study was conducted mainly be John Black.• The design an construction of the system will take

more than a year.• Hopefully, all with continue smoothly in my absence.• Can they lave him my messages?• I need to notified the bank of….• He is trying to fine out.

Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt

Thursday, April 11, 2013

Page 62: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 47

Language Modeling • Fundamental tool in NLP• Main idea:

– Some words are more likely than others to follow each other

– You can predict fairly accurately that likelihood.

• In other words, you can build a language model

Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt

Thursday, April 11, 2013

Page 63: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 48

N-Grams• N-Grams are sequences of tokens.• The N stands for how many terms are used

– Unigram: 1 term– Bigram: 2 terms– Trigrams: 3 terms

• You can use different kinds of tokens– Character based n-grams– Word-based n-grams– POS-based n-grams

• N-Grams give us some idea of the context around the token we are looking at.

Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt

Thursday, April 11, 2013

Page 64: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 49

N-Gram Models of Language• A language model is a model that lets us compute

the probability, or likelihood, of a sentence S, P(S).• N-Gram models use the previous N-1 words in a

sequence to predict the next word– unigrams, bigrams, trigrams,…

• How do we construct or train these language models?– Count frequencies in very large corpora– Determine probabilities using Markov models, similar

to POS tagging.

Thursday, April 11, 2013

Page 65: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 50

Counting Words in Corpora• What is a word?

– e.g., are cat and cats the same word?– September and Sept?– zero and oh?– Is _ a word? * ? ‘(‘ ?– How many words are there in don’t ? Gonna ?– In Japanese and Chinese text -- how do we

identify a word?• Back to stemming!

Thursday, April 11, 2013

Page 66: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 51

Terminology• Sentence: unit of written language• Utterance: unit of spoken language• Word Form: the inflected form that appears in the

corpus• Lemma: an abstract form, shared by word forms

having the same stem, part of speech, and word sense• Types: number of distinct words in a corpus

(vocabulary size)• Tokens: total number of words

Thursday, April 11, 2013

Page 67: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 52

Simple N-Grams• Assume a language has V word types in its

lexicon, how likely is word x to follow word y?– Simplest model of word probability: 1/ V– Alternative 1: estimate likelihood of x occurring in

new text based on its general frequency of occurrence estimated from a corpus (unigram probability)

• popcorn is more likely to occur than unicorn– Alternative 2: condition the likelihood of x occurring

in the context of previous words (bigrams, trigrams,…)

• mythical unicorn is more likely than mythical popcorn

Thursday, April 11, 2013

Page 68: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 53

Computing the Probability of a Word Sequence

• Compute the product of component conditional probabilities?– P(the mythical unicorn) = P(the) P(mythical|

the) P(unicorn|the mythical)• The longer the sequence, the less likely we

are to find it in a training corpus • P(Most biologists and folklore specialists believe

that in fact the mythical unicorn horns derived from the narwhal)

• Solution: approximate using n-grams

Thursday, April 11, 2013

Page 69: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 54

Bigram Model

• Approximate by – P(unicorn|the mythical) by P(unicorn|mythical)

• Markov assumption: the probability of a word depends only on the probability of a limited history

• Generalization: the probability of a word depends only on the probability of the n previous words– Trigrams, 4-grams, …– The higher n is, the more data needed to train– The higher n is, the sparser the matrix.

• Leads us to backoff models

Thursday, April 11, 2013

Page 70: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 55

• For N-gram models– ≈ – P(wn-1,wn) = P(wn | wn-1) P(wn-1)– By the Chain Rule we can decompose a joint

probability, e.g. P(w1,w2,w3)P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-1|wn)

P(wn)

For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams

P(the,mythical,unicorn) = P(unicorn|mythical) P(mythical|the) P(the|<start>)

Using N-Grams

Thursday, April 11, 2013

Page 71: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 56

A Simple Example

– P(I want to eat Chinese food) = P(I | <start>) * P(want | I) P(to | want) * P(eat | to) P(Chinese | eat) * P(food | Chinese)

Thursday, April 11, 2013

Page 72: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 57

Counts from the Berkeley Restaurant Project

Nth term

N-1 term

Thursday, April 11, 2013

Page 73: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 58

BeRP Bigram Table

Nth term

N-1 term

Thursday, April 11, 2013

Page 74: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 59

A Simple Example

– P(I want to eat Chinese food) = P(I | <start>) * P(want | I) P(to | want) * P(eat | to) P(Chinese | eat) * P(food | Chinese)

•.25*.32*.65*.26*.02*.56 = .00015

Thursday, April 11, 2013

Page 75: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 60

• P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080

• vs. I want to eat Chinese food = .00015• Probabilities seem to capture ``syntactic''

facts, ``world knowledge'' – eat is often followed by an NP– British food is not too popular

So What?

Thursday, April 11, 2013

Page 76: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 61

Approximating Shakespeare• As we increase the value of N, the accuracy of the n-gram

model increases, since choice of next word becomes increasingly constrained

• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter

• With bigrams...– What means, sir. I confess she? then all sorts, he is trim, captain.– Why dost stand forth thy canopy, forsooth; he is this palpable hit

the King Henry.

Thursday, April 11, 2013

Page 77: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 62

• Trigrams– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if

renown made it empty.• Quadrigrams

– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?

Thursday, April 11, 2013

Page 78: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 63

• There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus

• Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)

• Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

Thursday, April 11, 2013

Page 79: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 64

N-Gram Training Sensitivity• If we repeated the Shakespeare experiment

but trained our n-grams on a Wall Street Journal corpus, what would we get?

• This has major implications for corpus selection or design

Thursday, April 11, 2013

Page 80: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 65

Some Useful Empirical Observations• A few events occur with high frequency• Many events occur with low frequency• You can quickly collect statistics on the high

frequency events• You might have to wait an arbitrarily long time to

get valid statistics on low frequency events• Some of the zeroes in the table are really zeros

But others are simply low frequency events you haven't seen yet. We smooth the frequency table by assigning small but non-zero frequencies to these terms.

Thursday, April 11, 2013

Page 81: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 66

Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)

From Snow, http://www.stanford.edu/class/linguist236/lec11.ppt

Thursday, April 11, 2013

Page 82: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 67

Summary• N-gram probabilities can be used to

estimate the likelihood– Of a word occurring in a context (N-1)– Of a sentence occurring at all

• Smoothing techniques deal with problems of unseen words in a corpus

• N-grams are useful in a wide variety of NLP tasks

Thursday, April 11, 2013

Page 83: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek

All Our N-Grams Are Belong to Youhttp://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

• Google uses n-grams for machine translation, spelling correction, other NLP

• In 2006 they released a large collection of n-gram counts through the Linguistic Data Consortium, based on a trillion web pages– a trillion tokens, 300 million bigrams, about a

billion tri-, four-and five-grams. (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13)

• This quantity of data makes a qualitative change in what we can do with statistical NLP.

68

Thursday, April 11, 2013

Page 84: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 69

Semantics and Semantic Analysis

Thursday, April 11, 2013

Page 85: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 70

Meaning• So far, we have focused on the structure of language,

not on what things mean• We have been doing natural language processing, but

not natural language understanding.• So what is natural language understanding?

– Answering an essay question on an exam?– Deciding what to order at a restaurant by reading a menu?– Realizing you’ve been insulted?– Appreciating a sonnet?

• As hard as answering "What is artificial intelligence?"

Thursday, April 11, 2013

Page 86: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 71

Meaning, cont• On the practical side, we want to “understand”

natural language because BOW and n-gram methods will only take us so far in some things:– Machine translation– Generation– Question answering

• So we saw how we could use n-grams to choose the more likely translation for a word given other words. But n-grams can’t give us the potential translations in the first place.

Thursday, April 11, 2013

Page 87: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 72

Semantics• What kinds of things can we not do well with the

tools we have already looked at?– Retrieve information in response to unconstrained

questions: e.g., travel planning– Accurate translations– Play the "chooser" side of 20 Questions– Read a newspaper article and answer questions about it

• These tasks require that we also consider semantics: the meaning of our tokens and their sequences

Thursday, April 11, 2013

Page 88: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 73

So What IS Meaning?• From NLP viewpoint, meaning is a mapping from

linguistic forms to some kind of representation of knowledge of the world

• It is interpreted within the framework of some sort of action to be taken.

• Often we manipulate symbols all the way through; the “meaning” is put in by the human user. Translations, for instance.

• But not always – voice command systems, for instance, may map from the representation into actions.

Thursday, April 11, 2013

Page 89: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 74

Example• Question: Is there a restaurant in King of Prussia

serving vegetarian dinners?• From a restaurant database

– Lemon Grass is a sister restaurant to the one in West Philadelphia serving traditional Thai dishes in addition to a complete vegetarian Thai menu. The service and atmosphere are quite pleasant. A welcome addition to the King of Prussia area.

• What do we need to know to answer this question from this text?

• Can we unambiguously answer it?

Thursday, April 11, 2013

Page 90: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 75

Example, continued• Is there a restaurant in King of Prussia serving vegetarian

dinners? Yes.• Lemon Grass is a sister restaurant to the one in West Philadelphia

serving traditional Thai dishes in addition to a complete vegetarian Thai menu. The service and atmosphere are quite pleasant. A welcome addition to the King of Prussia area.

• What do we need to know? – Vegetarian Thai menu = vegetarian dinners– Welcome addition to the KoP area = in KoP

• Can we unambiguously answer it?– No. But we can be fairly certain.

• The only parse that makes sense• Restaurants normally do serve the dinner meal.

Thursday, April 11, 2013

Page 91: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 76

Knowledge Representation for NLP• So what representation?• A useful KR needs to give us a way to encode the

knowledge relevant to our problem in a way which allows us to use it. Any representation which does this will work.

• How do we decide what we want to represent?– Entities, categories, events, time, aspect– Predicates, relationships among entities, arguments

(constants, variables)– And…quantifiers, operators (e.g. temporal)

• How much structure do we capture?

Thursday, April 11, 2013

Page 92: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 77

Semantic Nets• labeled, directed graph• nodes represent objects, concepts, or situations

– labels indicate the name– nodes can be instances (individual objects) or

classes (generic nodes)• links represent relationships

– the relationships contain the structural information of the knowledge to be represented

– the label indicates the type of the relationship

Thursday, April 11, 2013

Page 93: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 78

Semantic Net Examples

Ship

Hull Propulsion

Steamboat

HasAHasA

InstanceOf

giveBob

Candy

Mary

Person

agent recipient

object

InstanceInstance

Thursday, April 11, 2013

Page 94: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 79

Generic/Individual• Generic describes the idea--the “notion”

– static• Individual or instance describes a real entity

– must conform to notion of generic– dynamic– “individuate” or “instantiate”

• A lot of NLP using semantic nets involves instantiating generic nets based on a given piece of text.

Thursday, April 11, 2013

Page 95: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 80

Individuation examplegivePerson

Thing

Personagent recipient

object

giveBob

Candy

Maryagent recipient

object

Generic RepresentationProcess the sentence“Bob gave Mary some candy”.

Instantiation

Thursday, April 11, 2013

Page 96: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 81

Frames• Represent related knowledge about a subject• Frame has a title and a set of slots

– Title is what the frame “is” – the concept– Slots capture relationships of the concept to other things

• Typically can be organized hierarchically – Most frame systems have an is-a slot– allows the use of inheritance

• Slots can contain all kinds of items– Rules, facts, images, video, questions, hypotheses, other frames– In NLP, typically capture relationships to other frames or entities

• Slots can also have procedural attachments– on creation, modification, removal of the slot value

Thursday, April 11, 2013

Page 97: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 82

Simple Frame ExampleSlot Name Filler

Restaurant Lemon GrassCuisine Thai, VegetarianPrice ExpensiveService ExcellentAtmosphere GoodLocation KoPWeb page Ask Google.

Thursday, April 11, 2013

Page 98: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 83

Usage of Frames• Most operations with frames do one of two

things:• Fill slots

– Process a piece of text to identify an entity for which we have a frame

– Fill as many slots as possible• Use contents of slots

– Look up answers to questions– Generate new text

[Rogers 1999]Thursday, April 11, 2013

Page 99: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 84

Scripts• Describe typical events or sequences• Components are

– script variables (players, props)– entry conditions– transactions– exit conditions

• Create instance by filling in variables

Thursday, April 11, 2013

Page 100: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 85

Restaurant Script Example• generic template for restaurants

– different types– default values

• script for a typical sequence of activities at a restaurant

• Often has a frame behind it; script is essentially instantiating the frame

[Rogers 1999]Thursday, April 11, 2013

Page 101: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 86

Restaurant ScriptEAT-AT-RESTAURANT Script

Props: (Restaurant, Money, Food, Menu, Tables, Chairs)Roles: (Hungry-Persons, Wait-Persons, Chef-Persons)Point-of-View: Hungry-PersonsTime-of-Occurrence: (Times-of-Operation of Restaurant)Place-of-Occurrence: (Location of Restaurant)Event-Sequence: first: Enter-Restaurant Script then: if (Wait-To-Be-Seated-Sign or Reservations) then Get-Maitre-d's-Attention Script then: Please-Be-Seated Script then: Order-Food-Script then: Eat-Food-Script unless (Long-Wait) when Exit-Restaurant-Angry

Script then: if (Food-Quality was better than Palatable) then Compliments-To-The-Chef Script then: Pay-For-It-Script finally: Leave-Restaurant Script [Rogers 1999]

Thursday, April 11, 2013

Page 102: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 87

Comments on Scripts• Obviously takes a lot of time to develop

them initially.– The script itself has much of the knowledge– May be serious overkill for most NLP tasks

• We need this level of detail if we want to include answers based on reasoning like “Most restaurants do serve dinner.”

Thursday, April 11, 2013

Page 103: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 88

First Order Predicate Calculus (FOPC)

• Terms– Constants: Lemon Grass– Functions: LocationOf(Lemon Grass)– Variables: x in LocationOf(x)

• Predicates: Relations that hold among objects– Serves(Lemon Grass,VegetarianFood)

• Logical Connectives: Permit compositionality of meaning– I only have $5 and I don’t have a lot of time– Have(I,$5) ∧ ¬ Have(I,LotofTime)

Thursday, April 11, 2013

Page 104: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 89

Choosing a Representation• We would like our representation to

support:– Verifiability– Unambiguous Representation– Canonical Form– Inference– Expressiveness

Thursday, April 11, 2013

Page 105: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 90

Verifiability• System can match input representation

against representations in knowledge base. If it finds a match, it can return Yes; Otherwise No.

• Does Lemon Grass serve vegetarian food?Serves(Lemon Grass,vegetarian food)

Thursday, April 11, 2013

Page 106: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 91

Unambiguous Representation• Single linguistic input can have different meaning

representations• Each representation unambiguously characterizes

one meaning.• Example: small cars and motorcycles are allowed

– car(x) & small(x) & motorcycle(y) & small(y) & allowed(x) & allowed(y)

– car(x) & small(x) & motorcycle(y) & allowed(x) & allowed(y)

Thursday, April 11, 2013

Page 107: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 92

Ambiguity and Vagueness• An expression is ambiguous if, in a given

context, it can be disambiguated to have a specific meaning, from a number of discrete, possible meanings. E.g., bank (financial institution) vs bank (river bank)

• An expression is vague, if it refers to a range of a scalar variable, such that, even in a specific context, it’s hard to specify the range entirely. E.g., he’s tall, it’s warm, etc.

Thursday, April 11, 2013

Page 108: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 93

Representing Similar Concepts• Distinct inputs could have the same meaning

– Does Lemon Grass have vegetarian dishes?– Do they have vegetarian food at Lemon Grass?– Are vegetarian dishes served at Lemon Grass?– Does Lemon Grass serve vegetarian fare?

• Alternatives– Four different semantic representations– Store all possible meaning representations in KB

Thursday, April 11, 2013

Page 109: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 94

Canonical Form• Solution: Inputs that mean same thing have

same meaning representation• Is this easy? No!

– Vegetarian dishes, vegetarian food, vegetarian fare

– Have, serve• What to do?

Thursday, April 11, 2013

Page 110: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 95

How to Produce a Canonical Form• Systematic Meaning Representations can be derived

from thesaurus– food ___– dish ___|____one overlapping meaning sense– fare ___|

• We can systematically relate syntactic constructions– [S [NP Lemon Grass] serves [NP vegetarian

dishes]]– [S [NP vegetarian dishes] are served at [NP Lemon

Grass]]

Thursday, April 11, 2013

Page 111: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 96

Inference• Consider a more complex request

– Can vegetarians eat at Lemon Grass?– Vs: Does Lemon Grass serve vegetarian food?

• Why do these result in the same answer?• Inference: Draw conclusions about truth of

propositions not explicitly stored in KB• serve(Lemon Grass,VegetarianFood) =>

CanEat(Vegetarians,At Lemon Grass)

Thursday, April 11, 2013

Page 112: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 97

Non-Yes/No Questions• Example: I'd like to find a restaurant where

I can get vegetarian food.• serve(x,VegetarianFood)• Matching succeeds only if variable x can be

replaced by known object in KB.

Thursday, April 11, 2013

Page 113: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 98

Meaning Structure of Language• Human Languages

– Display a basic predicate-argument structure– Make use of variables– Make use of quantifiers– Display a partially compositional semantics

Thursday, April 11, 2013

Page 114: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 99

Compositional Semantics• Assumption: The meaning of the whole is comprised of the

meaning of its parts– George cooks. Dan eats. Dan is sick.– Cook(George) Eat(Dan) Sick(Dan)– If George cooks and Dan eats, Dan will get sick.– (Cook(George) ^ eat(Dan)) --> Sick(Dan)

• Meaning derives from – The people and activities represented (predicates and arguments,

or, nouns and verbs)– The way they are ordered and related: syntax of the

representation, which may also reflect the syntax of the sentence• Parse inputs and associate meaning as we parse.

Thursday, April 11, 2013

Page 115: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 100

Feature and Information Extraction• We don’t always need a complete parse.

– We may be after only some specific information– And we know what that information is

• Lemon Grass is a sister restaurant to the one in West Philadelphia serving traditional Thai dishes in addition to a complete vegetarian Thai menu. The service and atmosphere are quite pleasant. A welcome addition to the King of Prussia area.

– We don’t care at all about the other information

Thursday, April 11, 2013

Page 116: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 101

Information Extraction• Retrieve some specific information which is located

somewhere in this set of documents.• Don’t want all the information, just what we have specified.• Information may occur multiple times in the text but we

just need to find it once• Often what is really wanted from a web search, for

instance.• Often involves long sentences, complex syntax, extended

discourse, multiple writers• May involve sentence fragments and other non-

grammatical text

Thursday, April 11, 2013

Page 117: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 102

Some Examples of Information Extraction• Financial Information

– Who is the CEO/CTO of a company?– What were the dividend payments for stocks I’m interested in for

the last five years?• Biological Information

– Are there known inhibitors of enzymes in a pathway?– Are there chromosomally located point mutations that result in a

described phenotype?• Other typical questions

– Who is expert in some area?– Is there a vegetarian restaurant in KoP?– Can I get a flight to Chicago tomorrow?

Thursday, April 11, 2013

Page 118: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 103

• Create a model of information to be extracted• Create knowledge base of rules for extraction

– concepts– relations among concepts

• Process information applying a series of tools• Always involves some level of domain

modeling and considerable natural language processing

Information Extraction -- How

Thursday, April 11, 2013

Page 119: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 104

Staged Approach• Apply a series of tools to an input text• At each stage an element of is identified for

possible use in the next level• The end result is a set of relations suitable

for entry into a database

Thursday, April 11, 2013

Page 120: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 105

Cascaded Process• Tokenize• POS Tag• Tag named entities: Names, dates, locations• Tag complex entities and relationships

– Grammatical, such as noun phrases– Semantic, such as CEO

• Bind values to tags throughout.

Thursday, April 11, 2013

Page 121: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 106

Named Entities• Named entity recognition…

– Labeling all the occurrences of named entities in a text…

• People, organizations, lakes, bridges, hospitals, mountains, etc…

• This is well-understood technology, readily available and works well.

• Typically uses a combination of enumerated lists (often called gazetteers) and regular expressions.

Thursday, April 11, 2013

Page 122: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 107

Complex Entities and Relationships• Uses Named Entities as components• Pattern-matching rules specific to a given domain• May be multi-pass: One pass creates entities

which are used as part of later passes.– EG: First pass locates CEOs, second pass locates

dates for CEOs being replaced.• May involve syntactic analysis as well, especially

for things like negatives and reference resolution

Thursday, April 11, 2013

Page 123: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 108

Why Does This Work?• The problem is very constrained.

– Only looking for a small set of items that can appear in a small set of roles

• We can ignore stuff we don’t care about.• BUT: creating the knowledge bases can be

very complex and time-consuming.

Thursday, April 11, 2013

Page 124: CS 8520: Artificial Intelligence - Villanova Universitymatuszek/spring2013/NLP2013.pdf · CS 8520: Artificial Intelligence ... • Basic idea is a finite state machine • Triples

CSC 8520 Spring 2013. Paula Matuszek 109

Summary• Some NLP tasks require some model of meaning: semantics• Semantics can be modeled in a variety of different

formalisms. The goal of all of them is to map from the sample of text being processed into some representation relevant to a real-world task

• Complete generic semantic parses gives us a very rich representation of text, but typically depend on the assumption that language is compositional.

• Other tasks such as Information Extraction may not require a detailed parse, but may depend on extensive domain knowledge

• How you represent and work with semantics will always depend on the NLP task you are trying to accomplish.

Thursday, April 11, 2013