1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of...

37
1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems www.sims.berkeley.edu/~hearst

Transcript of 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of...

Page 1: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

1

CS188 Guest Lecture:Statistical Natural Language Processing

Prof. Marti Hearst

School of Information Management & Systemswww.sims.berkeley.edu/~hearst 

Page 2: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

2

School of Information Management & Systems

Page 3: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

3

School of Information Management & Systems

SIMS

Information economics and policy

Sociology of information

Human-computer

interaction

Information assurance

Information design and architecture

Page 4: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

4

How do we Automatically Analyze Human Language?

The answer is … forget all that logic and inference stuff you’ve been learning all semester!

Instead, we do something entirely different.

Gather HUGE collections of text, and compute statistics over them. This allows us to make predictions.

Nearly always a VERY simple algorithm and a VERY large text collection do better than a smart algorithm using knowledge engineering.

Page 5: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

5

Statistical Natural Language Processing

Chapter 23 of the textbook Prof. Russell said it won’t be on the final

Today: 3 ApplicationsAuthor IdentificationSpeech Recognition (language models)Spelling Correction

Page 6: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

6Slide adapted from Fred S. Roberts

Author Identification

1. Disputed authorship (choose among k known authors)

2. Document pair analysis: Were two documents written by the same author?

3. Odd-person-out: Were these documents written by one of this set of authors or by someone else?

4. Clustering of “putative” authors (e.g., internet handles: termin8r, heyr, KaMaKaZie)

Problem Variations

Page 7: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

7Slide adapted form Glenn Fung

The Federalist Papers

Written in 1787-1788 by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York to ratify the constitution.

Papers consisted of short essays, 900 to 3500 words in length.

Authorship of 12 of those papers have been in dispute (Madison or Hamilton). These papers are referred to as the disputed Federalist papers.

Page 8: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

8

Stylometry

The use of metrics of literary style to analyze texts.Sentence lengthParagraph lengthPunctuationDensity of parts of speechVocabulary

Mosteller & Wallace, 1964Federalist papers problemUsed Naïve Bayes and 30 “marker” words more typical of one or the other authorConcluded the disputed documents written by Madison.

Page 9: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

9Slide adapted from Glenn Fung

Find a hyperplane based on 3 words:

0.5368 to +24.6634 upon+2.9532would=66.6159

All disputed papers end up on the Madison side of the plane.

An Alternative Method (Fung)

Page 10: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

10Slide adapted from Glenn Fung

Page 11: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

12Slide adapted from Fred S. Roberts

Idiosyncratic usage (misspellings, repeated neologisms, etc.) are apparently also useful.

For example, Foster’s unmasking of Klein as the author of “Primary Colors”:

“Klein and Anonymous loved unusual adjectives ending in -y and –inous: cartoony, chunky, crackly, dorky, snarly,…, slimetudinous, vertiginous, …”

“Both Klein and Anonymous added letters to their interjections: ahh, aww, naww.”

“Both Klein and Anonymous loved to coin words beginning in hyper-, mega-, post-, quasi-, and semi- more than all others put together”

“Klein and Anonymous use “riffle” to mean rifle or rustle, a usage for which the OED provides no instance in the past thousand years”

Idiosyncratic Features

Page 12: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

13

Language Modeling

A fundamental concept in NLPMain idea:

For a given language, some words are more likely than others to follow each other, or

You can predict (with some degree of accuracy) the probability that, given a word, a particular other word will follow it.

Page 13: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

14Adapted from slide by Bonnie Dorr

Next Word Prediction

From a NY Times story...Stocks ...Stocks plunged this ….Stocks plunged this morning, despite a cut in interest ratesStocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ...Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

Page 14: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

15Adapted from slide by Bonnie Dorr

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last …Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks.

Page 15: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

16Adapted from slide by Bonnie Dorr

Next Word Prediction

Clearly, we have the ability to predict future words in an utterance to some degree of accuracy.How?

Domain knowledgeSyntactic knowledgeLexical knowledge

Claim:A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniquesIn particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

Page 16: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

17Adapted from slide by Bonnie Dorr

Applications of Language Models

Why do we want to predict a word, given some preceding words?

Rank the likelihood of sequences containing various alternative hypotheses,

– e.g. for spoken language recognition

Theatre owners say unicorn sales have doubled...Theatre owners say popcorn sales have doubled...

Assess the likelihood/goodness of a sentence– for text generation or machine translation.

The doctor recommended a cat scan.El doctor recommendó una exploración del gato.

Page 17: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

18Adapted from slide by Bonnie Dorr

N-Gram Models of Language

Use the previous N-1 words in a sequence to predict the next wordLanguage Model (LM)

unigrams, bigrams, trigrams,…

How do we train these models?Very large corpora

Page 18: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

19

Notation

P(unicorn)Read this as “The probability of seeing the token unicorn”

P(unicorn|mythical)Called the Conditional Probability.Read this as “The probability of seeing the token unicorn given that you’ve seen the token mythical

Page 19: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

20Adapted from slide by Bonnie Dorr

Speech Recognition ExampleFrom BeRP: The Berkeley Restaurant Project (Jurafsky et al.)

A testbed for a Speech Recognition project

System prompts user for information in order to fill in slots in a restaurant database.

– Type of food, hours open, how expensive

After getting lots of input, can compute how likely it is that someone will say X given that they already said Y.

P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

Page 20: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

21Adapted from slide by Bonnie Dorr

A Bigram Grammar Fragment from BeRP

.001Eat British.03Eat today

.007Eat dessert.04Eat Indian

.01Eat tomorrow.04Eat a

.02Eat Mexican.04Eat at

.02Eat Chinese.05Eat dinner

.02Eat in.06Eat lunch

.03Eat breakfast.06Eat some

.03Eat Thai.16Eat on

Page 21: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

22Adapted from slide by Bonnie Dorr

.01British lunch.05Want a

.01British cuisine.65Want to

.15British restaurant.04I have

.60British food.08I don’t

.02To be.29I would

.09To spend.32I want

.14To have.02<start> I’m

.26To eat.04<start> Tell

.01Want Thai.06<start> I’d

.04Want some.25<start> I

Page 22: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

23Adapted from slide by Bonnie Dorr

P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080vs. I want to eat Chinese food = .00015Probabilities seem to capture “syntactic'' facts, “world knowledge''

eat is often followed by an NPBritish food is not too popular

N-gram models can be trained by counting and normalization

Page 23: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

24

Spelling CorrectionHow to do it?Standard approach

Rely on a dictionary for comparisonAssume a single “point change”

– Insertion, deletion, transposition, substitution– Don’t handle word substitution

ProblemsMight guess the wrong correctionDictionary not comprehensive

– Shrek, Britney Spears, nsync, p53, ground zero

May spell the word right but use it in the wrong place– principal, principle– read, red

Page 24: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

25

New Approach: Use Search Engine Query Logs!

Leverage off of the mistakes and corrections that millions of other people have already made!

Page 25: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

26

Spelling Correction via Query Logs

Cucerzan and Brill ‘04Main idea:

Iteratively transform the query into other strings that correspond to more likely queries.Use statistics from query logs to determine likelihood.

– Despite the fact that many of these are misspelled– Assume that the less wrong a misspelling is, the more

frequent it is, and correct > incorrect

Example:ditroitigers ->detroittigers ->detroit tigers

Page 26: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

27

Spelling Correction via Query Logs (Cucerzan and Brill ’04)

Page 27: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

28

Spelling Correction AlgorithmAlgorithm:

Compute the set of all possible alternatives for each word in the query

– Look at word unigrams and bigrams from the logs– This handles concatenation and splitting of words

Find the best possible alternative string to the input– Do this efficiently with a modified Viterbi algorithm

Constraints:No 2 adjacent in-vocabulary words can change simultaneouslyShort queries have further (unstated) restrictionsIn-vocabulary words can’t be changed in the first round of iteration

Page 28: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

30

Spelling Correction Evaluation

Emphasizing coverage1044 randomly chosen queriesAnnotated by two people (91.3% agreement)180 misspelled; annotators provided corrections81.1% system agreement with annotators

– 131 false positives 2002 kawasaki ninja zx6e -> 2002 kawasaki ninja zx6r

– 156 suggestions for the misspelled queries

2 iterations were sufficient for most correctionsProblem: annotators were guessing user intent

Page 29: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

31

Spell Checking: Summary

Can use the collective knowledge stored in query logs

Works pretty well despite the noisiness of the dataExploits the errors made by peopleMight be further improved to incorporate text from other domains

Page 30: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

32

Other Search Engine Applications

Many other applications apply to search engines and related topics.One more example … automatic synonym and related word generation.

Page 31: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

33

Synonym Generation

Page 32: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

34

Synonym Generation

Page 33: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

35

Synonym Generation

Page 34: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

36

Speaking of Search Engines… Introducing a New Course!

Search Engines: Technology, Society, and Business

IS141 (2 units) Mondays 4-6pm + 1hr sectionCCN 42702 No prerequisiteshttp://www.sims.berkeley.edu/courses/is141/f05/

Page 35: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

37

A Great Line-up of World-Class Experts!

Page 36: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

38

A Great Line-up of World-Class Experts!

Page 37: 1 CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems hearst.

39

Thank you!

Prof. Marti Hearst

School of Information Management & Systemswww.sims.berkeley.edu/~hearst