Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

41
Natural Language Processing Lecture 8—9/24/2013 Jim Martin

Transcript of Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Page 1: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Natural Language Processing

Lecture 8—9/24/2013Jim Martin

Page 2: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 2

Today

Review Word classes Part of speech tagging

HMMs Basic HMM model

Decoding Viterbi

Page 3: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 3

Word Classes:Parts of Speech

8 (ish) traditional parts of speech Noun, verb, adjective, preposition,

adverb, article, interjection, pronoun, conjunction, etc

Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...

Lots of debate within linguistics about the number, nature, and universality of these We’ll completely ignore this debate.

Page 4: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 4

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection.WORD

tag

theDET

koalaN

put Vthe

DETkeys Non Pthe

DETtable N

Page 5: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 5

Penn TreeBank POS Tagset

Page 6: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 6

POS Tagging

Words often have more than one part of speech: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word in context Usually a sentence

Page 7: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

POS Tagging

Note this is distinct from the task of identifying which sense of a word is being used given a particular part of speech. That’s called word sense disambiguation. We’ll get to that later. “… backed the car into a pole” “… backed the wrong candidate”

04/18/23 Speech and Language Processing - Jurafsky and Martin 7

Page 8: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 8

How Hard is POS Tagging? Measuring Ambiguity

Page 9: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 9

Two Methods for POS Tagging

1. Rule-based tagging (ENGTWOL; Section 5.4)

2. Stochastic1. Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

Page 10: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 10

POS Tagging as Sequence Classification

We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow

What is the best sequence of tags that corresponds to this sequence of observations?

Probabilistic view Consider all possible sequences of tags Out of this universe of sequences, choose the

tag sequence which is most probable given the observation sequence of n words w1…wn.

Page 11: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 11

Getting to HMMs

We want, out of all sequences of n tags t1…tn the single tag sequence such that

P(t1…tn|w1…wn) is highest.

Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is

maximized”

Page 12: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 12

Getting to HMMs

This equation should give us the best tag sequence

But how to make it operational? How to compute this value?

Intuition of Bayesian inference: Use Bayes rule to transform this equation

into a set of probabilities that are easier to compute (and give the right answer)

Page 13: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 13

Using Bayes Rule

Know this.Know this.

Page 14: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 14

Likelihood and Prior

Page 15: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 15

Two Kinds of Probabilities

Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and

nouns That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high

Compute P(NN|DT) by counting in a labeled corpus:

Page 16: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 16

Two Kinds of Probabilities

Word likelihood probabilities p(wi|ti) VBZ (3sg Pres Verb) likely to be “is” Compute P(is|VBZ) by counting in a

labeled corpus:

Page 17: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 17

Example: The Verb “race”

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

How do we pick the right tag?

Page 18: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 18

Disambiguating “race”

Page 19: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 19

Disambiguating “race”

Page 20: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 20

Example

P(NN|TO) = .00047 P(VB|TO) = .83 P(race|NN) = .00057 P(race|VB) = .00012 P(NR|VB) = .0027 P(NR|NN) = .0012

P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032

So we (correctly) choose the verb tag for “race”

Page 21: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 21

Break

Quiz readings Chapters 1 to 6

Chapter 2 Chapter 3

Skip 3.4.1, 3.10, 3.12

Chapter 4 Skip 4.7, 4.8-4.11

Chapter 5 Skip 5.5.4, 5.6, 5.8-5.10

Chapter 6 Skip 6.6-6.9

Page 22: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 22

Hidden Markov Models

What we’ve just described is called a Hidden Markov Model (HMM)

This is a kind of generative model. There is a hidden underlying generator

of observable events The hidden generator can be modeled

as a network of states and transitions We want to infer the underlying state

sequence given the observed event sequence

Page 23: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 23

States Q = q1, q2…qN; Observations O= o1, o2…oN;

Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

Transition probabilities Transition probability matrix A = {aij}

Observation likelihoods Output probability matrix B={bi(k)}

Special initial probability vector

Hidden Markov Models

Page 24: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 24

HMMs for Ice Cream

You are a climatologist in the year 2799 studying global warming

You can’t find any records of the weather in Baltimore for summer of 2007

But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer

Your job: figure out how hot it was each day

Page 25: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 25

Eisner Task

Given Ice Cream Observation Sequence:

1,2,3,2,2,2,3… Produce:

Hidden Weather Sequence: H,C,H,H,H,C, C…

Page 26: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 26

HMM for Ice Cream

Page 27: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Ice Cream HMM Let’s just do 131 as the sequence

How many underlying state (hot/cold) sequences are there?

How do you pick the right one?

04/18/23 Speech and Language Processing - Jurafsky and Martin 27

HHHHHCHCHHCCCCCCCHCHCCHH

HHHHHCHCHHCCCCCCCHCHCCHH

Argmax P(sequence | 1 3 1)Argmax P(sequence | 1 3 1)

Page 28: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Ice Cream HMM

Let’s just do 1 sequence: CHC

04/18/23 Speech and Language Processing - Jurafsky and Martin 28

Cold as the initial stateP(Cold|Start)Cold as the initial stateP(Cold|Start)

Observing a 1 on a cold dayP(1 | Cold)Observing a 1 on a cold dayP(1 | Cold)

Hot as the next stateP(Hot | Cold)Hot as the next stateP(Hot | Cold)

Observing a 3 on a hot dayP(3 | Hot)Observing a 3 on a hot dayP(3 | Hot)

Cold as the next stateP(Cold|Hot)Cold as the next stateP(Cold|Hot)

Observing a 1 on a cold dayP(1 | Cold)Observing a 1 on a cold dayP(1 | Cold)

.2

.5

.4

.4

.3

.5

.2

.5

.4

.4

.3

.5.0024.0024

Page 29: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 29

POS Transition Probabilities

Page 30: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 30

Observation Likelihoods

Page 31: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

Question

If there are 30 or so tags in the Penn set And the average sentence is around 20

words... How many tag sequences do we have

to enumerate to argmax over in the worst case scenario?

04/18/23 Speech and Language Processing - Jurafsky and Martin 31

30203020

Page 32: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 32

Decoding

Ok, now we have a complete model that can give us what we need. Recall that we need to get

We could just enumerate all paths given the input and use the model to assign probabilities to each. Not a good idea. Luckily dynamic programming (last seen in Ch. 3 with

minimum edit distance) helps us here

Page 33: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 33

Intuition

Consider a state sequence (tag sequence) that ends at state j with a particular tag T.

The probability of that tag sequence can be broken into two parts The probability of the BEST tag sequence

up through j-1 Multiplied by the transition probability from

the tag at the end of the j-1 sequence to T. And the observation probability of the word

given tag T.

Page 34: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 34

The Viterbi Algorithm

Page 35: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 35

Viterbi Example

Page 36: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 36

Viterbi Summary

Create an array With columns corresponding to inputs Rows corresponding to possible states

Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs

Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).

Page 37: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 37

Evaluation

So once you have you POS tagger running how do you evaluate it? Overall error rate with respect to a gold-

standard test set. Error rates on particular tags Error rates on particular words Tag confusions...

Page 38: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 38

Error Analysis

Look at a confusion matrix

See what errors are causing problems Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

Page 39: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 39

Evaluation

The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a

baseline tagger (one that uses no context).

Important: 100% is impossible even for human annotators.

Page 40: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 40

Summary

Parts of speech Tagsets Part of speech tagging HMM Tagging

Markov Chains Hidden Markov Models

Viterbi decoding

Page 41: Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

04/18/23 Speech and Language Processing - Jurafsky and Martin 41

Next Time

Three tasks for HMMs Decoding

Viterbi algorithm

Assigning probabilities to inputs Forward algorithm

Finding parameters for a model EM